Monday, December 26, 2011

Who wrote all this stuff?

Every folio of Catesby’s collections contains an assortment of pieces of information in addition to the dried plants themselves. There are various handwritten notes, some written on labels pasted onto the pages and some written directly on the pages, some in ink and some in pencil. Some specimens have typed identification labels. A few have labels from the Linnaean Typification Project.
Who the heck wrote all this stuff? What does it mean?

Let’s start by consulting Dandy. (This would be the ultimate reference for the Sloane: J.E. Dandy, ed., The Sloane Herbarium, Trustees of the British Museum, 1958.) Here is what he says about H.S. 212 and H.S. 232.

H.S. 212: Volumen Plantarum, quas Carolina misit D. CATESBY [m. scr. ignot.]. [96 ff] [Catesby’s collection.]
There is a list, m. scr. Ignot., on the title-page of the volume. Many specimens are referred to Ray, and many named by Solander; good specimens, some with Catesby’s labels. Amman has written determinations on ff. 25, 42, 53.

H.S. 232. Plants gathered in Carolina and the Bahama Islands by Mr. MARK CATESBY [m. scr. ignot.]. [139 ff.]
See H.S. 212, of which this forms a second volume. At the beginning is a duplicate of the list in H.S. 212. The specimens are referred to Ray, and some are named by Solander. Some have Catesby’s labels.

H.S. 212 f. 4 has lots of good examples of these texts. Take the specimen on the left, the conifer. Below the plant are the letters "R.H.S.P. 10.1" Below that a beautifully handwritten label reads "1 Cupressus disticha Linn." On that label in pencil is: "Sp. Pl. 1003." A typed label reads:

Taxodium distichum (L.) Richard
Catesby 1-11
See also HS 232 f 69, f 85
Det. Richard A. Howard 1982

Dandy says “many specimens are referred to Ray.” John Ray was an English naturalist who lived 1627-1705. He wrote one of the first works classifying plants according to structural features, the Historia Plantarum. Hans Sloane made notes on the pages of Catesbsy’s specimens referring to this book. He also refers to R.H.S., presumably Ray’s Hortus Siccus – but what is that? Ray’s own herbaria? Dandy writes “Many [of Sloane’s specimens] are indicated in the margins of a copy of Ray’s Historia Plantarum in the Department of Botany, which thus forms an index to the collections, so far as the specimens have been ‘referred to Ray’….” (p. 18). I haven’t seen this copy of Ray – it might be interesting to add these images to the collection one day.

Dandy doesn’t mention this, but some of Sloane’s notes mention Plukenet. This would be Leonard Plukenet, 1642-1706, another English botanist whose extensive collection of dried plants ended up with Sloane in 1710.

Lots of plants have lovely handwritten labels glued on to the pages such as this one that reads “Cupressus distich a Linn” (which is quite close to the current scientific name.) Dandy says that many specimens were “named by Solander.” Daniel Carlsson Solander, 1733-1782, happens to have been a student of Linnaeus, who taught him botany at Uppsala University. Solander went to England in 1760 and got a job at the British Museum. He accompanied Joseph Banks on his expeditions to the Pacific and Australia (1768-1771) and to Iceland (1772). For the next ten years he ran the Natural History Department at the British Museum. I don’t know when he wrote the labels for Catesby’s plants, but they are written in Linnaean binomials and many of the specimens cite Linnaeus as the authority, presumably from the Species Plantarum first published in 1753.

The other authority Solander cites in his binomials is “mscr.” Check out H.S. 212 f. 20, with one specimen labeled “Sophora cerulea Mscr.” Dandy says of Catesby’s specimens “a large number (especially in H.S. 212) are named by Solander and some are described as new in his MSS. Almost none of these names are currently valid scientific names. Solander pops up in descriptions of other Sloane collections, having described plants collected by others for his MSS. But where is this manuscript? Apparently it was not published. Solander and Banks collected a massive amount of material in Botany Bay (named for these botanists), and Solander moved in with Banks in 1771 specifically to prepare this material for publication. I don’t know if the Catesby materials were included in this project, but it seems plausible (based on some quick web searching;). Solander wrote up a pile of descriptions and Banks paid for 700 copper engravings, but the book was not yet finished when Solander died in 1782. Banks was too busy running the Royal Soceity and the Royal Botanic Gardens Kew to finish it himself, so it remained unpublished. This had a serious impact on the naming of the Australian collections. Could it also mean that Solander’s work in identifying Carolina plant species would forever go to waste?

Johann Amman, who wrote determinations on H.S. 212 ff. 25, 42, 53 lived from 1707 to 1741. He was born in Switzerland and became a professor of botany at St. Petersburg. He wrote a number of letters to Sloane between 1734 and 1741, in which he criticized Linnaeus for basing his classification system solely on numbers of floral parts.

My reading about Amman led me to the solution to another mystery. H.S. 212 f. 4 contains a specimen of Hamamelis virginiana L. Below the specimen is the penciled notation “Hamamelis Gronov.” Apparently Johan Frederik Gronovius described a specimen of Hamamelis, so maybe this was Hamamelis gronovius for a time. Who was Gronovius? He was another of the botanical club of the early 18th century, a regular correspondent of both Linnaeus and Catesby and seems to have defended Catesby to Linnaeus, who apparently took a dislike to Catesby. He was Dutch. John Clayton collected plants from Virginia and sent them to Gronovius, who used them in his Flora Virginica (1739-1743).

A number of specimens have typed labels glued onto the pages next to them. Most of these are the work of Richard A. Howard. Howard was director of Harvard’s Arnold Arboretum from 1966 until 1978. (During World War II he trained pilots how to survive in the jungles of the South Pacific – who says botany isn’t exciting?) He visited the Natural History Museum in 1982, where he identified the specimens that appear in Catesby’s Natural History. The labels refer to plates in the Natural History. (The modern names for Catesby’s plants. Howard, Richard. A. and Staples, George W. 1983. J. Arnold Arbor 64:511–546.)

For this specimen, Howard gave it a current scientific name, Taxodium distichum. He named the page in Natural History where this appears: volume 1, folio 11. He names two folios in Catesby's other collection of dried plants where this species also appears, although those two specimens have since been identified as the related Taxodium ascendens.

A few specimens have stickers from the Linnaean Typification Project added by James Reveal, professor emeritus of the University of Maryland, honorary curator of the New York Botanical Garden, and current curator of

What else? There are lots of pencil scribbles, mostly modern scientific names and references to pages in Natural History. Mark Spencer doesn’t know who added those.

I am continually amazed at the willingness of many to write on or glue stuff on the pages of a historic manuscript. But I suppose that’s how you build a multitext – if everyone had kept the pages pristine there wouldn’t be nearly so much to discover.

Monday, December 12, 2011

Text or Collection?

We care about our data and want it to be as broadly useful and long-lived as possible. One way we are trying to ensure its utility is by capturing and storing it in formats appropriate to its nature. Images are simple. We are creating directories of full-resolution TIFF images, and directories of versions in other formats derived from the original TIFFs: JPGs of different sizes, and Pyramidal TIFFs for delivery to dynamic web-based viewers. These images can come to their viewers through our CITE Image-Service, or interested parties can simply download them from the server, individually or in batches. Other data requires more thought. In this post I want to describe the process by which we determined whether a particular set of data is properly a “text” or a “collection”.

These terms have very specific meanings in the CITE architecture. A “text” is…
…a collection of leaf-nodes consisting of character-data and well-formed XML markup,
…each leaf node having a specified place in a sequence, and
…a specified position in a citation-hierarchy at least one level deep,
…and the collection as a whole taking its place in an ontological hiearchy of text-group, work, edition/translation.
Or, in slightly more general terms, something is a text if it consists of language, if you can cite it as a whole unambiguously, if you can cite its parts unabiguously, and if it is intended to be read in a sequence.

A “collection” is…
…a group of data-objects each consisting of one or more fields, but each object having the same fields,
…which objects may be in a sequence (an ordered collection), or not (an unordered collection),
…having no citation-hierarchy beyond the object’s identifier, although individual field may be addressed.
Or, in other words, a dictionary is a “collection”, as is a database of plant-records, a telephone directory, and the digital representations of the folios of a book.

Clearly there is some possible overlap and ambiguity. It would be possible to treat a sonnet as an “ordered collection of poetic lines”; it would be possible to deliver a telephone directory as a “text with a one-level deep citation-hiearchy”. You can argue whether a telephone directory or a dictionary are “ordered” or “unordered” collections—is alphabetical order inherent? Or merely convenient? In the case of the sonnet and phone-book, some reflection on the principal mode of interaction resolves the question: a sonnet is intended to be read (even if an analysis of a sonnet might pull out and quote particular lines), and a phone book is intended for random access (even if we occasionally read down a column to find just which John Smith we want to all).

In the case of Horti Sicci 212 and 232, the herbarium volumes of Mark Catesby that Amy and Patrick have begun recording, and which will be the first volumes fro the Sloane to go online, we faced a trickier question: Attached to each specimen are one or more labels; these labels fall into discrete categories (as Amy described in the previous post); clearly all the labels by Hans Sloane are one set; all the copperplate labels another; those by Richard Howard yet another set. Is each of these sets a “text” or a “collection”?

My first instinct was “text”. We are entirely indebted to the Homer Multitext for our vision of a digital herbarium, and the tools with which to create one. The HMT offers what would seem to be a very clear parallel: the scholia to the poetic text of the Iliad on Byzantine manuscripts. These are marginal notes on the text; there are different categories of notes, and the editors of the HMT have successfully treated each category as a separate “text”. (The full online catalogue of those texts is here). It seemed to make sense to follow this established standard, which the students at Holy Cross, Furman University, and the University of Houston have used to make significant discoveries about the history of Greek epic poetry.

The Venetus A Manuscript of the Iliad:
One folio, many texts
Like the scholia, the labels on Catesby’s specimens offer commentary; like the scholia, the different categories come from different periods of history. Both could scholia and labels can be identified unambiguously through citation, and in each case through a one-level-deep citation scheme. In terms of ontology, the lables are considerably less ambiguous, since their authorship is clear; even the anonymous handwritten labels are distinguishable as a set by handwriting, while all of the scholia were copied by the same scribe, and the contents of each set of scholia comes from many different ancient sources.

I talked through this question with Neel Smith of the Collect of the Holy Cross, whose instinct about the nature of knowledge and its digital avatars is unerring. And I now think the collected labels are not texts, but collections. Here’s why. Each scholion has a place in a sequence only because it refers to a specific part of the poem on which it comments. It is worth reading the scholia in order only because we read the Iliad in order. The labels do not comment on a text, but on what is clearly a collection—the collection of Catesby’s specimens—and an undered collection at that.

So it does not seem that we do violence to the labels by considering them member of collections. At this point, we can set aside ontological rigor and ask questions of convenience.

Catesby, H.S. 212, page 4:
One folio, many collections
Will anyone be likely to want to read each of Richard Howard’s labels, in order, in isolation from the specimens? Probably not. Will it be convenient to have these labels’ contents exposed through a collection-service? Probably, since the collection-service emphasizes query-and-retrieval, while the text-service emphasizes location-and-scrolling. Other advantages? Yes, the CITE Collection Service draws data from Google Fusion Tables, and the data in those tables is easily edited in place, while the CITE Text-service draws data from XML files tabulated and stored in Google BigTable, and thus editing a text is a more laborious process; for the near- and medium-term future, transcriptions of these lables are likely to be subject to more editing even as they are exposed to the public view.

What advantages did the editors of the Homer Multitext gain from decreeing the scholia to be Texts rather than Collections? Internal markup, more than anything. The scholia are in Greek; they are argumentative; they discuss linguistic features of the poem, geographical places, personal names, and often quote from other literature. All of these are features that are inherent to the content of the scholia, and varied enough to justify internal markup of an XML text. While our botanical labels offer some of these, each is short enough, and likely to contain at most only a few external references (place-names, cross-references, vel sim.) that we can better capture that information through indexing.

So this is simply one example of the kind of thinking that occupies us as we plan and proceed to publish these snapshots of botanical history.

Mark Catesby

Last month I knew nothing of Mark Catesby. Now the man is becoming a friend, and getting to know him is introducing me to a whole cast of characters from the age of discovery. We come together through his dried plant specimens.

Mark Catesby (1682-1749) lived in Charleston, South Carolina from 1722 to 1726. During these years he traveled around the state collecting plant specimens and making observations on the natural history of the area. He sent his specimens to Hans Sloane in London. Sloane bound them into two volumes, H.S. 212 and H.S. 232, and made them part of his herbarium collection. After Catesby returned to England in 1726 he got to work on his Natural History of Carolina, Florida and the Bahama Islands, which include his own watercolors of flora and fauna he observed.

A couple of weeks ago Patrick McMillan and I sat down to identify and list the plants contained in the volumes – a quick and first step. Or so we thought. The volumes are a trove of treasures and minor mysteries. We’ve been solving some of them and uncovering new ones. What are these plants? Where were they collected? What relationship do the herbarium specimens have to the watercolors? Who made all the notes on the pages?

Identifying the specimens is easy if you have Patrick McMillan on your team. He can identify many of them at sight. What he can’t identify instantly, we look up – through images on the Internet, descriptions and keys in Alan S. Weakley’s Flora of the Southern and Mid-Atlantic States (Working draft 15 May 2011), occasional emails to experts in particular taxa. The plants’ beautiful state of preservation certainly helps – Catesby did a wonderful job pressing them, and whatever he or Sloane treated them with has kept away the vermin. I double check scientific names and authorities and transcribe all the text on the page attached to the specimen – there may be a little or a lot. Patrick adds notes if the specimen is particularly interesting.

Here’s how we work: We set up our MacBooks side by side. I keep track of our data in Bento, which is ideal for this purpose. Each specimen gets its own record. Every type of text on the page gets its own field within the record. I keep Weakley’s Flora open on my desktop for quick reference. We go through the volumes folio by folio, opening each page in Preview. The digital photographs are clear enough that we can zoom in close enough to see tiny (if jaggy) details – even pubescence is visible. I also keep open the University of Wisconsin’s digital facsimile of Catesby’s Natural History to cross-check specimens that appear as watercolors. The USDA Plants Database is handy for scientific names and images. We fire off the occasional email, and work together to decipher Catesby’s handwriting.

I said identification is easy. This work takes hours. We sit there for three hours, six hours, drinking black coffee as the day slips by and dried plants and Latin imprint themselves in our brains so that we dream in binomials. After that work is done, I go back through the images and finish the transcriptions and clean-up on my own. Patrick spends evenings at home working ahead on identifications. H.S. 212 has 96 folios and 259 specimens. H.S. 232 has 139 folios and we’ve identified about 170 specimens; there are a number of Bahamian plants in this collection, and we haven’t worried about them for the moment.

Organizing the texts on the pages presents some challenges. Catesby was older than Carl Linnaeus (1707-1778), so his work predates binomial nomenclature. Sloane wrote notes on many of the folios in his distinctive handwriting, mostly references to works by John Ray. Sloane (or someone) pasted Catesby’s own descriptions of a few specimens on the pages. Someone else with copperplate handwriting who lived after Linnaeus wrote Latin binomials – many of them correct but many others never seen before or since – that are pasted on the pages next to a number of specimens. Someone wrote little notations in pencil directly on the pages – cryptic notes such as “ii.34.” Richard A. Howard identified some of the specimens in 1982, and his typed labels are pasted onto the pages.

Some folios are covered with these different texts. Others contain no text at all, only dried plants. Those are the pages where an expert taxonomist really comes in handy. But the pages full of labels are the really fun ones. In my next posting I’ll explain more of what is going on with those labels, and how we are deciphering them all.

— Amy Hackney Blackwell

Sunday, December 11, 2011

Borrowing from Homer: A Data Model for a Digital Herbarium

ὡς δ᾽ Εὖρός τε Νότος τ᾽ ἐριδαίνετον ἀλλήλοιιν
οὔρεος ἐν βήσσῃς βαθέην πελεμιζέμεν ὕλην
φηγόν τε μελίην τε τανύφλοιόν τε κράνειαν,
αἵ τε πρὸς ἀλλήλας ἔβαλον τανυήκεας ὄζους
ἠχῇ θεσπεσίῃ, πάταγος δέ τε ἀγνυμενάων,
ὣς Τρῶες καὶ Ἀχαιοὶ ἐπ᾽ ἀλλήλοισι θορόντες
δῄουν, οὐδ᾽ ἕτεροι μνώοντ᾽ ὀλοοῖο φόβοιο.

And as the East Wind and the South strive with one another
in shaking a deep wood in the glades of a mountain,
a wood of beech and ash and smooth-barked cornel,
and these dash one against the other their long boughs
with a wondrous din, and there is a crashing of broken branches;
even so the Trojans and Achaeans leapt one upon another
and made havoc, nor would either side take thought of ruinous flight.
Homer, Iliad 16.767-773

Many labels on one specimen in Catesby’s Hortus Siccus
We are building a digital herbarium. Its heart will be digital images of volumes of botanical specimens collected by the English naturalists who worked in the Carolinas during the 1700s. Each of these volumes is a complex piece of information technology. A volume contains many folios; each folio may contain one or more specimens; each specimen consists of one or more dried plants, and zero or more labels. The content, nature, and date of the labels varies, from a single number penciled on the paper of the folio itself, to elaborately and elegantly handwritten notes identifying the plant by common and Linnaean name, noting where it was collected, and sometimes describing how the native Americans used it; many specimens have 18th century notes, supplemented by later notes, including typewritten labels from the 20th century. The information on the notes is sometime detailed and accurate, sometime made obsolete as botanical science advanced over three centuries, and sometimes completely incorrect. Some labels cross reference other volumes, either volumes of specimens or volumes of botanical drawings. The languages of these texts is English, French, and Latin. Places have English or Native American names. And of course the plants themselves are relics of a natural world now a third of a millenium in our past. We would like this project ultimately to be a pyramid, with the beginnings of botanical science—Theophrastus and Aristotle, their works in Greek and English translations—at the base, and the living collections of the South Carolina Botanical Garden and plants growing in the wild at the pinnacle. This adds more complexity.

Theophrastus, On Plants, in Gree
For our digital herbarium to be a truly scientific publication, it must satisfy several requirements. First, the data must be completely and freely available in an unmediated form such that future researchers can gather it en masse for further study. Second, any additional metadata that we add must not interfere with this direct access to the data, and must itself be freely useable and reusable, without mediation. By “mediation” we mean any interface to the data that imposes restrictions on a researcher. If images are available only in low-resolution versions, or if it is impossible to download programmatically an entire library of images, for example, future scholars will be limited in their ability to answer questions that we have not anticipated. If the data exists in a format that depends on particular software, which might become unavailable, or if the data requires a scholar to request permission to copy it, it ceases to be scientific data and becomes a collection of private and ephemeral curiosities. Digital images afford us the luxury of broadly duplicative access, free from worries about the safety of the physical objects they represent.

Catesby’s Illustration of Hamamelis
At the same time, as we, our collaborators, or anyone else makes connections between objects in this collection, adds descriptions, comments, or articulates insights, that new data, those new objects of knowledge, must connect. So our digital herbarium needs mediation in the form of a mechanism for linking objects. This mediation must be simple, durable, and non-exclusive (see above). For linking to be effective, a citation pointing to an object must be easily resolved; that is, the reader (a human or a machine) must be able to follow the citation to the datum to which it points, retrieve that object, and work with it. A linking citation must be able to be general or specific, just as we can cite a whole book, or a particular passage of a book. When a specific citation is resolved, the reader should get the specific, but also have access to the larger context, just as you can follow a citation to a particular passage in a book, and then read the chapter containing that passage, or the whole book. A citation like this should not depend on a particular technology, or a particular organization of data, but should be firmly and permanently linked to the object to which it refers.

And at the moment, we want to deliver these objects to readers through the dominant medium of our day, end-user applications on the world wide web, accessed through personal computers or mobile computing devices.

Unmediated access is straightforward: create images, texts, and other data in documented open formats, and place them in a world-readable directory on a web-server. People can download particular items of interest, or they can use one of many common utilities to download the entire contents of the directory automatically; typing the terminal command “wget -r” will accomplish this on any Unix, Linux, of Mac OS X computer.

A potential pitfall for unmediated data is complexity of markup. Collaborative standards-bodies in the sciences and humanities have defined richly featured dialects of XML that can capture and describe innumerable features of a document; SQL-based databases allow intricate structures of tables and relations. Too often, the ability to capture every feature, every relationship in a single document or database is too tempting, and the result is data of such overwhelming complexity as to be useless to any but its creators (if even to them). So the “mediation of complexity” is to be avoided in favor of simplicity and a strict separation of concerns.

Developing web-based end-user applications is not hard either. There are many attractive online presentation of old books, galleries of image of plants, and texts of various kinds. It is the middle, the mediation that is non-exclusive, simple, and based on citations that resolve. In order to build the kind of applications we want, on the sound scholarly principles we value, we need a digital library infrastructure that addresses all of these needs.

Such an infrastructure exists thanks to the Homer Mulititext, a project of the Center for Hellenic Studies of Harvard University under the editorship of Casey Dué and Mary Ebbott. That project brings together a wealth of texts, images, and other data related to the history of Greek epic poetry. Like Botanica Caroliniana, the Homer Multitext (HMT) is a collaborative project covering centuries of history and a vast and diverse body of data.

This infrastructure is called C·I·T·E, and is described in the HMT’s blog and on that project’s website. It is the product of many years of development by many friends of the HMT, but it is mostly dependent on the brilliant insights of Neel Smith, a professor of Classics at The College of the Holy Cross. Its acronym stands for Collections, Indices, Texts, and Extensions. Collections are bodies of data-objects—volumes, specimens, &c. A Text is a text, such as Mark Catesby’s 1754 Natural History of Carolina, Florida, and the Bahama Islands, or Carolus Linnaeus’ Species Plantarum.* An Extension adds specific functionality to Collections of particular types of data; for our purposes, the main Extension is for Collections of images, because images are a type of data for which we have special expectations.

Each object in a Collection is uniquely identified, and that identifier, structured as a Universal Resource Name (URN), is all that is required to retrieve an object in a collection. Texts are cited similarly by a URN, which can be general, pointing to a whole work, or highly specific pointing to a particular passage in a work.

The “I” in C·I·T·E is for Indices, the primary and simple linking mechanism. Indices are very, very simple data-structures: pairs of URNs. An index can thus link Data-objects, Images, and textual passages, in any combination. Whatever complexity comes to exist in a digital library is constructed out of pairs of citations.

For each of the four kinds of information in a C·I·T·E library there is a Service that allows discovery, querying, and retrieval based on URN citations. These services are technology-agnostic, because they are defined by relatively simple vocabularies of requests, which yield relatively simple structured responses. Out of these requests and responses we can build end-user applications. Another project indebted to the Homer Multitext is the publication of biblical manuscripts from Lichfield Cathedral. These are exposed via a web-application that draws all of its data through C·I·T·E services. Some of the data resides in South Carolina, some in Houston, Texas, but it is linked through indices and aggregated for human readers.

[update: the services linked in this paragraph have been superceded] The current implementations of these services rely heavily on resources made available by Google, particularly Google AppEngine and Google Fusion Tables. The Collections that drive the Biblical Manuscripts from Lichfield Cathedral application are in Fusion Tables here. The texts on which that application draws are in a Google AppEngine implementation of the C·I·T·E Text Service here. For these collections and texts, a small Java application receives requests defined by the C·I·T·E protocol, draws data from Google, and returns predictable results.

We have begun building our own data model for Botanical Caroliniana with a few Fusion Tables that describe the volumes of Mark Catesby’s collections and his writings on the natural history of our region, from the first half of the 18th century. We will follow with other collections—of specimens, of their labels, and of our collaborators’ own comments. We will post updates here!

— Christopher Blackwell

* I will talk about the different between a Text and a Collection in a future post.

Sunday, December 4, 2011

Welcome, and Introduction

We are bringing together a library of historical botany that we hope will extend as far back as Theophrastus, the ancient Greek father of plant taxonomy, and up to the plants of the Carolinas living today in the wild, and under curation in the South Carolina Botanical Garden.

The centerpiece of this library will be digital facsimile editions of herbaria collected by the first European botanists to explore the Carolinas, collecting and identifying the native plants of the region. Many of these volumes have been in the care of the Natural History Museum in London, as part of the Sloane Herbarium. Its curator, Mark Spenser, has been a generous collaborator in our work to photograph and document these collections.

The work of digitization, alignment, and analysis of the images resulting from this collaboration has been funded by the National Science Foundation under Grants No. 0916148 & No. 0916421. Any opinions, findings and conclusions or recomendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).