Sunday, December 11, 2011

Borrowing from Homer: A Data Model for a Digital Herbarium

ὡς δ᾽ Εὖρός τε Νότος τ᾽ ἐριδαίνετον ἀλλήλοιιν
οὔρεος ἐν βήσσῃς βαθέην πελεμιζέμεν ὕλην
φηγόν τε μελίην τε τανύφλοιόν τε κράνειαν,
αἵ τε πρὸς ἀλλήλας ἔβαλον τανυήκεας ὄζους
ἠχῇ θεσπεσίῃ, πάταγος δέ τε ἀγνυμενάων,
ὣς Τρῶες καὶ Ἀχαιοὶ ἐπ᾽ ἀλλήλοισι θορόντες
δῄουν, οὐδ᾽ ἕτεροι μνώοντ᾽ ὀλοοῖο φόβοιο.

And as the East Wind and the South strive with one another
in shaking a deep wood in the glades of a mountain,
a wood of beech and ash and smooth-barked cornel,
and these dash one against the other their long boughs
with a wondrous din, and there is a crashing of broken branches;
even so the Trojans and Achaeans leapt one upon another
and made havoc, nor would either side take thought of ruinous flight.
Homer, Iliad 16.767-773

Many labels on one specimen in Catesby’s Hortus Siccus
We are building a digital herbarium. Its heart will be digital images of volumes of botanical specimens collected by the English naturalists who worked in the Carolinas during the 1700s. Each of these volumes is a complex piece of information technology. A volume contains many folios; each folio may contain one or more specimens; each specimen consists of one or more dried plants, and zero or more labels. The content, nature, and date of the labels varies, from a single number penciled on the paper of the folio itself, to elaborately and elegantly handwritten notes identifying the plant by common and Linnaean name, noting where it was collected, and sometimes describing how the native Americans used it; many specimens have 18th century notes, supplemented by later notes, including typewritten labels from the 20th century. The information on the notes is sometime detailed and accurate, sometime made obsolete as botanical science advanced over three centuries, and sometimes completely incorrect. Some labels cross reference other volumes, either volumes of specimens or volumes of botanical drawings. The languages of these texts is English, French, and Latin. Places have English or Native American names. And of course the plants themselves are relics of a natural world now a third of a millenium in our past. We would like this project ultimately to be a pyramid, with the beginnings of botanical science—Theophrastus and Aristotle, their works in Greek and English translations—at the base, and the living collections of the South Carolina Botanical Garden and plants growing in the wild at the pinnacle. This adds more complexity.

Theophrastus, On Plants, in Gree
For our digital herbarium to be a truly scientific publication, it must satisfy several requirements. First, the data must be completely and freely available in an unmediated form such that future researchers can gather it en masse for further study. Second, any additional metadata that we add must not interfere with this direct access to the data, and must itself be freely useable and reusable, without mediation. By “mediation” we mean any interface to the data that imposes restrictions on a researcher. If images are available only in low-resolution versions, or if it is impossible to download programmatically an entire library of images, for example, future scholars will be limited in their ability to answer questions that we have not anticipated. If the data exists in a format that depends on particular software, which might become unavailable, or if the data requires a scholar to request permission to copy it, it ceases to be scientific data and becomes a collection of private and ephemeral curiosities. Digital images afford us the luxury of broadly duplicative access, free from worries about the safety of the physical objects they represent.

Catesby’s Illustration of Hamamelis
At the same time, as we, our collaborators, or anyone else makes connections between objects in this collection, adds descriptions, comments, or articulates insights, that new data, those new objects of knowledge, must connect. So our digital herbarium needs mediation in the form of a mechanism for linking objects. This mediation must be simple, durable, and non-exclusive (see above). For linking to be effective, a citation pointing to an object must be easily resolved; that is, the reader (a human or a machine) must be able to follow the citation to the datum to which it points, retrieve that object, and work with it. A linking citation must be able to be general or specific, just as we can cite a whole book, or a particular passage of a book. When a specific citation is resolved, the reader should get the specific, but also have access to the larger context, just as you can follow a citation to a particular passage in a book, and then read the chapter containing that passage, or the whole book. A citation like this should not depend on a particular technology, or a particular organization of data, but should be firmly and permanently linked to the object to which it refers.

And at the moment, we want to deliver these objects to readers through the dominant medium of our day, end-user applications on the world wide web, accessed through personal computers or mobile computing devices.

Unmediated access is straightforward: create images, texts, and other data in documented open formats, and place them in a world-readable directory on a web-server. People can download particular items of interest, or they can use one of many common utilities to download the entire contents of the directory automatically; typing the terminal command “wget -r http://address.of.data/” will accomplish this on any Unix, Linux, of Mac OS X computer.

A potential pitfall for unmediated data is complexity of markup. Collaborative standards-bodies in the sciences and humanities have defined richly featured dialects of XML that can capture and describe innumerable features of a document; SQL-based databases allow intricate structures of tables and relations. Too often, the ability to capture every feature, every relationship in a single document or database is too tempting, and the result is data of such overwhelming complexity as to be useless to any but its creators (if even to them). So the “mediation of complexity” is to be avoided in favor of simplicity and a strict separation of concerns.

Developing web-based end-user applications is not hard either. There are many attractive online presentation of old books, galleries of image of plants, and texts of various kinds. It is the middle, the mediation that is non-exclusive, simple, and based on citations that resolve. In order to build the kind of applications we want, on the sound scholarly principles we value, we need a digital library infrastructure that addresses all of these needs.

Such an infrastructure exists thanks to the Homer Mulititext, a project of the Center for Hellenic Studies of Harvard University under the editorship of Casey Dué and Mary Ebbott. That project brings together a wealth of texts, images, and other data related to the history of Greek epic poetry. Like Botanica Caroliniana, the Homer Multitext (HMT) is a collaborative project covering centuries of history and a vast and diverse body of data.

This infrastructure is called C·I·T·E, and is described in the HMT’s blog and on that project’s website. It is the product of many years of development by many friends of the HMT, but it is mostly dependent on the brilliant insights of Neel Smith, a professor of Classics at The College of the Holy Cross. Its acronym stands for Collections, Indices, Texts, and Extensions. Collections are bodies of data-objects—volumes, specimens, &c. A Text is a text, such as Mark Catesby’s 1754 Natural History of Carolina, Florida, and the Bahama Islands, or Carolus Linnaeus’ Species Plantarum.* An Extension adds specific functionality to Collections of particular types of data; for our purposes, the main Extension is for Collections of images, because images are a type of data for which we have special expectations.

Each object in a Collection is uniquely identified, and that identifier, structured as a Universal Resource Name (URN), is all that is required to retrieve an object in a collection. Texts are cited similarly by a URN, which can be general, pointing to a whole work, or highly specific pointing to a particular passage in a work.

The “I” in C·I·T·E is for Indices, the primary and simple linking mechanism. Indices are very, very simple data-structures: pairs of URNs. An index can thus link Data-objects, Images, and textual passages, in any combination. Whatever complexity comes to exist in a digital library is constructed out of pairs of citations.

For each of the four kinds of information in a C·I·T·E library there is a Service that allows discovery, querying, and retrieval based on URN citations. These services are technology-agnostic, because they are defined by relatively simple vocabularies of requests, which yield relatively simple structured responses. Out of these requests and responses we can build end-user applications. Another project indebted to the Homer Multitext is the publication of biblical manuscripts from Lichfield Cathedral. These are exposed via a web-application that draws all of its data through C·I·T·E services. Some of the data resides in South Carolina, some in Houston, Texas, but it is linked through indices and aggregated for human readers.

[update: the services linked in this paragraph have been superceded] The current implementations of these services rely heavily on resources made available by Google, particularly Google AppEngine and Google Fusion Tables. The Collections that drive the Biblical Manuscripts from Lichfield Cathedral application are in Fusion Tables here. The texts on which that application draws are in a Google AppEngine implementation of the C·I·T·E Text Service here. For these collections and texts, a small Java application receives requests defined by the C·I·T·E protocol, draws data from Google, and returns predictable results.

We have begun building our own data model for Botanical Caroliniana with a few Fusion Tables that describe the volumes of Mark Catesby’s collections and his writings on the natural history of our region, from the first half of the 18th century. We will follow with other collections—of specimens, of their labels, and of our collaborators’ own comments. We will post updates here!

— Christopher Blackwell

* I will talk about the different between a Text and a Collection in a future post.