ὡς δ᾽ Εὖρός τε Νότος τ᾽ ἐριδαίνετον ἀλλήλοιιν
οὔρεος ἐν βήσσῃς βαθέην πελεμιζέμεν ὕλην
φηγόν τε μελίην τε τανύφλοιόν τε κράνειαν,
αἵ τε πρὸς ἀλλήλας ἔβαλον τανυήκεας ὄζους
ἠχῇ θεσπεσίῃ, πάταγος δέ τε ἀγνυμενάων,
ὣς Τρῶες καὶ Ἀχαιοὶ ἐπ᾽ ἀλλήλοισι θορόντες
δῄουν, οὐδ᾽ ἕτεροι μνώοντ᾽ ὀλοοῖο φόβοιο.
And as the East Wind and the South strive with one another
in shaking a deep wood in the glades of a mountain,
a wood of beech and ash and smooth-barked cornel,
and these dash one against the other their long boughs
with a wondrous din, and there is a crashing of broken branches;
even so the Trojans and Achaeans leapt one upon another
and made havoc, nor would either side take thought of ruinous flight.
Homer, Iliad 16.767-773
Many labels on one specimen in Catesby’s Hortus Siccus |
Theophrastus, On Plants, in Gree |
Catesby’s Illustration of Hamamelis |
And at the moment, we want to deliver these objects to readers through the dominant medium of our day, end-user applications on the world wide web, accessed through personal computers or mobile computing devices.
Unmediated access is straightforward: create images, texts, and other data in documented open formats, and place them in a world-readable directory on a web-server. People can download particular items of interest, or they can use one of many common utilities to download the entire contents of the directory automatically; typing the terminal command “wget -r http://address.of.data/” will accomplish this on any Unix, Linux, of Mac OS X computer.
A potential pitfall for unmediated data is complexity of markup. Collaborative standards-bodies in the sciences and humanities have defined richly featured dialects of XML that can capture and describe innumerable features of a document; SQL-based databases allow intricate structures of tables and relations. Too often, the ability to capture every feature, every relationship in a single document or database is too tempting, and the result is data of such overwhelming complexity as to be useless to any but its creators (if even to them). So the “mediation of complexity” is to be avoided in favor of simplicity and a strict separation of concerns.
Developing web-based end-user applications is not hard either. There are many attractive online presentation of old books, galleries of image of plants, and texts of various kinds. It is the middle, the mediation that is non-exclusive, simple, and based on citations that resolve. In order to build the kind of applications we want, on the sound scholarly principles we value, we need a digital library infrastructure that addresses all of these needs.
Such an infrastructure exists thanks to the Homer Mulititext, a project of the Center for Hellenic Studies of Harvard University under the editorship of Casey Dué and Mary Ebbott. That project brings together a wealth of texts, images, and other data related to the history of Greek epic poetry. Like Botanica Caroliniana, the Homer Multitext (HMT) is a collaborative project covering centuries of history and a vast and diverse body of data.
This infrastructure is called C·I·T·E, and is described in the HMT’s blog and on that project’s website. It is the product of many years of development by many friends of the HMT, but it is mostly dependent on the brilliant insights of Neel Smith, a professor of Classics at The College of the Holy Cross. Its acronym stands for Collections, Indices, Texts, and Extensions. Collections are bodies of data-objects—volumes, specimens, &c. A Text is a text, such as Mark Catesby’s 1754 Natural History of Carolina, Florida, and the Bahama Islands, or Carolus Linnaeus’ Species Plantarum.* An Extension adds specific functionality to Collections of particular types of data; for our purposes, the main Extension is for Collections of images, because images are a type of data for which we have special expectations.
Each object in a Collection is uniquely identified, and that identifier, structured as a Universal Resource Name (URN), is all that is required to retrieve an object in a collection. Texts are cited similarly by a URN, which can be general, pointing to a whole work, or highly specific pointing to a particular passage in a work.
The “I” in C·I·T·E is for Indices, the primary and simple linking mechanism. Indices are very, very simple data-structures: pairs of URNs. An index can thus link Data-objects, Images, and textual passages, in any combination. Whatever complexity comes to exist in a digital library is constructed out of pairs of citations.
For each of the four kinds of information in a C·I·T·E library there is a Service that allows discovery, querying, and retrieval based on URN citations. These services are technology-agnostic, because they are defined by relatively simple vocabularies of requests, which yield relatively simple structured responses. Out of these requests and responses we can build end-user applications. Another project indebted to the Homer Multitext is the publication of biblical manuscripts from Lichfield Cathedral. These are exposed via a web-application that draws all of its data through C·I·T·E services. Some of the data resides in South Carolina, some in Houston, Texas, but it is linked through indices and aggregated for human readers.
[update: the services linked in this paragraph have been superceded] The current implementations of these services rely heavily on resources made available by Google, particularly Google AppEngine and Google Fusion Tables. The Collections that drive the Biblical Manuscripts from Lichfield Cathedral application are in Fusion Tables here. The texts on which that application draws are in a Google AppEngine implementation of the C·I·T·E Text Service here. For these collections and texts, a small Java application receives requests defined by the C·I·T·E protocol, draws data from Google, and returns predictable results.
We have begun building our own data model for Botanical Caroliniana with a few Fusion Tables that describe the volumes of Mark Catesby’s collections and his writings on the natural history of our region, from the first half of the 18th century. We will follow with other collections—of specimens, of their labels, and of our collaborators’ own comments. We will post updates here!
— Christopher Blackwell
* I will talk about the different between a Text and a Collection in a future post.