Monday, December 12, 2011

Text or Collection?

We care about our data and want it to be as broadly useful and long-lived as possible. One way we are trying to ensure its utility is by capturing and storing it in formats appropriate to its nature. Images are simple. We are creating directories of full-resolution TIFF images, and directories of versions in other formats derived from the original TIFFs: JPGs of different sizes, and Pyramidal TIFFs for delivery to dynamic web-based viewers. These images can come to their viewers through our CITE Image-Service, or interested parties can simply download them from the server, individually or in batches. Other data requires more thought. In this post I want to describe the process by which we determined whether a particular set of data is properly a “text” or a “collection”.

These terms have very specific meanings in the CITE architecture. A “text” is…
…a collection of leaf-nodes consisting of character-data and well-formed XML markup,
…each leaf node having a specified place in a sequence, and
…a specified position in a citation-hierarchy at least one level deep,
…and the collection as a whole taking its place in an ontological hiearchy of text-group, work, edition/translation.
Or, in slightly more general terms, something is a text if it consists of language, if you can cite it as a whole unambiguously, if you can cite its parts unabiguously, and if it is intended to be read in a sequence.

A “collection” is…
…a group of data-objects each consisting of one or more fields, but each object having the same fields,
…which objects may be in a sequence (an ordered collection), or not (an unordered collection),
…having no citation-hierarchy beyond the object’s identifier, although individual field may be addressed.
Or, in other words, a dictionary is a “collection”, as is a database of plant-records, a telephone directory, and the digital representations of the folios of a book.

Clearly there is some possible overlap and ambiguity. It would be possible to treat a sonnet as an “ordered collection of poetic lines”; it would be possible to deliver a telephone directory as a “text with a one-level deep citation-hiearchy”. You can argue whether a telephone directory or a dictionary are “ordered” or “unordered” collections—is alphabetical order inherent? Or merely convenient? In the case of the sonnet and phone-book, some reflection on the principal mode of interaction resolves the question: a sonnet is intended to be read (even if an analysis of a sonnet might pull out and quote particular lines), and a phone book is intended for random access (even if we occasionally read down a column to find just which John Smith we want to all).

In the case of Horti Sicci 212 and 232, the herbarium volumes of Mark Catesby that Amy and Patrick have begun recording, and which will be the first volumes fro the Sloane to go online, we faced a trickier question: Attached to each specimen are one or more labels; these labels fall into discrete categories (as Amy described in the previous post); clearly all the labels by Hans Sloane are one set; all the copperplate labels another; those by Richard Howard yet another set. Is each of these sets a “text” or a “collection”?

My first instinct was “text”. We are entirely indebted to the Homer Multitext for our vision of a digital herbarium, and the tools with which to create one. The HMT offers what would seem to be a very clear parallel: the scholia to the poetic text of the Iliad on Byzantine manuscripts. These are marginal notes on the text; there are different categories of notes, and the editors of the HMT have successfully treated each category as a separate “text”. (The full online catalogue of those texts is here). It seemed to make sense to follow this established standard, which the students at Holy Cross, Furman University, and the University of Houston have used to make significant discoveries about the history of Greek epic poetry.


The Venetus A Manuscript of the Iliad:
One folio, many texts
Like the scholia, the labels on Catesby’s specimens offer commentary; like the scholia, the different categories come from different periods of history. Both could scholia and labels can be identified unambiguously through citation, and in each case through a one-level-deep citation scheme. In terms of ontology, the lables are considerably less ambiguous, since their authorship is clear; even the anonymous handwritten labels are distinguishable as a set by handwriting, while all of the scholia were copied by the same scribe, and the contents of each set of scholia comes from many different ancient sources.

I talked through this question with Neel Smith of the Collect of the Holy Cross, whose instinct about the nature of knowledge and its digital avatars is unerring. And I now think the collected labels are not texts, but collections. Here’s why. Each scholion has a place in a sequence only because it refers to a specific part of the poem on which it comments. It is worth reading the scholia in order only because we read the Iliad in order. The labels do not comment on a text, but on what is clearly a collection—the collection of Catesby’s specimens—and an undered collection at that.

So it does not seem that we do violence to the labels by considering them member of collections. At this point, we can set aside ontological rigor and ask questions of convenience.

Catesby, H.S. 212, page 4:
One folio, many collections
Will anyone be likely to want to read each of Richard Howard’s labels, in order, in isolation from the specimens? Probably not. Will it be convenient to have these labels’ contents exposed through a collection-service? Probably, since the collection-service emphasizes query-and-retrieval, while the text-service emphasizes location-and-scrolling. Other advantages? Yes, the CITE Collection Service draws data from Google Fusion Tables, and the data in those tables is easily edited in place, while the CITE Text-service draws data from XML files tabulated and stored in Google BigTable, and thus editing a text is a more laborious process; for the near- and medium-term future, transcriptions of these lables are likely to be subject to more editing even as they are exposed to the public view.

What advantages did the editors of the Homer Multitext gain from decreeing the scholia to be Texts rather than Collections? Internal markup, more than anything. The scholia are in Greek; they are argumentative; they discuss linguistic features of the poem, geographical places, personal names, and often quote from other literature. All of these are features that are inherent to the content of the scholia, and varied enough to justify internal markup of an XML text. While our botanical labels offer some of these, each is short enough, and likely to contain at most only a few external references (place-names, cross-references, vel sim.) that we can better capture that information through indexing.

So this is simply one example of the kind of thinking that occupies us as we plan and proceed to publish these snapshots of botanical history.