Thursday, July 26, 2012

Corpus Botany

Michael Dosmann, Curator of Living Collections at the Arnold Arboretum at Harvard University, gave this sage advice to anyone working to manage a collection of botanical data: “Don’t spend your life chasing taxonomy!” The world of botanical taxonomy is endlessly complex and dynamic, changing rapidly from month to month. It is a global effort to build a single hierarchical tree that captures reality, with contributors working from different directions using different and evolving techniques to understand a body of data that is expanding with new discoveries and shrinking from anthropogenic changes to the planet. It is built on a foundation that began with Theophrastus in the 4th Century BCE and was canonized by the 17th and 18th century natural philosophers of Europe, but this traditional foundation is being bent rather violently to accommodate three subsequent centuries of new understanding.

The traditional taxonomic ladder is captured in the mnemonic “King Philip Came Over For Good Sex” (shout-out to XKCD): Kingdom, Phylum, Class, Order, Family, Genus, Species.

But the Integrated Taxonomic Information System (ITIS) presents online users with the following: Kingdom, Subkingdom, Infrakingdom, Division, Subdivision, Infradivision, Superclass, Class, Subclass, Superorder, Order, Suborder, Family, Subfamily, Tribe, Subtribe, Genus… (I cut it off at the Genus level, since the point was made).

This list makes clearly shows an ongoing process of rebuilding-the-ship-as-it-sails, shoehorning sub- and super-categories into the ladder in order to reflect a growing understanding of increasing complexity.

Hence Dosmann’s advice: You can’t wait for this to get sorted out before getting down to work.

For Botanica Caroliniana we want to collect and juxtapose useful data on the history of botanical science. We are not in a hurry and are willing to take the time to work methodically, to separate concerns, to recognize that the underlying data is more important than an immediate, glossy online presentation. But we don’t want to wait forever.

And we need to name plants. These names must be unique, stable, and machine-actionable. Linnaean binomials are pretty good, and traditional. They are supposed to be unique. They are not stable, as further study will inevitably split species, rearrange genera; ITIS and IPNI (the Integrated Plant Names Index) will happily provide countless synonyms for any given Linnaean binomial. They are certainly not machine-actionable.

For our digital library we need machine-actionable identifiers that we can use now. They need to be unique and stable within the digital library, while accommodating subsequent changes to the scientific reality of the objects to which they point. Here we can borrow from the disciplines of information science and corpus linguistics.

Corpus Linguistics. It is extremely difficult to make assertions about “how the English Language works”; people keep saying new things, keep changing how they speak, keep encountering new situations that need new word and new constructions. It is much, much easier to make assertions about “the language of New York Times reporting from 1941 - 1945”: How did the NYT refer to the enemies and allies of the United States? What verbs did they use for military victory and defeat, for casualty figures, to describe economic hardships at home? Answers to those questions are easier to formulate, and can be assessed in the context of the explicitly defined corpus. Many answers that are valid for one corpus would be invalid for another—racial slurs that were acceptable to the NYT in 1942 would never be allowed in print today; the language describing the Soviet Union (I bet) changed dramatically between 1943 and 1947. Corpus linguistics allows us to study real phenomena within defined constraints that make intractable datasets manageable.

Namespaces and Arbitrary Identifiers.  Information scientists deal with large numbers of things. Amazon.com sells billions of products; they need to keep track of those products, to share information about them through the digital medium that the company inhabits. Under these circumstances it is immediately obvious that the acts of identification and description must be separate. This is not complicated: each product has a unique, machine-actionable ID, which points to a body of data that includes description, price, reader reviews, and so forth.

Digital librarians have a greater challenge than online merchants, since their IDs need to survive in the wild, outside the confines of a particular database. The “Rachael Ray 1.5 Quart WhistlingTeakettle” has an Amazon ID of 1343145892. That ID, elsewhere on the Internet, points to a product in a Japanese cosmetics catalogue, a Seller Profile on Ebay.com, and a team-building event for Western Union Employees, to name a few.

The answer for a digital library is to use Namespaces. There are no doubt billions of digital objects in the online universe with an ID of 19. But there is only one with an ID that is urn:cite:botcar:sloane.19. That is, “a URN using the CITE protocol, in the BOTCAR namespace, in the Sloane collection, number 19.”

Corpus Botany. For Botanica Caroliniana, we give each object of our interest a URN identifier: herbaria, folios in herbaria, specimens on folios, digital images of folios, and the notional species which these represent.

These species URNs are the glue that holds this digital library together. So the species Acer negundo L. has a URN: urn:cite:botcar:species.Acernegundo. The last element of the URN is somewhat human-readable, but it is worth emphasizing that this is an ID, an arbitrary identifier, and nothing more. As a data-object, urn:cite:botcar:species.Acernegundo identifies a notional species that we can label, for human readers, Acer negundo L., that we can supply with bibliography, or that we can link to an ITIS record (TSN serial no. 28749).

The important thing, though, is that we can build our digital library simply by creating a graph of URNs, with each URN maintaining a strict separation of concerns. A specimen (urn:cite:botcar:sloane.422) appears on a folio (urn:cite:fufolio:CatesbyHS212.12) and is an example of a species (urn:cite:botcar:species.Acernegundo), which belongs to a genus (urn:cite:botcar:genera.Acer), which in turn belongs to a family (urn:cite:botcar:family.Sapindaceae); the folio (urn:cite:fufolio:CatesbyHS212.12) is illustrated by a digital image (urn:cite:fufolioimg:Caroliniana.Catesby_HS212_012_0493), and the specimen itself is illustrated by a region of interest on that image (urn:cite:fufolioimg:Caroliniana.Catesby_HS212_012_0493:0.423,0.2,0.549,0.715).

In the above sentences, every noun is represented by a URN, and each verb can be as well. This is the subject of a subsequent posting on this blog.

For now, it is enough to say that we do not have to capture the entire taxonomic world in order to build a useful digital library of historical scientific data. We can give URNs to the objects of our concern, and only to those objects. Because our linking mechanisms are arbitrary identifiers, the structure of our digital library can accommodate advances in scientific understanding easily. If genus Acer is split in two, and Acer negundo suddenly belongs to a new genus, we need make only one change to the hundreds of thousands of points in our graph.* We change one entry from:

urn:cite:botcar:species.Acernegundo isMemberOf urn:cite:botcar:genera.Acer

to 

urn:cite:botcar:species.Acernegundo isMemberOf urn:cite:botcar:genera.SomeNewGenus

If Acer is renamed by some authority, but remains a member of Sapindaceae and the parent of Acer negundo, then we are under no obligation to change our URN, which can remain “the genus is referred to in Botanica Caroliniana as urn:cite:botcar:genera.Acer”. We can and should update any data fields pointed to by that URN to reflect the new scientific reality.

So this is what we are calling “corpus botany”: comprehensive coverage of data in a constrained, publicly defined corpus, with separation of concerns, tied together with namespaced, arbitrary identifiers.

The evolving collections of URN identifiers are store in public Google Fusion Tables:

  • Hortus Siccus Facsimile Model (Google Fusion Table)
  • Catesby Specimen Collection URNs (Google Fusion Table)
  • Botanical Species URNs (Google Fusion Table)
  • Botanical Genus URNs (Google Fusion Table)
  • Botanical Family URNs (Google Fusion Table)

  • The organization of these URNs into Collections, how we can use them in online publications, and the important topic of how to link them together, will be the subject of subsequent posts.

    * Genus Aster is problematic. Alan S. Weakley, in the Flora of the Southern and Mid-Atlantic States, says, “It is now abundantly clear that the traditional, broad circumscription of Aster, as a genus of some 250 species of North America and Eurasia, is untenable.”