Friday, July 27, 2012

An African Melon in South Carolina

In Oxford’s Sherard collection of Mark Catesby specimens lurks this tidbit, Sher-2195, a tendrilled vine with heavily dissected leaves, characteristic of the cucumber family, Cucurbitaceae. Patrick has identified it as Momordica charantia L., commonly known as Balsam pear, balsam apple, bitter melon, and bitter gourd. The common name is a bit confusing because a related species, Momordica balsamina, also goes by most of those names. No matter.

The notes on the specimen page read: “Bryonia.
Cucumis parvus Marianus, Bryonia alba foliis minoribus, polycarpus Pluk. Manu (?) 59” and “Mr. Catesby S. Carolina USA(?) from the upper part of the country.” The fact that this is in the Sherard collection in Oxford suggests that this plant was collected by Mark Catesby in South Carolina in 1722 or so. The note suggests that he found it some distance from the coast.

But this plant is not native to North America. Alan Weakley says that the vines of the genus Momordica are native to the Old World tropics. Momordica charantia is a native of Africa. Weakley’s flora contains no distribution map for Momordica charantia (though it does for M. balsamina), and notes only that the species has been found recently in the Panhandle of Florida. The USDA Plants distribution map doesn’t have any record of this species occurring in South Carolina.

So Mark Catesby cut a specimen of this plant in South Carolina in 1722. How did it get there?

Charleston, SC, was founded in 1670. It was a major port, and one of the points in North America where ships from Africa unloaded their cargoes of slaves and African plants. Could an African cucurbit make its way from Charleston to the “upper part of the country” by 1720? Fifty years is a long time. Was the plant in a settler’s garden? Did someone bring seeds from Europe?

Momordica was a known garden plant by the early 1800s. Thomas Jefferson planted balsam apple, apparently M. balsamina, in his garden at Monticello in 1810. The Monticello website claims that M. balsamina was introduced into Europe by 1568. (The source for this claim is the book Flowers and Herbs of Early America by Lawrence D. Griffith and Barbara Temple Lombardi, Yale University Press, 2008. I need to check this out of the Clemson library as soon as I head back to campus. I'm dying to know who brought it to Europe, where they got it, and how we know this.) Anyway, the plant had apparently been used as a medicine in Europe for over two centuries and was attractive enough for Jefferson to consider the seeds worth acquiring and planting in his annual bed.

Balsam apples appear in 18th and 19th century American paintings. The Pope Brown Collection of South Carolina Natural History contains this depiction of a Balsam Apple, either Momordica balsamina or Momordica charantia, painted c. 1765-1775 – not so long after Catesby. The Metropolitan Museum of Art has “Still Life: Balsam Apple and Vegetables, ” an oil painting done by James Peale of Maryland. So there were definitely Monardia growing on the East Coast between 1765 and 1820.

Eat The reports that M. charantia occurs from Connecticut south to Florida and west to Texas, as well as in parts south. It’s all over Florida today. This website comfortingly informs me that no one knows where it came from originally.

M. charantia certainly has a global distribution today. It’s a common vegetable throughout Asia, Africa, South America and the Caribbean. (Here's a nice botanical illustration done by a Japanese high school student for the Tsukuba Botanical Garden.) Despite its reputation for bitterness and the fact that it is poisonous if eaten raw, it apparently enhances a variety of dishes. It also has been used as a folk remedy all manner of ailment for centuries, and today is the subject of numerous studies of its pharmacological properties. Some experts think it might be useful for controlling diabetes.

Thursday, July 26, 2012

Corpus Botany

Michael Dosmann, Curator of Living Collections at the Arnold Arboretum at Harvard University, gave this sage advice to anyone working to manage a collection of botanical data: “Don’t spend your life chasing taxonomy!” The world of botanical taxonomy is endlessly complex and dynamic, changing rapidly from month to month. It is a global effort to build a single hierarchical tree that captures reality, with contributors working from different directions using different and evolving techniques to understand a body of data that is expanding with new discoveries and shrinking from anthropogenic changes to the planet. It is built on a foundation that began with Theophrastus in the 4th Century BCE and was canonized by the 17th and 18th century natural philosophers of Europe, but this traditional foundation is being bent rather violently to accommodate three subsequent centuries of new understanding.

The traditional taxonomic ladder is captured in the mnemonic “King Philip Came Over For Good Sex” (shout-out to XKCD): Kingdom, Phylum, Class, Order, Family, Genus, Species.

But the Integrated Taxonomic Information System (ITIS) presents online users with the following: Kingdom, Subkingdom, Infrakingdom, Division, Subdivision, Infradivision, Superclass, Class, Subclass, Superorder, Order, Suborder, Family, Subfamily, Tribe, Subtribe, Genus… (I cut it off at the Genus level, since the point was made).

This list makes clearly shows an ongoing process of rebuilding-the-ship-as-it-sails, shoehorning sub- and super-categories into the ladder in order to reflect a growing understanding of increasing complexity.

Hence Dosmann’s advice: You can’t wait for this to get sorted out before getting down to work.

For Botanica Caroliniana we want to collect and juxtapose useful data on the history of botanical science. We are not in a hurry and are willing to take the time to work methodically, to separate concerns, to recognize that the underlying data is more important than an immediate, glossy online presentation. But we don’t want to wait forever.

And we need to name plants. These names must be unique, stable, and machine-actionable. Linnaean binomials are pretty good, and traditional. They are supposed to be unique. They are not stable, as further study will inevitably split species, rearrange genera; ITIS and IPNI (the Integrated Plant Names Index) will happily provide countless synonyms for any given Linnaean binomial. They are certainly not machine-actionable.

For our digital library we need machine-actionable identifiers that we can use now. They need to be unique and stable within the digital library, while accommodating subsequent changes to the scientific reality of the objects to which they point. Here we can borrow from the disciplines of information science and corpus linguistics.

Corpus Linguistics. It is extremely difficult to make assertions about “how the English Language works”; people keep saying new things, keep changing how they speak, keep encountering new situations that need new word and new constructions. It is much, much easier to make assertions about “the language of New York Times reporting from 1941 - 1945”: How did the NYT refer to the enemies and allies of the United States? What verbs did they use for military victory and defeat, for casualty figures, to describe economic hardships at home? Answers to those questions are easier to formulate, and can be assessed in the context of the explicitly defined corpus. Many answers that are valid for one corpus would be invalid for another—racial slurs that were acceptable to the NYT in 1942 would never be allowed in print today; the language describing the Soviet Union (I bet) changed dramatically between 1943 and 1947. Corpus linguistics allows us to study real phenomena within defined constraints that make intractable datasets manageable.

Namespaces and Arbitrary Identifiers.  Information scientists deal with large numbers of things. sells billions of products; they need to keep track of those products, to share information about them through the digital medium that the company inhabits. Under these circumstances it is immediately obvious that the acts of identification and description must be separate. This is not complicated: each product has a unique, machine-actionable ID, which points to a body of data that includes description, price, reader reviews, and so forth.

Digital librarians have a greater challenge than online merchants, since their IDs need to survive in the wild, outside the confines of a particular database. The “Rachael Ray 1.5 Quart WhistlingTeakettle” has an Amazon ID of 1343145892. That ID, elsewhere on the Internet, points to a product in a Japanese cosmetics catalogue, a Seller Profile on, and a team-building event for Western Union Employees, to name a few.

The answer for a digital library is to use Namespaces. There are no doubt billions of digital objects in the online universe with an ID of 19. But there is only one with an ID that is urn:cite:botcar:sloane.19. That is, “a URN using the CITE protocol, in the BOTCAR namespace, in the Sloane collection, number 19.”

Corpus Botany. For Botanica Caroliniana, we give each object of our interest a URN identifier: herbaria, folios in herbaria, specimens on folios, digital images of folios, and the notional species which these represent.

These species URNs are the glue that holds this digital library together. So the species Acer negundo L. has a URN: urn:cite:botcar:species.Acernegundo. The last element of the URN is somewhat human-readable, but it is worth emphasizing that this is an ID, an arbitrary identifier, and nothing more. As a data-object, urn:cite:botcar:species.Acernegundo identifies a notional species that we can label, for human readers, Acer negundo L., that we can supply with bibliography, or that we can link to an ITIS record (TSN serial no. 28749).

The important thing, though, is that we can build our digital library simply by creating a graph of URNs, with each URN maintaining a strict separation of concerns. A specimen (urn:cite:botcar:sloane.422) appears on a folio (urn:cite:fufolio:CatesbyHS212.12) and is an example of a species (urn:cite:botcar:species.Acernegundo), which belongs to a genus (urn:cite:botcar:genera.Acer), which in turn belongs to a family (urn:cite:botcar:family.Sapindaceae); the folio (urn:cite:fufolio:CatesbyHS212.12) is illustrated by a digital image (urn:cite:fufolioimg:Caroliniana.Catesby_HS212_012_0493), and the specimen itself is illustrated by a region of interest on that image (urn:cite:fufolioimg:Caroliniana.Catesby_HS212_012_0493:0.423,0.2,0.549,0.715).

In the above sentences, every noun is represented by a URN, and each verb can be as well. This is the subject of a subsequent posting on this blog.

For now, it is enough to say that we do not have to capture the entire taxonomic world in order to build a useful digital library of historical scientific data. We can give URNs to the objects of our concern, and only to those objects. Because our linking mechanisms are arbitrary identifiers, the structure of our digital library can accommodate advances in scientific understanding easily. If genus Acer is split in two, and Acer negundo suddenly belongs to a new genus, we need make only one change to the hundreds of thousands of points in our graph.* We change one entry from:

urn:cite:botcar:species.Acernegundo isMemberOf urn:cite:botcar:genera.Acer


urn:cite:botcar:species.Acernegundo isMemberOf urn:cite:botcar:genera.SomeNewGenus

If Acer is renamed by some authority, but remains a member of Sapindaceae and the parent of Acer negundo, then we are under no obligation to change our URN, which can remain “the genus is referred to in Botanica Caroliniana as urn:cite:botcar:genera.Acer”. We can and should update any data fields pointed to by that URN to reflect the new scientific reality.

So this is what we are calling “corpus botany”: comprehensive coverage of data in a constrained, publicly defined corpus, with separation of concerns, tied together with namespaced, arbitrary identifiers.

The evolving collections of URN identifiers are store in public Google Fusion Tables:

  • Hortus Siccus Facsimile Model (Google Fusion Table)
  • Catesby Specimen Collection URNs (Google Fusion Table)
  • Botanical Species URNs (Google Fusion Table)
  • Botanical Genus URNs (Google Fusion Table)
  • Botanical Family URNs (Google Fusion Table)

  • The organization of these URNs into Collections, how we can use them in online publications, and the important topic of how to link them together, will be the subject of subsequent posts.

    * Genus Aster is problematic. Alan S. Weakley, in the Flora of the Southern and Mid-Atlantic States, says, “It is now abundantly clear that the traditional, broad circumscription of Aster, as a genus of some 250 species of North America and Eurasia, is untenable.”