Over the past few months I have been working on how to represent biological taxonomy and nomenclature using Description Logics. Here I combine these thoughts with a rather naive view of DNA Barcoding to suggest a new approach to taxonomy.
Description Logic (DL) is an extension of frame based languages (such as those used in object orientated programming paradigms) and semantic networks (e.g. WordNet) to link them to first-order predicate logic thus enabling the representation of application domains in formal, well understood ways that can be reasoned over by machines. DL has come to the fore in recent years with the advent of the Web Ontology Lanugage (OWL) by the World Wide Web Consortium (W3C). Two subsets of which, OWL-DL and OWL-Lite, are based on DL. Notably these two sub-languages guarantee decidability within finite time. From now on I’ll use the terminology of OWL-DL and OWL-Lite rather than generic DL terms. The OWL terms are more likely to be understood by a general reader who can read the OWL documentation as background. A concept in DL is referred to as a class in OWL. A role in DL is a property in OWL.
There are three principal features within OWL:
- Classes are groups of individuals that belong together typically because they share some properties or property values.
- Individuals are instances of classes.
- Properties are statements of relationships between individuals or from individuals to data values.
There are other features within the language that allow the expression of things such as equivalence, cardinality and the domains and ranges of properties. Using OWL principally involves asserting specialization hierarchies of classes and inferring unknown subclass relationships and class membership using an inference engine such as Fact++. A set of OWL assertions is frequently referred to as an ontology.
DL languages such as OWL appear to be an ideal way to formally present taxonomic monographs – there appears to be a natural mapping between OWL classes and biological taxa; OWL properties and biological characters; OWL individuals and biological specimens. Such a presentation would allow biological taxonomies to be manipulated by machines. It may be possible to allow inference to occur over multiple, potentially overlapping, classification systems. This may facilitate greater efficiency in capturing, disseminating and re-purposing taxonomic data as well as incorporating it into large scale genomic and climatological analyses.
Here I contend, however, that there isn’t a natural fit between DL and biological taxonomy and that this is informative of the nature of biological taxonomy as it has been carried out to date. OWL has been conceived as a language to formally describe any application domain with the rigour of knowing that the resultant model can be reasoned over. Biological monographs have evolved as a mechanism for semi-formally describing the domain of living organisms but without the logical rigour of ensuring decidability.
Class Membership – Why Taxonomists Can Argue.
In OWL individuals can be asserted as being a member of a class using the “rdf:type” predicate. In taxonomy an author will likewise assert that particular specimens are members of a taxon. Indeed, by naming a taxon the author is asserting that at least one specimen, the holotype, must be a member of the taxon under the rules of the ICBN and ICZN. It is also considered good practice for a taxonomist to cite all the specimens examined and assign each to one of the taxa described in the monograph.
The other way to decide class membership is via extension (also know as intention or connotation). In OWL, classes are defined by the restriction and equivalent class constructs. It could be asserted, for example, that a class consists of “all specimens with four white petals” meaning that any thing that is a specimen and has four petals that are white is a member of this class and any thing that is not a specimen and does not have them both is not a member of the class. Likewise in taxonomy an author provides a description of the taxon in words and specimens that fit the description are deemed to be members of the taxon.
Unfortunately taxonomic descriptions are not as rigorous as OWL class descriptions. It is rarely stated clearly which characteristics, if any, are mandatory for a specimen to be a member of a taxon. Typically the user of a monograph will interpret the meaning of the description in combination with an examination of the cited specimens or illustrations of them and, most importantly, in the light of their own experience. Even diagnostic keys are used primarily as discovery mechanisms that help the user find a subset of taxa for consideration.
This use of both taxon descriptions and exemplar cited specimens is crucial. It means that the taxonomic monograph can never be inferred over by a machine in the same way it can be done by a human. During the process of examining cited specimens the human user builds up their own, personal taxon description that is dependent on their skills and experience. They then use this description to decide which individual specimens are members of the taxon. This is a remarkable process as it allows experts to carry out tasks such as identifying fragmentary materials but it also allows experts to have contradictory opinions.
For a machine to carry out a similar process it would be necessary to extract a new class description from the cited specimens as the human does. If the specimens have been scored for each of the properties in the ontology then the automatically extracted description should be the same as original description and nothing more. Indeed if the specimens didn’t already fit the description then the ontology would have been invalid. What the machine can’t do is take into account additional properties, that weren’t considered in the first place, like the human user can.
If we accept that a machine can’t do what a human can do in this process we need to think carefully about what we are trying to do in the field of biodiversity informatics. Why should we dedicated resources to databasing of historical taxonomy beyond indexing the text if they will never be machine interpretable? Shouldn’t we ascertain that the results of our work are useful before we dedicate the time and effort. If information presented in the literature may only ever be useful when interpreted by humans then it only need be imaged and indexed not databased.
The notion that we can scale taxonomy by increased use of information technology is erroneous, at least in part, if there will always be a bottleneck of overworked taxonomic experts.
If the products of biological taxonomy are to be integrated into large scale analyses, such as those required for climate modeling, then they need to change so that they are logically rigorous descriptions of the world rather than principally literary works. A major re-think of why we do taxonomy the way we do is needed if our outputs are to be useful for future generations.
I have, of course, a solution to propose.
Invert the Notion of Identity
The principle problem with current taxonomy is the fluid notion of identity. Specimens can be identified to taxa, typically by ‘naming them’ but there is no normative way of settling any uncertainties or disputes because:
- Names are not reliable pointers to taxa. If a specimen is named but the taxonomic classification used in the naming is not specified then it can’t be know which taxon (of the multiple possible taxon concepts for that name) it has been identified to. See Taxa, Taxon Names and Globally Unique Identifiers in Perspective.
- Descriptions require human interpretation. As described above, the use of exemplar specimens combined with descriptions means that identifications will vary between experts.
- Relationships between descriptions are vague. The same name may be used for several separately defined taxa. The descriptions of these taxa may use the same or different morphological characteristics. Some descriptions will omit characteristics used in other descriptions that are ostensibly about of the same taxon. It is therefore not possible to say whether the two description overlap, are equivalents or do not intersect at all.
To overcome these issues two things are required:
- Unambiguous identifiers. In place of the current system of nomenclature (where names point to multiple taxa) we need a system that “hard links” a name (identifier) to the circumscription of a taxon. The resulting identifier can only ever mean one thing.
- Normative, Unambiguous Circumscriptions of Taxa. A method of circumscribing taxa that does not require interpretation by a human and is not subject to context.
We are at a turning point in the study of biodiversity where DNA barcoding initiatives have the ability to deliver a single tool that meets both these requirements – but it does need a shift in thinking before this leap can be made.
Up to now the assumption has been that we are discovering taxa in nature and then attempting to describe them. It is undoubtedly true that taxa do exist in nature. However, in order to construct a usable map of biodiversity, we need to turn this on its head. It is the act of minting an identifier and linking it to a circumscription that creates the taxon. We then discover which specimens in the wild fit into this taxon. Philosophically this his how we act anyway (see Identifiers, Identity and Me). Taxa are currently hypotheses (things we invent) that may break down as our knowledge grows.
An analogy frequently used in the past is the establishment of a set of pigeon holes for biodiversity that we then post specimens into. These pigeon holes aren’t entirely arbitrary (their size and distribution is based on some a priori expectation of what occurs in nature) but they are fixed, inflexible entities and nature isn’t guaranteed to fit neatly into them all just as it may not be entirely divisible into species and subspecies.
My proposal is very simple. We establish a system where researchers can publish “Barcode Taxa“. These consist of a registered DNA Barcode that has an associated, immutable human readable name and an HTTP URI. The system would take the form of a single register, perhaps linked to the existing BOLD system. Barcode Taxa are conceptually different from existing species for which we have barcodes. With Barcode Taxa the barcode comes first.
The human readable name should be based on an existing scientific binomial or trinomial name followed (in place of the author string and/or year) by the standardized name of the barcode region in square brackets e.g. Puma concolor [CO1] or Rhododendron luteum [rbcL+matK].
Whilst the human readable string makes it possible to refer to the taxon unambiguously in the literature the HTTP URI would allow machines to reference it unambiguously across the internet.
Any individual specimen that has the barcode for Rhododendron luteum [rbcL+matK] is Rhododendron luteum [rbcL+matK]. Any individual or specimens that does not have that barcode is not Rhododendron luteum [rbcL+matK] – no matter what the morphology involved.
Once the Barcode Taxon is established then secondary taxonomic products can be produced. These would take the form of descriptions and keys that are intended to enable people to estimate, from the morphology, whether an individual specimen is a member of a specific Barcode Taxon. Importantly these secondary taxonomic products can be tested for accuracy because there is a normative/standard measure of correctness.
I am not suggesting that all identification is done by barcoding. I am suggesting that the efficacy of diagnostic tools be established so that one can be, for example, 80% certain that one has identified the correct taxon using a particular key. This can be done independently of taxonomic experts. An ecologist could establish a field test for insect pupae and barcode a sample of those to establish how accurate the field test is.
In the majority of cases it is presumed that the standard barcoding regions will match roughly with variation of biological significance. If it does not then another barcoding region could be chosen and used as well e.g. Rhododendron luteum [atpH]. There is no presumed relationship between Rhododendron luteum [atpH] and Rhododendron luteum [rbcL+matK]. Any correlation between the two would have to be established experimentally.
Another analogy is that of a geodetic datum. WGS84 is suitable for the vast majority of georeferencing. It is feasible to use other geodetic datums though and there is not, necessarily, a guarantee of being able to convert between all the different combinations. They are essentially separate reference systems. Different barcoding regions would behave in a similar way. They would define different, unrelated namespaces for Barcode Taxa.
Barcode Taxa imply nothing about phylogeny although it is assumed that they would be used as putative terminal taxa in phylogenetic analysis and are likely to be monophyletic because of the regions chosen. The discovery of paraphyletic Barcode Taxa would not invalidate them. They would still be useful measures of biodiversity but could be flagged as being known to be ‘unnatural’.
Barcode Taxa imply no hierarchy but their use in combination with the proposed Phylocode would produce a predictable method of naming higher taxa and building the tree of life. They would certainly be more stable than the current proposal to use the existing binomial nomenclature.
Ecologists working with groups that have not been monographed would not have to wait for taxonomy to catch up before they can publish stable names for the organisms they are studying. They could simply barcode them, if they matched and existing barcode then they would use that name if they didn’t match then they just create a name that seems appropriate. Later a taxonomist may link this Barcode Taxon in with others in some larger treatment with keys and descriptions but that doesn’t have to happen in the lifetime of the ecologist.
It can not be stress too strongly that the erection of a Barcode Taxon bears no formal link to existing taxa of the same bi/trinomial. There is no intention that all individuals identified as Rhododendron luteum Sweet will have the barcode of Rhododendron luteum [rbcL+matK] although it would be expected that some of them would. Indeed it is impossible to absolutely define the precise set of individuals that are included in Rhododendron luteum Sweet whereas it would be possible to do that for Rhododendron luteum [rbcL+matK] and so the question of whether they are the same or not simply can’t be answered. This is the issue with the current approach of trying to establish barcodes for existing taxa. There is no way of measuring success. The only way we can say that barcode X uniquely identifies taxon Y is if barcode X defines taxon Y. Otherwise we simply don’t know.
If the purpose of taxonomy is to produce a system that people can actually use to hang data on – so that both people and machines can then infer more knowledge from the linked data – then this is really the only game in town. People who need to identify critical groups are already turning to barcoding the parts of those groups they are interested in and using those barcodes as a definitions of the taxa. The relevance of Linnaean style taxonomy is becoming increasingly tenuous. It will be cut off at the root when molecular biologists and ecologist team up and find they simply don’t need to talk to the people at the museum.
[In addition to the comments below there is a related discussion on Taxacom with 40+ messages in it]
“they simply don’t need to talk to the people at the museum”
So is it your contention that Barcodes used for identification need not be associated with specimens, nor have any repeatability – that the concept of type specimens is somehow unnecessary with barcoding? The folks who are actively building a repository of “barcode types” and those supplying specimens for that project would strongly disagree.
@DLM
I *do* advocate voucher specimens. Yes good scientific practice to voucher everything. Nomenclatural type specimens are another thing. If the barcode defines the taxon then anything you sequence that has that barcode is a valid member of the taxon in equal standing with any other specimen that passes that test. So every voucher specimen counts. Nomenclatural type specimens are just some weird way to calculate the correct name for a conventional taxon. How morphologically dissimilar from a type does a specimen have to be to be considered a member of the taxon? etc. We could happily ignore all the stuff in red folders in the herbarium. Just sequence good quality material rather than having the fetish for the old that is required by the codes of nomenclature.
I am taking a slightly naive approach to barcoding and assuming they are reproducible and reliable all the time.
My remark on not needing museums at all was a bit of showmanship perhaps. We would need specimen repositories.
Roger: I’m not sure I understand exactly what you mean when you write “information presented in the literature … only need be imaged and indexed not databased”. In the realm of biodiversity informatics, 90% of what we do in databasing taxonomic information is what I would call “indexing”. The goal, in my mind, has never been to get a computer to do the work of a taxonomist; it has been to make the job of the taxonomist easier and more efficient. Several times in history, certain practitioners of biological taxonomy suffering from “physics envy” have endeavored to refine the taxonomic process to pure and objective logic (e.g., numerical taxonomy). Such initiatives have generally not met with consistent success.
My rationalization of this is that taxonomy (and associated nomenclature) has existed as it has for the past 250 years because it serves a useful purpose for humans. As such, the primary client of taxonomy is, has been, and should continue to be optimized for humans. Biodiversity Informatics should not, in my mind, be about automating the intellectual process; but rather it should be about maximizing the efficiency of information access by humans.
You wrote: “The notion that we can scale taxonomy by increased use of information technology is erroneous, at least in part, if there will always be a bottleneck of overworked taxonomic experts.”
If we want taxonomy to continue to fulfill the role it has fulfilled for two and a half centuries, then I think we are a long, long way from the time when electronic intelligence will be able to supplement, let alone replace, the intellectual process of taxonomy. However, the bottleneck can be opened rather dramatically through the application of technology and biodiversity informatics. Having observed my PhD advisor engage in the art and science of taxonomy (he is the most prolific ichthyological taxonomist in the past 100 years or so), I noted early on how poor was the ratio of time he spent implementing his expertise, vs. time he spent manually typing and formatting diagnoses, synonymies, keys, bibliographies and tables of character data for manuscripts; looking up particular publications in his office library; going to a larger library to access some publication that he didn’t have his own copy of, etc. All told. he probably spent 15-20% of his time doing stuff that capitalized off his taxonomic expertise, and the rest of his time conducting tasks that are tedious and slow for a human, but incredibly east and fast for a computer. In fact, it was watching him work that led me down the path of biodiversity informatics in the first place — I knew that computer technology could streamline the process dramatically. And I still believe it can!
I disagree that taxonomy — as it now exists — cannot fulfill the needs of tasks like climate modelling; because I think such tasks do not require the level of precision of taxon-concept comparison that you seem to be striving for. By itself, nomenclature can allow automated reasoning to the resolution of primary type specimens, at least.
There is no denying that DNA Barcoding (at least as a concept) has real merit and value. I would say the same thing about the Phylocode approach to nomenclature. The important thing to remember is that where they have their strongest merit and value does not exactly coincide with the reasons why Linnean nomenclature has been so successful over the past 250 years. In other words, a system such as what you describe will certainly allow for certain things that traditional nomenclature is not optimally designed for. But it will fail to fulfill certain things that the traditional system has been (and continues to be) extremely effective for.
Addressing the first sentence of your last paragraph, I think that the purpose of taxonomy is to produce a system that people can actually use to *communicate with each other*. This involves much more than simply hanging data from. And even when data are hung on names, they still fulfill most of the need, most of the time, at the level of precision that most people care about (this includes things like climate modeling).
This is not to say that the approach you describe does not have merit. For its purposes, I think it has merit (see also Godfray, 2002, Nature 417:17-19; http://www.nature.com/nature/journal/v417/n6884/full/417017a.html — and the basic tenets of these ideas have been around quite a bit longer). Just don’t think of it as a replacement to traditional taxonomy.
Thanks for this Rich. You raise some good points that are true. Notably names as tags works fine for most things and a more precise system may not be needed for many applications. Indeed I would argue that taxonomy isn’t needed for many applications.
I worry about your sunk losses argument i.e. just because we have been doing it this way for so long does not mean it is the best way to do it in the future.
The other worry I have is with the definition of what works. Neither you or I are the best people to say whether taxonomy works or not. Really we need to test such an argument with users of the system. This assumes we are not doing taxonomy just as an end in itself – which may be justified.
I generally agree; but I’m not sure it’s just a matter of “sunk losses”. Yes, there is a component of that; but at a higher level, can you give me another example of a convention of this sort that has survived as long as it has, and has maintained such widespread (universal) adoption? Maybe it’s just habit; maybe it’s just fear of change. But it also might be an example of a system that works at some level.
I agree that our needs are evolving; and perhaps they are evolving faster and more substantially now than they ever have before. This is why I am not opposed to alternative strategies (I’m not a Phylocode-basher or a Barcode-basher). They are certainly worth pursuing, in my opinion; as long as they are fulfilling a real need that is not currently being met by the existing system.
However, I just don’t see these alternate approaches to organizing and naming biodiversity as “replacements” to the existing system, because the existing system continues to fulfill certain needs as well.
I generally agree with you here Roger. I think some form of radical change will eventually be unavoidable. With the increase of information technology, the tools/services with it, and the increased discussion and challenge of traditional nomenclature, the way we deal with “taxon names” will inevitably change. My background is not taxonomy or nomenclature, and maybe I am out of place to say this, but I have had concerns for a while about the unstructured, unordered, illogical processes that are used when dealing with nomenclature and taxon concepts. We just have to look at the current discussions on Taxacom to see that fundamental issues have not (and probably will not) ever go away. And with the influence of IT, I can see an approach along the lines you discuss being far more logical, ordered and therefore, useful to IT tools in the whole.
Rich wrote:
“The goal, in my mind, has never been to get a computer to do the work of a taxonomist; it has been to make the job of the taxonomist easier and more efficient”
I think I agree with you Rich to a degree, as this is probably the easiest, quickest, most advantageous use of IT tools and processes. BUT imagine how useful it would be to be able to get a computer to identify specimens, suggest new taxon names/concepts (my “Organism Concepts”, Roger’s “Barcode Taxa”), map relationships between organism concepts, pick up on discrepancies, etc, etc!
Roger, I agree with you and Rich …
My background on the edge of mycology gives me a slightly different perspective (that might be wrong!). For the majority of mushrooms classical morphology and ICBN nomenclature is still a principle driver. That classical view is increasingly informed by sequence data, both by barcodes for identification, and by multi-gene sequences for phylogenetic reconstruction.
However, fungi are a big group. If you focus on cryptic micro-fungi of economic importance then you get a different picture. Here you find a different approach has emerged due to limited morphology, and/or the limitation of gross morphology as an indicator of pathogenicity.
For example, the current global epidemic of wheat stem rust, which is formally defined as a race of Puccinia graminis, is now more universally, but informally known as ‘Ug99’ (for Uganda 1999), or its informal synonym ‘Pgt TTKS’ (for Puccinia graminis f.sp. tritici race TTKS [I don’t know what TTKS stands for]). Just Google Ug99 and you’ll find many of the pages won’t even refer to the formal name at all.
My view is that Ug99, and its synonym Pgt TTKS, are ‘effectively’ barcode names. If a strain has the resistance markers for Ug99 then it is Ug99. In addition, all the valuable information around the world is linked to the namestring Ug99.
Plant pathologists working on fungi, bacteria and viruses have adopted this approach for a long time now. In the pre-molecular era the plant pathologists used ICBN acceptable designations such as ‘Puccinia graminis formae specialis tritici’ to indicate physiologically distinct races without morphological distinction (c.f. pathovar in bacteriology). Now we commonly see informal name strings, with implied barcoding, such as ‘Ug99’ or ‘Pgt TTKS’ to pin down the concept.
Also, formal virus names such as ‘Influenza A subtype H5N1’ have embedded ‘barcoding’ – with implications for the viral species concept definition.
How different are these extant working practices to your proposal for inverting the notion of identity so it’s is based on a ‘barcode’ identifier rather than a mental construct?
I know even less about the domain of microbial ecology but I do see the enormous number of unassigned fungal environmental sequences in GenBank. It can only be a matter of time before humans start inventing namestrings for clusters of such sequences (which are effectively defined by barcodes) that are remote from anything with a Latin binomial. Maybe it already has started and I haven’t noticed. They’ll do that because they will want to link those barcode defined namestrings to associated data.
The issue I see is that systematsist’s traditional working practice compounds two fuzzy sets. The fuzzy set of living things ‘out there in the real world’, and the fuzzy set of namestrings we’ve applied to the fuzzy set of living things. Your inversion removes the fuzziness but I don’t see how it solves the inevitable problem of linking old with new.
The challenge for biodiversity informatics is to provide the universal linking/disambiguating mechanism for all these namestrings, their associated defined species concepts (fuzzy or not), and all the associated data.