The first two parts of the monograph to be looked at were published in Notes from the Royal Botanic Garden Edinburgh – the house journal of the gardens until 1990.
- Cullen, J. (1980) Revision of Rhododendron. I. subgenus Rhododendron sections Rhododendron and Pogonanthum. Notes from the Royal Botanic Garden Edinburgh. 39:1-207.
- Chamberlain, D.F. (1982) A revision of Rhododendron. II. Subgenus Hymenanthes. Notes from the Royal Botanic Garden Edinburgh. 39:209-486.
Between them these publications cover 544 species – more or less half the genus.
The entire run of the Notes has now be digitized to page images for BHL-Europe and so I have access to good quality pictures of the text. We have an in-house OCR service that I can drop these images into to create text or other outputs. I started by dropping all 200+ images from the first publication into the OCR and creating 200+ text files but this didn’t make sense because many of the species accounts ran across multiple pages. What I needed was the contiguous text for the whole publication. I could have concatenated the text files but I figured the OCR software would do a better job if it was working through one big document as it would learn from previous pages – OK maybe this is fantasy but it is worth a try. By using Preview (the Mac’s default PDF and image viewer) I created a single PDF containing all the images and put that through the OCR processor. The result was not only a single text file but also a PDF of the whole publication including OCR’d text. Job done! Can I stop now?
This process showed how easy it is to create digital versions of publications. The PDFs produced are not very friendly being almost 100mb in size for each of the two publications but they can be read on line and indexed so do fulfill the basic requirements of making ‘legacy’ publications available. Because of their size I do not attach the PDFs here.
Two points jump to mind:
- The accuracy of the OCR is masked because the text is hidden behind the page images. Although the document is searchable we can’t be sure that, if a search term is not found, it is because it isn’t there or because the OCR failed for that word in that location. This digitization process is likely to engender a false sense of security.
- The PDF’s of the publications do not enable re-mixing or querying of the data beyond simple text searching. Question like “What species occur in Yunnan, China?” can only be answered by working through the text manually – something that might be quicker with the printed version.
Making text available to read on line is useful in that it facilitates distribution and discovery of that text but that is all it does.
The next step is to try and turn what is basically a descriptive narrative into more useful information that can be used to answer the simple questions people are likely to ask of about biodiversity. At the least it has to be massaged into a set of web pages, one for each species, for use in EOL. There are two aspects to this process:
- Syntax – this is really the easy bit although time consuming. The text of the monograph has a particular syntax – ordering of characters into words and sentences. We need to mark up the document with another syntax that will allow a machine to extract chunks of information. This isn’t too difficult to do at a course level because the monographs are highly structured but it becomes harder the more finely granular the syntax becomes. It inevitably involves a lot of manual work and I’ll cover it in another post.
- Semantics – this involves tougher decisions but isn’t that time consuming. We need to decide what chunks of information in the document we want to extract and what chunks we can practically extract and reach some kind of compromise. Different chunks of text can be seen in the document. Some of these chunks have no biological meaning at all e.g. a page or a paragraph. Others have useful biological meaning e.g. a distribution string like “NE Burma, China (Yunnan, Sichuan, W Guizhou)” in the context of a species description. The decisions made about what to extract will effect the syntax used and how long it will take to impose that syntax on the raw text of the document. Making these decisions will be the subject of another blog post.
Hi Roger – really interested in what you are doing. With BHL-Australia instance, we are looking to provide some generalised tools for users to mark/extract content on any page (linking back to the page image as the basis for the information extracted). The process is effectively to annotate the page with LOD metadata that we can then harvest and integrate into other tools. The top priorities are
I’m very interested in strategies which work for you and any ways we can promote consistency in what you and BHL-Europe do and what we do. Thanks, Donald
Thanks Donald.
These blog posts are really thinking allowed stuff. In the back of my head the two questions I have are: Will this produce useful data? Could it cost effectively produce better data than just an indexing the text?
I had thought of writing a tool but decided I didn’t understand the problem well enough – so it is manual text editing and some regular expression substitutions for now.
All the best,
Roger