Construction of a biodiversity knowledge repository using a text mining-based framework Riza Batista-Navarro, Chrysoula Zerva and Sophia Ananiadou School of Computer Science University of Manchester Manchester, United Kingdom {riza.batista,chrysoula.zerva,sophia.ananiadou}@manchester.ac.uk Abstract ten in natural language, secondary data lacks the structure that primary data comes with, rendering In our aim to make the information en- the knowledge it contains obscured and inaccessi- capsulated by biodiversity literature more ble. In order to make information from secondary accessible and searchable, we have de- data available in a structured and thus search- veloped a text mining-based framework able form, we have developed a repository con- for automatically transforming text into a taining information automatically extracted from structured knowledge repository. A text biodiversity literature by a customisable text min- mining workflow employing information ing workflow. To maximise its interoperability extraction techniques, i.e., named entity with external tools or services, we have made the recognition and relation extraction, was knowledge repository available as a Resource De- implemented in the Argo platform and was scription Framework (RDF) triple store that con- subsequently applied on biodiversity lit- forms with the Open Annotation standard1 . We erature to extract structured information. then demonstrate how the repository, accessible The resulting annotations were stored in as a SPARQL endpoint, facilitates query-based a repository following the emerging Open search, thus making the information contained in Annotation standard, thus promoting in- biodiversity literature discoverable. teroperability with external applications. A handful of other tools for storing biodiversity Accessible as a SPARQL endpoint, the information in RDF format exist. Most of them, repository supports knowledge discovery however, do not have the capability to automati- over a huge amount of biodiversity liter- cally understand text written in natural language. ature by retrieving annotations matching Tools such as RDF123 (Han et al., 2008) and BiS- user-specified queries. ciCol Triplifier (Stucky et al., 2014), for exam- 1 Introduction ple, accept only data that is already in the form of structured tables. The browser extension Spot- Big data—huge data collections—are proliferat- ter (Parr et al., 2007) generates RDF-formatted ing in many disciplines at a rate that is much faster annotations over blog posts, not by automatically than what our analytical abilities can handle. One extracting information from the textual content particular discipline that has amassed big data is but rather by requiring its users to manually en- biological diversity, more popularly known as bio- ter structured descriptive metadata. Most similar diversity: the study of variability amongst all life to our work is a system for automatically extract- forms. On the one hand, researchers in this do- ing RDF triples pertaining to species’ morpholog- main collect primary data pertaining to the oc- ical characteristics, from the literature on Flora of currence or distribution of species, and store this North America (Cui et al., 2010). Their seman- information in a structured format (e.g., spread- tic annotation application provided the user with sheets, database tables). On the other hand, find- an opportunity to revise automatically generated ings or observations resulting from their analysis annotations, an option that can also be enabled of primary data are usually reported in literature in our approach. We note though that our work (e.g., monographs, books, journal articles or re- ports), often referred to as secondary data. Writ- 1 http://www.openannotation.org 22 Figure 1: Text mining workflow is uniquely underpinned by a highly customisable 2.2 Development of text mining workflow and extensible workflow. In this way, when do- One of the primary interests of our collabora- main experts call for other types of information to tors in the project is the discovery of fundamen- be captured, our framework will require only min- tal species-centric knowledge, particularly infor- imal development time and effort to fulfill the task. mation on species’ geographic locations, habitat, anatomical parts as well as authorities (i.e., per- 2 Methodology sons who described them). Guided by user re- In this section, we present in detail our framework quirements, we cast this work as an information for constructing the knowledge repository. We be- extraction task requiring: (1) named entity recog- gin by briefly describing the corpus of biodiversity nition (NER) for taxa, locations, habitat, anatom- documents that was utilised, and then outline the ical parts and persons; and (2) binary relation ex- various steps in the text mining workflow. We fi- traction focussing on the following types of as- nally proceed to explaining how the Open Annota- sociations: taxon-location, taxon-habitat, taxon- tion specification was adopted in order to store the anatomical part and taxon-person. information extracted from our corpus. To carry out these tasks on our corpus, we inte- grated various natural language processing (NLP) 2.1 Document selection tools into one workflow using the Argo platform. The Biodiversity Heritage Library (BHL)2 is a Argo3 is a web-based, graphical workbench that database of biodiversity literature maintained by facilitates the construction and execution of be- a consortium of natural history and botanical li- spoke modular text mining workflows. Underpin- braries all over the world. A product of the various ning it is a library of diverse elementary NLP com- partners’ digitisation efforts, BHL currently con- ponents, each of which performs a specific task. tains almost 110,000 titles, equivalent to almost 50 Argo’s graphical block diagramming interface for million pages of text resulting from the applica- workflow construction provides access to the com- tion of optical character recognition (OCR) tools ponent library, representing them as configurable on scanned images of legacy materials. For this blocks that can be interconnected to define pro- work, we decided to narrow down the scope of the cessing sequence. knowledge repository to the requirements of our The workflow that we developed, depicted in ongoing project whose aim is to comprehensively Figure 1, combines several components for pre- collect both primary and secondary information on processing, synactic and semantic analyses. It be- biodiversity in the Philippines. gins with an SFTP Document Reader which loads To this end, we retrieved only the subset of the plain-text corpus from a remote server. This is English BHL pages which are relevant to the followed by a Regex Annotator which attempts to Philippines, i.e., the union of (1) the set of pages detect paragraph boundaries based on the occur- which mention either “Philippines” or “Philip- rence of newline characters. The paragraphs are pine” within their content, and (2) the set of pages then segmented by the LingPipe Sentence Split- contained by books or volumes whose titles men- ter4 into sentences, each of which is decomposed tion “Philippines” or “Philippine”. This resulted in into tokens by the GENIA Tagger (Tsuruoka et a corpus of a total of 155,635 pages (around 12GB al., 2005) which also performs part-of-speech tag- in size). 3 http://argo.nactem.ac.uk 2 4 http://www.biodiversitylibrary.org http://alias-i.com/lingpipe 23 Figure 2: Our Open Annotation representation of related entities ging, lemmatisation and chunking. The next com- 2.3 Adopting the Open Annotation model ponent, the Biodiversity Concept Tagger, is a ma- The Open Annotation (OA) Core Data Model is chine learning-based NER5 that applies a condi- an emerging W3C-recommended standard for en- tional random fields (CRF) model (Lafferty et al., coding associations between any annotation and 2001) to assign labels to token sequences. The la- resource (i.e., what is being annotated). Built bels in this case correspond to the following cat- upon the Resource Description Framework (RDF), egories: taxon, location, habitat, anatomical part, the OA model represents an annotation as hav- quality and person. ing a body and a target, with the former some- The succeeding components in the workflow how describing the latter, e.g., by assigning a la- contribute towards the relation extraction task. bel or identifier. Following this fundamental idea Enju Parser performs deep syntactic parsing and and other relevant recommendations given in the extracts syntactic dependencies amongst sentence specification6 , we represented the named entity tokens. Its outputs are used by the next com- and relation annotations extracted by our text min- ponent, the Predicate Argument Structure Extrac- ing workflow in OA format, as depicted in Fig- tor, to compute semantic dependencies in the form ure 2. For brevity, prefixes were used in this of predicate-argument structures. The five in- figure instead of full namespaces, e.g., oa for stances of the Dependency Extractor component http://www.w3.org/ns/oa#. then makes use of the predicate-argument struc- Once the RDF triples had been generated, they tures to detect relationships between names cate- were automatically loaded onto a new Apache gorised under the specified entity types. The first Jena TDB7 store, which was then exposed as a instance, for example, detects only relationships SPARQL endpoint by Fuseki8 . between taxon and person names, while the last one captures related anatomical parts and qual- 3 Example use case ities. The Type Mapper ensures that all of the named entities and relations extracted conform We present an example of how our repository, now with the same annotation schema before they are in the form of a SPARQL-enabled triple store, can all saved in Open Annotation format by the last facilitate knowledge discovery. A user might be component, the Annotation Store Writer. We interested, for example, in learning which spe- briefly describe next how our extracted annota- cific geographic locations have been described in tions are encoded according to this format. the literature as having associations with certain 6 http://www.openannotation.org/spec/core 7 https://jena.apache.org/documentation/tdb 5 8 http://nersuite.nlplab.org https://jena.apache.org/documentation/fuseki2 24 species, e.g., the bird family of hornbills. Shown made accessible as a SPARQL endpoint9 that ac- in Listing 1 is a query in SPARQL, the query lan- cepts POST requests. The body of the request guage for RDF, that retrieves a list of all such lo- should be set to a valid SPARQL query while the cations, as well as the number of times that the re- headers should be configured to hold the follow- lationship was mentioned in the source document. ing name-value pairs: (1) Accept: text/csv and (2) Content-Type: application/sparql-query. Listing 1: An example SPARQL query that will retrieve locations related to hornbills. Acknowledgments PREFIX rdfs: We would like to thank Prof. Marilou Nicolas for PREFIX oa: her valuable inputs. This work is funded by the PREFIX rdf: PREFIX bd: SELECT ?tx ?lc (COUNT(?lc) as ?cnt) ?src WHERE { ?annotation oa:hasBody ?body . References GRAPH ?body { ?a rdf:type bd:Relation . Hong Cui, Kenneth Jiang, and Partha Pratim Sanyal. ?a bd:Relation:mention1 ?mention1 . 2010. From Text to RDF Triple Store: An Applica- ?a bd:Relation:mention2 ?mention2 . } tion for Biodiversity Literature. In Proceedings of ?mention1 oa:hasTarget ?target1 . the Association for Information Science and Tech- GRAPH ?comp1 { nology (ASIST 2010). ?target1 rdf:type bd:Taxon . } ?target1 oa:hasSelector ?selector1 . Lushan Han, Tim Finin, Cynthia Parr, Joel Sachs, and ?selector1 oa:default ?d1 . Anupam Joshi. 2008. RDF123: From Spreadsheets ?d1 oa:exact ?tx . to RDF. In Amit Sheth et al., editors, Proceedings FILTER(regex(?tx, "Hornbill", "i")) . of the 7th International Semantic Web Conference ?mention2 oa:hasTarget ?target2 . GRAPH ?comp2 { (ISWC 2008), pages 451–466. Springer Berlin Hei- ?target2 rdf:type bd:Location . } delberg, Berlin, Heidelberg. ?target2 oa:hasSelector ?selector2 . ?selector2 oa:default ?d2 . John D. Lafferty, Andrew McCallum, and Fernando ?d2 oa:exact ?lc . C. N. Pereira. 2001. Conditional Random Fields: ?target2 oa:hasSource ?src . } Probabilistic Models for Segmenting and Label- GROUP BY ?tx ?lc ?src ing Sequence Data. In Proceedings of the Eigh- ORDER BY DESC (?cnt) teenth International Conference on Machine Learn- ing (2001), pages 282–289, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. 4 Conclusion Cynthia Parr, Joel Sachs, Lushan Han, and Taowei In this paper, we presented a framework for build- Wang. 2007. RDF123 and Spotter: Tools for gener- ating OWL and RDF for biodiversity data in spread- ing a knowledge repository that: (1) applies a cus- sheets and unstructured text. In Proceedings of Bio- tomisable text mining workflow to extract infor- diversity Information Standards Annual Conference mation in the form of named entities and rela- (TDWG 2007). tionships between them; (2) stores the automati- Brian J. Stucky, John Deck, Tom Conlin, Lukasz cally extracted knowledge as RDF triples compli- Ziemba, Nico Cellinese, and Robert Guralnick. ant with the Open Annotation specification; and 2014. The BiSciCol Triplifier: bringing biodiver- (3) facilitates the discovery of otherwise obscured sity data to the Semantic Web. BMC Bioinformatics, knowledge by enabling query-based retrieval of 15(1):1–9. annotations from a SPARQL endpoint. We note Y. Tsuruoka, Y. Tateisi, J.-D. Kim, T. Ohta, J. Mc- that the triple store can be exposed via other appli- Naught, S. Ananiadou, and J. Tsujii. 2005. Devel- cation programming interfaces, i.e., web services oping a Robust Part-of-Speech Tagger for Biomed- ical Text. In Advances in Informatics - 10th Pan- that abstract away from SPARQL to make query- hellenic Conference on Informatics, volume 3746 of ing straightforward for non-technical users. Lecture Notes in Computer Science, pages 382–392. We envision that our knowledge repository will Springer-Verlag, Volos, Greece, November. facilitate the enhancement of search applications, e.g., information retrieval systems. It has been 9 http://nactem.ac.uk/copious-demo/annotations/sparql 25