Construction of a biodiversity knowledge repository
                      using a text mining-based framework
                Riza Batista-Navarro, Chrysoula Zerva and Sophia Ananiadou

                      School of Computer Science
                        University of Manchester
                      Manchester, United Kingdom
{riza.batista,chrysoula.zerva,sophia.ananiadou}@manchester.ac.uk


                     Abstract                                ten in natural language, secondary data lacks the
                                                             structure that primary data comes with, rendering
    In our aim to make the information en-                   the knowledge it contains obscured and inaccessi-
    capsulated by biodiversity literature more               ble. In order to make information from secondary
    accessible and searchable, we have de-                   data available in a structured and thus search-
    veloped a text mining-based framework                    able form, we have developed a repository con-
    for automatically transforming text into a               taining information automatically extracted from
    structured knowledge repository. A text                  biodiversity literature by a customisable text min-
    mining workflow employing information                    ing workflow. To maximise its interoperability
    extraction techniques, i.e., named entity                with external tools or services, we have made the
    recognition and relation extraction, was                 knowledge repository available as a Resource De-
    implemented in the Argo platform and was                 scription Framework (RDF) triple store that con-
    subsequently applied on biodiversity lit-                forms with the Open Annotation standard1 . We
    erature to extract structured information.               then demonstrate how the repository, accessible
    The resulting annotations were stored in                 as a SPARQL endpoint, facilitates query-based
    a repository following the emerging Open                 search, thus making the information contained in
    Annotation standard, thus promoting in-                  biodiversity literature discoverable.
    teroperability with external applications.                  A handful of other tools for storing biodiversity
    Accessible as a SPARQL endpoint, the                     information in RDF format exist. Most of them,
    repository supports knowledge discovery                  however, do not have the capability to automati-
    over a huge amount of biodiversity liter-                cally understand text written in natural language.
    ature by retrieving annotations matching                 Tools such as RDF123 (Han et al., 2008) and BiS-
    user-specified queries.                                  ciCol Triplifier (Stucky et al., 2014), for exam-
1   Introduction                                             ple, accept only data that is already in the form
                                                             of structured tables. The browser extension Spot-
Big data—huge data collections—are proliferat-               ter (Parr et al., 2007) generates RDF-formatted
ing in many disciplines at a rate that is much faster        annotations over blog posts, not by automatically
than what our analytical abilities can handle. One           extracting information from the textual content
particular discipline that has amassed big data is           but rather by requiring its users to manually en-
biological diversity, more popularly known as bio-           ter structured descriptive metadata. Most similar
diversity: the study of variability amongst all life         to our work is a system for automatically extract-
forms. On the one hand, researchers in this do-              ing RDF triples pertaining to species’ morpholog-
main collect primary data pertaining to the oc-              ical characteristics, from the literature on Flora of
currence or distribution of species, and store this          North America (Cui et al., 2010). Their seman-
information in a structured format (e.g., spread-            tic annotation application provided the user with
sheets, database tables). On the other hand, find-           an opportunity to revise automatically generated
ings or observations resulting from their analysis           annotations, an option that can also be enabled
of primary data are usually reported in literature           in our approach. We note though that our work
(e.g., monographs, books, journal articles or re-
ports), often referred to as secondary data. Writ-              1
                                                                    http://www.openannotation.org


                                                        22
                                             Figure 1: Text mining workflow


is uniquely underpinned by a highly customisable               2.2     Development of text mining workflow
and extensible workflow. In this way, when do-                 One of the primary interests of our collabora-
main experts call for other types of information to            tors in the project is the discovery of fundamen-
be captured, our framework will require only min-              tal species-centric knowledge, particularly infor-
imal development time and effort to fulfill the task.          mation on species’ geographic locations, habitat,
                                                               anatomical parts as well as authorities (i.e., per-
2       Methodology
                                                               sons who described them). Guided by user re-
In this section, we present in detail our framework            quirements, we cast this work as an information
for constructing the knowledge repository. We be-              extraction task requiring: (1) named entity recog-
gin by briefly describing the corpus of biodiversity           nition (NER) for taxa, locations, habitat, anatom-
documents that was utilised, and then outline the              ical parts and persons; and (2) binary relation ex-
various steps in the text mining workflow. We fi-              traction focussing on the following types of as-
nally proceed to explaining how the Open Annota-               sociations: taxon-location, taxon-habitat, taxon-
tion specification was adopted in order to store the           anatomical part and taxon-person.
information extracted from our corpus.                            To carry out these tasks on our corpus, we inte-
                                                               grated various natural language processing (NLP)
2.1 Document selection                                         tools into one workflow using the Argo platform.
The Biodiversity Heritage Library (BHL)2 is a                  Argo3 is a web-based, graphical workbench that
database of biodiversity literature maintained by              facilitates the construction and execution of be-
a consortium of natural history and botanical li-              spoke modular text mining workflows. Underpin-
braries all over the world. A product of the various           ning it is a library of diverse elementary NLP com-
partners’ digitisation efforts, BHL currently con-             ponents, each of which performs a specific task.
tains almost 110,000 titles, equivalent to almost 50           Argo’s graphical block diagramming interface for
million pages of text resulting from the applica-              workflow construction provides access to the com-
tion of optical character recognition (OCR) tools              ponent library, representing them as configurable
on scanned images of legacy materials. For this                blocks that can be interconnected to define pro-
work, we decided to narrow down the scope of the               cessing sequence.
knowledge repository to the requirements of our                   The workflow that we developed, depicted in
ongoing project whose aim is to comprehensively                Figure 1, combines several components for pre-
collect both primary and secondary information on              processing, synactic and semantic analyses. It be-
biodiversity in the Philippines.                               gins with an SFTP Document Reader which loads
   To this end, we retrieved only the subset of                the plain-text corpus from a remote server. This is
English BHL pages which are relevant to the                    followed by a Regex Annotator which attempts to
Philippines, i.e., the union of (1) the set of pages           detect paragraph boundaries based on the occur-
which mention either “Philippines” or “Philip-                 rence of newline characters. The paragraphs are
pine” within their content, and (2) the set of pages           then segmented by the LingPipe Sentence Split-
contained by books or volumes whose titles men-                ter4 into sentences, each of which is decomposed
tion “Philippines” or “Philippine”. This resulted in           into tokens by the GENIA Tagger (Tsuruoka et
a corpus of a total of 155,635 pages (around 12GB              al., 2005) which also performs part-of-speech tag-
in size).                                                        3
                                                                     http://argo.nactem.ac.uk
    2                                                            4
        http://www.biodiversitylibrary.org                           http://alias-i.com/lingpipe


                                                          23
                          Figure 2: Our Open Annotation representation of related entities


ging, lemmatisation and chunking. The next com-              2.3      Adopting the Open Annotation model
ponent, the Biodiversity Concept Tagger, is a ma-            The Open Annotation (OA) Core Data Model is
chine learning-based NER5 that applies a condi-              an emerging W3C-recommended standard for en-
tional random fields (CRF) model (Lafferty et al.,           coding associations between any annotation and
2001) to assign labels to token sequences. The la-           resource (i.e., what is being annotated). Built
bels in this case correspond to the following cat-           upon the Resource Description Framework (RDF),
egories: taxon, location, habitat, anatomical part,          the OA model represents an annotation as hav-
quality and person.                                          ing a body and a target, with the former some-
   The succeeding components in the workflow                 how describing the latter, e.g., by assigning a la-
contribute towards the relation extraction task.             bel or identifier. Following this fundamental idea
Enju Parser performs deep syntactic parsing and              and other relevant recommendations given in the
extracts syntactic dependencies amongst sentence             specification6 , we represented the named entity
tokens. Its outputs are used by the next com-                and relation annotations extracted by our text min-
ponent, the Predicate Argument Structure Extrac-             ing workflow in OA format, as depicted in Fig-
tor, to compute semantic dependencies in the form            ure 2. For brevity, prefixes were used in this
of predicate-argument structures. The five in-               figure instead of full namespaces, e.g., oa for
stances of the Dependency Extractor component                http://www.w3.org/ns/oa#.
then makes use of the predicate-argument struc-                 Once the RDF triples had been generated, they
tures to detect relationships between names cate-            were automatically loaded onto a new Apache
gorised under the specified entity types. The first          Jena TDB7 store, which was then exposed as a
instance, for example, detects only relationships            SPARQL endpoint by Fuseki8 .
between taxon and person names, while the last
one captures related anatomical parts and qual-              3       Example use case
ities. The Type Mapper ensures that all of the
named entities and relations extracted conform               We present an example of how our repository, now
with the same annotation schema before they are              in the form of a SPARQL-enabled triple store, can
all saved in Open Annotation format by the last              facilitate knowledge discovery. A user might be
component, the Annotation Store Writer. We                   interested, for example, in learning which spe-
briefly describe next how our extracted annota-              cific geographic locations have been described in
tions are encoded according to this format.                  the literature as having associations with certain
                                                                 6
                                                                   http://www.openannotation.org/spec/core
                                                                 7
                                                                   https://jena.apache.org/documentation/tdb
  5                                                              8
      http://nersuite.nlplab.org                                   https://jena.apache.org/documentation/fuseki2


                                                        24
species, e.g., the bird family of hornbills. Shown           made accessible as a SPARQL endpoint9 that ac-
in Listing 1 is a query in SPARQL, the query lan-            cepts POST requests. The body of the request
guage for RDF, that retrieves a list of all such lo-         should be set to a valid SPARQL query while the
cations, as well as the number of times that the re-         headers should be configured to hold the follow-
lationship was mentioned in the source document.             ing name-value pairs: (1) Accept: text/csv and (2)
                                                             Content-Type: application/sparql-query.
Listing 1: An example SPARQL query that will
retrieve locations related to hornbills.                     Acknowledgments
PREFIX rdfs: <http://www.w3.org/2000/01/
    rdf-schema#>                                             We would like to thank Prof. Marilou Nicolas for
PREFIX oa: <http://www.w3.org/ns/oa#>                        her valuable inputs. This work is funded by the
PREFIX rdf: <http://www.w3.org                               British Council [172722806 (COPIOUS)], and is
    /1999/02/22-rdf-syntax-ns#>
PREFIX bd: <http://nactem.ac.uk/schema/                      partially supported by the Engineering and Phys-
    uima/typesystem/                                         ical Sciences Research Council [EP/1038099/1
    MiningBiodiversityTypeSystem#uk.ac.                      (CDT)].
    nactem.uima.biodiv.>
SELECT ?tx ?lc (COUNT(?lc) as ?cnt) ?src
WHERE {
 ?annotation oa:hasBody ?body .                              References
 GRAPH ?body {
  ?a rdf:type bd:Relation .                                  Hong Cui, Kenneth Jiang, and Partha Pratim Sanyal.
  ?a bd:Relation:mention1 ?mention1 .                          2010. From Text to RDF Triple Store: An Applica-
  ?a bd:Relation:mention2 ?mention2 . }                        tion for Biodiversity Literature. In Proceedings of
 ?mention1 oa:hasTarget ?target1 .                             the Association for Information Science and Tech-
 GRAPH ?comp1 {                                                nology (ASIST 2010).
  ?target1 rdf:type bd:Taxon . }
 ?target1 oa:hasSelector ?selector1 .                        Lushan Han, Tim Finin, Cynthia Parr, Joel Sachs, and
 ?selector1 oa:default ?d1 .                                   Anupam Joshi. 2008. RDF123: From Spreadsheets
 ?d1 oa:exact ?tx .                                            to RDF. In Amit Sheth et al., editors, Proceedings
 FILTER(regex(?tx, "Hornbill", "i")) .
                                                               of the 7th International Semantic Web Conference
 ?mention2 oa:hasTarget ?target2 .
 GRAPH ?comp2 {                                                (ISWC 2008), pages 451–466. Springer Berlin Hei-
  ?target2 rdf:type bd:Location . }                            delberg, Berlin, Heidelberg.
 ?target2 oa:hasSelector ?selector2 .
 ?selector2 oa:default ?d2 .                                 John D. Lafferty, Andrew McCallum, and Fernando
 ?d2 oa:exact ?lc .                                            C. N. Pereira. 2001. Conditional Random Fields:
 ?target2 oa:hasSource ?src . }                                Probabilistic Models for Segmenting and Label-
GROUP BY ?tx ?lc ?src                                          ing Sequence Data. In Proceedings of the Eigh-
ORDER BY DESC (?cnt)                                           teenth International Conference on Machine Learn-
                                                               ing (2001), pages 282–289, San Francisco, CA,
                                                               USA. Morgan Kaufmann Publishers Inc.
4   Conclusion
                                                             Cynthia Parr, Joel Sachs, Lushan Han, and Taowei
In this paper, we presented a framework for build-             Wang. 2007. RDF123 and Spotter: Tools for gener-
                                                               ating OWL and RDF for biodiversity data in spread-
ing a knowledge repository that: (1) applies a cus-            sheets and unstructured text. In Proceedings of Bio-
tomisable text mining workflow to extract infor-               diversity Information Standards Annual Conference
mation in the form of named entities and rela-                 (TDWG 2007).
tionships between them; (2) stores the automati-
                                                             Brian J. Stucky, John Deck, Tom Conlin, Lukasz
cally extracted knowledge as RDF triples compli-               Ziemba, Nico Cellinese, and Robert Guralnick.
ant with the Open Annotation specification; and                2014. The BiSciCol Triplifier: bringing biodiver-
(3) facilitates the discovery of otherwise obscured            sity data to the Semantic Web. BMC Bioinformatics,
knowledge by enabling query-based retrieval of                 15(1):1–9.
annotations from a SPARQL endpoint. We note                  Y. Tsuruoka, Y. Tateisi, J.-D. Kim, T. Ohta, J. Mc-
that the triple store can be exposed via other appli-           Naught, S. Ananiadou, and J. Tsujii. 2005. Devel-
cation programming interfaces, i.e., web services               oping a Robust Part-of-Speech Tagger for Biomed-
                                                                ical Text. In Advances in Informatics - 10th Pan-
that abstract away from SPARQL to make query-                   hellenic Conference on Informatics, volume 3746 of
ing straightforward for non-technical users.                    Lecture Notes in Computer Science, pages 382–392.
   We envision that our knowledge repository will               Springer-Verlag, Volos, Greece, November.
facilitate the enhancement of search applications,
e.g., information retrieval systems. It has been                9
                                                                    http://nactem.ac.uk/copious-demo/annotations/sparql


                                                        25