<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Construction of a biodiversity knowledge repository using a text mining-based framework</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Riza Batista-Navarro</string-name>
          <email>riza.batista@manchester.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chrysoula Zerva</string-name>
          <email>chrysoula.zerva@manchester.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sophia Ananiadou</string-name>
          <email>sophia.ananiadou@manchester.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Computer Science University of Manchester Manchester</institution>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <fpage>22</fpage>
      <lpage>25</lpage>
      <abstract>
        <p>In our aim to make the information encapsulated by biodiversity literature more accessible and searchable, we have developed a text mining-based framework for automatically transforming text into a structured knowledge repository. A text mining workflow employing information extraction techniques, i.e., named entity recognition and relation extraction, was implemented in the Argo platform and was subsequently applied on biodiversity literature to extract structured information. The resulting annotations were stored in a repository following the emerging Open Annotation standard, thus promoting interoperability with external applications. Accessible as a SPARQL endpoint, the repository supports knowledge discovery over a huge amount of biodiversity literature by retrieving annotations matching user-specified queries.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1 Introduction
Big data—huge data collections—are
proliferating in many disciplines at a rate that is much faster
than what our analytical abilities can handle. One
particular discipline that has amassed big data is
biological diversity, more popularly known as
biodiversity: the study of variability amongst all life
forms. On the one hand, researchers in this
domain collect primary data pertaining to the
occurrence or distribution of species, and store this
information in a structured format (e.g.,
spreadsheets, database tables). On the other hand,
findings or observations resulting from their analysis
of primary data are usually reported in literature
(e.g., monographs, books, journal articles or
reports), often referred to as secondary data.
Written in natural language, secondary data lacks the
structure that primary data comes with, rendering
the knowledge it contains obscured and
inaccessible. In order to make information from secondary
data available in a structured and thus
searchable form, we have developed a repository
containing information automatically extracted from
biodiversity literature by a customisable text
mining workflow. To maximise its interoperability
with external tools or services, we have made the
knowledge repository available as a Resource
Description Framework (RDF) triple store that
conforms with the Open Annotation standard1. We
then demonstrate how the repository, accessible
as a SPARQL endpoint, facilitates query-based
search, thus making the information contained in
biodiversity literature discoverable.</p>
      <p>
        A handful of other tools for storing biodiversity
information in RDF format exist. Most of them,
however, do not have the capability to
automatically understand text written in natural language.
Tools such as RDF123
        <xref ref-type="bibr" rid="ref2">(Han et al., 2008)</xref>
        and
BiSciCol Triplifier
        <xref ref-type="bibr" rid="ref5">(Stucky et al., 2014)</xref>
        , for
example, accept only data that is already in the form
of structured tables. The browser extension
Spotter
        <xref ref-type="bibr" rid="ref4">(Parr et al., 2007)</xref>
        generates RDF-formatted
annotations over blog posts, not by automatically
extracting information from the textual content
but rather by requiring its users to manually
enter structured descriptive metadata. Most similar
to our work is a system for automatically
extracting RDF triples pertaining to species’
morphological characteristics, from the literature on Flora of
North America
        <xref ref-type="bibr" rid="ref1">(Cui et al., 2010)</xref>
        . Their
semantic annotation application provided the user with
an opportunity to revise automatically generated
annotations, an option that can also be enabled
in our approach. We note though that our work
1http://www.openannotation.org
is uniquely underpinned by a highly customisable
and extensible workflow. In this way, when
domain experts call for other types of information to
be captured, our framework will require only
minimal development time and effort to fulfill the task.
2
      </p>
      <p>Methodology
In this section, we present in detail our framework
for constructing the knowledge repository. We
begin by briefly describing the corpus of biodiversity
documents that was utilised, and then outline the
various steps in the text mining workflow. We
finally proceed to explaining how the Open
Annotation specification was adopted in order to store the
information extracted from our corpus.
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Document selection</title>
      <p>The Biodiversity Heritage Library (BHL)2 is a
database of biodiversity literature maintained by
a consortium of natural history and botanical
libraries all over the world. A product of the various
partners’ digitisation efforts, BHL currently
contains almost 110,000 titles, equivalent to almost 50
million pages of text resulting from the
application of optical character recognition (OCR) tools
on scanned images of legacy materials. For this
work, we decided to narrow down the scope of the
knowledge repository to the requirements of our
ongoing project whose aim is to comprehensively
collect both primary and secondary information on
biodiversity in the Philippines.</p>
      <p>To this end, we retrieved only the subset of
English BHL pages which are relevant to the
Philippines, i.e., the union of (1) the set of pages
which mention either “Philippines” or
“Philippine” within their content, and (2) the set of pages
contained by books or volumes whose titles
mention “Philippines” or “Philippine”. This resulted in
a corpus of a total of 155,635 pages (around 12GB
in size).</p>
    </sec>
    <sec id="sec-3">
      <title>Development of text mining workflow</title>
      <p>One of the primary interests of our
collaborators in the project is the discovery of
fundamental species-centric knowledge, particularly
information on species’ geographic locations, habitat,
anatomical parts as well as authorities (i.e.,
persons who described them). Guided by user
requirements, we cast this work as an information
extraction task requiring: (1) named entity
recognition (NER) for taxa, locations, habitat,
anatomical parts and persons; and (2) binary relation
extraction focussing on the following types of
associations: taxon-location, taxon-habitat,
taxonanatomical part and taxon-person.</p>
      <p>To carry out these tasks on our corpus, we
integrated various natural language processing (NLP)
tools into one workflow using the Argo platform.
Argo3 is a web-based, graphical workbench that
facilitates the construction and execution of
bespoke modular text mining workflows.
Underpinning it is a library of diverse elementary NLP
components, each of which performs a specific task.
Argo’s graphical block diagramming interface for
workflow construction provides access to the
component library, representing them as configurable
blocks that can be interconnected to define
processing sequence.</p>
      <p>
        The workflow that we developed, depicted in
Figure 1, combines several components for
preprocessing, synactic and semantic analyses. It
begins with an SFTP Document Reader which loads
the plain-text corpus from a remote server. This is
followed by a Regex Annotator which attempts to
detect paragraph boundaries based on the
occurrence of newline characters. The paragraphs are
then segmented by the LingPipe Sentence
Splitter4 into sentences, each of which is decomposed
into tokens by the GENIA Tagger
        <xref ref-type="bibr" rid="ref6">(Tsuruoka et
al., 2005)</xref>
        which also performs part-of-speech
tag3http://argo.nactem.ac.uk
4http://alias-i.com/lingpipe
ging, lemmatisation and chunking. The next
component, the Biodiversity Concept Tagger, is a
machine learning-based NER5 that applies a
conditional random fields (CRF) model
        <xref ref-type="bibr" rid="ref3">(Lafferty et al.,
2001)</xref>
        to assign labels to token sequences. The
labels in this case correspond to the following
categories: taxon, location, habitat, anatomical part,
quality and person.
      </p>
      <p>The succeeding components in the workflow
contribute towards the relation extraction task.
Enju Parser performs deep syntactic parsing and
extracts syntactic dependencies amongst sentence
tokens. Its outputs are used by the next
component, the Predicate Argument Structure
Extractor, to compute semantic dependencies in the form
of predicate-argument structures. The five
instances of the Dependency Extractor component
then makes use of the predicate-argument
structures to detect relationships between names
categorised under the specified entity types. The first
instance, for example, detects only relationships
between taxon and person names, while the last
one captures related anatomical parts and
qualities. The Type Mapper ensures that all of the
named entities and relations extracted conform
with the same annotation schema before they are
all saved in Open Annotation format by the last
component, the Annotation Store Writer. We
briefly describe next how our extracted
annotations are encoded according to this format.</p>
      <p>Adopting the Open Annotation model
The Open Annotation (OA) Core Data Model is
an emerging W3C-recommended standard for
encoding associations between any annotation and
resource (i.e., what is being annotated). Built
upon the Resource Description Framework (RDF),
the OA model represents an annotation as
having a body and a target, with the former
somehow describing the latter, e.g., by assigning a
label or identifier. Following this fundamental idea
and other relevant recommendations given in the
specification6, we represented the named entity
and relation annotations extracted by our text
mining workflow in OA format, as depicted in
Figure 2. For brevity, prefixes were used in this
figure instead of full namespaces, e.g., oa for
http://www.w3.org/ns/oa#.</p>
      <p>Once the RDF triples had been generated, they
were automatically loaded onto a new Apache
Jena TDB7 store, which was then exposed as a
SPARQL endpoint by Fuseki8.
3</p>
      <p>Example use case
We present an example of how our repository, now
in the form of a SPARQL-enabled triple store, can
facilitate knowledge discovery. A user might be
interested, for example, in learning which
specific geographic locations have been described in
the literature as having associations with certain
6http://www.openannotation.org/spec/core
7https://jena.apache.org/documentation/tdb
8https://jena.apache.org/documentation/fuseki2
species, e.g., the bird family of hornbills. Shown
in Listing 1 is a query in SPARQL, the query
language for RDF, that retrieves a list of all such
locations, as well as the number of times that the
relationship was mentioned in the source document.
Listing 1: An example SPARQL query that will
retrieve locations related to hornbills.
In this paper, we presented a framework for
building a knowledge repository that: (1) applies a
customisable text mining workflow to extract
information in the form of named entities and
relationships between them; (2) stores the
automatically extracted knowledge as RDF triples
compliant with the Open Annotation specification; and
(3) facilitates the discovery of otherwise obscured
knowledge by enabling query-based retrieval of
annotations from a SPARQL endpoint. We note
that the triple store can be exposed via other
application programming interfaces, i.e., web services
that abstract away from SPARQL to make
querying straightforward for non-technical users.</p>
      <p>We envision that our knowledge repository will
facilitate the enhancement of search applications,
e.g., information retrieval systems. It has been
made accessible as a SPARQL endpoint9 that
accepts POST requests. The body of the request
should be set to a valid SPARQL query while the
headers should be configured to hold the
following name-value pairs: (1) Accept: text/csv and (2)
Content-Type: application/sparql-query.
Acknowledgments
We would like to thank Prof. Marilou Nicolas for
her valuable inputs. This work is funded by the
British Council [172722806 (COPIOUS)], and is
partially supported by the Engineering and
Physical Sciences Research Council [EP/1038099/1
(CDT)].</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Hong</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Kenneth</given-names>
            <surname>Jiang</surname>
          </string-name>
          , and Partha Pratim Sanyal.
          <year>2010</year>
          .
          <article-title>From Text to RDF Triple Store: An Application for Biodiversity Literature</article-title>
          .
          <source>In Proceedings of the Association for Information Science and Technology (ASIST</source>
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Lushan</surname>
            <given-names>Han</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Tim</given-names>
            <surname>Finin</surname>
          </string-name>
          , Cynthia Parr, Joel Sachs, and
          <string-name>
            <given-names>Anupam</given-names>
            <surname>Joshi</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>RDF123: From Spreadsheets to RDF</article-title>
          . In Amit Sheth et al., editors,
          <source>Proceedings of the 7th International Semantic Web Conference (ISWC</source>
          <year>2008</year>
          ), pages
          <fpage>451</fpage>
          -
          <lpage>466</lpage>
          . Springer Berlin Heidelberg, Berlin, Heidelberg.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>John D. Lafferty</surname>
          </string-name>
          ,
          <string-name>
            <surname>Andrew McCallum</surname>
          </string-name>
          , and
          <string-name>
            <surname>Fernando</surname>
            <given-names>C. N.</given-names>
          </string-name>
          <string-name>
            <surname>Pereira</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <article-title>Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data</article-title>
          .
          <source>In Proceedings of the Eighteenth International Conference on Machine Learning</source>
          (
          <year>2001</year>
          ), pages
          <fpage>282</fpage>
          -
          <lpage>289</lpage>
          , San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Cynthia</given-names>
            <surname>Parr</surname>
          </string-name>
          , Joel Sachs, Lushan Han, and
          <string-name>
            <given-names>Taowei</given-names>
            <surname>Wang</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>RDF123 and Spotter: Tools for generating OWL and RDF for biodiversity data in spreadsheets and unstructured text</article-title>
          .
          <source>In Proceedings of Biodiversity Information Standards Annual Conference (TDWG</source>
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Brian J. Stucky</surname>
            , John Deck, Tom Conlin, Lukasz Ziemba, Nico Cellinese, and
            <given-names>Robert</given-names>
          </string-name>
          <string-name>
            <surname>Guralnick</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>The BiSciCol Triplifier: bringing biodiversity data to the Semantic Web</article-title>
          .
          <source>BMC Bioinformatics</source>
          ,
          <volume>15</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tsuruoka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tateisi</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-D. Kim</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Ohta</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>McNaught</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Ananiadou</surname>
            , and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Tsujii</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Developing a Robust Part-of-Speech Tagger for Biomedical Text</article-title>
          .
          <source>In Advances in Informatics - 10th Panhellenic Conference on Informatics, volume 3746 of Lecture Notes in Computer Science</source>
          , pages
          <fpage>382</fpage>
          -
          <lpage>392</lpage>
          . Springer-Verlag, Volos, Greece, November.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>