<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Web Application for Extracting Key Domain Information for Scientific Publications using Ontology</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Weijia Xu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>American Society of Plant Biologists Rockville</institution>
          ,
          <addr-line>Maryland</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Botany and Plant Pathology Oregon State University Oregon</institution>
          ,
          <addr-line>Portland</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Texas Advanced Computing Center University of Texas at Austin Austin</institution>
          ,
          <addr-line>Texas</addr-line>
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>- We present demos of an ongoing project, domain informational vocabulary extraction (DIVE), which aims to enrich digital publications through entity and key informational words detection and by adding additional annotations. The system implements multiple strategies for biological entity detection, including using regular expression rules, ontologies, and a keyword dictionary. These extracted entities are then stored in a database and made accessible through an interactive web application for curation and evaluation by authors. Through the web interface, the user can make additional annotations and corrections to the current results. The updates can then be used to improve the entity detection in subsequent processed articles. Although the system is being developed in the context of annotating journal articles, it can also be beneficial to domain curators and researchers at large.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Keywords—component; Information systems applications;
Information integration; Ontology; Text Mining</p>
    </sec>
    <sec id="sec-2">
      <title>I. INTRODUCTION</title>
      <p>Due to its technical depth and rich informational content, a
journal article often requires that readers, domain experts, and
curators invest significant amounts of time and effort to fully
comprehend and make intelligent use of its content. This can be
especially true in emerging areas, where novel ideas and new
terminologies may be presented without precedent. As new
technologies accelerating scientific discovery and more content
becomes available online, the number of new articles that must
be read and understood continues to rise. Therefore, there is a
pressing need to develop computational methods and tools that
can enrich the information content of digital publications,
improve its accessibility and utility, and facilitate the readers’
understanding by creating links between journal articles and
relevant database entities during the article production process.
To address this issue, we present software developments from
an ongoing project, DIVE, which features auto extraction of
informational vocabulary, web based access and curation tools,
and integration into the digital publication process.</p>
      <p>The framework implements several strategies in entity
extraction, including using regular expression rules, ontology
and a keyword dictionary. The results of the extracted
biological entities are then stored in a database and made
accessible through an interactive web application for curation
and evaluation by authors and other domain experts. Through
the web interface, a user can make additional annotations and
corrections to the initial result set. The updates are stored and
managed via the relational database for future improvements.</p>
      <p>We present application demos to illustrate this framework
using a sets of plant biology articles. We detail the design and
implementation of the system, including entity detection, the
extraction pipeline, and the web interface; we also present a use
case demonstration. We would like to engage publishers and
biology data curators in discussion and feedback.</p>
      <p>
        There are three major steps in processing the documents:
text extraction, entity candidate extraction, and candidate
assessment. The input for the text extraction process is the
structured document tagged by Journal Article Tag Suite
(JATS) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The input document is processed into two data
structures for textual data and structural data. This dual data
structure allows for efficient text processing of the publication
content while still being able to easily retrieve the
metastructure around a particular set of words during the
subsequent steps of processing. To identify informational
vocabulary candidates, our application implemented four sets
of extraction rules: regular expression rules, word dictionary,
publishing convention, and ontology rules. The ontology rules
utilize five biological ontologies including gene ontology [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
plant ontology [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], plant trait ontology [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], plant environment
condition ontology [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and Chemical Entities of Biological
Interest (ChEBI) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The results from document processing
are stored in a MySQL database and serve as data storage for
the web application.
      </p>
      <p>The web front end in our prototype is implemented using
Django (v 1.8). Based on Python, the web front is easily
programmable, extensible, and is pluggable with multiple
popular databases. It forms the presentation layer of this
system, relying on the back end code to run the entity
extraction algorithms from the manuscript and to transfer the
results in a JSON format.</p>
      <p>
        The system can benefit the entire life cycle of the digital
publication, from initial manuscript submission to publishing
the article and presenting information to readers. At the initial
manuscript submission stage, the manuscript can be processed
to extract known key informational vocabulary, such as
biological entities, as well as to identify potential new technical
words. That information may be used by editors to identify
appropriate reviewers for the manuscript. After the article has
been accepted for publication, additional information about the
key informational words, such as links to external repositories
or reference sites, may also be embedded during the
prepublication production process to enrich the information
content and accessibility. Publication curators may also
leverage the information for curation. New information defined
and verified by experts may also be injected to other
information resources, such as Planteome [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>II. APPLICATION FEATURES OVERVIEW</title>
      <p>Let us use an example to illustrate the features of our web
interface, thereby displaying the various views, layouts, and
functions available. Our prototype includes 609 manuscripts
from the journal Plant Physiology.</p>
      <p>The publication list view (Figure 1) is a paginated list of all
the articles with an external DOI reference and the article title.
The entity extraction phase can also generate phrases, while
matching the ontology term alphabtetically, are not used for the
purpose implied by the term. Those situations requires expert
knolwedge and input.</p>
      <p>
        Theorefore, each row also includes user control buttons for
editing the record. Figure 3 shows an example of the Entitiy
record editing interface. In this view, there are editable fields of
this record where a user may correct or enter new values. A
dynamic search box can be used to search and add new species
into the species menu, if the appropriate species was not
detected or inferred from the article. This search box uses an
online service from NCBI to provide a very comprehensive list
of options as the user dynamically types into it. Sentences of
occurrence of this entity are extracted from the manuscript with
the entity name highlighted in yellow. This again provides
better, almost complete context information for this entity, as
per the manuscript text.
Figure 2 shows example of exploring entities extracted
from a full article (i.e. 1002.xml in Figure 1). In the top of the
page, the title and abstract of the article [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] are presented to
give user some context of the manuscript. The list of entities
found in this article are organized in a table. Each row includes
information like name, type and number of occurrences in the
article of an entity. The XRef column presents possible
matches to existing ontology terms. If avalalible, links are also
presented to other online databases with more information of
that entity. For example, “leaf senescence” are mathced to a
term in triat ontology with a link to the corresponding entry in
the Planteome ontology database. In another example, “MES”
are matched to a term in ChEBI but a link to the external
database are missing at the time. The “species” column shows
prediction on which species this entitiy is likely assoicated with
based on the proximity of the term with the species name
appeared in the article and/or indicated by the ontologies. The
“Figure caption” column shows wheather the entity has been
used within a figure caption in the article.
      </p>
      <p>It is important to note that some entities can be matched to
multiple onotology terms. Such cases are currently resolved
based on the general priorites we assinged to each extraction
rule during the extraction phase. However, for a particular
article, the result may not always be the most approrirate one.
The prototype is still under development, and we welcome
feedback from domain researchers and publishing
professionals for future developments and improvements.</p>
    </sec>
    <sec id="sec-4">
      <title>ACKNOWLEDGMENT</title>
      <p>DIVE is partially supported by CyVerse (NSF award
DBI0735191 and DBI-1265383) and the Gramene, a Comparative
Plant Genomics Database (NSF award IOS-1127112).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>[1] National Center for Biotechnology information</article-title>
          .
          <source>Journal Article Tag Suite</source>
          . http://jats.nlm.nih.gov/,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Ashburner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ball</surname>
            ,
            <given-names>C.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blake</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Botstein</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Butler</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cherry</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davis</surname>
            <given-names>A.P.</given-names>
          </string-name>
          (
          <year>2000</year>
          )
          <article-title>Gene Ontology: tool for the unification of biology." Nature genetics 25</article-title>
          , no.
          <issue>1</issue>
          , pp
          <fpage>25</fpage>
          -
          <lpage>29</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Jaiswal</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Avraham</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ilic</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kellogg</surname>
            ,
            <given-names>E. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCouch</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pujar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zapata</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          (
          <year>2005</year>
          ).
          <article-title>Plant Ontology (PO): a Controlled Vocabulary of Plant Structures and Growth Stages</article-title>
          .
          <source>Comparative and Functional Genomics</source>
          ,
          <volume>6</volume>
          (
          <issue>7-8</issue>
          ),
          <fpage>388</fpage>
          -
          <lpage>397</lpage>
          . http://doi.org/10.1002/cfg.496
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Arnaud</surname>
            ,
            <given-names>E</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cooper</surname>
            <given-names>L</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shrestha</surname>
            ,
            <given-names>R</given-names>
          </string-name>
          , Menda,
          <string-name>
            <surname>N</surname>
          </string-name>
          , Nelson,
          <string-name>
            <surname>R T</surname>
          </string-name>
          , Matteis,
          <string-name>
            <given-names>L</given-names>
            ,
            <surname>Skofic</surname>
          </string-name>
          <string-name>
            <surname>M</surname>
          </string-name>
          (
          <year>2012</year>
          )
          <article-title>Towards a Reference Plant Trait Ontology for Modeling Knowledge of Plant Traits and Phenotypes in KEOD</article-title>
          , pp220-
          <fpage>5</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Plant</given-names>
            <surname>Enviroment Condition Ontology</surname>
          </string-name>
          , http://bioportal.bioontology.org/ontologies/PECO#
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Degtyarenko</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>de Matos</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ennis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hastings</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zbinden</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McNaught</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ashburner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>ChEBI: a database and ontology for chemical entities of biological interest</article-title>
          .
          <source>Nucleic Acids Research</source>
          ,
          <volume>36</volume>
          (Database issue),
          <fpage>D344</fpage>
          -
          <lpage>D350</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Cooper</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Jaiswal</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>2016</year>
          )
          <article-title>The Plant Ontology: A Tool for Plant Genomics</article-title>
          .
          <source>Plant Bioinformatics: Methods and Protocols</source>
          ,
          <volume>89</volume>
          -
          <fpage>114</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Hou</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Gan</surname>
            ,
            <given-names>S. S.</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>SAUR36, a small auxin up RNA gene, is involved in the promotion of leaf senescence in Arabidopsis</article-title>
          .
          <source>Plant</source>
          physiology,
          <volume>161</volume>
          (
          <issue>2</issue>
          ),
          <fpage>1002</fpage>
          -
          <lpage>1009</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>