<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>a
journal article often requires that readers</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Enhancing Information Accessibility of Scientific Publications with Text Mining and Ontology</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Weijia Xu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>American Society of Plant Biologists Rockville</institution>
          ,
          <addr-line>Maryland</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Botany and Plant Pathology Oregon State University Oregon</institution>
          ,
          <addr-line>Portland</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Texas Advanced Computing Center University of Texas at Austin Austin</institution>
          ,
          <addr-line>Texas</addr-line>
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>- We present an ongoing effort on utilizing text mining methods and existing biological ontologies to help readers to access the information contained in the scientific articles. Our approach includes using multiple strategies for biological entity detection and using association analysis on extracted analysis. The entity extraction processes utilizes regular expression rules, ontologies, and keyword dictionary to get a comprehensive list of biological entities. In addition to extract list of entities, we also apply natural language processing and association analysis techniques to generate inferences among entities and comparing to known relations documented in the existing ontologies.</p>
      </abstract>
      <kwd-group>
        <kwd>component</kwd>
        <kwd>Information Ontology</kwd>
        <kwd>Text Mining</kwd>
        <kwd>Association Analysis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>To address this challange, we present software
developments from an ongoing project, DIVE, which features
auto extraction of informational vocabulary, web based access
and curation tools. The framework implements several
strategies in entity extraction, including using regular
expression rules, ontology and a keyword dictionary. The
results of the extracted biological entities are then stored in a
database and made accessible through an interactive web
application for curation and evaluation by authors and other
domain experts. Additional text mining and associaiton
analysis can be run on extrated entities to help readers
understanding of the paper. The system can benefit the entire
life cycle of the digital publication, from initial manuscript
submission to publishing the article and presenting information
to readers. New information defined and verified by experts
may also be injected to other information resources.</p>
    </sec>
    <sec id="sec-2">
      <title>2) Entity Candidate Detection</title>
      <p>We implemented a rule-based approach for processing the text
and structure in order to identify informational vocabulary
candidates. The detection rules can be defined based on
various heuristics and requirements such as publishing
requirements, naming conventions, and domain ontologies.
New rules can be added on demand over time. Currently, there
are four types of rules implemented in the DIVE, regular
expression rules, word dictionary, publishing convention, and
ontology rules.</p>
      <p>
        The regular expression rules utilize common naming
conventions to identify biological entities, such as gene name,
protein name, molecule structures, chemical compound, etc.
Each rule can be defined as a regular expression and used for
matching the candidate word. The word dictionary rule
consists of a pre-defined list of words that should be included
or excluded in the candidate lists. The publication content is
searched against the list at run time. The publishing
convention rules are used to identify words that are in special
format, such as in italic, or in a particular component of the
publication, such as a figure legend. The enclosing tags of the
candidates are used to define each rule. Additional rules can
be added by specifying additional tag values or by using
naming conventions to detect entities like species names. The
ontology rules utilize five biological ontology including gene
ontology [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], plant ontology [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] plant trait ontology [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], plant
environment condition ontology [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and Chemical Entities of
Biological Interest (ChEBI) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
3) Entity Candidate Assessment
By applying the extraction rules listed above, a set of entity
candidates can be detected from the input document. Some
candidates might be detected by multiple rules. Different
detection rules also have different accuracy. Ontology file and
dictionary based approaches have the highest certainty.
Candidates only identified by other rules need further
validation. We currently implemented two automatic
validation mechanisms. One is based on the previously
validated results; the other one is based on co-location with
other confirmed entities. However, the primary method of
validation is by domain expert evaluation through the web
interface, which is detailed in the following section.
      </p>
    </sec>
    <sec id="sec-3">
      <title>B. Association Analysis</title>
      <p>
        The data association analysis can be used to generate
inferences between values from two or more fields of the data
in a given condition using FP-Growth algorithm[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].The
analysis starts with selecting and aggregating subset of data
specified by the input parameters as a list of records, also
known as transactions. The analysis algorithm will scan the
selected data set to compute the frequency of each value, also
referred as an item, and store the frequency value and
cooccurrence with other item, collectively referred as itemset, in
a tree structure, named frequent pattern tree (FP-tree). Then
the frequent item sets can be identified from the FP-tree to
generate inferences among subset of values.
      </p>
      <sec id="sec-3-1">
        <title>III. PRELIMINARY RESULTS</title>
        <p>Figure 2 shows top 20 inference rules based on all ontology
terms extracted from the collection. Each label indicates a
frequent item set found in the collection. The directional arrow
indicates an inference on co-occurrence between two item
sets. The shade of the directional arrow indicates the
confidence level of the rule.</p>
        <p>Such visual representations of inferred association between
diverse entity types could tremendously aid a researcher in
forming insights. This also has potential to be a similarity
metric between articles that could help editors gauge the
novelty of a new article submission.
We are continuing working on evaluating the performance of
the entity extraction over large data set and improving its
accuracy. We are gathering feedback from domain researchers
and publishing professionals for further entities candidate
evaluations. We are also working on comparing the inferences
from association analysis with known relationships
documented in the existing ontologies.</p>
      </sec>
      <sec id="sec-3-2">
        <title>ACKNOWLEDGMENT</title>
        <p>DIVE is partially supported by CyVerse (NSF award
DBI0735191 and DBI-1265383) and the Gramene, a Comparative
Plant Genomics Database (NSF award IOS-1127112).</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>[1] National Center for Biotechnology information</article-title>
          .
          <source>Journal Article Tag Suite</source>
          . http://jats.nlm.nih.gov/,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Ashburner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ball</surname>
            ,
            <given-names>C.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blake</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Botstein</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Butler</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cherry</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davis</surname>
            <given-names>A.P.</given-names>
          </string-name>
          <article-title>"Gene Ontology: tool for the unification of biology." Nature genetics 25</article-title>
          , no.
          <issue>1</issue>
          (
          <year>2000</year>
          ):
          <fpage>25</fpage>
          -
          <lpage>29</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Jaiswal</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Avraham</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ilic</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kellogg</surname>
            ,
            <given-names>E. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCouch</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pujar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zapata</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          (
          <year>2005</year>
          ).
          <article-title>Plant Ontology (PO): a Controlled Vocabulary of Plant Structures and Growth Stages</article-title>
          .
          <source>Comparative and Functional Genomics</source>
          ,
          <volume>6</volume>
          (
          <issue>7-8</issue>
          ),
          <fpage>388</fpage>
          -
          <lpage>397</lpage>
          . http://doi.org/10.1002/cfg.496
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Arnaud</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cooper</surname>
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shrestha</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Menda</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nelson</surname>
          </string-name>
          , R.T.,
          <string-name>
            <surname>Matteis</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Skofic</surname>
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2012</year>
          )
          <article-title>Towards a Reference Plant Trait Ontology for Modeling Knowledge of Plant Traits and</article-title>
          Phenotypes in KEOD, pp.
          <fpage>220</fpage>
          -
          <lpage>225</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Plant</given-names>
            <surname>Enviroment Condition Ontology</surname>
          </string-name>
          , http://bioportal.bioontology.org/ontologies/PECO#
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Degtyarenko</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>de Matos</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ennis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hastings</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zbinden</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McNaught</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ashburner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>ChEBI: a database and ontology for chemical entities of biological interest</article-title>
          .
          <source>Nucleic Acids Research</source>
          ,
          <volume>36</volume>
          (Database issue),
          <fpage>D344</fpage>
          -
          <lpage>D350</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Cooper</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Jaiswal</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>2016</year>
          )
          <article-title>The Plant Ontology: A Tool for Plant Genomics</article-title>
          .
          <source>Plant Bioinformatics: Methods and Protocols</source>
          ,
          <volume>89</volume>
          -
          <fpage>114</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Han</surname>
            ,
            <given-names>J</given-names>
          </string-name>
          . Pei,
          <string-name>
            <given-names>J.</given-names>
            and
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          “
          <article-title>Mining frequent patterns without candidate generation,”</article-title>
          <source>in ACM Sigmod Record</source>
          ,
          <year>2000</year>
          , vol.
          <volume>29</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>