Enhancing Information Accessibility of Scientific
    Publications with Text Mining and Ontology
   Weijia Xu                 Amit Gupta                        Pankaj Jaiswal                    Crispin Taylor         Patti Lockhart
   Texas Advanced Computing Center             Department of Botany and Plant Pathology          American Society of Plant Biologists
      University of Texas at Austin                    Oregon State University                      Rockville, Maryland, USA
          Austin, Texas USA                            Oregon, Portland, USA                       {ctaylor, plockhart}@aspb.org
     {xwj, agupta}@tacc.utexas.edu                    jaiswalp@oregonstate.edu


Abstract— We present an ongoing effort on utilizing text mining
methods and existing biological ontologies to help readers to                       II. METHODS AND IMPLEMENTAIONTS
access the information contained in the scientific articles. Our
approach includes using multiple strategies for biological entity       A.       Entity Detection Workflow
detection and using association analysis on extracted analysis.
The entity extraction processes utilizes regular expression rules,
ontologies, and keyword dictionary to get a comprehensive list of
biological entities. In addition to extract list of entities, we also
apply natural language processing and association analysis
techniques to generate inferences among entities and comparing
to known relations documented in the existing ontologies.

   Keywords—component; Information systems             applications;
Ontology; Text Mining; Association Analysis

                       I. INTRODUCTION
    Due to its technical depth and rich, informational content, a
journal article often requires that readers, domain experts, and
                                                                                    Figure 1. Processing workflow overview
curators invest significant amounts of time and effort to fully
comprehend and make intelligent use of its content. This can be         Figure 1 shows the overview of our processing workflow.
especially true in emerging areas, where novel ideas and new            There are three major steps in processing the documents: text
terminologies may be presented without precedent. As new                extraction, entity candidate extraction, and candidate
technologies accelerating scientific discovery and more content         assessment.
becomes available online, the number of new articles that must          1) Text Extraction
be read and understood continues to rise. There are over 22             The text extraction process the input structured document
millions references indexed by MEDLINE. Therefore, there is             tagged by JATS [1]. During this step, the input document will
a pressing need to develop computational methods and tools              be processed into two data structures for textual data and
that can facilitate the readers’ understanding of the content of        structural data. The textual information data includes as a list
the publicaiton.                                                        of string representation of the body of text included in the
    To address this challange, we present software                      journal articles. The structural data includes the metadata
developments from an ongoing project, DIVE, which features              information presented at the input document, such as section
auto extraction of informational vocabulary, web based access           mark, special formatting mark etc. A mapping is maintained
and curation tools. The framework implements several                    between the textual value and metadata information by their
strategies in entity extraction, including using regular                global positions in the original documents. This dual data
expression rules, ontology and a keyword dictionary. The                structure allows for efficient text processing of the publication
results of the extracted biological entities are then stored in a       content while still being able to easily retrieve the meta
database and made accessible through an interactive web                 structure around a particular set of words during the
application for curation and evaluation by authors and other            subsequent steps of processing.
domain experts. Additional text mining and associaiton
analysis can be run on extrated entities to help readers                2) Entity Candidate Detection
understanding of the paper. The system can benefit the entire           We implemented a rule-based approach for processing the text
life cycle of the digital publication, from initial manuscript          and structure in order to identify informational vocabulary
submission to publishing the article and presenting information         candidates. The detection rules can be defined based on
to readers. New information defined and verified by experts             various heuristics and requirements such as publishing
may also be injected to other information resources.                    requirements, naming conventions, and domain ontologies.
New rules can be added on demand over time. Currently, there       Such visual representations of inferred association between
are four types of rules implemented in the DIVE, regular           diverse entity types could tremendously aid a researcher in
expression rules, word dictionary, publishing convention, and      forming insights. This also has potential to be a similarity
ontology rules.                                                    metric between articles that could help editors gauge the
    The regular expression rules utilize common naming             novelty of a new article submission.
conventions to identify biological entities, such as gene name,
protein name, molecule structures, chemical compound, etc.
Each rule can be defined as a regular expression and used for
matching the candidate word. The word dictionary rule
consists of a pre-defined list of words that should be included
or excluded in the candidate lists. The publication content is
searched against the list at run time. The publishing
convention rules are used to identify words that are in special
format, such as in italic, or in a particular component of the
publication, such as a figure legend. The enclosing tags of the
candidates are used to define each rule. Additional rules can
be added by specifying additional tag values or by using
naming conventions to detect entities like species names. The
ontology rules utilize five biological ontology including gene
ontology [2], plant ontology [3] plant trait ontology [4], plant
environment condition ontology [5] and Chemical Entities of
Biological Interest (ChEBI) [6].
                                                                    Figure 2 Top 20 inference rules from association analysis.
3) Entity Candidate Assessment
By applying the extraction rules listed above, a set of entity
candidates can be detected from the input document. Some           We are continuing working on evaluating the performance of
candidates might be detected by multiple rules. Different          the entity extraction over large data set and improving its
detection rules also have different accuracy. Ontology file and    accuracy. We are gathering feedback from domain researchers
dictionary based approaches have the highest certainty.            and publishing professionals for further entities candidate
Candidates only identified by other rules need further             evaluations. We are also working on comparing the inferences
validation. We currently implemented two automatic                 from association analysis with known relationships
validation mechanisms. One is based on the previously              documented in the existing ontologies.
validated results; the other one is based on co-location with
                                                                                              ACKNOWLEDGMENT
other confirmed entities. However, the primary method of
validation is by domain expert evaluation through the web          DIVE is partially supported by CyVerse (NSF award DBI-
interface, which is detailed in the following section.             0735191 and DBI-1265383) and the Gramene, a Comparative
                                                                   Plant Genomics Database (NSF award IOS-1127112).
B. Association Analysis
The data association analysis can be used to generate                                             REFERENCES
inferences between values from two or more fields of the data      [1]   National Center for Biotechnology information. Journal Article Tag
in a given condition using FP-Growth algorithm[8].The                    Suite. http://jats.nlm.nih.gov/, 2013.
analysis starts with selecting and aggregating subset of data      [2]   Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry,
                                                                         J.M., Davis A.P. "Gene Ontology: tool for the unification of
specified by the input parameters as a list of records, also             biology." Nature genetics 25, no. 1 (2000): 25-29.
known as transactions. The analysis algorithm will scan the        [3]   Jaiswal, P., Avraham, S., Ilic, K., Kellogg, E. A., McCouch, S., Pujar,
selected data set to compute the frequency of each value, also           A., Zapata, F. (2005). Plant Ontology (PO): a Controlled Vocabulary of
referred as an item, and store the frequency value and co-               Plant Structures and Growth Stages. Comparative and Functional
                                                                         Genomics, 6(7-8), 388–397. http://doi.org/10.1002/cfg.496
occurrence with other item, collectively referred as itemset, in
                                                                   [4]   Arnaud, E., Cooper L., Shrestha, R., Menda, N., Nelson, R.T., Matteis,
a tree structure, named frequent pattern tree (FP-tree). Then            L., Skofic M. (2012) Towards a Reference Plant Trait Ontology for
the frequent item sets can be identified from the FP-tree to             Modeling Knowledge of Plant Traits and Phenotypes in KEOD, pp. 220-
generate inferences among subset of values.                              225.
                                                                   [5]   Plant Enviroment Condition Ontology,
                 III. PRELIMINARY RESULTS                                http://bioportal.bioontology.org/ontologies/PECO#
Figure 2 shows top 20 inference rules based on all ontology        [6]   Degtyarenko, K., de Matos, P., Ennis, M., Hastings, J., Zbinden, M.,
                                                                         McNaught, A., Ashburner, M. (2008). ChEBI: a database and ontology
terms extracted from the collection. Each label indicates a              for chemical entities of biological interest. Nucleic Acids Research, 36
frequent item set found in the collection. The directional arrow         (Database issue), D344–D350.
indicates an inference on co-occurrence between two item           [7]   Cooper, L. and Jaiswal, P. (2016) The Plant Ontology: A Tool for Plant
sets. The shade of the directional arrow indicates the                   Genomics. Plant Bioinformatics: Methods and Protocols, 89-114
confidence level of the rule.                                      [8]   Han, J. Pei, J. and Yin, Y. “Mining frequent patterns without candidate
                                                                         generation,” in ACM Sigmod Record, 2000, vol. 29, no. 2, pp. 1–12.