A Web Application for Extracting Key Domain
 Information for Scientific Publications using Ontology
   Weijia Xu               Amit Gupta                       Pankaj Jaiswal                     Crispin Taylor         Patti Lockhart
   Texas Advanced Computing Center           Department of Botany and Plant Pathology          American Society of Plant Biologists
      University of Texas at Austin                  Oregon State University                      Rockville, Maryland, USA
          Austin, Texas USA                          Oregon, Portland, USA                       {ctaylor, plockhart}@aspb.org
     {xwj, agupta}@tacc.utexas.edu                  jaiswalp@oregonstate.edu


Abstract— We present demos of an ongoing project, domain                 We present application demos to illustrate this framework
informational vocabulary extraction (DIVE), which aims to            using a sets of plant biology articles. We detail the design and
enrich digital publications through entity and key informational     implementation of the system, including entity detection, the
words detection and by adding additional annotations. The            extraction pipeline, and the web interface; we also present a use
system implements multiple strategies for biological entity          case demonstration. We would like to engage publishers and
detection, including using regular expression rules, ontologies,     biology data curators in discussion and feedback.
and a keyword dictionary. These extracted entities are then
stored in a database and made accessible through an interactive          There are three major steps in processing the documents:
web application for curation and evaluation by authors. Through      text extraction, entity candidate extraction, and candidate
the web interface, the user can make additional annotations and      assessment. The input for the text extraction process is the
corrections to the current results. The updates can then be used     structured document tagged by Journal Article Tag Suite
to improve the entity detection in subsequent processed articles.    (JATS) [1]. The input document is processed into two data
Although the system is being developed in the context of
annotating journal articles, it can also be beneficial to domain
                                                                     structures for textual data and structural data. This dual data
curators and researchers at large.                                   structure allows for efficient text processing of the publication
                                                                     content while still being able to easily retrieve the meta-
   Keywords—component; Information systems           applications;   structure around a particular set of words during the
Information integration; Ontology; Text Mining                       subsequent steps of processing. To identify informational
                                                                     vocabulary candidates, our application implemented four sets
                      I. INTRODUCTION                                of extraction rules: regular expression rules, word dictionary,
                                                                     publishing convention, and ontology rules. The ontology rules
    Due to its technical depth and rich informational content, a
journal article often requires that readers, domain experts, and     utilize five biological ontologies including gene ontology [2],
curators invest significant amounts of time and effort to fully      plant ontology [3], plant trait ontology [4], plant environment
comprehend and make intelligent use of its content. This can be      condition ontology [5] and Chemical Entities of Biological
especially true in emerging areas, where novel ideas and new         Interest (ChEBI) [6]. The results from document processing
terminologies may be presented without precedent. As new             are stored in a MySQL database and serve as data storage for
technologies accelerating scientific discovery and more content      the web application.
becomes available online, the number of new articles that must           The web front end in our prototype is implemented using
be read and understood continues to rise. Therefore, there is a      Django (v 1.8). Based on Python, the web front is easily
pressing need to develop computational methods and tools that        programmable, extensible, and is pluggable with multiple
can enrich the information content of digital publications,          popular databases. It forms the presentation layer of this
improve its accessibility and utility, and facilitate the readers’   system, relying on the back end code to run the entity
understanding by creating links between journal articles and         extraction algorithms from the manuscript and to transfer the
relevant database entities during the article production process.    results in a JSON format.
To address this issue, we present software developments from
                                                                         The system can benefit the entire life cycle of the digital
an ongoing project, DIVE, which features auto extraction of
                                                                     publication, from initial manuscript submission to publishing
informational vocabulary, web based access and curation tools,
                                                                     the article and presenting information to readers. At the initial
and integration into the digital publication process.
                                                                     manuscript submission stage, the manuscript can be processed
    The framework implements several strategies in entity            to extract known key informational vocabulary, such as
extraction, including using regular expression rules, ontology       biological entities, as well as to identify potential new technical
and a keyword dictionary. The results of the extracted               words. That information may be used by editors to identify
biological entities are then stored in a database and made           appropriate reviewers for the manuscript. After the article has
accessible through an interactive web application for curation       been accepted for publication, additional information about the
and evaluation by authors and other domain experts. Through          key informational words, such as links to external repositories
the web interface, a user can make additional annotations and        or reference sites, may also be embedded during the pre-
corrections to the initial result set. The updates are stored and    publication production process to enrich the information
managed via the relational database for future improvements.         content and accessibility. Publication curators may also
leverage the information for curation. New information defined        The entity extraction phase can also generate phrases, while
and verified by experts may also be injected to other                 matching the ontology term alphabtetically, are not used for the
information resources, such as Planteome [7].                         purpose implied by the term. Those situations requires expert
                                                                      knolwedge and input.
           II. APPLICATION FEATURES OVERVIEW                              Theorefore, each row also includes user control buttons for
    Let us use an example to illustrate the features of our web       editing the record. Figure 3 shows an example of the Entitiy
interface, thereby displaying the various views, layouts, and         record editing interface. In this view, there are editable fields of
functions available. Our prototype includes 609 manuscripts           this record where a user may correct or enter new values. A
from the journal Plant Physiology.                                    dynamic search box can be used to search and add new species
                                                                      into the species menu, if the appropriate species was not
                                                                      detected or inferred from the article. This search box uses an
                                                                      online service from NCBI to provide a very comprehensive list
                                                                      of options as the user dynamically types into it. Sentences of
                                                                      occurrence of this entity are extracted from the manuscript with
   Figure 1. Collection paginated view
                                                                      the entity name highlighted in yellow. This again provides
    The publication list view (Figure 1) is a paginated list of all   better, almost complete context information for this entity, as
the articles with an external DOI reference and the article title.    per the manuscript text.


                                                                            Figure 3. Interface for showing/editing entity details
                                                                      The prototype is still under development, and we welcome
                                                                      feedback from domain researchers and publishing
                                                                      professionals for future developments and improvements.
                                                                                                 ACKNOWLEDGMENT
   Figure 2. Interface for exploring entities in a publication        DIVE is partially supported by CyVerse (NSF award DBI-
    Figure 2 shows example of exploring entities extracted            0735191 and DBI-1265383) and the Gramene, a Comparative
from a full article (i.e. 1002.xml in Figure 1). In the top of the    Plant Genomics Database (NSF award IOS-1127112).
page, the title and abstract of the article [8] are presented to
give user some context of the manuscript. The list of entities                                       REFERENCES
found in this article are organized in a table. Each row includes     [1]   National Center for Biotechnology information. Journal Article Tag
information like name, type and number of occurrences in the                Suite. http://jats.nlm.nih.gov/, 2013.
article of an entity. The XRef column presents possible               [2]   Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry,
                                                                            J.M., Davis A.P. (2000) Gene Ontology: tool for the unification of
matches to existing ontology terms. If avalalible, links are also           biology." Nature genetics 25, no. 1, pp 25-29.
presented to other online databases with more information of
                                                                      [3]   Jaiswal, P., Avraham, S., Ilic, K., Kellogg, E. A., McCouch, S., Pujar,
that entity. For example, “leaf senescence” are mathced to a                A., Zapata, F. (2005). Plant Ontology (PO): a Controlled Vocabulary of
term in triat ontology with a link to the corresponding entry in            Plant Structures and Growth Stages. Comparative and Functional
the Planteome ontology database. In another example, “MES”                  Genomics, 6(7-8), 388–397. http://doi.org/10.1002/cfg.496
are matched to a term in ChEBI but a link to the external             [4]   Arnaud, E, Cooper L, Shrestha, R, Menda, N, Nelson, R T, Matteis, L,
database are missing at the time. The “species” column shows                Skofic M (2012) Towards a Reference Plant Trait Ontology for
prediction on which species this entitiy is likely assoicated with          Modeling Knowledge of Plant Traits and Phenotypes in KEOD, pp220-5
based on the proximity of the term with the species name              [5]   Plant Enviroment Condition Ontology,
                                                                            http://bioportal.bioontology.org/ontologies/PECO#
appeared in the article and/or indicated by the ontologies. The
                                                                      [6]   Degtyarenko, K., de Matos, P., Ennis, M., Hastings, J., Zbinden, M.,
“Figure caption” column shows wheather the entity has been                  McNaught, A., Ashburner, M. (2008). ChEBI: a database and ontology
used within a figure caption in the article.                                for chemical entities of biological interest. Nucleic Acids Research, 36
                                                                            (Database issue), D344–D350.
    It is important to note that some entities can be matched to
                                                                      [7]   Cooper, L. and Jaiswal, P. (2016) The Plant Ontology: A Tool for Plant
multiple onotology terms. Such cases are currently resolved                 Genomics. Plant Bioinformatics: Methods and Protocols, 89-114
based on the general priorites we assinged to each extraction         [8]   Hou, K., Wu, W., & Gan, S. S. (2013). SAUR36, a small auxin up RNA
rule during the extraction phase. However, for a particular                 gene, is involved in the promotion of leaf senescence in Arabidopsis.
article, the result may not always be the most approrirate one.             Plant physiology, 161(2), 1002-1009.