A Web Application for Extracting Key Domain Information for Scientific Publications using Ontology Weijia Xu Amit Gupta Pankaj Jaiswal Crispin Taylor Patti Lockhart Texas Advanced Computing Center Department of Botany and Plant Pathology American Society of Plant Biologists University of Texas at Austin Oregon State University Rockville, Maryland, USA Austin, Texas USA Oregon, Portland, USA {ctaylor, plockhart}@aspb.org {xwj, agupta}@tacc.utexas.edu jaiswalp@oregonstate.edu Abstract— We present demos of an ongoing project, domain We present application demos to illustrate this framework informational vocabulary extraction (DIVE), which aims to using a sets of plant biology articles. We detail the design and enrich digital publications through entity and key informational implementation of the system, including entity detection, the words detection and by adding additional annotations. The extraction pipeline, and the web interface; we also present a use system implements multiple strategies for biological entity case demonstration. We would like to engage publishers and detection, including using regular expression rules, ontologies, biology data curators in discussion and feedback. and a keyword dictionary. These extracted entities are then stored in a database and made accessible through an interactive There are three major steps in processing the documents: web application for curation and evaluation by authors. Through text extraction, entity candidate extraction, and candidate the web interface, the user can make additional annotations and assessment. The input for the text extraction process is the corrections to the current results. The updates can then be used structured document tagged by Journal Article Tag Suite to improve the entity detection in subsequent processed articles. (JATS) [1]. The input document is processed into two data Although the system is being developed in the context of annotating journal articles, it can also be beneficial to domain structures for textual data and structural data. This dual data curators and researchers at large. structure allows for efficient text processing of the publication content while still being able to easily retrieve the meta- Keywords—component; Information systems applications; structure around a particular set of words during the Information integration; Ontology; Text Mining subsequent steps of processing. To identify informational vocabulary candidates, our application implemented four sets I. INTRODUCTION of extraction rules: regular expression rules, word dictionary, publishing convention, and ontology rules. The ontology rules Due to its technical depth and rich informational content, a journal article often requires that readers, domain experts, and utilize five biological ontologies including gene ontology [2], curators invest significant amounts of time and effort to fully plant ontology [3], plant trait ontology [4], plant environment comprehend and make intelligent use of its content. This can be condition ontology [5] and Chemical Entities of Biological especially true in emerging areas, where novel ideas and new Interest (ChEBI) [6]. The results from document processing terminologies may be presented without precedent. As new are stored in a MySQL database and serve as data storage for technologies accelerating scientific discovery and more content the web application. becomes available online, the number of new articles that must The web front end in our prototype is implemented using be read and understood continues to rise. Therefore, there is a Django (v 1.8). Based on Python, the web front is easily pressing need to develop computational methods and tools that programmable, extensible, and is pluggable with multiple can enrich the information content of digital publications, popular databases. It forms the presentation layer of this improve its accessibility and utility, and facilitate the readers’ system, relying on the back end code to run the entity understanding by creating links between journal articles and extraction algorithms from the manuscript and to transfer the relevant database entities during the article production process. results in a JSON format. To address this issue, we present software developments from The system can benefit the entire life cycle of the digital an ongoing project, DIVE, which features auto extraction of publication, from initial manuscript submission to publishing informational vocabulary, web based access and curation tools, the article and presenting information to readers. At the initial and integration into the digital publication process. manuscript submission stage, the manuscript can be processed The framework implements several strategies in entity to extract known key informational vocabulary, such as extraction, including using regular expression rules, ontology biological entities, as well as to identify potential new technical and a keyword dictionary. The results of the extracted words. That information may be used by editors to identify biological entities are then stored in a database and made appropriate reviewers for the manuscript. After the article has accessible through an interactive web application for curation been accepted for publication, additional information about the and evaluation by authors and other domain experts. Through key informational words, such as links to external repositories the web interface, a user can make additional annotations and or reference sites, may also be embedded during the pre- corrections to the initial result set. The updates are stored and publication production process to enrich the information managed via the relational database for future improvements. content and accessibility. Publication curators may also leverage the information for curation. New information defined The entity extraction phase can also generate phrases, while and verified by experts may also be injected to other matching the ontology term alphabtetically, are not used for the information resources, such as Planteome [7]. purpose implied by the term. Those situations requires expert knolwedge and input. II. APPLICATION FEATURES OVERVIEW Theorefore, each row also includes user control buttons for Let us use an example to illustrate the features of our web editing the record. Figure 3 shows an example of the Entitiy interface, thereby displaying the various views, layouts, and record editing interface. In this view, there are editable fields of functions available. Our prototype includes 609 manuscripts this record where a user may correct or enter new values. A from the journal Plant Physiology. dynamic search box can be used to search and add new species into the species menu, if the appropriate species was not detected or inferred from the article. This search box uses an online service from NCBI to provide a very comprehensive list of options as the user dynamically types into it. Sentences of occurrence of this entity are extracted from the manuscript with Figure 1. Collection paginated view the entity name highlighted in yellow. This again provides The publication list view (Figure 1) is a paginated list of all better, almost complete context information for this entity, as the articles with an external DOI reference and the article title. per the manuscript text. Figure 3. Interface for showing/editing entity details The prototype is still under development, and we welcome feedback from domain researchers and publishing professionals for future developments and improvements. ACKNOWLEDGMENT Figure 2. Interface for exploring entities in a publication DIVE is partially supported by CyVerse (NSF award DBI- Figure 2 shows example of exploring entities extracted 0735191 and DBI-1265383) and the Gramene, a Comparative from a full article (i.e. 1002.xml in Figure 1). In the top of the Plant Genomics Database (NSF award IOS-1127112). page, the title and abstract of the article [8] are presented to give user some context of the manuscript. The list of entities REFERENCES found in this article are organized in a table. Each row includes [1] National Center for Biotechnology information. Journal Article Tag information like name, type and number of occurrences in the Suite. http://jats.nlm.nih.gov/, 2013. article of an entity. The XRef column presents possible [2] Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis A.P. (2000) Gene Ontology: tool for the unification of matches to existing ontology terms. If avalalible, links are also biology." Nature genetics 25, no. 1, pp 25-29. presented to other online databases with more information of [3] Jaiswal, P., Avraham, S., Ilic, K., Kellogg, E. A., McCouch, S., Pujar, that entity. For example, “leaf senescence” are mathced to a A., Zapata, F. (2005). Plant Ontology (PO): a Controlled Vocabulary of term in triat ontology with a link to the corresponding entry in Plant Structures and Growth Stages. Comparative and Functional the Planteome ontology database. In another example, “MES” Genomics, 6(7-8), 388–397. http://doi.org/10.1002/cfg.496 are matched to a term in ChEBI but a link to the external [4] Arnaud, E, Cooper L, Shrestha, R, Menda, N, Nelson, R T, Matteis, L, database are missing at the time. The “species” column shows Skofic M (2012) Towards a Reference Plant Trait Ontology for prediction on which species this entitiy is likely assoicated with Modeling Knowledge of Plant Traits and Phenotypes in KEOD, pp220-5 based on the proximity of the term with the species name [5] Plant Enviroment Condition Ontology, http://bioportal.bioontology.org/ontologies/PECO# appeared in the article and/or indicated by the ontologies. The [6] Degtyarenko, K., de Matos, P., Ennis, M., Hastings, J., Zbinden, M., “Figure caption” column shows wheather the entity has been McNaught, A., Ashburner, M. (2008). ChEBI: a database and ontology used within a figure caption in the article. for chemical entities of biological interest. Nucleic Acids Research, 36 (Database issue), D344–D350. It is important to note that some entities can be matched to [7] Cooper, L. and Jaiswal, P. (2016) The Plant Ontology: A Tool for Plant multiple onotology terms. Such cases are currently resolved Genomics. Plant Bioinformatics: Methods and Protocols, 89-114 based on the general priorites we assinged to each extraction [8] Hou, K., Wu, W., & Gan, S. S. (2013). SAUR36, a small auxin up RNA rule during the extraction phase. However, for a particular gene, is involved in the promotion of leaf senescence in Arabidopsis. article, the result may not always be the most approrirate one. Plant physiology, 161(2), 1002-1009.