Enhancing Information Accessibility of Scientific Publications with Text Mining and Ontology Weijia Xu Amit Gupta Pankaj Jaiswal Crispin Taylor Patti Lockhart Texas Advanced Computing Center Department of Botany and Plant Pathology American Society of Plant Biologists University of Texas at Austin Oregon State University Rockville, Maryland, USA Austin, Texas USA Oregon, Portland, USA {ctaylor, plockhart}@aspb.org {xwj, agupta}@tacc.utexas.edu jaiswalp@oregonstate.edu Abstract— We present an ongoing effort on utilizing text mining methods and existing biological ontologies to help readers to II. METHODS AND IMPLEMENTAIONTS access the information contained in the scientific articles. Our approach includes using multiple strategies for biological entity A. Entity Detection Workflow detection and using association analysis on extracted analysis. The entity extraction processes utilizes regular expression rules, ontologies, and keyword dictionary to get a comprehensive list of biological entities. In addition to extract list of entities, we also apply natural language processing and association analysis techniques to generate inferences among entities and comparing to known relations documented in the existing ontologies. Keywords—component; Information systems applications; Ontology; Text Mining; Association Analysis I. INTRODUCTION Due to its technical depth and rich, informational content, a journal article often requires that readers, domain experts, and Figure 1. Processing workflow overview curators invest significant amounts of time and effort to fully comprehend and make intelligent use of its content. This can be Figure 1 shows the overview of our processing workflow. especially true in emerging areas, where novel ideas and new There are three major steps in processing the documents: text terminologies may be presented without precedent. As new extraction, entity candidate extraction, and candidate technologies accelerating scientific discovery and more content assessment. becomes available online, the number of new articles that must 1) Text Extraction be read and understood continues to rise. There are over 22 The text extraction process the input structured document millions references indexed by MEDLINE. Therefore, there is tagged by JATS [1]. During this step, the input document will a pressing need to develop computational methods and tools be processed into two data structures for textual data and that can facilitate the readers’ understanding of the content of structural data. The textual information data includes as a list the publicaiton. of string representation of the body of text included in the To address this challange, we present software journal articles. The structural data includes the metadata developments from an ongoing project, DIVE, which features information presented at the input document, such as section auto extraction of informational vocabulary, web based access mark, special formatting mark etc. A mapping is maintained and curation tools. The framework implements several between the textual value and metadata information by their strategies in entity extraction, including using regular global positions in the original documents. This dual data expression rules, ontology and a keyword dictionary. The structure allows for efficient text processing of the publication results of the extracted biological entities are then stored in a content while still being able to easily retrieve the meta database and made accessible through an interactive web structure around a particular set of words during the application for curation and evaluation by authors and other subsequent steps of processing. domain experts. Additional text mining and associaiton analysis can be run on extrated entities to help readers 2) Entity Candidate Detection understanding of the paper. The system can benefit the entire We implemented a rule-based approach for processing the text life cycle of the digital publication, from initial manuscript and structure in order to identify informational vocabulary submission to publishing the article and presenting information candidates. The detection rules can be defined based on to readers. New information defined and verified by experts various heuristics and requirements such as publishing may also be injected to other information resources. requirements, naming conventions, and domain ontologies. New rules can be added on demand over time. Currently, there Such visual representations of inferred association between are four types of rules implemented in the DIVE, regular diverse entity types could tremendously aid a researcher in expression rules, word dictionary, publishing convention, and forming insights. This also has potential to be a similarity ontology rules. metric between articles that could help editors gauge the The regular expression rules utilize common naming novelty of a new article submission. conventions to identify biological entities, such as gene name, protein name, molecule structures, chemical compound, etc. Each rule can be defined as a regular expression and used for matching the candidate word. The word dictionary rule consists of a pre-defined list of words that should be included or excluded in the candidate lists. The publication content is searched against the list at run time. The publishing convention rules are used to identify words that are in special format, such as in italic, or in a particular component of the publication, such as a figure legend. The enclosing tags of the candidates are used to define each rule. Additional rules can be added by specifying additional tag values or by using naming conventions to detect entities like species names. The ontology rules utilize five biological ontology including gene ontology [2], plant ontology [3] plant trait ontology [4], plant environment condition ontology [5] and Chemical Entities of Biological Interest (ChEBI) [6]. Figure 2 Top 20 inference rules from association analysis. 3) Entity Candidate Assessment By applying the extraction rules listed above, a set of entity candidates can be detected from the input document. Some We are continuing working on evaluating the performance of candidates might be detected by multiple rules. Different the entity extraction over large data set and improving its detection rules also have different accuracy. Ontology file and accuracy. We are gathering feedback from domain researchers dictionary based approaches have the highest certainty. and publishing professionals for further entities candidate Candidates only identified by other rules need further evaluations. We are also working on comparing the inferences validation. We currently implemented two automatic from association analysis with known relationships validation mechanisms. One is based on the previously documented in the existing ontologies. validated results; the other one is based on co-location with ACKNOWLEDGMENT other confirmed entities. However, the primary method of validation is by domain expert evaluation through the web DIVE is partially supported by CyVerse (NSF award DBI- interface, which is detailed in the following section. 0735191 and DBI-1265383) and the Gramene, a Comparative Plant Genomics Database (NSF award IOS-1127112). B. Association Analysis The data association analysis can be used to generate REFERENCES inferences between values from two or more fields of the data [1] National Center for Biotechnology information. Journal Article Tag in a given condition using FP-Growth algorithm[8].The Suite. http://jats.nlm.nih.gov/, 2013. analysis starts with selecting and aggregating subset of data [2] Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis A.P. "Gene Ontology: tool for the unification of specified by the input parameters as a list of records, also biology." Nature genetics 25, no. 1 (2000): 25-29. known as transactions. The analysis algorithm will scan the [3] Jaiswal, P., Avraham, S., Ilic, K., Kellogg, E. A., McCouch, S., Pujar, selected data set to compute the frequency of each value, also A., Zapata, F. (2005). Plant Ontology (PO): a Controlled Vocabulary of referred as an item, and store the frequency value and co- Plant Structures and Growth Stages. Comparative and Functional Genomics, 6(7-8), 388–397. http://doi.org/10.1002/cfg.496 occurrence with other item, collectively referred as itemset, in [4] Arnaud, E., Cooper L., Shrestha, R., Menda, N., Nelson, R.T., Matteis, a tree structure, named frequent pattern tree (FP-tree). Then L., Skofic M. (2012) Towards a Reference Plant Trait Ontology for the frequent item sets can be identified from the FP-tree to Modeling Knowledge of Plant Traits and Phenotypes in KEOD, pp. 220- generate inferences among subset of values. 225. [5] Plant Enviroment Condition Ontology, III. PRELIMINARY RESULTS http://bioportal.bioontology.org/ontologies/PECO# Figure 2 shows top 20 inference rules based on all ontology [6] Degtyarenko, K., de Matos, P., Ennis, M., Hastings, J., Zbinden, M., McNaught, A., Ashburner, M. (2008). ChEBI: a database and ontology terms extracted from the collection. Each label indicates a for chemical entities of biological interest. Nucleic Acids Research, 36 frequent item set found in the collection. The directional arrow (Database issue), D344–D350. indicates an inference on co-occurrence between two item [7] Cooper, L. and Jaiswal, P. (2016) The Plant Ontology: A Tool for Plant sets. The shade of the directional arrow indicates the Genomics. Plant Bioinformatics: Methods and Protocols, 89-114 confidence level of the rule. [8] Han, J. Pei, J. and Yin, Y. “Mining frequent patterns without candidate generation,” in ACM Sigmod Record, 2000, vol. 29, no. 2, pp. 1–12.