Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA 1 Computational Classification of Phenologs Across Biological Diversity Ian R Braun, Carolyn J Lawrence-Dill Genetics, Development, and Cell Biology Iowa State University Ames, IA, United States irbraun@iastate.edu Phenotypic diversity analyses are the basis for research Plant PhenomeNET is a phenotype similarity network discoveries ranging from basic biology to applied research. composed of phenotypes from six different model plant species Phenotypic analyses often benefit from the availability of large that demonstrates the utility of this approach [3]. In the quantities of high-quality data in a standardized format. Image construction of Plant PhenomeNET, curators converted text- and spectral analyses have been shown to enable high- based representations of the phenotypes into sets of EQ throughput, computational classification of a variety of statements, composed of entities (e.g., leaf) and qualities (e.g., phenotypes and traits. However, equivalent phenotypes increased length), both represented by ontology terms. The expressed across individuals or groups that are not anatomically similarity for each pair of phenotypes was then calculated similar can pose a problem for such classification methods. In based on the overlap in the sets of ontology terms present in these cases, high-throughput, computational classification is still each phenotype’s EQ statements. The goal of the work possible if the phenotypes are documented using standardized, language-based descriptions. Conversion of language-based presented here is to automate the process of converting text- phenotypes to computer-readable “EQ” statements enables such based phenotypes to EQ statements using machine learning and large-scale analyses. EQ statements are composed of entities (e.g., natural language processing techniques, so that such phenotype leaf) and qualities (e.g., increased length) drawn from terms in similarity networks can be generated and expanded more ontologies. In this work, we present a method for automatically easily. converting free-text descriptions of plant phenotypes to EQ statements using a machine learning approach. Random forest II. METHODS classifiers identify potential matches between phenotype descriptions and terms from a set of ontologies including GO A. Plant PhenomeNET Dataset (gene ontology), PO (plant ontology), and PATO (phenotype and trait ontology), among others. These candidate ontology terms The Plant PhenomeNET dataset of phenotype descriptions, are combined into candidate EQ statements, which are corresponding atomized statements, and corresponding curator- probabilistically evaluated with respect to a natural language generated EQ statements is used as the source of both training parse of the phenotype description. Models and parameters in and testing data in this work. The atomized statements in this this method are trained using a dataset of plant phenotypes and dataset are used as input to the described methods, with the aim curator-converted EQ statements from the Plant PhenomeNET of automatically generating logical EQ statements which are project (Oellrich, Walls et al., 2015). Preliminary results similar to those generated by the curators. comparing predicted and curated EQ statements are presented. Potential use across datasets to enable automated phenolog B. Mapping Text to Candidate Terms discovery are discussed. The purpose of the first method employed is to map each Keywords—phenologs; phenotypes; text mining, ontologies input atomized statement to a subset of the available ontology terms, which contains only those terms that match the text (may be used to describe a portion of the text). To do this, I. INTRODUCTION random forest machine learning models specific to each Identifying phenologs (comparable phenotypes with ontology are trained to classify pairs of text and ontology terms hypothesized shared genetic origin) within and between species as either matching or not, and are then used to produce enables candidate gene prediction for phenotypes of interest in probabilities with which the ontology terms may be ranked for agriculture and medicine alike [1,2,3]. For systems or species a given atomized statmenet. Features used to represent pairs of which are not anatomically similar, the use of image-based text and ontology terms take into account semantic similarity, phenotype data makes phenolog identification difficult. In syntactic similarity, and contextual similarity with respect to these cases however, semantic analysis of text-based the ontology structure. The top ranking ontology terms are representations of the phenotypes can provide enough taken as candidate terms. information to identify phenologs and generate hypotheses about the underlying biology of interest [4]. ICBO 2018 August 7-10, 2018 1 Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA 2 C. Composing Candidate EQ Statements For each atomized statement, the candidate ontology terms are used to construct a set of all possible candidate EQ statements. This is done by combining the terms from appropriate ontologies into appropriate roles within the EQ statement structure. Some rules specific to the ontologies used are enforced. For example, the inclusion of a relational PATO term as the quality necessitates a secondary entity term. D. Evaluating Candidate EQ Statements This process evaluates each candidate EQ statement that was composed in the previous step. The atomized statement that was used to generate the candidate EQ statements is processed with the Stanford CoreNLP pipeline, specifically to produce a dependency graph of the text. Figure 1. Precision recall curve for predicted PATO terms of Each candidate ontology term identified in the previous holdout atomized-statements from Plant PhenomeNET. step is assigned to a node in the dependency graph that is most Average hierarchical precision and recall are shown between similary to that ontology term (as measured by similarity all positive predictions and the closest correct PATO terms. metrics of high importance in the random forest models). With each candidate ontology term assigned to a node in the dependency graph, a given EQ statement can be represented by the shortest path in the graph from the Entity term to the Quality term. Distributions of the length of these paths and edge types along the paths are generated from the training data. The structural probability of a candidate EQ statement is defined as the frequency with which its E-to-Q path appears in the training data. The overall quality score q for an EQ is a weighted average of this structural probability and the average probability of the terms, as output by the random forest models. Figure 2. Histogram of similarities (weighted Jaccard) between predicted and curated EQ statements for holdout III. RESULTS AND DISCUSSION atomized statements from Plant PhenomeNET. Shaded Random forest classifiers specific to each ontology were predictions have quality scores exceeding the learned quality evaluated using standard precision and recall curves (Figure 1). threshold value. For the purposes of this evaluation, predicted probabilities for a term are considered correct if they exceed the threshold value REFERENCES and that term is present in the curated EQ statement for that atomized statement. In addition to binary precision and recall, [1] McGary KL, Park TJ, Woods JO, Cha HJ, Wallingford JB, Marcotte hierarchical similarity metrics are used to evaluate the average EM. Systematic discovery of nonobvious human disease models through similarity between predicted and curated terms with respect to orthologous phenotypes. Proc Natl Acad Sci USA. 2010 Apr the structure of the ontology (Figure 1). 6;107(14):6544-9. doi:10.1073/pnas.0910200107. [2] Hoehndorf R, Schofield PN, Gkoutos GV. PhenomeNET: a whole- For each predicted EQ statement, its similarity to the phenome approach to disease gene discovery. Nucleic Acids Res. 2011 corresponding curated EQ statement was measured (Figure 2). Oct;39(18):e119. doi:10.1093/nar/gkr538. This preliminary work demonstrates the utility of using [3] Oellrich A, Walls RL, Cannon EK, Cannon SB, Cooper L, Gardiner J, Gkoutos GV, Harper L, He M, Hoehndorf R, Jaiswal P, Kalberer SR, machine learning and natural language processing techniques Lloyd JP, Meinke D, Menda N, Moore L, Nelson RT, Pujar A, for automating or assisting the work of translating text-based Lawrence CJ, Huala E. An ontology approach to comparative phenotypes into EQ statements. Our current and on-going work phenomics in plants. Plant Methods. 2015 Feb 25;11:10. is focused on 1) adapting the methods to handle more complex doi:10.1186/s13007-015-0053-y. phenotypes which map to multiple EQ statements, 2) using and [4] Braun I, Balhoff J, Berardini TZ, Cooper L, Gkoutos G, Harper L, Huala adapting existing tools to extract phenotype descriptions from E, Jaiswal P, Kazic T, Lapp H, Macklin JA, Specht CD, Vision T, Walls RL, Lawrence-Dill CJ. 'Computable' phenotypes enable comparative the literature in order to build an expanded dataset of text and predictive phenomics among plant species across domains of life. descriptions. In: Thessen, AE (Ed.) Application of Semantic Technologies in Biodiversity Science. Studies on the Semantic Web, IOS Press/AKA Verlag. To appear. ICBO 2018 August 7-10, 2018 2