=Paper=
{{Paper
|id=Vol-2285/ICBO_2018_paper_58
|storemode=property
|title=Computational Classification of Phenologs Across Biological Diversity
|pdfUrl=https://ceur-ws.org/Vol-2285/ICBO_2018_paper_58.pdf
|volume=Vol-2285
|authors= Ian Braun,Carolyn Lawrence-Dill
|dblpUrl=https://dblp.org/rec/conf/icbo/BraunL18
}}
==Computational Classification of Phenologs Across Biological Diversity==
<pdf width="1500px">https://ceur-ws.org/Vol-2285/ICBO_2018_paper_58.pdf</pdf>
<pre>
       Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA                       1


    Computational Classification of Phenologs Across
                 Biological Diversity

                                                   Ian R Braun, Carolyn J Lawrence-Dill
                                                   Genetics, Development, and Cell Biology
                                                             Iowa State University
                                                           Ames, IA, United States
                                                              irbraun@iastate.edu


    Phenotypic diversity analyses are the basis for research                       Plant PhenomeNET is a phenotype similarity network
discoveries ranging from basic biology to applied research.                    composed of phenotypes from six different model plant species
Phenotypic analyses often benefit from the availability of large               that demonstrates the utility of this approach [3]. In the
quantities of high-quality data in a standardized format. Image                construction of Plant PhenomeNET, curators converted text-
and spectral analyses have been shown to enable high-                          based representations of the phenotypes into sets of EQ
throughput, computational classification of a variety of                       statements, composed of entities (e.g., leaf) and qualities (e.g.,
phenotypes and traits. However, equivalent phenotypes                          increased length), both represented by ontology terms. The
expressed across individuals or groups that are not anatomically               similarity for each pair of phenotypes was then calculated
similar can pose a problem for such classification methods. In
                                                                               based on the overlap in the sets of ontology terms present in
these cases, high-throughput, computational classification is still
                                                                               each phenotype’s EQ statements. The goal of the work
possible if the phenotypes are documented using standardized,
language-based descriptions. Conversion of language-based
                                                                               presented here is to automate the process of converting text-
phenotypes to computer-readable “EQ” statements enables such                   based phenotypes to EQ statements using machine learning and
large-scale analyses. EQ statements are composed of entities (e.g.,            natural language processing techniques, so that such phenotype
leaf) and qualities (e.g., increased length) drawn from terms in               similarity networks can be generated and expanded more
ontologies. In this work, we present a method for automatically                easily.
converting free-text descriptions of plant phenotypes to EQ
statements using a machine learning approach. Random forest                                                II. METHODS
classifiers identify potential matches between phenotype
descriptions and terms from a set of ontologies including GO                   A. Plant PhenomeNET Dataset
(gene ontology), PO (plant ontology), and PATO (phenotype and
trait ontology), among others. These candidate ontology terms                      The Plant PhenomeNET dataset of phenotype descriptions,
are combined into candidate EQ statements, which are                           corresponding atomized statements, and corresponding curator-
probabilistically evaluated with respect to a natural language                 generated EQ statements is used as the source of both training
parse of the phenotype description. Models and parameters in                   and testing data in this work. The atomized statements in this
this method are trained using a dataset of plant phenotypes and                dataset are used as input to the described methods, with the aim
curator-converted EQ statements from the Plant PhenomeNET                      of automatically generating logical EQ statements which are
project (Oellrich, Walls et al., 2015). Preliminary results                    similar to those generated by the curators.
comparing predicted and curated EQ statements are presented.
Potential use across datasets to enable automated phenolog                     B. Mapping Text to Candidate Terms
discovery are discussed.
                                                                                   The purpose of the first method employed is to map each
   Keywords—phenologs; phenotypes; text mining, ontologies                     input atomized statement to a subset of the available ontology
                                                                               terms, which contains only those terms that match the text
                                                                               (may be used to describe a portion of the text). To do this,
                        I. INTRODUCTION                                        random forest machine learning models specific to each
    Identifying phenologs (comparable phenotypes with                          ontology are trained to classify pairs of text and ontology terms
hypothesized shared genetic origin) within and between species                 as either matching or not, and are then used to produce
enables candidate gene prediction for phenotypes of interest in                probabilities with which the ontology terms may be ranked for
agriculture and medicine alike [1,2,3]. For systems or species                 a given atomized statmenet. Features used to represent pairs of
which are not anatomically similar, the use of image-based                     text and ontology terms take into account semantic similarity,
phenotype data makes phenolog identification difficult. In                     syntactic similarity, and contextual similarity with respect to
these cases however, semantic analysis of text-based                           the ontology structure. The top ranking ontology terms are
representations of the phenotypes can provide enough                           taken as candidate terms.
information to identify phenologs and generate hypotheses
about the underlying biology of interest [4].


       ICBO 2018                                                   August 7-10, 2018                                                    1
      Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, Oregon, USA                                2


C. Composing Candidate EQ Statements
     For each atomized statement, the candidate ontology terms
are used to construct a set of all possible candidate EQ
statements. This is done by combining the terms from
appropriate ontologies into appropriate roles within the EQ
statement structure. Some rules specific to the ontologies used
are enforced. For example, the inclusion of a relational PATO
term as the quality necessitates a secondary entity term.

D. Evaluating Candidate EQ Statements
    This process evaluates each candidate EQ statement that
was composed in the previous step. The atomized statement
that was used to generate the candidate EQ statements is
processed with the Stanford CoreNLP pipeline, specifically to
produce a dependency graph of the text.
                                                                              Figure 1. Precision recall curve for predicted PATO terms of
    Each candidate ontology term identified in the previous                   holdout atomized-statements from Plant PhenomeNET.
step is assigned to a node in the dependency graph that is most               Average hierarchical precision and recall are shown between
similary to that ontology term (as measured by similarity                     all positive predictions and the closest correct PATO terms.
metrics of high importance in the random forest models). With
each candidate ontology term assigned to a node in the
dependency graph, a given EQ statement can be represented by
the shortest path in the graph from the Entity term to the
Quality term. Distributions of the length of these paths and
edge types along the paths are generated from the training data.
The structural probability of a candidate EQ statement is
defined as the frequency with which its E-to-Q path appears in
the training data.
    The overall quality score q for an EQ is a weighted average
of this structural probability and the average probability of the
terms, as output by the random forest models.
                                                                              Figure 2. Histogram of similarities (weighted Jaccard)
                                                                              between predicted and curated EQ statements for holdout
                III. RESULTS AND DISCUSSION                                   atomized statements from Plant PhenomeNET. Shaded
    Random forest classifiers specific to each ontology were                  predictions have quality scores exceeding the learned quality
evaluated using standard precision and recall curves (Figure 1).              threshold value.
For the purposes of this evaluation, predicted probabilities for a
term are considered correct if they exceed the threshold value                                              REFERENCES
and that term is present in the curated EQ statement for that
atomized statement. In addition to binary precision and recall,
                                                                              [1]   McGary KL, Park TJ, Woods JO, Cha HJ, Wallingford JB, Marcotte
hierarchical similarity metrics are used to evaluate the average                    EM. Systematic discovery of nonobvious human disease models through
similarity between predicted and curated terms with respect to                      orthologous phenotypes. Proc Natl Acad Sci USA. 2010 Apr
the structure of the ontology (Figure 1).                                           6;107(14):6544-9. doi:10.1073/pnas.0910200107.
                                                                              [2]   Hoehndorf R, Schofield PN, Gkoutos GV. PhenomeNET: a whole-
    For each predicted EQ statement, its similarity to the                          phenome approach to disease gene discovery. Nucleic Acids Res. 2011
corresponding curated EQ statement was measured (Figure 2).                         Oct;39(18):e119. doi:10.1093/nar/gkr538.
    This preliminary work demonstrates the utility of using                   [3]   Oellrich A, Walls RL, Cannon EK, Cannon SB, Cooper L, Gardiner J,
                                                                                    Gkoutos GV, Harper L, He M, Hoehndorf R, Jaiswal P, Kalberer SR,
machine learning and natural language processing techniques                         Lloyd JP, Meinke D, Menda N, Moore L, Nelson RT, Pujar A,
for automating or assisting the work of translating text-based                      Lawrence CJ, Huala E. An ontology approach to comparative
phenotypes into EQ statements. Our current and on-going work                        phenomics in plants. Plant Methods. 2015 Feb 25;11:10.
is focused on 1) adapting the methods to handle more complex                        doi:10.1186/s13007-015-0053-y.
phenotypes which map to multiple EQ statements, 2) using and                  [4]   Braun I, Balhoff J, Berardini TZ, Cooper L, Gkoutos G, Harper L, Huala
adapting existing tools to extract phenotype descriptions from                      E, Jaiswal P, Kazic T, Lapp H, Macklin JA, Specht CD, Vision T, Walls
                                                                                    RL, Lawrence-Dill CJ. 'Computable' phenotypes enable comparative
the literature in order to build an expanded dataset of text                        and predictive phenomics among plant species across domains of life.
descriptions.                                                                       In: Thessen, AE (Ed.) Application of Semantic Technologies in
                                                                                    Biodiversity Science. Studies on the Semantic Web, IOS Press/AKA
                                                                                    Verlag. To appear.


      ICBO 2018                                                   August 7-10, 2018                                                             2

</pre>