Enriching the Human Phenotype Ontology with inferred axioms from textual descriptions? Shahad Kudama1 , Rafael Berlanga1 , Ernesto Jiménez-Ruiz2 1 Jaume I University, Castellón, Spain 2 University of Oslo, Oslo, Norway Abstract. The Human Phenotype Ontology (HP) is a reference vocabulary of human phenotypic abnormalities. HP, apart from the textual information (general definitions, descriptions, synonyms, etc.) of each ontology concept, also provides computer-readable logical definitions (axioms) of terms that will allow human phenotypic abnormalities to be related to entities from anatomy, pathology, bio- chemistry and other areas. In this paper we present a prototype to generate new axiomatic knowledge from the textual descriptions of each HP term. The proto- type (i) detects terms in the textual descriptions and not found in the given logical expressions, (ii) generates pair combinations of those terms, (iii) builds triples after detecting the most probable relation between the pair of terms using a sta- tistical model and, finally, (iv) suggests the most probable triples to the user so she can decide which ones can be added to the original axioms. 1 Introduction The large amount of public knowledge resources available in the Web have been devel- oped regardless of the processing and integration needs of modern information systems and, this fact, is obstructing its massive use. We clearly need richer lexicons and ax- iomatic knowledge resources. In this paper, we focus on the axiomatic knowledge resources and address the prob- lem of how to exploit these resources to improve processing and integration tasks. The research is done in the technological context of the Semantic Web, because one of its main objectives is to generate semantic annotations from knowledge resources. In particular, there is special interest in the information processing in the phenotype and genotype field, where descriptions tend to be logical representations that allow infer- encing over them. The main challenge is how to exploit semantic properties of these resources in the processing and analysis. In this paper, we rely on the Human Phenotype Ontology (HP) [1] and the annota- tion facilities provided by BioPortal [2]. HP is an ontology, expressed in the Web Ontology Language (OWL, a family of knowledge representation languages for authoring ontologies) [3], that aims at provid- ing a standardized vocabulary of phenotypic abnormalities related to human disease. Each term in the HP describes a phenotypic abnormality, such as ’atrial septal de- fect’. It currently contains approximately 11, 000 terms and over 115, 000 annotations to hereditary diseases. It also includes axioms for the terms, which are a formal way ? This work was partially funded by the BIGMED project (IKT 259055), the SIRIUS Centre for Scalable Data Access (Research Council of Norway, project no.: 237889). 2 to describe taxonomies and classification networks, essentially defining the structure of knowledge for various domains: the nouns representing classes of objects and the verbs representing relations between the objects. We focus on the task of extracting axioms, from textual descriptions of phenotypes. We summarize the main objectives of the work presented in this paper as follows: – Analysis of the HP axioms to understand how relations between HP classes are expressed. – Use of the descriptive textual annotations of the HP classes and detect terms that are not being used in the related axioms. – Design of a statistical model to infer relations between a given pair of ontology classes. – Use of the statistical model to generate a list of triples (subject, relation, object) ranked based on the probability of having this relation between the two concepts. – Select the most probable triples and propose this new knowledge for a subsequent (manual) assessment to convert valid and relevant triples into suitable HP axioms. The processing of free text and the discovery of implicit relations arise as two of the most important challenges in this paper. On the one hand, free text brings ambiguity and vagueness. On the other hand, potential relations between classes will most probably not be explicitly expressed as verbs in the textual information, and thus the relation names will need to be inferred. 2 Related work In this section, we briefly introduce some approaches relevant to the work presented in this paper. In [4], an effort to elucidate Obol (Open Bio-Ontology Language) is carried out and the attempts to reason over the resulting definitions are presented. [5, 6] repre- sent efforts to normalize the Gene Ontology [7] in a way that can be better exploited by reasoners. Thanks to the logical definitions of an ontology, we can gradually begin to automate many aspects of ontology development, detecting errors and filling in miss- ing relationships. Another related work is the Semantic Medline, more specifically the SemMedDB [8]. This approach aims at building a triple store of semantic annotations in UMLS that are extracted from predications identified in PubMed abstracts. These predications are associated to the predefined set of relationships of the UMLS Semantic Network. Unfortunately, the tool for extracting predications (i.e., SemRep) depends on specific versions of UMLS Metathesaurus, which are not freely available and do not have the same domain coverage as BioPortal. Moreover, the relations provided by Sem- Rep are different from those used in HP and BioPortal, as discussed later in Section 4. 3 Methodology We have represented the axioms in HP as triples (subject, relation, object), in order to generate a statistical model and being able to infer the most suitable relation to each (subject, object) pair of annotations. Apart from the axioms, the HP ontology contains, for each class or term, a set of lexical metadata: definition, description, synonyms, etc. We extracted and annotated the lexica using the BioPortal annotator,3 an online service 3 https://bioportal.bioontology.org/annotator 3 that discover annotations for biomedical texts with classes from different ontologies stored in BioPortal [2]. Using the extracted annotations, we generated for each HP class a list of pairs (subject, object) with all possible combinations of the annotations. We give, as input, to the statistical model the list of pairs (subject, object) and we obtain the same list enriched with relations (subject, relation, object), ranked by the probability given by the statistical model. The last step is guided by the user, he is the person able to review the list of ranked triples, decide which ones are useful and, finally, build new axioms to be added to the HP classes. 3.1 Transforming axioms into triples We extracted from HP all textual information and all the axioms for each term. Then we expressed axioms in an easier way to make use of them, by implementing a method for parsing and transforming OWL axioms defining a class to observation triples (statistic units): subject, relation, object. Here we can see an example of the different stages to go from a Description Logic axiom to a set of triples. Starting with, for example, the following axiom (belonging to HP 0000871, Panhypopituitarism): [’SomeValuesFrom(BFO_0000051 IntersectionOf(PATO_0000462 SomeValuesFrom(RO_0000052 IntersectionOf( IntersectionOf(GO_0003008 SomeValuesFrom(BFO_0000066 UBERON_0002196)) SomeValuesFrom(BFO_0000050 UBERON_0000468))) SomeValuesFrom(RO_0002573 PATO_0000460)))’] We focused only the classes involved in the axiom (the concrete OWL constructor or restriction is not relevant for our approach), and we worked just with the name of the ontology, not the term code. For example, (HP 0100752, UBERON 3010224) is changed to (HP, UBERON). The reduction of axioms and the corresponding triples are shown in the table below. [’BFO’ (HP 0000871, BFO, PATO) [’PATO’, (PATO, RO, GO) [’RO’, (GO, BFO, UBERON) [[’GO’, [’BFO’, ’UBERON’]], (RO, BFO, UBERON) [’BFO’, ’UBERON’]], (PATO, RO, BFO) [’RO’, ’PATO’]]] (PATO, RO, PATO) 3.2 Generating the statistical models for axioms After moving from each axiom to a set of triples, abstracting the ontological informa- tion, we estimate the probabilities between the different components of these triples. More specifically, our aim is to estimate the following marginal distributions: P (s∗ |r) for subject-relation pairs, P (o∗ |r) for object-relation pairs, and P (r|s∗ , o∗ ) for relation against subject-object pairs. When estimating these probabilities, we abstract s and o to their component ontologies (denoted with .∗ superscript) so that we can rank rela- tion schemas. With the previous distributions, we can rank the inferred triples for each pair extracted from the textual descriptions. We use the maximum likelihood estimation (MLE), using factorization as follows: 4 P (s∗ , r, o∗ ) = P (r|s∗ ) · P (r|o∗ ) · P (r|s∗ , o∗ ) 3.3 Generating new knowledge through semantic annotation We annotated the HP descriptions of each concept by using BioPortal. With these an- notations, all possible triples are generated by combining pairs of annotations that co- occur in each sentence of the description and all the potential relations that can hold between them. We also add constraints over the entities to be related. For example, both subject and object have to be in the same sentence and subject or object (or both of them) is not in the given axioms, so we can be sure that new triples are adding knowl- edge. Finally, by using the statistical model, candidate triples for each HP are ranked. 4 Results Results are provided as triples (subject, relation, object), representing knowledge that is not present within the axioms associated to the HP classes. We obtained 76, 348 new triples for 8, 582 HP classes, so the number of triples we infer for each HP class is, on average, 8.4 As an example, we have the term HP 0100752 with preferred label ’Hepatic anoma- lous lobulation’. Currently, this term does not have any (direct) axiom associated in the HP. The term HP 0100752 also has the following textual descriptions: ’Anoma- lous liver lobulation’, ’Abnormal liver lobulation’ and ’Formation of abnormal lobules (small masses of tissue) in the liver’. After using the proposed method, we obtained the following triples: subject relation object prob. masses inheres in liver 0.18081 masses inheres in tissue 0.18081 abnormal inheres in liver 0.18081 abnormal inheres in tissue 0.18081 masses has modifier abnormal 0.12623 tissue part of liver 0.02376 We have performed a preliminary evaluation by comparing the extracted triples against the SemMedDB predictions [8].5 For this purpose, we crossed extracted triples and predications by subject and object. As a result, we were able to match 41,200 (54%) triples to SemMedDB predications. This indicates that our approach generates meaning- ful triples. We also inspected non-matched triples, and many of them can be considered correct. However, a strict evaluation must be performed to assess their true accuracy. As for relations, SemMedDB deals with a much richer set of relations compared to HP. Analyzing the matched triples-predications, the main identified alignments between SemMedDB and HP relations are: (PART OF, part of ), (OCCURS IN, inheres in), (AS- SOCIATED WITH, inheres in) and (AFFECTS, has modifier). However, SemMedDB and HP relationships are not easily comparable as they are used in different ways. This issue deserves an in-deep study in the future work. 4 Raw results: http://krono.act.uji.es/swat4ls_2017/results.txt 5 SemMedDB predictions: http://krono.act.uji.es/swat4ls_2017/evaluation.txt 5 To sum up, results are promising as we are able to extract knowledge that it is not explicitly present in the axioms. With this knowledge experts should decide which extracted triples are useful for her, and then, create and add new axioms associated to the HP classes. 5 Conclusions Many efforts have been done to give structure and formal definitions to biomedical ontologies, which enable the use of reasoners in order to infer (implicit) knowledge from the ontology. There is still, however, plenty of work to do in this area as the domain keep evolving and the ontologies need to keep track of this new knowledge in a coherent and complete manner. The maximum potential of any ontology will be obtained when all its terms have a complete and exhaustive set of logical definitions. In this paper, we have presented a method to enrich the logical information available in the HP classes. Using a statistical model and extracting the missing concepts in the axioms, the system proposes a list of candidate triples that can be used by experts to build new axioms. We have compared the generated triples with the SemMedDB pred- ications, showing a notable overlap between them and therefore their meaningfulness. As future work, we plan to define further relevance criteria for providing a better ranking of triples, as only using probability thresholds do not give us always good results. We also need to design and implement a solid and complete evaluation process as the task of doing it manually is not manageable, due to the large amount of data we are dealing with. We also plan to make use of the alignment between HP and UMLS to obtain a richer lexicon associated to HP classes, because BioPortal annotations are often too short and consequently, they do not cover the full semantics of the text. Finally, the system could build the axioms automatically and be able to tune the statistical model with the user feedback, considering that accepting or rejecting a triple is a valuable information to be used as input of the statistical model. References 1. Kohler, S., et al.: The Human Phenotype Ontology in 2017. Nucleic Acids Research 45 (2017) D865–D876 2. Noy, N.F., et al.: Bioportal: ontologies and integrated data resources at the click of a mouse. Nucleic Acids Research 37(Web-Server-Issue) (2009) 170–173 3. Consortium, W.W.W.: OWL 2 Web Ontology Language document overview. W3C (2009) 4. Mungall, C.J.: Obol: Integrating language and meaning in bio-ontologies. Willey InterScience (2004) 509–520 5. Mungall, C.J., et al.: Cross-product extensions of the gene ontology. Journal of Biomedical Informatics 44 (2011) 80–86 6. Wroe, C., Stevens, R., Goble, C.A., Ashburner, M.: A methodology to migrate the gene ontology to a description logic environment using DAML+OIL. In: PSB. (2003) 7. Ashburner, M., et al.: Creating the gene ontology resource. design and implementation. Willey InterScience (2001) 425–433 8. Kilicoglu, H., et al.: SemMedDB: a PubMed-scale repository of biomedical semantic predi- cations. Bioinformatics 28(23) (2012)