The coevolution of ontologies and knowledge-based analytics in bioinformatics Robert Hoehndorf1 1 Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences & Engineering Division, King Abdullah University of Science and Technology, 4700 KAUST, Thuwal 23955, Saudi Arabia Abstract I discuss the coevolution of bio-ontologies and analytical bioinformatics methods in response to the evolving landscape of life sciences. I focus on the role of ontologies, in particular the Gene Ontology, in capturing and describing biological knowledge, and the challenges and developments in ontology-based bioinformatics, particularly in light of new computational methods and machine learning. The main theme is the bidirectional influence between how ontologies and bioinformatics methods evolved together, and how ontologies have shaped advancements in the analysis, representation, and understanding of biological data by providing a unifying layer of knowledge. Keywords bio-ontology, knowledge-based analytics, Artificial Intelligence Developing high quality ontologies is expensive, and, challenges) in bio-ontologies still occur in fields where like most infrastructure components of the life sciences, novel experimental techniques are leading to a radical ontologies have evolved in response to specific needs and change of our understanding of biological phenomena. requirements of the biomedical community. At the same For example, recently, our understanding of cell types has time, new tools utilizing ontologies emerged to enable changed drastically, resulting from single cell sequencing or improve analysis of biological data. In my talk, I will technologies and the resulting detailed information avail- explore how bio-ontologies have evolved in response to a able about cell types and their relations; ontologies of changing bioinformatics environment and how bioinfor- cell types had to change accordingly [2], and cell ontolo- matics tools and methods evolved in response to chang- gies are now one of the most active areas of bio-ontology ing ontologies; my main aim will be to characterize the development (as evidenced, for example, by the regular current changes in bioinformatics through large-scale ap- CELLS workshop co-lated with the International Confer- plication of machine learning, and how ontologies have ence on Biomedical Ontologies). to change to accommodate these changes. Yet, what the early development of the GO (and sim- The Gene Ontology (GO), the first bio-ontology that ilar ontologies) has shown is that the development and was and still is widely used, emerged as a consequence evolution of ontologies in life sciences is not a one-way of breakthroughs in gene and genome sequencing and road and only determined by changes in experimental the resulting understanding of how many genes are con- techniques; rather, the availability of ontologies has also served in different organisms [1]. This novel understand- led to novel computational analysis methods, and ontolo- ing, combined with the rapid change of knowledge in the gies will change in response to the emergence of novel field of molecular biology, necessitated the development methods. Two methods are particularly noteworthy here, of the GO, to keep track of the changing knowledge in the ontology enrichment analysis and semantic similarity field and simultaneously provide a means to describe our measures. Both techniques are some of the most widely knowledge of gene and protein functions. Using the GO used computational analysis methods involving ontolo- for describing protein functions solved many challenges. gies. An ontology enrichment analysis uses an ontol- A form of deductive inference (“true path rule”) allowed ogy together with its annotations in order to determine capturing the most specific information about a protein whether there is a function that is statistically enriched as possible while still allowing inference of more general in a set of genes or gene products [3, 4, 5]. Ontology- information, and use of a taxonomy allowed knowledge based semantic similarity measures utilize the knowledge to evolve by gradually adding more specific functions to contained in ontologies (in particular within the formal a protein without invalidating previous assertions. axioms) to define measures of similarities between ontol- Today, some of the most exciting developments (and ogy classes, sets of classes, instances of classes, or entities annotated with (sets of) classes [6]. Semantic similarity ICBO 2022, September 25-28, 2022, Ann Arbor, MI, USA was first used to query for and retrieve “semantically” $ robert.hoehndorf@kaust.edu.sa (R. Hoehndorf) € https://leechuck.de (R. Hoehndorf) related proteins [7], and later extended to find other enti-  0000-0001-8149-5890 (R. Hoehndorf) ties with some association using a “guilt by association” © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings approach. At the same time, and further motivating the focus on My key take away message from these methods is knowledge graphs instead of ontologies, novel knowl- that bioinformatics has developed a set of computational edge graph analytics approaches emerged, in particular methods that crucially relied on ontologies providing machine learning methods that would operate directly on accurate results. Both enrichment analysis and semantic graphs or knowledge graphs [21, 22], and graph neural similarity require that inferences in ontologies, in partic- networks that can exploit the knowledge graphs for vari- ular inferences about annotated genes or gene products ous tasks [23]. In particular, knowledge graph embedding (the “true path rule” in GO and more elaborate versions methods have been adopted widely within the bioinfor- of this rule), are accurate (accurate in the sense that they matics community to exploit information in knowledge are biologically correct and experimentally verifiable). graphs for predictive or analytical tasks. Several knowl- Early ontologies did not always produce accurate infer- edge graph embedding methods have been developed ences [8, 9, 10, 11], and finding these incorrect inferences [21], but some of the most popular are based on the prin- has, arguably, led to one of the most active periods for ciple that, if the fact 𝑟(𝑎, 𝑏) is in the knowledge graph, ontology development and quality improvement, where then ⃗𝑎 + ⃗𝑟 ≈ ⃗𝑏 (where ⃗𝑎 etc. are the “embedding” vec- the community applied and developed methods inspired, tors of some dimension that “represent” 𝑎, 𝑟, and 𝑏 in among others, by philosophy [12], linguistics [13], and a distributed manner) [24]. The advantage of these em- logics [14]. bedding methods is their interpretability, simplicity, and With further improvement in experimental methods, almost universal applicability. in particular the emergence of high throughput sequenc- The role of ontologies in graph-based machine learn- ing methods, the demands on ontologies rose further, ing methods (such as knowledge graph embeddings, or both in terms of their accuracy as well as in their de- graph neural networks) is to provide a source of nodes, tail and discriminatory power. Ontologies now had to and the formal axioms in the ontologies provides a source cope with Big Data, and manually building ontologies of relatedness (edges) that make up the resulting graph would no longer scale in many domains. In this time, [25]. Yet, many aspects that have been considered crucial ontology design patterns [15], upper ontologies [16, 17], in developing ontologies are lost, specifically all benefits more and elaborate ontology design principles and com- arising from semantics, both logical and ontological [26]: munity standards allowed ontologies to “scale up” both the ability for complex queries; ensured consistency; and to capturing Big Data and more detailed nuances in bi- deductive inference. In particular deductive inference ological phenomena. The new problem arose that our (which is required both for complex queries and deter- tools (reasoners and ontology editors such as Protege) no mining consistency) is crucial for exploring the knowl- longer scaled to the new size and complexity of ontolo- edge ontologies contain beyond what has been explicitly gies. The solution was to switch to different tools like asserted, but this ability for deductive inference is largely Elk [18], and apply modularization techniques such as lost in graph-based methods. MIREOT [19]; while these work in solving the problem Before ontologies (considered here as artifacts which of scalability to Big Data, they have also hidden (and explicitly and formally specify a conceptualization of a lost) some information; automated reasoners such as Elk domain using a logic-based language) can become rele- only consider a tiny subset of the language we use to vant in machine learning in bioinformatics, methods that formalize ontologies, and modularization techniques can can utilize the semantics of ontologies need to first be hide inconsistencies and therefore allow inconsistencies developed, because very few such methods exist in the to increase [20]. field of AI; and it is even more of a challenge to tune such As a result, a switch took place within the bio- methods to the specific peculiarities of bio-ontologies ontologies community and the focus was no longer only which have distinct properties when compared to on- on “ontologies” as formal artifacts capturing domain tologies used in other domains, in particular computer knowledge accurately, but rather on constructing “knowl- science. edge graphs” in which the focus is on linking information Some new methods emerged over the past years that in some (vaguely) meaningful manner. The tendency to apply machine learning methods to bio-ontologies. While focus more on “knowledge graphs” instead of ontolo- some of these methods are simple extensions of learn- gies was by no means universal but certainly noticeable ing from graph-structured data or learning from text, and still ongoing today. The move was motivated by more recent approaches aim to explicitly address the the desire to focus on “relatedness” instead of precision, missing formal semantics in machine learning models. and find ways to integrate (i.e., link) large amounts of These neuro-symbolic methods can produce deductive resources, in particular in the biomedical domain; the inferences directly, either by implementing a deduction resources that are linked were often not ontologies but system using neural approaches or by generating model (medical) terminologies, so that ontological precision structures using neural approaches. Establishing this may have been an obstacle to successful integration. correspondence between classical semantics and neural networks enables novel applications and demands on 35. URL: https://doi.org/10.1186/1471-2105-3-35. ontologies, but also opens novel opportunities, both for doi:10.1186/1471-2105-3-35. bioinformatics and AI. In bioinformatics, these methods [6] S. Harispe, S. Ranwez, S. Janaqi, J. Montmain, allow machine learning to utilize the vast and rich knowl- The semantic measures library and toolkit: fast edge contained in bio-ontologies thereby endowing the computation of semantic similarity and relatedness machine learning models with domain knowledge (and using biomedical ontologies, Bioinformatics the ability to explore the knowledge more deeply than 30 (2014) 740–742. URL: http://bioinformatics. would be possible using only knowledge graphs), which oxfordjournals.org/content/30/5/740.abstract. can be used to provide access to the results of over a doi:10.1093/bioinformatics/btt581. hundred years of experiments that are now contained in [7] P. W. Lord, R. D. Stevens, A. Brass, C. A. Goble, ontologies and knowledge bases. One of the most obvi- Investigating semantic similarity measures across ous areas of application are rare diseases where only little the gene ontology: the relationship between training data will ever be available. For AI, bio-ontologies sequence and annotation, Bioinformatics 19 provide a vast und largely underused resource of knowl- (2003) 1275–1283. URL: http://bioinformatics. edge with direct implications for health, the environment, oxfordjournals.org/content/19/10/1275.abstract. and well-being. doi:10.1093/bioinformatics/btg153. [8] B. Smith, J. Williams, S. Schulze-Kremer, The on- tology of the gene ontology., AMIA ... Annual Sym- References posium proceedings / AMIA Symposium. AMIA Symposium (2003) 609–613. URL: http://view.ncbi. [1] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, nlm.nih.gov/pubmed/14728245. H. Butler, M. J. Cherry, A. P. Davis, K. Dolinski, [9] B. Smith, C. Rosse, The role of foundational rela- S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, tions in the alignment of biomedical ontologies., L. I. Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Medinfo 11 (2004) 444–448. Richardson, M. Ringwald, G. M. Rubin, G. Sherlock, [10] B. Smith, Against fantology, in: M. E. Reicher, J. C. Gene ontology: tool for the unification of biology, Marek (Eds.), Experience and Analysis. Proceed- Nature Genetics 25 (2000) 25–29. URL: http://dx.doi. ings of the 27th International Wittgenstein Sym- org/10.1038/75556. doi:10.1038/75556. posium., volume 6, 2005, pp. 153–170. URL: http: [2] D. Osumi-Sutherland, C. Xu, M. Keays, A. P. Levine, //dx.doi.org/10.1186/gb-2004-6-1-r7. doi:http:// P. V. Kharchenko, A. Regev, E. Lein, S. A. Teich- dx.doi.org/10.1186/gb-2004-6-1-r7. mann, Cell type ontologies of the human cell [11] W. Ceusters, P. Elkin, B. Smith, Referent track- atlas, Nature Cell Biology 23 (2021) 1129–1135. ing: The problem of negative findings, Stud Health URL: https://doi.org/10.1038/s41556-021-00787-7. Technol Inform (2006). doi:10.1038/s41556-021-00787-7. [12] B. Smith, W. Ceusters, Ontological realism: A [3] S. W. Doniger, N. Salomonis, K. D. Dahlquist, methodology for coordinated evolution of scien- K. Vranizan, S. C. Lawlor, B. R. Conklin, tific ontologies, Appl. Ontol. 5 (2010) 139–188. MAPPFinder: using Gene Ontology and Gen- [13] M. Bada, L. Hunter, Enrichment of OBO ontologies, MAPP to create a global gene-expression profile Journal of Biomedical Informatics 40 (2007) 300– from microarray data, Genome Biology 4 (2003) 315. URL: https://doi.org/10.1016/j.jbi.2006.07.003. R7. URL: https://doi.org/10.1186/gb-2003-4-1-r7. doi:10.1016/j.jbi.2006.07.003. doi:10.1186/gb-2003-4-1-r7. [14] R. Hoehndorf, F. Loebe, J. Kelso, H. Herre, Repre- [4] A. Subramanian, P. Tamayo, V. K. Mootha, senting default knowledge in biomedical ontologies: S. Mukherjee, B. L. Ebert, M. A. Gillette, application to the integration of anatomy and A. Paulovich, S. L. Pomeroy, T. R. Golub, E. S. Lan- phenotype ontologies, BMC Bioinform. 8 (2007). der, J. P. Mesirov, Gene set enrichment analy- URL: https://doi.org/10.1186/1471-2105-8-377. sis: A knowledge-based approach for interpret- doi:10.1186/1471-2105-8-377. ing genome-wide expression profiles, Proceed- [15] D. Osumi-Sutherland, M. Courtot, J. P. ings of the National Academy of Sciences of the Balhoff, C. Mungall, Dead simple United States of America 102 (2005) 15545–15550. OWL design patterns 8 (2017). URL: URL: http://www.pnas.org/content/102/43/15545. https://doi.org/10.1186/s13326-017-0126-0. abstract. doi:10.1073/pnas.0506580102. doi:10.1186/s13326-017-0126-0. [5] M. D. Robinson, J. Grigull, N. Mohammad, T. R. [16] B. Smith, W. Ceusters, B. Klagges, J. Köhler, A. Ku- Hughes, FunSpec: a web-based cluster inter- mar, J. Lomax, C. Mungall, F. Neuhaus, A. L. Rec- preter for yeast, BMC Bioinformatics 3 (2002) tor, C. Rosse, Relations in biomedical ontologies., Genome Biol 6 (2005) R46. URL: http://dx.doi.org/10. 1186/gb-2005-6-5-r46. doi:http://dx.doi.org/ [26] F. Loebe, H. Herre, Formal semantics and ontologies 10.1186/gb-2005-6-5-r46. - towards an ontological account of formal seman- [17] B. Smith, M. Ashburner, C. Rosse, J. Bard, W. Bug, tics, in: C. Eschenbach, M. Grüninger (Eds.), Formal W. Ceusters, L. J. Goldberg, K. Eilbeck, A. Ireland, Ontology in Information Systems, Proceedings of C. J. Mungall, N. Leontis, P. R. Serra, A. Ruttenberg, the Fifth International Conference, FOIS 2008, Saar- S. A. Sansone, R. H. Scheuermann, N. Shah, P. L. brücken, Germany, October 31st - November 3rd, Whetzel, S. Lewis, The OBO Foundry: coordinated 2008, volume 183 of Frontiers in Artificial Intelli- evolution of ontologies to support biomedical data gence and Applications, IOS Press, 2008, pp. 49–62. integration, Nat Biotech 25 (2007) 1251–1255. URL: https://doi.org/10.3233/978-1-58603-923-3-49. [18] Y. Kazakov, M. Krötzsch, F. Simancik, The incredible doi:10.3233/978-1-58603-923-3-49. elk, Journal of Automated Reasoning 53 (2014) 1–61. URL: http://dx.doi.org/10.1007/s10817-013-9296-3. doi:10.1007/s10817-013-9296-3. [19] M. Courtot, N. Juty, C. Knüpfer, D. Waltemath, A. Zhukova, A. Dräger, M. Dumontier, A. Finney, M. Golebiewski, J. Hastings, S. Hoops, S. Keat- ing, D. B. Kell, S. Kerrien, J. Lawson, A. Lister, J. Lu, R. Machne, P. Mendes, M. Pocock, N. Ro- driguez, A. Villeger, D. J. Wilkinson, S. Wimalaratne, C. Laibe, M. Hucka, N. Le Novère, Controlled vocabularies and semantics in systems biology., Molecular systems biology 7 (2011). URL: http:// dx.doi.org/10.1038/msb.2011.77. doi:10.1038/msb. 2011.77. [20] L. T. Slater, G. V. Gkoutos, R. Hoehndorf, Towards semantic interoperability: finding and repairing hid- den contradictions in biomedical ontologies, BMC Medical Informatics and Decision Making 20 (2020). URL: https://doi.org/10.1186/s12911-020-01336-2. doi:10.1186/s12911-020-01336-2. [21] Q. Wang, Z. Mao, B. Wang, L. Guo, Knowl- edge graph embedding: A survey of approaches and applications, IEEE Transactions on Knowl- edge and Data Engineering 29 (2017) 2724–2743. doi:10.1109/TKDE.2017.2754499. [22] M. Ali, C. T. Hoyt, D. Domingo-Fernández, J. Lehmann, H. Jabeen, BioKEEN: a library for learning and evaluating biological knowledge graph embeddings, Bioinformatics 35 (2019) 3538–3540. URL: https://doi.org/10.1093/bioinformatics/btz117. doi:10.1093/bioinformatics/btz117. [23] X.-M. Zhang, L. Liang, L. Liu, M.-J. Tang, Graph neural networks and their current applications in bioinformatics, Frontiers in Genetics 12 (2021). URL: https://doi.org/10.3389/fgene.2021.690049. doi:10. 3389/fgene.2021.690049. [24] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, O. Yakhnenko, Translating embeddings for mod- eling multi-relational data, Advances in neural information processing systems 26 (2013). [25] J. Chen, P. Hu, E. Jimenez-Ruiz, O. M. Holter, D. Antonyrajah, I. Horrocks, OWL2Vec*: embed- ding of OWL ontologies, Machine Learning (2021). URL: https://doi.org/10.1007/s10994-021-05997-6. doi:10.1007/s10994-021-05997-6.