Investigating Ontology Use in Artificial Intelligence and Machine Learning for Biomedical Research - A Preliminary Report from A Literature Review Asiyah Yu Lin*1, Andrey Ibrahim Seleznev2, Tianming “Danny” Ning3, Paulene Grier4,5, Lalisa “Mariam” Lin6, Christopher Travieso7, Ansu Chatterjee8, Jaleal Sanjak9 1 National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA 2 Walter Johnson High School, Bethesda, MD 20817, USA 3 Winston Churchill High school, Potomac, MD 20854, USA 4 Thomas Stone High School, Waldorf, MD 20601, USA 5 College of Southern Maryland, La Plata, MD 20646, USA 6 Walt Whitman High School, Bethesda, MD 20817, USA 7 Our Lady Of Good Counsel High School, Onley, MD 20832, USA 8 Office of Director, National Institutes of Health, Bethesda, MD 20852, USA 9 National Center for Advancing Translational Sciences, National Institutes of Health, Rockville, MD 20850, USA Abstract In this report, the authors conducted a comprehensive literature review to answer a question: how ontologies are being used in the AI/ML approaches to solve biomedical research problems? A selection of 107 papers were reviewed and data were extracted to answer question regarding how, what, who and where the ontology-aware AI/ML approach were applied in biomedical domain, as well as the mechanics of ontology use in AI/ML framework. The ontologies either was used as categories of data or used to compute the knowledge. Among many other ontologies, the Gene Ontology dominated the use of ontologies in AI/ML based biomedical problem solving. Lack of collaborations were observed via the co-authorship network analysis. Keywords 1 Ontology, Artificial Intelligence, Machine Learning, literature review, 1. Introduction answer a question: how ontologies are being used in the AI/ML approaches to solve biomedical research problems? As a form of knowledge representation, ontologies organize the knowledge and data hierarchically (“tree-like”) and horizontally 2. Method (“network-like” or “graph-like”) using semantic relations, such as “is-a” or “part-of”. Artificial On the date of Sep.4, 2022, a total of 503 Intelligence (AI) and/or Machine Learning (ML) papers were retrieved from PubMed Central® often apply mathematical models that require (PMC) archive using keywords appeared in title numeric data as input. The fast growing and big and abstract: ontology, artificial intelligence, volume of biomedical data has benefited the fast- machine learning, deep learning, neural network, advancing AI/ML algorithms and frameworks. and embedding within 5 years’ range, from 2017 However, leveraging the non-numerical, to 2022. Out of the 503 papers, the authors semantic, and hierarchical relations from an selected 250 papers highly relevant papers to ontology remains a challenge in AI/ML [1]. In this screen due to the time constrain. In total, 107 report, the authors conducted a literature review to papers were selected for this report based on the ICBO 2022, September 25-28, 2022, Ann Arbor, MI, USA EMAIL: asiyah.lin@nih.gov (A. 1); ORCID: 0000-0003-2620-0345 (A. 1); ©️ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings eligibility criteria of a research paper solving a image, or pathology report, and predicting patient biomedical scientific problem. Excluded papers similarity from clinical trial. Interestingly, there (n= 143) are review or comment papers, papers are papers using ontology and AI/ML to mine the that do not solve a biomedical scientific problem, social media data for sentiment prediction and rather an engineer problem such as Natural drug off-label use prediction. Language Processing (NLP) problems or using AI/ML to develop ontology (e.g. predict new relations or new classes), and irrelevant papers. To facilitate the information extraction process and make the user interface easy and intuitive to use, a Google Form was designed for extracting the related text from the papers. The senior reviewer (AYL) then reviewed all the 250 papers’ screening and 107 papers’ information extraction to cross check the results. The raw dataset of reviewers’ response was deposited to the Zenodo repository. A DOI id (10.5281/zenodo.7769984) was reserved for this dataset. Co-authorship network analysis were conducted using Gephi software (https://gephi.org/). 3. Results After reviewing the abstracts of 250 papers, Figure 1: Word cloud of the biomedical problems authors identified four major categories of (generated by https://www.wordclouds.com/) ontology use in AI/ML: 1. Use the whole ontology or ontology terms as data labels to be the training 2. What ontologies are being used? datasets; 2. Transform the ontological Besides 7 papers that did not mention the name representation into numerical data representation of ontologies used, 100 papers have specified the that will be used in the downstream AI/ML, which ontologies being used. The use of Gene Ontology includes calculate term’s semantic similarities, (GO) is dominant: out of 107 papers, 65 (60.7%) construct concepts association matrix, and use were utilize GO in their AI/ML pipeline or word embedding algorithms, and etc.; 3. The architecture to solve their scientific problems. The ontology as a graph structure or network structure next 4 most frequently used ontologies are: used as a part of neural network architecture; 4. SNOMED CT and Human Phenotype Ontology The ontology classification is the target of the (HPO) (9 papers, 8.4%), UMLS (6 papers, 5.6%) AI/ML classifier. and Disease Ontology (DOID) (5 papers, 4.7%). What follows are the specific questions being Besides those, the Infectious Disease Ontology answered via this exercise of literature review. (IDO), ChEBI, FMA and Chinese version MeSH were used more than 1 papers. Many papers 1. What biomedical problems are solved develop specific ontologies for their specific task. using ontology aware-AI/ML? In addition to the dominate use of GO, 38 (35.5%) The biomedical problems that were being ontologies cover topics related to disease, solved are mostly focused on gene function phenotype, or conditions. This result shows the prediction (25 papers), or ontology annotation (14 lack of diversity of biomedical ontology use in papers). 7 papers using ontology-aware AI/ML to AI/ML for biomedical research. It also shows the perform protein/gene interaction prediction, and 6 potential benefit of a unified ontology that covers papers predict disease gene or protein or variant diseases, phenotypes, and conditions. prediction. Other topics including drug-drug interaction, drug-drug interaction, drug repurpose, 3. How ontology is being used in the drug target, drug toxicity, pathway membership AI/ML algorithm or architecture? prediction. In the clinical area, a few papers focus There are two big categories on how ontology on clinical outcomes prediction from EHR, is being used in AI/ML algorithms: A) using anatomical site prediction from radiology report, ontology as categories of data, or B) compute the knowledge. In category A, 42 papers (39%) were collaborate. A network analysis was performed using ontologies as training data, and 24 papers based on the co-authorship. The resulted research (22%) were using ontologies as classifier’s target. network shows a lack of collaboration in this In category B, the most popular use is to transform research area. Most of the authors are isolated the ontology into numeric presentation. 54 papers groups (Figure 2A). The hub analysis of the (50.4%) were using different methodologies, such network reveals one active hub center, Dr. Robert as embedding, semantic similarity, and Hoehndorf from the King Abdula University of information content, to convert a text-based Science and Technology (KAUST) at Saudi. He ontology into a matrix table with numbers. Only has many papers published with many authors; 12 papers (11.2%) utilized the whole ontology’s however, his co-authorship network is limited content and structure as a layer in a neural between the UK and Saudi Arabia (Figure 2B). network architecture. Community analysis showed that beside the Out of the 107 papers, 31 papers (29%) applied community formed by the UK and Saudi, a few neural network architecture. Among which, 11 Chinese researcher forms their own community papers used convolutional neural network, 7 via co-authorships. This result shows that a lot of papers used deep neural network, 6 papers used collaborative activities, such as focused long short-term memory network including Bi- conference, workshops, meetings, and hackathons LSTM and Bo-LSTM, 3 papers on recurrent are needed to promote creativity and innovation neural network, 2 papers on artificial neural of science. The authors suggested that more network. Deep learning technology were applied workshops such as Role of Ontology in in 4 papers. There is a growing practice to use a Biomedical AI (ROBI) should be held, and a variety of embedding methods to transform the community of such scientists working in this ontology into a low-dimensional vector space. 6 specific area should be established. papers were using Node2Vec, 4 papers using Word2Vec, 2 papers on Doc2Vec, 2 papers on Onto2Vec, and 1 paper on OPA2Vec and A. DL2Vec. While new methodologies are tested in those papers, traditional classifiers are still being applied: 8 papers applied Support Vector Machine (SVM), 6 papers applied Random Forest, 4 papers used Naive Bayes classifier or k-nearest neighbor and 3 papers used logistic regression techniques. In most of the case, the authors claimed that ontology-aware AI/ML outperforms traditional classifiers. 4. Who and where publish those papers? The authors also looked at the geographical distribution of the papers that are published. The top 5 countries that publish the most are: USA (33 B. papers), China (26 papers), UK and Saudi Arabia (10 papers each), France (7 papers), and Germany, Korea, and Portugal (7 papers each). 26 papers have authors across different countries. Out of which, 4 papers produced by China-USA collaborations, and 2 papers produced by France and Lebanon collaboration. The observation of USA publishing dominant maybe biased, because the authors only selected the USA based PMC as the source database to retrieve papers. 5. How did authors collaborate in Figure 2A: Network analysis of the co-authorship research? of the 107 papers. A node denotes an author’ The authors were interested in learning about Color denotes community; size of the nodes who are the researchers in this field and how they denotes the centrality of an author; the size of link denotes the counts of co-authorship group. AYL conceived the idea of the paper, between authors. designed the search strategy and survey questions, Figure 2B: Hub analysis showed that Dr. Robert evaluated the results, conducted analysis, and Hoehndorf and his group forms an active hub and wrote the manuscript. All other authors a small community comprised of Dr. Hoehndorf’s contributed to the paper review and data entry. collaborators in Saudi Arabia and UK. AIS and TDN contributed to data curation for co- authorship network analysis. 4. Conclusion 6. References In conclusion, ontology provides contextually rich data to help the AI/ML to achieve a higher [1] Kulmanov M, Smaili FZ, Gao X, performance compared to the similar methods Hoehndorf R: Semantic similarity and without ontologies. However, the applications of machine learning with ontologies. Brief ontology-aware AI/ML in biomedical domain are Bioinform 2021, 22(4). still limited to gene or protein function [2] Hassan M, Guan H, Melliou A, Wang Y, predictions. The lack of cross-discipline Sun Q, Zeng S, Liang W, Zhang Y, Zhang collaborations specifically in applications in Z, Hu Q: Neuro-Symbolic Learning: biomedical domain is alarming. Fundings to Principles and Applications in support collaborative initiatives and community Ophthalmology. arXiv preprint development are needed in this area. Workshops arXiv:220800374 2022. such as ROBI should be continued and expanded. [3] Holzinger A, Malle B, Saranti A, Pfeifer Utilizing the graph-structural and semantics B: Towards multi-modal causability with within an ontology requires more complex neural Graph Neural Networks enabling network architecture along with many other information fusion for explainable AI. components such as the neuro-symbolic Information Fusion 2021, 71:28-37. approach. Explainable AI is an emerging field where the explanatory techniques can explicitly Kulmanov M, Smaili FZ, Gao X, show why a recommendation, or a prediction is Hoehndorf R: Semantic similarity and made. This literature review is biased by the machine learning with ontologies. Brief selection of PMC as the pool to retrieve. Many Bioinform 2021, 22(4). methodological papers were published as 2. Hassan M, Guan H, Melliou A, Wang Y, conference proceedings or white papers. Rising Sun Q, Zeng S, Liang W, Zhang Y, Zhang topics such as neuro-symbolic, explainable AI Z, Hu Q: Neuro-Symbolic Learning: were not investigated. The future work includes Principles and Applications in extending the search to other repositories, such as Ophthalmology. arXiv preprint Europe PMC, IEEE, PMLR, DBLP, arXiv, and to arXiv:220800374 2022. other topics such as neuro-symbolic [2], 3. Holzinger A, Malle B, Saranti A, Pfeifer explainable AI [3] use in biomedical domain. B: Towards multi-modal causability Leveraging an ontology of AI/ML to annotate with Graph Neural Networks enabling more details on AI/ML components to allow information fusion for explainable AI. better analysis is another future direction as well. Information Fusion 2021, 71:28-37. 5. Acknowledgement AYL, AC and JS are supported by The Office of Data Science Strategy, NIH, via the Data and Technology Advancement (DATA) National Service Scholar program. AIS, TDN, PG, LL, and CT are the members of a local youth group Biomedical Informatics Research for Youth located in Maryland. This group is founded by AYL. AYL and JS are mentors of this youth