Investigating Ontology Use in Artificial Intelligence and Machine
                         Learning for Biomedical Research
                         -    A Preliminary Report from A Literature Review
                         Asiyah Yu Lin*1, Andrey Ibrahim Seleznev2, Tianming “Danny” Ning3, Paulene Grier4,5,
                         Lalisa “Mariam” Lin6, Christopher Travieso7, Ansu Chatterjee8, Jaleal Sanjak9
                         1
                           National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA
                         2
                           Walter Johnson High School, Bethesda, MD 20817, USA
                         3
                           Winston Churchill High school, Potomac, MD 20854, USA
                         4
                           Thomas Stone High School, Waldorf, MD 20601, USA
                         5
                           College of Southern Maryland, La Plata, MD 20646, USA
                         6
                           Walt Whitman High School, Bethesda, MD 20817, USA
                         7
                           Our Lady Of Good Counsel High School, Onley, MD 20832, USA
                         8
                           Office of Director, National Institutes of Health, Bethesda, MD 20852, USA
                         9
                           National Center for Advancing Translational Sciences, National Institutes of Health, Rockville, MD 20850, USA


                                          Abstract
                                          In this report, the authors conducted a comprehensive literature review to answer a question:
                                          how ontologies are being used in the AI/ML approaches to solve biomedical research problems?
                                          A selection of 107 papers were reviewed and data were extracted to answer question regarding
                                          how, what, who and where the ontology-aware AI/ML approach were applied in biomedical
                                          domain, as well as the mechanics of ontology use in AI/ML framework. The ontologies either
                                          was used as categories of data or used to compute the knowledge. Among many other
                                          ontologies, the Gene Ontology dominated the use of ontologies in AI/ML based biomedical
                                          problem solving. Lack of collaborations were observed via the co-authorship network analysis.

                                          Keywords 1
                                          Ontology, Artificial Intelligence, Machine Learning, literature review,

                         1. Introduction                                                                              answer a question: how ontologies are being used
                                                                                                                      in the AI/ML approaches to solve biomedical
                                                                                                                      research problems?
                             As a form of knowledge representation,
                         ontologies organize the knowledge and data
                         hierarchically (“tree-like”) and horizontally                                                2. Method
                         (“network-like” or “graph-like”) using semantic
                         relations, such as “is-a” or “part-of”. Artificial                                              On the date of Sep.4, 2022, a total of 503
                         Intelligence (AI) and/or Machine Learning (ML)                                               papers were retrieved from PubMed Central®
                         often apply mathematical models that require                                                 (PMC) archive using keywords appeared in title
                         numeric data as input. The fast growing and big                                              and abstract: ontology, artificial intelligence,
                         volume of biomedical data has benefited the fast-                                            machine learning, deep learning, neural network,
                         advancing AI/ML algorithms and frameworks.                                                   and embedding within 5 years’ range, from 2017
                         However, leveraging the non-numerical,                                                       to 2022. Out of the 503 papers, the authors
                         semantic, and hierarchical relations from an                                                 selected 250 papers highly relevant papers to
                         ontology remains a challenge in AI/ML [1]. In this                                           screen due to the time constrain. In total, 107
                         report, the authors conducted a literature review to                                         papers were selected for this report based on the

                         ICBO 2022, September 25-28, 2022, Ann Arbor, MI, USA
                         EMAIL: asiyah.lin@nih.gov (A. 1);
                         ORCID: 0000-0003-2620-0345 (A. 1);
                                      ©️ 2022 Copyright for this paper by its authors. Use permitted under Creative
                                      Commons License Attribution 4.0 International (CC BY 4.0).

                                      CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
eligibility criteria of a research paper solving a     image, or pathology report, and predicting patient
biomedical scientific problem. Excluded papers         similarity from clinical trial. Interestingly, there
(n= 143) are review or comment papers, papers          are papers using ontology and AI/ML to mine the
that do not solve a biomedical scientific problem,     social media data for sentiment prediction and
rather an engineer problem such as Natural             drug off-label use prediction.
Language Processing (NLP) problems or using
AI/ML to develop ontology (e.g. predict new
relations or new classes), and irrelevant papers.
To facilitate the information extraction process
and make the user interface easy and intuitive to
use, a Google Form was designed for extracting
the related text from the papers. The senior
reviewer (AYL) then reviewed all the 250 papers’
screening and 107 papers’ information extraction
to cross check the results. The raw dataset of
reviewers’ response was deposited to the Zenodo
repository. A DOI id (10.5281/zenodo.7769984)
was reserved for this dataset. Co-authorship
network analysis were conducted using Gephi
software (https://gephi.org/).

3. Results

   After reviewing the abstracts of 250 papers,        Figure 1: Word cloud of the biomedical problems
authors identified four major categories of            (generated by https://www.wordclouds.com/)
ontology use in AI/ML: 1. Use the whole ontology
or ontology terms as data labels to be the training        2. What ontologies are being used?
datasets; 2. Transform the ontological                     Besides 7 papers that did not mention the name
representation into numerical data representation      of ontologies used, 100 papers have specified the
that will be used in the downstream AI/ML, which       ontologies being used. The use of Gene Ontology
includes calculate term’s semantic similarities,       (GO) is dominant: out of 107 papers, 65 (60.7%)
construct concepts association matrix, and use         were utilize GO in their AI/ML pipeline or
word embedding algorithms, and etc.; 3. The            architecture to solve their scientific problems. The
ontology as a graph structure or network structure     next 4 most frequently used ontologies are:
used as a part of neural network architecture; 4.      SNOMED CT and Human Phenotype Ontology
The ontology classification is the target of the       (HPO) (9 papers, 8.4%), UMLS (6 papers, 5.6%)
AI/ML classifier.                                      and Disease Ontology (DOID) (5 papers, 4.7%).
   What follows are the specific questions being       Besides those, the Infectious Disease Ontology
answered via this exercise of literature review.       (IDO), ChEBI, FMA and Chinese version MeSH
                                                       were used more than 1 papers. Many papers
    1. What biomedical problems are solved             develop specific ontologies for their specific task.
        using ontology aware-AI/ML?                    In addition to the dominate use of GO, 38 (35.5%)
    The biomedical problems that were being            ontologies cover topics related to disease,
solved are mostly focused on gene function             phenotype, or conditions. This result shows the
prediction (25 papers), or ontology annotation (14     lack of diversity of biomedical ontology use in
papers). 7 papers using ontology-aware AI/ML to        AI/ML for biomedical research. It also shows the
perform protein/gene interaction prediction, and 6     potential benefit of a unified ontology that covers
papers predict disease gene or protein or variant      diseases, phenotypes, and conditions.
prediction. Other topics including drug-drug
interaction, drug-drug interaction, drug repurpose,       3. How ontology is being used in the
drug target, drug toxicity, pathway membership                AI/ML algorithm or architecture?
prediction. In the clinical area, a few papers focus      There are two big categories on how ontology
on clinical outcomes prediction from EHR,              is being used in AI/ML algorithms: A) using
anatomical site prediction from radiology report,      ontology as categories of data, or B) compute the
knowledge. In category A, 42 papers (39%) were          collaborate. A network analysis was performed
using ontologies as training data, and 24 papers        based on the co-authorship. The resulted research
(22%) were using ontologies as classifier’s target.     network shows a lack of collaboration in this
In category B, the most popular use is to transform     research area. Most of the authors are isolated
the ontology into numeric presentation. 54 papers       groups (Figure 2A). The hub analysis of the
(50.4%) were using different methodologies, such        network reveals one active hub center, Dr. Robert
as embedding, semantic similarity, and                  Hoehndorf from the King Abdula University of
information content, to convert a text-based            Science and Technology (KAUST) at Saudi. He
ontology into a matrix table with numbers. Only         has many papers published with many authors;
12 papers (11.2%) utilized the whole ontology’s         however, his co-authorship network is limited
content and structure as a layer in a neural            between the UK and Saudi Arabia (Figure 2B).
network architecture.                                   Community analysis showed that beside the
    Out of the 107 papers, 31 papers (29%) applied      community formed by the UK and Saudi, a few
neural network architecture. Among which, 11            Chinese researcher forms their own community
papers used convolutional neural network, 7             via co-authorships. This result shows that a lot of
papers used deep neural network, 6 papers used          collaborative activities, such as focused
long short-term memory network including Bi-            conference, workshops, meetings, and hackathons
LSTM and Bo-LSTM, 3 papers on recurrent                 are needed to promote creativity and innovation
neural network, 2 papers on artificial neural           of science. The authors suggested that more
network. Deep learning technology were applied          workshops such as Role of Ontology in
in 4 papers. There is a growing practice to use a       Biomedical AI (ROBI) should be held, and a
variety of embedding methods to transform the           community of such scientists working in this
ontology into a low-dimensional vector space. 6         specific area should be established.
papers were using Node2Vec, 4 papers using
Word2Vec, 2 papers on Doc2Vec, 2 papers on
Onto2Vec, and 1 paper on OPA2Vec and                          A.
DL2Vec. While new methodologies are tested in
those papers, traditional classifiers are still being
applied: 8 papers applied Support Vector Machine
(SVM), 6 papers applied Random Forest, 4 papers
used Naive Bayes classifier or k-nearest neighbor
and 3 papers used logistic regression techniques.
In most of the case, the authors claimed that
ontology-aware AI/ML outperforms traditional
classifiers.

   4. Who and where publish those papers?
   The authors also looked at the geographical
distribution of the papers that are published. The
top 5 countries that publish the most are: USA (33                 B.
papers), China (26 papers), UK and Saudi Arabia
(10 papers each), France (7 papers), and Germany,
Korea, and Portugal (7 papers each). 26 papers
have authors across different countries. Out of
which, 4 papers produced by China-USA
collaborations, and 2 papers produced by France
and Lebanon collaboration. The observation of
USA publishing dominant maybe biased, because
the authors only selected the USA based PMC as
the source database to retrieve papers.

  5. How did authors collaborate in                     Figure 2A: Network analysis of the co-authorship
      research?                                         of the 107 papers. A node denotes an author’
  The authors were interested in learning about         Color denotes community; size of the nodes
who are the researchers in this field and how they      denotes the centrality of an author; the size of
link denotes the counts of co-authorship               group. AYL conceived the idea of the paper,
between authors.                                       designed the search strategy and survey questions,
Figure 2B: Hub analysis showed that Dr. Robert         evaluated the results, conducted analysis, and
Hoehndorf and his group forms an active hub and        wrote the manuscript. All other authors
a small community comprised of Dr. Hoehndorf’s         contributed to the paper review and data entry.
collaborators in Saudi Arabia and UK.                  AIS and TDN contributed to data curation for co-
                                                       authorship network analysis.
4. Conclusion
                                                       6. References
    In conclusion, ontology provides contextually
rich data to help the AI/ML to achieve a higher        [1]     Kulmanov M, Smaili FZ, Gao X,
performance compared to the similar methods                    Hoehndorf R: Semantic similarity and
without ontologies. However, the applications of               machine learning with ontologies. Brief
ontology-aware AI/ML in biomedical domain are                  Bioinform 2021, 22(4).
still limited to gene or protein function              [2]     Hassan M, Guan H, Melliou A, Wang Y,
predictions. The lack of cross-discipline                      Sun Q, Zeng S, Liang W, Zhang Y, Zhang
collaborations specifically in applications in                 Z, Hu Q: Neuro-Symbolic Learning:
biomedical domain is alarming. Fundings to                     Principles   and     Applications     in
support collaborative initiatives and community                Ophthalmology.       arXiv      preprint
development are needed in this area. Workshops                 arXiv:220800374 2022.
such as ROBI should be continued and expanded.         [3]     Holzinger A, Malle B, Saranti A, Pfeifer
    Utilizing the graph-structural and semantics               B: Towards multi-modal causability with
within an ontology requires more complex neural                Graph Neural Networks enabling
network architecture along with many other                     information fusion for explainable AI.
components such as the neuro-symbolic                          Information Fusion 2021, 71:28-37.
approach. Explainable AI is an emerging field
where the explanatory techniques can explicitly                Kulmanov M, Smaili FZ, Gao X,
show why a recommendation, or a prediction is                  Hoehndorf R: Semantic similarity and
made. This literature review is biased by the                  machine learning with ontologies. Brief
selection of PMC as the pool to retrieve. Many                 Bioinform 2021, 22(4).
methodological papers were published as                2.      Hassan M, Guan H, Melliou A, Wang Y,
conference proceedings or white papers. Rising                 Sun Q, Zeng S, Liang W, Zhang Y, Zhang
topics such as neuro-symbolic, explainable AI                  Z, Hu Q: Neuro-Symbolic Learning:
were not investigated. The future work includes                Principles and Applications in
extending the search to other repositories, such as            Ophthalmology.       arXiv     preprint
Europe PMC, IEEE, PMLR, DBLP, arXiv, and to                    arXiv:220800374 2022.
other topics such as neuro-symbolic [2],               3.      Holzinger A, Malle B, Saranti A, Pfeifer
explainable AI [3] use in biomedical domain.                   B: Towards multi-modal causability
Leveraging an ontology of AI/ML to annotate                    with Graph Neural Networks enabling
more details on AI/ML components to allow                      information fusion for explainable AI.
better analysis is another future direction as well.           Information Fusion 2021, 71:28-37.


5. Acknowledgement

   AYL, AC and JS are supported by The Office
of Data Science Strategy, NIH, via the Data and
Technology Advancement (DATA) National
Service Scholar program. AIS, TDN, PG, LL, and
CT are the members of a local youth group
Biomedical Informatics Research for Youth
located in Maryland. This group is founded by
AYL. AYL and JS are mentors of this youth