=Paper= {{Paper |id=Vol-3603/Paper2 |storemode=property |title=Exploring the Use of Ontology Components for Distantly-Supervised Disease and Phenotype Named Entity Recognition |pdfUrl=https://ceur-ws.org/Vol-3603/Paper2.pdf |volume=Vol-3603 |authors=Sumyyah Toonsi,Senay Kafkas,Robert Hoehndorf |dblpUrl=https://dblp.org/rec/conf/icbo/ToonsiKH23 }} ==Exploring the Use of Ontology Components for Distantly-Supervised Disease and Phenotype Named Entity Recognition== https://ceur-ws.org/Vol-3603/Paper2.pdf
                                Exploring the Use of Ontology Components for
                                Distantly-Supervised Disease and Phenotype Named
                                Entity Recognition
                                Sumyyah Toonsi1,2,† , Şenay Kafkas1,2,† and Robert Hoehndorf1,2,*
                                1
                                  Computer, Electrical and Mathematical Sciences & Engineering (CEMSE) Division, King Abdullah University of Science
                                and Technology (KAUST), Thuwal, 23955, Kingdom of Saudi Arabia
                                2
                                  Computational Bioscience Research Center, King Abdullah University of Science and Technology (KAUST), Thuwal,
                                23955, Kingdom of Saudi Arabia


                                                                         Abstract
                                                                         The lack of curated corpora is one of the major obstacles for Named Entity Recognition (NER). With
                                                                         the advancements in deep learning and development of robust language models, distant supervision
                                                                         utilizing weakly labelled data is often used to alleviate this problem. Previous approaches utilized weakly
                                                                         labeled corpora from Wikipedia or from the literature. However, to the best of our knowledge, none
                                                                         of them explored the use of the different ontology components for disease/phenotype NER under the
                                                                         distant supervision scheme. In this study, we explored whether different ontology components can be
                                                                         used to develop a distantly supervised disease/phenotype entity recognition model. We trained different
                                                                         models by considering ontology labels, synonyms, definitions, axioms and their combinations in addition
                                                                         to a model trained on literature. Results showed that content from the disease/phenotype ontologies
                                                                         can be exploited to develop a NER model performing at the state-of-the-art level. In particular, models
                                                                         that utilised both the ontology definitions and axioms showed competitive performance compared to
                                                                         the model trained on literature. This relieves the need of finding and annotating external corpora.
                                                                         Furthermore, models trained using ontology components made zero-shot predictions on the test datasets
                                                                         which were not observed by the models training on the literature based datasets.

                                                                         Keywords
                                                                         Named Entity Recognition, Text mining, ontologies




                                1. Introduction
                                Named Entity Recognition (NER) is a form of Natural Language processing (NLP) that aims to
                                identify and classify named entities such as organisation, person, disease and genes in text. NER
                                is a challenging task due to the nature of language which includes abbreviations, synonymous
                                entities, and in general variable descriptions of entities.


                                Proceedings of the International Conference on Biomedical Ontologies 2023, August 28th-September 1st, 2023, Brasilia,
                                Brazil
                                *
                                  Corresponding author.
                                †
                                  These authors contributed equally.
                                $ sumyyah.toonsi@kaust.edu.sa (S. Toonsi); senay.kafkas@kaust.edu.sa (Ş. Kafkas); robert.hoendorf@kaust.edu.sa
                                (R. Hoehndorf)
                                 0000-0003-4746-4649 (S. Toonsi); 0000-0001-7509-5786 (Ş. Kafkas); 0000-0001-8149-5890 (R. Hoehndorf)
                                                                       © 2023 Copyright for this paper by its authors.
                                                                       Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings



                                                                                                                                                                                       13
   Early methods for NER used dictionaries due to their applicability and time efficiency. Lexical
approaches such as the NCBO (National Center for Biomedical Ontology) annotator [1], ZOOMA
[2], and the OBO (Open Biological and Biomedical Ontologies) annotator [3] are not able to
recognise new concepts and cannot detect all variations of expressions. This is because once
dictionaries are constructed with terms, they can only find exact matches to those terms. Hence,
dictionary-based approaches suffer from low recall.
   With the emergence of machine learning, better NER methods were developed. This was
possible through exposing statistical models to curated text where mentions of entities are
identified by human curators and provided to these models. Subsequently, these models were
able to generalize to unseen entities better than previous methods. For instance, GNormPlus [4]
was developed to find gene/protein mentions using a supervised model which demonstrated
competitive results at the time. Although supervised methods showed remarkable improvements
in performance, they require curated instances for the model to learn. That is, the model expects
instances of text where mentions of entities are clearly provided to learn to distinguish concepts
of interest. This becomes a serious problem when one wants to recognise a novel/unexplored
concept. Moreover, supervised methods often fail to recognise concepts uncovered by the
curated corpora.
   To alleviate the need for curated corpora, distant-supervision was explored for NER. In partic-
ular, distantly supervised models are trained on a weakly labeled training set, i.e., obtained from
an imprecise source. For instance, dictionaries could be used to annotate text with exact matches
which can produce both false positives and false negatives. Methods like BOND[5], PatNER[6],
ChemNER[7], PhenoTagger [8], Conf-MPU [9] and Dong and colleagues [10] demonstrated the
potential of distant supervision for NER. The aforementioned methods created weakly labeled
sets using labels and synonyms found in ontologies/vocabularies to extract training instances
from unlabeled corpora. Later, these instances were used to train different models which in
some cases outperformed state-of-the-art methods.
   Inspired by the advances achieved by distant supervision, we explored the contribution of
different components of ontologies (Labels and synonyms, definitions, and complex axioms)
to the task of NER under the distant supervision scheme. In all of the previously mentioned
distantly-supervised NER methods, only labels and synonyms of ontologies/vocabularies were
used to create the weakly labeled corpora from literature. The use of different ontology compo-
nents to develop NER models has not been comprehensively explored for diseases/phenotypes.
In addition to the use of labels and synonyms, in this study, we go a step further to explore the
use of definitions and axioms to develop a disease/phenotype NER model. We hypothesize that
the dense and rich knowledge found in ontologies can be used to develop NER models without
the need of external corpora such as literature abstracts. We conducted our experiments on
disease and phenotype entity recognition because, the study of diseases and phenotypes is
important for understanding disease diagnosis, treatment and epidemiology.




                                                                                                      14
2. Materials and Methods
2.1. Ontologies, literature resource and benchmark corpora
2.1.1. Ontologies
We used the Disease Ontology (DO) [11] on 15/April/2022) (downloaded on 1/March/2022)
and the MEDIC vocabulary [12] in our study. DO is an ontology from the Open Biomedical
Ontologies (OBO) [11], whereas MEDIC is a vocabulary of disease terms represented in the
Web Ontology Language (OWL) [12]. We used the Human Phenotype Ontology (HPO) [13]
(downloaded on 5/Jan/2022) for the phenotype concepts.

2.1.2. Literature
We used Medline [14] as a literature resource to generate our abstract-based weakly labeled
dataset. To select abstracts that cover ontology concepts, we used an in-house index covering
32,923,095 Medline records (downloaded on Dec-15-2022) generated using Elasticsearch [15].

2.1.3. Benchmark corpora
To evaluate the named entity recognition models, we used four benchmark corpus; the NCBI–
Disease Corpus [16] and the MedMentions Corpus (disease and phenotype) [17] and GSC+ [18].
NCBI–Disease is a widely used corpus where disease mentions are annotated and reviewed by
multiple annotators. MedMentions is a large corpus annotated by an extensive set of Unified
Medical Language System (UMLS) concepts. We selected the abstracts with disease annotations
from MedMentions and named this the MedMentions–disease Corpus. To form this corpus, we
used UMLS-to-MESH mappings from UMLS to obtain the MESH codes and selected the disease
concepts which exist in our disease dictionary (described in section 2.2). Similarly, we selected
the abstracts with phenotype concepts where we found mappings from UMLS-to-HPO and
named this dataset as MedMentions–phenotypes. GSC+ is a widely used benchmarking dataset
covering phenotype concepts particularly from HPO. We used the test dataset version released
by [8]. Table 1 shows the distribution of the abstracts and annotations in the four benchmark
corpora.

Table 1
Statistics of benchmark corpora
                    Corpus                          Abstracts   Annotations
                    NCBI–disease train              593         5146
                    NCBI–disease dev                100         788
                    NCBI–disease test               100         960
                    MedMentions–disease test        879         3726
                    MedMentions–phenotype train     1291        6772
                    MedMentions–phenotype dev       428         2287
                    MedMentions–phenotype test      405         2190
                    GSC+ test                       228         1933




                                                                                                    15
2.2. Dictionary generation
We generated and used two dictionaries to weakly label Medline abstracts for disease and
phenotype concepts. To generate our dictionaries, first, we extracted the labels and synonyms
of all concepts from MEDIC, DO and HPO. Second, we filtered out the possible ambiguous
labels/synonyms which are often stop words, short labels/synonyms (1 or 2 character long) and
labels/synonyms shared by two different concepts from the dictionary. For example, DO con-
tains a synonym which is "go" for the "geroderma osteodysplasticum" concept (DOID:0111266).
The synonym "go" is ambiguous with the verb "go". Filtering out ambiguous names is a common
practice used in text mining workflows that rely on lexical matches. We used the Natural
Language Toolkit (NLTK) stop words [19] and filtered out any exact match with the labels/syn-
onyms in MEDIC and DO and HPO. In both sources, we did not find any match with the list of
stop words. We also filtered out the labels/synonyms having less than 3 characters to avoid
false positives. Additionally, for the generation of the dictionary for diseases, we filtered out
all the disease labels/synonyms which exactly match with protein labels/synonyms from the
HUGO Gene Nomenclature Committee (HGNC) Database [20] to avoid false positive matches
with protein names. Third, we generated the plural form of each label/synonym by using
the Inflect Python module [21]. For example, the module generates “tetanic cataracts” for the
given multi-word term, “tetanic cataract” (DOID:13822). Our final disease dictionary covers
244,903 disease labels and synonyms of 29,374 distinct concepts from MEDIC and DO. The final
phenotype dictionary covers 79,010 phenotype labels and synonyms of 14,631 distinct concepts
from HPO.

2.3. Ontology components used
An ontology 𝑂, as previously described in [22], has four main components:

    • Classes and relations, where classes and relations are assigned unique identifiers.
    • Domain vocabulary, where labels and synonyms are linked to ontology classes and
      relations.
    • Textual definitions, where descriptions about classes and relations are provided, usually
      in natural language.
    • Formal axioms, where relations between concepts are described in some formal language
      and possibly linked to other ontologies and sources.

We used labels and synonyms, textual definitions, and formal axioms components separately to
create weakly labeled corpora and the statistics are reported in Table 2.

Table 2
Statistics of used ontology components
              Component          DO and MEDIC              HPO
              Labels/synonyms    35,333                    16,307
              Definitions        9,435 and 19,939 dummy    10,202 and 2,451 dummy
              Axioms             30,834                    37,062




                                                                                                    16
2.4. Training dataset construction
2.4.1. Abstracts from literature
To generate the training set for distant supervision, first, we retrieved the relevant literature by
searching the indexed Medline for the exact match of each label/synonym from the dictionaries.
We retrieved the top [1-5] Medline abstracts/titles hits per concept that is identified based on
the default Elastic Search Engine relevance scoring settings (TF-IDF [23] based scoring). Second,
we used the dictionaries and annotated the downloaded abstracts lexically and converted the
annotations to the I-O-B format (a common format for tagging tokens in a chunking task where
𝐵 indicates the first token (Beginning) of an annotation, 𝐼 subsequent (Inside) token of the same
annotation and 𝑂 representing a token that is not annotated (Outside)) [24] by using spaCy
[25]. Finally, we obtained two sets of corpora; one for the disease concepts and the other for the
phenotype concepts. We found 16,307 distinct phenotype labels/synonyms belonging to 6,962
classes from HPO in at least one Medline record by searching the indexed literature. These
concepts are covered by 16096, 31372, 46032, 60098 and 74087 distinct Medline abstracts/titles
at top 1, 2, 3, 4, 5 hits respectively, and we used them as our training sets for phenotypes. We
found 35,333 distinct disease labels/synonyms linked to 8,400 distinct concepts from MEDIC
and DO in at least one Medline records. These concepts are covered by 41698, 81007, 118295,
154060 and 187462 distinct Medline abstracts/titles at top 1, 2, 3, 4, 5 hits respectively and we
used as our training sets for disease concepts.

Table 3
Example of using the class DOID:0040099 to create different weakly labeled sets. Text in bold refers to
text annotated as B/I classes in the IOB format.
   Component Ontology representation                         Dataset representation
   Labels         name: Livedoid vasculitis                  Livedoid vasculitis
   Synonyms       synonym: “livedoid vasculopathy” EXACT Livedoid vasculopathy
   Axioms         DOID:0040099 SubClassOf DOID:865           Livedoid vasculitis is a vasculitis
   Definitions    “A vasculitis with purpuric ulcers.”       A vasculitis with purpuric ulcers.



2.4.2. Labels and synonyms
Using the direct labels and synonyms from ontologies, we created two sets for phenotypes and
diseases. For phenotypes, the labels and synonyms extracted from HPO were directly considered
as positives as shown in Table 3. We used the labels and synonyms from DO and added MEDIC
as well. The labels and synonyms were retrieved from the dictionary described in 2.2.

2.4.3. Definitions
Definitions in DO are available in natural language. To associate the concept with its definition,
we added the concept label/synonyms to the beginning of a definition as shown in Table 3. For
concepts which lacked definitions, we simply included their labels/synonyms with a dummy
sentence replicated for all. For instance, if a disease 𝑋 does not have a definition, its dummy
definition is “𝑋 is a disease”. Since definitions can included other concepts (e.g. parent concepts)




                                                                                                          17
in their description, mentions of such concepts can be troublesome. To partially resolve this
issue, we annotated the definitions with the dictionaries described in 2.2 Matches against
the dictionaries were treated as positive mentions of concepts. In total, we retrieved 9,435
definitions from DO and used dummy definitions for 19,939 concepts. For phenotypes, we
included definitions for 10,202 concepts and used dummy definitions for 2,451 concept.

2.4.4. Axioms
Axioms are not readily available for natural language tasks since they are expressed in formal
language. To tackle this issue, we first processed axioms as previously described in [26]. Next,
we replaced ontology identifiers with their labels/synonyms. We also included axioms which
reference external ontologies and replaced their identifiers with names as shown in Table 3.
   For diseases, we used 30,834 axioms from DO. For phenotypes, we included 37,062 axioms from
HPO. Axioms of both concepts included references to external ontologies which we downloaded
and processed to map their identifiers to their names. The external ontologies that were included
are: the Basic Formal Ontology (BFO) [27], the Chemical Entities of Biological Interest (ChEBI)
[28], the Cell Ontology (CL) [29], the Gene Ontology (GO), the Relation Ontology (RO) [30],
and the Uber-anatomy Ontology (UBERON) [31].

2.5. Named entity recognition using distant supervision
NER refers to identifying boundaries of entity mentions in text (disease and phenotype mentions
in our case). We used distant supervision to train our models by using BioBERT to recognise
disease and phenotype mentions in text. Figure 1 depicts the system overview.
   BioBERT is a BERT (Bidirectional Encoder Representations from Transformers) [32] pre-
trained language model based on large biomedical corpora. BERT is a contextualized word
representation model trained using masked language modeling. It provides self-supervised deep
bidirectional representations from unlabeled text by jointly conditioning on both left and right
contexts. The pre-trained BERT model can be fine-tuned with an additional output layer to
generate models for various desired NLP tasks. We used simpletransformers [33] which provides
a wrapper model to distantly supervise an entity recognition model. More specifically, the
wrapped model is used to fine-tune BERT models by adding a token-level classifier on top that
classifies tokens into one of the output classes which are I-O-B (Inside-Outside-Beginning).
In the training phase, our models are initialised with weights from BioBERT-Base v1.1 [34]
and then fine-tuned on the disease and phenotype entity recognition task using our training
corpora.


3. Results
We set up our experiments on four separate benchmarking corpora covering phenotype and
disease concepts; NCBI–disease, MedMentions–disease, MedMentions–phenotype and GCS+.
We reported our NER results using the Precision, Recall and F-score metrics. We used a relaxed
scheme to calculate the metrics where we considered any partial overlap between the prediction
and the curated annotations to be a true positive. That is, predictions are considered to be




                                                                                                    18
         Training phase                                                               Test phase

                                  Labels/Synonyms Axioms       Definitions




                                                                     Indexed PubMed
                                                                     for titles and
                                                                     abstracts
   Ontology        Dictionary    Dictionary
                                                    Distant
                  construction
                                                   dataset                                  Test text
                     (Label,
                                                  generation
                   synonyms,
                    plurals )


                                                                                                         Annotated text

                                                Training a               Deep                Named
                     Distant
                                                  model                learning              Entity
                     dataset
                                                 (Simple                model              Recognition
                                              Transformers)           (BioBERT)




Figure 1: System Overview
This figure depicts the training and test phases in our system. In the training phase, we used ontologies
to create a dictionary from the labels, synonyms and their plural forms. We used this dictionary to
create distant datasets from Medline abstracts and different ontology parts (labels/synonyms, axioms
and definitions). Later, this distant dataset is used for training a BioBERT NER model by using the
SimpleTranformers wrapper. In the test phase, the trained model is tested on different benchmarking
corpora.


positives whenever the indices (locations in text) of the prediction and the curated annotations
overlap.
   Table 4 shows the performance of the disease NER models which are distantly supervised on
different ontology components or on abstracts (best F1-score is achieved at top 1, see Additional
File 1) on the disease test sets (see Table 1). For the sake of comparison, we also included a
supervised BioBERT model that is trained on the NCBI-disease training set. Our results showed
that supervised BioBERT trained on the curated set performed the best on NCBI–disease (0.94 F1-
score) because concepts are highly conserved in this dataset. To fairly compare the performance
of the methods, we further evaluated the models on the MedMentions–disease dataset. Results
showed that the distantly supervised models (trained on abstracts and definitions plus axioms)
achieved higher F1 scores (0.68 for abstracts and 0.67 for definitions and axioms) compared
to the model trained on the curated set (0.66 F1-score) which is actually biased towards the
NCBI–disease dataset (we found out there is 80% overlap in concept IDs between NCBI training
and test sets). The models trained on solely labels and synonyms, axioms, definitions showed
lower F1-score compared to the model trained on abstracts. On the other hand, the model
trained on definitions plus axioms achieved a competitive F1-score compared to the model
trained on abstracts. This result is more evident on the MedMentions-disease test set.




                                                                                                                          19
Table 4
Disease NER results
                        Corpus                    Precision     Recall     F1
                        NCBI-disease
                        Labels and synonyms       0.64          0.36       0.46
                        Axioms                    0.68          0.59       0.63
                        Definitions               0.87          0.80       0.83
                        Definitions and axioms    0.91          0.76       0.83
                        Literature abstracts      0.92          0.81       0.86
                        Curated NCBI train        0.91          0.96       0.94
                        MedMentions-disease
                        Labels and synonyms       0.41          0.26       0.32
                        Axioms                    0.43          0.42       0.43
                        Definitions               0.48          0.82       0.61
                        Definitions and axioms    0.58          0.79       0.67
                        Literature abstracts      0.60          0.78       0.68
                        Curated NCBI train        0.58          0.77       0.66

Table 5
Phenotype NER Results
                      Corpus                        Precision     Recall        F1
                      MedMentions-phenotype
                      Labels and synonyms           0.33          0.75          0.46
                      Axioms                        0.31          0.58          0.40
                      Definitions                   0.47          0.80          0.59
                      Definitions and axioms        0.55          0.77          0.64
                      Literature abstracts          0.60          0.82          0.69
                      Curated MedMentions train     0.61          0.79          0.69
                      GSC+
                      Labels and synonyms           0.32          0.71          0.44
                      Axioms                        0.40          0.60          0.48
                      Definitions                   0.61          0.77          0.68
                      Definitions and axioms        0.65          0.74          0.69
                      Literature abstracts          0.73          0.78          0.75
                      Curated MedMentions train     0.61          0.53          0.57


   Table 5 presents the performance of the models in phenotype NER on the GSC+ and
MedMentions-phenotype test datasets. We included the MedMentions-phenotype dataset
to thoroughly test our models and to train the supervised model on sufficient data. With the
inclusion of context at a large scale, the model trained on the weakly labelled abstracts achieved
the highest F1-score (0.69 F1-score on MedMentions-phenotype and 0.75 on GSC+) compared
to other models. On the other hand, the model trained on the curated set was not robust to the
change of dataset as it performed poorly on GSC+ (0.57 F1-score). We observed 6% discrepancy
between the model trained on abstracts and the model trained on weakly labelled definitions
plus axioms. We discuss the reasons of this discrepancy in detail in the “Discussion” section.




                                                                                                     20
4. Discussion
Our main goal was to explore whether ontology components can help to develop distantly
supervised disease/phenotype entity recognition models which are competitive to the state-of-
the-art. To that end, we exploited ontological components to create textual context using the
labels/synonyms, axioms and definitions. We observed that utilising the context in ontologies
via distant supervision aids in developing a NER model at the state-of-the-art level. While the
models trained solely on labels and synonyms achieves lowest simply due to lack of context;
the models incorporating context such as axioms and definitions improved the performance
upon the models that lack context.
   The disease NER model trained on the axioms and definitions achieved competitive F1-score
compared to the model trained on the abstracts only. However, we observed 6% discrepancy
between the phenotype NER models trained on the abstracts (best F1-score is achieved at top
2) and axioms and definitions together. To investigate the reason for this discrepancy, we
focused on the False Positive (FP) predictions that we achieved on the GSC+ test corpus. The
model trained on the weakly labeled abstracts produced 440 FPs while the model trained on
the phenotype definitions and axioms produced 608 FPs. We found that 184 out of 608 FPs
are produced distinctly by the model trained on definitions and axioms and not by the one
trained on the abstracts. We randomly sampled 20 FPs from these 184 FPs for further manual
analysis. Our manual analysis on these 20 FPs showed that all of them were actually True
Positives but have been missed by the GSC+ dataset. For example, we found “Uniparental
disomy” (HP:0032382) in PMID:8103288 was captured correctly by the model but was missed
by GSC+ annotations. More importantly, we observed that the majority of the FPs were not
introduced in the definitions and axioms training corpus but were rather predicted as zero-
shot instances (i.e. instances that were not seen by the model during training). For example,
“Angelman syndrome” in PMID:8786067 which does not correspond to any label/synonyms in
HPO and does not exist in the corpus was annotated by the model trained on definitions and
axioms. Furthermore, the model trained on literature abstracts did not have these FPs since
they were specifically included as 𝑂 classes in the training set. Details on our manual analysis
can be found in the Additional Files 1.
   We conducted our study on DO and HPO. These ontologies are widely used and therefore
contain dense content which can help to generate sufficiently large weakly label datasets.
Although the approach is generic and its utility can be explored for any given ontology; the
performance would depend on the density of the content of the ontology of choice. That is, if the
ontology does not sufficiently describe a concept, it is not possible to obtain a well-performing
model.


5. Conclusion
In conclusion, our analysis showed that the ontology components can provide a suitable corpus
to build a NER model that is competitive to state-of-the-art. This alleviates the need for
annotating a large number of abstracts and facilitates the creation of weakly labeled training
corpora. Easily obtained corpora are desirable since they reduce both the computational and




                                                                                                    21
time overheads. To our best knowledge, this is the first work that uses ontology axioms to build
disease/phenotypes NER models.
   Additionally, the models trained on ontology components were capable of zero-shot learning
on the test datasets. This was not the cases for the models trained on curated sets and the
models trained on the large weakly labeled literature abstracts. Our approach is generic and
its utility can be explored with any other given ontology which has sufficient content that
describes the concept of interest.


Acknowledgments
We thank Dr. Mahmut Uludağ for his technical assistance in processing MEDLINE data.

This work has been supported by funding from King Abdullah University of Science
and Technology (KAUST) Office of Sponsored Research (OSR) under Award No. URF/1/4355-01-
01, URF/1/4675-01-01, URF/1/4697-01-01, URF/1/5041-01-01, REI/1/5334-01-01, FCC/1/1976-46-01
and FCC/1/1976-34-01.


References
 [1] C. Jonquet, N. H. Shah, M. A. Musen, The open biomedical annotator, in: American Medical
     Informatics Association Symposium on Translational BioInformatics, AMIA-TBI’09, San
     Francisco, CA, USA, 2009, pp. 56–60.
 [2] M. Kapushesky, et al., Gene expression atlas update–a value-added database of microarray
     and sequencing-based functional genomics experiments, Nucleic Acids Research 40 (2011)
     D1077–D1081. URL: https://doi.org/10.1093/nar/gkr913. doi:10.1093/nar/gkr913.
 [3] M. Taboada, H. Rodriguez, D. Martinez, M. Pardo, M. J. Sobrido, Automated semantic
     annotation of rare disease cases: a case study, Database 2014 (2014) bau045–bau045. URL:
     https://doi.org/10.1093/database/bau045. doi:10.1093/database/bau045.
 [4] C.-H. Wei, H.-Y. Kao, Z. Lu, GNormPlus: An integrative approach for tagging genes,
     gene families, and protein domains, BioMed Research International 2015 (2015) 1–7. URL:
     https://doi.org/10.1155/2015/918710. doi:10.1155/2015/918710.
 [5] C. Liang, Y. Yu, H. Jiang, S. Er, R. Wang, T. Zhao, C. Zhang, Bond: Bert-assisted open-
     domain named entity recognition with distant supervision, in: Proceedings of the 26th
     ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD
     ’20, Association for Computing Machinery, New York, NY, USA, 2020, p. 1054–1064. URL:
     https://doi.org/10.1145/3394486.3403149. doi:10.1145/3394486.3403149.
 [6] X. Wang, Y. Guan, Y. Zhang, Q. Li, J. Han, Pattern-enhanced named entity recognition
     with distant supervision, in: 2020 IEEE International Conference on Big Data (Big Data),
     2020, pp. 818–827. doi:10.1109/BigData50022.2020.9378052.
 [7] X. Wang, V. Hu, X. Song, S. Garg, J. Xiao, J. Han, ChemNER: Fine-grained chemistry
     named entity recognition with ontology-guided distant supervision, in: Proceedings of
     the 2021 Conference on Empirical Methods in Natural Language Processing, Association
     for Computational Linguistics, Online and Punta Cana, Dominican Republic, 2021, pp.




                                                                                                   22
     5227–5240. URL: https://aclanthology.org/2021.emnlp-main.424. doi:10.18653/v1/2021.
     emnlp-main.424.
 [8] L. Luo, S. Yan, P.-T. Lai, D. Veltri, A. Oler, S. Xirasagar, R. Ghosh, M. Similuk, P. N. Robinson,
     Z. Lu, PhenoTagger: a hybrid method for phenotype concept recognition using human
     phenotype ontology, Bioinformatics 37 (2021) 1884–1890. URL: https://doi.org/10.1093/
     bioinformatics/btab019. doi:10.1093/bioinformatics/btab019.
 [9] K. Zhou, Y. Li, Q. Li, Distantly supervised named entity recognition via confidence-based
     multi-class positive and unlabeled learning, in: Proceedings of the 60th Annual Meeting
     of the Association for Computational Linguistics (Volume 1: Long Papers), Association for
     Computational Linguistics, Dublin, Ireland, 2022, pp. 7198–7211. URL: https://aclanthology.
     org/2022.acl-long.498. doi:10.18653/v1/2022.acl-long.498.
[10] H. Dong, V. Suárez-Paniagua, H. Zhang, M. Wang, A. Casey, E. Davidson, J. Chen, B. Alex,
     W. Whiteley, H. Wu, Ontology-driven and weakly supervised rare disease identification
     from clinical notes, BMC Medical Informatics and Decision Making 23 (2023). URL:
     https://doi.org/10.1186/s12911-023-02181-9. doi:10.1186/s12911-023-02181-9.
[11] L. M. Schriml, et al., Human Disease Ontology 2018 update: classification, content and
     workflow expansion, Nucleic Acids Research 47 (2018) D955–D962. URL: https://doi.org/
     10.1093/nar/gky1032. doi:10.1093/nar/gky1032.
[12] A. P. Davis, T. C. Wiegers, M. C. Rosenstein, C. J. Mattingly, MEDIC: a practical disease
     vocabulary used at the Comparative Toxicogenomics Database, Database 2012 (2012). URL:
     https://doi.org/10.1093/database/bar065. doi:10.1093/database/bar065, bar065.
[13] S. Köhler, et al., Expansion of the human phenotype ontology (HPO) knowledge base and
     resources, Nucleic Acids Research 47 (2018) D1018–D1027. URL: https://doi.org/10.1093/
     nar/gky1105. doi:10.1093/nar/gky1105.
[14] NCBI, Pubmed, 1996. https://pubmed.ncbi.nlm.nih.gov/, Last accessed on 2022-04-18.
[15] N. Elastic, Swiftype, Elastic search, 2010. https://www.elastic.co/, Last accessed on 2022-
     04-18.
[16] R. I. Doğan, R. Leaman, Z. Lu, NCBI disease corpus: A resource for disease name recognition
     and concept normalization, Journal of Biomedical Informatics 47 (2014) 1–10. URL: https:
     //doi.org/10.1016/j.jbi.2013.12.006. doi:10.1016/j.jbi.2013.12.006.
[17] S. Mohan, D. Li, Medmentions: A large biomedical corpus annotated with umls concepts,
     2019. URL: https://arxiv.org/abs/1902.09476. doi:10.48550/ARXIV.1902.09476.
[18] M. Lobo, A. Lamurias, F. M. Couto, Identifying human phenotype terms by combining
     machine learning and validation rules, BioMed Research International 2017 (2017) 1–8.
     URL: https://doi.org/10.1155/2017/8565739. doi:10.1155/2017/8565739.
[19] I. Brigadir, Nltk stop words, 2019. https://github.com/igorbrigadir/stopwords/blob/master/
     en/nltk.txt, Last accessed on 2022-09-14.
[20] S. Tweedie, B. Braschi, K. Gray, T. E. M. Jones, R. L. Seal, B. Yates, E. A. Bruford, Gene-
     names.org: the HGNC and VGNC resources in 2021, Nucleic Acids Research 49 (2020)
     D939–D946. URL: https://doi.org/10.1093/nar/gkaa980. doi:10.1093/nar/gkaa980.
[21] P. Dyson, Inflect python module, 2022. https://pypi.org/project/inflect/, Last accessed on
     2022-09-14.
[22] R. Hoehndorf, P. N. Schofield, G. V. Gkoutos, The role of ontologies in biological and
     biomedical research: a functional perspective, Briefings in bioinformatics 16 (2015) 1069–




                                                                                                          23
     1080.
[23] C. Sammut, G. I. Webb (Eds.), TF–IDF, Springer US, Boston, MA, 2010, pp. 986–987. URL:
     https://doi.org/10.1007/978-0-387-30164-8_832. doi:10.1007/978-0-387-30164-8_
     832.
[24] L. A. Ramshaw, M. P. Marcus, Text chunking using transformation-based learning, in:
     ACL Third Workshop on Very Large Corpora, 1995, pp. 82–94. doi:https://doi.org/
     10.48550/arXiv.cmp-lg/9505040.
[25] M. Honnibal, I. Montani, spaCy 2: Natural language understanding with Bloom embeddings,
     convolutional neural networks and incremental parsing, 2017. To appear.
[26] F. Z. Smaili, X. Gao, R. Hoehndorf, Onto2Vec: joint vector-based representation of biological
     entities and their ontology-based annotations, Bioinformatics 34 (2018) i52–i60. URL:
     https://doi.org/10.1093/bioinformatics/bty259. doi:10.1093/bioinformatics/bty259.
[27] R. Arp, B. Smith, A. D. Spear, Building ontologies with Basic Formal Ontology, The MIT
     Press, Cambridge, Massachusetts;London, England;, 2015;2016;.
[28] J. Hastings, et al., Chebi in 2016: Improved services and an expanding collection of
     metabolites, Nucleic acids research 44 (2016) D1214—9. URL: https://europepmc.org/
     articles/PMC4702775. doi:10.1093/nar/gkv1031.
[29] T. Bakken, L. Cowell, B. D. Aevermann, M. Novotny, R. Hodge, J. A. Miller, A. Lee, I. Chang,
     J. McCorrison, B. Pulendran, et al., Cell type discovery and representation in the era of
     high-content single cell phenotyping, BMC bioinformatics 18 (2017) 7–16.
[30] R. P. Huntley, M. A. Harris, Y. Alam-Faruque, J. A. Blake, S. Carbon, H. Dietze, E. C. Dimmer,
     R. E. Foulger, D. P. Hill, V. K. Khodiyar, et al., A method for increasing expressivity of
     gene ontology annotations using a compositional approach, BMC bioinformatics 15 (2014)
     1–11.
[31] C. J. Mungall, C. Torniai, G. V. Gkoutos, S. E. Lewis, M. A. Haendel, Uberon, an integrative
     multi-species anatomy ontology, Genome biology 13 (2012) 1–20.
[32] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, in: Proceedings of the 2019 Conference of
     the North, Association for Computational Linguistics, 2019. URL: https://doi.org/10.18653/
     v1/n19-1423. doi:10.18653/v1/n19-1423.
[33] T. C. Rajapakse, Simple transformers, https://github.com/ThilinaRajapakse/
     simpletransformers, 2019.
[34] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, J. Kang, Biobert github respository, 2019.
     (https://github.com/dmis-lab/biobert).



A. Appendix
    • Additional file 1 — AdditionalFile1.xls First sheet name as “performance_on_abstracts”
      contains the performances of the models trained on the weakly labeled abstract datasets
      selected based on top [1-5] hits from the ElasticSearch Index. Second sheet named
      as “manual_error_analysis” contains our manual analysis results on the False Posi-
      tives from the GSC+ dataset. The file is available from github: https://github.com/
      bio-ontology-research-group/OntoNER




                                                                                                      24