<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exploring the Use of Ontology Components for Distantly-Supervised Disease and Phenotype Named Entity Recognition</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sumyyah Toonsi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Şenay Kafkas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Robert Hoehndorf</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computational Bioscience Research Center, King Abdullah University of Science and Technology (KAUST)</institution>
          ,
          <addr-line>Thuwal, 23955, Kingdom of</addr-line>
          <country country="SA">Saudi Arabia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Computer, Electrical and Mathematical Sciences &amp; Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST)</institution>
          ,
          <addr-line>Thuwal, 23955, Kingdom of</addr-line>
          <country country="SA">Saudi Arabia</country>
        </aff>
      </contrib-group>
      <fpage>13</fpage>
      <lpage>24</lpage>
      <abstract>
        <p>The lack of curated corpora is one of the major obstacles for Named Entity Recognition (NER). With the advancements in deep learning and development of robust language models, distant supervision utilizing weakly labelled data is often used to alleviate this problem. Previous approaches utilized weakly labeled corpora from Wikipedia or from the literature. However, to the best of our knowledge, none of them explored the use of the diferent ontology components for disease/phenotype NER under the distant supervision scheme. In this study, we explored whether diferent ontology components can be used to develop a distantly supervised disease/phenotype entity recognition model. We trained diferent models by considering ontology labels, synonyms, definitions, axioms and their combinations in addition to a model trained on literature. Results showed that content from the disease/phenotype ontologies can be exploited to develop a NER model performing at the state-of-the-art level. In particular, models that utilised both the ontology definitions and axioms showed competitive performance compared to the model trained on literature. This relieves the need of finding and annotating external corpora. Furthermore, models trained using ontology components made zero-shot predictions on the test datasets which were not observed by the models training on the literature based datasets.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Named Entity Recognition</kwd>
        <kwd>Text mining</kwd>
        <kwd>ontologies</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Named Entity Recognition (NER) is a form of Natural Language processing (NLP) that aims to
identify and classify named entities such as organisation, person, disease and genes in text. NER
is a challenging task due to the nature of language which includes abbreviations, synonymous
entities, and in general variable descriptions of entities.</p>
      <p>
        Early methods for NER used dictionaries due to their applicability and time eficiency. Lexical
approaches such as the NCBO (National Center for Biomedical Ontology) annotator [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], ZOOMA
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and the OBO (Open Biological and Biomedical Ontologies) annotator [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] are not able to
recognise new concepts and cannot detect all variations of expressions. This is because once
dictionaries are constructed with terms, they can only find exact matches to those terms. Hence,
dictionary-based approaches sufer from low recall.
      </p>
      <p>
        With the emergence of machine learning, better NER methods were developed. This was
possible through exposing statistical models to curated text where mentions of entities are
identified by human curators and provided to these models. Subsequently, these models were
able to generalize to unseen entities better than previous methods. For instance, GNormPlus [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
was developed to find gene/protein mentions using a supervised model which demonstrated
competitive results at the time. Although supervised methods showed remarkable improvements
in performance, they require curated instances for the model to learn. That is, the model expects
instances of text where mentions of entities are clearly provided to learn to distinguish concepts
of interest. This becomes a serious problem when one wants to recognise a novel/unexplored
concept. Moreover, supervised methods often fail to recognise concepts uncovered by the
curated corpora.
      </p>
      <p>
        To alleviate the need for curated corpora, distant-supervision was explored for NER. In
particular, distantly supervised models are trained on a weakly labeled training set, i.e., obtained from
an imprecise source. For instance, dictionaries could be used to annotate text with exact matches
which can produce both false positives and false negatives. Methods like BOND[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], PatNER[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
ChemNER[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], PhenoTagger [8], Conf-MPU [9] and Dong and colleagues [10] demonstrated the
potential of distant supervision for NER. The aforementioned methods created weakly labeled
sets using labels and synonyms found in ontologies/vocabularies to extract training instances
from unlabeled corpora. Later, these instances were used to train diferent models which in
some cases outperformed state-of-the-art methods.
      </p>
      <p>Inspired by the advances achieved by distant supervision, we explored the contribution of
diferent components of ontologies (Labels and synonyms, definitions, and complex axioms)
to the task of NER under the distant supervision scheme. In all of the previously mentioned
distantly-supervised NER methods, only labels and synonyms of ontologies/vocabularies were
used to create the weakly labeled corpora from literature. The use of diferent ontology
components to develop NER models has not been comprehensively explored for diseases/phenotypes.
In addition to the use of labels and synonyms, in this study, we go a step further to explore the
use of definitions and axioms to develop a disease/phenotype NER model. We hypothesize that
the dense and rich knowledge found in ontologies can be used to develop NER models without
the need of external corpora such as literature abstracts. We conducted our experiments on
disease and phenotype entity recognition because, the study of diseases and phenotypes is
important for understanding disease diagnosis, treatment and epidemiology.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Materials and Methods</title>
      <sec id="sec-2-1">
        <title>2.1. Ontologies, literature resource and benchmark corpora</title>
        <sec id="sec-2-1-1">
          <title>2.1.1. Ontologies</title>
          <p>We used the Disease Ontology (DO) [11] on 15/April/2022) (downloaded on 1/March/2022)
and the MEDIC vocabulary [12] in our study. DO is an ontology from the Open Biomedical
Ontologies (OBO) [11], whereas MEDIC is a vocabulary of disease terms represented in the
Web Ontology Language (OWL) [12]. We used the Human Phenotype Ontology (HPO) [13]
(downloaded on 5/Jan/2022) for the phenotype concepts.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>2.1.2. Literature</title>
          <p>We used Medline [14] as a literature resource to generate our abstract-based weakly labeled
dataset. To select abstracts that cover ontology concepts, we used an in-house index covering
32,923,095 Medline records (downloaded on Dec-15-2022) generated using Elasticsearch [15].</p>
        </sec>
        <sec id="sec-2-1-3">
          <title>2.1.3. Benchmark corpora</title>
          <p>To evaluate the named entity recognition models, we used four benchmark corpus; the NCBI–
Disease Corpus [16] and the MedMentions Corpus (disease and phenotype) [17] and GSC+ [18].
NCBI–Disease is a widely used corpus where disease mentions are annotated and reviewed by
multiple annotators. MedMentions is a large corpus annotated by an extensive set of Unified
Medical Language System (UMLS) concepts. We selected the abstracts with disease annotations
from MedMentions and named this the MedMentions–disease Corpus. To form this corpus, we
used UMLS-to-MESH mappings from UMLS to obtain the MESH codes and selected the disease
concepts which exist in our disease dictionary (described in section 2.2). Similarly, we selected
the abstracts with phenotype concepts where we found mappings from UMLS-to-HPO and
named this dataset as MedMentions–phenotypes. GSC+ is a widely used benchmarking dataset
covering phenotype concepts particularly from HPO. We used the test dataset version released
by [8]. Table 1 shows the distribution of the abstracts and annotations in the four benchmark
corpora.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Dictionary generation</title>
        <p>We generated and used two dictionaries to weakly label Medline abstracts for disease and
phenotype concepts. To generate our dictionaries, first, we extracted the labels and synonyms
of all concepts from MEDIC, DO and HPO. Second, we filtered out the possible ambiguous
labels/synonyms which are often stop words, short labels/synonyms (1 or 2 character long) and
labels/synonyms shared by two diferent concepts from the dictionary. For example, DO
contains a synonym which is "go" for the "geroderma osteodysplasticum" concept (DOID:0111266).
The synonym "go" is ambiguous with the verb "go". Filtering out ambiguous names is a common
practice used in text mining workflows that rely on lexical matches. We used the Natural
Language Toolkit (NLTK) stop words [19] and filtered out any exact match with the
labels/synonyms in MEDIC and DO and HPO. In both sources, we did not find any match with the list of
stop words. We also filtered out the labels/synonyms having less than 3 characters to avoid
false positives. Additionally, for the generation of the dictionary for diseases, we filtered out
all the disease labels/synonyms which exactly match with protein labels/synonyms from the
HUGO Gene Nomenclature Committee (HGNC) Database [20] to avoid false positive matches
with protein names. Third, we generated the plural form of each label/synonym by using
the Inflect Python module [ 21]. For example, the module generates “tetanic cataracts” for the
given multi-word term, “tetanic cataract” (DOID:13822). Our final disease dictionary covers
244,903 disease labels and synonyms of 29,374 distinct concepts from MEDIC and DO. The final
phenotype dictionary covers 79,010 phenotype labels and synonyms of 14,631 distinct concepts
from HPO.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Ontology components used</title>
        <p>An ontology , as previously described in [22], has four main components:
• Classes and relations, where classes and relations are assigned unique identifiers.
• Domain vocabulary, where labels and synonyms are linked to ontology classes and
relations.
• Textual definitions, where descriptions about classes and relations are provided, usually
in natural language.
• Formal axioms, where relations between concepts are described in some formal language
and possibly linked to other ontologies and sources.</p>
        <p>We used labels and synonyms, textual definitions, and formal axioms components separately to
create weakly labeled corpora and the statistics are reported in Table 2.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Training dataset construction</title>
        <sec id="sec-2-4-1">
          <title>2.4.1. Abstracts from literature</title>
          <p>
            To generate the training set for distant supervision, first, we retrieved the relevant literature by
searching the indexed Medline for the exact match of each label/synonym from the dictionaries.
We retrieved the top [
            <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5">1-5</xref>
            ] Medline abstracts/titles hits per concept that is identified based on
the default Elastic Search Engine relevance scoring settings (TF-IDF [
            <xref ref-type="bibr" rid="ref8">23</xref>
            ] based scoring). Second,
we used the dictionaries and annotated the downloaded abstracts lexically and converted the
annotations to the I-O-B format (a common format for tagging tokens in a chunking task where
 indicates the first token (Beginning) of an annotation,  subsequent (Inside) token of the same
annotation and  representing a token that is not annotated (Outside)) [
            <xref ref-type="bibr" rid="ref9">24</xref>
            ] by using spaCy
[
            <xref ref-type="bibr" rid="ref10">25</xref>
            ]. Finally, we obtained two sets of corpora; one for the disease concepts and the other for the
phenotype concepts. We found 16,307 distinct phenotype labels/synonyms belonging to 6,962
classes from HPO in at least one Medline record by searching the indexed literature. These
concepts are covered by 16096, 31372, 46032, 60098 and 74087 distinct Medline abstracts/titles
at top 1, 2, 3, 4, 5 hits respectively, and we used them as our training sets for phenotypes. We
found 35,333 distinct disease labels/synonyms linked to 8,400 distinct concepts from MEDIC
and DO in at least one Medline records. These concepts are covered by 41698, 81007, 118295,
154060 and 187462 distinct Medline abstracts/titles at top 1, 2, 3, 4, 5 hits respectively and we
used as our training sets for disease concepts.
          </p>
        </sec>
        <sec id="sec-2-4-2">
          <title>2.4.2. Labels and synonyms</title>
          <p>Using the direct labels and synonyms from ontologies, we created two sets for phenotypes and
diseases. For phenotypes, the labels and synonyms extracted from HPO were directly considered
as positives as shown in Table 3. We used the labels and synonyms from DO and added MEDIC
as well. The labels and synonyms were retrieved from the dictionary described in 2.2.</p>
        </sec>
        <sec id="sec-2-4-3">
          <title>2.4.3. Definitions</title>
          <p>
            Definitions in DO are available in natural language. To associate the concept with its definition,
we added the concept label/synonyms to the beginning of a definition as shown in Table 3. For
concepts which lacked definitions, we simply included their labels/synonyms with a dummy
sentence replicated for all. For instance, if a disease  does not have a definition, its dummy
definition is “  is a disease”. Since definitions can included other concepts (e.g. parent concepts)
in their description, mentions of such concepts can be troublesome. To partially resolve this
issue, we annotated the definitions with the dictionaries described in 2.2 Matches against
the dictionaries were treated as positive mentions of concepts. In total, we retrieved 9,435
definitions from DO and used dummy definitions for 19,939 concepts. For phenotypes, we
included definitions for 10,202 concepts and used dummy definitions for 2,451 concept.
2.4.4. Axioms
Axioms are not readily available for natural language tasks since they are expressed in formal
language. To tackle this issue, we first processed axioms as previously described in [
            <xref ref-type="bibr" rid="ref11">26</xref>
            ]. Next,
we replaced ontology identifiers with their labels/synonyms. We also included axioms which
reference external ontologies and replaced their identifiers with names as shown in Table 3.
          </p>
          <p>
            For diseases, we used 30,834 axioms from DO. For phenotypes, we included 37,062 axioms from
HPO. Axioms of both concepts included references to external ontologies which we downloaded
and processed to map their identifiers to their names. The external ontologies that were included
are: the Basic Formal Ontology (BFO) [
            <xref ref-type="bibr" rid="ref12">27</xref>
            ], the Chemical Entities of Biological Interest (ChEBI)
[
            <xref ref-type="bibr" rid="ref13">28</xref>
            ], the Cell Ontology (CL) [
            <xref ref-type="bibr" rid="ref14">29</xref>
            ], the Gene Ontology (GO), the Relation Ontology (RO) [
            <xref ref-type="bibr" rid="ref15">30</xref>
            ],
and the Uber-anatomy Ontology (UBERON) [
            <xref ref-type="bibr" rid="ref16">31</xref>
            ].
          </p>
        </sec>
      </sec>
      <sec id="sec-2-5">
        <title>2.5. Named entity recognition using distant supervision</title>
        <p>NER refers to identifying boundaries of entity mentions in text (disease and phenotype mentions
in our case). We used distant supervision to train our models by using BioBERT to recognise
disease and phenotype mentions in text. Figure 1 depicts the system overview.</p>
        <p>
          BioBERT is a BERT (Bidirectional Encoder Representations from Transformers) [
          <xref ref-type="bibr" rid="ref17">32</xref>
          ]
pretrained language model based on large biomedical corpora. BERT is a contextualized word
representation model trained using masked language modeling. It provides self-supervised deep
bidirectional representations from unlabeled text by jointly conditioning on both left and right
contexts. The pre-trained BERT model can be fine-tuned with an additional output layer to
generate models for various desired NLP tasks. We used simpletransformers [
          <xref ref-type="bibr" rid="ref18">33</xref>
          ] which provides
a wrapper model to distantly supervise an entity recognition model. More specifically, the
wrapped model is used to fine-tune BERT models by adding a token-level classifier on top that
classifies tokens into one of the output classes which are I-O-B (Inside-Outside-Beginning).
In the training phase, our models are initialised with weights from BioBERT-Base v1.1 [
          <xref ref-type="bibr" rid="ref19">34</xref>
          ]
and then fine-tuned on the disease and phenotype entity recognition task using our training
corpora.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>We set up our experiments on four separate benchmarking corpora covering phenotype and
disease concepts; NCBI–disease, MedMentions–disease, MedMentions–phenotype and GCS+.
We reported our NER results using the Precision, Recall and F-score metrics. We used a relaxed
scheme to calculate the metrics where we considered any partial overlap between the prediction
and the curated annotations to be a true positive. That is, predictions are considered to be
Training phase</p>
      <p>Test phase</p>
      <p>Labels/Synonyms Axioms Definitions
Ontology</p>
      <p>Dictionary
Dictionary
construction
(Label,
synonyms,
plurals )
Distant
dataset</p>
      <p>Indexed PubMed
for titles and
abstracts
Distant
dataset
generation
Training a
model
(Simple
Transformers)</p>
      <p>Deep
learning
model
(BioBERT)</p>
      <p>Test text
Named</p>
      <p>Entity
Recognition</p>
      <p>Annotated text
positives whenever the indices (locations in text) of the prediction and the curated annotations
overlap.</p>
      <p>Table 4 shows the performance of the disease NER models which are distantly supervised on
diferent ontology components or on abstracts (best F1-score is achieved at top 1, see Additional
File 1) on the disease test sets (see Table 1). For the sake of comparison, we also included a
supervised BioBERT model that is trained on the NCBI-disease training set. Our results showed
that supervised BioBERT trained on the curated set performed the best on NCBI–disease (0.94
F1score) because concepts are highly conserved in this dataset. To fairly compare the performance
of the methods, we further evaluated the models on the MedMentions–disease dataset. Results
showed that the distantly supervised models (trained on abstracts and definitions plus axioms)
achieved higher F1 scores (0.68 for abstracts and 0.67 for definitions and axioms) compared
to the model trained on the curated set (0.66 F1-score) which is actually biased towards the
NCBI–disease dataset (we found out there is 80% overlap in concept IDs between NCBI training
and test sets). The models trained on solely labels and synonyms, axioms, definitions showed
lower F1-score compared to the model trained on abstracts. On the other hand, the model
trained on definitions plus axioms achieved a competitive F1-score compared to the model
trained on abstracts. This result is more evident on the MedMentions-disease test set.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion</title>
      <p>Our main goal was to explore whether ontology components can help to develop distantly
supervised disease/phenotype entity recognition models which are competitive to the
state-ofthe-art. To that end, we exploited ontological components to create textual context using the
labels/synonyms, axioms and definitions. We observed that utilising the context in ontologies
via distant supervision aids in developing a NER model at the state-of-the-art level. While the
models trained solely on labels and synonyms achieves lowest simply due to lack of context;
the models incorporating context such as axioms and definitions improved the performance
upon the models that lack context.</p>
      <p>The disease NER model trained on the axioms and definitions achieved competitive F1-score
compared to the model trained on the abstracts only. However, we observed 6% discrepancy
between the phenotype NER models trained on the abstracts (best F1-score is achieved at top
2) and axioms and definitions together. To investigate the reason for this discrepancy, we
focused on the False Positive (FP) predictions that we achieved on the GSC+ test corpus. The
model trained on the weakly labeled abstracts produced 440 FPs while the model trained on
the phenotype definitions and axioms produced 608 FPs. We found that 184 out of 608 FPs
are produced distinctly by the model trained on definitions and axioms and not by the one
trained on the abstracts. We randomly sampled 20 FPs from these 184 FPs for further manual
analysis. Our manual analysis on these 20 FPs showed that all of them were actually True
Positives but have been missed by the GSC+ dataset. For example, we found “Uniparental
disomy” (HP:0032382) in PMID:8103288 was captured correctly by the model but was missed
by GSC+ annotations. More importantly, we observed that the majority of the FPs were not
introduced in the definitions and axioms training corpus but were rather predicted as
zeroshot instances (i.e. instances that were not seen by the model during training). For example,
“Angelman syndrome” in PMID:8786067 which does not correspond to any label/synonyms in
HPO and does not exist in the corpus was annotated by the model trained on definitions and
axioms. Furthermore, the model trained on literature abstracts did not have these FPs since
they were specifically included as  classes in the training set. Details on our manual analysis
can be found in the Additional Files 1.</p>
      <p>We conducted our study on DO and HPO. These ontologies are widely used and therefore
contain dense content which can help to generate suficiently large weakly label datasets.
Although the approach is generic and its utility can be explored for any given ontology; the
performance would depend on the density of the content of the ontology of choice. That is, if the
ontology does not suficiently describe a concept, it is not possible to obtain a well-performing
model.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In conclusion, our analysis showed that the ontology components can provide a suitable corpus
to build a NER model that is competitive to state-of-the-art. This alleviates the need for
annotating a large number of abstracts and facilitates the creation of weakly labeled training
corpora. Easily obtained corpora are desirable since they reduce both the computational and
time overheads. To our best knowledge, this is the first work that uses ontology axioms to build
disease/phenotypes NER models.</p>
      <p>Additionally, the models trained on ontology components were capable of zero-shot learning
on the test datasets. This was not the cases for the models trained on curated sets and the
models trained on the large weakly labeled literature abstracts. Our approach is generic and
its utility can be explored with any other given ontology which has suficient content that
describes the concept of interest.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>We thank Dr. Mahmut Uludağ for his technical assistance in processing MEDLINE data.
This work has been supported by funding from King Abdullah University of Science
and Technology (KAUST) Ofice of Sponsored Research (OSR) under Award No.
URF/1/4355-0101, URF/1/4675-01-01, URF/1/4697-01-01, URF/1/5041-01-01, REI/1/5334-01-01, FCC/1/1976-46-01
and FCC/1/1976-34-01.
5227–5240. URL: https://aclanthology.org/2021.emnlp-main.424. doi:10.18653/v1/2021.
emnlp-main.424.
[8] L. Luo, S. Yan, P.-T. Lai, D. Veltri, A. Oler, S. Xirasagar, R. Ghosh, M. Similuk, P. N. Robinson,
Z. Lu, PhenoTagger: a hybrid method for phenotype concept recognition using human
phenotype ontology, Bioinformatics 37 (2021) 1884–1890. URL: https://doi.org/10.1093/
bioinformatics/btab019. doi:10.1093/bioinformatics/btab019.
[9] K. Zhou, Y. Li, Q. Li, Distantly supervised named entity recognition via confidence-based
multi-class positive and unlabeled learning, in: Proceedings of the 60th Annual Meeting
of the Association for Computational Linguistics (Volume 1: Long Papers), Association for
Computational Linguistics, Dublin, Ireland, 2022, pp. 7198–7211. URL: https://aclanthology.
org/2022.acl-long.498. doi:10.18653/v1/2022.acl-long.498.
[10] H. Dong, V. Suárez-Paniagua, H. Zhang, M. Wang, A. Casey, E. Davidson, J. Chen, B. Alex,
W. Whiteley, H. Wu, Ontology-driven and weakly supervised rare disease identification
from clinical notes, BMC Medical Informatics and Decision Making 23 (2023). URL:
https://doi.org/10.1186/s12911-023-02181-9. doi:10.1186/s12911-023-02181-9.
[11] L. M. Schriml, et al., Human Disease Ontology 2018 update: classification, content and
workflow expansion, Nucleic Acids Research 47 (2018) D955–D962. URL: https://doi.org/
10.1093/nar/gky1032. doi:10.1093/nar/gky1032.
[12] A. P. Davis, T. C. Wiegers, M. C. Rosenstein, C. J. Mattingly, MEDIC: a practical disease
vocabulary used at the Comparative Toxicogenomics Database, Database 2012 (2012). URL:
https://doi.org/10.1093/database/bar065. doi:10.1093/database/bar065, bar065.
[13] S. Köhler, et al., Expansion of the human phenotype ontology (HPO) knowledge base and
resources, Nucleic Acids Research 47 (2018) D1018–D1027. URL: https://doi.org/10.1093/
nar/gky1105. doi:10.1093/nar/gky1105.
[14] NCBI, Pubmed, 1996. https://pubmed.ncbi.nlm.nih.gov/, Last accessed on 2022-04-18.
[15] N. Elastic, Swiftype, Elastic search, 2010. https://www.elastic.co/, Last accessed on
202204-18.
[16] R. I. Doğan, R. Leaman, Z. Lu, NCBI disease corpus: A resource for disease name recognition
and concept normalization, Journal of Biomedical Informatics 47 (2014) 1–10. URL: https:
//doi.org/10.1016/j.jbi.2013.12.006. doi:10.1016/j.jbi.2013.12.006.
[17] S. Mohan, D. Li, Medmentions: A large biomedical corpus annotated with umls concepts,
2019. URL: https://arxiv.org/abs/1902.09476. doi:10.48550/ARXIV.1902.09476.
[18] M. Lobo, A. Lamurias, F. M. Couto, Identifying human phenotype terms by combining
machine learning and validation rules, BioMed Research International 2017 (2017) 1–8.</p>
      <p>
        URL: https://doi.org/10.1155/2017/8565739. doi:10.1155/2017/8565739.
[19] I. Brigadir, Nltk stop words, 2019. https://github.com/igorbrigadir/stopwords/blob/master/
en/nltk.txt, Last accessed on 2022-09-14.
[20] S. Tweedie, B. Braschi, K. Gray, T. E. M. Jones, R. L. Seal, B. Yates, E. A. Bruford,
Genenames.org: the HGNC and VGNC resources in 2021, Nucleic Acids Research 49 (2020)
D939–D946. URL: https://doi.org/10.1093/nar/gkaa980. doi:10.1093/nar/gkaa980.
[21] P. Dyson, Inflect python module, 2022. https://pypi.org/project/inflect/, Last accessed on
2022-09-14.
[22] R. Hoehndorf, P. N. Schofield, G. V. Gkoutos, The role of ontologies in biological and
biomedical research: a functional perspective, Briefings in bioinformatics 16 (2015) 1069–
• Additional file 1 — AdditionalFile1.xls First sheet name as “performance_on_abstracts”
contains the performances of the models trained on the weakly labeled abstract datasets
selected based on top [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5">1-5</xref>
        ] hits from the ElasticSearch Index. Second sheet named
as “manual_error_analysis” contains our manual analysis results on the False
Positives from the GSC+ dataset. The file is available from github: https://github.com/
bio-ontology-research-group/OntoNER
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Jonquet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. H.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Musen</surname>
          </string-name>
          ,
          <article-title>The open biomedical annotator</article-title>
          , in: American Medical Informatics Association Symposium on Translational BioInformatics, AMIA-TBI'
          <fpage>09</fpage>
          , San Francisco, CA, USA,
          <year>2009</year>
          , pp.
          <fpage>56</fpage>
          -
          <lpage>60</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kapushesky</surname>
          </string-name>
          , et al.,
          <article-title>Gene expression atlas update-a value-added database of microarray and sequencing-based functional genomics experiments</article-title>
          ,
          <source>Nucleic Acids Research</source>
          <volume>40</volume>
          (
          <year>2011</year>
          )
          <fpage>D1077</fpage>
          -
          <lpage>D1081</lpage>
          . URL: https://doi.org/10.1093/nar/gkr913. doi:
          <volume>10</volume>
          .1093/nar/gkr913.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Taboada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Rodriguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Martinez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pardo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Sobrido</surname>
          </string-name>
          ,
          <source>Automated semantic annotation of rare disease cases: a case study, Database</source>
          <year>2014</year>
          (
          <year>2014</year>
          )
          <fpage>bau045</fpage>
          -
          <lpage>bau045</lpage>
          . URL: https://doi.org/10.1093/database/bau045. doi:
          <volume>10</volume>
          .1093/database/bau045.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>C.-H. Wei</surname>
            , H.-
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Kao</surname>
            ,
            <given-names>Z. Lu,</given-names>
          </string-name>
          <article-title>GNormPlus: An integrative approach for tagging genes, gene families, and protein domains</article-title>
          ,
          <source>BioMed Research International</source>
          <year>2015</year>
          (
          <year>2015</year>
          )
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          . URL: https://doi.org/10.1155/
          <year>2015</year>
          /918710. doi:
          <volume>10</volume>
          .1155/
          <year>2015</year>
          /918710.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Er</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Bond:
          <article-title>Bert-assisted opendomain named entity recognition with distant supervision</article-title>
          ,
          <source>in: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          , KDD '20,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2020</year>
          , p.
          <fpage>1054</fpage>
          -
          <lpage>1064</lpage>
          . URL: https://doi.org/10.1145/3394486.3403149. doi:
          <volume>10</volume>
          .1145/3394486.3403149.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Guan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Li</surname>
          </string-name>
          , J. Han,
          <article-title>Pattern-enhanced named entity recognition with distant supervision</article-title>
          ,
          <source>in: 2020 IEEE International Conference on Big Data (Big Data)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>818</fpage>
          -
          <lpage>827</lpage>
          . doi:
          <volume>10</volume>
          .1109/BigData50022.
          <year>2020</year>
          .
          <volume>9378052</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Garg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xiao</surname>
          </string-name>
          , J. Han,
          <article-title>ChemNER: Fine-grained chemistry named entity recognition with ontology-guided distant supervision</article-title>
          ,
          <source>in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Online and
          <string-name>
            <given-names>Punta</given-names>
            <surname>Cana</surname>
          </string-name>
          , Dominican Republic,
          <year>2021</year>
          , pp.
          <fpage>1080</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>C.</given-names>
            <surname>Sammut</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. I.</surname>
          </string-name>
          Webb (Eds.),
          <source>TF-IDF</source>
          ,
          <string-name>
            <surname>Springer</surname>
            <given-names>US</given-names>
          </string-name>
          , Boston, MA,
          <year>2010</year>
          , pp.
          <fpage>986</fpage>
          -
          <lpage>987</lpage>
          . URL: https://doi.org/10.1007/978-0-
          <fpage>387</fpage>
          -30164-8_
          <fpage>832</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-0-
          <fpage>387</fpage>
          -30164-8_
          <fpage>832</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Ramshaw</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Marcus</surname>
          </string-name>
          ,
          <article-title>Text chunking using transformation-based learning</article-title>
          ,
          <source>in: ACL Third Workshop on Very Large Corpora</source>
          ,
          <year>1995</year>
          , pp.
          <fpage>82</fpage>
          -
          <lpage>94</lpage>
          . doi:https://doi.org/ 10.48550/arXiv.cmp-lg/9505040.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>M.</given-names>
            <surname>Honnibal</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Montani</surname>
          </string-name>
          , spaCy 2:
          <article-title>Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing, 2017</article-title>
          . To appear.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>F. Z.</given-names>
            <surname>Smaili</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hoehndorf</surname>
          </string-name>
          ,
          <article-title>Onto2Vec: joint vector-based representation of biological entities and their ontology-based annotations</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>34</volume>
          (
          <year>2018</year>
          )
          <fpage>i52</fpage>
          -
          <lpage>i60</lpage>
          . URL: https://doi.org/10.1093/bioinformatics/bty259. doi:
          <volume>10</volume>
          .1093/bioinformatics/bty259.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>R.</given-names>
            <surname>Arp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. D.</given-names>
            <surname>Spear</surname>
          </string-name>
          ,
          <article-title>Building ontologies with Basic Formal Ontology</article-title>
          , The MIT Press, Cambridge, Massachusetts;London, England;,
          <year>2015</year>
          ;
          <year>2016</year>
          ;.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hastings</surname>
          </string-name>
          , et al., Chebi in 2016:
          <article-title>Improved services and an expanding collection of metabolites</article-title>
          ,
          <source>Nucleic acids research</source>
          <volume>44</volume>
          (
          <year>2016</year>
          )
          <article-title>D1214-9</article-title>
          . URL: https://europepmc.org/ articles/PMC4702775. doi:
          <volume>10</volume>
          .1093/nar/gkv1031.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>T.</given-names>
            <surname>Bakken</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cowell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. D.</given-names>
            <surname>Aevermann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Novotny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hodge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>McCorrison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pulendran</surname>
          </string-name>
          , et al.,
          <article-title>Cell type discovery and representation in the era of high-content single cell phenotyping</article-title>
          ,
          <source>BMC bioinformatics 18</source>
          (
          <year>2017</year>
          )
          <fpage>7</fpage>
          -
          <lpage>16</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>R. P.</given-names>
            <surname>Huntley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Harris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Alam-Faruque</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Blake</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Carbon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Dietze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. C.</given-names>
            <surname>Dimmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. E.</given-names>
            <surname>Foulger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Hill</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. K.</given-names>
            <surname>Khodiyar</surname>
          </string-name>
          , et al.,
          <article-title>A method for increasing expressivity of gene ontology annotations using a compositional approach</article-title>
          ,
          <source>BMC bioinformatics 15</source>
          (
          <year>2014</year>
          )
          <fpage>1</fpage>
          -
          <lpage>11</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Mungall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Torniai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Gkoutos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Haendel</surname>
          </string-name>
          ,
          <article-title>Uberon, an integrative multi-species anatomy ontology</article-title>
          ,
          <source>Genome biology 13</source>
          (
          <year>2012</year>
          )
          <fpage>1</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , in
          <source>: Proceedings of the 2019 Conference of the North, Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          . URL: https://doi.org/10.18653/ v1/n19-
          <fpage>1423</fpage>
          . doi:
          <volume>10</volume>
          .18653/v1/n19-
          <fpage>1423</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>T. C.</given-names>
            <surname>Rajapakse</surname>
          </string-name>
          , Simple transformers, https://github.com/ThilinaRajapakse/ simpletransformers,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yoon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. H.</given-names>
            <surname>So</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kang</surname>
          </string-name>
          , Biobert github respository,
          <year>2019</year>
          . (https://github.com/dmis-lab/biobert).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>