<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>C-STSS: A Context -based Short Text Semantic Similarity approach applied to biomedical named entity linking⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Asma Djellal</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maya Souilah Benabdelhafid</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ecole Supérieure de Comptabilité et de Finance</institution>
          ,
          <addr-line>ESCF Constantine</addr-line>
          ,
          <country country="DZ">Algeria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Lire laboratory, Abdelhamide Mehri Constantine 2 University</institution>
          ,
          <addr-line>Constantine</addr-line>
          ,
          <country country="DZ">Algeria</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This research paper delves into Human-Computer Interaction by investigating Knowledge Graph-based Question Answering systems in the biomedical domain. The study leverages Knowledge Graphs as potent tools to enhance Named Entity Linking in short texts, where limited context poses challenges. Conventional linking methods struggle with single Named Entity linking due to poor context and name variation issues, afecting their eficiency. To address these challenges, several scholars are working on designing Knowledge Graph-based Question Answering Systems with a focus on the name variation problem by relying on Named Entity morphological forms but they are rarely considering their semantic similarities. This paper introduces a Context-based Short Text Semantic Similarity approach for Named Entity Linking in the biomedical domain. The proposed approach improves the performance of Question Answering systems by utilizing contextual semantic similarities in short texts and combining knowledge-based and corpus-based methods for fine-grained meaning comparison, which allow addressing sparseness and vocabulary mismatches, showcasing the paper's uniqueness.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Question Answering Systems</kwd>
        <kwd>Natural Language Processing</kwd>
        <kwd>Biomedical Named Entity Linking</kwd>
        <kwd>Contextual Semantic Similarities</kwd>
        <kwd>Short Text</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>This paper introduces the Context-based Short Text</title>
        <p>
          Semantic Similarity (C-STSS) approach, a sophisticated
In the ever-evolving landscape of Natural Language Pro- framework that aims to bridge the gap between the
limcessing (NLP), the challenge of deciphering the nuances ited context of short biomedical texts and the rich
seof short texts, particularly within specialized domains mantic knowledge encompassed within specialized
dolike biomedicine, has emerged as a critical area of re- mains. By dissecting semantic similarities and leveraging
search. Short texts, encompassing brief queries and ques- domain-specific knowledge, C-STSS provides nuanced
tions, lack the extensive context often found in longer analysis, facilitating accurate NEL even in the face of
texts, posing formidable obstacles for accurate Named sparse and mismatched vocabulary. This innovative
apEntity Linking (NEL), which is a key part for developing proach holds the promise of revolutionizing NEL within
Question Answering Systems (QAS) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. The core dif- the constraints of short texts, opening new avenues
ifculty lies in disambiguating Named Entities (NE) [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], for exploration at the intersection of NLP and
Humanespecially those sharing similar surface forms [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], and Computer Interaction (HCI).
capturing subtle semantic diferences essential for accu- The remainder of this paper is organized as follows.
rate NEL. Section 2 outlines some preliminaries related to the
re
        </p>
        <p>
          To address these challenges, this paper pioneers a search work. Section 3 reviews some related works and
novel approach that considers the fine-grained meaning analyses drawbacks of recent biomedical NEL systems.
comparison by integrating knowledge-based and corpus- Section 4 constitutes the bulk of the paper and presents
based methods [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Corpus-based methods leverage con- C-STSS, our proposed approach for dealing with NEL
textual information from textual data to compute gen- problem in short biomedical text. Section 5 concludes
eral semantic relatedness between words. Meanwhile, the paper and suggests directions for future works.
knowledge-based methods draw upon the wealth of
semantic information stored in resources like Knowledge
Graphs (KG). By integrating these approaches, the study 2. Preliminaries
aims to overcome the sparseness and vocabulary
mismatches inherent in short texts.
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Related Work</title>
      <p>user queries and extracting answers by matching and
reasoning in KG. For instance, to answer the question
"Who is Apple CEO? (see Figure 1), these systems tackle
challenges like:
In recent years, the focus of NLP research has extended
from the general language domain to the biomedical field,
driven by Biomedical NLP (BioNLP) shared tasks and the
1. Named Entity Recognition (NER), identifies increasing application of BioNLP tools in areas like
clinfragments mentioning NE in text. In the above ical research and quality improvement [14, 15, 16, 17].
question the mention "Apple" is identified as a NE More particularly, Biomedical QA (BioQA) have been
2. Named Entity Disambiguation (NED), seeks introduced for enabling innovative applications to
efecfor each NE its corresponding meaning over a tively perceive, access, and understand complex
biomedigiven KG, e.g. Wikidata. In our case, "Apple" cal knowledge [18]. On one hand, we can find for instance
can be linked to several Wikidata entries with cTAKES [19], TaggerOne [20], and QuickUMLS [21] that
diferent QID e.g. Q89 (apple, the fruit) or Q312 are commonly used as rule-based knowledge-intensive
(Apple Inc., the company). concept normalization tools. These solutions use rules to
3. Named Entity Linking (NEL), links each NE to generate lexical variants for each noun phrase and then
its exact meaning over a KG, e.g., IRIs in Wikidata, perform dictionary queries for each variant. Although
based on the surrounding context. According to they provide robust performance, they implicitly assume
the question, "Apple" has to be disambiguated the availability of concept aliases in the target language
as "Apple Inc., the company" with the ID Q312. and focus on normalizing mentions and recognizing NE
Therefore, NEL task has to link it to its IRI https: without efectively linking them [22].
//www.wikidata.org/wiki/Q312. Despite the developments, BioQA systems are still
immature and rarely used in real-life settings. Current
re</p>
      <p>
        It is important to note that this paper aligns with the search often emphasizes morphological and string
simiprevailing research trend, employing the term NEL to larities of NE, neglecting their semantic similarities. NEL
encompass both Disambiguation and Linking tasks, a con- approaches are being introduced to maps various
exvention adopted by several state-of-the-art approaches. pressions, terms, or abbreviations to their
correspondThroughout the remainder of this paper, the primary fo- ing common semantic representation or concept
identicus revolves specifically around NEL task, rather than the ifer in a given terminology or vocabulary. Biomedical
complete QAS. For in-depth technical insights into NER language models are being explored to improve
entitytask, interested readers are referred to the comprehensive linking strategies and to achieve automatic term mapping
surveys [
        <xref ref-type="bibr" rid="ref8">8, 9</xref>
        ]. The NEL process generally involves two and some efective approaches to English corpora have
steps: been proposed. For instance, in [23], authors have
proposed a collective inference approach, which leverages
• Retrieving Candidates Entities: The first step semantic information and structures in ontology to solve
entails retrieving a set of candidate entities from the NEL problem for biomedical literature. Also, in [24],
the KG that the recognized NE may refer to. Var- scholars have proposed a graph-based linking approach
ious techniques are employed, including name which starts by constructing graphs for mentions, KG,
dictionary-based methods [10], surface form ex- and candidates and then exploits the information
enpansion [11], and semantic relationships [12]. tropy and similarity algorithm to perform NEL. Like our
These methods rely primarily on string compar- approach, these contributions are dependent on the
conisons between the NE and the candidates, gen- text and KG. In addition, scholars in [25] have proposed
erating a set of potential entities. For example, LATTE, a LATent Type Entity linking model, leveraging
"Apple", might be mapped to candidates like Q89 latent semantic information to improve entity linking,
and Q312 in Wikidata (see Figure 1). while authors in [26] have used semantic type
informa• Selecting the Correct Candidate: Given that a tion for improved entity disambiguation.
      </p>
      <p>NE can often refer to a large number of candidate Diferent from the above works where no evaluation
entities [13], the challenge lies in selecting the benchmark has been developed to evaluate how well
lanmost relevant one. This step requires ranking guage models represent biomedical concepts according
the candidate entities based on the surrounding to their corresponding context, authors in [27] propose a
context and selecting the highly scored candidate novel dataset, BioWiC, to evaluate the ability of language
that best fits the meaning of the given NE. For models to encode biomedical terms in context. Another
instance, if "Apple" refers to both the fruit and the research direction is to use for example BERT-based
recompany, according to the context, the correct trieve and re-rank models [28]. For instance, in [29],
candidate "Apple Co" needs to be selected. scholars have improved biomedical pretrained language
models with knowledge.</p>
      <sec id="sec-2-1">
        <title>C-STSS approach involves four main sub-processes (see</title>
        <p>Figure 2). First, the Pre-process verifies and prepares the
input question and recognizes the involved NE. Then,
the Expansion generates the NE context by expanding
the input question. Thereafter, Candidates Generation
retrieves all NE candidates from DBpedia. Finally, the
Ranking sub-process uses Semantic similarities to score
candidates based on the generated context, and then links
the NE to the highest scored candidate. This process
frames NEL task as a ranking problem and will be detailed
further in the following sections.</p>
        <p>
          Let us notice that a particularly challenging is the task subjected to critical transformations. After verifying the
of NEL in short texts, such as questions, where limited question’s structure for grammatical or spelling errors,
contextual information hampers conventional linking cleaning and normalization are performed to remove
unmethods. Addressing these challenges, this paper in- necessary or noisy words. This involves employing NL
troduces a C-STSS approach, designed to enhance the techniques such as tokenization [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] and stop-word
reperformance of biomedical NEL systems dealing with moval [30], focusing on retaining only nouns, verbs, and
short texts through contextual semantic similarities. adjectives. Finally, cTAKES [18], an open-source NLP
tool, is utilized in order to recognize the involved NE.
        </p>
        <p>It is noteworthy that due to the brevity of questions,
4. C-STSS Approach for words from the entity mention are included in the
conBiomedical NEL text window, especially if the entity consists of two or
more words. For instance, in the case of NE "Malignant
tumor" contextual words like "Malignant" and "tumor"
are extracted as they contain meaningful common nouns.</p>
        <p>In a biomedical scenario, a sample question  could be:
"How can Cancer be prevented and detected". Having this
question as input, the pre-process generates as output
the set of recognized NE and a set of words  .</p>
      </sec>
      <sec id="sec-2-2">
        <title>Input: "How can Cancer be prevented and detected?"</title>
        <sec id="sec-2-2-1">
          <title>Output:</title>
          <p>• A set of words = Cancer, prevented, detected
• A set of = Cancer
4.1. Pre-Process</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>The pre-processing step is vital as it significantly influ</title>
        <p>ences the outcome of the linking process, ensuring that
the input question is refined and suitable for subsequent
analysis. In the Pre-Process stage, the input question  is
4.2. Expansion</p>
      </sec>
      <sec id="sec-2-4">
        <title>The Expansion module aims to enhance the contextual semantic similarity measurement particularly in short texts. In such case, traditional entity-entity relatedness</title>
        <p>approaches become inefective due to the lack of con- vocabulary gap in short texts. Synonyms, despite
text, and vocabulary mismatch further complicates the their diferent surface form, are strongly
semantimeasurement of similarity between candidate descrip- cally related.
tions and context. To overcome these challenges, the
Expansion module enrich and expand the question with At the end of the Expansion, the context window will
semantically related words. Initially, a stemming algo- be enriched with additional related words. Following
rithm is used to reduce each word  ∈  to its root our biomedical scenario, the set of words  is enriched
or stem in order to ensure a consistent comparison [31]. resulting to the context  as represented in Table 1.
Then, it enriches the stemmed words by incorporating Input: A set of words = Cancer, prevented, detected
their synonyms using WordNet [32] as a background KG.</p>
        <p>Consequently, this module enables a more comprehen- Output: A set of contextual words
sive analysis of the semantic similarities between the
recognized NE and its candidates by allowing:
4.3. Candidate Generation
• Lexical comparison: Family words sharing the The Candidate Generation module focuses on retrieving
same stem, e.g., "prevention" and "prevented", potential candidate entities to which the NE can refer
could be compared. These words, although to within DBpedia, a central KG comprising over 228
slightly varied, are semantically related. million entities from Wikipedia and Wikidata. The
pro• Semantic comparison: Words with diferent cess begins by a simple string comparison to identify
lexical forms but similar meanings, namely syn- candidates whose names match the NE. However,
dealonyms e.g., "prevented" and "avoided" bridge the ing with name variations is a considerable challenge in
the biomedical field [ 33]. This variation is so extensive is created to diferentiate them. For that, we generate a
that a single entity can have multiple names, for instance, SPARQL query, specifying the NamedEntity
(disambigua"decreases in hemoglobin" could refer to at least four dif- tion) notion and the property wikiPageDisambiguates,
ferent entities in MedDRA , which all look alike: "changes to retrieve all links listed on this page and add them to
in hemoglobin", "increase in hematocrit", "hemoglobin the set of candidates.
decreased", and "decreases in platelets". Addressing the From the previous Biomedical scenario,
challenge of name variation, Candidate Generation em- we retrieve the set of candidates:  =
ploys several techniques: , ℎ... having exact
string match with   =  by executing the
• Exact String Match: Candidates sharing the ex- SPARQL query presented in the following listing over
act string name with the NE are considered. DBpedia. The result is shown in Figure 3.
• Abbreviations/Acronyms: Biomedical
dictionaries are utilized to handle abbreviations and 4.4. Ranking
acronyms common in the biomedical domain.
• Numbers: Variations in writing numbers (Arabic, Ranking module holds immense importance in the NEL
Roman, or English spelled) are normalized for process as it discerns the most suitable candidate for the
consistency. NE based on the question context. When provided with
• Adjectives: Multiple adjectives associated with a a context  and a set of candidates  , this module
single noun employing composites like "and," "/", uses a ranking algorithm to compute for each candidate
or "or" are separated and considered individually. diferent contextual semantic similarities.
• Tokenization: Biomedical terms composed of These contextual semantic similarities refer to the
meamultiple tokens connected by hyphens require surement of how closely the candidate aligns with the
dehyphenation for proper token sequence gener- context. To this end, the algorithm computes some
conation. textual semantic similarities according to the equation
(1). The candidate with the highest score will be
identi</p>
        <p>These techniques are elaborated in Table 2, providing fied as the correct meaning of the NE. It is essential to
an example for each case. highlight that the similarity between each candidate and</p>
        <p>Let us notice that, exact string matches can be retrieved the context is measured over its description in DBpedia.
using DBpedia’s disambiguation pages. If multiple DBpe- () = ( (, )) (1)
dia entries share the same name, a disambiguation page</p>
        <p>Here,  is a semantic similarity function. For for words that are overly frequent in candidate
descripeach  ∈  , the following semantic similarities are tions,  () is employed to assign lower weights to
computed: these less distinguishing words.</p>
        <p>
          Textual similarity: Given the   context and a Hence, in order to compute textual similarity
becandidate  ∈  , we create two vectors representing tween the two vectors  = (1, 2, .., ) and  =
their textual content: the candidate description vector, (′1, ′2, .., ′), the cosine method is applied. This method
noted as  and the contextual words vector, noted as calculates the cosine of the angle between these two
vec. It should be noted that, lemmatization is applied on tors. It is defined as :
candidates descriptions for omitting stop words, very (, ) = ( , )
frequent and very rare words. We employ a standard = ∑︀ ′/(√︂∑︀ 2 √︂∑︀ ′2) (3)
Vector Space Model, with a  −  weighting scheme   
for representing both vectors:  = (1, 2, ..., ), The primary challenge with using cosine similarity in
each dimension  of  corresponds to the word weight advanced models lies in vocabulary mismatch. Cosine
and is defined as: similarity essentially measures the correlation between
 =  (, ) ×  () (2) the words of two textual vectors [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Consequently, this
Where  (, ), is the Term-Frequency function method fails to measure similarity when the vectors do
and denotes the frequency of the contextual word  in not share identical words. Even if there are semantically
the candidate description . It assesses the significance related words, they are not taken into account. To face
of the contextual word within the candidate’s description. this drawback, we opt for knowledge-based methods to
While  (), stands for Inverse Document Frequency, expand the input question with all words with semantic
signifies the number of candidates whose descriptions relevance when generating its context. This will
successincorporate the contextual word . In order to account fully overcome issues such sparseness and vocabulary
() = ()
= ()/ ∑︀ ( ) (4)
        </p>
        <p>∈</p>
        <p>Here, () represents the number of links pointing
to the candidate  in DBpedia.</p>
        <p>Word co-occurrence: In the state-of-the-art systems,
the co-occurrence feature traditionally signifies the
simultaneous appearance of a set of NE within the same
text, allowing them to be collectively linked. Regrettably,
this approach faces limitations when applied to short
texts, where the presence of multiple NE is rare. Despite
that, we adapted the co-occurrence concept to measure
the contextual relevance between the NE and a given
candidate. In our methodology, this feature is redefined
as:</p>
        <p>“The appearing of several contextual words within a
given candidate description”</p>
        <p>Obviously, the more diferent contextual words found
within the candidate description, the closer it aligns with
the NE context. To quantify this similarity, given the
NE context and a candidate , we examine two sets of
words: the set of contextual words denoted as  and
the set of candidate description words denoted as  .</p>
        <p>The word co-occurrence similarity function is defined as
follows:
 (, ) = − ( , )</p>
        <p>/||
= ∑︀</p>
        <p>signifies the count of contextual words</p>
        <p>Here, ∑︀ 
contained within  . This refined definition ofers a
nuanced understanding of word co-occurrence, enhancing
the precision of context relevance measurements.</p>
        <p>The details provided above are condensed into the
subsequent algorithm, outlining our C-STSS approach. It
encapsulates the intricacies of our C-STSS approach for
biomedical NEL. Given an input question, C-STSS
process employs the NER function to recognize the involved
NE, generates the context using the Context function,
and retrieves all potential candidates over DBpedia by
employing Candidates function. These candidates are
selected based on the five cases explained earlier. C-STSS
mismatch while assessing textual similarity. algorithm incorporates furthermore functions in order to</p>
        <p>Candidate Popularity: Measuring the popularity of identify the more relevant candidate: Lemmatization is
entities is a crucial factor in determining their relevance applied to omit stop words, very frequent and very rare
to a given NE. According to [13] a simple linking method words above context and candidate to enhance clarity.
based solely on candidate popularity can achieve 71% Words function retrieves feature words for context and
accuracy. It is essential to note that certain candidates candidate, shaping the subsequent analysis. Frequency
are exceptionally rare compared to others. For instance, function uses a  −  weighting scheme for
repreconsider the   = ; while "Cancer (film)" might senting context and candidate vectors, ensuring a robust
be a rare occurrence, "Cancer (astrology)" might be more representation of the textual data.
common, with "Cancer (disease)" being the most popular
entity. This observation can be formalized by analyzing Algorithm 1 C-STSS approach of biomedical NEL
candidates’ incoming and outgoing links within DBpedia.</p>
        <p>The candidate popularity function, denoted as (), is
defined as follows:
(5)</p>
        <sec id="sec-2-4-1">
          <title>Require: Question</title>
          <p>Ensure:  ∈  having the highest score
1:  ← ()
2:   ←  ()
3:  ← ( )
4:  ← ∅
5:   ← 0
6:   ← 0
7:  ← ()
8:  ←  ()
9:  ←  ()
10:   ← 0
11: for each  ∈  do
12:   ←   + ()
13: end for
14: for each  ∈  do
15:  ← ()
16:  ←  ()
17:  ← (,  )
18:  ← ()
19:   ←  / 
20:  ←  ()
21:  ← 0
22: for each  ∈  do
23: if  ∈  then
24:  ← +1
25: end if
26: end for
27:
28:
  ← /||
 ← ∑︀(,  ,  )
29:  ← (1/ ∑3︀ ( −  )2)
=1
30: if ( &gt;   and  &lt;
 ) then
31:   ← 
32:   ← 
33:  ← 
34: end if
35: end for
36: Return ()</p>
          <p>To assess the similarity between words in the context
and those in the candidate descriptions, three distinct
semantic similarity metrics are calculated and combined
to score each candidate:</p>
        </sec>
      </sec>
      <sec id="sec-2-5">
        <title>This similarity computation is iteratively applied to all candidate entities in order to scoring them. The candidate with the highest score and the lowest standard deviation is returned as the correct one.</title>
        <p>4.5. Discussion
While various scholars focus on addressing the name
variation problem in BioQA by considering
morphological forms of biomedical NE, few incorporate semantic
similarities. C-STSS approach combines NE
morphological forms and contextual semantic similarities. To further
enhance its eficacy, our approach integrates
knowledgebased methods with corpus-based ones, alleviating issues
related to sparseness and vocabulary mismatch. This
fusion of techniques forms the core innovation of this
research.</p>
        <p>To conclude, it is now well established that biomedical
text requires methods targeted for the domain.
Developments in Deep Learning and a series of successful
shared challenges have contributed to a steady progress
in techniques for Bio-NLP text. Contributing to this
ongoing progress and particularly focusing on
computational methods, our future issue will aim to create and
encourage research in novel approaches for analyzing
biomedical text. More particularly, on transformer-based
models that seem to be the future of NLP as explained in
recent surveys [34, 35, 36, 37, 38].</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Conclusion</title>
      <sec id="sec-3-1">
        <title>In recent years, KG have undergone substantial growth</title>
        <p>in both theoretical frameworks and practical applications.</p>
        <p>Despite these advancements, KGQAS encounter
persistent challenges. They face limitations due to historical
precedents and excessive human intervention,
necessitating innovative solutions.</p>
        <p>Within the intricate domain of biomedicine, additional
complexities emerge. Indeed, NEL in the medical domain
is a newer problem. This paper presents a Context-based
Short Text Semantic Similarity approach, designed to
enhance biomedical NEL systems by exploiting contextual
[9] V. Yadav, S. Bethard, A survey on recent advances tive biomedical entity linking using a dual encoder,
in named entity recognition from deep learning arXiv preprint arXiv:2103.05028 (2021).
models, arXiv preprint arXiv:1910.11470 (2019). [23] J. G. Zheng, D. Howsmon, B. Zhang, J. Hahn,
[10] Z. Yang, H. Lin, Y. Li, Exploiting the performance D. McGuinness, J. Hendler, H. Ji, Entity linking
of dictionary-based bio-entity name recognition in for biomedical literature, BMC medical informatics
biomedical literature, Computational biology and and decision making 15 (2015) 1–9.
chemistry 32 (2008) 287–291. [24] H. Wang, J. G. Zheng, X. Ma, P. Fox, H. Ji, Language
[11] A. Reshamwala, D. Mishra, P. Pawar, Review on and domain independent entity linking with
quannatural language processing, IRACST Engineering tified collective validation, in: Proceedings of the
Science and Technology: An International Journal 2015 Conference on Empirical Methods in Natural
(ESTIJ) 3 (2013) 113–116. Language Processing, 2015, pp. 695–704.
[12] R. Meymandpour, J. G. Davis, A semantic similarity [25] M. Zhu, B. Celikkaya, P. Bhatia, C. K. Reddy, Latte:
measure for linked data: An information content- Latent type modeling for biomedical entity linking,
based approach, Knowledge-Based Systems 109 in: Proceedings of the AAAI conference on artificial
(2016) 276–293. intelligence, volume 34, 2020, pp. 9757–9764.
[13] W. Shen, J. Wang, J. Han, Entity linking with a [26] S. Vashishth, D. Newman-Grifis, R. Joshi, R. Dutt,
knowledge base: Issues, techniques, and solutions, C. P. Rosé, Improving broad-coverage medical
enIEEE Transactions on Knowledge and Data Engi- tity linking with semantic type prediction and
largeneering 27 (2014) 443–460. scale datasets, Journal of biomedical informatics
[14] G. Frisoni, G. Moro, A. Carbonaro, A survey on 121 (2021) 103880.</p>
        <p>event extraction for natural language understand- [27] H. Rouhizadeh, I. Nikishina, A. Yazdani, A.
Boring: Riding the biomedical literature wave, IEEE net, B. Zhang, J. Ehrsam, C. Gaudet-Blavignac,
Access 9 (2021) 160721–160757. N. Naderi, D. Teodoro, Biowic: An evaluation
[15] T. A. Koleck, C. Dreisbach, P. E. Bourne, S. Bakken, benchmark for biomedical concept representation,
Natural language processing of symptoms docu- bioRxiv (2023) 2023–11.
mented in free-text narratives of electronic health [28] Y. He, Z. Zhu, Y. Zhang, Q. Chen, J. Caverlee,
Inrecords: a systematic review, Journal of the Ameri- fusing disease knowledge into bert for health
quescan Medical Informatics Association 26 (2019) 364– tion answering, medical inference and disease name
379. recognition, arXiv preprint arXiv:2010.03746 (2020).
[16] I. J. B. Young, S. Luz, N. Lone, A systematic review [29] Z. Yuan, Y. Liu, C. Tan, S. Huang, F. Huang,
Improvof natural language processing for classification ing biomedical pretrained language models with
tasks in the field of incident reporting and adverse knowledge, arXiv preprint arXiv:2104.10344 (2021).
event analysis, International journal of medical [30] Z. Xu, X. Luo, S. Zhang, X. Wei, L. Mei, C. Hu,
Mininformatics 132 (2019) 103971. ing temporal explicit and implicit semantic relations
[17] E. French, B. T. McInnes, An overview of biomedical between entities using web search engines, Future
entity linking throughout the years, Journal of Generation Computer Systems 37 (2014) 468–477.
biomedical informatics 137 (2023) 104252. [31] C. Ramasubramanian, R. Ramya, Efective
pre[18] Q. Jin, Z. Yuan, G. Xiong, Q. Yu, H. Ying, C. Tan, processing activities in text mining using improved
M. Chen, S. Huang, X. Liu, S. Yu, Biomedical ques- porter’s stemming algorithm, International Journal
tion answering: a survey of approaches and chal- of Advanced Research in Computer and
Communilenges, ACM Computing Surveys (CSUR) 55 (2022) cation Engineering 2 (2013) 4536–4538.
1–36. [32] C. Fellbaum, WordNet: An electronic lexical
[19] G. K. Savova, J. J. Masanz, P. V. Ogren, J. Zheng, database, MIT press, 1998.</p>
        <p>S. Sohn, K. C. Kipper-Schuler, C. G. Chute, Mayo [33] L. Chen, G. Varoquaux, F. M. Suchanek, A
clinical text analysis and knowledge extraction sys- lightweight neural model for biomedical entity
linktem (ctakes): architecture, component evaluation ing, in: Proceedings of the AAAI conference on
and applications, Journal of the American Medical artificial intelligence, volume 35, 2021, pp. 12657–
Informatics Association 17 (2010) 507–513. 12665.
[20] R. Leaman, Z. Lu, Taggerone: joint named entity [34] L. Cai, J. Li, H. Lv, W. Liu, H. Niu, Z. Wang,
Inrecognition and normalization with semi-markov corporating domain knowledge for biomedical text
models, Bioinformatics 32 (2016) 2839–2846. analysis into deep learning: A survey, Journal of
[21] L. Soldaini, N. Goharian, Quickumls: a fast, unsu- Biomedical Informatics (2023) 104418.
pervised approach for medical concept extraction, [35] K. S. Kalyan, A. Rajasekharan, S. Sangeetha, Ammu:
in: MedIR workshop, sigir, 2016, pp. 1–4. a survey of transformer-based biomedical
pre[22] R. Bhowmik, K. Stratos, G. de Melo, Fast and efec- trained language models, Journal of biomedical
informatics 126 (2022) 103982.
[36] S. Islam, H. Elmekki, A. Elsebai, J. Bentahar,</p>
        <p>N. Drawel, G. Rjoub, W. Pedrycz, A comprehensive
survey on applications of transformers for deep
learning tasks, Expert Systems with Applications
(2023) 122666.
[37] K. Hall, V. Chang, C. Jayne, A review on natural
language processing models for covid-19 research,</p>
        <p>Healthcare Analytics (2022) 100078.
[38] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,</p>
        <p>L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin,
Attention is all you need, Advances in neural
information processing systems 30 (2017).</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Dimitrakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sgontzos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tzitzikas</surname>
          </string-name>
          ,
          <article-title>A survey on question answering systems over linked data and documents</article-title>
          ,
          <source>Journal of intelligent information systems 55</source>
          (
          <year>2020</year>
          )
          <fpage>233</fpage>
          -
          <lpage>259</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Iglesias</surname>
          </string-name>
          ,
          <article-title>Exploiting semantic similarity for named entity disambiguation in knowledge graphs</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>101</volume>
          (
          <year>2018</year>
          )
          <fpage>8</fpage>
          -
          <lpage>24</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          ,
          <article-title>Word sense disambiguation: A survey, ACM computing surveys (CSUR) 41 (</article-title>
          <year>2009</year>
          )
          <fpage>1</fpage>
          -
          <lpage>69</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          , G. Kobilarov,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cyganiak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ives</surname>
          </string-name>
          ,
          <article-title>Dbpedia: A nucleus for a web of open data</article-title>
          , in: international semantic web conference, Springer,
          <year>2007</year>
          , pp.
          <fpage>722</fpage>
          -
          <lpage>735</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>Bollacker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Paritosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Sturge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Taylor</surname>
          </string-name>
          , Freebase:
          <article-title>a collaboratively created graph database for structuring human knowledge</article-title>
          ,
          <source>in: Proceedings of the 2008 ACM SIGMOD international conference on Management of data</source>
          ,
          <year>2008</year>
          , pp.
          <fpage>1247</fpage>
          -
          <lpage>1250</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Vrandečić</surname>
          </string-name>
          ,
          <article-title>Wikidata: A new platform for collaborative data collection</article-title>
          ,
          <source>in: Proceedings of the 21st international conference on world wide web</source>
          ,
          <year>2012</year>
          , pp.
          <fpage>1063</fpage>
          -
          <lpage>1064</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M. R. A. H.</given-names>
            <surname>Rony</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chaudhuri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Usbeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <article-title>Tree-kgqa: an unsupervised approach for question answering over knowledge graphs</article-title>
          ,
          <source>IEEE Access 10</source>
          (
          <year>2022</year>
          )
          <fpage>50467</fpage>
          -
          <lpage>50478</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Al-Moslmi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Ocaña</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Opdahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Veres</surname>
          </string-name>
          ,
          <article-title>Named entity extraction for knowledge graphs: A literature overview</article-title>
          ,
          <source>IEEE Access 8</source>
          (
          <year>2020</year>
          )
          <fpage>32862</fpage>
          -
          <lpage>32881</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>