<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Disambiguation via Negative Sampling</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Taiqiang Wu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xingyu Bai</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yujiu Yang</string-name>
          <email>yang.yujiu@sz.tsinghua.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Acronym Disambiguation.</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Tsinghua Shenzhen International Graduate School Tsinghua University</institution>
          ,
          <country country="CN">P. R. China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Workshop Proce dings</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Acronym Disambiguation (AD) task aims to map the acronym in sentences to the corresponding expansion among candidate expansions. However, these models based on domain agnostic knowledge might perform insuficient when directly applied to the data in some specific areas such as science and law. To track these issues, we propose a prompt-based acronym disambiguation system with special negative sampling. Specially, we design a prompt to combine the input sentences and candidate expansions, followed by a Pre-train Language Model (PLM) to calculate the score. Moreover, negative expansions are randomly sampled for better training, and an additional hinge loss is added to improve the robustness of our system. Experiments show the efectiveness of our system, and we get competitive results in the SDU@AAAI-22-Shared Task 2: AAAI'22: Scientific Document Understanding Commons License Attribution 4.0 International (CC BY 4.0).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Acronyms are abbreviations formed from the initial
components of words or phrases [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. They are widely used
in our daily life especially on social media. By using
acronyms, people can avoid frequently repeating long
phrases; thus, the sentences could be shorter and more
readable. For example, we use NASA to replace the
National Aeronautics and Space Administration.
      </p>
      <p>However, for people without domain knowledge,
acronyms might be confusing at some time, such as “PPP”
can be Paycheck Protection Program or Public-Private
Partnership. It is necessary to build an acronym
disambiguation system that can identify the correct meaning
of acronyms in a diferent context to track this issue. As
shown in Figure 1, given several sentences containing
acronym POS, we need to find out the corresponding
expansion among candidate expansions in the given
dictionary. Moreover, understanding the correlation between
acronyms and their expansion is beneficial for several
tasks in natural language processing, including question
answering and machine reading comprehension.</p>
      <p>
        Acronym disambiguation is usually considered as a
sequence classification task [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the goal is to map the
given acronym in context to the corresponding
expansion from the candidate expansion dictionary.
Previous works mainly focused on the feature construction of
acronym context to better understand semantics, such as
hand crafted rules and patterns [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], word embeddings [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
CEUR
htp:/ceur-ws.org
ISN1613-073
© 2022 Copyright for this paper by its authors. Use permitted under Creative
      </p>
      <p>CEUR</p>
      <p>Documents
Dialogue fillers and acceptance words
affect the accuracy of POS tagging.</p>
      <p>These heuristics filter out redundant
constituents and raise the ratio of POS
in the dataset.</p>
      <p>POS of first occurrence : Important
concepts are expected to be mentioned
before less relevant ones.</p>
      <p>Dictionary-POS</p>
      <p>Part-Of-Speech
positive instances
Possessive
Position
postag</p>
      <p>
        The expansion in the same color is just the corresponding
expansion for each POS.
graph structures [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], machine learning based methods
such as CRF and SVM [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and deep learning based
methods [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. The experiments on this task were further
extended to learn richer semantics features using
Transformer [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], BERT [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and SciBERT [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Although these
eforts have achieved significant performance in this task,
most of them ignored modeling the semantic relationship
between acronym context and candidate expansions.
      </p>
      <p>
        Furthermore, large-scale data during training brings
an extremely long-tail problem. The size of the original
candidate expansions in the dictionary varies, making it
hard to batch the samples during training. To address
this issue, previous works [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] dynamically add extra
expansions into the candidate expansion set. However,
they ignore the fact that the original negative candidate
expansions are related to the acronym word in semantic
meaning while the added expansions are unrelated.
      </p>
      <p>
        In this paper, we proposed a prompt-based acronym
disambiguation framework with a specially designed neg- combinations of prompts have been explored, such as
ative sampling strategy. Firstly, we design a prompt tem- prompt augmentation [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], prompt composition [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and
plate and use the template to concatenate the acronym prompt decomposition [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. In this work, we construct
context and candidate expansions. Secondly, we utilize a diferent forms of prompts manually, to enrich the
knowlpre-trained language model such as BERT [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] to encode edge enhancement methods.
the combined context separately, followed by a linear
layer to map the context vectors into logits. Since the 2.2. Word Sense Disambiguation
size of candidate expansions for each acronym varies, we
try to sample negative samples, thus padding the can- Word Sense Disambiguation(WSD) is divided into
superdidate expansions randomly. Finally, we consider the vised, unsupervised and semi-supervised methods.
original negative expansions as hard negative samples In supervised WSD methods, classic machine
learningand the added ones as easy negative samples, which can based methods, such as decision tree, SVM, ANN and
calculate an extra loss to build a more robust system. naive Bayes models, have been combined to improve
The main contributions of this work are summarized as the complexity of classifier [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. WSD model based on
follows: evolutionary game theory was designed to determine
the prediction of ambiguous words by calculating
distri• We design a prompt-based framework to resolve bution and semantic similarity [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Supervised neural
the acronym disambiguation problem, which can network with LKB graph embedding was proposed for
be easily modified to solve other NLP tasks such transferring the pre-trained embeddings of synset to
preas Entity Linking. dict ones [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
• We propose a simple yet efective dynamic neg- Unsupervised WSD methods mainly cluster the
unlaative sampling strategy and adopt a novel hinge beled corpus to predict the category of ambiguous words.
loss to help train a robust model. The strategy The classic hybrid model consists of self-adaptive genetic,
can benefit other matching problems. max-min ant and any colony algorithms [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. WSD
mod• We conduct experiments on the SDU@AAAI22 els based on polysemy vector representation adopted
shared task 2 dataset and achieve competitive statistical polysemy, word sense numbers, and K-means
performance, demonstrating our framework’s ef- to finish disambiguation [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. Word sense mapping graph
fectiveness. network can be combined with multilinguistic and
multiknowledge resources to integrate rich information in
2. Related Work unsupervised scenario [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ].
      </p>
      <p>
        In semi-supervised WSD models, the classifier is
In this section, we mainly introduce the related stud- trained by the integration of annotated and
unannoies for prompt-based models, especially the BERT-based tated corpora. PageRank-based WSD algorithm
commodels. We first review the existing researches on word bined pIWordNet and semantic links from valency
lexisense disambiguation, which is more generalized than con, Wikipedia articles and SUMO ontology [23].
Clusthe acronym disambiguation. tering and labeling strategy was used to generate labeled
data for subjectivity WSD semi-automatically and further
combined with original annotated data [24].
2.1. Prompt-based Learning However, all these methods ignore the interaction
bePrompt is suggestive information to enhance the knowl- tween ambiguous word explanation and its context. In
edge that PLMs (Pre-trained Language Models) learned this work, we propose a prompt-based model to
inteduring pre-training, containing the description of task an- grate better the semantic relationship between acronym
swers and corresponding answers. Prompt-based learn- context and candidate expansions.
ing is a slot-filling method based on language models,
which aims to probabilistically construct the final prompt 3. Methodology
as the prediction of the task. Previous exploration in
prompt-based learning mainly focuses on prompt con- In this section, we present the overall architecture of
struction, including prompt engineering and answer en- our proposed framework, which uses the prompt-based
gineering. Prompt engineering creates a prompt function model to solve the acronym disambiguation problem and
applicable to corresponding downstream tasks [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ], adopt a dynamic negative sampling strategy to improve
While answer engineering searches for a unified answer the robustness of our model.
space to which the original answers are mapped [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        Multi-prompt learning, an ensemble of these two
engineering prompts, aims to improve the generalization of
models [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Based on multi-prompt learning, various
      </p>
      <sec id="sec-1-1">
        <title>Dictionary-Rest Acronym</title>
      </sec>
      <sec id="sec-1-2">
        <title>Dictionary-POS</title>
        <p>Part-Of-Speech
positive instances
Position
Possessive
postag</p>
      </sec>
      <sec id="sec-1-3">
        <title>Raw Sample</title>
      </sec>
      <sec id="sec-1-4">
        <title>Size:k</title>
      </sec>
      <sec id="sec-1-5">
        <title>Ground Truth</title>
        <p>[Part-Of-Speech]</p>
      </sec>
      <sec id="sec-1-6">
        <title>Size:1</title>
      </sec>
      <sec id="sec-1-7">
        <title>Prompts</title>
        <p>Dialogue fillers and acceptance words
affect the accuracy of POS tagging.
⨁
[SEP]POS[SEP]the meaning of POS
is or equals &lt;MASK&gt;</p>
        <sec id="sec-1-7-1">
          <title>Random Sampling</title>
        </sec>
      </sec>
      <sec id="sec-1-8">
        <title>Original Negative</title>
        <p>[positive instances]
[Position]
……</p>
      </sec>
      <sec id="sec-1-9">
        <title>Size:k-1</title>
      </sec>
      <sec id="sec-1-10">
        <title>MASK</title>
        <sec id="sec-1-10-1">
          <title>Enumeration</title>
        </sec>
      </sec>
      <sec id="sec-1-11">
        <title>Added Negative</title>
        <p>[Sound Pattern of English]
[Frequent Candidates]
……</p>
      </sec>
      <sec id="sec-1-12">
        <title>Size:N-k</title>
        <p>……
……
1 2 3
k k+1 k+2
N</p>
      </sec>
      <sec id="sec-1-13">
        <title>Train</title>
        <p>BERT</p>
      </sec>
      <sec id="sec-1-14">
        <title>Inference</title>
        <p>⨁ : Concatenate
logits
CE loss</p>
        <p>Hinge loss</p>
        <p>Dynamic Mask
logit1
logit2
logit3
…… logitk
logitk+1
logitk+2
…… logitN
3.1. Problem Statement token is inserted before and after the acronym, followed
by a string: the meaning of acronym is or equals
expanFormally, given an input sentence  =  1,  2, ...,   and sion. Finally, BERT with an additional linear layer is
acronym  =   at position  , the goal is to disambiguate employed as our encoder. For training, we will calculate
the corresponding expansions   among  candidate ex- the cross-entropy loss and adopted hinge loss [25]. For
pansions { 1,  2, ...,   }. The candidate expansions are inference, a dynamic mask strategy is adopted, in which
given in advance and their size vary. Specifically, in we will drop the logits of added expansions. Specially,
this paper, we treat this task as a classification problem we will drop the logits from the added negative samples,
by padding the candidate expansions set to fix length which can not be the answer.
with randomly chosen unrelated expansions. We will
dynamically mask the logit of added expansion in the
testing phase and choose the largest one among original 3.3. Prompt Design
candidate expansions as the final prediction.</p>
        <p>
          To build a prompt template efectively, we consider a
two-stage strategy. We hope the model to be aware of
3.2. Overview two tasks: finding out the acronym and finding out
the corresponding expansion. Thus, we employ the
As shown in Figure 2, given the acronym POS in the token [SEP] to highlight the acronym, which can help the
sentence, there are  candidate expansions which can be model to understand where the acronym is. For second
divided into a positive sample set of size 1 and a hard neg- task, previous works[26, 27] show that a longer prompt
ative sample set of size  − 1 . Firstly, in the expansions of usually performs better. To add more tokens, we use the
other acronyms, we randomly sample  − samples as the template: the meaning of acronym is or equals expansion.
easy negative sample set to pad the candidate expansions For French and Spanish, we employ the corresponding
into fix size  . Secondly, we design a prompt strategy to translation as the prompt templates.
combine the acronym and candidate expansions. [SEP]
3.4. Negative Sampling
The size of candidate expansions in the dictionary varies,
making it hard to train an eficient model. Moreover, we
consider the negative samples in the original candidate
expansions as related to but not exactly the ground
truth. To improve the robustness and convergence of
the model, we adopt a negative sampling strategy. We
set the size of the padded candidate set as  and
randomly sample expansions from the candidate expansions
of other acronyms as needed. For example,  is set to 6,
and the number of original candidate expansions is 2. We
need to pick up 4 additional expansions. We note that
[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] also proposed a similar negative sampling strategy.
        </p>
        <p>The diference is that we divided the negative samples
into hard negative samples and easy negative samples,
thus designing extra loss.
3.5. Loss Function
For the model, we consider two goals: 1) the ground
truth expansion gets the highest score; 2) the original
negative expansions get higher scores than the additional
negative expansions. For the first goal, we employ the
cross entropy loss function. Note the predict label as
  and the ground truth label as   .</p>
        <p>Dataset
Legal English
Scientific English
French
Spanish
where  is also a learnable hyperparameter to control the
ratio of hinge loss.</p>
        <p>= (  ,   ) (1) Wbye SeDvaUlu@aAteAaAllI-m22od[e2l8s]b.asAeds osnhotwhendiantaTseatblpero1v,idthede
where the  means cross entropy loss function. For dataset [29] contains training and development datasets
the second goal, we follow the idea of hinge loss and in English (both scientific and legal domain), Spanish, and
we want the minimum of the original expansion scores French consisting of 497 English Scientific, 303 English
  = {  1,   2, ...,   −1 } is higher than the legal, 546 Spanish, and 669 French acronyms. For each
maximum of the additional expansion scores   = language, a diction containing acronyms and their
candi{  1,   2, ...,    − } by a margin. date expansions is provided. For Legal English, there are
3717 sentences containing 174997 tokens and 625
candi ℎ = max( − min(  ) + max(  ), 0) (2) date expansions in the diction. The average expansion
length of all acronyms is 3.1. The acronyms in the testing
where max(⋅) and min(⋅) mean the maximum and mini- set would not appear in the training set.
mum function while  is a learnable margin. Hence we For Exploratory Data Analysis(EDA), we analyze the
get our final loss. statistical features in the dataset. As shown in Figure 3
and Figure 4, we can see that: 1) for most acronyms, the
 =   +  ℎ (3) corresponding sentences are more than 10, indicating
that the samples are highly similar. 2) for most acronyms,
the corresponding candidate expansions are less than 4.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Experiments</title>
      <p>Given the acronyms in sentences, candidate expansions
In this section, we first introduce the experimental and ground truth labels, we can calculate the
macrodataset and evaluation metrics and then conduct compre- averaged precision, recall and F1 score.
hensive experimental studies to verify the efectiveness
of our method.
4.3. Implement
4.2. Evaluation Metrics</p>
      <p>All models are implemented based on the open-source
transformers library of Huggingface [30]. For all datasets,
Dataset
Legal English
Scientific English
French
Spanish</p>
      <p>Model
bert-large-cased
spanbert-large-cased
scibert-scivocab-cased
scibert-scivocab-cased ( = 1.5 )
bert-base-french-europeana-cased</p>
      <p>camembert-large
bert-base-spanish-wwm-cased
bert-base-multilingual-cased</p>
      <p>3
Epoch of training stage
4
we set the  = 0.1 and  = 1 . The batch size is 2 and
the size of expected expansion  = ([  ]) + 2. For
example, for French dataset, the maximum of candidate
expansions for all acronyms is 12, thus we set  = 12 +
2 = 14. As for other parameters, we set the learning rate
as 3 − 5 and random seed as 10086. We pad or cut the
input into 128 length. For French dataset, the prompt
is: el significado de acronym es o igual a expansion. For
Spanish dataset, we use: la signification de acronym est
ou est égale à expansion. We train our model in one V100
GPU and evaluate the result using the oficial script.
4.4. Comparison
4.4.1. Overall Performance
The overall performance results on the validation set
are shown in Table 2. For Legal English, we choose
the bert-large-cased [31] and spanbert-large-cased as the
PLM. For Scientific English, we choose
scibert-sci-vocabcased [32]. For French, we choose
bert-base-frencheuropeana-cased and camembert-large [33]. For Spanish,
we choose bert-base-spanish-wwm-cased [34] and
bertbase-multilingual-cased [31]. As shown in Table 2, we
can observe that most models sufer from over-fitting
after 3 epochs. Moreover, we find that the BERT trained
on the specialized corpus performs better than trained
on the common corpus.
4.4.2. The Efect of PLM
We change the BERT type to study the influence of
different backbones on the French dataset. As shown in
Figure 5, we can see that the larger models usually get
better results. Another interesting observation is that all
models sufer from over-fitting at epoch 4.
4.4.3. The Efect of Margin 
We change the  to 0.0 and 1.0 and conduct our
experiments in the English science dataset. According to Figure
6, we can find that a large  brings a considerable change
during training. Actually, a large  means a large gap is
required, leading to the oscillation in the loss.
4.4.4. The Efect of Ratio 
We change the  to 0.5 and 1.5 and conduct our
experiments in the English Science dataset. As shown in Figure
7, the larger  leads to a lower result. Actually, a large
 means a large hinge loss, which pushes the model to
1
2</p>
      <p>3
Epoch of training stage</p>
      <p>4</p>
    </sec>
    <sec id="sec-3">
      <title>5. Conclusion and Future Work</title>
      <p>In this paper, we proposed a novel prompt-based model,
which shows promising and competitive performance in
SDU@AAAI-22 - Shared Task 2. We design an efective,
prompt template that helps the model utilize the implicit
knowledge in the pre-trained language model. A dynamic
negative sampling strategy is employed to improve the
robustness and performance of our model.</p>
      <p>
        For future work, we will try to adopt a learned prompt
template rather than a fixed template following the CoOp
[26]. Moreover, the acronym disambiguation under a
zero-shot setting would be another interesting and
valuable topic. Utilizing the graph information in given
sentence [
        <xref ref-type="bibr" rid="ref23">35</xref>
        ] may also help.
      </p>
    </sec>
    <sec id="sec-4">
      <title>6. Acknowledgments</title>
      <p>This research was supported in part by the National
Key Research and Development Program of China (No.
2018YFB1601102) and the Shenzhen Key Laboratory of
Marine IntelliSense and Computation under Contract
ZDSYS20200811142605016. We thank the organizers of
acronym identification and disambiguation competitions
and the reviewers for their valuable comments and
suggestions.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fuxman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <article-title>Guess me if you can: Acronym disambiguation for enterprises</article-title>
          ,
          <source>in: ACL2018</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>1308</fpage>
          -
          <lpage>1317</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A. P. B.</given-names>
            <surname>Veyseh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dernoncourt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. H.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. H.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <article-title>What does this acronym mean? introducing a new dataset for acronym identification and disambiguation</article-title>
          ,
          <source>in: Proceedings of the COLING</source>
          <year>2020</year>
          ,
          <source>International Committee on Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>3285</fpage>
          -
          <lpage>3301</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Ciosici</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Sommer</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Assent</surname>
          </string-name>
          ,
          <article-title>Unsupervised abbreviation disambiguation contextual disambiguation using word embeddings</article-title>
          , CoRR abs/
          <year>1904</year>
          .00929 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1904</year>
          .00929. arXiv:
          <year>1904</year>
          .00929.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          , C. Liu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          <article-title>, Multi-granularity sequence labeling model for acronym expansion identification</article-title>
          ,
          <source>Inf. Sci</source>
          .
          <volume>378</volume>
          (
          <year>2017</year>
          )
          <fpage>462</fpage>
          -
          <lpage>474</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Charbonnier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wartena</surname>
          </string-name>
          ,
          <article-title>Using word embeddings for unsupervised acronym disambiguation</article-title>
          ,
          <source>in: Proceedings of the COLING</source>
          <year>2018</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          ,
          <year>2018</year>
          , pp.
          <fpage>2610</fpage>
          -
          <lpage>2619</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <article-title>Deep contextualized biomedical abbreviation expansion</article-title>
          ,
          <source>in: Proceedings of the BioNLP@ACL, Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>88</fpage>
          -
          <lpage>96</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems 30: NeurIPS</source>
          <year>2017</year>
          ,
          <year>2017</year>
          , pp.
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <article-title>BERT: pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <source>in: Proceedings of the NAACL-HLT</source>
          <year>2019</year>
          ,
          <article-title>Volume 1 (Long</article-title>
          and Short Papers),
          <source>Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>I.</given-names>
            <surname>Beltagy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cohan</surname>
          </string-name>
          ,
          <article-title>Scibert: A pretrained language model for scientific text</article-title>
          ,
          <source>in: Proceedings of the EMNLP-IJCNLP</source>
          <year>2019</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          ,
          <year>2019</year>
          , pp.
          <fpage>3613</fpage>
          -
          <lpage>3618</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>C.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <article-title>Bert-based acronym disambiguation with multiple training tegration</article-title>
          ,
          <source>Comput. Mater. Continua</source>
          <volume>61</volume>
          (
          <year>2019</year>
          )
          <article-title>strategies</article-title>
          , in
          <source>: Proceedings of the SDU@AAAI</source>
          <year>2021</year>
          ,
          <volume>197</volume>
          -
          <fpage>212</fpage>
          . volume
          <volume>2831</volume>
          <source>of CEUR Workshop Proceedings</source>
          ,
          <year>2021</year>
          . [23]
          <string-name>
            <given-names>A.</given-names>
            <surname>Janz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Piasecki</surname>
          </string-name>
          ,
          <article-title>A weakly supervised word</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Shin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Razeghi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. L. L.</given-names>
            <surname>IV</surname>
          </string-name>
          , E. Wallace,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <article-title>Singh, sense disambiguation for polish using rich lexical Autoprompt: Eliciting knowledge from language resources, Poznan Studies in Contemporary Linmodels with automatically generated prompts</article-title>
          ,
          <source>in: guistics 55</source>
          (
          <year>2019</year>
          )
          <fpage>339</fpage>
          -
          <lpage>365</lpage>
          .
          <source>Proceedings of the EMNLP</source>
          <year>2020</year>
          , Association for [24]
          <string-name>
            <given-names>C.</given-names>
            <surname>Akkaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wiebe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mihalcea</surname>
          </string-name>
          , Iterative conComputational Linguistics,
          <year>2020</year>
          , pp.
          <fpage>4222</fpage>
          -
          <lpage>4235</lpage>
          .
          <article-title>strained clustering for subjectivity word sense dis-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. F.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Araki</surname>
          </string-name>
          , G. Neubig,
          <article-title>How can we ambiguation</article-title>
          ,
          <source>in: Proceedings of the EACL</source>
          <year>2014</year>
          ,
          <article-title>know what language models know</article-title>
          ,
          <source>Trans. Assoc. The Association for Computer Linguistics</source>
          ,
          <year>2014</year>
          , pp.
          <source>Comput. Linguistics</source>
          <volume>8</volume>
          (
          <year>2020</year>
          )
          <fpage>423</fpage>
          -
          <lpage>438</lpage>
          .
          <fpage>269</fpage>
          -
          <lpage>278</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yuan</surname>
          </string-name>
          , G. Neubig, P. Liu, Bartscore: Evalu- [25]
          <string-name>
            <given-names>C.</given-names>
            <surname>Gentile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. K.</given-names>
            <surname>Warmuth</surname>
          </string-name>
          ,
          <article-title>Linear hinge loss and ating generated text as text generation, CoRR average margin</article-title>
          ,
          <source>in: Advances in Neural Informaabs/2106</source>
          .11520 (
          <year>2021</year>
          ).
          <source>tion Processing Systems</source>
          <volume>11</volume>
          ,
          <string-name>
            <surname>[</surname>
            <given-names>NIPS</given-names>
          </string-name>
          <year>1998</year>
          , The MIT
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>T.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fisch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <surname>Making</surname>
          </string-name>
          pre-trained lan- Press,
          <year>1998</year>
          , pp.
          <fpage>225</fpage>
          -
          <lpage>231</lpage>
          .
          <article-title>guage models better few-shot learners</article-title>
          , in: Pro- [26]
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. C.</given-names>
            <surname>Loy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <source>Learning ceedings of the ACL/IJCNLP</source>
          <year>2021</year>
          ,
          <article-title>(Volume 1: Long to prompt for vision-language models</article-title>
          ,
          <source>CoRR Papers)</source>
          ,
          <source>Association for Computational Linguistics, abs/2109</source>
          .01134 (
          <year>2021</year>
          ). URL: https://arxiv.org/abs/ 2021, pp.
          <fpage>3816</fpage>
          -
          <lpage>3830</lpage>
          .
          <fpage>2109</fpage>
          .01134. arXiv:
          <volume>2109</volume>
          .
          <fpage>01134</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>X.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <surname>W</surname>
          </string-name>
          . Zhao,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sun</surname>
          </string-name>
          , PTR: [27]
          <string-name>
            <given-names>B.</given-names>
            <surname>Lester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Al-Rfou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Constant</surname>
          </string-name>
          ,
          <article-title>The power of prompt tuning with rules for text classification, scale for parameter-eficient prompt tuning</article-title>
          ,
          <source>CoRR CoRR abs/2105</source>
          .11259 (
          <year>2021</year>
          ). URL: https://arxiv.org/ abs/2104.08691 (
          <year>2021</year>
          ). URL: https://arxiv.org/abs/ abs/2105.11259. arXiv:
          <volume>2105</volume>
          .
          <fpage>11259</fpage>
          . 2104.08691. arXiv:
          <volume>2104</volume>
          .
          <fpage>08691</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>L.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          , J. Liu,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Template- [28]
          <string-name>
            <given-names>S.</given-names>
            <surname>Y. R. J. F. D. T. H. N. Amir</surname>
          </string-name>
          <article-title>Pouran Ben Veyseh, based named entity recognition using BART, in: Nicole Meister, Multilingual acronym extraction Findings of the Association for Computational Lin-</article-title>
          and
          <source>disambiguation shared tasks at sdu</source>
          <year>2022</year>
          , in: guistics: ACL/IJCNLP 2021, volume ACL/
          <source>IJCNLP Proceedings of SDU@AAAI-22</source>
          ,
          <year>2022</year>
          . 2021 of Findings of ACL, Association for Computa- [29]
          <string-name>
            <given-names>S.</given-names>
            <surname>Y. R. J. F. D. T. H. N. Amir</surname>
          </string-name>
          Pouran Ben Veyseh, tional Linguistics,
          <year>2021</year>
          , pp.
          <fpage>1835</fpage>
          -
          <lpage>1845</lpage>
          .
          <article-title>Nicole Meister, Macronym: A large-scale dataset</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Saha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. S.</given-names>
            <surname>Dash</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Naskar</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>Pal, for multilingual and multi-domain acronym extracA novel approach to word sense disambiguation tion</article-title>
          , in: arXiv,
          <year>2022</year>
          .
          <article-title>in bengali language using supervised methodology</article-title>
          , [30]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          , C. DeSādhanā
          <volume>44</volume>
          (
          <year>2019</year>
          )
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          . langue, A. Moi,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cistac</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rault</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Louf</surname>
          </string-name>
          , M. Fun-
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>R.</given-names>
            <surname>Tripodi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pelillo</surname>
          </string-name>
          ,
          <article-title>A game-theoretic approach towicz</article-title>
          , J. Davison,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shleifer</surname>
          </string-name>
          , P. von Platen, C. Ma, to word sense disambiguation, Comput. Linguistics
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jernite</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Plu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. L.</given-names>
            <surname>Scao</surname>
          </string-name>
          , S. Gugger,
          <volume>43</volume>
          (
          <year>2017</year>
          )
          <fpage>31</fpage>
          -
          <lpage>70</lpage>
          . M.
          <string-name>
            <surname>Drame</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          <string-name>
            <surname>Lhoest</surname>
            ,
            <given-names>A. M.</given-names>
          </string-name>
          <string-name>
            <surname>Rush</surname>
          </string-name>
          , Transformers:
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bevilacqua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          ,
          <article-title>Breaking through the 80% State-of-the-art natural language processing, in: glass ceiling: Raising the state of the art in word Proceedings of the EMNLP 2020 - Demos, Associasense disambiguation by incorporating knowledge tion for Computational Linguistics</article-title>
          ,
          <year>2020</year>
          , pp.
          <fpage>38</fpage>
          -
          <lpage>45</lpage>
          . graph information,
          <source>in: Proceedings of the ACL</source>
          [31]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <year>2020</year>
          ,
          <article-title>Association for Computational Linguistics, BERT: pre-training of deep bidirectional trans2020</article-title>
          , pp.
          <fpage>2854</fpage>
          -
          <lpage>2864</lpage>
          .
          <article-title>formers for language understanding</article-title>
          , CoRR
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>W.</given-names>
            <surname>Alsaeedan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. E. B.</given-names>
            <surname>Menai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Al-Ahmadi</surname>
          </string-name>
          , A abs/
          <year>1810</year>
          .04805 (
          <year>2018</year>
          ). URL: http://arxiv.org/abs/ hybrid genetic-ant
          <source>colony optimization algorithm</source>
          <year>1810</year>
          .
          <volume>04805</volume>
          . arXiv:
          <year>1810</year>
          .
          <article-title>04805. for the word sense disambiguation problem</article-title>
          , Inf. [32]
          <string-name>
            <given-names>I.</given-names>
            <surname>Beltagy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cohan</surname>
          </string-name>
          ,
          <source>Scibert: A pretrained Sci</source>
          .
          <volume>417</volume>
          (
          <year>2017</year>
          )
          <fpage>20</fpage>
          -
          <lpage>38</lpage>
          .
          <article-title>language model for scientific text</article-title>
          , in: EMNLP, As-
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>G.</given-names>
            <surname>Wiedemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Remus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chawla</surname>
          </string-name>
          , C. Biemann, sociation for Computational Linguistics,
          <year>2019</year>
          . URL:
          <article-title>Does BERT make any sense? interpretable word https://www</article-title>
          .aclweb.org/anthology/D19-1371.
          <article-title>sense disambiguation with contextualized embed-</article-title>
          [33]
          <string-name>
            <given-names>L.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Muller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J. O.</given-names>
            <surname>Suárez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dupont</surname>
          </string-name>
          , L. Rodings, in
          <source>: Proceedings of the 15th Conference mary</source>
          , É. V.
          <string-name>
            <surname>de la Clergerie</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Seddah</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Sagot</surname>
          </string-name>
          ,
          <source>on Natural Language Processing, KONVENS</source>
          <year>2019</year>
          ,
          <article-title>Camembert: a tasty french language model</article-title>
          , in: 2019.
          <article-title>Proceedings of the 58th Annual Meeting of the As-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>W.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , X. Zhang, sociation for Computational Linguistics,
          <year>2020</year>
          . A.
          <string-name>
            <surname>Ouyang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            , Graph-based chinese word [34]
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Cañete</surname>
            , G. Chaperon,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Fuentes</surname>
            ,
            <given-names>J.-H.</given-names>
          </string-name>
          <string-name>
            <surname>Ho</surname>
          </string-name>
          ,
          <article-title>sense disambiguation with multi-knowledge in- H.</article-title>
          <string-name>
            <surname>Kang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pérez</surname>
          </string-name>
          ,
          <article-title>Spanish pre-trained bert model and evaluation data</article-title>
          ,
          <source>in: PML4DC at ICLR</source>
          <year>2020</year>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lei</surname>
          </string-name>
          , G. Xun,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          , FAT-RE:
          <article-title>A faster dependency-free model for relation extraction</article-title>
          ,
          <source>J. Web Semant</source>
          .
          <volume>65</volume>
          (
          <year>2020</year>
          )
          <article-title>100598</article-title>
          . URL: https://doi. org/10.1016/j.websem.
          <year>2020</year>
          .
          <volume>100598</volume>
          . doi:
          <volume>10</volume>
          .1016/j. websem.
          <year>2020</year>
          .
          <volume>100598</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>