<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SimCLAD: A Simple Framework for Contrastive Learning of Acronym Disambiguation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Bin Li (Corresponding author)</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fei Xia</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yixuan Weng</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiusheng Huang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bin Sun</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>College of Electrical and Information Engineering, Hunan University</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy Sciences</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>School of Artificial Intelligence, University of Chinese Academy of Sciences</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Acronym disambiguation means finding the correct meaning of an ambiguous acronym from the dictionary in a given sentence, which is one of the key points for scientific document understanding (SDU@AAAI-22). Recently, many attempts have tried to solve this problem via fine-tuning the pre-trained masked language models (MLMs) in order to obtain a better acronym representation. However, the acronym meaning is varied under diferent contexts, whose corresponding phrase representation mapped in diferent directions lacks discrimination in the entire vector space. Thus, the original representations of the pre-trained MLMs are not ideal for the acronym disambiguation task. In this paper, we propose a Simple framework for Contrastive Learning of Acronym Disambiguation (SimCLAD) method to better understand the acronym meanings. Specifically, we design a continual contrastive pre-training method that enhances the pre-trained model's generalization ability by learning the phrase-level contrastive distributions between true meaning and ambiguous phrases. The results on the acronym disambiguation of the scientific domain in English show that the proposed method outperforms all other competitive state-of-the-art (SOTA) methods.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Acronym Disambiguation</kwd>
        <kwd>Document Understanding</kwd>
        <kwd>Contrastive Learning</kwd>
        <kwd>Continual Pte-training</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Input:
Sentence : SVMs have been used for text classification
(Tong and Koller, 2002), using properties of the support
vector ma- chine algorithm for determining what
unlabelled data to select for classification.</p>
      <p>Dictionary : 1. Support Vector Machines</p>
      <p>
        2. Support vector machines
Output : Support vector machines
Recently, the pre-training technology has highly
improved the machine understanding level [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However,
due to the complexity and ambiguity of the natural
language, there is still a gap between the machines and
humans in comprehensively understanding documents
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In scientific document understanding
(SDU@AAAI22), due to space limitations, the appearance of acronyms Figure 1: Example of acronym disambiguation.
is often necessary. It is of great significance to correctly
understand and distinguish the correct acronym meaning the text in bold represents the short acronym. The
dicfrom the given sentence [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. tionary contains the indistinguishable acronym of
long
      </p>
      <p>
        More precisely, the document reading system is ex- form. Our goal is to predict the correct meaning of the
pected to find the correct expanded form of the acronym long-form acronyms from the dictionary (i.e., Support
given the possible expansions from the dictionary for the vector machines). A good prediction should not only
unacronym. This is quite important for a variety of down- derstand the context meaning, but also difer the meaning
stream tasks containing the understanding part, such as of ambiguous phrases. Many works have attempted to
reading comprehension [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], story cloze [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and medical incorporate the manually designed rules [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], handcrafted
entity disambiguation [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], etc. features [9, 10], word embedding [11] and pre-training
      </p>
      <p>
        The acronym disambiguation task aims at finding the technology [12, 13, 14] into this task and achieved
relacorrect meaning of the ambiguous acronym in a given tively good performance. According to the result of the
text from the dictionary [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. As shown in Figure 1, the SDU@AAAI-21 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the pre-training method can
efecsentence is from the scientific domain in English, where tively outperform the rule-based or feature-based method
by a large margin. However, the acronym meaning varies
SDU@AAAI-22: Workshop on Scientific Document Understanding, in diferent contexts [ 15], whose corresponding token
co-located with AAAI 2022. 2022 Vancouver, Canada. representation is anisotropic distribution [16]. For the
" libincn@hnu.edu.cn (B. Li (Corresponding author)); masked language models (MLMs), the token
represenxiafei2020@ia.ac.cn (F. Xia); wengsyx@gmail.com (Y. Weng); tation is mapped with a cramped idiomatic distribution
(hBu.aSnugnx)iusheng2020@ia.ac.cn (X. Huang); sunbin611@hnu.edu.cn [17, 16]. As a result, the MLMs are weak in distinguishing
© 2022 Copyright for this paper by its authors. Use permitted under Creative the ambiguous meaning of acronyms, especially in the
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org) acronym disambiguation task.
      </p>
      <p>Inspired by the token-aware contrastive learning lack the cognition of negative samples in the
representamethod [16], a Simple framework for Contrastive tion. Diferent from the above methods, we use contrast
Learning of Acronym Disambiguation (Sim-CLAD) me- learning to obtain more obvious features for acronym
thod is proposed to distinguish the distributions between disambiguation.
true meaning and ambiguous phrases. Specifically, we
adopt the phrase-level continual contrastive pre-training 2.2. Contrastive learning
method to enhance the pre-trained MLMs for a better
representation of the acronyms. Extensive experiments
carried on the acronym disambiguation of the
scientific domain in English show that the proposed method
achieves the best results compared with the other
competitive state-of-the-art (SOTA) methods. The online
leaderboard shows that the proposed method ranks 1-st
in the scientific English domain in the shared task2 of the
SDU@AAAI-22. The main contributions are summarized
as follows:</p>
      <sec id="sec-1-1">
        <title>In general, methods based on contrastive learning (CL)</title>
        <p>can well distinguish the observed data from other
negative samples. Many attempts of the CL have been made to
many areas of computer vision, including image [20] and
video [21]. Most recently, a simple framework for the CL
of visual representations named as SimCLR [22] based on
NT-Xent loss is proposed for better image representation.</p>
        <p>The same idea can also be found in the field of natural
language processing (NLP). In the field of NLP, many
works [23, 24] are devoted to modeling better
sentence• We perform the first attempt to resolve the level representations with the CL for the downstream
acronym disambiguation problem with a con- tasks. Recently, Su et al. [16] propose a token-aware CL
trastive pre-trained model for better acronym un- framework to learn the isotropic and discriminative
disderstanding. tribution of token representations by restoring the
origi• We extend the token-level contrastive learning nal token meaning of the masked items. This method is
method by designing a phrase-level continuing very efective in distinguishing the token-level
represencontrastive pre-training method to obtain bet- tations thereby achieving better performance in sentence
ter contrastive representations of the ambiguous representation. Following this work, we further consider
acronyms. the phrase-level CL by recovering the probable phrases
• Experiments conducted on the scientific English (i.e., ambiguous acronyms) during the pre-training phase
dataset demonstrate that the proposed method to obtain a better-distinguished acronym representation.
has better performance and outperforms other
competitive baselines.</p>
        <sec id="sec-1-1-1">
          <title>2.3. Continual pre-training</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <sec id="sec-2-1">
        <title>2.1. Acronym disambiguation</title>
        <p>
          Acronym disambiguation has attracted much attention
in biomedical fields [ 18]. The earliest methods [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] utilize
manually designed rules or text features to find out the
acronym expansions. Later, there have been a few works
[19] on automatically digging out the acronym
expansions by analyzing the web data. These methods are
usually efective when an acronym appears in conjunction
with the corresponding extensions in the same document.
        </p>
        <p>However, traditional rules or statistics cannot efectively 3. Task introduction
handle these tasks with the explosive growth of
information. In addition, these methods used for biomedical 3.1. Problem definition
tasks cannot be directly transferred to other fields, such
as science. Recently, deep learning based methods have The acronym disambiguation task aims to find the correct
promoted the development of scientific document under- meaning of a given acronym in a sentence. Specifically,
standing (SDU). Methods like feature-based [9], cluster- the sentence can be represented as  = [1, 2, . . . , ],
ing [11], and pre-training model methods [14] perform where  is the total length of the sentence. Given that the
well in this task. Although these methods based on the index  represents the acronym in the input sentence, the
pre-training technology (i.e., MLMs) can efectively dis- short acronym can be represented as ˆ, The
correspondtinguish confusing phrases of the acronym, they still ing meaning of the short-form acronym is chosen from
It is a wise choice for further continual pre-training the
pre-trained model [25] to alleviate the task and domain
discrepancy between the upstream and the downstream
tasks. Many works tend to investigate how to better
transform the general knowledge to the domain-specific
task via continuing pre-training [26, 16]. In the field of
the SDU, the generic MLMs are weak in well
distinguishing confusing phrases from the dictionary. As a result,
the continual pre-training method is adopted in this
paper to directly improve the ability of understanding with
contrastive learning.
Encoder</p>
        <p>Student</p>
        <p>Teacher (Frozen)
Embedding Layer  We
 adopt  [</p>
        <p>]  
Preprocessed Text We</p>
        <p>adopt [MASK] in
Input Text
 this
this
 paper
paper
 ’
We adopt
the dictionary  = [1, 2, . . . , ], where the  rep- is divided into training (7532), development (894), and
resents the phrase in the dictionary, and the  represents testing (574) according to the data set. All the datasets
the total length of the probable phrases. Our goal is to can be found in the work [27], where the training and
predict the correct phrase meaning  of short acronym validation sets of the scientific English dataset have been
ˆ from the dictionary , where the  ∈ [1, ],  ∈ [1, ]. manually labeled. All the labels are collected in the
dictionary.</p>
      </sec>
      <sec id="sec-2-2">
        <title>3.2. Evaluation metric</title>
        <sec id="sec-2-2-1">
          <title>To evaluate the performance of diferent methods, the Macro F1 is adopted. The definitions are shown as follows:</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Method</title>
      <sec id="sec-3-1">
        <title>4.1. Model architecture</title>
        <p>∑︀
=1 precision</p>
        <p>Precision = As shown in Figure 2, the overview of the proposed
method contains two domain pre-trained models (a
stu</p>
        <p>Recall = ∑︀=1recall (1) dpeanratmanetderast,ei.aec.,hSecri)BwEhRiTch(Baerletaingiyt,iaLloiz,eadndwCitohhtahne 2s0a1m9e).</p>
        <p>Macro F1 = 2 × PrPecriesciiosnion+ × ReRcaelclall Aartethfreoszteangetoopfrporvei-dtreaainginogo,dtheencpoadrainmgerteeprsreosfetnhteattieoanchfoerr
where  is the number of total classes, the precision the student model. In addition, the teacher supports the
and recall represent the precision and recall of class  well-formed original objectives of the MLM (i.e., masked
respectively. language modeling and next sentence prediction) for the
student model. Inspired by [16], we intentionally mask
the original short-form acronym (′ ) to perform the
dis3.3. Dataset tinguish ambiguous long-form acronyms (1+, 2− ) in
the teacher model, where notation + and − are positive
Table 1 and negative samples. A contrast loss is adopted in the
Statistical information of scientific English dataset. pre-training process of the student model. Specifically,
Data Sample Number Ratio it is obtained by masking the short-form acronym (i.e.,</p>
        <p>CL) in the input sentence of the student model against</p>
        <p>Training Set 7532 83.69% the “correct meaning” produced by the teacher without
Development Set 894 9.93% masking the corresponding phrases. To get the
represenTest Set 574 6.38% tation of the “reference” phrase in the dictionary (dotted
Total 9000 100% frame), we perform phrase averaged method by
averagThe acronym disambiguation contains the dataset of ing the embeddings of the tokens (i.e., contrastive
learnscientific English, which is shown in Table 1. The dataset ing), which is presented with the upper bar. Meanwhile,
we let the representation distance of positive and negative
modeling task and the next sentence prediction (NSP)
(i.e., Contrastive Learning) samples stay away to enhance
task. The overall optimizing objectives are performed as
the model’s ability to distinguish confusing samples.
the continual pre-training in the domain-specific corpus,
which can be shown as</p>
      </sec>
      <sec id="sec-3-2">
        <title>4.2. Phrase-level contrastive pre-training</title>
        <sec id="sec-3-2-1">
          <title>The proposed me-thod is composed of two pre-trained</title>
          <p>models who are both initialized with the SciBERT model, where the pre-training step is totally unsupervised, which
where the one is the student model (noted as ) and the
other is the teacher model (noted as  ). During the
pretraining phase, we only optimize the parameters of 
can be carried out with the vast scientific English dataset.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Once the pre-trained model is obtained, the student</title>
          <p>model will be fine-tuned on the acronym
disambigualeaving the  model to be frozen. Given an input sen- tion task.
ℒ = ℒCL + ℒMLM + ℒNSP
tence  = [1, . . . , ], we intentionally mask the short
acronym ˆ following the same pre-training task [17].
Then, we feed the masked sentence ′ into the student
to perform the pre-trained training task. As a result, we
obtain the contextual representation ̃ℎ︀ = [̃ℎ︀1, . . . , ̃ℎ︀]
in the student model, where the [MASK] is embedded as
[mask]. At the same time, the teacher model replaces the
corresponding short acronym ˆ in the original sentence
 with the phrase in the dictionary  as input. It is
intuitive that the teacher can distinguish all the probable
representations with the dictionary, where we want the
student model to distinguish the correct phrase
meaning through CL. In the end, the well-formed phrase
representation is utilized with the averaged embeddings.
Thus, the final representation of the recovered sentence
ℎ = [ℎ1, . . . , ℎ] against the corresponding input
sentence is produced by the teacher (see Figure 2). Following
the work [16], we further refine the proposed
phraselevel contrastive pre-training loss
ℒCL = −
 
=1 =1
∑︁ ∑︁ 1 (ˆ, ) log</p>
          <p>S(ℎ̃︀,ℎ)/
∑︀
=1 S(ℎ̃︀,ℎ )/
,
(2)
where the indicator function 1(ˆ, ) = 1 if ˆ is the
masked acronym and short for the corresponding
longform . Otherwise, 1(ˆ, ) = 0. We use the  as
the temperature hyper-parameter and the notation S(, )
represents the similarity function, where we choose the
cosine function. The  is the number of the all possible
long-form acronyms.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>4.4. Contrastive fine-tuning</title>
        <p>Concretely, given the final hidden state, ℎ of the input
sentence, the representation of the probable phrases can
be represented as ℎ . We concatenate the ℎ and the
ℎ to obtain the feature ℎ for the two classification and
contrastive learning, which can be presented as
ℎ = [︀ ℎ; ℎ ]︀
shown as follows:</p>
        <sec id="sec-3-3-1">
          <title>A non-linear projection layer is added on top of the pre</title>
          <p>trained model for obtaining representation. The positive
sample is noted as +, and the negative sample is noted
as − . The calculation of two types of the feature can be
+ = 2 ReLU (︀ 1ℎ+)︀
− = 2 ReLU (︀ 1ℎ− )︀
Finally, we perform fine-tuning in a multi-task manner
and take a weighted average of the two classification
losses and the contrastive loss:
ℒ =
(1 −  )
2
︀(
ℒ + ℒ− ︀) +  ℒCL</p>
          <p>+
where the  is the weight hyper-parameter.
5. Experiment setup</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>5.1. Baseline models</title>
        <p>(3)
(4)
(5)
(6)</p>
      </sec>
      <sec id="sec-3-5">
        <title>4.3. Optimizing objectives</title>
        <sec id="sec-3-5-1">
          <title>Naturally, the student model lear-ns to distinguish the</title>
          <p>masked acronym closer to its corresponding “true”
representation produced by the teacher and away from the
meaning of the other confusing phrases in the sentence.</p>
        </sec>
        <sec id="sec-3-5-2">
          <title>In summary, the acronym representations learned by</title>
          <p>the student are more discriminative with the confusing
phrases, therefore better following an isotropic
distribution [16]. Furthermore, the original pre-training method
of the MLM [17] is also adopted for learning good
document representations, including the masked language</p>
        </sec>
      </sec>
      <sec id="sec-3-6">
        <title>5.2. Pre-training strategies</title>
        <sec id="sec-3-6-1">
          <title>We use the continuing pre-training strategy with the pro</title>
          <p>posed method using the Sci-BERT model2. Except for</p>
        </sec>
        <sec id="sec-3-6-2">
          <title>2https://huggingface.co/allenai/scibert_scivocab_cased</title>
        </sec>
        <sec id="sec-3-6-3">
          <title>3https://huggingface.co/datasets/scientific_papers 4https://huggingface.co/roberta-large 5https://huggingface.co/allenai/scibert_scivocab_uncased</title>
          <p>ability of the diferent baselines and add the balanced
weights to get the final predictions, where more
implemented details can be found in the work [34].</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>6. Results</title>
      <p>The main results of our model and baselines are shown
in Table 2. It can be found that the performance of the
pre-trained model outperforms the rule-based method
since the rule-based method is dificult to pick the
correct phrase from confusing acronym options from the
dictionary due to its poor generalization. The SciBERT
beats the RoBERTa in the three scores, which indicates
that the domain-specific pre-training is of significant for
science document understanding. The scientific domain
pre-trained model can capture a deep representation of
the confusing acronyms. The hdBERT merges diferent
types of hidden features to get better generalization in
binary classification, thereby performing well in this task.
The results of the BERT-MT demonstrate that there are
indeed many useful tricks in helping the model enhance
the ability of robustness. It is noted that the proposed
method outperforms the other baselines in three scores,
which represents that the pre-trained model with
continuing contrastive pre-training can further improve the
model’s ability to represent acronyms. Notice that the
ensemble method can further improve the diversity of
the final results thereby achieving the best performance
in the test set. In summary, we finally rank the 1-st in
the online leaderboard, which is shown in Table 3.</p>
    </sec>
    <sec id="sec-5">
      <title>7. Conclusion</title>
      <p>We describe a simple framework for contrastive
learning of acronym disambiguation in the shared task 2 of
the SDU@AAAI-22. Many baselines are implemented to
compare with the proposed method, including methods
based on pre-training, combinations of diferent
structures, and useful tricks. The results demonstrate that
the proposed method outperforms all other baselines,
achieving the best performance (top-1) in the acronym
disambiguation of scientific English. It can be further
concluded that the continuing contrastive pre-training
method can enhance the model’s ability to represent the
confusing phrases of the long-form acronym. The
contrastive fine-tune can further enhance the generalization
ability. In future work, we will extend our work as
follows: (1) to use twin networks for training the teacher
and the student together. (2) Adopting the fine-grained
and the coarse-grained embedding into the contrastive
pre-training to better acknowledge the meaning of the
sentence.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgement</title>
      <sec id="sec-6-1">
        <title>This work is supported by the National Key Research and Development Project of China (2018YFB1305200) and the National Natural Science Fund of China (62171183, 61801178).</title>
        <p>text, in: Biocomputing 2003, World Scientific, 2002, representations using videos, in: Proceedings of the
pp. 451–462. IEEE international conference on computer vision,
[9] L. Luo, Z. Yang, P. Yang, Y. Zhang, L. Wang, H. Lin, 2015, pp. 2794–2802.</p>
        <p>J. Wang, An attention-based bilstm-crf approach to [22] T. Chen, S. Kornblith, M. Norouzi, G. Hinton, A
document-level chemical named entity recognition, simple framework for contrastive learning of visual
Bioinformatics 34 (2018) 1381–1388. representations, in: International conference on
[10] F. Li, Z. Mai, W. Zou, W. Ou, X. Qin, Y. Lin, machine learning, PMLR, 2020, pp. 1597–1607.</p>
        <p>W. Zhang, Systems at sdu-2021 task 1: Transform- [23] Z. Wu, S. Wang, J. Gu, M. Khabsa, F. Sun, H. Ma,
ers for sentence level sequence label., in: SDU@ Clear: Contrastive learning for sentence
represenAAAI, 2021. tation, arXiv preprint arXiv:2012.15466 (2020).
[11] A. Jaber, P. Martínez, Participation of uc3m in sdu@ [24] F. Liu, I. Vulić, A. Korhonen, N. Collier, Fast,
efaaai-21: A hybrid approach to disambiguate scien- fective, and self-supervised: Transforming masked
tific acronyms., in: SDU@ AAAI, 2021. language models into universal lexical and sentence
[12] Q. Zhong, G. Zeng, D. Zhu, Y. Zhang, W. Lin, encoders, in: Proceedings of the 2021 Conference
B. Chen, J. Tang, Leveraging domain agnostic and on Empirical Methods in Natural Language
Prospecific knowledge for acronym disambiguation., cessing (EMNLP), Association for Computational
in: SDU@ AAAI, 2021. Linguistics, Punta Cana, Dominican Republic and
[13] D. R. Kubal, A. Nagvenkar, Efective ensembling of Online, 2021. URL: https://arxiv.org/abs/2104.08027.
transformer based language models for acronyms [25] S. Gururangan, A. Marasović, S. Swayamdipta,
identification., in: SDU@ AAAI, 2021. K. Lo, I. Beltagy, D. Downey, N. A. Smith, Don’t stop
[14] C. Pan, B. Song, S. Wang, Z. Luo, Bert-based pretraining: Adapt language models to domains and
acronym disambiguation with multiple training tasks, in: Proceedings of the 58th Annual Meeting
strategies, arXiv preprint arXiv:2103.00488 (2021). of the Association for Computational Linguistics,
[15] A. P. B. Veyseh, F. Dernoncourt, W. Chang, T. H. 2020, pp. 8342–8360.</p>
        <p>Nguyen, Maddog: A web-based system for acronym [26] R. Han, X. Ren, N. Peng, Econet: Efective continual
identification and disambiguation, in: Proceedings pretraining of language models for event temporal
of the 16th Conference of the European Chapter reasoning, in: Proceedings of the 2021 Conference
of the Association for Computational Linguistics: on Empirical Methods in Natural Language
ProcessSystem Demonstrations, 2021, pp. 160–167. ing, 2021, pp. 5367–5380.
[16] Y. Su, F. Liu, Z. Meng, L. Shu, E. Shareghi, N. Col- [27] S. Y. R. J. F. D. T. H. N. Amir Pouran Ben
Veylier, Tacl: Improving bert pre-training with seh, Nicole Meister, MACRONYM: A
Largetoken-aware contrastive learning, arXiv preprint Scale Dataset for Multilingual and Multi-Domain
arXiv:2111.04198 (2021). Acronym Extraction, in: arXiv, 2022.
[17] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: [28] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen,
Pre-training of deep bidirectional transformers for O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
language understanding, in: Proceedings of the Roberta: A robustly optimized bert pretraining
ap2019 Conference of the North American Chapter proach, arXiv preprint arXiv:1907.11692 (2019).
of the Association for Computational Linguistics: [29] R. Sennrich, B. Haddow, A. Birch, Neural machine
Human Language Technologies, Volume 1 (Long translation of rare words with subword units, in:
and Short Papers), 2019, pp. 4171–4186. Proceedings of the 54th Annual Meeting of the
As[18] Q. Jin, J. Liu, X. Lu, Deep contextualized biomedical sociation for Computational Linguistics (Volume 1:
abbreviation expansion, in: Proceedings of the 18th Long Papers), 2016, pp. 1715–1725.
BioNLP Workshop and Shared Task, 2019, pp. 88– [30] I. Beltagy, K. Lo, A. Cohan, Scibert: A pretrained
lan96. guage model for scientific text, in: Proceedings of
[19] D. Nadeau, P. D. Turney, A supervised learning the 2019 Conference on Empirical Methods in
Natapproach to acronym identification, in: Conference ural Language Processing and the 9th International
of the Canadian Society for Computational Studies Joint Conference on Natural Language Processing
of Intelligence, Springer, 2005, pp. 319–329. (EMNLP-IJCNLP), 2019, pp. 3615–3620.
[20] S. Chopra, R. Hadsell, Y. LeCun, Learning a simi- [31] I. Loshchilov, F. Hutter, Decoupled weight decay
larity metric discriminatively, with application to regularization, arXiv preprint arXiv:1711.05101
face verification, in: 2005 IEEE Computer Soci- (2017).
ety Conference on Computer Vision and Pattern [32] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C.
DeRecognition (CVPR’05), volume 1, IEEE, 2005, pp. langue, A. Moi, P. Cistac, T. Rault, R. Louf, M.
Fun539–546. towicz, et al., Huggingface’s transformers:
State-of[21] X. Wang, A. Gupta, Unsupervised learning of visual the-art natural language processing, arXiv preprint
arXiv:1910.03771 (2019).
[33] D. P. Kingma, J. Ba, Adam: A method for
stochastic optimization, arXiv preprint arXiv:1412.6980
(2014).
[34] G. P. C. Fung, J. X. Yu, H. Wang, D. W. Cheung,</p>
        <p>H. Liu, A balanced ensemble approach to weighting
classifiers for text classification, in: Sixth
International Conference on Data Mining (ICDM’06), IEEE,
2006, pp. 869–873.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>X.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>Pretrained models for natural language processing: A survey</article-title>
          ,
          <source>Science China Technological Sciences</source>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A. P. B.</given-names>
            <surname>Veyseh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dernoncourt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. H.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Celi</surname>
          </string-name>
          ,
          <article-title>Acronym identification and disambiguation shared tasks for scientific document understanding</article-title>
          , arXiv preprint arXiv:
          <year>2012</year>
          .
          <volume>11760</volume>
          (
          <issue>2020a</issue>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Y. R. J. F. D. T. H. N. Amir Pouran Ben Veyseh</surname>
          </string-name>
          , Nicole Meister,
          <source>Multilingual Acronym Extraction and Disambiguation Shared Tasks at SDU</source>
          <year>2022</year>
          ,
          <source>in: Proceedings of SDU@AAAI-22</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gardner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Berant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajishirzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Talmor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <article-title>On making reading comprehension more comprehensive</article-title>
          ,
          <source>in: Proceedings of the 2nd Workshop on Machine Reading for Question Answering</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>105</fpage>
          -
          <lpage>112</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Guan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>Lot: A benchmark for evaluating chinese long text understanding and generation</article-title>
          ,
          <source>arXiv preprint arXiv:2108.12960</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chen</surname>
          </string-name>
          , H. Liu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Weng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <article-title>More but correct: Generating diversified and entity-revised medical response</article-title>
          ,
          <source>arXiv preprint arXiv:2108.01266</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A. P. B.</given-names>
            <surname>Veyseh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dernoncourt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. H.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. H.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <article-title>What does this acronym mean? introducing a new dataset for acronym identification and disambiguation</article-title>
          ,
          <source>in: Proceedings of the 28th International Conference on Computational Linguistics</source>
          ,
          <year>2020b</year>
          , pp.
          <fpage>3285</fpage>
          -
          <lpage>3301</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Schwartz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Hearst</surname>
          </string-name>
          ,
          <article-title>A simple algorithm for identifying abbreviation definitions in biomedical</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>