<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>BERT-based Acronym Disambiguation with Multiple Training Strategies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Chunguang Pan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bingyan Song</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shengguang Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhipeng Luo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>DeepBlue Technology (Shanghai) Co.</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>panchg</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>songby</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>wangshg</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>luozpg @deepblueai.com</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>- Dictionary : SVM : -- Support Vector Machine -- State Vector Machine</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Output : Support Vector Machine</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Acronym disambiguation (AD) task aims to find the correct expansions of an ambiguous ancronym in a given sentence. Although it is convenient to use acronyms, sometimes they could be difficult to understand. Identifying the appropriate expansions of an acronym is a practical task in natural language processing. Since few works have been done for AD in scientific field, we propose a binary classification model incorporating BERT and several training strategies including dynamic negative sample selection, task adaptive pretraining, adversarial training and pseudo labeling in this paper. Experiments on SciAD show the effectiveness of our proposed model and our score ranks 1st in SDU@AAAI-21 shared task 2: Acronym Disambiguation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        An acronym is a word created from the initial components
of a phrase or name, called the expansion
        <xref ref-type="bibr" rid="ref1 ref11">(Jacobs, Itai, and
Wintner 2020)</xref>
        . In many literature and documents, especially
in scientific and medical fields, the amount of acrnomys is
increasing at an incredible rate. By using acronyms, people
can avoid repeating frequently used long phrases. For
example, CNN is an acronym with the expansion Convolutional
Neural Network, though it has additional expansion
possibilities depending on context, such as Condensed Nearest
Neighbor.
      </p>
      <p>Understanding the correlation between acronyms and
their expansions is critical for several applications in natural
language processing, including text classification, question
answering and so on.</p>
      <p>Despite the convenience of using acronyms, sometimes
they could be difficult to understand, especially for people
who are not familiar with the specific area, such as in
scientific or medical field. Therefore, it is necessary to develop a
system that can automatically resovle the appropriate
meaning of acronyms in different contextual information.</p>
      <p>Given an acronym and several possible expansions,
acronym disambiguation(AD) task is to determine which
expansion is correct for a particular context. The scientific
acronym disambiguation task is challenging due to the high
ambiguity of acronyms. For example, as shown in Figure
1, SVM has two expansions in the dictionary. According to
the contextual information from the input sentence, the SVM
here represents for the Support Vetor Machine which is quite
smilar to State Vector Machine.</p>
      <p>
        Consequently, AD is formulated as a classification
problem, where given a sentence and an acronym, the goal is to
predict the expansion of the acronym in a given candidate
set. Over the past two decades, several kinds of approaches
have been proposed. At the begining, pattern-matching
techniques were popular. They
        <xref ref-type="bibr" rid="ref30">(Taghva and Gilbreth 1999)</xref>
        designed rules and patterns to find the corresponding
expansions of each acronym. However, as the pattern-matching
methods require more human efforts on designing and
tuning the rules and patterns, machine learning based methods
(i.e. CRF and SVM)
        <xref ref-type="bibr" rid="ref14 ref17">(Liu, Liu, and Huang 2017)</xref>
        have been
preferred. More recently, deep learning methods
        <xref ref-type="bibr" rid="ref12 ref2 ref3">(Charbonnier and Wartena 2018; Jin, Liu, and Lu 2019)</xref>
        are adopted
to solve this task.
      </p>
      <p>
        Recently, pre-trained language models such as ELMo
        <xref ref-type="bibr" rid="ref24">(Peters et al. 2018)</xref>
        and BERT
        <xref ref-type="bibr" rid="ref4">(Devlin et al. 2018)</xref>
        , have shown
their effectiveness in contextual representation. Inspired by
the pre-trained model, we propose a binary classification
model that is capable of handling acronym disambiguation.
We evaluate and verify the proposed method on the dataset
released by SDU@AAAI 2021 Shared Task: Acronym
Disambiguation
        <xref ref-type="bibr" rid="ref11 ref31 ref32 ref8">(Veyseh et al. 2020a)</xref>
        . Experimental results
show that our model can effectively deal with the task and
we win the first place of the competition.
      </p>
      <p>2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <sec id="sec-2-1">
        <title>Acronym Disambiguation</title>
        <p>
          Acronym diambiguation has received a lot of attentions in
vertical domains especially in biomedical fields. Most of
the proposed methods
          <xref ref-type="bibr" rid="ref26">(Schwartz and Hearst 2002)</xref>
          utilize
generic rules or text patterns to discover acronym
expansions. These methods are usually under circumstances where
acronyms are co-mentioned with the corresponding
expansions in the same document. However, in scientific papers,
this rarely happens. It is very common for people to define
the acronyms somewhere and use them elsewhere. Thus,
such methods cannot be used for acronym disambiguation
in scientific field.
        </p>
        <p>
          There have been a few works
          <xref ref-type="bibr" rid="ref1 ref21">(Nadeau and Turney 2005)</xref>
          on automatically mining acronym expansions by leveraging
Web data (e.g. click logs, query sessions). However, we
cannot apply them directly to scientific data, since most data in
scientific are raw text and therefore logs of the query
sessions/clicks are rarely available.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Pre-trained Models</title>
        <p>Substantial work has shown that pre-trained models (PTMs),
on the large unlabeled corpus can learn universal language
representations, which are beneficial for downstream NLP
tasks and can avoid training a new model from scratch.</p>
        <p>
          The first-generation PTMs aim to learn good word
embeddings. These models are usually very shallow for
computational efficiencies, such as Skip-Gram
          <xref ref-type="bibr" rid="ref19">(Mikolov et al.
2013)</xref>
          and GloVe
          <xref ref-type="bibr" rid="ref23">(Pennington, Socher, and Manning 2014)</xref>
          ,
because they themselves are no longer needed by
downstream tasks. Although these pre-trained embeddings can
capture semantic meanings of words, they fail to caputre
higher-level concepts in context, such as polysemous
disambiguation and semantic roles. The second-generation PTMs
focus on learning contextual word embeddings, such as
ELMo
          <xref ref-type="bibr" rid="ref24">(Peters et al. 2018)</xref>
          , OpenAI GPT
          <xref ref-type="bibr" rid="ref25">(Radford et al.
2018)</xref>
          and BERT
          <xref ref-type="bibr" rid="ref4">(Devlin et al. 2018)</xref>
          . These learned
encoders are still needed to generate word embeddings in
context when being used in downstream tasks.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Adversarial Training</title>
        <p>
          Adversarial training (AT)
          <xref ref-type="bibr" rid="ref7">(Goodfellow, Shlens, and Szegedy
2014)</xref>
          is a mean of regularizing classification algorithms by
generating adversarial noise to the training data. It was first
introduced in image classification tasks where the input data
is continuous.
        </p>
        <p>
          <xref ref-type="bibr" rid="ref20">Miyato, Dai, and Goodfellow (2017</xref>
          ) extend adversarial
and virtual adversarial training to the text classification by
applying perturbation to the word embeddings and propose
an end-to-end way of data perturbation by utilizing the
gradient information.
          <xref ref-type="bibr" rid="ref34">Zhu, Li, and Zhou (2019</xref>
          ) propose an
adversarial attention network for the task of multi-dimensional
emotion regression, which automatically rates multiple
emotion dimension scores for an input text.
25000
s20000
e
c
n
e
t
n
se15000
f
o
y
c
n
eu10000
q
e
fr
5000
        </p>
        <p>0
400
s
ym300
n
o
r
c
a
f
o
cyn200
e
u
q
e
fr
100
0
26075</p>
        <p>8879
1</p>
        <p>2
437</p>
        <p>140
2
3
2387</p>
        <p>1333 435 220 59 188 4 61
3 4 5 6 7 8 9 &gt;=10
number of acronyms per sentence</p>
        <p>
          There are also other works for regularizing classifiers by
adding random noise to the data, such as dropout
          <xref ref-type="bibr" rid="ref29">(Srivastava et al. 2014)</xref>
          and its variant for NLP tasks, word dropout
          <xref ref-type="bibr" rid="ref10">(Iyyer et al. 2015)</xref>
          .
          <xref ref-type="bibr" rid="ref33">Xie et al. (2019)</xref>
          discusses various data
noising techniques for language models and provides
empirical analysis validating the relationship between nosing
and smoothing.
          <xref ref-type="bibr" rid="ref28">Søgaard (2013)</xref>
          and
          <xref ref-type="bibr" rid="ref16">Li, Cohn, and Baldwin
(2017</xref>
          ) focus on linguistic adversaries.
        </p>
        <p>Combining multiple advantages in above works, we
propose a binary classification model utilizing BERT and
several training strategies such as adversarial training and so
on.</p>
        <p>3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Data</title>
      <p>In this paper, we use the AD dataset called SciAD
released by Veyseh et al. (2020b). They collect a corpus of
6,786 English papers from arXiv and these papers consist of
2,031,592 sentences that could be used for data annotation.</p>
      <p>The dataset contains 62,441 samples where each sample
involves a sentence, an ambiguous acronym, and its correct
meaning (one of the meanings of the acronym recorded by
the dictionary , as shown in 1).</p>
      <p>
        Figure 2 and Figure 3 demonstrate statistics of SciAD
dataset. More specifically, Figure 2 reveals the distribution
of number of acronyms per sentence. Each sentence could
have more than one acronym and most sentences have 1 or 2
acronyms. Figure 3 shows the distribution of number of
expansions per acronym. The distribution shown in this figure
is consistent with the same distribution presented in the prior
work
        <xref ref-type="bibr" rid="ref2">(Charbonnier and Wartena, 2018)</xref>
        in which in both
distributions, acronyms with 2 or 3 meanings have the highest
number of samples in the dataset
        <xref ref-type="bibr" rid="ref31 ref32">(Veyseh et al. 2020b)</xref>
        .
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Binary Classification Model</title>
      <p>The input of the binary classification model is a sentence
with an ambiguous acronym and a possible expansion. The
model needs to predict whether the expansion is the
corresponding expansion of the given acronym. Given an
input sentence, the model will assign a predicted score to
each candidate expansion. The candidate expansion with the
highest score will be the model output. Figure 4 shows an
example of the procedure.</p>
      <sec id="sec-4-1">
        <title>Input Format</title>
        <p>Since BERT can process multiple input sentences with
segment embeddings, we use the candidate expansion as the
first input segment, and the given text as the second input
segment. We separat these two input segments with the
special token [CLS]. Furthermore, we add two special tokens
&lt;start&gt; and &lt;end&gt; to wrap the acronym in the text,
which enables that the acronym can get enough attention
from the model.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Binary Model Architecture</title>
        <p>The model architecture is described in Figure 5 in detail.
First, we use a BERT encoder to get the representation of
input segments. Next, we calculate the mean of the start and
end positions of the acronym, and concatenate the
representation with the [CLS] position vector. Then, we sent this</p>
        <sec id="sec-4-2-1">
          <title>BERT</title>
        </sec>
        <sec id="sec-4-2-2">
          <title>BERT</title>
          <p>
            concatenated vector into a binary classifier for prediction.
The represenation first pass through a dropout layer
            <xref ref-type="bibr" rid="ref29">(Srivastava et al. 2014)</xref>
            and a feedforward layer. The output of these
layers is then feed into a ReLU
            <xref ref-type="bibr" rid="ref6">(Glorot, Bordes, and
Bengio 2011)</xref>
            activation. After this, the calculated vector pass
through a dropout layer and a feedforward layer again. The
final prediction can be obtained through a sigmoid
activation.
          </p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>Training Strategies</title>
        <p>Pretrained Models Experiments from previous work
have shown the effectiveness of pretrained models.
Starting from BERT model, there are many improved pretrained
models. Roberta uses dynamic masks and removes next
sentence prediction task. In our experiments, we compare
BERT and Roberta models trained on corpus from different
fields.</p>
      </sec>
      <sec id="sec-4-4">
        <title>Dynamic Negative Sample Selection During training,</title>
        <p>we dynamicly select a fixed number of negative samples for
each batch, which ensures that the model is trained on more
balanced positive and negative data, and all negative samples
are used in training at the same time.</p>
      </sec>
      <sec id="sec-4-5">
        <title>Task Adaptive Pretraining Gururangan et al. (2020)</title>
        <p>
          shows that task-adaptive pretraining (TAPT) can effectively
improve model performance. The task-specific dataset
usually covers only a subset of data used for general pretraining,
thus we can achieve significant improvement by pretraining
the masked language model task on the given dataset.
Adversarial Training Adversarial training is a popular
approach to increasing robustness of neural networks. As
shown in
          <xref ref-type="bibr" rid="ref20">Miyato, Dai, and Goodfellow (2017</xref>
          ),
adversarial training has good regularization performance. By adding
perturbations to the embedding layer, we can get more stable
word representations and a more generalized model, which
significantly improves model performance on unseen data.
Pseudo-Labeling Pseudo labeling
          <xref ref-type="bibr" rid="ref22 ref27 ref9">(Iscen et al. 2019;
Oliver et al. 2018; Shi et al. 2018)</xref>
          uses network predictions
with high confidence as labels. We mix these pseudo labels
and the training set together to generate a new dataset. We
than use this new dataset to train a new binary
classification model. Pseudo-labeling has been proved an effective
approach to utilize unlabeled data for a better performance.
5
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experiments</title>
      <sec id="sec-5-1">
        <title>Hyper parameters</title>
        <p>
          The batch size used in our experiments is 32. We train each
model for 15 epochs. The initial learning rate for the text
encoder is 1:0 10 5, and for other parameters, the initial
learning rate is set to 5:0 10 4. We evaluate our model
on the validation set at each epoch. If the macro F1 score
doesn’t increase, we then decay the learning rate by a factor
of 0.1. The minimum learning rate is 5:0 10 7. We use
Adam optimizer
          <xref ref-type="bibr" rid="ref14">(Kingma and Ba 2017)</xref>
          in all our
experiments.
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>Pretrained Models</title>
        <p>Since different pretrained models are trained using different
data, we do experiments on several pretrained models.
Table 1 shows our experimental results on different pretrained
models in validation set. The bert-base model gets the
highest score in commonly used pretrained models (the top 3
lines in Table 1). Since a large ratio of texts in the given
dataset come from computer science field, the cs-roberta
model outperforms the bert-base model by 1.6 percents. The
best model in our experiments is the scibert model, which
achieves the F1 score of 89%.</p>
        <p>Model
bert-base-uncased
bert-large-uncased
roberta-base
cs-roberta-base
scibert-scivocab-uncased
Combining training strategies We do some futher
experiments on validation set to verify the effectiveness of each
strategy mentioned above. The results are shown in Table
2. As shown in the table, F1 score increases by 4 percents
with dynamic sampling. TAPT and adversarial training
further improve the performance on validation set by 0.47
percent. Finally, we use pseudo-labeling method. Samples from
the test set with a score higher than 0.95 are selected and
mixed with the training set. It still slightly improves the F1
score.</p>
        <p>Model
scibert-scivocab-uncased
+dynamic sampling
+task adaptive pretraining
+adversarial training
+pseudo-labeling</p>
        <p>Precision
0:9263
0:9575
0:9610
0.9651
0:9629</p>
        <p>Recall
0:8569
0:9060
0:9055
0:9082
0.9106</p>
        <p>F1
0:8902
0:9310
0:9324
0:9358
0.9360</p>
        <p>Error Analysis We gather a sample of 100 development
set examples that our model misclassified and look at these
examples manually to do the error analysis.</p>
        <p>From these examples, we find that there are two main
cases where the model gives the wrong prediction. The first
one is that the candidate expansions are too similar, even
have the same meanings in different forms. For example, in
the sentence ’The SC is decreasing for increasing values of
...’, the correct expansion for ’SC’ is ’sum capacities’ while
our prediction is ’sum capacity’ which has the same meaning
with the correct one but in the singular form.</p>
        <p>The second one is that there is too little contextual
information in the given sentence for prediction. For instance, the
correct expansion for ’ML’ in sentence ’ML models are
usually much more complex, see Figure.’ is ’model logic’, the
predict expansion is ’machine learning’. Even people can
hardly tell which one is right only based on the given
sentence.</p>
        <p>Time complexity To analysis the time complexity of our
proposed method, we show measurements of the actual
running time observed in our experiments. The discussions are
not that precise or exhaustive. However, we believe they are
enough to offer readers rough estimations of the time
complexity of our model.</p>
        <p>We utilize TAPT strategy to further train the scibert model
by using eight NVIDIA TITAN V (12GB). It takes three
hours to train 100 epochs in total.</p>
        <p>After getting the new pretrained model, we trained the
binary classification model on two NVIDIA TITAN V. On
average, each epoch of the training and inference time of
adding adversarial training and pseudo-labeling are shown
in Table 3 respectively. It begins to converge after five
epochs. It takes nearly the same time to do the inference
while the training time is twice as long after adversarial
training is added.</p>
        <p>Model
+adversarial training
+pseudo-labeling</p>
        <p>Train
1588s
3021s
3328s</p>
        <p>
          Inference
150:42s
149:64s
149:36s
Comparison Results We compared our results with
several other models. Precision, Recall and F1 of our proposed
model are computed on testing data via the cross-validation
method.
• MF &amp; ADE Non-deep learning models that utilize rules
or hand crafted features
          <xref ref-type="bibr" rid="ref15">(Li et al. 2018)</xref>
          .
• NOA &amp; UAD Language-model-based baselines that train
the word embeddings using the training corpus
          <xref ref-type="bibr" rid="ref2 ref3">(Charbonnier and Wartena 2018; Ciosici and Assent 2019)</xref>
          .
• BEM &amp; DECBAE Models employ deep architectures
(e.g., LSTM)
          <xref ref-type="bibr" rid="ref1 ref12 ref3">(Jin, Liu, and Lu 2019; Blevins and
Zettlemoyer 2020)</xref>
          .
• GAD A deep learning model utilizes the syntactical
structure of the sentence
          <xref ref-type="bibr" rid="ref31 ref32">(Veyseh et al. 2020b)</xref>
          .
        </p>
        <p>Model
MF
ADE
NOA
UAD
BEM
DECBAE
GAD</p>
      </sec>
      <sec id="sec-5-3">
        <title>Ours</title>
        <p>Human Performance</p>
        <p>Precision
0:8903
0:8674
0:7814
0:8901
0:8675
0:8867
0:8927
0.9695
0:9782</p>
        <p>Recall
0:4220
0:4325
0:3506
0:7008
0:3594
0:7432
0:7666
0.9132
0:9445</p>
        <p>F1
0:5726
0:5772
0:4840
0:7837
0:5082
0:8086
0:8190
0.9405
0:9610
As shown in Table 4, rules/features fail to caputre all
patterns of expressing the meanings of the acronym, resulting
in poorer recall on expansions compared to acronyms. In
contrast, the deep learning model has comparable recall on
expansions and acronyms, showing the importance of
pretrained word embeddings and deep architectures for AD.
However, they all fall far behind human level performance.
Among all the models, our proposed model achieves the best
results on the SciAD and is very close to the human
performance which shows the capability of the strategies we
introduced above.</p>
        <p>
          SDU@AAAI 2021 Shared Task: Acronym
Disambiguation The competition results are shown in Table 5. We
show scores of the top 5 ranked models as well as the
baseline model. The baseline model is released by the provider
of the SciAD dataset
          <xref ref-type="bibr" rid="ref31 ref32">(Veyseh et al. 2020b)</xref>
          . Our model
performs best among all the ranking list and outperforms the
second place by 0:32%. In addition, our model outperforms
the baseline model by 12:15% which is a great improvement.
        </p>
        <p>Model</p>
      </sec>
      <sec id="sec-5-4">
        <title>Rank1</title>
        <p>Rank2
Rank3
Rank4
Rank5
Baseline
In this paper, we introduce a binary classification model for
acronym disambiguation. We utilize the BERT encoder to
get the input representations and adopt several strategies
including dynamic negative sample selection, task adaptive
pretraining, adversarial training and pseudo-labeling.
Experiments on SciAD show the validity of our proposed model
and we win the first place of the SDU@AAAI-2021 Shared
task 2.</p>
        <p>Deep
contextualarXiv preprint</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Blevins</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>Moving down the long tail of word sense disambiguation with gloss-informed biencoders</article-title>
          . arXiv preprint arXiv:
          <year>2005</year>
          .02590.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Charbonnier</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Wartena</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Using word embeddings for unsupervised acronym disambiguation</article-title>
          .
          <source>In Proceedings of the 27th International Conference on Computational Linguistics</source>
          ,
          <fpage>2610</fpage>
          -
          <lpage>2619</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Ciosici</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Assent</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Abbreviation explorer-an interactive system for pre-evaluation of unsupervised abbreviation disambiguation</article-title>
          .
          <source>In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations)</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Chang, M.-W.;
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          . arXiv preprint arXiv:
          <year>1810</year>
          .04805.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Glorot</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Bordes</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and Bengio,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          <year>2011</year>
          .
          <article-title>Deep sparse rectifier neural networks</article-title>
          .
          <source>In Proceedings of the fourteenth international conference on artificial intelligence and statistics</source>
          ,
          <volume>315</volume>
          -
          <fpage>323</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Goodfellow</surname>
            ,
            <given-names>I. J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Shlens</surname>
          </string-name>
          , J.; and
          <string-name>
            <surname>Szegedy</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>Explaining and harnessing adversarial examples</article-title>
          .
          <source>arXiv preprint arXiv:1412</source>
          .
          <fpage>6572</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Gururangan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Marasovic´,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Swayamdipta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ;
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ;
            <surname>Beltagy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            ;
            <surname>Downey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ; and
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <surname>N. A.</surname>
          </string-name>
          <year>2020</year>
          .
          <article-title>Don't stop pretraining: Adapt language models to domains and tasks</article-title>
          .
          <source>In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <fpage>8342</fpage>
          -
          <lpage>8360</lpage>
          . Online:
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Iscen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Tolias</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ; Avrithis,
          <string-name>
            <given-names>Y.</given-names>
            ; and
            <surname>Chum</surname>
          </string-name>
          ,
          <string-name>
            <surname>O.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Label propagation for deep semi-supervised learning</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <fpage>5070</fpage>
          -
          <lpage>5079</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Iyyer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Manjunatha</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Boyd-Graber</surname>
            , J.; and Daume´ III,
            <given-names>H.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Deep unordered composition rivals syntactic methods for text classification. In Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 1: Long papers</article-title>
          ),
          <fpage>1681</fpage>
          -
          <lpage>1691</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Jacobs</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Itai</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Wintner</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>Acronyms: identification, expansion and disambiguation</article-title>
          .
          <source>Annals of Mathematics and Artificial Intelligence</source>
          <volume>88</volume>
          (
          <issue>5</issue>
          ):
          <fpage>517</fpage>
          -
          <lpage>532</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Jin</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ; Liu, J.; and
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          arXiv:
          <year>1906</year>
          .03360.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Kingma</surname>
            ,
            <given-names>D. P.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Ba</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Fuxman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Tao</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Guess me if you can: Acronym disambiguation for enterprises</article-title>
          .
          <source>In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</source>
          ,
          <fpage>1308</fpage>
          -
          <lpage>1317</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Cohn</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ; and Baldwin,
          <string-name>
            <surname>T.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Robust training under linguistic adversity</article-title>
          .
          <source>In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume</source>
          <volume>2</volume>
          ,
          <string-name>
            <surname>Short</surname>
            <given-names>Papers</given-names>
          </string-name>
          ,
          <fpage>21</fpage>
          -
          <lpage>27</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Liu,
          <string-name>
            <surname>C.</surname>
          </string-name>
          ; and Huang,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Multi-granularity sequence labeling model for acronym expansion identification</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <source>Information Sciences</source>
          <volume>378</volume>
          :
          <fpage>462</fpage>
          -
          <lpage>474</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ; Chen,
          <string-name>
            <given-names>K.</given-names>
            ;
            <surname>Corrado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. S.</given-names>
            ; and
            <surname>Dean</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          <year>2013</year>
          .
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          ,
          <volume>3111</volume>
          -
          <fpage>3119</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Miyato</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Dai</surname>
            ,
            <given-names>A. M.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Goodfellow</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Adversarial training methods for semi-supervised text classification</article-title>
          .
          <source>In Proceedings of International Conference on Learning Representations.</source>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Nadeau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Turney</surname>
            ,
            <given-names>P. D.</given-names>
          </string-name>
          <year>2005</year>
          .
          <article-title>A supervised learning approach to acronym identification</article-title>
          .
          <source>In Conference of the Canadian Society for Computational Studies of Intelligence</source>
          ,
          <fpage>319</fpage>
          -
          <lpage>329</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Oliver</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Odena</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Raffel</surname>
            ,
            <given-names>C. A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Cubuk</surname>
          </string-name>
          , E. D.; and
          <string-name>
            <surname>Goodfellow</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Realistic evaluation of deep semisupervised learning algorithms</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          ,
          <volume>3235</volume>
          -
          <fpage>3246</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>Pennington</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Socher, R.; and
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C. D.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>Glove: Global vectors for word representation</article-title>
          .
          <source>In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)</source>
          ,
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>Peters</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Iyyer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gardner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Deep contextualized word representations</article-title>
          .
          <source>In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (
          <issue>Long Papers)</issue>
          ,
          <fpage>2227</fpage>
          -
          <lpage>2237</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Narasimhan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Salimans</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Improving language understanding by generative pre-training (</article-title>
          <year>2018</year>
          ). URL https://s3-us-west-
          <volume>2</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>Schwartz</surname>
            ,
            <given-names>A. S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Hearst</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <year>2002</year>
          .
          <article-title>A simple algorithm for identifying abbreviation definitions in biomedical text</article-title>
          .
          <source>In Biocomputing 2003. World Scientific</source>
          .
          <volume>451</volume>
          -
          <fpage>462</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>Shi</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ; Gong,
          <string-name>
            <given-names>Y.</given-names>
            ;
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ;
            <surname>MaXiaoyu Tao</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z.</surname>
          </string-name>
          ; and Zheng,
          <string-name>
            <surname>N.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>Transductive semi-supervised deep learning using min-max features</article-title>
          .
          <source>In Proceedings of the European Conference on Computer Vision (ECCV)</source>
          ,
          <fpage>299</fpage>
          -
          <lpage>315</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <surname>Søgaard</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2013</year>
          .
          <article-title>Part-of-speech tagging with antagonistic adversaries</article-title>
          .
          <source>In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)</source>
          ,
          <fpage>640</fpage>
          -
          <lpage>644</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <surname>Srivastava</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          , G.;
          <string-name>
            <surname>Krizhevsky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.;</given-names>
          </string-name>
          and Salakhutdinov,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2014</year>
          .
          <article-title>Dropout: a simple way to prevent neural networks from overfitting</article-title>
          .
          <source>The journal of machine learning research 15(1)</source>
          :
          <fpage>1929</fpage>
          -
          <lpage>1958</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <surname>Taghva</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Gilbreth</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>1999</year>
          .
          <article-title>Recognizing acronyms and their definitions</article-title>
          .
          <source>International Journal on Document Analysis and Recognition</source>
          <volume>1</volume>
          (
          <issue>4</issue>
          ):
          <fpage>191</fpage>
          -
          <lpage>198</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <surname>Veyseh</surname>
            ,
            <given-names>A. P. B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Dernoncourt</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>T. H.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Celi</surname>
            ,
            <given-names>L. A.</given-names>
          </string-name>
          <year>2020a</year>
          .
          <article-title>Acronym identification and disambiguation shared tasksfor scientific document understanding</article-title>
          .
          <source>arXiv preprint arXiv:2012</source>
          .11760.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <string-name>
            <surname>Veyseh</surname>
            ,
            <given-names>A. P. B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Dernoncourt</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Tran</surname>
            ,
            <given-names>Q. H.</given-names>
          </string-name>
          ; and Nguyen,
          <string-name>
            <surname>T. H.</surname>
          </string-name>
          <year>2020b</year>
          .
          <article-title>What does this acronym mean? introducing a new dataset for acronym identification and disambiguation</article-title>
          .
          <source>In Proceedings of the 28th International Conference on Computational Linguistics</source>
          ,
          <fpage>3285</fpage>
          -
          <lpage>3301</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>S. I.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Le´vy, D.;
          <string-name>
            <surname>Nie</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A. Y.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Data noising as smoothing in neural network language models</article-title>
          .
          <source>In 5th International Conference on Learning Representations</source>
          ,
          <string-name>
            <surname>ICLR</surname>
          </string-name>
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Adversarial attention modeling for multi-dimensional emotion regression</article-title>
          .
          <source>In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <fpage>471</fpage>
          -
          <lpage>480</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>