=Paper= {{Paper |id=Vol-2831/paper25 |storemode=property |title=BERT-based Acronym Disambiguation with Multiple Training Strategies |pdfUrl=https://ceur-ws.org/Vol-2831/paper25.pdf |volume=Vol-2831 |authors=Chunguang Pan,Bingyan Song,Shengguang Wang,Zhipeng Luo |dblpUrl=https://dblp.org/rec/conf/aaai/PanSWL21 }} ==BERT-based Acronym Disambiguation with Multiple Training Strategies== https://ceur-ws.org/Vol-2831/paper25.pdf
       BERT-based Acronym Disambiguation with Multiple Training Strategies


                    Chunguang Pan 1 ,Bingyan Song 1 , Shengguang Wang 1 , Zhipeng Luo 1
                                            1
                                            DeepBlue Technology (Shanghai) Co., Ltd
                                       {panchg, songby, wangshg, luozp} @deepblueai.com




                           Abstract                                   Input :
  Acronym disambiguation (AD) task aims to find the correct              – Sentence : The model complexity for the SVM is
  expansions of an ambiguous ancronym in a given sentence.                    determined by the Gaussian kernel spread and the
  Although it is convenient to use acronyms, sometimes they                   penalty parameter.
  could be difficult to understand. Identifying the appropriate
  expansions of an acronym is a practical task in natural lan-           – Dictionary : SVM : -- Support Vector Machine
  guage processing. Since few works have been done for AD                                     -- State Vector Machine
  in scientific field, we propose a binary classification model
  incorporating BERT and several training strategies including
                                                                      Output : Support Vector Machine
  dynamic negative sample selection, task adaptive pretraining,
  adversarial training and pseudo labeling in this paper. Ex-
  periments on SciAD show the effectiveness of our proposed              Figure 1: An example of acronym disambiguation
  model and our score ranks 1st in SDU@AAAI-21 shared task
  2: Acronym Disambiguation.
                                                                    acronym disambiguation task is challenging due to the high
                     1    Introduction                              ambiguity of acronyms. For example, as shown in Figure
                                                                    1, SVM has two expansions in the dictionary. According to
An acronym is a word created from the initial components            the contextual information from the input sentence, the SVM
of a phrase or name, called the expansion (Jacobs, Itai, and        here represents for the Support Vetor Machine which is quite
Wintner 2020). In many literature and documents, especially         smilar to State Vector Machine.
in scientific and medical fields, the amount of acrnomys is            Consequently, AD is formulated as a classification prob-
increasing at an incredible rate. By using acronyms, people         lem, where given a sentence and an acronym, the goal is to
can avoid repeating frequently used long phrases. For exam-         predict the expansion of the acronym in a given candidate
ple, CNN is an acronym with the expansion Convolutional             set. Over the past two decades, several kinds of approaches
Neural Network, though it has additional expansion possi-           have been proposed. At the begining, pattern-matching tech-
bilities depending on context, such as Condensed Nearest            niques were popular. They (Taghva and Gilbreth 1999) de-
Neighbor.                                                           signed rules and patterns to find the corresponding expan-
    Understanding the correlation between acronyms and              sions of each acronym. However, as the pattern-matching
their expansions is critical for several applications in natural    methods require more human efforts on designing and tun-
language processing, including text classification, question        ing the rules and patterns, machine learning based methods
answering and so on.                                                (i.e. CRF and SVM) (Liu, Liu, and Huang 2017) have been
    Despite the convenience of using acronyms, sometimes            preferred. More recently, deep learning methods (Charbon-
they could be difficult to understand, especially for people        nier and Wartena 2018; Jin, Liu, and Lu 2019) are adopted
who are not familiar with the specific area, such as in scien-      to solve this task.
tific or medical field. Therefore, it is necessary to develop a        Recently, pre-trained language models such as ELMo (Pe-
system that can automatically resovle the appropriate mean-         ters et al. 2018) and BERT (Devlin et al. 2018), have shown
ing of acronyms in different contextual information.                their effectiveness in contextual representation. Inspired by
    Given an acronym and several possible expansions,               the pre-trained model, we propose a binary classification
acronym disambiguation(AD) task is to determine which               model that is capable of handling acronym disambiguation.
expansion is correct for a particular context. The scientific       We evaluate and verify the proposed method on the dataset
Copyright © 2021 for this paper by its authors. Use permitted un-   released by SDU@AAAI 2021 Shared Task: Acronym Dis-
der Creative Commons License Attribution 4.0 International (CC      ambiguation (Veyseh et al. 2020a). Experimental results
BY 4.0).                                                            show that our model can effectively deal with the task and
we win the first place of the competition.
                                                                                                               26075
                   2    Related Work                                                       25000

Acronym Disambiguation
                                                                                           20000




                                                                  frequency of sentences
Acronym diambiguation has received a lot of attentions in
vertical domains especially in biomedical fields. Most of                                  15000
the proposed methods (Schwartz and Hearst 2002) utilize
generic rules or text patterns to discover acronym expan-
sions. These methods are usually under circumstances where                                 10000                       8879
acronyms are co-mentioned with the corresponding expan-
sions in the same document. However, in scientific papers,                                      5000
this rarely happens. It is very common for people to define                                                                   2387
                                                                                                                                     1333
the acronyms somewhere and use them elsewhere. Thus,                                                                                      435 220 59 188          4    61
                                                                                                         0
such methods cannot be used for acronym disambiguation                                                          1       2      3    4      5     6     7     8    9   >=10
                                                                                                                               number of acronyms per sentence
in scientific field.
   There have been a few works (Nadeau and Turney 2005)
on automatically mining acronym expansions by leveraging                                                      Figure 2: Number of acronyms per sentence
Web data (e.g. click logs, query sessions). However, we can-
not apply them directly to scientific data, since most data in
scientific are raw text and therefore logs of the query ses-                                                    437
sions/clicks are rarely available.
                                                                                                        400
Pre-trained Models
                                                                                frequency of acronyms
Substantial work has shown that pre-trained models (PTMs),                                              300
on the large unlabeled corpus can learn universal language
representations, which are beneficial for downstream NLP                                                200
tasks and can avoid training a new model from scratch.
   The first-generation PTMs aim to learn good word em-                                                                 140
beddings. These models are usually very shallow for com-                                                100
putational efficiencies, such as Skip-Gram (Mikolov et al.                                                                      45      38
2013) and GloVe (Pennington, Socher, and Manning 2014),                                                                                      14     18       10        22
                                                                                                                                                                  8
because they themselves are no longer needed by down-                                                    0
                                                                                                                 2       3      4      5      6       7      8    9   >=10
stream tasks. Although these pre-trained embeddings can                                                                        number of expansions per acronym
capture semantic meanings of words, they fail to caputre
higher-level concepts in context, such as polysemous disam-                                                   Figure 3: Number of expansions per acronym
biguation and semantic roles. The second-generation PTMs
focus on learning contextual word embeddings, such as
ELMo (Peters et al. 2018), OpenAI GPT (Radford et al.                 There are also other works for regularizing classifiers by
2018) and BERT (Devlin et al. 2018). These learned en-             adding random noise to the data, such as dropout (Srivas-
coders are still needed to generate word embeddings in con-        tava et al. 2014) and its variant for NLP tasks, word dropout
text when being used in downstream tasks.                          (Iyyer et al. 2015). Xie et al. (2019) discusses various data
                                                                   noising techniques for language models and provides em-
Adversarial Training                                               pirical analysis validating the relationship between nosing
Adversarial training (AT) (Goodfellow, Shlens, and Szegedy         and smoothing. Søgaard (2013) and Li, Cohn, and Baldwin
2014) is a mean of regularizing classification algorithms by       (2017) focus on linguistic adversaries.
generating adversarial noise to the training data. It was first       Combining multiple advantages in above works, we pro-
introduced in image classification tasks where the input data      pose a binary classification model utilizing BERT and sev-
is continuous.                                                     eral training strategies such as adversarial training and so
                                                                   on.
   Miyato, Dai, and Goodfellow (2017) extend adversarial
and virtual adversarial training to the text classification by
applying perturbation to the word embeddings and propose                                                                               3     Data
an end-to-end way of data perturbation by utilizing the gra-       In this paper, we use the AD dataset called SciAD re-
dient information. Zhu, Li, and Zhou (2019) propose an ad-         leased by Veyseh et al. (2020b). They collect a corpus of
versarial attention network for the task of multi-dimensional      6,786 English papers from arXiv and these papers consist of
emotion regression, which automatically rates multiple emo-        2,031,592 sentences that could be used for data annotation.
tion dimension scores for an input text.                              The dataset contains 62,441 samples where each sample
                 mean squared error
    The MSE of consists of variance and squared bias.                   BERT                     0.95


                 model selection eqn
                                                                        BERT                     0.37                argmax
    The MSE of consists of variance and squared bias.


                minimum square error
                                                                        BERT                     0.56
    The MSE of consists of variance and squared bias.


Figure 4: Acronym disambiguation based on binary classification model. For each sample, the model needs to predict whether
the given expansions matches the acronym or not, and find the expansion with the highest score as the correct one.


involves a sentence, an ambiguous acronym, and its correct        concatenated vector into a binary classifier for prediction.
meaning (one of the meanings of the acronym recorded by           The represenation first pass through a dropout layer (Srivas-
the dictionary , as shown in 1).                                  tava et al. 2014) and a feedforward layer. The output of these
   Figure 2 and Figure 3 demonstrate statistics of SciAD          layers is then feed into a ReLU (Glorot, Bordes, and Ben-
dataset. More specifically, Figure 2 reveals the distribution     gio 2011) activation. After this, the calculated vector pass
of number of acronyms per sentence. Each sentence could           through a dropout layer and a feedforward layer again. The
have more than one acronym and most sentences have 1 or 2         final prediction can be obtained through a sigmoid activa-
acronyms. Figure 3 shows the distribution of number of ex-        tion.
pansions per acronym. The distribution shown in this figure
is consistent with the same distribution presented in the prior   Training Strategies
work (Charbonnier and Wartena, 2018) in which in both dis-        Pretrained Models Experiments from previous work
tributions, acronyms with 2 or 3 meanings have the highest        have shown the effectiveness of pretrained models. Start-
number of samples in the dataset (Veyseh et al. 2020b).           ing from BERT model, there are many improved pretrained
                                                                  models. Roberta uses dynamic masks and removes next
          4    Binary Classification Model                        sentence prediction task. In our experiments, we compare
The input of the binary classification model is a sentence        BERT and Roberta models trained on corpus from different
with an ambiguous acronym and a possible expansion. The           fields.
model needs to predict whether the expansion is the cor-          Dynamic Negative Sample Selection During training,
responding expansion of the given acronym. Given an in-           we dynamicly select a fixed number of negative samples for
put sentence, the model will assign a predicted score to          each batch, which ensures that the model is trained on more
each candidate expansion. The candidate expansion with the        balanced positive and negative data, and all negative samples
highest score will be the model output. Figure 4 shows an         are used in training at the same time.
example of the procedure.
                                                                  Task Adaptive Pretraining Gururangan et al. (2020)
Input Format                                                      shows that task-adaptive pretraining (TAPT) can effectively
                                                                  improve model performance. The task-specific dataset usu-
Since BERT can process multiple input sentences with seg-         ally covers only a subset of data used for general pretraining,
ment embeddings, we use the candidate expansion as the            thus we can achieve significant improvement by pretraining
first input segment, and the given text as the second input       the masked language model task on the given dataset.
segment. We separat these two input segments with the spe-
cial token [CLS]. Furthermore, we add two special tokens          Adversarial Training Adversarial training is a popular
 and  to wrap the acronym in the text,                approach to increasing robustness of neural networks. As
which enables that the acronym can get enough attention           shown in Miyato, Dai, and Goodfellow (2017), adversar-
from the model.                                                   ial training has good regularization performance. By adding
                                                                  perturbations to the embedding layer, we can get more stable
Binary Model Architecture                                         word representations and a more generalized model, which
                                                                  significantly improves model performance on unseen data.
The model architecture is described in Figure 5 in detail.
First, we use a BERT encoder to get the representation of         Pseudo-Labeling Pseudo labeling (Iscen et al. 2019;
input segments. Next, we calculate the mean of the start and      Oliver et al. 2018; Shi et al. 2018) uses network predictions
end positions of the acronym, and concatenate the represen-       with high confidence as labels. We mix these pseudo labels
tation with the [CLS] position vector. Then, we sent this         and the training set together to generate a new dataset. We
                                                                 Model                         Precision    Recall       F1
    [CLS]                                                        bert-base-uncased              0.9176      0.8160     0.8638
                                                                 bert-large-uncased             0.9034      0.7693     0.8311
    Bayesian                                Dropout(0.2)
                                                                 roberta-base                   0.9008      0.7687     0.8295
                                           Dense(1356*128)
    network                                                      cs-roberta-base                0.9216      0.8415     0.8797
                                                ReLU
                                                                 scibert-scivocab-uncased       0.9263      0.8569     0.8902
    [SEP]                                   Dropout(0.1)
                                     Dense(128*1)        Table 1: Results on validation set using different pretrained
    B
                                               Sigmoid          models.
    ##N
                     mean   concat                         Training Procedure
                                                                We incorporate all the training strategies introduced above
    Is
                                                                to improve the performance of our proposed binary classifi-
    Also                                                        cation model. According to the experiment result in Table 1,
    Applied                                                     we choose scibert as the fundamental pretrained model and
                                                                use the TAPT technique to train a new pretrained model.
    To                                                          Then we add the dynamic negative sample selection and ad-
    Projection                                                  versarial training strategies to train the binary classfication
    Layer
                                                                model. After this, we utilize the pseudo-labeling technique
                                                                and obtain the final binary classification model.
    [SEP]
                                                                Further Experiments
    Input        BERT                                           Combining training strategies We do some futher exper-
                                                                iments on validation set to verify the effectiveness of each
                                                                strategy mentioned above. The results are shown in Table
           Figure 5: The binary classification model.           2. As shown in the table, F1 score increases by 4 percents
                                                                with dynamic sampling. TAPT and adversarial training fur-
                                                                ther improve the performance on validation set by 0.47 per-
than use this new dataset to train a new binary classifica-     cent. Finally, we use pseudo-labeling method. Samples from
tion model. Pseudo-labeling has been proved an effective        the test set with a score higher than 0.95 are selected and
approach to utilize unlabeled data for a better performance.    mixed with the training set. It still slightly improves the F1
                                                                score.

                     5    Experiments                            Model                          Precision    Recall       F1
                                                                 scibert-scivocab-uncased        0.9263      0.8569     0.8902
Hyper parameters                                                 +dynamic sampling               0.9575      0.9060     0.9310
                                                                 +task adaptive pretraining      0.9610      0.9055     0.9324
The batch size used in our experiments is 32. We train each      +adversarial training           0.9651      0.9082     0.9358
model for 15 epochs. The initial learning rate for the text      +pseudo-labeling                0.9629      0.9106     0.9360
encoder is 1.0 × 10−5 , and for other parameters, the initial
learning rate is set to 5.0 × 10−4 . We evaluate our model      Table 2: Results on validation set using different training ap-
on the validation set at each epoch. If the macro F1 score      proaches.
doesn’t increase, we then decay the learning rate by a factor
of 0.1. The minimum learning rate is 5.0 × 10−7 . We use
Adam optimizer (Kingma and Ba 2017) in all our experi-          Error Analysis We gather a sample of 100 development
ments.                                                          set examples that our model misclassified and look at these
                                                                examples manually to do the error analysis.
Pretrained Models                                                   From these examples, we find that there are two main
                                                                cases where the model gives the wrong prediction. The first
Since different pretrained models are trained using different   one is that the candidate expansions are too similar, even
data, we do experiments on several pretrained models. Ta-       have the same meanings in different forms. For example, in
ble 1 shows our experimental results on different pretrained    the sentence ’The SC is decreasing for increasing values of
models in validation set. The bert-base model gets the high-    ...’, the correct expansion for ’SC’ is ’sum capacities’ while
est score in commonly used pretrained models (the top 3         our prediction is ’sum capacity’ which has the same meaning
lines in Table 1). Since a large ratio of texts in the given    with the correct one but in the singular form.
dataset come from computer science field, the cs-roberta            The second one is that there is too little contextual infor-
model outperforms the bert-base model by 1.6 percents. The      mation in the given sentence for prediction. For instance, the
best model in our experiments is the scibert model, which       correct expansion for ’ML’ in sentence ’ML models are usu-
achieves the F1 score of 89%.                                   ally much more complex, see Figure.’ is ’model logic’, the
predict expansion is ’machine learning’. Even people can            As shown in Table 4, rules/features fail to caputre all pat-
hardly tell which one is right only based on the given sen-      terns of expressing the meanings of the acronym, resulting
tence.                                                           in poorer recall on expansions compared to acronyms. In
                                                                 contrast, the deep learning model has comparable recall on
Time complexity To analysis the time complexity of our           expansions and acronyms, showing the importance of pre-
proposed method, we show measurements of the actual run-         trained word embeddings and deep architectures for AD.
ning time observed in our experiments. The discussions are       However, they all fall far behind human level performance.
not that precise or exhaustive. However, we believe they are     Among all the models, our proposed model achieves the best
enough to offer readers rough estimations of the time com-       results on the SciAD and is very close to the human perfor-
plexity of our model.                                            mance which shows the capability of the strategies we intro-
   We utilize TAPT strategy to further train the scibert model   duced above.
by using eight NVIDIA TITAN V (12GB). It takes three
hours to train 100 epochs in total.                              SDU@AAAI 2021 Shared Task: Acronym Disambigua-
   After getting the new pretrained model, we trained the        tion The competition results are shown in Table 5. We
binary classification model on two NVIDIA TITAN V. On            show scores of the top 5 ranked models as well as the base-
average, each epoch of the training and inference time of        line model. The baseline model is released by the provider
adding adversarial training and pseudo-labeling are shown        of the SciAD dataset (Veyseh et al. 2020b). Our model per-
in Table 3 respectively. It begins to converge after five        forms best among all the ranking list and outperforms the
epochs. It takes nearly the same time to do the inference        second place by 0.32%. In addition, our model outperforms
while the training time is twice as long after adversarial       the baseline model by 12.15% which is a great improvement.
training is added.
                                                                           Model       Precision    Recall      F1
         Model                   Train     Inference                       Rank1        0.9695      0.9132    0.9405
                                 1588s      150.42s                        Rank2        0.9694      0.9073    0.9373
         +adversarial training   3021s      149.64s                        Rank3        0.9652      0.9009    0.9319
         +pseudo-labeling        3328s      149.36s                        Rank4        0.9595      0.8959    0.9266
                                                                           Rank5        0.9548      0.8907    0.9216
                 Table 3: Time complexity                                  Baseline     0.8927      0.7666    0.8190

                                                                                      Table 5: Leaderboard
Comparison Results We compared our results with sev-
eral other models. Precision, Recall and F1 of our proposed
model are computed on testing data via the cross-validation                            6    Conclusion
method.
                                                                 In this paper, we introduce a binary classification model for
• MF & ADE Non-deep learning models that utilize rules           acronym disambiguation. We utilize the BERT encoder to
  or hand crafted features (Li et al. 2018).                     get the input representations and adopt several strategies in-
• NOA & UAD Language-model-based baselines that train            cluding dynamic negative sample selection, task adaptive
  the word embeddings using the training corpus (Charbon-        pretraining, adversarial training and pseudo-labeling. Exper-
  nier and Wartena 2018; Ciosici and Assent 2019).               iments on SciAD show the validity of our proposed model
                                                                 and we win the first place of the SDU@AAAI-2021 Shared
• BEM & DECBAE Models employ deep architectures                  task 2.
  (e.g., LSTM) (Jin, Liu, and Lu 2019; Blevins and Zettle-
  moyer 2020).                                                                             References
• GAD A deep learning model utilizes the syntactical struc-      Blevins, T., and Zettlemoyer, L. 2020. Moving down the
  ture of the sentence (Veyseh et al. 2020b).                    long tail of word sense disambiguation with gloss-informed
                                                                 biencoders. arXiv preprint arXiv:2005.02590.
   Model                    Precision    Recall      F1          Charbonnier, J., and Wartena, C. 2018. Using word embed-
   MF                        0.8903      0.4220    0.5726        dings for unsupervised acronym disambiguation. In Pro-
   ADE                       0.8674      0.4325    0.5772        ceedings of the 27th International Conference on Computa-
   NOA                       0.7814      0.3506    0.4840        tional Linguistics, 2610–2619.
   UAD                       0.8901      0.7008    0.7837        Ciosici, M., and Assent, I. 2019. Abbreviation explorer-an
   BEM                       0.8675      0.3594    0.5082        interactive system for pre-evaluation of unsupervised abbre-
   DECBAE                    0.8867      0.7432    0.8086        viation disambiguation. In Proceedings of the 2019 Confer-
   GAD                       0.8927      0.7666    0.8190        ence of the North American Chapter of the Association for
   Ours                      0.9695      0.9132    0.9405        Computational Linguistics (Demonstrations), 1–5.
   Human Performance         0.9782      0.9445    0.9610        Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018.
                                                                 Bert: Pre-training of deep bidirectional transformers for lan-
   Table 4: Results of different models on testing dataset       guage understanding. arXiv preprint arXiv:1810.04805.
Glorot, X.; Bordes, A.; and Bengio, Y. 2011. Deep sparse          Oliver, A.; Odena, A.; Raffel, C. A.; Cubuk, E. D.; and
rectifier neural networks. In Proceedings of the fourteenth       Goodfellow, I. 2018. Realistic evaluation of deep semi-
international conference on artificial intelligence and statis-   supervised learning algorithms. In Advances in neural in-
tics, 315–323.                                                    formation processing systems, 3235–3246.
Goodfellow, I. J.; Shlens, J.; and Szegedy, C. 2014. Explain-     Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove:
ing and harnessing adversarial examples. arXiv preprint           Global vectors for word representation. In Proceedings of
arXiv:1412.6572.                                                  the 2014 conference on empirical methods in natural lan-
Gururangan, S.; Marasović, A.; Swayamdipta, S.; Lo, K.;          guage processing (EMNLP), 1532–1543.
Beltagy, I.; Downey, D.; and Smith, N. A. 2020. Don’t stop        Peters, M.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark,
pretraining: Adapt language models to domains and tasks. In       C.; Lee, K.; and Zettlemoyer, L. 2018. Deep contextualized
Proceedings of the 58th Annual Meeting of the Association         word representations. In Proceedings of the 2018 Confer-
for Computational Linguistics, 8342–8360. Online: Associ-         ence of the North American Chapter of the Association for
ation for Computational Linguistics.                              Computational Linguistics: Human Language Technologies,
                                                                  Volume 1 (Long Papers), 2227–2237.
Iscen, A.; Tolias, G.; Avrithis, Y.; and Chum, O. 2019. Label
propagation for deep semi-supervised learning. In Proceed-        Radford, A.; Narasimhan, K.; Salimans, T.; and Sutskever,
ings of the IEEE conference on computer vision and pattern        I. 2018. Improving language understanding by gen-
recognition, 5070–5079.                                           erative pre-training (2018).       URL https://s3-us-west-2.
                                                                  amazonaws. com/openai-assets/research-covers/language-
Iyyer, M.; Manjunatha, V.; Boyd-Graber, J.; and Daumé III,
                                                                  unsupervised/language understanding paper. pdf.
H. 2015. Deep unordered composition rivals syntactic meth-
ods for text classification. In Proceedings of the 53rd annual    Schwartz, A. S., and Hearst, M. A. 2002. A simple algorithm
meeting of the association for computational linguistics and      for identifying abbreviation definitions in biomedical text. In
the 7th international joint conference on natural language        Biocomputing 2003. World Scientific. 451–462.
processing (volume 1: Long papers), 1681–1691.                    Shi, W.; Gong, Y.; Ding, C.; MaXiaoyu Tao, Z.; and Zheng,
Jacobs, K.; Itai, A.; and Wintner, S. 2020. Acronyms: iden-       N. 2018. Transductive semi-supervised deep learning using
tification, expansion and disambiguation. Annals of Mathe-        min-max features. In Proceedings of the European Confer-
matics and Artificial Intelligence 88(5):517–532.                 ence on Computer Vision (ECCV), 299–315.
Jin, Q.; Liu, J.; and Lu, X. 2019. Deep contextual-               Søgaard, A. 2013. Part-of-speech tagging with antagonistic
ized biomedical abbreviation expansion. arXiv preprint            adversaries. In Proceedings of the 51st Annual Meeting of
arXiv:1906.03360.                                                 the Association for Computational Linguistics (Volume 2:
                                                                  Short Papers), 640–644.
Kingma, D. P., and Ba, J. 2017. Adam: A method for
                                                                  Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and
stochastic optimization.
                                                                  Salakhutdinov, R. 2014. Dropout: a simple way to prevent
Li, Y.; Zhao, B.; Fuxman, A.; and Tao, F. 2018. Guess me if       neural networks from overfitting. The journal of machine
you can: Acronym disambiguation for enterprises. In Pro-          learning research 15(1):1929–1958.
ceedings of the 56th Annual Meeting of the Association for        Taghva, K., and Gilbreth, J. 1999. Recognizing acronyms
Computational Linguistics (Volume 1: Long Papers), 1308–          and their definitions. International Journal on Document
1317.                                                             Analysis and Recognition 1(4):191–198.
Li, Y.; Cohn, T.; and Baldwin, T. 2017. Robust training           Veyseh, A. P. B.; Dernoncourt, F.; Nguyen, T. H.; Chang,
under linguistic adversity. In Proceedings of the 15th Con-       W.; and Celi, L. A. 2020a. Acronym identification and
ference of the European Chapter of the Association for Com-       disambiguation shared tasksfor scientific document under-
putational Linguistics: Volume 2, Short Papers, 21–27.            standing. arXiv preprint arXiv:2012.11760.
Liu, J.; Liu, C.; and Huang, Y. 2017. Multi-granularity se-       Veyseh, A. P. B.; Dernoncourt, F.; Tran, Q. H.; and Nguyen,
quence labeling model for acronym expansion identification.       T. H. 2020b. What does this acronym mean? introducing
Information Sciences 378:462–474.                                 a new dataset for acronym identification and disambigua-
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and         tion. In Proceedings of the 28th International Conference
Dean, J. 2013. Distributed representations of words and           on Computational Linguistics, 3285–3301.
phrases and their compositionality. In Advances in neural         Xie, Z.; Wang, S. I.; Li, J.; Lévy, D.; Nie, A.; Jurafsky, D.;
information processing systems, 3111–3119.                        and Ng, A. Y. 2019. Data noising as smoothing in neural
Miyato, T.; Dai, A. M.; and Goodfellow, I. 2017. Adver-           network language models. In 5th International Conference
sarial training methods for semi-supervised text classifica-      on Learning Representations, ICLR 2017.
tion. In Proceedings of International Conference on Learn-        Zhu, S.; Li, S.; and Zhou, G. 2019. Adversarial attention
ing Representations.                                              modeling for multi-dimensional emotion regression. In Pro-
Nadeau, D., and Turney, P. D. 2005. A supervised learning         ceedings of the 57th Annual Meeting of the Association for
approach to acronym identification. In Conference of the          Computational Linguistics, 471–480.
Canadian Society for Computational Studies of Intelligence,
319–329. Springer.