=Paper=
{{Paper
|id=Vol-2831/paper28
|storemode=property
|title=AT-BERT: Adversarial Training BERT for Acronym Identification Winning Solution for SDU@AAAI-21
|pdfUrl=https://ceur-ws.org/Vol-2831/paper28.pdf
|volume=Vol-2831
|authors=Danqing Zhu,Wangli Lin,Yang Zhang,Qiwei Zhong,Guanxiong Zeng,Weilin Wu,Jiayu Tang
|dblpUrl=https://dblp.org/rec/conf/aaai/ZhuLZZZWT21
}}
==AT-BERT: Adversarial Training BERT for Acronym Identification Winning Solution for SDU@AAAI-21==
<pdf width="1500px">https://ceur-ws.org/Vol-2831/paper28.pdf</pdf>
<pre>
             AT-BERT: Adversarial Training BERT for Acronym Identification
                        Winning Solution for SDU@AAAI-21
  Danqing Zhu, Wangli Lin, Yang Zhang, Qiwei Zhong, Guanxiong Zeng, Weilin Wu, Jiayu Tang
                                             Alibaba Group, Hangzhou, China
          {danqing.zdq, wangli.lwl, zy142206, yunwei.zqw, moshi.zgx, william.wwl, jiayu.tangjy}@alibaba-inc.com


                            Abstract                                      Several approaches have been proposed to solve the
                                                                      acronym identification problem in the last two decades. The
  Acronym identification focuses on finding the acronyms and          majority of the prior methods are rule-based (Schwartz and
  the phrases that have been abbreviated, which is crucial for
  scientific document understanding tasks. However, the lim-
                                                                      Hearst 2002; Okazaki and Ananiadou 2006) or feature-based
  ited size of manually annotated datasets hinders further im-        (Kuo et al. 2009; Liu, Liu, and Huang 2017), which em-
  provement for the problem. Recent breakthroughs of lan-             ploys manually designed rules or features for the acronym
  guage models pre-trained on large corpora clearly show              and long form predictions. Due to the rules/features are spe-
  that unsupervised pre-training can vastly improve the per-          cially designed for finding long forms, these methods have
  formance of downstream tasks. In this paper, we present an          high precision. However, they fail to capture all the diverse
  Adversarial Training BERT method named AT-BERT, our                 forms of acronym expression (Harris and Srinivasan 2019).
  winning solution to acronym identification task for Scientific      On the contrast, taking advantage of pre-trained word em-
  Document Understanding (SDU) Challenge of AAAI 2021.                beddings and deep architecture, deep learning models like
  Specifically, the pre-trained BERT is adopted to capture better     LSTM-CRF show promising results for acronym identifica-
  semantic representation. Then we incorporate the FGM ad-
  versarial training strategy into the fine-tuning of BERT, which
                                                                      tion (Veyseh et al. 2020b). Although these works have made
  makes the model more robust and generalized. Furthermore,           great progress, there are still some limitations that hinder
  an ensemble mechanism is devised to involve the represen-           further improvement, such as the limited size of manually
  tations learned from multiple BERT variants. Assembling all         annotated acronyms and the noises in the automatically cre-
  these components together, the experimental results on the          ated datasets.
  SciAI dataset show that our proposed approach outperforms               Motivated by the above observations, the first publicly
  all other competitive state-of-the-art methods.                     available and the largest manually annotated acronym iden-
                                                                      tification the dataset in scientific domain is released (Veyseh
                        Introduction                                  et al. 2020b), and the Scientific Document Understanding
                                                                      (SDU) Challenge (Veyseh et al. 2020a) for acronym identi-
Acronyms are widespread used in many technical docu-                  fication task is hosted 1 . The task aims to identify acronyms
ments to reduce duplicate references to the same concept.             (i.e., short-forms) and their meanings (i.e.,long-forms) from
According to the reports (Barnett and Doubleday 2020), af-            the documents, a toy example is shown in Table 1. In this pa-
ter an analysis of more than 24 million article titles and 18         per, we formulate the problem as a sentence-level sequence
million article abstracts published between 1950 and 2019,            labeling problem, and design a novel BERT-based ensem-
there was at least one acronym in 19% of the titles and 73%           ble model called Adversarial Training BERT (AT-BERT).
of the abstracts. As the growing amount of scientific pa-             Specifically, considering the training data is relatively small,
pers published every year, the number of acronyms is also             we adopt the pre-trained BERT model as sentence encoder,
constantly climbing. However, not all acronyms are stan-              which is pre-trained on general domain corpora and shows a
dard written (i.e., take the first letter of each word and put        significant improvement on the performance of downstream
them together in all capital letters), there are many different       tasks with supervised fine-tuning (Beltagy, Lo, and Cohan
ways of writing, e.g., XGBoost is an acronym of eXtreme               2019). Furthermore, we leverage the FGM (Miyato, Dai, and
Gradient Boosting (Chen and Guestrin 2016). Thus, auto-               Goodfellow 2017), an adversarial training strategy to im-
matic identification of acronyms and discovery of associated          prove the generalization ability of the model, making it more
definitions are crucial for text understanding tasks, such as         robust to noisy data. Finally, we utilize a multi-BERT en-
question answering (Ackermann et al. 2020; Veyseh 2016),              semble to fully exploit the representations learned from mul-
slot filling (Pouran Ben Veyseh, Dernonrcourt, and Nguyen             tiple BERT variants (Xu et al. 2020). Combining these re-
2019) and definition extraction (Kang et al. 2020).                   spective advantages, our proposed model won the first prize
                                                                      in the SDU@AAAI-21, outperforming all the other compet-
Copyright © 2021 for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY
                                                                         1
4.0). All rights reserved.                                                   https://sites.google.com/view/sdu-aaai21/shared-task
         Input:      Existing methods for learning with noisy labels (LNL) primarily take a loss correction approach.
         Output:     Existing methods for learning with noisy labels (LNL) primarily take a loss correction approach.

Table 1: A toy example of the acronym identification task. In this example, the acronym is shown in bold font and the long-form
is shown with an underline.


itive methods.                                                    tions are crucial and BERT-based models with better seman-
   The main contributions are summarized as follows:              tic representation are more suitable for the task.
 • To the best of our knowledge, it is the first work to in-
   corporate adversarial training strategy into BERT-based        Adversarial Training
   model for acronym identification task in the scientific do-    Adversarial training, in which a network is trained on ad-
   main.                                                          versarial examples, is an important way to enhance the ro-
 • We propose a novel framework for acronym identifica-           bustness of neural networks. The Fast Gradient Sign Method
   tion, including a pre-trained BERT for the semantic rep-       (FGSM) (Goodfellow, Shlens, and Szegedy 2015) and its
   resentation, an adversarial training strategy to make the      variant Fast Gradient Method (FGM) (Miyato, Dai, and
   model more robust and generalized, as well as a multi-         Goodfellow 2017) are firstly proposed for adversarial train-
   BERT ensemble mechanism to achieve superior perfor-            ing. The FGSM and FGM methods generate adversarial ex-
   mance.                                                         amples by adding gradient-based perturbation to the input
                                                                  samples with different normalization strategies. They relies
 • Extensive experiments are conducted on the data offered
                                                                  heavily on the assumption that the loss function is linear.
   by the SDU@AAAI-21, demonstrating the effectiveness
                                                                  Different from them, the Projected Gradient Descent (PGD)
   of our proposed method.
                                                                  (Madry et al. 2018) method is an iterative attack method
                                                                  with multi-step iterations, and each iteration will project the
                     Related Work                                 perturbation to a specified range. PGD increases computa-
In this section, we mainly introduce the related studies for      tional cost to get better effect, and many PGD-based meth-
the sequence labeling problem especially the BERT-based           ods have been proposed to be more efficient. YOPO (Zhang
models, then we review the existing researches on adversar-       et al. 2019) computes the gradient of first layer merely, while
ial training.                                                     FreeAT (Shafahi et al. 2019) and FreeLB (Zhu et al. 2020)
                                                                  further reduce the frequency of the gradient computation.
Sequence Labeling and BERT-based Models                              Considering the dataset for acronym identification is rel-
In this paper, we formulate the acronym identification            atively small that is easily to be overfit, we incorporate the
as a sequence labeling problem. Traditional approaches            adversarial training strategy into the BERT-based models to
of sequence labeling are mainly based on rule-based or            achieve a more robust and generalized performance.
feature-based methods (Okazaki and Ananiadou 2006; Kuo
et al. 2009). Recently, deep learning models have achieved                              Methodology
promising results, for instance, the LSTM-CRF (Li et al.          In this section, we present the overall architecture of our pro-
2020) model utilizes LSTMs to extract contextualized rep-         posed method, which uses the BERT-based model to solve
resentations and implement sequence optimization by CRF.          the sequence labeling problem, and adopt adversarial train-
With the development of pre-trained language models,              ing strategies to improve the robustness of the model.
BERT-based models achieve state-of-the-art results in nat-
ural language related tasks. BERT (Kenton and Toutanova
2019) is a multi-layer bidirectional Transformer encoder,         Overview
which is pre-trained on Wikipedia and BooksCorpus, has            In the following, we propose a BERT-based classification
given state-of-the-art results on a wide variety of NLP tasks     model based on adversarial training strategy, which is called
and inspired many variants. RoBERTa (Liu et al. 2019) uti-        adversarial training BERT (AT-BERT). As show in Figure
lizes BPE(Byte Pair Encoding) and dynamic masking to in-          1, the pre-trained BERT model is used for semantic feature
crease the shared vocabulary. It optimizes the training strat-    encoding, and downstream acronym identification task is
egy of BERT and achieves better performance. ALBERT               solved using its output representations with linear classifiers.
(Lan et al. 2019) utilizes factorized embedding parame-           In addition, due to the complexity of the acronyms in scien-
terization and cross-Layer parameter sharing to reduce the        tific documents and the relatively small training dataset, the
model parameters. ERINE (Sun et al. 2019) proposes a new          model is prone to overfitting. We use the FGM to add pertur-
masking strategy based on phrases and entities, in which          bation to the input samples for adversarial training, making
customized tasks are continuously introduced and trained          the model more robust and generalized. Finally, in order to
through multi-task learning.                                      improve the accuracy of the task, We train different BERT
   As for acronym identification, it is more challenging than     models, such as BERT, SciBERT, RoBERTa, ALBERT and
general sequence labeling problems since acronyms are di-         ELECTRA, and make an average ensemble for all the mod-
verse and ambiguous. Thus the con-textualized representa-         els to achieve superior performance.
                                  Figure 1: The overall architecture of the proposed AT-BERT approach.


BERT For Sequence Labeling Problem                                     where x represents the input representation of the sample,
BERT (Bidirectional Encoder Representations from Trans-                and a corresponding target as y, radv is the perturbation ap-
formers) is state of the art language model for NLP. It                plied to the input, S is the perturbation space, and L is some
uses the encoder structure of the Transformer (Vaswani                 loss function like Equation (2). First, The internal maximiza-
et al. 2017) for deep self-supervised learning, which requires         tion problem is to find the best perturbation at a given data
task-specific fine-tuning. Transformer is an attention mecha-          point x in the perturbation space to generate adversarial ex-
nism that learns contextual relations between words (or sub-           amples that achieves high loss. This can be seen as an attack
words) in a text. In this paper, the downstream task is a sin-         on a given neural network. Second, The goal of the external
gle sentence tagging problem. We denote a sequence with                minimization problem is to find the model parameters θ to
T words as : W = (w1 , w2 , ..., wT ), and a corresponding             minimize the “adversarial loss” given by the internal attack
target as Y = (y 1 , y 2 , ..., y T ). BERT trains an encoder that     problem.
generates a contextualized vector representation for each to-             With the above clear definition of the adversarial train-
ken as a hidden state:                                                 ing, we will introduce how to apply a small perturbation
                                                                       to the input sample to generate adversarial samples in our
                H = BERT(w1 , w2 , ..., wT ; θ)                        task. There are many related studies on adversarial training
                                                              (1)
                  = (h1 , h2 , ..., hT )                               such as the FGSM, single-step algorithm FGM, multi-step
                                                                       algorithm PGD, and Free-LB. Since these can basically be
The hidden state is then fed into a fully connection layer             regarded as a series of methods, we will briefly introduce
with a softmax unit to obtain the predicted probability dis-           FGM. FGM made a simple extension on the calculation of
tribution for each token. The model is trained with the Cross-         perturbation in FGSM and proposed FGM. The main idea is
Entropy loss function, which is defined as follows.                    to add a perturbation to the input that can increase the loss,
                               C
                             T X                                       it happens to be the direction in which the gradient of the
                    L=−
                             X
                                         yji logsij           (2)      loss function rises. Specifically, the adversarial perturbation
                                                                       is defined as follows.
                              i      j

where y i and si are the ground truth probability distribution                               g = ∇x L(θ, x, y)                     (4)
and the predicted probability distribution, C is the number                                                g
                                                                                              radv =  ·                           (5)
of categories.                                                                                           ||g||2
                                                                       where g is the gradient of the loss with respect to x, the L2
Adversarial Training For BERT                                          norm is used to constrain g in Equation (5), and the  is a hy-
Adversarial training is an important way to enhance the ro-            perparameter and defaults to 1. In our acronym identification
bustness of models by adversarial samples. An adversarial              task, the perturbation radv will be added to the embedding
example is an instance with small, intentional feature per-            of the input word. The overall architecture of the proposed
turbations that induce the model to make a false prediction.           AT-BERT is shown in Figure 1.
In the procedure of adversarial training, the input samples
will firstly be mixed with some small perturbation to gen-                                   Experiments
erate adversarial samples (Szegedy et al. 2014). The model             In this section, we first introduce the experimental dataset
is trained with both the original input sample and generated           and evaluation metrics, and then conduct comprehensive ex-
adversarial samples to enhance its robustness and general-             perimental studies to verify the effectiveness of our method.
ization. (Madry et al. 2018) abstracted the general form of
adversarial training as the maximum and minimum formula                Dataset
as follows.                                                            We evaluate all models based on the dataset provided by
                                                                       SDU@AAAI-21. It contains a training set of 14,006 sam-
           min E(x,y)∼D [ max L(θ, x + radv , y)]             (3)      ples, a development set of 1,717 samples, and a test set of
            θ              radv ∈S
           Data            Sample Number         Ratio            predicted short-form or long-from boundaries equal to the
        training set          14,006            80.16%            ground-truth beginning and end of the short-form or long-
      development set          1,717             9.82%            form boundary, respectively. The official score (noded as
          test set             1,750            10.02%            MacroF1) is the macro average of short-form and long-form
            total             17,473             100%             prediction F1 score.

       Table 2: The statistical information of dataset.           Compared Methods
                                                                  We experiment with four schemes: Baselines, BERT mod-
                                                                  els, Adversarial Training for BERT (AT-BERT) and Model
                                                                  Ensemble.

                                                                  (a) Baselines
                                                                  • Rule-based methods: These models employ manually
                                                                    designed rules to extract acronyms and long forms in
                                                                    the text. The evaluation code and results are provided by
                                                                    SDU@AAAI-212 .
                                                                  • Deep learning models: As shown in previous work (Vey-
                                                                    seh et al. 2020b), we can see that the F1 score of the
                                                                    LSTM-CRF model is only one percentage point higher
                                                                    than the rule-based models. Therefore, we do not imple-
                                                                    ment the LSTM-CRF model by ourselves. More details on
                                                                    these models and hyper parameters are illustrated in (Vey-
                                                                    seh et al. 2020b).
       Figure 2: Category distribution of training set.
                                                                  (b) BERT Models

1,750 samples, as shown in Table 2. This task aims to iden-       • BERT: BERT (Kenton and Toutanova 2019) is a multi-
tify acronyms (i.e., short-forms) and their meanings (i.e.,         layer bidirectional transformer encoder trained with a
long-forms) from the documents. The dataset provides the            masked language modeling (MLM) objective and the next
boundaries for the acronyms and long forms in the sentence          sentence prediction task. It has two sizes, we have both
using BIO format (i.e., label set includes B-short, I-short, B-     experimented, namely BERTBASE architecture (L=12,
long, I-long and O). The percentage of each label category          H=768, A=12, total 110M parameters) and BERTLARGE
in all tokens is shown in Figure 2. Obviously, the distribution     architecture (L=24, H=1024, A=16, total 355M parame-
of label classes across the all known classes is biased. Each       ters) provided by huggingface (Wolf et al. 2020).
sample in the training set and development set has three at-      • SciBERT: SciBERT is the pretrained model presented by
tributes:                                                           Beltagy, Lo, and Cohan, which is based on BERTBASE
• tokens: The list of words (tokens) of the sample.                 and trained on a large corpus of scientific text. It has
                                                                    achieved new state-of-the-art results on a suite of tasks
• labels: The short-form and long-form labels of the words          in the scientific domain (Beltagy, Lo, and Cohan 2019;
  in BIO format. The labels B-short and B-long identi-              Zhong et al. 2021).
  fies the beginning of a short-form and long-form phrase,
  respectively. The labels I-short and I-long indicates the       • RoBERTa: RoBERTa (Liu et al. 2019) improves the orig-
  words inside the short-form or long-form phrases. Finally,        inal implementation of BERT for better performance, us-
  the label O shows the word is not part of any short-form          ing dynamic masking, removing the next sentence pre-
  or long-form phrase.                                              diction task, training with larger batches, on more data,
                                                                    and for longer. RoBERTa follows the same architecture
• id: The unique ID of the sample.
                                                                    as BERT.
And the test set has no labels attributes. We refer the readers
to the work (Veyseh et al. 2020b) for more details.               • ALBERT: The ALBERT model (Lan et al. 2019) presents
                                                                    two parameter-reduction techniques to lower memory
Evaluation Metrics                                                  consumption and increases the training speed of BERT.
                                                                    First, splitting the embedding matrix into two smaller ma-
Regarding the evaluation metrics, similar to previous               trices. Second, using repeating layers split among groups.
work (Veyseh et al. 2020b), the results are evaluated based
on their macro-averaged precision, recall, and F1 score on        • ELECTRA: ELECTRA(Clark et al. 2020) proposes a
the test set computed for correct predictions of short-form         more effective pretraining method. Instead of corrupting
(i.e., acronym) and long-form (i.e., phrase) boundaries in
the sentences. A short-form or long-form boundary predic-            2
                                                                       https://github.com/amirveyseh/AAAI-21-SDU-shared-task-
tion is counted as correct if the beginning and the end of the    1-AI
 Parameter    SciBERT                        BERT          BERTLARGE    RoBERTa       ALBERT          ELECTRA
  Training                              scibert-scivocab    bert-base   bert-large                                       google/electra-large
              pretrained model                                                       roberta-larged albert-xxlarge-v2e
 Arguments                                  -uncaseda       -uncasedb    -uncasedc                                         -discriminatorf
              epoch                             3               3            3             3               4                       3
              batch size                       16              16           16            16               8                      16
              learning rate                   2e-5            2e-5         2e-5          2e-5             5e-6                  2e-5
              max seq len                     512             512          512           512              512                    512
   Model      attention probs
                                              0.1             0.1          0.1           0.1                0                    0.1
 Arguments     dropout prob
              hidden dropout prob             0.1             0.1          0.1           0.1                0                    0.1
              classifier dropout prob                                                                      0.1
              num attention heads             16               12           16            16               64                    16
              num hidden layers               24               12           24            24               12                    24
              hidden size                    1024             768          1024          1024             4096                  1024
              hidden act                     gelu             gelu         gelu          gelu           gelu new                gelu
              intermediate size              3072             3072         4096          4096            16384                  4096
              vocab size                    30522            30522        30522         50265            30000                 30522
  a
    https://github.com/allenai/scibert
  b
    https://huggingface.co/bert-base-uncased
  c
    https://huggingface.co/bert-large-uncased
  d
    https://huggingface.co/roberta-large
  e
    https://huggingface.co/albert-xxlarge-v2
  f
    https://huggingface.co/google/electra-large-discriminator
                                 Table 3: Model architecture and main parameters of our experiments.


  some positions of inputs with [MASK], ELECTRA re-                        For the above models, we do not modify the original net-
  places some tokens of the inputs with their plausible alter-          work structure. For more detailed network structures and pa-
  natives sampled from a small generator network. ELEC-                 rameters, please refer to transformers(Wolf et al. 2020). For
  TRA trains a discriminator to predict whether each to-                each BERT variant model, we pick the best learning rate and
  ken in the corrupted input was replaced by the generator              number of epochs on the development set and report the cor-
  or not. The pretrained discriminator can then be used in              responding test results. We found that when epoch is set to
  downstream tasks for fine-tuning.                                     3, the learning rate is 2e-5, the maximum sentence length is
                                                                        512 and the batch size is set to occupy as much GPU as pos-
(c) AT-BERT Models                                                      sible, most models are close to convergence. Therefore, we
In order to solve the problem that the models may be overfit-           set the above training parameters uniformly for all models.
ted and have poor generalization due to less training data, we          More detailed parameter settings are shown in Table 3.
used the FGM algorithm for adversarial training on various
BERT models.
                                                                        Performance Comparison
(d) Model Ensemble                                                      The comparison results are shown in Table 4. The main ob-
Model ensemble is a commonly used method to improve                     servations are summarized as follows:
model accuracy. We perform an average ensemble of the                      (1) Compared with the rule-based method and LSTM-
output probability distributions of various BERT models to              CRF model, all the BERT-based models achieve better re-
obtain the final prediction results. In general, model fusion           sults, illustrating the advantage with pre-trained BERT. Due
requires that the fused models themselves perform well and              to the conservative nature of rule-base method, it has higher
are different from each other, so we finally use four models:           precision but far lower recall than all other models. With
BERTLARGE , RoBERTa, ALBERT, and ELECTRA for fu-                        unsupervised pre-training on large corpus, the BERT-based
sion (named BERT-E shortly). AT-BERT equipped with ad-                  models outperform LSTM-CRF among all the evaluation
versarial training strategy is shorted as AT-BERT-E.                    metrics.
                                                                           (2) Among the six different BERT-based models, the
Implementation                                                          SciBERT model has the same architecture and training strat-
All models are implemented based on the open source trans-              egy with BERTBASE . However, the SciBERT, whose cor-
formers library of huggingface (Wolf et al. 2020), which                pus is more relevant to our task, outperforms BERTBASE
provides thousands of pretrained models to perform tasks                with 1.03% increased MarcoF1 score. Meanwhile, the
on texts such as sequence classification and information ex-            BERTLARGE have more complex architecture and param-
traction. It provides APIs to quickly download and use those            eter, thus it performs better than SciBERT. Taking advan-
pretrained models on a given text, fine-tune them on your               tage of larger training corpus and more effective training
own datasets. The deep learning framework used in this pa-              strategies, the performances of other BERT-based models
per is Pytorch. In addition, We use two V100 GPUs with 12               like RoBERTa and ELECTRA get further improvement.
cores to complete these experiments.                                       (3) With the FGM adversarial training strategy, as shown
           Scheme                    Methodology                        Acronym                         Long Form
                                                              P(%)       R(%) F1(%)             P(%)      R(%) F1(%)           MacroF1(%)
        Baseline                       RULE                   90.67      91.71  91.18           95.78     66.09  78.21            85.46
                                     LSTM-CRF                 88.58      86.93  87.75           85.33     85.38  85.36            86.55
           BERT                      BERTBASE                 92.88      92.50  92.69           87.20     89.96  88.56            90.63
                                      SciBERT                 92.61      90.82  91.71           90.96     92.37  91.66            91.69
                                    BERTLARGE                 94.07      94.28  94.18           90.60     91.44  91.02            92.60
                                      RoBERTa                 93.10      92.63  92.86           92.77     93.92  93.35            93.11
                                      ALBERT                  91.82      94.22  93.01           91.69     94.36  93.00            93.00
                                     ELECTRA                  92.79      93.99  93.39           91.25     94.42  92.81            93.10
       AT-BERT                      BERTLARGE                 94.34      93.17  93.75           92.04     93.24  92.64            93.20
                                      RoBERTa                 94.50      93.36  93.93           91.83     94.73  93.26            93.60
                                      ALBERT                  92.48      94.01  93.24           92.73     94.44  93.56            93.41
                                     ELECTRA                  94.38      92.88  93.63           92.66     93.86  93.26            93.45
       Ensemble                       BERT-E                  94.62      92.72  93.66           92.83     93.99  93.18            93.43
                                     AT-BERT-E                94.87      93.99  94.43           92.84     94.79  93.80            94.12
 Human Performance                Human Performance           98.51      94.33  96.37           96.89     94.79  95.82            96.09

                                     Table 4: Performance comparison over different released period.

 tokens     this   study   were    convolutional   and/or   recurrent   neural    nets    (   CNNs      ,   RNNs      ,   or   CRNNs     )   ,
  label      O       O      O         B-long       I-long     I-long    I-long   I-long   O   B-short   O   B-short   O   O    B-short   O   O
 w/o AT      O       O      O           O             O      B-long     I-long   I-long   O   B-short   O   B-short   O   O    B-short   O   O
 with AT     O       O      O         B-long       I-long     I-long    I-long   I-long   O   B-short   O   B-short   O   O    B-short   O   O

                                    Table 5: Case analysis with and without (w/o) adversarial training.


in Figure 3, we can clearly observe that the AT-BERT based
models outperforms those without adversarial training by
a large margin. The obvious improvement indicates that
the adversarial training strategy has a positive effect on the
BERT-based models’ performance.
   (4) From the comparison of ensemble strategies, we can
find that the BERT-E model is more superior than any
BERT-based model, especially in the precision and MarcoF1
metrics. The similar phenomenon also occurs in the com-
parison of AT-BERT-E model with single AT-BERT based
model. The AT-BERT-E model which performs best is more
advanced than the baseline methods, i.e., the rule-based
method and LSTM-CRF model, with 8.66% and 7.57% in-
creased MarcoF1 score, respectively.
   The above observations demonstrate that the effectiveness
of different components of our proposed AT-BERT model.
However, the best performance is still less effective than hu-                   Figure 3: Comparison of Macro F1 for models with and
man performance, thereby providing many research oppor-                          without (w/o) adversarial training.
tunities for this scenario.

Case Study                                                                       does have better robustness and generalization.
We further analyze the prediction results of BERT and AT-
BERT. An interesting example (DEV-1629) is shown in Ta-                                       Conclusion and Future Work
ble 5. The corresponding long-forms of “CNNs”, “RNNs”,                           In this paper, we proposed a novel BERT-based model called
and “CRNNs” is “convolutional and/or recurrent neural                            AT-BERT for acronym identification, the winning solution
nets”, while the prediction of BERT is “recurrent neural                         to the AAAI-21 Workshop on Scientific Document Under-
nets”. This example is very confusing, because “recurrent                        standing. A FGM-based adversarial training strategy was in-
neural nets” can be considered as the long form of “RNNs”.                       corporated in the fine-tuning of BERT variants, and an aver-
The general BERT model might be easily affected by the to-                       age ensemble mechanism was devised to capture the better
ken “and/or” and ignores the previous token “convolutional.                      representation from multi-BERT variants. The extensive ex-
The experimental results prove that our proposed AT-BERT                         periments were conducted on the SciAI dataset and achieved
the best performance among all the competitive methods,          Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; and Dollár, P.
which verifies the effectiveness of the proposed approach.       2017. Focal Loss for Dense Object Detection. In ICCV,
   In the future, we will optimize our model from two per-       2980–2988.
spectives. One is to explore more adversarial training strate-   Liu, J.; Liu, C.; and Huang, Y. 2017. Multi-granularity Se-
gies such as PGD and FreeLB for BERT model. The other            quence Labeling Model for Acronym Expansion Identifica-
is to try different loss function such as Dice Loss (Li et al.   tion. Information Sciences 378: 462–474.
2019) and Focal Loss (Lin et al. 2017) to alleviate the phe-
nomenon of class imbalance.                                      Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.;
                                                                 Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V.
                                                                 2019. RoBERTa: A Robustly Optimized BERT Pretraining
                  Acknowledgments                                Approach. arXiv preprint arXiv:1907.11692 .
We thank the organizers of acronym identification and dis-
                                                                 Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; and
ambiguation competitions and the reviewers for their valu-
                                                                 Vladu, A. 2018. Towards Deep Learning Models Resistant
able comments and suggestions.
                                                                 to Adversarial Attacks. In ICLR.
                       References                                Miyato, T.; Dai, A. M.; and Goodfellow, I. J. 2017. Adver-
                                                                 sarial Training Methods for Semi-Supervised Text Classifi-
Ackermann, C. F.; Beller, C. E.; Boxwell, S. A.; Katz, E. G.;    cation. In ICLR.
and Summers, K. M. 2020. Resolution of Acronyms in
Question Answering Systems. US Patent 10,572,597.                Okazaki, N.; and Ananiadou, S. 2006. Building an Ab-
                                                                 breviation Dictionary Using a Term Recognition Approach.
Barnett, A.; and Doubleday, Z. 2020. Meta-Research: The          Bioinformatics 22(24): 3089–3095.
Growth of Acronyms in the Scientific Literature. Elife 9:
e60080.                                                          Pouran Ben Veyseh, A.; Dernonrcourt, F.; and Nguyen, T. H.
                                                                 2019. Improving Slot Filling by Utilizing Contextual Infor-
Beltagy, I.; Lo, K.; and Cohan, A. 2019. SciBERT: Pre-           mation. arXiv arXiv–1911.
trained Language Model for Scientific Text. In EMNLP.
                                                                 Schwartz, A. S.; and Hearst, M. A. 2002. A Simple Algo-
Chen, T.; and Guestrin, C. 2016. XGBoost: A Scalable Tree        rithm for Identifying Abbreviation Definitions in Biomedi-
Boosting System. In KDD, 785–794.                                cal Text. In Biocomputing 2003, 451–462. World Scientific.
Clark, K.; Luong, M.; Le, Q. V.; and Manning, C. D. 2020.        Shafahi, A.; Najibi, M.; Ghiasi, M. A.; Xu, Z.; Dickerson, J.;
ELECTRA: Pre-training Text Encoders as Discriminators            Studer, C.; Davis, L. S.; Taylor, G.; and Goldstein, T. 2019.
Rather Than Generators. In ICLR.                                 Adversarial Training for Free! In NeurIPS, 3358–3369.
Goodfellow, I. J.; Shlens, J.; and Szegedy, C. 2015. Explain-    Sun, Y.; Wang, S.; Li, Y.; Feng, S.; Chen, X.; Zhang, H.;
ing and Harnessing Adversarial Examples. In ICLR.                Tian, X.; Zhu, D.; Tian, H.; and Wu, H. 2019. Ernie:
                                                                 Enhanced Representation Through Knowledge Integration.
Harris, C. G.; and Srinivasan, P. 2019. My Word! Machine
                                                                 arXiv preprint arXiv:1904.09223 .
versus Human Computation Methods for Identifying and
Resolving Acronyms. Computación y Sistemas 23(3).               Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan,
                                                                 D.; Goodfellow, I. J.; and Fergus, R. 2014. Intriguing Prop-
Kang, D.; Head, A.; Sidhu, R.; Lo, K.; Weld, D. S.; and          erties of Neural Networks. In Bengio, Y.; and LeCun, Y.,
Hearst, M. A. 2020. Document-Level Definition Detection          eds., ICLR.
in Scholarly Documents: Existing Models, Error Analyses,
and Future Directions. arXiv preprint arXiv:2010.05129 .         Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
                                                                 L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At-
Kenton, J. D. M.-W. C.; and Toutanova, L. K. 2019. BERT:         tention is All You Need. In NIPS, 5998–6008.
Pre-training of Deep Bidirectional Transformers for Lan-
guage Understanding. In NAACL-HLT, 4171–4186.                    Veyseh, A. P. B. 2016. Cross-lingual Question Answering
                                                                 Using Common Semantic Space. In TextGraphs, 15–19.
Kuo, C.-J.; Ling, M. H.; Lin, K.-T.; and Hsu, C.-N. 2009.
BIOADI: A Machine Learning Approach to Identifying Ab-           Veyseh, A. P. B.; Dernoncourt, F.; Nguyen, T. H.; Chang, W.;
breviations and Definitions in Biological Literature. In BMC     and Celi, L. A. 2020a. Acronym Identification and Disam-
bioinformatics, volume 10, S7. Springer.                         biguation shared tasks for Scientific Document Understand-
                                                                 ing. In AAAI Workshop on Scientific Document Understand-
Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.;          ing.
and Soricut, R. 2019. ALBERT: A Lite BERT for Self-
                                                                 Veyseh, A. P. B.; Dernoncourt, F.; Tran, Q. H.; and Nguyen,
supervised Learning of Language Representations. In ICLR.
                                                                 T. H. 2020b. What Does This Acronym Mean? Introducing
Li, J.; Sun, A.; Han, J.; and Li, C. 2020. A Survey on Deep      a New Dataset for Acronym Identification and Disambigua-
Learning for Named Entity Recognition. IEEE Transactions         tion. In COLING, 3285–3301.
on Knowledge and Data Engineering .                              Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.;
Li, X.; Sun, X.; Meng, Y.; Liang, J.; Wu, F.; and Li, J. 2019.   Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; Davi-
Dice Loss for Data-imbalanced NLP Tasks. arXiv preprint          son, J.; Shleifer, S.; von Platen, P.; Ma, C.; Jernite, Y.; Plu, J.;
arXiv:1911.02855 .                                               Xu, C.; Scao, T. L.; Gugger, S.; Drame, M.; Lhoest, Q.; and
Rush, A. M. 2020. Transformers: State-of-the-Art Natural
Language Processing. In EMNLP, 38–45.
Xu, Y.; Qiu, X.; Zhou, L.; and Huang, X. 2020. Improving
BERT Fine-Tuning via Self-Ensemble and Self-Distillation.
arXiv preprint arXiv:2002.10345 .
Zhang, D.; Zhang, T.; Lu, Y.; Zhu, Z.; and Dong, B. 2019.
You Only Propagate Once: Accelerating Adversarial Train-
ing via Maximal Principle. In NeurIPS, 227–238.
Zhong, Q.; Zeng, G.; Zhu, D.; Zhang, Y.; Lin, W.; Chen,
B.; and Tang, J. 2021. Leveraging Domain Agnostic and
Specific Knowledge for Acronym Disambiguation. In AAAI
Workshop on Scientific Document Understanding.
Zhu, C.; Cheng, Y.; Gan, Z.; Sun, S.; Goldstein, T.; and Liu,
J. 2020. FreeLB: Enhanced Adversarial Training for Natural
Language Understanding. In ICLR.

</pre>