Acronym Extraction with Hybrid Strategies
Siheng Li1,† , Cheng Yang1,† , Tian Liang1,† , Xinyu Zhu1 , Chengze Yu1 and Yujiu Yang1,∗
1
    Tsinghua Shenzhen International Graduate School, Tsinghua University


                                             Abstract
                                             Acronym extraction plays an important role in scientific document understanding. Recently, the AAAI-22 Workshop on
                                             Scientific Document Understanding released multiple high-quality datasets and attracted widespread attention. In this work,
                                             we present our hybrid strategies with adversarial training for this task. Specifically, we first apply pre-trained models to obtain
                                             contextualized text encoding. Then, on the one hand, we employ a sequence labeling strategy with BiLSTM and CRF to tag
                                             each word in a sentence. On the other hand, we use a span selection strategy that directly predicts the acronym and long-form
                                             spans. In addition, we adopt adversarial training to further improve the robustness and generalization ability of our models.
                                             Experimental results show that both methods outperform strong baselines and rank high on the SDU@AAAI-22 - Shared
                                             Task 1: Acronym Extraction, our scores rank 2nd in 4 test sets and 3rd in 3 test sets. Moreover, the ablation study further
                                             verifies the effectiveness of each component. Our code is available at https://github.com/carlyoung1999/AAAI-SDU-Task1 .

                                             Keywords
                                             Acronym Extraction, Natural Language Processing, BERT


1. Introduction                                                                                                                             Input:
                                                                                                                                            Existing methods for learning with noisy labels (LNL)
An acronym consists of the initial letters of the corre-                                                                                    primarily take a loss correction approach.
sponding terminology and is widely used in scientific
                                                                                                                                            Output:
documents for its convenience. However, this also makes
                                                                                                                                            Acronym: LNL
it difficult to understand scientific documents for both                                                                                    Long-form: learning with noisy labels
humans and machines. In natural language processing,
accurate acronym extraction is beneficial for the down-
stream applications like question answering [1], defini-                                                                              Figure 1: An example of Acronym Extraction.
tion extraction [2] and relation extraction [3, 4]. Recently,
SDU@AAAI-22 released multiple datasets [5] for scien-
tific document understanding, and we focus on the task of
acronym extraction [6], which aims to extract acronyms                                                                                to capture feature interactions between adjacent words
and their corresponding explanations (long-forms); a toy                                                                              further and employ CRF to model the dependency be-
example can be seen in Figure 1.                                                                                                      tween sequence labels for the sequence labeling strategy.
    Traditional approaches utilize rule-based pattern [7]                                                                             As for the span selection strategy, we use binary tag-
or manual features [8] which are labor-force and time-                                                                                gers to predict the start and end index for acronyms or
consumed. Recently, deep learning based methods [9, 10]                                                                               long-forms. To further improve our models’ robustness
are preferred for their better performance and end-to-end                                                                             and generalization ability, we employ adversarial train-
learning.                                                                                                                             ing, which dynamically adds noise to avoid overfitting.
    In this paper, we propose two strategies for acronym                                                                              These two strategies get comparable performance, and
extraction, sequence labeling strategy and span selection                                                                             we choose the better one for evaluation according to their
strategy. Specifically, we first use pre-trained language                                                                             performance in the development set. Our contributions
models like BERT [11] or RoBERTa [12] to obtain contex-                                                                               are as follows:
tualized word representations. Then, we utilize BiLSTM
                                                                                                                                           • We propose two strategies for acronym extrac-
The second workshop on Scientific Document Understanding at AAAI                                                                             tion, sequence labeling and span selection.
2022                                                                                                                                       • Our adversarial training further improves the ro-
∗
†
     Corresponding author.                                                                                                                   bustness and generalization ability of our models.
    These authors contributed equally.                                                                                                     • Experiments show that our models outper-
Envelope-Open lisiheng21@mails.tsinghua.edu.cn (S. Li);
yangc21@mails.tsinghua.edu.cn (C. Yang);
                                                                                                                                             form strong baselines and rank high in the
liangt21@mails.tsinghua.edu.cn (T. Liang);                                                                                                   SDU@AAAI-22 - Shared Task 1: Acronym Ex-
zhuxy21@mails.tsinghua.edu.cn (X. Zhu);                                                                                                      traction.
ycz21@mails.tsinghua.edu.cn (C. Yu);
yang.yujiu@sz.tsinghua.edu.cn (Y. Yang)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                       Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
2. Related Works                                                 strategy with BPE (Byte-Pair-Encoding) and dynamic
                                                                 masking to increase shared vocabulary, thus providing
In this section, we introduce the related studies for            more fine-grained representations and stronger robust-
acronym extraction, including Rule-based, LSTM-based,            ness. SciBERT [24] has the same structure as BERT, while
and Pre-trained-based methods.                                   it is well pre-trained to process scientific documents
                                                                 specifically. Many works utilize the power of pre-trained
2.1. Rule-based                                                  models for acronym extraction. Pan et al. [25] proposes
                                                                 a multi-task learning method based on BERT-CRF and
Traditional acronym extraction methods mainly focus              BERT-Span, which makes full use of these two separate
on rule-based methods. Specifically, most of them [13]           models through redefining the fusion loss function and
utilize generic rules or text patterns to discover acronym       achieves great performance. Li et al. [26] utilizes Sen-
expansions in the field of biomedicine. Torres-Schumann          tence Piece byte-pair encoding to relabel sentences. Then,
and Schulz [14] further extend rule sets to hidden Markov        they are embedded into the XLNet [27] for processing.
models and improve both recall and precision values. Re-
cently, a new work [15] has made a comprehensive intro-
duction to the rule-based machine identification meth-           3. Methodology
ods. They comprehensively classify present Rule-based
models, analyze two separate approaches (a machine al-           3.1. Task Formulation
gorithm and a crowd-sourcing approach), and compare
                                                          Given a text 𝑋 = {𝑥1 , 𝑥2 , ..., 𝑥𝑙 } where each 𝑥𝑖 is a word and
them in detail. However, Due to the conservative nature
                                                          𝑙 represents text length, acronym extraction aims to find
of rule-based models, this method requires complicated
                                                          all acronyms and long-forms mentioned in this text. For-
manual formulations and lacks flexibility.
                                                          mally, the model needs to automatically extract acronym
                                                          mention set 𝒜 = {[𝑠1 , 𝑒1 ), [𝑠2 , 𝑒2 ), ..., [𝑠𝑛 , 𝑒𝑛 )}, where 𝑠𝑖 and
2.2. LSTM-based                                           𝑒𝑖 denotes the start and end position of the i-th acronym
Taking advantage of LSTM [16]’s power for text model- respectively. In addition, the model also needs to extract
ing, LSTM-based methods has got decent performance in long-form mention set ℬ = {[𝑠1 , 𝑒1 ), [𝑠2 , 𝑒2 ), ..., [𝑠𝑚 , 𝑒𝑚 )},
acronym extraction. They mainly focus on better seman- similar with 𝒜.
tic representations and attention mechanisms. DECBAE
[17] extracts contextualized features with BioELMo [18] 3.2. Overview
and provides these features to specific abbreviated BiL-
                                                          We describe our hybrid strategies to extract acronyms and
STMs, achieving good performance. In addition, they
                                                          long-forms in this section. At first, we use pre-trained
use a simple but effective heuristic method for automat-
                                                          models for tokenizing and encoding the original sentence.
ically collecting datasets from a large corpus. Li et al.
                                                          Then, we employ a BiLSTM-CRF head to model acronym
[19] propose a novel topic-attention model and compare
                                                          extraction as a sequence labeling task and a BiLSTM-Span
the performance of different attention mechanisms em-
                                                          head to model it as a span selection task. In addition, to
bedded in LSTM and ELMo. Their model is applied to
                                                          improve the robustness and generalization of our models,
the acronym task of medical terms. To further capture
                                                          we apply adversarial training techniques.
the dependency between sequence labels, Veyseh et al.
[20] propose to combine LSTM with CRF for Acronym
identification and Disambiguation.                        3.3. BERT Encoder
                                                                 We adopt BERT or RoBERTa as a text encoder to capture
2.3. Pre-trained-based                                           rich contextualized word embeddings. For brevity, we
                                                                 use BERT to indicate both BERT and RoBERTa following.
Language models pre-trained with a large corpus have
                                                                 Given the input 𝑋 = {𝑥1 , 𝑥2 , ..., 𝑥𝑙 }, with the help of deep
shown promising performance in lots of downstream
                                                                 multi-head attention layers, BERT captures contextual-
tasks. One of the most popular is Bidirectional Encoder
                                                                 ized representation for each token. The encoding process
Representations from Transformers (BERT) [21], which
                                                                 is as follows:
obtains rich semantic representations by Masked LM task
in the pre-training stage. BERT has been applied to many                 𝐻 = BERT([𝑥1 , 𝑥2 , ..., 𝑥𝑙 ]) = [ℎ1 , ℎ2 , ..., ℎ𝑙 ]𝑇 ,   (1)
NLP tasks like information extraction [22] and dialogue
state tracking [23].                                             where 𝐻 ∈ ℝ𝑙×𝑑 , and 𝑑 denotes hidden dimension.
   In addition, it is worth mentioning that there have
been many fine-grained improvements or specific domain
variants of BERT. RoBERTa [12] optimizes the training
                                  acronyms: [“DL”]                                                                     acronyms: [“DL”]
                                  long-forms: [“Deep Learning”]                                                        long-forms: [“Deep Learning”]


                B-Acronym         O           O        B-Long       I-Long
                                                                                        S-Acronym     1                0            0            0               0
     output         O             O           O        B-Long       I-Long
   sequences:                                                                           E-Acronym     1                0            0            0               0
                B-Acronym         O           O        B-Long         O

                                                                                         S-Long       0                0            0            1               0
                                        CRF Tagger                                       E-Long       0                0            0            0               1


                                       Linear Layer                                                       S-Acronym Tagger                           S-Long Tagger
                                                                                                                             Binary Taggers
                                                                                                          E-Acronym Tagger                           E-Long Tagger


                 BiLSTM      BiLSTM         BiLSTM         BiLSTM    BiLSTM
                                                                                                    BiLSTM          BiLSTM        BiLSTM      BiLSTM           BiLSTM


                                      BERT Encoder
                                                                                                                             BERT Encoder

       input:     DL         stands          for           Deep     Learning
                                                                                          input:     DL             stands         for         Deep          Learning


Figure 2: The model architecture of our Sequence Labeling
strategy.                                                                            Figure 3:       The model architecture of our Span Selection
                                                                                     strategy.


3.4. Sequence Labeling Strategy
                                                                                     denotes transition scorer in CRF and is a learnable matrix
For this strategy, we first transform the character-level                            practically. 𝑍 (𝑋 ) is the normalization factor to constraint
position labels provided by raw datasets to token-level                              the probability in (0, 1). The loss function is negative log-
BIO labels as follows:                                                               likelihood :
                                                                                                         ℒ𝑆𝐿 = − log(𝑃(𝑌 |𝑋 )).                (5)
      • B-Acronym: Beginning of an acronym.
      • I-Acronym: Inside of an acronym.                                                For the inference, we use the Viterbi algorithm [28]
      • B-Long: Beginning of a long-form.                                            for decoding the best label sequence.
      • I-Long: Inside of a long-form.
      • O: Outside of any acronym and long-form.                                     3.5. Span Selection Strategy
   To solve this sequence labeling problem, we adopt                                 We also formulate it as an extractive span selection task,
a BERT-BiLSTM-CRF method, and the architecture is                                    aiming to find the text span of acronyms and long-forms
shown in Figure 2. First, we utilize a BiLSTM network                                directly. Similar to the sequence labeling strategy, we
to capture feature interactions between adjacent words                               transform the character-level labels [𝑠𝑡𝑎𝑟𝑡, 𝑒𝑛𝑑) provided
further:                                                                             by raw datasets to token-level [𝑠𝑡𝑎𝑟𝑡, 𝑒𝑛𝑑) for the follow-
                   𝐻 ′ = BiLSTM(𝐻 ),                (2)                              ing token classification.
                                                                                        We adopt the same BERT encoder and LSTM network
where 𝐻 ′ ∈ ℝ𝑙×2𝑑 . Then, a linear classifier transforms 𝐻 ′                         as above to get contextualized word representations 𝐻 ′ ∈
into the logits of 5 BIO labels defined above:                                       ℝ𝑙×2𝑑 . Then we construct four binary taggers:
                  𝐿 = [𝐿0 , 𝐿1 , 𝐿2 , 𝐿3 , 𝐿4 ] = 𝐻 ′ 𝑊𝐿 ,                     (3)        • S-Acronym Tagger predicts whether a token is
                                                                                            the start of an acronym.
where 𝑊𝐿 ∈ ℝ2𝑑×5 and 𝐿 = [𝐿0 , 𝐿1 , 𝐿2 , 𝐿3 , 𝐿4 ] ∈ ℝ𝑙×5 are
the logits.                                                                               • E-Acronym Tagger predicts whether a token is
  To model the dependency between sequence labels,                                          the end of an acronym.
we adopt a Linear Chain CRF (Conditional Random Field)                                    • S-Long Tagger predicts whether a token is the
[28], the probability of a tagged sequence is:                                              start of a long-form.
                                                                                          • E-long Tagger predicts whether a token is the
                              𝑙                        𝑙                                    end of a long-form.
                       exp(∑𝑖=1 𝜑(𝑦𝑖 |𝑥𝑖 ) + ∑𝑖=1 𝜓 (𝑦𝑖 |𝑦𝑖−1 ))
    𝑃(𝑌 |𝑋 ) =                                                   ,             (4)
                                       𝑍 (𝑋 )                  We apply a simple linear layer to represent these tag-
where 𝑌 = [𝑦1 , 𝑦2 , ..., 𝑦𝑙 ] is the ground truth label se- gers which work as follows:
quence and 𝑦𝑖 is the label for 𝑖-th token. 𝜑(⋅) represents
                                                                          𝐿 = [𝐿0 , 𝐿1 , 𝐿2 , 𝐿3 ] = 𝐻 ′ 𝑊𝑆 ,     (6)
emission scorer which refers to the logits 𝐿 above. 𝜓 (⋅)
                                                 English                       Persian                 Vietnamese
                         Method
                                         P          R        F1         P         R       F1       P          R        F1
                         Rule         0.33        0.15      0.20       0.95      0.44    0.60     0.82       0.39      0.53
                         BERT         0.82        0.85      0.83       0.94      0.47    0.63     0.82       0.73      0.77
                         RoBERTa      0.84        0.88      0.86       0.94      0.52    0.67     0.97       0.48      0.64
                         Ours-SL      0.86        0.88      0.87       0.87      0.59    0.70     0.98       0.65      0.78
                         Ours-SS      0.86        0.89      0.87       0.84      0.67    0.73     0.81       0.91      0.85

Table 1
Performance comparison on the development sets of scientific domain.


                                             English                           Persian                     Vietnamese
                     Ranking
                                     P            R         F1          P          R        F1         P          R         F1
                            1       0.89         0.92      0.90    0.76          0.82     0.79     0.85        0.82         0.84
                            2       0.89⋆        0.89⋆     0.89⋆   0.60⋆         0.69⋆    0.63⋆    0.83        0.84         0.83
                            3       0.85         0.87      0.86    0.92          0.43     0.59     0.96⋆       0.62⋆        0.76⋆
                            4       0.83         0.88      0.86    0.64          0.51     0.57     0.64        0.66         0.65

Table 2
Performance comparison on the test sets of scientific domain, ⋆ indicates the score of our model.


where 𝑊𝑆 ∈ ℝ2𝑑×4 , and 𝐿 = [𝐿0 , 𝐿1 , 𝐿2 , 𝐿3 ] ∈ ℝ𝑙×4 are                       Datasets                  Training     Development    Test
logits for 4 classes declared above. The loss function is                        English Scientific         3980                 497   498
binary cross entropy:                                                            Persian                    1336                 167   168
                                                                                 Vietnamese                 1274                 159   160
            𝑙   3
                     𝑗          𝑗            𝑗               𝑗                   English Legal              3564                 445   446
 ℒ𝑆𝑆 = ∑ ∑[−𝑦𝑖 ⋅log(𝜎(𝑙𝑖 ))+(1−𝑦𝑖 )⋅log(1−𝜎(𝑙𝑖 ))], (7)
         𝑖=0 𝑗=0                                                                 French                     7783                 973   973
                                                                                 Spanish                    5928                 741   741
        𝑗                                                          𝑗             Danish                     3082                 385   386
where 𝑦𝑖 is the label for 𝑖-th token regarding class 𝑗, 𝑙𝑖 is
the logit for 𝑖-th token regarding class 𝑗, and 𝜎(𝑥) denotes
                                                                              Table 3
sigmoid function.                                                             Statistics of the datasets, the first three belongs to scientific
   For the inference, we first predict the class label of                     domain while the others belongs to legal domain.
each token. Then, we match each S-Acronym token with
the nearest E-Acronym token to get an acronym. The
operation for long-form is the same.
                                                                              3.7. Objective Function
3.6. Adversarial Training                                                     We jointly train our models with adversarial training, for
                                                                              sequence labeling strategy:
To enhance the robustness and generalization ability of
our models, we adopt adversarial training. Specifically,                   ℒ = ℒ𝑆𝐿 + 𝛼ℒ𝐴𝑑𝑣 .                                                  (9)
given an input 𝑋, we incorporate a posterior regulariza-
tion mechanism [29]:                                     For span selection strategy:

                                                                                                   ℒ = ℒ𝑆𝑆 + 𝛼ℒ𝐴𝑑𝑣 .                      (10)
            ℒ𝐴𝑑𝑣 = max ∑ Div(𝑓𝜃 (𝑋 )||𝑓𝜃 (𝑋 + 𝜖)),                 (8)
                    ‖𝜖‖≤𝑎
                                                            The 𝛼 is used for controlling the significance of adversar-
where Div is some f-divergence1 , 𝜖 is noise, 𝑎 is noise ial training.
norm and 𝑓𝜃 represents the predict function in our mod-
els, like CRF tagger and Binary taggers. This loss regular-
izes the posterior difference between original and noisy 4. Experiments
inputs to avoid overfitting. Practically, we use an inner
loop to search the most adversarial direction.              4.1. Datasets
                                                            Our experiments are conducted on the official dataset
1
  We use Jensen-Shannon divergence in our experiments.      of SDU@AAAI-22 - Shared Task 1: Acronym Extrac-
                                    English                        French                   Spanish                     Danish
            Method
                             P            R         F1        P         R     F1       P       R       F1       P          R       F1
            Rule           0.32        0.10      0.16      0.22     0.06     0.10    0.17    0.07     0.10    0.10       0.06      0.08
            BERT           0.88        0.87      0.88      0.94     0.94     0.94    0.89    0.90     0.89    0.93       0.94      0.93
            RoBERTa        0.87        0.88      0.88      0.78     0.76     0.77    0.88    0.88     0.88    0.90       0.92      0.91
            Ours-SL        0.88        0.88      0.88      0.95     0.94     0.94    0.90    0.90     0.90    0.94       0.95      0.94
            Ours-SS        0.89        0.88      0.89      0.95     0.93     0.94    0.90    0.90     0.90    0.95       0.93      0.94

Table 4
Performance comparison on the development sets of legal domain.


                                 English                          French                    Spanish                      Danish
         Ranking
                       P           R           F1         P         R        F1       P        R       F1           P          R        F1
              1      0.90         0.92        0.91       0.94     0.95      0.94    0.90     0.91     0.91     0.95        0.98         0.96
              2      0.88⋆        0.91⋆       0.90⋆      0.92⋆    0.93⋆     0.93⋆   0.90     0.91     0.90     0.95        0.97         0.96
              3      0.87         0.91        0.89       0.93     0.92      0.92    0.90⋆    0.90⋆    0.90⋆    0.95⋆       0.95⋆        0.95⋆
              4      0.87         0.90        0.88       0.81     0.80      0.81    0.90     0.90     0.90     0.89        0.90         0.89

Table 5
Performance comparison on the test sets of legal domain, ⋆ indicates the score of our model.


                                       English Scientific                           • Roberta-based This is similar with above, ex-
            Method
                                          P         R      F1                         cept RoBERTa [12] as text encoder.
            Ours-SL                    0.86      0.88      0.87
            Ours-SL w/o CRF            0.84      0.88      0.86               4.3. Implementations
            Ours-SL w/o AT             0.86      0.87      0.86
                                                                              For baselines, we select pre-train models trained with
            Ours-SS                    0.86      0.89      0.87               corresponding language corpora in HuggingFace Trans-
            Ours-SS w/o AT             0.86      0.87      0.86               formers [30]. As for ours, we adopt the best pre-trained
Table 6
                                                                              models according to their performance in the develop-
Ablation studies in the development set of English Scientific.                ment set. Specifically, we adopt roberta-base3 for En-
                                                                              glish, roberta-fa-zwnj-base-ner4 for Persian, bert-base-
                                                                              vi-cased5 for Vietnamese, bert-base-fr-cased6 for French,
                                                                              bert-base-es-cased7 for Spanish and danish-bert-botxo-
tion. They provide the data of scientific domain includ-
                                                                              ner-dane8 for Danish.
ing English, Persian and Vietnamese; and legal domain
                                                                                 We tune the hyper-parameters according to the perfor-
including English, French, Spanish and Danish. Table 3
                                                                              mance in the development set. For the sequence labeling
summarizes the statistics of datasets used in our experi-
                                                                              strategy, the batch size, LSTM layer, LSTM hidden size,
ments.
                                                                              adversarial training weight are 8, 1, 256, 0.1, respectively.
                                                                              The batch size, LSTM layer, LSTM hidden size, and ad-
4.2. Baselines                                                                versarial training weight for our span selection strategy
                                                                              are 16, 1, 256, and 1.0. We run all experiments using
To investigate the effectiveness of our proposed approach,
                                                                              PyTorch 1.9.1 on the Nvidia GeForce RTX 3090 GPU, In-
we compare it with the following three baselines:
                                                                              tel(R) Xeon(R) Platinum 8260L CPU on Ubuntu 18.04.4
        • Rule-based This method utilizes a manually de-                      LTS OS. Our code will be released soon.
          signed pattern to extract acronyms and is pro-
          vided by SDU@AAAI-22 2 .
        • BERT-based This method employs BERT [21] as
          a text encoder to get contextualized word repre-                    3
                                                                                https://huggingface.co/roberta-base
          sentation, then employs a classification head to                    4
                                                                                https://huggingface.co/HooshvareLab/roberta-fa-zwnj-base-ner
          tag each word.                                                      5
                                                                                https://huggingface.co/Geotrend/bert-base-vi-cased
                                                                              6
                                                                                https://huggingface.co/Geotrend/bert-base-fr-cased
                                                                              7
                                                                                https://huggingface.co/Geotrend/bert-base-es-cased
2                                                                             8
    https://github.com/amirveyseh/AAAI-22-SDU-shared-task-1-AE                  https://huggingface.co/Maltehb/danish-bert-botxo-ner-dane
4.4. Results                                               6. Acknowledgments
4.4.1. Scientific Domain                                  This research was supported in part by the National
The comparison between the proposed model and base- Key Research and Development Program of China (No.
line models is shown in Table 1. The main observations 2020YFB1708200) and the Shenzhen Key Laboratory of
can be summarized as follows:                             Marine IntelliSense and Computation under Contract
                                                          ZDSYS20200811142605016.
     • Compared with manually designed rule-based
       methods, pre-trained model-based methods have
       huge advantages because they can capture rea- References
       sonable word representations.
                                                           [1] C. F. Ackermann, C. E. Beller, S. A. Boxwell, E. G.
     • The difference between the BERT model and
                                                               Katz, K. M. Summers, Resolution of acronyms
       RoBERTa model is remarkable. We conjecture
                                                               in question answering systems, 2020. US Patent
       this is due to the datasets being small; thus, the
                                                               10,572,597.
       results depend more on the power of the pre-
                                                           [2] D. Kang, A. Head, R. Sidhu, K. Lo, D. S. Weld, M. A.
       trained model.
                                                               Hearst, Document-level definition detection in
     • Our two strategies get similar results and outper-      scholarly documents: Existing models, error anal-
       form all baseline methods. We submit the better         yses, and future directions, in: Proceedings of
       one for testing.                                        SDP@EMNLP 2020, Association for Computational
Table 2 shows the top 4 scores in the test sets of the            Linguistics, 2020, pp. 196–206.
scientific domain; our method gets decent performance         [3] Y. Shi, Y. Yang, Relational facts extraction with
and ranks 2st in English and Persian, 3st in Vietnamese.          splitting mechanism, in: 2020 IEEE International
                                                                  Conference on Knowledge Graph, ICKG 2020„ IEEE,
                                                                  2020, pp. 374–379.
4.4.2. Legal Domain
                                                              [4] L. Ding, Z. Lei, G. Xun, Y. Yang, FAT-RE: A faster
The comparison is shown in Table 4, the observations are          dependency-free model for relation extraction, J.
similar with Scientific Domain, and our method outper-            Web Semant. 65 (2020) 100598.
forms all baseline models stably. Table 5 shows the top       [5] S. Y. R. J. F. D. T. H. N. Amir Pouran Ben Vey-
4 scores in the test sets; our method gets decent perfor-         seh, Nicole Meister, MACRONYM: A Large-
mance and ranks 2 in English and French, 3 in Spanish             Scale Dataset for Multilingual and Multi-Domain
and Danish.                                                       Acronym Extraction, in: arXiv, 2022.
                                                              [6] S. Y. R. J. F. D. T. H. N. Amir Pouran Ben Veyseh,
4.5. Ablation Study                                               Nicole Meister, Multilingual Acronym Extraction
                                                                  and Disambiguation Shared Tasks at SDU 2022, in:
To further prove the effectiveness of each component, we          Proceedings of SDU@AAAI-22, 2022.
run ablation studies on the development set of English        [7] N. Okazaki, S. Ananiadou, Building an abbrevia-
Scientific, as shown in Table 6. We find that: (1) for our        tion dictionary using a term recognition approach,
sequence labeling strategy, CRF is necessary because it           Bioinform. 22 (2006) 3089–3095.
helps capture the dependency between sequence labels. [8] C. Kuo, M. H. T. Ling, K. Lin, C. Hsu, BIOADI: a
(2) adversarial training is beneficial to both strategies by      machine learning approach to identifying abbrevia-
adding reasonable noises, which improve our models’               tions and definitions in biological literature, BMC
robustness and generalization performance.                        Bioinform. 10 (2009) 7.
                                                              [9] D. Zhu, W. Lin, Y. Zhang, Q. Zhong, G. Zeng,
                                                                  W. Wu, J. Tang, AT-BERT: adversarial training
5. Conclusion                                                     BERT for acronym identification winning solution
In this paper, we explore and propose two strategies              for sdu@aaai-21, CoRR abs/2101.03700 (2021).
with adversarial training for SDU@AAAI-22 - Shared           [10] N. Egan, J. Bohannon, Primer ai’s systems for
Task 1: Acronym Extraction. Experiments show that                 acronym identification and disambiguation, in: Pro-
our methods outperform strong baseline methods in all             ceedings of the SDU@AAAI 2021, volume 2831 of
7 datasets. In addition, our score ranks high in the test         CEUR Workshop Proceedings, 2021.
sets. For future work, we will try to solve the problem of   [11] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT:
class imbalance in both strategies.                               pre-training of deep bidirectional transformers for
                                                                  language understanding, in: Proceedings of the
                                                                  NAACL-HLT 2019, Volume 1 (Long and Short Pa-
     pers), Association for Computational Linguistics,            putational Linguistics, 2019, pp. 3613–3618.
     2019, pp. 4171–4186.                                    [25] C. Pan, B. Song, S. Wang, Z. Luo, Bert-based
[12] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen,          acronym disambiguation with multiple training
     O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,              strategies, in: Proceedings of the SDU@AAAI 2021,
     Roberta: A robustly optimized BERT pretraining               volume 2831 of CEUR Workshop Proceedings, 2021.
     approach, CoRR abs/1907.11692 (2019).                   [26] F. Li, Z. Mai, W. Zou, W. Ou, X. Qin, Y. Lin,
[13] A. S. Schwartz, M. A. Hearst, A simple algorithm for         W. Zhang, Systems at SDU-2021 task 1: Trans-
     identifying abbreviation definitions in biomedical           formers for sentence level sequence label, in: Pro-
     text, Pacific Symposium on Biocomputing. Pacific             ceedings of the SDU@AAAI 2021, volume 2831 of
     Symposium on Biocomputing 4 (2003) 451–462.                  CEUR Workshop Proceedings, 2021.
[14] E. Torres-Schumann, K. U. Schulz, Stable meth-          [27] Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhut-
     ods for recognizing acronym-expansion pairs: from            dinov, Q. V. Le, Xlnet: Generalized autoregressive
     rule sets to hidden markov models, Int. J. Document          pretraining for language understanding, in: Ad-
     Anal. Recognit. 8 (2006).                                    vances in Neural Information Processing Systems
[15] C. G. Harris, P. Srinivasan, My word! machine                32, NeurIPS 2019, 2019, pp. 5754–5764.
     versus human computation methods for identifying        [28] C. Sutton, A. McCallum, An introduction to condi-
     and resolving acronyms, Computación y Sistemas               tional random fields, Found. Trends Mach. Learn. 4
     23 (2019).                                                   (2012) 267–373.
[16] S. Hochreiter, J. Schmidhuber, Long short-term          [29] H. Cheng, X. Liu, L. Pereira, Y. Yu, J. Gao, Posterior
     memory, Neural Comput. 9 (1997) 1735–1780.                   differential regularization with f-divergence for im-
[17] Q. Jin, J. Liu, X. Lu, Deep contextualized biomedical        proving model robustness, in: Proceedings of the
     abbreviation expansion, in: Proceedings of the               NAACL-HLT 2021, Association for Computational
     BioNLP@ACL 2019, Association for Computational               Linguistics, 2021, pp. 1078–1089.
     Linguistics, 2019, pp. 88–96.                           [30] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De-
[18] Q. Jin, B. Dhingra, W. W. Cohen, X. Lu, Prob-                langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun-
     ing biomedical embeddings from language models,              towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma,
     CoRR abs/1904.02181 (2019).                                  Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger,
[19] I. Li, M. Yasunaga, M. Y. Nuzumlali, C. Caraballo,           M. Drame, Q. Lhoest, A. M. Rush, Transformers:
     S. Mahajan, H. M. Krumholz, D. R. Radev, A neural            State-of-the-art natural language processing, in:
     topic-attention model for medical term abbreviation          Proceedings of the EMNLP 2020 - Demos, Associ-
     disambiguation, CoRR abs/1910.14076 (2019).                  ation for Computational Linguistics, Online, 2020,
[20] A. P. B. Veyseh, F. Dernoncourt, Q. H. Tran, T. H.           pp. 38–45.
     Nguyen, What does this acronym mean? introduc-
     ing a new dataset for acronym identification and
     disambiguation, arXiv preprint arXiv:2010.14678
     (2020).
[21] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,
     Bert: Pre-training of deep bidirectional transform-
     ers for language understanding, arXiv preprint
     arXiv:1810.04805 (2018).
[22] Z. Wei, J. Su, Y. Wang, Y. Tian, Y. Chang, A novel
     cascade binary tagging framework for relational
     triple extraction, in: Proceedings of the 58th An-
     nual Meeting of the Association for Computational
     Linguistics, ACL 2020, Association for Computa-
     tional Linguistics, 2020, pp. 1476–1488.
[23] S. Kim, S. Yang, G. Kim, S. Lee, Efficient dialogue
     state tracking by selectively overwriting memory,
     in: Proceedings of the 58th Annual Meeting of the
     Association for Computational Linguistics, ACL
     2020, Association for Computational Linguistics,
     2020, pp. 567–582.
[24] I. Beltagy, K. Lo, A. Cohan, Scibert: A pretrained
     language model for scientific text, in: Proceedings
     of the EMNLP-IJCNLP 2019, Association for Com-