Input:

Siheng Li

lisiheng21@mails.tsinghua.edu.cn 0

Cheng Yang

Tian Liang

liangt21@mails.tsinghua.edu.cn 0

Xinyu Zhu

zhuxy21@mails.tsinghua.edu.cn 0

Chengze Yu

Yujiu Yang

Acronym Extraction, Natural Language Processing, BERT

0 Tsinghua Shenzhen International Graduate School, Tsinghua University

Acronym extraction plays an important role in scientific document understanding. Recently, the AAAI-22 Workshop on Scientific Document Understanding released multiple high-quality datasets and attracted widespread attention. In this work, we present our hybrid strategies with adversarial training for this task. Specifically, we first apply pre-trained models to obtain contextualized text encoding. Then, on the one hand, we employ a sequence labeling strategy with BiLSTM and CRF to tag each word in a sentence. On the other hand, we use a span selection strategy that directly predicts the acronym and long-form spans. In addition, we adopt adversarial training to further improve the robustness and generalization ability of our models. Experimental results show that both methods outperform strong baselines and rank high on the SDU@AAAI-22 - Shared Task 1: Acronym Extraction, our scores rank 2nd in 4 test sets and 3rd in 3 test sets. Moreover, the ablation study further verifies the efectiveness of each component. Our code is available at https://github.com/carlyoung1999/AAAI-SDU-Task1 .

Input:

Existing methods for learning with noisy labels (LNL) primarily take a loss correction approach.

Output:

Acronym: LNL

Long-form: learning with noisy labels

to capture feature interactions between adjacent words further and employ CRF to model the dependency between sequence labels for the sequence labeling strategy. As for the span selection strategy, we use binary taggers to predict the start and end index for acronyms or long-forms. To further improve our models’ robustness and generalization ability, we employ adversarial training, which dynamically adds noise to avoid overfitting. These two strategies get comparable performance, and we choose the better one for evaluation according to their performance in the development set. Our contributions • We propose two strategies for acronym extraction, sequence labeling and span selection. • Our adversarial training further improves the robustness and generalization ability of our models. • Experiments show that our models outperform strong baselines and rank high in the

1. Introduction

An acronym consists of the initial letters of the corresponding terminology and is widely used in scientific documents for its convenience. However, this also makes it dificult to understand scientific documents for both humans and machines. In natural language processing, accurate acronym extraction is beneficial for the downstream applications like question answering [1], definition extraction [2] and relation extraction [3, 4]. Recently, The second workshop on Scientific Document Understanding at AAAI

2. Related Works

In this section, we introduce the related studies for acronym extraction, including Rule-based, LSTM-based, and Pre-trained-based methods.

2.1. Rule-based

Traditional acronym extraction methods mainly focus on rule-based methods. Specifically, most of them [ 13] utilize generic rules or text patterns to discover acronym expansions in the field of biomedicine. Torres-Schumann and Schulz [14] further extend rule sets to hidden Markov models and improve both recall and precision values. Recently, a new work [15] has made a comprehensive introduction to the rule-based machine identification methods. They comprehensively classify present Rule-based gorithm and a crowd-sourcing approach), and compare them in detail. However, Due to the conservative nature of rule-based models, this method requires complicated manual formulations and lacks flexibility.

2.2. LSTM-based

Taking advantage of LSTM [16]’s power for text modeling, LSTM-based methods has got decent performance in acronym extraction. They mainly focus on better semantic representations and attention mechanisms. DECBAE [17] extracts contextualized features with BioELMo [18] and provides these features to specific abbreviated BiLSTMs, achieving good performance. In addition, they use a simple but efective heuristic method for automatically collecting datasets from a large corpus. Li et al. [19] propose a novel topic-attention model and compare the performance of diferent attention mechanisms embedded in LSTM and ELMo. Their model is applied to the acronym task of medical terms. To further capture the dependency between sequence labels, Veyseh et al. [20] propose to combine LSTM with CRF for Acronym identification and Disambiguation.

2.3. Pre-trained-based

Language models pre-trained with a large corpus have shown promising performance in lots of downstream tasks. One of the most popular is Bidirectional Encoder Representations from Transformers (BERT) [21], which obtains rich semantic representations by Masked LM task in the pre-training stage. BERT has been applied to many NLP tasks like information extraction [22] and dialogue state tracking [23].

In addition, it is worth mentioning that there have been many fine-grained improvements or specific domain variants of BERT. RoBERTa [12] optimizes the training strategy with BPE (Byte-Pair-Encoding) and dynamic masking to increase shared vocabulary, thus providing more fine-grained representations and stronger robustness. SciBERT [24] has the same structure as BERT, while it is well pre-trained to process scientific documents specifically. Many works utilize the power of pre-trained models for acronym extraction. Pan et al. [25] proposes a multi-task learning method based on BERT-CRF and BERT-Span, which makes full use of these two separate models through redefining the fusion loss function and achieves great performance. Li et al. [26] utilizes Sentence Piece byte-pair encoding to relabel sentences. Then, they are embedded into the XLNet [27] for processing. 3.

Methodology

similar with .

3.2. Overview

Given a text = {

1, 2, ..., } where each is a word and represents text length, acronym extraction aims to find all acronyms and long-forms mentioned in this text. Formally, the model needs to automatically extract acronym mention set = {[

1, 1), [ 2, 2), ..., [ , )}, where and denotes the start and end position of the i-th acronym respectively. In addition, the model also needs to extract long-form mention set ℬ = {[ 1, 1), [ 2, 2), ..., [ , )}, We describe our hybrid strategies to extract acronyms and long-forms in this section. At first, we use pre-trained models for tokenizing and encoding the original sentence. Then, we employ a BiLSTM-CRF head to model acronym extraction as a sequence labeling task and a BiLSTM-Span head to model it as a span selection task. In addition, to improve the robustness and generalization of our models, we apply adversarial training techniques.

3.3. BERT Encoder

We adopt BERT or RoBERTa as a text encoder to capture rich contextualized word embeddings. For brevity, we use BERT to indicate both BERT and RoBERTa following. Given the input = {

1, 2, ..., }, with the help of deep multi-head attention layers, BERT captures contextualized representation for each token. The encoding process is as follows: = BERT([ 1, 2, ..., ]) = [ℎ1, ℎ2, ..., ℎ ] , ( 1 ) where ∈ ℝ × , and denotes hidden dimension. acronyms: [“DL”] long-forms: [“Deep Learning”] acronyms: [“DL”] long-forms: [“Deep Learning”] output sequences: B-Acronym

B-Acronym

O O O

B-Long B-Long B-Long

I-Long I-Long

O CRF Tagger Linear Layer

BERT Encoder BiLSTM

BiLSTM

BiLSTM input:

DL stands for

Deep

Learning

3.4. Sequence Labeling Strategy

For this strategy, we first transform the character-level • B-Acronym: Beginning of an acronym. • I-Acronym: Inside of an acronym. • B-Long: Beginning of a long-form. • I-Long: Inside of a long-form. • O: Outside of any acronym and long-form. S-Acronym E-Acronym 1 S-Long E-Long 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 S-Acronym Tagger E-Acronym Tagger

Binary Taggers

S-Long Tagger E-Long Tagger BiLSTM

BiLSTM

BERT Encoder input:

DL stands for

Deep

Learning

3.5. Span Selection Strategy

BIO labels as follows: position labels provided by raw datasets to token-level the probability in ( 0, 1 ). The loss function is negative log

To solve this sequence labeling problem, we adopt a BERT-BiLSTM-CRF method, and the architecture is shown in Figure 2. First, we utilize a BiLSTM network to capture feature interactions between adjacent words further:

′ = BiLSTM( ), where ′ ∈ ℝ×2 . Then, a linear classifier transforms ′ into the logits of 5 BIO labels defined above: ( 2 ) ing token classification.

We also formulate it as an extractive span selection task, aiming to find the text span of acronyms and long-forms directly. Similar to the sequence labeling strategy, we transform the character-level labels [ , ) by raw datasets to token-level [ , )

provided for the followℝ×2 . Then we construct four binary taggers:

We adopt the same BERT encoder and LSTM network as above to get contextualized word representations ′ ∈ = [ 0, 1, 2, 3, 4] = ′ ,

( 3 ) the logits. where ∈ ℝ2×5 and = [ 0, 1, 2, 3, 4] ∈ ℝ×5 are

To model the dependency between sequence labels, we adopt a Linear Chain CRF (Conditional Random Field) [28], the probability of a tagged sequence is: ( | ) =

exp(∑=1 ( | ) + ∑ =1 ( | −1 )) ( ) where = [ 1, 2, ..., ] is the ground truth label sequence and is the label for -th token. (⋅) represents emission scorer which refers to the logits above. (⋅) the start of an acronym. the end of an acronym. start of a long-form.

end of a long-form. • S-Acronym Tagger predicts whether a token is • E-Acronym Tagger predicts whether a token is • S-Long Tagger predicts whether a token is the • E-long Tagger predicts whether a token is the

We apply a simple linear layer to represent these taggers which work as follows: = [ 0, 1, 2, 3] = ′ , ( 6 ) Rule BERT RoBERTa Ours-SL Ours-SS Performance comparison on the test sets of scientific domain, ⋆ indicates the score of our model.

Datasets English Scientific Persian Vietnamese English Legal French Spanish Danish 3980 1336 1274 3564 7783 5928 3082 497 167 159 445 973 741 385 where ∈ ℝ2×4 , and = [

0, 1, 2, 3] ∈ ℝ×4 are logits for 4 classes declared above. The loss function is binary cross entropy:

3 the logit for -th token regarding class , and () where is the label for -th token regarding class ,

is denotes sigmoid function.

For the inference, we first predict the class label of each token. Then, we match each S-Acronym token with the nearest E-Acronym token to get an acronym. The operation for long-form is the same.

3.6. Adversarial Training

To enhance the robustness and generalization ability of our models, we adopt adversarial training. Specifically, given an input , we incorporate a posterior regularization mechanism [29]:

‖‖≤ ℒ = max ∑ Div( ( )|| ( + )), ( 8 ) where Div is some f-divergence1, is noise, is noise norm and represents the predict function in our models, like CRF tagger and Binary taggers. This loss regularizes the posterior diference between original and noisy inputs to avoid overfitting. Practically, we use an inner loop to search the most adversarial direction. 1We use Jensen-Shannon divergence in our experiments. Test 498 168 160 446 973 741 386 ( 9 ) ( 10 ) Training

Development

3.7. Objective Function

We jointly train our models with adversarial training, for sequence labeling strategy: For span selection strategy: ℒ = ℒ + ℒ .

ℒ = ℒ + ℒ .

The is used for controlling the significance of adversarial training.

4. Experiments 4.1. Datasets

Our experiments are conducted on the oficial dataset of SDU@AAAI-22 - Shared Task 1: Acronym ExtracRule BERT RoBERTa Ours-SL Ours-SS

For baselines, we select pre-train models trained with Ours-SS 0.86 0.89 0.87 corresponding language corpora in HuggingFace TransOurs-SS w/o AT 0.86 0.87 0.86 formers [30]. As for ours, we adopt the best pre-trained Table 6 models according to their performance in the developAblation studies in the development set of English Scientific. ment set. Specifically, we adopt roberta-base 3 for English, roberta-fa-zwnj-base-ner4 for Persian, bert-basevi-cased5 for Vietnamese, bert-base-fr-cased6 for French, bert-base-es-cased7 for Spanish and danish-bert-botxotion. They provide the data of scientific domain includ- ner-dane8 for Danish. ing English, Persian and Vietnamese; and legal domain We tune the hyper-parameters according to the perforincluding English, French, Spanish and Danish. Table 3 mance in the development set. For the sequence labeling smuemnmts.arizes the statistics of datasets used in our experi- satdrvaetresgayr,iathlterabianticnhgswizeei,gLhSt TarMe 8la,y1e,r2,5L6S,T0.M1, rheisdpdeecntivsiezley,. The batch size, LSTM layer, LSTM hidden size, and ad4.2. Baselines versarial training weight for our span selection strategy To investigate the efectiveness of our proposed approach, are 16, 1, 256, and 1.0. We run all experiments using we compare it with the following three baselines: PyTorch 1.9.1 on the Nvidia GeForce RTX 3090 GPU, Intel(R) Xeon(R) Platinum 8260L CPU on Ubuntu 18.04.4 • Rule-based This method utilizes a manually de- LTS OS. Our code will be released soon. signed pattern to extract acronyms and is provided by SDU@AAAI-22 2. • BERT-based This method employs BERT [21] as a text encoder to get contextualized word repre- 3https://huggingface.co/roberta-base sentation, then employs a classification head to tag each word.

4https://huggingface.co/HooshvareLab/roberta-fa-zwnj-base-ner 5https://huggingface.co/Geotrend/bert-base-vi-cased 6https://huggingface.co/Geotrend/bert-base-fr-cased 7https://huggingface.co/Geotrend/bert-base-es-cased 8https://huggingface.co/Maltehb/danish-bert-botxo-ner-dane 2https://github.com/amirveyseh/AAAI-22-SDU-shared-task-1-AE

4.4. Results

4.4.1. Scientific Domain The comparison between the proposed model and baseline models is shown in Table 1. The main observations can be summarized as follows: • Compared with manually designed rule-based methods, pre-trained model-based methods have huge advantages because they can capture reasonable word representations. • The diference between the BERT model and RoBERTa model is remarkable. We conjecture this is due to the datasets being small; thus, the results depend more on the power of the pretrained model. • Our two strategies get similar results and outperform all baseline methods. We submit the better one for testing. 4.4.2. Legal Domain The comparison is shown in Table 4, the observations are similar with Scientific Domain, and our method outperforms all baseline models stably. Table 5 shows the top 4 scores in the test sets; our method gets decent performance and ranks 2 in English and French, 3 in Spanish and Danish.

4.5. Ablation Study

To further prove the efectiveness of each component, we run ablation studies on the development set of English Scientific, as shown in Table 6. We find that: ( 1 ) for our sequence labeling strategy, CRF is necessary because it helps capture the dependency between sequence labels. ( 2 ) adversarial training is beneficial to both strategies by adding reasonable noises, which improve our models’ robustness and generalization performance.

5. Conclusion

In this paper, we explore and propose two strategies with adversarial training for SDU@AAAI-22 - Shared Task 1: Acronym Extraction. Experiments show that our methods outperform strong baseline methods in all 7 datasets. In addition, our score ranks high in the test sets. For future work, we will try to solve the problem of class imbalance in both strategies.

6. Acknowledgments

This research was supported in part by the National Key Research and Development Program of China (No. 2020YFB1708200) and the Shenzhen Key Laboratory of Marine IntelliSense and Computation under Contract ZDSYS20200811142605016. pers), Association for Computational Linguistics, putational Linguistics, 2019, pp. 3613–3618. 2019, pp. 4171–4186. [25] C. Pan, B. Song, S. Wang, Z. Luo, Bert-based [12] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, acronym disambiguation with multiple training O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, strategies, in: Proceedings of the SDU@AAAI 2021, Roberta: A robustly optimized BERT pretraining volume 2831 of CEUR Workshop Proceedings, 2021. approach, CoRR abs/1907.11692 (2019). [26] F. Li, Z. Mai, W. Zou, W. Ou, X. Qin, Y. Lin, [13] A. S. Schwartz, M. A. Hearst, A simple algorithm for W. Zhang, Systems at SDU-2021 task 1: Transidentifying abbreviation definitions in biomedical formers for sentence level sequence label, in: Protext, Pacific Symposium on Biocomputing. Pacific ceedings of the SDU@AAAI 2021, volume 2831 of Symposium on Biocomputing 4 (2003) 451–462. CEUR Workshop Proceedings, 2021. [14] E. Torres-Schumann, K. U. Schulz, Stable meth- [27] Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutods for recognizing acronym-expansion pairs: from dinov, Q. V. Le, Xlnet: Generalized autoregressive rule sets to hidden markov models, Int. J. Document pretraining for language understanding, in: AdAnal. Recognit. 8 (2006). vances in Neural Information Processing Systems [15] C. G. Harris, P. Srinivasan, My word! machine 32, NeurIPS 2019, 2019, pp. 5754–5764. versus human computation methods for identifying [28] C. Sutton, A. McCallum, An introduction to condiand resolving acronyms, Computación y Sistemas tional random fields, Found. Trends Mach. Learn. 4 23 (2019). (2012) 267–373. [16] S. Hochreiter, J. Schmidhuber, Long short-term [29] H. Cheng, X. Liu, L. Pereira, Y. Yu, J. Gao, Posterior memory, Neural Comput. 9 (1997) 1735–1780. diferential regularization with f-divergence for im[17] Q. Jin, J. Liu, X. Lu, Deep contextualized biomedical proving model robustness, in: Proceedings of the abbreviation expansion, in: Proceedings of the NAACL-HLT 2021, Association for Computational BioNLP@ACL 2019, Association for Computational Linguistics, 2021, pp. 1078–1089.

Linguistics, 2019, pp. 88–96. [30] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De[18] Q. Jin, B. Dhingra, W. W. Cohen, X. Lu, Prob- langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funing biomedical embeddings from language models, towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, CoRR abs/1904.02181 (2019). Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, [19] I. Li, M. Yasunaga, M. Y. Nuzumlali, C. Caraballo, M. Drame, Q. Lhoest, A. M. Rush, Transformers: S. Mahajan, H. M. Krumholz, D. R. Radev, A neural State-of-the-art natural language processing, in: topic-attention model for medical term abbreviation Proceedings of the EMNLP 2020 - Demos, Associdisambiguation, CoRR abs/1910.14076 (2019). ation for Computational Linguistics, Online, 2020, [20] A. P. B. Veyseh, F. Dernoncourt, Q. H. Tran, T. H. pp. 38–45.

Nguyen, What does this acronym mean? introducing a new dataset for acronym identification and disambiguation, arXiv preprint arXiv:2010.14678 (2020). [21] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,

Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [22] Z. Wei, J. Su, Y. Wang, Y. Tian, Y. Chang, A novel cascade binary tagging framework for relational triple extraction, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Association for Computational Linguistics, 2020, pp. 1476–1488. [23] S. Kim, S. Yang, G. Kim, S. Lee, Eficient dialogue state tracking by selectively overwriting memory, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Association for Computational Linguistics, 2020, pp. 567–582. [24] I. Beltagy, K. Lo, A. Cohan, Scibert: A pretrained language model for scientific text, in: Proceedings of the EMNLP-IJCNLP 2019, Association for Com

[1]

C. F.

Ackermann ,

C. E.

Beller ,

S. A.

Boxwell ,

E. G.

Katz , K. M. Summers , Resolution of acronyms in question answering systems , 2020 . US Patent 10 , 572 , 597 .

[2]

Kang ,

Head ,

Sidhu ,

Lo ,

D. S.

Weld ,

M. A.

Hearst , Document-level definition detection in scholarly documents: Existing models, error analyses, and future directions , in: Proceedings of SDP@EMNLP 2020 , Association for Computational Linguistics , 2020 , pp. 196 - 206 .

[3]

Shi ,

Yang , Relational facts extraction with splitting mechanism , in: 2020 IEEE International Conference on Knowledge Graph , ICKG 2020 „ IEEE, 2020 , pp. 374 - 379 .

[4]

Ding ,

Lei , G. Xun,

Yang , FAT-RE: A faster dependency-free model for relation extraction , J. Web Semant . 65 ( 2020 ) 100598 .

[5]

Y. R. J. F. D. T. H. N. Amir Pouran Ben Veyseh , Nicole Meister, MACRONYM: A LargeScale Dataset for Multilingual and Multi-Domain Acronym Extraction , in: arXiv, 2022 .

[6]

Y. R. J. F. D. T. H. N. Amir Pouran Ben Veyseh , Nicole Meister, Multilingual Acronym Extraction and Disambiguation Shared Tasks at SDU 2022 , in: Proceedings of SDU@AAAI-22 , 2022 .

[7]

Okazaki ,

Ananiadou , Building an abbreviation dictionary using a term recognition approach , Bioinform . 22 ( 2006 ) 3089 - 3095 .

[8]

Kuo ,

M. H. T.

Ling ,

Lin ,

Hsu , BIOADI: a machine learning approach to identifying abbreviations and definitions in biological literature , BMC Bioinform . 10 ( 2009 ) 7 .

[9]

Zhu ,

Lin ,

Zhang ,

Zhong , G. Zeng,

Wu ,

Tang , AT-BERT: adversarial training BERT for acronym identification winning solution for sdu@aaai-21 , CoRR abs/2101.03700 ( 2021 ).

[10]

Egan ,

Bohannon , Primer ai's systems for acronym identification and disambiguation , in: Proceedings of the SDU@AAAI 2021 , volume 2831 of CEUR Workshop Proceedings , 2021 .

[11]

Devlin ,

Chang ,

Lee ,

Toutanova , BERT: pre-training of deep bidirectional transformers for language understanding , in: Proceedings of the NAACL-HLT 2019 , Volume 1 (Long and Short Pa-