Acronym Extraction with Hybrid Strategies Siheng Li1,† , Cheng Yang1,† , Tian Liang1,† , Xinyu Zhu1 , Chengze Yu1 and Yujiu Yang1,∗ 1 Tsinghua Shenzhen International Graduate School, Tsinghua University Abstract Acronym extraction plays an important role in scientific document understanding. Recently, the AAAI-22 Workshop on Scientific Document Understanding released multiple high-quality datasets and attracted widespread attention. In this work, we present our hybrid strategies with adversarial training for this task. Specifically, we first apply pre-trained models to obtain contextualized text encoding. Then, on the one hand, we employ a sequence labeling strategy with BiLSTM and CRF to tag each word in a sentence. On the other hand, we use a span selection strategy that directly predicts the acronym and long-form spans. In addition, we adopt adversarial training to further improve the robustness and generalization ability of our models. Experimental results show that both methods outperform strong baselines and rank high on the SDU@AAAI-22 - Shared Task 1: Acronym Extraction, our scores rank 2nd in 4 test sets and 3rd in 3 test sets. Moreover, the ablation study further verifies the effectiveness of each component. Our code is available at https://github.com/carlyoung1999/AAAI-SDU-Task1 . Keywords Acronym Extraction, Natural Language Processing, BERT 1. Introduction Input: Existing methods for learning with noisy labels (LNL) An acronym consists of the initial letters of the corre- primarily take a loss correction approach. sponding terminology and is widely used in scientific Output: documents for its convenience. However, this also makes Acronym: LNL it difficult to understand scientific documents for both Long-form: learning with noisy labels humans and machines. In natural language processing, accurate acronym extraction is beneficial for the down- stream applications like question answering [1], defini- Figure 1: An example of Acronym Extraction. tion extraction [2] and relation extraction [3, 4]. Recently, SDU@AAAI-22 released multiple datasets [5] for scien- tific document understanding, and we focus on the task of acronym extraction [6], which aims to extract acronyms to capture feature interactions between adjacent words and their corresponding explanations (long-forms); a toy further and employ CRF to model the dependency be- example can be seen in Figure 1. tween sequence labels for the sequence labeling strategy. Traditional approaches utilize rule-based pattern [7] As for the span selection strategy, we use binary tag- or manual features [8] which are labor-force and time- gers to predict the start and end index for acronyms or consumed. Recently, deep learning based methods [9, 10] long-forms. To further improve our models’ robustness are preferred for their better performance and end-to-end and generalization ability, we employ adversarial train- learning. ing, which dynamically adds noise to avoid overfitting. In this paper, we propose two strategies for acronym These two strategies get comparable performance, and extraction, sequence labeling strategy and span selection we choose the better one for evaluation according to their strategy. Specifically, we first use pre-trained language performance in the development set. Our contributions models like BERT [11] or RoBERTa [12] to obtain contex- are as follows: tualized word representations. Then, we utilize BiLSTM • We propose two strategies for acronym extrac- The second workshop on Scientific Document Understanding at AAAI tion, sequence labeling and span selection. 2022 • Our adversarial training further improves the ro- ∗ † Corresponding author. bustness and generalization ability of our models. These authors contributed equally. • Experiments show that our models outper- Envelope-Open lisiheng21@mails.tsinghua.edu.cn (S. Li); yangc21@mails.tsinghua.edu.cn (C. Yang); form strong baselines and rank high in the liangt21@mails.tsinghua.edu.cn (T. Liang); SDU@AAAI-22 - Shared Task 1: Acronym Ex- zhuxy21@mails.tsinghua.edu.cn (X. Zhu); traction. ycz21@mails.tsinghua.edu.cn (C. Yu); yang.yujiu@sz.tsinghua.edu.cn (Y. Yang) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 2. Related Works strategy with BPE (Byte-Pair-Encoding) and dynamic masking to increase shared vocabulary, thus providing In this section, we introduce the related studies for more fine-grained representations and stronger robust- acronym extraction, including Rule-based, LSTM-based, ness. SciBERT [24] has the same structure as BERT, while and Pre-trained-based methods. it is well pre-trained to process scientific documents specifically. Many works utilize the power of pre-trained 2.1. Rule-based models for acronym extraction. Pan et al. [25] proposes a multi-task learning method based on BERT-CRF and Traditional acronym extraction methods mainly focus BERT-Span, which makes full use of these two separate on rule-based methods. Specifically, most of them [13] models through redefining the fusion loss function and utilize generic rules or text patterns to discover acronym achieves great performance. Li et al. [26] utilizes Sen- expansions in the field of biomedicine. Torres-Schumann tence Piece byte-pair encoding to relabel sentences. Then, and Schulz [14] further extend rule sets to hidden Markov they are embedded into the XLNet [27] for processing. models and improve both recall and precision values. Re- cently, a new work [15] has made a comprehensive intro- duction to the rule-based machine identification meth- 3. Methodology ods. They comprehensively classify present Rule-based models, analyze two separate approaches (a machine al- 3.1. Task Formulation gorithm and a crowd-sourcing approach), and compare Given a text 𝑋 = {𝑥1 , 𝑥2 , ..., 𝑥𝑙 } where each 𝑥𝑖 is a word and them in detail. However, Due to the conservative nature 𝑙 represents text length, acronym extraction aims to find of rule-based models, this method requires complicated all acronyms and long-forms mentioned in this text. For- manual formulations and lacks flexibility. mally, the model needs to automatically extract acronym mention set 𝒜 = {[𝑠1 , 𝑒1 ), [𝑠2 , 𝑒2 ), ..., [𝑠𝑛 , 𝑒𝑛 )}, where 𝑠𝑖 and 2.2. LSTM-based 𝑒𝑖 denotes the start and end position of the i-th acronym Taking advantage of LSTM [16]’s power for text model- respectively. In addition, the model also needs to extract ing, LSTM-based methods has got decent performance in long-form mention set ℬ = {[𝑠1 , 𝑒1 ), [𝑠2 , 𝑒2 ), ..., [𝑠𝑚 , 𝑒𝑚 )}, acronym extraction. They mainly focus on better seman- similar with 𝒜. tic representations and attention mechanisms. DECBAE [17] extracts contextualized features with BioELMo [18] 3.2. Overview and provides these features to specific abbreviated BiL- We describe our hybrid strategies to extract acronyms and STMs, achieving good performance. In addition, they long-forms in this section. At first, we use pre-trained use a simple but effective heuristic method for automat- models for tokenizing and encoding the original sentence. ically collecting datasets from a large corpus. Li et al. Then, we employ a BiLSTM-CRF head to model acronym [19] propose a novel topic-attention model and compare extraction as a sequence labeling task and a BiLSTM-Span the performance of different attention mechanisms em- head to model it as a span selection task. In addition, to bedded in LSTM and ELMo. Their model is applied to improve the robustness and generalization of our models, the acronym task of medical terms. To further capture we apply adversarial training techniques. the dependency between sequence labels, Veyseh et al. [20] propose to combine LSTM with CRF for Acronym identification and Disambiguation. 3.3. BERT Encoder We adopt BERT or RoBERTa as a text encoder to capture 2.3. Pre-trained-based rich contextualized word embeddings. For brevity, we use BERT to indicate both BERT and RoBERTa following. Language models pre-trained with a large corpus have Given the input 𝑋 = {𝑥1 , 𝑥2 , ..., 𝑥𝑙 }, with the help of deep shown promising performance in lots of downstream multi-head attention layers, BERT captures contextual- tasks. One of the most popular is Bidirectional Encoder ized representation for each token. The encoding process Representations from Transformers (BERT) [21], which is as follows: obtains rich semantic representations by Masked LM task in the pre-training stage. BERT has been applied to many 𝐻 = BERT([𝑥1 , 𝑥2 , ..., 𝑥𝑙 ]) = [ℎ1 , ℎ2 , ..., ℎ𝑙 ]𝑇 , (1) NLP tasks like information extraction [22] and dialogue state tracking [23]. where 𝐻 ∈ ℝ𝑙×𝑑 , and 𝑑 denotes hidden dimension. In addition, it is worth mentioning that there have been many fine-grained improvements or specific domain variants of BERT. RoBERTa [12] optimizes the training acronyms: [“DL”] acronyms: [“DL”] long-forms: [“Deep Learning”] long-forms: [“Deep Learning”] B-Acronym O O B-Long I-Long S-Acronym 1 0 0 0 0 output O O O B-Long I-Long sequences: E-Acronym 1 0 0 0 0 B-Acronym O O B-Long O S-Long 0 0 0 1 0 CRF Tagger E-Long 0 0 0 0 1 Linear Layer S-Acronym Tagger S-Long Tagger Binary Taggers E-Acronym Tagger E-Long Tagger BiLSTM BiLSTM BiLSTM BiLSTM BiLSTM BiLSTM BiLSTM BiLSTM BiLSTM BiLSTM BERT Encoder BERT Encoder input: DL stands for Deep Learning input: DL stands for Deep Learning Figure 2: The model architecture of our Sequence Labeling strategy. Figure 3: The model architecture of our Span Selection strategy. 3.4. Sequence Labeling Strategy denotes transition scorer in CRF and is a learnable matrix For this strategy, we first transform the character-level practically. 𝑍 (𝑋 ) is the normalization factor to constraint position labels provided by raw datasets to token-level the probability in (0, 1). The loss function is negative log- BIO labels as follows: likelihood : ℒ𝑆𝐿 = − log(𝑃(𝑌 |𝑋 )). (5) • B-Acronym: Beginning of an acronym. • I-Acronym: Inside of an acronym. For the inference, we use the Viterbi algorithm [28] • B-Long: Beginning of a long-form. for decoding the best label sequence. • I-Long: Inside of a long-form. • O: Outside of any acronym and long-form. 3.5. Span Selection Strategy To solve this sequence labeling problem, we adopt We also formulate it as an extractive span selection task, a BERT-BiLSTM-CRF method, and the architecture is aiming to find the text span of acronyms and long-forms shown in Figure 2. First, we utilize a BiLSTM network directly. Similar to the sequence labeling strategy, we to capture feature interactions between adjacent words transform the character-level labels [𝑠𝑡𝑎𝑟𝑡, 𝑒𝑛𝑑) provided further: by raw datasets to token-level [𝑠𝑡𝑎𝑟𝑡, 𝑒𝑛𝑑) for the follow- 𝐻 ′ = BiLSTM(𝐻 ), (2) ing token classification. We adopt the same BERT encoder and LSTM network where 𝐻 ′ ∈ ℝ𝑙×2𝑑 . Then, a linear classifier transforms 𝐻 ′ as above to get contextualized word representations 𝐻 ′ ∈ into the logits of 5 BIO labels defined above: ℝ𝑙×2𝑑 . Then we construct four binary taggers: 𝐿 = [𝐿0 , 𝐿1 , 𝐿2 , 𝐿3 , 𝐿4 ] = 𝐻 ′ 𝑊𝐿 , (3) • S-Acronym Tagger predicts whether a token is the start of an acronym. where 𝑊𝐿 ∈ ℝ2𝑑×5 and 𝐿 = [𝐿0 , 𝐿1 , 𝐿2 , 𝐿3 , 𝐿4 ] ∈ ℝ𝑙×5 are the logits. • E-Acronym Tagger predicts whether a token is To model the dependency between sequence labels, the end of an acronym. we adopt a Linear Chain CRF (Conditional Random Field) • S-Long Tagger predicts whether a token is the [28], the probability of a tagged sequence is: start of a long-form. • E-long Tagger predicts whether a token is the 𝑙 𝑙 end of a long-form. exp(∑𝑖=1 𝜑(𝑦𝑖 |𝑥𝑖 ) + ∑𝑖=1 𝜓 (𝑦𝑖 |𝑦𝑖−1 )) 𝑃(𝑌 |𝑋 ) = , (4) 𝑍 (𝑋 ) We apply a simple linear layer to represent these tag- where 𝑌 = [𝑦1 , 𝑦2 , ..., 𝑦𝑙 ] is the ground truth label se- gers which work as follows: quence and 𝑦𝑖 is the label for 𝑖-th token. 𝜑(⋅) represents 𝐿 = [𝐿0 , 𝐿1 , 𝐿2 , 𝐿3 ] = 𝐻 ′ 𝑊𝑆 , (6) emission scorer which refers to the logits 𝐿 above. 𝜓 (⋅) English Persian Vietnamese Method P R F1 P R F1 P R F1 Rule 0.33 0.15 0.20 0.95 0.44 0.60 0.82 0.39 0.53 BERT 0.82 0.85 0.83 0.94 0.47 0.63 0.82 0.73 0.77 RoBERTa 0.84 0.88 0.86 0.94 0.52 0.67 0.97 0.48 0.64 Ours-SL 0.86 0.88 0.87 0.87 0.59 0.70 0.98 0.65 0.78 Ours-SS 0.86 0.89 0.87 0.84 0.67 0.73 0.81 0.91 0.85 Table 1 Performance comparison on the development sets of scientific domain. English Persian Vietnamese Ranking P R F1 P R F1 P R F1 1 0.89 0.92 0.90 0.76 0.82 0.79 0.85 0.82 0.84 2 0.89⋆ 0.89⋆ 0.89⋆ 0.60⋆ 0.69⋆ 0.63⋆ 0.83 0.84 0.83 3 0.85 0.87 0.86 0.92 0.43 0.59 0.96⋆ 0.62⋆ 0.76⋆ 4 0.83 0.88 0.86 0.64 0.51 0.57 0.64 0.66 0.65 Table 2 Performance comparison on the test sets of scientific domain, ⋆ indicates the score of our model. where 𝑊𝑆 ∈ ℝ2𝑑×4 , and 𝐿 = [𝐿0 , 𝐿1 , 𝐿2 , 𝐿3 ] ∈ ℝ𝑙×4 are Datasets Training Development Test logits for 4 classes declared above. The loss function is English Scientific 3980 497 498 binary cross entropy: Persian 1336 167 168 Vietnamese 1274 159 160 𝑙 3 𝑗 𝑗 𝑗 𝑗 English Legal 3564 445 446 ℒ𝑆𝑆 = ∑ ∑[−𝑦𝑖 ⋅log(𝜎(𝑙𝑖 ))+(1−𝑦𝑖 )⋅log(1−𝜎(𝑙𝑖 ))], (7) 𝑖=0 𝑗=0 French 7783 973 973 Spanish 5928 741 741 𝑗 𝑗 Danish 3082 385 386 where 𝑦𝑖 is the label for 𝑖-th token regarding class 𝑗, 𝑙𝑖 is the logit for 𝑖-th token regarding class 𝑗, and 𝜎(𝑥) denotes Table 3 sigmoid function. Statistics of the datasets, the first three belongs to scientific For the inference, we first predict the class label of domain while the others belongs to legal domain. each token. Then, we match each S-Acronym token with the nearest E-Acronym token to get an acronym. The operation for long-form is the same. 3.7. Objective Function 3.6. Adversarial Training We jointly train our models with adversarial training, for sequence labeling strategy: To enhance the robustness and generalization ability of our models, we adopt adversarial training. Specifically, ℒ = ℒ𝑆𝐿 + 𝛼ℒ𝐴𝑑𝑣 . (9) given an input 𝑋, we incorporate a posterior regulariza- tion mechanism [29]: For span selection strategy: ℒ = ℒ𝑆𝑆 + 𝛼ℒ𝐴𝑑𝑣 . (10) ℒ𝐴𝑑𝑣 = max ∑ Div(𝑓𝜃 (𝑋 )||𝑓𝜃 (𝑋 + 𝜖)), (8) ‖𝜖‖≤𝑎 The 𝛼 is used for controlling the significance of adversar- where Div is some f-divergence1 , 𝜖 is noise, 𝑎 is noise ial training. norm and 𝑓𝜃 represents the predict function in our mod- els, like CRF tagger and Binary taggers. This loss regular- izes the posterior difference between original and noisy 4. Experiments inputs to avoid overfitting. Practically, we use an inner loop to search the most adversarial direction. 4.1. Datasets Our experiments are conducted on the official dataset 1 We use Jensen-Shannon divergence in our experiments. of SDU@AAAI-22 - Shared Task 1: Acronym Extrac- English French Spanish Danish Method P R F1 P R F1 P R F1 P R F1 Rule 0.32 0.10 0.16 0.22 0.06 0.10 0.17 0.07 0.10 0.10 0.06 0.08 BERT 0.88 0.87 0.88 0.94 0.94 0.94 0.89 0.90 0.89 0.93 0.94 0.93 RoBERTa 0.87 0.88 0.88 0.78 0.76 0.77 0.88 0.88 0.88 0.90 0.92 0.91 Ours-SL 0.88 0.88 0.88 0.95 0.94 0.94 0.90 0.90 0.90 0.94 0.95 0.94 Ours-SS 0.89 0.88 0.89 0.95 0.93 0.94 0.90 0.90 0.90 0.95 0.93 0.94 Table 4 Performance comparison on the development sets of legal domain. English French Spanish Danish Ranking P R F1 P R F1 P R F1 P R F1 1 0.90 0.92 0.91 0.94 0.95 0.94 0.90 0.91 0.91 0.95 0.98 0.96 2 0.88⋆ 0.91⋆ 0.90⋆ 0.92⋆ 0.93⋆ 0.93⋆ 0.90 0.91 0.90 0.95 0.97 0.96 3 0.87 0.91 0.89 0.93 0.92 0.92 0.90⋆ 0.90⋆ 0.90⋆ 0.95⋆ 0.95⋆ 0.95⋆ 4 0.87 0.90 0.88 0.81 0.80 0.81 0.90 0.90 0.90 0.89 0.90 0.89 Table 5 Performance comparison on the test sets of legal domain, ⋆ indicates the score of our model. English Scientific • Roberta-based This is similar with above, ex- Method P R F1 cept RoBERTa [12] as text encoder. Ours-SL 0.86 0.88 0.87 Ours-SL w/o CRF 0.84 0.88 0.86 4.3. Implementations Ours-SL w/o AT 0.86 0.87 0.86 For baselines, we select pre-train models trained with Ours-SS 0.86 0.89 0.87 corresponding language corpora in HuggingFace Trans- Ours-SS w/o AT 0.86 0.87 0.86 formers [30]. As for ours, we adopt the best pre-trained Table 6 models according to their performance in the develop- Ablation studies in the development set of English Scientific. ment set. Specifically, we adopt roberta-base3 for En- glish, roberta-fa-zwnj-base-ner4 for Persian, bert-base- vi-cased5 for Vietnamese, bert-base-fr-cased6 for French, bert-base-es-cased7 for Spanish and danish-bert-botxo- tion. They provide the data of scientific domain includ- ner-dane8 for Danish. ing English, Persian and Vietnamese; and legal domain We tune the hyper-parameters according to the perfor- including English, French, Spanish and Danish. Table 3 mance in the development set. For the sequence labeling summarizes the statistics of datasets used in our experi- strategy, the batch size, LSTM layer, LSTM hidden size, ments. adversarial training weight are 8, 1, 256, 0.1, respectively. The batch size, LSTM layer, LSTM hidden size, and ad- 4.2. Baselines versarial training weight for our span selection strategy are 16, 1, 256, and 1.0. We run all experiments using To investigate the effectiveness of our proposed approach, PyTorch 1.9.1 on the Nvidia GeForce RTX 3090 GPU, In- we compare it with the following three baselines: tel(R) Xeon(R) Platinum 8260L CPU on Ubuntu 18.04.4 • Rule-based This method utilizes a manually de- LTS OS. Our code will be released soon. signed pattern to extract acronyms and is pro- vided by SDU@AAAI-22 2 . • BERT-based This method employs BERT [21] as a text encoder to get contextualized word repre- 3 https://huggingface.co/roberta-base sentation, then employs a classification head to 4 https://huggingface.co/HooshvareLab/roberta-fa-zwnj-base-ner tag each word. 5 https://huggingface.co/Geotrend/bert-base-vi-cased 6 https://huggingface.co/Geotrend/bert-base-fr-cased 7 https://huggingface.co/Geotrend/bert-base-es-cased 2 8 https://github.com/amirveyseh/AAAI-22-SDU-shared-task-1-AE https://huggingface.co/Maltehb/danish-bert-botxo-ner-dane 4.4. Results 6. Acknowledgments 4.4.1. Scientific Domain This research was supported in part by the National The comparison between the proposed model and base- Key Research and Development Program of China (No. line models is shown in Table 1. The main observations 2020YFB1708200) and the Shenzhen Key Laboratory of can be summarized as follows: Marine IntelliSense and Computation under Contract ZDSYS20200811142605016. • Compared with manually designed rule-based methods, pre-trained model-based methods have huge advantages because they can capture rea- References sonable word representations. [1] C. F. Ackermann, C. E. Beller, S. A. Boxwell, E. G. • The difference between the BERT model and Katz, K. M. Summers, Resolution of acronyms RoBERTa model is remarkable. We conjecture in question answering systems, 2020. US Patent this is due to the datasets being small; thus, the 10,572,597. results depend more on the power of the pre- [2] D. Kang, A. Head, R. Sidhu, K. Lo, D. S. Weld, M. A. trained model. Hearst, Document-level definition detection in • Our two strategies get similar results and outper- scholarly documents: Existing models, error anal- form all baseline methods. We submit the better yses, and future directions, in: Proceedings of one for testing. SDP@EMNLP 2020, Association for Computational Table 2 shows the top 4 scores in the test sets of the Linguistics, 2020, pp. 196–206. scientific domain; our method gets decent performance [3] Y. Shi, Y. Yang, Relational facts extraction with and ranks 2st in English and Persian, 3st in Vietnamese. splitting mechanism, in: 2020 IEEE International Conference on Knowledge Graph, ICKG 2020„ IEEE, 2020, pp. 374–379. 4.4.2. Legal Domain [4] L. Ding, Z. Lei, G. Xun, Y. Yang, FAT-RE: A faster The comparison is shown in Table 4, the observations are dependency-free model for relation extraction, J. similar with Scientific Domain, and our method outper- Web Semant. 65 (2020) 100598. forms all baseline models stably. Table 5 shows the top [5] S. Y. R. J. F. D. T. H. N. Amir Pouran Ben Vey- 4 scores in the test sets; our method gets decent perfor- seh, Nicole Meister, MACRONYM: A Large- mance and ranks 2 in English and French, 3 in Spanish Scale Dataset for Multilingual and Multi-Domain and Danish. Acronym Extraction, in: arXiv, 2022. [6] S. Y. R. J. F. D. T. H. N. Amir Pouran Ben Veyseh, 4.5. Ablation Study Nicole Meister, Multilingual Acronym Extraction and Disambiguation Shared Tasks at SDU 2022, in: To further prove the effectiveness of each component, we Proceedings of SDU@AAAI-22, 2022. run ablation studies on the development set of English [7] N. Okazaki, S. Ananiadou, Building an abbrevia- Scientific, as shown in Table 6. We find that: (1) for our tion dictionary using a term recognition approach, sequence labeling strategy, CRF is necessary because it Bioinform. 22 (2006) 3089–3095. helps capture the dependency between sequence labels. [8] C. Kuo, M. H. T. Ling, K. Lin, C. Hsu, BIOADI: a (2) adversarial training is beneficial to both strategies by machine learning approach to identifying abbrevia- adding reasonable noises, which improve our models’ tions and definitions in biological literature, BMC robustness and generalization performance. Bioinform. 10 (2009) 7. [9] D. Zhu, W. Lin, Y. Zhang, Q. Zhong, G. Zeng, W. Wu, J. Tang, AT-BERT: adversarial training 5. Conclusion BERT for acronym identification winning solution In this paper, we explore and propose two strategies for sdu@aaai-21, CoRR abs/2101.03700 (2021). with adversarial training for SDU@AAAI-22 - Shared [10] N. Egan, J. Bohannon, Primer ai’s systems for Task 1: Acronym Extraction. Experiments show that acronym identification and disambiguation, in: Pro- our methods outperform strong baseline methods in all ceedings of the SDU@AAAI 2021, volume 2831 of 7 datasets. In addition, our score ranks high in the test CEUR Workshop Proceedings, 2021. sets. For future work, we will try to solve the problem of [11] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: class imbalance in both strategies. pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the NAACL-HLT 2019, Volume 1 (Long and Short Pa- pers), Association for Computational Linguistics, putational Linguistics, 2019, pp. 3613–3618. 2019, pp. 4171–4186. [25] C. Pan, B. Song, S. Wang, Z. Luo, Bert-based [12] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, acronym disambiguation with multiple training O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, strategies, in: Proceedings of the SDU@AAAI 2021, Roberta: A robustly optimized BERT pretraining volume 2831 of CEUR Workshop Proceedings, 2021. approach, CoRR abs/1907.11692 (2019). [26] F. Li, Z. Mai, W. Zou, W. Ou, X. Qin, Y. Lin, [13] A. S. Schwartz, M. A. Hearst, A simple algorithm for W. Zhang, Systems at SDU-2021 task 1: Trans- identifying abbreviation definitions in biomedical formers for sentence level sequence label, in: Pro- text, Pacific Symposium on Biocomputing. Pacific ceedings of the SDU@AAAI 2021, volume 2831 of Symposium on Biocomputing 4 (2003) 451–462. CEUR Workshop Proceedings, 2021. [14] E. Torres-Schumann, K. U. Schulz, Stable meth- [27] Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhut- ods for recognizing acronym-expansion pairs: from dinov, Q. V. Le, Xlnet: Generalized autoregressive rule sets to hidden markov models, Int. J. Document pretraining for language understanding, in: Ad- Anal. Recognit. 8 (2006). vances in Neural Information Processing Systems [15] C. G. Harris, P. Srinivasan, My word! machine 32, NeurIPS 2019, 2019, pp. 5754–5764. versus human computation methods for identifying [28] C. Sutton, A. McCallum, An introduction to condi- and resolving acronyms, Computación y Sistemas tional random fields, Found. Trends Mach. Learn. 4 23 (2019). (2012) 267–373. [16] S. Hochreiter, J. Schmidhuber, Long short-term [29] H. Cheng, X. Liu, L. Pereira, Y. Yu, J. Gao, Posterior memory, Neural Comput. 9 (1997) 1735–1780. differential regularization with f-divergence for im- [17] Q. Jin, J. Liu, X. Lu, Deep contextualized biomedical proving model robustness, in: Proceedings of the abbreviation expansion, in: Proceedings of the NAACL-HLT 2021, Association for Computational BioNLP@ACL 2019, Association for Computational Linguistics, 2021, pp. 1078–1089. Linguistics, 2019, pp. 88–96. [30] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De- [18] Q. Jin, B. Dhingra, W. W. Cohen, X. Lu, Prob- langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun- ing biomedical embeddings from language models, towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, CoRR abs/1904.02181 (2019). Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, [19] I. Li, M. Yasunaga, M. Y. Nuzumlali, C. Caraballo, M. Drame, Q. Lhoest, A. M. Rush, Transformers: S. Mahajan, H. M. Krumholz, D. R. Radev, A neural State-of-the-art natural language processing, in: topic-attention model for medical term abbreviation Proceedings of the EMNLP 2020 - Demos, Associ- disambiguation, CoRR abs/1910.14076 (2019). ation for Computational Linguistics, Online, 2020, [20] A. P. B. Veyseh, F. Dernoncourt, Q. H. Tran, T. H. pp. 38–45. Nguyen, What does this acronym mean? introduc- ing a new dataset for acronym identification and disambiguation, arXiv preprint arXiv:2010.14678 (2020). [21] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transform- ers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [22] Z. Wei, J. Su, Y. Wang, Y. Tian, Y. Chang, A novel cascade binary tagging framework for relational triple extraction, in: Proceedings of the 58th An- nual Meeting of the Association for Computational Linguistics, ACL 2020, Association for Computa- tional Linguistics, 2020, pp. 1476–1488. [23] S. Kim, S. Yang, G. Kim, S. Lee, Efficient dialogue state tracking by selectively overwriting memory, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Association for Computational Linguistics, 2020, pp. 567–582. [24] I. Beltagy, K. Lo, A. Cohan, Scibert: A pretrained language model for scientific text, in: Proceedings of the EMNLP-IJCNLP 2019, Association for Com-