=Paper=
{{Paper
|id=Vol-2831/paper28
|storemode=property
|title=AT-BERT: Adversarial Training BERT for Acronym Identification Winning Solution for SDU@AAAI-21
|pdfUrl=https://ceur-ws.org/Vol-2831/paper28.pdf
|volume=Vol-2831
|authors=Danqing Zhu,Wangli Lin,Yang Zhang,Qiwei Zhong,Guanxiong Zeng,Weilin Wu,Jiayu Tang
|dblpUrl=https://dblp.org/rec/conf/aaai/ZhuLZZZWT21
}}
==AT-BERT: Adversarial Training BERT for Acronym Identification Winning Solution for SDU@AAAI-21==
AT-BERT: Adversarial Training BERT for Acronym Identification Winning Solution for SDU@AAAI-21 Danqing Zhu, Wangli Lin, Yang Zhang, Qiwei Zhong, Guanxiong Zeng, Weilin Wu, Jiayu Tang Alibaba Group, Hangzhou, China {danqing.zdq, wangli.lwl, zy142206, yunwei.zqw, moshi.zgx, william.wwl, jiayu.tangjy}@alibaba-inc.com Abstract Several approaches have been proposed to solve the acronym identification problem in the last two decades. The Acronym identification focuses on finding the acronyms and majority of the prior methods are rule-based (Schwartz and the phrases that have been abbreviated, which is crucial for scientific document understanding tasks. However, the lim- Hearst 2002; Okazaki and Ananiadou 2006) or feature-based ited size of manually annotated datasets hinders further im- (Kuo et al. 2009; Liu, Liu, and Huang 2017), which em- provement for the problem. Recent breakthroughs of lan- ploys manually designed rules or features for the acronym guage models pre-trained on large corpora clearly show and long form predictions. Due to the rules/features are spe- that unsupervised pre-training can vastly improve the per- cially designed for finding long forms, these methods have formance of downstream tasks. In this paper, we present an high precision. However, they fail to capture all the diverse Adversarial Training BERT method named AT-BERT, our forms of acronym expression (Harris and Srinivasan 2019). winning solution to acronym identification task for Scientific On the contrast, taking advantage of pre-trained word em- Document Understanding (SDU) Challenge of AAAI 2021. beddings and deep architecture, deep learning models like Specifically, the pre-trained BERT is adopted to capture better LSTM-CRF show promising results for acronym identifica- semantic representation. Then we incorporate the FGM ad- versarial training strategy into the fine-tuning of BERT, which tion (Veyseh et al. 2020b). Although these works have made makes the model more robust and generalized. Furthermore, great progress, there are still some limitations that hinder an ensemble mechanism is devised to involve the represen- further improvement, such as the limited size of manually tations learned from multiple BERT variants. Assembling all annotated acronyms and the noises in the automatically cre- these components together, the experimental results on the ated datasets. SciAI dataset show that our proposed approach outperforms Motivated by the above observations, the first publicly all other competitive state-of-the-art methods. available and the largest manually annotated acronym iden- tification the dataset in scientific domain is released (Veyseh Introduction et al. 2020b), and the Scientific Document Understanding (SDU) Challenge (Veyseh et al. 2020a) for acronym identi- Acronyms are widespread used in many technical docu- fication task is hosted 1 . The task aims to identify acronyms ments to reduce duplicate references to the same concept. (i.e., short-forms) and their meanings (i.e.,long-forms) from According to the reports (Barnett and Doubleday 2020), af- the documents, a toy example is shown in Table 1. In this pa- ter an analysis of more than 24 million article titles and 18 per, we formulate the problem as a sentence-level sequence million article abstracts published between 1950 and 2019, labeling problem, and design a novel BERT-based ensem- there was at least one acronym in 19% of the titles and 73% ble model called Adversarial Training BERT (AT-BERT). of the abstracts. As the growing amount of scientific pa- Specifically, considering the training data is relatively small, pers published every year, the number of acronyms is also we adopt the pre-trained BERT model as sentence encoder, constantly climbing. However, not all acronyms are stan- which is pre-trained on general domain corpora and shows a dard written (i.e., take the first letter of each word and put significant improvement on the performance of downstream them together in all capital letters), there are many different tasks with supervised fine-tuning (Beltagy, Lo, and Cohan ways of writing, e.g., XGBoost is an acronym of eXtreme 2019). Furthermore, we leverage the FGM (Miyato, Dai, and Gradient Boosting (Chen and Guestrin 2016). Thus, auto- Goodfellow 2017), an adversarial training strategy to im- matic identification of acronyms and discovery of associated prove the generalization ability of the model, making it more definitions are crucial for text understanding tasks, such as robust to noisy data. Finally, we utilize a multi-BERT en- question answering (Ackermann et al. 2020; Veyseh 2016), semble to fully exploit the representations learned from mul- slot filling (Pouran Ben Veyseh, Dernonrcourt, and Nguyen tiple BERT variants (Xu et al. 2020). Combining these re- 2019) and definition extraction (Kang et al. 2020). spective advantages, our proposed model won the first prize in the SDU@AAAI-21, outperforming all the other compet- Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 1 4.0). All rights reserved. https://sites.google.com/view/sdu-aaai21/shared-task Input: Existing methods for learning with noisy labels (LNL) primarily take a loss correction approach. Output: Existing methods for learning with noisy labels (LNL) primarily take a loss correction approach. Table 1: A toy example of the acronym identification task. In this example, the acronym is shown in bold font and the long-form is shown with an underline. itive methods. tions are crucial and BERT-based models with better seman- The main contributions are summarized as follows: tic representation are more suitable for the task. • To the best of our knowledge, it is the first work to in- corporate adversarial training strategy into BERT-based Adversarial Training model for acronym identification task in the scientific do- Adversarial training, in which a network is trained on ad- main. versarial examples, is an important way to enhance the ro- • We propose a novel framework for acronym identifica- bustness of neural networks. The Fast Gradient Sign Method tion, including a pre-trained BERT for the semantic rep- (FGSM) (Goodfellow, Shlens, and Szegedy 2015) and its resentation, an adversarial training strategy to make the variant Fast Gradient Method (FGM) (Miyato, Dai, and model more robust and generalized, as well as a multi- Goodfellow 2017) are firstly proposed for adversarial train- BERT ensemble mechanism to achieve superior perfor- ing. The FGSM and FGM methods generate adversarial ex- mance. amples by adding gradient-based perturbation to the input samples with different normalization strategies. They relies • Extensive experiments are conducted on the data offered heavily on the assumption that the loss function is linear. by the SDU@AAAI-21, demonstrating the effectiveness Different from them, the Projected Gradient Descent (PGD) of our proposed method. (Madry et al. 2018) method is an iterative attack method with multi-step iterations, and each iteration will project the Related Work perturbation to a specified range. PGD increases computa- In this section, we mainly introduce the related studies for tional cost to get better effect, and many PGD-based meth- the sequence labeling problem especially the BERT-based ods have been proposed to be more efficient. YOPO (Zhang models, then we review the existing researches on adversar- et al. 2019) computes the gradient of first layer merely, while ial training. FreeAT (Shafahi et al. 2019) and FreeLB (Zhu et al. 2020) further reduce the frequency of the gradient computation. Sequence Labeling and BERT-based Models Considering the dataset for acronym identification is rel- In this paper, we formulate the acronym identification atively small that is easily to be overfit, we incorporate the as a sequence labeling problem. Traditional approaches adversarial training strategy into the BERT-based models to of sequence labeling are mainly based on rule-based or achieve a more robust and generalized performance. feature-based methods (Okazaki and Ananiadou 2006; Kuo et al. 2009). Recently, deep learning models have achieved Methodology promising results, for instance, the LSTM-CRF (Li et al. In this section, we present the overall architecture of our pro- 2020) model utilizes LSTMs to extract contextualized rep- posed method, which uses the BERT-based model to solve resentations and implement sequence optimization by CRF. the sequence labeling problem, and adopt adversarial train- With the development of pre-trained language models, ing strategies to improve the robustness of the model. BERT-based models achieve state-of-the-art results in nat- ural language related tasks. BERT (Kenton and Toutanova 2019) is a multi-layer bidirectional Transformer encoder, Overview which is pre-trained on Wikipedia and BooksCorpus, has In the following, we propose a BERT-based classification given state-of-the-art results on a wide variety of NLP tasks model based on adversarial training strategy, which is called and inspired many variants. RoBERTa (Liu et al. 2019) uti- adversarial training BERT (AT-BERT). As show in Figure lizes BPE(Byte Pair Encoding) and dynamic masking to in- 1, the pre-trained BERT model is used for semantic feature crease the shared vocabulary. It optimizes the training strat- encoding, and downstream acronym identification task is egy of BERT and achieves better performance. ALBERT solved using its output representations with linear classifiers. (Lan et al. 2019) utilizes factorized embedding parame- In addition, due to the complexity of the acronyms in scien- terization and cross-Layer parameter sharing to reduce the tific documents and the relatively small training dataset, the model parameters. ERINE (Sun et al. 2019) proposes a new model is prone to overfitting. We use the FGM to add pertur- masking strategy based on phrases and entities, in which bation to the input samples for adversarial training, making customized tasks are continuously introduced and trained the model more robust and generalized. Finally, in order to through multi-task learning. improve the accuracy of the task, We train different BERT As for acronym identification, it is more challenging than models, such as BERT, SciBERT, RoBERTa, ALBERT and general sequence labeling problems since acronyms are di- ELECTRA, and make an average ensemble for all the mod- verse and ambiguous. Thus the con-textualized representa- els to achieve superior performance. Figure 1: The overall architecture of the proposed AT-BERT approach. BERT For Sequence Labeling Problem where x represents the input representation of the sample, BERT (Bidirectional Encoder Representations from Trans- and a corresponding target as y, radv is the perturbation ap- formers) is state of the art language model for NLP. It plied to the input, S is the perturbation space, and L is some uses the encoder structure of the Transformer (Vaswani loss function like Equation (2). First, The internal maximiza- et al. 2017) for deep self-supervised learning, which requires tion problem is to find the best perturbation at a given data task-specific fine-tuning. Transformer is an attention mecha- point x in the perturbation space to generate adversarial ex- nism that learns contextual relations between words (or sub- amples that achieves high loss. This can be seen as an attack words) in a text. In this paper, the downstream task is a sin- on a given neural network. Second, The goal of the external gle sentence tagging problem. We denote a sequence with minimization problem is to find the model parameters θ to T words as : W = (w1 , w2 , ..., wT ), and a corresponding minimize the “adversarial loss” given by the internal attack target as Y = (y 1 , y 2 , ..., y T ). BERT trains an encoder that problem. generates a contextualized vector representation for each to- With the above clear definition of the adversarial train- ken as a hidden state: ing, we will introduce how to apply a small perturbation to the input sample to generate adversarial samples in our H = BERT(w1 , w2 , ..., wT ; θ) task. There are many related studies on adversarial training (1) = (h1 , h2 , ..., hT ) such as the FGSM, single-step algorithm FGM, multi-step algorithm PGD, and Free-LB. Since these can basically be The hidden state is then fed into a fully connection layer regarded as a series of methods, we will briefly introduce with a softmax unit to obtain the predicted probability dis- FGM. FGM made a simple extension on the calculation of tribution for each token. The model is trained with the Cross- perturbation in FGSM and proposed FGM. The main idea is Entropy loss function, which is defined as follows. to add a perturbation to the input that can increase the loss, C T X it happens to be the direction in which the gradient of the L=− X yji logsij (2) loss function rises. Specifically, the adversarial perturbation is defined as follows. i j where y i and si are the ground truth probability distribution g = ∇x L(θ, x, y) (4) and the predicted probability distribution, C is the number g radv = · (5) of categories. ||g||2 where g is the gradient of the loss with respect to x, the L2 Adversarial Training For BERT norm is used to constrain g in Equation (5), and the is a hy- Adversarial training is an important way to enhance the ro- perparameter and defaults to 1. In our acronym identification bustness of models by adversarial samples. An adversarial task, the perturbation radv will be added to the embedding example is an instance with small, intentional feature per- of the input word. The overall architecture of the proposed turbations that induce the model to make a false prediction. AT-BERT is shown in Figure 1. In the procedure of adversarial training, the input samples will firstly be mixed with some small perturbation to gen- Experiments erate adversarial samples (Szegedy et al. 2014). The model In this section, we first introduce the experimental dataset is trained with both the original input sample and generated and evaluation metrics, and then conduct comprehensive ex- adversarial samples to enhance its robustness and general- perimental studies to verify the effectiveness of our method. ization. (Madry et al. 2018) abstracted the general form of adversarial training as the maximum and minimum formula Dataset as follows. We evaluate all models based on the dataset provided by SDU@AAAI-21. It contains a training set of 14,006 sam- min E(x,y)∼D [ max L(θ, x + radv , y)] (3) ples, a development set of 1,717 samples, and a test set of θ radv ∈S Data Sample Number Ratio predicted short-form or long-from boundaries equal to the training set 14,006 80.16% ground-truth beginning and end of the short-form or long- development set 1,717 9.82% form boundary, respectively. The official score (noded as test set 1,750 10.02% MacroF1) is the macro average of short-form and long-form total 17,473 100% prediction F1 score. Table 2: The statistical information of dataset. Compared Methods We experiment with four schemes: Baselines, BERT mod- els, Adversarial Training for BERT (AT-BERT) and Model Ensemble. (a) Baselines • Rule-based methods: These models employ manually designed rules to extract acronyms and long forms in the text. The evaluation code and results are provided by SDU@AAAI-212 . • Deep learning models: As shown in previous work (Vey- seh et al. 2020b), we can see that the F1 score of the LSTM-CRF model is only one percentage point higher than the rule-based models. Therefore, we do not imple- ment the LSTM-CRF model by ourselves. More details on these models and hyper parameters are illustrated in (Vey- seh et al. 2020b). Figure 2: Category distribution of training set. (b) BERT Models 1,750 samples, as shown in Table 2. This task aims to iden- • BERT: BERT (Kenton and Toutanova 2019) is a multi- tify acronyms (i.e., short-forms) and their meanings (i.e., layer bidirectional transformer encoder trained with a long-forms) from the documents. The dataset provides the masked language modeling (MLM) objective and the next boundaries for the acronyms and long forms in the sentence sentence prediction task. It has two sizes, we have both using BIO format (i.e., label set includes B-short, I-short, B- experimented, namely BERTBASE architecture (L=12, long, I-long and O). The percentage of each label category H=768, A=12, total 110M parameters) and BERTLARGE in all tokens is shown in Figure 2. Obviously, the distribution architecture (L=24, H=1024, A=16, total 355M parame- of label classes across the all known classes is biased. Each ters) provided by huggingface (Wolf et al. 2020). sample in the training set and development set has three at- • SciBERT: SciBERT is the pretrained model presented by tributes: Beltagy, Lo, and Cohan, which is based on BERTBASE • tokens: The list of words (tokens) of the sample. and trained on a large corpus of scientific text. It has achieved new state-of-the-art results on a suite of tasks • labels: The short-form and long-form labels of the words in the scientific domain (Beltagy, Lo, and Cohan 2019; in BIO format. The labels B-short and B-long identi- Zhong et al. 2021). fies the beginning of a short-form and long-form phrase, respectively. The labels I-short and I-long indicates the • RoBERTa: RoBERTa (Liu et al. 2019) improves the orig- words inside the short-form or long-form phrases. Finally, inal implementation of BERT for better performance, us- the label O shows the word is not part of any short-form ing dynamic masking, removing the next sentence pre- or long-form phrase. diction task, training with larger batches, on more data, and for longer. RoBERTa follows the same architecture • id: The unique ID of the sample. as BERT. And the test set has no labels attributes. We refer the readers to the work (Veyseh et al. 2020b) for more details. • ALBERT: The ALBERT model (Lan et al. 2019) presents two parameter-reduction techniques to lower memory Evaluation Metrics consumption and increases the training speed of BERT. First, splitting the embedding matrix into two smaller ma- Regarding the evaluation metrics, similar to previous trices. Second, using repeating layers split among groups. work (Veyseh et al. 2020b), the results are evaluated based on their macro-averaged precision, recall, and F1 score on • ELECTRA: ELECTRA(Clark et al. 2020) proposes a the test set computed for correct predictions of short-form more effective pretraining method. Instead of corrupting (i.e., acronym) and long-form (i.e., phrase) boundaries in the sentences. A short-form or long-form boundary predic- 2 https://github.com/amirveyseh/AAAI-21-SDU-shared-task- tion is counted as correct if the beginning and the end of the 1-AI Parameter SciBERT BERT BERTLARGE RoBERTa ALBERT ELECTRA Training scibert-scivocab bert-base bert-large google/electra-large pretrained model roberta-larged albert-xxlarge-v2e Arguments -uncaseda -uncasedb -uncasedc -discriminatorf epoch 3 3 3 3 4 3 batch size 16 16 16 16 8 16 learning rate 2e-5 2e-5 2e-5 2e-5 5e-6 2e-5 max seq len 512 512 512 512 512 512 Model attention probs 0.1 0.1 0.1 0.1 0 0.1 Arguments dropout prob hidden dropout prob 0.1 0.1 0.1 0.1 0 0.1 classifier dropout prob 0.1 num attention heads 16 12 16 16 64 16 num hidden layers 24 12 24 24 12 24 hidden size 1024 768 1024 1024 4096 1024 hidden act gelu gelu gelu gelu gelu new gelu intermediate size 3072 3072 4096 4096 16384 4096 vocab size 30522 30522 30522 50265 30000 30522 a https://github.com/allenai/scibert b https://huggingface.co/bert-base-uncased c https://huggingface.co/bert-large-uncased d https://huggingface.co/roberta-large e https://huggingface.co/albert-xxlarge-v2 f https://huggingface.co/google/electra-large-discriminator Table 3: Model architecture and main parameters of our experiments. some positions of inputs with [MASK], ELECTRA re- For the above models, we do not modify the original net- places some tokens of the inputs with their plausible alter- work structure. For more detailed network structures and pa- natives sampled from a small generator network. ELEC- rameters, please refer to transformers(Wolf et al. 2020). For TRA trains a discriminator to predict whether each to- each BERT variant model, we pick the best learning rate and ken in the corrupted input was replaced by the generator number of epochs on the development set and report the cor- or not. The pretrained discriminator can then be used in responding test results. We found that when epoch is set to downstream tasks for fine-tuning. 3, the learning rate is 2e-5, the maximum sentence length is 512 and the batch size is set to occupy as much GPU as pos- (c) AT-BERT Models sible, most models are close to convergence. Therefore, we In order to solve the problem that the models may be overfit- set the above training parameters uniformly for all models. ted and have poor generalization due to less training data, we More detailed parameter settings are shown in Table 3. used the FGM algorithm for adversarial training on various BERT models. Performance Comparison (d) Model Ensemble The comparison results are shown in Table 4. The main ob- Model ensemble is a commonly used method to improve servations are summarized as follows: model accuracy. We perform an average ensemble of the (1) Compared with the rule-based method and LSTM- output probability distributions of various BERT models to CRF model, all the BERT-based models achieve better re- obtain the final prediction results. In general, model fusion sults, illustrating the advantage with pre-trained BERT. Due requires that the fused models themselves perform well and to the conservative nature of rule-base method, it has higher are different from each other, so we finally use four models: precision but far lower recall than all other models. With BERTLARGE , RoBERTa, ALBERT, and ELECTRA for fu- unsupervised pre-training on large corpus, the BERT-based sion (named BERT-E shortly). AT-BERT equipped with ad- models outperform LSTM-CRF among all the evaluation versarial training strategy is shorted as AT-BERT-E. metrics. (2) Among the six different BERT-based models, the Implementation SciBERT model has the same architecture and training strat- All models are implemented based on the open source trans- egy with BERTBASE . However, the SciBERT, whose cor- formers library of huggingface (Wolf et al. 2020), which pus is more relevant to our task, outperforms BERTBASE provides thousands of pretrained models to perform tasks with 1.03% increased MarcoF1 score. Meanwhile, the on texts such as sequence classification and information ex- BERTLARGE have more complex architecture and param- traction. It provides APIs to quickly download and use those eter, thus it performs better than SciBERT. Taking advan- pretrained models on a given text, fine-tune them on your tage of larger training corpus and more effective training own datasets. The deep learning framework used in this pa- strategies, the performances of other BERT-based models per is Pytorch. In addition, We use two V100 GPUs with 12 like RoBERTa and ELECTRA get further improvement. cores to complete these experiments. (3) With the FGM adversarial training strategy, as shown Scheme Methodology Acronym Long Form P(%) R(%) F1(%) P(%) R(%) F1(%) MacroF1(%) Baseline RULE 90.67 91.71 91.18 95.78 66.09 78.21 85.46 LSTM-CRF 88.58 86.93 87.75 85.33 85.38 85.36 86.55 BERT BERTBASE 92.88 92.50 92.69 87.20 89.96 88.56 90.63 SciBERT 92.61 90.82 91.71 90.96 92.37 91.66 91.69 BERTLARGE 94.07 94.28 94.18 90.60 91.44 91.02 92.60 RoBERTa 93.10 92.63 92.86 92.77 93.92 93.35 93.11 ALBERT 91.82 94.22 93.01 91.69 94.36 93.00 93.00 ELECTRA 92.79 93.99 93.39 91.25 94.42 92.81 93.10 AT-BERT BERTLARGE 94.34 93.17 93.75 92.04 93.24 92.64 93.20 RoBERTa 94.50 93.36 93.93 91.83 94.73 93.26 93.60 ALBERT 92.48 94.01 93.24 92.73 94.44 93.56 93.41 ELECTRA 94.38 92.88 93.63 92.66 93.86 93.26 93.45 Ensemble BERT-E 94.62 92.72 93.66 92.83 93.99 93.18 93.43 AT-BERT-E 94.87 93.99 94.43 92.84 94.79 93.80 94.12 Human Performance Human Performance 98.51 94.33 96.37 96.89 94.79 95.82 96.09 Table 4: Performance comparison over different released period. tokens this study were convolutional and/or recurrent neural nets ( CNNs , RNNs , or CRNNs ) , label O O O B-long I-long I-long I-long I-long O B-short O B-short O O B-short O O w/o AT O O O O O B-long I-long I-long O B-short O B-short O O B-short O O with AT O O O B-long I-long I-long I-long I-long O B-short O B-short O O B-short O O Table 5: Case analysis with and without (w/o) adversarial training. in Figure 3, we can clearly observe that the AT-BERT based models outperforms those without adversarial training by a large margin. The obvious improvement indicates that the adversarial training strategy has a positive effect on the BERT-based models’ performance. (4) From the comparison of ensemble strategies, we can find that the BERT-E model is more superior than any BERT-based model, especially in the precision and MarcoF1 metrics. The similar phenomenon also occurs in the com- parison of AT-BERT-E model with single AT-BERT based model. The AT-BERT-E model which performs best is more advanced than the baseline methods, i.e., the rule-based method and LSTM-CRF model, with 8.66% and 7.57% in- creased MarcoF1 score, respectively. The above observations demonstrate that the effectiveness of different components of our proposed AT-BERT model. However, the best performance is still less effective than hu- Figure 3: Comparison of Macro F1 for models with and man performance, thereby providing many research oppor- without (w/o) adversarial training. tunities for this scenario. Case Study does have better robustness and generalization. We further analyze the prediction results of BERT and AT- BERT. An interesting example (DEV-1629) is shown in Ta- Conclusion and Future Work ble 5. The corresponding long-forms of “CNNs”, “RNNs”, In this paper, we proposed a novel BERT-based model called and “CRNNs” is “convolutional and/or recurrent neural AT-BERT for acronym identification, the winning solution nets”, while the prediction of BERT is “recurrent neural to the AAAI-21 Workshop on Scientific Document Under- nets”. This example is very confusing, because “recurrent standing. A FGM-based adversarial training strategy was in- neural nets” can be considered as the long form of “RNNs”. corporated in the fine-tuning of BERT variants, and an aver- The general BERT model might be easily affected by the to- age ensemble mechanism was devised to capture the better ken “and/or” and ignores the previous token “convolutional. representation from multi-BERT variants. The extensive ex- The experimental results prove that our proposed AT-BERT periments were conducted on the SciAI dataset and achieved the best performance among all the competitive methods, Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; and Dollár, P. which verifies the effectiveness of the proposed approach. 2017. Focal Loss for Dense Object Detection. In ICCV, In the future, we will optimize our model from two per- 2980–2988. spectives. One is to explore more adversarial training strate- Liu, J.; Liu, C.; and Huang, Y. 2017. Multi-granularity Se- gies such as PGD and FreeLB for BERT model. The other quence Labeling Model for Acronym Expansion Identifica- is to try different loss function such as Dice Loss (Li et al. tion. Information Sciences 378: 462–474. 2019) and Focal Loss (Lin et al. 2017) to alleviate the phe- nomenon of class imbalance. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Acknowledgments Approach. arXiv preprint arXiv:1907.11692 . We thank the organizers of acronym identification and dis- Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; and ambiguation competitions and the reviewers for their valu- Vladu, A. 2018. Towards Deep Learning Models Resistant able comments and suggestions. to Adversarial Attacks. In ICLR. References Miyato, T.; Dai, A. M.; and Goodfellow, I. J. 2017. Adver- sarial Training Methods for Semi-Supervised Text Classifi- Ackermann, C. F.; Beller, C. E.; Boxwell, S. A.; Katz, E. G.; cation. In ICLR. and Summers, K. M. 2020. Resolution of Acronyms in Question Answering Systems. US Patent 10,572,597. Okazaki, N.; and Ananiadou, S. 2006. Building an Ab- breviation Dictionary Using a Term Recognition Approach. Barnett, A.; and Doubleday, Z. 2020. Meta-Research: The Bioinformatics 22(24): 3089–3095. Growth of Acronyms in the Scientific Literature. Elife 9: e60080. Pouran Ben Veyseh, A.; Dernonrcourt, F.; and Nguyen, T. H. 2019. Improving Slot Filling by Utilizing Contextual Infor- Beltagy, I.; Lo, K.; and Cohan, A. 2019. SciBERT: Pre- mation. arXiv arXiv–1911. trained Language Model for Scientific Text. In EMNLP. Schwartz, A. S.; and Hearst, M. A. 2002. A Simple Algo- Chen, T.; and Guestrin, C. 2016. XGBoost: A Scalable Tree rithm for Identifying Abbreviation Definitions in Biomedi- Boosting System. In KDD, 785–794. cal Text. In Biocomputing 2003, 451–462. World Scientific. Clark, K.; Luong, M.; Le, Q. V.; and Manning, C. D. 2020. Shafahi, A.; Najibi, M.; Ghiasi, M. A.; Xu, Z.; Dickerson, J.; ELECTRA: Pre-training Text Encoders as Discriminators Studer, C.; Davis, L. S.; Taylor, G.; and Goldstein, T. 2019. Rather Than Generators. In ICLR. Adversarial Training for Free! In NeurIPS, 3358–3369. Goodfellow, I. J.; Shlens, J.; and Szegedy, C. 2015. Explain- Sun, Y.; Wang, S.; Li, Y.; Feng, S.; Chen, X.; Zhang, H.; ing and Harnessing Adversarial Examples. In ICLR. Tian, X.; Zhu, D.; Tian, H.; and Wu, H. 2019. Ernie: Enhanced Representation Through Knowledge Integration. Harris, C. G.; and Srinivasan, P. 2019. My Word! Machine arXiv preprint arXiv:1904.09223 . versus Human Computation Methods for Identifying and Resolving Acronyms. Computación y Sistemas 23(3). Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I. J.; and Fergus, R. 2014. Intriguing Prop- Kang, D.; Head, A.; Sidhu, R.; Lo, K.; Weld, D. S.; and erties of Neural Networks. In Bengio, Y.; and LeCun, Y., Hearst, M. A. 2020. Document-Level Definition Detection eds., ICLR. in Scholarly Documents: Existing Models, Error Analyses, and Future Directions. arXiv preprint arXiv:2010.05129 . Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At- Kenton, J. D. M.-W. C.; and Toutanova, L. K. 2019. BERT: tention is All You Need. In NIPS, 5998–6008. Pre-training of Deep Bidirectional Transformers for Lan- guage Understanding. In NAACL-HLT, 4171–4186. Veyseh, A. P. B. 2016. Cross-lingual Question Answering Using Common Semantic Space. In TextGraphs, 15–19. Kuo, C.-J.; Ling, M. H.; Lin, K.-T.; and Hsu, C.-N. 2009. BIOADI: A Machine Learning Approach to Identifying Ab- Veyseh, A. P. B.; Dernoncourt, F.; Nguyen, T. H.; Chang, W.; breviations and Definitions in Biological Literature. In BMC and Celi, L. A. 2020a. Acronym Identification and Disam- bioinformatics, volume 10, S7. Springer. biguation shared tasks for Scientific Document Understand- ing. In AAAI Workshop on Scientific Document Understand- Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; ing. and Soricut, R. 2019. ALBERT: A Lite BERT for Self- Veyseh, A. P. B.; Dernoncourt, F.; Tran, Q. H.; and Nguyen, supervised Learning of Language Representations. In ICLR. T. H. 2020b. What Does This Acronym Mean? Introducing Li, J.; Sun, A.; Han, J.; and Li, C. 2020. A Survey on Deep a New Dataset for Acronym Identification and Disambigua- Learning for Named Entity Recognition. IEEE Transactions tion. In COLING, 3285–3301. on Knowledge and Data Engineering . Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Li, X.; Sun, X.; Meng, Y.; Liang, J.; Wu, F.; and Li, J. 2019. Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; Davi- Dice Loss for Data-imbalanced NLP Tasks. arXiv preprint son, J.; Shleifer, S.; von Platen, P.; Ma, C.; Jernite, Y.; Plu, J.; arXiv:1911.02855 . Xu, C.; Scao, T. L.; Gugger, S.; Drame, M.; Lhoest, Q.; and Rush, A. M. 2020. Transformers: State-of-the-Art Natural Language Processing. In EMNLP, 38–45. Xu, Y.; Qiu, X.; Zhou, L.; and Huang, X. 2020. Improving BERT Fine-Tuning via Self-Ensemble and Self-Distillation. arXiv preprint arXiv:2002.10345 . Zhang, D.; Zhang, T.; Lu, Y.; Zhu, Z.; and Dong, B. 2019. You Only Propagate Once: Accelerating Adversarial Train- ing via Maximal Principle. In NeurIPS, 227–238. Zhong, Q.; Zeng, G.; Zhu, D.; Zhang, Y.; Lin, W.; Chen, B.; and Tang, J. 2021. Leveraging Domain Agnostic and Specific Knowledge for Acronym Disambiguation. In AAAI Workshop on Scientific Document Understanding. Zhu, C.; Cheng, Y.; Gan, Z.; Sun, S.; Goldstein, T.; and Liu, J. 2020. FreeLB: Enhanced Adversarial Training for Natural Language Understanding. In ICLR.