BERT-based Acronym Disambiguation with Multiple Training Strategies Chunguang Pan 1 ,Bingyan Song 1 , Shengguang Wang 1 , Zhipeng Luo 1 1 DeepBlue Technology (Shanghai) Co., Ltd {panchg, songby, wangshg, luozp} @deepblueai.com Abstract Input : Acronym disambiguation (AD) task aims to find the correct – Sentence : The model complexity for the SVM is expansions of an ambiguous ancronym in a given sentence. determined by the Gaussian kernel spread and the Although it is convenient to use acronyms, sometimes they penalty parameter. could be difficult to understand. Identifying the appropriate expansions of an acronym is a practical task in natural lan- – Dictionary : SVM : -- Support Vector Machine guage processing. Since few works have been done for AD -- State Vector Machine in scientific field, we propose a binary classification model incorporating BERT and several training strategies including Output : Support Vector Machine dynamic negative sample selection, task adaptive pretraining, adversarial training and pseudo labeling in this paper. Ex- periments on SciAD show the effectiveness of our proposed Figure 1: An example of acronym disambiguation model and our score ranks 1st in SDU@AAAI-21 shared task 2: Acronym Disambiguation. acronym disambiguation task is challenging due to the high 1 Introduction ambiguity of acronyms. For example, as shown in Figure 1, SVM has two expansions in the dictionary. According to An acronym is a word created from the initial components the contextual information from the input sentence, the SVM of a phrase or name, called the expansion (Jacobs, Itai, and here represents for the Support Vetor Machine which is quite Wintner 2020). In many literature and documents, especially smilar to State Vector Machine. in scientific and medical fields, the amount of acrnomys is Consequently, AD is formulated as a classification prob- increasing at an incredible rate. By using acronyms, people lem, where given a sentence and an acronym, the goal is to can avoid repeating frequently used long phrases. For exam- predict the expansion of the acronym in a given candidate ple, CNN is an acronym with the expansion Convolutional set. Over the past two decades, several kinds of approaches Neural Network, though it has additional expansion possi- have been proposed. At the begining, pattern-matching tech- bilities depending on context, such as Condensed Nearest niques were popular. They (Taghva and Gilbreth 1999) de- Neighbor. signed rules and patterns to find the corresponding expan- Understanding the correlation between acronyms and sions of each acronym. However, as the pattern-matching their expansions is critical for several applications in natural methods require more human efforts on designing and tun- language processing, including text classification, question ing the rules and patterns, machine learning based methods answering and so on. (i.e. CRF and SVM) (Liu, Liu, and Huang 2017) have been Despite the convenience of using acronyms, sometimes preferred. More recently, deep learning methods (Charbon- they could be difficult to understand, especially for people nier and Wartena 2018; Jin, Liu, and Lu 2019) are adopted who are not familiar with the specific area, such as in scien- to solve this task. tific or medical field. Therefore, it is necessary to develop a Recently, pre-trained language models such as ELMo (Pe- system that can automatically resovle the appropriate mean- ters et al. 2018) and BERT (Devlin et al. 2018), have shown ing of acronyms in different contextual information. their effectiveness in contextual representation. Inspired by Given an acronym and several possible expansions, the pre-trained model, we propose a binary classification acronym disambiguation(AD) task is to determine which model that is capable of handling acronym disambiguation. expansion is correct for a particular context. The scientific We evaluate and verify the proposed method on the dataset Copyright © 2021 for this paper by its authors. Use permitted un- released by SDU@AAAI 2021 Shared Task: Acronym Dis- der Creative Commons License Attribution 4.0 International (CC ambiguation (Veyseh et al. 2020a). Experimental results BY 4.0). show that our model can effectively deal with the task and we win the first place of the competition. 26075 2 Related Work 25000 Acronym Disambiguation 20000 frequency of sentences Acronym diambiguation has received a lot of attentions in vertical domains especially in biomedical fields. Most of 15000 the proposed methods (Schwartz and Hearst 2002) utilize generic rules or text patterns to discover acronym expan- sions. These methods are usually under circumstances where 10000 8879 acronyms are co-mentioned with the corresponding expan- sions in the same document. However, in scientific papers, 5000 this rarely happens. It is very common for people to define 2387 1333 the acronyms somewhere and use them elsewhere. Thus, 435 220 59 188 4 61 0 such methods cannot be used for acronym disambiguation 1 2 3 4 5 6 7 8 9 >=10 number of acronyms per sentence in scientific field. There have been a few works (Nadeau and Turney 2005) on automatically mining acronym expansions by leveraging Figure 2: Number of acronyms per sentence Web data (e.g. click logs, query sessions). However, we can- not apply them directly to scientific data, since most data in scientific are raw text and therefore logs of the query ses- 437 sions/clicks are rarely available. 400 Pre-trained Models frequency of acronyms Substantial work has shown that pre-trained models (PTMs), 300 on the large unlabeled corpus can learn universal language representations, which are beneficial for downstream NLP 200 tasks and can avoid training a new model from scratch. The first-generation PTMs aim to learn good word em- 140 beddings. These models are usually very shallow for com- 100 putational efficiencies, such as Skip-Gram (Mikolov et al. 45 38 2013) and GloVe (Pennington, Socher, and Manning 2014), 14 18 10 22 8 because they themselves are no longer needed by down- 0 2 3 4 5 6 7 8 9 >=10 stream tasks. Although these pre-trained embeddings can number of expansions per acronym capture semantic meanings of words, they fail to caputre higher-level concepts in context, such as polysemous disam- Figure 3: Number of expansions per acronym biguation and semantic roles. The second-generation PTMs focus on learning contextual word embeddings, such as ELMo (Peters et al. 2018), OpenAI GPT (Radford et al. There are also other works for regularizing classifiers by 2018) and BERT (Devlin et al. 2018). These learned en- adding random noise to the data, such as dropout (Srivas- coders are still needed to generate word embeddings in con- tava et al. 2014) and its variant for NLP tasks, word dropout text when being used in downstream tasks. (Iyyer et al. 2015). Xie et al. (2019) discusses various data noising techniques for language models and provides em- Adversarial Training pirical analysis validating the relationship between nosing Adversarial training (AT) (Goodfellow, Shlens, and Szegedy and smoothing. Søgaard (2013) and Li, Cohn, and Baldwin 2014) is a mean of regularizing classification algorithms by (2017) focus on linguistic adversaries. generating adversarial noise to the training data. It was first Combining multiple advantages in above works, we pro- introduced in image classification tasks where the input data pose a binary classification model utilizing BERT and sev- is continuous. eral training strategies such as adversarial training and so on. Miyato, Dai, and Goodfellow (2017) extend adversarial and virtual adversarial training to the text classification by applying perturbation to the word embeddings and propose 3 Data an end-to-end way of data perturbation by utilizing the gra- In this paper, we use the AD dataset called SciAD re- dient information. Zhu, Li, and Zhou (2019) propose an ad- leased by Veyseh et al. (2020b). They collect a corpus of versarial attention network for the task of multi-dimensional 6,786 English papers from arXiv and these papers consist of emotion regression, which automatically rates multiple emo- 2,031,592 sentences that could be used for data annotation. tion dimension scores for an input text. The dataset contains 62,441 samples where each sample mean squared error The MSE of consists of variance and squared bias. BERT 0.95 model selection eqn BERT 0.37 argmax The MSE of consists of variance and squared bias. minimum square error BERT 0.56 The MSE of consists of variance and squared bias. Figure 4: Acronym disambiguation based on binary classification model. For each sample, the model needs to predict whether the given expansions matches the acronym or not, and find the expansion with the highest score as the correct one. involves a sentence, an ambiguous acronym, and its correct concatenated vector into a binary classifier for prediction. meaning (one of the meanings of the acronym recorded by The represenation first pass through a dropout layer (Srivas- the dictionary , as shown in 1). tava et al. 2014) and a feedforward layer. The output of these Figure 2 and Figure 3 demonstrate statistics of SciAD layers is then feed into a ReLU (Glorot, Bordes, and Ben- dataset. More specifically, Figure 2 reveals the distribution gio 2011) activation. After this, the calculated vector pass of number of acronyms per sentence. Each sentence could through a dropout layer and a feedforward layer again. The have more than one acronym and most sentences have 1 or 2 final prediction can be obtained through a sigmoid activa- acronyms. Figure 3 shows the distribution of number of ex- tion. pansions per acronym. The distribution shown in this figure is consistent with the same distribution presented in the prior Training Strategies work (Charbonnier and Wartena, 2018) in which in both dis- Pretrained Models Experiments from previous work tributions, acronyms with 2 or 3 meanings have the highest have shown the effectiveness of pretrained models. Start- number of samples in the dataset (Veyseh et al. 2020b). ing from BERT model, there are many improved pretrained models. Roberta uses dynamic masks and removes next 4 Binary Classification Model sentence prediction task. In our experiments, we compare The input of the binary classification model is a sentence BERT and Roberta models trained on corpus from different with an ambiguous acronym and a possible expansion. The fields. model needs to predict whether the expansion is the cor- Dynamic Negative Sample Selection During training, responding expansion of the given acronym. Given an in- we dynamicly select a fixed number of negative samples for put sentence, the model will assign a predicted score to each batch, which ensures that the model is trained on more each candidate expansion. The candidate expansion with the balanced positive and negative data, and all negative samples highest score will be the model output. Figure 4 shows an are used in training at the same time. example of the procedure. Task Adaptive Pretraining Gururangan et al. (2020) Input Format shows that task-adaptive pretraining (TAPT) can effectively improve model performance. The task-specific dataset usu- Since BERT can process multiple input sentences with seg- ally covers only a subset of data used for general pretraining, ment embeddings, we use the candidate expansion as the thus we can achieve significant improvement by pretraining first input segment, and the given text as the second input the masked language model task on the given dataset. segment. We separat these two input segments with the spe- cial token [CLS]. Furthermore, we add two special tokens Adversarial Training Adversarial training is a popular and to wrap the acronym in the text, approach to increasing robustness of neural networks. As which enables that the acronym can get enough attention shown in Miyato, Dai, and Goodfellow (2017), adversar- from the model. ial training has good regularization performance. By adding perturbations to the embedding layer, we can get more stable Binary Model Architecture word representations and a more generalized model, which significantly improves model performance on unseen data. The model architecture is described in Figure 5 in detail. First, we use a BERT encoder to get the representation of Pseudo-Labeling Pseudo labeling (Iscen et al. 2019; input segments. Next, we calculate the mean of the start and Oliver et al. 2018; Shi et al. 2018) uses network predictions end positions of the acronym, and concatenate the represen- with high confidence as labels. We mix these pseudo labels tation with the [CLS] position vector. Then, we sent this and the training set together to generate a new dataset. We Model Precision Recall F1 [CLS] bert-base-uncased 0.9176 0.8160 0.8638 bert-large-uncased 0.9034 0.7693 0.8311 Bayesian Dropout(0.2) roberta-base 0.9008 0.7687 0.8295 Dense(1356*128) network cs-roberta-base 0.9216 0.8415 0.8797 ReLU scibert-scivocab-uncased 0.9263 0.8569 0.8902 [SEP] Dropout(0.1) Dense(128*1) Table 1: Results on validation set using different pretrained B Sigmoid models. ##N mean concat Training Procedure We incorporate all the training strategies introduced above Is to improve the performance of our proposed binary classifi- Also cation model. According to the experiment result in Table 1, Applied we choose scibert as the fundamental pretrained model and use the TAPT technique to train a new pretrained model. To Then we add the dynamic negative sample selection and ad- Projection versarial training strategies to train the binary classfication Layer model. After this, we utilize the pseudo-labeling technique and obtain the final binary classification model. [SEP] Further Experiments Input BERT Combining training strategies We do some futher exper- iments on validation set to verify the effectiveness of each strategy mentioned above. The results are shown in Table Figure 5: The binary classification model. 2. As shown in the table, F1 score increases by 4 percents with dynamic sampling. TAPT and adversarial training fur- ther improve the performance on validation set by 0.47 per- than use this new dataset to train a new binary classifica- cent. Finally, we use pseudo-labeling method. Samples from tion model. Pseudo-labeling has been proved an effective the test set with a score higher than 0.95 are selected and approach to utilize unlabeled data for a better performance. mixed with the training set. It still slightly improves the F1 score. 5 Experiments Model Precision Recall F1 scibert-scivocab-uncased 0.9263 0.8569 0.8902 Hyper parameters +dynamic sampling 0.9575 0.9060 0.9310 +task adaptive pretraining 0.9610 0.9055 0.9324 The batch size used in our experiments is 32. We train each +adversarial training 0.9651 0.9082 0.9358 model for 15 epochs. The initial learning rate for the text +pseudo-labeling 0.9629 0.9106 0.9360 encoder is 1.0 × 10−5 , and for other parameters, the initial learning rate is set to 5.0 × 10−4 . We evaluate our model Table 2: Results on validation set using different training ap- on the validation set at each epoch. If the macro F1 score proaches. doesn’t increase, we then decay the learning rate by a factor of 0.1. The minimum learning rate is 5.0 × 10−7 . We use Adam optimizer (Kingma and Ba 2017) in all our experi- Error Analysis We gather a sample of 100 development ments. set examples that our model misclassified and look at these examples manually to do the error analysis. Pretrained Models From these examples, we find that there are two main cases where the model gives the wrong prediction. The first Since different pretrained models are trained using different one is that the candidate expansions are too similar, even data, we do experiments on several pretrained models. Ta- have the same meanings in different forms. For example, in ble 1 shows our experimental results on different pretrained the sentence ’The SC is decreasing for increasing values of models in validation set. The bert-base model gets the high- ...’, the correct expansion for ’SC’ is ’sum capacities’ while est score in commonly used pretrained models (the top 3 our prediction is ’sum capacity’ which has the same meaning lines in Table 1). Since a large ratio of texts in the given with the correct one but in the singular form. dataset come from computer science field, the cs-roberta The second one is that there is too little contextual infor- model outperforms the bert-base model by 1.6 percents. The mation in the given sentence for prediction. For instance, the best model in our experiments is the scibert model, which correct expansion for ’ML’ in sentence ’ML models are usu- achieves the F1 score of 89%. ally much more complex, see Figure.’ is ’model logic’, the predict expansion is ’machine learning’. Even people can As shown in Table 4, rules/features fail to caputre all pat- hardly tell which one is right only based on the given sen- terns of expressing the meanings of the acronym, resulting tence. in poorer recall on expansions compared to acronyms. In contrast, the deep learning model has comparable recall on Time complexity To analysis the time complexity of our expansions and acronyms, showing the importance of pre- proposed method, we show measurements of the actual run- trained word embeddings and deep architectures for AD. ning time observed in our experiments. The discussions are However, they all fall far behind human level performance. not that precise or exhaustive. However, we believe they are Among all the models, our proposed model achieves the best enough to offer readers rough estimations of the time com- results on the SciAD and is very close to the human perfor- plexity of our model. mance which shows the capability of the strategies we intro- We utilize TAPT strategy to further train the scibert model duced above. by using eight NVIDIA TITAN V (12GB). It takes three hours to train 100 epochs in total. SDU@AAAI 2021 Shared Task: Acronym Disambigua- After getting the new pretrained model, we trained the tion The competition results are shown in Table 5. We binary classification model on two NVIDIA TITAN V. On show scores of the top 5 ranked models as well as the base- average, each epoch of the training and inference time of line model. The baseline model is released by the provider adding adversarial training and pseudo-labeling are shown of the SciAD dataset (Veyseh et al. 2020b). Our model per- in Table 3 respectively. It begins to converge after five forms best among all the ranking list and outperforms the epochs. It takes nearly the same time to do the inference second place by 0.32%. In addition, our model outperforms while the training time is twice as long after adversarial the baseline model by 12.15% which is a great improvement. training is added. Model Precision Recall F1 Model Train Inference Rank1 0.9695 0.9132 0.9405 1588s 150.42s Rank2 0.9694 0.9073 0.9373 +adversarial training 3021s 149.64s Rank3 0.9652 0.9009 0.9319 +pseudo-labeling 3328s 149.36s Rank4 0.9595 0.8959 0.9266 Rank5 0.9548 0.8907 0.9216 Table 3: Time complexity Baseline 0.8927 0.7666 0.8190 Table 5: Leaderboard Comparison Results We compared our results with sev- eral other models. Precision, Recall and F1 of our proposed model are computed on testing data via the cross-validation 6 Conclusion method. In this paper, we introduce a binary classification model for • MF & ADE Non-deep learning models that utilize rules acronym disambiguation. We utilize the BERT encoder to or hand crafted features (Li et al. 2018). get the input representations and adopt several strategies in- • NOA & UAD Language-model-based baselines that train cluding dynamic negative sample selection, task adaptive the word embeddings using the training corpus (Charbon- pretraining, adversarial training and pseudo-labeling. Exper- nier and Wartena 2018; Ciosici and Assent 2019). iments on SciAD show the validity of our proposed model and we win the first place of the SDU@AAAI-2021 Shared • BEM & DECBAE Models employ deep architectures task 2. (e.g., LSTM) (Jin, Liu, and Lu 2019; Blevins and Zettle- moyer 2020). References • GAD A deep learning model utilizes the syntactical struc- Blevins, T., and Zettlemoyer, L. 2020. Moving down the ture of the sentence (Veyseh et al. 2020b). long tail of word sense disambiguation with gloss-informed biencoders. arXiv preprint arXiv:2005.02590. Model Precision Recall F1 Charbonnier, J., and Wartena, C. 2018. Using word embed- MF 0.8903 0.4220 0.5726 dings for unsupervised acronym disambiguation. In Pro- ADE 0.8674 0.4325 0.5772 ceedings of the 27th International Conference on Computa- NOA 0.7814 0.3506 0.4840 tional Linguistics, 2610–2619. UAD 0.8901 0.7008 0.7837 Ciosici, M., and Assent, I. 2019. Abbreviation explorer-an BEM 0.8675 0.3594 0.5082 interactive system for pre-evaluation of unsupervised abbre- DECBAE 0.8867 0.7432 0.8086 viation disambiguation. In Proceedings of the 2019 Confer- GAD 0.8927 0.7666 0.8190 ence of the North American Chapter of the Association for Ours 0.9695 0.9132 0.9405 Computational Linguistics (Demonstrations), 1–5. Human Performance 0.9782 0.9445 0.9610 Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Bert: Pre-training of deep bidirectional transformers for lan- Table 4: Results of different models on testing dataset guage understanding. arXiv preprint arXiv:1810.04805. Glorot, X.; Bordes, A.; and Bengio, Y. 2011. Deep sparse Oliver, A.; Odena, A.; Raffel, C. A.; Cubuk, E. D.; and rectifier neural networks. In Proceedings of the fourteenth Goodfellow, I. 2018. Realistic evaluation of deep semi- international conference on artificial intelligence and statis- supervised learning algorithms. In Advances in neural in- tics, 315–323. formation processing systems, 3235–3246. Goodfellow, I. J.; Shlens, J.; and Szegedy, C. 2014. Explain- Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: ing and harnessing adversarial examples. arXiv preprint Global vectors for word representation. In Proceedings of arXiv:1412.6572. the 2014 conference on empirical methods in natural lan- Gururangan, S.; Marasović, A.; Swayamdipta, S.; Lo, K.; guage processing (EMNLP), 1532–1543. Beltagy, I.; Downey, D.; and Smith, N. A. 2020. Don’t stop Peters, M.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, pretraining: Adapt language models to domains and tasks. In C.; Lee, K.; and Zettlemoyer, L. 2018. Deep contextualized Proceedings of the 58th Annual Meeting of the Association word representations. In Proceedings of the 2018 Confer- for Computational Linguistics, 8342–8360. Online: Associ- ence of the North American Chapter of the Association for ation for Computational Linguistics. Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2227–2237. Iscen, A.; Tolias, G.; Avrithis, Y.; and Chum, O. 2019. Label propagation for deep semi-supervised learning. In Proceed- Radford, A.; Narasimhan, K.; Salimans, T.; and Sutskever, ings of the IEEE conference on computer vision and pattern I. 2018. Improving language understanding by gen- recognition, 5070–5079. erative pre-training (2018). URL https://s3-us-west-2. amazonaws. com/openai-assets/research-covers/language- Iyyer, M.; Manjunatha, V.; Boyd-Graber, J.; and Daumé III, unsupervised/language understanding paper. pdf. H. 2015. Deep unordered composition rivals syntactic meth- ods for text classification. In Proceedings of the 53rd annual Schwartz, A. S., and Hearst, M. A. 2002. A simple algorithm meeting of the association for computational linguistics and for identifying abbreviation definitions in biomedical text. In the 7th international joint conference on natural language Biocomputing 2003. World Scientific. 451–462. processing (volume 1: Long papers), 1681–1691. Shi, W.; Gong, Y.; Ding, C.; MaXiaoyu Tao, Z.; and Zheng, Jacobs, K.; Itai, A.; and Wintner, S. 2020. Acronyms: iden- N. 2018. Transductive semi-supervised deep learning using tification, expansion and disambiguation. Annals of Mathe- min-max features. In Proceedings of the European Confer- matics and Artificial Intelligence 88(5):517–532. ence on Computer Vision (ECCV), 299–315. Jin, Q.; Liu, J.; and Lu, X. 2019. Deep contextual- Søgaard, A. 2013. Part-of-speech tagging with antagonistic ized biomedical abbreviation expansion. arXiv preprint adversaries. In Proceedings of the 51st Annual Meeting of arXiv:1906.03360. the Association for Computational Linguistics (Volume 2: Short Papers), 640–644. Kingma, D. P., and Ba, J. 2017. Adam: A method for Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and stochastic optimization. Salakhutdinov, R. 2014. Dropout: a simple way to prevent Li, Y.; Zhao, B.; Fuxman, A.; and Tao, F. 2018. Guess me if neural networks from overfitting. The journal of machine you can: Acronym disambiguation for enterprises. In Pro- learning research 15(1):1929–1958. ceedings of the 56th Annual Meeting of the Association for Taghva, K., and Gilbreth, J. 1999. Recognizing acronyms Computational Linguistics (Volume 1: Long Papers), 1308– and their definitions. International Journal on Document 1317. Analysis and Recognition 1(4):191–198. Li, Y.; Cohn, T.; and Baldwin, T. 2017. Robust training Veyseh, A. P. B.; Dernoncourt, F.; Nguyen, T. H.; Chang, under linguistic adversity. In Proceedings of the 15th Con- W.; and Celi, L. A. 2020a. Acronym identification and ference of the European Chapter of the Association for Com- disambiguation shared tasksfor scientific document under- putational Linguistics: Volume 2, Short Papers, 21–27. standing. arXiv preprint arXiv:2012.11760. Liu, J.; Liu, C.; and Huang, Y. 2017. Multi-granularity se- Veyseh, A. P. B.; Dernoncourt, F.; Tran, Q. H.; and Nguyen, quence labeling model for acronym expansion identification. T. H. 2020b. What does this acronym mean? introducing Information Sciences 378:462–474. a new dataset for acronym identification and disambigua- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and tion. In Proceedings of the 28th International Conference Dean, J. 2013. Distributed representations of words and on Computational Linguistics, 3285–3301. phrases and their compositionality. In Advances in neural Xie, Z.; Wang, S. I.; Li, J.; Lévy, D.; Nie, A.; Jurafsky, D.; information processing systems, 3111–3119. and Ng, A. Y. 2019. Data noising as smoothing in neural Miyato, T.; Dai, A. M.; and Goodfellow, I. 2017. Adver- network language models. In 5th International Conference sarial training methods for semi-supervised text classifica- on Learning Representations, ICLR 2017. tion. In Proceedings of International Conference on Learn- Zhu, S.; Li, S.; and Zhou, G. 2019. Adversarial attention ing Representations. modeling for multi-dimensional emotion regression. In Pro- Nadeau, D., and Turney, P. D. 2005. A supervised learning ceedings of the 57th Annual Meeting of the Association for approach to acronym identification. In Conference of the Computational Linguistics, 471–480. Canadian Society for Computational Studies of Intelligence, 319–329. Springer.