Effective Ensembling of Transformer based Language Models for Acronyms Identification Divesh Kubal, Apurva Nagvenkar CRIMSON AI divesh.kubal@crimsoni.ai, apurva.nagvenkar@crimsoni.ai Abstract rule-based(Schwartz and Hearst 2002) or machine learning- based(Jacobs, Itai, and Wintner 2020; Kuo et al. 2009; An acronym can be viewed as a word which is constructed Liu, Liu, and Huang 2017). Some techniques use word- by taking initial components from a phrase. This study deals with the problem of identification and extraction of embedding techniques (Kirchhoff and Turner 2016) to ex- an acronym’s short and long-form. The proposed approach tract acronyms. There are several packages available in solves the Acronym Identification (AI) task mentioned in the python (Cook 2019; Schwartz and Hearst 2002) which ex- scientific document understanding (SDU@AAAI-21) task. tract acronyms and their expansions. The understanding of This paper model the Acronym identification task as the sen- acronyms and their expansions is an important task in the tence level sequence-labeling problem. The proposed method following use-cases: is computed by an ensemble of various Language Models trained by hyper-parameter tuning. This ensembling tech- • Text understanding: There can be multiple expansions of nique is then coupled with post-processing steps to extract an acronym. Hence, identification of correct contextual the best possible predictions. The trained model’s perfor- meaning is important to understand the text in an unam- mance is evaluated against standard evaluation metrics such biguous manner. as precision, recall, and F1-score. The final model achieves • Information retrieval: When a document is queried by in- an F1 score of 95.60%, a precision of 93.97%, and a recall of putting a query containing an acronym, the results should 97.95% on the development dataset. On the test data, the pro- posed model achieves an F1 score of 92.08%, a precision of contain the relevant results. 89.70%, and the highest recall of 94.59%, compared to other • Machine translation: Acronyms posses a big challenge participants results in the competition. when translating a source language to its target language. • Text Summarization: It is advisable to use an acronym Introduction counterpart of its expansion to summarize the text. The acronyms in the technical/scientific domain are increas- ing at an exponential rate. This is due to a huge amount of research conducted in the technical and non-technical do- mains. The term “acronym” is defined as the name given to a particular word or a phrase by taking the first letters of each word of a phrase (Mack 2012). For example, ‘ANN’ is an acronym that stands for ‘Artificial Neural Network’. The more common or general term used is “abbreviation”. Ab- breviations encompass acronyms and few abbreviations that use letters other than initial characters of phrases, such as ‘Mr.’ for ‘Ministers’. Hence, there is a thin line that distin- guishes acronyms from abbreviations. The acronyms serve a vital role in writing science or research-related technical Figure 1: Examples of Acronym Identification documents, patents, etc., by preventing content repetition. This enables speeding the reading process and paving the This paper presents an effective ensembling based ap- way for an easier understanding of the content written in- proach to automatically extract acronyms along with their side a document. extended/long-forms. The proposed approach combines en- Several techniques have been proposed to extract sembling Language Models + hyper-parameter tuning + acronyms from a given input text corpus. These systems are post-processing to get the best possible results. This paper is Copyright © 2021for this paper by its authors. Use permitted under structured by giving a survey of related work where a quick Creative Commons License Attribution 4.0 International (CC BY brief about existing approaches used to solve the problem of 4.0). Acronym Identification (Veyseh et al. 2020a) is given. After the related work section, an in-depth explanation of the pro- One of the machine learning-based Acronym Identifica- posed system architecture is discussed. A thorough compar- tion systems proposed by (Kuo et al. 2009) extracts features ative analysis of results on different scenarios is explained in and then uses various algorithms like Support Vector Ma- the section results and discussions. Finally, this paper con- chine, Naive Bayes, Logistic Regression, and Monte-Carlo cludes by giving a quick conclusion and future directions. Sampling Logistic Regression for training. (Liu, Liu, and Huang 2017) proposed a Latent-state Related Work Neural Conditional Random Fields model (LNCRF) sys- tem which couples Conditional Random Fields with nonlin- The existing Acronym Identification systems can be broadly ear hidden layers. This system models the task of Acronym classified into rule-based, features & machine learning- Identification as a sequence labeling problem, and it sur- based and Deep Learning-based as depicted in figure 2. The passes many baseline models that were in existence at that rule-based systems mainly consist of techniques that use time. rules, patterns and regular expressions to extract acronyms A machine learning-based approach proposed by (Jacobs, and their expansions. The machine learning-based tech- Itai, and Wintner 2020) aims to extract acronyms by au- niques first extract features required for Acronym Identifi- tomatically building an acronym-based dictionary from an cation, and then a classifier (like Support Vector Machine, unannotated dataset. One of the critical parts of this system Conditional Random Fields) is trained over these features. is that it is capable of extracting non-local acronyms too. It The Deep Learning-based techniques use state-of-the-art means extracting the expanded form of an acronym even if algorithms like Recurrent Neural Networks (RNN), Long the short form is not present in a given sentence. Short Term Memory (LSTM), transformer-based models, Another approach that utilizes Long Short Term Mem- etc. ory - Conditional Random Field (LSTM-CRF) proposed by (Veyseh et al. 2020b) provides an in-depth explanation about overall Acronym identification and Disambiguation implementations. The rule-based and machine learning-based models were not able to capture the contextual information. An acronym extraction system that uses machine learning combined with a neural network based-contextual model was proposed by (Kirchhoff and Turner 2016). This model can store con- textual information and hence can also be used to solve the task of acronym disambiguation. Figure 2: Acronym Identification Systems Broad Classifica- tion Effective Ensembling of Language Models for Sequence Labeling Problem: EELM-SLP The Acronym Identification techniques can be traced back to the year 1999 where (Taghva and Gilbreth 1999) pro- The proposed system architecture is depicted in figure 3. posed a system named as Acronym Finding Program (AFP). The main components of Effective Ensembling of Language It is a simple regex-based system that identified candi- Models for Sequence Labeling Problem (EELM-SLP) Sys- date acronyms (upper-case words of three to ten grams). tem Architecture are as follows: It attempts to find acronym expansion by scanning a 2n- 1. Data Acquisition/Collection. window (n: number of letters in candidate acronyms). The final system was evaluated on 1328 files. Many regex-based 2. Finetuning of transformer-based Language Models for Acronym Identification systems were proposed afterward. sequence-labeling task. The issue with regex-based techniques was that it is impossi- 3. Hyperparameter tuning and retraining of Language Mod- ble to incorporate all the possible rules, and it required more els. manual work to identify patterns and then to write rules. 4. Ensembling and postprocessing to obtain final predic- A popular approach to detect acronyms was proposed tions. by (Schwartz and Hearst 2002)1 . This technique is capa- ble of extracting complicated acronyms, and it’s expansions. 5. Evaluation on development and test datasets. It works in two stages. The first stage identifies candidate acronyms by using predefined patterns (“acronym” and “ex- Data Acquisition/Collection pansion” and vice-versa). In the second stage, the overlap- The entire data can be downloaded from https://github. ping characters in an acronym and expansion are counted, com/amirveyseh/AAAI-21-SDU-shared-task-1-AI. A sam- and finally, this count is compared to a specified thresh- ple snapshot of the data is depicted in figure 1. The data old. As this technique cannot store contextual information, is bifurcated into training, development, and test data. The it failed to extract acronyms and their expansions with long- training dataset consists of 14006 labeled sentences (Vey- term dependencies. seh et al. 2020b). The development data consists of 1717 la- beled sentences. Each datapoint in train and test data has ‘id’ 1 https://github.com/philgooch/abbreviation-extraction representing train or development datapoint ID, ‘tokens’, a Figure 3: Effective Ensembling of Language Models for Sequence Labeling Problem (EELM-SLP) System Architecture list containing the sentence split into individual tokens, and • Longformer has approximately 149M parameters. Here ‘labels’ are the token-wise annotations for the tokens asso- 4096 represents that the model is pretrained on documents ciated with them. The test dataset consists of about 1750 of maximum length 4096. datapoints where only ‘id’ and ‘tokens’ are provided. The All the above-mentioned Language Models are finetuned on ‘labels’ depict the short and long-form acronyms in BIO for- the training dataset for sentence-level sequence labeling. Ev- mat (short for inside, outside, beginning). B-long, I-long, B- ery model is trained separately, and the latest iteration is short, and I-short are the labels used. stored. Finetuning of transformer-based Language Models Hyperparameter tuning of Language Models for for sequence-labeling task. Sequence Labeling Task The proposed approach uses six transformer-based pre- It was observed that during the inference phase, the output trained Language Models, who are fined tuned on a down- length was truncated to 128 tokens as it was the default stream sequence labeling task. The following Language ‘max seq length’ parameter. To preserve the entire token Models are used: BERT (Devlin et al. 2018), RoBERTa (Liu length, two parameters were used heavily. These two param- et al. 2019), XLM-RoBERTa (Conneau et al. 2019), eters are ‘sliding window’ and ‘max seq length’. The slid- CamemBERT (Martin et al. 2019), Longformer (Beltagy, ing window prevents the truncation of sentences by split- Peters, and Cohan 2020) and DistilBERT (Sanh et al. 2019). ting the input sequence into multiple windows if it ex- BERT is trained with the objective of masked language mod- ceeds the default maximum sequence value. The sliding eling (MLM) and the next sentence prediction (NSP). For window problem is that the contextual information is broken BERT, RoBERTa, XLM-RoBERTa, CamemBERT, and Dis- while predicting, and hence it was not used. Some experi- tilBERT, the base-cased configurations are used. AllenAI- ments were carried on with max seq length parameter, and base-4096 configuration used for longformer. it was observed that the model was performing better when The configuration used for finetuning Language Models max seq length was kept to 350. Other max seq length val- are as follows: ues were 128, 256, 300, 400, 450, and 512. The language models were finetuned by using ‘Simple Transformers’ (Ra- • The BERT and CamemBERT model is 12-layered having japakse 2020)2 library and by using ‘Hugging Face’ (Wolf 768-hidden layers, 12-heads, and 109M parameters. et al. 2019)3 pretrained models. GEFORCE RTX 2080 Ti 11GB GPU was used to train all the models with a 32GB • RoBERTa has the same configuration as that of BERT, primary memory system. except it has 125M parameters. Table 1 depicts the final list of hyperparameters used to • XLM-RoBERTa has approximately 270M parameters finetune the Language Models on the sequence labeling task. with 12-layers, 768-hidden-state, 3072 feed-forward The ‘adam epsilon’ is the epsilon value to use for adam hidden-state, 8-heads, which is pretrained on Common- optimizer. ‘best model dir’ is the directory where the best Crawl data in 100 languages. model automatically gets stored after the completion of all epochs. ‘cache dir’ is where the processed data is stored, • DistilBERT has less parameters of 65M, and it is 6- 2 layered. DistilBERT model is distilled from the BERT- https://simpletransformers.ai/ 3 based configuration. https://huggingface.co/ Hyperparameter Value tion loss, and the ‘patience’ is kept to be 3. The training pro- adam epsilon 1e-08 cess avoids evaluation while training. Hence, the parameter best model dir outputs/best model/ ‘evaluate while training’ is kept to be False. ‘fp16’ corre- cache dir cache/ sponds to the 16-bits training and sometimes also referred train custom parameters only False dataloader num workers 4 to as mixed-precision training is kept to True. The ‘gradi- do lower case False ent accumulation step’ is kept to 1 and the ‘learning rate’ to early stopping consider epochs False be 4e-05. The training was carried on a single Graphics Pro- early stopping delta 0 cessing Unit (GPU) system, and so the ‘n gpu’ was fixed to early stopping metric eval loss 1. As we don’t want the model to save very frequently and early stopping metric minimize True save the secondary storage space, the ‘save steps’ is kept to early stopping patience 3 a higher value of 20,000. The training epochs were set to encoding null eval batch size 8 50, but it was observed that the models were able to train evaluate during training False around 20 epochs. Hence, the training process was stopped evaluate during training silent True manually as soon as the loss was stagnant and at the low- evaluate during training steps 2000 est point. The ‘labes list’ corresponds to the target labels evaluate during training verbose False and it was set to [“B-long”, “I-long”, “B-short”, “I-short”, fp16 True “O”]. The ‘max seq length’ is one of the hyperparameters gradient accumulation steps 1 which was finetuned. These are some of the important hy- learning rate 4e-05 perparameters known to have a huge impact on finetuning local rank -1 logging steps 50 Language Models for sequence labeling task. After the final max grad norm 1.0 hyperparameters were computed, all the Language Models max seq length 350 were again finetuned for Sentence Level Sequence Labeling multiprocessing chunksize 500 task to identify acronyms from a given input text. n gpu 1 no cache False Ensembling and postprocessing to obtain final no save False num train epochs 50 predictions output dir outputs/ After all the language models were finetuned on training overwrite output dir True data, the ensembling technique was applied. The advantage process count 4 of performing ensembling is to construct a robust discrim- reprocess input data True save best model True inator or model by using comparatively weaker models. In save eval checkpoints True this paper, a novel-ensembling approach based on RoBERTa save model every epoch True prioritizing was used. The table 2 shows that the F1-score of save steps 20000 RoBERTa for the development dataset is higher than other save optimizer and scheduler True algorithms. The steps are as follows: silent False train batch size 16 1. Initially, all the predictions from six pre-trained language use cached eval features False models were extracted. use early stopping False use multiprocessing True 2. For each sentence, token level predictions were extracted warmup ratio 0.06 for all six models. Hence for each token for every sen- warmup steps 2628 tence, six predictions were obtained. For simplicity, the weight decay 0 predictions can be from any of the following labels - ‘O’, classification report False ‘B-short’, ‘I-short’, ‘B-long’ and ‘I-long’. labels list B-long, I-long, B-short, I-short, O lazy loading False 3. If all the predictions are ‘O’ and any of the models gives lazy loading start line 0 any predictions from the bouquet of ‘B-short’, ‘I-short’, ‘B-long’ and ‘I-long’, then the prediction other than ‘O’ Table 1: Final hyperparameters list used to finetune Lan- is considered. guage models for Sequence Labeling Task) 4. If there is a conflict between ‘B-short’, ‘I-short’, ‘B- long’, and ‘I-long’, then the prediction suggested by RoBERTa is taken into consideration. But if, in this case, if RoBERTa’s prediction is ‘O’, then the prediction with which is consumable by PyTorch (Paszke et al. 2019). The the highest frequency is considered. ‘train custom parameters only’ is kept False as we are uti- lizing all the hyperparameters available in Language Mod- The above ensembling technique gave rise to an increase els. The ‘dataloader num workers’ specifies the number of in recall. The ensembling approach’s limitation is that it pre- CPUs which will be used for data processing. As this is a se- dicted labels having ‘I-short’ and ‘I-long’ as the beginning quence labeling task, the ‘cased’ configurations of Language tags, which lowered precision for some tokens. To overcome Models are used, and hence the ‘do lower case’ is kept to be this problem, the following post-processing rules were ap- False. The ‘early stopping metric’ is kept to be the evalua- plied: Algorithm Precision Recall F-1 Score LONG SHORT (hyperparameters finetuned) Precision Recall F1-score Precision Recall F1-score Ensembling + Post Processing (EELM-SLP) 93.35 97.95 95.6 91.43 97.27 94.26 95.27 98.63 96.92 Ensembling Technique 92.36 97.2 94.72 90.9 96.65 93.69 93.81 97.74 95.74 RoBERTa 93.19 91.2 92.18 91.65 92.56 92.1 94.74 89.83 92.22 XLM-RoBERTa 93.05 90.89 91.95 91.49 92 91.75 94.61 89.77 92.12 BERT 92.34 90.88 91.6 90.43 90.27 90.35 94.24 91.48 92.84 CamemBERT 93.45 88.93 91.13 92.32 90.64 90.98 95.58 87.23 91.21 Longformer 93.14 88.59 90.81 90.49 90.2 90.34 95.8 86.97 91.17 DistilBERT 91.04 90.51 90.77 88.47 88.03 88.25 93.6 92.98 93.29 Table 2: Evaluation results for Development Dataset (all values are in percentage) Algorithm Precision Recall F-1 Score Ensembling + Post Processing (EELM-SLP) 89.7 94.59 92.08 Ensembling Technique 89.93 93.87 91.86 RoBERTa after hyperparameter tuning 90.26 92.46 91.34 RoBERTa before hyperparameter tuning 90.85 91.73 91.29 Table 3: Evaluation results for Test Dataset (all values are in percentage) • If at any point of time the prediction starts with ‘I-short’, of 95.6% recorded on development data. It can be clearly then it is replaced as ‘B-short’. seen that the proposed approach surpasses other individual • The pattern ‘O’, ‘I-long’ was replaced with ‘B-long’, ‘I- language models by quite a considerable margin. For all the long’ to maintain consistency. individual finetuned models and the proposed approach, the class scores are also computed, namely precision, recall, and By applying the above post-processing rules, the precision F1-score for ‘short’ and ‘long’ labels. improved the proposed, and hence the proposed EELM-SLP In table 3, the evaluation on the test dataset is presented system surpassed the other individual trained models on the on standard evaluation metrics like precision, recall, and F1- development dataset. score. Initially, the test dataset was evaluated on RoBERTa All the models are trained on 11GB Nividia 2080 Ti GPU finetuned model before finetuning. This initial evaluation for 20-22 epochs, with each epoch taking around three to resulted in precision, recall, and F1-score to be 90.85%, four minutes. Each model took around 90 minutes to train, 91.73%, and 91.29%, respectively. The RoBERTa was then and due to early stopping, the training was stopped as soon finetuned after hyperparameter tuning, which further ac- as the loss stopped reducing. The proposed architecture is celerated the metrics. The hyperparameter tuned RoBERTa efficient and can be integrated into an actual production en- achieved the precision of 90.26%, recall of 92.46%, and F1- vironment. The prediction pipeline is designed in such a way score of 91.34%. One of the significant highlights is the that it takes advantage of CPU parallelism. So if the input results obtained after applying the ensembling technique, contains a batch of sentences, the predictions will be com- which improved the F1-score from 91.34% to 91.86%. Fi- puted in parallel independent batches and then later com- nally, the highest F1-score was achieved after the applica- bined to form in order as they appear in the original para- tion of post-processing. The final proposed system achieved graph or list of sentences. an F1-score of 92.08%, precision of 89.7%, and recall of 94.59%. Results and Discussions The table 2 depicts the evaluation metrics on the devel- Conclusion and Future Scope opment dataset. It can be seen that the top-3 perform- ers among individual language models based on F1-score In this paper, an Effective Ensembling of Language Models are RoBERTa, followed by XLM-RoBERTa, which is fol- for Sequence Labeling (EELM-SLP) is proposed. The en- lowed by BERT. The RoBERTa achieves the highest F1- sembling technique’s importance can be clearly seen based score of 92.18%, while XLM-RoBERTa achieves the F1- on the development and test dataset results. The final model score of 91.95%. The third best performing algorithm BERT achieves a precision of 93.35%, recall of 97.95% and F1- achieves the F1-score of 91.6%. The ensembling technique score of 95.6% on the development dataset and precision surpasses the hyperparameter finetuned individual Language of 89.7%, recall of 94.59% and F1-score of 92.08% on the Models and achieves a precision of 92.36%, recall of 97.2%, test set. There is an improvement on all the three metrics of and F1-score of 95.6%. However, the proposed approach of precision, recall, and F1-score for the proposed approach on Effective Ensembling of Language Models for Sequence La- the development dataset. Although the recall and F1-score beling Problem (EELM-SLP) achieves the highest F1-score are highest for the test dataset, the precision is lowered. This might be due to the introduction of false-positives while en- Rajapakse, T. 2020. Simple transformers. sembling and post-processing step. On the one hand, the en- Sanh, V.; Debut, L.; Chaumond, J.; and Wolf, T. 2019. Dis- sembling and post-processing step increased the recall and tilBERT, a distilled version of BERT: smaller, faster, cheaper F1-score, but on the other hand, it lowered the precision. and lighter. arXiv preprint arXiv:1910.01108 . Hence, the post-processing can be finetuned in-depth. Fur- ther, this paper uses all the language models in their ‘base’ Schwartz, A. S.; and Hearst, M. A. 2002. A simple algorithm configuration and not ‘large’ configuration. In the future, ex- for identifying abbreviation definitions in biomedical text. In periments can be carried on ‘large’ configuration based lan- Biocomputing 2003, 451–462. World Scientific. guage models, which might improve the scores. Taghva, K.; and Gilbreth, J. 1999. Recognizing acronyms and their definitions. International Journal on Document References Analysis and Recognition 1(4): 191–198. Beltagy, I.; Peters, M. E.; and Cohan, A. 2020. Long- Veyseh, A. P. B.; Dernoncourt, F.; Nguyen, T. H.; Chang, W.; former: The long-document transformer. arXiv preprint and Celi, L. A. 2020a. Acronym Identification and Disam- arXiv:2004.05150 . biguation shared tasksfor Scientific Document Understand- Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; ing. arXiv preprint arXiv:2012.11760 . Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettle- Veyseh, A. P. B.; Dernoncourt, F.; Tran, Q. H.; and Nguyen, moyer, L.; and Stoyanov, V. 2019. Unsupervised cross- T. H. 2020b. What Does This Acronym Mean? Introducing lingual representation learning at scale. arXiv preprint a New Dataset for Acronym Identification and Disambigua- arXiv:1911.02116 . tion. arXiv preprint arXiv:2010.14678 . Cook, B. 2019. ACRONYM: Acronym CReatiON for You Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; and Me. arXiv preprint arXiv:1903.12180 . Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. 2019. HuggingFace’s Transformers: State-of-the-art Natural Bert: Pre-training of deep bidirectional transformers for lan- Language Processing. ArXiv arXiv–1910. guage understanding. arXiv preprint arXiv:1810.04805 . Jacobs, K.; Itai, A.; and Wintner, S. 2020. Acronyms: iden- tification, expansion and disambiguation. Annals of Mathe- matics and Artificial Intelligence 88(5): 517–532. Kirchhoff, K.; and Turner, A. M. 2016. Unsupervised res- olution of acronyms and abbreviations in nursing notes us- ing document-level context models. In Proceedings of the Seventh International Workshop on Health Text Mining and Information Analysis, 52–60. Kuo, C.-J.; Ling, M. H.; Lin, K.-T.; and Hsu, C.-N. 2009. BIOADI: a machine learning approach to identifying ab- breviations and definitions in biological literature. In BMC bioinformatics, volume 10, S7. Springer. Liu, J.; Liu, C.; and Huang, Y. 2017. Multi-granularity se- quence labeling model for acronym expansion identification. Information Sciences 378: 462–474. Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. Roberta: A robustly optimized bert pretraining ap- proach. arXiv preprint arXiv:1907.11692 . Mack, C. A. 2012. How to write a good scientific paper: acronyms. Journal of micro/nanolithography, MEMS, and MOEMS 11(4): 040102. Martin, L.; Muller, B.; Suárez, P. J. O.; Dupont, Y.; Romary, L.; de la Clergerie, É. V.; Seddah, D.; and Sagot, B. 2019. Camembert: a tasty french language model. arXiv preprint arXiv:1911.03894 . Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. 2019. Pytorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems, 8026–8037.