SciDr at SDU-2020 : IDEAS - Identifying and Disambiguating Everyday Acronyms for Scientific Domain Aadarsh Singh, Priyanshu Kumar * Indian Institute of Technology (Indian School of Mines) Dhanbad, Jharkhand, India aadarshsingh191198@gmail.com, kpriyanshu256@gmail.com Abstract sentence. We start the experimentation process for AI with We present our systems submitted for the shared tasks of rule based models. The experiments on both tasks are then Acronym Identification (AI) and Acronym Disambiguation extended to using Transformers (Vaswani et al. 2017) based (AD) held under Workshop on SDU. We mainly experiment architecture, BERT (Devlin et al. 2018) as the backbone of with BERT and SciBERT. In addition, we assess the effec- the model; followed by SciBERT (Beltagy, Lo, and Cohan tiveness of “BIOless” tagging and blending along with the 2019), which too is a BERT-based model, but is pretrained prowess of ensembling in AI. For AD, we formulate the prob- on text from scientific research papers instead of Wikipedia lem as a span prediction task, experiment with different train- corpus. In addition, for AD, we experiment with different ing techniques and also leverage the use of external data. Our training procedures, aiming to instill knowledge about vari- systems rank 11th and 3rd in AI and AD tasks respectively. ous topics into our models. The rest of the paper is organized as follows : Related 1 Introduction works have been discussed in section 2, followed by a brief An acronym is an abbreviation formed from the initial let- description of the shared task datasets in section 3. The ters of other words and pronounced as a word. The usage of methodology and experimental settings are covered in sec- acronyms in articles and speech has increased as it avoids the tions 4 and 5. Sections 6 and 7 contain the results and discus- effort of remembering long complex terms. However, this sion. Section 8 concludes the paper and also includes scope increased usage of acronyms has also caused new issues of of future work. Acronym Identification (AI) and of Acronym Disambigua- tion (AD). AI is the process of identifying which parts of 2 Related Work a sentence constitute the acronyms and their corresponding Initial works on AI incorporate the use of rule-based meth- long forms, whereas AD is the process of correctly predict- ods. Park and Byrd (2001) present rule based methods for ing the long form expansion of an acronym given a context finding acronyms in free text. They make use of various of its usage. AI and AD are beneficial for applications like patterns, text markers and linguistic cue words to detect question answering (Ackermann et al. 2020) and definition acronyms and also their definitions. Schwartz and Hearst extraction (Kumar et al. (2020), Singh, Kumar, and Sinha (2002) make use of the fact that majority acronyms and their (2020)). Since, both AI and AD tasks are benefited with do- long forms are found in close vicinity in sentence, with one main knowledge, manual identification and disambiguation of them enclosed between parentheses and thus extract short of acronyms by domain experts is possible. However, it is and long pairs from sentences. They also propose an algo- tiresome and expensive. Hence, there is a dire need to de- rithm for identifying correct long forms. velop intelligent systems that can mimic the role of domain People have also tried to leverage the use of web-search experts and can help us automate the task of AI and AD. queries and logs to identify acronym-expansion pairs. A In this paper, we present our approach for the shared framework for automatic acronym extraction on a large tasks of Acronym Identification and Acronym Disambigua- scale was proposed by Jain, Cucerzan, and Azzam (2007). tion held under the workshop of Scientific Document Under- They scrape the web for candidate sentences (those contain- standing (SDU). The problem of AI is treated as a sequence ing acronym-expansion pairs) and then identify acronyms- tagging problem. For AD, we treat it as a span prediction expansion pairs using search query logs and search results. problem i.e. given a sentence containing an acronym and the They also try to rank acronym expansions by assigning possible long forms of that acronym, we aim to extract the a score to expansions using various factors. Taneva et al. span from the possible expansions, which is the most appro- (2013) target the problem of finding distinct expansions for priate long form of the acronym as per the context in the an acronym. They make use of query click logs and cluster- * Authors have equal contribution. ing techniques to extract candidate expansions of acronyms Copyright © 2021for this paper by its authors. Use permitted under and group them such that each group has a unique meaning. Creative Commons License Attribution 4.0 International (CC BY They then assign scores to grouped expansions to find the 4.0) appropriate expansion. A comprehensive comparative study between rule-based pare the performance of LSTMs with ELMo embeddings and machine based methods for identifying and resolving armed with different types of attention mechanisms. acronyms has been done by Harris and Srinivasan (2019). An overview of the submissions made to the shared tasks They collect data from various resources and then experi- of AI and AD has been done by the organizers (Veyseh et al. ment with machine based algorithms, crowd-sourcing meth- 2020a). ods and a game based approach. Liu, Liu, and Huang (2017) treat AI as a sequence la- 3 Datasets belling problem and propose Latent-state Neural Condi- Veyseh et al. (2020b) provide the shared task participants tional Random Fields model (LNCRF), which are superior with a dataset for AI and AD tasks called SciAI and SciAD to CRFs in handling complex sentences by making use of respectively. SciAI contains 17,506 sentences from research nonlinear hidden layers. The incorporation of neural net- papers, in which the boundaries of acronyms and their long works with CRFs enable learning of better representations forms are labelled using the BIO format. The tag set con- from manually created features, which help in better perfor- sists of B-short, B-long, I-short, I-long and O, “short” rep- mance. resenting the acronym and “long” representing the expan- Many works solve AD task by creating word vectors sion respectively. SciAD contains 62,441 instances covering and then using them to rank the candidates of the acronym acronyms used in the scientific domain. The dataset contains with reference to its usage. McInnes et al. (2011) correlate the sentence, the acronym and the correct expansion of that acronym disambiguation with word sense disambiguation. acronym as per its usage in the sentence. The dataset also They create 2nd order vectors of all possible long forms and contains a dictionary which is a mapping of the acronyms the acronym with the help of word co-occurrences. The cor- to candidate long forms. Both datasets are different from the rect long form is then found out using cosine similarity be- existing datasets for AI and AD as they are larger in size and tween the vectors. Li et al. (2018) present an end-to-end have instances belonging to scientific domain (majority AI pipeline for acronym disambiguation in the domain of en- and AD datasets belong to the medical domain). terprise. Due to the lack of mapping of acronym to their long forms, they first use data mining techniques to create 4 Methodology a knowledge base. Further, they treat acronym disambigua- tion as a ranking problem and create ranking models using 4.1 Models some manually created features. Since both the tasks are similar, we try out the following With the advent of deep learning, researchers have tried models for both of them and then build upon them: to create more informative word vectors for the previous ap- • BERT : BERT, based on the Transformer architecture proach. Wu et al. (2015) first use deep learning to create neu- consists of multi-attention heads which apply a sequence- ral word embedding from medical domain data. They com- to-sequence transformation on the input text sequence. bine the word embeddings of a sample text in different ways The training objectives of BERT make it unique. The and then train a Support Vector Machine (SVM) classifier Masked Language Model (MLM) learns to predict a for each acronym. Charbonnier and Wartena (2018) explore masked token using the left and right context of the text acronym disambiguation in the scientific research domain. sequence. BERT also learns to predict whether two sen- They obtain word vectors from text of scientific research pa- tences occur in continuation or not (Next Sentence Pre- pers and create vector representations for the context of the diction). acronym. Distance minimisation between vector of context and acronym expansion, gives the appropriate expansion. • SciBERT : Allen Institute for Artificial Intelligence (AI2) Ciosici, Sommer, and Assent (2019) present an unsuper- pretrain the base version of BERT (SciBERT) on scien- vised approach for acronym disambiguation by treating it tific text from 1.14 million research papers from Seman- as a word prediction problem. They use word2vec (Mikolov tic Scholar. Owing to the similarity of the domain of the et al. 2013) to simultaneously learn word embeddings and by shared task dataset and SciBERT training corpus, we be- learning to predict the correct special token (concatenation lieve the model will be beneficial for the tasks. We use of short and long form) of a sentence. The obtained word SciBERT with SciVocab in our experiments. embeddings are used to create representations of the context of the short form and the best expansion of the short form 4.2 AI is obtained from the candidates by minimising distance be- Problem Formulation We can easily identify the AI task tween representations. as a NER (Named Entity Recognition) / BIO tagging task. Many works also treat AD as a classification problem. The tags used in the above methods were short-form and Jin, Liu, and Lu (2019) explore the usage of contextualised long-form labels of the words in BIO format. One of the in- BioELMO word embeddings for acronym disambiguation. teresting experiments that we perform is to make use of “BI- They train separate BiLSTM classifiers for each acronym Oless” tags. Keeping all factors constant, classifiers ought which outputs the appropriate expansion when a text is in- to work better if the number of classes are less. Tagging put. They achieve state of the art performance on PubMed is a token classification task. Hence, the tagger should per- dataset. Li et al. (2019) propose a novel neural topic at- form better if the number of tags are reduced. The following tention mechanism to learn better contextualised representa- changes are carried out in the training data to obtain “BIO- tions for medical term acronym disambiguation. They com- less” tags : 1. B-short and I-short tags are changed to B-short classification, the final prediction is the mode of the pre- 2. B-long and I-long tags are changed to B-long dictions of the participating models; similarly, in a tagging task or rather token classification, the final prediction for a 3. O tags are unchanged. given sequence is the sequence of modes of the prediction The models are trained and once the results are obtained, sequence of the participating models. the definition of B, I and O tags viz. beginning, inside and Assume y is label, x is the token, N is the total number of outside, are used to reconstruct the original tags. It is done base taggers employed and Ti is a function that returns 1 by changing the first tag in a cluster to B-short or B-long and if the prediction of the ith tagger is y, otherwise 0. the rest of them to I-short or I-long. Then, W (y, x) is said to be the score and is defined as: Models We experiment with the following mod- N els/variations of the models already mentioned : X W (y, x) = Ti (y, x) • Conditional Random Fields (CRFs) : Considering la- i=0 belling of sentences with POS (Parts Of Speech) tags, it The y with the highest score is chosen as the label of x. is highly probable that a NOUN is followed by a VERB. Therefore, these kinds of task fall under a category which • Blending (Sikdar and Gambäck 2017) : Hereby, we de- is essentially a combination of classification (classifying pict our process of blending models (Figure 1). The whole a word to one of the POS tags) and graphical modelling process consists of the following 3 stages: (one word influences the POS tag of other words). Thus, a. The base models are trained on the training data and these tasks involve predicting a large number of variables then predictions are made on the validation data using that depend on each other as well as on other observed these. variables. b. The predictions obtained in the previous stage are used CRFs are a popular probabilistic method suitable for tasks as the features for this stage. A CRF is fit on these fea- such as this. They combine the ability of graphical models tures using 5-fold cross validation. to compactly model multivariate data with the ability of c. The five trained models obtained in the previous stage classification methods to perform prediction using large are then ensembled using majority voting to make the sets of input features. For the current data, we use the fol- final prediction. lowing features as input: For the current word - 4.3 AD a. The lower cased version of the word Problem Formulation Many existing works on AD solve b. The last three letters of the word the problem as a text classification problem, i.e. given a text c. If all characters of the word are upper case and an acronym, classify the long form of the acronym or d. If the word is title cased by developing rich word vector representation to extract the most suitable full form out of some candidate long forms. e. The POS tag of the word We, instead, treat AD as a span prediction problem. The f. The first two characters of the POS tag of the word model predicts the span containing the correct long form g. If 60% of the word is uppercase from the concatenated text consisting of the acronym, the For neighbouring words - candidate long forms of that acronym and the sentence (in the same order). The predicted span is then compared with a. The lower cased version of the word the candidate long forms and the best match is chosen as per b. If the word is title cased Jaccard score. c. If all characters of the word are upper case Each approach has its own shortcomings. For the classifi- cation approach, the size of the model increases with the in- d. The POS tag of the word crease in dictionary size; training models for a large number e. The first two characters of the POS tag of the word of classes is difficult. A solution to this problem is to build • BERT base cased : We use the cased base version of individual models for acronyms, but the solution might not BERT as the backbone of our Transformer-CRF architec- be feasible if there are many acronyms. For the vector based ture methods, achieving rich representations is difficult. As for the span prediction approach, the handling of long inputs is • SciBERT cased : We use the cased version of SciBERT difficult and time-consuming. We may have to compromise as the backbone of our Transformer-CRF architecture. on the context of the acronym in order to adjust for long sequences. Post Modelling Experiments The process of ensembling To prepare our input text for the model, we take advan- helped to get a major boost in the score of the base models. tage of the fact that BERT can encode a pair of sequences We used two kinds of ensembling process: together. Therefore, the first sequence is the acronym con- • Majority Voting/Hard Voting (Wu et al. 2006): The idea catenated with all possible expansions from the dictionary here is to simply go with what the majority of the mod- and the second sequence is the input text. Since, some of the els in the ensemble method are predicting. In the case of input sentences are quite long, we sample tokens from the Model 1 Predictions Fold 1 CRF Predictions Model 2 Predictions Train Data Fold 2 CRF Predictions Val Soft Final Data Voting Prediction Test Data Model 3 Predictions Fold K CRF Predictions Inference Val Training Data Training Inference New features STAGE 1 STAGE 2 STAGE 3 Figure 1: Blending for AI sentences. In order to input sufficient context of the acronym scrape Wikipedia for articles (using Wikipedia API https: into the models, we take n/2 space delimited tokens to the //pypi.org/project/wikipedia/ ) related to the long forms left of the acronym and n/2 space delimited tokens to the of acronyms present in the dictionary and fine tuned the right of it, where n is a hyperparameter. We find in our ex- LM of SciBERT using the data. We then use the new fine periments that taking n to be sufficiently large gives almost tuned model weights for the SciBERT backbone and train consistent performance. We fix n to 120 in our experiments. it for span prediction. We experiment with different training approaches and • SciBERT uncased with 2 stage training : We train the pretrained weights keeping the architecture of our model model in 2 stages using different data. We prepare our constant in all cases. The backbone of the architecture is own dataset using the articles scrapped from Wikipedia; the base version of BERT. The sequence outputs of the occurrences of long forms of acronyms are replaced by last layer of BERT (shape = (batch size, max len, 768)) the acronym. We first train our model on this data and then is passed through a dense layer to reduce its shape to on the shared task data. This is a supervised approach to (batch size, max len, 2). The output is splitted into 2 parts help models learn for acronyms and expansions under rep- at the 2nd axis to get our token level logits for start position resented in the shared task data as compared to the above and end position. A pictorial representation of the model can approach which is unsupervised. be found in Figure 2. Post Modelling Experiments [CLS] Acronym • Ensemble : Since our approach outputs start and end Enconding SP SciBERT probability distribution over the entire sequence of tokens, BERT / Linear Expansion_1 Token Level Expansion_2 ... we cannot average probabilities from models using differ- Expansion_N [SEP] ent tokenizers. Keeping the above fact in mind, we aver- Sentence [SEP] EP age the probabilities from the two best models (as per CV) i.e. SciBERT uncased and SciBERT uncased with 2 stage Figure 2: Model Architecture for AD; SP and EP stand for training. The appropriate acronym expansion is then ex- Start Probability and End Probability . tracted with the help of this averaged probability, which provides robustness in our predictions. Models We experiment with the following models: • Ensemble with post-processing : We also devise a post processing that can help us rectify some of the mistakes of • BERT base uncased : We use the uncased base version our models to some extent. All the post-processing does is of BERT as the backbone of our model. that if a candidate expansion of an acronym is present in • SciBERT uncased : We use the uncased version of SciB- the sentence and the acronym is enclosed within parenthe- ERT as the backbone of our model. sis in the sentence, then that candidate expansion is pre- dicted as the expansion of the acronym. The motivation • SciBERT uncased with fine tuned LM : The dataset for devising this post-processing is discussed in Section does not contain samples for all acronym expansions. 7. Hence, models trained only on the provided dataset may suffer when it comes to predicting unseen acronym expan- sions. We try to instill some knowledge of the acronym 5 Experimental Settings expansions in our model by fine tuning the MLM. We For AI task, there are three kinds of experimental settings: a. The base models were trained on the training data and Model Val CV Test evaluated on the validation data. Baseline 0.8546 - 0.8409 CRF 0.8254 - - b. For the better performing base models, we concatenate the CRF BIOless 0.7994 - - training and validation data and perform a 5 fold cross- BERT cased 0.9145 - - validation on the concatenated dataset. BERT cased BIOless 0.9163 - - c. For blending, we perform a 5 fold cross-validation on the SciBERT cased 0.9173 - 0.8921 validation data. SciBERT cased BIO- 0.9165 - 0.9005 For each one of the above settings, training was done for less 20 epochs using early stopping with patience of 10. Model SciBERT cased - 0.9075 0.9023 optimisation was done using BertAdam with a learning rate SciBERT cased BIO- - 0.9073 0.9036 of 1e-3, a batch size of 16 and gradient accumulation batch less size of 32. Blending with mode - 0.8962 0.9090 For the AD task, we concatenate the training and valida- ensembling tion data and perform a 5 fold stratified cross-validation on the joined dataset (stratified with respect to acronym). The Table 1: Results of AI task. folds are trained for 5 epochs using early stopping with pa- tience of 2 and tolerance of 1e-3. Model optimisation is done User / Team Name Test Score using AdamW (Loshchilov and Hutter 2018) with a learning zdq 0.9330 rate of 2e-5 and a batch size of 32. qinpersevere 0.9311 Mobius 0.9281 6 Results SciDr (Us) 0.9090 6.1 AI Table 2: Comparison of AI results The macro F1 scores of our approaches are listed in Table 1. For, the base models, validation is done using the validation data. Only the promising models, in our case SciBERT mod- 6.2 AD els, are taken through the arduous cross validation process. We tabulate the macro F1 score of the models in the cross- It should also be noted that the folds for the process of validation and test setting (in Table 3). The performance of cross validation on the modified blending technique are ex- SciBERT is superior to BERT owing to the similarity of tracted out of the validation data unlike SciBERT models pretraining corpus and task dataset. We also observe that which are cross validated on the combined data (train + val- the performance of SciBERT uncased and SciBERT un- idation), and hence the two CV scores are not comparable. cased with 2 stage training is almost similar in both cross- The other observations are enumerated as follows: validation and test, with the latter performing a bit better a. The official baseline, though rulebased, surpasses CRF. than former, whereas the performance of the one with fine- tuned LM is lower. A possible reason for this observation b. As expected, SciBERT performs better than BERT. can be attributed to the difference between the source of the c. As for the BIOless variants: data used for fine tuning (Wikipedia) and the shared task data (scientific papers). The usage of extra data created us- • CRFs see a considerably big difference (0.026) be- ing Wikipedia is beneficial for the model since it contains tween the BIOless and BIO variants. The hypothesis samples for some acronyms under-represented in the task that “the tagger should perform better if the number of dataset. tags are reduced” seems to fail here. The present task of AI seems a bit complex for CRFs as they do not even Model CV Test surpass the baseline score of 0.84. Hence, it would only Baseline - 0.6097 be justifiable to treat CRFs as an exception with respect BERT uncased 0.7549 0.8980 to the hypothesis. SciBERT uncased 0.8423 0.9244 • For all the other models/variations, BIOless is pretty SciBERT uncased 0.8278 0.9194 close (a difference of 0.0008 or 0.0002 ) or sur- with fine tuned LM passes the BIO variant(with a relatively larger differ- SciBERT uncased 0.8424 0.9292 ence 0.0084 or 0.0013). with 2 stage training d. Based on the test score, BIOless variants perform better Ensemble - 0.9303 than their corresponding BIO counterparts. Ensemble with post- - 0.9319 processing e. The test score undoubtedly shows the eminence of the modified blending technique. Table 3: Results of AD task. Table 2 shows a comparison of our results with the top scoring submissions of AI task. Table 4 lists the scores of the top submissions for AD task. Figure 3: A few erroneously tagged instances for AI. User / Team Name Test Score F1 but the precision obtained was quite good compared to DeepBlueAI 0.9405 the precision of the SciBERT cased BIOless model with hard qwzhong 0.9373 voting. The only way to employ the adroitness of the base- SciDr (Us) 0.9319 line model was to stack it (and some other better performing del2z 0.9266 models) with the SciBERT cased BIOless model. And as is visible in Table 5, the Blended model improved considerably Table 4: Comparison of AD results especially with respect to precision. Figure 3 represents some of sentences tagged incorrectly by the SciBERT model. Ideally the analysis should have 7 Discussion been done on the best model, but it is too complex to in- 7.1 AI terpret it. Having a look at the DEV-297 and DEV-42, it is clear that the gold truths have some annotation flaws. HMM The best proposed method for the AI task involves the use is clearly an acronym for Hidden Markov Models and still is of the following three main building blocks: not labelled. Similarly, RNN, CNN and WiFi are acronyms • SciBERT as the base model for Recurrent Neural Network, Convolutional Neural Net- • BIOless variant work and Wireless Fidelity respectively but only CNN is marked in the ground truth. Also, complicated neural net- • Modified blending technique or the blending method cou- work is no full form but is used to show the complications pled with hard voting. of RNN and CNN neural networks. Our base model does The reason for SciBERT performing better than the BERT good in predicting the right tags for there samples. model lies in the fact that the pretraining corpus is simi- On the other hand, we find that in DEV-1313 and DEV- lar to our dataset. The hypothesis for using BIOless vari- 593, the model has completely failed to identify the long ants instead of the conventional technique seems to hold true forms, and also misidentified a few short forms. Two likely (points c, d and e in Subsection 6.1). causes could be as follows: Model F1 Precision Recall • improper tokenization of the dataset Baseline 0.8409 0.9131 0.7793 • “and”, “-”, “of” etc. in between long forms SciBERT cased BIO- 0.9036 0.8987 0.9086 less with hard voting Blending with mode 0.9090 0.9097 0.9083 7.2 AD ensembling The formulation of AD as a span prediction problem is quite efficient from the performance and computational expense Table 5: F1, Precision and Recall of some models used in AI point of view. A complete cross-validation run under the ex- Task perimental settings can be performed in 6 hours on an aver- age on a NVIDIA Tesla P100. Ensembling has always helped in the domain of Machine Speaking about the results, for the out-of-fold predictions Learning. The third block viz. modified blending technique of SciBERT uncased, we observe that the model is incorrect is a combination of two propitious methods - blending and mainly for acronyms which do not have many occurrences hard voting, and ultimately went about to give the best re- in the task dataset. This motivated us to attempt instilling sults. The baseline method used by the organizers had a low knowledge into our models via external data. Id Acronym Text Normal Stage Ensemble TS-633 FM Ultimately , once we select an FM , feature map fuzzy measure factorization machines the ChI becomes a specific operator . TS-811 GS Additionally , using WSE ( GS genetic search google scholar ’s gold standard search ) we obtained 84.4 accuracy with an FPR of 0.157 and AUC value of 0.918 . TS-5682 EL Thus , with EL system ( ) , only two external links euler - lagrange entity linking structures are possible for : ( i ) , and ( ii ) , . Table 6: Mismatch of predictions between SciBERT uncased, SciBERT uncased with 2 stage training and their soft ensemble. We first examine the differences between the test set pre- 8 Conclusion dictions of SciBERT uncased, SciBERT uncased with 2 We present our approach for Acronym Identification and stage training and their ensemble (represented as Normal, Acronym Disambiguation in scientific domain. The usage Stage and Ensemble respectively) to understand the differ- of SciBERT in both tasks is beneficial because of domain ence between the models and to find out which model is and training corpus similarity. We addressed AI as a tag- exhibiting more confidence in its prediction. ging problem. Our experiments prove the usefulness of data transformation using BIOless tags, and the adroitness of Id Acronym Text blending incorporated with hard voting. We approached AD TS-5572 LPP The LPP can be briefly de- as span prediction problem. Our experimental work demon- scribed as follows . strates the effect of pretrained weights, external data, ensem- TS-5830 GCN Effect of both kernels added bling and post-processing. Our analysis provides some inter- at end to get actual GCN esting insights into some of the shortcomings of the models output . and also some of the flaws in the dataset annotation. For fu- ture work, we can experiment with data augmentation and Table 7: Instances lacking sufficient context for AD. observe the behaviour of the models for both AI and AD. We examine those samples where all of the three pre- 9 Appendix dictions are different (Table 6). It can be observed that the predictions of SciBERT uncased seem pretty appropriate as The source code of our approaches for AI and AD can be per the context and the contributions from the Stage model found at : changes the final prediction. However, there are 92 instances • AI : https://github.com/aadarshsingh191198/AAAI-21- in the test predictions where any of the three predictions are SDU-shared-task-1-AI different. These are the instances where the ensemble sub- mission gets the test score boost. • AD : https://github.com/aadarshsingh191198/AAAI-21- We observe that some of the samples in the test set do not SDU-shared-task-2-AD contain sufficient context which can help in acronym disam- biguation. This can be an issue and it is difficult to say how Acknowledgements the models will behave in such situations. Some of the sam- We thank Google Colab and Kaggle for their free computa- ples are shown in Table 7. For the text with id TS-5572, the tional resources. possible long forms of LPP are “locality preserving projec- tions” and “load planning problem”. Both the models predict References one of the expansions and both the expansions seem rele- vant in the given context. Similar arguments can be given Ackermann, C. F.; Beller, C. E.; Boxwell, S. A.; Katz, E. G.; for the text with id TS-5830, where the models get confused and Summers, K. M. 2020. Resolution of acronyms in ques- between “global convolution networks” and “graph convo- tion answering systems. US Patent 10,572,597. lution networks”. Beltagy, I.; Lo, K.; and Cohan, A. 2019. SciBERT: A pre- Many of the instances in the test set are such that the long trained language model for scientific text. arXiv preprint form expansion of the acronym is present in the text and the arXiv:1903.10676 . acronym is present within parentheses. Our models correctly predict the long form for most of these instances, but miss Charbonnier, J.; and Wartena, C. 2018. Using word embed- out on a few occasions. This motivated us to devise a post- dings for unsupervised acronym disambiguation . processing for such instances, where we can directly check Ciosici, M.; Sommer, T.; and Assent, I. 2019. Unsupervised for such conditions and predict accordingly, overwriting the Abbreviation Disambiguation Contextual disambiguation model predictions. using word embeddings. arXiv preprint arXiv:1904.00929 . Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. the Fourteenth Workshop on Semantic Evaluation, 710–716. Bert: Pre-training of deep bidirectional transformers for lan- Barcelona (online): International Committee for Computa- guage understanding. arXiv preprint arXiv:1810.04805 . tional Linguistics. URL https://www.aclweb.org/anthology/ Harris, C. G.; and Srinivasan, P. 2019. My Word! Machine 2020.semeval-1.93. versus Human Computation Methods for Identifying and Taneva, B.; Cheng, T.; Chakrabarti, K.; and He, Y. 2013. Resolving Acronyms. Computación y Sistemas 23(3). Mining acronym expansions and their meanings using query Jain, A.; Cucerzan, S.; and Azzam, S. 2007. Acronym- click log. In Proceedings of the 22nd international confer- expansion recognition and ranking on the web. In 2007 ence on World Wide Web, 1261–1272. IEEE International Conference on Information Reuse and Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, Integration, 209–214. IEEE. L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. At- Jin, Q.; Liu, J.; and Lu, X. 2019. Deep Contextual- tention is all you need. In Advances in neural information ized Biomedical Abbreviation Expansion. arXiv preprint processing systems, 5998–6008. arXiv:1906.03360 . Veyseh, A. P. B.; Dernoncourt, F.; Nguyen, T. H.; Chang, W.; Kumar, P.; Singh, A.; Kumar, P.; and Kumar, C. 2020. An and Celi, L. A. 2020a. Acronym Identification and Disam- explainable machine learning approach for definition extrac- biguation shared tasksfor Scientific Document Understand- tion. In International Conference on Machine Learning, Im- ing. arXiv preprint arXiv:2012.11760 . age Processing, Network Security and Data Sciences, 145– Veyseh, A. P. B.; Dernoncourt, F.; Tran, Q. H.; and Nguyen, 155. Springer. T. H. 2020b. What Does This Acronym Mean? Introducing Li, I.; Yasunaga, M.; Nuzumlalı, M. Y.; Caraballo, C.; Ma- a New Dataset for Acronym Identification and Disambigua- hajan, S.; Krumholz, H.; and Radev, D. 2019. A Neural tion. In Proceedings of COLING. Topic-Attention Model for Medical Term Abbreviation Dis- Wu, C.-W.; Jan, S.-Y.; Tsai, R. T.-H.; and Hsu, W.-L. ambiguation. arXiv preprint arXiv:1910.14076 . 2006. On using ensemble methods for Chinese named entity Li, Y.; Zhao, B.; Fuxman, A.; and Tao, F. 2018. Guess Me if recognition. In Proceedings of the Fifth SIGHAN Workshop You Can: Acronym Disambiguation for Enterprises. In Pro- on Chinese Language Processing, 142–145. ceedings of the 56th Annual Meeting of the Association for Wu, Y.; Xu, J.; Zhang, Y.; and Xu, H. 2015. Clinical abbre- Computational Linguistics (Volume 1: Long Papers), 1308– viation disambiguation using neural word embeddings. In 1317. Proceedings of BioNLP 15, 171–176. Liu, J.; Liu, C.; and Huang, Y. 2017. Multi-granularity se- quence labeling model for acronym expansion identification. Information Sciences 378: 462–474. Loshchilov, I.; and Hutter, F. 2018. Fixing weight decay regularization in adam . McInnes, B.; Pedersen, T.; Liu, Y.; Pakhomov, S.; and Melton, G. B. 2011. Using second-order vectors in a knowledge-based method for acronym disambiguation. In Proceedings of the fifteenth conference on computational natural language learning, 145–153. Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013. Ef- ficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 . Park, Y.; and Byrd, R. J. 2001. Hybrid text mining for find- ing abbreviations and their definitions. In Proceedings of the 2001 conference on empirical methods in natural language processing. Schwartz, A. S.; and Hearst, M. A. 2002. A simple algorithm for identifying abbreviation definitions in biomedical text. In Biocomputing 2003, 451–462. World Scientific. Sikdar, U. K.; and Gambäck, B. 2017. A feature-based en- semble approach to recognition of emerging and rare named entities. In Proceedings of the 3rd Workshop on Noisy User- generated Text, 177–181. Singh, A.; Kumar, P.; and Sinha, A. 2020. DSC IIT- ISM at SemEval-2020 Task 6: Boosting BERT with De- pendencies for Definition Extraction. In Proceedings of