Participation of UC3M in SDU@AAAI-21: A Hybrid Approach to Disambiguate Scientific Acronyms Areej Jaber,1,2 Paloma Martı́nez1 1 Computer Science Department, Universidad Carlos III de Madrid , 2 Applied Computer Science Department, PTUK University 1 30 Av. de la Universidad, 28911 Leganés, Madrid, Spain 2 Jaffa street, 7 Tulkarem, Palestine, a.jabir@ptuk.edu.ps, pmf@inf.uc3m.es Abstract context. Disambiguate acronyms are considered a special case of Lexical sample WSD. Acronyms disambiguation is considered a word sense dis- Three main approaches are applied extensively on WSD. ambiguation (WSD) task which consists on determining the correct expansion of an acronym based on a given context. The first one is Knowledge based approach that integrates This paper describes three hybrid systems to disambiguate lexical knowledge bases and exploits semantic similarity and acronyms in scientific documents, which combine three su- graph-based approaches. In similarity-based methods each pervised machine learning (ML) models (Support Vector Ma- expansion of the ambiguous acronym is compared to those chine, Naive Bayes and K-Nearest Neighbor) with cosine of the content words appearing near it (context words) and similarity on SciAD corpus. Our system achieved it’s best the expansion with the highest similarity (for instance, using performance on the independent test set on Naive Bayes and cosine distance) is supposed to be the right one. (Billami cosine similarity with 92.15% of precision, 77.97% of recall 2017). and 84.47% of F1-macro measure. Unsupervised ML approaches disambiguate by finding hidden structure in unlabelled data, for instance, clustering Introduction documents or sentences in groups each one representing an expansion (Charbonnier and Wartena 2018). Acronyms are defined as ’a short form of multiple words Finally, supervised ML approaches which require tagged or phrases’ which are used in various type of documents. corpora. WSD based on this approach, is considered as a Normally its meaning is represented the first time is used in text classification problem where the objective is to pre- each document. But there are many cases that it is used alone dict the correct expansion of an acronym among its differ- without its meaning like in the case of clinical documents. ent expansions (Melacci, Globo, and Rigutini 2018). Super- There is no standard rules to create acronyms. Usually vised approaches achieved high performance in this type of each acronym has more than one meaning which is called task, but it requires annotated data that is considered expen- ’expansion’ or ’long form’. Writing acronyms without their sive to generate. To face this problem, semi-supervised ap- expansions in the same sentence makes it ambiguous. De- proaches are applied. In semi-supervised approaches train- termining the correct expansion for an acronym depends on ing data are automatically generated from few annotated ex- many factors like the domain it is used in. For example; amples (da Silva Sousa, Milios, and Berton 2020). acronym ’ED’ could mean ’Emergency Department’ if it is We explore word embeddings in this work as features to used in documents related to medical domain, or it could be used in ML algorithms; a preliminary analysis is done mean ’Euclidean Distance’ if it is used in documents re- in (Jaber and Martı́ınez 2021). A word embedding is a real- lated to mathematics domain. Furthermore, acronyms could value vector that represents a single word based on the con- have many expansions even in the same domain like ’RNN’ text in which it appears (Khattak et al. 2019). These numeri- which has two possible expansions ’Recurrent Neural Net- cal word representations could be built using different mod- work’ and ’Random Neural Networks’ which both of them els like (Mikolov et al. 2013), (Peters et al. 2018) and (De- are used in computer science domain. vlin et al. 2019) based on different neural networks archi- Word Sense Disambiguation (WSD) is a Natural Lan- tectures. Fortunately, these embeddings could be trained on guage Processing (NLP) task which is applied to determine large data set, saved and used in solving other tasks; they are the right expansion of acronyms based on it’s context. There called pre-trained word embeddings or pre-trained models. are two types of WSD; all words WSD which disambiguates all words in the given context. The second type is Lexi- In this paper three supervised ML models combined with cal sample WSD which disambiguates specific word in the a knowledge based model are used to disambiguate sci- entific acronyms for SDU@AAAI-21 shared task (Veyseh Copyright © 2021, for this paper by its authors. Use permitted un- et al. 2020a). The rest of the paper is organized as follows: der Creative Commons License Attribution 4.0 International (CC Method section describes the data set which is used in this BY 4.0). study, the features and the different models that we applied. Figure 2: Number of senses per acronym in the dictionary. E.g. we see that there are 437 acronyms with two expan- Figure 1: Frequency of each number of examples per sions. acronym across train, development and test data sets. directory. This baseline computes the frequency of the long In strategies section we describe how the proposed methods forms in the training data set. Afterwards, to make prediction are experimentally conducted. Finally, we present our results for each acronym in the development data set, it selects the compared to the baseline system. long form with the highest frequency as the final prediction. If there is a tie, the long form that appears the first among Method all tied long forms in the dictionary is selected as the final Acronyms are ambiguous because they could have multi- prediction. ple expansions. Determining the correct expansion of an Supervised ML Three Supervised ML algorithms are im- acronym is a WSD problem. Since SciAD contains an small plemented: set of exampled for some acronyms, we combined super- vised machine with knowledge based approaches to tackle • Support Vector Machine (SVM): which separates positive this problem. samples from negative ones based on the idea of linear hyper-plane from labeled data set differentiating between Data Set samples into true or false categories. SVM is adapted to SciAD (Veyseh et al. 2020b) corpus is used in this task, multi-class classification to be used in WSD. which is created by AAAI-21 shared task 2 organizers. • Naive Bayes (NB): a probabilistic approach to estimate SciAD was generated from 6,786 English papers from arXiv probabilistic parameters which has a long history of suc- with 2,031,592 sentences. Table 1 shows the detailed num- cess in WSD. This approach is based on Bayes theorem bers of annotated samples on three data set, training, devel- to compute the conditional probability for each sense of opment and test training data set. an abbreviation from a set of features. • K-Nearest Neighbor (KNN): the classification is done by Training Development Test computing the Euclidean distance for each test vector Sentences 50034 6189 6218 with the most k similar training vectors. Tokens 1548278 190654 190111 Acronyms 731 611 618 Knowledge Based Approach For acronyms with few ex- Expansions 2150 1233 - amples in the data set, which are insufficient to train a su- pervised ML method, a knowledge based approach is im- Table 1: Description of training, development and test data plemented. This method is based on expansion’s dictionary sets. provided by organizers; cosine similarity was applied in the test examples. Two vectors were said to be similar when the Figure 1 shows frequencies of annotated examples per cosine similarity was close to 1, and they were said to be each acronym; 299 acronyms have less than 20 annotated dissimilar when it is close to 0 (Singhal 2001). examples in the training data set. Additionally, the organizers provide the participants with Features an acronyms dictionary which contains 732 acronyms and Features play an important rule in WSD system, two types 2308 senses with average of 3.15 senses per acronym. Figure of features were used. WSD Features: Several lexical fea- 2 shows the distributions of senses for acronyms contained tures were used to disambiguate acronyms considering both in the dictionary. left and right contexts of the target Acronym. Our system adopted a set of lexical features that have been used suc- Model cessful in WSD. Given a sentence s formed by a set of words Baseline In order to familiarize the participants with the [...w−2 , w−1 , w0 , wt , w+1 , w+2 ...] where wk is the targeted task, the organizers provided a rule-based baseline in code ambiguous acronym, we extracted the following features: Figure 3: Overview of proposed approach to disambiguate acronyms. 1. Word Features: stemmed words for each tokens on both On the other hand, for the knowledge based approach, just side of the target acronym. the summation strategy of pre trained word embedding vec- 2. Word features with direction: The relative direction (left tors were generated for each example and for the candidate or right side) of stemmed words. expansions which were extracted from expansions dictio- nary. 3. POS (Part-Of-Speech) Tag: POS tag feature for each to- ken on both sides. Training Phase 4. Position features: The distance between the feature word In this phase, training and development data sets were com- and the target acronym. bined to increase the size of data set for each acronyms. 5. Word formation features from the acronym itself includ- Our goal was to build a model to predict acronym’s ex- ing special characters, capital letters and numbers. pansion based on a context for each acronym that has more Pre-trained word embedding features: A pre-trained word than 20 annotated examples. To achieve this goal the train- embedding model with 300 dimension vectors was built ing data was separated based on each acronym data set. Ta- used FastText (Joulin et al. 2016) generated from several ble 2 shows the distribution of the whole data set for ML English resources such as the Wikipedia and data from the and Knowledge-based (KB) approaches, 450 acronyms with common crawl project, (Mikolov et al. 2018). 53702 annotated examples, are disambiguated by three ML models, SVM, NB, and KNN. While 282 acronyms with Strategies 2521 annotated examples disambiguated by cosine similar- ity method. Pre processing data & Features extracting Several pre-processing steps were conducted on the dataset Data set Acronyms Expansions including remove stop words, special characters and stem- ML 53702 450 1601 ming the words before extracting the features. For super- KB 2521 282 594 vised ML approaches, features are formed by combining Total 56223 732 2195 WSD lexical features and the summation strategy from the pre trained word embeddings which are generated based on Table 2: Distribution of data sets, acronyms over two pro- the following equation: posed models in the training phase. |W | X S= v(W (i)), i 6= k (1) i=0 Testing Phase where W is a list of words which surrounding the targeted When the testing data set was released by the organizers, the acronym. |W |is the length of the list and v(.) is a Fasttext testing data set was divided based on the training data we pre trained word embedding as mentioned in previous sub- had previously (see Figure 4). Table 3 shows the distribution section and k is the position of the target acronym. of testing data set over the two models; 444 acronyms with 5876 annotated examples in the testing data set, are disam- biguated through the three ML models. 174 acronyms in 342 annotated testing examples were disambiguated with cosine similarity method. Figure 3 summarizes the overall process for the proposed system. Data set size # of acronyms Machine Learning 5876 444 Knowledge based 342 174 Total 6218 618 Table 3: Distribution of Data sets, Acronyms over two pro- posed models in the testing phase. Precision Recall F1-macro NB-KB 90.31% 87.16% 84.37% SVM-KB 90.20% 86.78% 88.16% KNN-KB 83.85% 79.59% 79.53% Table 4: Averaged performance of the three proposed hybrid approaches implemented on the training phase. Precision Recall F1-macro NB 92.15% 77.97% 84.47% SVM 91.66% 73.33% 81.48% KNN 90.26% 67.51% 77.25% Table 5: Averaged performance of the three proposed hybrid approaches on testing data set. Evaluation & Result Figure 4: Data flow chart in training and testing phases. The system performance was evaluated by using three met- rics, Precision which is defined as the percentage of the in- stances which actually have a class label X (True Positives) cross validation was used for all acronyms in ML mod- divided by all those which were classified as class label X as els. Furthermore, the training data set contains 10 non- the following equation: ambiguous acronyms which their data set contain one ex- pansion. T rueP ositives P recision = (2) Table 4 shows our result on training phase, NB with co- T ruepositives + F alseP ositives sine similarity achieved the highest performance with preci- Recall is defined as the percentage of the instances which sion 90.31% , recall 87.16% and F1-macro 84.37%. were classified as class X, divided by all instances which Table 5 shows the final scores for our systems were re- correctly have class X as the following equation: ported by the organizers. The best performance achieved precision 92.15%, recall 77.97% and F1-macro 84.47%, for T rueP ositives a hybrid approach with NB and cosine similarity. Recall = (3) T rueP ositives + F alseN egatives F1-macro is defined as the harmonic mean of Precision and Preliminary Analysis of Errors Recall as the following equation: A sample of low performance on accuracy were achieved on P recision × Recall a training phase shows how strongly imbalanced data set size F1 = 2 × (4) affects on the model. We focus on Naive Bayes approach P recision + Recall since the best result was achieved through this approach. Ta- The training data set includes 634 expansions with less ble 6 shows the accuracy of 4 acronyms, ARD acronym with than 10 annotated examples from different acronyms, to bal- 246 dataset is distributed between two expansions ”acceler- ance the data set, these expansions were replicated through ated robust distillation” with 46 training examples and ”ad- oversampling techniques using sklearn library. Then 5 fold versarially robust distillation” with 201 training examples, Data set Number of Number of examples Acronym Accuracy Expansion size expansions per expansion mean squared error 462 MSE 501 3 52% minimum square error 10 model selection eqn 29 gaussian process 466 GP 552 2 61% geometric programming 86 citation nearest neighbour 14 complicated neural networks 1 CNN 2973 4 58% condensed nearest neighbor 33 convolutional neural network 2925 accelerated robust distillation 46 ARD 247 2 38% adversarially robust distillation 201 Table 6: Distribution of data set size over expansions and the accuracy of Naive Bayes model on sample acronyms. was achieved the lowest accuracy which is 38%. Jaber, A.; and Martı́ınez, P. 2021. Disambiguating Clinical Abbreviations Using Pre-trained Word Embeddings. In To Conclusion appear in Proceedings of the 14th International Joint Con- In this paper, we introduced a system to disambiguate sci- ference on Biomedical Engineering Systems and Technolo- entific acronyms. Our system best score was achieved by a gies: HEALTHINF,. INSTICC. hybrid approach combining supervised ML Naive Bayes and Joulin, A.; Grave, E.; Bojanowski, P.; and Mikolov, T. cosine similarity with precision 92.15%, recall 77.97% and 2016. Bag of Tricks for Efficient Text Classification. arXiv F1-macro 84.47%. preprint arXiv:1607.01759 . Khattak, F. K.; Jeblee, S.; Pou-Prom, C.; Abdalla, M.; Acknowledgments. Meaney, C.; and Rudzicz, F. 2019. A survey of word em- Thanks to Palestine Technical University-Kadoorie (PTUK) beddings for clinical text. Journal of Biomedical Informat- and DeepEMR project (TIN2017-87548-C2-1-R) for par- ics: X 4(October): 100057. ISSN 2590177X. doi:10.1016/j. tially funding this work. yjbinx.2019.100057. URL https://doi.org/10.1016/j.yjbinx. 2019.100057. References Melacci, S.; Globo, A.; and Rigutini, L. 2018. Enhancing Billami, M. 2017. A Knowledge-Based Approach to Word Modern Supervised Word Sense Disambiguation Models by Sense Disambiguation by distributional selection and se- Semantic Lexical Resources. In Calzolari, N.; Choukri, mantic features. CoRR abs/1702.08450. URL http://arxiv. K.; Cieri, C.; Declerck, T.; Goggi, S.; Hasida, K.; Isa- org/abs/1702.08450. hara, H.; Maegaard, B.; Mariani, J.; Mazo, H.; Moreno, A.; Charbonnier, J.; and Wartena, C. 2018. Using Word Embed- Odijk, J.; Piperidis, S.; and Tokunaga, T., eds., Proceed- dings for Unsupervised Acronym Disambiguation. In Ben- ings of the Eleventh International Conference on Language der, E. M.; Derczynski, L.; and Isabelle, P., eds., Proceed- Resources and Evaluation, LREC 2018, Miyazaki, Japan, ings of the 27th International Conference on Computational May 7-12, 2018. European Language Resources Associa- Linguistics, COLING 2018, Santa Fe, New Mexico, USA, tion (ELRA). URL http://www.lrec-conf.org/proceedings/ August 20-26, 2018, 2610–2619. Association for Computa- lrec2018/summaries/112.html. tional Linguistics. URL https://www.aclweb.org/anthology/ Mikolov, T.; Grave, E.; Bojanowski, P.; Puhrsch, C.; and C18-1221/. Joulin, A. 2018. Advances in Pre-Training Distributed da Silva Sousa, S. B.; Milios, E. E.; and Berton, L. 2020. Word Representations. In Calzolari, N.; Choukri, K.; Word sense disambiguation: an evaluation study of semi- Cieri, C.; Declerck, T.; Goggi, S.; Hasida, K.; Isahara, supervised approaches with word embeddings. In 2020 In- H.; Maegaard, B.; Mariani, J.; Mazo, H.; Moreno, A.; ternational Joint Conference on Neural Networks, IJCNN Odijk, J.; Piperidis, S.; and Tokunaga, T., eds., Proceed- 2020, Glasgow, United Kingdom, July 19-24, 2020, 1– ings of the Eleventh International Conference on Language 8. IEEE. doi:10.1109/IJCNN48605.2020.9207225. URL Resources and Evaluation, LREC 2018, Miyazaki, Japan, https://doi.org/10.1109/IJCNN48605.2020.9207225. May 7-12, 2018. European Language Resources Associa- tion (ELRA). URL http://www.lrec-conf.org/proceedings/ Devlin, J.; Chang, M. W.; Lee, K.; and Toutanova, K. 2019. lrec2018/summaries/721.html. BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL HLT 2019 - 2019 Confer- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; and Dean, ence of the North American Chapter of the Association for J. 2013. Distributed representations ofwords and phrases Computational Linguistics: Human Language Technologies and their compositionality. Advances in Neural Information - Proceedings of the Conference 1(Mlm): 4171–4186. Processing Systems 1–9. ISSN 10495258. Peters, M. E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; and Zettlemoyer, L. 2018. Deep Contextual- ized Word Representations. In Walker, M. A.; Ji, H.; and Stent, A., eds., Proceedings of the 2018 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, NAACL- HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), 2227–2237. Association for Com- putational Linguistics. doi:10.18653/v1/n18-1202. URL https://doi.org/10.18653/v1/n18-1202. Singhal, A. 2001. Modern Information Retrieval: A Brief Overview. IEEE Data Eng. Bull. 24(4): 35–43. URL http: //sites.computer.org/debull/A01DEC-CD.pdf. Veyseh, A. P. B.; Dernoncourt, F.; Nguyen, T. H.; Chang, W.; and Celi, L. A. 2020a. Acronym Identification and Dis- ambiguation Shared Tasks for Scientific Document Under- standing. CoRR abs/2012.11760. URL https://arxiv.org/abs/ 2012.11760. Veyseh, A. P. B.; Dernoncourt, F.; Tran, Q. H.; and Nguyen, T. H. 2020b. What Does This Acronym Mean? Introduc- ing a New Dataset for Acronym Identification and Disam- biguation. CoRR abs/2010.14678. URL https://arxiv.org/ abs/2010.14678.