Participation of UC3M in SDU@AAAI-21: A Hybrid Approach to Disambiguate
                             Scientific Acronyms
                                               Areej Jaber,1,2 Paloma Martı́nez1
                                1
                                    Computer Science Department, Universidad Carlos III de Madrid ,
                                      2
                                        Applied Computer Science Department, PTUK University
                                        1
                                          30 Av. de la Universidad, 28911 Leganés, Madrid, Spain
                                                    2
                                                      Jaffa street, 7 Tulkarem, Palestine,
                                                  a.jabir@ptuk.edu.ps, pmf@inf.uc3m.es

                           Abstract                                   context. Disambiguate acronyms are considered a special
                                                                      case of Lexical sample WSD.
  Acronyms disambiguation is considered a word sense dis-
                                                                         Three main approaches are applied extensively on WSD.
  ambiguation (WSD) task which consists on determining the
  correct expansion of an acronym based on a given context.           The first one is Knowledge based approach that integrates
  This paper describes three hybrid systems to disambiguate           lexical knowledge bases and exploits semantic similarity and
  acronyms in scientific documents, which combine three su-           graph-based approaches. In similarity-based methods each
  pervised machine learning (ML) models (Support Vector Ma-           expansion of the ambiguous acronym is compared to those
  chine, Naive Bayes and K-Nearest Neighbor) with cosine              of the content words appearing near it (context words) and
  similarity on SciAD corpus. Our system achieved it’s best           the expansion with the highest similarity (for instance, using
  performance on the independent test set on Naive Bayes and          cosine distance) is supposed to be the right one. (Billami
  cosine similarity with 92.15% of precision, 77.97% of recall        2017).
  and 84.47% of F1-macro measure.
                                                                         Unsupervised ML approaches disambiguate by finding
                                                                      hidden structure in unlabelled data, for instance, clustering
                        Introduction                                  documents or sentences in groups each one representing an
                                                                      expansion (Charbonnier and Wartena 2018).
Acronyms are defined as ’a short form of multiple words                  Finally, supervised ML approaches which require tagged
or phrases’ which are used in various type of documents.              corpora. WSD based on this approach, is considered as a
Normally its meaning is represented the first time is used in         text classification problem where the objective is to pre-
each document. But there are many cases that it is used alone         dict the correct expansion of an acronym among its differ-
without its meaning like in the case of clinical documents.           ent expansions (Melacci, Globo, and Rigutini 2018). Super-
   There is no standard rules to create acronyms. Usually             vised approaches achieved high performance in this type of
each acronym has more than one meaning which is called                task, but it requires annotated data that is considered expen-
’expansion’ or ’long form’. Writing acronyms without their            sive to generate. To face this problem, semi-supervised ap-
expansions in the same sentence makes it ambiguous. De-               proaches are applied. In semi-supervised approaches train-
termining the correct expansion for an acronym depends on             ing data are automatically generated from few annotated ex-
many factors like the domain it is used in. For example;              amples (da Silva Sousa, Milios, and Berton 2020).
acronym ’ED’ could mean ’Emergency Department’ if it is
                                                                         We explore word embeddings in this work as features to
used in documents related to medical domain, or it could
                                                                      be used in ML algorithms; a preliminary analysis is done
mean ’Euclidean Distance’ if it is used in documents re-
                                                                      in (Jaber and Martı́ınez 2021). A word embedding is a real-
lated to mathematics domain. Furthermore, acronyms could
                                                                      value vector that represents a single word based on the con-
have many expansions even in the same domain like ’RNN’
                                                                      text in which it appears (Khattak et al. 2019). These numeri-
which has two possible expansions ’Recurrent Neural Net-
                                                                      cal word representations could be built using different mod-
work’ and ’Random Neural Networks’ which both of them
                                                                      els like (Mikolov et al. 2013), (Peters et al. 2018) and (De-
are used in computer science domain.
                                                                      vlin et al. 2019) based on different neural networks archi-
   Word Sense Disambiguation (WSD) is a Natural Lan-                  tectures. Fortunately, these embeddings could be trained on
guage Processing (NLP) task which is applied to determine             large data set, saved and used in solving other tasks; they are
the right expansion of acronyms based on it’s context. There          called pre-trained word embeddings or pre-trained models.
are two types of WSD; all words WSD which disambiguates
all words in the given context. The second type is Lexi-                 In this paper three supervised ML models combined with
cal sample WSD which disambiguates specific word in the               a knowledge based model are used to disambiguate sci-
                                                                      entific acronyms for SDU@AAAI-21 shared task (Veyseh
Copyright © 2021, for this paper by its authors. Use permitted un-    et al. 2020a). The rest of the paper is organized as follows:
der Creative Commons License Attribution 4.0 International (CC        Method section describes the data set which is used in this
BY 4.0).                                                              study, the features and the different models that we applied.
                                                                Figure 2: Number of senses per acronym in the dictionary.
                                                                E.g. we see that there are 437 acronyms with two expan-
Figure 1: Frequency of each number of examples per              sions.
acronym across train, development and test data sets.

                                                                directory. This baseline computes the frequency of the long
In strategies section we describe how the proposed methods      forms in the training data set. Afterwards, to make prediction
are experimentally conducted. Finally, we present our results   for each acronym in the development data set, it selects the
compared to the baseline system.                                long form with the highest frequency as the final prediction.
                                                                If there is a tie, the long form that appears the first among
                         Method                                 all tied long forms in the dictionary is selected as the final
Acronyms are ambiguous because they could have multi-           prediction.
ple expansions. Determining the correct expansion of an         Supervised ML Three Supervised ML algorithms are im-
acronym is a WSD problem. Since SciAD contains an small         plemented:
set of exampled for some acronyms, we combined super-
vised machine with knowledge based approaches to tackle         • Support Vector Machine (SVM): which separates positive
this problem.                                                     samples from negative ones based on the idea of linear
                                                                  hyper-plane from labeled data set differentiating between
Data Set                                                          samples into true or false categories. SVM is adapted to
SciAD (Veyseh et al. 2020b) corpus is used in this task,          multi-class classification to be used in WSD.
which is created by AAAI-21 shared task 2 organizers.           • Naive Bayes (NB): a probabilistic approach to estimate
SciAD was generated from 6,786 English papers from arXiv          probabilistic parameters which has a long history of suc-
with 2,031,592 sentences. Table 1 shows the detailed num-         cess in WSD. This approach is based on Bayes theorem
bers of annotated samples on three data set, training, devel-     to compute the conditional probability for each sense of
opment and test training data set.                                an abbreviation from a set of features.
                                                                • K-Nearest Neighbor (KNN): the classification is done by
                   Training    Development       Test             computing the Euclidean distance for each test vector
     Sentences      50034         6189           6218             with the most k similar training vectors.
      Tokens       1548278       190654         190111
    Acronyms         731           611           618            Knowledge Based Approach For acronyms with few ex-
    Expansions      2150          1233             -            amples in the data set, which are insufficient to train a su-
                                                                pervised ML method, a knowledge based approach is im-
Table 1: Description of training, development and test data     plemented. This method is based on expansion’s dictionary
sets.                                                           provided by organizers; cosine similarity was applied in the
                                                                test examples. Two vectors were said to be similar when the
   Figure 1 shows frequencies of annotated examples per         cosine similarity was close to 1, and they were said to be
each acronym; 299 acronyms have less than 20 annotated          dissimilar when it is close to 0 (Singhal 2001).
examples in the training data set.
   Additionally, the organizers provide the participants with   Features
an acronyms dictionary which contains 732 acronyms and          Features play an important rule in WSD system, two types
2308 senses with average of 3.15 senses per acronym. Figure     of features were used. WSD Features: Several lexical fea-
2 shows the distributions of senses for acronyms contained      tures were used to disambiguate acronyms considering both
in the dictionary.                                              left and right contexts of the target Acronym. Our system
                                                                adopted a set of lexical features that have been used suc-
Model                                                           cessful in WSD. Given a sentence s formed by a set of words
Baseline In order to familiarize the participants with the      [...w−2 , w−1 , w0 , wt , w+1 , w+2 ...] where wk is the targeted
task, the organizers provided a rule-based baseline in code     ambiguous acronym, we extracted the following features:
                               Figure 3: Overview of proposed approach to disambiguate acronyms.


1. Word Features: stemmed words for each tokens on both              On the other hand, for the knowledge based approach, just
   side of the target acronym.                                    the summation strategy of pre trained word embedding vec-
2. Word features with direction: The relative direction (left     tors were generated for each example and for the candidate
   or right side) of stemmed words.                               expansions which were extracted from expansions dictio-
                                                                  nary.
3. POS (Part-Of-Speech) Tag: POS tag feature for each to-
   ken on both sides.                                             Training Phase
4. Position features: The distance between the feature word
                                                                  In this phase, training and development data sets were com-
   and the target acronym.
                                                                  bined to increase the size of data set for each acronyms.
5. Word formation features from the acronym itself includ-        Our goal was to build a model to predict acronym’s ex-
   ing special characters, capital letters and numbers.           pansion based on a context for each acronym that has more
 Pre-trained word embedding features: A pre-trained word          than 20 annotated examples. To achieve this goal the train-
 embedding model with 300 dimension vectors was built             ing data was separated based on each acronym data set. Ta-
 used FastText (Joulin et al. 2016) generated from several        ble 2 shows the distribution of the whole data set for ML
 English resources such as the Wikipedia and data from the        and Knowledge-based (KB) approaches, 450 acronyms with
 common crawl project, (Mikolov et al. 2018).                     53702 annotated examples, are disambiguated by three ML
                                                                  models, SVM, NB, and KNN. While 282 acronyms with
                        Strategies                                2521 annotated examples disambiguated by cosine similar-
                                                                  ity method.
Pre processing data & Features extracting
Several pre-processing steps were conducted on the dataset                        Data set   Acronyms      Expansions
including remove stop words, special characters and stem-                 ML       53702       450           1601
ming the words before extracting the features. For super-                  KB       2521       282            594
vised ML approaches, features are formed by combining                     Total    56223       732           2195
WSD lexical features and the summation strategy from the
pre trained word embeddings which are generated based on          Table 2: Distribution of data sets, acronyms over two pro-
the following equation:                                           posed models in the training phase.
                       |W |
                       X
                  S=          v(W (i)), i 6= k            (1)
                       i=0                                        Testing Phase
where W is a list of words which surrounding the targeted         When the testing data set was released by the organizers, the
acronym. |W |is the length of the list and v(.) is a Fasttext     testing data set was divided based on the training data we
pre trained word embedding as mentioned in previous sub-          had previously (see Figure 4). Table 3 shows the distribution
section and k is the position of the target acronym.              of testing data set over the two models; 444 acronyms with
5876 annotated examples in the testing data set, are disam-
biguated through the three ML models. 174 acronyms in 342
annotated testing examples were disambiguated with cosine
similarity method. Figure 3 summarizes the overall process
for the proposed system.

                       Data set size      # of acronyms
 Machine Learning         5876                  444
 Knowledge based           342                  174
      Total               6218                  618

Table 3: Distribution of Data sets, Acronyms over two pro-
posed models in the testing phase.


                    Precision     Recall      F1-macro
       NB-KB        90.31%       87.16%       84.37%
       SVM-KB        90.20%      86.78%        88.16%
       KNN-KB        83.85%      79.59%        79.53%

Table 4: Averaged performance of the three proposed hybrid
approaches implemented on the training phase.


                  Precision      Recall     F1-macro
         NB       92.15%        77.97%      84.47%
         SVM       91.66%       73.33%       81.48%
         KNN       90.26%       67.51%       77.25%

Table 5: Averaged performance of the three proposed hybrid
approaches on testing data set.


                 Evaluation & Result                               Figure 4: Data flow chart in training and testing phases.
The system performance was evaluated by using three met-
rics, Precision which is defined as the percentage of the in-
stances which actually have a class label X (True Positives)     cross validation was used for all acronyms in ML mod-
divided by all those which were classified as class label X as   els. Furthermore, the training data set contains 10 non-
the following equation:                                          ambiguous acronyms which their data set contain one ex-
                                                                 pansion.
                           T rueP ositives
   P recision =                                           (2)       Table 4 shows our result on training phase, NB with co-
                  T ruepositives + F alseP ositives              sine similarity achieved the highest performance with preci-
Recall is defined as the percentage of the instances which       sion 90.31% , recall 87.16% and F1-macro 84.37%.
were classified as class X, divided by all instances which          Table 5 shows the final scores for our systems were re-
correctly have class X as the following equation:                ported by the organizers. The best performance achieved
                                                                 precision 92.15%, recall 77.97% and F1-macro 84.47%, for
                           T rueP ositives                       a hybrid approach with NB and cosine similarity.
      Recall =                                            (3)
                 T rueP ositives + F alseN egatives
F1-macro is defined as the harmonic mean of Precision and                  Preliminary Analysis of Errors
Recall as the following equation:                                A sample of low performance on accuracy were achieved on
                       P recision × Recall                       a training phase shows how strongly imbalanced data set size
              F1 = 2 ×                                    (4)    affects on the model. We focus on Naive Bayes approach
                       P recision + Recall                       since the best result was achieved through this approach. Ta-
   The training data set includes 634 expansions with less       ble 6 shows the accuracy of 4 acronyms, ARD acronym with
than 10 annotated examples from different acronyms, to bal-      246 dataset is distributed between two expansions ”acceler-
ance the data set, these expansions were replicated through      ated robust distillation” with 46 training examples and ”ad-
oversampling techniques using sklearn library. Then 5 fold       versarially robust distillation” with 201 training examples,
                       Data set   Number of                                                        Number of examples
          Acronym                               Accuracy                 Expansion
                        size      expansions                                                         per expansion
                                                                      mean squared error                   462
            MSE          501           3          52%               minimum square error                    10
                                                                     model selection eqn                    29
                                                                       gaussian process                    466
             GP          552           2          61%
                                                                   geometric programming                    86
                                                                  citation nearest neighbour                14
                                                                complicated neural networks                 1
            CNN         2973           4          58%
                                                                 condensed nearest neighbor                 33
                                                                convolutional neural network              2925
                                                                accelerated robust distillation             46
            ARD          247           2          38%
                                                               adversarially robust distillation           201

     Table 6: Distribution of data set size over expansions and the accuracy of Naive Bayes model on sample acronyms.


was achieved the lowest accuracy which is 38%.                       Jaber, A.; and Martı́ınez, P. 2021. Disambiguating Clinical
                                                                     Abbreviations Using Pre-trained Word Embeddings. In To
                      Conclusion                                     appear in Proceedings of the 14th International Joint Con-
In this paper, we introduced a system to disambiguate sci-           ference on Biomedical Engineering Systems and Technolo-
entific acronyms. Our system best score was achieved by a            gies: HEALTHINF,. INSTICC.
hybrid approach combining supervised ML Naive Bayes and              Joulin, A.; Grave, E.; Bojanowski, P.; and Mikolov, T.
cosine similarity with precision 92.15%, recall 77.97% and           2016. Bag of Tricks for Efficient Text Classification. arXiv
F1-macro 84.47%.                                                     preprint arXiv:1607.01759 .
                                                                     Khattak, F. K.; Jeblee, S.; Pou-Prom, C.; Abdalla, M.;
                  Acknowledgments.                                   Meaney, C.; and Rudzicz, F. 2019. A survey of word em-
Thanks to Palestine Technical University-Kadoorie (PTUK)             beddings for clinical text. Journal of Biomedical Informat-
and DeepEMR project (TIN2017-87548-C2-1-R) for par-                  ics: X 4(October): 100057. ISSN 2590177X. doi:10.1016/j.
tially funding this work.                                            yjbinx.2019.100057. URL https://doi.org/10.1016/j.yjbinx.
                                                                     2019.100057.
                       References                                    Melacci, S.; Globo, A.; and Rigutini, L. 2018. Enhancing
Billami, M. 2017. A Knowledge-Based Approach to Word                 Modern Supervised Word Sense Disambiguation Models by
Sense Disambiguation by distributional selection and se-             Semantic Lexical Resources. In Calzolari, N.; Choukri,
mantic features. CoRR abs/1702.08450. URL http://arxiv.              K.; Cieri, C.; Declerck, T.; Goggi, S.; Hasida, K.; Isa-
org/abs/1702.08450.                                                  hara, H.; Maegaard, B.; Mariani, J.; Mazo, H.; Moreno, A.;
Charbonnier, J.; and Wartena, C. 2018. Using Word Embed-             Odijk, J.; Piperidis, S.; and Tokunaga, T., eds., Proceed-
dings for Unsupervised Acronym Disambiguation. In Ben-               ings of the Eleventh International Conference on Language
der, E. M.; Derczynski, L.; and Isabelle, P., eds., Proceed-         Resources and Evaluation, LREC 2018, Miyazaki, Japan,
ings of the 27th International Conference on Computational           May 7-12, 2018. European Language Resources Associa-
Linguistics, COLING 2018, Santa Fe, New Mexico, USA,                 tion (ELRA). URL http://www.lrec-conf.org/proceedings/
August 20-26, 2018, 2610–2619. Association for Computa-              lrec2018/summaries/112.html.
tional Linguistics. URL https://www.aclweb.org/anthology/            Mikolov, T.; Grave, E.; Bojanowski, P.; Puhrsch, C.; and
C18-1221/.                                                           Joulin, A. 2018. Advances in Pre-Training Distributed
da Silva Sousa, S. B.; Milios, E. E.; and Berton, L. 2020.           Word Representations. In Calzolari, N.; Choukri, K.;
Word sense disambiguation: an evaluation study of semi-              Cieri, C.; Declerck, T.; Goggi, S.; Hasida, K.; Isahara,
supervised approaches with word embeddings. In 2020 In-              H.; Maegaard, B.; Mariani, J.; Mazo, H.; Moreno, A.;
ternational Joint Conference on Neural Networks, IJCNN               Odijk, J.; Piperidis, S.; and Tokunaga, T., eds., Proceed-
2020, Glasgow, United Kingdom, July 19-24, 2020, 1–                  ings of the Eleventh International Conference on Language
8. IEEE. doi:10.1109/IJCNN48605.2020.9207225. URL                    Resources and Evaluation, LREC 2018, Miyazaki, Japan,
https://doi.org/10.1109/IJCNN48605.2020.9207225.                     May 7-12, 2018. European Language Resources Associa-
                                                                     tion (ELRA). URL http://www.lrec-conf.org/proceedings/
Devlin, J.; Chang, M. W.; Lee, K.; and Toutanova, K. 2019.
                                                                     lrec2018/summaries/721.html.
BERT: Pre-training of deep bidirectional transformers for
language understanding. NAACL HLT 2019 - 2019 Confer-                Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; and Dean,
ence of the North American Chapter of the Association for            J. 2013. Distributed representations ofwords and phrases
Computational Linguistics: Human Language Technologies               and their compositionality. Advances in Neural Information
- Proceedings of the Conference 1(Mlm): 4171–4186.                   Processing Systems 1–9. ISSN 10495258.
Peters, M. E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark,
C.; Lee, K.; and Zettlemoyer, L. 2018. Deep Contextual-
ized Word Representations. In Walker, M. A.; Ji, H.; and
Stent, A., eds., Proceedings of the 2018 Conference of the
North American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies, NAACL-
HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018,
Volume 1 (Long Papers), 2227–2237. Association for Com-
putational Linguistics. doi:10.18653/v1/n18-1202. URL
https://doi.org/10.18653/v1/n18-1202.
Singhal, A. 2001. Modern Information Retrieval: A Brief
Overview. IEEE Data Eng. Bull. 24(4): 35–43. URL http:
//sites.computer.org/debull/A01DEC-CD.pdf.
Veyseh, A. P. B.; Dernoncourt, F.; Nguyen, T. H.; Chang,
W.; and Celi, L. A. 2020a. Acronym Identification and Dis-
ambiguation Shared Tasks for Scientific Document Under-
standing. CoRR abs/2012.11760. URL https://arxiv.org/abs/
2012.11760.
Veyseh, A. P. B.; Dernoncourt, F.; Tran, Q. H.; and Nguyen,
T. H. 2020b. What Does This Acronym Mean? Introduc-
ing a New Dataset for Acronym Identification and Disam-
biguation. CoRR abs/2010.14678. URL https://arxiv.org/
abs/2010.14678.