=Paper=
{{Paper
|id=Vol-2831/paper32
|storemode=property
|title=Acronym Expander at SDU@AAAI-21: an Acronym Disambiguation Module
|pdfUrl=https://ceur-ws.org/Vol-2831/paper32.pdf
|volume=Vol-2831
|authors=João L. M. Pereira,Helena Galhardas,Dennis Shasha
|dblpUrl=https://dblp.org/rec/conf/aaai/PereiraGS21
}}
==Acronym Expander at SDU@AAAI-21: an Acronym Disambiguation Module==
<pdf width="1500px">https://ceur-ws.org/Vol-2831/paper32.pdf</pdf>
<pre>
    Acronym Expander at SDU@AAAI-21: an Acronym Disambiguation Module

                             João L. M. Pereira,1 Helena Galhardas, 1 Dennis Shasha 2
                                1
                               INESC-ID and Instituto Superior Técnico, Universidade de Lisboa
                                                  2
                                                    Courant Institute, NYU
                  joaoplmpereira@tecnico.ulisboa.pt, helena.galhardas@tecnico.ulisboa.pt, shasha@cs.nyu.edu


                            Abstract                                  (ii) In the case that an acronym is not expanded in a text,
                                                                      out-expansion chooses an expansion from a previously large
  In order to properly determine which of several possible            parsed corpus (training corpus) like Wikipedia1 . The choice
  meanings an acronym A in sentence s has, any system that
  aims to find the correct meaning for A must understand the
                                                                      of which of several possible expansions to choose is based
  context of s.                                                       on some notion of article domain similarity between the text
                                                                      with a non-expanded acronym A and the articles contain-
  This paper describes the techniques we use for that problem
  for the SDU@AAAI benchmark in which context was pro-                ing expansions for A. We participated in the SDU@AAAI
  vided in the form of sentences in which acronym A is present        benchmark presented in Veyseh et al. (2020) that tests out-
  and defined.                                                        expansion only (i.e., acronym disambiguation).
  As a capsule summary of our results, Support Vector Ma-                Our system includes two techniques for out-expansion:
  chines with Doc2Vec techniques achieves a higher Macro F1-          Cosine similarity of Classic Context Vectors (Abdalgader
  Measure score than Cosine similarity with Classic Context           and Skabar 2012; Prokofyev et al. 2013; Li, Ji, and Yan
  Vector techniques. Although these techniques usually work           2015) and Doc2Vec (Le and Mikolov 2014) whose outputs
  better with documents (i.e., many sentences rather than the         are used as features for Support Vector Machines (SVMs)
  one sentence offered in this benchmark), they achieved scores       to create a new out-expansion technique. Moreover, we used
  of Macro F1-Measure 86-89%.                                         Wikipedia articles to enrich the training data for these tech-
  While these results were 5.65% worse than the best in the           niques. Our results show that Doc2Vec together with Sup-
  benchmark experiment, the high speed of our approach (max           port Vector Machines (SVMs) gives the best prediction re-
  0.6 seconds on average per sentence on a virtual machine al-        sults when using Wikipedia data. Without extra data, context
  located with 4 CPU cores and 32GB of RAM in a shared
  server) and the possibility that our methods are complemen-
                                                                      vector works best.
  tary to those of other groups may lead to high performance
  hybrid systems.                                                                             Related Work
                                                                      To our knowledge, systems that expand abbreviations
                        Introduction                                  and/or acronyms use a pre-defined dictionary of acronym-
                                                                      expansions (Gooch 2012; ABBREX2 ) as opposed to trying
                                                                      to discover the proper expansion based on context.
   The proper expansion of an acronym depends on context.
                                                                         Ciosici and Assent (2018) proposed an abbrevia-
For example, ”HD” can mean Harmonic Distortion in a sig-
                                                                      tion/acronym expansion system architecture that performs
nal context, High Definition in a video context, and Hunt-
                                                                      out-expansion. Unfortunately, their demo paper does not
ington ’s Disease in a medical context. Thus, any system that
                                                                      provide enough technical details and their code is propri-
hopes to help readers understand the intended meaning of an
                                                                      etary.
undefined acronym in a sentence must expand that acronym
                                                                         The remaining part of this section describes previous
using its context.
                                                                      work on out-expansion.
   An acronym expander system comprises the following
steps: (i) Extraction of both acronym and (when present) its             Li, Ji, and Yan (2015) proposed two approaches to
expansion within a text. For example, if a given text has ”HD         out-expansion based on word embeddings from Word2Vec
(High Definition)” then HD would be the acronym and High              (Mikolov et al. 2013a) to address the out-expansion prob-
Definition would be the expansion. We call this in-expansion          lem. Their best approach, called Surrounding Based Em-
because it can be done for a particular article on its own.           bedding (SBE), combines the Word2Vec embeddings of
                                                                      the words surrounding the acronym or the expansion. Sim-
Copyright © 2021 for this paper by its authors. Use permitted under
                                                                         1
Creative Commons License Attribution 4.0 International (CC BY                https://www.wikipedia.org/
                                                                         2
4.0).                                                                        http://abbrex.com/
ilarly to SBE, Ciosici, Sommer, and Assent (2019) pro-              For the competition, we tested two representator tech-
posed Unsupervised Acronym Disambiguation (UAD) that              niques: Classic Context Vector and Doc2Vec.
replaces each expansion occurrence in the text collection by
a normalized token and retrains the Word2Vec google news          Classic Context Vector The context vector technique is
model (Mikolov et al. 2013a) on that collection. The result-      an unsupervised method used as a baseline in Word Sense
ing model produces an embedding for each normalized to-           Disambiguation problems (Abdalgader and Skabar 2012)
ken, i.e., an expansion embedding.                                and also in acronym disambiguation problems (Prokofyev
   Thakker, Barot, and Bagul (2017) creates document vec-         et al. 2013)(Li, Ji, and Yan 2015). We denote it as classic
tor embeddings using Doc2Vec for each document. For each          to distinguish it from variants or other techniques that also
set of documents D containing an expansion for an acronym         provide vectors to contexts.
A, the system trains a Doc2Vec model on D which is used              A Context vector represents a term (e.g, an acronym or
to infer the embedding for an input document i containing         expansion) by a vector based on the words that co-occur
an undefined acronym A.                                           with the term in each document of the corpus containing
   Charbonnier and Wartena (2018) proposed an out-                that term. Thus, a context vector is a sparse vector where
expansion approach based on Word2Vec embeddings                   each position corresponds to a word in any document in the
weighted by Term Frequency-Inverse Document Frequency             corpus, if the word is in a document that contains the term,
(TF-IDF) scores to find out-expansions for acronyms in sci-       then the vector position has some positive value, otherwise
entific article captions.                                         the value is zero. In the classic approach, the value at each
   More recently, Pouran Ben Veyseh et al. (2020) com-            vector position corresponds to the number of co-occurrences
pare previous works in a new dataset (i.e., the Acronym           of the term and the co-occurring words in all the documents
Disambiguation dataset used in SDU@AAAI competition).             of the corpus.
The authors also propose a new model called Graph-based              In acronym disambiguation, the acronym in a particular
Acronym Disambiguation (GAD). GAD uses word and                   sentence yields a context vector (which we call the ”target
sentence representations obtained from Bidirectional Long         context vector”) which contains the words occurring in that
Short-Term Memory (BiLSTM) neural network. Those rep-             sentence and their number of occurrences.
resentations are complemented by using syntactic structure           Each possible expansion for the acronym will have a con-
from a dependency tree graph to model far but important           text vector as well (”potential context vector”). Classic con-
dependencies between words using a Graph Convolutional            text vector chooses the expansion associated with the poten-
Neural networks (GCN) (Kipf and Welling 2017). Finally,           tial context vector that is most similar to the target context
a two layer feedforward neural network classifier is used to      vector. The simplest similarity metric is cosine similarity.
guess the expansion.                                              Figure 1 presents an example of a context vector for Portable
   A related line of work explored the expansion of               Document Format expansion using two documents. For in-
acronyms in enterprise texts (Feng et al. 2009; Li et al.         stance, words ”the” and ”file” occur one time in each docu-
2018). For instance, in Li et al. (2018), enterprise textual      ment and so the positions reserved to these two words in the
documents are used as training data as well as Wikipedia ar-      vector contains value 2 while the others contain 1.
ticles and a set of features like statistics based on word fre-   Doc2Vec Doc2Vec (Le and Mikolov 2014) is a document
quencies, words co-occurrences, and TF-IDF. Other works           embedding and an unsupervised learning technique that adds
explored acronym disambiguation in biomedical domains             the capability of automatically learning document (or para-
(Pustejovsky et al. 2001; Pakhomov, Pedersen, and Chute           graph) vectors to Word2Vec (Mikolov et al. 2013a). Given a
2005; Yu et al. 2006; Stevenson et al. 2009; Moon, Pakho-         list of words (e.g., a text document) as input, the output of
mov, and Melton 2012; Moon, McInnes, and Melton 2015;             Doc2Vec is a dense vector of real numbers (i.e., an embed-
Wu et al. 2015; 2017).                                            ding).
   Less directly related, but insightful, is the literature on       Just as Word2Vec assigns a vector to a word, Doc2Vec as-
Word Sense Disambiguation (WSD) (Navigli 2009; Moro               signs a vector of N dimensions called a document vector to
and Navigli 2015) because that work also must make use of         a document (or in the case of this benchmark to a sentence).
the context around a token (in our case, an acronym; in the          The training problem consists of finding the best set
word sense literature, a word).                                   of embedding values for each word and document (i.e.,
                                                                  Doc2Vec model parameters) that, given a document, pre-
               Out-Expansion Strategy                             dicts the set of words in that document. For example, con-
                                                                  sider a document consisting on a list containing the countries
   Our out-expansion strategy consists of: (i) a Representa-      in Figure 2. If the document is known to the Doc2Vec model
tor to map an input sentence to a document representation         (i.e., it was included in the training data) then we have a doc-
that holds contextual information and (ii) an Out-Expansion       ument embedding available, otherwise, a document embed-
Predictor to choose a context-appropriate out-expansion for       ding d is computed by finding the best values that maximize
each acronym found in the input sentence.                         the prediction of the country names given d.
                                                                     In contrast to Word2Vec which averages word vectors to
Representator                                                     represent a particular document, Doc2Vec creates a trained
Representors summarize text (documents or sentences) in           vector for each document in the corpus (Dai, Olah, and Le
order to capture information signals about their semantics.       2015). By comparing those document vectors through co-
                           Documents containing the Portable Document Format expansion


                       Potential Context Vector for the Portable Document Format expansion

                           Words    the   file   format   increases   in   popularity   formats    including
                           Count     2     2        1         1        1       1           1           1

                      Figure 1: Classic context vector example for portable document format expansion.


sine similarity, Doc2Vec can infer semantically similar doc-           Datasets
uments.                                                                The datasets that we use in this benchmark are:
Out-Expansion Predictor                                                SciAD contains sentences from human annotated scientific
                                                                         articles extracted from ArXiv 3 . Each sentence contains an
Out-expansion predictors select an out-expansion for a given
                                                                         acronym to disambiguate. This is the dataset provided for
acronym A in an input sentence ins. For this purpose, a pre-
                                                                         the SDU@AAAI Acronym Disambiguation (AD) com-
dictor considers each sentence containing a valid expansion
                                                                         petition and it was proposed in (Pouran Ben Veyseh
E for A. Sentences are characterized by the representator
                                                                         et al. 2020). There are three data splits: (i) Train with
output explained in the previous section.
                                                                         50,033 sentences, (ii) Dev with 6,188 sentences, and (iii)
   In the case of the Classic Context Vector, we compare the
                                                                         Test with 6,217 sentences where acronym expansion is
input sentence context vector with the vector resulting from
                                                                         unknown. This dataset also contains a dictionary with
summing the context vectors of the sentences for E. In this
                                                                         acronyms and their possible expansions.
classical approach, because we have only a context vector
per expansion E, we use cosine similarity to evaluate simi-            Wikipedia contains all English articles of Wikipedia.org
larity.                                                                 taken from the Wikipedia dump of March 1, 20204 . We
   We consider the use of machine learning classifiers as al-           used the WikiExtractor5 software to obtain the articles in
ternatives to cosine similarity when more than one training             plain text, and we used the Schwartz and Hearst (2003)
sample is possible for an expansion (i.e., label). This is the          algorithm to extract acronyms and expansions from each
case of Doc2Vec whose embeddings represent a set of words               Wikipedia article.
(e.g., document or sentence) so we will have as many sam-
ples per expansion as set of words (e.g., sentences) where it          Data Preparation
occurs. However, for Classic Context Vector, it is not pos-            We process the datasets by removing punctuation and nor-
sible to use such machine learning approach because we                 malizing tokens in order to create a better textual represen-
have a context vector per expansion and so only one sam-               tation. That is, we perform the following operations on each
ple per expansion. Specifically, for the competition we used           dataset:
Support Vector Machines (SVMs) where non-binary classi-
                                                                       SciAD We remove non alphanumeric tokens, punctuation
fication was performed by a ”one-vs-all” approach where a
                                                                         characters, and stop-words. Then, we transform each to-
binary SVM classifier predicts with a certain probability if
                                                                         ken to its stem, e.g. expander, expanding, and expanded
a sample belongs to a particular class. The class with high-
                                                                         all map to expand. We use the Porter Stemmer algorithm
est probability is selected. We used the LibLinear (Fan et
                                                                         from the Natural Language Toolkit (NLTK) (Bird, Klein,
al. 2008) implementation included in sckit-learn toolkit (Pe-
                                                                         and Loper 2009) for that purpose.
dregosa et al. 2011).
                                                                       Wikipedia Because expansions in Wikipedia may be writ-
  SDU@AAAI Benchmark of Out-expansion                                   ten in different formats and with plurals, we normalize
                                                                        the expansions found against the dictionary of acronym
             Techniques
                                                                           3
This section describes the SDU@AAAI benchmark used to                        https://arxiv.org/
                                                                           4
evaluate the out-expansion techniques described in the pre-                  https://dumps.wikimedia.org/enwiki
                                                                           5
vious section.                                                               http://medialab.di.unipi.it/wiki/Wikipedia Extractor
                       Figure 2: Countries and Capitals vectors. Modified from (Mikolov et al. 2013b).


  expansions shared with SciAD. So, each expansion in the        Training execution times: the execution time to create the
  Wikipedia documents is replaced by the closest expansion         representator model based on the training sentences
  in the SciAD dictionary. Distance is given by comparing          and/or documents.
  the expansion in Wikipedia against a SciAD expansion,
                                                                 Average execution times per sentence: the average exe-
  if the first 4 characters of each word are equal we con-
                                                                   cution time to predict the expansions for the acronym in a
  sider the expansions to be equal (distance=0); distance
                                                                   sentence.
  is given by the edit-distance between both expansions, if
  the edit-distance is below 3 then the expansions are close
  enough, otherwise they are considered two distinct expan-                       Experimental Evaluation
  sions. Wikipedia expansions not close to any expansion in      This section reports on the out-expansion experiments.
  the SciAD dictionary and their corresponding documents            We run the experiments on a machine with the follow-
  are not considered for prediction because only the expan-      ing specifications: Virtual Machine (VM) with 4 CPU cores
  sions in the dictionary are valid for the SciAD evaluation     from an Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz,
  set. Furthermore, while keeping the expansions in text,        32GiB of RAM (Random Access Memory), and Ubuntu
  we apply the tokenizer from NLTK and remove the non            18.04.3 LTS.
  alphanumeric tokens, punctuation characters, and stop-
  words. Finally, we transform each token to its stem as we      Out-Expansions on the AAAI Benchmark
  did for the SciAD dataset.
                                                                 In this section, we report the results obtained using the
Out-expansion Techniques                                         SciAD dataset. We used the Train set and Dev set as test
For the SDU@AAAI AD competition, we test the following           data to tune the hyperparameters of Doc2Vec and SVMs.
out-expansion techniques: we use (i) the Cosine similarity       Then, we used the Train and Dev sets as training data and the
(Cossim) with the Classic Context Vector (Li, Ji, and Yan        Test set as evaluation. Evaluation quality measures were pro-
2015) and (ii) the outputs of Doc2Vec as features for Support    vided by the Codelab competition evaluation system6 . We
Vector Machines (SVMs).                                          have submitted two combinations of out-expander predic-
                                                                 tors and representators: (i) cosine similarity (Cossim) as pre-
Prediction and Performance Metrics                               dictor with Classic Context Vector as representator, and (ii)
                                                                 Support Vector Machine (SVM) as predictor with Doc2Vec
For the SDU@AAAI competition, we use the following               as representator.
metrics:                                                            In Table 1, we report the out-expansion macro averages
Out-expansion Macro Averages: the average of the Preci-          for predicting expansions for acronyms in sentences: Pre-
 sion, Recall and F1-Measure for each expansion. These           cision (P), Recall (R), and F1-measure (F1). Macro F1-
 are the official metrics used in the SDU@AAAI competi-          measure is the official measure for ranking competitors in
 tion, being the Macro F1-Measure used to rank the com-
                                                                    6
 petitors based on expansion prediction quality.                        https://competitions.codalab.org/competitions/26611
  Acronym out-expansion technique           Dev macro avg.             Test macro avg.                 Execution times
  Predictor Representator             P (%)    R (%)    F1 (%)    P (%)    R (%)    F1 (%)   Training (s) Avg. per sentence (s)
  Cossim    Classic Context Vector    90.00% 84.68% 87.26%        92.13% 84.16% 87.96%              1.24                  0.18
  SVM       Doc2Vec                   90.49% 81.01% 85.48%        91.95% 80.15% 85.65%             92.79                  0.09

     Table 1: Out-expansion macro averages and execution times for training and average per sentence for SciAD dataset.

  Acronym out-expansion technique           Dev macro avg.             Test macro avg.                 Execution times
  Predictor Representator             P (%)    R (%)    F1 (%)    P (%)    R (%)    F1 (%)   Training (s) Avg. per sentence (s)
  Cossim    Classic Context Vector    88.24% 82.79% 85.43%        90.27% 83.73% 86.88%            504.52                  0.61
  SVM       Doc2Vec                   91.54% 81.50% 86.23%        93.57% 83.77% 88.40%           7367.32                  0.12

Table 2: Out-expansion macro averages and execution times for training and average per sentence for SciAD dataset with also
Wikipedia as training data.


the SDU@AAAI competition. We also report the best re-               Wikipedia, Cossim with Classic Context Vector is faster
sults obtained for both the Dev set used for hyperparmeter          in training than SVM with Doc2Vec, while slower in
selection and the Test set as testing data. In addition, we re-     per-sentence processing. Training both techniques is much
port the execution times for training and the average per pre-      slower with the addition of Wikipedia data, yet fast enough
dicted acronym in a sentence (note that each sentence con-          for a regular machine (e.g., 2 hours to train the Doc2Vec
tains only one acronym to expand).                                  model). On average, to process a sentence, the incorporation
   We can see that Cossim with Classic Context Vector               of Wikipedia slows down Cossim with Classic Context
achieved the best results. In general, both techniques have         Vector by 0.43 seconds and slows down SVM with Doc2Vec
slightly lower recall (less than 1%) in the Test set than in        by 0.03 seconds.
the Dev test but higher macro precisions (1-2%). Since the              Most of the excellent efforts by other research groups
gains in macro precisions are higher than the losses in macro       submitted to the competition are transformer-based models
recalls for the Test set, the harmonic means of both macros         that use pretrained models like BERT (Devlin et al. 2018),
(i.e., macro F1-measures) are higher in the Test set. Differ-       ROBERTA (Liu et al. 2019), and SciBERT (Beltagy, Lo,
ences among the techniques are consistent across various            and Cohan 2019). Those works mostly distinguish them-
test sets. Regarding execution times, Cossim with Classic           selves on how they adapt such transformers models to out-
Context Vector is faster in training (91s) and SVM with             expansion (Veyseh et al. 2020). The three leaderboard works
Doc2Vec is faster on average per sentence (0.09s). Clas-            use transformers and their macro F1-measures range from
sic Context Vector counts word occurrences at training time         93.19% to 94.05%. In our understanding, only three works
while Doc2Vec trains a neural network for word and docu-            including ours explored alternative techniques to transform-
ment embeddings with several iterations over the training           ers, no other work explored Doc2Vec or SVMs. Although
corpus (e.g., 200). Both training and sentence processing           our best technique scores are 6% less than the best in compe-
times are low given that they are executed on a regular ma-         tition, we believe that our techniques are distinct enough to
chine (4 CPU cores and 32GB of RAM). Both techniques                be complements to transformer-based techniques or may in-
are lightweight solutions for this problem.                         troduce a lighter/faster approach to this problem since trans-
                                                                    former models even using GPUs (Graphics processing units)
Cross-Training and Additional Data for the SDU@AAAI
                                                                    or TPUs (Tensor Processing Unit (TPU)) usually take more
Competition For our next set of experiments for the
                                                                    time to train and to process data than Doc2Vec and SVMs.
SDU@AAAI competition, we increase the training data
                                                                    Further, our approaches could work better when the context
provided by the competition sets with documents obtained
                                                                    consists of entire documents rather than single sentences,
from Wikipedia, i.e., the Wikipedia dataset. We wanted to
                                                                    which is our core use case.
test whether additional data and cross-training data helps to
solve this problem and which techniques can benefit from
such a data increment.                                                        Conclusions and Future Work
   Table 2 shows the macro averages on the SciAD Dev                We have evaluated two rapid techniques for acronym dis-
set using SciAD train and Wikipedia documents as train-             ambiguation using the SDU@AAAI benchmarks. We have
ing data; and the macro averages and execution times on the         found that Cosine similarity with Classic Context Vector
SciAD test set using the above training sets plus the Dev set       works best when no Wikipedia data is used. SVM with
as training data.                                                   Doc2Vec outperforms Cosine similarity with Classic Con-
   In contrast to previous results where Wikipedia data was         text Vector when using Wikipedia data. Our overall re-
not used, after adding Wikipedia documents to the training          sults, as measured by F1-measure score, are within 5.7% of
data, SVM with Doc2Vec obtains the best results. That com-          the best system in competition. By analyzing the execution
bination also benefits from using Wikipedia data. The three         times of each phase (training and evaluation of sentences),
macro averages are lower when applied to the Dev set than           we showed that our approach is lightweight even on a stan-
when applied to the Test set.                                       dard computer.
   Consistent with the experimental results without                    We believe we could have improved performance if we
had used data sources in addition to Wikipedia such as ab-     Fan, R.-E.; Chang, K.-W.; Hsieh, C.-J.; Wang, X.-R.; and
stracts from articles in web repositories to make the domain   Lin, C.-J. 2008. LIBLINEAR: A Library for Large Lin-
closer to the SDU@AAAI competition data.                       ear Classification. Journal of Machine Learning Research
                                                               9:1871–1874.
                  Acknowledgments                              Feng, S.; Xiong, Y.; Yao, C.; Zheng, L.; and Liu, W. 2009.
Pereira’s work was supported by national funds through         Acronym extraction and disambiguation in large-scale or-
FCT (Fundação para a Ciência e a Tecnologia), under         ganizational web pages. In Proceedings of the Eighteenth
the PhD Scholarship SFRH/BD/135719/2018. Furthermore,          ACM Conference on Information and Knowledge Manage-
Pereira and Galhardas’ work was supported by national          ment, 1693–1696. New York, NY: Association for Comput-
funds through FCT under the project UIDB/50021/2020.           ing Machinery.
   Shasha’s work has been partly supported by (i) the New      Gooch, P. 2012. BADREX: in situ expansion and corefer-
York University Abu Dhabi Center for Interacting Urban         ence of biomedical abbreviations using dynamic regular ex-
Networks (CITIES), funded by Tamkeen under the NYUAD           pressions. arXiv preprint. arXiv:1206.4522 [cs.CL]. Ithaca,
Research Institute Award CG001 and by the Swiss Re Insti-      NY: Cornell University Library.
tute under the Quantum Cities initiative, (ii) NYU Wireless,
                                                               Kipf, T. N., and Welling, M. 2017. Semi-supervised classi-
and (iii) U.S. National Science Foundation grants 1934388,
                                                               fication with graph convolutional networks. arXiv preprint.
1840761, and 1339362.
                                                               arXiv:1609.02907 [cs.CL]. Ithaca, NY: Cornell University
   The server virtual machine used to run the experiments
                                                               Library.
was supported by BioData.pt – Infraestrutura Portuguesa de
Dados Biológicos, project 22231/01/SAICT/2016, funded         Le, Q. V., and Mikolov, T. 2014. Distributed representations
by Portugal 2020.                                              of sentences and documents. In Proceedings of the Thirty-
                                                               First International Conference on Machine Learning, 1188–
                       References                              1196.
Abdalgader, K., and Skabar, A. 2012. Unsupervised              Li, Y.; Zhao, B.; Fuxman, A.; and Tao, F. 2018. Guess Me if
Similarity-based Word Sense Disambiguation Using Con-          You Can: Acronym Disambiguation for Enterprises. In Pro-
text Vectors and Sentential Word Importance. ACM Trans-        ceedings of the Fifty-Sixth Annual Meeting of the Associa-
actions on Speech and Language Processing 9(1):2–21.           tion for Computational Linguistics, 1308–1317. Melbourne,
                                                               Australia: Association for Computational Linguistics.
Beltagy, I.; Lo, K.; and Cohan, A. 2019. SciBERT: Pre-
trained Language Model for Scientific Text. In Proceedings     Li, C.; Ji, L.; and Yan, J. 2015. Acronym Disambigua-
of the 2019 Conference on Empirical Methods in Natural         tion Using Word Embedding. In Proceedings of the Twenty-
Language Processing.                                           Ninth AAAI Conference on Artificial Intelligence, 4178–
Bird, S.; Klein, E.; and Loper, E. 2009. Natural Language      4179. Menlo Park, Calif.: AAAI Press.
Processing with Python. O’Reilly Media, Inc., 1st edition.     Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy,
Charbonnier, J., and Wartena, C. 2018. Using word embed-       O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019.
dings for unsupervised acronym disambiguation. In Pro-         RoBERTa: A Robustly Optimized BERT Pretraining Ap-
ceedings of the Twenty-Seventh International Conference        proach. arXiv preprint. arXiv:1907.11692 [cs.CL]. Ithaca,
on Computational Linguistics, 2610–2619. Santa Fe, New         NY: Cornell University Library.
Mexico: Association for Computational Linguistics.             Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013a.
Ciosici, M. R., and Assent, I. 2018. Abbreviation expander     Efficient Estimation of Word Representations in Vector
- a web-based system for easy reading of technical docu-       Space. arXiv preprint. arXiv:1301.3781v3 [cs.CL]. Ithaca,
ments. In Proceedings of the Twenty-Seventh International      NY: Cornell University Library.
Conference on Computational Linguistics: System Demon-         Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; and Dean,
strations, 1–4. Santa Fe, New Mexico: Association for Com-     J. 2013b. Distributed Representations of Words and Phrases
putational Linguistics.                                        and Their Compositionality. In Proceedings of the Twenty-
Ciosici, M. R.; Sommer, T.; and Assent, I.            2019.    Sixth International Conference on Neural Information Pro-
Unsupervised abbreviation disambiguation contextual dis-       cessing Systems, 3111–3119. Red Hook, NY: Curran Asso-
ambiguation using word embeddings.          arXiv preprint.    ciates Inc.
arXiv:1904.00929v2 [cs.CL]. Ithaca, NY: Cornell Univer-        Moon, S.; McInnes, B.; and Melton, G. B. 2015. Challenges
sity Library.                                                  and practical approaches with word sense disambiguation of
Dai, A. M.; Olah, C.; and Le, Q. V. 2015. Docu-                acronyms and abbreviations in the clinical domain. Health-
ment embedding with paragraph vectors. arXiv preprint.         care Informatics Research 21(1):35–42.
arXiv:1507.07998 [cs.CL]. Ithaca, NY: Cornell University       Moon, S.; Pakhomov, S.; and Melton, G. B. 2012. Au-
Library.                                                       tomated disambiguation of acronyms and abbreviations in
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018.     clinical texts: window and training size considerations.
BERT: Pre-training of Deep Bidirectional Transformers for      AMIA Annual Symposium proceedings 2012:1310–1319.
Language Understanding. arXiv preprint. arXiv:1810.04805       Moro, A., and Navigli, R. 2015. SemEval-2015 task 13:
[cs.CL]. Ithaca, NY: Cornell University Library.               Multilingual all-words sense disambiguation and entity link-
ing. In Proceedings of the Ninth International Workshop on        and Xu, H. 2017. A long journey to short abbreviations:
Semantic Evaluation, 288–297. Denver, Colorado: Associa-          developing an open-source framework for clinical abbrevia-
tion for Computational Linguistics.                               tion recognition and disambiguation (CARD). Journal of the
Navigli, R. 2009. Word sense disambiguation: A survey.            American Medical Informatics Association 24(e1):e79–e86.
ACM Computing Surveys 41(2):1–69.                                 Yu, H.; Kim, W.; Hatzivassiloglou, V.; and Wilbur, J. 2006.
Pakhomov, S.; Pedersen, T.; and Chute, C. G. 2005. Abbre-         A Large Scale, Corpus-Based Approach for Automatically
viation and acronym disambiguation in clinical discourse.         Disambiguating Biomedical Abbreviations. ACM Transac-
AMIA Annual Symposium proceedings 2005:589–593.                   tions on Information Systems 24(3):380–404.
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.;
Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss,
R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.;
Brucher, M.; Perrot, M.; and Duchesnay, E. 2011. Scikit-
learn: Machine learning in Python. Journal of Machine
Learning Research 12:2825–2830.
Pouran Ben Veyseh, A.; Dernoncourt, F.; Tran, Q. H.; and
Nguyen, T. H. 2020. What does this acronym mean? in-
troducing a new dataset for acronym identification and dis-
ambiguation. In Proceedings of the Twenty-Eighth Inter-
national Conference on Computational Linguistics, 3285–
3301. Barcelona, Spain (Online): International Committee
on Computational Linguistics.
Prokofyev, R.; Demartini, G.; Boyarsky, A.; Ruchayskiy, O.;
and Cudré-Mauroux, P. 2013. Ontology-based word sense
disambiguation for scientific literature. In Proceedings of
the Thirty-Fifth European Conference on Advances in Infor-
mation Retrieval, 594–605. Berlin, Heidelberg: Springer-
Verlag.
Pustejovsky, J.; Castaño, J.; Cochran, B.; Kotecki, M.; and
Morrell, M. 2001. Automatic extraction of acronym-
meaning pairs from MEDLINE databases. Studies in Health
Technology and Informatics 84(Pt 1):371–375.
Schwartz, A. S., and Hearst, M. A. 2003. A simple algorithm
for identifying abbreviation definitions in biomedical text. In
Pacific Symposium on Biocomputing, 451–462. Singapore:
World Scientific Press.
Stevenson, M.; Guo, Y.; Al Amri, A.; and Gaizauskas, R.
2009. Disambiguation of Biomedical Abbreviations. In Pro-
ceedings of the Workshop on Current Trends in Biomedical
Natural Language Processing, 71–79. Boulder, Colorado:
Association for Computational Linguistics.
Thakker, A.; Barot, S.; and Bagul, S. 2017. Acronym
Disambiguation: A Domain Independent Approach. arXiv
preprint. arXiv:1711.09271v3 [cs.CL]. Ithaca, NY: Cornell
University Library.
Veyseh, A. P. B.; Dernoncourt, F.; Nguyen, T. H.; Chang, W.;
and Celi, L. A. 2020. Acronym identification and disam-
biguation shared tasks for scientific document understand-
ing. In Proceedings of the AAAI-21 Workshop on Scientific
Document Understanding.
Wu, Y.; Xu, J.; Zhang, Y.; and Xu, H. 2015. Clinical abbre-
viation disambiguation using neural word embeddings. In
Proceedings of the Workshop on Biomedical Natural Lan-
guage Processing, 171–176. Beijing, China: Association for
Computational Linguistics.
Wu, Y.; Denny, J. C.; Trent Rosenbloom, S.; Miller, R. A.;
Giuse, D. A.; Wang, L.; Blanquicett, C.; Soysal, E.; Xu, J.;

</pre>