=Paper=
{{Paper
|id=Vol-3164/paper18
|storemode=property
|title=T5 Encoder Based Acronym Disambiguation with Weak Supervision
|pdfUrl=https://ceur-ws.org/Vol-3164/paper18.pdf
|volume=Vol-3164
|authors=Gwangho Song,Hongrae Lee,Kyuseok Shim
|dblpUrl=https://dblp.org/rec/conf/aaai/SongLS22
}}
==T5 Encoder Based Acronym Disambiguation with Weak Supervision==
<pdf width="1500px">https://ceur-ws.org/Vol-3164/paper18.pdf</pdf>
<pre>
T5 Encoder Based Acronym Disambiguation with Weak
Supervision
Gwangho Song1 , Hongrae Lee2 and Kyuseok Shim1,3
1
  Seoul National University, Seoul, South Korea
2
  Google, Mountain View, CA, USA
3
  Corresponding author


                                             Abstract
                                             An acronym is a word formed by abbreviating a phrase by combining certain letters of words in the phrase into a single term.
                                             Acronym disambiguation task selects the correct expansion of an ambiguous acronym in a sentence among the candidate
                                             expansions in a dictionary. Although it is convenient to use acronyms, identifying the appropriate expansions of an acronym
                                             in a sentence is a difficult task in natural language processing. Based on the recent success of the large-scale pre-trained
                                             language models such as BERT and T5, we propose a binary classification model using those language models for acronym
                                             disambiguation. To overcome the limited coverage of a training data, we use a weak supervision approach to increase the
                                             training data. Specifically, after collecting sentences containing an expansion of an acronym from Wikipedia, we replace the
                                             expansion with its acronym and label the sentence with the expansion. By conducting extensive experiments, we show the
                                             effectiveness of the proposed model. Our model is placed in the top 3 models for three of four categories in SDU@AAAI-22
                                             shared task 2: Acronym Disambiguation.

                                             Keywords
                                             acronym disambiguation, natural language processing, deep learning, weak supervision


1. Introduction                                                                                                          Input:
                                                                                                                              - Sentence: Since our generative models are
An acronym is a word formed by abbreviating a phrase
                                                                                                                                based on DP priors, they are designed to fa-
which is called a long-form or an expansion (e.g., AAAI                                                                         vor a small number of unique entities per image.
for Association for the Advancement of Artificial Intel-
ligence). Due to its brevity, its usage is ubiquitous in
                                                                                                                                                  ⎧
                                                                                                                                                  ⎨ Dynamic Programming
many literature and documents, especially in scientific                                                                       - Dictionary: DP Dependency Parsing
and biomedical fields [1, 2, 3, 4, 5]. A report found that                                                                                            Dirichlet Process
                                                                                                                                                  ⎩
more than 63% of the articles in English Wikipedia con-
                                                                                                                         Output: Dirichlet Process
tain at least one abbreviation [1]. Furthermore, among
more than 24 million article titles and 18 million article
                                                                                                                      Figure 1: An example of acronym disambiguation
abstracts published between 1950 and 2019, there is at
least one acronym in 19% of the titles and 73% of the
abstracts [2].
   Acronyms frequently have multiple long-forms, and                                                                  disambiguation task is important and challenging.
only one of them is valid for a specific context. For exam-                                                              The goal of acronym disambiguation(AD) is to select
ple, in a 2001 version of the WWWAAS (World-Wide Web                                                                  the correct long-form of an ambiguous acronym in a sen-
Acronym and Abbreviation Server) database, 47.97% of                                                                  tence among the candidate long-forms in a dictionary.
acronyms have multiple expansions [6]. As another ex-                                                                 Figure 1 shows an example of acronym disambiguation.
ample, in the SciAD dataset released by SDU@AAAI 2021                                                                 A sentence containing an ambiguous acronym “DP” and
Shared Task: Acronym Disambiguation [5], an acronym                                                                   a dictionary with the long-forms of “DP” are given as
has 3.1 long-forms on average and up to 20 long-forms.                                                                the input. In the dictionary, the acronym “DP” has three
When sufficient context is not available, this leads to the                                                           possible long-forms: “Dynamic Programming”, “Depen-
ambiguity of the meaning of acronyms and creates seri-                                                                dency Parsing” and “Dirichlet Process”. According to
ous understanding difficulties [2, 7, 8, 9]. Thus, acronym                                                            the context of the input sentence, since “DP” stands for
                                                                                                                      “Dirichlet Process”, a model outputs “Dirichlet Process”
Scientific Document Understanding Workshop at AAAI 2022, March 1                                                      as its expansion.
" ghsong@kdd.snu.ac.kr (G. Song); mr.hongrae.lee@gmail.com
(H. Lee); kshim@snu.ac.kr (K. Shim)
                                                                                                                         The problem of acronym disambiguation is usually cast
 0000-0002-9450-5629 (G. Song); 0000-0002-6138-3071 (H. Lee);                                                        as a classification problem whose goal is to determine
0000-0001-8818-0963 (K. Shim)                                                                                         whether a long-form has the same meaning with the
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).                     acronym in an input sentence. Early approaches [10, 11,
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
             An input sentence  𝒔 = Since our generative models are based on             prediction score 𝑝
                                [SOA] DP [EOA] priors, they are designed to
                                favor a small number of unique entities per image.             MLP
                 The acronym
                in the sentence 𝒂                                                                  ℎ
                                                                                             Encoder
                             A candidate
                                           𝒆 = Dynamic Programming
                           long-form for 𝒂 𝒂,𝑗
                                                                                     𝒙 = 𝒆𝒂,𝑗 ⊕ [SEP] ⊕ 𝒔

Figure 2: An illustration of the proposed model


12, 6] rely on the traditional classification models such as      2.1. Acronym Disambiguation
SVMs, decision trees and naive Bayes classifiers. As deep
                                                        Early approaches [10, 11, 12, 6] rely on the traditional
learning becomes more mainstream in natural language
                                                        classification models such as SVMs, decision trees and
processing, several works employ contextualized word
                                                        naive Bayes classifiers. As deep learning becomes more
embeddings to create semantic representations of long-
                                                        mainstream in natural language processing, several
forms and context [9, 13, 14, 15, 16]. Moreover, with the
                                                        works employ contextualized word embeddings to cre-
recent success of the pre-trained language models such
                                                        ate semantic representations of long-forms and context
as BERT [17] and T5 [18] in natural language processing,
                                                        [9, 13, 14, 15, 16]. The works in [13, 14] study the use of
classification models for acronym disambiguation are
                                                        word embeddings [24, 25] to build classifiers for clinical
developed based on the pre-trained language models [4,
                                                        abbreviation disambiguation. The UAD model proposed
19, 20, 21].
                                                        in [15] creates word embeddings by using additional un-
   To study multilingual acronym disambiguation, we
                                                        structured text. The work in [9] compares the averaged
develop a binary classification model by utilizing T5 [18],
                                                        context vector of the words in a long-form of an acronym
which is one of the most popular pre-trained language
                                                        with the weighted average vector of the words in the con-
models, as well as mT5 [22] which is a multilingual vari-
                                                        text of the acronym based on word embeddings trained
ant of T5. We evaluate the proposed model on the dataset
                                                        on a domain-specific corpus. In [26], the proposed model
released by SDU@AAAI 2022 Shared Task: Acronym
                                                        is trained to compute the similarity between a candi-
Disambiguation [23]. Since the acronyms in the test
                                                        date long-form and the context surrounding the target
dataset do not appear in the training dataset, the training
                                                        acronym.
dataset provided in the competition may not be sufficient
                                                           Many works utilize deep neural architectures to con-
to solve the problem. Thus, we use a weak supervision
                                                        struct a classifier [16, 8, 4, 19, 20, 21]. At the AAAI-
approach to increase the training dataset. By training on
                                                        21 Workshop on Scientific Document Understanding
the provided training dataset as well as the weakly la-
                                                        (SDU@AAAI-21), the top ranked participants [20, 19, 21]
beled training dataset generated by our weak supervision
                                                        present models for acronym disambiguation based on
method, the proposed model ranks in the top 3 place for
                                                        pre-trained language models such as RoBERTa [27] and
three of four categories in SDU@AAAI-22 shared task 2:
                                                        SciBERT [28]. In [20], the problem of acronym disam-
Acronym Disambiguation.
                                                        biguation is treated as a span prediction problem, and the
   The remainder of this paper is organized as follows.
                                                        proposed model predicts the span containing the correct
We provide related work in Section 2 and present our pro-
                                                        long-form from the concatenation of an input sentence
posed model in Section 3. In Section 4, we describe the
                                                        and candidate long-forms of the acronym in the sentence.
datasets used for training the model, including weakly
                                                        The hdBERT model proposed in [21] combines RoBERTa
labeled datasets generated by weak supervision. Finally,
                                                        and SciBERT to capture both domain agnostic and do-
we discuss the experimental results in Section 5 and sum-
                                                        main specific information. The work in [19], which is
marize the paper in Section 6.
                                                        the winner of the shared task of acronym disambiguation
                                                        held under the workshop SDU@AAAI 2021, incorporates
2. Related Work                                         training strategies such as adversarial training [29] and
                                                        task-adaptive pre-training [30]. Following a similar strat-
In this section, we present the previous works on egy to the recent works [19, 21], we develop a binary
acronym disambiguation. We also summarize the pre- classification model for acronym disambiguation.
trained language models widely adopted in various natu-
ral language processing. In addition, we introduce weak
supervision approaches to construct additional data.
                                                                                               Avg. # Sentences
                                             # Sentences               # Acronyms
                 Category                                                                        per acronym
                                     Train       Dev       Test    Train   Dev      Test   Train      Dev      Test
             Legal English            2,949       385       383     242     31       30    12.186    12.419    12.767
             Scientific English       7,532       894       574     405     52       40    18.598    17.192    14.350
             French                   7,851       909       813     541     68       60    14.512    13.368    13.550
             Spanish                  6,267       818       862     437     56       53    14.341    14.607    16.264
             Total                   24,599     3,006      2,632   1,625   207      183    15.138    14.522    14.383
Table 1
Statistics of the labeled datasets


2.2. Pre-trained Language Models                                     3.1. Problem Definition
There has been significant progress across many natu-               The problem of acronym disambiguation is defined as a
ral language processing (NLP) tasks by the pre-trained              classification problem [5]. Given a dictionary 𝒜 which is
language models trained on large-scale unlabeled cor-               a mapping of acronyms to candidate long-forms (or ex-
pora. Based on the transformer architecture [31], a set of          pansions), let 𝒜(𝑎) = {𝑒𝑎,1 , . . . , 𝑒𝑎,𝑚(𝑎) } be the set
large-scale pre-trained language models are developed,              of all candidate long-forms of an acronym 𝑎, where
including BERT [17], RoBERTa [27], GPT [32] and T5                  𝑚(𝑎) is the size of the set. Then, for an input sen-
[18]. Since these models are pre-trained on datasets pri-           tence 𝑠 = ⟨𝑤1 , 𝑤2 , . . . , 𝑤𝑛 ⟩ consisting of 𝑛 tokens (i.e.,
marily consisting of English text, multilingual models              𝑤1 , . . . , 𝑤𝑛 ) and an acronym 𝑎 = ⟨𝑤𝑖 , . . . , 𝑤𝑗 ⟩ with
such as mBERT [33] and mT5 [22] are presented. To pro-              1 ≤ 𝑖 ≤ 𝑗 ≤ 𝑛 which is a contiguous subsequence
cess multilingual texts in the datasets published in the            of 𝑠, we want to predict the correct long-form of the
shared task for acronym disambiguation in the workshop              acronym 𝑎 among the candidate long-forms in 𝒜(𝑎).
SDU@AAAI-22, we use both T5 and mT5 to encode input                 Note that we represent a text as a sequence of tokens
texts.                                                              by using a tokenizer such as WordPiece [45] and Senten-
                                                                    cePiece [46]. Following the existing works [19, 21], we
2.3. Weak Supervision                                               simplify the problem as a binary classification problem.
                                                                    In other words, given an input sentence 𝑠, an acronym 𝑎
Modern machine learning models generally need a large               appearing in 𝑠 and a candidate long-form 𝑒𝑎,𝑘 in 𝒜(𝑎),
amount of hand-labeled training sets for performance                we predict the label 𝑦 which is 1 if 𝑒𝑎,𝑘 is the correct
improvement [34]. Since creating hand-labeled training              long-form of 𝑎 in the context of 𝑠, and 0 otherwise.
datasets is time-consuming and expensive, recent works
rely on weak supervision to generate noisy datasets
                                                                     3.2. Model Architecture
[35, 36, 37, 38, 39, 40, 41, 42]. Distant supervision, one
of the most popular techniques for weak supervision,                We provide an illustration of the proposed model in Fig-
utilizes external knowledge bases to produce noisy la-              ure 2. The model consists of an encoder, which trans-
bels [35, 36, 43] Other works obtain noisy labels by us-            forms an input token sequence into a vector representa-
ing crowdsourcing [40, 41, 42] or simple heuristic rules            tion, and a multi-layer perceptron (MLP) with a sigmoid
[44, 37]. The system proposed in [39] automatically gen-            activation function to output the prediction. We use the
erates the heuristics to assign training labels to a large-         pre-trained language models such as T5 [18] or mT5 [22]
scale unlabeled data. Similar to the works in [35, 36, 43]          encoder layers to encode the input tokens, and take the
based on distant supervision, we use the relationships              hidden state of the first token as the encoder output. The
between acronyms and their possible long-forms as the               encoder takes as input the concatenation of the input
weak supervision sources.                                           long-form 𝑒𝑎,𝑗 and the sentence 𝑠 [19]. A separator sym-
                                                                    bol (i.e., [SEP]) is used to separate them. In other words,
                                                                    by using the symbol ⊕ to represent the concatenation of
3. Acronym Disambiguation                                           two token sequences, the input token sequence 𝑥 of the
   Model                                                            encoder is defined as

We first provide the problem definition of acronym dis-                 𝑥 = 𝑒𝑎,𝑗 ⊕ ⟨[SEP]⟩ ⊕ 𝑠.             (1)
ambiguation. We next present the overall architecture
and details of our proposed model.                      We also insert two special tokens [BOA] and [EOA] be-
                                                        fore and after the acronym 𝑎 in 𝑠 to highlight the po-
                                                        sition of the acronym. For example, consider the input
sentence containing the acronym “DP” and one of its can-                    Category            # LFs      # ACs      Avg. Fanout
didate long-form “Dynamic Programming” in Figure 1.                      Legal English          1,126         456            2.469
As shown in Figure 2, the encoder takes as input the                     Scientific English     2,275         671            3.390
token sequence obtained by concatenating “Dynamic                        French                 2,578         926            2.784
Programming”, [SEP] and the input sentence. The en-                      Spanish                1,859         682            2.726
coder converts the input token sequence 𝑥 into a vector                  Total                  7,838       2,735            2.866
representation ℎ ∈ R𝑑 , where 𝑑 is the number of hidden                                                 * LF: long-form, AC: acronym
units. The MLP layer is used to compute the prediction                 Table 2
score 𝑝 from ℎ. That is,                                               Statistics of the dictionaries

                𝑝 = sigmoid(𝑊 𝑇 ℎ + 𝑏),                        (2)
                                                                       Category        L      L+𝑊1       L+𝑊5       L+𝑊10    L+𝑊20
where 𝑊 ∈ R and 𝑏 ∈ R are parameters of the MLP
                𝑑                                                        Legal
                                                                                     2,949     3,366      4,640      5,921     8,048
layer. We interpret 𝑝 as the probability that the input                 English
long-form 𝑒𝑎,𝑗 is the correct long-form of the acronym                 Scientific
                                                                                     7,532     8,337     10,688     12,875    16,264
                                                                        English
𝑎 in 𝑠.
   Given a set of 𝑁 sentences 𝒮 = {𝑠1 , . . . , 𝑠𝑁 }, let 𝑎𝑖            French       7,851     8,575     10,479     12,135    14,609
be the acronym contained in the sentence 𝑠𝑖 . For every                 Spanish      6,267     6,980      9,036     10,922    13,788
pair of a sentence 𝑠𝑖 ∈ 𝒮 and a long-form 𝑒𝑎𝑖 ,𝑗 ∈ 𝒜(𝑎𝑖 ),               Total      24,599    27,258     34,843     41,853    52,709
we obtain an input token sequence 𝑥𝑖,𝑗 by Equation (1),
                                                                     Table 3
as well as its corresponding label 𝑦𝑖,𝑗 . Thus, from the
                                                                     Statistics of the labeled and weakly labeled datasets
sentence set 𝒮, we can build a training dataset 𝒟 =
{(𝑥𝑖,𝑗 , 𝑦𝑖,𝑗 ) | 1 ≤ 𝑖 ≤ 𝑁, 1 ≤ 𝑗 ≤ 𝑚(𝑎𝑖 )}. We train
the model on the training dataset 𝒟. Let us denote the
prediction score for 𝑥𝑖,𝑗 by 𝑝𝑖,𝑗 . Then, we use the cross-          shared-task) of the competition on acronym disambigua-
entropy loss to train the model on the training dataset              tion, for each category, there is no overlap of acronyms
𝒟. In other words, the loss is defined as                            between any pair of the training, development and test
       𝑁 𝑚(𝑎
                                                                     datasets. Table 2 shows the statistics of the dictionary
          ∑︁𝑖 )
                                                                     for every category. In the table, the “Avg. Fanout” indi-
       ∑︁
ℒ=−                 (𝑦𝑖,𝑗 log 𝑝𝑖,𝑗 + (1 − 𝑦𝑖,𝑗 ) log (1 − 𝑝𝑖,𝑗 )).
       𝑖=1 𝑗=1                                                       cates the average number of candidate long-forms for
                                                               (3)   an acronym. A dictionary contains a mapping from an
At the inference stage, for an input sentence 𝑠 with an              acronym to the set of its candidate long-forms. The num-
acronym 𝑎, we compute the prediction score for each                  ber of occurrences of an acronym in the datasets of all
candidate long-form in 𝒜(𝑎) and choose the one with                  categories is 2.866 on average.
the highest prediction score.
                                                                     4.2. Weakly Labeled Datasets
4. Datasets                                                          Among the acronyms in the dictionaries, 40.6% of them
                                                                     do not appear in the training dataset. To train the pro-
We describe the labeled datasets published for the                   posed model for such acronyms, we collect additional
shared task on acronym disambiguation in the workshop                data by incorporating a weak supervision method [35].
SDU@AAAI-22 [47]. Moreover, we present the details of                Specifically, we first extract the sentences containing a
additional datasets generated by our weak supervision                long-form in the dictionaries from English, French and
method.                                                              Spanish Wikipedia dump dated November 7, 2021. For
                                                                     each language, we do not use the long-form of every
4.1. Labeled Datasets                                                acronym whose number of occurrences is at least 1,000 in
                                                                     the Wikipedia dump, since the pre-trained language mod-
The detailed statistics of the labeled datasets is provided
                                                                     els are likely to be well-trained for such frequent long-
in Table 1. The datasets consist of four categories (i.e.,
                                                                     forms. For each extracted sentence from Wikipedia, we
Legal English, Scientific English, French and Spanish).
                                                                     replace the long-form in the sentence with its acronym.
In total, there are 24,599, 3,006 and 2,632 sentences in
                                                                     We next assign 1 as the label for the pair of the extracted
the training, development and test datasets, respectively.
                                                                     sentence and the long-form, and 0 for every pair of
Every sentence in the datasets has a single ambiguous
                                                                     the sentence and each of the other long-forms of the
acronym which is to be disambiguated. On average, an
                                                                     acronym.
acronym appears in 14 or 15 sentences. As mentioned in
                                                                        Let 𝑁𝑠 be the maximum allowed number of sentences
the web page (https://sites.google.com/view/sdu-aaai22/
                                                                     extracted from the Wikipedia dumps for a long-form. For
                                                 Legal       Scientific
         Encoder               # Params                                      French         Spanish           All
                                                English      English
    BERT-base-cased [17]              108M   69.74 ± 3.21   65.37 ± 0.79   64.68 ± 0.98   66.64 ± 0.97   66.02 ± 0.42
    T5E-base [18]                     110M   66.94 ± 1.60   64.31 ± 1.02   66.42 ± 0.72   68.14 ± 1.08   66.32 ± 0.73
    BERT-large-cased [17]             334M   70.35 ± 1.57   66.48 ± 0.90   66.11 ± 0.63   66.90 ± 0.76   66.95 ± 0.52
    mT5E-base [22]                    277M   67.47 ± 3.37   62.47 ± 0.62   69.09 ± 1.24   72.88 ± 2.50   67.90 ± 1.59
    RoBERTa-base [27]                 125M   70.94 ± 2.30   67.82 ± 2.75   67.10 ± 1.68   71.64 ± 0.77   68.98 ± 0.37
    mBERT-base-cased [33]             178M   73.18 ± 2.46   66.74 ± 1.32   69.98 ± 1.28   76.74 ± 2.62   71.18 ± 0.91
    hdBERT [21]                       472M   71.03 ± 1.24   75.69 ± 0.49   67.81 ± 0.53   74.17 ± 0.79   72.25 ± 0.17
    T5E-large [18]                    335M   75.62 ± 1.39   72.85 ± 0.65   70.57 ± 0.46   72.91 ± 2.23   72.49 ± 0.22
    mT5E-large [22]                   564M   72.83 ± 0.90   69.62 ± 0.37   72.11 ± 1.18   78.35 ± 1.00   73.09 ± 0.51
    mT5E-xlarge [22]                1,670M   75.44 ± 2.03   70.92 ± 0.88   72.49 ± 0.51   78.95 ± 0.88   74.08 ± 0.57
    T5E-xlarge [18]                 1,241M   78.73 ± 1.10   77.56 ± 0.63   72.69 ± 1.40   77.88 ± 0.73   76.24 ± 0.79
Table 4
F1 score with varying the encoder


each value of 𝑁𝑠 in {1, 5, 10, 20}, we create a weakly        and test datasets. Specifically, we first compute precision,
labeled dataset. Let L and 𝑊𝑘 denote the labeled dataset      recall and F1 score for each long-form and then report
provided in the competition and the weakly labeled            the average value of all long-forms for each measure.
dataset generated with 𝑁𝑠 = 𝑘. Then, we refer to the          Furthermore, for the development data, we report the
combination of the labeled dataset (L) and each of the        average value with its standard deviation by training the
weakly labeled datasets as L+𝑊1 , L+𝑊5 , L+𝑊10 and            models three times with different random seeds.
L+𝑊20 , respectively. The statistics of the combined
datasets are presented in Table 3. As an example, when        5.2. Experimental Results
𝑁𝑠 = 10, we obtain 17,254 additional sentences contain-
ing an acronym in the dictionaries by weak supervision,    Pre-trained models We compare the performance of
and the ratio of unseen acronyms in the training dataset   the implementations of the proposed model with varying
is reduced from 40.6% to 21.6%.                            the pre-trained model of the encoder. We use BERT [17],
                                                           mBERT [33], RoBERTa [27], hdBERT [21], T5 [18] and
                                                           mT5 [22] as the encoder. Since pre-trained models with
5. Experiments                                             various model sizes are available for BERT and T5, we
                                                           test them with varying the model size, too. While the
We first present the experimental setup and next report default learning rate is 10−5 , we use a learning rate of
the results of experiments including the competition for 10−6 for hdBERT since we get a better performance with
acronym disambiguation.                                    10−6 .
                                                              Table 4 shows the F1 score on the development dataset
5.1. Experimental Setup                                    for each category. The results show that the implementa-
                                                           tion with T5-xlarge achieves the highest performance in
We conduct all experiments on a single machine with an
                                                           terms of the F1 score in every category except Spanish.
AMD EPYC Rome 7402P 24-Core CPU and two NVIDIA
                                                           The second best in terms of the F1 score for all categories
GeForce RTX 3090 GPUs under PyTorch framework [48].
                                                           is the implementation with mT5-xlarge as the encoder.
For each sentence, we consider a window of 64 tokens
                                                           Note that although T5 is pre-trained using English cor-
where the acronym in the sentence is located in the mid-
                                                           pora, we can see that the model with the encoder of T5
dle of the window, and use the sequence of tokens in that
                                                           is generalized well to the other languages. As the size of
window for training. We set the batch size to 16 and use
                                                           a model increases, the accuracy of the model tends to be
Adam optimizer [49]. Furthermore, we use the union of
                                                           improved. However, the implementation with T5-xlarge
the training datasets with all categories to train imple-
                                                           performs better than that with mT5-xlarge since T5 is
mentations of the proposed model for 10 epochs with a
                                                           pre-trained with supervised training, while mT5 is not.
learning late of 10 . Moreover, we apply dropout [50]
                   −5
                                                           Note that we cannot evaluate the pre-trained models with
to the encoder of the model with a dropout probability
                                                           a larger size such as T5-xxlarge and mT5-xxlarge models
of 0.1.
                                                           due to GPU memory limitations used in our experiment.
   To evaluate the performance of the model, we use
macro-averaged precision (P), recall (R) and F1 score (F1)
computed for each long-form [15, 5] on the development Weak supervision To confirm the effectiveness of the
                                                           weakly labeled datasets, we train the proposed model
  Data            P               R               F1           fine-tuning improves the accuracy for every category.
  L          79.43 ± 0.68    73.30 ± 0.89    76.24 ± 0.79
  L+𝑊1       81.05 ± 0.48    75.11 ± 0.50    77.97 ± 0.47
                                                               SDU@AAAI-22 Shared Task: Acronym Disambigua-
  L+𝑊5       81.54 ± 0.61    74.50 ± 0.15    77.86 ± 0.32
                                                               tion In the competition, for each category, we use the
  L+𝑊10      81.78 ± 0.76    74.66 ± 0.77    78.06 ± 0.76
                                                               model performed the best on the test dataset as shown in
  L+𝑊20      81.14 ± 0.70    73.98 ± 0.33    77.40 ± 0.47
                                                               Table 7. The bolded numbers in the table are the scores
Table 5                                                        of our model. The results show that our model ranks the
Performance with the weakly labeled datasets                   2nd place for Legal English and 3rd place for Scientific
                                                               English and French.

which uses T5-xlarge as the encoder on both the la-
beled and weakly labeled datasets with varying 𝑁𝑠 =            6. Conclusion
1, 5, 10, 20. We provide the results in Table 5. Recall that   We propose a binary classification model for acronym dis-
we use L and Wk to denote the labeled dataset and the          ambiguation by utilizing large-scale pre-trained language
weakly labeled dataset generated with 𝑁𝑠 = 𝑘 respec-           models. To increase the size of the training datasets, we
tively, as described in Section 4. The table shows that        use a weak supervision approach to generate weakly
the F1 score becomes larger with increasing the value          labeled datasets. Experimental results show that train-
of 𝑁𝑠 for 𝑁𝑠 = 1, 5, 10. However, when 𝑁𝑠 = 20, the            ing on both labeled and weakly labeled datasets is ben-
accuracy is degraded since the skewness of the number          eficial to the accuracy of the proposed model. For the
of sentences containing an acronym increases. In other         shared task on acronym disambiguation in the AAAI-
words, as 𝑁𝑠 increases, the number of the extracted sen-       22 Workshop on Scientific Document Understanding
tences containing a frequent long-form becomes large,          (SDU@AAAI-22), our model ranks within the 3rd place
while that of the extracted sentences containing rare long-    in three of four categories.
form does not. Since the model performs the best when
𝑁𝑠 = 10, we set 𝑁𝑠 to 10 as the default value.
   Table 6 presents some examples which are classified         Acknowledgments
incorrectly with the labeled dataset only, but are classi-
fied correctly after training on both labeled and weakly       This work was supported by Institute of Information
labeled datasets. The two rightmost columns show the           & communications Technology Planning & Evaluation
prediction scores generated by the model trained using         (IITP) grant funded by the Korea government(MSIT) (No.
only the labeled dataset and using both the labeled and        2020-0-00857, Development of cloud robot intelligence
weakly labeled dataset with 𝑁𝑠 = 10 (i.e., L+𝑊10 ), re-        augmentation, sharing and framework technology to in-
spectively. Without the weakly labeled dataset, as shown       tegrate and enhance the intelligence of multiple robots).
in the table, the model fails to find the correct long-forms   It was also supported by the National Research Founda-
for the sentences. However, by using the weakly labeled        tion of Korea(NRF) grant funded by the Korea govern-
dataset, the prediction scores for the correct long-forms      ment(MSIT) (No. NRF-2020R1A2C1003576).
increase significantly.

Performance on the test dataset We evaluate the
                                                               References
implementations of our model with T5-xlarge and mT5-            [1] W. Ammar, K. Darwish, A. El Kahki, K. Hafez, Ice-
xlarge as the encoder after training them on both the               tea: in-context expansion and translation of english
labeled and weakly labeled dataset. When we use T5-                 abbreviations, in: International Conference on In-
xlarge, we set the learning rate to 9 × 10−6 since we               telligent Text Processing and Computational Lin-
find that the model performs the best with that learning            guistics, Springer, 2011, pp. 41–54.
rate by a hyperparameter search. As shown in Table 7, in        [2] A. Barnett, Z. Doubleday, Meta-research: The
terms of the F1 score on the test dataset, the model with           growth of acronyms in the scientific literature, Elife
T5-xlarge performs the best for both Legal English and              9 (2020) e60080.
Scientific English datasets. On the other hand, the model       [3] R. Islamaj Dogan, G. C. Murray, A. Névéol, Z. Lu,
with mT5-xlarge shows better performance than that                  Understanding pubmed® user search behavior
with T5-xlarge for French and Spanish datasets. To fur-             through log analysis, Database 2009 (2009).
ther improve the performance of the best model in each          [4] Q. Jin, J. Liu, X. Lu,         Deep contextualized
category, we additionally train the best model by using             biomedical abbreviation expansion, arXiv preprint
only the dataset of the category for 5 epochs with a learn-         arXiv:1906.03360 (2019).
ing rate of 10−6 . The results show that the category-wise
                                                                                                             Prediction Prediction
Category Sentence                                                 Acronym           Correct expansion
                                                                                                                (L)      (L+𝑊10 )
                                                                              United Nations Entity for
  Legal Slovakia welcomes the establishment of UN
                                                  UN-Women                      Gender Equality and           0.678190         0.929378
 English Women – the UN-Women.
                                                                              Empowerment of Women
           There is no answer to the hopelessness and despair
                                                                                  Organization for
  Legal of the more than
                                                                   OECD        Economic Cooperation           0.202852         0.999734
 English 30 million unemployed in the countries of the
                                                                                 and Development
           OECD.
Scientific
           The SGD is adopted to optimize the parameters.          SGD       stochastic gradient descent      0.368887         0.998205
 English
           Specifically, we will interpolate the translation
Scientific models as in Foster and
                                                                   MAP            maximum a posteriori        0.184368         0.629905
 English Kuhn (2007), including a MAP combination (Bac-
           chiani et al 2006).
           Il est entouré au Nord par l’Ouganda, à l’Est par
                                                                                   République
 French la Tanzanie, au Sud par le Burundi et à l’Ouest par        RDC                                        0.844930         0.999477
                                                                              Démocratique du Congo
           la RDC.
           De plus, il y a un représentant spécial adjoint du
 French Secrétaire général résident à Chypre avec le rang          SSG         sous-secrétaire général        0.956114         0.998696
           de SSG.
           En cuanto al FMAMl se sugirió que sería apropiado
                                                                                   Fondo para el Medio
 Spanish esperar hasta que se completara el debate actual         FMAM                                        0.000304         0.999739
                                                                                    Ambiente Mundia
           sobre su reforma.
           El Gobierno del Japón acoge con beneplácito la
                                                                                   Nueva Alianza para
 Spanish NEPAD África que ha sido lanzada por los países          NEPAD                                       0.944804         0.990742
                                                                                    el Desarrollo de
           africanos.
Table 6
Examples classified correctly by weak supervision in the development dataset

                                                                   Development                                  Test
       Category                    Model
                                                          P               R                F1            P        R            F1
                           T5-xlarge               86.13 ± 0.55    76.11 ± 1.67      80.80 ± 0.88    84.64     76.71          80.48
   Legal English           mT5-xlarge              81.49 ± 1.62    72.22 ± 0.68      76.57 ± 0.36    82.95     72.80          77.54
                           T5-xlarge-finetune      86.35 ± 0.21    78.16 ± 0.32      82.05 ± 0.24    85.52     77.12          81.11
                           T5-xlarge               81.72 ± 0.50    75.59 ± 1.15      78.54 ± 0.82    87.21     81.36          84.18
   Scientific English      mT5-xlarge              77.10 ± 2.58    67.00 ± 1.85      71.70 ± 2.16    82.85     75.62          79.07
                           T5-xlarge-finetune      82.38 ± 0.40    76.23 ± 0.50      79.18 ± 0.44    88.36     81.85          84.98
                           T5-xlarge               79.00 ± 0.07    70.35 ± 0.14      74.43 ± 0.11    79.98     69.29          74.25
   French                  mT5-xlarge              77.66 ± 1.26    68.17 ± 1.91      72.60 ± 1.48    80.71     70.42          75.21
                           mT5-xlarge-finetune     77.39 ± 0.38    67.99 ± 0.45      72.39 ± 0.38    80.79     72.20          76.25
                           T5-xlarge               86.08 ± 0.39    77.97 ± 1.57      81.83 ± 1.01    84.31     75.36          79.58
   Spanish                 mT5-xlarge              84.63 ± 3.29    78.83 ± 2.26      81.63 ± 2.69    86.27     76.16          80.90
                           mT5-xlarge-finetune     86.55 ± 0.39    80.89 ± 0.17      83.63 ± 0.27    86.33     76.51          81.12
Table 7
Performance on the test dataset of each category

          Legal English                Scientific English                   French                           Spanish
  Model      P       R       F1     Model    P      R     F1      Model     P      R        F1      Model     P      R           F1
 Rank1       0.94   0.87    0.90    Rank1   0.97   0.94   0.96    Rank1     0.89    0.79   0.84     Rank1    0.91      0.85     0.88
 Rank2       0.86   0.77    0.81    Rank2   0.95   0.90   0.93    Rank2     0.85    0.73   0.78     Rank2    0.88      0.79     0.83
 Rank3       0.82   0.80    0.81    Rank3   0.88   0.82   0.85    Rank3     0.81    0.72   0.76     Rank3    0.86      0.80     0.83
 Rank4       0.79   0.64    0.70    Rank4   0.81   0.77   0.79    Rank4     0.76    0.70   0.73     Rank4    0.83      0.80     0.81
 Rank5       0.75   0.61    0.67    Rank5   0.81   0.69   0.75    Rank5     0.73    0.64   0.68     Rank5    0.86      0.77     0.81
Table 8
Leaderboard
 [5] A. P. B. Veyseh, F. Dernoncourt, Q. H. Tran, T. H.          Bert: Pre-training of deep bidirectional transform-
     Nguyen, What does this acronym mean? introduc-              ers for language understanding, arXiv preprint
     ing a new dataset for acronym identification and            arXiv:1810.04805 (2018).
     disambiguation, arXiv preprint arXiv:2010.14678        [18] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang,
     (2020).                                                     M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the
 [6] M. Zahariev, Automatic sense disambiguation for             limits of transfer learning with a unified text-to-
     acronyms, in: Proceedings of the 27th annual inter-         text transformer, arXiv preprint arXiv:1910.10683
     national ACM SIGIR conference on Research and               (2019).
     development in information retrieval, 2004, pp. 586–   [19] C. Pan, B. Song, S. Wang, Z. Luo, Bert-based
     587.                                                        acronym disambiguation with multiple training
 [7] H. L. Fred, T. O. Cheng, Acronymesis: the exploding         strategies, arXiv preprint arXiv:2103.00488 (2021).
     misuse of acronyms, Texas Heart Institute Journal      [20] A. Singh, P. Kumar,             Scidr at sdu-2020:
     30 (2003) 255.                                              Ideas–identifying and disambiguating everyday
 [8] A. G. Ahmed, M. F. A. Hady, E. Nabil, A. Badr, A            acronyms for scientific domain, arXiv preprint
     language modeling approach for acronym expan-               arXiv:2102.08818 (2021).
     sion disambiguation, in: International Conference      [21] Q. Zhong, G. Zeng, D. Zhu, Y. Zhang, W. Lin,
     on Intelligent Text Processing and Computational            B. Chen, J. Tang, Leveraging domain agnostic and
     Linguistics, Springer, 2015, pp. 264–278.                   specific knowledge for acronym disambiguation.,
 [9] J. Charbonnier, C. Wartena, Using word embed-               in: SDU@ AAAI, 2021.
     dings for unsupervised acronym disambiguation          [22] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou,
     (2018).                                                     A. Siddhant, A. Barua, C. Raffel, mt5: A massively
[10] S. Pakhomov, T. Pedersen, C. G. Chute, Abbrevi-             multilingual pre-trained text-to-text transformer,
     ation and acronym disambiguation in clinical dis-           arXiv preprint arXiv:2010.11934 (2020).
     course, in: AMIA annual symposium proceedings,         [23] A. P. B. Veyseh, N. Meister, S. Yoon, R. Jain, F. Der-
     volume 2005, American Medical Informatics Asso-             noncourt, T. H. Nguyen, Multilingual acronym
     ciation, 2005, p. 589.                                      extraction and disambiguation shared tasks at sdu
[11] S. Moon, S. Pakhomov, G. B. Melton, Automated               2022, in: Proceedings of SDU@AAAI-22, 2022.
     disambiguation of acronyms and abbreviations in        [24] R. Collobert, J. Weston, L. Bottou, M. Karlen,
     clinical texts: window and training size consider-          K. Kavukcuoglu, P. Kuksa, Natural language pro-
     ations, in: AMIA annual symposium proceedings,              cessing (almost) from scratch, Journal of machine
     volume 2012, American Medical Informatics Asso-             learning research 12 (2011) 2493–2537.
     ciation, 2012, p. 1310.                                [25] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado,
[12] S. Moon, B. McInnes, G. B. Melton, Challenges and           J. Dean, Distributed representations of words and
     practical approaches with word sense disambigua-            phrases and their compositionality, in: Advances
     tion of acronyms and abbreviations in the clinical          in neural information processing systems, 2013, pp.
     domain, Healthcare informatics research 21 (2015)           3111–3119.
     35–42.                                                 [26] K. Kirchhoff, A. M. Turner, Unsupervised resolution
[13] Y. Wu, J. Xu, Y. Zhang, H. Xu, Clinical abbreviation        of acronyms and abbreviations in nursing notes us-
     disambiguation using neural word embeddings, in:            ing document-level context models, in: Proceedings
     Proceedings of BioNLP 15, 2015, pp. 171–176.                of the Seventh International Workshop on Health
[14] R. Antunes, S. Matos, Biomedical word sense disam-          Text Mining and Information Analysis, 2016, pp.
     biguation with word embeddings, in: International           52–60.
     Conference on Practical Applications of Computa-       [27] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen,
     tional Biology & Bioinformatics, Springer, 2017, pp.        O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
     273–279.                                                    Roberta: A robustly optimized bert pretraining ap-
[15] M. Ciosici, T. Sommer, I. Assent, Unsupervised ab-          proach, arXiv preprint arXiv:1907.11692 (2019).
     breviation disambiguation contextual disambigua-       [28] I. Beltagy, K. Lo, A. Cohan, Scibert: A pretrained
     tion using word embeddings, arXiv preprint                  language model for scientific text, arXiv preprint
     arXiv:1904.00929 (2019).                                    arXiv:1903.10676 (2019).
[16] I. Li, M. Yasunaga, M. Y. Nuzumlalı, C. Caraballo,     [29] T. Miyato, A. M. Dai, I. Goodfellow, Adversarial
     S. Mahajan, H. Krumholz, D. Radev, A neural                 training methods for semi-supervised text classifi-
     topic-attention model for medical term abbreviation         cation, arXiv preprint arXiv:1605.07725 (2016).
     disambiguation, arXiv preprint arXiv:1910.14076        [30] S. Gururangan, A. Marasović, S. Swayamdipta,
     (2019).                                                     K. Lo, I. Beltagy, D. Downey, N. A. Smith, Don’t stop
[17] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,               pretraining: adapt language models to domains and
     tasks, arXiv preprint arXiv:2004.10964 (2020).          [43] E. Alfonseca, K. Filippova, J.-Y. Delort, G. Garrido,
[31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,             Pattern learning for relation extraction with hierar-
     L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At-         chical topic models (2012).
     tention is all you need, in: Advances in neural         [44] A. J. Ratner, C. M. De Sa, S. Wu, D. Selsam, C. Ré,
     information processing systems, 2017, pp. 5998–              Data programming: Creating large training sets,
     6008.                                                        quickly, Advances in neural information processing
[32] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Ka-           systems 29 (2016) 3567–3575.
     plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas-    [45] R. Sennrich, B. Haddow, A. Birch, Neural machine
     try, A. Askell, et al., Language models are few-shot         translation of rare words with subword units, arXiv
     learners, arXiv preprint arXiv:2005.14165 (2020).            preprint arXiv:1508.07909 (2015).
[33] J.    Devlin,       Multilingual     bert    readme,    [46] T. Kudo, J. Richardson, Sentencepiece: A simple and
     https://github.com/google-research/bert/blob/                language independent subword tokenizer and deto-
     master/multilingual.md, 2018.                                kenizer for neural text processing, arXiv preprint
[34] C. Sun, A. Shrivastava, S. Singh, A. Gupta, Re-              arXiv:1808.06226 (2018).
     visiting unreasonable effectiveness of data in deep     [47] A. P. B. Veyseh, N. Meister, S. Yoon, R. Jain, F. Der-
     learning era, in: Proceedings of the IEEE inter-             noncourt, T. H. Nguyen, Macronym: A large-scale
     national conference on computer vision, 2017, pp.            dataset for multilingual and multi-domain acronym
     843–852.                                                     extraction, arXiv preprint arXiv:1412.6980 (2022).
[35] M. Mintz, S. Bills, R. Snow, D. Jurafsky, Distant       [48] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Brad-
     supervision for relation extraction without labeled          bury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein,
     data, in: Proceedings of the Joint Conference of             L. Antiga, et al., Pytorch: An imperative style, high-
     the 47th Annual Meeting of the ACL and the 4th In-           performance deep learning library, Advances in
     ternational Joint Conference on Natural Language             neural information processing systems 32 (2019)
     Processing of the AFNLP, 2009, pp. 1003–1011.                8026–8037.
[36] A. Go, R. Bhayani, L. Huang, Twitter sentiment          [49] D. P. Kingma, J. Ba, Adam: A method for stochas-
     classification using distant supervision, CS224N             tic optimization, arXiv preprint arXiv:1412.6980
     project report, Stanford 1 (2009) 2009.                      (2014).
[37] A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu,   [50] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever,
     C. Ré, Snorkel: Rapid training data creation with            R. Salakhutdinov, Dropout: a simple way to prevent
     weak supervision, in: Proceedings of the VLDB En-            neural networks from overfitting, The journal of
     dowment. International Conference on Very Large              machine learning research 15 (2014) 1929–1958.
     Data Bases, volume 11, NIH Public Access, 2017, p.
     269.
[38] A. Ratner, B. Hancock, J. Dunnmon, R. Goldman,
     C. Ré, Snorkel metal: Weak supervision for multi-
     task learning, in: Proceedings of the Second Work-
     shop on Data Management for End-To-End Ma-
     chine Learning, 2018, pp. 1–4.
[39] P. Varma, C. Ré, Snuba: Automating weak super-
     vision to label training data, in: Proceedings of
     the VLDB Endowment. International Conference
     on Very Large Data Bases, volume 12, NIH Public
     Access, 2018, p. 223.
[40] N. Dalvi, A. Dasgupta, R. Kumar, V. Rastogi, Aggre-
     gating crowdsourced binary ratings, in: Proceed-
     ings of the 22nd international conference on World
     Wide Web, 2013, pp. 285–294.
[41] Y. Zhang, X. Chen, D. Zhou, M. I. Jordan, Spectral
     methods meet em: A provably optimal algorithm
     for crowdsourcing, Advances in neural information
     processing systems 27 (2014) 1260–1268.
[42] M. Joglekar, H. Garcia-Molina, A. Parameswaran,
     Comprehensive and reliable crowd assessment algo-
     rithms, in: 2015 IEEE 31st International Conference
     on Data Engineering, IEEE, 2015, pp. 195–206.

</pre>