<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>G. Song); mr.hongrae.lee@gmail.com
(H. Lee); kshim@snu.ac.kr (K. Shim)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>T5 Encoder Based Acronym Disambiguation with Weak Supervision</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gwangho Song</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hongrae Lee</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kyuseok Shim</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Google, Mountain View</institution>
          ,
          <addr-line>CA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Seoul National University</institution>
          ,
          <addr-line>Seoul</addr-line>
          ,
          <country country="KR">South Korea</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>An acronym is a word formed by abbreviating a phrase by combining certain letters of words in the phrase into a single term. Acronym disambiguation task selects the correct expansion of an ambiguous acronym in a sentence among the candidate expansions in a dictionary. Although it is convenient to use acronyms, identifying the appropriate expansions of an acronym in a sentence is a dificult task in natural language processing. Based on the recent success of the large-scale pre-trained language models such as BERT and T5, we propose a binary classification model using those language models for acronym disambiguation. To overcome the limited coverage of a training data, we use a weak supervision approach to increase the training data. Specifically, after collecting sentences containing an expansion of an acronym from Wikipedia, we replace the expansion with its acronym and label the sentence with the expansion. By conducting extensive experiments, we show the efectiveness of the proposed model. Our model is placed in the top 3 models for three of four categories in SDU@AAAI-22 shared task 2: Acronym Disambiguation.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;acronym disambiguation</kwd>
        <kwd>natural language processing</kwd>
        <kwd>deep learning</kwd>
        <kwd>weak supervision</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>Input:</title>
        <p>- Sentence: Since our generative models are
based on DP priors, they are designed to
favor a small number of unique entities per image.</p>
        <p>
          An acronym is a word formed by abbreviating a phrase
which is called a long-form or an expansion (e.g., AAAI
for Association for the Advancement of Artificial
Intelligence). Due to its brevity, its usage is ubiquitous in ⎧ Dynamic Programming
many literature and documents, especially in scientific - Dictionary: DP ⎨ Dependency Parsing
and biomedical fields [
          <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1, 2, 3, 4, 5</xref>
          ]. A report found that ⎩ Dirichlet Process
more than 63% of the articles in English Wikipedia
contain at least one abbreviation [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Furthermore, among Output: Dirichlet Process
more than 24 million article titles and 18 million article
abstracts published between 1950 and 2019, there is at Figure 1: An example of acronym disambiguation
least one acronym in 19% of the titles and 73% of the
abstracts [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
        <p>
          Acronyms frequently have multiple long-forms, and disambiguation task is important and challenging.
only one of them is valid for a specific context. For exam- The goal of acronym disambiguation(AD) is to select
ple, in a 2001 version of the WWWAAS (World-Wide Web the correct long-form of an ambiguous acronym in a
senAcronym and Abbreviation Server) database, 47.97% of tence among the candidate long-forms in a dictionary.
acronyms have multiple expansions [6]. As another ex- Figure 1 shows an example of acronym disambiguation.
ample, in the SciAD dataset released by SDU@AAAI 2021 A sentence containing an ambiguous acronym “DP” and
Shared Task: Acronym Disambiguation [5], an acronym a dictionary with the long-forms of “DP” are given as
has 3.1 long-forms on average and up to 20 long-forms. the input. In the dictionary, the acronym “DP” has three
When sucfiient context is not available, this leads to the possible long-forms: “Dynamic Programming”,
“Depenambiguity of the meaning of acronyms and creates seri- dency Parsing” and “Dirichlet Process”. According to
ous understanding dificulties [
          <xref ref-type="bibr" rid="ref2">2, 7, 8, 9</xref>
          ]. Thus, acronym the context of the input sentence, since “DP” stands for
“Dirichlet Process”, a model outputs “Dirichlet Process”
as its expansion.
        </p>
        <p>The problem of acronym disambiguation is usually cast
as a classification problem whose goal is to determine
whether a long-form has the same meaning with the
acronym in an input sentence. Early approaches [10, 11,</p>
        <p>The acronym
in the sentence 
An input sentence  = Since our generative models are based on
[SOA] DP [EOA] priors, they are designed to
favor a small number of unique entities per image.
lonAgc-faonrdmidfaoter    , = Dynamic Programming
prediction score</p>
        <p>MLP</p>
        <p>ℎ</p>
        <p>
          Encoder
 =   , ⊕ [SEP] ⊕ 
12, 6] rely on the traditional classification models such as
SVMs, decision trees and naive Bayes classifiers. As deep
learning becomes more mainstream in natural language
processing, several works employ contextualized word
embeddings to create semantic representations of
longforms and context [9, 13, 14, 15, 16]. Moreover, with the
recent success of the pre-trained language models such
as BERT [17] and T5 [18] in natural language processing,
classification models for acronym disambiguation are
developed based on the pre-trained language models [
          <xref ref-type="bibr" rid="ref4">4,
19, 20, 21</xref>
          ].
        </p>
        <p>To study multilingual acronym disambiguation, we
develop a binary classification model by utilizing T5 [ 18],
which is one of the most popular pre-trained language
models, as well as mT5 [22] which is a multilingual
variant of T5. We evaluate the proposed model on the dataset
released by SDU@AAAI 2022 Shared Task: Acronym
Disambiguation [23]. Since the acronyms in the test
dataset do not appear in the training dataset, the training
dataset provided in the competition may not be suficient
to solve the problem. Thus, we use a weak supervision
approach to increase the training dataset. By training on
the provided training dataset as well as the weakly
labeled training dataset generated by our weak supervision
method, the proposed model ranks in the top 3 place for
three of four categories in SDU@AAAI-22 shared task 2:
Acronym Disambiguation.</p>
        <p>The remainder of this paper is organized as follows.</p>
        <p>We provide related work in Section 2 and present our
proposed model in Section 3. In Section 4, we describe the
datasets used for training the model, including weakly
labeled datasets generated by weak supervision. Finally,
we discuss the experimental results in Section 5 and
summarize the paper in Section 6.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>In this section, we present the previous works on</title>
        <p>acronym disambiguation. We also summarize the
pretrained language models widely adopted in various
natural language processing. In addition, we introduce weak
supervision approaches to construct additional data.
2.1. Acronym Disambiguation
Early approaches [10, 11, 12, 6] rely on the traditional
classification models such as SVMs, decision trees and
naive Bayes classifiers. As deep learning becomes more
mainstream in natural language processing, several
works employ contextualized word embeddings to
create semantic representations of long-forms and context
[9, 13, 14, 15, 16]. The works in [13, 14] study the use of
word embeddings [24, 25] to build classifiers for clinical
abbreviation disambiguation. The UAD model proposed
in [15] creates word embeddings by using additional
unstructured text. The work in [9] compares the averaged
context vector of the words in a long-form of an acronym
with the weighted average vector of the words in the
context of the acronym based on word embeddings trained
on a domain-specific corpus. In [ 26], the proposed model
is trained to compute the similarity between a
candidate long-form and the context surrounding the target
acronym.</p>
        <p>
          Many works utilize deep neural architectures to
construct a classifier [
          <xref ref-type="bibr" rid="ref4">16, 8, 4, 19, 20, 21</xref>
          ]. At the
AAAI21 Workshop on Scientific Document Understanding
(SDU@AAAI-21), the top ranked participants [20, 19, 21]
present models for acronym disambiguation based on
pre-trained language models such as RoBERTa [27] and
SciBERT [28]. In [20], the problem of acronym
disambiguation is treated as a span prediction problem, and the
proposed model predicts the span containing the correct
long-form from the concatenation of an input sentence
and candidate long-forms of the acronym in the sentence.
        </p>
        <p>The hdBERT model proposed in [21] combines RoBERTa
and SciBERT to capture both domain agnostic and
domain specific information. The work in [ 19], which is
the winner of the shared task of acronym disambiguation
held under the workshop SDU@AAAI 2021, incorporates
training strategies such as adversarial training [29] and
task-adaptive pre-training [30]. Following a similar
strategy to the recent works [19, 21], we develop a binary
classification model for acronym disambiguation.
Total
# Sentences</p>
        <p># Acronyms
Train
2.2. Pre-trained Language Models
3.1. Problem Definition
There has been significant progress across many natu- The problem of acronym disambiguation is defined as a
ral language processing (NLP) tasks by the pre-trained classification problem [ 5]. Given a dictionary  which is
language models trained on large-scale unlabeled cor- a mapping of acronyms to candidate long-forms (or
expora. Based on the transformer architecture [31], a set of pansions), let () = {,1, . . . , ,()} be the set
large-scale pre-trained language models are developed, of all candidate long-forms of an acronym , where
including BERT [17], RoBERTa [27], GPT [32] and T5 () is the size of the set. Then, for an input
sen[18]. Since these models are pre-trained on datasets pri- tence  = ⟨1, 2, . . . , ⟩ consisting of  tokens (i.e.,
marily consisting of English text, multilingual models 1, . . . , ) and an acronym  = ⟨, . . . ,  ⟩ with
such as mBERT [33] and mT5 [22] are presented. To pro- 1 ≤  ≤  ≤  which is a contiguous subsequence
cess multilingual texts in the datasets published in the of , we want to predict the correct long-form of the
shared task for acronym disambiguation in the workshop acronym  among the candidate long-forms in ().
SDU@AAAI-22, we use both T5 and mT5 to encode input Note that we represent a text as a sequence of tokens
texts. by using a tokenizer such as WordPiece [45] and
SentencePiece [46]. Following the existing works [19, 21], we
2.3. Weak Supervision simplify the problem as a binary classification problem.
In other words, given an input sentence , an acronym 
appearing in  and a candidate long-form , in (),
we predict the label  which is 1 if , is the correct
long-form of  in the context of , and 0 otherwise.</p>
        <p>Modern machine learning models generally need a large
amount of hand-labeled training sets for performance
improvement [34]. Since creating hand-labeled training
datasets is time-consuming and expensive, recent works
rely on weak supervision to generate noisy datasets
[35, 36, 37, 38, 39, 40, 41, 42]. Distant supervision, one
of the most popular techniques for weak supervision,
utilizes external knowledge bases to produce noisy
labels [35, 36, 43] Other works obtain noisy labels by
using crowdsourcing [40, 41, 42] or simple heuristic rules
[44, 37]. The system proposed in [39] automatically
generates the heuristics to assign training labels to a
largescale unlabeled data. Similar to the works in [35, 36, 43]
based on distant supervision, we use the relationships
between acronyms and their possible long-forms as the
weak supervision sources.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Acronym Disambiguation</title>
    </sec>
    <sec id="sec-4">
      <title>Model</title>
      <sec id="sec-4-1">
        <title>We first provide the problem definition of acronym disambiguation. We next present the overall architecture and details of our proposed model.</title>
        <p>3.2. Model Architecture
We provide an illustration of the proposed model in
Figure 2. The model consists of an encoder, which
transforms an input token sequence into a vector
representation, and a multi-layer perceptron (MLP) with a sigmoid
activation function to output the prediction. We use the
pre-trained language models such as T5 [18] or mT5 [22]
encoder layers to encode the input tokens, and take the
hidden state of the first token as the encoder output. The
encoder takes as input the concatenation of the input
long-form , and the sentence  [19]. A separator
symbol (i.e., [SEP]) is used to separate them. In other words,
by using the symbol ⊕ to represent the concatenation of
two token sequences, the input token sequence  of the
encoder is defined as
 = , ⊕ ⟨ [SEP]⟩ ⊕ .
(1)</p>
      </sec>
      <sec id="sec-4-2">
        <title>We also insert two special tokens [BOA] and [EOA] be</title>
        <p>fore and after the acronym  in  to highlight the
position of the acronym. For example, consider the input
sentence containing the acronym “DP” and one of its
candidate long-form “Dynamic Programming” in Figure 1.</p>
        <p>As shown in Figure 2, the encoder takes as input the
token sequence obtained by concatenating “Dynamic
Programming”, [SEP] and the input sentence. The
encoder converts the input token sequence  into a vector
representation ℎ ∈ R, where  is the number of hidden
units. The MLP layer is used to compute the prediction
score  from ℎ. That is,
 = sigmoid(  ℎ + ),
(2)
wlahyeerr.e We ∈intRerparentda∈s tRhearperopbaarbaimliteytetrhsaotftthheeinMpLuPt ELneggliaslh 2,949 3,366 4,640 5,921 8,048
lo niGngi-vfeo.nrma set,ofisthseencoternreccetslon=g-f{or1m, .o.f.t,hea}c,rolentym SEFcnireegnnlticsihfhic 77,,855312 88,,537357 1100,,467898 1122,,183755 1146,,620694
be the acronym contained in the sentence . For every Spanish 6,267 6,980 9,036 10,922 13,788
pair of a sentence  ∈  and a long-form , ∈ (), Total 24,599 27,258 34,843 41,853 52,709
wsaesenwotebenltlacieanssaietnts icno,pruwrteestopckoaennndbisnuegiqludlaeabnectlerain,,i.nbgTyhduEasqt,ausfaretotiomnt(h1=e), TStaabtliseti3cs of the labeled and weakly labeled datasets
{(, , , ) | 1 ≤  ≤ , 1 ≤  ≤ ()}. We train
the model on the training dataset . Let us denote the
prediction score for , by , . Then, we use the cross- shared-task) of the competition on acronym
disambiguaentropy loss to train the model on the training dataset tion, for each category, there is no overlap of acronyms
. In other words, the loss is defined as between any pair of the training, development and test
 () datasets. Table 2 shows the statistics of the dictionary
ℒ = − ∑︁ ∑︁ (, log , + (1 − , ) log (1 − , )). for every category. In the table, the “Avg. Fanout”
indi=1 =1 cates the average number of candidate long-forms for
(3) an acronym. A dictionary contains a mapping from an
At the inference stage, for an input sentence  with an acronym to the set of its candidate long-forms. The
numacronym , we compute the prediction score for each ber of occurrences of an acronym in the datasets of all
candidate long-form in () and choose the one with categories is 2.866 on average.
the highest prediction score.
4.2. Weakly Labeled Datasets</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Datasets</title>
      <sec id="sec-5-1">
        <title>Among the acronyms in the dictionaries, 40.6% of them</title>
        <p>do not appear in the training dataset. To train the
proWe describe the labeled datasets published for the posed model for such acronyms, we collect additional
shared task on acronym disambiguation in the workshop data by incorporating a weak supervision method [35].
SDU@AAAI-22 [47]. Moreover, we present the details of Specifically, we first extract the sentences containing a
additional datasets generated by our weak supervision long-form in the dictionaries from English, French and
method. Spanish Wikipedia dump dated November 7, 2021. For
each language, we do not use the long-form of every
4.1. Labeled Datasets acronym whose number of occurrences is at least 1,000 in
the Wikipedia dump, since the pre-trained language
modThe detailed statistics of the labeled datasets is provided els are likely to be well-trained for such frequent
longin Table 1. The datasets consist of four categories (i.e., forms. For each extracted sentence from Wikipedia, we
Legal English, Scientific English, French and Spanish). replace the long-form in the sentence with its acronym.
In total, there are 24,599, 3,006 and 2,632 sentences in We next assign 1 as the label for the pair of the extracted
the training, development and test datasets, respectively. sentence and the long-form, and 0 for every pair of
Every sentence in the datasets has a single ambiguous the sentence and each of the other long-forms of the
acronym which is to be disambiguated. On average, an acronym.
acronym appears in 14 or 15 sentences. As mentioned in Let  be the maximum allowed number of sentences
the web page (https://sites.google.com/view/sdu-aaai22/ extracted from the Wikipedia dumps for a long-form. For
BERT-base-cased [17]
T5E-base [18]
BERT-large-cased [17]
mT5E-base [22]
RoBERTa-base [27]
mBERT-base-cased [33]
hdBERT [21]
T5E-large [18]
mT5E-large [22]
mT5E-xlarge [22]
T5E-xlarge [18]
# Params
108M
110M
334M
277M
125M
178M
472M
335M
564M
1,670M
1,241M
and test datasets. Specifically, we first compute precision,
recall and F1 score for each long-form and then report
the average value of all long-forms for each measure.</p>
        <p>Furthermore, for the development data, we report the
average value with its standard deviation by training the
models three times with diferent random seeds.
5.2. Experimental Results
each value of  in {1, 5, 10, 20}, we create a weakly
labeled dataset. Let L and  denote the labeled dataset
provided in the competition and the weakly labeled
dataset generated with  = . Then, we refer to the
combination of the labeled dataset (L) and each of the
weakly labeled datasets as L+1, L+5, L+10 and
L+20, respectively. The statistics of the combined
datasets are presented in Table 3. As an example, when
 = 10, we obtain 17,254 additional sentences
containing an acronym in the dictionaries by weak supervision,
and the ratio of unseen acronyms in the training dataset
is reduced from 40.6% to 21.6%.</p>
        <p>Pre-trained models We compare the performance of
the implementations of the proposed model with varying
the pre-trained model of the encoder. We use BERT [17],
mBERT [33], RoBERTa [27], hdBERT [21], T5 [18] and
mT5 [22] as the encoder. Since pre-trained models with
5. Experiments various model sizes are available for BERT and T5, we
test them with varying the model size, too. While the
We first present the experimental setup and next report default learning rate is 10− 5, we use a learning rate of
the results of experiments including the competition for 10− 6 for hdBERT since we get a better performance with
acronym disambiguation. 10− 6.</p>
        <p>Table 4 shows the F1 score on the development dataset
5.1. Experimental Setup for each category. The results show that the
implementation with T5-xlarge achieves the highest performance in
We conduct all experiments on a single machine with an terms of the F1 score in every category except Spanish.
AMD EPYC Rome 7402P 24-Core CPU and two NVIDIA The second best in terms of the F1 score for all categories
GeForce RTX 3090 GPUs under PyTorch framework [48]. is the implementation with mT5-xlarge as the encoder.
For each sentence, we consider a window of 64 tokens Note that although T5 is pre-trained using English
corwhere the acronym in the sentence is located in the mid- pora, we can see that the model with the encoder of T5
dle of the window, and use the sequence of tokens in that is generalized well to the other languages. As the size of
window for training. We set the batch size to 16 and use a model increases, the accuracy of the model tends to be
Adam optimizer [49]. Furthermore, we use the union of improved. However, the implementation with T5-xlarge
the training datasets with all categories to train imple- performs better than that with mT5-xlarge since T5 is
mentations of the proposed model for 10 epochs with a pre-trained with supervised training, while mT5 is not.
learning late of 10− 5. Moreover, we apply dropout [50]
Note that we cannot evaluate the pre-trained models with
to the encoder of the model with a dropout probability a larger size such as T5-xxlarge and mT5-xxlarge models
of 0.1. due to GPU memory limitations used in our experiment.</p>
        <p>To evaluate the performance of the model, we use
macro-averaged precision (P), recall (R) and F1 score (F1)
computed for each long-form [15, 5] on the development</p>
      </sec>
      <sec id="sec-5-2">
        <title>Weak supervision To confirm the efectiveness of the weakly labeled datasets, we train the proposed model</title>
        <p>ifne-tuning improves the accuracy for every category.</p>
        <sec id="sec-5-2-1">
          <title>SDU@AAAI-22 Shared Task: Acronym Disambigua</title>
          <p>tion In the competition, for each category, we use the
model performed the best on the test dataset as shown in
Table 7. The bolded numbers in the table are the scores
of our model. The results show that our model ranks the
2nd place for Legal English and 3rd place for Scientific
English and French.
which uses T5-xlarge as the encoder on both the la- 6. Conclusion
beled and weakly labeled datasets with varying  =
1, 5, 10, 20. We provide the results in Table 5. Recall that We propose a binary classification model for acronym
diswe use L and Wk to denote the labeled dataset and the ambiguation by utilizing large-scale pre-trained language
weakly labeled dataset generated with  =  respec- models. To increase the size of the training datasets, we
tively, as described in Section 4. The table shows that use a weak supervision approach to generate weakly
the F1 score becomes larger with increasing the value labeled datasets. Experimental results show that
trainof  for  = 1, 5, 10. However, when  = 20, the ing on both labeled and weakly labeled datasets is
benaccuracy is degraded since the skewness of the number eficial to the accuracy of the proposed model. For the
of sentences containing an acronym increases. In other shared task on acronym disambiguation in the
AAAIwords, as  increases, the number of the extracted sen- 22 Workshop on Scientific Document Understanding
tences containing a frequent long-form becomes large, (SDU@AAAI-22), our model ranks within the 3rd place
while that of the extracted sentences containing rare long- in three of four categories.
form does not. Since the model performs the best when
 = 10, we set  to 10 as the default value.</p>
          <p>Table 6 presents some examples which are classified Acknowledgments
incorrectly with the labeled dataset only, but are
classiifed correctly after training on both labeled and weakly This work was supported by Institute of Information
labeled datasets. The two rightmost columns show the &amp; communications Technology Planning &amp; Evaluation
prediction scores generated by the model trained using (IITP) grant funded by the Korea government(MSIT) (No.
only the labeled dataset and using both the labeled and 2020-0-00857, Development of cloud robot intelligence
weakly labeled dataset with  = 10 (i.e., L+10), re- augmentation, sharing and framework technology to
inspectively. Without the weakly labeled dataset, as shown tegrate and enhance the intelligence of multiple robots).
in the table, the model fails to find the correct long-forms It was also supported by the National Research
Foundafor the sentences. However, by using the weakly labeled tion of Korea(NRF) grant funded by the Korea
governdataset, the prediction scores for the correct long-forms ment(MSIT) (No. NRF-2020R1A2C1003576).
increase significantly.</p>
          <p>Performance on the test dataset We evaluate the
implementations of our model with T5-xlarge and
mT5xlarge as the encoder after training them on both the
labeled and weakly labeled dataset. When we use
T5xlarge, we set the learning rate to 9 × 10− 6 since we
ifnd that the model performs the best with that learning
rate by a hyperparameter search. As shown in Table 7, in
terms of the F1 score on the test dataset, the model with
T5-xlarge performs the best for both Legal English and
Scientific English datasets. On the other hand, the model
with mT5-xlarge shows better performance than that
with T5-xlarge for French and Spanish datasets. To
further improve the performance of the best model in each
category, we additionally train the best model by using
only the dataset of the category for 5 epochs with a
learning rate of 10− 6. The results show that the category-wise</p>
          <p>Legal
English</p>
          <p>There is no answer to the hopelessness and despair
of the more than
30 million unemployed in the countries of the</p>
          <p>OECD.</p>
          <p>Scientific The SGD is adopted to optimize the parameters.</p>
          <p>English</p>
          <p>Specifically, we will interpolate the translation
Scientific models as in Foster and
English Kuhn (2007), including a MAP combination
(Bacchiani et al 2006).</p>
          <p>Il est entouré au Nord par l’Ouganda, à l’Est par
French la Tanzanie, au Sud par le Burundi et à l’Ouest par
la RDC.</p>
          <p>De plus, il y a un représentant spécial adjoint du
French Secrétaire général résident à Chypre avec le rang
de SSG.</p>
          <p>En cuanto al FMAMl se sugirió que sería apropiado
Spanish esperar hasta que se completara el debate actual
sobre su reforma.</p>
          <p>El Gobierno del Japón acoge con beneplácito la
Spanish NEPAD África que ha sido lanzada por los países
africanos.</p>
          <p>OECD
SGD
MAP
RDC
SSG
FMAM
NEPAD
Legal
English</p>
          <p>Slovakia welcomes the establishment of UN UN-Women
Women – the UN-Women.</p>
          <p>Prediction Prediction</p>
          <p>(L) (L+10)
[5] A. P. B. Veyseh, F. Dernoncourt, Q. H. Tran, T. H. Bert: Pre-training of deep bidirectional
transformNguyen, What does this acronym mean? introduc- ers for language understanding, arXiv preprint
ing a new dataset for acronym identification and arXiv:1810.04805 (2018).
disambiguation, arXiv preprint arXiv:2010.14678 [18] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang,
(2020). M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the
[6] M. Zahariev, Automatic sense disambiguation for limits of transfer learning with a unified
text-toacronyms, in: Proceedings of the 27th annual inter- text transformer, arXiv preprint arXiv:1910.10683
national ACM SIGIR conference on Research and (2019).
development in information retrieval, 2004, pp. 586– [19] C. Pan, B. Song, S. Wang, Z. Luo, Bert-based
587. acronym disambiguation with multiple training
[7] H. L. Fred, T. O. Cheng, Acronymesis: the exploding strategies, arXiv preprint arXiv:2103.00488 (2021).
misuse of acronyms, Texas Heart Institute Journal [20] A. Singh, P. Kumar, Scidr at sdu-2020:
30 (2003) 255. Ideas–identifying and disambiguating everyday
[8] A. G. Ahmed, M. F. A. Hady, E. Nabil, A. Badr, A acronyms for scientific domain, arXiv preprint
language modeling approach for acronym expan- arXiv:2102.08818 (2021).
sion disambiguation, in: International Conference [21] Q. Zhong, G. Zeng, D. Zhu, Y. Zhang, W. Lin,
on Intelligent Text Processing and Computational B. Chen, J. Tang, Leveraging domain agnostic and
Linguistics, Springer, 2015, pp. 264–278. specific knowledge for acronym disambiguation.,
[9] J. Charbonnier, C. Wartena, Using word embed- in: SDU@ AAAI, 2021.</p>
          <p>dings for unsupervised acronym disambiguation [22] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou,
(2018). A. Siddhant, A. Barua, C. Rafel, mt5: A massively
[10] S. Pakhomov, T. Pedersen, C. G. Chute, Abbrevi- multilingual pre-trained text-to-text transformer,
ation and acronym disambiguation in clinical dis- arXiv preprint arXiv:2010.11934 (2020).
course, in: AMIA annual symposium proceedings, [23] A. P. B. Veyseh, N. Meister, S. Yoon, R. Jain, F.
Dervolume 2005, American Medical Informatics Asso- noncourt, T. H. Nguyen, Multilingual acronym
ciation, 2005, p. 589. extraction and disambiguation shared tasks at sdu
[11] S. Moon, S. Pakhomov, G. B. Melton, Automated 2022, in: Proceedings of SDU@AAAI-22, 2022.
disambiguation of acronyms and abbreviations in [24] R. Collobert, J. Weston, L. Bottou, M. Karlen,
clinical texts: window and training size consider- K. Kavukcuoglu, P. Kuksa, Natural language
proations, in: AMIA annual symposium proceedings, cessing (almost) from scratch, Journal of machine
volume 2012, American Medical Informatics Asso- learning research 12 (2011) 2493–2537.
ciation, 2012, p. 1310. [25] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado,
[12] S. Moon, B. McInnes, G. B. Melton, Challenges and J. Dean, Distributed representations of words and
practical approaches with word sense disambigua- phrases and their compositionality, in: Advances
tion of acronyms and abbreviations in the clinical in neural information processing systems, 2013, pp.
domain, Healthcare informatics research 21 (2015) 3111–3119.</p>
          <p>35–42. [26] K. Kirchhof, A. M. Turner, Unsupervised resolution
[13] Y. Wu, J. Xu, Y. Zhang, H. Xu, Clinical abbreviation of acronyms and abbreviations in nursing notes
usdisambiguation using neural word embeddings, in: ing document-level context models, in: Proceedings
Proceedings of BioNLP 15, 2015, pp. 171–176. of the Seventh International Workshop on Health
[14] R. Antunes, S. Matos, Biomedical word sense disam- Text Mining and Information Analysis, 2016, pp.
biguation with word embeddings, in: International 52–60.</p>
          <p>Conference on Practical Applications of Computa- [27] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen,
tional Biology &amp; Bioinformatics, Springer, 2017, pp. O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
273–279. Roberta: A robustly optimized bert pretraining
ap[15] M. Ciosici, T. Sommer, I. Assent, Unsupervised ab- proach, arXiv preprint arXiv:1907.11692 (2019).
breviation disambiguation contextual disambigua- [28] I. Beltagy, K. Lo, A. Cohan, Scibert: A pretrained
tion using word embeddings, arXiv preprint language model for scientific text, arXiv preprint
arXiv:1904.00929 (2019). arXiv:1903.10676 (2019).
[16] I. Li, M. Yasunaga, M. Y. Nuzumlalı, C. Caraballo, [29] T. Miyato, A. M. Dai, I. Goodfellow, Adversarial
S. Mahajan, H. Krumholz, D. Radev, A neural training methods for semi-supervised text
classifitopic-attention model for medical term abbreviation cation, arXiv preprint arXiv:1605.07725 (2016).
disambiguation, arXiv preprint arXiv:1910.14076 [30] S. Gururangan, A. Marasović, S. Swayamdipta,
(2019). K. Lo, I. Beltagy, D. Downey, N. A. Smith, Don’t stop
[17] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, pretraining: adapt language models to domains and
tasks, arXiv preprint arXiv:2004.10964 (2020). [43] E. Alfonseca, K. Filippova, J.-Y. Delort, G. Garrido,
[31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, Pattern learning for relation extraction with
hierarL. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At- chical topic models (2012).
tention is all you need, in: Advances in neural [44] A. J. Ratner, C. M. De Sa, S. Wu, D. Selsam, C. Ré,
information processing systems, 2017, pp. 5998– Data programming: Creating large training sets,
6008. quickly, Advances in neural information processing
[32] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Ka- systems 29 (2016) 3567–3575.
plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas- [45] R. Sennrich, B. Haddow, A. Birch, Neural machine
try, A. Askell, et al., Language models are few-shot translation of rare words with subword units, arXiv
learners, arXiv preprint arXiv:2005.14165 (2020). preprint arXiv:1508.07909 (2015).
[33] J. Devlin, Multilingual bert readme, [46] T. Kudo, J. Richardson, Sentencepiece: A simple and
https://github.com/google-research/bert/blob/ language independent subword tokenizer and
detomaster/multilingual.md, 2018. kenizer for neural text processing, arXiv preprint
[34] C. Sun, A. Shrivastava, S. Singh, A. Gupta, Re- arXiv:1808.06226 (2018).</p>
          <p>visiting unreasonable efectiveness of data in deep [47] A. P. B. Veyseh, N. Meister, S. Yoon, R. Jain, F.
Derlearning era, in: Proceedings of the IEEE inter- noncourt, T. H. Nguyen, Macronym: A large-scale
national conference on computer vision, 2017, pp. dataset for multilingual and multi-domain acronym
843–852. extraction, arXiv preprint arXiv:1412.6980 (2022).
[35] M. Mintz, S. Bills, R. Snow, D. Jurafsky, Distant [48] A. Paszke, S. Gross, F. Massa, A. Lerer, J.
Bradsupervision for relation extraction without labeled bury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein,
data, in: Proceedings of the Joint Conference of L. Antiga, et al., Pytorch: An imperative style,
highthe 47th Annual Meeting of the ACL and the 4th In- performance deep learning library, Advances in
ternational Joint Conference on Natural Language neural information processing systems 32 (2019)
Processing of the AFNLP, 2009, pp. 1003–1011. 8026–8037.
[36] A. Go, R. Bhayani, L. Huang, Twitter sentiment [49] D. P. Kingma, J. Ba, Adam: A method for
stochasclassification using distant supervision, CS224N tic optimization, arXiv preprint arXiv:1412.6980
project report, Stanford 1 (2009) 2009. (2014).
[37] A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, [50] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever,
C. Ré, Snorkel: Rapid training data creation with R. Salakhutdinov, Dropout: a simple way to prevent
weak supervision, in: Proceedings of the VLDB En- neural networks from overfitting, The journal of
dowment. International Conference on Very Large machine learning research 15 (2014) 1929–1958.
Data Bases, volume 11, NIH Public Access, 2017, p.</p>
          <p>269.
[38] A. Ratner, B. Hancock, J. Dunnmon, R. Goldman,</p>
          <p>C. Ré, Snorkel metal: Weak supervision for
multitask learning, in: Proceedings of the Second
Workshop on Data Management for End-To-End
Machine Learning, 2018, pp. 1–4.
[39] P. Varma, C. Ré, Snuba: Automating weak
supervision to label training data, in: Proceedings of
the VLDB Endowment. International Conference
on Very Large Data Bases, volume 12, NIH Public</p>
          <p>Access, 2018, p. 223.
[40] N. Dalvi, A. Dasgupta, R. Kumar, V. Rastogi,
Aggregating crowdsourced binary ratings, in:
Proceedings of the 22nd international conference on World</p>
          <p>Wide Web, 2013, pp. 285–294.
[41] Y. Zhang, X. Chen, D. Zhou, M. I. Jordan, Spectral
methods meet em: A provably optimal algorithm
for crowdsourcing, Advances in neural information
processing systems 27 (2014) 1260–1268.
[42] M. Joglekar, H. Garcia-Molina, A. Parameswaran,</p>
          <p>Comprehensive and reliable crowd assessment
algorithms, in: 2015 IEEE 31st International Conference
on Data Engineering, IEEE, 2015, pp. 195–206.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>W.</given-names>
            <surname>Ammar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Darwish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>El Kahki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hafez</surname>
          </string-name>
          ,
          <article-title>Icetea: in-context expansion and translation of english abbreviations</article-title>
          ,
          <source>in: International Conference on Intelligent Text Processing and Computational Linguistics</source>
          , Springer,
          <year>2011</year>
          , pp.
          <fpage>41</fpage>
          -
          <lpage>54</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Barnett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Doubleday</surname>
          </string-name>
          , Meta-research:
          <article-title>The growth of acronyms in the scientific literature</article-title>
          ,
          <source>Elife</source>
          <volume>9</volume>
          (
          <year>2020</year>
          )
          <article-title>e60080</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R. Islamaj</given-names>
            <surname>Dogan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. C.</given-names>
            <surname>Murray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Névéol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <article-title>Understanding pubmed® user search behavior through log analysis</article-title>
          ,
          <source>Database</source>
          <year>2009</year>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <article-title>Deep contextualized biomedical abbreviation expansion</article-title>
          , arXiv preprint arXiv:
          <year>1906</year>
          .
          <volume>03360</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>