=Paper=
{{Paper
|id=Vol-3164/paper18
|storemode=property
|title=T5 Encoder Based Acronym Disambiguation with Weak Supervision
|pdfUrl=https://ceur-ws.org/Vol-3164/paper18.pdf
|volume=Vol-3164
|authors=Gwangho Song,Hongrae Lee,Kyuseok Shim
|dblpUrl=https://dblp.org/rec/conf/aaai/SongLS22
}}
==T5 Encoder Based Acronym Disambiguation with Weak Supervision==
T5 Encoder Based Acronym Disambiguation with Weak Supervision Gwangho Song1 , Hongrae Lee2 and Kyuseok Shim1,3 1 Seoul National University, Seoul, South Korea 2 Google, Mountain View, CA, USA 3 Corresponding author Abstract An acronym is a word formed by abbreviating a phrase by combining certain letters of words in the phrase into a single term. Acronym disambiguation task selects the correct expansion of an ambiguous acronym in a sentence among the candidate expansions in a dictionary. Although it is convenient to use acronyms, identifying the appropriate expansions of an acronym in a sentence is a difficult task in natural language processing. Based on the recent success of the large-scale pre-trained language models such as BERT and T5, we propose a binary classification model using those language models for acronym disambiguation. To overcome the limited coverage of a training data, we use a weak supervision approach to increase the training data. Specifically, after collecting sentences containing an expansion of an acronym from Wikipedia, we replace the expansion with its acronym and label the sentence with the expansion. By conducting extensive experiments, we show the effectiveness of the proposed model. Our model is placed in the top 3 models for three of four categories in SDU@AAAI-22 shared task 2: Acronym Disambiguation. Keywords acronym disambiguation, natural language processing, deep learning, weak supervision 1. Introduction Input: - Sentence: Since our generative models are An acronym is a word formed by abbreviating a phrase based on DP priors, they are designed to fa- which is called a long-form or an expansion (e.g., AAAI vor a small number of unique entities per image. for Association for the Advancement of Artificial Intel- ligence). Due to its brevity, its usage is ubiquitous in ⎧ ⎨ Dynamic Programming many literature and documents, especially in scientific - Dictionary: DP Dependency Parsing and biomedical fields [1, 2, 3, 4, 5]. A report found that Dirichlet Process ⎩ more than 63% of the articles in English Wikipedia con- Output: Dirichlet Process tain at least one abbreviation [1]. Furthermore, among more than 24 million article titles and 18 million article Figure 1: An example of acronym disambiguation abstracts published between 1950 and 2019, there is at least one acronym in 19% of the titles and 73% of the abstracts [2]. Acronyms frequently have multiple long-forms, and disambiguation task is important and challenging. only one of them is valid for a specific context. For exam- The goal of acronym disambiguation(AD) is to select ple, in a 2001 version of the WWWAAS (World-Wide Web the correct long-form of an ambiguous acronym in a sen- Acronym and Abbreviation Server) database, 47.97% of tence among the candidate long-forms in a dictionary. acronyms have multiple expansions [6]. As another ex- Figure 1 shows an example of acronym disambiguation. ample, in the SciAD dataset released by SDU@AAAI 2021 A sentence containing an ambiguous acronym “DP” and Shared Task: Acronym Disambiguation [5], an acronym a dictionary with the long-forms of “DP” are given as has 3.1 long-forms on average and up to 20 long-forms. the input. In the dictionary, the acronym “DP” has three When sufficient context is not available, this leads to the possible long-forms: “Dynamic Programming”, “Depen- ambiguity of the meaning of acronyms and creates seri- dency Parsing” and “Dirichlet Process”. According to ous understanding difficulties [2, 7, 8, 9]. Thus, acronym the context of the input sentence, since “DP” stands for “Dirichlet Process”, a model outputs “Dirichlet Process” Scientific Document Understanding Workshop at AAAI 2022, March 1 as its expansion. " ghsong@kdd.snu.ac.kr (G. Song); mr.hongrae.lee@gmail.com (H. Lee); kshim@snu.ac.kr (K. Shim) The problem of acronym disambiguation is usually cast 0000-0002-9450-5629 (G. Song); 0000-0002-6138-3071 (H. Lee); as a classification problem whose goal is to determine 0000-0001-8818-0963 (K. Shim) whether a long-form has the same meaning with the © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). acronym in an input sentence. Early approaches [10, 11, CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) An input sentence 𝒔 = Since our generative models are based on prediction score 𝑝 [SOA] DP [EOA] priors, they are designed to favor a small number of unique entities per image. MLP The acronym in the sentence 𝒂 ℎ Encoder A candidate 𝒆 = Dynamic Programming long-form for 𝒂 𝒂,𝑗 𝒙 = 𝒆𝒂,𝑗 ⊕ [SEP] ⊕ 𝒔 Figure 2: An illustration of the proposed model 12, 6] rely on the traditional classification models such as 2.1. Acronym Disambiguation SVMs, decision trees and naive Bayes classifiers. As deep Early approaches [10, 11, 12, 6] rely on the traditional learning becomes more mainstream in natural language classification models such as SVMs, decision trees and processing, several works employ contextualized word naive Bayes classifiers. As deep learning becomes more embeddings to create semantic representations of long- mainstream in natural language processing, several forms and context [9, 13, 14, 15, 16]. Moreover, with the works employ contextualized word embeddings to cre- recent success of the pre-trained language models such ate semantic representations of long-forms and context as BERT [17] and T5 [18] in natural language processing, [9, 13, 14, 15, 16]. The works in [13, 14] study the use of classification models for acronym disambiguation are word embeddings [24, 25] to build classifiers for clinical developed based on the pre-trained language models [4, abbreviation disambiguation. The UAD model proposed 19, 20, 21]. in [15] creates word embeddings by using additional un- To study multilingual acronym disambiguation, we structured text. The work in [9] compares the averaged develop a binary classification model by utilizing T5 [18], context vector of the words in a long-form of an acronym which is one of the most popular pre-trained language with the weighted average vector of the words in the con- models, as well as mT5 [22] which is a multilingual vari- text of the acronym based on word embeddings trained ant of T5. We evaluate the proposed model on the dataset on a domain-specific corpus. In [26], the proposed model released by SDU@AAAI 2022 Shared Task: Acronym is trained to compute the similarity between a candi- Disambiguation [23]. Since the acronyms in the test date long-form and the context surrounding the target dataset do not appear in the training dataset, the training acronym. dataset provided in the competition may not be sufficient Many works utilize deep neural architectures to con- to solve the problem. Thus, we use a weak supervision struct a classifier [16, 8, 4, 19, 20, 21]. At the AAAI- approach to increase the training dataset. By training on 21 Workshop on Scientific Document Understanding the provided training dataset as well as the weakly la- (SDU@AAAI-21), the top ranked participants [20, 19, 21] beled training dataset generated by our weak supervision present models for acronym disambiguation based on method, the proposed model ranks in the top 3 place for pre-trained language models such as RoBERTa [27] and three of four categories in SDU@AAAI-22 shared task 2: SciBERT [28]. In [20], the problem of acronym disam- Acronym Disambiguation. biguation is treated as a span prediction problem, and the The remainder of this paper is organized as follows. proposed model predicts the span containing the correct We provide related work in Section 2 and present our pro- long-form from the concatenation of an input sentence posed model in Section 3. In Section 4, we describe the and candidate long-forms of the acronym in the sentence. datasets used for training the model, including weakly The hdBERT model proposed in [21] combines RoBERTa labeled datasets generated by weak supervision. Finally, and SciBERT to capture both domain agnostic and do- we discuss the experimental results in Section 5 and sum- main specific information. The work in [19], which is marize the paper in Section 6. the winner of the shared task of acronym disambiguation held under the workshop SDU@AAAI 2021, incorporates 2. Related Work training strategies such as adversarial training [29] and task-adaptive pre-training [30]. Following a similar strat- In this section, we present the previous works on egy to the recent works [19, 21], we develop a binary acronym disambiguation. We also summarize the pre- classification model for acronym disambiguation. trained language models widely adopted in various natu- ral language processing. In addition, we introduce weak supervision approaches to construct additional data. Avg. # Sentences # Sentences # Acronyms Category per acronym Train Dev Test Train Dev Test Train Dev Test Legal English 2,949 385 383 242 31 30 12.186 12.419 12.767 Scientific English 7,532 894 574 405 52 40 18.598 17.192 14.350 French 7,851 909 813 541 68 60 14.512 13.368 13.550 Spanish 6,267 818 862 437 56 53 14.341 14.607 16.264 Total 24,599 3,006 2,632 1,625 207 183 15.138 14.522 14.383 Table 1 Statistics of the labeled datasets 2.2. Pre-trained Language Models 3.1. Problem Definition There has been significant progress across many natu- The problem of acronym disambiguation is defined as a ral language processing (NLP) tasks by the pre-trained classification problem [5]. Given a dictionary 𝒜 which is language models trained on large-scale unlabeled cor- a mapping of acronyms to candidate long-forms (or ex- pora. Based on the transformer architecture [31], a set of pansions), let 𝒜(𝑎) = {𝑒𝑎,1 , . . . , 𝑒𝑎,𝑚(𝑎) } be the set large-scale pre-trained language models are developed, of all candidate long-forms of an acronym 𝑎, where including BERT [17], RoBERTa [27], GPT [32] and T5 𝑚(𝑎) is the size of the set. Then, for an input sen- [18]. Since these models are pre-trained on datasets pri- tence 𝑠 = ⟨𝑤1 , 𝑤2 , . . . , 𝑤𝑛 ⟩ consisting of 𝑛 tokens (i.e., marily consisting of English text, multilingual models 𝑤1 , . . . , 𝑤𝑛 ) and an acronym 𝑎 = ⟨𝑤𝑖 , . . . , 𝑤𝑗 ⟩ with such as mBERT [33] and mT5 [22] are presented. To pro- 1 ≤ 𝑖 ≤ 𝑗 ≤ 𝑛 which is a contiguous subsequence cess multilingual texts in the datasets published in the of 𝑠, we want to predict the correct long-form of the shared task for acronym disambiguation in the workshop acronym 𝑎 among the candidate long-forms in 𝒜(𝑎). SDU@AAAI-22, we use both T5 and mT5 to encode input Note that we represent a text as a sequence of tokens texts. by using a tokenizer such as WordPiece [45] and Senten- cePiece [46]. Following the existing works [19, 21], we 2.3. Weak Supervision simplify the problem as a binary classification problem. In other words, given an input sentence 𝑠, an acronym 𝑎 Modern machine learning models generally need a large appearing in 𝑠 and a candidate long-form 𝑒𝑎,𝑘 in 𝒜(𝑎), amount of hand-labeled training sets for performance we predict the label 𝑦 which is 1 if 𝑒𝑎,𝑘 is the correct improvement [34]. Since creating hand-labeled training long-form of 𝑎 in the context of 𝑠, and 0 otherwise. datasets is time-consuming and expensive, recent works rely on weak supervision to generate noisy datasets 3.2. Model Architecture [35, 36, 37, 38, 39, 40, 41, 42]. Distant supervision, one of the most popular techniques for weak supervision, We provide an illustration of the proposed model in Fig- utilizes external knowledge bases to produce noisy la- ure 2. The model consists of an encoder, which trans- bels [35, 36, 43] Other works obtain noisy labels by us- forms an input token sequence into a vector representa- ing crowdsourcing [40, 41, 42] or simple heuristic rules tion, and a multi-layer perceptron (MLP) with a sigmoid [44, 37]. The system proposed in [39] automatically gen- activation function to output the prediction. We use the erates the heuristics to assign training labels to a large- pre-trained language models such as T5 [18] or mT5 [22] scale unlabeled data. Similar to the works in [35, 36, 43] encoder layers to encode the input tokens, and take the based on distant supervision, we use the relationships hidden state of the first token as the encoder output. The between acronyms and their possible long-forms as the encoder takes as input the concatenation of the input weak supervision sources. long-form 𝑒𝑎,𝑗 and the sentence 𝑠 [19]. A separator sym- bol (i.e., [SEP]) is used to separate them. In other words, by using the symbol ⊕ to represent the concatenation of 3. Acronym Disambiguation two token sequences, the input token sequence 𝑥 of the Model encoder is defined as We first provide the problem definition of acronym dis- 𝑥 = 𝑒𝑎,𝑗 ⊕ ⟨[SEP]⟩ ⊕ 𝑠. (1) ambiguation. We next present the overall architecture and details of our proposed model. We also insert two special tokens [BOA] and [EOA] be- fore and after the acronym 𝑎 in 𝑠 to highlight the po- sition of the acronym. For example, consider the input sentence containing the acronym “DP” and one of its can- Category # LFs # ACs Avg. Fanout didate long-form “Dynamic Programming” in Figure 1. Legal English 1,126 456 2.469 As shown in Figure 2, the encoder takes as input the Scientific English 2,275 671 3.390 token sequence obtained by concatenating “Dynamic French 2,578 926 2.784 Programming”, [SEP] and the input sentence. The en- Spanish 1,859 682 2.726 coder converts the input token sequence 𝑥 into a vector Total 7,838 2,735 2.866 representation ℎ ∈ R𝑑 , where 𝑑 is the number of hidden * LF: long-form, AC: acronym units. The MLP layer is used to compute the prediction Table 2 score 𝑝 from ℎ. That is, Statistics of the dictionaries 𝑝 = sigmoid(𝑊 𝑇 ℎ + 𝑏), (2) Category L L+𝑊1 L+𝑊5 L+𝑊10 L+𝑊20 where 𝑊 ∈ R and 𝑏 ∈ R are parameters of the MLP 𝑑 Legal 2,949 3,366 4,640 5,921 8,048 layer. We interpret 𝑝 as the probability that the input English long-form 𝑒𝑎,𝑗 is the correct long-form of the acronym Scientific 7,532 8,337 10,688 12,875 16,264 English 𝑎 in 𝑠. Given a set of 𝑁 sentences 𝒮 = {𝑠1 , . . . , 𝑠𝑁 }, let 𝑎𝑖 French 7,851 8,575 10,479 12,135 14,609 be the acronym contained in the sentence 𝑠𝑖 . For every Spanish 6,267 6,980 9,036 10,922 13,788 pair of a sentence 𝑠𝑖 ∈ 𝒮 and a long-form 𝑒𝑎𝑖 ,𝑗 ∈ 𝒜(𝑎𝑖 ), Total 24,599 27,258 34,843 41,853 52,709 we obtain an input token sequence 𝑥𝑖,𝑗 by Equation (1), Table 3 as well as its corresponding label 𝑦𝑖,𝑗 . Thus, from the Statistics of the labeled and weakly labeled datasets sentence set 𝒮, we can build a training dataset 𝒟 = {(𝑥𝑖,𝑗 , 𝑦𝑖,𝑗 ) | 1 ≤ 𝑖 ≤ 𝑁, 1 ≤ 𝑗 ≤ 𝑚(𝑎𝑖 )}. We train the model on the training dataset 𝒟. Let us denote the prediction score for 𝑥𝑖,𝑗 by 𝑝𝑖,𝑗 . Then, we use the cross- shared-task) of the competition on acronym disambigua- entropy loss to train the model on the training dataset tion, for each category, there is no overlap of acronyms 𝒟. In other words, the loss is defined as between any pair of the training, development and test 𝑁 𝑚(𝑎 datasets. Table 2 shows the statistics of the dictionary ∑︁𝑖 ) for every category. In the table, the “Avg. Fanout” indi- ∑︁ ℒ=− (𝑦𝑖,𝑗 log 𝑝𝑖,𝑗 + (1 − 𝑦𝑖,𝑗 ) log (1 − 𝑝𝑖,𝑗 )). 𝑖=1 𝑗=1 cates the average number of candidate long-forms for (3) an acronym. A dictionary contains a mapping from an At the inference stage, for an input sentence 𝑠 with an acronym to the set of its candidate long-forms. The num- acronym 𝑎, we compute the prediction score for each ber of occurrences of an acronym in the datasets of all candidate long-form in 𝒜(𝑎) and choose the one with categories is 2.866 on average. the highest prediction score. 4.2. Weakly Labeled Datasets 4. Datasets Among the acronyms in the dictionaries, 40.6% of them do not appear in the training dataset. To train the pro- We describe the labeled datasets published for the posed model for such acronyms, we collect additional shared task on acronym disambiguation in the workshop data by incorporating a weak supervision method [35]. SDU@AAAI-22 [47]. Moreover, we present the details of Specifically, we first extract the sentences containing a additional datasets generated by our weak supervision long-form in the dictionaries from English, French and method. Spanish Wikipedia dump dated November 7, 2021. For each language, we do not use the long-form of every 4.1. Labeled Datasets acronym whose number of occurrences is at least 1,000 in the Wikipedia dump, since the pre-trained language mod- The detailed statistics of the labeled datasets is provided els are likely to be well-trained for such frequent long- in Table 1. The datasets consist of four categories (i.e., forms. For each extracted sentence from Wikipedia, we Legal English, Scientific English, French and Spanish). replace the long-form in the sentence with its acronym. In total, there are 24,599, 3,006 and 2,632 sentences in We next assign 1 as the label for the pair of the extracted the training, development and test datasets, respectively. sentence and the long-form, and 0 for every pair of Every sentence in the datasets has a single ambiguous the sentence and each of the other long-forms of the acronym which is to be disambiguated. On average, an acronym. acronym appears in 14 or 15 sentences. As mentioned in Let 𝑁𝑠 be the maximum allowed number of sentences the web page (https://sites.google.com/view/sdu-aaai22/ extracted from the Wikipedia dumps for a long-form. For Legal Scientific Encoder # Params French Spanish All English English BERT-base-cased [17] 108M 69.74 ± 3.21 65.37 ± 0.79 64.68 ± 0.98 66.64 ± 0.97 66.02 ± 0.42 T5E-base [18] 110M 66.94 ± 1.60 64.31 ± 1.02 66.42 ± 0.72 68.14 ± 1.08 66.32 ± 0.73 BERT-large-cased [17] 334M 70.35 ± 1.57 66.48 ± 0.90 66.11 ± 0.63 66.90 ± 0.76 66.95 ± 0.52 mT5E-base [22] 277M 67.47 ± 3.37 62.47 ± 0.62 69.09 ± 1.24 72.88 ± 2.50 67.90 ± 1.59 RoBERTa-base [27] 125M 70.94 ± 2.30 67.82 ± 2.75 67.10 ± 1.68 71.64 ± 0.77 68.98 ± 0.37 mBERT-base-cased [33] 178M 73.18 ± 2.46 66.74 ± 1.32 69.98 ± 1.28 76.74 ± 2.62 71.18 ± 0.91 hdBERT [21] 472M 71.03 ± 1.24 75.69 ± 0.49 67.81 ± 0.53 74.17 ± 0.79 72.25 ± 0.17 T5E-large [18] 335M 75.62 ± 1.39 72.85 ± 0.65 70.57 ± 0.46 72.91 ± 2.23 72.49 ± 0.22 mT5E-large [22] 564M 72.83 ± 0.90 69.62 ± 0.37 72.11 ± 1.18 78.35 ± 1.00 73.09 ± 0.51 mT5E-xlarge [22] 1,670M 75.44 ± 2.03 70.92 ± 0.88 72.49 ± 0.51 78.95 ± 0.88 74.08 ± 0.57 T5E-xlarge [18] 1,241M 78.73 ± 1.10 77.56 ± 0.63 72.69 ± 1.40 77.88 ± 0.73 76.24 ± 0.79 Table 4 F1 score with varying the encoder each value of 𝑁𝑠 in {1, 5, 10, 20}, we create a weakly and test datasets. Specifically, we first compute precision, labeled dataset. Let L and 𝑊𝑘 denote the labeled dataset recall and F1 score for each long-form and then report provided in the competition and the weakly labeled the average value of all long-forms for each measure. dataset generated with 𝑁𝑠 = 𝑘. Then, we refer to the Furthermore, for the development data, we report the combination of the labeled dataset (L) and each of the average value with its standard deviation by training the weakly labeled datasets as L+𝑊1 , L+𝑊5 , L+𝑊10 and models three times with different random seeds. L+𝑊20 , respectively. The statistics of the combined datasets are presented in Table 3. As an example, when 5.2. Experimental Results 𝑁𝑠 = 10, we obtain 17,254 additional sentences contain- ing an acronym in the dictionaries by weak supervision, Pre-trained models We compare the performance of and the ratio of unseen acronyms in the training dataset the implementations of the proposed model with varying is reduced from 40.6% to 21.6%. the pre-trained model of the encoder. We use BERT [17], mBERT [33], RoBERTa [27], hdBERT [21], T5 [18] and mT5 [22] as the encoder. Since pre-trained models with 5. Experiments various model sizes are available for BERT and T5, we test them with varying the model size, too. While the We first present the experimental setup and next report default learning rate is 10−5 , we use a learning rate of the results of experiments including the competition for 10−6 for hdBERT since we get a better performance with acronym disambiguation. 10−6 . Table 4 shows the F1 score on the development dataset 5.1. Experimental Setup for each category. The results show that the implementa- tion with T5-xlarge achieves the highest performance in We conduct all experiments on a single machine with an terms of the F1 score in every category except Spanish. AMD EPYC Rome 7402P 24-Core CPU and two NVIDIA The second best in terms of the F1 score for all categories GeForce RTX 3090 GPUs under PyTorch framework [48]. is the implementation with mT5-xlarge as the encoder. For each sentence, we consider a window of 64 tokens Note that although T5 is pre-trained using English cor- where the acronym in the sentence is located in the mid- pora, we can see that the model with the encoder of T5 dle of the window, and use the sequence of tokens in that is generalized well to the other languages. As the size of window for training. We set the batch size to 16 and use a model increases, the accuracy of the model tends to be Adam optimizer [49]. Furthermore, we use the union of improved. However, the implementation with T5-xlarge the training datasets with all categories to train imple- performs better than that with mT5-xlarge since T5 is mentations of the proposed model for 10 epochs with a pre-trained with supervised training, while mT5 is not. learning late of 10 . Moreover, we apply dropout [50] −5 Note that we cannot evaluate the pre-trained models with to the encoder of the model with a dropout probability a larger size such as T5-xxlarge and mT5-xxlarge models of 0.1. due to GPU memory limitations used in our experiment. To evaluate the performance of the model, we use macro-averaged precision (P), recall (R) and F1 score (F1) computed for each long-form [15, 5] on the development Weak supervision To confirm the effectiveness of the weakly labeled datasets, we train the proposed model Data P R F1 fine-tuning improves the accuracy for every category. L 79.43 ± 0.68 73.30 ± 0.89 76.24 ± 0.79 L+𝑊1 81.05 ± 0.48 75.11 ± 0.50 77.97 ± 0.47 SDU@AAAI-22 Shared Task: Acronym Disambigua- L+𝑊5 81.54 ± 0.61 74.50 ± 0.15 77.86 ± 0.32 tion In the competition, for each category, we use the L+𝑊10 81.78 ± 0.76 74.66 ± 0.77 78.06 ± 0.76 model performed the best on the test dataset as shown in L+𝑊20 81.14 ± 0.70 73.98 ± 0.33 77.40 ± 0.47 Table 7. The bolded numbers in the table are the scores Table 5 of our model. The results show that our model ranks the Performance with the weakly labeled datasets 2nd place for Legal English and 3rd place for Scientific English and French. which uses T5-xlarge as the encoder on both the la- beled and weakly labeled datasets with varying 𝑁𝑠 = 6. Conclusion 1, 5, 10, 20. We provide the results in Table 5. Recall that We propose a binary classification model for acronym dis- we use L and Wk to denote the labeled dataset and the ambiguation by utilizing large-scale pre-trained language weakly labeled dataset generated with 𝑁𝑠 = 𝑘 respec- models. To increase the size of the training datasets, we tively, as described in Section 4. The table shows that use a weak supervision approach to generate weakly the F1 score becomes larger with increasing the value labeled datasets. Experimental results show that train- of 𝑁𝑠 for 𝑁𝑠 = 1, 5, 10. However, when 𝑁𝑠 = 20, the ing on both labeled and weakly labeled datasets is ben- accuracy is degraded since the skewness of the number eficial to the accuracy of the proposed model. For the of sentences containing an acronym increases. In other shared task on acronym disambiguation in the AAAI- words, as 𝑁𝑠 increases, the number of the extracted sen- 22 Workshop on Scientific Document Understanding tences containing a frequent long-form becomes large, (SDU@AAAI-22), our model ranks within the 3rd place while that of the extracted sentences containing rare long- in three of four categories. form does not. Since the model performs the best when 𝑁𝑠 = 10, we set 𝑁𝑠 to 10 as the default value. Table 6 presents some examples which are classified Acknowledgments incorrectly with the labeled dataset only, but are classi- fied correctly after training on both labeled and weakly This work was supported by Institute of Information labeled datasets. The two rightmost columns show the & communications Technology Planning & Evaluation prediction scores generated by the model trained using (IITP) grant funded by the Korea government(MSIT) (No. only the labeled dataset and using both the labeled and 2020-0-00857, Development of cloud robot intelligence weakly labeled dataset with 𝑁𝑠 = 10 (i.e., L+𝑊10 ), re- augmentation, sharing and framework technology to in- spectively. Without the weakly labeled dataset, as shown tegrate and enhance the intelligence of multiple robots). in the table, the model fails to find the correct long-forms It was also supported by the National Research Founda- for the sentences. However, by using the weakly labeled tion of Korea(NRF) grant funded by the Korea govern- dataset, the prediction scores for the correct long-forms ment(MSIT) (No. NRF-2020R1A2C1003576). increase significantly. Performance on the test dataset We evaluate the References implementations of our model with T5-xlarge and mT5- [1] W. Ammar, K. Darwish, A. El Kahki, K. Hafez, Ice- xlarge as the encoder after training them on both the tea: in-context expansion and translation of english labeled and weakly labeled dataset. When we use T5- abbreviations, in: International Conference on In- xlarge, we set the learning rate to 9 × 10−6 since we telligent Text Processing and Computational Lin- find that the model performs the best with that learning guistics, Springer, 2011, pp. 41–54. rate by a hyperparameter search. As shown in Table 7, in [2] A. Barnett, Z. Doubleday, Meta-research: The terms of the F1 score on the test dataset, the model with growth of acronyms in the scientific literature, Elife T5-xlarge performs the best for both Legal English and 9 (2020) e60080. Scientific English datasets. On the other hand, the model [3] R. Islamaj Dogan, G. C. Murray, A. Névéol, Z. Lu, with mT5-xlarge shows better performance than that Understanding pubmed® user search behavior with T5-xlarge for French and Spanish datasets. To fur- through log analysis, Database 2009 (2009). ther improve the performance of the best model in each [4] Q. Jin, J. Liu, X. Lu, Deep contextualized category, we additionally train the best model by using biomedical abbreviation expansion, arXiv preprint only the dataset of the category for 5 epochs with a learn- arXiv:1906.03360 (2019). ing rate of 10−6 . The results show that the category-wise Prediction Prediction Category Sentence Acronym Correct expansion (L) (L+𝑊10 ) United Nations Entity for Legal Slovakia welcomes the establishment of UN UN-Women Gender Equality and 0.678190 0.929378 English Women – the UN-Women. Empowerment of Women There is no answer to the hopelessness and despair Organization for Legal of the more than OECD Economic Cooperation 0.202852 0.999734 English 30 million unemployed in the countries of the and Development OECD. Scientific The SGD is adopted to optimize the parameters. SGD stochastic gradient descent 0.368887 0.998205 English Specifically, we will interpolate the translation Scientific models as in Foster and MAP maximum a posteriori 0.184368 0.629905 English Kuhn (2007), including a MAP combination (Bac- chiani et al 2006). Il est entouré au Nord par l’Ouganda, à l’Est par République French la Tanzanie, au Sud par le Burundi et à l’Ouest par RDC 0.844930 0.999477 Démocratique du Congo la RDC. De plus, il y a un représentant spécial adjoint du French Secrétaire général résident à Chypre avec le rang SSG sous-secrétaire général 0.956114 0.998696 de SSG. En cuanto al FMAMl se sugirió que sería apropiado Fondo para el Medio Spanish esperar hasta que se completara el debate actual FMAM 0.000304 0.999739 Ambiente Mundia sobre su reforma. El Gobierno del Japón acoge con beneplácito la Nueva Alianza para Spanish NEPAD África que ha sido lanzada por los países NEPAD 0.944804 0.990742 el Desarrollo de africanos. Table 6 Examples classified correctly by weak supervision in the development dataset Development Test Category Model P R F1 P R F1 T5-xlarge 86.13 ± 0.55 76.11 ± 1.67 80.80 ± 0.88 84.64 76.71 80.48 Legal English mT5-xlarge 81.49 ± 1.62 72.22 ± 0.68 76.57 ± 0.36 82.95 72.80 77.54 T5-xlarge-finetune 86.35 ± 0.21 78.16 ± 0.32 82.05 ± 0.24 85.52 77.12 81.11 T5-xlarge 81.72 ± 0.50 75.59 ± 1.15 78.54 ± 0.82 87.21 81.36 84.18 Scientific English mT5-xlarge 77.10 ± 2.58 67.00 ± 1.85 71.70 ± 2.16 82.85 75.62 79.07 T5-xlarge-finetune 82.38 ± 0.40 76.23 ± 0.50 79.18 ± 0.44 88.36 81.85 84.98 T5-xlarge 79.00 ± 0.07 70.35 ± 0.14 74.43 ± 0.11 79.98 69.29 74.25 French mT5-xlarge 77.66 ± 1.26 68.17 ± 1.91 72.60 ± 1.48 80.71 70.42 75.21 mT5-xlarge-finetune 77.39 ± 0.38 67.99 ± 0.45 72.39 ± 0.38 80.79 72.20 76.25 T5-xlarge 86.08 ± 0.39 77.97 ± 1.57 81.83 ± 1.01 84.31 75.36 79.58 Spanish mT5-xlarge 84.63 ± 3.29 78.83 ± 2.26 81.63 ± 2.69 86.27 76.16 80.90 mT5-xlarge-finetune 86.55 ± 0.39 80.89 ± 0.17 83.63 ± 0.27 86.33 76.51 81.12 Table 7 Performance on the test dataset of each category Legal English Scientific English French Spanish Model P R F1 Model P R F1 Model P R F1 Model P R F1 Rank1 0.94 0.87 0.90 Rank1 0.97 0.94 0.96 Rank1 0.89 0.79 0.84 Rank1 0.91 0.85 0.88 Rank2 0.86 0.77 0.81 Rank2 0.95 0.90 0.93 Rank2 0.85 0.73 0.78 Rank2 0.88 0.79 0.83 Rank3 0.82 0.80 0.81 Rank3 0.88 0.82 0.85 Rank3 0.81 0.72 0.76 Rank3 0.86 0.80 0.83 Rank4 0.79 0.64 0.70 Rank4 0.81 0.77 0.79 Rank4 0.76 0.70 0.73 Rank4 0.83 0.80 0.81 Rank5 0.75 0.61 0.67 Rank5 0.81 0.69 0.75 Rank5 0.73 0.64 0.68 Rank5 0.86 0.77 0.81 Table 8 Leaderboard [5] A. P. B. Veyseh, F. Dernoncourt, Q. H. Tran, T. H. Bert: Pre-training of deep bidirectional transform- Nguyen, What does this acronym mean? introduc- ers for language understanding, arXiv preprint ing a new dataset for acronym identification and arXiv:1810.04805 (2018). disambiguation, arXiv preprint arXiv:2010.14678 [18] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, (2020). M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the [6] M. Zahariev, Automatic sense disambiguation for limits of transfer learning with a unified text-to- acronyms, in: Proceedings of the 27th annual inter- text transformer, arXiv preprint arXiv:1910.10683 national ACM SIGIR conference on Research and (2019). development in information retrieval, 2004, pp. 586– [19] C. Pan, B. Song, S. Wang, Z. Luo, Bert-based 587. acronym disambiguation with multiple training [7] H. L. Fred, T. O. Cheng, Acronymesis: the exploding strategies, arXiv preprint arXiv:2103.00488 (2021). misuse of acronyms, Texas Heart Institute Journal [20] A. Singh, P. Kumar, Scidr at sdu-2020: 30 (2003) 255. Ideas–identifying and disambiguating everyday [8] A. G. Ahmed, M. F. A. Hady, E. Nabil, A. Badr, A acronyms for scientific domain, arXiv preprint language modeling approach for acronym expan- arXiv:2102.08818 (2021). sion disambiguation, in: International Conference [21] Q. Zhong, G. Zeng, D. Zhu, Y. Zhang, W. Lin, on Intelligent Text Processing and Computational B. Chen, J. Tang, Leveraging domain agnostic and Linguistics, Springer, 2015, pp. 264–278. specific knowledge for acronym disambiguation., [9] J. Charbonnier, C. Wartena, Using word embed- in: SDU@ AAAI, 2021. dings for unsupervised acronym disambiguation [22] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, (2018). A. Siddhant, A. Barua, C. Raffel, mt5: A massively [10] S. Pakhomov, T. Pedersen, C. G. Chute, Abbrevi- multilingual pre-trained text-to-text transformer, ation and acronym disambiguation in clinical dis- arXiv preprint arXiv:2010.11934 (2020). course, in: AMIA annual symposium proceedings, [23] A. P. B. Veyseh, N. Meister, S. Yoon, R. Jain, F. Der- volume 2005, American Medical Informatics Asso- noncourt, T. H. Nguyen, Multilingual acronym ciation, 2005, p. 589. extraction and disambiguation shared tasks at sdu [11] S. Moon, S. Pakhomov, G. B. Melton, Automated 2022, in: Proceedings of SDU@AAAI-22, 2022. disambiguation of acronyms and abbreviations in [24] R. Collobert, J. Weston, L. Bottou, M. Karlen, clinical texts: window and training size consider- K. Kavukcuoglu, P. Kuksa, Natural language pro- ations, in: AMIA annual symposium proceedings, cessing (almost) from scratch, Journal of machine volume 2012, American Medical Informatics Asso- learning research 12 (2011) 2493–2537. ciation, 2012, p. 1310. [25] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, [12] S. Moon, B. McInnes, G. B. Melton, Challenges and J. Dean, Distributed representations of words and practical approaches with word sense disambigua- phrases and their compositionality, in: Advances tion of acronyms and abbreviations in the clinical in neural information processing systems, 2013, pp. domain, Healthcare informatics research 21 (2015) 3111–3119. 35–42. [26] K. Kirchhoff, A. M. Turner, Unsupervised resolution [13] Y. Wu, J. Xu, Y. Zhang, H. Xu, Clinical abbreviation of acronyms and abbreviations in nursing notes us- disambiguation using neural word embeddings, in: ing document-level context models, in: Proceedings Proceedings of BioNLP 15, 2015, pp. 171–176. of the Seventh International Workshop on Health [14] R. Antunes, S. Matos, Biomedical word sense disam- Text Mining and Information Analysis, 2016, pp. biguation with word embeddings, in: International 52–60. Conference on Practical Applications of Computa- [27] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, tional Biology & Bioinformatics, Springer, 2017, pp. O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, 273–279. Roberta: A robustly optimized bert pretraining ap- [15] M. Ciosici, T. Sommer, I. Assent, Unsupervised ab- proach, arXiv preprint arXiv:1907.11692 (2019). breviation disambiguation contextual disambigua- [28] I. Beltagy, K. Lo, A. Cohan, Scibert: A pretrained tion using word embeddings, arXiv preprint language model for scientific text, arXiv preprint arXiv:1904.00929 (2019). arXiv:1903.10676 (2019). [16] I. Li, M. Yasunaga, M. Y. Nuzumlalı, C. Caraballo, [29] T. Miyato, A. M. Dai, I. Goodfellow, Adversarial S. Mahajan, H. Krumholz, D. Radev, A neural training methods for semi-supervised text classifi- topic-attention model for medical term abbreviation cation, arXiv preprint arXiv:1605.07725 (2016). disambiguation, arXiv preprint arXiv:1910.14076 [30] S. Gururangan, A. Marasović, S. Swayamdipta, (2019). K. Lo, I. Beltagy, D. Downey, N. A. Smith, Don’t stop [17] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, pretraining: adapt language models to domains and tasks, arXiv preprint arXiv:2004.10964 (2020). [43] E. Alfonseca, K. Filippova, J.-Y. Delort, G. Garrido, [31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, Pattern learning for relation extraction with hierar- L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At- chical topic models (2012). tention is all you need, in: Advances in neural [44] A. J. Ratner, C. M. De Sa, S. Wu, D. Selsam, C. Ré, information processing systems, 2017, pp. 5998– Data programming: Creating large training sets, 6008. quickly, Advances in neural information processing [32] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Ka- systems 29 (2016) 3567–3575. plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas- [45] R. Sennrich, B. Haddow, A. Birch, Neural machine try, A. Askell, et al., Language models are few-shot translation of rare words with subword units, arXiv learners, arXiv preprint arXiv:2005.14165 (2020). preprint arXiv:1508.07909 (2015). [33] J. Devlin, Multilingual bert readme, [46] T. Kudo, J. Richardson, Sentencepiece: A simple and https://github.com/google-research/bert/blob/ language independent subword tokenizer and deto- master/multilingual.md, 2018. kenizer for neural text processing, arXiv preprint [34] C. Sun, A. Shrivastava, S. Singh, A. Gupta, Re- arXiv:1808.06226 (2018). visiting unreasonable effectiveness of data in deep [47] A. P. B. Veyseh, N. Meister, S. Yoon, R. Jain, F. Der- learning era, in: Proceedings of the IEEE inter- noncourt, T. H. Nguyen, Macronym: A large-scale national conference on computer vision, 2017, pp. dataset for multilingual and multi-domain acronym 843–852. extraction, arXiv preprint arXiv:1412.6980 (2022). [35] M. Mintz, S. Bills, R. Snow, D. Jurafsky, Distant [48] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Brad- supervision for relation extraction without labeled bury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, data, in: Proceedings of the Joint Conference of L. Antiga, et al., Pytorch: An imperative style, high- the 47th Annual Meeting of the ACL and the 4th In- performance deep learning library, Advances in ternational Joint Conference on Natural Language neural information processing systems 32 (2019) Processing of the AFNLP, 2009, pp. 1003–1011. 8026–8037. [36] A. Go, R. Bhayani, L. Huang, Twitter sentiment [49] D. P. Kingma, J. Ba, Adam: A method for stochas- classification using distant supervision, CS224N tic optimization, arXiv preprint arXiv:1412.6980 project report, Stanford 1 (2009) 2009. (2014). [37] A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, [50] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, C. Ré, Snorkel: Rapid training data creation with R. Salakhutdinov, Dropout: a simple way to prevent weak supervision, in: Proceedings of the VLDB En- neural networks from overfitting, The journal of dowment. International Conference on Very Large machine learning research 15 (2014) 1929–1958. Data Bases, volume 11, NIH Public Access, 2017, p. 269. [38] A. Ratner, B. Hancock, J. Dunnmon, R. Goldman, C. Ré, Snorkel metal: Weak supervision for multi- task learning, in: Proceedings of the Second Work- shop on Data Management for End-To-End Ma- chine Learning, 2018, pp. 1–4. [39] P. Varma, C. Ré, Snuba: Automating weak super- vision to label training data, in: Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases, volume 12, NIH Public Access, 2018, p. 223. [40] N. Dalvi, A. Dasgupta, R. Kumar, V. Rastogi, Aggre- gating crowdsourced binary ratings, in: Proceed- ings of the 22nd international conference on World Wide Web, 2013, pp. 285–294. [41] Y. Zhang, X. Chen, D. Zhou, M. I. Jordan, Spectral methods meet em: A provably optimal algorithm for crowdsourcing, Advances in neural information processing systems 27 (2014) 1260–1268. [42] M. Joglekar, H. Garcia-Molina, A. Parameswaran, Comprehensive and reliable crowd assessment algo- rithms, in: 2015 IEEE 31st International Conference on Data Engineering, IEEE, 2015, pp. 195–206.