1. Introduction

G. Song); mr.hongrae.lee@gmail.com (H. Lee); kshim@snu.ac.kr (K. Shim)

T5 Encoder Based Acronym Disambiguation with Weak Supervision

Gwangho Song

Hongrae Lee

Kyuseok Shim

1 0 Google, Mountain View , CA , USA 1 Seoul National University , Seoul , South Korea

2022

000 0 0002

An acronym is a word formed by abbreviating a phrase by combining certain letters of words in the phrase into a single term. Acronym disambiguation task selects the correct expansion of an ambiguous acronym in a sentence among the candidate expansions in a dictionary. Although it is convenient to use acronyms, identifying the appropriate expansions of an acronym in a sentence is a dificult task in natural language processing. Based on the recent success of the large-scale pre-trained language models such as BERT and T5, we propose a binary classification model using those language models for acronym disambiguation. To overcome the limited coverage of a training data, we use a weak supervision approach to increase the training data. Specifically, after collecting sentences containing an expansion of an acronym from Wikipedia, we replace the expansion with its acronym and label the sentence with the expansion. By conducting extensive experiments, we show the efectiveness of the proposed model. Our model is placed in the top 3 models for three of four categories in SDU@AAAI-22 shared task 2: Acronym Disambiguation.

eol>acronym disambiguation natural language processing deep learning weak supervision

1. Introduction Input:

- Sentence: Since our generative models are based on DP priors, they are designed to favor a small number of unique entities per image.

An acronym is a word formed by abbreviating a phrase which is called a long-form or an expansion (e.g., AAAI for Association for the Advancement of Artificial Intelligence). Due to its brevity, its usage is ubiquitous in ⎧ Dynamic Programming many literature and documents, especially in scientific - Dictionary: DP ⎨ Dependency Parsing and biomedical fields [ 1, 2, 3, 4, 5 ]. A report found that ⎩ Dirichlet Process more than 63% of the articles in English Wikipedia contain at least one abbreviation [ 1 ]. Furthermore, among Output: Dirichlet Process more than 24 million article titles and 18 million article abstracts published between 1950 and 2019, there is at Figure 1: An example of acronym disambiguation least one acronym in 19% of the titles and 73% of the abstracts [ 2 ].

Acronyms frequently have multiple long-forms, and disambiguation task is important and challenging. only one of them is valid for a specific context. For exam- The goal of acronym disambiguation(AD) is to select ple, in a 2001 version of the WWWAAS (World-Wide Web the correct long-form of an ambiguous acronym in a senAcronym and Abbreviation Server) database, 47.97% of tence among the candidate long-forms in a dictionary. acronyms have multiple expansions [6]. As another ex- Figure 1 shows an example of acronym disambiguation. ample, in the SciAD dataset released by SDU@AAAI 2021 A sentence containing an ambiguous acronym “DP” and Shared Task: Acronym Disambiguation [5], an acronym a dictionary with the long-forms of “DP” are given as has 3.1 long-forms on average and up to 20 long-forms. the input. In the dictionary, the acronym “DP” has three When sucfiient context is not available, this leads to the possible long-forms: “Dynamic Programming”, “Depenambiguity of the meaning of acronyms and creates seri- dency Parsing” and “Dirichlet Process”. According to ous understanding dificulties [ 2, 7, 8, 9 ]. Thus, acronym the context of the input sentence, since “DP” stands for “Dirichlet Process”, a model outputs “Dirichlet Process” as its expansion.

The problem of acronym disambiguation is usually cast as a classification problem whose goal is to determine whether a long-form has the same meaning with the acronym in an input sentence. Early approaches [10, 11,

The acronym in the sentence An input sentence = Since our generative models are based on [SOA] DP [EOA] priors, they are designed to favor a small number of unique entities per image. lonAgc-faonrdmidfaoter , = Dynamic Programming prediction score

MLP

ℎ

Encoder = , ⊕ [SEP] ⊕ 12, 6] rely on the traditional classification models such as SVMs, decision trees and naive Bayes classifiers. As deep learning becomes more mainstream in natural language processing, several works employ contextualized word embeddings to create semantic representations of longforms and context [9, 13, 14, 15, 16]. Moreover, with the recent success of the pre-trained language models such as BERT [17] and T5 [18] in natural language processing, classification models for acronym disambiguation are developed based on the pre-trained language models [ 4, 19, 20, 21 ].

To study multilingual acronym disambiguation, we develop a binary classification model by utilizing T5 [ 18], which is one of the most popular pre-trained language models, as well as mT5 [22] which is a multilingual variant of T5. We evaluate the proposed model on the dataset released by SDU@AAAI 2022 Shared Task: Acronym Disambiguation [23]. Since the acronyms in the test dataset do not appear in the training dataset, the training dataset provided in the competition may not be suficient to solve the problem. Thus, we use a weak supervision approach to increase the training dataset. By training on the provided training dataset as well as the weakly labeled training dataset generated by our weak supervision method, the proposed model ranks in the top 3 place for three of four categories in SDU@AAAI-22 shared task 2: Acronym Disambiguation.

The remainder of this paper is organized as follows.

We provide related work in Section 2 and present our proposed model in Section 3. In Section 4, we describe the datasets used for training the model, including weakly labeled datasets generated by weak supervision. Finally, we discuss the experimental results in Section 5 and summarize the paper in Section 6.

2. Related Work In this section, we present the previous works on

acronym disambiguation. We also summarize the pretrained language models widely adopted in various natural language processing. In addition, we introduce weak supervision approaches to construct additional data. 2.1. Acronym Disambiguation Early approaches [10, 11, 12, 6] rely on the traditional classification models such as SVMs, decision trees and naive Bayes classifiers. As deep learning becomes more mainstream in natural language processing, several works employ contextualized word embeddings to create semantic representations of long-forms and context [9, 13, 14, 15, 16]. The works in [13, 14] study the use of word embeddings [24, 25] to build classifiers for clinical abbreviation disambiguation. The UAD model proposed in [15] creates word embeddings by using additional unstructured text. The work in [9] compares the averaged context vector of the words in a long-form of an acronym with the weighted average vector of the words in the context of the acronym based on word embeddings trained on a domain-specific corpus. In [ 26], the proposed model is trained to compute the similarity between a candidate long-form and the context surrounding the target acronym.

Many works utilize deep neural architectures to construct a classifier [ 16, 8, 4, 19, 20, 21 ]. At the AAAI21 Workshop on Scientific Document Understanding (SDU@AAAI-21), the top ranked participants [20, 19, 21] present models for acronym disambiguation based on pre-trained language models such as RoBERTa [27] and SciBERT [28]. In [20], the problem of acronym disambiguation is treated as a span prediction problem, and the proposed model predicts the span containing the correct long-form from the concatenation of an input sentence and candidate long-forms of the acronym in the sentence.

The hdBERT model proposed in [21] combines RoBERTa and SciBERT to capture both domain agnostic and domain specific information. The work in [ 19], which is the winner of the shared task of acronym disambiguation held under the workshop SDU@AAAI 2021, incorporates training strategies such as adversarial training [29] and task-adaptive pre-training [30]. Following a similar strategy to the recent works [19, 21], we develop a binary classification model for acronym disambiguation. Total # Sentences

# Acronyms Train 2.2. Pre-trained Language Models 3.1. Problem Definition There has been significant progress across many natu- The problem of acronym disambiguation is defined as a ral language processing (NLP) tasks by the pre-trained classification problem [ 5]. Given a dictionary which is language models trained on large-scale unlabeled cor- a mapping of acronyms to candidate long-forms (or expora. Based on the transformer architecture [31], a set of pansions), let () = {,1, . . . , ,()} be the set large-scale pre-trained language models are developed, of all candidate long-forms of an acronym , where including BERT [17], RoBERTa [27], GPT [32] and T5 () is the size of the set. Then, for an input sen[18]. Since these models are pre-trained on datasets pri- tence = ⟨1, 2, . . . , ⟩ consisting of tokens (i.e., marily consisting of English text, multilingual models 1, . . . , ) and an acronym = ⟨, . . . , ⟩ with such as mBERT [33] and mT5 [22] are presented. To pro- 1 ≤ ≤ ≤ which is a contiguous subsequence cess multilingual texts in the datasets published in the of , we want to predict the correct long-form of the shared task for acronym disambiguation in the workshop acronym among the candidate long-forms in (). SDU@AAAI-22, we use both T5 and mT5 to encode input Note that we represent a text as a sequence of tokens texts. by using a tokenizer such as WordPiece [45] and SentencePiece [46]. Following the existing works [19, 21], we 2.3. Weak Supervision simplify the problem as a binary classification problem. In other words, given an input sentence , an acronym appearing in and a candidate long-form , in (), we predict the label which is 1 if , is the correct long-form of in the context of , and 0 otherwise.

Modern machine learning models generally need a large amount of hand-labeled training sets for performance improvement [34]. Since creating hand-labeled training datasets is time-consuming and expensive, recent works rely on weak supervision to generate noisy datasets [35, 36, 37, 38, 39, 40, 41, 42]. Distant supervision, one of the most popular techniques for weak supervision, utilizes external knowledge bases to produce noisy labels [35, 36, 43] Other works obtain noisy labels by using crowdsourcing [40, 41, 42] or simple heuristic rules [44, 37]. The system proposed in [39] automatically generates the heuristics to assign training labels to a largescale unlabeled data. Similar to the works in [35, 36, 43] based on distant supervision, we use the relationships between acronyms and their possible long-forms as the weak supervision sources.

3. Acronym Disambiguation Model We first provide the problem definition of acronym disambiguation. We next present the overall architecture and details of our proposed model.

3.2. Model Architecture We provide an illustration of the proposed model in Figure 2. The model consists of an encoder, which transforms an input token sequence into a vector representation, and a multi-layer perceptron (MLP) with a sigmoid activation function to output the prediction. We use the pre-trained language models such as T5 [18] or mT5 [22] encoder layers to encode the input tokens, and take the hidden state of the first token as the encoder output. The encoder takes as input the concatenation of the input long-form , and the sentence [19]. A separator symbol (i.e., [SEP]) is used to separate them. In other words, by using the symbol ⊕ to represent the concatenation of two token sequences, the input token sequence of the encoder is defined as = , ⊕ ⟨ [SEP]⟩ ⊕ . (1)

We also insert two special tokens [BOA] and [EOA] be

fore and after the acronym in to highlight the position of the acronym. For example, consider the input sentence containing the acronym “DP” and one of its candidate long-form “Dynamic Programming” in Figure 1.

As shown in Figure 2, the encoder takes as input the token sequence obtained by concatenating “Dynamic Programming”, [SEP] and the input sentence. The encoder converts the input token sequence into a vector representation ℎ ∈ R, where is the number of hidden units. The MLP layer is used to compute the prediction score from ℎ. That is, = sigmoid( ℎ + ), (2) wlahyeerr.e We ∈intRerparentda∈s tRhearperopbaarbaimliteytetrhsaotftthheeinMpLuPt ELneggliaslh 2,949 3,366 4,640 5,921 8,048 lo niGngi-vfeo.nrma set,ofisthseencoternreccetslon=g-f{or1m, .o.f.t,hea}c,rolentym SEFcnireegnnlticsihfhic 77,,855312 88,,537357 1100,,467898 1122,,183755 1146,,620694 be the acronym contained in the sentence . For every Spanish 6,267 6,980 9,036 10,922 13,788 pair of a sentence ∈ and a long-form , ∈ (), Total 24,599 27,258 34,843 41,853 52,709 wsaesenwotebenltlacieanssaietnts icno,pruwrteestopckoaennndbisnuegiqludlaeabnectlerain,,i.nbgTyhduEasqt,ausfaretotiomnt(h1=e), TStaabtliseti3cs of the labeled and weakly labeled datasets {(, , , ) | 1 ≤ ≤ , 1 ≤ ≤ ()}. We train the model on the training dataset . Let us denote the prediction score for , by , . Then, we use the cross- shared-task) of the competition on acronym disambiguaentropy loss to train the model on the training dataset tion, for each category, there is no overlap of acronyms . In other words, the loss is defined as between any pair of the training, development and test () datasets. Table 2 shows the statistics of the dictionary ℒ = − ∑︁ ∑︁ (, log , + (1 − , ) log (1 − , )). for every category. In the table, the “Avg. Fanout” indi=1 =1 cates the average number of candidate long-forms for (3) an acronym. A dictionary contains a mapping from an At the inference stage, for an input sentence with an acronym to the set of its candidate long-forms. The numacronym , we compute the prediction score for each ber of occurrences of an acronym in the datasets of all candidate long-form in () and choose the one with categories is 2.866 on average. the highest prediction score. 4.2. Weakly Labeled Datasets

4. Datasets Among the acronyms in the dictionaries, 40.6% of them

do not appear in the training dataset. To train the proWe describe the labeled datasets published for the posed model for such acronyms, we collect additional shared task on acronym disambiguation in the workshop data by incorporating a weak supervision method [35]. SDU@AAAI-22 [47]. Moreover, we present the details of Specifically, we first extract the sentences containing a additional datasets generated by our weak supervision long-form in the dictionaries from English, French and method. Spanish Wikipedia dump dated November 7, 2021. For each language, we do not use the long-form of every 4.1. Labeled Datasets acronym whose number of occurrences is at least 1,000 in the Wikipedia dump, since the pre-trained language modThe detailed statistics of the labeled datasets is provided els are likely to be well-trained for such frequent longin Table 1. The datasets consist of four categories (i.e., forms. For each extracted sentence from Wikipedia, we Legal English, Scientific English, French and Spanish). replace the long-form in the sentence with its acronym. In total, there are 24,599, 3,006 and 2,632 sentences in We next assign 1 as the label for the pair of the extracted the training, development and test datasets, respectively. sentence and the long-form, and 0 for every pair of Every sentence in the datasets has a single ambiguous the sentence and each of the other long-forms of the acronym which is to be disambiguated. On average, an acronym. acronym appears in 14 or 15 sentences. As mentioned in Let be the maximum allowed number of sentences the web page (https://sites.google.com/view/sdu-aaai22/ extracted from the Wikipedia dumps for a long-form. For BERT-base-cased [17] T5E-base [18] BERT-large-cased [17] mT5E-base [22] RoBERTa-base [27] mBERT-base-cased [33] hdBERT [21] T5E-large [18] mT5E-large [22] mT5E-xlarge [22] T5E-xlarge [18] # Params 108M 110M 334M 277M 125M 178M 472M 335M 564M 1,670M 1,241M and test datasets. Specifically, we first compute precision, recall and F1 score for each long-form and then report the average value of all long-forms for each measure.

Furthermore, for the development data, we report the average value with its standard deviation by training the models three times with diferent random seeds. 5.2. Experimental Results each value of in {1, 5, 10, 20}, we create a weakly labeled dataset. Let L and denote the labeled dataset provided in the competition and the weakly labeled dataset generated with = . Then, we refer to the combination of the labeled dataset (L) and each of the weakly labeled datasets as L+1, L+5, L+10 and L+20, respectively. The statistics of the combined datasets are presented in Table 3. As an example, when = 10, we obtain 17,254 additional sentences containing an acronym in the dictionaries by weak supervision, and the ratio of unseen acronyms in the training dataset is reduced from 40.6% to 21.6%.

Pre-trained models We compare the performance of the implementations of the proposed model with varying the pre-trained model of the encoder. We use BERT [17], mBERT [33], RoBERTa [27], hdBERT [21], T5 [18] and mT5 [22] as the encoder. Since pre-trained models with 5. Experiments various model sizes are available for BERT and T5, we test them with varying the model size, too. While the We first present the experimental setup and next report default learning rate is 10− 5, we use a learning rate of the results of experiments including the competition for 10− 6 for hdBERT since we get a better performance with acronym disambiguation. 10− 6.

Table 4 shows the F1 score on the development dataset 5.1. Experimental Setup for each category. The results show that the implementation with T5-xlarge achieves the highest performance in We conduct all experiments on a single machine with an terms of the F1 score in every category except Spanish. AMD EPYC Rome 7402P 24-Core CPU and two NVIDIA The second best in terms of the F1 score for all categories GeForce RTX 3090 GPUs under PyTorch framework [48]. is the implementation with mT5-xlarge as the encoder. For each sentence, we consider a window of 64 tokens Note that although T5 is pre-trained using English corwhere the acronym in the sentence is located in the mid- pora, we can see that the model with the encoder of T5 dle of the window, and use the sequence of tokens in that is generalized well to the other languages. As the size of window for training. We set the batch size to 16 and use a model increases, the accuracy of the model tends to be Adam optimizer [49]. Furthermore, we use the union of improved. However, the implementation with T5-xlarge the training datasets with all categories to train imple- performs better than that with mT5-xlarge since T5 is mentations of the proposed model for 10 epochs with a pre-trained with supervised training, while mT5 is not. learning late of 10− 5. Moreover, we apply dropout [50] Note that we cannot evaluate the pre-trained models with to the encoder of the model with a dropout probability a larger size such as T5-xxlarge and mT5-xxlarge models of 0.1. due to GPU memory limitations used in our experiment.

To evaluate the performance of the model, we use macro-averaged precision (P), recall (R) and F1 score (F1) computed for each long-form [15, 5] on the development

Weak supervision To confirm the efectiveness of the weakly labeled datasets, we train the proposed model

ifne-tuning improves the accuracy for every category.

SDU@AAAI-22 Shared Task: Acronym Disambigua

tion In the competition, for each category, we use the model performed the best on the test dataset as shown in Table 7. The bolded numbers in the table are the scores of our model. The results show that our model ranks the 2nd place for Legal English and 3rd place for Scientific English and French. which uses T5-xlarge as the encoder on both the la- 6. Conclusion beled and weakly labeled datasets with varying = 1, 5, 10, 20. We provide the results in Table 5. Recall that We propose a binary classification model for acronym diswe use L and Wk to denote the labeled dataset and the ambiguation by utilizing large-scale pre-trained language weakly labeled dataset generated with = respec- models. To increase the size of the training datasets, we tively, as described in Section 4. The table shows that use a weak supervision approach to generate weakly the F1 score becomes larger with increasing the value labeled datasets. Experimental results show that trainof for = 1, 5, 10. However, when = 20, the ing on both labeled and weakly labeled datasets is benaccuracy is degraded since the skewness of the number eficial to the accuracy of the proposed model. For the of sentences containing an acronym increases. In other shared task on acronym disambiguation in the AAAIwords, as increases, the number of the extracted sen- 22 Workshop on Scientific Document Understanding tences containing a frequent long-form becomes large, (SDU@AAAI-22), our model ranks within the 3rd place while that of the extracted sentences containing rare long- in three of four categories. form does not. Since the model performs the best when = 10, we set to 10 as the default value.

Table 6 presents some examples which are classified Acknowledgments incorrectly with the labeled dataset only, but are classiifed correctly after training on both labeled and weakly This work was supported by Institute of Information labeled datasets. The two rightmost columns show the & communications Technology Planning & Evaluation prediction scores generated by the model trained using (IITP) grant funded by the Korea government(MSIT) (No. only the labeled dataset and using both the labeled and 2020-0-00857, Development of cloud robot intelligence weakly labeled dataset with = 10 (i.e., L+10), re- augmentation, sharing and framework technology to inspectively. Without the weakly labeled dataset, as shown tegrate and enhance the intelligence of multiple robots). in the table, the model fails to find the correct long-forms It was also supported by the National Research Foundafor the sentences. However, by using the weakly labeled tion of Korea(NRF) grant funded by the Korea governdataset, the prediction scores for the correct long-forms ment(MSIT) (No. NRF-2020R1A2C1003576). increase significantly.

Performance on the test dataset We evaluate the implementations of our model with T5-xlarge and mT5xlarge as the encoder after training them on both the labeled and weakly labeled dataset. When we use T5xlarge, we set the learning rate to 9 × 10− 6 since we ifnd that the model performs the best with that learning rate by a hyperparameter search. As shown in Table 7, in terms of the F1 score on the test dataset, the model with T5-xlarge performs the best for both Legal English and Scientific English datasets. On the other hand, the model with mT5-xlarge shows better performance than that with T5-xlarge for French and Spanish datasets. To further improve the performance of the best model in each category, we additionally train the best model by using only the dataset of the category for 5 epochs with a learning rate of 10− 6. The results show that the category-wise

Legal English

There is no answer to the hopelessness and despair of the more than 30 million unemployed in the countries of the

OECD.

Scientific The SGD is adopted to optimize the parameters.

English

Specifically, we will interpolate the translation Scientific models as in Foster and English Kuhn (2007), including a MAP combination (Bacchiani et al 2006).

Il est entouré au Nord par l’Ouganda, à l’Est par French la Tanzanie, au Sud par le Burundi et à l’Ouest par la RDC.

De plus, il y a un représentant spécial adjoint du French Secrétaire général résident à Chypre avec le rang de SSG.

En cuanto al FMAMl se sugirió que sería apropiado Spanish esperar hasta que se completara el debate actual sobre su reforma.

El Gobierno del Japón acoge con beneplácito la Spanish NEPAD África que ha sido lanzada por los países africanos.

OECD SGD MAP RDC SSG FMAM NEPAD Legal English

Slovakia welcomes the establishment of UN UN-Women Women – the UN-Women.

Prediction Prediction

(L) (L+10) [5] A. P. B. Veyseh, F. Dernoncourt, Q. H. Tran, T. H. Bert: Pre-training of deep bidirectional transformNguyen, What does this acronym mean? introduc- ers for language understanding, arXiv preprint ing a new dataset for acronym identification and arXiv:1810.04805 (2018). disambiguation, arXiv preprint arXiv:2010.14678 [18] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang, (2020). M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the [6] M. Zahariev, Automatic sense disambiguation for limits of transfer learning with a unified text-toacronyms, in: Proceedings of the 27th annual inter- text transformer, arXiv preprint arXiv:1910.10683 national ACM SIGIR conference on Research and (2019). development in information retrieval, 2004, pp. 586– [19] C. Pan, B. Song, S. Wang, Z. Luo, Bert-based 587. acronym disambiguation with multiple training [7] H. L. Fred, T. O. Cheng, Acronymesis: the exploding strategies, arXiv preprint arXiv:2103.00488 (2021). misuse of acronyms, Texas Heart Institute Journal [20] A. Singh, P. Kumar, Scidr at sdu-2020: 30 (2003) 255. Ideas–identifying and disambiguating everyday [8] A. G. Ahmed, M. F. A. Hady, E. Nabil, A. Badr, A acronyms for scientific domain, arXiv preprint language modeling approach for acronym expan- arXiv:2102.08818 (2021). sion disambiguation, in: International Conference [21] Q. Zhong, G. Zeng, D. Zhu, Y. Zhang, W. Lin, on Intelligent Text Processing and Computational B. Chen, J. Tang, Leveraging domain agnostic and Linguistics, Springer, 2015, pp. 264–278. specific knowledge for acronym disambiguation., [9] J. Charbonnier, C. Wartena, Using word embed- in: SDU@ AAAI, 2021.

dings for unsupervised acronym disambiguation [22] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, (2018). A. Siddhant, A. Barua, C. Rafel, mt5: A massively [10] S. Pakhomov, T. Pedersen, C. G. Chute, Abbrevi- multilingual pre-trained text-to-text transformer, ation and acronym disambiguation in clinical dis- arXiv preprint arXiv:2010.11934 (2020). course, in: AMIA annual symposium proceedings, [23] A. P. B. Veyseh, N. Meister, S. Yoon, R. Jain, F. Dervolume 2005, American Medical Informatics Asso- noncourt, T. H. Nguyen, Multilingual acronym ciation, 2005, p. 589. extraction and disambiguation shared tasks at sdu [11] S. Moon, S. Pakhomov, G. B. Melton, Automated 2022, in: Proceedings of SDU@AAAI-22, 2022. disambiguation of acronyms and abbreviations in [24] R. Collobert, J. Weston, L. Bottou, M. Karlen, clinical texts: window and training size consider- K. Kavukcuoglu, P. Kuksa, Natural language proations, in: AMIA annual symposium proceedings, cessing (almost) from scratch, Journal of machine volume 2012, American Medical Informatics Asso- learning research 12 (2011) 2493–2537. ciation, 2012, p. 1310. [25] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, [12] S. Moon, B. McInnes, G. B. Melton, Challenges and J. Dean, Distributed representations of words and practical approaches with word sense disambigua- phrases and their compositionality, in: Advances tion of acronyms and abbreviations in the clinical in neural information processing systems, 2013, pp. domain, Healthcare informatics research 21 (2015) 3111–3119.

35–42. [26] K. Kirchhof, A. M. Turner, Unsupervised resolution [13] Y. Wu, J. Xu, Y. Zhang, H. Xu, Clinical abbreviation of acronyms and abbreviations in nursing notes usdisambiguation using neural word embeddings, in: ing document-level context models, in: Proceedings Proceedings of BioNLP 15, 2015, pp. 171–176. of the Seventh International Workshop on Health [14] R. Antunes, S. Matos, Biomedical word sense disam- Text Mining and Information Analysis, 2016, pp. biguation with word embeddings, in: International 52–60.

Conference on Practical Applications of Computa- [27] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, tional Biology & Bioinformatics, Springer, 2017, pp. O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, 273–279. Roberta: A robustly optimized bert pretraining ap[15] M. Ciosici, T. Sommer, I. Assent, Unsupervised ab- proach, arXiv preprint arXiv:1907.11692 (2019). breviation disambiguation contextual disambigua- [28] I. Beltagy, K. Lo, A. Cohan, Scibert: A pretrained tion using word embeddings, arXiv preprint language model for scientific text, arXiv preprint arXiv:1904.00929 (2019). arXiv:1903.10676 (2019). [16] I. Li, M. Yasunaga, M. Y. Nuzumlalı, C. Caraballo, [29] T. Miyato, A. M. Dai, I. Goodfellow, Adversarial S. Mahajan, H. Krumholz, D. Radev, A neural training methods for semi-supervised text classifitopic-attention model for medical term abbreviation cation, arXiv preprint arXiv:1605.07725 (2016). disambiguation, arXiv preprint arXiv:1910.14076 [30] S. Gururangan, A. Marasović, S. Swayamdipta, (2019). K. Lo, I. Beltagy, D. Downey, N. A. Smith, Don’t stop [17] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, pretraining: adapt language models to domains and tasks, arXiv preprint arXiv:2004.10964 (2020). [43] E. Alfonseca, K. Filippova, J.-Y. Delort, G. Garrido, [31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, Pattern learning for relation extraction with hierarL. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At- chical topic models (2012). tention is all you need, in: Advances in neural [44] A. J. Ratner, C. M. De Sa, S. Wu, D. Selsam, C. Ré, information processing systems, 2017, pp. 5998– Data programming: Creating large training sets, 6008. quickly, Advances in neural information processing [32] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Ka- systems 29 (2016) 3567–3575. plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas- [45] R. Sennrich, B. Haddow, A. Birch, Neural machine try, A. Askell, et al., Language models are few-shot translation of rare words with subword units, arXiv learners, arXiv preprint arXiv:2005.14165 (2020). preprint arXiv:1508.07909 (2015). [33] J. Devlin, Multilingual bert readme, [46] T. Kudo, J. Richardson, Sentencepiece: A simple and https://github.com/google-research/bert/blob/ language independent subword tokenizer and detomaster/multilingual.md, 2018. kenizer for neural text processing, arXiv preprint [34] C. Sun, A. Shrivastava, S. Singh, A. Gupta, Re- arXiv:1808.06226 (2018).

visiting unreasonable efectiveness of data in deep [47] A. P. B. Veyseh, N. Meister, S. Yoon, R. Jain, F. Derlearning era, in: Proceedings of the IEEE inter- noncourt, T. H. Nguyen, Macronym: A large-scale national conference on computer vision, 2017, pp. dataset for multilingual and multi-domain acronym 843–852. extraction, arXiv preprint arXiv:1412.6980 (2022). [35] M. Mintz, S. Bills, R. Snow, D. Jurafsky, Distant [48] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradsupervision for relation extraction without labeled bury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, data, in: Proceedings of the Joint Conference of L. Antiga, et al., Pytorch: An imperative style, highthe 47th Annual Meeting of the ACL and the 4th In- performance deep learning library, Advances in ternational Joint Conference on Natural Language neural information processing systems 32 (2019) Processing of the AFNLP, 2009, pp. 1003–1011. 8026–8037. [36] A. Go, R. Bhayani, L. Huang, Twitter sentiment [49] D. P. Kingma, J. Ba, Adam: A method for stochasclassification using distant supervision, CS224N tic optimization, arXiv preprint arXiv:1412.6980 project report, Stanford 1 (2009) 2009. (2014). [37] A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, [50] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, C. Ré, Snorkel: Rapid training data creation with R. Salakhutdinov, Dropout: a simple way to prevent weak supervision, in: Proceedings of the VLDB En- neural networks from overfitting, The journal of dowment. International Conference on Very Large machine learning research 15 (2014) 1929–1958. Data Bases, volume 11, NIH Public Access, 2017, p.

269. [38] A. Ratner, B. Hancock, J. Dunnmon, R. Goldman,

C. Ré, Snorkel metal: Weak supervision for multitask learning, in: Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning, 2018, pp. 1–4. [39] P. Varma, C. Ré, Snuba: Automating weak supervision to label training data, in: Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases, volume 12, NIH Public

Access, 2018, p. 223. [40] N. Dalvi, A. Dasgupta, R. Kumar, V. Rastogi, Aggregating crowdsourced binary ratings, in: Proceedings of the 22nd international conference on World

Wide Web, 2013, pp. 285–294. [41] Y. Zhang, X. Chen, D. Zhou, M. I. Jordan, Spectral methods meet em: A provably optimal algorithm for crowdsourcing, Advances in neural information processing systems 27 (2014) 1260–1268. [42] M. Joglekar, H. Garcia-Molina, A. Parameswaran,

Comprehensive and reliable crowd assessment algorithms, in: 2015 IEEE 31st International Conference on Data Engineering, IEEE, 2015, pp. 195–206.

[1]

Ammar ,

Darwish ,

El Kahki ,

Hafez , Icetea: in-context expansion and translation of english abbreviations , in: International Conference on Intelligent Text Processing and Computational Linguistics , Springer, 2011 , pp. 41 - 54 .

[2]

Barnett ,

Doubleday , Meta-research: The growth of acronyms in the scientific literature , Elife 9 ( 2020 ) e60080 .

[3]

R. Islamaj

Dogan ,

G. C.

Murray ,

Névéol ,

Lu , Understanding pubmed® user search behavior through log analysis , Database 2009 ( 2009 ).

[4]

Jin ,

Liu ,

Lu , Deep contextualized biomedical abbreviation expansion , arXiv preprint arXiv: 1906 . 03360 ( 2019 ).