1. Introduction

SimCLAD: A Simple Framework for Contrastive Learning of Acronym Disambiguation

Bin Li (Corresponding author)

Fei Xia

1 2

Yixuan Weng

Xiusheng Huang

1 2

Bin Sun

0 0 College of Electrical and Information Engineering, Hunan University 1 National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy Sciences 2 School of Artificial Intelligence, University of Chinese Academy of Sciences

Acronym disambiguation means finding the correct meaning of an ambiguous acronym from the dictionary in a given sentence, which is one of the key points for scientific document understanding (SDU@AAAI-22). Recently, many attempts have tried to solve this problem via fine-tuning the pre-trained masked language models (MLMs) in order to obtain a better acronym representation. However, the acronym meaning is varied under diferent contexts, whose corresponding phrase representation mapped in diferent directions lacks discrimination in the entire vector space. Thus, the original representations of the pre-trained MLMs are not ideal for the acronym disambiguation task. In this paper, we propose a Simple framework for Contrastive Learning of Acronym Disambiguation (SimCLAD) method to better understand the acronym meanings. Specifically, we design a continual contrastive pre-training method that enhances the pre-trained model's generalization ability by learning the phrase-level contrastive distributions between true meaning and ambiguous phrases. The results on the acronym disambiguation of the scientific domain in English show that the proposed method outperforms all other competitive state-of-the-art (SOTA) methods.

eol>Acronym Disambiguation Document Understanding Contrastive Learning Continual Pte-training

1. Introduction

Input: Sentence : SVMs have been used for text classification (Tong and Koller, 2002), using properties of the support vector ma- chine algorithm for determining what unlabelled data to select for classification.

Dictionary : 1. Support Vector Machines

2. Support vector machines Output : Support vector machines Recently, the pre-training technology has highly improved the machine understanding level [ 1 ]. However, due to the complexity and ambiguity of the natural language, there is still a gap between the machines and humans in comprehensively understanding documents [ 2 ]. In scientific document understanding (SDU@AAAI22), due to space limitations, the appearance of acronyms Figure 1: Example of acronym disambiguation. is often necessary. It is of great significance to correctly understand and distinguish the correct acronym meaning the text in bold represents the short acronym. The dicfrom the given sentence [ 3 ]. tionary contains the indistinguishable acronym of long

More precisely, the document reading system is ex- form. Our goal is to predict the correct meaning of the pected to find the correct expanded form of the acronym long-form acronyms from the dictionary (i.e., Support given the possible expansions from the dictionary for the vector machines). A good prediction should not only unacronym. This is quite important for a variety of down- derstand the context meaning, but also difer the meaning stream tasks containing the understanding part, such as of ambiguous phrases. Many works have attempted to reading comprehension [ 4 ], story cloze [ 5 ] and medical incorporate the manually designed rules [ 8 ], handcrafted entity disambiguation [ 6 ], etc. features [9, 10], word embedding [11] and pre-training

The acronym disambiguation task aims at finding the technology [12, 13, 14] into this task and achieved relacorrect meaning of the ambiguous acronym in a given tively good performance. According to the result of the text from the dictionary [ 7 ]. As shown in Figure 1, the SDU@AAAI-21 [ 2 ], the pre-training method can efecsentence is from the scientific domain in English, where tively outperform the rule-based or feature-based method by a large margin. However, the acronym meaning varies SDU@AAAI-22: Workshop on Scientific Document Understanding, in diferent contexts [ 15], whose corresponding token co-located with AAAI 2022. 2022 Vancouver, Canada. representation is anisotropic distribution [16]. For the " libincn@hnu.edu.cn (B. Li (Corresponding author)); masked language models (MLMs), the token represenxiafei2020@ia.ac.cn (F. Xia); wengsyx@gmail.com (Y. Weng); tation is mapped with a cramped idiomatic distribution (hBu.aSnugnx)iusheng2020@ia.ac.cn (X. Huang); sunbin611@hnu.edu.cn [17, 16]. As a result, the MLMs are weak in distinguishing © 2022 Copyright for this paper by its authors. Use permitted under Creative the ambiguous meaning of acronyms, especially in the CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org) acronym disambiguation task.

Inspired by the token-aware contrastive learning lack the cognition of negative samples in the representamethod [16], a Simple framework for Contrastive tion. Diferent from the above methods, we use contrast Learning of Acronym Disambiguation (Sim-CLAD) me- learning to obtain more obvious features for acronym thod is proposed to distinguish the distributions between disambiguation. true meaning and ambiguous phrases. Specifically, we adopt the phrase-level continual contrastive pre-training 2.2. Contrastive learning method to enhance the pre-trained MLMs for a better representation of the acronyms. Extensive experiments carried on the acronym disambiguation of the scientific domain in English show that the proposed method achieves the best results compared with the other competitive state-of-the-art (SOTA) methods. The online leaderboard shows that the proposed method ranks 1-st in the scientific English domain in the shared task2 of the SDU@AAAI-22. The main contributions are summarized as follows:

In general, methods based on contrastive learning (CL)

can well distinguish the observed data from other negative samples. Many attempts of the CL have been made to many areas of computer vision, including image [20] and video [21]. Most recently, a simple framework for the CL of visual representations named as SimCLR [22] based on NT-Xent loss is proposed for better image representation.

The same idea can also be found in the field of natural language processing (NLP). In the field of NLP, many works [23, 24] are devoted to modeling better sentence• We perform the first attempt to resolve the level representations with the CL for the downstream acronym disambiguation problem with a con- tasks. Recently, Su et al. [16] propose a token-aware CL trastive pre-trained model for better acronym un- framework to learn the isotropic and discriminative disderstanding. tribution of token representations by restoring the origi• We extend the token-level contrastive learning nal token meaning of the masked items. This method is method by designing a phrase-level continuing very efective in distinguishing the token-level represencontrastive pre-training method to obtain bet- tations thereby achieving better performance in sentence ter contrastive representations of the ambiguous representation. Following this work, we further consider acronyms. the phrase-level CL by recovering the probable phrases • Experiments conducted on the scientific English (i.e., ambiguous acronyms) during the pre-training phase dataset demonstrate that the proposed method to obtain a better-distinguished acronym representation. has better performance and outperforms other competitive baselines.

2.3. Continual pre-training 2. Related work 2.1. Acronym disambiguation

Acronym disambiguation has attracted much attention in biomedical fields [ 18]. The earliest methods [ 8 ] utilize manually designed rules or text features to find out the acronym expansions. Later, there have been a few works [19] on automatically digging out the acronym expansions by analyzing the web data. These methods are usually efective when an acronym appears in conjunction with the corresponding extensions in the same document.

However, traditional rules or statistics cannot efectively 3. Task introduction handle these tasks with the explosive growth of information. In addition, these methods used for biomedical 3.1. Problem definition tasks cannot be directly transferred to other fields, such as science. Recently, deep learning based methods have The acronym disambiguation task aims to find the correct promoted the development of scientific document under- meaning of a given acronym in a sentence. Specifically, standing (SDU). Methods like feature-based [9], cluster- the sentence can be represented as = [1, 2, . . . , ], ing [11], and pre-training model methods [14] perform where is the total length of the sentence. Given that the well in this task. Although these methods based on the index represents the acronym in the input sentence, the pre-training technology (i.e., MLMs) can efectively dis- short acronym can be represented as ˆ, The correspondtinguish confusing phrases of the acronym, they still ing meaning of the short-form acronym is chosen from It is a wise choice for further continual pre-training the pre-trained model [25] to alleviate the task and domain discrepancy between the upstream and the downstream tasks. Many works tend to investigate how to better transform the general knowledge to the domain-specific task via continuing pre-training [26, 16]. In the field of the SDU, the generic MLMs are weak in well distinguishing confusing phrases from the dictionary. As a result, the continual pre-training method is adopted in this paper to directly improve the ability of understanding with contrastive learning. Encoder

Student

Teacher (Frozen) Embedding Layer We adopt [

] Preprocessed Text We

adopt [MASK] in Input Text this this paper paper ’ We adopt the dictionary = [1, 2, . . . , ], where the rep- is divided into training (7532), development (894), and resents the phrase in the dictionary, and the represents testing (574) according to the data set. All the datasets the total length of the probable phrases. Our goal is to can be found in the work [27], where the training and predict the correct phrase meaning of short acronym validation sets of the scientific English dataset have been ˆ from the dictionary , where the ∈ [1, ], ∈ [1, ]. manually labeled. All the labels are collected in the dictionary.

3.2. Evaluation metric To evaluate the performance of diferent methods, the Macro F1 is adopted. The definitions are shown as follows: 4. Method 4.1. Model architecture

∑︀ =1 precision

Precision = As shown in Figure 2, the overview of the proposed method contains two domain pre-trained models (a stu

Recall = ∑︀=1recall (1) dpeanratmanetderast,ei.aec.,hSecri)BwEhRiTch(Baerletaingiyt,iaLloiz,eadndwCitohhtahne 2s0a1m9e).

Macro F1 = 2 × PrPecriesciiosnion+ × ReRcaelclall Aartethfreoszteangetoopfrporvei-dtreaainginogo,dtheencpoadrainmgerteeprsreosfetnhteattieoanchfoerr where is the number of total classes, the precision the student model. In addition, the teacher supports the and recall represent the precision and recall of class well-formed original objectives of the MLM (i.e., masked respectively. language modeling and next sentence prediction) for the student model. Inspired by [16], we intentionally mask the original short-form acronym (′ ) to perform the dis3.3. Dataset tinguish ambiguous long-form acronyms (1+, 2− ) in the teacher model, where notation + and − are positive Table 1 and negative samples. A contrast loss is adopted in the Statistical information of scientific English dataset. pre-training process of the student model. Specifically, Data Sample Number Ratio it is obtained by masking the short-form acronym (i.e.,

CL) in the input sentence of the student model against

Training Set 7532 83.69% the “correct meaning” produced by the teacher without Development Set 894 9.93% masking the corresponding phrases. To get the represenTest Set 574 6.38% tation of the “reference” phrase in the dictionary (dotted Total 9000 100% frame), we perform phrase averaged method by averagThe acronym disambiguation contains the dataset of ing the embeddings of the tokens (i.e., contrastive learnscientific English, which is shown in Table 1. The dataset ing), which is presented with the upper bar. Meanwhile, we let the representation distance of positive and negative modeling task and the next sentence prediction (NSP) (i.e., Contrastive Learning) samples stay away to enhance task. The overall optimizing objectives are performed as the model’s ability to distinguish confusing samples. the continual pre-training in the domain-specific corpus, which can be shown as

4.2. Phrase-level contrastive pre-training The proposed me-thod is composed of two pre-trained

models who are both initialized with the SciBERT model, where the pre-training step is totally unsupervised, which where the one is the student model (noted as ) and the other is the teacher model (noted as ). During the pretraining phase, we only optimize the parameters of can be carried out with the vast scientific English dataset.

Once the pre-trained model is obtained, the student

model will be fine-tuned on the acronym disambigualeaving the model to be frozen. Given an input sen- tion task. ℒ = ℒCL + ℒMLM + ℒNSP tence = [1, . . . , ], we intentionally mask the short acronym ˆ following the same pre-training task [17]. Then, we feed the masked sentence ′ into the student to perform the pre-trained training task. As a result, we obtain the contextual representation ̃ℎ︀ = [̃ℎ︀1, . . . , ̃ℎ︀] in the student model, where the [MASK] is embedded as [mask]. At the same time, the teacher model replaces the corresponding short acronym ˆ in the original sentence with the phrase in the dictionary as input. It is intuitive that the teacher can distinguish all the probable representations with the dictionary, where we want the student model to distinguish the correct phrase meaning through CL. In the end, the well-formed phrase representation is utilized with the averaged embeddings. Thus, the final representation of the recovered sentence ℎ = [ℎ1, . . . , ℎ] against the corresponding input sentence is produced by the teacher (see Figure 2). Following the work [16], we further refine the proposed phraselevel contrastive pre-training loss ℒCL = − =1 =1 ∑︁ ∑︁ 1 (ˆ, ) log

S(ℎ̃︀,ℎ)/ ∑︀ =1 S(ℎ̃︀,ℎ )/ , (2) where the indicator function 1(ˆ, ) = 1 if ˆ is the masked acronym and short for the corresponding longform . Otherwise, 1(ˆ, ) = 0. We use the as the temperature hyper-parameter and the notation S(, ) represents the similarity function, where we choose the cosine function. The is the number of the all possible long-form acronyms.

4.4. Contrastive fine-tuning

Concretely, given the final hidden state, ℎ of the input sentence, the representation of the probable phrases can be represented as ℎ . We concatenate the ℎ and the ℎ to obtain the feature ℎ for the two classification and contrastive learning, which can be presented as ℎ = [︀ ℎ; ℎ ]︀ shown as follows:

A non-linear projection layer is added on top of the pre

trained model for obtaining representation. The positive sample is noted as +, and the negative sample is noted as − . The calculation of two types of the feature can be + = 2 ReLU (︀ 1ℎ+)︀ − = 2 ReLU (︀ 1ℎ− )︀ Finally, we perform fine-tuning in a multi-task manner and take a weighted average of the two classification losses and the contrastive loss: ℒ = (1 − ) 2 ︀( ℒ + ℒ− ︀) + ℒCL

+ where the is the weight hyper-parameter. 5. Experiment setup

5.1. Baseline models

(3) (4) (5) (6)

4.3. Optimizing objectives Naturally, the student model lear-ns to distinguish the

masked acronym closer to its corresponding “true” representation produced by the teacher and away from the meaning of the other confusing phrases in the sentence.

In summary, the acronym representations learned by

the student are more discriminative with the confusing phrases, therefore better following an isotropic distribution [16]. Furthermore, the original pre-training method of the MLM [17] is also adopted for learning good document representations, including the masked language

5.2. Pre-training strategies We use the continuing pre-training strategy with the pro

posed method using the Sci-BERT model2. Except for

2https://huggingface.co/allenai/scibert_scivocab_cased 3https://huggingface.co/datasets/scientific_papers 4https://huggingface.co/roberta-large 5https://huggingface.co/allenai/scibert_scivocab_uncased

ability of the diferent baselines and add the balanced weights to get the final predictions, where more implemented details can be found in the work [34].

6. Results

The main results of our model and baselines are shown in Table 2. It can be found that the performance of the pre-trained model outperforms the rule-based method since the rule-based method is dificult to pick the correct phrase from confusing acronym options from the dictionary due to its poor generalization. The SciBERT beats the RoBERTa in the three scores, which indicates that the domain-specific pre-training is of significant for science document understanding. The scientific domain pre-trained model can capture a deep representation of the confusing acronyms. The hdBERT merges diferent types of hidden features to get better generalization in binary classification, thereby performing well in this task. The results of the BERT-MT demonstrate that there are indeed many useful tricks in helping the model enhance the ability of robustness. It is noted that the proposed method outperforms the other baselines in three scores, which represents that the pre-trained model with continuing contrastive pre-training can further improve the model’s ability to represent acronyms. Notice that the ensemble method can further improve the diversity of the final results thereby achieving the best performance in the test set. In summary, we finally rank the 1-st in the online leaderboard, which is shown in Table 3.

7. Conclusion

We describe a simple framework for contrastive learning of acronym disambiguation in the shared task 2 of the SDU@AAAI-22. Many baselines are implemented to compare with the proposed method, including methods based on pre-training, combinations of diferent structures, and useful tricks. The results demonstrate that the proposed method outperforms all other baselines, achieving the best performance (top-1) in the acronym disambiguation of scientific English. It can be further concluded that the continuing contrastive pre-training method can enhance the model’s ability to represent the confusing phrases of the long-form acronym. The contrastive fine-tune can further enhance the generalization ability. In future work, we will extend our work as follows: (1) to use twin networks for training the teacher and the student together. (2) Adopting the fine-grained and the coarse-grained embedding into the contrastive pre-training to better acknowledge the meaning of the sentence.

Acknowledgement This work is supported by the National Key Research and Development Project of China (2018YFB1305200) and the National Natural Science Fund of China (62171183, 61801178).

text, in: Biocomputing 2003, World Scientific, 2002, representations using videos, in: Proceedings of the pp. 451–462. IEEE international conference on computer vision, [9] L. Luo, Z. Yang, P. Yang, Y. Zhang, L. Wang, H. Lin, 2015, pp. 2794–2802.

J. Wang, An attention-based bilstm-crf approach to [22] T. Chen, S. Kornblith, M. Norouzi, G. Hinton, A document-level chemical named entity recognition, simple framework for contrastive learning of visual Bioinformatics 34 (2018) 1381–1388. representations, in: International conference on [10] F. Li, Z. Mai, W. Zou, W. Ou, X. Qin, Y. Lin, machine learning, PMLR, 2020, pp. 1597–1607.

W. Zhang, Systems at sdu-2021 task 1: Transform- [23] Z. Wu, S. Wang, J. Gu, M. Khabsa, F. Sun, H. Ma, ers for sentence level sequence label., in: SDU@ Clear: Contrastive learning for sentence represenAAAI, 2021. tation, arXiv preprint arXiv:2012.15466 (2020). [11] A. Jaber, P. Martínez, Participation of uc3m in sdu@ [24] F. Liu, I. Vulić, A. Korhonen, N. Collier, Fast, efaaai-21: A hybrid approach to disambiguate scien- fective, and self-supervised: Transforming masked tific acronyms., in: SDU@ AAAI, 2021. language models into universal lexical and sentence [12] Q. Zhong, G. Zeng, D. Zhu, Y. Zhang, W. Lin, encoders, in: Proceedings of the 2021 Conference B. Chen, J. Tang, Leveraging domain agnostic and on Empirical Methods in Natural Language Prospecific knowledge for acronym disambiguation., cessing (EMNLP), Association for Computational in: SDU@ AAAI, 2021. Linguistics, Punta Cana, Dominican Republic and [13] D. R. Kubal, A. Nagvenkar, Efective ensembling of Online, 2021. URL: https://arxiv.org/abs/2104.08027. transformer based language models for acronyms [25] S. Gururangan, A. Marasović, S. Swayamdipta, identification., in: SDU@ AAAI, 2021. K. Lo, I. Beltagy, D. Downey, N. A. Smith, Don’t stop [14] C. Pan, B. Song, S. Wang, Z. Luo, Bert-based pretraining: Adapt language models to domains and acronym disambiguation with multiple training tasks, in: Proceedings of the 58th Annual Meeting strategies, arXiv preprint arXiv:2103.00488 (2021). of the Association for Computational Linguistics, [15] A. P. B. Veyseh, F. Dernoncourt, W. Chang, T. H. 2020, pp. 8342–8360.

Nguyen, Maddog: A web-based system for acronym [26] R. Han, X. Ren, N. Peng, Econet: Efective continual identification and disambiguation, in: Proceedings pretraining of language models for event temporal of the 16th Conference of the European Chapter reasoning, in: Proceedings of the 2021 Conference of the Association for Computational Linguistics: on Empirical Methods in Natural Language ProcessSystem Demonstrations, 2021, pp. 160–167. ing, 2021, pp. 5367–5380. [16] Y. Su, F. Liu, Z. Meng, L. Shu, E. Shareghi, N. Col- [27] S. Y. R. J. F. D. T. H. N. Amir Pouran Ben Veylier, Tacl: Improving bert pre-training with seh, Nicole Meister, MACRONYM: A Largetoken-aware contrastive learning, arXiv preprint Scale Dataset for Multilingual and Multi-Domain arXiv:2111.04198 (2021). Acronym Extraction, in: arXiv, 2022. [17] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: [28] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, Pre-training of deep bidirectional transformers for O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, language understanding, in: Proceedings of the Roberta: A robustly optimized bert pretraining ap2019 Conference of the North American Chapter proach, arXiv preprint arXiv:1907.11692 (2019). of the Association for Computational Linguistics: [29] R. Sennrich, B. Haddow, A. Birch, Neural machine Human Language Technologies, Volume 1 (Long translation of rare words with subword units, in: and Short Papers), 2019, pp. 4171–4186. Proceedings of the 54th Annual Meeting of the As[18] Q. Jin, J. Liu, X. Lu, Deep contextualized biomedical sociation for Computational Linguistics (Volume 1: abbreviation expansion, in: Proceedings of the 18th Long Papers), 2016, pp. 1715–1725. BioNLP Workshop and Shared Task, 2019, pp. 88– [30] I. Beltagy, K. Lo, A. Cohan, Scibert: A pretrained lan96. guage model for scientific text, in: Proceedings of [19] D. Nadeau, P. D. Turney, A supervised learning the 2019 Conference on Empirical Methods in Natapproach to acronym identification, in: Conference ural Language Processing and the 9th International of the Canadian Society for Computational Studies Joint Conference on Natural Language Processing of Intelligence, Springer, 2005, pp. 319–329. (EMNLP-IJCNLP), 2019, pp. 3615–3620. [20] S. Chopra, R. Hadsell, Y. LeCun, Learning a simi- [31] I. Loshchilov, F. Hutter, Decoupled weight decay larity metric discriminatively, with application to regularization, arXiv preprint arXiv:1711.05101 face verification, in: 2005 IEEE Computer Soci- (2017). ety Conference on Computer Vision and Pattern [32] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. DeRecognition (CVPR’05), volume 1, IEEE, 2005, pp. langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun539–546. towicz, et al., Huggingface’s transformers: State-of[21] X. Wang, A. Gupta, Unsupervised learning of visual the-art natural language processing, arXiv preprint arXiv:1910.03771 (2019). [33] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014). [34] G. P. C. Fung, J. X. Yu, H. Wang, D. W. Cheung,

H. Liu, A balanced ensemble approach to weighting classifiers for text classification, in: Sixth International Conference on Data Mining (ICDM’06), IEEE, 2006, pp. 869–873.

[1]

Qiu ,

Sun ,

Xu ,

Shao ,

Dai ,

Huang , Pretrained models for natural language processing: A survey , Science China Technological Sciences ( 2020 ) 1 - 26 .

[2]

A. P. B.

Veyseh ,

Dernoncourt ,

T. H.

Nguyen ,

Chang ,

L. A.

Celi , Acronym identification and disambiguation shared tasks for scientific document understanding , arXiv preprint arXiv: 2012 . 11760 ( 2020a ).

[3]

Y. R. J. F. D. T. H. N. Amir Pouran Ben Veyseh , Nicole Meister, Multilingual Acronym Extraction and Disambiguation Shared Tasks at SDU 2022 , in: Proceedings of SDU@AAAI-22 , 2022 .

[4]

Gardner ,

Berant ,

Hajishirzi ,

Talmor ,

Min , On making reading comprehension more comprehensive , in: Proceedings of the 2nd Workshop on Machine Reading for Question Answering , 2019 , pp. 105 - 112 .

[5]

Guan ,

Feng ,

Chen ,

He ,

Mao ,

Fan ,

Huang , Lot: A benchmark for evaluating chinese long text understanding and generation , arXiv preprint arXiv:2108.12960 ( 2021 ).

[6]

Li ,

Chen , H. Liu,

Weng ,

Sun ,

Li ,

Bai ,

Hu , More but correct: Generating diversified and entity-revised medical response , arXiv preprint arXiv:2108.01266 ( 2021 ).

[7]

A. P. B.

Veyseh ,

Dernoncourt ,

Q. H.

Tran ,

T. H.

Nguyen , What does this acronym mean? introducing a new dataset for acronym identification and disambiguation , in: Proceedings of the 28th International Conference on Computational Linguistics , 2020b , pp. 3285 - 3301 .

[8]

A. S.

Schwartz ,

M. A.

Hearst , A simple algorithm for identifying abbreviation definitions in biomedical