1. Introduction

Disambiguation via Negative Sampling

Taiqiang Wu

0 1

Xingyu Bai

Yujiu Yang

yang.yujiu@sz.tsinghua.edu.cn 0 1

Acronym Disambiguation.

0 Tsinghua Shenzhen International Graduate School Tsinghua University , P. R. China 1 Workshop Proce dings

Acronym Disambiguation (AD) task aims to map the acronym in sentences to the corresponding expansion among candidate expansions. However, these models based on domain agnostic knowledge might perform insuficient when directly applied to the data in some specific areas such as science and law. To track these issues, we propose a prompt-based acronym disambiguation system with special negative sampling. Specially, we design a prompt to combine the input sentences and candidate expansions, followed by a Pre-train Language Model (PLM) to calculate the score. Moreover, negative expansions are randomly sampled for better training, and an additional hinge loss is added to improve the robustness of our system. Experiments show the efectiveness of our system, and we get competitive results in the SDU@AAAI-22-Shared Task 2: AAAI'22: Scientific Document Understanding Commons License Attribution 4.0 International (CC BY 4.0).

1. Introduction

Acronyms are abbreviations formed from the initial components of words or phrases [ 1 ]. They are widely used in our daily life especially on social media. By using acronyms, people can avoid frequently repeating long phrases; thus, the sentences could be shorter and more readable. For example, we use NASA to replace the National Aeronautics and Space Administration.

However, for people without domain knowledge, acronyms might be confusing at some time, such as “PPP” can be Paycheck Protection Program or Public-Private Partnership. It is necessary to build an acronym disambiguation system that can identify the correct meaning of acronyms in a diferent context to track this issue. As shown in Figure 1, given several sentences containing acronym POS, we need to find out the corresponding expansion among candidate expansions in the given dictionary. Moreover, understanding the correlation between acronyms and their expansion is beneficial for several tasks in natural language processing, including question answering and machine reading comprehension.

Acronym disambiguation is usually considered as a sequence classification task [ 2 ], the goal is to map the given acronym in context to the corresponding expansion from the candidate expansion dictionary. Previous works mainly focused on the feature construction of acronym context to better understand semantics, such as hand crafted rules and patterns [ 1 ], word embeddings [ 3 ], CEUR htp:/ceur-ws.org ISN1613-073 © 2022 Copyright for this paper by its authors. Use permitted under Creative

CEUR

Documents Dialogue fillers and acceptance words affect the accuracy of POS tagging.

These heuristics filter out redundant constituents and raise the ratio of POS in the dataset.

POS of first occurrence : Important concepts are expected to be mentioned before less relevant ones.

Dictionary-POS

Part-Of-Speech positive instances Possessive Position postag

The expansion in the same color is just the corresponding expansion for each POS. graph structures [ 2 ], machine learning based methods such as CRF and SVM [ 4 ], and deep learning based methods [ 5, 6 ]. The experiments on this task were further extended to learn richer semantics features using Transformer [ 7 ], BERT [ 8 ] and SciBERT [ 9 ]. Although these eforts have achieved significant performance in this task, most of them ignored modeling the semantic relationship between acronym context and candidate expansions.

Furthermore, large-scale data during training brings an extremely long-tail problem. The size of the original candidate expansions in the dictionary varies, making it hard to batch the samples during training. To address this issue, previous works [ 10 ] dynamically add extra expansions into the candidate expansion set. However, they ignore the fact that the original negative candidate expansions are related to the acronym word in semantic meaning while the added expansions are unrelated.

In this paper, we proposed a prompt-based acronym disambiguation framework with a specially designed neg- combinations of prompts have been explored, such as ative sampling strategy. Firstly, we design a prompt tem- prompt augmentation [ 14 ], prompt composition [ 15 ] and plate and use the template to concatenate the acronym prompt decomposition [ 16 ]. In this work, we construct context and candidate expansions. Secondly, we utilize a diferent forms of prompts manually, to enrich the knowlpre-trained language model such as BERT [ 8 ] to encode edge enhancement methods. the combined context separately, followed by a linear layer to map the context vectors into logits. Since the 2.2. Word Sense Disambiguation size of candidate expansions for each acronym varies, we try to sample negative samples, thus padding the can- Word Sense Disambiguation(WSD) is divided into superdidate expansions randomly. Finally, we consider the vised, unsupervised and semi-supervised methods. original negative expansions as hard negative samples In supervised WSD methods, classic machine learningand the added ones as easy negative samples, which can based methods, such as decision tree, SVM, ANN and calculate an extra loss to build a more robust system. naive Bayes models, have been combined to improve The main contributions of this work are summarized as the complexity of classifier [ 17 ]. WSD model based on follows: evolutionary game theory was designed to determine the prediction of ambiguous words by calculating distri• We design a prompt-based framework to resolve bution and semantic similarity [ 18 ]. Supervised neural the acronym disambiguation problem, which can network with LKB graph embedding was proposed for be easily modified to solve other NLP tasks such transferring the pre-trained embeddings of synset to preas Entity Linking. dict ones [ 19 ]. • We propose a simple yet efective dynamic neg- Unsupervised WSD methods mainly cluster the unlaative sampling strategy and adopt a novel hinge beled corpus to predict the category of ambiguous words. loss to help train a robust model. The strategy The classic hybrid model consists of self-adaptive genetic, can benefit other matching problems. max-min ant and any colony algorithms [ 20 ]. WSD mod• We conduct experiments on the SDU@AAAI22 els based on polysemy vector representation adopted shared task 2 dataset and achieve competitive statistical polysemy, word sense numbers, and K-means performance, demonstrating our framework’s ef- to finish disambiguation [ 21 ]. Word sense mapping graph fectiveness. network can be combined with multilinguistic and multiknowledge resources to integrate rich information in 2. Related Work unsupervised scenario [ 22 ].

In semi-supervised WSD models, the classifier is In this section, we mainly introduce the related stud- trained by the integration of annotated and unannoies for prompt-based models, especially the BERT-based tated corpora. PageRank-based WSD algorithm commodels. We first review the existing researches on word bined pIWordNet and semantic links from valency lexisense disambiguation, which is more generalized than con, Wikipedia articles and SUMO ontology [23]. Clusthe acronym disambiguation. tering and labeling strategy was used to generate labeled data for subjectivity WSD semi-automatically and further combined with original annotated data [24]. 2.1. Prompt-based Learning However, all these methods ignore the interaction bePrompt is suggestive information to enhance the knowl- tween ambiguous word explanation and its context. In edge that PLMs (Pre-trained Language Models) learned this work, we propose a prompt-based model to inteduring pre-training, containing the description of task an- grate better the semantic relationship between acronym swers and corresponding answers. Prompt-based learn- context and candidate expansions. ing is a slot-filling method based on language models, which aims to probabilistically construct the final prompt 3. Methodology as the prediction of the task. Previous exploration in prompt-based learning mainly focuses on prompt con- In this section, we present the overall architecture of struction, including prompt engineering and answer en- our proposed framework, which uses the prompt-based gineering. Prompt engineering creates a prompt function model to solve the acronym disambiguation problem and applicable to corresponding downstream tasks [ 11, 12 ], adopt a dynamic negative sampling strategy to improve While answer engineering searches for a unified answer the robustness of our model. space to which the original answers are mapped [ 11 ].

Multi-prompt learning, an ensemble of these two engineering prompts, aims to improve the generalization of models [ 13 ]. Based on multi-prompt learning, various

Dictionary-Rest Acronym Dictionary-POS

Part-Of-Speech positive instances Position Possessive postag

Raw Sample Size:k Ground Truth

[Part-Of-Speech]

Size:1 Prompts

Dialogue fillers and acceptance words affect the accuracy of POS tagging. ⨁ [SEP]POS[SEP]the meaning of POS is or equals <MASK>

Random Sampling Original Negative

[positive instances] [Position] ……

Size:k-1 MASK Enumeration Added Negative

[Sound Pattern of English] [Frequent Candidates] ……

Size:N-k

…… …… 1 2 3 k k+1 k+2 N

Train

BERT

Inference

⨁ : Concatenate logits CE loss

Hinge loss

Dynamic Mask logit1 logit2 logit3 …… logitk logitk+1 logitk+2 …… logitN 3.1. Problem Statement token is inserted before and after the acronym, followed by a string: the meaning of acronym is or equals expanFormally, given an input sentence = 1, 2, ..., and sion. Finally, BERT with an additional linear layer is acronym = at position , the goal is to disambiguate employed as our encoder. For training, we will calculate the corresponding expansions among candidate ex- the cross-entropy loss and adopted hinge loss [25]. For pansions { 1, 2, ..., }. The candidate expansions are inference, a dynamic mask strategy is adopted, in which given in advance and their size vary. Specifically, in we will drop the logits of added expansions. Specially, this paper, we treat this task as a classification problem we will drop the logits from the added negative samples, by padding the candidate expansions set to fix length which can not be the answer. with randomly chosen unrelated expansions. We will dynamically mask the logit of added expansion in the testing phase and choose the largest one among original 3.3. Prompt Design candidate expansions as the final prediction.

To build a prompt template efectively, we consider a two-stage strategy. We hope the model to be aware of 3.2. Overview two tasks: finding out the acronym and finding out the corresponding expansion. Thus, we employ the As shown in Figure 2, given the acronym POS in the token [SEP] to highlight the acronym, which can help the sentence, there are candidate expansions which can be model to understand where the acronym is. For second divided into a positive sample set of size 1 and a hard neg- task, previous works[26, 27] show that a longer prompt ative sample set of size − 1 . Firstly, in the expansions of usually performs better. To add more tokens, we use the other acronyms, we randomly sample − samples as the template: the meaning of acronym is or equals expansion. easy negative sample set to pad the candidate expansions For French and Spanish, we employ the corresponding into fix size . Secondly, we design a prompt strategy to translation as the prompt templates. combine the acronym and candidate expansions. [SEP] 3.4. Negative Sampling The size of candidate expansions in the dictionary varies, making it hard to train an eficient model. Moreover, we consider the negative samples in the original candidate expansions as related to but not exactly the ground truth. To improve the robustness and convergence of the model, we adopt a negative sampling strategy. We set the size of the padded candidate set as and randomly sample expansions from the candidate expansions of other acronyms as needed. For example, is set to 6, and the number of original candidate expansions is 2. We need to pick up 4 additional expansions. We note that [ 10 ] also proposed a similar negative sampling strategy.

The diference is that we divided the negative samples into hard negative samples and easy negative samples, thus designing extra loss. 3.5. Loss Function For the model, we consider two goals: 1) the ground truth expansion gets the highest score; 2) the original negative expansions get higher scores than the additional negative expansions. For the first goal, we employ the cross entropy loss function. Note the predict label as and the ground truth label as .

Dataset Legal English Scientific English French Spanish where is also a learnable hyperparameter to control the ratio of hinge loss.

= ( , ) (1) Wbye SeDvaUlu@aAteAaAllI-m22od[e2l8s]b.asAeds osnhotwhendiantaTseatblpero1v,idthede where the means cross entropy loss function. For dataset [29] contains training and development datasets the second goal, we follow the idea of hinge loss and in English (both scientific and legal domain), Spanish, and we want the minimum of the original expansion scores French consisting of 497 English Scientific, 303 English = { 1, 2, ..., −1 } is higher than the legal, 546 Spanish, and 669 French acronyms. For each maximum of the additional expansion scores = language, a diction containing acronyms and their candi{ 1, 2, ..., − } by a margin. date expansions is provided. For Legal English, there are 3717 sentences containing 174997 tokens and 625 candi ℎ = max( − min( ) + max( ), 0) (2) date expansions in the diction. The average expansion length of all acronyms is 3.1. The acronyms in the testing where max(⋅) and min(⋅) mean the maximum and mini- set would not appear in the training set. mum function while is a learnable margin. Hence we For Exploratory Data Analysis(EDA), we analyze the get our final loss. statistical features in the dataset. As shown in Figure 3 and Figure 4, we can see that: 1) for most acronyms, the = + ℎ (3) corresponding sentences are more than 10, indicating that the samples are highly similar. 2) for most acronyms, the corresponding candidate expansions are less than 4.

4. Experiments

Given the acronyms in sentences, candidate expansions In this section, we first introduce the experimental and ground truth labels, we can calculate the macrodataset and evaluation metrics and then conduct compre- averaged precision, recall and F1 score. hensive experimental studies to verify the efectiveness of our method. 4.3. Implement 4.2. Evaluation Metrics

All models are implemented based on the open-source transformers library of Huggingface [30]. For all datasets, Dataset Legal English Scientific English French Spanish

Model bert-large-cased spanbert-large-cased scibert-scivocab-cased scibert-scivocab-cased ( = 1.5 ) bert-base-french-europeana-cased

camembert-large bert-base-spanish-wwm-cased bert-base-multilingual-cased

3 Epoch of training stage 4 we set the = 0.1 and = 1 . The batch size is 2 and the size of expected expansion = ([ ]) + 2. For example, for French dataset, the maximum of candidate expansions for all acronyms is 12, thus we set = 12 + 2 = 14. As for other parameters, we set the learning rate as 3 − 5 and random seed as 10086. We pad or cut the input into 128 length. For French dataset, the prompt is: el significado de acronym es o igual a expansion. For Spanish dataset, we use: la signification de acronym est ou est égale à expansion. We train our model in one V100 GPU and evaluate the result using the oficial script. 4.4. Comparison 4.4.1. Overall Performance The overall performance results on the validation set are shown in Table 2. For Legal English, we choose the bert-large-cased [31] and spanbert-large-cased as the PLM. For Scientific English, we choose scibert-sci-vocabcased [32]. For French, we choose bert-base-frencheuropeana-cased and camembert-large [33]. For Spanish, we choose bert-base-spanish-wwm-cased [34] and bertbase-multilingual-cased [31]. As shown in Table 2, we can observe that most models sufer from over-fitting after 3 epochs. Moreover, we find that the BERT trained on the specialized corpus performs better than trained on the common corpus. 4.4.2. The Efect of PLM We change the BERT type to study the influence of different backbones on the French dataset. As shown in Figure 5, we can see that the larger models usually get better results. Another interesting observation is that all models sufer from over-fitting at epoch 4. 4.4.3. The Efect of Margin We change the to 0.0 and 1.0 and conduct our experiments in the English science dataset. According to Figure 6, we can find that a large brings a considerable change during training. Actually, a large means a large gap is required, leading to the oscillation in the loss. 4.4.4. The Efect of Ratio We change the to 0.5 and 1.5 and conduct our experiments in the English Science dataset. As shown in Figure 7, the larger leads to a lower result. Actually, a large means a large hinge loss, which pushes the model to 1 2

3 Epoch of training stage

5. Conclusion and Future Work

In this paper, we proposed a novel prompt-based model, which shows promising and competitive performance in SDU@AAAI-22 - Shared Task 2. We design an efective, prompt template that helps the model utilize the implicit knowledge in the pre-trained language model. A dynamic negative sampling strategy is employed to improve the robustness and performance of our model.

For future work, we will try to adopt a learned prompt template rather than a fixed template following the CoOp [26]. Moreover, the acronym disambiguation under a zero-shot setting would be another interesting and valuable topic. Utilizing the graph information in given sentence [ 35 ] may also help.

6. Acknowledgments

This research was supported in part by the National Key Research and Development Program of China (No. 2018YFB1601102) and the Shenzhen Key Laboratory of Marine IntelliSense and Computation under Contract ZDSYS20200811142605016. We thank the organizers of acronym identification and disambiguation competitions and the reviewers for their valuable comments and suggestions.

[1]

Li ,

Zhao ,

Fuxman ,

Tao , Guess me if you can: Acronym disambiguation for enterprises , in: ACL2018 , 2018 , pp. 1308 - 1317 .

[2]

A. P. B.

Veyseh ,

Dernoncourt ,

Q. H.

Tran ,

T. H.

Nguyen , What does this acronym mean? introducing a new dataset for acronym identification and disambiguation , in: Proceedings of the COLING 2020 , International Committee on Computational Linguistics , 2020 , pp. 3285 - 3301 .

[3]

M. R.

Ciosici ,

Sommer , I. Assent , Unsupervised abbreviation disambiguation contextual disambiguation using word embeddings , CoRR abs/ 1904 .00929 ( 2019 ). URL: http://arxiv.org/abs/ 1904 .00929. arXiv: 1904 .00929.

[4]

Liu , C. Liu,

Huang , Multi-granularity sequence labeling model for acronym expansion identification , Inf. Sci . 378 ( 2017 ) 462 - 474 .

[5]

Charbonnier ,

Wartena , Using word embeddings for unsupervised acronym disambiguation , in: Proceedings of the COLING 2018 , Association for Computational Linguistics , 2018 , pp. 2610 - 2619 .

[6]

Jin ,

Liu ,

Lu , Deep contextualized biomedical abbreviation expansion , in: Proceedings of the BioNLP@ACL, Association for Computational Linguistics , 2019 , pp. 88 - 96 .

[7]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez ,

Kaiser , I. Polosukhin , Attention is all you need , in: Advances in Neural Information Processing Systems 30: NeurIPS 2017 , 2017 , pp. 5998 - 6008 .

[8]

Devlin ,

Chang ,

Lee ,

Toutanova , BERT: pre-training of deep bidirectional transformers for language understanding , in: Proceedings of the NAACL-HLT 2019 , Volume 1 (Long and Short Papers), Association for Computational Linguistics , 2019 , pp. 4171 - 4186 .

[9]

Beltagy ,

Lo ,

Cohan , Scibert: A pretrained language model for scientific text , in: Proceedings of the EMNLP-IJCNLP 2019 , Association for Computational Linguistics , 2019 , pp. 3613 - 3618 .

[10]

Pan ,

Song ,

Wang ,

Luo , Bert-based acronym disambiguation with multiple training tegration , Comput. Mater. Continua 61 ( 2019 ) strategies , in : Proceedings of the SDU@AAAI 2021 , 197 - 212 . volume 2831 of CEUR Workshop Proceedings , 2021 . [23]

Janz ,

Piasecki , A weakly supervised word

[11]

Shin ,

Razeghi ,

R. L. L.

IV , E. Wallace, S. Singh, sense disambiguation for polish using rich lexical Autoprompt: Eliciting knowledge from language resources, Poznan Studies in Contemporary Linmodels with automatically generated prompts , in: guistics 55 ( 2019 ) 339 - 365 . Proceedings of the EMNLP 2020 , Association for [24]

Akkaya ,

Wiebe ,

Mihalcea , Iterative conComputational Linguistics, 2020 , pp. 4222 - 4235 . strained clustering for subjectivity word sense dis-

[12]

Jiang ,

F. F.

Xu ,

Araki , G. Neubig, How can we ambiguation , in: Proceedings of the EACL 2014 , know what language models know , Trans. Assoc. The Association for Computer Linguistics , 2014 , pp. Comput. Linguistics 8 ( 2020 ) 423 - 438 . 269 - 278 .

[13]

Yuan , G. Neubig, P. Liu, Bartscore: Evalu- [25]

Gentile ,

M. K.

Warmuth , Linear hinge loss and ating generated text as text generation, CoRR average margin , in: Advances in Neural Informaabs/2106 .11520 ( 2021 ). tion Processing Systems 11 , [

NIPS

1998 , The MIT

[14]

Gao ,

Fisch ,

Chen , Making pre-trained lan- Press, 1998 , pp. 225 - 231 . guage models better few-shot learners , in: Pro- [26]

Zhou ,

Yang ,

C. C.

Loy ,

Liu , Learning ceedings of the ACL/IJCNLP 2021 , (Volume 1: Long to prompt for vision-language models , CoRR Papers) , Association for Computational Linguistics, abs/2109 .01134 ( 2021 ). URL: https://arxiv.org/abs/ 2021, pp. 3816 - 3830 . 2109 .01134. arXiv: 2109 . 01134 .

[15]

Han , W . Zhao,

Ding ,

Liu ,

Sun , PTR: [27]

Lester ,

Al-Rfou ,

Constant , The power of prompt tuning with rules for text classification, scale for parameter-eficient prompt tuning , CoRR CoRR abs/2105 .11259 ( 2021 ). URL: https://arxiv.org/ abs/2104.08691 ( 2021 ). URL: https://arxiv.org/abs/ abs/2105.11259. arXiv: 2105 . 11259 . 2104.08691. arXiv: 2104 . 08691 .

[16]

Cui ,

Wu , J. Liu,

Yang ,

Zhang , Template- [28]

Y. R. J. F. D. T. H. N. Amir Pouran Ben Veyseh, based named entity recognition using BART, in: Nicole Meister, Multilingual acronym extraction Findings of the Association for Computational Lin- and disambiguation shared tasks at sdu 2022 , in: guistics: ACL/IJCNLP 2021, volume ACL/ IJCNLP Proceedings of SDU@AAAI-22 , 2022 . 2021 of Findings of ACL, Association for Computa- [29]

Y. R. J. F. D. T. H. N. Amir Pouran Ben Veyseh, tional Linguistics, 2021 , pp. 1835 - 1845 . Nicole Meister, Macronym: A large-scale dataset

[17]

A. R.

Pal ,

Saha ,

N. S.

Dash ,

S. K.

Naskar , A. Pal, for multilingual and multi-domain acronym extracA novel approach to word sense disambiguation tion , in: arXiv, 2022 . in bengali language using supervised methodology , [30]

Wolf ,

Debut ,

Sanh ,

Chaumond , C. DeSādhanā 44 ( 2019 ) 1 - 12 . langue, A. Moi,

Cistac ,

Rault ,

Louf , M. Fun-

[18]

Tripodi ,

Pelillo , A game-theoretic approach towicz , J. Davison,

Shleifer , P. von Platen, C. Ma, to word sense disambiguation, Comput. Linguistics

Jernite ,

Plu ,

Xu ,

T. L.

Scao , S. Gugger, 43 ( 2017 ) 31 - 70 . M. Drame , Q.

Lhoest , A. M.

Rush , Transformers:

[19]

Bevilacqua ,

Navigli , Breaking through the 80% State-of-the-art natural language processing, in: glass ceiling: Raising the state of the art in word Proceedings of the EMNLP 2020 - Demos, Associasense disambiguation by incorporating knowledge tion for Computational Linguistics , 2020 , pp. 38 - 45 . graph information, in: Proceedings of the ACL [31]

Devlin ,

Chang ,

Lee ,

Toutanova , 2020 , Association for Computational Linguistics, BERT: pre-training of deep bidirectional trans2020 , pp. 2854 - 2864 . formers for language understanding , CoRR

[20]

Alsaeedan ,

M. E. B.

Menai ,

S. A.

Al-Ahmadi , A abs/ 1810 .04805 ( 2018 ). URL: http://arxiv.org/abs/ hybrid genetic-ant colony optimization algorithm 1810 . 04805 . arXiv: 1810 . 04805. for the word sense disambiguation problem , Inf. [32]

Beltagy ,

Lo ,

Cohan , Scibert: A pretrained Sci . 417 ( 2017 ) 20 - 38 . language model for scientific text , in: EMNLP, As-

[21]

Wiedemann ,

Remus ,

Chawla , C. Biemann, sociation for Computational Linguistics, 2019 . URL: Does BERT make any sense? interpretable word https://www .aclweb.org/anthology/D19-1371. sense disambiguation with contextualized embed- [33]

Martin ,

Muller ,

P. J. O.

Suárez ,

Dupont , L. Rodings, in : Proceedings of the 15th Conference mary , É. V. de la Clergerie , D.

Seddah , B.

Sagot , on Natural Language Processing, KONVENS 2019 , Camembert: a tasty french language model , in: 2019. Proceedings of the 58th Annual Meeting of the As-

[22]

Lu ,

Meng ,

Wang ,

Zhang , X. Zhang, sociation for Computational Linguistics, 2020 . A. Ouyang , X.

Zhang , Graph-based chinese word [34] J.

Cañete , G. Chaperon, R.

Fuentes , J.-H.

Ho , sense disambiguation with multi-knowledge in- H.

Kang , J.

Pérez , Spanish pre-trained bert model and evaluation data , in: PML4DC at ICLR 2020 , 2020 .

[35]

Ding ,

Lei , G. Xun,

Yang , FAT-RE: A faster dependency-free model for relation extraction , J. Web Semant . 65 ( 2020 ) 100598 . URL: https://doi. org/10.1016/j.websem. 2020 . 100598 . doi: 10 .1016/j. websem. 2020 . 100598 .