-

M : Acronym Disambiguation by Building Counterfactuals and Multilingual Mixing

Yixuan Weng

wengsyx@gmail.com 0 1 2 3 5 6

Fei Xia

xiafei2020@ia.ac.cn 0 1 2 3 4 5 6

Bin Li

Mlibincn@hnu.edu.cn 0 1 2 3 6

Xiusheng Huang

huangxiusheng2020@ia.ac.cn 0 1 2 3 4 5 6

Shizhu He

shizhu.he@nlpr.ia.ac.cn 0 1 2 3 4 5 6 0 (science) , English (legal), French and Spanish were given 1 Chinese Academy Sciences , Beijing, 100190 , China 2 In the datasets , 30,237 data in the four fields of English 3 In the past, researchers have tried to solve AD prob- 4 National Laboratory of Pattern Recognition, Institute of Automation 5 National Laboratory of Pattern Recognition, Institute of Automation,Chinese Academy Sciences , Beijing, 100190 , China 6 Red means wrong , green means right. Acronyms in English

Scientific documents often contain a large number of acronyms. Disambiguation of these acronyms will help researchers better understand the meaning of vocabulary in the documents. In the past, thanks to large amounts of data from English literature, acronym task was mainly applied in English literature. However, for other low-resource languages, it's training data is scarce, so the generalization performance of the model is poor. To address the above issue, this paper proposes a new method for acronym disambiguation, named as ADBCMM, which can significantly improve the performance of lowresource languages by building counterfactuals and multilingual mixing. Specifically, by balancing data bias in low-resource langauge, ADBCMM will able to improve the test performance outside the data set. In SDU@AAAI-22 - Shared Task 2: Acronym Disambiguation, the proposed method won first place in French and Spanish. You can repeat our results here

Multilingual

https://github.com/WENGSYX/ADBCMM.

1. Introduction

The exchanges between countries become closer with the progress of globalization. As countries began to communicate more politically, economically and academically, language understanding became a new challenge. Acronyms often appear in the scientific documents of diferent countries. Compared to English, acronyms are more challenging to understand in other languages. Acronyms will become a barrier for researchers to read scientific literature and afect exchanges and cooperation between countries. ing the large number of acronym diferences in the text, which need to find the correct interpretation. For these acronyms, we need to find the correct one in the current context from the dictionary. For example, in “The traditional Chinese sentences are transferred into SC”, “SC” means “simplified Chinese” rather than “System Combination”. It is dificult for some people who are not familiar with a language to understand related acronyms. So we need to distinguish abbreviations, which is a challenging nEvelop-O LGOBE task. ding [2], and deep learning [3]. Over the last few years, the BERT [4] model has emerged, which adopts a method of pre-training in a large language library. Many studies have shown that these pre-training models (PTMs) have gained a wealth of generic characteristics. Recently, They

2. Related Work

In this section, we will introduce the Acronym Disambiguation datasets[7] and how to solve the Acronym Disambiguation tasks [8] in English scenarios in the past, while introducing the dificulties of the Acronym Disambiguation tasks in other languages. 2.1. AD dataset • In Figure 1, we can find that the extension of other languages does not necessarily contain an acronym of the first letter, and it isn’t easy to match directly through the rules. • Other languages lack PLMs trained in scientific language. • In Table 1, the number of datasets in French and Spanish is small. Training models are prone to bias and over-adaptation.

3. Methods

[5, 6] have achieved remarkable efects using the BERT includes dynamic adverse sample selection, task adaptive model in AD tasks. pretraining, adversarial training [9] and pseudo labelling

However, these methods do not work well in other in his paper. This model achieved its first achievement. languages. So we used the following methods to fur- Zhong [6] believes that diferent pre-training models ther enhance the model’s out-of-data test performance store knowledge in diferent fields, and better results can to help better researchers understand and communicate be achieved through model integration. He proposed a multilingual multi-domain scientific documents. Hierarchical Dual-path BERT method to capture general and professional field language, while using RoBERTa and SciBERT to perceive and predict text. He eventually reached a 93.73% F1 value in the SciAD datasets. • A simple ADBCMM approach was proposed to use other language data as counterfacts datasets in AD tasks, solving the model bias. • We tried to use the Multiple-Choice Model frame- 2.3. Dificulty with multilingual data work to make the model more focused on wordto-word comparisons to help the model better In the AD of SDU@AAAI-22, the organizers released AD understand the first letter abbreviation. datasets covering French and Spanish, which have the • Our results achieved SOTA efects in both the following dificulties compared to the English environFrench and Spanish of the AD dataset, showing ment: outstanding performance, surpassing all other baselines methods.

In this section, we will describe the framework for the Table 1 overall model, as well as a range of methods for AD Specific number of the Acronym Disambiguation datasets. datasets for other languages, including ADBCMM, InIncluding the Acronym Disambiguation tasks for 4 diferent Trust-loss [10], Child-Tuning [11] and R-Drop[12]. fields. The total number of data sets is not more than 10,000.

Data Train Dev Test Total

2.2. Previous work

In the AD of SDU@AAAI-21, the teams presented their methodologies and submitted a total of 10 papers. Those papers included some excellent projects.

Pan [5] trained a Binary Classification Model incorporating BERT and several training strategies. His program

3.1. The model framework

We use the Multiple-Choice model framework, which is diferent from the Binary Classification Model used by Pan [5].

The Multiple-Choice model [13] refers to adding a classifier to the end output of the BERT model. Each sentence has only a single output value to represent the probability of this option.

In Figure 2, when we use the Multiple-Choice model, each batch will enter all the possible options in the same set during the training. If the word in the dictionary is insuficient, we use “Padding” for filling, eventually at the output end for softmax classification and calculation of losses.

Thus, we can more accurately derive the probability of each option by comparing methods. Compared with Binary Classification Model, Multiple-Choice model capturing more semantic characteristics and make the model more comprehensively trained and predicted on diferences, rather than the error interference model caused by the dynamic construction of negative samples. tasks. The ADBCMM approach helps address biases caused by insuficient data in small-language environments.

3.2. ADBCMM

PLM has achieved excellent results in many NLP tasks, but the potential bias in training data can harm out-ofdata testing performance. Counterfactually augmented datasets is a recent solution [14]. But if it takes a lot of human resources and money to build counterfactual samples by man, this approach is not realistic.

We found many homonyms samples by analyzing erroneous samples on dev datasets. We think these samples errors are mainly due to model bias, over-training leads to over-adaptation seriously, and data set performance is poor. That’s why we used diferent language markup information to use other language samples as new counterfactual samples after being modified.

In Figure 3, the training process is like a pyramid. We ifrst train using data in multiple languages, and then we do secondary training in a single language based on pre-training.

Why continue training with single-language materials after multilingual mixed training instead of testing directly after multilingual Counterfacts datasets training? Because in our experiment, with the addition of more language samples, the models may become overwhelming. Even though French, English and Spanish belong to the Indo-European language family, they all have unique language properties, syntax and vocabulary. This would be a noise interference for diferent languages. Models may ignore semantic characteristics that are unique to a particular language and prefer to learn more common ones.

Our ADBCMM approach can also be further extended to translation, Ner, conversation generation and other

3.3. Child-Tuning

Because AD data sets are smaller and can easily be learned, resulting in the model’s poor centralized generalization capacity during testing. We used the ChildTuning method proposed to address this discrepancy.

The Child-Tuning [11] strategy only updates the corresponding Child Network when the parameters are updated backwards, without adjusting all the parameters.

At the end of the first epoch, we compare the model’s parameters with the original parameters to find out the greatest weight of the change, and in the subsequent we only update the parameters of this section. This approach like the reverse Dropout [15], it can bring performance improvements to our models.

Language Model/Method

BETO Flaubert-base-cased mDeberta-v3-base + ADBCMM + Child-Tuning + R-Drop

ALLs Finally in Test 4.1. Baseline We used three pre-training models, including Flaubert, BETO and mDeberta, for a total of 15 training sessions.

For both French and Spanish languages, we used Flaubert- We use argmax to choose the maximum of all values as base-cased [16] models and BETO [17] cased models re- the final result for the word to be selected. spectively. These models are Bidirectional Encoder Rep- In all the experiments, we set 16 epochs and decided resentations from Transformers [4], and the size is both to use the 1e-5 learning rate (we used warmup simulbases. These models have a lot of Masked Language taneously) with Pytorch[23]. We put gradient decrease Model (MLM) [18] training in the related large single- 1e-5 and batch size 1 (each batch contains 14 diferent language repository and have state-of-the-art(SOTA) re- options). we employ the AdamW optimizer [24] and use sults in the related languages. These pre-trained models the hugging-face2 [13] framework. We only use the first can better capture the semantic information of words. 300 tokens for each sample. On a Intel 10900K server

But there is no additional training, so the two models with 128G memory, we used a 24G NVIDIA 3090 GPU to still need to fine-tune AD data centralization to solve the train our model.

Acronym Disambiguation tasks. We will add a classification layer behind these models, and then the models become Multiple-Choice Models. We trained the models in a single language. Their results will be used as our Baseline, and the results of other models will be compared with them.

1You can go to https://huggingface.co/microsoft/ mdeberta-v3-base download model 2https://github.com/huggingface/transformers SDU@AAAI ranks of the Acronym Disambiguation tasks in French and Spanish

4.4. Assessment of indicators

In AD tasks, Macro F1 was used as an assessment indicator by calculating the accuracy and recall rate of the ifnal result.

= = 1 = 1 = 2 +

∑=1 1 method, the better the performance.3

means that the higher the total number of categories, accuracy, recall rate, and MacroF1. The higher the F1

5. Results

In Table 2, we can find that under the same conditions, mDeberta performs less well in French than in Flaubertbase-cased, and less well in Spanish than in BETO. We speculate that because mDeberta uses a large number of data in diferent languages during the pre-training phase. Still, after spinning into other languages, due to the further side focus, it may not necessarily accurately record the semantic characteristics of a single language so that the actual performance will be slightly worse compared to BETO and Flaubert. They have been pretrained only in a single language.

Both Child-Tuning and R-Drop showed excellent performance in English and Spanish, bringing a 3-5% F1 boost to our model. But compared to the ADBCMM method, they were still slightly underperforming. Our ADBCMM method brought more than 10% performance boost directly to our mDeberta model. This is indeed 3Below is the specific meaning of the formula. TP: The prediction is correct and the sample is correct. FP: The prediction is wrong and the sample is correct. FN: The predicting is correct and the sample is wrong. incredible. To ensure the repetitiveness of the experiment, we repeated three experiments. The mDeberta models using the ADBCMM method were compared to their mDeberta model F1 performance over 10% in these three experiments.

We think that ADBCMM can significantly boost our models because of the reliable Counterfacts datasets. First, they can match upstream and downstream training data; second, counterfacts datasets can reduce the model’s bias, learning from more text data to more relevant information with Acronym Disambiguation tasks; third, even if the datasets are collected from diferent languages or fields, but they are scientific documents, so the general language training mDeberta model can learn the syntax characteristics of scientific documents in more scientific documents and further improve performance.

Finally, we followed ADBCMM-based methods and achieved SOTA scores in both SDU@AAAI’s French and Spanish. In Acronym Disambiguation tasks [8], our methods of Precision, Recall and Macro F1 are SOTA. Remarkably, our approach leads us to the second F1 score of 5% 6%.

6. Conclusion

In this article, we mainly talk about how to use ADBCMM in the Acronym Disambiguation tasks at SDU@AAAI-22 and compare it with other Models or Methods to yield SOTA. We used a straightforward method to build counterfacts datasets in ADBCMM. We directly use other language datasets for training and secondary Fine-Tune in their language, which gives our models a remarkable effect. After combining the Multiple-Choice Model, ChildTuning, R-Drop and other methods, our approach leads ahead of all diferent systems. Apparently, in multilingual data aggregation, simply using other languages as counterfacts datasets can improve performance. At the same time, our work provides practical help for researchers to understand scientific documentation better.

7. Acknowledgement 8. Online Resources References

The work is supported by the National Key Research and Development Program of China (2020AAA0106400) and the National Natural Science Foundation of China (61922085, 61976211). The work is also supported by the Beijing Academy of Artificial Intelligence (BAAI2019QN0301), the Key Research Program of the Chinese Academy of Sciences under Grant (ZDBS-SSWJSC006), the independent research project of the National Laboratory of Pattern Recognition, China and the Youth Innovation Promotion Association CAS, China. ing Research 15 (2014) 1929–1958. URL: [24] I. Loshchilov, F. Hutter, Fixing weight decay reguhttp://jmlr.org/papers/v15/srivastava14a.html. larization in adam, ArXiv abs/1711.05101 (2017). [16] H. Le, L. Vial, J. Frej, V. Segonne, M. Coavoux,

B. Lecouteux, A. Allauzen, B. Crabbé, L. Besacier, D. Schwab, Flaubert: Unsupervised language model pre-training for french, in: Proceedings of The 12th Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France, 2020, pp. 2479–2490. URL: https: //www.aclweb.org/anthology/2020.lrec-1.302. [17] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho,

H. Kang, J. Pérez, Spanish pre-trained bert model and evaluation data, in: PML4DC at ICLR 2020, 2020. [18] W. L. Taylor, “cloze procedure”: A new tool for measuring readability, Journalism Quarterly 30 (1953) 415–433. URL: https://doi.org/10.1177/107769905303000401. doi:10.1177/107769905303000401.

arXiv:https://doi.org/10.1177/107769905303000401. [19] P. He, J. Gao, W. Chen, Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing, 2021.

arXiv:2111.09543. [20] P. He, X. Liu, J. Gao, W. Chen, Deberta: Decodingenhanced bert with disentangled attention, in: International Conference on Learning Representations, 2021. URL: https://openreview.net/forum?id=

XPZIaotutsD. [21] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised crosslingual representation learning at scale, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 2020, pp. 8440–8451. URL: https://www.aclweb. org/anthology/2020.acl-main.747. doi:10.18653/ v1/2020.acl-main.747. [22] K. Clark, M.-T. Luong, Q. V. Le, C. D. Manning,

ELECTRA: Pre-training text encoders as discriminators rather than generators, in: ICLR, 2020. URL: https://openreview.net/pdf?id=r1xMH1BtvB. [23] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, Pytorch: An imperative style, high-performance deep learning library, in: H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, R. Garnett (Eds.), Advances in Neural Information Processing Systems, volume 32, Curran Associates, Inc., 2019.

URL: https://proceedings.neurips.cc/paper/2019/ file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf.