<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>M : Acronym Disambiguation by Building Counterfactuals and Multilingual Mixing</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yixuan Weng</string-name>
          <email>wengsyx@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fei Xia</string-name>
          <email>xiafei2020@ia.ac.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bin Li</string-name>
          <email>Mlibincn@hnu.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiusheng Huang</string-name>
          <email>huangxiusheng2020@ia.ac.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shizhu He</string-name>
          <email>shizhu.he@nlpr.ia.ac.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>(science)</institution>
          ,
          <addr-line>English (legal), French and Spanish were given</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Chinese Academy Sciences</institution>
          ,
          <addr-line>Beijing, 100190</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>In the datasets</institution>
          ,
          <addr-line>30,237 data in the four fields of English</addr-line>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>In the past, researchers have tried to solve AD prob-</institution>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>National Laboratory of Pattern Recognition, Institute of Automation</institution>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>National Laboratory of Pattern Recognition, Institute of Automation,Chinese Academy Sciences</institution>
          ,
          <addr-line>Beijing, 100190</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff6">
          <label>6</label>
          <institution>Red means wrong</institution>
          ,
          <addr-line>green means right. Acronyms in English</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Scientific documents often contain a large number of acronyms. Disambiguation of these acronyms will help researchers better understand the meaning of vocabulary in the documents. In the past, thanks to large amounts of data from English literature, acronym task was mainly applied in English literature. However, for other low-resource languages, it's training data is scarce, so the generalization performance of the model is poor. To address the above issue, this paper proposes a new method for acronym disambiguation, named as ADBCMM, which can significantly improve the performance of lowresource languages by building counterfactuals and multilingual mixing. Specifically, by balancing data bias in low-resource langauge, ADBCMM will able to improve the test performance outside the data set. In SDU@AAAI-22 - Shared Task 2: Acronym Disambiguation, the proposed method won first place in French and Spanish. You can repeat our results here</p>
      </abstract>
      <kwd-group>
        <kwd>Multilingual</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>https://github.com/WENGSYX/ADBCMM.</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>The exchanges between countries become closer with
the progress of globalization. As countries began to
communicate more politically, economically and
academically, language understanding became a new challenge.
Acronyms often appear in the scientific documents of
diferent countries. Compared to English, acronyms
are more challenging to understand in other languages.
Acronyms will become a barrier for researchers to read
scientific literature and afect exchanges and cooperation
between countries.
ing the large number of acronym diferences in the text,
which need to find the correct interpretation. For these
acronyms, we need to find the correct one in the current
context from the dictionary. For example, in “The
traditional Chinese sentences are transferred into SC”, “SC”
means “simplified Chinese” rather than “System
Combination”. It is dificult for some people who are not familiar
with a language to understand related acronyms. So we
need to distinguish abbreviations, which is a challenging
nEvelop-O
LGOBE
task.
ding [2], and deep learning [3]. Over the last few years,
the BERT [4] model has emerged, which adopts a method
of pre-training in a large language library. Many studies
have shown that these pre-training models (PTMs) have
gained a wealth of generic characteristics. Recently, They</p>
    </sec>
    <sec id="sec-3">
      <title>2. Related Work</title>
      <p>In this section, we will introduce the Acronym
Disambiguation datasets[7] and how to solve the Acronym
Disambiguation tasks [8] in English scenarios in the past,
while introducing the dificulties of the Acronym
Disambiguation tasks in other languages.
2.1. AD dataset
• In Figure 1, we can find that the extension of
other languages does not necessarily contain an
acronym of the first letter, and it isn’t easy to
match directly through the rules.
• Other languages lack PLMs trained in scientific
language.
• In Table 1, the number of datasets in French and
Spanish is small. Training models are prone to
bias and over-adaptation.</p>
    </sec>
    <sec id="sec-4">
      <title>3. Methods</title>
      <p>[5, 6] have achieved remarkable efects using the BERT includes dynamic adverse sample selection, task adaptive
model in AD tasks. pretraining, adversarial training [9] and pseudo labelling</p>
      <p>However, these methods do not work well in other in his paper. This model achieved its first achievement.
languages. So we used the following methods to fur- Zhong [6] believes that diferent pre-training models
ther enhance the model’s out-of-data test performance store knowledge in diferent fields, and better results can
to help better researchers understand and communicate be achieved through model integration. He proposed a
multilingual multi-domain scientific documents. Hierarchical Dual-path BERT method to capture general
and professional field language, while using RoBERTa
and SciBERT to perceive and predict text. He eventually
reached a 93.73% F1 value in the SciAD datasets.
• A simple ADBCMM approach was proposed to
use other language data as counterfacts datasets
in AD tasks, solving the model bias.
• We tried to use the Multiple-Choice Model frame- 2.3. Dificulty with multilingual data
work to make the model more focused on
wordto-word comparisons to help the model better In the AD of SDU@AAAI-22, the organizers released AD
understand the first letter abbreviation. datasets covering French and Spanish, which have the
• Our results achieved SOTA efects in both the following dificulties compared to the English
environFrench and Spanish of the AD dataset, showing ment:
outstanding performance, surpassing all other
baselines methods.</p>
      <p>In this section, we will describe the framework for the
Table 1 overall model, as well as a range of methods for AD
Specific number of the Acronym Disambiguation datasets. datasets for other languages, including ADBCMM,
InIncluding the Acronym Disambiguation tasks for 4 diferent Trust-loss [10], Child-Tuning [11] and R-Drop[12].
fields. The total number of data sets is not more than 10,000.</p>
      <p>Data
Train
Dev
Test
Total</p>
      <sec id="sec-4-1">
        <title>2.2. Previous work</title>
        <p>In the AD of SDU@AAAI-21, the teams presented their
methodologies and submitted a total of 10 papers. Those
papers included some excellent projects.</p>
        <p>Pan [5] trained a Binary Classification Model
incorporating BERT and several training strategies. His program</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.1. The model framework</title>
        <p>We use the Multiple-Choice model framework, which is
diferent from the Binary Classification Model used by
Pan [5].</p>
        <p>The Multiple-Choice model [13] refers to adding a
classifier to the end output of the BERT model. Each
sentence has only a single output value to represent the
probability of this option.</p>
        <p>In Figure 2, when we use the Multiple-Choice model,
each batch will enter all the possible options in the same
set during the training. If the word in the dictionary is
insuficient, we use “Padding” for filling, eventually at
the output end for softmax classification and calculation
of losses.</p>
        <p>Thus, we can more accurately derive the probability
of each option by comparing methods. Compared with
Binary Classification Model, Multiple-Choice model
capturing more semantic characteristics and make the model
more comprehensively trained and predicted on
diferences, rather than the error interference model caused
by the dynamic construction of negative samples.
tasks. The ADBCMM approach helps address biases
caused by insuficient data in small-language
environments.</p>
      </sec>
      <sec id="sec-4-3">
        <title>3.2. ADBCMM</title>
        <p>PLM has achieved excellent results in many NLP tasks,
but the potential bias in training data can harm
out-ofdata testing performance. Counterfactually augmented
datasets is a recent solution [14]. But if it takes a lot
of human resources and money to build counterfactual
samples by man, this approach is not realistic.</p>
        <p>We found many homonyms samples by analyzing
erroneous samples on dev datasets. We think these samples
errors are mainly due to model bias, over-training leads
to over-adaptation seriously, and data set performance
is poor. That’s why we used diferent language markup
information to use other language samples as new
counterfactual samples after being modified.</p>
        <p>In Figure 3, the training process is like a pyramid. We
ifrst train using data in multiple languages, and then
we do secondary training in a single language based on
pre-training.</p>
        <p>Why continue training with single-language materials
after multilingual mixed training instead of testing
directly after multilingual Counterfacts datasets training?
Because in our experiment, with the addition of more
language samples, the models may become
overwhelming. Even though French, English and Spanish belong to
the Indo-European language family, they all have unique
language properties, syntax and vocabulary. This would
be a noise interference for diferent languages. Models
may ignore semantic characteristics that are unique to
a particular language and prefer to learn more common
ones.</p>
        <p>Our ADBCMM approach can also be further extended
to translation, Ner, conversation generation and other</p>
      </sec>
      <sec id="sec-4-4">
        <title>3.3. Child-Tuning</title>
        <p>Because AD data sets are smaller and can easily be
learned, resulting in the model’s poor centralized
generalization capacity during testing. We used the
ChildTuning method proposed to address this discrepancy.</p>
        <p>The Child-Tuning [11] strategy only updates the
corresponding Child Network when the parameters are
updated backwards, without adjusting all the parameters.</p>
        <p>At the end of the first epoch, we compare the model’s
parameters with the original parameters to find out the
greatest weight of the change, and in the subsequent we
only update the parameters of this section. This approach
like the reverse Dropout [15], it can bring performance
improvements to our models.</p>
        <p>Language
Model/Method</p>
        <p>BETO
Flaubert-base-cased
mDeberta-v3-base
+ ADBCMM
+ Child-Tuning
+ R-Drop</p>
        <p>ALLs
Finally in Test
4.1. Baseline We used three pre-training models, including Flaubert,
BETO and mDeberta, for a total of 15 training sessions.</p>
        <p>For both French and Spanish languages, we used Flaubert- We use argmax to choose the maximum of all values as
base-cased [16] models and BETO [17] cased models re- the final result for the word to be selected.
spectively. These models are Bidirectional Encoder Rep- In all the experiments, we set 16 epochs and decided
resentations from Transformers [4], and the size is both to use the 1e-5 learning rate (we used warmup
simulbases. These models have a lot of Masked Language taneously) with Pytorch[23]. We put gradient decrease
Model (MLM) [18] training in the related large single- 1e-5 and batch size 1 (each batch contains 14 diferent
language repository and have state-of-the-art(SOTA) re- options). we employ the AdamW optimizer [24] and use
sults in the related languages. These pre-trained models the hugging-face2 [13] framework. We only use the first
can better capture the semantic information of words. 300 tokens for each sample. On a Intel 10900K server</p>
        <p>But there is no additional training, so the two models with 128G memory, we used a 24G NVIDIA 3090 GPU to
still need to fine-tune AD data centralization to solve the train our model.</p>
        <p>Acronym Disambiguation tasks. We will add a
classification layer behind these models, and then the models
become Multiple-Choice Models. We trained the models
in a single language. Their results will be used as our
Baseline, and the results of other models will be compared
with them.</p>
        <p>1You can go to https://huggingface.co/microsoft/
mdeberta-v3-base download model
2https://github.com/huggingface/transformers
SDU@AAAI ranks of the Acronym Disambiguation tasks in French and Spanish</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.4. Assessment of indicators</title>
        <p>In AD tasks, Macro F1 was used as an assessment
indicator by calculating the accuracy and recall rate of the
ifnal result.</p>
        <p>=
 =
 1 =
   1 =
2  
   +</p>
        <p>+</p>
        <p>+</p>
        <p>∑=1  1 

method, the better the performance.3</p>
        <p>means that the higher the total number of categories,
accuracy, recall rate, and MacroF1. The higher the F1</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>In Table 2, we can find that under the same conditions,
mDeberta performs less well in French than in
Flaubertbase-cased, and less well in Spanish than in BETO. We
speculate that because mDeberta uses a large number
of data in diferent languages during the pre-training
phase. Still, after spinning into other languages, due to
the further side focus, it may not necessarily accurately
record the semantic characteristics of a single language
so that the actual performance will be slightly worse
compared to BETO and Flaubert. They have been
pretrained only in a single language.</p>
      <p>Both Child-Tuning and R-Drop showed excellent
performance in English and Spanish, bringing a 3-5% F1
boost to our model. But compared to the ADBCMM
method, they were still slightly underperforming. Our
ADBCMM method brought more than 10% performance
boost directly to our mDeberta model. This is indeed
3Below is the specific meaning of the formula. TP: The
prediction is correct and the sample is correct. FP: The prediction is
wrong and the sample is correct. FN: The predicting is correct and
the sample is wrong.
incredible. To ensure the repetitiveness of the
experiment, we repeated three experiments. The mDeberta
models using the ADBCMM method were compared to
their mDeberta model F1 performance over 10% in these
three experiments.</p>
      <p>We think that ADBCMM can significantly boost our
models because of the reliable Counterfacts datasets.
First, they can match upstream and downstream
training data; second, counterfacts datasets can reduce the
model’s bias, learning from more text data to more
relevant information with Acronym Disambiguation tasks;
third, even if the datasets are collected from diferent
languages or fields, but they are scientific documents, so
the general language training mDeberta model can learn
the syntax characteristics of scientific documents in more
scientific documents and further improve performance.</p>
      <p>Finally, we followed ADBCMM-based methods and
achieved SOTA scores in both SDU@AAAI’s French and
Spanish. In Acronym Disambiguation tasks [8], our
methods of Precision, Recall and Macro F1 are SOTA.
Remarkably, our approach leads us to the second F1 score of 5%
6%.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this article, we mainly talk about how to use ADBCMM
in the Acronym Disambiguation tasks at SDU@AAAI-22
and compare it with other Models or Methods to yield
SOTA. We used a straightforward method to build
counterfacts datasets in ADBCMM. We directly use other
language datasets for training and secondary Fine-Tune in
their language, which gives our models a remarkable
effect. After combining the Multiple-Choice Model,
ChildTuning, R-Drop and other methods, our approach leads
ahead of all diferent systems. Apparently, in multilingual
data aggregation, simply using other languages as
counterfacts datasets can improve performance. At the same
time, our work provides practical help for researchers to
understand scientific documentation better.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Acknowledgement</title>
    </sec>
    <sec id="sec-8">
      <title>8. Online Resources References</title>
      <p>The work is supported by the National Key Research
and Development Program of China (2020AAA0106400)
and the National Natural Science Foundation of China
(61922085, 61976211). The work is also supported
by the Beijing Academy of Artificial Intelligence
(BAAI2019QN0301), the Key Research Program of the
Chinese Academy of Sciences under Grant
(ZDBS-SSWJSC006), the independent research project of the National
Laboratory of Pattern Recognition, China and the Youth
Innovation Promotion Association CAS, China.
ing Research 15 (2014) 1929–1958. URL: [24] I. Loshchilov, F. Hutter, Fixing weight decay
reguhttp://jmlr.org/papers/v15/srivastava14a.html. larization in adam, ArXiv abs/1711.05101 (2017).
[16] H. Le, L. Vial, J. Frej, V. Segonne, M. Coavoux,</p>
      <p>B. Lecouteux, A. Allauzen, B. Crabbé, L. Besacier,
D. Schwab, Flaubert: Unsupervised language model
pre-training for french, in: Proceedings of The
12th Language Resources and Evaluation
Conference, European Language Resources Association,
Marseille, France, 2020, pp. 2479–2490. URL: https:
//www.aclweb.org/anthology/2020.lrec-1.302.
[17] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho,</p>
      <p>H. Kang, J. Pérez, Spanish pre-trained bert model
and evaluation data, in: PML4DC at ICLR 2020,
2020.
[18] W. L. Taylor, “cloze procedure”: A new
tool for measuring readability,
Journalism Quarterly 30 (1953) 415–433. URL:
https://doi.org/10.1177/107769905303000401.
doi:10.1177/107769905303000401.</p>
      <p>arXiv:https://doi.org/10.1177/107769905303000401.
[19] P. He, J. Gao, W. Chen, Debertav3:
Improving deberta using electra-style pre-training with
gradient-disentangled embedding sharing, 2021.</p>
      <p>arXiv:2111.09543.
[20] P. He, X. Liu, J. Gao, W. Chen, Deberta:
Decodingenhanced bert with disentangled attention, in:
International Conference on Learning
Representations, 2021. URL: https://openreview.net/forum?id=</p>
      <p>XPZIaotutsD.
[21] A. Conneau, K. Khandelwal, N. Goyal, V.
Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott,
L. Zettlemoyer, V. Stoyanov, Unsupervised
crosslingual representation learning at scale, in:
Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics,
Association for Computational Linguistics, Online,
2020, pp. 8440–8451. URL: https://www.aclweb.
org/anthology/2020.acl-main.747. doi:10.18653/
v1/2020.acl-main.747.
[22] K. Clark, M.-T. Luong, Q. V. Le, C. D. Manning,</p>
      <p>ELECTRA: Pre-training text encoders as
discriminators rather than generators, in: ICLR, 2020. URL:
https://openreview.net/pdf?id=r1xMH1BtvB.
[23] A. Paszke, S. Gross, F. Massa, A. Lerer, J.
Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein,
L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito,
M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner,
L. Fang, J. Bai, S. Chintala, Pytorch: An
imperative style, high-performance deep learning
library, in: H. Wallach, H. Larochelle, A.
Beygelzimer, F. d'Alché-Buc, E. Fox, R. Garnett (Eds.),
Advances in Neural Information Processing
Systems, volume 32, Curran Associates, Inc., 2019.</p>
      <p>URL: https://proceedings.neurips.cc/paper/2019/
file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>