=Paper= {{Paper |id=Vol-3164/paper7 |storemode=property |title=Domain Adaptive Pretraining for Multilingual Acronym Extraction |pdfUrl=https://ceur-ws.org/Vol-3164/paper7.pdf |volume=Vol-3164 |authors=Usama Yaseen,Stefan Langer |dblpUrl=https://dblp.org/rec/conf/aaai/YaseenL22 }} ==Domain Adaptive Pretraining for Multilingual Acronym Extraction== https://ceur-ws.org/Vol-3164/paper7.pdf
Domain Adaptive Pretraining for Multilingual Acronym
Extraction
Usama Yaseen1,2 , Stefan Langer1,2
1
    Technology, Siemens AG Munich, Germany
2
    CIS, University of Munich (LMU) Munich, Germany


                                             Abstract
                                             This paper presents our findings from participating in the multilingual acronym extraction shared task SDU@AAAI-22.
                                             The task consists of acronym extraction from documents in 6 languages within scientific and legal domains. To address
                                             multilingual acronym extraction we employed BiLSTM-CRF with multilingual XLM-RoBERTa embeddings. We pretrained
                                             the XLM-RoBERTa model on the shared task corpus to further adapt XLM-RoBERTa embeddings to the shared task domain(s).
                                             Our system (team: SMR-NLP) achieved competitive performance for acronym extraction across all the languages.

                                             Keywords
                                             pretraining, domain adaptation, acronym extraction, XLM-RoBERTa



1. Introduction                                                                                                       2. Task Description and
The number of scientific papers published every year is                                                                  Contributions
growing at an increasing rate [1]. The authors of the
                                                                                                                      We participate in the Acronym Extraction task [10] orga-
scientific publications employ abbreviations as a tool to
                                                                                                                      nized by the Scientific Document Understanding work-
make technical terms less verbose. The abbreviations
                                                                                                                      shop 2022 (SDU@AAAI-22). The task consists of identi-
take the form of acronyms or initialisms. We refer to
                                                                                                                      fying acronyms (short-forms) and their meanings (long-
the abbreviated term as “acronym” and we refer to the
                                                                                                                      forms) from the documents in six languages including
full term as the “long form”. On one hand, the acronyms
                                                                                                                      Danish (da), English (en), French (fr), Spanish (es), Persian
enable avoiding frequently used long phrases making
                                                                                                                      (fa) and Vietnamese (vi). The task corpus [11] consists
writing convenient for researchers but on the other hand
                                                                                                                      of documents from the scientific (en, fa, vi) and legal
they pose a challenge to non-expert human readers. This
                                                                                                                      domain (da, en, fr, es).
challenge is heightened by the fact that the acronyms are
                                                                                                                         Following are our multi-fold contributions:
not always standard written, e.g. XGBoost is an acronym
                                                                                                                         1. We model multilingual acronym extraction as a se-
of eXtreme Gradient Boosting [2]. Following the increase
                                                                                                                      quence labelling task and employed contextualized mul-
of scientific publications, the number of acronyms is enor-
                                                                                                                      tilingual XLM-RoBERTa embeddings [12]. Our system
mously increasing as well [3]. Thus, automatic identifi-
                                                                                                                      consists of a single model for multilingual acronym ex-
cation of acronyms and their corresponding long forms
                                                                                                                      traction and hence is practical for real-world usage.
is crucial for scientific document understanding tasks.
                                                                                                                         2. We investigated domain adaptive pretraining of
   The existing work in acronym extraction consists of
                                                                                                                      XLM-RoBERTa on the task corpus, which resulted in im-
carefully crafted rule-based methods [4, 5] and feature-
                                                                                                                      proved performance across all the languages.
based methods [6, 7]. These methods typically achieve
high precision as they are designed to find long form,
however, they suffer from low recall [8]. Recently, Deep                                                              3. Methodology
Learning based sequence models like LSTM-CRF [9] have
been explored for the task of acronym extraction, how-                                                                In the following sections we discuss our proposed model
ever, these methods require large training data to achieve                                                            for acronym extraction.
optimal performance. One of the major limitations of
existing work in acronym extraction is that most prior                                                                3.1. Multilingual Acronym Extraction
work only focuses on the English language.
                                                                                                                      Our sequence labelling model follows the well-known
                                                                                                                      architecture [13] with a bidirectional long short-term
Woodstock’21: Symposium on the irreproducible science, June 07–11,                                                    memory (BiLSTM) network and conditional random field
2021, Woodstock, NY
                                                                                                                      (CRF) output layer [14]. In order to address the multi-
Envelope-Open usama.yaseen@siemens.com (U. Yaseen);
langer.stefan@siemens.com (S. Langer)                                                                                 lingual aspect of the task we employed contextualized
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                                                      multilingual XLM-RoBERTa embeddings [12] in all the
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                        experiments.
         epochs         all               da              en-sci            en-leg              fr                fa                es                vi
                      P/R/F1            P/R/F1            P/R/F1            P/R/F1            P/R/F1            P/R/F1            P/R/F1            P/R/F1
                                                                               dev
    r1     0       .841/.868/.854    .825/.833/.829    .727/.750/.738    .758/.784/.771    .738/.742/.740    .619/.539/.576    .820/.871/.845    .375/.547/.445
    r2     1       .855/.876/.866    .826/.833/.830    .747/.757/.752    .786/.793/.789    .756/.750/.753    .644/.560/.599    .832/.872/.852    .385/.615/.474
    r3     3      .857/.878/.868    .827/.833/.830    .750/.759/.755    .789/.795/.792    .788/.751/.754    .665/.557/.606    .832/.873/.852    .408/.689/.512
    r4     3              -           .77/.773/.775    .617/.703/.650    .677/.677/.677    .715/.733/.724    .864/.294/.439    .823/.850/.836    .623/.074/.132
                                                                               test
    r5     3             -          .825/.833/.829    .727/.750/.738     .758/.784/.771    .738/.742/.740   .619/.539/.576    .820/.871/.845    .375/.547/.445

Table 1
F1-score on the development set (r1-r4) and test set (r5). Here, epochs: number of pretraining epochs for XLM-RoBERTa on the
task corpus, eng-sci: english scientific domain, eng-leg: english legal domain, all: all languages combined.


           Language                    train           dev                        the languages in the corpus. As a pre-processing step,
           da                          3082            385                        we used spaCy [15] to perform word tokenization and
           eng-scientific              3980            497                        POS tagging.
           eng-legal                   3564            445                           We do not apply any strategy to explicitly account for
           fr                          7783            973                        low training data of Persian and Vietnamese. Table 3 lists
                                                                                  the best configuration of hyperparameters. We compute
           es                          5928            741
                                                                                  macro-averaged F1-score using the script provided by the
           fa                          1336            167                        organizers on the development set 1 . We employ early
           vi                          1274            159                        stopping and report the F1-score on the test set using the
Table 2                                                                           best performant model on the development set.
Sentence counts of train and development set across the lan-
guages.                                                                           4.2. Results
                                                           Table 1 reports the F1-score on the development and
           Hyperparameter                        Value
                                                           test set for all the languages. As a baseline experiment,
           hidden size                           256       we combined the training data for all the languages
           learning rate                         5.0𝑒 − 6  and trained a BiLSTM-CRF model using the pretrained
           training epochs                       20        multilingual XLM-Roberta2 embeddings (row r1). This
           pretraining epochs                    3         achieves the overall F1-score of 0.854.
                                                              We pretrained XLM-Roberta model for 1 epoch on
Table 3
                                                           the task corpus using train and development set, which
Hyperparameter settings for acronym extraction.
                                                           results in 0.1 points improvement in the overall F1-score
                                                           leading to the F1-score of 0.866 (row r2). Increasing the
                                                           pretraining epochs to 3 results in an improvement of
3.2. Domain Adaptive Pretraining                           additional 0.1 points in the overall F1-score (row r3).
The original XLM-RoBERTa embeddings [12] are trained          We also experimented with training the individual
on the filtered CommonCrawl data (General domain), models for each language (including separate models for
whereas the data of the shared task comprises docu- English scientific and English legal). This results in a
ments from scientific and legal domains. In order to significant decrease in F1-score for all the languages (on
better adapt the contextualized representation to the tar- average 0.12 points in F1-score, see row r4). This demon-
get scientific and legal domain, we further pretrained the strates that BiLSTM-CRF with multilingual XLM-Roberta
original XLM-RoBERTa model on the corpus data. Our embeddings performs best when trained with several lan-
experiments demonstrate improved performance on the guages together enabling effective cross-lingual transfer.
task of acronym extraction due to the domain adaptive         The F1-score of our submission on the test set are
pretraining across all the languages.                      reported   in row r5. Our test submission achieves the
                                                           F1-score similar to the development set for all the lan-
                                                           guages demonstrating effective generalization on the test
4. Experiments and Results                                 set; Vietnamese is an exception where F1-score on the
                                                           test set is significantly worse than the F1-score on the
4.1. Dataset                                               development set (see rows r5 vs r3).
Table 2 reports sentence counts in the train and develop-
                                                              1
ment set for all the languages. Persian and Vietnamese          https://github.com/amirveyseh/AAAI-22-SDU-shared-task-1-
                                                          AE/blob/main/scorer.py
have substantially low sentences compared to the rest of      2
                                                                                          https://huggingface.co/xlm-roberta-base
5. Conclusion                                                      proach, Bioinform. 22 (2006) 3089–3095. URL:
                                                                   https://doi.org/10.1093/bioinformatics/btl534.
In this paper, we described our system with which we           [6] C. Kuo, M. H. T. Ling, K. Lin, C. Hsu, BIOADI: a
participate in the multilingual acronym extraction shared          machine learning approach to identifying abbrevia-
task organized by the Scientific Document Understand-              tions and definitions in biological literature, BMC
ing workshop 2022 (SDU@AAAI-22). We formulate                      Bioinform. 10 (2009) 7. URL: https://doi.org/10.1186/
multlilignual acronym extraction in 6 languages and                1471-2105-10-S15-S7.
2 domains as a sequence labelling task and employed            [7] J. Liu, C. Liu, Y. Huang, Multi-granularity se-
BiLSTM-CRF model with multilingual XLM-RoBERTa                     quence labeling model for acronym expansion iden-
embeddings. We pretrained XLM-RoBERTa model on the                 tification, Inf. Sci. 378 (2017) 462–474. URL: https:
target scientific and legal domain to better adapt multi-          //doi.org/10.1016/j.ins.2016.06.045.
lingual XLM-RoBERTa embeddings for the target task.            [8] C. G. Harris, P. Srinivasan, My word! machine
Our system demonstrates competitive performance on                 versus human computation methods for identify-
the multilingual acronym extraction task for all the lan-          ing and resolving acronyms, Computación y Sis-
guages. In future, we would like to improve error analysis         temas 23 (2019). URL: https://www.cys.cic.ipn.mx/
to further enhance our multilingual acronym extraction             ojs/index.php/CyS/article/view/3249.
models.                                                        [9] A. P. B. Veyseh, F. Dernoncourt, Q. H. Tran, T. H.
                                                                   Nguyen, What does this acronym mean? in-
                                                                   troducing a new dataset for acronym identifica-
Acknowledgments                                                    tion and disambiguation, in: D. Scott, N. Bel,
This research was supported by the Federal Ministry                C. Zong (Eds.), Proceedings of the 28th Interna-
for Economic Affairs and Energy ( Bundesministerium                tional Conference on Computational Linguistics,
für Wirtschaft und Energie: https://bmwi.de), grant                COLING 2020, Barcelona, Spain (Online), Decem-
01MD19003E (PLASS: https://plass.io) at Siemens AG                 ber 8-13, 2020, International Committee on Com-
(Technology), Munich Germany.                                      putational Linguistics, 2020, pp. 3285–3301. URL:
                                                                   https://doi.org/10.18653/v1/2020.coling-main.292.
                                                              [10] A. P. B. Veyseh, N. Meister, S. Yoon, R. Jain, F. Der-
References                                                         noncourt, T. H. Nguyen, Multilingual Acronym Ex-
                                                                   traction and Disambiguation Shared Tasks at SDU
 [1] L. Bornmann, R. Mutz, Growth rates of modern                  2022, in: Proceedings of SDU@AAAI-22, 2022.
     science: A bibliometric analysis based on the num-       [11] A. P. B. Veyseh, N. Meister, S. Yoon, R. Jain, F. Der-
     ber of publications and cited references, J. As-              noncourt, T. H. Nguyen, MACRONYM: A Large-
     soc. Inf. Sci. Technol. 66 (2015) 2215–2222. URL:             Scale Dataset for Multilingual and Multi-Domain
     https://doi.org/10.1002/asi.23329.                            Acronym Extraction, in: arXiv, 2022.
 [2] T. Chen, C. Guestrin, Xgboost: A scalable tree           [12] A. Conneau, K. Khandelwal, N. Goyal, V. Chaud-
     boosting system, in: B. Krishnapuram, M. Shah,                hary, G. Wenzek, F. Guzmán, E. Grave, M. Ott,
     A. J. Smola, C. C. Aggarwal, D. Shen, R. Rastogi              L. Zettlemoyer, V. Stoyanov, Unsupervised cross-
     (Eds.), Proceedings of the 22nd ACM SIGKDD In-                lingual representation learning at scale, in: D. Ju-
     ternational Conference on Knowledge Discovery                 rafsky, J. Chai, N. Schluter, J. R. Tetreault (Eds.),
     and Data Mining, San Francisco, CA, USA, Au-                  Proceedings of the 58th Annual Meeting of the
     gust 13-17, 2016, ACM, 2016, pp. 785–794. URL:                Association for Computational Linguistics, ACL
     https://doi.org/10.1145/2939672.2939785.                      2020, Online, July 5-10, 2020, Association for Com-
 [3] A. P. A. Barnett, Z. Doubleday, The growth of                 putational Linguistics, 2020, pp. 8440–8451. URL:
     acronyms in the scientific literature, eLife 9 (2020).        https://doi.org/10.18653/v1/2020.acl-main.747.
 [4] A. S. Schwartz, M. A. Hearst, A simple algo-             [13] G. Lample, M. Ballesteros, S. Subramanian,
     rithm for identifying abbreviation definitions in             K. Kawakami, C. Dyer, Neural architectures
     biomedical text, in: R. B. Altman, A. K. Dunker,              for named entity recognition, in: K. Knight,
     L. Hunter, T. E. Klein (Eds.), Proceedings of the             A. Nenkova, O. Rambow (Eds.), NAACL HLT 2016,
     8th Pacific Symposium on Biocomputing, PSB 2003,              The 2016 Conference of the North American Chap-
     Lihue, Hawaii, USA, January 3-7, 2003, 2003, pp.              ter of the Association for Computational Linguis-
     451–462. URL: http://psb.stanford.edu/psb-online/             tics: Human Language Technologies, San Diego
     proceedings/psb03/schwartz.pdf.                               California, USA, June 12-17, 2016, The Association
 [5] N. Okazaki, S. Ananiadou, Building an abbre-                  for Computational Linguistics, 2016, pp. 260–270.
     viation dictionary using a term recognition ap-               URL: https://doi.org/10.18653/v1/n16-1030.
                                                              [14] J. D. Lafferty, A. McCallum, F. C. N. Pereira, Condi-
     tional random fields: Probabilistic models for seg-
     menting and labeling sequence data, in: C. E. Brod-
     ley, A. P. Danyluk (Eds.), Proceedings of the Eigh-
     teenth International Conference on Machine Learn-
     ing (ICML 2001), Williams College, Williamstown,
     MA, USA, June 28 - July 1, 2001, Morgan Kaufmann,
     2001, pp. 282–289.
[15] M. Honnibal, I. Montani, M. Honnibal, H. Peters,
     S. V. Landeghem, M. Samsonov, J. Geovedi, J. Re-
     gan, G. Orosz, S. L. Kristiansen, P. O. McCann,
     D. Altinok, Roman, G. Howard, S. Bozek, E. Bot,
     M. Amery, W. Phatthiyaphaibun, L. U. Vogelsang,
     B. Böing, P. K. Tippa, jeannefukumaru, GregDub-
     bin, V. Mazaev, R. Balakrishnan, J. D. Møllerhøj,
     wbwseeker, M. Burton, thomasO, A. Patel, explo-
     sion/spaCy: v2.1.7: Improved evaluation, better lan-
     guage factories and bug fixes, 2019. URL: https://doi.
     org/10.5281/zenodo.3358113. doi:10.5281/zenodo.
     3358113 .