=Paper=
{{Paper
|id=Vol-3164/paper7
|storemode=property
|title=Domain Adaptive Pretraining for Multilingual Acronym Extraction
|pdfUrl=https://ceur-ws.org/Vol-3164/paper7.pdf
|volume=Vol-3164
|authors=Usama Yaseen,Stefan Langer
|dblpUrl=https://dblp.org/rec/conf/aaai/YaseenL22
}}
==Domain Adaptive Pretraining for Multilingual Acronym Extraction==
Domain Adaptive Pretraining for Multilingual Acronym Extraction Usama Yaseen1,2 , Stefan Langer1,2 1 Technology, Siemens AG Munich, Germany 2 CIS, University of Munich (LMU) Munich, Germany Abstract This paper presents our findings from participating in the multilingual acronym extraction shared task SDU@AAAI-22. The task consists of acronym extraction from documents in 6 languages within scientific and legal domains. To address multilingual acronym extraction we employed BiLSTM-CRF with multilingual XLM-RoBERTa embeddings. We pretrained the XLM-RoBERTa model on the shared task corpus to further adapt XLM-RoBERTa embeddings to the shared task domain(s). Our system (team: SMR-NLP) achieved competitive performance for acronym extraction across all the languages. Keywords pretraining, domain adaptation, acronym extraction, XLM-RoBERTa 1. Introduction 2. Task Description and The number of scientific papers published every year is Contributions growing at an increasing rate [1]. The authors of the We participate in the Acronym Extraction task [10] orga- scientific publications employ abbreviations as a tool to nized by the Scientific Document Understanding work- make technical terms less verbose. The abbreviations shop 2022 (SDU@AAAI-22). The task consists of identi- take the form of acronyms or initialisms. We refer to fying acronyms (short-forms) and their meanings (long- the abbreviated term as “acronym” and we refer to the forms) from the documents in six languages including full term as the “long form”. On one hand, the acronyms Danish (da), English (en), French (fr), Spanish (es), Persian enable avoiding frequently used long phrases making (fa) and Vietnamese (vi). The task corpus [11] consists writing convenient for researchers but on the other hand of documents from the scientific (en, fa, vi) and legal they pose a challenge to non-expert human readers. This domain (da, en, fr, es). challenge is heightened by the fact that the acronyms are Following are our multi-fold contributions: not always standard written, e.g. XGBoost is an acronym 1. We model multilingual acronym extraction as a se- of eXtreme Gradient Boosting [2]. Following the increase quence labelling task and employed contextualized mul- of scientific publications, the number of acronyms is enor- tilingual XLM-RoBERTa embeddings [12]. Our system mously increasing as well [3]. Thus, automatic identifi- consists of a single model for multilingual acronym ex- cation of acronyms and their corresponding long forms traction and hence is practical for real-world usage. is crucial for scientific document understanding tasks. 2. We investigated domain adaptive pretraining of The existing work in acronym extraction consists of XLM-RoBERTa on the task corpus, which resulted in im- carefully crafted rule-based methods [4, 5] and feature- proved performance across all the languages. based methods [6, 7]. These methods typically achieve high precision as they are designed to find long form, however, they suffer from low recall [8]. Recently, Deep 3. Methodology Learning based sequence models like LSTM-CRF [9] have been explored for the task of acronym extraction, how- In the following sections we discuss our proposed model ever, these methods require large training data to achieve for acronym extraction. optimal performance. One of the major limitations of existing work in acronym extraction is that most prior 3.1. Multilingual Acronym Extraction work only focuses on the English language. Our sequence labelling model follows the well-known architecture [13] with a bidirectional long short-term Woodstock’21: Symposium on the irreproducible science, June 07–11, memory (BiLSTM) network and conditional random field 2021, Woodstock, NY (CRF) output layer [14]. In order to address the multi- Envelope-Open usama.yaseen@siemens.com (U. Yaseen); langer.stefan@siemens.com (S. Langer) lingual aspect of the task we employed contextualized © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). multilingual XLM-RoBERTa embeddings [12] in all the CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) experiments. epochs all da en-sci en-leg fr fa es vi P/R/F1 P/R/F1 P/R/F1 P/R/F1 P/R/F1 P/R/F1 P/R/F1 P/R/F1 dev r1 0 .841/.868/.854 .825/.833/.829 .727/.750/.738 .758/.784/.771 .738/.742/.740 .619/.539/.576 .820/.871/.845 .375/.547/.445 r2 1 .855/.876/.866 .826/.833/.830 .747/.757/.752 .786/.793/.789 .756/.750/.753 .644/.560/.599 .832/.872/.852 .385/.615/.474 r3 3 .857/.878/.868 .827/.833/.830 .750/.759/.755 .789/.795/.792 .788/.751/.754 .665/.557/.606 .832/.873/.852 .408/.689/.512 r4 3 - .77/.773/.775 .617/.703/.650 .677/.677/.677 .715/.733/.724 .864/.294/.439 .823/.850/.836 .623/.074/.132 test r5 3 - .825/.833/.829 .727/.750/.738 .758/.784/.771 .738/.742/.740 .619/.539/.576 .820/.871/.845 .375/.547/.445 Table 1 F1-score on the development set (r1-r4) and test set (r5). Here, epochs: number of pretraining epochs for XLM-RoBERTa on the task corpus, eng-sci: english scientific domain, eng-leg: english legal domain, all: all languages combined. Language train dev the languages in the corpus. As a pre-processing step, da 3082 385 we used spaCy [15] to perform word tokenization and eng-scientific 3980 497 POS tagging. eng-legal 3564 445 We do not apply any strategy to explicitly account for fr 7783 973 low training data of Persian and Vietnamese. Table 3 lists the best configuration of hyperparameters. We compute es 5928 741 macro-averaged F1-score using the script provided by the fa 1336 167 organizers on the development set 1 . We employ early vi 1274 159 stopping and report the F1-score on the test set using the Table 2 best performant model on the development set. Sentence counts of train and development set across the lan- guages. 4.2. Results Table 1 reports the F1-score on the development and Hyperparameter Value test set for all the languages. As a baseline experiment, hidden size 256 we combined the training data for all the languages learning rate 5.0𝑒 − 6 and trained a BiLSTM-CRF model using the pretrained training epochs 20 multilingual XLM-Roberta2 embeddings (row r1). This pretraining epochs 3 achieves the overall F1-score of 0.854. We pretrained XLM-Roberta model for 1 epoch on Table 3 the task corpus using train and development set, which Hyperparameter settings for acronym extraction. results in 0.1 points improvement in the overall F1-score leading to the F1-score of 0.866 (row r2). Increasing the pretraining epochs to 3 results in an improvement of 3.2. Domain Adaptive Pretraining additional 0.1 points in the overall F1-score (row r3). The original XLM-RoBERTa embeddings [12] are trained We also experimented with training the individual on the filtered CommonCrawl data (General domain), models for each language (including separate models for whereas the data of the shared task comprises docu- English scientific and English legal). This results in a ments from scientific and legal domains. In order to significant decrease in F1-score for all the languages (on better adapt the contextualized representation to the tar- average 0.12 points in F1-score, see row r4). This demon- get scientific and legal domain, we further pretrained the strates that BiLSTM-CRF with multilingual XLM-Roberta original XLM-RoBERTa model on the corpus data. Our embeddings performs best when trained with several lan- experiments demonstrate improved performance on the guages together enabling effective cross-lingual transfer. task of acronym extraction due to the domain adaptive The F1-score of our submission on the test set are pretraining across all the languages. reported in row r5. Our test submission achieves the F1-score similar to the development set for all the lan- guages demonstrating effective generalization on the test 4. Experiments and Results set; Vietnamese is an exception where F1-score on the test set is significantly worse than the F1-score on the 4.1. Dataset development set (see rows r5 vs r3). Table 2 reports sentence counts in the train and develop- 1 ment set for all the languages. Persian and Vietnamese https://github.com/amirveyseh/AAAI-22-SDU-shared-task-1- AE/blob/main/scorer.py have substantially low sentences compared to the rest of 2 https://huggingface.co/xlm-roberta-base 5. Conclusion proach, Bioinform. 22 (2006) 3089–3095. URL: https://doi.org/10.1093/bioinformatics/btl534. In this paper, we described our system with which we [6] C. Kuo, M. H. T. Ling, K. Lin, C. Hsu, BIOADI: a participate in the multilingual acronym extraction shared machine learning approach to identifying abbrevia- task organized by the Scientific Document Understand- tions and definitions in biological literature, BMC ing workshop 2022 (SDU@AAAI-22). We formulate Bioinform. 10 (2009) 7. URL: https://doi.org/10.1186/ multlilignual acronym extraction in 6 languages and 1471-2105-10-S15-S7. 2 domains as a sequence labelling task and employed [7] J. Liu, C. Liu, Y. Huang, Multi-granularity se- BiLSTM-CRF model with multilingual XLM-RoBERTa quence labeling model for acronym expansion iden- embeddings. We pretrained XLM-RoBERTa model on the tification, Inf. Sci. 378 (2017) 462–474. URL: https: target scientific and legal domain to better adapt multi- //doi.org/10.1016/j.ins.2016.06.045. lingual XLM-RoBERTa embeddings for the target task. [8] C. G. Harris, P. Srinivasan, My word! machine Our system demonstrates competitive performance on versus human computation methods for identify- the multilingual acronym extraction task for all the lan- ing and resolving acronyms, Computación y Sis- guages. In future, we would like to improve error analysis temas 23 (2019). URL: https://www.cys.cic.ipn.mx/ to further enhance our multilingual acronym extraction ojs/index.php/CyS/article/view/3249. models. [9] A. P. B. Veyseh, F. Dernoncourt, Q. H. Tran, T. H. Nguyen, What does this acronym mean? in- troducing a new dataset for acronym identifica- Acknowledgments tion and disambiguation, in: D. Scott, N. Bel, This research was supported by the Federal Ministry C. Zong (Eds.), Proceedings of the 28th Interna- for Economic Affairs and Energy ( Bundesministerium tional Conference on Computational Linguistics, für Wirtschaft und Energie: https://bmwi.de), grant COLING 2020, Barcelona, Spain (Online), Decem- 01MD19003E (PLASS: https://plass.io) at Siemens AG ber 8-13, 2020, International Committee on Com- (Technology), Munich Germany. putational Linguistics, 2020, pp. 3285–3301. URL: https://doi.org/10.18653/v1/2020.coling-main.292. [10] A. P. B. Veyseh, N. Meister, S. Yoon, R. Jain, F. Der- References noncourt, T. H. Nguyen, Multilingual Acronym Ex- traction and Disambiguation Shared Tasks at SDU [1] L. Bornmann, R. Mutz, Growth rates of modern 2022, in: Proceedings of SDU@AAAI-22, 2022. science: A bibliometric analysis based on the num- [11] A. P. B. Veyseh, N. Meister, S. Yoon, R. Jain, F. Der- ber of publications and cited references, J. As- noncourt, T. H. Nguyen, MACRONYM: A Large- soc. Inf. Sci. Technol. 66 (2015) 2215–2222. URL: Scale Dataset for Multilingual and Multi-Domain https://doi.org/10.1002/asi.23329. Acronym Extraction, in: arXiv, 2022. [2] T. Chen, C. Guestrin, Xgboost: A scalable tree [12] A. Conneau, K. Khandelwal, N. Goyal, V. Chaud- boosting system, in: B. Krishnapuram, M. Shah, hary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, A. J. Smola, C. C. Aggarwal, D. Shen, R. Rastogi L. Zettlemoyer, V. Stoyanov, Unsupervised cross- (Eds.), Proceedings of the 22nd ACM SIGKDD In- lingual representation learning at scale, in: D. Ju- ternational Conference on Knowledge Discovery rafsky, J. Chai, N. Schluter, J. R. Tetreault (Eds.), and Data Mining, San Francisco, CA, USA, Au- Proceedings of the 58th Annual Meeting of the gust 13-17, 2016, ACM, 2016, pp. 785–794. URL: Association for Computational Linguistics, ACL https://doi.org/10.1145/2939672.2939785. 2020, Online, July 5-10, 2020, Association for Com- [3] A. P. A. Barnett, Z. Doubleday, The growth of putational Linguistics, 2020, pp. 8440–8451. URL: acronyms in the scientific literature, eLife 9 (2020). https://doi.org/10.18653/v1/2020.acl-main.747. [4] A. S. Schwartz, M. A. Hearst, A simple algo- [13] G. Lample, M. Ballesteros, S. Subramanian, rithm for identifying abbreviation definitions in K. Kawakami, C. Dyer, Neural architectures biomedical text, in: R. B. Altman, A. K. Dunker, for named entity recognition, in: K. Knight, L. Hunter, T. E. Klein (Eds.), Proceedings of the A. Nenkova, O. Rambow (Eds.), NAACL HLT 2016, 8th Pacific Symposium on Biocomputing, PSB 2003, The 2016 Conference of the North American Chap- Lihue, Hawaii, USA, January 3-7, 2003, 2003, pp. ter of the Association for Computational Linguis- 451–462. URL: http://psb.stanford.edu/psb-online/ tics: Human Language Technologies, San Diego proceedings/psb03/schwartz.pdf. California, USA, June 12-17, 2016, The Association [5] N. Okazaki, S. Ananiadou, Building an abbre- for Computational Linguistics, 2016, pp. 260–270. viation dictionary using a term recognition ap- URL: https://doi.org/10.18653/v1/n16-1030. [14] J. D. Lafferty, A. McCallum, F. C. N. Pereira, Condi- tional random fields: Probabilistic models for seg- menting and labeling sequence data, in: C. E. Brod- ley, A. P. Danyluk (Eds.), Proceedings of the Eigh- teenth International Conference on Machine Learn- ing (ICML 2001), Williams College, Williamstown, MA, USA, June 28 - July 1, 2001, Morgan Kaufmann, 2001, pp. 282–289. [15] M. Honnibal, I. Montani, M. Honnibal, H. Peters, S. V. Landeghem, M. Samsonov, J. Geovedi, J. Re- gan, G. Orosz, S. L. Kristiansen, P. O. McCann, D. Altinok, Roman, G. Howard, S. Bozek, E. Bot, M. Amery, W. Phatthiyaphaibun, L. U. Vogelsang, B. Böing, P. K. Tippa, jeannefukumaru, GregDub- bin, V. Mazaev, R. Balakrishnan, J. D. Møllerhøj, wbwseeker, M. Burton, thomasO, A. Patel, explo- sion/spaCy: v2.1.7: Improved evaluation, better lan- guage factories and bug fixes, 2019. URL: https://doi. org/10.5281/zenodo.3358113. doi:10.5281/zenodo. 3358113 .