1. Introduction

Computational Humanities Research Conference, November

DaCy: A Unified Framework for Danish NLP

Kenneth Enevoldsen

0 2

Lasse Hansen

0 1 2

Kristofer L.

Nielbo

0 2 0 Center for Humanities Computing, Aarhus University , Denmark 1 Department of Clinical Medicine, Aarhus University , Denmark 2 Interacting Minds Centre, Aarhus University , Denmark

2021

1 7 19

Danish natural language processing (NLP) has in recent years obtained considerable improvements with the addition of multiple new datasets and models. However, at present, there is no coherent framework for applying state-of-the-art models for Danish. We present DaCy: a unified framework for Danish NLP built on and integrated with SpaCy. DaCy uses efficient multitask models which obtain state-of-the-art performance on named entity recognition, part-of-speech tagging, and dependency parsing. DaCy contains tools for easy integration of existing models such as for polarity, emotion, or subjectivity detection. In addition, we conduct a series of tests for biases and robustness of Danish NLP pipelines through data augmentation. DaCy large compares favorably and is especially robust to long input lengths and spelling variations and errors. All models except DaCy large display significant biases related to ethnicity while only Polyglot shows a significant gender bias. We argue that for languages with limited benchmark sets, data augmentation can be particularly useful for obtaining more realistic and fine-grained performance estimates. We provide a series of augmenters as a first step towards a more thorough evaluation of language models for low and medium resource languages and encourage further development.

eol>Natural Language Processing Low-resource NLP Data Augmentation Danish NLP

1. Introduction

1.1. DaCy With this motivation we present DaCy: an efficient end-to-end framework for Danish NLP with state-of-the-art performance on POS, NER and dependency parsing. DaCy fills the gap in Danish NLP by providing a consistent interface that is easily extendable and able to integrate other models. DaCy is built on SpaCy v.3 which comes with a range of advantages: the framework is optimized, user-friendly, and well-documented. DaCy includes three finetuned language models: DaCy small, based on a Danish Electra (14M parameters) [ 14 ]; DaCy medium, based on the Danish BERT (110M parameters) [ 22 ]; and DaCy large, based on the multilingual XLM-RoBERTa (550M parameters) [ 9 ]. All models have been fine-tuned to do POS tagging, NER, and dependency parsing in a single forward pass, which increases the efficiency of the model and allows for larger models at the same computational cost.

Besides models fine-tuned for DaCy, the package includes convenient wrappers to add other models to the pipeline. For instance, Danish models for detecting polarity, emotion, and subjectivity classification can be added in a single line of code, and any HuggingFace Transformers [ 34 ] model trained for sentence classification can be conveniently wrapped and included in the pipeline using utility functions. With this functionality, DaCy aims at being a unified framework for Danish NLP. All functionality is well-documented and covered by tutorials.1

1.2. Robustness & Evaluation

Fine-tuned language models are commonly evaluated by testing performance on a gold-standard benchmark dataset. The most commonly used benchmark for Danish is the DaNE dataset [ 16 ], which consists of the Danish Dependency Treebank [ 18 ], additionally tagged for NER. For languages with few benchmarks datasets, such as Danish, the performance stability and generalizability can not be reliably estimated [ 27 ]. For instance, the text included in DaNE was collected in the years 1983–1992 from both written and spoken domains [ 16 ]. Given the change of languages over time and the addition of new textual domains such as social media, this dataset is unlikely to be representative of the contemporary domains of application. For instance, models might not be sufficiently exposed to e.g. abbreviated names, spelling errors, or non-standard casing to correctly and robustly classify them. In this sense, the performance obtained on DaNE is unlikely to hold for real-world use cases.

To provide an additional layer of validation, we propose evaluating models on augmented gold-standard data. Data augmentation entails generating new data by slightly modifying existing data points [ 13 ]. Data augmentation techniques such as rotation and cropping are widely used in computer vision to reduce overfitting [ 29 ], and are becoming increasingly common in NLP [ 7 ]. The complex syntactic and semantic structure of text complicates the task of finding useful augmentations, but simple manipulations such as synonym replacement and random character swaps and deletions have been found to be particularly useful for supervised learning in low-resource settings [ 33 ].

Although data augmentation is most commonly used for increasing the amount of training data, it can just as well be used for evaluation purposes [ 27 ]. By augmenting a gold-standard dataset, we can evaluate model performance when exposed to data that more closely mimics real-life settings by adding spelling errors, more diverse names, or other manipulations. In section 2.2, we introduce a series of augmentations and evaluate the performance of Danish NLP pipelines on them.

The contributions of this paper are three-fold. 1) We introduce new state-of-art models for Danish dependency parsing, NER and POS. 2) We introduce the DaCy Python library as a unified framework for state-of-the-art NLP in Danish. 3) We evaluate Danish NLP pipelines

1See: https://centre-for-humanities-computing.github.io/DaCy/

using data augmentation and provide directions for future model development.

2. Methods 2.1. Training

To train the candidate models for DaCy, all publicly available Transformer-based language models for Danish were fine-tuned on the DaNE corpus [ 16 ] using SpaCy 3.0.3 [ 15 ]. The models include 2 Danish ELECTRAs [ 8, 14, 32 ], the Danish ConvBERT [ 17, 32 ], the Danish BERT [ 11, 22 ], and the multilingual XLM-Roberta Large [ 9 ]. All models were trained with an input length of 10 sentences until convergence using similar hyperparameters on a Quadro RTX 8000 GPU. Adam was used as optimizer with hyperparameters β1 = 0.9 and β2 = 0.999. Further, L2 normalization with α = 0.01 and gradient clipping with c = 1.0 was employed. For increased efficiency, all models were trained with a multi-task objective [ 6, 28 ] on NER, POS, and dependency parsing. This allows the training of larger models at the same computational cost, but it is unlikely that multi-task training at this scale improves performance [ 25, 1 ].2

Table 1 shows the performance of all fine-tuned models evaluated on DaNE’s test set. The three best performing models in each size category, XLM-Roberta, DaBERT, and Ælaectra Cased are included in DaCy as the large, medium and small, respectively. In line with previous ifndings [ 25, 5, 24 ], larger models tend to perform better with XLM-Roberta obtaining the best performance across the board.

2.2. Evaluation

To evaluate the robustness of DaCy and other Danish NLP pipelines, we assessed their performance on multiple augmented version of the DaNE test set. All Danish models are trained on the DaNE corpus which consists of a mix of textual data of both spoken and written origin from the years 1983–1992 [ 16 ], with the exception of Polyglot which is trained on entities extracted from Wikipedia [ 26 ]. As a consequence, the training data is rarely representative of the domain in which the models will be applied. For example, social media, contemporary news media, and historical texts have domain specific characteristics such as non-standard casing, a higher degree of typos, use of hashtags, and historic spelling such as upper-cased nouns [ 31, 3, 12 ]. While it is infeasible to test the models on all possible domains, some of these characteristics 2For a full list of models and training configurations see the config files on Github: centre-for-humanities-computing/DaCy/tree/main/training https://github.com/ can be modelled using data augmentation which can provide practitioners with an estimate of the potential shortcomings of the model. Further, data augmentation can be used to estimate biases against protected groups such as gender and ethnicity.

The augmenters presented here are not meant to be exhaustive, but rather a first step towards more thorough validation of new language models. We argue that the bar for inclusion of a new model should be set higher than a slight increase in benchmark performance. Language models are used in a variety of contexts which current benchmarks tasks, especially for low resource languages, do not capture. Our aim with these experiments is to provide an extra layer of insight into the performance of language models that more closely mimics naturalistic use cases, and encourage the development of further augmenters. Augmentation not only provides insights into when model performance breaks down, whether certain models are more suited for specific use-cases than others, but can also be used for identifying specific areas to improve upon.

The augmenters developed for this paper are designed in accordance with the SpaCy framework, and are thus not necessarily tied to DaCy or Danish in particular and can be used both during model validation and training. Comprehensive tutorials are provided on the DaCy Github repository.

We tested small, medium, and large SpaCy [ 15 ] and DaCy models, Stanza [ 23 ], Polyglot [ 26 ], NERDA [ 19 ], Flair, 3 [ 2 ], and DaNLP’s BERT [ 4 ] on the DaNE test set augmented with the following augmenters: 1. Keystroke augmentation: substitute 2%, 5%, or 15% of characters with a neighbouring character on a Danish QWERTY keyboard. 2. ÆØÅ augmentation: substitute ae/Æ with ae/Ae, ø/Ø with oe/Oe, and å/Å with aa/Aa to simulate some historic text variations in Danish. 3. Lower-case augmentation: convert all text to lower-case. 4. Spacing augmentation: randomly remove 5% of all whitespace. 5. Name augmentations: a) Substitute all names (PER entities) with randomly sampled Danish names, respecting first and last names. b) Substitute all names with randomly sampled names of Muslim origin used in Denmark [21], respecting first and last names. c) Substitute all names with sampled Danish male names, respecting first and last names. d) Substitute all names with sampled Danish female names, respecting first and last names.

e) Abbreviate all first names to the first character including a full stop.

The stochastic augmentations, i.e. name and keystroke augmentations, were repeated 20 times.

Previous evaluations of Danish NLP tools have used the gold-standard tokens instead of using a tokenization module. While this allows for easier comparison of the specific modules it inflates the performance metrics of the models and is unlikely to reflect the metric of interest, namely, the performance during application.4 All models were tested using both their own tokenizer (if they have one) and the SpaCy tokenizer for Danish. The performance reported in

3As supplied by DaNLP. 4In our experiments, several of the Danish models performed worse using their own tokenizer.

section 3 uses the best peforming tokenization module for each pipeline. For all models except Stanza and Polyglot this was found to be the SpaCy tokenizer.

3. Results 4. Discussion

This paper has introduced the DaCy models and presented a thorough evaluation of Danish NLP models on a battery of augmentations. DaCy models achieve state-of-the-art performance on Danish NER, POS, and dependency parsing, and are robust to augmentations such as keystroke errors, name changes, and lowercasing. The results from training DaCy underline three well-known trends in deep learning and NLP, 1) larger models tend to perform better, 2) higher quality pre-training data leads to better models, as illustrated by the superior performance of Ælaectra compared to DaELECTRA, and 3) multilingual models perform competitively with monolingual models [ 25, 35, 5 ].

Our experiments with multiple augmenters revealed diferent patterns of strengths and weaknesses across Danish NLP models. In general, larger models tend to be more robust to data augmentations. Several models are highly sensitive to casing, which limits their usefulness on certain domains. Evaluating models on augmented data provides a more holistic and realistic estimate of the expected performance, and can reveal in which use cases one model might be more useful than another. For example, it might be better to use DaCy medium on social media as opposed to DaCy large as its performance is not afected by casing.

The purpose of the data augmentation experiments was to evaluate the robustness of Danish models and to open a discussion on how to present new models going forward. As more models are developed for low and medium resource languages, properly evaluating them becomes vital for securing robustness, transparency, and efectiveness despite limited benchmark sets. We do not posit data augmentation as the only solution, but demonstrate that it can efectively reveal performance diferences on important factors such as casing, spelling errors, and biases related to protected groups. As researchers, we bear the responsibility for releasing adequately tested and robust models into the world. With the increasing ease of deployment, users must be made aware of the level of performance they can realistically expect to achieve on their problem, and when to choose one model over another. Social media researchers should know that certain models are sensitive to casing, historians should know that some models handle old text variations such as ae, oe, aa poorly, and lawyers should be aware that models might not be able to identify abbreviated names as efectively. In this regard, transparency and openness as to when and how models fail are crucial measures to report. Such evaluation requires the development of infrastructure and tools, but is fast and easy to conduct once in place. For instance, it only takes 8 minutes to test DaCy large on all augmented datasets including bootstrapping. As part of the DaCy library, we provide several augmenters and utility functions for evaluation that integrate with SpaCy and encourage new NLP models to use and expand upon them. For the continued development of low and medium resource NLP in a direction that is beneficial for practitioners, it is vital to conduct more thorough evaluation of new models. We suggest these augmenters not as an evaluation standard, but as preliminary guiding principles for future development of NLP models for low and medium resource languages in particular.

Acknowledgments

We would like to especially thank Martin C. Jespersen for early conversions on biases and covert weaknesses in Danish language models. [21] E. V. Meldgaard. Muslimske fornavne i Danmark. 2005. url: https : / / nors . ku . dk / publikationer/webpublikationer/muslimske%5C%5Ffornavne/.

[1]

Aghajanyan ,

Gupta ,

Shrivastava ,

Chen ,

Zettlemoyer , and

Gupta . “Muppet: Massive Multi-task Representations with Pre-Finetuning” . In: arXiv: 2101 .11038 [cs] ( 2021 ). arXiv: 2101 .11038. url: http://arxiv.org/abs/2101.11038.

[2]

Akbik ,

Bergmann ,

Blythe ,

Rasul ,

Schweter , and

Vollgraf . “ FLAIR: An Easy-to-Use Framework for State-of-the- Art

NLP

”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations) . Minneapolis, Minnesota: Association for Computational Linguistics, 2019 , pp. 54 - 59 . doi: 10 .18653/v1/ N19 -4010. url: https://aclanthology.org/N19-4010.

[3]

Baldwin . “ Social media: friend or foe of natural language processing?” In: Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation . 2012 , pp. 58 - 59 .

[4]

Brogaard Pauli ,

Barrett ,

Lacroix , and

Hvingelby . “ DaNLP: An open-source toolkit for Danish Natural Language Processing” . In: Proceedings of the 23rd Nordic Conference on Computational Linguistics 2021 ). NoDaLiDa 2021 . 2021 .

[5] T. B. Brown , B.

Mann , N.

Ryder , M.

Subbiah , J.

Kaplan , P.

Dhariwal , A.

Neelakantan , P.

Shyam , G.

Sastry , A.

Askell , S.

Agarwal , A.

Herbert-Voss , G. Krueger, T.

Henighan , R.

Child , A.

Ramesh , D. M.

Ziegler , J.

Wu , C.

Winter , C.

Hesse , M.

Chen , E. Sigler, M.

Litwin , S.

Gray , B.

Chess , J.

Clark , C.

Berner , S.

McCandlish , A.

Radford , I. Sutskever , and

Amodei . “ Language Models are Few-Shot Learners” . In: arXiv: 2005 .14165 [cs] ( 2020 ). arXiv: 2005 .14165. url: http://arxiv.org/abs/ 2005 .14165.

[6]

Caruana . “ Multitask learning” . In: Machine learning 28.1 ( 1997 ), pp. 41 - 75 .

[7]

Chen ,

Tam ,

Rafel ,

Bansal , and

Yang . “An Empirical Survey of Data Augmentation for Limited Data Learning in NLP” . In: arXiv: 2106 .07499 [cs] ( 2021 ). arXiv: 2106 .07499. url: http://arxiv.org/abs/2106.07499.

[8]

Clark , M.-

Luong ,

Q. V.

Le , and

C. D.

Manning . “ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators” . In: arXiv: 2003 .10555 [cs] ( 2020 ). arXiv: 2003 .10555. url: http://arxiv.org/abs/ 2003 .10555.

[9]

Conneau ,

Khandelwal ,

Goyal ,

Chaudhary ,

Wenzek ,

Guzmán , E. Grave,

Ott ,

Zettlemoyer , and

Stoyanov . “ Unsupervised Cross-lingual Representation Learning at Scale” . In: arXiv: 1911 .02116 [cs] ( 2020 ). arXiv: 1911 . 02116. url: http : //arxiv.org/abs/ 1911 .02116.

[10]

Derczynski ,

M. R.

Ciosici ,

Baglini ,

M. H.

Christiansen ,

J. A.

Dalsgaard ,

Fusaroli ,

P. J.

Henrichsen ,

Hvingelby ,

Kirkedal ,

A. S.

Kjeldsen ,

Ladefoged ,

Å. Nielsen ,

Madsen ,

M. L.

Petersen ,

J. H.

Rystrøm , and

Varab . “ The Danish Gigaword Corpus” . In: Proceedings of the 23rd Nordic Conference on Computational Linguistics . 2021 .

[11]

Devlin , M.-

Chang ,

Lee , and

Toutanova . “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” . In: arXiv: 1810 .04805 [cs] ( 2019 ). arXiv: 1810 .04805. url: http://arxiv.org/abs/ 1810 .04805.

[12]

Farzindar and

Inkpen . “ Natural language processing for social media” . In: Synthesis Lectures on Human Language Technologies 8 .2 ( 2015 ), pp. 1 - 166 .

[13]

S. Y.

Feng ,

Gangal ,

Wei ,

Chandar ,

Vosoughi ,

Mitamura , and E. Hovy. “ A Survey of Data Augmentation Approaches for NLP” . In: arXiv: 2105 .03075 [cs] ( 2021 ). arXiv: 2105 .03075. url: http://arxiv.org/abs/2105.03075.

[14]

Højmark-Bertelsen. “Ælaectra - A Step Towards More Efficient Danish Natural Language Processing ” . In: 2021 . url: https://github.com/MalteHB/-l-ctra/.

[15]

Honnibal , I. Montani, S. Van Landeghem , and

Boyd . spaCy: Industrial-strength Natural Language Processing in Python. 2020 . doi: 10 .5281/zenodo.1212303. url: https: //doi.org/10.5281/zenodo.1212303.

[16]

Hvingelby ,

A. B.

Pauli ,

Barrett ,

Rosted ,

L. M.

Lidegaard , and

Søgaard . “ DaNE: A named entity resource for danish” . In: Proceedings of the 12th Language Resources and Evaluation Conference. Lrec '20 . 2020 , pp. 4597 - 4604 .

[17]

Jiang ,

Yu ,

Zhou ,

Chen ,

Feng , and

Yan . “ ConvBERT: Improving BERT with Span-based Dynamic Convolution” . In: arXiv: 2008 .02496 [cs] ( 2021 ). arXiv: 2008 . 02496. url: http://arxiv.org/abs/ 2008 .02496.

[18]

Johannsen ,

H. M.

Alonso , and

Plank . “ Universal Dependencies for Danish” . In: Proceedings of the Fourteenth International Workshop on Treebanks and Linguistic Theories ( 2015 ).

[19]

Kjeldgaard . “ Nerda” . In: GitHub, 2020 . url: https://github.com/ebanalyse/NERDA.

[20]

C. D.

Manning . “ Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics?” In: Computational Linguistics and Intelligent Text Processing . Ed. by

A. F.

Gelbukh . Lecture Notes in Computer Science. Berlin, Heidelberg: Springer, 2011 , pp. 171 - 189 . doi: 10 .1007/978-3- 642 -19400-9\_ 14 .

[22]

J. D.

Møllerhøj . Danish BERT model: BotXO has trained the most advanced BERT model . BotXO . 2019 . url: https://www.botxo.ai/blog/danish-bert-model/.

[23]

Qi ,

Zhang ,

Zhang , J. Bolton, and

C. D.

Manning . “ Stanza: A Python Natural Language Processing Toolkit for Many Human Languages” . In: arXiv: 2003 .07082 [cs] ( 2020 ). arXiv: 2003 .07082. url: http://arxiv.org/abs/ 2003 .07082.

[24]

Radford , J. Wu ,

Child ,

Luan ,

Amodei , and I. Sutskever. “ Language models are unsupervised multitask learners” . In: OpenAI blog 1 .8 ( 2019 ), p. 9 .

[25]

Rafel ,

Shazeer ,

Roberts ,

Lee ,

Narang ,

Matena ,

Zhou ,

Li ,

and P. J.

Liu . “ Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer” . In: arXiv: 1910 .10683 [cs, stat] ( 2020 ). arXiv: 1910 .10683. url: http:// arxiv.org/abs/ 1910 .10683.

[26]

Al-Rfou' ,

Perozzi , and

Skiena . “ Polyglot: Distributed Word Representations for Multilingual NLP” . In: Proceedings of the Seventeenth Conference on Computational Natural Language Learning . Sofia, Bulgaria: Association for Computational Linguistics, 2013 , pp. 183 - 192 . url: https://aclanthology.org/W13-3520.

[27]

M. T.

Ribeiro ,

Wu ,

Guestrin , and

Singh . “Beyond Accuracy: Behavioral Testing of NLP Models with CheckList” . In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics , 2020 , pp. 4902 - 4912 . doi: 10 . 18653 / v1 / 2020 . acl - main . 442. url: https : //aclanthology.org/ 2020 .acl-main. 442 .

[28] S. Ruder. “ An Overview of Multi-Task Learning in Deep Neural Networks” . In: arXiv:1706 .05098 [cs, stat] ( 2017 ). arXiv: 1706 .05098. url: http://arxiv.org/abs/1706.05098.

[29]

Shorten and

T. M.

Khoshgoftaar . “ A survey on Image Data Augmentation for Deep Learning” . In: Journal of Big Data 6.1 ( 2019 ), p. 60 . doi: 10 .1186/s40537-019-0197-0. url: https://doi.org/10.1186/s40537-019-0197-0.

[30] sprogteknologi.dk. Sprogteknologi.dk. 2021 . url: https://sprogteknologi.dk/.

[31]

Tahmasebi . “ A Study on Word2Vec on a Historical Swedish Newspaper Corpus” . In: Proceedings of the Digital Humanities in the Nordic Countries 3rd Conference , DHN 2018 , Helsinki, Finland, March 7- 9 , 2018 . 2018 , pp. 25 - 37 . url: http://ceur-ws.org/Vol2084/paper2.pdf.

[32] P. T. Tamini-Sarnikowski. “Danish transformers” . In: GitHub , 2020 . url: https://github. com/sarnikowski.

[33]

Wei and

Zou . “EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks” . In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) . Hong Kong , China: Association for Computational Linguistics, 2019 , pp. 6382 - 6388 . doi: 10 .18653/v1/ D19 -1670. url: https://www.aclweb.org/anthology/D19-1670.

[34]

Wolf ,

Debut ,

Sanh ,

Chaumond ,

Delangue ,

Moi ,

Cistac ,

Rault ,

Louf ,

Funtowicz ,

Davison ,

Shleifer , P. von Platen, C. Ma,

Jernite ,

Plu ,

Xu ,

T. L.

Scao ,

Gugger ,

Drame ,

Lhoest , and

A. M.

Rush . “ HuggingFace's Transformers: State-of-the-art Natural Language Processing” . In: arXiv: 1910 .03771 [cs] ( 2020 ). arXiv: 1910 .03771. url: http://arxiv.org/abs/ 1910 .03771.

[35]

Xue ,

Constant ,

Roberts ,

Kale ,

Al-Rfou ,

Siddhant ,

Barua , and

Rafel . “mT5: A massively multilingual pre-trained text-to-text transformer” . In: arXiv: 2010 .11934 [cs] ( 2021 ). arXiv: 2010 .11934. url: http://arxiv.org/abs/ 2010 .11934.