=Paper=
{{Paper
|id=Vol-2989/short_paper24
|storemode=property
|title=DaCy: A Unified Framework for Danish NLP
|pdfUrl=https://ceur-ws.org/Vol-2989/short_paper24.pdf
|volume=Vol-2989
|authors=Kenneth Enevoldsen,Lasse Hansen,Kristoffer L. Nielbo
|dblpUrl=https://dblp.org/rec/conf/chr/EnevoldsenHN21
}}
==DaCy: A Unified Framework for Danish NLP==
<pdf width="1500px">https://ceur-ws.org/Vol-2989/short_paper24.pdf</pdf>
<pre>
                               DaCy: A Unified Framework for Danish NLP
Kenneth Enevoldsen1,2 , Lasse Hansen1,2,3 and Kristoffer L. Nielbo1,2
1
  Interacting Minds Centre, Aarhus University, Denmark
2
  Center for Humanities Computing, Aarhus University, Denmark
3
  Department of Clinical Medicine, Aarhus University, Denmark


                                 Abstract
                                 Danish natural language processing (NLP) has in recent years obtained considerable improvements
                                 with the addition of multiple new datasets and models. However, at present, there is no coherent
                                 framework for applying state-of-the-art models for Danish. We present DaCy: a unified framework for
                                 Danish NLP built on and integrated with SpaCy. DaCy uses eﬀicient multitask models which obtain
                                 state-of-the-art performance on named entity recognition, part-of-speech tagging, and dependency
                                 parsing. DaCy contains tools for easy integration of existing models such as for polarity, emotion, or
                                 subjectivity detection. In addition, we conduct a series of tests for biases and robustness of Danish
                                 NLP pipelines through data augmentation. DaCy large compares favorably and is especially robust
                                 to long input lengths and spelling variations and errors. All models except DaCy large display
                                 significant biases related to ethnicity while only Polyglot shows a significant gender bias. We argue
                                 that for languages with limited benchmark sets, data augmentation can be particularly useful for
                                 obtaining more realistic and fine-grained performance estimates. We provide a series of augmenters
                                 as a first step towards a more thorough evaluation of language models for low and medium resource
                                 languages and encourage further development.

                                 Keywords
                                 Natural Language Processing, Low-resource NLP, Data Augmentation, Danish NLP


1. Introduction
Danish Natural Language Processing (NLP) has seen a recent rise in resources with the in-
troduction of the Danish Gigaword Corpus [10], curated lists of Natural Language Processing
(NLP) tools by DaNLP [4] and sprogteknologi.dk, and at least five pretrained neural lan-
guage models [14, 22, 32]. Datasets and models are available for most common tasks such as
Named Entity Recognition (NER), Part-of-Speech (POS) tagging, dependency parsing, senti-
ment analysis, and coreference resolution [4, 30]. However, no coherent, eﬀicient and state-of-
the-art framework exists for all fundamental NLP tasks. Models are developed and distributed
as disjoint projects and often require diverging package versions and have idiosyncratic APIs.
These factors complicate workflows and hamper further developments.

1.1. DaCy
With this motivation we present DaCy: an eﬀicient end-to-end framework for Danish NLP
with state-of-the-art performance on POS, NER and dependency parsing. DaCy fills the
CHR 2021: Computational Humanities Research Conference, November 17–19, 2021, Amsterdam, The
Netherlands
£ kenneth.enevoldsen@cas.au.dk (K. Enevoldsen)
Ǳ 0000-0001-8733-0966 (K. Enevoldsen); 0000-0003-1113-4779 (L. Hansen); 0000-0002-5116-5070 (K.L. Nielbo)
                               © 2021 Copyright for this paper by its authors.
                               Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Wor
    Pr
       ks
        hop
     oceedi
          ngs
                ht
                I
                 tp:
                   //
                    ceur
                       -
                SSN1613-
                        ws
                         .or
                       0073
                           g

                               CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                   206
gap in Danish NLP by providing a consistent interface that is easily extendable and able to
integrate other models. DaCy is built on SpaCy v.3 which comes with a range of advantages:
the framework is optimized, user-friendly, and well-documented. DaCy includes three fine-
tuned language models: DaCy small, based on a Danish Electra (14M parameters) [14]; DaCy
medium, based on the Danish BERT (110M parameters) [22]; and DaCy large, based on the
multilingual XLM-RoBERTa (550M parameters) [9]. All models have been fine-tuned to do
POS tagging, NER, and dependency parsing in a single forward pass, which increases the
eﬀiciency of the model and allows for larger models at the same computational cost.
   Besides models fine-tuned for DaCy, the package includes convenient wrappers to add other
models to the pipeline. For instance, Danish models for detecting polarity, emotion, and sub-
jectivity classification can be added in a single line of code, and any HuggingFace Transformers
[34] model trained for sentence classification can be conveniently wrapped and included in the
pipeline using utility functions. With this functionality, DaCy aims at being a unified frame-
work for Danish NLP. All functionality is well-documented and covered by tutorials.1

1.2. Robustness & Evaluation
Fine-tuned language models are commonly evaluated by testing performance on a gold-standard
benchmark dataset. The most commonly used benchmark for Danish is the DaNE dataset
[16], which consists of the Danish Dependency Treebank [18], additionally tagged for NER.
For languages with few benchmarks datasets, such as Danish, the performance stability and
generalizability can not be reliably estimated [27]. For instance, the text included in DaNE
was collected in the years 1983–1992 from both written and spoken domains [16]. Given the
change of languages over time and the addition of new textual domains such as social media,
this dataset is unlikely to be representative of the contemporary domains of application. For
instance, models might not be suﬀiciently exposed to e.g. abbreviated names, spelling errors,
or non-standard casing to correctly and robustly classify them. In this sense, the performance
obtained on DaNE is unlikely to hold for real-world use cases.
   To provide an additional layer of validation, we propose evaluating models on augmented
gold-standard data. Data augmentation entails generating new data by slightly modifying
existing data points [13]. Data augmentation techniques such as rotation and cropping are
widely used in computer vision to reduce overfitting [29], and are becoming increasingly com-
mon in NLP [7]. The complex syntactic and semantic structure of text complicates the task
of finding useful augmentations, but simple manipulations such as synonym replacement and
random character swaps and deletions have been found to be particularly useful for supervised
learning in low-resource settings [33].
   Although data augmentation is most commonly used for increasing the amount of training
data, it can just as well be used for evaluation purposes [27]. By augmenting a gold-standard
dataset, we can evaluate model performance when exposed to data that more closely mimics
real-life settings by adding spelling errors, more diverse names, or other manipulations. In
section 2.2, we introduce a series of augmentations and evaluate the performance of Danish
NLP pipelines on them.
   The contributions of this paper are three-fold. 1) We introduce new state-of-art models for
Danish dependency parsing, NER and POS. 2) We introduce the DaCy Python library as a
unified framework for state-of-the-art NLP in Danish. 3) We evaluate Danish NLP pipelines

   1
       See: https://centre-for-humanities-computing.github.io/DaCy/


                                                    207
Table 1
Performance of models finetuned for DaCy. Highest scores are in bold and second highest is underlined.
WPS indicates words pr. second.
                                  POS                       NER                      Dependency Parsing   Speed
 Framework      Model           Accuracy   PER     LOC     ORG     MISC    Avg. F1   UAS        LAS       WPS
 DaCy large     XLM-RoBERTa       98.39    95.53   83.90   77.82   80.16   85.20     90.59      88         4311
 DaCy medium    DaBERT            97.93    89.62   83.09   67.35   70.69   78.47     87.88      85         8335
 DaCy small     Ælæctra Cased     97.69    87.36   81.95   63.83   70.68   76.55     86.45      83        10671
                DaELECTRA         97.40    82.80   77.39   63.01   66.95    73.16    85.20      82        9855
                DaConvBERT        97.23    85.08   78.26   61.76   66.93    73.77    84.61      81        10029


using data augmentation and provide directions for future model development.


2. Methods
2.1. Training
To train the candidate models for DaCy, all publicly available Transformer-based language
models for Danish were fine-tuned on the DaNE corpus [16] using SpaCy 3.0.3 [15]. The
models include 2 Danish ELECTRAs [8, 14, 32], the Danish ConvBERT [17, 32], the Danish
BERT [11, 22], and the multilingual XLM-Roberta Large [9]. All models were trained with
an input length of 10 sentences until convergence using similar hyperparameters on a Quadro
RTX 8000 GPU. Adam was used as optimizer with hyperparameters β1 = 0.9 and β2 = 0.999.
Further, L2 normalization with α = 0.01 and gradient clipping with c = 1.0 was employed. For
increased eﬀiciency, all models were trained with a multi-task objective [6, 28] on NER, POS,
and dependency parsing. This allows the training of larger models at the same computational
cost, but it is unlikely that multi-task training at this scale improves performance [25, 1].2
   Table 1 shows the performance of all fine-tuned models evaluated on DaNE’s test set. The
three best performing models in each size category, XLM-Roberta, DaBERT, and Ælæctra
Cased are included in DaCy as the large, medium and small, respectively. In line with previous
findings [25, 5, 24], larger models tend to perform better with XLM-Roberta obtaining the best
performance across the board.

2.2. Evaluation
To evaluate the robustness of DaCy and other Danish NLP pipelines, we assessed their perfor-
mance on multiple augmented version of the DaNE test set. All Danish models are trained on
the DaNE corpus which consists of a mix of textual data of both spoken and written origin from
the years 1983–1992 [16], with the exception of Polyglot which is trained on entities extracted
from Wikipedia [26]. As a consequence, the training data is rarely representative of the domain
in which the models will be applied. For example, social media, contemporary news media,
and historical texts have domain specific characteristics such as non-standard casing, a higher
degree of typos, use of hashtags, and historic spelling such as upper-cased nouns [31, 3, 12].
While it is infeasible to test the models on all possible domains, some of these characteristics

   2
     For a full list of models and training configurations see the config files on Github: https://github.com/
centre-for-humanities-computing/DaCy/tree/main/training


                                                     208
can be modelled using data augmentation which can provide practitioners with an estimate of
the potential shortcomings of the model. Further, data augmentation can be used to estimate
biases against protected groups such as gender and ethnicity.
   The augmenters presented here are not meant to be exhaustive, but rather a first step towards
more thorough validation of new language models. We argue that the bar for inclusion of a
new model should be set higher than a slight increase in benchmark performance. Language
models are used in a variety of contexts which current benchmarks tasks, especially for low
resource languages, do not capture. Our aim with these experiments is to provide an extra layer
of insight into the performance of language models that more closely mimics naturalistic use
cases, and encourage the development of further augmenters. Augmentation not only provides
insights into when model performance breaks down, whether certain models are more suited
for specific use-cases than others, but can also be used for identifying specific areas to improve
upon.
   The augmenters developed for this paper are designed in accordance with the SpaCy frame-
work, and are thus not necessarily tied to DaCy or Danish in particular and can be used both
during model validation and training. Comprehensive tutorials are provided on the DaCy
Github repository.
   We tested small, medium, and large SpaCy [15] and DaCy models, Stanza [23], Polyglot
[26], NERDA [19], Flair, 3 [2], and DaNLP’s BERT [4] on the DaNE test set augmented with
the following augmenters:

  1. Keystroke augmentation: substitute 2%, 5%, or 15% of characters with a neighbouring
     character on a Danish QWERTY keyboard.
  2. ÆØÅ augmentation: substitute æ/Æ with ae/Ae, ø/Ø with oe/Oe, and å/Å with aa/Aa
     to simulate some historic text variations in Danish.
  3. Lower-case augmentation: convert all text to lower-case.
  4. Spacing augmentation: randomly remove 5% of all whitespace.
  5. Name augmentations:
       a) Substitute all names (PER entities) with randomly sampled Danish names, respect-
          ing first and last names.
       b) Substitute all names with randomly sampled names of Muslim origin used in Den-
          mark [21], respecting first and last names.
       c) Substitute all names with sampled Danish male names, respecting first and last
          names.
       d) Substitute all names with sampled Danish female names, respecting first and last
          names.
       e) Abbreviate all first names to the first character including a full stop.

   The stochastic augmentations, i.e. name and keystroke augmentations, were repeated 20
times.
   Previous evaluations of Danish NLP tools have used the gold-standard tokens instead of
using a tokenization module. While this allows for easier comparison of the specific modules it
inflates the performance metrics of the models and is unlikely to reflect the metric of interest,
namely, the performance during application.4 All models were tested using both their own
tokenizer (if they have one) and the SpaCy tokenizer for Danish. The performance reported in
   3
       As supplied by DaNLP.
   4
       In our experiments, several of the Danish models performed worse using their own tokenizer.


                                                      209
Table 2
Performance of Danish NLP pipelines. Wall Time is the time taken by the model to go through the DaNE
test set without augmentation. Stanza uses the spacy-stanza implementation. The speed of the DaNLP
model is reported as provided by the framework, which does not utilize batch input. However, given the
model size it can be expected to reach speeds comparable to DaCy medium. Empty cells indicates that the
framework does not include the specific model.
                 POS                                      NER                                 Dependency Parsing   Wall Time
 Model          Accuracy   Person   Location   Organization     Misc     F1     F1 w/o Misc   LAS       UAS        GPU/CPU
 DaCy large      98.37     93.33     84.88        76.49         80.16   84.39     85.65       88.44     90.85      2.9 / 34.7
 DaCy medium     98.15     89.86     83.96        64.47         70.09   77.67     79.68       86.65     89.25       1.8 / 9.9
 DaCy small      97.75     87.98     79.23        60.58         64.82   74.18     76.98       84.03     87.63      1.9 / 2.6
 DaNLP BERT                92.27     83.90        71.13                 72.84      83.20                            37.4 / -
 Flair           97.80     92.60     84.82        61.29                 70.49      81.09                             2.0 / -
 NERDA                     92.35     81.52        65.96         72.41   79.04      80.85                             2.5 / -
 Polyglot        76.26     79.25     68.06        40.69                 56.67      65.32                            - / 3.8
 SpaCy large     96.30     86.17     84.16        63.36         65.52   75.75      78.57      78.01     81.95      0.9 / 1.4
 SpaCy medium    95.71     84.55     77.29        63.16         63.25   73.23      76.01      77.73     81.87      1.2 / 1.4
 SpaCy small     94.80     78.92     69.04        53.49         61.54   67.11      68.61      74.03     78.68      1.4 / 1.5
 Stanza          97.62                                                                        83.84     87.34       29.3 / -


section 3 uses the best peforming tokenization module for each pipeline. For all models except
Stanza and Polyglot this was found to be the SpaCy tokenizer.


3. Results
Table 2 shows the overall performance of Danish NLP frameworks on POS, NER, and depen-
dency parsing on the un-augmented DaNE test set. DaCy large obtains a new state-of-the-art
on all three tasks, most notably on NER and dependency parsing. Regardless of model, per-
formance for POS is stable around 98% accuracy. POS tagging has long been at this level,
and obtaining greater accuracy has been argued to require updates to the training data rather
than new architectures [20].
   Tables 3 to 5 shows a detailed performance breakdown of the models on NER, POS, and
dependency parsing on the augmented data described in section 2.2. Overall, spelling varia-
tions and abbreviated first names consistently reduce performance of all models on all tasks.
Even simple replacements of æ, ø, and å lead to performance degradation. In general, larger
models handle augmentations better than small models with DaCy large performing the best
on all augmentations with the exception of lower-casing. DaCy medium, DaNLP’s BERT, and
NERDA are based on the uncased Danish BERT [22], and are consequently not affected by
casing. The BiLSTM-based models (Stanza and Flair) perform competitively under augmen-
tations and are only consistently outperformed by DaCy large.
   On NER specifically, all models with the exception of DaCy large obtain significantly worse
performance on Muslim names as compared to Danish names. The robustness of DaCy large
likely stems from the multilingual pre-training and the model size. Similarly, DaCy small is
robust to spelling errors and outperforms larger models such as DaNLP’s BERT and NERDA,
this is likely due to its well-curated training data [10]. DaNLP’s BERT and NERDA models
were found to severely under-perform if given longer input lengths. DaCy’s models consistently
perform slightly better with more context, but are not vulnerable to shorter input. Lastly, as
expected, the lack of casing is especially detrimental for NER for the cased models, most


                                                              210
Table 3
NER performance of Danish NLP pipelines reported as average F1 scores excluding the MISC category on
the test set of DaNE. Best scores are marked bold and second best are underlined. * denotes that the result
is significantly different from the baseline using a significance threshold of 0.05 with Bonferroni correction
for multiple comparisons. Danish names is considered the baseline for the augmentation of Muslim, female,
and male names. Values in parentheses denote the standard deviation. NERDA limits input size to 128
wordpieces which leads to truncation on long input sizes and high rates of keystroke errors.
                                                                      Deterministic Augmentations
                                                                                 Input Length               Names
      Model                               Baseline   Æøå    Lowercase     5 sentences     10 sentences     Abbreviated
      DaCy large                            85.6     83.5      69.7          86.5             86.5            80.1
      DaCy medium                           79.7     73.1      79.5          80.9             80.4            76.3
      DaCy small                            77.0     74.7      48.0          77.0             78.2            69.4
      DaNLP BERT                            83.2     78.6     83.1           78.6             61.9            78.1
      Flair                                 81.1     80.2     24.4           81.0             80.9            74.9
      NERDA                                 80.9     74.8     80.7           73.7             53.8            76.4
      Polyglot                              65.3     61.4      55.3          64.8             64.2            40.2
      SpaCy large                           78.6     75.4      5.7           78.8             78.8            78.0
      SpaCy medium                          76.0     74.7      9.7           76.5             76.8            76.0
      SpaCy small                           68.6     66.9      4.8           68.0             68.0            63.8
                                                      Stochastic Augmentations
                                           Names                                        Keystroke Errors
       Model            Danish       Muslim        Female      Male           2%              5%            15%
       DaCy large  86.2 (0.6)* 86.0 (0.5) 86.2 (0.5) 86.2 (0.4) 82.0 (1.2)* 76.9 (1.3)* 61.3 (1.6)*
       DaCy medium 80.3 (0.5)* 77.9 (0.8)* 80.3 (0.4) 80.2 (0.7) 65.5 (1.7)* 50.0 (1.6)* 25.8 (1.3)*
       DaCy small   76.5 (0.9) 75.7 (0.7)* 76.7 (0.8) 76.6 (0.7) 70.7 (1.6)* 62.1 (1.5)* 41.3 (1.6)*
       DaNLP BERT     82.9 (0.6)    81.0 (1.0)* 83.1 (0.5) 83.0 (0.7)     72.6 (1.2)*     60.9 (1.7)*    37.0 (1.5)*
       Flair          81.2 (0.7)    79.8 (0.7)* 81.4 (0.5) 81.5 (0.5)     78.3 (0.9)*     73.5 (1.5)*    56.3 (1.7)*
       NERDA          80.0 (1.1)*   78.1 (1.2)* 80.2 (0.8) 80.0 (0.8)     70.7 (1.4)*     57.5 (1.4)*    31.1 (1.6)*
       Polyglot       63.1 (1.2)*   41.8 (0.7)* 61.2 (1.2)* 64.8 (1.2)*   57.4 (0.9)*     46.9 (1.9)*    24.7 (1.9)*
       SpaCy large  79.5 (0.6)* 71.6 (1.1)* 79.8 (0.5)      79.4 (0.5)    72.1 (1.0)*     63.3 (1.5)*    44.9 (1.8)*
       SpaCy medium 78.2 (0.7)* 69.2 (1.4)* 78.2 (0.7)      78.5 (0.8)    70.5 (1.3)*     64.2 (1.5)*    46.9 (1.6)*
       SpaCy small  62.5 (1.6)* 57.8 (1.4)* 63.0 (1.1)      63.3 (0.9)    65.4 (0.7)*     60.5 (1.5)*    45.9 (1.6)*


notably Flair, the SpaCy models, DaCy large and DaCy small.


4. Discussion
This paper has introduced the DaCy models and presented a thorough evaluation of Danish
NLP models on a battery of augmentations. DaCy models achieve state-of-the-art perfor-
mance on Danish NER, POS, and dependency parsing, and are robust to augmentations such
as keystroke errors, name changes, and lowercasing. The results from training DaCy under-
line three well-known trends in deep learning and NLP, 1) larger models tend to perform
better, 2) higher quality pre-training data leads to better models, as illustrated by the supe-
rior performance of Ælæctra compared to DaELECTRA, and 3) multilingual models perform
competitively with monolingual models [25, 35, 5].
   Our experiments with multiple augmenters revealed different patterns of strengths and weak-
nesses across Danish NLP models. In general, larger models tend to be more robust to data


                                                        211
Table 4
POS performance of Danish NLP pipelines reported as accuracy on the test set of DaNE. Best scores
are marked bold and second best are underlined. * denotes that the result is significantly different from
baseline using a significance threshold of 0.05 with Bonferroni correction for multiple comparisons. Values
in parentheses denote the standard deviation. NERDA limits input size to 128 wordpieces which leads to
truncation on long input sizes and with a high degree of keystroke errors.
                                    Deterministic Augmentations                        Stochastic Augmentations
                                                     Input Length                           Keystroke Errors
 Model           Baseline   Æøå    Lowercase   5 sentences     10 sentences       2%               5%             15%
 DaCy large       98.4      97.5     95.5         98.5            98.4        95.5 (0.2)*     91.1 (0.2)*      75.4 (0.6)*
 DaCy medium      98.2      96.5     98.1         97.8            97.9        93.6 (0.3)*     86.5 (0.3)*      63.3 (0.6)*
 DaCy small       97.7      95.4     95.4         97.6            97.7        93.1 (0.2)*     85.9 (0.4)*      62.5 (0.4)*
 Flair            97.8      95.0     95.0         97.7            97.7        94.7 (0.2)*     89.8 (0.3)*      72.1 (0.4)*
 Polyglot         76.3      71.6     75.6         75.7            75.6        71.7 (0.2)*     65.3 (0.3)*      49.4 (0.4)*
 SpaCy large      96.3      92.4     91.5         96.3            96.3        91.5 (0.2)*     84.8 (0.4)*      66.2 (0.5)*
 SpaCy medium     95.7      92.4     91.6         95.8            95.7        91.0 (0.3)*     84.5 (0.3)*      66.0 (0.5)*
 SpaCy small      94.8      90.5     90.3         94.8            94.8        90.7 (0.2)*     85.3 (0.3)*      69.1 (0.4)*
 Stanza           97.6      96.1     95.4         97.7            97.7        94.8 (0.2)*     90.6 (0.3)*      75.6 (0.5)*


Table 5
Dependency parsing performance of Danish NLP pipelines on the test set of DaNE reported as LAS. Best
scores are marked bold and second best are underlined. * denotes that the result is significantly different
from baseline using a significance threshold of 0.05 with Bonferroni correction for multiple comparisons.
Values in parentheses denote the standard deviation.
                                    Deterministic Augmentations                        Stochastic Augmentations
                                                      Input Length                          Keystroke Errors
 Model           Baseline   Æøå    Lowercase   5 sentences     10 sentences       2%          5%                   15%
 DaCy large       88.4      86.2     87.0         88.3             88.3       83.7 (0.4)*     76.6 (0.5)*      53.6 (0.8)*
 DaCy medium      86.7      84.6     86.6         85.4             85.3       79.9 (0.5)*     69.9 (0.7)*      41.1 (0.9)*
 DaCy small       84.0      79.0     82.7         83.5             83.0       76.8 (0.4)*     66.2 (0.8)*      38.0 (0.6)*
 SpaCy large       78.0     71.0     74.0         77.6             77.6       69.7 (0.5)*     59.3 (0.7)*       34.8 (0.7)*
 SpaCy medium      77.7     71.2     73.8         77.4             77.4       69.6 (0.6)*     59.5 (0.6)*       35.3 (0.7)*
 SpaCy small       74.0     65.9     70.4         74.1             74.1       67.5 (0.4)*     59.1 (0.5)*       38.2 (0.7)*
 Stanza            83.8     80.2     82.5         83.9             83.9       79.0 (0.4)*     71.9 (0.5)*       49.8 (0.9)*


augmentations. Several models are highly sensitive to casing, which limits their usefulness on
certain domains. Evaluating models on augmented data provides a more holistic and realistic
estimate of the expected performance, and can reveal in which use cases one model might be
more useful than another. For example, it might be better to use DaCy medium on social
media as opposed to DaCy large as its performance is not affected by casing.
   The purpose of the data augmentation experiments was to evaluate the robustness of Danish
models and to open a discussion on how to present new models going forward. As more models
are developed for low and medium resource languages, properly evaluating them becomes vital
for securing robustness, transparency, and effectiveness despite limited benchmark sets. We
do not posit data augmentation as the only solution, but demonstrate that it can effectively
reveal performance differences on important factors such as casing, spelling errors, and biases
related to protected groups. As researchers, we bear the responsibility for releasing adequately


                                                         212
tested and robust models into the world. With the increasing ease of deployment, users must
be made aware of the level of performance they can realistically expect to achieve on their
problem, and when to choose one model over another. Social media researchers should know
that certain models are sensitive to casing, historians should know that some models handle
old text variations such as ae, oe, aa poorly, and lawyers should be aware that models might
not be able to identify abbreviated names as effectively. In this regard, transparency and
openness as to when and how models fail are crucial measures to report. Such evaluation
requires the development of infrastructure and tools, but is fast and easy to conduct once in
place. For instance, it only takes 8 minutes to test DaCy large on all augmented datasets
including bootstrapping. As part of the DaCy library, we provide several augmenters and
utility functions for evaluation that integrate with SpaCy and encourage new NLP models
to use and expand upon them. For the continued development of low and medium resource
NLP in a direction that is beneficial for practitioners, it is vital to conduct more thorough
evaluation of new models. We suggest these augmenters not as an evaluation standard, but
as preliminary guiding principles for future development of NLP models for low and medium
resource languages in particular.


Acknowledgments
We would like to especially thank Martin C. Jespersen for early conversions on biases and
covert weaknesses in Danish language models.


References
 [1]   A. Aghajanyan, A. Gupta, A. Shrivastava, X. Chen, L. Zettlemoyer, and S. Gupta. “Mup-
       pet: Massive Multi-task Representations with Pre-Finetuning”. In: arXiv:2101.11038 [cs]
       (2021). arXiv: 2101.11038. url: http://arxiv.org/abs/2101.11038.
 [2]   A. Akbik, T. Bergmann, D. Blythe, K. Rasul, S. Schweter, and R. Vollgraf. “FLAIR: An
       Easy-to-Use Framework for State-of-the-Art NLP”. In: Proceedings of the 2019 Confer-
       ence of the North American Chapter of the Association for Computational Linguistics
       (Demonstrations). Minneapolis, Minnesota: Association for Computational Linguistics,
       2019, pp. 54–59. doi: 10.18653/v1/N19-4010. url: https://aclanthology.org/N19-4010.
 [3]   T. Baldwin. “Social media: friend or foe of natural language processing?” In: Proceedings
       of the 26th Pacific Asia Conference on Language, Information, and Computation. 2012,
       pp. 58–59.
 [4]   A. Brogaard Pauli, M. Barrett, O. Lacroix, and R. Hvingelby. “DaNLP: An open-source
       toolkit for Danish Natural Language Processing”. In: Proceedings of the 23rd Nordic
       Conference on Computational Linguistics 2021). NoDaLiDa 2021. 2021.
 [5]   T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan,
       P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan,
       R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M.
       Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever,
       and D. Amodei. “Language Models are Few-Shot Learners”. In: arXiv:2005.14165 [cs]
       (2020). arXiv: 2005.14165. url: http://arxiv.org/abs/2005.14165.


                                              213
 [6]   R. Caruana. “Multitask learning”. In: Machine learning 28.1 (1997), pp. 41–75.
 [7]   J. Chen, D. Tam, C. Raffel, M. Bansal, and D. Yang. “An Empirical Survey of Data
       Augmentation for Limited Data Learning in NLP”. In: arXiv:2106.07499 [cs] (2021).
       arXiv: 2106.07499. url: http://arxiv.org/abs/2106.07499.
 [8]   K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning. “ELECTRA: Pre-training Text
       Encoders as Discriminators Rather Than Generators”. In: arXiv:2003.10555 [cs] (2020).
       arXiv: 2003.10555. url: http://arxiv.org/abs/2003.10555.
 [9]   A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave,
       M. Ott, L. Zettlemoyer, and V. Stoyanov. “Unsupervised Cross-lingual Representation
       Learning at Scale”. In: arXiv:1911.02116 [cs] (2020). arXiv: 1911 . 02116. url: http :
       //arxiv.org/abs/1911.02116.
[10]   L. Derczynski, M. R. Ciosici, R. Baglini, M. H. Christiansen, J. A. Dalsgaard, R. Fusaroli,
       P. J. Henrichsen, R. Hvingelby, A. Kirkedal, A. S. Kjeldsen, C. Ladefoged, F. Å. Nielsen,
       J. Madsen, M. L. Petersen, J. H. Rystrøm, and D. Varab. “The Danish Gigaword Corpus”.
       In: Proceedings of the 23rd Nordic Conference on Computational Linguistics. 2021.
[11]   J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. “BERT: Pre-training of Deep Bidi-
       rectional Transformers for Language Understanding”. In: arXiv:1810.04805 [cs] (2019).
       arXiv: 1810.04805. url: http://arxiv.org/abs/1810.04805.
[12]   A. Farzindar and D. Inkpen. “Natural language processing for social media”. In: Synthesis
       Lectures on Human Language Technologies 8.2 (2015), pp. 1–166.
[13]   S. Y. Feng, V. Gangal, J. Wei, S. Chandar, S. Vosoughi, T. Mitamura, and E. Hovy. “A
       Survey of Data Augmentation Approaches for NLP”. In: arXiv:2105.03075 [cs] (2021).
       arXiv: 2105.03075. url: http://arxiv.org/abs/2105.03075.
[14]   M. Højmark-Bertelsen. “Ælæctra - A Step Towards More Eﬀicient Danish Natural Lan-
       guage Processing”. In: 2021. url: https://github.com/MalteHB/-l-ctra/.
[15]   M. Honnibal, I. Montani, S. Van Landeghem, and A. Boyd. spaCy: Industrial-strength
       Natural Language Processing in Python. 2020. doi: 10.5281/zenodo.1212303. url: https:
       //doi.org/10.5281/zenodo.1212303.
[16]   R. Hvingelby, A. B. Pauli, M. Barrett, C. Rosted, L. M. Lidegaard, and A. Søgaard.
       “DaNE: A named entity resource for danish”. In: Proceedings of the 12th Language Re-
       sources and Evaluation Conference. Lrec ’20. 2020, pp. 4597–4604.
[17]   Z. Jiang, W. Yu, D. Zhou, Y. Chen, J. Feng, and S. Yan. “ConvBERT: Improving BERT
       with Span-based Dynamic Convolution”. In: arXiv:2008.02496 [cs] (2021). arXiv: 2008.
       02496. url: http://arxiv.org/abs/2008.02496.
[18]   A. Johannsen, H. M. Alonso, and B. Plank. “Universal Dependencies for Danish”. In: Pro-
       ceedings of the Fourteenth International Workshop on Treebanks and Linguistic Theories
       (2015).
[19]   L. Kjeldgaard. “Nerda”. In: GitHub, 2020. url: https://github.com/ebanalyse/NERDA.
[20]   C. D. Manning. “Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Lin-
       guistics?” In: Computational Linguistics and Intelligent Text Processing. Ed. by A. F.
       Gelbukh. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer, 2011, pp. 171–
       189. doi: 10.1007/978-3-642-19400-9\_14.


                                               214
[21]   E. V. Meldgaard. Muslimske fornavne i Danmark. 2005. url: https : / / nors . ku . dk /
       publikationer/webpublikationer/muslimske%5C%5Ffornavne/.
[22]   J. D. Møllerhøj. Danish BERT model: BotXO has trained the most advanced BERT
       model. BotXO. 2019. url: https://www.botxo.ai/blog/danish-bert-model/.
[23]   P. Qi, Y. Zhang, Y. Zhang, J. Bolton, and C. D. Manning. “Stanza: A Python Natural
       Language Processing Toolkit for Many Human Languages”. In: arXiv:2003.07082 [cs]
       (2020). arXiv: 2003.07082. url: http://arxiv.org/abs/2003.07082.
[24]   A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. “Language models
       are unsupervised multitask learners”. In: OpenAI blog 1.8 (2019), p. 9.
[25]   C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li,
       and P. J. Liu. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text
       Transformer”. In: arXiv:1910.10683 [cs, stat] (2020). arXiv: 1910.10683. url: http://
       arxiv.org/abs/1910.10683.
[26]   R. Al-Rfou’, B. Perozzi, and S. Skiena. “Polyglot: Distributed Word Representations
       for Multilingual NLP”. In: Proceedings of the Seventeenth Conference on Computational
       Natural Language Learning. Sofia, Bulgaria: Association for Computational Linguistics,
       2013, pp. 183–192. url: https://aclanthology.org/W13-3520.
[27]   M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh. “Beyond Accuracy: Behavioral Test-
       ing of NLP Models with CheckList”. In: Proceedings of the 58th Annual Meeting of
       the Association for Computational Linguistics. Online: Association for Computational
       Linguistics, 2020, pp. 4902–4912. doi: 10 . 18653 / v1 / 2020 . acl - main . 442. url: https :
       //aclanthology.org/2020.acl-main.442.
[28]   S. Ruder. “An Overview of Multi-Task Learning in Deep Neural Networks”. In: arXiv:1706.05098
       [cs, stat] (2017). arXiv: 1706.05098. url: http://arxiv.org/abs/1706.05098.
[29]   C. Shorten and T. M. Khoshgoftaar. “A survey on Image Data Augmentation for Deep
       Learning”. In: Journal of Big Data 6.1 (2019), p. 60. doi: 10.1186/s40537-019-0197-0.
       url: https://doi.org/10.1186/s40537-019-0197-0.
[30]   sprogteknologi.dk. Sprogteknologi.dk. 2021. url: https://sprogteknologi.dk/.
[31]   N. Tahmasebi. “A Study on Word2Vec on a Historical Swedish Newspaper Corpus”.
       In: Proceedings of the Digital Humanities in the Nordic Countries 3rd Conference, DHN
       2018, Helsinki, Finland, March 7-9, 2018. 2018, pp. 25–37. url: http://ceur-ws.org/Vol-
       2084/paper2.pdf.
[32]   P. T. Tamini-Sarnikowski. “Danish transformers”. In: GitHub, 2020. url: https://github.
       com/sarnikowski.
[33]   J. Wei and K. Zou. “EDA: Easy Data Augmentation Techniques for Boosting Perfor-
       mance on Text Classification Tasks”. In: Proceedings of the 2019 Conference on Empiri-
       cal Methods in Natural Language Processing and the 9th International Joint Conference
       on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association
       for Computational Linguistics, 2019, pp. 6382–6388. doi: 10.18653/v1/D19-1670. url:
       https://www.aclweb.org/anthology/D19-1670.


                                                 215
[34]   T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault,
       R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu,
       C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush. “HuggingFace’s
       Transformers: State-of-the-art Natural Language Processing”. In: arXiv:1910.03771 [cs]
       (2020). arXiv: 1910.03771. url: http://arxiv.org/abs/1910.03771.
[35]   L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and
       C. Raffel. “mT5: A massively multilingual pre-trained text-to-text transformer”. In:
       arXiv:2010.11934 [cs] (2021). arXiv: 2010.11934. url: http://arxiv.org/abs/2010.11934.


                                              216

</pre>