Evaluation of Data Augmentation for Named Entity
Recognition in the German Legal Domain
Robin Erd1 , Leila Feddoul1,* , Clara Lachenmaier2 and Marianne Jana Mauch1
1
    Heinz Nixdorf Chair for Distributed Information Systems, Friedrich Schiller University Jena, Jena, Germany
2
    Computatitional Linguistics, Bielefeld University, Bielefeld, Germany


                                         Abstract
                                         One of the techniques to solve Natural Language Processing tasks is supervised learning, which requires
                                         large labeled datasets for model training. Such datasets are usually unavailable for specific domains or
                                         languages other than English. Creating them manually is a time-consuming task. This paper aims to
                                         explore methods to artificially expand small datasets in the German legal domain. We tested three Data
                                         Augmentation approaches on differently sized fragments of the German Legal Entity Recognition dataset:
                                         Synonym replacement, mention replacement, and back translation. We evaluated the effect of training
                                         on the augmented data with a bidirectional Long Short-Term Memory Network with a Conditional
                                         Random Field layer and a Transformer-based model. It appears that synonym replacement and mention
                                         replacement yield similarly positive results, while the latter is less time-consuming. Performing back
                                         translation turns out as challenging using legal texts.

                                         Keywords
                                         Named Entity Recognition, Data Augmentation, German Legal Domain


1. Introduction
The project Canaréno1 aims to support the analysis of German legal norms, which is done
as a first step in the creation of digital administrative p rocesses. In general, German public
authorities need to follow a specific process in order to deliver an administrative service (e.g.,
vehicle registration) for citizens or companies. This process is not arbitrary created, but it is
typically based on legal bases (e.g., laws, ordinances, etc.). Thus, the very first step for the
modeling of administrative processes is to gather relevant legal bases and to analyze them
in a subsequent step. The purpose of this analysis is to identify indications in the text about
possible process elements (e.g., process steps, participants, etc.). This is carried out manually in
either an implicit, or an explicit way by highlighting relevant words or sentences belonging to
specific categories such as process main actor or process contributor. This task is not only time-
and personnel-consuming, but also requires a certain expertise. Therefore, the initial goal is
Joint Proceedings
AI4LEGAL’22:       of ISWC2022
               International    Workshops:
                              Workshop      the International
                                        on Artificial          Workshop
                                                      Intelligence       on Artificial
                                                                   Technologies        Intelligence
                                                                                for Legal           Technologies
                                                                                           Documents,  Colocatedfor Le
                                                                                                                 with
Documents
theer 23–27,(AI4LEGAL)
             2022, Virtualand the International Workshop on Knowledge Graph Summarization (KGSum) (2022)
                           Conference
*
 Corresponding author.
$ robin.erd@uni-jena.de (R. Erd); leila.feddoul@uni-jena.de (L. Feddoul); clara.lachenmaier@uni-bielefeld.de (C.
Lachenmaier); marianne.mauch@uni-jena.de (M. J. Mauch)
 0000-0001-5340-8163 (R. Erd); 0000-0001-8896-8208 (L. Feddoul); 0000-0002-9207-3420 (C. Lachenmaier);
000-0003-1478-1867 (M. J. Mauch)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
            CEUR Workshop Proceedings (CEUR-WS.org)
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073


1
    http://www.opendva.uni-jena.de/


                                                                                                          62
to support this legal analysis by performing automatic suggestions about relevant objects. In
this context, Named Entity Recognition (NER) techniques are investigated. NER aims to detect
and classify Named Entities (NEs), e.g., persons in unstructured text. It allows machines to
better understand the contained information and serves as an initial step for performing more
complex tasks (e.g., question answering). In our context, it will be used as a basis for further
processing to identify possible process steps and their interaction with other process elements.
However, common NE classes are often reflecting generic concepts. As soon as texts cover
specific domains that deal with particular phenomena, general NER classes do not suffice to
depict all concepts of the certain niche. Recent datasets used special domain-specific tags for
legal texts, such as lawyer, legal norm, or court [1]. Nevertheless, the dataset is not covering the
specific properties related to administrative processes. To the best of our knowledge, no labeled
dataset exists using those specific tags.
   NER is often solved by training supervised machine learning models, which learn complex
features, when they are provided with sufficient labeled data. The process of manual data
labeling could involve domain experts and is time-consuming. In this context, we want to
investigate techniques for Data Augmentation (DA) using another dataset similar in nature to
our target data. Small datasets can be enlarged by generating new training data automatically
using DA. Some of the methods for DA in the field of Natural Language Processing (NLP) and
specifically NER are, among others: synonym replacement [2, 3, 4, 5], mention replacement2
[6, 5, 7], random deletion [3], random insertion [3], random swap [3, 5], noising techniques [3],
TF-IDF based word replacement [8], back translation [9], and generative approaches[10, 11, 12].
   There are works considering various aspects of different DA techniques, but to the best of our
knowledge: (1) none uses data from the legal domain and all of them mainly consider English
data, (2) none compares different sources that can be used with the synonym replacement,
and (3) other implementations of back translation only perform segment-wise back translation
while excluding entities, limiting the degree of change that might be achieved, or employ a NER
model to re-annotate the back translated sentences.
   Our goal is to evaluate and compare DA techniques for the NER task, explicitly focusing on
the German legal domain and thus using the German Legal Entity Recognition (LER) dataset
[1]. The key contributions of this paper are:

      1. A workflow for the generation and augmentation of different dataset fractions using three
         different DA techniques along with three different synonym sources.

      2. A back translation method that (1) does not rely on pre-trained models for translation
         or re-annotation, and (2) translates the whole sentence, including entities, and thereby
         enriches the mentions space.

      3. Evaluation and comparison of the effectiveness of the selected DA approaches using two
         deep learning models on a German legal dataset.

 The source code for data generation and evaluation is publicly available [13, 14] under an
MIT License. Generated datasets [15] and evaluation results [16] are published on Zenodo.

2
    Note that in NER, the terms "mention" and "entity" can be used interchangeably.


                                                          63
2. Related Work
Named Entity Recognition. With most research in NLP and specifically NER being conducted
in the English language, few works exist regarding NER in the German legal domain. Glaser et
al. [17] tested three techniques for extracting entities from German legal contracts: GermaNER,
DBpedia Spotlight [18], and templated NER. GermaNER and DBpedia Spotlight achieved an
F1-score of 0.80 and 0.87 respectively, while templated NER was tested on a smaller dataset
and achieved an F1-score of 0.92. Leitner et al. [19] evaluated different Bidirectional Long
Short-Term Memory (BiLSTM) networks with a Conditional Random Field (CRF) layer for
NER on the German LER dataset. The best performance was achieved using two BiLSTM-CRF
models with character embeddings with an F1-score of 0.9546. More recently, Zöllner et al.
[20] compared different pre-training techniques and a modified fine-tuning process for small
Bidirectional Encoder Representations from Transformers (BERT) [21] models and also used
the LER dataset for evaluation, achieving an F1-score of 0.9488.
   Data Augmentation. Replacement-based Techniques. The replacement of words was one
of the first techniques to be employed for DA. Zhang et al. [2] applied WordNet-based [22]
synonym replacement to eight text classification datasets. Wang et al. [23] applied Word2vec-
based synonym replacement to a newly created Twitter dataset used for topic classification.
Wu et al. [24] randomly replaced one to two words per sentence with a [MASK] token, which
was then to be filled by a label-conditioned contextual language model. Liu et al. [7] randomly
replaced mentions in the training data with mentions from a manually created dictionary
containing mentions not part of the training data.
   Combined Techniques. Wei et al. [3] presented four techniques for use with text classification
data now known as easy-data-augmentation (EDA) techniques, namely synonym replacement,
random insertion, random swap, and random deletion. Kang et al. [25] extended these, adding
an external knowledge-based system and modifying them to work with NER tasks. Shim et al.
[4] and Issifu et al. [26] adapted the modified EDA techniques. Dai et al. [5] applied label-wise
token replacement, synonym replacement, mention replacement, and shuffle within segments
techniques to the MaSciP (materials science) and i2b2-2010 (biomedical) NER datasets. Yaseen
et al. [9] used the same techniques as Dai et al., but applied them to the MaSciP and Species-800
datasets.
   Back Translation. Xie et al. [8] applied back translation to a topic classification dataset,
Luque et al. [27] used back translation to augment a sentiment analysis dataset. While all these
applications are sentence-level tasks, there has recently been an effort to apply back translation
to sequence-labeling data required by, e.g., the NER task. Yaseen et al. [9] applied segment-wise
back translation to the MaSciP and Species-800 datasets, back translating only the context,
excluding the entities. They achieved an increase in F1-score of 0.0645 and 0.0148, respectively.
Sabty et al. [28] applied back translation to Arabic-English code-switching3 NER data but did
not improve performance. Their task differs from our task as they also had to preserve the
code-switching property of the sentences. They re-annotated the back translated sentences
using a NER model, tested using trained models for translation instead of Google Translate, and
tried more than one pivotal language.

3
    Code-switching refers to text containing more than one language in the same sentence.


                                                         64
  While previously mentioned works try to evaluate different DA approaches, they focus
on general domains and mostly consider English datasets. Furthermore, we do not find any
comparison of different synonym replacement sources. Considering back translation, the
mentioned works either only back translate the context of entities or use a model to perform
the translation and re-annotation.


3. Approach
Figure 1 depicts our proposed workflow. We establish a baseline by considering the train split of
the chosen dataset, taking different fractions (%): {1, 10, 30, 50, and 100}, training two models
with the non-augmented data, and evaluating the performance of each model on the test split.
To evaluate the impact of DA techniques on the model performance, we apply the selected DA
techniques to these training split fractions, train on the generated augmented fractions, and
evaluate the impact of DA. Table 1 illustrates examples for each technique.

Figure 1: Overview of the proposed approach.

                                                              1%    10%   30%     50%   100%


                  German LER
                                          Train split (70%)           Take fraction
                    dataset                                                                     Synonym
                                                                                               Replacement

                                                                                                 Mention
                    Evaluation                 Training             Data Augmentation
                                                                                               Replacement


                                  BiLSTM-CRF          XLM-RoBERTa                         Back Translation


   We implement and evaluate the following three DA techniques, each of which attempts to
create a modified copy of each sentence in the training dataset. A modified copy of a sentence
can only be created if all technique-specific conditions are met. This leads to varying numbers
of augmented sentences between the applied DA techniques. In addition, only sentences
whose tokenization is reproducible can be augmented. One could also augment the dataset
iteratively, generating multiple augmented sentences for each original sentence, but in that case
a mechanism should be applied to avoid duplicates and ensure sufficient degrees of variation.
We have applied just one augmentation iteration round. Generated sentences are appended to
the original training dataset.
   Synonym Replacement. We substitute a percentage of not-tagged, qualified4 tokens in
the sentence with a replacement similar in meaning. We compare three different external
sources for replacements: OpenThesaurus [29], fastText embeddings [30], and the contextual
language model XLM-RoBERTa [31]. The augmentation of a sentence only succeeds if the
selected replacement source provides a replacement for the selected tokens and at least one
token qualifies for replacement (e.g., numbers are not qualified). Furthermore, the selected
replacement percentage has to amount to at least one token.
4
    We filter replacement candidates with a regular expression to avoid replacing or inserting, e.g., punctuation.


                                                               65
Table 1
Example of augmented sentences with synonym replacement (SR), mention replacement (MR), and
back translation (BT) with changes highlighted in magenta.
                   None     Alex    is    going      to           Los   Angeles     in     California   .
                    SR      Alex   was   walking   towards        Los   Angeles   around   California   .
                    MR     Chloe    is    going      to       Mexico      in      United   Kingdom      .
                    BT      Alex    is    going      to           Los   Angeles     ,      California   .


   Mention Replacement. For each mention in the sentence, we replace it with a random
mention of the same class from the original training set. Only sentences containing mentions
can be augmented using this technique.
   Back Translation. We first extract all mentions and their class from the sentence. We then
back translate the complete sentence as a plain string and the extracted mentions separately
using the BackTranslation5 python package, which depends on external services to provide
translations. We then map the extracted mentions back to the back translated sentence based
on their token sequence6 . As pivotal language, we decided to use English. With this process,
we essentially preserve the original labels and adapt them to the new sentence, adding new
mention variants to the dataset and foregoing the need to use a NER model to perform the
re-annotation.


4. Experiments
4.1. Experimental setup
All experiments were performed on AlmaLinux 8.3 using Python 3.9.12. Training and evaluation
are run on a single NVIDIA A100 GPU.
   Dataset. We evaluate the DA methods on the German LER dataset, containing ≈ 67, 000
sentences with over 2 million tokens classified into 19 fine-grained semantic classes7 . As the
train/dev/test splits are not provided, we split the data ourselves to 70/15/15 splits. Conse-
quently, our training split contains 46, 706 sentences. We work with IOB2 [32], the tagging
scheme in which the data is provided. When working with the data during augmentation,
the tokenization of the original sentence should be reproduced. Therefore we use the SoMaJo
tokenizer [33] used by Leitner et al [1].
   NER Models. To evaluate the effect of the DA techniques, we choose two models. One is
BiLSTM-CRF, implemented using the FLAIR framework [34]. Following the recommendations
of Akbik et al. [35], we use it with stacked German fastText and German forward and backward
FLAIR embeddings, train the model using Stochastic Gradient Descent without momentum,
clip gradients at 5, and anneal the learning rate against the micro F1-score on the dev split,
5
  https://pypi.org/project/BackTranslation/, accessed on 03.08.2022
6
  This is only possible if a sentence does not contain a token sequence multiple times with different label sequences
  each. If mapping the extracted mentions to the new sentence fails, the augmentation of this particular original
  sentence is canceled.
7
  Leitner et al. found that some tags, such as street, landscape, brand and regulation are more difficult to predict than
  e.g., judge, law and court.


                                                             66
halving it if the score does not increase for 5 consecutive epochs. We use a learning rate of 0.05,
a mini-batch size of 32, apply variational dropout and train the model for 150 epochs but stop
earlier if the learning rate falls below 0.0001.
   The other model is a Transformer-based model, implemented using the FLERT extension
[36] of the FLAIR framework. We chose the XLM-RoBERTa Transformer model (XLM-R) over
models trained specifically for the German language as preliminary studies showed that it
achieves better results on the LER dataset than, e.g., GELECTRA [37]. We fine-tune it for 10
epochs using the AdamW optimizer with a mini-batch size of 1. The learning rate increases from
0 to 5𝑒 − 6 during the warm-up phase and then linearly decreases, reaching 0 by the end of the
training.

4.2. Results
Table 2 provides the baseline results on the test set before applying the DA techniques as well as
the results after applying synonym replacement. We use the micro F1-score to evaluate model
performance. Baseline results show that in very low-data settings, BiLSTM-CRF outperforms
XLM-R. For all other dataset fractions, XLM-R outperforms BiLSTM-CRF. Note that a relatively
good performance is achieved with only 10% and 30% of the original dataset.
   Synonym Replacement. Depending on the selected configuration, synonym replacement
is the most expensive technique, taking up to 12 seconds per sentence when using the con-
textual language model as source in combination with a replacement percentage of 60%. The
augmentation of the German LER dataset consequently took between 5 and 155 hours, boosting
the dataset size by up to 87.3% when applied with a replacement percentage of 60% and using
fastText or the contextual language model as source. Our replacement percentages of 20%,
40% and 60% of eligible tokens result in around 10.8%, 22.7%, and 34.6% of total tokens being
replaced.
   OpenThesaurus. Using OpenThesaurus as source, we notice that a higher replacement per-
centage improves XLM-R performance, while for BiLSTM-CRF, it does not. XLM-R trained
on the larger datasets benefits marginally from applying DA, while for BiLSTM-CRF, we get a
mixed picture.
   fastText. With fastText embeddings, a higher replacement percentage improves BiLSTM-CRF
performance, while for XLM-R , it does not. Additionally, we notice that the performance of the
XLM-R model after training on the larger datasets is impacted very slightly by DA. In contrast,
BiLSTM-CRF shows improvements for the 50%-dataset.
   Contextual Language Model. For the 1%, 10%, and 30% datasets, XLM-R benefits more than
BiLSTM-CRF. In contrast to other sources, we notice that a higher replacement percentage
does not increase but reduces the augmentation’s positive impact across all dataset and model
combinations. The augmentation affects the performance on the 100%-dataset only marginally.
   Overall. Figure 2 shows the average relative improvement in micro F1-score across all datasets
achieved by applying synonym replacement. We notice that XLM-R benefits more from DA than
BiLSTM-CRF, with the contextual language model as source yielding the greatest improvement.
Applying synonym replacement leads to improvements in most cases. We deduce that the
contextual language model is best used with a low replacement percentage.
   Mention Replacement. Mention replacement is the least expensive technique, with the


                                                67
Table 2
Evaluation results after training on data augmented with synonym replacement (SR) in terms of micro
F1-score with either OpenThesaurus (THE), fastText embeddings (FTX), or a contextual language model
(CLM) as replacement source and different replacement percentages.
                                              BiLSTM-CRF                                          XLM-R
        Dataset   Source       Base            20%        40%           60%         Base         20%        40%       60%
             1%     THE       0.6994      -0.0035    +0.0108      +0.0096          0.6089   +0.0081     -0.0028   +0.0322
            10%     THE       0.8941     +0.0130      +0.0073     +0.0042          0.9130   +0.0079     +0.0054   +0.0104
            30%     THE       0.9346      +0.0010     +0.0012    +0.0014           0.9416   +0.0061    +0.0097     +0.0084
            50%     THE       0.9430     +0.0059      +0.0033     +0.0040          0.9559   -0.0007     +0.0021   +0.0031
           100%     THE       0.9572      +0.0023    +0.0031      +0.0013          0.9661   +0.0007    +0.0010     +0.0004
           Avg.                          +0.0037     +0.0051         +0.0041                +0.0044    +0.0031    +0.0109
             1%     FTX       0.6994      -0.0180     -0.0030    +0.0072           0.6089   +0.0296     +0.0224    +0.0268
            10%     FTX       0.8941      +0.0047     +0.0073    +0.0079           0.9130    +0.0033   +0.0055     +0.0051
            30%     FTX       0.9346      +0.0009     +0.0009    +0.0023           0.9416   +0.0089     +0.0063    +0.0075
            50%     FTX       0.9430     +0.0053      +0.0048     +0.0049          0.9559    +0.0012    +0.0001   +0.0022
           100%     FTX       0.9572      +0.0005    +0.0025      +0.0023          0.9661    +0.0011   +0.0014     +0.0002
           Avg.                           -0.0013    +0.0025     +0.0049                    +0.0088    +0.0071    +0.0084
             1%    CLM        0.6994     +0.0180      -0.0039        +0.0007       0.6089   +0.0943     +0.0770   +0.0530
            10%    CLM        0.8941      +0.0015    +0.0089         -0.0018       0.9130   +0.0074     +0.0018   +0.0013
            30%    CLM        0.9346      +0.0002    +0.0008         +0.0002       0.9416    +0.0045   +0.0047    +0.0028
            50%    CLM        0.9430     +0.0047     +0.0047         +0.0035       0.9559   +0.0020     -0.0004   -0.0005
           100%    CLM        0.9572     +0.0006      -0.0005        -0.0018       0.9661    -0.0015   -0.0003    -0.0013
           Avg.                          +0.0050     +0.0020         +0.0002                +0.0213    +0.0166    +0.0111


Table 3
Evaluation results after training on data augmented with mention replacement (MR) or back translation
(BT) in terms of micro F1-score.
                                              BiLSTM-CRF                              XLM-R
                      Dataset          Base          MR          BT        Base             MR         BT
                             1%    0.6994       +0.0222    +0.0065        0.6089      +0.0772     -0.0123
                            10%    0.8941       +0.0053    -0.0040        0.9130      +0.0103     -0.0025
                            30%    0.9346       +0.0032    +0.0006        0.9416      +0.0064     +0.0037
                            50%    0.9430       +0.0061    +0.0063        0.9559      +0.0033     -0.0008
                           100%    0.9572       +0.0007    -0.0003        0.9661      -0.0003     -0.0004
                           Avg.                 +0.0075    +0.0018                    +0.0194     -0.0025


augmentation taking only 0.011 seconds per sentence. It increased the dataset size by 37.854%.
Table 3 lists the results achieved after training both models on the augmented datasets. We notice
that the improvements for all datasets larger than the 10%-dataset are minor. The maximum
relative improvement is achieved with the 1%-dataset for both models. The average change
in micro F1-score across all datasets is +0.0075 and +0.0194 for BiLSTM-CRF and XLM-R.
We also evaluated, although without notable results, the effect of applying both mention and
synonym replacement combined.
   Back Translation. By applying back translation, we were able to increase the dataset size by
63.24%, boosting the total number of annotated entities by 17.52%. However, we do not register
a significant impact on the performance regarding the micro F1-score of either BiLSTM-CRF or


                                                                68
Figure 2: Average micro F1-score improvements across all datasets achieved by applying synonym
replacement by source, model, and replacement percentage.

                                                       BiLSTM−CRF      XLM−R


                         average micro F1−score
                                                  3%


                              improvement
                                                                                   Replacement
                                                  2%                               Percentage
                                                                                     20%
                                                  1%                                 40%
                                                                                     60%
                                                  0%

                                                       CLM FTX THE   CLM FTX THE
                                                        Replacement Source


XLM-R. The average change in micro F1-score across all datasets is +0.0018 and −0.0025 for
BiLSTM-CRF and XLM-R.


5. Conclusion and Future Work
We implemented three different DA techniques for use with NER training data, evaluated them
on data from the German legal domain, and compared different German replacement sources
and percentages for synonym replacement. We believe that the proposed implementation of
back translation is unique in its ability to back translate entire sentences while preserving their
labels. Our workflow included two models and five different fractions of the full dataset.
   We found that DA can be beneficial when working with small datasets, such as the 1%
dataset, containing only 468 sentences. Considering that synonym and mention replacement
deliver comparable improvements, the latter is the most efficient. For synonym replacement,
the contextual language model is the most effective source. Back translation is challenged by
the long and nested sentences, occasional ambiguities, and frequently occurring legal concepts
that do not exist in the country’s legal system of the used pivotal language. Back translation
achieved a maximum improvement of +0.0065 using BiLSTM-CRF with the 1%-dataset. Mention
replacement achieved a maximum improvement of +0.0772 using XLM-R with the 1%-dataset.
Synonym replacement achieved a maximum improvement of +0.0943 using XLM-R with the
1%-dataset, a replacement percentage of 20% and the contextual language model as source.
   Future work could focus on improving the proposed back translation technique by, e.g.,
adding more flexibility to the re-annotation process. Mention replacement could be extended
to get the replacements from, e.g., a knowledge base, to introduce new entities. Synonym
replacement could benefit from a mechanism that prevents the replacements from being too
similar to the original token. In the context of the Canaréno project, this evaluation gives
us more insights about which techniques are better suited to augment our small manually
annotated dataset, before applying and evaluating different NER models.


                                                                       69
Acknowledgments
The project Canaréno was funded by the Federal Ministry of the Interior and Community and
the Free State of Thuringia. We also thank Prof. Dr. Birgitta König-Ries and Pr. Dr. Sina Zarrieß
for the guidance and feedback.


References
 [1] E. Leitner, G. Rehm, J. Moreno-Schneider, A Dataset of German Legal Documents for
     Named Entity Recognition, in: LREC 2020, European Language Resources Association,
     2020, pp. 4478–4485. URL: https://aclanthology.org/2020.lrec-1.551.
 [2] X. Zhang, J. J. Zhao, Y. LeCun, Character-level Convolutional Networks for Text Classi-
     fication, in: NIPS 2015, 2015, pp. 649–657. URL: https://dl.acm.org/doi/10.5555/2969239.
     2969312.
 [3] J. Wei, K. Zou, EDA: Easy Data Augmentation Techniques for Boosting Performance
     on Text Classification Tasks, in: EMNLP@IJCNLP 2019, Association for Computational
     Linguistics, 2019, pp. 6382–6388. doi:10.18653/v1/D19-1670.
 [4] H. Shim, S. Luca, D. Lowet, B. Vanrumste, Data Augmentation and Semi-Supervised
     Learning for Deep Neural Networks-Based Text Classifier, Association for Computing
     Machinery, 2020, pp. 1119–1126. doi:10.1145/3341105.3373992.
 [5] X. Dai, H. Adel, An Analysis of Simple Data Augmentation for Named Entity Recognition,
     in: COLING 2020, International Committee on Computational Linguistics, 2020, pp. 3861–
     3867. doi:10.18653/v1/2020.coling-main.343.
 [6] J. Raiman, J. Miller, Globally Normalized Reader, in: EMNLP 2017, Association for
     Computational Linguistics, 2017, pp. 1059–1069. doi:10.18653/v1/D17-1111.
 [7] Q. Liu, P. Li, W. Lu, Q. Cheng, Long-tail Dataset Entity Recognition based on Data
     Augmentation, in: EEKE@JCDL 2020, CEUR-WS.org, 2020, pp. 79–80. URL: http://ceur-ws.
     org/Vol-2658/paper10.pdf.
 [8] Q. Xie, Z. Dai, E. H. Hovy, T. Luong, Q. Le, Unsupervised Data Augmentation for Consis-
     tency Training, in: NIPS 2020, 2020. URL: https://dl.acm.org/doi/10.5555/3495724.3496249.
 [9] U. Yaseen, S. Langer, Data Augmentation for Low-Resource Named Entity Recognition
     Using Backtranslation (2021). doi:10.48550/ARXIV.2108.11703.
[10] A. Keraghel, K. Benabdeslem, B. Canita, Data augmentation process to improve deep
     learning-based NER task in the automotive industry field, in: IJCNN 2020, 2020, pp. 1–8.
     doi:10.1109/IJCNN48605.2020.9207241.
[11] R. Zhou, X. Li, R. He, L. Bing, E. Cambria, L. Si, C. Miao, MELM: Data Augmentation
     with Masked Entity Language Modeling for Low-Resource NER (2022) 2251–2262. doi:10.
     18653/v1/2022.acl-long.160.
[12] R. Zhang, Y. Yu, C. Zhang, SeqMix: Augmenting Active Sequence Labeling via Sequence
     Mixup, in: EMNLP 2020, Association for Computational Linguistics, 2020, pp. 8566–8579.
     doi:10.18653/v1/2020.emnlp-main.691.
[13] R. Erd, L. Feddoul, fusion-jena/data-augmentation-ner-legal, 2022. URL: https://github.
     com/fusion-jena/data-augmentation-ner-legal.


                                               70
[14] R. Erd, L. Feddoul, fusion-jena/data-augmentation-ner-legal v1.0.1, 2022. doi:10.5281/
     zenodo.6992392.
[15] R. Erd, L. Feddoul, C. Lachenmaier, M. J. Mauch, data-augmentation-ner-datasets, 2022.
     doi:10.5281/zenodo.6956603.
[16] R. Erd, L. Feddoul, C. Lachenmaier, M. J. Mauch, data-augmentation-ner-results, 2022.
     doi:10.5281/zenodo.6956508.
[17] I. Glaser, B. Waltl, F. Matthes, Named entity recognition, extraction, and linking in German
     legal contracts, in: IRIS: Internationales Rechtsinformatik Symposium, 2018, pp. 325–334.
[18] P. N. Mendes, M. Jakob, A. García-Silva, C. Bizer, DBpedia spotlight: shedding light on the
     web of documents, in: I-SEMANTICS 2011, ACM International Conference Proceeding
     Series, ACM, 2011, pp. 1–8. doi:10.1145/2063518.2063519.
[19] E. Leitner, G. Rehm, J. Moreno-Schneider, Fine-Grained Named Entity Recognition in Legal
     Documents, in: SEMANTiCS 2019, Springer International Publishing, 2019, pp. 272–287.
     doi:10.1007/978-3-030-33220-4\_20.
[20] J. Zöllner, K. Sperfeld, C. Wick, R. Labahn, Optimizing Small BERTs Trained for German
     NER, Inf. 12 (2021) 443. doi:10.3390/info12110443.
[21] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional
     Transformers for Language Understanding, in: NAACL-HLT 2019, Association for Com-
     putational Linguistics, 2019, pp. 4171–4186. doi:10.18653/v1/n19-1423.
[22] C. Fellbaum, WordNet: An Electronic Lexical Database, The MIT Press, 1998. doi:10.
     7551/mitpress/7287.001.0001.
[23] W. Y. Wang, D. Yang, That’s So Annoying!!!: A Lexical and Frame-Semantic Embedding
     Based Data Augmentation Approach to Automatic Categorization of Annoying Behaviors
     using #petpeeve Tweets, in: EMNLP 2015, Association for Computational Linguistics,
     2015, pp. 2557–2563. doi:10.18653/v1/D15-1306.
[24] X. Wu, S. Lv, L. Zang, J. Han, S. Hu, Conditional BERT Contextual Augmenta-
     tion, in: ICCS 2019, Springer International Publishing, 2019, pp. 84–95. doi:10.1007/
     978-3-030-22747-0\_7.
[25] T. Kang, A. Perotte, Y. Tang, C. Ta, C. Weng, UMLS-based data augmentation for natural
     language processing of clinical research literature, Journal of the American Medical
     Informatics Association 28 (2020) 812–823. doi:10.1093/jamia/ocaa309.
[26] A. M. Issifu, M. C. Ganiz, A Simple Data Augmentation Method to Improve the Performance
     of Named Entity Recognition Models in Medical Domain, in: 2021 6th International
     Conference on Computer Science and Engineering (UBMK), 2021, pp. 763–768. doi:10.
     1109/UBMK52708.2021.9558986.
[27] F. M. Luque, Atalaya at TASS 2019: Data Augmentation and Robust Embeddings for
     Sentiment Analysis, in: IberLEF@SEPLN 2019, CEUR-WS.org, 2019, pp. 561–570. URL:
     http://ceur-ws.org/Vol-2421/TASS_paper_1.pdf.
[28] C. Sabty, I. Omar, F. Wasfalla, M. Islam, S. Abdennadher, Data Augmentation Techniques
     on Arabic Data for Named Entity Recognition, Procedia Computer Science 189 (2021)
     292–299. doi:10.1016/j.procs.2021.05.092.
[29] D. Naber, OpenThesaurus: ein offenes deutsches Wortnetz, Sprachtechnologie, mobile
     Kommunikation und linguistische Ressourcen: Beiträge zur GLDV-Tagung, Bonn, Germany
     (2005) 422–433.


                                              71
[30] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword
     Information, Trans. Assoc. Comput. Linguistics 5 (2017) 135–146. doi:10.1162/tacl\
     _a\_00051.
[31] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave,
     M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised Cross-lingual Representation Learning
     at Scale, in: ACL 2020, Association for Computational Linguistics, 2020, pp. 8440–8451.
     doi:10.18653/v1/2020.acl-main.747.
[32] M. Konkol, M. Konopík, Segment Representations in Named Entity Recognition, in: Text,
     Speech, and Dialogue, Springer International Publishing, 2015, pp. 61–70. doi:10.1007/
     978-3-319-24033-6\_7.
[33] T. Proisl, P. Uhrig, SoMaJo: State-of-the-art tokenization for German web and social media
     texts, in: WAC@ACL 2016, Association for Computational Linguistics, 2016, pp. 57–62.
     doi:10.18653/v1/W16-2607.
[34] A. Akbik, T. Bergmann, D. Blythe, K. Rasul, S. Schweter, R. Vollgraf, FLAIR: An Easy-to-Use
     Framework for State-of-the-Art NLP, in: NAACL-HLT 2019, Association for Computational
     Linguistics, 2019, pp. 54–59. doi:10.18653/v1/n19-4010.
[35] A. Akbik, D. Blythe, R. Vollgraf, Contextual String Embeddings for Sequence Labeling,
     in: COLLING 2018, Association for Computational Linguistics, 2018, pp. 1638–1649. URL:
     https://aclanthology.org/C18-1139.
[36] S. Schweter, A. Akbik, FLERT: Document-Level Features for Named Entity Recognition
     (2020). doi:10.48550/ARXIV.2011.06993.
[37] B. Chan, S. Schweter, T. Möller, German’s Next Language Model, in: COLING 2020, Inter-
     national Committee on Computational Linguistics, 2020, pp. 6788–6796. doi:10.18653/
     v1/2020.coling-main.598.


                                              72