<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluating German Transformer Language Models with Syntactic Agreement Tests</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Karolina Zaczynska</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nils Feldhus</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Robert Schwarzenberg</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aleksandra Gabryszak</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sebastian M o¨ller</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>German Research Center for Artificial Intelligence</institution>
          ,
          <addr-line>DFKI</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Pre-trained transformer language models (TLMs) have recently refashioned natural language processing (NLP): Most stateof-the-art NLP models now operate on top of TLMs to benefit from contextualization and knowledge induction. To explain their success, the scientific community conducted numerous analyses. Besides other methods, syntactic agreement tests were utilized to analyse TLMs. Most of the studies were conducted for the English language, however. In this work, we analyse German TLMs. To this end, we design numerous agreement tasks, some of which consider peculiarities of the German language. Our experimental results show that state-of-the-art German TLMs generally perform well on agreement tasks, but we also identify and discuss syntactic structures that push them to their limits.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Pre-trained language models, in particular those
which are based on the transformer architecture
        <xref ref-type="bibr" rid="ref13">(Vaswani et al., 2017)</xref>
        , have immensely improved
the performance of various downstream models
(see, e.g. Zhang et al. (2020, 2019); Raffel et al.
(2019)). To explain their success, numerous
introspective experiments have targeted different
aspects of TLMs. It was shown, for instance, that
they encode syntactic, semantic and world
knowledge
        <xref ref-type="bibr" rid="ref8">(Petroni et al., 2019)</xref>
        and present downstream
models with a highly contextualized
representation of the input tokens
        <xref ref-type="bibr" rid="ref12">(Tenney et al., 2019)</xref>
        . For a
Copyright c 2020 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0)
* Shared first authorship.
comprehensive overview of the many studies
conducted about arguably the most prominent of
language models, BERT
        <xref ref-type="bibr" rid="ref2">(Devlin et al., 2019)</xref>
        , we
refer the interested reader to the excellent overview
paper by Rogers et al. (2020).
      </p>
      <p>
        With the exception of experiments targeting a
multilingual BERT model
        <xref ref-type="bibr" rid="ref10">(Rogers et al., 2020)</xref>
        ,
most of the studies were conducted only for
English, however. Other languages are
underrepresented. In this work, we narrow the gap for
German by analysing the abilities and limits of
German TLMs. To the best of our knowledge, we are
the first to conduct such an analysis for the
German language.
      </p>
      <p>When compared with English, there are
considerable syntactic differences in the German
language that we consider in this work. For
example, the inflection system of the German language
is more complex, the range of morpho-syntactic
rules needed to form grammatical sentences is
larger, and the allowed word order is more diverse.
As a consequence, the German language
models face specific challenges. The syntactic
agreement tests presented in this work include several
of them.</p>
      <p>Our main contributions are threefold:
1. Utilizing context-free grammars (CFG), we
compile a German data set of controlled
syntactic correctness tests of various
complexities. The motivation and construction of the
data set is closely following the one described
in Marvin and Linzen (2018), where
syntactic tests were conducted for English. In
particular, we devise several kinds of
subjectverb agreement as well as reflexive anaphora
agreement tasks, taking into account
peculiarities of the German language. A simple
subject-verb agreement task is given in
Example 1.1.</p>
      <p>Example 1.1. Decide which of the following
sentences is grammatical:
(a) Der Autor lacht. (The author laughs.)
(b) * Der Autor lachen. (The author laugh.)
2. We use the data set to evaluate two
transformer-based language models that were
pre-trained on German corpora. During the
evaluation, contrary to prior work, we
utilize the cross entropy loss to score the
syntactic correctness of input sentences. This
addresses a problem with the sub-word
tokenization of some TLMs that was
previously solved by discarding thousands of data
points.
3. We conduct a qualitative and quantitative
analysis of the experimental results,
estimating the abilities and limits of the TLMs
tested.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Methods</title>
      <p>Our work combines and translates the targeted
syntactic evaluation of language models by
Marvin and Linzen (2018) and the assessment of
BERT’s syntactic abilities by Goldberg (2019)
from English into German. Our methods consist
of agreement test generation and model
evaluation.</p>
      <p>We created the following agreement test,
following Marvin and Linzen (2018): Two sentences,
a grammatical one and an ungrammatical one, are
forwarded through a model. The sentences differ
minimally from each other at only one locus of
(un)grammaticality, i.e. one word. The model
output is monitored and if the output suggests that the
model prefers the grammatical one over the
ungrammatical one, that instance is counted as a
correct classification; otherwise, it is counted as an
incorrect classification.</p>
      <p>Goldberg (2019) used agreement tests to
evaluate BERT models. To account for their
bidirectionality, he masked the locus of (un)grammaticality
and queried the candidate probabilities for the
mask. In Example 1.1, Der Mann [MASK]. is
forwarded through a BERT model and the
candidate probabilities at the position of the mask are
determined. If lacht receives a higher
probability than lachen, the task is solved correctly by
the language model. The author runs into
problems, however, when the candidates are tokenized
into multiple sub-word tokens, say lachen !
[lach, ##en]. In this case, the author simply
ignores the data point.</p>
      <p>Instead of discarding such sequences, we take
inspiration from Marvin and Linzen (2018) and
score whole sentences (without masks). However,
we still discard cases in which the two candidates
have a different amount of sub-words after
tokenization, as we see the comparability impaired if
the resulting sequences of tokens are of different
lengths.</p>
      <p>We compute the sentence score with the
crossentropy loss of the forward pass, using the input
sequence as the target:</p>
      <p>T 0
1 X
T
i=1</p>
      <p>T
@ f (S)i;Si + log(X exp(f (S)j;Sj ))A
j=1
1
where S is a sequence of T positive integer token
ids and f : ZN ! RN V a language model
mapping N token IDs onto N token probabilities over
a vocabulary of size V . We compute Eq. 2 with
the grammatical candidate in place and a second
time with the ungrammatical candidate in place.</p>
      <p>
        Please note that during the training of a
bidirectional language model, the points of interests need
to be masked to prevent information leakage
        <xref ref-type="bibr" rid="ref2">(Devlin et al., 2019)</xref>
        . In our case, information leakage
is not a problem because we compare two whole
sequences.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Syntactic Agreement Tests</title>
      <p>This section describes the syntactic agreement
tests we generated to evaluate German TLMs on.</p>
      <p>Our tests are inspired by the research of Marvin
and Linzen (2018) and Goldberg (2019). In
particular, we translate many of their tests on
subjectverb agreement (SVA) and reflexive anaphora
(RA) agreement from English to German
(Section 3.2). In addition, we design tests for syntactic
phenomena which are typical of the German
language (Section 3.3).</p>
      <p>The generated tasks cover a range of
difficulties. In German, the subject and the inflected verb
agree with regard to person and grammatical
number. In the simplest case, the sentences contain
only a subject and a verb. In the more
challenging cases we added different types of distraction,
i.e. either additional non-subjective (pro)nouns as
candidates for subjects or other additional lexical
material making the sentences more complex.</p>
      <p>
        For our experiments, we consider instances
where the grammatical number of non-subjective
(pro)nouns matches the one of the subject as well
as examples where their grammatical number is
different. Furthermore, we distinguish between
local and non-local feature agreement, which means,
we take into account whether the distractors
occur between subject and its corresponding verb or
not. The described test scenario allows us to
compare the models’ performance with regard to the
features of the distractor as well as its distance to
the relevant verb. Therefore, the designed tests
expand the experimental setup of Marvin and Linzen
(2018) by going beyond the attractors, i.e.
intermissions defined as intervening nouns with the
opposite number from the subject
        <xref ref-type="bibr" rid="ref5">(Linzen et al.,
2016)</xref>
        .
3.1
      </p>
      <sec id="sec-3-1">
        <title>Dataset</title>
        <p>We created a dataset of 12,426 sentences using
hand-crafted Context Free Grammars (CFGs) as
illustrated in Example 3.1.</p>
        <p>Example 3.1. Context Free Grammar for creating
sentences S from a vocabulary V to test agreement
in a simple sentence:</p>
        <p>S
NP
ART
N
V
&gt; NP V ’.’
&gt; ART N</p>
        <p>&gt; ’Die’
&gt; ’Autoren’ j ’Richterinnen’
&gt; ’lachen’ j ’reden’
Output: Die Autoren lachen. / Die Autoren reden.
/ Die Richterinnen lachen. / Die Richterinnen
reden.</p>
        <p>As shown in the example, the CFG creates
sentences as output with varying lexical items but
with a relatively low variance. However, it allows
us to tightly control the generated sentences with
respect to the desired tests, in terms of distractor
features as well as syntactic structure and
correctness of the sentences.</p>
        <p>Our data set covers 14 test cases of different
challenge levels (Sections 3.2–3.3). The number
of sentences ranges from 64 to 2160 with an
average of 1035,5 sentences per test case. A sentence
is build on average of 6.88 tokens. The vocabulary
consists of 88 lexems and 171 word forms. For our
corpus, we chose common words to build the
sentences, so that the TLM was not confronted with
potentially unknown words.
3.2
In the following, we introduce the agreement tests
that we translated from the work of Marvin and
Linzen (2018).</p>
        <p>We describe three groups of tests ordered by the
increasing challenge level: (1) local agreement,
no distractors, (2) local agreement, plus
distractors, and (3) non-local agreement, plus distractors.
Afterwards, we introduce tests designed to target
German phenomena specifically.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Local agreement, no distractors We first in</title>
        <p>clude cases with local agreement and without a
distractor. Sentences consisting of only one
subject and verb are what we refer to as simple
sentence in the following, showcased in Example 3.2.
Example 3.2. Simple sentence with only one
subject and one verb (the locus of (un)grammaticality
is italic, the incorrect variant is preceded by *):
(a) Das Kind trinkt.
(b) * Das Kind trinken.</p>
        <p>Local agreement, plus distractors Complex
sentences with a local agreement in a sentential
complement or in an object relative clause
constitute the next level of difficulty. Those sentences
contain two subjects: one in the main clause, and
another one in the subordinate clause. In
Example 3.3, the latter functions as a sentential
complement, in Example 3.4, as an object relative clause.
For both types of the subordinate clause, the verb
follows the subject directly. The subject of a main
clause is the distractor in those cases while the
agreement between the subject and the verb of the
subordinate clause is our point of interest.
Example 3.3. SVA in a sentential complement:
(a) Die Vertreter sagten, dass das Kind trinkt.
(b) * Die Vertreter sagten, dass das Kind trinken.
Example 3.4. SVA in an object relative clause
(a) Der Autor, den die Vertreter kennen, lacht.
(b) * Der Autor, den die Vertreter kennt, lacht.
Non-local agreement, plus distractors We also
tested TLMs on a set of constructions with
nonlocal agreement, induced by potentially distracting
words and phrases between the head of the subject
and its corresponding verb. With these tasks, we
are testing the language model’s ability to attend
to the subject in sentences across long contexts.</p>
        <p>Our first test case is a SVA across a
preprositional phrase (PP). We created sentences with the
subject modified by a directly following PP, which
includes a potentially attracting noun, as in
Example 3.5.</p>
        <p>Example 3.5. SVA across a PP
(a) Der Autor neben den Landstrichen lacht.
(b) * Der Autor neben den Landstrichen lachen.</p>
        <p>Furthermore, we test SVAs across subject
relative clauses which include one potentially
distracting object and verb in between subject and
corresponding verb, as in Example 3.6.</p>
        <p>Example 3.6. SVA across a subject relative clause
(a) Der Autor, der die Architekten liebt, lacht.
(b) * Der Autor, der die Architekten liebt,
lachen.</p>
        <p>The same challenge exists for SVAs across
object relative clauses which also contain potentially
distracting chunks and separate the subject and its
corresponding verb, as in Example 3.7.</p>
        <p>Example 3.7. SVA across an object relative
clause
(a) Der Autor, den die Vertreter kennen, lacht.
(b) * Der Autor, den die Vertreter kennen,
lachen.</p>
        <p>Additionally, we designed various sentences for
testing SVAs across coordinated verbal phrases
(VP), where the subject must agree in person and
number with the finite verb included in each VP. In
our test, the point of interest is the second verb of
the coordination. This kind of structure challenges
the model to recognize that the complete
subjectverb structure does not end after the first verb, but
rather it also includes the second verb. We test the
SVA in verbal coordinations of different lengths
and various number of distractors.</p>
        <p>First, we test the model on sentences consisting
of a short and simple VP coordination with no
distractors, as illustrated by Example 3.8.</p>
        <p>Example 3.8. SVA in short VP coordinations (i.e.
with no distractors)
(a) Der Autor schwimmt und lacht.
(b) * Der Autor schwimmt und lachen.</p>
        <p>To increase the difficulty level, we inserted noun
phrases having a different number as the subject
into the coordinated VP. We distinguish between
verbal coordinations with a single noun
distractor (Example 3.9) and two noun distractors
(Example 3.10).</p>
        <p>Example 3.9. SVA in medium VP coordinations
(i.e. with a single noun distractor)
(a) Der Autor redet mit Menschen und lacht.
(b) * Der Autor redet mit Menschen und lachen.
Example 3.10. SVA in long VP coordinations (i.e.
with two noun distractors)
(a) Der Autor redet mit Menschen und verfolgt
die Fernsehprogramme.
(b) Der Autor redet mit Menschen und verfolgen
die Fernsehprogramme.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Novel Agreement Tests</title>
        <p>In addition to the tests above that we based on
previous work, we also designed tasks which target
constructs that are more specific to the German
language.</p>
        <p>First, we test the agreement between verb and
its corresponding subject containing an extended
modifier, i.e. an adjective modifying a subject and
extended by further subordinate nominal or
prepositional phrase. The extended modifier is
positioned between the determinator and the noun of
the subject. In comparison to English, the
German language is much more flexible with regard
to the number and the types of allowed extensions.
To test the impact of nouns used within extended
modifiers of a subject we include sentences with
simple modifiers as well as with extended
modifiers (Example 3.11 and 3.12).</p>
        <p>Example 3.11. SVA with a simple modifier
(a) Die wartenden Autoren lachen.
(b) * Die wartenden Autoren lacht.</p>
        <p>Example 3.12. SVA with an extended modifier
(a) Die die Pflanze liebenden Autoren lachen.
(b) * Die die Pflanze liebenden Autoren lacht.</p>
        <p>Another agreement test relates to the more
diverse word order in German in comparison to
English. Example 3.13 illustrates the shift of the
direct object diese Romane from its standard
position in the middle-field (after the finite verb) to the
pre-field, and the shift of the subject der Autor to
the middle-field from its standard position in the
pre-field (before the finite verb). This movement
would be not possible in English. The German
language often allows the shift, since it marks the
case of noun phrases by the inflectional suffix of
their determiner (e.g. der Autor in nominative case
vs. den Autor in accusative case) and sometimes
also by the suffix of the noun itself (e.g. des
Autors in genitive). That property supports to
distinguish subjects (always nominative case) from
objects or adjuncts independent from their position in
SUBJECT-VERB AGREEMENT
Simple Sentence
In a sentential complement
Short VP coordination
Medium VP coordination
Long VP coordination
Across a PP
Across a subject relative clause
Across an object relative clause
In an object relative clause
With a modifier
With an extended modifier
Pre-field
REFLEXIVE ANAPHORA
Person &amp; number agreement
Case agreement
a sentence. With this test case, we can evaluate if
the model recognizes the subject in sentences
correctly, even though the subject-verb-object order is
disregarded. We exclude test sentences where the
subject and the object have the same inflectional
suffixes in nominative and accusative, i.e. an
unambiguous distinction between subject and object
is not possible solely based on the inflection.</p>
      </sec>
      <sec id="sec-3-4">
        <title>Example 3.13. Pre-field</title>
        <p>(a) Diese Romane empfahl der Autor.
(b) * Diese Romane empfahlen der Autor.</p>
        <p>Moreover, we created sentences with reflexive
verbs, i.e. sentential phrases where the reflexive
anaphora (RA) in the accusative case follows the
verb and agrees with the subject in the
grammatical number and person. The first sentence in
Examples 3.14 and 3.15 illustrates the agreement
between RA mich (accusative case) and the subject
ich in person (first) and number (singular). We use
two different tests: (a) for the recognition of a
correct person (Example 3.14), also used by Marvin
and Linzen (2018), and (b) for the recognition of
a correct case (accusative instead of incorrect
dative, Example 3.15). The correct number is always
given.</p>
        <p>Example 3.14. Subject RA agreement
(personagreement)
(a) Ich bedanke mich.
(b) * Ich bedanke sich.
in
accusative
(case</p>
      </sec>
      <sec id="sec-3-5">
        <title>Example 3.15. RA</title>
        <p>agreement)
(a) Ich bedanke mich.
(b) * Ich bedanke mir.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>In this section, we introduce the models we
evaluate and in particular highlight their
similarities and differences. We probe transformer-based
BERT models because they are currently the
basis for many state-of-the-art downstream models
and very prominent in the community. The model
selection was driven and confined by availability.
We made use of Wolf et al. (2019)’s transformer
package.1</p>
      <p>The first model which we refer to as GBERTlarge
is a community model provided by the Bavarian
State Library.2 It was trained on multiple
German corpora including a recent Wikipedia dump,
EU Bookshop corpus, the Open Subtitles corpus,
a CommonCrawl corpus, a ParaCrawl corpus and
the News Crawl corpus, with 16 GB of training
material in total.</p>
      <p>The second model which we refer to as
distilGBERT was trained on half of the data used
1https://github.com/huggingface/
transformers (Accessed: 2020-03-05)</p>
      <p>2https://huggingface.co/dbmdz/
bert-base-german-cased (Accessed:
05)
2020-03to pretrain BERT using distillation with the
supervision of GBERTlarge 3.</p>
      <p>The data set, the CFGs with the list of lexical
items and the code for the experiments are publicly
available. 4
5</p>
    </sec>
    <sec id="sec-5">
      <title>Results &amp; Discussion</title>
      <p>The coarse-grained results of our experiments are
listed in Table 1. We note that both models
perform well across the majority of tasks. This is
in line with previous work that demonstrated that
BERT models are capable of solving syntactic
agreement tasks. As shown by Goldberg (2019)
for English, for instance, our most successful
German BERT model, GBERTlarge, also scores above
80% or 90% in most of the tasks, whereas the
LSTM-LMs probed by Marvin and Linzen (2018)
achieve scores not above 74%.</p>
      <p>We observe that GBERTlarge outperforms
distilGBERT in thirteen out of fourteen tasks. For
example, in the case of SVA across an object
relative clause, GBERTlarge achieved a score of
92.06%, whereas distilGBERT’s score is lower by
around 18 percentage points. Based on these
observations, we assume the higher amount of
German training data, that GBERTlarge was trained
on, is the distinguishing factor.</p>
      <p>There is a big overlap between the most
challenging stress tests. Four out of five tests align
when sorted in ascending order (worst
performance first, underscored in Table 1). To analyse
the stress tests further, in Table 2, we subdivide
cases between singular and plural subjects and
distractors.</p>
      <p>We expected high accuracies for the cases with
local agreement. Our results show that all those
cases, which are Simple Sentence, SVA in a
sentential complement, SVA in an object relative clause
and SVA with a simple modifier, have a score
above 94 percent for both models.</p>
      <p>Regarding the German-specific syntactic
constructs, we observe that both models perform well.
The movement of the subject from pre-field to
middle-field does not seem to cause any bigger
problems, both distilGBERT and GBERTlarge have
an accuracy around 80%.</p>
      <p>3https://github.com/huggingface/
transformers/blob/master/examples/
distillation/README.md (Accessed:
21)
4https://github.com/DFKI-NLP/gevalm/
2020-05</p>
      <p>As can be seen in Tables 1 and 2, VP
coordination probing cases were a big challenge for both
models. For example, distilGBERT only achieves an
overall accuracy of 0.4813 on SVA in a medium VP
coordination and 0.5167 on SVA in a long VP
coordination, while GBERTlarge achieves 0.6188 and
0.5938, respectively. In these aspects, our results
deviate considerably from the findings of
Goldberg (2019) who reported that the English BERT
models performed well on long VP tasks, too. The
respective syntactic constructs may thus be
particularly challenging for the BERT models in the
German language. Interestingly, according to Table 2,
GBERTlarge performs with an accuracy of 1.0 for
long VPs with a singular subject. We note that the
most challenging sentences for both models in all
of the VP coordination cases were the ones with a
plural subject.</p>
      <p>In contrast to the aforementioned VP
coordinations, SVA across an object relative clause for both
models and SVA across a subject relative clause
for distilGBERT show a better accuracy for
sentences when the subject is plural. We assume that
for some cases the grammatical number of the
subject is a more influential aspect for the result than
the number of the distractor. We didn’t expect this
given that we used the same lexemes within one
case to ensure comparability between the results.</p>
      <p>We expected that sentences in which the
grammatical number of the distractor deviates from the
number of the relevant verb (singular-plural and
plural-singular) have a lower accuracy. This,
however, applies only to a few cases, like SVA across
an object relative clause and Pre-field. Thus, the
TLMs appear to be mostly robust against those
distractors.</p>
      <p>Inferring sound causes for why some syntactic
constructs push the models to their limit would
require a thorough statistical analysis of the data
and probably even an introspective analysis of the
model. We leave it to future work to conduct such
an analysis.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Related Work</title>
      <p>There is a huge body of related literature on the
syntactic evaluation of language models. For more
background, we refer the interested reader to the
works cited in the influential contribution by
Marvin and Linzen (2018) and Goldberg (2019).</p>
      <p>Gulordava et al. (2018) assessed subject-verb
agreement with an emphasis on syntactic over
SUBJECT-VERB AGREEMENT</p>
      <sec id="sec-6-1">
        <title>Simple sentence --spgl</title>
        <p>-sgsg
In a sentential complement --spglppll
-plsg</p>
      </sec>
      <sec id="sec-6-2">
        <title>Short VP coordination --spgl</title>
        <p>-sgsg
Medium VP coordination --spglppll
-plsg
-sgsg
Long VP coordination --spglppll
-plsg
-sgsg
Across a prepositional phrase --spglppll
-plsg
-sgsg
Across a subject relative clause --spglppll
-plsg
-sgsg
Across an object relative clause --spglppll
-plsg
-sgsg
In an object relative clause --spglppll
-plsg</p>
      </sec>
      <sec id="sec-6-3">
        <title>With a simple modifier --spgl</title>
        <p>-sgsg
With an extended modifier --spglppll
-plsg
-sgsg
Pre-field --spglppll</p>
        <p>-plsg
semantic preference. McCoy et al. (2019)
created a data set with entailment tests. Bacon and
Regier (2019) extended Goldberg (2019) to 26
languages, excluding German, and found out that
with a higher number of distractors and long-range
dependencies, BERT achieves lower accuracies for
the syntactic agreement tests.</p>
        <p>As mentioned above, we also recommend the
overview paper by Rogers et al. (2020) on
studies of BERT models specifically. Apart from the
experiments cited in this work that evaluate
multilingual models, such as MBERT, we are not aware
of any study dedicated to the agreement analysis
of German BERT models.</p>
        <p>Ro¨ nnqvist et al. (2019), nevertheless, tested
multilingual BERT models on their ability of
hierarchical understanding of German sentences and
with a cloze test for which an arbitrary
(grammatically correct) word was masked and needed to be
filled in again.
7</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>We conducted a broad analysis of German BERT
models, targeting their syntactic abilities. We
translated agreement tests from English to German
and also designed tasks that reflect syntactic
phenomena that are typical for the German language.
The data set we generated and the accompanying
grammars are publicly available.</p>
      <p>Furthermore, we utilized the cross-entropy loss
to score whole natural sentences and this way
mitigated a problem with sub-word tokenization. Our
source code is open source, too.</p>
      <p>Our experimental results show that the German
models perform very well on most of the
agreement tasks. However, we also identified syntactic
stress tests that models in other languages appear
to solve much better. We plan to replace the
synthetic sentences with real language samples in the
future, to better reflect the diversity of the German
language in our experiments.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <p>We would like to thank Leonhard Hennig for his
valuable feedback. This work has been supported
by the German Federal Ministry of Education and
Research as part of the project XAINES.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Geoff</given-names>
            <surname>Bacon</surname>
          </string-name>
          and
          <string-name>
            <given-names>Terry</given-names>
            <surname>Regier</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Does BERT agree? Evaluating knowledge of structure dependence through agreement relations</article-title>
          . arXiv:
          <year>1908</year>
          .09892 [cs].
          <source>ArXiv</source>
          :
          <year>1908</year>
          .09892.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          .
          <source>In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers), pages
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Yoav</given-names>
            <surname>Goldberg</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Assessing bert's syntactic abilities</article-title>
          . arXiv preprint arXiv:
          <year>1901</year>
          .05287.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Kristina</given-names>
            <surname>Gulordava</surname>
          </string-name>
          , Piotr Bojanowski, Edouard Grave, Tal Linzen, and
          <string-name>
            <given-names>Marco</given-names>
            <surname>Baroni</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Colorless Green Recurrent Networks Dream Hierarchically</article-title>
          .
          <source>In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (
          <issue>Long Papers)</issue>
          , pages
          <fpage>1195</fpage>
          -
          <lpage>1205</lpage>
          , New Orleans, Louisiana. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Tal</given-names>
            <surname>Linzen</surname>
          </string-name>
          , Emmanuel Dupoux, and
          <string-name>
            <given-names>Yoav</given-names>
            <surname>Goldberg</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Assessing the ability of LSTMs to learn syntax-sensitive dependencies</article-title>
          .
          <source>Transactions of the Association for Computational Linguistics</source>
          ,
          <volume>4</volume>
          :
          <fpage>521</fpage>
          -
          <lpage>535</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Rebecca</given-names>
            <surname>Marvin</surname>
          </string-name>
          and
          <string-name>
            <given-names>Tal</given-names>
            <surname>Linzen</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Targeted syntactic evaluation of language models</article-title>
          .
          <source>In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing</source>
          , pages
          <fpage>1192</fpage>
          -
          <lpage>1202</lpage>
          , Brussels, Belgium. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Tom McCoy</surname>
            ,
            <given-names>Ellie</given-names>
          </string-name>
          <string-name>
            <surname>Pavlick</surname>
            , and
            <given-names>Tal</given-names>
          </string-name>
          <string-name>
            <surname>Linzen</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference</article-title>
          .
          <source>In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</source>
          , pages
          <fpage>3428</fpage>
          -
          <lpage>3448</lpage>
          , Florence, Italy. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Fabio</given-names>
            <surname>Petroni</surname>
          </string-name>
          , Tim Rockta¨schel, Sebastian Riedel,
          <string-name>
            <given-names>Patrick</given-names>
            <surname>Lewis</surname>
          </string-name>
          , Anton Bakhtin,
          <string-name>
            <surname>Yuxiang Wu</surname>
            , and
            <given-names>Alexander</given-names>
          </string-name>
          <string-name>
            <surname>Miller</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Language Models as Knowledge Bases?</article-title>
          <source>In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP)</source>
          , pages
          <fpage>2463</fpage>
          -
          <lpage>2473</lpage>
          ,
          <string-name>
            <surname>Hong</surname>
            <given-names>Kong</given-names>
          </string-name>
          , China. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Colin</given-names>
            <surname>Raffel</surname>
          </string-name>
          , Noam Shazeer, Adam Roberts,
          <string-name>
            <given-names>Katherine</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Sharan</given-names>
            <surname>Narang</surname>
          </string-name>
          , Michael Matena,
          <string-name>
            <surname>Yanqi Zhou</surname>
            ,
            <given-names>Wei</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <surname>Peter J Liu</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Exploring the limits of transfer learning with a unified text-to-text transformer</article-title>
          . arXiv preprint arXiv:
          <year>1910</year>
          .10683.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Anna</given-names>
            <surname>Rogers</surname>
          </string-name>
          , Olga Kovaleva, and
          <string-name>
            <given-names>Anna</given-names>
            <surname>Rumshisky</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>A primer in bertology: What we know about how bert works</article-title>
          . arXiv preprint arXiv:
          <year>2002</year>
          .12327.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Samuel</given-names>
            <surname>Ro</surname>
          </string-name>
          ¨nnqvist, Jenna Kanerva, Tapio Salakoski, and
          <string-name>
            <given-names>Filip</given-names>
            <surname>Ginter</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Is multilingual BERT fluent in language generation</article-title>
          ?
          <source>In Proceedings of the First NLPL Workshop on Deep Learning for Natural Language Processing</source>
          , pages
          <fpage>29</fpage>
          -
          <lpage>36</lpage>
          , Turku, Finland. Linko¨ping University Electronic Press.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Ian</given-names>
            <surname>Tenney</surname>
          </string-name>
          , Patrick Xia, Berlin Chen, Alex Wang,
          <string-name>
            <surname>Adam Poliak</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Thomas</surname>
            <given-names>McCoy</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Najoung</given-names>
            <surname>Kim</surname>
          </string-name>
          , Benjamin Van Durme,
          <string-name>
            <surname>Samuel R. Bowman</surname>
          </string-name>
          ,
          <string-name>
            <surname>Dipanjan Das</surname>
            , and
            <given-names>Ellie</given-names>
          </string-name>
          <string-name>
            <surname>Pavlick</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>What do you learn from context? Probing for sentence structure in contextualized word representations</article-title>
          . arXiv:
          <year>1905</year>
          .06316 [cs].
          <source>ArXiv</source>
          :
          <year>1905</year>
          .06316.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Ashish</given-names>
            <surname>Vaswani</surname>
          </string-name>
          , Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
          <string-name>
            <surname>Łukasz Kaiser</surname>
            , and
            <given-names>Illia</given-names>
          </string-name>
          <string-name>
            <surname>Polosukhin</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Attention is all you need</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Wolf</surname>
          </string-name>
          , Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R'emi Louf, Morgan Funtowicz, and
          <string-name>
            <given-names>Jamie</given-names>
            <surname>Brew</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Huggingface's transformers: State-of-the-art natural language processing</article-title>
          . ArXiv, abs/
          <year>1910</year>
          .03771.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Zhuosheng</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Yuwei Wu,
          <string-name>
            <given-names>Hai</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Zuchao</given-names>
            <surname>Li</surname>
          </string-name>
          , Shuailiang Zhang, Xi Zhou, and
          <string-name>
            <given-names>Xiang</given-names>
            <surname>Zhou</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Semantics-aware bert for language understanding</article-title>
          . arXiv preprint arXiv:
          <year>1909</year>
          .02209.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Zhuosheng</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Junjie Yang, and
          <string-name>
            <given-names>Hai</given-names>
            <surname>Zhao</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Retrospective reader for machine reading comprehension</article-title>
          . arXiv preprint arXiv:
          <year>2001</year>
          .09694.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>