<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Paris, France
£ yulia.clausen@rub.d(Ye. Clausen)
ȉ</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>German Question Tags: A Computational Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>YuliaClausen</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Germanistisches Institut</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ruhr-Universität Bochum</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Germany</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>000</volume>
      <fpage>9</fpage>
      <lpage>0009</lpage>
      <abstract>
        <p>The German language exhibits a range of question tags that can typically, but not always, be substituted for one another. Moreover, the same words can have other meanings while occurring in the sentence椀昀nal position. The tags' felicity conditions were addressed in previous corpus-based and experimental work and attributed to semantic and pragmatic properties of tag questions. This paper addresses the question of whether and to what extent the di昀erences among German tags can be determined automatically. We assess the performance of three pretrained German BERT models on a tag question dataset and 昀椀ne-tune one of these models on the tag word prediction task. A close examination of this model's output indicates that BERT can identify properties relevant for the tags' felicity conditions and interchangeability consistent with previous studies.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;tag questions</kwd>
        <kwd>German</kwd>
        <kwd>tags</kwd>
        <kwd>annotation</kwd>
        <kwd>BERT</kwd>
        <kwd>clustering</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>(1)
(2)</p>
      <sec id="sec-1-1">
        <title>Lina says to her sister as they go out of the cinema:</title>
        <p>Der Film war gut, ne?/nicht?/oder?
‘The 昀椀lm was good, wasn’t it?’
Lina comes back from the movies and says to her sister (who did not want to come):
Der Film war gut, ne!/nicht!/*oder!
‘The 昀椀lm was good, you know!’</p>
        <p>In (1), di昀erent tags are equally suitable for requesting con昀椀rmation of whether Lina’s sister
also liked the 昀椀lm. In2(), however, Lina’s sister is requested to con昀椀rm her acknowledgment
of the provided information, in which coadseer is infelicitous [c4f]..</p>
        <p>Felicity conditions of German tags were addressed in previous experimental and
corpus-based studies, and several factors were identi昀椀ed as crucial for the tags’
(non-)interchangeability. Among those are syntactic and semantic properties of TQs, as well
as pragmatic inferences arising from various contextual aspects (see2Sefocrtidoentails). In
this study, we pursue the question of whether the similarities and/or di昀erences among tags,
and hence cases of their potential interchangeability, can be modeled automatically. Language
models, such as BERT6[], are known for their capacity to leverage semantic and other types
of linguistic information from the context around a given word (s1e7e]ef.ogr., a[n overview).
Therefore, we test whether and how well BERT can identify the properties of German tags,
such as those de昀椀ned in previous work, and whether we can gain new insights from this into
the tags’ felicity conditions.</p>
        <p>It is worth noting that there exists another TQ-relevant distinction in German: Words
functioning as tags can have other meanings while occurring in the tag position (i.e., end of a
sentence). For examplen,icht is also a negation particle (eK.egn.,nst du das nicht? ‘Don’t you
know that?’). This is a di昀erent kind of distinction, since semantically TQs di昀er considerably
from other sentence types ending with the same word. We thus include both types of sentences
in our analysis. We expect the sentence type distinction to be easier for BERT than determining
the di昀erences among individual tags. The latter, however, is of primary interest to us.</p>
        <p>Our paper makes the following contributions. We test the capacities of three existing
pretrained German BERT models to di昀erentiate among question tags as well as between TQs
and other sentence types. We 昀椀nd that while most models capture the sentence type
distinction quite well, they struggle with semantic/pragmatic di昀erences within the tag class. Instead,
BERT demonstrates a strong dependence on structural features, such as punctuation. We apply
K-Means clustering to the embeddings produced by one of these models and test the overlap
of the generated clusters with the linguistic properties of TQs de昀椀ned in previous work. We
椀昀nd indications as to which of those properties are relevant for the tags’ felicity conditions in
accordance with previous 昀椀ndings. Finally, we 昀椀ne-tune the selected model on the next word
prediction task with respect to two aspects: prediction of the word class (tag vs. no-tag) and
form (e.g.,oder vs. ne). Our experiments show that the 昀椀ne-tuned model outperforms the
original one in both tasks, while at the same time revealing the importance of the dataset size for
meaningful prediction.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <sec id="sec-2-1">
        <title>2.1. German question tags</title>
        <p>
          The meaning and felicity conditions of German tags were addressed in recent corpus-based and
experimental studie5s,[
          <xref ref-type="bibr" rid="ref3 ref4 ref9">3, 4, 9</xref>
          ]. Several semantic/pragmatic as well as syntactic factors were
found crucial for the tags’ felicity conditions and their interchangeability potential. Anchor
clause type and speech act provide certain indications regarding the tags’ felicity, such that,
e.g., imperative directives as i3n) a(re compatible only wijtah[cf. 4].
        </p>
        <sec id="sec-2-1-1">
          <title>Max wants to play football with his friends, but his father says:</title>
          <p>Mach erst deine Hausaufgaben, ja!/*ne!/*nicht!/*oder!
‘Do your homework 昀椀rst!2’</p>
          <p>
            O昀琀entimes, additional context is required, though. For example, the TQ anch1o)rasnidn (
(2) in Section1 are both declarative assertionso,dbeurtis felicitous only in the former. In such
cases, information about the interlocutors’ epistemic authority provides additional clues, e.g.,
whether the speaker is informing the addressee or asking for a con昀椀rmation (cf. statements
vs. questions as functions of TQs i1n2][). If the speaker is the source of information, the
use of oder is typically ruled out. Further constraints are provided by the type of requested
con昀椀rmation, i.e., the aspect of the anchor proposition the addressee is requested to con昀椀rm
(target of con昀椀rmation in [
            <xref ref-type="bibr" rid="ref20 ref4">4, 20</xref>
            ]). An example would be agreement with the speaker’s opinion
vs. acknowledgment of the provided informatio1n)ivns.( (2) in Section1.
          </p>
          <p>These linguistic properties have been found to correlate with di昀erent tags as well as with
each other to varying degree4s],([p. 26), and while some of them are straightforward (e.g.,
anchor clause type), other are more complex and need to be inferred from the context (e.g.,
target of con昀椀rmation).</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Language modeling</title>
        <p>Among the growing amount of work on the next word prediction with language models,
several studies have focused on linguistic elements in the sentence-昀椀nal position. Kato,
Miyata, and Sato11[] use BERT to generate simpli昀椀ed substitutions for Japanese sentence-ending
predicates. Li, Grissom II, and Boyd-Grab1e3r][predict sentence-昀椀nal verbs for German and
Japanese with neural models for two tasks: predicting the exact verb and a semantically
similar one. Mandokoro, Oka, Matsushima, Fukada, Yoshimura, Kawahara, and T1a5n]atkraai[n
a BERT model on the task of Japanese sentence-昀椀nal particle prediction.</p>
        <p>Ettinger7[] explores the role of di昀erent types of information in prediction of the
sentence椀昀nal word on the basis of its le昀琀-side context for English. Similarly, we implement the tag word
prediction task informed only by its le昀琀-side context. The factors tes7t]eadriens[imilar to
those that play a role in the felicity conditions of German tags: semantic roles, event knowledge,
and pragmatic inferences. Ettinger 昀椀nds them to be particularly challenging for BERT.</p>
        <p>To our knowledge, there are no studies that explore the features of question tags or focus on
automatic tag prediction with language models.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Data</title>
      <p>
        We work with the TQ dataset fro4]mb[uilt from three German corpora: CallHome (C10H],) [
OpenSubtitles (OS)1[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and Twitter (TW)19[]. This dataset contains automatically extracted
TQ candidates that need to be manually disambiguated as to whether or not they end with a
tag. We con昀椀ne our analysis to the four most frequentjtaa,nges,(nicht andoder), for which
we uni昀椀ed and annotated the data with the tag/no-tag labels. The annotation was performed
2We 昀椀nd that the sense of non-negotiability conveyed by this utterance is best expressed without a tag in English.
by four annotators: the author of this paper and three annotators with a linguistic background.
The latter were provided with the annotation guidelines. To ensure the annotation quality, the
author of this paper independently annotated approx. 1,000 TQ candidates from each
annotator’s 昀椀le. High inter-annotator agreement was reached on these data subsets: Cohen’s kappa
score of 0.9 with annotator 1 and 0.78 with annot3atAonry2c.on昀氀icting annotations in these
data subsets, i.e., between the author of the paper and each respective annotator, were resolved
a昀琀erwords. Table1 shows the number of annotated tag words per corpus used in thi4s study.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Tag word embeddings</title>
      <p>We test the following existing pretrained German BERT models:</p>
      <p>To generate the tag word embeddings, we extracted one TQ candidate from each record in the
dataset8. Depending on the corpus, we applied di昀erent preprocessing steps to the extracted
TQ candidates. For CH and OS, we removed all meta-language sequences. For TW, we stripped
URLs (end of sentence), hashtags and @username mentions (beginning and end of sentence),
and common emoticons (anywhere in sentence). Furthermore, we excluded TQ candidates
consisting of fewer than three tokens including the tag word, in order to eliminate (most of)
3We could not calculate the inter-annotator agreement with annotator 3, as they did not complete their annotations,
so that there were no overlapping annotations available for comparison. The annotation in this case was completed
by the author of the paper.
4The annotated dataset and the annotation guidelines are available via the Open SciencehFtrtapmse:/w/oosrf.kio:
/pcng9.
5https://huggingface.co/bert-base-german-cased
6https://github.com/dbmdz/berts#german-bert
7https://huggingface.co/deepset/gbert-l;a[r2g].e
8Some records consist of several sentences (e.g., a tweet) and hence can contain more than one TQ candidate. We
extracted each record’s 昀椀rst sentence ending with one of the relevant tag words.
the short sequences bearing little meaning. Finally, we removed all duplicates based on
casesensitive string comparison. Examples of the preprocessed sentences in the 昀椀nal dataset are
given in Table2.</p>
      <p>We fed the preprocessed TQ candidates through each model and obtained embeddings
consisting of either 12 layers with 768 dimensiobnerst-(base-german-cased
andbert-base-germandbmdz-cased) or 24 layers with 1,027 dimensiongsb(ert-large) per token. To get a single
embedding per token, we concatenated each token’s last four layers, thus obtaining one vector with
3,072 (bert-base-german-cased andbert-base-german-dbmdz-cased) or 4,096 (gbert-large)
dimensions. Finally, we extracted each tag word’s embedding, which we use here as its contextual
representation.</p>
    </sec>
    <sec id="sec-5">
      <title>5. BERT model comparison</title>
      <p>This section discusses the output of the three BERT models with respect to the tag/no-tag
distinction and the di昀erences among the tag forms. We reduce the embeddings to three
components with Principal Component Analysis (P9CaAn)d map them into a vector space. We
use the visualized data for our analysis and provide a more compact version of the plots in
AppendixA for illustrati1o0n.
5.1. bert-base-german-cased
This model di昀erentiates prima facie well among the four tag words: Vectors representing the
same tags are densely grouped together, while distinct tags are visibly separated from each
other (Figure1a, 2a, 3a in AppendixA). However, each vector group is a tag/no-tag mixture
9We used thescikit-learn implementationh:ttps://scikit-learn.org/stable/modules/generkalteeadr/ns.decompositio
n.PCA.html.
10The plots in AppendiAx were created witsheaborn (https://seaborn.pydata.o)r.gT/he interactive 3D plots used
for our analysis were created wmitahtplotlib (https://matplotlib.o)ragn/d are available via the Open Science
Framework:https://osf.io/pcn g.9
(except foroder, which has no no-tag counterparts). This suggests that this model only
differentiates between the surface forms of the tag words, and will most likely be insu昀케cient in
handling 昀椀ner-grained distinctions, such as di昀erent types of utterances ending with the same
word.
5.2. bert-base-german-dbmdz-cased
The vector groups generated by this model are less dense and have visually less space between
them compared tobert-base-german-cased (Figure1b, 2b, 3b in AppendixA). Nonetheless, the
model di昀erentiates well among the tags and provides a reasonable tag/no-tag separation in
most cases. Furthermore, it subdivides the tag groups, which is not the
cabseert-wbiatshegerman-cased. This is particularly prominent for jCa,Hne( andnicht) and TW (all tags).</p>
      <p>We 昀椀nd that the formation of subgroups (among the TQs ending with the same tag) is tied
to punctuation. Tags are placed into di昀erent subgroups depending on whether they are
followed by a question mark or a period. This is consistent across the tags and corpora. The
tag-preceding comma also plays a role: The tags are either clearly separated (e.g., ‘, ja?’ vs.
‘ja?’ in OS/TW), or there is a gradual transition from one punctuation type to another within
a subgroup (e.g., ‘ne.’ vs. ‘, ne.’ in CH).</p>
      <p>The tag/no-tag groups typically partially overlap in cases of matching punctujatinion (e.g.,
OS). Given that tags with di昀erent punctuation form distinct subgroups, this suggests that the
model considers tags and no-tags with the same punctuation to be more similar than the same
tags with di昀erent punctuation. Thus, structural features seem to dominate over potential
syntactic/semantic di昀erences between TQs and other sentence types ending with the same
tag word.
5.3. gbert-large
This model falls in between the other two, as its output looks
similarbteortt-bhaaset-goefrmancased in terms of compact, spatially well-separated vector groups, while at the same time
providing a good tag/no-tag distinction akibnertt-obase-german-dbmdz-cased (Figure1c, 2c, 3c
in AppendixA). The model shows a stable pattern across the three corpora: While the vector
groups representing di昀erent tags are spatially separated, the tag/no-tag instances are situated
in very close proximity to each other and even partially ojvaearnladpni(cht in all corpora;
ne in TW). The tag/no-tag distinctionnfiochrt generally seems to be most de昀椀nite, showing
practically no overlap in OS and1T1W.</p>
      <p>This model also di昀erentiates based on punctuation. In some cases, tags are divided into
two distinct subgroups based on the end punctuanteiaonnd(ja in CH). In most cases, though,
the tags are ordered within their respective groups: Tags followed by a question mark and
preceded by a comma are situated on one side of the vector group, whereas those ending with
a period are placed on its other end. The latter is also where a (partial) overlap with the no-tags
takes place, as no-tags are largely followed by a period.
11The clear tag/no-tag distinctionifcohrt is also made by thbeert-base-german-dbmdz-cased model.</p>
      <sec id="sec-5-1">
        <title>5.4. Summary</title>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Clustering</title>
      <p>We 昀椀nd that bert-base-german-dbmdz-cased looks most promising for exploring tag
interchangeability in our data. This model di昀erentiates between the tags quite clearly, but also
places several tag subgroups close to each other (contgrbaerrty-latroge), which might indicate
potential cases of similarity. Therefore, we perform a clustering analysis of its output (Section
6) and use this model for the tag word prediction task (S7e)c.tion
In this section, we focus only on the tag part of the data and apply the K-Means clustering
algorithm to the BERT-generated tag ve1c2toArs.discussed in the previous section, BERT
groups tags by their form (and punctuation). By means of clustering, we explore whether
there are any common features across these tag groups. Our assumption is that distinct tags
that occur in similar contexts will have similar linguistic properties encoded in their vector
representations and will hence be clustered together.</p>
      <sec id="sec-6-1">
        <title>6.1. Cluster analysis</title>
        <p>We experiment with di昀erent numbers of cluste r)ss(tarting with 4 (i.e., the number of tags
in the dataset) and increasing it in single steps up to 10. As discussed in1S,etcatgisonare
interchangeable only in certain contexts, which is why we are interested in impure clusters,
i.e., the ones where di昀erent tag groups are partially clustered together.</p>
        <p>The general tendency we observe is that with hig’sh,eerach tag form is allocated to a
distinct cluster or even divided into multiple clusters. Hence, we determine t h(ebheligohwest
generated with the respec ti’vsecan be found in Figur4ein AppendixA.
10) with which any di昀erent tags are still clustered together, and examine the resulting impure
clusters in more detail. Following this strategy, w e =se9lfeocrt CallHome, =7 for Twitter,
and  =4 for OpenSubtitles. Tab3leshows the impure clusters. An overview of all clusters
Impure K-Means clusters per corpu s. denotes the overall number of clusters, { } mark cluster
boundaries, subscript numbers indicate cluster IDs in plots.</p>
        <p>Corpus
CallHome
Twitter
OpenSubtitles 4 {ne, 2 nicht, oder}4</p>
        <p>Impure clusters
7 {partja, partnicht}2
9 {partnicht, partoder}3, {partne, partnicht, 1 oder}7
12We used thescikit-learn implementationh:ttps://scikit-learn.org/stable/modules/generkalteeadr/sn.cluster.</p>
        <p>KMeans.htm.l The clusters are built on the original BERT vectors; the PCA-reduced vectors are used only for
visualization purposes.
6.1.1. CallHome
Two impure clusters were generated w i=t9h (Figure4a in AppendixA). The cluster {part
nicht, partoder}3 contains the instances of these tags that are followed by a question mark
and preceded by a comma. This makes up a part of otdheer-subgroup and the
complentiechtsubgroup with a question mark. A closer look at TQs in this corpus reveals that the ones with a
question mark express requests for information or an opinion from the addressee (cf. questions
and statement-question blen1d2s])[.</p>
        <p>The cluster {parnte, partnicht, 1 oder}7 contains tags without the preceding comma and
followed by a period (including occasional cases of alternative punctuation). This corresponds
to a part of each respective tag’s subgroup. TQs ending with a period in this corpus are those
where the speaker has epistemic authority and provides information or an opinion.</p>
        <p>We conclude that the clustering method supports the punctuation-based distinction among
TQs, e.g., by utilizing the tag-preceding punctuation as a clustering criterion. The observed
correlation between the end punctuation and certain TQ types can be attributed to the fact
that CH contains transcribed data, where, evidently, question marks and periods represent the
rising and falling intonation, respectively. This, in turn, corresponds (at least roughly) to the
addressee vs. speaker epistemic authority. This correlation should be taken with a grain of salt,
though, as it is not necessarily the case with other corpora, e.g., Twitter users do not follow
punctuation rules strictly.
6.1.2. OpenSubtitles
One impure cluster –ne{, 2 nicht, oder}4 – was generated wit h=4 (Figure4b in AppendixA).
Any higher merely led to multiple clustersjafoarndnicht. This is not surprising, as these
tags are represented by a notably larger number of instancnees atnhdanoder in the corpus.
This cluster comprises the total numbenre aonfdoder in OS and covers a mix of di昀erent TQ
types.</p>
        <p>There is almost no variation in punctuation in this corpus: TQs without the tag-preceding
comma and/or ending with a period make up less than 2% per tag. Due to this fractional amount,
these cases are not decisive for the automatic analysis.</p>
        <p>The homogeneous use of punctuation in this corpus might be explained by the fact that
subtitles are supposed to conform with standard grammar (in our case, a tag separated from
the anchor clause by a comma and followed by a question mark).</p>
        <p>For this data, K-Means prioritizes the division of large tag groups into multiple clusters over
the clustering of di昀erent tags together. We 昀椀nd no obvious di昀erences between the instances
of ja in the two clusters generated w=it4,he.g., they both contain directive TQs.
6.1.3. Twitter
One impure cluster was generated w i=t7h(Figure4c in AppendixA). This cluster – {parjat,
partnicht}2 – comprises the instances of the respective tags that have no preceding comma
and are followed by a question mark. In Twitter, the question mark is the predominant end
punctuation, and only few TQs end with a period (less than 1n%icwhtitahndoder, 3% withja,
and 15% withne). Thus, tags are clustered based on the presence or absence of the preceding
comma, rather than the end punctuation.</p>
        <p>In general, K-Means merely assigns distinct clusters to the tag subgroups already formed
by the BERT model. The clustering togethenricohft withja is not straightforward, especially
sinceoder is situated closer to the former.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Mapping of linguistic properties to clusters</title>
        <p>We assess how well the linguistic properties of TQs determined in the previous work map onto
the K-means clusters. We use the annotations of the anchor clause type, anchor speech act,
and target of con昀椀rmation from4][available for a portion of the dataset used in this study: 940
TQs in CallHome and 641 TQs in Twitter.</p>
        <p>
          To test the distribution of these properties across our clusters, we apply the cluster evaluation
metric V-measure1[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], which constitutes the harmonic mean between homogeneity (whether
all TQs in a cluster belong to the same category, e.g., anchor clause type) and completeness
(whether all TQs with the same properties are put into one 1c3luWset e昀椀nrd). that the target
of con昀椀rmation has the highest match with the clusters in both corpora: its V-measure scores
range between 0.13-0.16 (CH and TW), depending on the number of clusters (between 4 and
10). The anchor clause type and speech act are both associated with lower scores: 0.05-0.09
(CH) and 0.11-0.16 (TW).
        </p>
        <p>These results con昀椀rm previous observations that the tags’ felicity conditions only partially
depend on the anchor clause type and speech act. They also support previous 昀椀ndings that
certain tags, such aosder, are infelicitous with requests to acknowledge the provided information,
while other tags, suchnaes, are typical for this target of con昀椀rmat8i]o.n [</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Tag word prediction</title>
      <p>
        In this section, we describe the BERT Masked Language Modeling task for the tag word
prediction with the model selected in Sec5t. iWone test the impact of 昀椀ne-tuning on the model’s
performance and examine its predictions with regard to the tags’ interchangeability potential.
We implement the training task using PyTor16c]ha[nd the HuggingFace Transformers library
[
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
      </p>
      <sec id="sec-7-1">
        <title>7.1. Experimental setup</title>
        <p>For this task, we use the complete dataset (tags and no-tags) and 昀椀ne-tune the BERT model to
predict the tag word form (en.ge.,vs. ja) and class (tag vs. no-tag). We represent the no-tags
with the special tokens [ntja], [ntne], and [ntnicht] to di昀erentiate them from the respective
tags in the model’s predictio14nsT.he special tokens and tags are then replaced with the [mask]
token. We run the training for 10 epochs with standard parameters. The performance of the
椀昀ne-tuned model is compared with that of the original pretrained model (baseline).
13We used thescikit-learn implementationh:ttps://scikit-learn.org/stable/modules/generkalteeadr/ns.metrics.v_
measure_score.ht m.l
14The tagoder has no counterpart [ntoder].</p>
        <p>The dataset is randomly split into the training and test sets (80% and 20% from each corpus,
respectively). The training set is further randomly split into 80% training and 20% evaluation.
We apply this con昀椀guration to (a) the whole dataset and (b) the dataset without OpenSubtitles
in the training data. With this, we test how much the model relies on the OS data, which was
part of its original pretraining.</p>
        <p>Furthermore, we train the model separately on each corpus and test on the rest of the dataset.
Our corpora di昀er in terms of style and conformity to standards: spoken telephone
conversations (CH), transcribed spoken language (OS), and computer-mediated communication that
can be placed somewhere between written and spoken (TW4)][.cfW.ith this, we test the
suitability of di昀erent types of data for training a generalized model for tag prediction.</p>
      </sec>
      <sec id="sec-7-2">
        <title>7.2. Evaluation</title>
        <p>For each sentence, we consider the top three predictions and calculate two types of scores to
evaluate the model’s performance:
• score_equal – the model predicts the correct class (tag/no-tag) and the correct form
(e.g., ne-tag forne-tag)
• score_close – the model predicts the correct class, but the form can be incorrect (e.g.,
ja-no-tag fornicht-no-tag orja-tag fornicht-tag); this score includes score_equal
We sum up the probabilities that match these criteria within the top three predictions to
obtain a single score. The calculation is demonstrated below for a TQ from the Twitter corpus
in (4):
(4)</p>
        <p>Eh Digga, das war voll fett krass alter oder?
‘Eh dude, that was absolutely totally cool man right?’</p>
        <p>The top three predictions and their probabilities for thisodTeQr (a0r.9e33), ne (0.03), and
[ntnicht] (0.026). Thus, score_equal amounts to 0.933 + 0 + 0 = 0.933 (93%) and score_close
to 0.933 + 0.03 + 0 = 0.963 (96%). Additionally, we report precision, recall, and F1 scores based
on the model’s top prediction.</p>
      </sec>
      <sec id="sec-7-3">
        <title>7.3. Results</title>
        <p>The score_equal and score_close results are given in Ta4b.lIendependently of whether OS
is present in the training data, we observe a considerable improvement over the baseline (both
scores). The tag/no-tag distinction (score_close) reaches almost a 100% probability in most
cases.</p>
        <p>With OS in the training data, the lowest score_equal values are obtanien&lt;eoddefor&lt;rnicht
(increasing in this order). This re昀氀ects the number of the respective tags in the training part of
the dataset, with less frequent tags receiving poorer scores. The baseline scores are distributed
di昀erently, suggesting thaotder and ja were the most frequent tags in the model’s original
training data. However, the correctness probability of the baseline model does not go beyond
50% (both scores). Given that we introduced the no-tag special tokens for this task, the baseline
scores are especially low in the test containing all items (tags and no-tags).</p>
        <p>Without OS in the training data, score_equal drops drasticnailclhyt faonrdja. We attribute
this to the fact that the majority of TQs with these tags come from this corpus, thus limiting
the model’s exposure to this type of data during training. The importance of large datasets for
predictions with BERT was emphasized in previous studies 1[e3.,g1.,].</p>
        <p>Precision, recall, and F1 scores show a (notable) improvement of the 昀椀ne-tuned model over
the baseline for each tag (Tab5leasnd6). When trained on all corpora, the 昀椀ne-tuned model
shows lower recall foodrer compared to the baseline. The latter provides reasonable results
primarily forja. Its predictions fonre andnicht tend towards zero.</p>
        <p>The experiments with training on one corpus and testing on the rest of the dataset resulted
in a lower performance compared to the training on the data from all corpora. This can be
explained by the limited amount of the training data (CH in particular turned out to be least
suitable for training). Another reason is that our data, especially OS and TW, is imbalanced
and certain tags are heavily underrepresented. As with the tests described above, the results
here directly depend on the amount of the training data: The tag words represented by larger
numbers of instances received higher scores.</p>
        <p>In addition to these tests, we examine the top three predictions in the results of the
training on all corpora (see Secti7o.1n) regarding the frequency with which di昀erent tags were
suggested by BERT for each original tag var1i5anWte. hope to 昀椀nd indications of the tags’
interchangeability by examining which tags might constitute the best substitutes for each other.
For TQs withne, BERT predictedne, ja, andnicht with almost equal frequency (in 21-23% of
the cases for each). For TQs wijtah,nicht, or oder, the original tag was predicted in the
majority of the cases (27-32%, depending on the tag). The next-best alternatives were as follows:
nicht (29%) forja, ja (25%) fornicht, and bothja andnicht (21% each) foroder. These results
suggest thaotder andne are generally poor substitutes for each other, which con昀椀rms previous
corpus-based result4s][. The indication thnaetcould be replaced bnyicht orja is consistent
with the experimental evidence5i]n, w[hich shows that these tags have common
characteristics. For example, they are less felicitous in TQs expressing speaker assumptions based on the
addressee’s behavior.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>8. Discussion and conclusion</title>
      <p>This study explored whether the di昀erences among the four common Germanjat,naeg,snicht,
andoder, such as those established in previous corpus-based and experimental work, can be
interpreted and predicted automatically. Our analysis of the existing German BERT models
showed that they strongly depend on structural features, such as the tag-surrounding
punctuation. For example, tags and no-tags were o昀琀entimes regarded as more similar to one another
than to other instances of the respective classes due to matching punctuation, while syntactic
15We look at frequencies instead of probabilities, as in our data the latter are typically considerably lower for the
second and third top predictions compared to the 昀椀rst one. This might be di昀erent with a larger dataset, though.
and semantic properties of TQs were not recognizably detected.</p>
      <p>We examined the tag vectors generated by one of these models in more detail. The mapping
of linguistic properties of TQs to the automatically formed clusters of the tag vectors con昀椀rmed
previous observations that the target of con昀椀rmation is a more informative feature for tags’
di昀erentiation than, for instance, the syntactic properties of the TQ anchor.</p>
      <p>Furthermore, we 昀椀ne-tuned the selected model on the tag word prediction task. The tag
word class (tag/no-tag) was predicted with near 100% probability in most cases. The
prediction of the tag word form proved to be more challenging, though. Especially the experiments
with training on single corpora highlighted the importance of the dataset size: The predicted
tag word probabilities directly correlated with the number of instances they were represented
by in the training set. Overall, the results showed that with standard parameters and given
a large enough training dataset (14,045 tags and 20,860 no-tags, in our case) the 昀椀ne-tuned
model works well for this task. However, hyper-parameter optimization and class weighting
are worth exploring in the future.</p>
      <p>The di昀케culties with the automatic distinction between the tag forms are not overly
surprising, a昀琀er all. Cases where di昀erent TQ types share syntactic and semantic properties of the
anchor provide limited information for BERT to rely on in order to, for example, rule out the
use of certain tags, such oadser in informing TQs. The absence of additional contextual
information hinders the judgments about the tags’ felicity in such cases. Nonetheless, certain
TQs contain su昀케cient information to predict the tag even without
contejaxtin,ei.mg.p,erative directives. Since they di昀er both semantically and syntactically from TQs with declarative
anchors, we would expect BERT to pick up on their speci昀椀c properties. However, possibly
because of their underrepresentation in our dataset, these TQs were not identi昀椀ed. Augmentation
of the dataset with certain (synthetically generated) TQ types would facilitate further testing
of BERT’s capacity to detect their features.</p>
      <p>We conclude that BERT provides indications of TQ features that are useful for tag
di昀erentiation. It also seems to correctly recognize which tags constitute appropriate substitutes for
each other, although this needs further testing on a larger dataset. In future work, it could be
worth including the right-side context of the tags (not present in our data) to fully exploit the
power of BERT to use bidirectional context.</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgments</title>
      <p>We thank Tatjana Sche昀툀er and Manfred Stede for discussions and valuable suggestions. We
are grateful to the anonymous reviewers for their helpful comments.</p>
      <p>This research was funded by the PhD completion scholarship from the Graduate Fund of the
State of Brandenburg awarded by the University of Potsdam, and by the Deutsche
Forschungsgemeinscha昀琀 (DFG, German Research Foundation), CRC 1567, Project ID 470106373.
[1] F. Bianchi, B. Yu, and J. Tagliabue. “BERT Goes Shopping: Comparing Distributional
Models for Product RepresentationsP”.roInce:edings of the 4th Workshop on e-Commerce</p>
    </sec>
    <sec id="sec-10">
      <title>A. Visualization of BERT Vectors</title>
      <p>(b) bert-base-german-dbmdz-cased
(c) gbert-large
(b) bert-base-german-dbmdz-cased
(c) gbert-large
(b) bert-base-german-dbmdz-cased
(c) gbert-large</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>and NLP. Online: Association for Computational Linguistics</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>121</lpage>
          .
          <year>0d</year>
          .
          <source>1o8i6: 53/v1/2021.ecnlp-1</source>
          .1.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schweter</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Möller</surname>
          </string-name>
          . “
          <article-title>German's Next Language ModePlr”o</article-title>
          .
          <source>cInee:dings of the 28th International Conference on Computational Linguistics</source>
          . Barcelona,
          <string-name>
            <surname>Spain</surname>
          </string-name>
          (Online):
          <source>International Committee on Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>6788</fpage>
          -
          <lpage>679160</lpage>
          ..1d8oi:
          <volume>653</volume>
          /v1/
          <year>2020</year>
          .coling-main.
          <volume>59</volume>
          .8
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Clausen</surname>
          </string-name>
          . “
          <article-title>You shall know a tag by the context it occurs in: An analysis of German tag questions and their responses in spontaneous conversatioCnosn”</article-title>
          .S
          <string-name>
            <surname>OInL:E XXIX:</surname>
          </string-name>
          <article-title>Proceedings of the 29th Conference of the Student Organization of Linguistics in Europe</article-title>
          . Ed. by
          <string-name>
            <given-names>A.</given-names>
            <surname>Holtz</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Kovač</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Puggaard-Rode</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>J.</given-names>
            <surname>Wall</surname>
          </string-name>
          . Leiden: Leiden University Centre for Linguistics,
          <year>2021</year>
          , pp.
          <fpage>116</fpage>
          -
          <lpage>140</lpage>
          . urlh: ttps://www.universiteitleiden.nl/binaries/cont ent/assets/geesteswetenschappen/lucl/sole/consolex.xix.pdf
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Clausen</surname>
          </string-name>
          and T. Sche昀툀er. “
          <article-title>A corpus-based analysis of meaning variations in German tag questions: Evidence from spoken and written conversational corpCoorrap”u</article-title>
          .
          <source>sIn: Linguistics and Linguistic Theory</source>
          <volume>18</volume>
          .1 (
          <issue>2022</issue>
          ), pp.
          <fpage>1</fpage>
          -
          <lpage>31</lpage>
          . doi:
          <volume>10</volume>
          .1515/cllt-2019
          <source>-006.0</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Clausen</surname>
          </string-name>
          and T. Sche昀툀er. “
          <article-title>Commitments in German Tag Questions: An Experimental Study”</article-title>
          .
          <source>In:Proceedings of the 24th Workshop on the Semantics and Pragmatics of Dialogue - Full Papers. Virtually at Brandeis</source>
          , Waltham, New Jersey: Semdial, 2020h. tutrpl://sem dial.org/anthology/Z20-Clausen%5C%
          <fpage>5Fsemdial</fpage>
          %
          <fpage>5C</fpage>
          %
          <fpage>5F0014</fpage>
          ..pdf
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          . “BERT:
          <article-title>Pre-training of Deep Bidirectional Transformers for Language UnderstandinPgr”o.cInee:dings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</article-title>
          . Ed. by
          <string-name>
            <given-names>J.</given-names>
            <surname>Burstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Doran</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Solorio</surname>
          </string-name>
          . Minneapolis, MN, USA: Association for Computational Linguistics,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
          <year>1d0</year>
          .
          <source>o1i8:653/v1 /n19-1423.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ettinger</surname>
          </string-name>
          . “
          <string-name>
            <surname>What BERT Is</surname>
          </string-name>
          <article-title>Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models”</article-title>
          .
          <source>ITnr:ansactions of the Association for Computational Linguistics</source>
          <volume>8</volume>
          (
          <year>2020</year>
          ), pp.
          <fpage>34</fpage>
          -
          <lpage>48</lpage>
          . doi:
          <volume>10</volume>
          .1162/tacl\_a\_
          <volume>0029</volume>
          .
          <fpage>8</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hagemann</surname>
          </string-name>
          . “
          <article-title>Tag questions als Evidenzmarker. Formulierungsdynamik, sequentielle Struktur und Funktionen redezuginterner tagsG”</article-title>
          .esIpnr:ächsforschung - OnlineZeitschri昀琀
          <source>zur verbalen Interaktion</source>
          <volume>10</volume>
          (
          <year>2009</year>
          ), pp.
          <fpage>145</fpage>
          -
          <lpage>176</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Heim</surname>
          </string-name>
          . “
          <article-title>Turn-peripheral management of Common Ground: A study of Swagbeilaln”</article-title>
          .
          <source>In: Journal of Pragmatics</source>
          <volume>141</volume>
          (
          <year>2019</year>
          ), pp.
          <fpage>130</fpage>
          -
          <lpage>146</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.pragma.
          <year>2018</year>
          .
          <volume>12</volume>
          .007.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K.</given-names>
            <surname>Karins</surname>
          </string-name>
          , R. MacIntyre, M. Brandmair,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lauscher</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>McLemCoArLeL.HOME German Transcripts LDC97T15. Web Download</surname>
          </string-name>
          . Philadelphia: Linguistic Data Consortium.
          <year>1997</year>
          . url: https://catalog.ldc.upenn.edu/LDC97.T15
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Miyata</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Sato</surname>
          </string-name>
          . “
          <article-title>BERT-Based Simpli昀椀cation of Japanese Sentence-Ending Predicates in Descriptive Text”</article-title>
          .
          <source>PIrno:ceedings of the 13th International Conference on Natural Language Generation</source>
          . Dublin, Ireland: Association for Computational Linguistics,
          <year>2020</year>
          , pp.
          <fpage>242</fpage>
          -
          <lpage>251</lpage>
          . url:https://aclanthology.org/
          <year>2020</year>
          .inlg.-
          <volume>1</volume>
          .
          <fpage>31</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kimps</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Davidse</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Cornillie</surname>
          </string-name>
          . “
          <article-title>A speech function analysis of tag questions in British English spontaneous dialogueJ”</article-title>
          .
          <source>oIunr:nal of Pragmatics</source>
          <volume>66</volume>
          (
          <year>2014</year>
          ), pp.
          <fpage>64</fpage>
          -
          <lpage>85</lpage>
          . doi: doi.org/10.1016/j.pragma.
          <year>2014</year>
          .
          <volume>02</volume>
          .01.3
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A</given-names>
            .
            <surname>Grissom</surname>
          </string-name>
          <string-name>
            <given-names>II</given-names>
            , and J.
            <surname>Boyd-Graber</surname>
          </string-name>
          .
          <article-title>“An Attentive Recurrent Model for Incremental Prediction of Sentence-昀椀nal Verbs”</article-title>
          .
          <article-title>IFnin:dings of the Association for Computational Linguistics: EMNLP 2020</article-title>
          .
          <article-title>Online: Association for Computational Linguistics</article-title>
          ,
          <year>2020</year>
          , pp.
          <fpage>126</fpage>
          -
          <lpage>136</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .findings-emnlp.
          <volume>12</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lison</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Tiedemann</surname>
          </string-name>
          . “OpenSubtitles2016:
          <article-title>Extracting Large Parallel Corpora from Movie and TV Subtitles”</article-title>
          .
          <source>InPr:oceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)</source>
          . Portorož,
          <source>Slovenia: European Language Resources Association (ELRA)</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>923</fpage>
          -
          <lpage>929</lpage>
          . urhl:ttps://aclanthology.org/L16- 1.
          <fpage>147</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mandokoro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Oka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Matsushima</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fukada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yoshimura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kawahara</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Tanaka</surname>
          </string-name>
          . “
          <article-title>Construction and Evaluation of a Self-Attention Model for Semantic Understanding of Sentence-Final Particlesa”</article-title>
          .r XIniv: preprint (
          <year>2022</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2210 .00282.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Paszke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Massa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lerer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bradbury</surname>
          </string-name>
          , G. Chanan,
          <string-name>
            <given-names>T.</given-names>
            <surname>Killeen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Gimelshein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Antiga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Desmaison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kopf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>DeVito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Raison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tejani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chilamkurthy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Steiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bai</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Chintala</surname>
          </string-name>
          . “
          <article-title>PyTorch: An Imperative Style, High-Performance Deep Learning Library”</article-title>
          .
          <source>AIdnv:ances in Neural Information Processing Systems</source>
          <volume>32</volume>
          . Ed. by
          <string-name>
            <given-names>H.</given-names>
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Larochelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Beygelzimer</surname>
          </string-name>
          , F. d'AlchéBuc, E. Fox, and
          <string-name>
            <given-names>R.</given-names>
            <surname>Garnett</surname>
          </string-name>
          . Curran Associates, Inc.,
          <year>2019</year>
          , pp.
          <fpage>8024</fpage>
          -
          <lpage>8035</lpage>
          . hutrtlp:://papers .neurips.cc/paper/9015-pytorch
          <article-title>-an-imperative-style-high-performance-deep-learninglibrary.pd</article-title>
          . f
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rogers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kovaleva</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Rumshisky</surname>
          </string-name>
          .
          <article-title>“A Primer in BERTology: What We Know About How BERT Works”</article-title>
          .
          <source>InT:ransactions of the Association for Computational Linguistics</source>
          <volume>8</volume>
          (
          <year>2020</year>
          ), pp.
          <fpage>842</fpage>
          -
          <lpage>866</lpage>
          . url: https://aclanthology.org/
          <year>2020</year>
          .tacl. -
          <volume>1</volume>
          .
          <fpage>54</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rosenberg</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Hirschberg</surname>
          </string-name>
          . “
          <string-name>
            <surname>V-Measure</surname>
          </string-name>
          :
          <article-title>A Conditional Entropy-Based External Cluster Evaluation Measure”.PIrno:ceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)</article-title>
          . Prague, Czech Republic: Association for Computational Linguistics,
          <year>2007</year>
          , pp.
          <fpage>410</fpage>
          -
          <lpage>420</lpage>
          . url: https://aclanthology.org/D07- 1.
          <fpage>043</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>T.</given-names>
            <surname>Sche昀툀er. “A German Twitter</surname>
          </string-name>
          <article-title>Snapshot”</article-title>
          .
          <source>InP:roceedings of the 19th International Conference on Language Resources and Evaluation (LREC'14)</source>
          . Reykjavik, Iceland:
          <source>European Language Resources Association (ELRA)</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>2284</fpage>
          -
          <lpage>2289</lpage>
          . urhlt:tp://www.lrec
          <article-title>-co nf</article-title>
          .org/proceedings/lrec2014/pdf/1146%5C%
          <fpage>5FPaper</fpage>
          .pdf
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiltschko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Denis</surname>
          </string-name>
          , and
          <string-name>
            <surname>A. D'Arcy</surname>
          </string-name>
          . “
          <article-title>Deconstructing variation in pragmatic function: A transdisciplinary case study”</article-title>
          .
          <source>LIann:guage in Society 47.4</source>
          (
          <issue>2018</issue>
          ), pp.
          <fpage>569</fpage>
          -
          <lpage>599</lpage>
          . doi:
          <volume>10</volume>
          .1017/s004740451800057x.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Delangue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cistac</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rault</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Louf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Funtowicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Davison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shleifer</surname>
          </string-name>
          , P. von Platen, C. Ma,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jernite</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Plu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. Le</given-names>
            <surname>Scao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gugger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Drame</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Lhoest</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Rush</surname>
          </string-name>
          . “Transformers:
          <article-title>State-of-theArt Natural Language Processing”</article-title>
          .
          <source>PIrnoc:eedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Online: Association for Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>38</fpage>
          -
          <lpage>45</lpage>
          .
          <year>d1o0i</year>
          .:
          <volume>18653</volume>
          /v1/
          <year>2020</year>
          .emnlp-demos.
          <volume>6</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>