1. Introduction

Paris, France £ yulia.clausen@rub.d(Ye. Clausen) ȉ

German Question Tags: A Computational Analysis

YuliaClausen

Germanistisches Institut

Ruhr-Universität Bochum

Germany

2023

000 9 0009

The German language exhibits a range of question tags that can typically, but not always, be substituted for one another. Moreover, the same words can have other meanings while occurring in the sentence椀昀nal position. The tags' felicity conditions were addressed in previous corpus-based and experimental work and attributed to semantic and pragmatic properties of tag questions. This paper addresses the question of whether and to what extent the di昀erences among German tags can be determined automatically. We assess the performance of three pretrained German BERT models on a tag question dataset and 昀椀ne-tune one of these models on the tag word prediction task. A close examination of this model's output indicates that BERT can identify properties relevant for the tags' felicity conditions and interchangeability consistent with previous studies.

eol>tag questions German tags annotation BERT clustering

1. Introduction

(1) (2)

Lina says to her sister as they go out of the cinema:

Der Film war gut, ne?/nicht?/oder? ‘The 昀椀lm was good, wasn’t it?’ Lina comes back from the movies and says to her sister (who did not want to come): Der Film war gut, ne!/nicht!/*oder! ‘The 昀椀lm was good, you know!’

In (1), di昀erent tags are equally suitable for requesting con昀椀rmation of whether Lina’s sister also liked the 昀椀lm. In2(), however, Lina’s sister is requested to con昀椀rm her acknowledgment of the provided information, in which coadseer is infelicitous [c4f]..

Felicity conditions of German tags were addressed in previous experimental and corpus-based studies, and several factors were identi昀椀ed as crucial for the tags’ (non-)interchangeability. Among those are syntactic and semantic properties of TQs, as well as pragmatic inferences arising from various contextual aspects (see2Sefocrtidoentails). In this study, we pursue the question of whether the similarities and/or di昀erences among tags, and hence cases of their potential interchangeability, can be modeled automatically. Language models, such as BERT6[], are known for their capacity to leverage semantic and other types of linguistic information from the context around a given word (s1e7e]ef.ogr., a[n overview). Therefore, we test whether and how well BERT can identify the properties of German tags, such as those de昀椀ned in previous work, and whether we can gain new insights from this into the tags’ felicity conditions.

It is worth noting that there exists another TQ-relevant distinction in German: Words functioning as tags can have other meanings while occurring in the tag position (i.e., end of a sentence). For examplen,icht is also a negation particle (eK.egn.,nst du das nicht? ‘Don’t you know that?’). This is a di昀erent kind of distinction, since semantically TQs di昀er considerably from other sentence types ending with the same word. We thus include both types of sentences in our analysis. We expect the sentence type distinction to be easier for BERT than determining the di昀erences among individual tags. The latter, however, is of primary interest to us.

Our paper makes the following contributions. We test the capacities of three existing pretrained German BERT models to di昀erentiate among question tags as well as between TQs and other sentence types. We 昀椀nd that while most models capture the sentence type distinction quite well, they struggle with semantic/pragmatic di昀erences within the tag class. Instead, BERT demonstrates a strong dependence on structural features, such as punctuation. We apply K-Means clustering to the embeddings produced by one of these models and test the overlap of the generated clusters with the linguistic properties of TQs de昀椀ned in previous work. We 椀昀nd indications as to which of those properties are relevant for the tags’ felicity conditions in accordance with previous 昀椀ndings. Finally, we 昀椀ne-tune the selected model on the next word prediction task with respect to two aspects: prediction of the word class (tag vs. no-tag) and form (e.g.,oder vs. ne). Our experiments show that the 昀椀ne-tuned model outperforms the original one in both tasks, while at the same time revealing the importance of the dataset size for meaningful prediction.

2. Related work 2.1. German question tags

The meaning and felicity conditions of German tags were addressed in recent corpus-based and experimental studie5s,[ 3, 4, 9 ]. Several semantic/pragmatic as well as syntactic factors were found crucial for the tags’ felicity conditions and their interchangeability potential. Anchor clause type and speech act provide certain indications regarding the tags’ felicity, such that, e.g., imperative directives as i3n) a(re compatible only wijtah[cf. 4].

Max wants to play football with his friends, but his father says:

Mach erst deine Hausaufgaben, ja!/*ne!/*nicht!/*oder! ‘Do your homework 昀椀rst!2’

O昀琀entimes, additional context is required, though. For example, the TQ anch1o)rasnidn ( (2) in Section1 are both declarative assertionso,dbeurtis felicitous only in the former. In such cases, information about the interlocutors’ epistemic authority provides additional clues, e.g., whether the speaker is informing the addressee or asking for a con昀椀rmation (cf. statements vs. questions as functions of TQs i1n2][). If the speaker is the source of information, the use of oder is typically ruled out. Further constraints are provided by the type of requested con昀椀rmation, i.e., the aspect of the anchor proposition the addressee is requested to con昀椀rm (target of con昀椀rmation in [ 4, 20 ]). An example would be agreement with the speaker’s opinion vs. acknowledgment of the provided informatio1n)ivns.( (2) in Section1.

These linguistic properties have been found to correlate with di昀erent tags as well as with each other to varying degree4s],([p. 26), and while some of them are straightforward (e.g., anchor clause type), other are more complex and need to be inferred from the context (e.g., target of con昀椀rmation).

2.2. Language modeling

Among the growing amount of work on the next word prediction with language models, several studies have focused on linguistic elements in the sentence-昀椀nal position. Kato, Miyata, and Sato11[] use BERT to generate simpli昀椀ed substitutions for Japanese sentence-ending predicates. Li, Grissom II, and Boyd-Grab1e3r][predict sentence-昀椀nal verbs for German and Japanese with neural models for two tasks: predicting the exact verb and a semantically similar one. Mandokoro, Oka, Matsushima, Fukada, Yoshimura, Kawahara, and T1a5n]atkraai[n a BERT model on the task of Japanese sentence-昀椀nal particle prediction.

Ettinger7[] explores the role of di昀erent types of information in prediction of the sentence椀昀nal word on the basis of its le昀琀-side context for English. Similarly, we implement the tag word prediction task informed only by its le昀琀-side context. The factors tes7t]eadriens[imilar to those that play a role in the felicity conditions of German tags: semantic roles, event knowledge, and pragmatic inferences. Ettinger 昀椀nds them to be particularly challenging for BERT.

To our knowledge, there are no studies that explore the features of question tags or focus on automatic tag prediction with language models.

3. Data

We work with the TQ dataset fro4]mb[uilt from three German corpora: CallHome (C10H],) [ OpenSubtitles (OS)1[ 4 ], and Twitter (TW)19[]. This dataset contains automatically extracted TQ candidates that need to be manually disambiguated as to whether or not they end with a tag. We con昀椀ne our analysis to the four most frequentjtaa,nges,(nicht andoder), for which we uni昀椀ed and annotated the data with the tag/no-tag labels. The annotation was performed 2We 昀椀nd that the sense of non-negotiability conveyed by this utterance is best expressed without a tag in English. by four annotators: the author of this paper and three annotators with a linguistic background. The latter were provided with the annotation guidelines. To ensure the annotation quality, the author of this paper independently annotated approx. 1,000 TQ candidates from each annotator’s 昀椀le. High inter-annotator agreement was reached on these data subsets: Cohen’s kappa score of 0.9 with annotator 1 and 0.78 with annot3atAonry2c.on昀氀icting annotations in these data subsets, i.e., between the author of the paper and each respective annotator, were resolved a昀琀erwords. Table1 shows the number of annotated tag words per corpus used in thi4s study.

4. Tag word embeddings

We test the following existing pretrained German BERT models:

To generate the tag word embeddings, we extracted one TQ candidate from each record in the dataset8. Depending on the corpus, we applied di昀erent preprocessing steps to the extracted TQ candidates. For CH and OS, we removed all meta-language sequences. For TW, we stripped URLs (end of sentence), hashtags and @username mentions (beginning and end of sentence), and common emoticons (anywhere in sentence). Furthermore, we excluded TQ candidates consisting of fewer than three tokens including the tag word, in order to eliminate (most of) 3We could not calculate the inter-annotator agreement with annotator 3, as they did not complete their annotations, so that there were no overlapping annotations available for comparison. The annotation in this case was completed by the author of the paper. 4The annotated dataset and the annotation guidelines are available via the Open SciencehFtrtapmse:/w/oosrf.kio: /pcng9. 5https://huggingface.co/bert-base-german-cased 6https://github.com/dbmdz/berts#german-bert 7https://huggingface.co/deepset/gbert-l;a[r2g].e 8Some records consist of several sentences (e.g., a tweet) and hence can contain more than one TQ candidate. We extracted each record’s 昀椀rst sentence ending with one of the relevant tag words. the short sequences bearing little meaning. Finally, we removed all duplicates based on casesensitive string comparison. Examples of the preprocessed sentences in the 昀椀nal dataset are given in Table2.

We fed the preprocessed TQ candidates through each model and obtained embeddings consisting of either 12 layers with 768 dimensiobnerst-(base-german-cased andbert-base-germandbmdz-cased) or 24 layers with 1,027 dimensiongsb(ert-large) per token. To get a single embedding per token, we concatenated each token’s last four layers, thus obtaining one vector with 3,072 (bert-base-german-cased andbert-base-german-dbmdz-cased) or 4,096 (gbert-large) dimensions. Finally, we extracted each tag word’s embedding, which we use here as its contextual representation.

5. BERT model comparison

This section discusses the output of the three BERT models with respect to the tag/no-tag distinction and the di昀erences among the tag forms. We reduce the embeddings to three components with Principal Component Analysis (P9CaAn)d map them into a vector space. We use the visualized data for our analysis and provide a more compact version of the plots in AppendixA for illustrati1o0n. 5.1. bert-base-german-cased This model di昀erentiates prima facie well among the four tag words: Vectors representing the same tags are densely grouped together, while distinct tags are visibly separated from each other (Figure1a, 2a, 3a in AppendixA). However, each vector group is a tag/no-tag mixture 9We used thescikit-learn implementationh:ttps://scikit-learn.org/stable/modules/generkalteeadr/ns.decompositio n.PCA.html. 10The plots in AppendiAx were created witsheaborn (https://seaborn.pydata.o)r.gT/he interactive 3D plots used for our analysis were created wmitahtplotlib (https://matplotlib.o)ragn/d are available via the Open Science Framework:https://osf.io/pcn g.9 (except foroder, which has no no-tag counterparts). This suggests that this model only differentiates between the surface forms of the tag words, and will most likely be insu昀케cient in handling 昀椀ner-grained distinctions, such as di昀erent types of utterances ending with the same word. 5.2. bert-base-german-dbmdz-cased The vector groups generated by this model are less dense and have visually less space between them compared tobert-base-german-cased (Figure1b, 2b, 3b in AppendixA). Nonetheless, the model di昀erentiates well among the tags and provides a reasonable tag/no-tag separation in most cases. Furthermore, it subdivides the tag groups, which is not the cabseert-wbiatshegerman-cased. This is particularly prominent for jCa,Hne( andnicht) and TW (all tags).

We 昀椀nd that the formation of subgroups (among the TQs ending with the same tag) is tied to punctuation. Tags are placed into di昀erent subgroups depending on whether they are followed by a question mark or a period. This is consistent across the tags and corpora. The tag-preceding comma also plays a role: The tags are either clearly separated (e.g., ‘, ja?’ vs. ‘ja?’ in OS/TW), or there is a gradual transition from one punctuation type to another within a subgroup (e.g., ‘ne.’ vs. ‘, ne.’ in CH).

The tag/no-tag groups typically partially overlap in cases of matching punctujatinion (e.g., OS). Given that tags with di昀erent punctuation form distinct subgroups, this suggests that the model considers tags and no-tags with the same punctuation to be more similar than the same tags with di昀erent punctuation. Thus, structural features seem to dominate over potential syntactic/semantic di昀erences between TQs and other sentence types ending with the same tag word. 5.3. gbert-large This model falls in between the other two, as its output looks similarbteortt-bhaaset-goefrmancased in terms of compact, spatially well-separated vector groups, while at the same time providing a good tag/no-tag distinction akibnertt-obase-german-dbmdz-cased (Figure1c, 2c, 3c in AppendixA). The model shows a stable pattern across the three corpora: While the vector groups representing di昀erent tags are spatially separated, the tag/no-tag instances are situated in very close proximity to each other and even partially ojvaearnladpni(cht in all corpora; ne in TW). The tag/no-tag distinctionnfiochrt generally seems to be most de昀椀nite, showing practically no overlap in OS and1T1W.

This model also di昀erentiates based on punctuation. In some cases, tags are divided into two distinct subgroups based on the end punctuanteiaonnd(ja in CH). In most cases, though, the tags are ordered within their respective groups: Tags followed by a question mark and preceded by a comma are situated on one side of the vector group, whereas those ending with a period are placed on its other end. The latter is also where a (partial) overlap with the no-tags takes place, as no-tags are largely followed by a period. 11The clear tag/no-tag distinctionifcohrt is also made by thbeert-base-german-dbmdz-cased model.

5.4. Summary 6. Clustering

We 昀椀nd that bert-base-german-dbmdz-cased looks most promising for exploring tag interchangeability in our data. This model di昀erentiates between the tags quite clearly, but also places several tag subgroups close to each other (contgrbaerrty-latroge), which might indicate potential cases of similarity. Therefore, we perform a clustering analysis of its output (Section 6) and use this model for the tag word prediction task (S7e)c.tion In this section, we focus only on the tag part of the data and apply the K-Means clustering algorithm to the BERT-generated tag ve1c2toArs.discussed in the previous section, BERT groups tags by their form (and punctuation). By means of clustering, we explore whether there are any common features across these tag groups. Our assumption is that distinct tags that occur in similar contexts will have similar linguistic properties encoded in their vector representations and will hence be clustered together.

6.1. Cluster analysis

We experiment with di昀erent numbers of cluste r)ss(tarting with 4 (i.e., the number of tags in the dataset) and increasing it in single steps up to 10. As discussed in1S,etcatgisonare interchangeable only in certain contexts, which is why we are interested in impure clusters, i.e., the ones where di昀erent tag groups are partially clustered together.

The general tendency we observe is that with hig’sh,eerach tag form is allocated to a distinct cluster or even divided into multiple clusters. Hence, we determine t h(ebheligohwest generated with the respec ti’vsecan be found in Figur4ein AppendixA. 10) with which any di昀erent tags are still clustered together, and examine the resulting impure clusters in more detail. Following this strategy, w e =se9lfeocrt CallHome, =7 for Twitter, and =4 for OpenSubtitles. Tab3leshows the impure clusters. An overview of all clusters Impure K-Means clusters per corpu s. denotes the overall number of clusters, { } mark cluster boundaries, subscript numbers indicate cluster IDs in plots.

Corpus CallHome Twitter OpenSubtitles 4 {ne, 2 nicht, oder}4

Impure clusters 7 {partja, partnicht}2 9 {partnicht, partoder}3, {partne, partnicht, 1 oder}7 12We used thescikit-learn implementationh:ttps://scikit-learn.org/stable/modules/generkalteeadr/sn.cluster.

KMeans.htm.l The clusters are built on the original BERT vectors; the PCA-reduced vectors are used only for visualization purposes. 6.1.1. CallHome Two impure clusters were generated w i=t9h (Figure4a in AppendixA). The cluster {part nicht, partoder}3 contains the instances of these tags that are followed by a question mark and preceded by a comma. This makes up a part of otdheer-subgroup and the complentiechtsubgroup with a question mark. A closer look at TQs in this corpus reveals that the ones with a question mark express requests for information or an opinion from the addressee (cf. questions and statement-question blen1d2s])[.

The cluster {parnte, partnicht, 1 oder}7 contains tags without the preceding comma and followed by a period (including occasional cases of alternative punctuation). This corresponds to a part of each respective tag’s subgroup. TQs ending with a period in this corpus are those where the speaker has epistemic authority and provides information or an opinion.

We conclude that the clustering method supports the punctuation-based distinction among TQs, e.g., by utilizing the tag-preceding punctuation as a clustering criterion. The observed correlation between the end punctuation and certain TQ types can be attributed to the fact that CH contains transcribed data, where, evidently, question marks and periods represent the rising and falling intonation, respectively. This, in turn, corresponds (at least roughly) to the addressee vs. speaker epistemic authority. This correlation should be taken with a grain of salt, though, as it is not necessarily the case with other corpora, e.g., Twitter users do not follow punctuation rules strictly. 6.1.2. OpenSubtitles One impure cluster –ne{, 2 nicht, oder}4 – was generated wit h=4 (Figure4b in AppendixA). Any higher merely led to multiple clustersjafoarndnicht. This is not surprising, as these tags are represented by a notably larger number of instancnees atnhdanoder in the corpus. This cluster comprises the total numbenre aonfdoder in OS and covers a mix of di昀erent TQ types.

There is almost no variation in punctuation in this corpus: TQs without the tag-preceding comma and/or ending with a period make up less than 2% per tag. Due to this fractional amount, these cases are not decisive for the automatic analysis.

The homogeneous use of punctuation in this corpus might be explained by the fact that subtitles are supposed to conform with standard grammar (in our case, a tag separated from the anchor clause by a comma and followed by a question mark).

For this data, K-Means prioritizes the division of large tag groups into multiple clusters over the clustering of di昀erent tags together. We 昀椀nd no obvious di昀erences between the instances of ja in the two clusters generated w=it4,he.g., they both contain directive TQs. 6.1.3. Twitter One impure cluster was generated w i=t7h(Figure4c in AppendixA). This cluster – {parjat, partnicht}2 – comprises the instances of the respective tags that have no preceding comma and are followed by a question mark. In Twitter, the question mark is the predominant end punctuation, and only few TQs end with a period (less than 1n%icwhtitahndoder, 3% withja, and 15% withne). Thus, tags are clustered based on the presence or absence of the preceding comma, rather than the end punctuation.

In general, K-Means merely assigns distinct clusters to the tag subgroups already formed by the BERT model. The clustering togethenricohft withja is not straightforward, especially sinceoder is situated closer to the former.

6.2. Mapping of linguistic properties to clusters

We assess how well the linguistic properties of TQs determined in the previous work map onto the K-means clusters. We use the annotations of the anchor clause type, anchor speech act, and target of con昀椀rmation from4][available for a portion of the dataset used in this study: 940 TQs in CallHome and 641 TQs in Twitter.

To test the distribution of these properties across our clusters, we apply the cluster evaluation metric V-measure1[ 8 ], which constitutes the harmonic mean between homogeneity (whether all TQs in a cluster belong to the same category, e.g., anchor clause type) and completeness (whether all TQs with the same properties are put into one 1c3luWset e昀椀nrd). that the target of con昀椀rmation has the highest match with the clusters in both corpora: its V-measure scores range between 0.13-0.16 (CH and TW), depending on the number of clusters (between 4 and 10). The anchor clause type and speech act are both associated with lower scores: 0.05-0.09 (CH) and 0.11-0.16 (TW).

These results con昀椀rm previous observations that the tags’ felicity conditions only partially depend on the anchor clause type and speech act. They also support previous 昀椀ndings that certain tags, such aosder, are infelicitous with requests to acknowledge the provided information, while other tags, suchnaes, are typical for this target of con昀椀rmat8i]o.n [

7. Tag word prediction

In this section, we describe the BERT Masked Language Modeling task for the tag word prediction with the model selected in Sec5t. iWone test the impact of 昀椀ne-tuning on the model’s performance and examine its predictions with regard to the tags’ interchangeability potential. We implement the training task using PyTor16c]ha[nd the HuggingFace Transformers library [ 21 ].

7.1. Experimental setup

For this task, we use the complete dataset (tags and no-tags) and 昀椀ne-tune the BERT model to predict the tag word form (en.ge.,vs. ja) and class (tag vs. no-tag). We represent the no-tags with the special tokens [ntja], [ntne], and [ntnicht] to di昀erentiate them from the respective tags in the model’s predictio14nsT.he special tokens and tags are then replaced with the [mask] token. We run the training for 10 epochs with standard parameters. The performance of the 椀昀ne-tuned model is compared with that of the original pretrained model (baseline). 13We used thescikit-learn implementationh:ttps://scikit-learn.org/stable/modules/generkalteeadr/ns.metrics.v_ measure_score.ht m.l 14The tagoder has no counterpart [ntoder].

The dataset is randomly split into the training and test sets (80% and 20% from each corpus, respectively). The training set is further randomly split into 80% training and 20% evaluation. We apply this con昀椀guration to (a) the whole dataset and (b) the dataset without OpenSubtitles in the training data. With this, we test how much the model relies on the OS data, which was part of its original pretraining.

Furthermore, we train the model separately on each corpus and test on the rest of the dataset. Our corpora di昀er in terms of style and conformity to standards: spoken telephone conversations (CH), transcribed spoken language (OS), and computer-mediated communication that can be placed somewhere between written and spoken (TW4)][.cfW.ith this, we test the suitability of di昀erent types of data for training a generalized model for tag prediction.

7.2. Evaluation

For each sentence, we consider the top three predictions and calculate two types of scores to evaluate the model’s performance: • score_equal – the model predicts the correct class (tag/no-tag) and the correct form (e.g., ne-tag forne-tag) • score_close – the model predicts the correct class, but the form can be incorrect (e.g., ja-no-tag fornicht-no-tag orja-tag fornicht-tag); this score includes score_equal We sum up the probabilities that match these criteria within the top three predictions to obtain a single score. The calculation is demonstrated below for a TQ from the Twitter corpus in (4): (4)

Eh Digga, das war voll fett krass alter oder? ‘Eh dude, that was absolutely totally cool man right?’

The top three predictions and their probabilities for thisodTeQr (a0r.9e33), ne (0.03), and [ntnicht] (0.026). Thus, score_equal amounts to 0.933 + 0 + 0 = 0.933 (93%) and score_close to 0.933 + 0.03 + 0 = 0.963 (96%). Additionally, we report precision, recall, and F1 scores based on the model’s top prediction.

7.3. Results

The score_equal and score_close results are given in Ta4b.lIendependently of whether OS is present in the training data, we observe a considerable improvement over the baseline (both scores). The tag/no-tag distinction (score_close) reaches almost a 100% probability in most cases.

With OS in the training data, the lowest score_equal values are obtanien<eoddefor<rnicht (increasing in this order). This re昀氀ects the number of the respective tags in the training part of the dataset, with less frequent tags receiving poorer scores. The baseline scores are distributed di昀erently, suggesting thaotder and ja were the most frequent tags in the model’s original training data. However, the correctness probability of the baseline model does not go beyond 50% (both scores). Given that we introduced the no-tag special tokens for this task, the baseline scores are especially low in the test containing all items (tags and no-tags).

Without OS in the training data, score_equal drops drasticnailclhyt faonrdja. We attribute this to the fact that the majority of TQs with these tags come from this corpus, thus limiting the model’s exposure to this type of data during training. The importance of large datasets for predictions with BERT was emphasized in previous studies 1[e3.,g1.,].

Precision, recall, and F1 scores show a (notable) improvement of the 昀椀ne-tuned model over the baseline for each tag (Tab5leasnd6). When trained on all corpora, the 昀椀ne-tuned model shows lower recall foodrer compared to the baseline. The latter provides reasonable results primarily forja. Its predictions fonre andnicht tend towards zero.

The experiments with training on one corpus and testing on the rest of the dataset resulted in a lower performance compared to the training on the data from all corpora. This can be explained by the limited amount of the training data (CH in particular turned out to be least suitable for training). Another reason is that our data, especially OS and TW, is imbalanced and certain tags are heavily underrepresented. As with the tests described above, the results here directly depend on the amount of the training data: The tag words represented by larger numbers of instances received higher scores.

In addition to these tests, we examine the top three predictions in the results of the training on all corpora (see Secti7o.1n) regarding the frequency with which di昀erent tags were suggested by BERT for each original tag var1i5anWte. hope to 昀椀nd indications of the tags’ interchangeability by examining which tags might constitute the best substitutes for each other. For TQs withne, BERT predictedne, ja, andnicht with almost equal frequency (in 21-23% of the cases for each). For TQs wijtah,nicht, or oder, the original tag was predicted in the majority of the cases (27-32%, depending on the tag). The next-best alternatives were as follows: nicht (29%) forja, ja (25%) fornicht, and bothja andnicht (21% each) foroder. These results suggest thaotder andne are generally poor substitutes for each other, which con昀椀rms previous corpus-based result4s][. The indication thnaetcould be replaced bnyicht orja is consistent with the experimental evidence5i]n, w[hich shows that these tags have common characteristics. For example, they are less felicitous in TQs expressing speaker assumptions based on the addressee’s behavior.

8. Discussion and conclusion

This study explored whether the di昀erences among the four common Germanjat,naeg,snicht, andoder, such as those established in previous corpus-based and experimental work, can be interpreted and predicted automatically. Our analysis of the existing German BERT models showed that they strongly depend on structural features, such as the tag-surrounding punctuation. For example, tags and no-tags were o昀琀entimes regarded as more similar to one another than to other instances of the respective classes due to matching punctuation, while syntactic 15We look at frequencies instead of probabilities, as in our data the latter are typically considerably lower for the second and third top predictions compared to the 昀椀rst one. This might be di昀erent with a larger dataset, though. and semantic properties of TQs were not recognizably detected.

We examined the tag vectors generated by one of these models in more detail. The mapping of linguistic properties of TQs to the automatically formed clusters of the tag vectors con昀椀rmed previous observations that the target of con昀椀rmation is a more informative feature for tags’ di昀erentiation than, for instance, the syntactic properties of the TQ anchor.

Furthermore, we 昀椀ne-tuned the selected model on the tag word prediction task. The tag word class (tag/no-tag) was predicted with near 100% probability in most cases. The prediction of the tag word form proved to be more challenging, though. Especially the experiments with training on single corpora highlighted the importance of the dataset size: The predicted tag word probabilities directly correlated with the number of instances they were represented by in the training set. Overall, the results showed that with standard parameters and given a large enough training dataset (14,045 tags and 20,860 no-tags, in our case) the 昀椀ne-tuned model works well for this task. However, hyper-parameter optimization and class weighting are worth exploring in the future.

The di昀케culties with the automatic distinction between the tag forms are not overly surprising, a昀琀er all. Cases where di昀erent TQ types share syntactic and semantic properties of the anchor provide limited information for BERT to rely on in order to, for example, rule out the use of certain tags, such oadser in informing TQs. The absence of additional contextual information hinders the judgments about the tags’ felicity in such cases. Nonetheless, certain TQs contain su昀케cient information to predict the tag even without contejaxtin,ei.mg.p,erative directives. Since they di昀er both semantically and syntactically from TQs with declarative anchors, we would expect BERT to pick up on their speci昀椀c properties. However, possibly because of their underrepresentation in our dataset, these TQs were not identi昀椀ed. Augmentation of the dataset with certain (synthetically generated) TQ types would facilitate further testing of BERT’s capacity to detect their features.

We conclude that BERT provides indications of TQ features that are useful for tag di昀erentiation. It also seems to correctly recognize which tags constitute appropriate substitutes for each other, although this needs further testing on a larger dataset. In future work, it could be worth including the right-side context of the tags (not present in our data) to fully exploit the power of BERT to use bidirectional context.

Acknowledgments

We thank Tatjana Sche昀툀er and Manfred Stede for discussions and valuable suggestions. We are grateful to the anonymous reviewers for their helpful comments.

This research was funded by the PhD completion scholarship from the Graduate Fund of the State of Brandenburg awarded by the University of Potsdam, and by the Deutsche Forschungsgemeinscha昀琀 (DFG, German Research Foundation), CRC 1567, Project ID 470106373. [1] F. Bianchi, B. Yu, and J. Tagliabue. “BERT Goes Shopping: Comparing Distributional Models for Product RepresentationsP”.roInce:edings of the 4th Workshop on e-Commerce

A. Visualization of BERT Vectors

(b) bert-base-german-dbmdz-cased (c) gbert-large (b) bert-base-german-dbmdz-cased (c) gbert-large (b) bert-base-german-dbmdz-cased (c) gbert-large

and NLP. Online: Association for Computational Linguistics , 2021 , pp. 1 - 121 . 0d . 1o8i6: 53/v1/2021.ecnlp-1 .1.

[2]

Chan ,

Schweter , and

Möller . “ German's Next Language ModePlr”o . cInee:dings of the 28th International Conference on Computational Linguistics . Barcelona, Spain (Online): International Committee on Computational Linguistics , 2020 , pp. 6788 - 679160 ..1d8oi: 653 /v1/ 2020 .coling-main. 59 .8

[3]

Clausen . “ You shall know a tag by the context it occurs in: An analysis of German tag questions and their responses in spontaneous conversatioCnosn” .S OInL:E XXIX: Proceedings of the 29th Conference of the Student Organization of Linguistics in Europe . Ed. by

Holtz , I. Kovač ,

Puggaard-Rode , and

Wall . Leiden: Leiden University Centre for Linguistics, 2021 , pp. 116 - 140 . urlh: ttps://www.universiteitleiden.nl/binaries/cont ent/assets/geesteswetenschappen/lucl/sole/consolex.xix.pdf

[4]

Clausen and T. Sche昀툀er. “ A corpus-based analysis of meaning variations in German tag questions: Evidence from spoken and written conversational corpCoorrap”u . sIn: Linguistics and Linguistic Theory 18 .1 ( 2022 ), pp. 1 - 31 . doi: 10 .1515/cllt-2019 -006.0

[5]

Clausen and T. Sche昀툀er. “ Commitments in German Tag Questions: An Experimental Study” . In:Proceedings of the 24th Workshop on the Semantics and Pragmatics of Dialogue - Full Papers. Virtually at Brandeis , Waltham, New Jersey: Semdial, 2020h. tutrpl://sem dial.org/anthology/Z20-Clausen%5C% 5Fsemdial % 5C % 5F0014 ..pdf

[6]

Devlin , M.-

Chang ,

Lee , and

Toutanova . “BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandinPgr”o.cInee:dings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . Ed. by

Burstein ,

Doran , and

Solorio . Minneapolis, MN, USA: Association for Computational Linguistics, 2019 , pp. 4171 - 4186 . 1d0 . o1i8:653/v1 /n19-1423.

[7]

Ettinger . “ What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models” . ITnr:ansactions of the Association for Computational Linguistics 8 ( 2020 ), pp. 34 - 48 . doi: 10 .1162/tacl\_a\_ 0029 . 8

[8]

Hagemann . “ Tag questions als Evidenzmarker. Formulierungsdynamik, sequentielle Struktur und Funktionen redezuginterner tagsG” .esIpnr:ächsforschung - OnlineZeitschri昀琀 zur verbalen Interaktion 10 ( 2009 ), pp. 145 - 176 .

[9]

Heim . “ Turn-peripheral management of Common Ground: A study of Swagbeilaln” . In: Journal of Pragmatics 141 ( 2019 ), pp. 130 - 146 . doi: 10 .1016/j.pragma. 2018 . 12 .007.

[10]

Karins , R. MacIntyre, M. Brandmair,

Lauscher , and

McLemCoArLeL.HOME German Transcripts LDC97T15. Web Download . Philadelphia: Linguistic Data Consortium. 1997 . url: https://catalog.ldc.upenn.edu/LDC97.T15

[11]

Kato ,

Miyata , and

Sato . “ BERT-Based Simpli昀椀cation of Japanese Sentence-Ending Predicates in Descriptive Text” . PIrno:ceedings of the 13th International Conference on Natural Language Generation . Dublin, Ireland: Association for Computational Linguistics, 2020 , pp. 242 - 251 . url:https://aclanthology.org/ 2020 .inlg.- 1 . 31

[12]

Kimps ,

Davidse , and

Cornillie . “ A speech function analysis of tag questions in British English spontaneous dialogueJ” . oIunr:nal of Pragmatics 66 ( 2014 ), pp. 64 - 85 . doi: doi.org/10.1016/j.pragma. 2014 . 02 .01.3

[13]

Li , A . Grissom II , and J. Boyd-Graber . “An Attentive Recurrent Model for Incremental Prediction of Sentence-昀椀nal Verbs” . IFnin:dings of the Association for Computational Linguistics: EMNLP 2020 . Online: Association for Computational Linguistics , 2020 , pp. 126 - 136 . doi: 10 .18653/v1/ 2020 .findings-emnlp. 12 .

[14]

Lison and

Tiedemann . “OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles” . InPr:oceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16) . Portorož, Slovenia: European Language Resources Association (ELRA) , 2016 , pp. 923 - 929 . urhl:ttps://aclanthology.org/L16- 1. 147

[15]

Mandokoro ,

Oka ,

Matsushima ,

Fukada ,

Yoshimura ,

Kawahara , and

Tanaka . “ Construction and Evaluation of a Self-Attention Model for Semantic Understanding of Sentence-Final Particlesa” .r XIniv: preprint ( 2022 ). doi: 10 .48550/arXiv.2210 .00282.

[16]

Paszke ,

Gross ,

Massa ,

Lerer ,

Bradbury , G. Chanan,

Killeen ,

Lin ,

Gimelshein ,

Antiga ,

Desmaison ,

Kopf ,

Yang ,

DeVito ,

Raison ,

Tejani ,

Chilamkurthy ,

Steiner ,

Fang ,

Bai , and

Chintala . “ PyTorch: An Imperative Style, High-Performance Deep Learning Library” . AIdnv:ances in Neural Information Processing Systems 32 . Ed. by

Wallach ,

Larochelle ,

Beygelzimer , F. d'AlchéBuc, E. Fox, and

Garnett . Curran Associates, Inc., 2019 , pp. 8024 - 8035 . hutrtlp:://papers .neurips.cc/paper/9015-pytorch -an-imperative-style-high-performance-deep-learninglibrary.pd . f

[17]

Rogers ,

Kovaleva , and

Rumshisky . “A Primer in BERTology: What We Know About How BERT Works” . InT:ransactions of the Association for Computational Linguistics 8 ( 2020 ), pp. 842 - 866 . url: https://aclanthology.org/ 2020 .tacl. - 1 . 54

[18]

Rosenberg and

Hirschberg . “ V-Measure : A Conditional Entropy-Based External Cluster Evaluation Measure”.PIrno:ceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) . Prague, Czech Republic: Association for Computational Linguistics, 2007 , pp. 410 - 420 . url: https://aclanthology.org/D07- 1. 043

[19]

Sche昀툀er. “A German Twitter Snapshot” . InP:roceedings of the 19th International Conference on Language Resources and Evaluation (LREC'14) . Reykjavik, Iceland: European Language Resources Association (ELRA) , 2014 , pp. 2284 - 2289 . urhlt:tp://www.lrec -co nf .org/proceedings/lrec2014/pdf/1146%5C% 5FPaper .pdf

[20]

Wiltschko ,

Denis , and A. D'Arcy . “ Deconstructing variation in pragmatic function: A transdisciplinary case study” . LIann:guage in Society 47.4 ( 2018 ), pp. 569 - 599 . doi: 10 .1017/s004740451800057x.

[21]

Wolf ,

Debut ,

Sanh ,

Chaumond ,

Delangue ,

Moi ,

Cistac ,

Rault ,

Louf ,

Funtowicz ,

Davison ,

Shleifer , P. von Platen, C. Ma,

Jernite ,

Plu ,

Xu ,

T. Le

Scao ,

Gugger ,

Drame ,

Lhoest , and

Rush . “Transformers: State-of-theArt Natural Language Processing” . PIrnoc:eedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Online: Association for Computational Linguistics , 2020 , pp. 38 - 45 . d1o0i .: 18653 /v1/ 2020 .emnlp-demos. 6 .