=Paper=
{{Paper
|id=Vol-2624/paper5
|storemode=property
|title=To BERT or not to BERT – Comparing Contextual Embeddings in a Deep Learning Architecture for the Automatic Recognition of four Types of Speech, Thought and Writing Representation
|pdfUrl=https://ceur-ws.org/Vol-2624/paper5.pdf
|volume=Vol-2624
|authors=Annelen Brunner,Ngoc Duyen Tanja Tu,Lukas Weimer,Fotis Jannidis
|dblpUrl=https://dblp.org/rec/conf/swisstext/BrunnerTWJ20
}}
==To BERT or not to BERT – Comparing Contextual Embeddings in a Deep Learning Architecture for the Automatic Recognition of four Types of Speech, Thought and Writing Representation==
To BERT or not to BERT – Comparing contextual embeddings in a deep learning architecture for the automatic recognition of four types of speech, thought and writing representation Annelen Brunner, Ngoc Duyen Tanja Tu Lukas Weimer, Fotis Jannidis Leibniz-Institut für Deutsche Sprache Universität Würzburg R5 6-13 Am Hubland D-68161 Mannheim D-97074 Würzburg brunner|tu lukas.weimer|fotis.jannidis @ids-mannheim.de @uni-wuerzburg.de Abstract stream of consciousness, there is a large amount of research (e.g. Banfield (1982); Fludernik We present recognizers for four very (1993); Pascal (1977)). In linguistics, the gram- different types of speech, thought and matical, lexical and functional characteristics of writing representation (STWR) for Ger- STWR have also been of interest (e.g. Weinrich man texts. The implementation is based (2007); Zifonun et al. (1997); Hauser (2008); on deep learning with two different cus- Fabricius-Hansen et al. (2018)). tomized contextual embeddings, namely To conduct either narratological or linguistic FLAIR embeddings and BERT embed- studies on STWR based on big data, being able dings. This paper gives an evaluation to automatically detect different types of STWR of our recognizers with a particular fo- would be of great benefit. This was our motiva- cus on the differences in performance tion to develop recognizers for the following four we observed between those two embed- forms of STWR, which have been distinguished in dings. FLAIR performed best for di- literary and linguistic theory.1 rect STWR (F1=0.85), BERT for indi- Direct STWR is a quotation of a character’s rect (F1=0.76) and free indirect (F1=0.59) speech, thought or writing. It is frequently – STWR. For reported STWR, the compar- though not always – enclosed by quotation marks ison was inconclusive, but BERT gave and/or introduced by a framing clause. the best average results and best indi- vidual model (F1=0.60). Our best re- Dann sagte er: “Ich habe Hunger.” cognizers, our customized language em- (Then he said: “I’m hungry.”) beddings and most of our test and train- Free indirect STWR, also known as “erlebte ing data are freely available and can be Rede” in German, is mainly used in literary found via www.redewiedergabe.de or at texts to represent a character’s thoughts while still github.com/redewiedergabe. maintaining characteristics of the narrator’s voice (e.g. past tense and third person pronouns). 1 Introduction Er war ratlos. Woher sollte er denn hier Speech, thought and writing representation bloß ein Mittagessen bekommen? (He (STWR) is an interesting phenomenon both from was at a loss. Where should he ever find a narratological and a linguistic point of view. lunch here?) The manner in which a character’s voice is in- corporated into the narrative is strongly linked to Indirect STWR is a paraphrase of the character’s narrative techniques as well as to the construction speech, thought or writing, composed of a framing of the narrative world and is therefore a standard clause (not counted as part of the STWR) with topic in narratology (e.g. McHale (2014); Genette a dependent subordinate clause (often using sub- (2010); Leech and Short (2013)). For some junctive mode) or an infinitive phrase. phenomena, such as free indirect discourse and Er fragte, wo das Essen sei. (He asked where the food was.) Copyright c 2020 for this paper by its authors. Use permit- 1 ted under Creative Commons License Attribution 4.0 Interna- The stretch of STWR is printed in italics in the following tional (CC BY 4.0) examples. Reported STWR is defined as a mention of a uated on sentence level, were achieved by the speech, thought or writing act that may or may not rule-based approach for indirect (F1=0.71) and re- specify the topic and does not take the form of in- ported (F1=0.57) STWR and by the RandomFor- direct STWR. est model for direct (F1=0.87) and free indirect (F1=0.40) STWR. Schricker et al. (2019) use the Er sprach über das Mittagessen. (He same corpus, but split the data into a stratified talked about lunch.) training and test set. They use different features In the following, we will describe our approach than Brunner and train three different machine in developing recognizers for these four STWR learning algorithms, RandomForest, Support Vec- types and evaluate our results with a particular fo- tor Machine and Multilayer Perceptron. Random- cus on the differences between the two contextual Forest was most successful and gave sentence- embeddings that proved most successful for our wise F1 scores of 0.95 for direct, 0.79 for indirect, task, BERT and FLAIR. 0.70 for free indirect and 0.49 for reported STWR. Papay and Padó (2019) test their corpus-agnostic 2 Related work quotation detection model on Brunner’s corpus and approximate her RandomForest results. 2.1 STWR recognizers Our recognizers fill a need regarding the recog- Automatic STWR recognition focuses mainly on nition of STWR in German texts, as they deal with the forms direct STWR (e.g. Schricker et al. all four forms of STWR and are, at the same time, (2019); Jannidis et al. (2018); Tu et al. (2019); trained and tested on a much larger data base than Brunner (2015); Brooke et al. (2015) for German Brunner’s corpus, making our results much more texts; Schöch et al. (2016) for French texts) and in- reliable. In addition to that, our data not only com- direct STWR (e.g. Schricker et al. (2019); Brunner prises fictional, but also non-fictional texts. An (2015) for German texts; Lazaridou et al. (2017); earlier version of our recognizer for free indirect Scheible et al. (2016) for English texts; Freitas STWR was discussed in Brunner et al. (2019). We et al. (2016) for Portugese texts). For free in- improved on this version by adding more training direct and reported STWR, recognizers were im- data and achieving higher scores with the BERT plemented by Brunner (2015) and Schricker et al. based model. (2019), the latter builds upon work by the former. In addition to that, Papay and Padó (2019) pro- 2.2 Language embeddings pose a corpus-agnostic neural model for quotation detection. As the testing of different language embeddings Since we developed recognizers for German, was a central component in the development of our we will only take a closer look at recognizers recognizers, we will briefly outline characteristics trained and tested on German texts. Jannidis et al. and research concerning the two most successful (2018) developed a deep-learning based recogni- ones that will be in focus in the rest of this pa- zer for direct speech in German novels, which per: FLAIR embeddings (Akbik et al., 2018) and works without quotation marks and achieves an BERT embeddings (Devlin et al., 2019). accuracy of 0.84 in sentence-wise evaluation. The Both have in common that they produce algorithm by Brooke et al. (2015) is a simple rule- context-dependent embeddings as opposed to based algorithm, which matches quotation marks. static word embeddings, such as fastText (Bo- They do not report any scores. Like Jannidis et al. janowski et al., 2016), GloVe (Pennington et al., (2018), Tu et al. (2019) focus on developing a re- 2014) or Word2Vec (Mikolov et al., 2013). That cognizer for direct speech which works without means they assign an embedding to a word based quotation marks, but they used a rule-based ap- on its context and are therefore able to capture proach, achieving a sentence level accuracy be- polysemy. Contextual embeddings have been tween 80.5 to 85.4% for fictional and 60.8% for shown to be of great benefit in several NLP tasks, non-fictional data. Brunner (2015) uses a corpus e.g. predicting the topic of a tweet (Joshi et al., of 13 short German narratives. For each type of 2019), part-of-speech-tagging, lemmatization and STWR, she implements a rule-based model and dependency parsing (Straka et al., 2019). Though trains a RandomForest model, evaluated in ten- FLAIR and BERT both produce contextual em- fold cross validation. The best F1 scores, eval- beddings, they differ in several features. FLAIR produces character-level embeddings. sing. As opposed to that, ELMo performs best A sentence is passed to the algorithm as a se- in part-of-speech-tagging and lemmatization, fol- quence of characters and the task is to predict the lowed by FLAIR. Therefore Straka et al. (2019) next character based on the previous characters. conclude that ELMo is best and FLAIR embed- When using FLAIR embeddings, it is recommen- dings are second best in capturing morphological ded to combine two independently trained mo- and orthographic information while BERT is best dels: i) a forward model and ii) a backward model. in capturing syntactic information. The forward model reads every character in a sen- tence from left-to-right, the backward model from 3 Method the opposite direction. This character-based em- bedding architecture gives advantages in capturing We defined the recognition of STWR as a se- morphological and semantic structures (cf. Akbik quence labeling task on token level. For each of et al. (2018)). the four types of STWR, a separate model was trained on binary labels (“token is part of this type While FLAIR embeddings only know their pre- of STWR: yes/no”). The input data consists of vious context, BERT embeddings know the previ- chunks of up to 100 tokens, which may span se- ous as well as succeeding context at the same time. veral sentences. The chunks may never cross bor- The training of BERT embeddings consists of two ders between different texts or cut sentences (ex- tasks: 1) given a sequence of tokens, where 15% of cept when a sentence exceeds 100 tokens) and can tokens are masked by the masked language model, therefore also be shorter than the maximum. predict the masked token based on its context, 2) To train our tagging model, we used the Se- given two pair of sentences, predict if the second quenceTagger class of the FLAIR framework (Ak- sentence is the subsequent sentence of the first bik et al., 2019) which implements a BiLSTM- one. The advantage of this model is that it learns CRF architecture on top of a language embedding associations between tokens as well as sentences. (as proposed by Huang et al. (2015)). We use two This is important for token-level tasks, such as BiLSTM layers with a hidden size of 256 each question answering (cf. Devlin et al. (2019)). and one CRF layer. This setting was decided after There are no systematic analyses comparing running tests with only one BiLSTM layer, which the performance of embeddings for a sequence gave considerably worse results, and with three labeling task in German, like we will do in BiLSTM layers, which led to no significant im- in this paper. However, there is work focus- provements. ing on the use of different embeddings in NLP We tested many different configurations for the tasks in English texts: e.g. Wiedemann et al. language embeddings in this setup. Initial tests (2019) compare BERT, ELMo and FLAIR em- were done with just fastText embeddings. The beddings in a word sense disambiguation task, results were much worse than the two configu- where BERT performed best. Sharma and Daniel rations that became our main focus: a) a fast- (2019) compare BioBERT (Lee et al., 2019) and Text model stacked with a FLAIR forwards and FLAIR embeddings, more precisely the pubmed- a FLAIR backwards model (as recommended in x model, in a Biomedical Named Entity Recog- Akbik et al. (2018)) and b) a BERT model. nition task. They find that stacking FLAIR em- Except for free indirect, all of our recognizers beddings with BioELMo yields better results than were trained and tested on historical German. Us- using only FLAIR. Compared to BioBERT the re- ing out-of-the-box embeddings, which are trained sults of FLAIR are very close, although the FLAIR on modern texts like German Wikipedia dump, embeddings are pretrained on a much smaller Open legal data dump, Open subtitles or the EU dataset. There is also a comparison between BERT bookshop corpus (Tiedemann, 2012), is therefore (bert-base-uncased, bert-base-chinese, bert-base- problematic. So we custom-trained our own fast- multilingual-uncased), FLAIR (bg, cs, de, en, fr, Text and FLAIR embeddings and fine-tuned the nl, pl, pt, sl, sv) and ELMo (english) in a part- BERT embeddings. The following settings were of-speech-tagging, lemmatization and dependency used: parsing task on different languages: Straka et al. Skip-Gram fastText models: We used the de- (2019) showed, that BERT outperforms ELMo as fault setting as recommend by the fastText tutorial, well as FLAIR embeddings in dependency par- i.e. we trained for five epochs, set the learning rate to 0.05, adjusted the minimum of the character n- 131,360,863 tokens. gram-size to 3 and the maximum to 6. We var- ied the model dimensions as well as the training 4.2 Training data for the recognizers material: fastTextTrain clean is a smaller, cleaner The recognizers for direct, indirect and reported corpus, fastTextTrain contains additional material STWR were trained on historical German texts with OCR errors (cf. section 4.1). On each train- – excerpts as well as full texts – that were pub- ing set, one model with 300 and one model with lished from the middle of the 19th to the early 500 dimensions was trained. 20th century. It comprises fiction as well as non- FLAIR: A forward and a backward FLAIR em- fiction (newspaper and journal articles) in near bedding with a hidden size of 1024 were trained. equal proportion; fiction is somewhat more domi- All settings were chosen according to the recom- nant. Roughly half of the data was manually la- mendation of the FLAIR tutorial, i.e. the sequence beled by two human annotators independently of length was set to 250, the mini-batch size to 100, one another. Then a third person compared the an- the learning rate to 0.20, the annealing factor to 0.4 notations, adjudicated discrepancies and created a and the patience value to 25. The model stopped consensus annotation. The rest was labeled by a training after 10 epochs due to low loss. single annotator. BERT: We used the PyTorch script fine- For indirect STWR, the training data was sup- tune on pregenerated.py to fine-tune the pre- plemented with 16 additional historical full texts trained bert-base-german-cased-model with the (9 fictional and 7 non-fictional) to increase the recommended default configuration: epochs: 3, number of instances. To speed up the annotation gradient accumulation steps: 1, train batch size: process, these texts were automatically annotated 32, learning rate: 0.00003 and max seq len: 128. by one of our earlier recognizer models and then 4 Data manually checked. The annotators looked at the whole texts, so false negatives were corrected as 4.1 Training data for the embeddings well. For the training/fine-tuning of the embeddings All the historical data is published as corpus 9,577 fictional and non-fictional German texts REDEWIEDERGABE (Brunner et al., 2020) and from the 19th and early 20th century were se- freely available. lected. As the historical data contained much too few For the fine-tuning of the BERT embeddings, instances of free indirect STWR, we had to cre- we fed all data – split into sentences – into the ate a separate training corpus for this STWR script pregenerate training data.py from PyTorch, type. The basis were 150 instances of free indi- which transforms it to BERT embedding compa- rect STWR with little to no context, manually ex- tible input data. The BERT fine-tuning tutorial tracted from 20th century novels. In addition to recommends to create an epoch of input data for that, full texts and excerpts from modern popular each training epoch, so BERT will not be trained crime novels as well as dime novels were auto- on the same random splits in each epoch. We matically annotated with a basic rule-based reco- fine-tuned BERT for 3 epochs, so we generated 3 gnizer that used typical surface indicators. Those epochs of data. annotations were then verified by human annota- For the FLAIR embeddings, 70%, i.e. tors. On this data, we trained an early recognizer 4,508,960 sentences, of the 6,441,372 sen- (Brunner et al., 2019) which was then used to an- tences from the data were randomly drawn to notate additional historical fictional texts. These form the training corpus. For the validation cor- annotations were again verified by human anno- pus 15%, i.e. 966,206 sentences, were randomly tators before they were added to the training ma- drawn. The rest was used for testing purposes. terial as well. It should be noted that in this For training the fastText embedding, we used semi-automated annotation process, instances that two different inputs. FastTextTrain contained all were not detected by the early recognizers had no of the 137,093,995 tokens of our data. From fast- chance of being annotated. Because of this, the TextTrain clean we removed all texts that were data most likely contains false negatives. recognized with OCR and thus contained typical For model training, our data was split into a OCR errors. This resulted in a smaller input set of training corpus (648,338 tokens for direct and re- Training corpus Validation corpus Test corpus Tokens Percent Instances Tokens Percent Instances Tokens Percent Instances Direct 212,467 32.77 6,293 24,321 24.99 878 18,307 18.71 605 Indirect 49,222 7.03 3,505 8,502 8.74 571 8,664 8.86 545 Reported 66,817 10.31 7,522 11,404 7.73 1,219 10,696 10.93 976 Free Ind 236,011 6.30 6,887 7,005 3.85 205 3,002 13,09 98 Table 1: The occurrences of each form of STWR in the training, validation and test corpora given in tokens, percentage of tokens in the respective corpus, and instances. ported; 700,202 tokens for indirect; 3,804,226 to- which outperformed models pretrainend on mo- kens for free indirect) and a validation corpus dern German for all STWR types as well. (97,316 tokens for direct, reported and indirect; We speculate that this is because the customiza- 181.942 tokens for free indirect). Table 1 shows tion made the models better suited for literary texts the occurrences of each form of STWR in its train- in general, even though it was done on historical ing and validation corpus, given in tokens, per- German. centage of tokens in its corpus and instances2 . The most successful configuration of fastText + FLAIR varies slightly between the different forms 4.3 Test data for the recognizers of STWR with respect to the fastText model that gave the best results. The fastText specifications Our test data for the direct, indirect and reported for the four types of STWR are detailed in table 2. STWR recognizers has 97,863 tokens and com- prises excerpts from historical fictional and non- Dimensions Training data fictional texts in equal proportions. They were Direct 500 fastTextTrain clean Indirect 300 fastTextTrain labeled with a consensus annotation as described Reported 500 fastTextTrain clean in section 4.2. The test data for the free indirect Free ind 300 fastTextTrain STWR recognizer has 22,935 tokens and com- prises 22 excerpts from dime novels, which were Table 2: Dimensions and training data for the fastText models used by the different FLAIR based recognizers. manually labeled by one human annotator. Table 1 shows the occurrences of each form of STWR in We trained each model with the same configu- its test corpus, given in tokens, percentage of test ration for three times to correct for random vari- corpus tokens and instances. ation in the deep learning results. Table 3 reports the average value of each score and the standard 5 Results deviation, calculated on token level. We report the scores of our most successful lan- On average, the recognizers using BERT em- guage embedding configurations, i.e. fine-tuned beddings scored better for all types of STWR BERT and fastText stacked with FLAIR forwards except direct, for which the recognizers of the and backwards. stacked fastText and FLAIR embeddings proved consistently more successful. Most striking was Notably, our fine-tuned BERT model performed BERT’s advantage for free indirect, where espe- better than the regular BERT model for all STWR cially the recall improved. It should be noted types, even free indirect, where the STWR re- though, that the FLAIR-based freeIndirect model cognizers were tested on modern German fiction consistently gave better precision. (as opposed to historical German fiction and non- However, when looking at the standard devia- fiction for the other models). The same is true tion over the three runs and the range of results, for the custom-trained fastText + FLAIR models we see that the F1 score ranges of the FLAIR and 2 An instance is defined here as an uninterrupted sequence BERT recognizers overlap for reported, so the re- of tokens annotated as the same type of STWR, which can be sults of the comparison are not conclusive for this longer than one sentence. This is of course a simplification, as two conceptually separate stretches, such as lines of dialogue STWR type. For the other three STWR types, the by two different people, will be counted as one instance if F1 score ranges are clearly distinct, even though they follow directly after each other, but can serve as a rough the free indirect models show a high variance. guideline. On average, a direct instance is 46 tokens long, an indirect instance 15 tokens, a reported instance 10 tokens and Table 4 lists the scores for the individual reco- a free indirect instance 34 tokens. gnizers from the three training runs that produced fastText + FLAIR BERT F1 Prec Rec F1 Prec Rec Dir 0.84 (0.0047) 0.90 (0.0245) 0.79 (0.0094) 0.80 (0.0047) 0.86 (0.017) 0.74 (0.0082) Ind 0.73 (0.0082) 0.78 (0.0082) 0.68 (0.0205) 0.76 (0.0) 0.79 (0.0236) 0.73 (0.017) Rep 0.56 (0.0125) 0.68 (0.0094) 0.48 (0.0141) 0.58 (0.017) 0.69 (0.0163) 0.51 (0.034) Fr ind 0.49 (0.017) 0.86 (0.0094) 0.35 (0.0125) 0.57 (0.0216) 0.80 (0.017) 0.44 (0.0309) Table 3: Average scores over three runs for each form of STWR, standard deviation given in brackets. Best average scores are bolded. the best results. 5.1 Direct STWR F1 Prec Rec Embedding Direct STWR has two main characteristics: First, Direct 0.85 0.93 0.78 cust. fastText+FLAIR being a quotation of the character’s voice, it Indirect 0.76 0.81 0.71 BERT fine-tuned Reported 0.60 0.67 0.54 BERT fine-tuned tends to use first and second person pronouns and Free ind 0.59 0.78 0.47 BERT fine-tuned present tense. Second, it is often marked with quotation marks, but the reliability of this particu- Table 4: Scores of our top models lar indicator varies dramatically between different texts. Its instances can also be very long, spanning To give an impression how difficult it is for hu- multiple sentences. We observed that about half of mans to annotate these forms, table 5 presents the the false positives as well as the false negatives are agreement scores between human annotators. The partial matches, i.e. the recognizer did correctly scores for direct, indirect and reported STWR are identify a stretch of direct STWR, but either broke based on corpus REDEWIEDERGABE, the cor- of too early or extended it too far. pus of fictional and non-fictional historical texts A main cause for false negatives were missing our test data was drawn from. The score for free quotation marks, i.e. unmarked stretches of di- indirect was calculated directly on the free indirect rect STWR, especially if those occured in first per- test corpus. son narration. In these cases, the recognizer is missing its two most reliable indicators to distin- F1 Prec Rec Fleiss’ Kappa Direct 0.94 0.94 0.94 0.92 guish direct STWR from narrator text at the same Indirect 0.75 0.77 0.74 0.73 time. Another source of false negatives are very Reported 0.56 0.56 0.56 0.49 long stretches of direct STWR, such as embedded Free ind 0.69 0.64 0.73 0.66 narratives. The recognizer looses the wider con- Table 5: Human annotator agreement for the STWR text and tends to treat this STWR as narrator text, types. especially if it contains nested direct STWR and exhibits characteristics such as third person pro- We performed two types of error analysis: First, nouns and past tense. we looked at the first 10,000 tokens of our test data The main source of false positives is also related and categorized the types of errors made by our to narrative perspective: In a first person narration top recognizers (cf. table 4). This gives an im- or a letter, the recognizer tends to annotate narra- pression of the types of challenges the four forms tor text as direct STWR – the reverse problem to of STWR pose and how well our recognizers can the one described above. Note that these cases are deal with them which is important practical infor- very hard for human annotators as well and can mation for anyone using them. only be solved by knowing a wide context. The Second, we also looked at the first 20 dif- recognizer knows a context of 100 tokens maxi- ferences between the results of the best models mum and we observed that wrong decisions often trained with BERT vs. the best models trained occur at the beginning of a context chunk and are with FLAIR. The goal was to find indicators which then propagated to its end. Another source of false specific properties of the two different contextual positives are – predictably – stretches of texts in embeddings made them better or worse suited to a quotation marks that are not direct STWR, though particular task. these are a relatively rare occurance. We also ob- As the four types of STWR have very different served mix-ups with the forms indirect and free in- characteristics, we will discuss each of them sepa- direct STWR, especially if unsual punctuation was rately. used, though this was rare as well. In summary, we can say that for direct STWR have trouble identifying the correct borders of the narrative perspective is a major factor. The test STWR. When looking at the error analysis, about material was deliberately designed to contain texts one third of the errors for both false positives and written both in first and third person perspective. false negatives are partial matches, mostly caused If evaluated separately, we could observe a sig- by this problem. nificantly better performance for third person per- The biggest cause for errors are cases where spective (see table 6).3 the typical indirect structure – a subclause start- F1 Prec Rec ing with dass, ob (that, whether) or an interroga- First person 0.80 0.86 0.75 tive prounoun – is paired with an unusual frame. Third person 0.87 0.97 0.79 This leads to false positives, if the frame con- tains words that usually indicate STWR, such as Table 6: Evaluation for the direct recognizer (top model, FLAIR based) split into texts with first and third es scheint außer Frage, dass ... (it seems out of person perspective the question that ...). Though this phrase does not introduce STWR, the word Frage (question) still Direct STWR is the only type of STWR where triggers an annotation. On the flipside, cases of FLAIR embeddings performed better than BERT indirect STWR tend to be missed if they are in- with a clear advantage. Looking at the first 20 dif- troduced by phrases that have an unsual structure ferences between the recognizers, we found that and don’t contain words that are strongly associ- BERT is more prone to annotate letters and first ated with speech, thought or writing acts. We also person perspective narratives as direct STWR. It observed that unusual punctuation, such as dashes, also breaks off prematurely more often, indicat- multiple dots and colon (used instead of comma at ing that FLAIR seems to be better in maintaining the border of an indirect STWR), have negative ef- the context of the annotation. On the other hand, fects on recognition accuracy. FLAIR tends to make more minor mistakes, such In a comparison between the indirect models us- as not annotating a dash when it is used instead of ing BERT and FLAIR embeddings, we observed a quotation mark to introduce direct STWR. This that both models make errors of the types de- points to the more character-based behaviour of scribed above, though at different places. How- FLAIR which – in general – seems to serve well ever, overall FLAIR seems more susceptible to in- for direct STWR, maybe because of the prevalence terference in the form of unusual punctuation or of typographical indicators. The wider context of framing phrases that are interjected in the mid- the BERT embeddings does not seem to help with dle of a stretch of indirect STWR. It is also less the perspective problem, but instead introduces ad- successful than BERT in recognizing STWR in- ditional errors. stances that are introduced with nouns instead of verbs. 5.2 Indirect STWR Indirect representation in our definition takes the 5.3 Reported STWR form of a subordinate clause or an infinitive phrase, dependent on a framing clause which is Reported STWR is a fairly difficult form even for not part of the STWR itself. Thus, instances of human annotators, mainly because it is so similar indirect STWR are always shorter than one sen- to pure narration that it can be hard to distinguish. tence. Of the four STWR forms, it is the one that It should be noted that the gold standard annota- is most strongly defined by its syntactical form. tion in this case contains a number of uncertain One difficulty are cases where the indirect instances that could be debatable for humans as STWR contains subclauses or, conversely, is fol- well. Reported instances tend to be rather short, lowed by a subclause that is not part of the in- varying from one token to one sentence at most, stance. In these structures, the recognizer tends to and syntactically diverse. The most reliable indi- cators are words referring to speech, thought and 3 We experimented with training two specialized direct writing acts. models, one only using texts with first person perspective and one only texts with third person perspective as training mate- Only about a fifth of the false negatives and rial, and evaluated them on the matching types of texts. Un- false positives observed for reported STWR were fortunately the performance was worse than that of the model trained on the complete training corpus, probably because of partial matches, a significantly lower percentage the significant reduction of training material. than for direct and indirect STWR. This indicates that for this form, finding the correct borders of The main cause for false positives are cases in the annotation is less of a problem than deciding which some of the main indicators of free indirect whether STWR is present at all. (as described above) occur in narration. In addi- Most errors can be attributed to problems re- tion to that, unmarked direct STWR is prone to be lated to speech, thought or writing words, the main labeled as free indirect. As for the false negatives, indicator for reported STWR. Such words can trig- about half of the missed instances contained at ger a false annotation and are the main cause of least one surface marker, but many are only recog- false positives. The reverse problem is even more nizable via wider context clues or an understand- prominent: Instances that do not use lexical ma- ing of the content. terial commonly associated with speech, thought Comparing BERT and FLAIR again, we find and writing tend to be overlooked. Missing such that BERT gives a much better recall – the same unusual instances is the main problem of the reco- effect as with reported STWR, but more pro- gnizer and though the direct and indirect recogni- nounced. BERT is clearly better in picking up zers also have better precision than recall scores, subtle signals for free indirect STWR than FLAIR. the difference for reported is clearly more pro- The flip-side of this is that the BERT model also nounced. Another recurring error type is that the produces more false positives than the FLAIR borders of the STWR were not detected correctly, model. An interesting observation is that it some- missing modifiers or annotating part of the sur- times annotates sentences that are not part of rounding narration. We also observed some rare the free indirect STWR itself, but introduce it. mixups with indirect STWR. Though the borders of the STWR are not detected As noted above, the F1 score ranges of the correctly in these cases, this might indicate that the FLAIR and BERT based recognizers are not dis- model learned that these context clues are highly tinct for reported, though BERT does perform relevant to identify free indirect STWR which is somewhat better on average. Looking at the dif- indeed the case. An example for this scenario is ferences, we found that the recognizers make the the following passage: same types of mistakes, but BERT is generally Jetzt war er mit dem Lächeln an der Reihe. Ihre more open to unusal instances of reported STWR, Reaktionen kamen so spontan und waren so leading to a better recall, which is the main reason ungekünstelt und ehrlich. Hoffentlich würde sie for its better overall performance. das nie verlieren. (Now it was his turn to smile. Her reactions came so spontanous and were so 5.4 Free indirect STWR genuine and honest. Hopefully she would never Free indirect STWR is structurally similar to di- loose that.) rect STWR in that it usually spans one or more BERT also marks the introductory sentence (un- consecutive sentences. It is very hard to identify derlined) that shifts the focus to the character to using surface markers, as it is basically a shift to a introduce the free indirect instance (in italics) that characters internal thoughts, but still uses the same tells us his thoughts. The FLAIR model on the tense and pronouns as the surrounding narration. other hand has its strength in precision: The few The best indicators are emphatic punctation such false positives that it produced are often border- as ?, !, -, words indicating a reference point in the line cases that are attached to free indirect pas- present (such as now, here) and characteristics of sages and could be read as plausible extensions. informal speech such as dialect or modal particles. The free indirect recognizers show the largest 6 Conclusion gap between precision and recall: nearly 0.4 We presented recognizers for four types of STWR points. Clearly, the problem here lies in unde- which differ strongly in structure and difficulty. tected cases. Notably however, over 40% of the Our models for direct, indirect and reported were false negatives are partial matches, meaning that trained and tested on historical German fictional the recognizer at least correctly detected an in- and non-fictional texts, the model for free indi- stance of free indirect, though it failed to capture rect on modern German fiction. The success rates it completely.4 (25 or 50 tokens) and used this as training input. A detailed 4 The recall problem might be exacerbated by the false evaluation is beyond the scope of this paper, but the resulting negatives in the training data. We ran tests where we cut out recognizers had better recall but worse precision, leading to the marked instances in the training data with some context similar F1 scores. correspond closely to the reliability of humans: available at github.com/redewiedergabe/tagger For indirect and reported, we even achieved simi- along with the code used for training and lar scores to the human annotator agreement on a execution. comparable corpus. For the types direct and free In addition to that, all the material used indirect, humans still clearly outperform our best for the direct, indirect and reported recognizers models. In both cases, we believe that the need and part of the material used for the free in- for wide contextual knowledge plays an important direct recognizer5 is available as corpus RE- role to explain the gap: For direct, the models fail DEWIEDERGABE (Brunner et al., 2020) at most often in distinguishing between a first per- github.com/redewiedergabe/corpus. The rich an- son narrator and a character quote. Free indirect notation of corpus REDEWIEDERGABE also of- in general is a highly context dependent form that fers opportunities to train more complex recogni- requires an understanding of the narrative struc- zers, e.g. by providing labels for the medium of ture. the STWR (speech, thought or writing) as well as We tested a variety of different language em- annotation for the framing phrase for direct and in- beddings for our task and provided a comparison direct STWR and the speaker for all four forms of of the most promising: FLAIR and BERT embed- STWR. dings. For both, we also trained/fine-tuned models on historical texts. FLAIR gave the best scores for References direct, BERT for indirect and free indirect. For reported, the results were not conclusive: Though Alan Akbik, Tanja Bergmann, Duncan Blythe, Kashif Rasul, Stefan Schweter, and Roland Vollgraf. 2019. BERT performed better on average, we observed FLAIR: An Easy-to-Use Framework for State-of- an overlap in F1 score range of the BERT and the-Art NLP. In Proceedings of the 2019 Confer- FLAIR models over multiple runs. ence of the North American Chapter of the Asso- Most striking was the improvement achieved ciation for Computational Linguistics (Demonstra- tions), pages 54–59, Minneapolis, Minnesota. Asso- with BERT for free indirect STWR. In particu- ciation for Computational Linguistics. lar, BERT improved recall for the most difficult forms, free indirect and – to a lesser degree – re- Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual String Embeddings for Sequence ported, showing a greater ability to detect unusual Labeling. In Proceedings of the 27th International instances. Conference on Computational Linguistics, pages Direct STWR was the only form where FLAIR 1638–1649, Santa Fe, New Mexico, USA. Associ- clearly outperformed BERT. It seems like the ation for Computational Linguistics. higher sensitivity of BERT is more of a disadvan- Ann Banfield. 1982. Unspeakable sentences. Narra- tage here, as it tended to misclassify even more tion and representation in the language of fiction. instances of first person narration than FLAIR. Routledge & Kegan Paul, Boston u.a. To further improve performance, one idea is Piotr Bojanowski, Edouard Grave, Armand Joulin, and modifying our input strategy: instead of conse- Tomas Mikolov. 2016. Enriching Word Vectors with cutive chunks of up to 100 tokens, overlapping Subword Information. CoRR, abs/1607.04606. chunks could be used as input. This might pre- Julian Brooke, Adam Hammond, and Graeme Hirst. vent the recognizers from loosing context at the 2015. GutenTag: An NLP-driven Tool for Digital beginning of a chunk, which would be especially Humanities Research in the Project Gutenberg Cor- pus. North American Chapter of the Association relevant for the direct and free indirect recognizer. for Computational Linguistics – Human Language The top models and customized embeddings Technologies, pages 42–47. described in this paper are freely available via Annelen Brunner. 2015. Automatische Erkennung von our homepage www.redewiedergabe.de and Redewiedergabe. Ein Beitrag zur quantitativen Nar- via GitHub. In detail, our customized BERT ratologie. Number 47 in Narratologia. de Gruyter, embeddings can be found at huggingface.co/ Berlin u.a. redewiedergabe/bert-base-historical-german-rw- Annelen Brunner, Stefan Engelberg, Fotis Jannidis, cased, the custom-trained FLAIR embeddings Ngoc Duyen Tanja Tu, and Lukas Weimer. 2020. are integrated into the FLAIR framework as 5 Unfortunately we can only publish the historical part of de-historic-rw-forward and de-historic-rw- the free indirect material due to copyright restrictions on the backward. The top recognizer models are modern texts. Corpus REDEWIEDERGABE. In Proceedings of Jinhyuk Lee, Wonjin Yoon, Kim Sungdong, Kim The 12th Language Resources and Evaluation Con- Donghyeon, Kim Sunkyu, Chan Ho So, and Jaewoo ference, pages 796–805, Marseille, France. Euro- Kang. 2019. BioBERT: a pre-trained biomedical pean Language Resources Association. language representation model for biomedical text mining. Annelen Brunner, Ngoc Duyen Tanja Tu, Lukas Weimer, and Fotis Jannidis. 2019. Deep learning for Geoffrey Leech and Mick Short. 2013. Style in fiction. free indirect representation. In Proceedings of the A linguistic introduction to English fictional prose, 2 15th Conference on Natural Language Processing edition. Routledge, London u.a. (KONVENS 2019): Short Papers, pages 241–245, Erlangen, Germany. German Society for Computa- Brian McHale. 2014. Speech Representation. In Peter tional Linguistics & Language Technology. Hühn, John Pier, Wolf Schmid, and Jörg Schönert, editors, The living handbook of narratology. Ham- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and burg University Press, Hamburg. Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Un- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey derstanding. CoRR, abs/1810.04805. Dean. 2013. Efficient Estimation of Word Represen- tations in Vector Space. Cathrine Fabricius-Hansen, Kåre Solfjeld, and An- neliese Pitz. 2018. Der Konjunktiv: Formen und Sean Papay and Sebastian Padó. 2019. Quotation Spielräume. Number 100 in Stauffenburg Linguis- detection and classification with a corpus-agnostic tik. Stauffenburg, Tübingen. model. In Proceedings of the International Confer- ence on Recent Advances in Natural Language Pro- Monika Fludernik. 1993. The fictions of language and cessing (RANLP 2019), pages 888–894, Varna, Bul- the languages of fiction. The linguisitic representa- garia. INCOMA Ltd. tion of speech and consciousness. Routledge, Lon- don/New York. Roy Pascal. 1977. The dual voice. Free indirect speech and its functioning in the nineteenth-century Euro- Cláudia Freitas, Bianca Freitas, and Diana Santos. pean novel. Manchester University Press, Manch- 2016. QUEMDISSE? Reported Speech in Por- ester. tuguese. In Proceedings of the Tenth International Conference on Language Resources and Evaluation Jeffrey Pennington, Richard Socher, and Christo- (LREC 2016) – Book of abstracts, pages 4410–4416. pher D. Manning. 2014. Glove: Global Vectors for Word Representation. In Proceedings of the Gérard Genette. 2010. Die Erzählung, 3 edition. Num- 2014 Conference on Empirical Methods in Natural ber 8083 in UTB. Fink, Paderborn. Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Lin- Stefan Hauser. 2008. Beobachtungen zur Redewieder- guistics. gabe in der Tagespresse. Eine kontrastive Analyse. In Heinz-Helmut Lüger and Hartmut Lenk, editors, Christof Schöch, Daniel Schlör, Stefanie Popp, Anne- Kontrastive Medienlinguistik, pages 271–286. Ver- len Brunner, and José Calvo Tello. 2016. Straight lag Empirische Pädagogik, Landau. talk! Automatic Recognition of Direct Speech in Nineteenth-century French Novels. In Conference Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirec- Abstracts, pages 346–353, Jagiellonian University & tional LSTM-CRF Models for Sequence Tagging. Pedagogical University, Kraków. CoRR, abs/1508.01991. Christian Scheible, Roman Klinger, and Sebastian Fotis Jannidis, Albin Zehe, Leonard Konle, Andreas Padó. 2016. Model Architectures for Quotation De- Hotho, and Markus Krug. 2018. Analysing Direct tection. In Proceedings of the 54th Annual Meet- Speech in German Novels. In Digital Humanities ing of the Association for Computational Linguistics im deutschsprachigen Raum – Book of abstracts. (ACL) – Book of abstracts, pages 1736–1745. Aditya Joshi, Sarvnaz Karimi, Ross Sparks, Cécile Luise Schricker, Manfred Stede, and Peer Trilcke. Paris, and C Raina MacIntyre. 2019. A Comparison 2019. Extraction and Classification of Speech, of Word-based and Context-based Representations Thought, and Writing in German Narrative Texts. In for Classification Problems in Health Informatics. Preliminary proceedings of the 15th Conference on In Proceedings of the BioNLP 2019 workshop, pages Natural Language Processing (KONVENS 2019): 135–141, Florence, Italy. Association for Computa- Long Papers, pages 183–192, Erlangen, Germany. tional Linguistics. German Society for Computational Linguistics & Language Technology. Konstantina Lazaridou, Ralf Krestel, and Felix Nau- mann. 2017. Identifying Media Bias by Analyzing Shreyas Sharma and Ron Daniel. 2019. BioFLAIR: Reported Speech. In IEEE International Conference Pretrained Pooled Contextualized Embeddings for on Data Mining – Book of abstracts, pages 943–948. Biomedical Sequence Labeling Tasks. Milan Straka, Jana Straková, and Jan Hajic. 2019. Evaluating Contextualized Embeddings on 54 Lan- guages in POS Tagging, Lemmatization and Depen- dency Parsing. Jörg Tiedemann. 2012. Parallel Data, Tools and In- terfaces in OPUS. In Proceedings of the Eight In- ternational Conference on Language Resources and Evaluation (LREC’12), pages 2214–2218, Istanbul, Turkey. European Language Resources Association (ELRA). Ngoc Duyen Tanja Tu, Markus Krug, and Annelen Brunner. 2019. Automatic recognition of direct speech without quotation marks. A rule-based ap- proach. In Digital Humanities: multimedial & mul- timodal. Konferenzabstracts, pages 87–89, Frank- furt am Main/Mainz. Harald Weinrich. 2007. Textgrammatik der deutschen Sprache, 4., rev. aufl edition. Wiss. Buchges, Darm- stadt. Gregor Wiedemann, Steffen Remus, Avi Chawla, and Chris Biemann. 2019. Does BERT Make Any Sense? Interpretable Word Sense Disambiguation with Contextualized Embeddings. In Preliminary proceedings of the 15th Conference on Natural Lan- guage Processing (KONVENS 2019): Long Papers, pages 161–170, Erlangen, Germany. German So- ciety for Computational Linguistics & Language Technology. Gisela Zifonun, Ludger Hoffmann, and Bruno Strecker. 1997. Grammatik der deutschen Sprache, volume 3 of Schriften des Instituts für deutsche Sprache. de Gruyter, Berlin/New York/Amsterdam.