=Paper=
{{Paper
|id=Vol-2624/paper5
|storemode=property
|title=To BERT or not to BERT – Comparing Contextual Embeddings in a Deep Learning Architecture for the Automatic Recognition of four Types of Speech, Thought and Writing Representation
|pdfUrl=https://ceur-ws.org/Vol-2624/paper5.pdf
|volume=Vol-2624
|authors=Annelen Brunner,Ngoc Duyen Tanja Tu,Lukas Weimer,Fotis Jannidis
|dblpUrl=https://dblp.org/rec/conf/swisstext/BrunnerTWJ20
}}
==To BERT or not to BERT – Comparing Contextual Embeddings in a Deep Learning Architecture for the Automatic Recognition of four Types of Speech, Thought and Writing Representation==
<pdf width="1500px">https://ceur-ws.org/Vol-2624/paper5.pdf</pdf>
<pre>
 To BERT or not to BERT – Comparing contextual embeddings in a deep
learning architecture for the automatic recognition of four types of speech,
                    thought and writing representation
       Annelen Brunner, Ngoc Duyen Tanja Tu        Lukas Weimer, Fotis Jannidis
         Leibniz-Institut für Deutsche Sprache       Universität Würzburg
                        R5 6-13                           Am Hubland
                 D-68161 Mannheim                      D-97074 Würzburg
                    brunner|tu                  lukas.weimer|fotis.jannidis
               @ids-mannheim.de                      @uni-wuerzburg.de

                       Abstract                               stream of consciousness, there is a large amount
                                                              of research (e.g. Banfield (1982); Fludernik
    We present recognizers for four very                      (1993); Pascal (1977)). In linguistics, the gram-
    different types of speech, thought and                    matical, lexical and functional characteristics of
    writing representation (STWR) for Ger-                    STWR have also been of interest (e.g. Weinrich
    man texts. The implementation is based                    (2007); Zifonun et al. (1997); Hauser (2008);
    on deep learning with two different cus-                  Fabricius-Hansen et al. (2018)).
    tomized contextual embeddings, namely                        To conduct either narratological or linguistic
    FLAIR embeddings and BERT embed-                          studies on STWR based on big data, being able
    dings. This paper gives an evaluation                     to automatically detect different types of STWR
    of our recognizers with a particular fo-                  would be of great benefit. This was our motiva-
    cus on the differences in performance                     tion to develop recognizers for the following four
    we observed between those two embed-                      forms of STWR, which have been distinguished in
    dings. FLAIR performed best for di-                       literary and linguistic theory.1
    rect STWR (F1=0.85), BERT for indi-                          Direct STWR is a quotation of a character’s
    rect (F1=0.76) and free indirect (F1=0.59)                speech, thought or writing. It is frequently –
    STWR. For reported STWR, the compar-                      though not always – enclosed by quotation marks
    ison was inconclusive, but BERT gave                      and/or introduced by a framing clause.
    the best average results and best indi-
    vidual model (F1=0.60). Our best re-                            Dann sagte er: “Ich habe Hunger.”
    cognizers, our customized language em-                          (Then he said: “I’m hungry.”)
    beddings and most of our test and train-                  Free indirect STWR, also known as “erlebte
    ing data are freely available and can be                  Rede” in German, is mainly used in literary
    found via www.redewiedergabe.de or at                     texts to represent a character’s thoughts while still
    github.com/redewiedergabe.                                maintaining characteristics of the narrator’s voice
                                                              (e.g. past tense and third person pronouns).
1   Introduction
                                                                    Er war ratlos. Woher sollte er denn hier
Speech, thought and writing representation                          bloß ein Mittagessen bekommen? (He
(STWR) is an interesting phenomenon both from                       was at a loss. Where should he ever find
a narratological and a linguistic point of view.                    lunch here?)
The manner in which a character’s voice is in-
corporated into the narrative is strongly linked to           Indirect STWR is a paraphrase of the character’s
narrative techniques as well as to the construction           speech, thought or writing, composed of a framing
of the narrative world and is therefore a standard            clause (not counted as part of the STWR) with
topic in narratology (e.g. McHale (2014); Genette             a dependent subordinate clause (often using sub-
(2010); Leech and Short (2013)). For some                     junctive mode) or an infinitive phrase.
phenomena, such as free indirect discourse and                      Er fragte, wo das Essen sei. (He asked
                                                                    where the food was.)
Copyright c 2020 for this paper by its authors. Use permit-
                                                                 1
ted under Creative Commons License Attribution 4.0 Interna-        The stretch of STWR is printed in italics in the following
tional (CC BY 4.0)                                            examples.
Reported STWR is defined as a mention of a               uated on sentence level, were achieved by the
speech, thought or writing act that may or may not       rule-based approach for indirect (F1=0.71) and re-
specify the topic and does not take the form of in-      ported (F1=0.57) STWR and by the RandomFor-
direct STWR.                                             est model for direct (F1=0.87) and free indirect
                                                         (F1=0.40) STWR. Schricker et al. (2019) use the
      Er sprach über das Mittagessen. (He               same corpus, but split the data into a stratified
      talked about lunch.)                               training and test set. They use different features
In the following, we will describe our approach          than Brunner and train three different machine
in developing recognizers for these four STWR            learning algorithms, RandomForest, Support Vec-
types and evaluate our results with a particular fo-     tor Machine and Multilayer Perceptron. Random-
cus on the differences between the two contextual        Forest was most successful and gave sentence-
embeddings that proved most successful for our           wise F1 scores of 0.95 for direct, 0.79 for indirect,
task, BERT and FLAIR.                                    0.70 for free indirect and 0.49 for reported STWR.
                                                         Papay and Padó (2019) test their corpus-agnostic
2     Related work                                       quotation detection model on Brunner’s corpus
                                                         and approximate her RandomForest results.
2.1    STWR recognizers
                                                            Our recognizers fill a need regarding the recog-
Automatic STWR recognition focuses mainly on             nition of STWR in German texts, as they deal with
the forms direct STWR (e.g. Schricker et al.             all four forms of STWR and are, at the same time,
(2019); Jannidis et al. (2018); Tu et al. (2019);        trained and tested on a much larger data base than
Brunner (2015); Brooke et al. (2015) for German          Brunner’s corpus, making our results much more
texts; Schöch et al. (2016) for French texts) and in-   reliable. In addition to that, our data not only com-
direct STWR (e.g. Schricker et al. (2019); Brunner       prises fictional, but also non-fictional texts. An
(2015) for German texts; Lazaridou et al. (2017);        earlier version of our recognizer for free indirect
Scheible et al. (2016) for English texts; Freitas        STWR was discussed in Brunner et al. (2019). We
et al. (2016) for Portugese texts). For free in-         improved on this version by adding more training
direct and reported STWR, recognizers were im-           data and achieving higher scores with the BERT
plemented by Brunner (2015) and Schricker et al.         based model.
(2019), the latter builds upon work by the former.
In addition to that, Papay and Padó (2019) pro-         2.2   Language embeddings
pose a corpus-agnostic neural model for quotation
detection.                                               As the testing of different language embeddings
   Since we developed recognizers for German,            was a central component in the development of our
we will only take a closer look at recognizers           recognizers, we will briefly outline characteristics
trained and tested on German texts. Jannidis et al.      and research concerning the two most successful
(2018) developed a deep-learning based recogni-          ones that will be in focus in the rest of this pa-
zer for direct speech in German novels, which            per: FLAIR embeddings (Akbik et al., 2018) and
works without quotation marks and achieves an            BERT embeddings (Devlin et al., 2019).
accuracy of 0.84 in sentence-wise evaluation. The           Both have in common that they produce
algorithm by Brooke et al. (2015) is a simple rule-      context-dependent embeddings as opposed to
based algorithm, which matches quotation marks.          static word embeddings, such as fastText (Bo-
They do not report any scores. Like Jannidis et al.      janowski et al., 2016), GloVe (Pennington et al.,
(2018), Tu et al. (2019) focus on developing a re-       2014) or Word2Vec (Mikolov et al., 2013). That
cognizer for direct speech which works without           means they assign an embedding to a word based
quotation marks, but they used a rule-based ap-          on its context and are therefore able to capture
proach, achieving a sentence level accuracy be-          polysemy. Contextual embeddings have been
tween 80.5 to 85.4% for fictional and 60.8% for          shown to be of great benefit in several NLP tasks,
non-fictional data. Brunner (2015) uses a corpus         e.g. predicting the topic of a tweet (Joshi et al.,
of 13 short German narratives. For each type of          2019), part-of-speech-tagging, lemmatization and
STWR, she implements a rule-based model and              dependency parsing (Straka et al., 2019). Though
trains a RandomForest model, evaluated in ten-           FLAIR and BERT both produce contextual em-
fold cross validation. The best F1 scores, eval-         beddings, they differ in several features.
   FLAIR produces character-level embeddings.         sing. As opposed to that, ELMo performs best
A sentence is passed to the algorithm as a se-        in part-of-speech-tagging and lemmatization, fol-
quence of characters and the task is to predict the   lowed by FLAIR. Therefore Straka et al. (2019)
next character based on the previous characters.      conclude that ELMo is best and FLAIR embed-
When using FLAIR embeddings, it is recommen-          dings are second best in capturing morphological
ded to combine two independently trained mo-          and orthographic information while BERT is best
dels: i) a forward model and ii) a backward model.    in capturing syntactic information.
The forward model reads every character in a sen-
tence from left-to-right, the backward model from     3   Method
the opposite direction. This character-based em-
bedding architecture gives advantages in capturing    We defined the recognition of STWR as a se-
morphological and semantic structures (cf. Akbik      quence labeling task on token level. For each of
et al. (2018)).                                       the four types of STWR, a separate model was
                                                      trained on binary labels (“token is part of this type
   While FLAIR embeddings only know their pre-        of STWR: yes/no”). The input data consists of
vious context, BERT embeddings know the previ-        chunks of up to 100 tokens, which may span se-
ous as well as succeeding context at the same time.   veral sentences. The chunks may never cross bor-
The training of BERT embeddings consists of two       ders between different texts or cut sentences (ex-
tasks: 1) given a sequence of tokens, where 15% of    cept when a sentence exceeds 100 tokens) and can
tokens are masked by the masked language model,       therefore also be shorter than the maximum.
predict the masked token based on its context, 2)        To train our tagging model, we used the Se-
given two pair of sentences, predict if the second    quenceTagger class of the FLAIR framework (Ak-
sentence is the subsequent sentence of the first      bik et al., 2019) which implements a BiLSTM-
one. The advantage of this model is that it learns    CRF architecture on top of a language embedding
associations between tokens as well as sentences.     (as proposed by Huang et al. (2015)). We use two
This is important for token-level tasks, such as      BiLSTM layers with a hidden size of 256 each
question answering (cf. Devlin et al. (2019)).        and one CRF layer. This setting was decided after
   There are no systematic analyses comparing         running tests with only one BiLSTM layer, which
the performance of embeddings for a sequence          gave considerably worse results, and with three
labeling task in German, like we will do in           BiLSTM layers, which led to no significant im-
in this paper. However, there is work focus-          provements.
ing on the use of different embeddings in NLP            We tested many different configurations for the
tasks in English texts: e.g. Wiedemann et al.         language embeddings in this setup. Initial tests
(2019) compare BERT, ELMo and FLAIR em-               were done with just fastText embeddings. The
beddings in a word sense disambiguation task,         results were much worse than the two configu-
where BERT performed best. Sharma and Daniel          rations that became our main focus: a) a fast-
(2019) compare BioBERT (Lee et al., 2019) and         Text model stacked with a FLAIR forwards and
FLAIR embeddings, more precisely the pubmed-          a FLAIR backwards model (as recommended in
x model, in a Biomedical Named Entity Recog-          Akbik et al. (2018)) and b) a BERT model.
nition task. They find that stacking FLAIR em-           Except for free indirect, all of our recognizers
beddings with BioELMo yields better results than      were trained and tested on historical German. Us-
using only FLAIR. Compared to BioBERT the re-         ing out-of-the-box embeddings, which are trained
sults of FLAIR are very close, although the FLAIR     on modern texts like German Wikipedia dump,
embeddings are pretrained on a much smaller           Open legal data dump, Open subtitles or the EU
dataset. There is also a comparison between BERT      bookshop corpus (Tiedemann, 2012), is therefore
(bert-base-uncased, bert-base-chinese, bert-base-     problematic. So we custom-trained our own fast-
multilingual-uncased), FLAIR (bg, cs, de, en, fr,     Text and FLAIR embeddings and fine-tuned the
nl, pl, pt, sl, sv) and ELMo (english) in a part-     BERT embeddings. The following settings were
of-speech-tagging, lemmatization and dependency       used:
parsing task on different languages: Straka et al.       Skip-Gram fastText models: We used the de-
(2019) showed, that BERT outperforms ELMo as          fault setting as recommend by the fastText tutorial,
well as FLAIR embeddings in dependency par-           i.e. we trained for five epochs, set the learning rate
to 0.05, adjusted the minimum of the character n-        131,360,863 tokens.
gram-size to 3 and the maximum to 6. We var-
ied the model dimensions as well as the training         4.2   Training data for the recognizers
material: fastTextTrain clean is a smaller, cleaner
                                                         The recognizers for direct, indirect and reported
corpus, fastTextTrain contains additional material
                                                         STWR were trained on historical German texts
with OCR errors (cf. section 4.1). On each train-
                                                         – excerpts as well as full texts – that were pub-
ing set, one model with 300 and one model with
                                                         lished from the middle of the 19th to the early
500 dimensions was trained.
                                                         20th century. It comprises fiction as well as non-
   FLAIR: A forward and a backward FLAIR em-
                                                         fiction (newspaper and journal articles) in near
bedding with a hidden size of 1024 were trained.
                                                         equal proportion; fiction is somewhat more domi-
All settings were chosen according to the recom-
                                                         nant. Roughly half of the data was manually la-
mendation of the FLAIR tutorial, i.e. the sequence
                                                         beled by two human annotators independently of
length was set to 250, the mini-batch size to 100,
                                                         one another. Then a third person compared the an-
the learning rate to 0.20, the annealing factor to 0.4
                                                         notations, adjudicated discrepancies and created a
and the patience value to 25. The model stopped
                                                         consensus annotation. The rest was labeled by a
training after 10 epochs due to low loss.
                                                         single annotator.
   BERT: We used the PyTorch script fine-
                                                            For indirect STWR, the training data was sup-
tune on pregenerated.py to fine-tune the pre-
                                                         plemented with 16 additional historical full texts
trained bert-base-german-cased-model with the
                                                         (9 fictional and 7 non-fictional) to increase the
recommended default configuration: epochs: 3,
                                                         number of instances. To speed up the annotation
gradient accumulation steps: 1, train batch size:
                                                         process, these texts were automatically annotated
32, learning rate: 0.00003 and max seq len: 128.
                                                         by one of our earlier recognizer models and then
4     Data                                               manually checked. The annotators looked at the
                                                         whole texts, so false negatives were corrected as
4.1    Training data for the embeddings                  well.
For the training/fine-tuning of the embeddings              All the historical data is published as corpus
9,577 fictional and non-fictional German texts           REDEWIEDERGABE (Brunner et al., 2020) and
from the 19th and early 20th century were se-            freely available.
lected.                                                     As the historical data contained much too few
   For the fine-tuning of the BERT embeddings,           instances of free indirect STWR, we had to cre-
we fed all data – split into sentences – into the        ate a separate training corpus for this STWR
script pregenerate training data.py from PyTorch,        type. The basis were 150 instances of free indi-
which transforms it to BERT embedding compa-             rect STWR with little to no context, manually ex-
tible input data. The BERT fine-tuning tutorial          tracted from 20th century novels. In addition to
recommends to create an epoch of input data for          that, full texts and excerpts from modern popular
each training epoch, so BERT will not be trained         crime novels as well as dime novels were auto-
on the same random splits in each epoch. We              matically annotated with a basic rule-based reco-
fine-tuned BERT for 3 epochs, so we generated 3          gnizer that used typical surface indicators. Those
epochs of data.                                          annotations were then verified by human annota-
   For the FLAIR embeddings, 70%, i.e.                   tors. On this data, we trained an early recognizer
4,508,960 sentences, of the 6,441,372 sen-               (Brunner et al., 2019) which was then used to an-
tences from the data were randomly drawn to              notate additional historical fictional texts. These
form the training corpus. For the validation cor-        annotations were again verified by human anno-
pus 15%, i.e. 966,206 sentences, were randomly           tators before they were added to the training ma-
drawn. The rest was used for testing purposes.           terial as well. It should be noted that in this
   For training the fastText embedding, we used          semi-automated annotation process, instances that
two different inputs. FastTextTrain contained all        were not detected by the early recognizers had no
of the 137,093,995 tokens of our data. From fast-        chance of being annotated. Because of this, the
TextTrain clean we removed all texts that were           data most likely contains false negatives.
recognized with OCR and thus contained typical              For model training, our data was split into a
OCR errors. This resulted in a smaller input set of      training corpus (648,338 tokens for direct and re-
                           Training corpus                     Validation corpus                  Test corpus
                     Tokens Percent Instances            Tokens Percent Instances        Tokens    Percent Instances
         Direct     212,467     32.77      6,293         24,321     24.99          878   18,307     18.71        605
         Indirect    49,222       7.03     3,505          8,502       8.74         571    8,664       8.86       545
         Reported    66,817     10.31      7,522         11,404       7.73       1,219   10,696     10.93        976
         Free Ind   236,011       6.30     6,887          7,005       3.85         205    3,002     13,09         98

Table 1: The occurrences of each form of STWR in the training, validation and test corpora given in tokens,
percentage of tokens in the respective corpus, and instances.


ported; 700,202 tokens for indirect; 3,804,226 to-                 which outperformed models pretrainend on mo-
kens for free indirect) and a validation corpus                    dern German for all STWR types as well.
(97,316 tokens for direct, reported and indirect;                     We speculate that this is because the customiza-
181.942 tokens for free indirect). Table 1 shows                   tion made the models better suited for literary texts
the occurrences of each form of STWR in its train-                 in general, even though it was done on historical
ing and validation corpus, given in tokens, per-                   German.
centage of tokens in its corpus and instances2 .                      The most successful configuration of fastText +
                                                                   FLAIR varies slightly between the different forms
4.3      Test data for the recognizers                             of STWR with respect to the fastText model that
                                                                   gave the best results. The fastText specifications
Our test data for the direct, indirect and reported
                                                                   for the four types of STWR are detailed in table 2.
STWR recognizers has 97,863 tokens and com-
prises excerpts from historical fictional and non-                                   Dimensions    Training data
fictional texts in equal proportions. They were                          Direct      500           fastTextTrain clean
                                                                         Indirect    300           fastTextTrain
labeled with a consensus annotation as described                         Reported    500           fastTextTrain clean
in section 4.2. The test data for the free indirect                      Free ind    300           fastTextTrain
STWR recognizer has 22,935 tokens and com-
prises 22 excerpts from dime novels, which were                    Table 2: Dimensions and training data for the fastText
                                                                   models used by the different FLAIR based recognizers.
manually labeled by one human annotator. Table
1 shows the occurrences of each form of STWR in
                                                                      We trained each model with the same configu-
its test corpus, given in tokens, percentage of test
                                                                   ration for three times to correct for random vari-
corpus tokens and instances.
                                                                   ation in the deep learning results. Table 3 reports
                                                                   the average value of each score and the standard
5       Results                                                    deviation, calculated on token level.
We report the scores of our most successful lan-                      On average, the recognizers using BERT em-
guage embedding configurations, i.e. fine-tuned                    beddings scored better for all types of STWR
BERT and fastText stacked with FLAIR forwards                      except direct, for which the recognizers of the
and backwards.                                                     stacked fastText and FLAIR embeddings proved
                                                                   consistently more successful. Most striking was
   Notably, our fine-tuned BERT model performed
                                                                   BERT’s advantage for free indirect, where espe-
better than the regular BERT model for all STWR
                                                                   cially the recall improved. It should be noted
types, even free indirect, where the STWR re-
                                                                   though, that the FLAIR-based freeIndirect model
cognizers were tested on modern German fiction
                                                                   consistently gave better precision.
(as opposed to historical German fiction and non-
                                                                      However, when looking at the standard devia-
fiction for the other models). The same is true
                                                                   tion over the three runs and the range of results,
for the custom-trained fastText + FLAIR models
                                                                   we see that the F1 score ranges of the FLAIR and
    2
     An instance is defined here as an uninterrupted sequence      BERT recognizers overlap for reported, so the re-
of tokens annotated as the same type of STWR, which can be         sults of the comparison are not conclusive for this
longer than one sentence. This is of course a simplification, as
two conceptually separate stretches, such as lines of dialogue
                                                                   STWR type. For the other three STWR types, the
by two different people, will be counted as one instance if        F1 score ranges are clearly distinct, even though
they follow directly after each other, but can serve as a rough    the free indirect models show a high variance.
guideline. On average, a direct instance is 46 tokens long, an
indirect instance 15 tokens, a reported instance 10 tokens and        Table 4 lists the scores for the individual reco-
a free indirect instance 34 tokens.                                gnizers from the three training runs that produced
                                fastText + FLAIR                                                BERT
              F1                 Prec              Rec                   F1              Prec               Rec
 Dir        0.84     (0.0047)    0.90 (0.0245)     0.79   (0.0094)     0.80   (0.0047)   0.86    (0.017)    0.74   (0.0082)
 Ind        0.73     (0.0082)    0.78 (0.0082)     0.68   (0.0205)     0.76   (0.0)      0.79    (0.0236)   0.73   (0.017)
 Rep        0.56     (0.0125)    0.68 (0.0094)     0.48   (0.0141)     0.58   (0.017)    0.69    (0.0163)   0.51   (0.034)
 Fr ind     0.49     (0.017)     0.86 (0.0094)     0.35   (0.0125)     0.57   (0.0216)   0.80    (0.017)    0.44   (0.0309)

Table 3: Average scores over three runs for each form of STWR, standard deviation given in brackets. Best average
scores are bolded.


the best results.                                                5.1     Direct STWR
                F1     Prec     Rec    Embedding                 Direct STWR has two main characteristics: First,
 Direct       0.85     0.93     0.78   cust. fastText+FLAIR      being a quotation of the character’s voice, it
 Indirect     0.76     0.81     0.71   BERT fine-tuned
 Reported     0.60     0.67     0.54   BERT fine-tuned
                                                                 tends to use first and second person pronouns and
 Free ind     0.59     0.78     0.47   BERT fine-tuned           present tense. Second, it is often marked with
                                                                 quotation marks, but the reliability of this particu-
            Table 4: Scores of our top models                    lar indicator varies dramatically between different
                                                                 texts. Its instances can also be very long, spanning
   To give an impression how difficult it is for hu-             multiple sentences. We observed that about half of
mans to annotate these forms, table 5 presents the               the false positives as well as the false negatives are
agreement scores between human annotators. The                   partial matches, i.e. the recognizer did correctly
scores for direct, indirect and reported STWR are                identify a stretch of direct STWR, but either broke
based on corpus REDEWIEDERGABE, the cor-                         of too early or extended it too far.
pus of fictional and non-fictional historical texts                 A main cause for false negatives were missing
our test data was drawn from. The score for free                 quotation marks, i.e. unmarked stretches of di-
indirect was calculated directly on the free indirect            rect STWR, especially if those occured in first per-
test corpus.                                                     son narration. In these cases, the recognizer is
                                                                 missing its two most reliable indicators to distin-
                     F1    Prec    Rec    Fleiss’ Kappa
     Direct        0.94    0.94    0.94            0.92          guish direct STWR from narrator text at the same
     Indirect      0.75    0.77    0.74            0.73          time. Another source of false negatives are very
     Reported      0.56    0.56    0.56            0.49          long stretches of direct STWR, such as embedded
     Free ind      0.69    0.64    0.73            0.66
                                                                 narratives. The recognizer looses the wider con-
Table 5: Human annotator agreement for the STWR                  text and tends to treat this STWR as narrator text,
types.                                                           especially if it contains nested direct STWR and
                                                                 exhibits characteristics such as third person pro-
   We performed two types of error analysis: First,              nouns and past tense.
we looked at the first 10,000 tokens of our test data               The main source of false positives is also related
and categorized the types of errors made by our                  to narrative perspective: In a first person narration
top recognizers (cf. table 4). This gives an im-                 or a letter, the recognizer tends to annotate narra-
pression of the types of challenges the four forms               tor text as direct STWR – the reverse problem to
of STWR pose and how well our recognizers can                    the one described above. Note that these cases are
deal with them which is important practical infor-               very hard for human annotators as well and can
mation for anyone using them.                                    only be solved by knowing a wide context. The
   Second, we also looked at the first 20 dif-                   recognizer knows a context of 100 tokens maxi-
ferences between the results of the best models                  mum and we observed that wrong decisions often
trained with BERT vs. the best models trained                    occur at the beginning of a context chunk and are
with FLAIR. The goal was to find indicators which                then propagated to its end. Another source of false
specific properties of the two different contextual              positives are – predictably – stretches of texts in
embeddings made them better or worse suited to a                 quotation marks that are not direct STWR, though
particular task.                                                 these are a relatively rare occurance. We also ob-
   As the four types of STWR have very different                 served mix-ups with the forms indirect and free in-
characteristics, we will discuss each of them sepa-              direct STWR, especially if unsual punctuation was
rately.                                                          used, though this was rare as well.
   In summary, we can say that for direct STWR                   have trouble identifying the correct borders of the
narrative perspective is a major factor. The test                STWR. When looking at the error analysis, about
material was deliberately designed to contain texts              one third of the errors for both false positives and
written both in first and third person perspective.              false negatives are partial matches, mostly caused
If evaluated separately, we could observe a sig-                 by this problem.
nificantly better performance for third person per-                 The biggest cause for errors are cases where
spective (see table 6).3                                         the typical indirect structure – a subclause start-
                               F1   Prec    Rec
                                                                 ing with dass, ob (that, whether) or an interroga-
            First person     0.80   0.86    0.75                 tive prounoun – is paired with an unusual frame.
            Third person     0.87   0.97    0.79                 This leads to false positives, if the frame con-
                                                                 tains words that usually indicate STWR, such as
Table 6: Evaluation for the direct recognizer (top
model, FLAIR based) split into texts with first and third        es scheint außer Frage, dass ... (it seems out of
person perspective                                               the question that ...). Though this phrase does not
                                                                 introduce STWR, the word Frage (question) still
   Direct STWR is the only type of STWR where                    triggers an annotation. On the flipside, cases of
FLAIR embeddings performed better than BERT                      indirect STWR tend to be missed if they are in-
with a clear advantage. Looking at the first 20 dif-             troduced by phrases that have an unsual structure
ferences between the recognizers, we found that                  and don’t contain words that are strongly associ-
BERT is more prone to annotate letters and first                 ated with speech, thought or writing acts. We also
person perspective narratives as direct STWR. It                 observed that unusual punctuation, such as dashes,
also breaks off prematurely more often, indicat-                 multiple dots and colon (used instead of comma at
ing that FLAIR seems to be better in maintaining                 the border of an indirect STWR), have negative ef-
the context of the annotation. On the other hand,                fects on recognition accuracy.
FLAIR tends to make more minor mistakes, such                       In a comparison between the indirect models us-
as not annotating a dash when it is used instead of              ing BERT and FLAIR embeddings, we observed
a quotation mark to introduce direct STWR. This                  that both models make errors of the types de-
points to the more character-based behaviour of                  scribed above, though at different places. How-
FLAIR which – in general – seems to serve well                   ever, overall FLAIR seems more susceptible to in-
for direct STWR, maybe because of the prevalence                 terference in the form of unusual punctuation or
of typographical indicators. The wider context of                framing phrases that are interjected in the mid-
the BERT embeddings does not seem to help with                   dle of a stretch of indirect STWR. It is also less
the perspective problem, but instead introduces ad-              successful than BERT in recognizing STWR in-
ditional errors.                                                 stances that are introduced with nouns instead of
                                                                 verbs.
5.2    Indirect STWR
Indirect representation in our definition takes the              5.3   Reported STWR
form of a subordinate clause or an infinitive
phrase, dependent on a framing clause which is                   Reported STWR is a fairly difficult form even for
not part of the STWR itself. Thus, instances of                  human annotators, mainly because it is so similar
indirect STWR are always shorter than one sen-                   to pure narration that it can be hard to distinguish.
tence. Of the four STWR forms, it is the one that                It should be noted that the gold standard annota-
is most strongly defined by its syntactical form.                tion in this case contains a number of uncertain
   One difficulty are cases where the indirect                   instances that could be debatable for humans as
STWR contains subclauses or, conversely, is fol-                 well. Reported instances tend to be rather short,
lowed by a subclause that is not part of the in-                 varying from one token to one sentence at most,
stance. In these structures, the recognizer tends to             and syntactically diverse. The most reliable indi-
                                                                 cators are words referring to speech, thought and
   3
      We experimented with training two specialized direct       writing acts.
models, one only using texts with first person perspective and
one only texts with third person perspective as training mate-      Only about a fifth of the false negatives and
rial, and evaluated them on the matching types of texts. Un-     false positives observed for reported STWR were
fortunately the performance was worse than that of the model
trained on the complete training corpus, probably because of     partial matches, a significantly lower percentage
the significant reduction of training material.                  than for direct and indirect STWR. This indicates
that for this form, finding the correct borders of                 The main cause for false positives are cases in
the annotation is less of a problem than deciding               which some of the main indicators of free indirect
whether STWR is present at all.                                 (as described above) occur in narration. In addi-
   Most errors can be attributed to problems re-                tion to that, unmarked direct STWR is prone to be
lated to speech, thought or writing words, the main             labeled as free indirect. As for the false negatives,
indicator for reported STWR. Such words can trig-               about half of the missed instances contained at
ger a false annotation and are the main cause of                least one surface marker, but many are only recog-
false positives. The reverse problem is even more               nizable via wider context clues or an understand-
prominent: Instances that do not use lexical ma-                ing of the content.
terial commonly associated with speech, thought                    Comparing BERT and FLAIR again, we find
and writing tend to be overlooked. Missing such                 that BERT gives a much better recall – the same
unusual instances is the main problem of the reco-              effect as with reported STWR, but more pro-
gnizer and though the direct and indirect recogni-              nounced. BERT is clearly better in picking up
zers also have better precision than recall scores,             subtle signals for free indirect STWR than FLAIR.
the difference for reported is clearly more pro-                The flip-side of this is that the BERT model also
nounced. Another recurring error type is that the               produces more false positives than the FLAIR
borders of the STWR were not detected correctly,                model. An interesting observation is that it some-
missing modifiers or annotating part of the sur-                times annotates sentences that are not part of
rounding narration. We also observed some rare                  the free indirect STWR itself, but introduce it.
mixups with indirect STWR.                                      Though the borders of the STWR are not detected
   As noted above, the F1 score ranges of the                   correctly in these cases, this might indicate that the
FLAIR and BERT based recognizers are not dis-                   model learned that these context clues are highly
tinct for reported, though BERT does perform                    relevant to identify free indirect STWR which is
somewhat better on average. Looking at the dif-                 indeed the case. An example for this scenario is
ferences, we found that the recognizers make the                the following passage:
same types of mistakes, but BERT is generally                      Jetzt war er mit dem Lächeln an der Reihe. Ihre
more open to unusal instances of reported STWR,                 Reaktionen kamen so spontan und waren so
leading to a better recall, which is the main reason            ungekünstelt und ehrlich. Hoffentlich würde sie
for its better overall performance.                             das nie verlieren. (Now it was his turn to smile.
                                                                Her reactions came so spontanous and were so
5.4    Free indirect STWR                                       genuine and honest. Hopefully she would never
Free indirect STWR is structurally similar to di-               loose that.)
rect STWR in that it usually spans one or more                     BERT also marks the introductory sentence (un-
consecutive sentences. It is very hard to identify              derlined) that shifts the focus to the character to
using surface markers, as it is basically a shift to a          introduce the free indirect instance (in italics) that
characters internal thoughts, but still uses the same           tells us his thoughts. The FLAIR model on the
tense and pronouns as the surrounding narration.                other hand has its strength in precision: The few
The best indicators are emphatic punctation such                false positives that it produced are often border-
as ?, !, -, words indicating a reference point in the           line cases that are attached to free indirect pas-
present (such as now, here) and characteristics of              sages and could be read as plausible extensions.
informal speech such as dialect or modal particles.
   The free indirect recognizers show the largest               6    Conclusion
gap between precision and recall: nearly 0.4
                                                                We presented recognizers for four types of STWR
points. Clearly, the problem here lies in unde-
                                                                which differ strongly in structure and difficulty.
tected cases. Notably however, over 40% of the
                                                                Our models for direct, indirect and reported were
false negatives are partial matches, meaning that
                                                                trained and tested on historical German fictional
the recognizer at least correctly detected an in-
                                                                and non-fictional texts, the model for free indi-
stance of free indirect, though it failed to capture
                                                                rect on modern German fiction. The success rates
it completely.4
                                                                (25 or 50 tokens) and used this as training input. A detailed
   4
    The recall problem might be exacerbated by the false        evaluation is beyond the scope of this paper, but the resulting
negatives in the training data. We ran tests where we cut out   recognizers had better recall but worse precision, leading to
the marked instances in the training data with some context     similar F1 scores.
correspond closely to the reliability of humans:        available at github.com/redewiedergabe/tagger
For indirect and reported, we even achieved simi-       along with the code used for training and
lar scores to the human annotator agreement on a        execution.
comparable corpus. For the types direct and free           In addition to that, all the material used
indirect, humans still clearly outperform our best      for the direct, indirect and reported recognizers
models. In both cases, we believe that the need         and part of the material used for the free in-
for wide contextual knowledge plays an important        direct recognizer5 is available as corpus RE-
role to explain the gap: For direct, the models fail    DEWIEDERGABE (Brunner et al., 2020) at
most often in distinguishing between a first per-       github.com/redewiedergabe/corpus. The rich an-
son narrator and a character quote. Free indirect       notation of corpus REDEWIEDERGABE also of-
in general is a highly context dependent form that      fers opportunities to train more complex recogni-
requires an understanding of the narrative struc-       zers, e.g. by providing labels for the medium of
ture.                                                   the STWR (speech, thought or writing) as well as
   We tested a variety of different language em-        annotation for the framing phrase for direct and in-
beddings for our task and provided a comparison         direct STWR and the speaker for all four forms of
of the most promising: FLAIR and BERT embed-            STWR.
dings. For both, we also trained/fine-tuned models
on historical texts. FLAIR gave the best scores for
                                                        References
direct, BERT for indirect and free indirect. For
reported, the results were not conclusive: Though       Alan Akbik, Tanja Bergmann, Duncan Blythe, Kashif
                                                          Rasul, Stefan Schweter, and Roland Vollgraf. 2019.
BERT performed better on average, we observed
                                                          FLAIR: An Easy-to-Use Framework for State-of-
an overlap in F1 score range of the BERT and              the-Art NLP. In Proceedings of the 2019 Confer-
FLAIR models over multiple runs.                          ence of the North American Chapter of the Asso-
   Most striking was the improvement achieved             ciation for Computational Linguistics (Demonstra-
                                                          tions), pages 54–59, Minneapolis, Minnesota. Asso-
with BERT for free indirect STWR. In particu-             ciation for Computational Linguistics.
lar, BERT improved recall for the most difficult
forms, free indirect and – to a lesser degree – re-     Alan Akbik, Duncan Blythe, and Roland Vollgraf.
                                                          2018. Contextual String Embeddings for Sequence
ported, showing a greater ability to detect unusual       Labeling. In Proceedings of the 27th International
instances.                                                Conference on Computational Linguistics, pages
   Direct STWR was the only form where FLAIR              1638–1649, Santa Fe, New Mexico, USA. Associ-
clearly outperformed BERT. It seems like the              ation for Computational Linguistics.
higher sensitivity of BERT is more of a disadvan-       Ann Banfield. 1982. Unspeakable sentences. Narra-
tage here, as it tended to misclassify even more          tion and representation in the language of fiction.
instances of first person narration than FLAIR.           Routledge & Kegan Paul, Boston u.a.
   To further improve performance, one idea is          Piotr Bojanowski, Edouard Grave, Armand Joulin, and
modifying our input strategy: instead of conse-            Tomas Mikolov. 2016. Enriching Word Vectors with
cutive chunks of up to 100 tokens, overlapping             Subword Information. CoRR, abs/1607.04606.
chunks could be used as input. This might pre-          Julian Brooke, Adam Hammond, and Graeme Hirst.
vent the recognizers from loosing context at the           2015. GutenTag: An NLP-driven Tool for Digital
beginning of a chunk, which would be especially            Humanities Research in the Project Gutenberg Cor-
                                                           pus. North American Chapter of the Association
relevant for the direct and free indirect recognizer.      for Computational Linguistics – Human Language
   The top models and customized embeddings                Technologies, pages 42–47.
described in this paper are freely available via
                                                        Annelen Brunner. 2015. Automatische Erkennung von
our homepage www.redewiedergabe.de and                    Redewiedergabe. Ein Beitrag zur quantitativen Nar-
via GitHub. In detail, our customized BERT                ratologie. Number 47 in Narratologia. de Gruyter,
embeddings can be found at huggingface.co/                Berlin u.a.
redewiedergabe/bert-base-historical-german-rw-          Annelen Brunner, Stefan Engelberg, Fotis Jannidis,
cased, the custom-trained FLAIR embeddings                Ngoc Duyen Tanja Tu, and Lukas Weimer. 2020.
are integrated into the FLAIR framework as                 5
                                                             Unfortunately we can only publish the historical part of
de-historic-rw-forward       and     de-historic-rw-    the free indirect material due to copyright restrictions on the
backward.       The top recognizer models are           modern texts.
  Corpus REDEWIEDERGABE. In Proceedings of               Jinhyuk Lee, Wonjin Yoon, Kim Sungdong, Kim
  The 12th Language Resources and Evaluation Con-           Donghyeon, Kim Sunkyu, Chan Ho So, and Jaewoo
  ference, pages 796–805, Marseille, France. Euro-          Kang. 2019. BioBERT: a pre-trained biomedical
  pean Language Resources Association.                      language representation model for biomedical text
                                                            mining.
Annelen Brunner, Ngoc Duyen Tanja Tu, Lukas
  Weimer, and Fotis Jannidis. 2019. Deep learning for    Geoffrey Leech and Mick Short. 2013. Style in fiction.
  free indirect representation. In Proceedings of the      A linguistic introduction to English fictional prose, 2
  15th Conference on Natural Language Processing           edition. Routledge, London u.a.
  (KONVENS 2019): Short Papers, pages 241–245,
  Erlangen, Germany. German Society for Computa-         Brian McHale. 2014. Speech Representation. In Peter
  tional Linguistics & Language Technology.                Hühn, John Pier, Wolf Schmid, and Jörg Schönert,
                                                           editors, The living handbook of narratology. Ham-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and              burg University Press, Hamburg.
   Kristina Toutanova. 2019. BERT: Pre-training of
   Deep Bidirectional Transformers for Language Un-      Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
   derstanding. CoRR, abs/1810.04805.                      Dean. 2013. Efficient Estimation of Word Represen-
                                                           tations in Vector Space.
Cathrine Fabricius-Hansen, Kåre Solfjeld, and An-
  neliese Pitz. 2018. Der Konjunktiv: Formen und         Sean Papay and Sebastian Padó. 2019. Quotation
  Spielräume. Number 100 in Stauffenburg Linguis-         detection and classification with a corpus-agnostic
  tik. Stauffenburg, Tübingen.                            model. In Proceedings of the International Confer-
                                                           ence on Recent Advances in Natural Language Pro-
Monika Fludernik. 1993. The fictions of language and       cessing (RANLP 2019), pages 888–894, Varna, Bul-
 the languages of fiction. The linguisitic representa-     garia. INCOMA Ltd.
 tion of speech and consciousness. Routledge, Lon-
 don/New York.                                           Roy Pascal. 1977. The dual voice. Free indirect speech
                                                           and its functioning in the nineteenth-century Euro-
Cláudia Freitas, Bianca Freitas, and Diana Santos.        pean novel. Manchester University Press, Manch-
   2016. QUEMDISSE? Reported Speech in Por-                ester.
   tuguese. In Proceedings of the Tenth International
   Conference on Language Resources and Evaluation       Jeffrey Pennington, Richard Socher, and Christo-
   (LREC 2016) – Book of abstracts, pages 4410–4416.        pher D. Manning. 2014. Glove: Global Vectors
                                                            for Word Representation. In Proceedings of the
Gérard Genette. 2010. Die Erzählung, 3 edition. Num-      2014 Conference on Empirical Methods in Natural
   ber 8083 in UTB. Fink, Paderborn.                        Language Processing (EMNLP), pages 1532–1543,
                                                            Doha, Qatar. Association for Computational Lin-
Stefan Hauser. 2008. Beobachtungen zur Redewieder-          guistics.
   gabe in der Tagespresse. Eine kontrastive Analyse.
   In Heinz-Helmut Lüger and Hartmut Lenk, editors,     Christof Schöch, Daniel Schlör, Stefanie Popp, Anne-
   Kontrastive Medienlinguistik, pages 271–286. Ver-       len Brunner, and José Calvo Tello. 2016. Straight
   lag Empirische Pädagogik, Landau.                      talk! Automatic Recognition of Direct Speech in
                                                           Nineteenth-century French Novels. In Conference
Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirec-          Abstracts, pages 346–353, Jagiellonian University &
  tional LSTM-CRF Models for Sequence Tagging.             Pedagogical University, Kraków.
  CoRR, abs/1508.01991.
                                                         Christian Scheible, Roman Klinger, and Sebastian
Fotis Jannidis, Albin Zehe, Leonard Konle, Andreas         Padó. 2016. Model Architectures for Quotation De-
  Hotho, and Markus Krug. 2018. Analysing Direct           tection. In Proceedings of the 54th Annual Meet-
  Speech in German Novels. In Digital Humanities           ing of the Association for Computational Linguistics
  im deutschsprachigen Raum – Book of abstracts.           (ACL) – Book of abstracts, pages 1736–1745.

Aditya Joshi, Sarvnaz Karimi, Ross Sparks, Cécile       Luise Schricker, Manfred Stede, and Peer Trilcke.
  Paris, and C Raina MacIntyre. 2019. A Comparison         2019. Extraction and Classification of Speech,
  of Word-based and Context-based Representations          Thought, and Writing in German Narrative Texts. In
  for Classification Problems in Health Informatics.       Preliminary proceedings of the 15th Conference on
  In Proceedings of the BioNLP 2019 workshop, pages        Natural Language Processing (KONVENS 2019):
  135–141, Florence, Italy. Association for Computa-       Long Papers, pages 183–192, Erlangen, Germany.
  tional Linguistics.                                      German Society for Computational Linguistics &
                                                           Language Technology.
Konstantina Lazaridou, Ralf Krestel, and Felix Nau-
  mann. 2017. Identifying Media Bias by Analyzing        Shreyas Sharma and Ron Daniel. 2019. BioFLAIR:
  Reported Speech. In IEEE International Conference        Pretrained Pooled Contextualized Embeddings for
  on Data Mining – Book of abstracts, pages 943–948.       Biomedical Sequence Labeling Tasks.
Milan Straka, Jana Straková, and Jan Hajic. 2019.
  Evaluating Contextualized Embeddings on 54 Lan-
  guages in POS Tagging, Lemmatization and Depen-
  dency Parsing.
Jörg Tiedemann. 2012. Parallel Data, Tools and In-
    terfaces in OPUS. In Proceedings of the Eight In-
    ternational Conference on Language Resources and
    Evaluation (LREC’12), pages 2214–2218, Istanbul,
    Turkey. European Language Resources Association
    (ELRA).
Ngoc Duyen Tanja Tu, Markus Krug, and Annelen
  Brunner. 2019. Automatic recognition of direct
  speech without quotation marks. A rule-based ap-
  proach. In Digital Humanities: multimedial & mul-
  timodal. Konferenzabstracts, pages 87–89, Frank-
  furt am Main/Mainz.
Harald Weinrich. 2007. Textgrammatik der deutschen
  Sprache, 4., rev. aufl edition. Wiss. Buchges, Darm-
  stadt.
Gregor Wiedemann, Steffen Remus, Avi Chawla, and
  Chris Biemann. 2019. Does BERT Make Any
  Sense? Interpretable Word Sense Disambiguation
  with Contextualized Embeddings. In Preliminary
  proceedings of the 15th Conference on Natural Lan-
  guage Processing (KONVENS 2019): Long Papers,
  pages 161–170, Erlangen, Germany. German So-
  ciety for Computational Linguistics & Language
  Technology.
Gisela Zifonun, Ludger Hoffmann, and Bruno Strecker.
  1997. Grammatik der deutschen Sprache, volume 3
  of Schriften des Instituts für deutsche Sprache. de
  Gruyter, Berlin/New York/Amsterdam.

</pre>