<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>To BERT or not to BERT - Comparing contextual embeddings in a deep learning architecture for the automatic recognition of four types of speech, thought and writing representation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Annelen Brunner</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mannheim brunner</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>tu @ids-mannheim.de</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Lukas Weimer</institution>
          ,
          <addr-line>Fotis Jannidis Universita ̈t W u ̈rzburg Am Hubland D-97074 W u ̈rzburg lukas.weimer</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present recognizers for four very different types of speech, thought and writing representation (STWR) for German texts. The implementation is based on deep learning with two different customized contextual embeddings, namely FLAIR embeddings and BERT embeddings. This paper gives an evaluation of our recognizers with a particular focus on the differences in performance we observed between those two embeddings. FLAIR performed best for direct STWR (F1=0.85), BERT for indirect (F1=0.76) and free indirect (F1=0.59) STWR. For reported STWR, the comparison was inconclusive, but BERT gave the best average results and best individual model (F1=0.60). Our best recognizers, our customized language embeddings and most of our test and training data are freely available and can be found via www.redewiedergabe.de or at github.com/redewiedergabe.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Speech, thought and writing representation
(STWR) is an interesting phenomenon both from
a narratological and a linguistic point of view.
The manner in which a character’s voice is
incorporated into the narrative is strongly linked to
narrative techniques as well as to the construction
of the narrative world and is therefore a standard
topic in narratology (e.g. McHale (2014);
        <xref ref-type="bibr" rid="ref14">Genette
(2010</xref>
        ); Leech and Short (2013)). For some
phenomena, such as free indirect discourse and
stream of consciousness, there is a large amount
of research (e.g. Banfield (1982); Fludernik
(1993); Pascal (1977)). In linguistics, the
grammatical, lexical and functional characteristics of
STWR have also been of interest (e.g. Weinrich
(2007); Zifonun et al. (1997); Hauser (2008);
Fabricius-Hansen et al. (2018)).
      </p>
      <p>To conduct either narratological or linguistic
studies on STWR based on big data, being able
to automatically detect different types of STWR
would be of great benefit. This was our
motivation to develop recognizers for the following four
forms of STWR, which have been distinguished in
literary and linguistic theory.1</p>
      <p>Direct STWR is a quotation of a character’s
speech, thought or writing. It is frequently –
though not always – enclosed by quotation marks
and/or introduced by a framing clause.</p>
    </sec>
    <sec id="sec-2">
      <title>Dann sagte er: “Ich habe Hunger.”</title>
      <p>(Then he said: “I’m hungry.”)
Free indirect STWR, also known as “erlebte
Rede” in German, is mainly used in literary
texts to represent a character’s thoughts while still
maintaining characteristics of the narrator’s voice
(e.g. past tense and third person pronouns).</p>
    </sec>
    <sec id="sec-3">
      <title>Er war ratlos. Woher sollte er denn hier bloß ein Mittagessen bekommen? (He was at a loss. Where should he ever find lunch here?)</title>
      <p>Indirect STWR is a paraphrase of the character’s
speech, thought or writing, composed of a framing
clause (not counted as part of the STWR) with
a dependent subordinate clause (often using
subjunctive mode) or an infinitive phrase.</p>
      <p>Er fragte, wo das Essen sei. (He asked
where the food was.)
1The stretch of STWR is printed in italics in the following
examples.</p>
      <p>Reported STWR is defined as a mention of a
speech, thought or writing act that may or may not
specify the topic and does not take the form of
indirect STWR.</p>
    </sec>
    <sec id="sec-4">
      <title>Er sprach u¨ber das Mittagessen. (He talked about lunch.)</title>
      <p>In the following, we will describe our approach
in developing recognizers for these four STWR
types and evaluate our results with a particular
focus on the differences between the two contextual
embeddings that proved most successful for our
task, BERT and FLAIR.
2</p>
      <sec id="sec-4-1">
        <title>Related work</title>
        <sec id="sec-4-1-1">
          <title>2.1 STWR recognizers</title>
          <p>Automatic STWR recognition focuses mainly on
the forms direct STWR (e.g. Schricker et al.
(2019); Jannidis et al. (2018); Tu et al. (2019);
Brunner (2015); Brooke et al. (2015) for German
texts; Scho¨ch et al. (2016) for French texts) and
indirect STWR (e.g. Schricker et al. (2019); Brunner
(2015) for German texts; Lazaridou et al. (2017);
Scheible et al. (2016) for English texts; Freitas
et al. (2016) for Portugese texts). For free
indirect and reported STWR, recognizers were
implemented by Brunner (2015) and Schricker et al.
(2019), the latter builds upon work by the former.
In addition to that, Papay and Pado´ (2019)
propose a corpus-agnostic neural model for quotation
detection.</p>
          <p>Since we developed recognizers for German,
we will only take a closer look at recognizers
trained and tested on German texts. Jannidis et al.
(2018) developed a deep-learning based
recognizer for direct speech in German novels, which
works without quotation marks and achieves an
accuracy of 0.84 in sentence-wise evaluation. The
algorithm by Brooke et al. (2015) is a simple
rulebased algorithm, which matches quotation marks.
They do not report any scores. Like Jannidis et al.
(2018), Tu et al. (2019) focus on developing a
recognizer for direct speech which works without
quotation marks, but they used a rule-based
approach, achieving a sentence level accuracy
between 80.5 to 85.4% for fictional and 60.8% for
non-fictional data. Brunner (2015) uses a corpus
of 13 short German narratives. For each type of
STWR, she implements a rule-based model and
trains a RandomForest model, evaluated in
tenfold cross validation. The best F1 scores,
evaluated on sentence level, were achieved by the
rule-based approach for indirect (F1=0.71) and
reported (F1=0.57) STWR and by the
RandomForest model for direct (F1=0.87) and free indirect
(F1=0.40) STWR. Schricker et al. (2019) use the
same corpus, but split the data into a stratified
training and test set. They use different features
than Brunner and train three different machine
learning algorithms, RandomForest, Support
Vector Machine and Multilayer Perceptron.
RandomForest was most successful and gave
sentencewise F1 scores of 0.95 for direct, 0.79 for indirect,
0.70 for free indirect and 0.49 for reported STWR.
Papay and Pado´ (2019) test their corpus-agnostic
quotation detection model on Brunner’s corpus
and approximate her RandomForest results.</p>
          <p>Our recognizers fill a need regarding the
recognition of STWR in German texts, as they deal with
all four forms of STWR and are, at the same time,
trained and tested on a much larger data base than
Brunner’s corpus, making our results much more
reliable. In addition to that, our data not only
comprises fictional, but also non-fictional texts. An
earlier version of our recognizer for free indirect
STWR was discussed in Brunner et al. (2019). We
improved on this version by adding more training
data and achieving higher scores with the BERT
based model.
2.2</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>Language embeddings</title>
          <p>
            As the testing of different language embeddings
was a central component in the development of our
recognizers, we will briefly outline characteristics
and research concerning the two most successful
ones that will be in focus in the rest of this
paper: FLAIR embeddings
            <xref ref-type="bibr" rid="ref2">(Akbik et al., 2018)</xref>
            and
BERT embeddings
            <xref ref-type="bibr" rid="ref10">(Devlin et al., 2019)</xref>
            .
          </p>
          <p>
            Both have in common that they produce
context-dependent embeddings as opposed to
static word embeddings, such as fastText
            <xref ref-type="bibr" rid="ref4">(Bojanowski et al., 2016)</xref>
            , GloVe
            <xref ref-type="bibr" rid="ref26">(Pennington et al.,
2014)</xref>
            or Word2Vec
            <xref ref-type="bibr" rid="ref23">(Mikolov et al., 2013)</xref>
            . That
means they assign an embedding to a word based
on its context and are therefore able to capture
polysemy. Contextual embeddings have been
shown to be of great benefit in several NLP tasks,
e.g. predicting the topic of a tweet
            <xref ref-type="bibr" rid="ref18">(Joshi et al.,
2019)</xref>
            , part-of-speech-tagging, lemmatization and
dependency parsing
            <xref ref-type="bibr" rid="ref31">(Straka et al., 2019)</xref>
            . Though
FLAIR and BERT both produce contextual
embeddings, they differ in several features.
          </p>
          <p>FLAIR produces character-level embeddings.
A sentence is passed to the algorithm as a
sequence of characters and the task is to predict the
next character based on the previous characters.
When using FLAIR embeddings, it is
recommended to combine two independently trained
models: i) a forward model and ii) a backward model.
The forward model reads every character in a
sentence from left-to-right, the backward model from
the opposite direction. This character-based
embedding architecture gives advantages in capturing
morphological and semantic structures (cf. Akbik
et al. (2018)).</p>
          <p>While FLAIR embeddings only know their
previous context, BERT embeddings know the
previous as well as succeeding context at the same time.
The training of BERT embeddings consists of two
tasks: 1) given a sequence of tokens, where 15% of
tokens are masked by the masked language model,
predict the masked token based on its context, 2)
given two pair of sentences, predict if the second
sentence is the subsequent sentence of the first
one. The advantage of this model is that it learns
associations between tokens as well as sentences.
This is important for token-level tasks, such as
question answering (cf. Devlin et al. (2019)).</p>
          <p>
            There are no systematic analyses comparing
the performance of embeddings for a sequence
labeling task in German, like we will do in
in this paper. However, there is work
focusing on the use of different embeddings in NLP
tasks in English texts: e.g. Wiedemann et al.
(2019) compare BERT, ELMo and FLAIR
embeddings in a word sense disambiguation task,
where BERT performed best. Sharma and Daniel
(2019) compare BioBERT
            <xref ref-type="bibr" rid="ref10 ref20">(Lee et al., 2019)</xref>
            and
FLAIR embeddings, more precisely the
pubmedx model, in a Biomedical Named Entity
Recognition task. They find that stacking FLAIR
embeddings with BioELMo yields better results than
using only FLAIR. Compared to BioBERT the
results of FLAIR are very close, although the FLAIR
embeddings are pretrained on a much smaller
dataset. There is also a comparison between BERT
(bert-base-uncased, bert-base-chinese,
bert-basemultilingual-uncased), FLAIR (bg, cs, de, en, fr,
nl, pl, pt, sl, sv) and ELMo (english) in a
partof-speech-tagging, lemmatization and dependency
parsing task on different languages: Straka et al.
(2019) showed, that BERT outperforms ELMo as
well as FLAIR embeddings in dependency
parsing. As opposed to that, ELMo performs best
in part-of-speech-tagging and lemmatization,
followed by FLAIR. Therefore Straka et al. (2019)
conclude that ELMo is best and FLAIR
embeddings are second best in capturing morphological
and orthographic information while BERT is best
in capturing syntactic information.
3
          </p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Method</title>
        <p>We defined the recognition of STWR as a
sequence labeling task on token level. For each of
the four types of STWR, a separate model was
trained on binary labels (“token is part of this type
of STWR: yes/no”). The input data consists of
chunks of up to 100 tokens, which may span
several sentences. The chunks may never cross
borders between different texts or cut sentences
(except when a sentence exceeds 100 tokens) and can
therefore also be shorter than the maximum.</p>
        <p>
          To train our tagging model, we used the
SequenceTagger class of the FLAIR framework
          <xref ref-type="bibr" rid="ref1">(Akbik et al., 2019)</xref>
          which implements a
BiLSTMCRF architecture on top of a language embedding
(as proposed by Huang et al. (2015)). We use two
BiLSTM layers with a hidden size of 256 each
and one CRF layer. This setting was decided after
running tests with only one BiLSTM layer, which
gave considerably worse results, and with three
BiLSTM layers, which led to no significant
improvements.
        </p>
        <p>We tested many different configurations for the
language embeddings in this setup. Initial tests
were done with just fastText embeddings. The
results were much worse than the two
configurations that became our main focus: a) a
fastText model stacked with a FLAIR forwards and
a FLAIR backwards model (as recommended in
Akbik et al. (2018)) and b) a BERT model.</p>
        <p>
          Except for free indirect, all of our recognizers
were trained and tested on historical German.
Using out-of-the-box embeddings, which are trained
on modern texts like German Wikipedia dump,
Open legal data dump, Open subtitles or the EU
bookshop corpus
          <xref ref-type="bibr" rid="ref32">(Tiedemann, 2012)</xref>
          , is therefore
problematic. So we custom-trained our own
fastText and FLAIR embeddings and fine-tuned the
BERT embeddings. The following settings were
used:
        </p>
        <p>Skip-Gram fastText models: We used the
default setting as recommend by the fastText tutorial,
i.e. we trained for five epochs, set the learning rate
to 0.05, adjusted the minimum of the character
ngram-size to 3 and the maximum to 6. We
varied the model dimensions as well as the training
material: fastTextTrain clean is a smaller, cleaner
corpus, fastTextTrain contains additional material
with OCR errors (cf. section 4.1). On each
training set, one model with 300 and one model with
500 dimensions was trained.</p>
        <p>FLAIR: A forward and a backward FLAIR
embedding with a hidden size of 1024 were trained.
All settings were chosen according to the
recommendation of the FLAIR tutorial, i.e. the sequence
length was set to 250, the mini-batch size to 100,
the learning rate to 0.20, the annealing factor to 0.4
and the patience value to 25. The model stopped
training after 10 epochs due to low loss.</p>
        <p>BERT: We used the PyTorch script
finetune on pregenerated.py to fine-tune the
pretrained bert-base-german-cased-model with the
recommended default configuration: epochs: 3,
gradient accumulation steps: 1, train batch size:
32, learning rate: 0.00003 and max seq len: 128.
4
4.1</p>
      </sec>
      <sec id="sec-4-3">
        <title>Data</title>
        <sec id="sec-4-3-1">
          <title>Training data for the embeddings</title>
          <p>For the training/fine-tuning of the embeddings
9,577 fictional and non-fictional German texts
from the 19th and early 20th century were
selected.</p>
          <p>For the fine-tuning of the BERT embeddings,
we fed all data – split into sentences – into the
script pregenerate training data.py from PyTorch,
which transforms it to BERT embedding
compatible input data. The BERT fine-tuning tutorial
recommends to create an epoch of input data for
each training epoch, so BERT will not be trained
on the same random splits in each epoch. We
fine-tuned BERT for 3 epochs, so we generated 3
epochs of data.</p>
          <p>For the FLAIR embeddings, 70%, i.e.
4,508,960 sentences, of the 6,441,372
sentences from the data were randomly drawn to
form the training corpus. For the validation
corpus 15%, i.e. 966,206 sentences, were randomly
drawn. The rest was used for testing purposes.</p>
          <p>For training the fastText embedding, we used
two different inputs. FastTextTrain contained all
of the 137,093,995 tokens of our data. From
fastTextTrain clean we removed all texts that were
recognized with OCR and thus contained typical
OCR errors. This resulted in a smaller input set of
131,360,863 tokens.
4.2</p>
        </sec>
        <sec id="sec-4-3-2">
          <title>Training data for the recognizers</title>
          <p>The recognizers for direct, indirect and reported
STWR were trained on historical German texts
– excerpts as well as full texts – that were
published from the middle of the 19th to the early
20th century. It comprises fiction as well as
nonfiction (newspaper and journal articles) in near
equal proportion; fiction is somewhat more
dominant. Roughly half of the data was manually
labeled by two human annotators independently of
one another. Then a third person compared the
annotations, adjudicated discrepancies and created a
consensus annotation. The rest was labeled by a
single annotator.</p>
          <p>For indirect STWR, the training data was
supplemented with 16 additional historical full texts
(9 fictional and 7 non-fictional) to increase the
number of instances. To speed up the annotation
process, these texts were automatically annotated
by one of our earlier recognizer models and then
manually checked. The annotators looked at the
whole texts, so false negatives were corrected as
well.</p>
          <p>
            All the historical data is published as corpus
REDEWIEDERGABE
            <xref ref-type="bibr" rid="ref7">(Brunner et al., 2020)</xref>
            and
freely available.
          </p>
          <p>
            As the historical data contained much too few
instances of free indirect STWR, we had to
create a separate training corpus for this STWR
type. The basis were 150 instances of free
indirect STWR with little to no context, manually
extracted from 20th century novels. In addition to
that, full texts and excerpts from modern popular
crime novels as well as dime novels were
automatically annotated with a basic rule-based
recognizer that used typical surface indicators. Those
annotations were then verified by human
annotators. On this data, we trained an early recognizer
            <xref ref-type="bibr" rid="ref9">(Brunner et al., 2019)</xref>
            which was then used to
annotate additional historical fictional texts. These
annotations were again verified by human
annotators before they were added to the training
material as well. It should be noted that in this
semi-automated annotation process, instances that
were not detected by the early recognizers had no
chance of being annotated. Because of this, the
data most likely contains false negatives.
          </p>
          <p>For model training, our data was split into a
training corpus (648,338 tokens for direct and
re</p>
          <p>Tokens
212,467
49,222
66,817
236,011</p>
          <p>Training corpus</p>
          <p>Percent Instances
32.77 6,293
7.03 3,505
10.31 7,522
6.30 6,887
ported; 700,202 tokens for indirect; 3,804,226
tokens for free indirect) and a validation corpus
(97,316 tokens for direct, reported and indirect;
181.942 tokens for free indirect). Table 1 shows
the occurrences of each form of STWR in its
training and validation corpus, given in tokens,
percentage of tokens in its corpus and instances2.
Our test data for the direct, indirect and reported
STWR recognizers has 97,863 tokens and
comprises excerpts from historical fictional and
nonfictional texts in equal proportions. They were
labeled with a consensus annotation as described
in section 4.2. The test data for the free indirect
STWR recognizer has 22,935 tokens and
comprises 22 excerpts from dime novels, which were
manually labeled by one human annotator. Table
1 shows the occurrences of each form of STWR in
its test corpus, given in tokens, percentage of test
corpus tokens and instances.
5</p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>Results</title>
        <p>We report the scores of our most successful
language embedding configurations, i.e. fine-tuned
BERT and fastText stacked with FLAIR forwards
and backwards.</p>
        <p>Notably, our fine-tuned BERT model performed
better than the regular BERT model for all STWR
types, even free indirect, where the STWR
recognizers were tested on modern German fiction
(as opposed to historical German fiction and
nonfiction for the other models). The same is true
for the custom-trained fastText + FLAIR models
2An instance is defined here as an uninterrupted sequence
of tokens annotated as the same type of STWR, which can be
longer than one sentence. This is of course a simplification, as
two conceptually separate stretches, such as lines of dialogue
by two different people, will be counted as one instance if
they follow directly after each other, but can serve as a rough
guideline. On average, a direct instance is 46 tokens long, an
indirect instance 15 tokens, a reported instance 10 tokens and
a free indirect instance 34 tokens.
which outperformed models pretrainend on
modern German for all STWR types as well.</p>
        <p>We speculate that this is because the
customization made the models better suited for literary texts
in general, even though it was done on historical
German.</p>
        <p>The most successful configuration of fastText +
FLAIR varies slightly between the different forms
of STWR with respect to the fastText model that
gave the best results. The fastText specifications
for the four types of STWR are detailed in table 2.</p>
        <p>Direct
Indirect
Reported
Free ind</p>
        <p>We trained each model with the same
configuration for three times to correct for random
variation in the deep learning results. Table 3 reports
the average value of each score and the standard
deviation, calculated on token level.</p>
        <p>On average, the recognizers using BERT
embeddings scored better for all types of STWR
except direct, for which the recognizers of the
stacked fastText and FLAIR embeddings proved
consistently more successful. Most striking was
BERT’s advantage for free indirect, where
especially the recall improved. It should be noted
though, that the FLAIR-based freeIndirect model
consistently gave better precision.</p>
        <p>However, when looking at the standard
deviation over the three runs and the range of results,
we see that the F1 score ranges of the FLAIR and
BERT recognizers overlap for reported, so the
results of the comparison are not conclusive for this
STWR type. For the other three STWR types, the
F1 score ranges are clearly distinct, even though
the free indirect models show a high variance.</p>
        <p>Table 4 lists the scores for the individual
recognizers from the three training runs that produced
(0.017)
(0.0236)
(0.0163)
(0.017)
(0.0082)
(0.017)
(0.034)
(0.0309)
the best results.</p>
        <p>To give an impression how difficult it is for
humans to annotate these forms, table 5 presents the
agreement scores between human annotators. The
scores for direct, indirect and reported STWR are
based on corpus REDEWIEDERGABE, the
corpus of fictional and non-fictional historical texts
our test data was drawn from. The score for free
indirect was calculated directly on the free indirect
test corpus.</p>
        <p>We performed two types of error analysis: First,
we looked at the first 10,000 tokens of our test data
and categorized the types of errors made by our
top recognizers (cf. table 4). This gives an
impression of the types of challenges the four forms
of STWR pose and how well our recognizers can
deal with them which is important practical
information for anyone using them.</p>
        <p>Second, we also looked at the first 20
differences between the results of the best models
trained with BERT vs. the best models trained
with FLAIR. The goal was to find indicators which
specific properties of the two different contextual
embeddings made them better or worse suited to a
particular task.</p>
        <p>As the four types of STWR have very different
characteristics, we will discuss each of them
separately.
Direct STWR has two main characteristics: First,
being a quotation of the character’s voice, it
tends to use first and second person pronouns and
present tense. Second, it is often marked with
quotation marks, but the reliability of this
particular indicator varies dramatically between different
texts. Its instances can also be very long, spanning
multiple sentences. We observed that about half of
the false positives as well as the false negatives are
partial matches, i.e. the recognizer did correctly
identify a stretch of direct STWR, but either broke
of too early or extended it too far.</p>
        <p>A main cause for false negatives were missing
quotation marks, i.e. unmarked stretches of
direct STWR, especially if those occured in first
person narration. In these cases, the recognizer is
missing its two most reliable indicators to
distinguish direct STWR from narrator text at the same
time. Another source of false negatives are very
long stretches of direct STWR, such as embedded
narratives. The recognizer looses the wider
context and tends to treat this STWR as narrator text,
especially if it contains nested direct STWR and
exhibits characteristics such as third person
pronouns and past tense.</p>
        <p>The main source of false positives is also related
to narrative perspective: In a first person narration
or a letter, the recognizer tends to annotate
narrator text as direct STWR – the reverse problem to
the one described above. Note that these cases are
very hard for human annotators as well and can
only be solved by knowing a wide context. The
recognizer knows a context of 100 tokens
maximum and we observed that wrong decisions often
occur at the beginning of a context chunk and are
then propagated to its end. Another source of false
positives are – predictably – stretches of texts in
quotation marks that are not direct STWR, though
these are a relatively rare occurance. We also
observed mix-ups with the forms indirect and free
indirect STWR, especially if unsual punctuation was
used, though this was rare as well.</p>
        <p>In summary, we can say that for direct STWR
narrative perspective is a major factor. The test
material was deliberately designed to contain texts
written both in first and third person perspective.
If evaluated separately, we could observe a
significantly better performance for third person
perspective (see table 6).3</p>
        <p>First person
Third person</p>
        <p>Direct STWR is the only type of STWR where
FLAIR embeddings performed better than BERT
with a clear advantage. Looking at the first 20
differences between the recognizers, we found that
BERT is more prone to annotate letters and first
person perspective narratives as direct STWR. It
also breaks off prematurely more often,
indicating that FLAIR seems to be better in maintaining
the context of the annotation. On the other hand,
FLAIR tends to make more minor mistakes, such
as not annotating a dash when it is used instead of
a quotation mark to introduce direct STWR. This
points to the more character-based behaviour of
FLAIR which – in general – seems to serve well
for direct STWR, maybe because of the prevalence
of typographical indicators. The wider context of
the BERT embeddings does not seem to help with
the perspective problem, but instead introduces
additional errors.
5.2</p>
        <sec id="sec-4-4-1">
          <title>Indirect STWR</title>
          <p>Indirect representation in our definition takes the
form of a subordinate clause or an infinitive
phrase, dependent on a framing clause which is
not part of the STWR itself. Thus, instances of
indirect STWR are always shorter than one
sentence. Of the four STWR forms, it is the one that
is most strongly defined by its syntactical form.</p>
          <p>One difficulty are cases where the indirect
STWR contains subclauses or, conversely, is
followed by a subclause that is not part of the
instance. In these structures, the recognizer tends to
3We experimented with training two specialized direct
models, one only using texts with first person perspective and
one only texts with third person perspective as training
material, and evaluated them on the matching types of texts.
Unfortunately the performance was worse than that of the model
trained on the complete training corpus, probably because of
the significant reduction of training material.
have trouble identifying the correct borders of the
STWR. When looking at the error analysis, about
one third of the errors for both false positives and
false negatives are partial matches, mostly caused
by this problem.</p>
          <p>The biggest cause for errors are cases where
the typical indirect structure – a subclause
starting with dass, ob (that, whether) or an
interrogative prounoun – is paired with an unusual frame.
This leads to false positives, if the frame
contains words that usually indicate STWR, such as
es scheint außer Frage, dass ... (it seems out of
the question that ...). Though this phrase does not
introduce STWR, the word Frage (question) still
triggers an annotation. On the flipside, cases of
indirect STWR tend to be missed if they are
introduced by phrases that have an unsual structure
and don’t contain words that are strongly
associated with speech, thought or writing acts. We also
observed that unusual punctuation, such as dashes,
multiple dots and colon (used instead of comma at
the border of an indirect STWR), have negative
effects on recognition accuracy.</p>
          <p>In a comparison between the indirect models
using BERT and FLAIR embeddings, we observed
that both models make errors of the types
described above, though at different places.
However, overall FLAIR seems more susceptible to
interference in the form of unusual punctuation or
framing phrases that are interjected in the
middle of a stretch of indirect STWR. It is also less
successful than BERT in recognizing STWR
instances that are introduced with nouns instead of
verbs.
5.3</p>
        </sec>
        <sec id="sec-4-4-2">
          <title>Reported STWR</title>
          <p>Reported STWR is a fairly difficult form even for
human annotators, mainly because it is so similar
to pure narration that it can be hard to distinguish.
It should be noted that the gold standard
annotation in this case contains a number of uncertain
instances that could be debatable for humans as
well. Reported instances tend to be rather short,
varying from one token to one sentence at most,
and syntactically diverse. The most reliable
indicators are words referring to speech, thought and
writing acts.</p>
          <p>Only about a fifth of the false negatives and
false positives observed for reported STWR were
partial matches, a significantly lower percentage
than for direct and indirect STWR. This indicates
that for this form, finding the correct borders of
the annotation is less of a problem than deciding
whether STWR is present at all.</p>
          <p>Most errors can be attributed to problems
related to speech, thought or writing words, the main
indicator for reported STWR. Such words can
trigger a false annotation and are the main cause of
false positives. The reverse problem is even more
prominent: Instances that do not use lexical
material commonly associated with speech, thought
and writing tend to be overlooked. Missing such
unusual instances is the main problem of the
recognizer and though the direct and indirect
recognizers also have better precision than recall scores,
the difference for reported is clearly more
pronounced. Another recurring error type is that the
borders of the STWR were not detected correctly,
missing modifiers or annotating part of the
surrounding narration. We also observed some rare
mixups with indirect STWR.</p>
          <p>As noted above, the F1 score ranges of the
FLAIR and BERT based recognizers are not
distinct for reported, though BERT does perform
somewhat better on average. Looking at the
differences, we found that the recognizers make the
same types of mistakes, but BERT is generally
more open to unusal instances of reported STWR,
leading to a better recall, which is the main reason
for its better overall performance.
5.4</p>
        </sec>
        <sec id="sec-4-4-3">
          <title>Free indirect STWR</title>
          <p>Free indirect STWR is structurally similar to
direct STWR in that it usually spans one or more
consecutive sentences. It is very hard to identify
using surface markers, as it is basically a shift to a
characters internal thoughts, but still uses the same
tense and pronouns as the surrounding narration.
The best indicators are emphatic punctation such
as ?, !, -, words indicating a reference point in the
present (such as now, here) and characteristics of
informal speech such as dialect or modal particles.</p>
          <p>The free indirect recognizers show the largest
gap between precision and recall: nearly 0.4
points. Clearly, the problem here lies in
undetected cases. Notably however, over 40% of the
false negatives are partial matches, meaning that
the recognizer at least correctly detected an
instance of free indirect, though it failed to capture
it completely.4</p>
          <p>4The recall problem might be exacerbated by the false
negatives in the training data. We ran tests where we cut out
the marked instances in the training data with some context
The main cause for false positives are cases in
which some of the main indicators of free indirect
(as described above) occur in narration. In
addition to that, unmarked direct STWR is prone to be
labeled as free indirect. As for the false negatives,
about half of the missed instances contained at
least one surface marker, but many are only
recognizable via wider context clues or an
understanding of the content.</p>
          <p>Comparing BERT and FLAIR again, we find
that BERT gives a much better recall – the same
effect as with reported STWR, but more
pronounced. BERT is clearly better in picking up
subtle signals for free indirect STWR than FLAIR.
The flip-side of this is that the BERT model also
produces more false positives than the FLAIR
model. An interesting observation is that it
sometimes annotates sentences that are not part of
the free indirect STWR itself, but introduce it.
Though the borders of the STWR are not detected
correctly in these cases, this might indicate that the
model learned that these context clues are highly
relevant to identify free indirect STWR which is
indeed the case. An example for this scenario is
the following passage:</p>
          <p>Jetzt war er mit dem La¨cheln an der Reihe. Ihre</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Reaktionen kamen so spontan und waren so</title>
      <p>ungeku¨nstelt und ehrlich. Hoffentlich wu¨rde sie
das nie verlieren. (Now it was his turn to smile.</p>
    </sec>
    <sec id="sec-6">
      <title>Her reactions came so spontanous and were so genuine and honest. Hopefully she would never loose that.)</title>
      <p>BERT also marks the introductory sentence
(underlined) that shifts the focus to the character to
introduce the free indirect instance (in italics) that
tells us his thoughts. The FLAIR model on the
other hand has its strength in precision: The few
false positives that it produced are often
borderline cases that are attached to free indirect
passages and could be read as plausible extensions.
6</p>
      <sec id="sec-6-1">
        <title>Conclusion</title>
        <p>We presented recognizers for four types of STWR
which differ strongly in structure and difficulty.
Our models for direct, indirect and reported were
trained and tested on historical German fictional
and non-fictional texts, the model for free
indirect on modern German fiction. The success rates
(25 or 50 tokens) and used this as training input. A detailed
evaluation is beyond the scope of this paper, but the resulting
recognizers had better recall but worse precision, leading to
similar F1 scores.
correspond closely to the reliability of humans:
For indirect and reported, we even achieved
similar scores to the human annotator agreement on a
comparable corpus. For the types direct and free
indirect, humans still clearly outperform our best
models. In both cases, we believe that the need
for wide contextual knowledge plays an important
role to explain the gap: For direct, the models fail
most often in distinguishing between a first
person narrator and a character quote. Free indirect
in general is a highly context dependent form that
requires an understanding of the narrative
structure.</p>
        <p>We tested a variety of different language
embeddings for our task and provided a comparison
of the most promising: FLAIR and BERT
embeddings. For both, we also trained/fine-tuned models
on historical texts. FLAIR gave the best scores for
direct, BERT for indirect and free indirect. For
reported, the results were not conclusive: Though
BERT performed better on average, we observed
an overlap in F1 score range of the BERT and
FLAIR models over multiple runs.</p>
        <p>Most striking was the improvement achieved
with BERT for free indirect STWR. In
particular, BERT improved recall for the most difficult
forms, free indirect and – to a lesser degree –
reported, showing a greater ability to detect unusual
instances.</p>
        <p>Direct STWR was the only form where FLAIR
clearly outperformed BERT. It seems like the
higher sensitivity of BERT is more of a
disadvantage here, as it tended to misclassify even more
instances of first person narration than FLAIR.</p>
        <p>To further improve performance, one idea is
modifying our input strategy: instead of
consecutive chunks of up to 100 tokens, overlapping
chunks could be used as input. This might
prevent the recognizers from loosing context at the
beginning of a chunk, which would be especially
relevant for the direct and free indirect recognizer.</p>
        <p>The top models and customized embeddings
described in this paper are freely available via
our homepage www.redewiedergabe.de and
via GitHub. In detail, our customized BERT
embeddings can be found at huggingface.co/
redewiedergabe/bert-base-historical-german-rwcased, the custom-trained FLAIR embeddings
are integrated into the FLAIR framework as
de-historic-rw-forward and
de-historic-rwbackward. The top recognizer models are
available at github.com/redewiedergabe/tagger
along with the code used for training and
execution.</p>
        <p>
          In addition to that, all the material used
for the direct, indirect and reported recognizers
and part of the material used for the free
indirect recognizer5 is available as corpus
REDEWIEDERGABE
          <xref ref-type="bibr" rid="ref7">(Brunner et al., 2020)</xref>
          at
github.com/redewiedergabe/corpus. The rich
annotation of corpus REDEWIEDERGABE also
offers opportunities to train more complex
recognizers, e.g. by providing labels for the medium of
the STWR (speech, thought or writing) as well as
annotation for the framing phrase for direct and
indirect STWR and the speaker for all four forms of
STWR.
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Alan</given-names>
            <surname>Akbik</surname>
          </string-name>
          , Tanja Bergmann, Duncan Blythe, Kashif Rasul, Stefan Schweter, and
          <string-name>
            <given-names>Roland</given-names>
            <surname>Vollgraf</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>FLAIR: An Easy-to-Use Framework for State-ofthe-Art NLP</article-title>
          .
          <source>In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations)</source>
          , pages
          <fpage>54</fpage>
          -
          <lpage>59</lpage>
          , Minneapolis, Minnesota. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Alan</given-names>
            <surname>Akbik</surname>
          </string-name>
          , Duncan Blythe, and
          <string-name>
            <given-names>Roland</given-names>
            <surname>Vollgraf</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Contextual String Embeddings for Sequence Labeling</article-title>
          .
          <source>In Proceedings of the 27th International Conference on Computational Linguistics</source>
          , pages
          <fpage>1638</fpage>
          -
          <lpage>1649</lpage>
          ,
          <string-name>
            <given-names>Santa</given-names>
            <surname>Fe</surname>
          </string-name>
          , New Mexico, USA. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Ann</given-names>
            <surname>Banfield</surname>
          </string-name>
          .
          <year>1982</year>
          .
          <article-title>Unspeakable sentences. Narration and representation in the language of fiction</article-title>
          .
          <source>Routledge &amp; Kegan Paul</source>
          , Boston u.a.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Piotr</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          , Edouard Grave, Armand Joulin, and
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Enriching Word Vectors with Subword Information</article-title>
          . CoRR, abs/1607.04606.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Julian</given-names>
            <surname>Brooke</surname>
          </string-name>
          , Adam Hammond, and
          <string-name>
            <given-names>Graeme</given-names>
            <surname>Hirst</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>GutenTag: An NLP-driven Tool for Digital Humanities Research in the Project Gutenberg Corpus</article-title>
          .
          <article-title>North American Chapter of the Association for Computational Linguistics - Human Language Technologies</article-title>
          , pages
          <fpage>42</fpage>
          -
          <lpage>47</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Annelen</given-names>
            <surname>Brunner</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Automatische Erkennung von Redewiedergabe. Ein Beitrag zur quantitativen Narratologie</article-title>
          . Number 47 in Narratologia. de Gruyter, Berlin u.a.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Annelen</given-names>
            <surname>Brunner</surname>
          </string-name>
          , Stefan Engelberg, Fotis Jannidis, Ngoc Duyen Tanja Tu, and
          <string-name>
            <given-names>Lukas</given-names>
            <surname>Weimer</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>5Unfortunately we can only publish the historical part of the free indirect material due to copyright restrictions on the modern texts</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Corpus</given-names>
            <surname>REDEWIEDERGABE</surname>
          </string-name>
          .
          <source>In Proceedings of The 12th Language Resources and Evaluation Conference</source>
          , pages
          <fpage>796</fpage>
          -
          <lpage>805</lpage>
          , Marseille, France. European Language Resources Association.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Annelen</given-names>
            <surname>Brunner</surname>
          </string-name>
          , Ngoc Duyen Tanja Tu, Lukas Weimer, and
          <string-name>
            <given-names>Fotis</given-names>
            <surname>Jannidis</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Deep learning for free indirect representation</article-title>
          .
          <source>In Proceedings of the 15th Conference on Natural Language Processing (KONVENS</source>
          <year>2019</year>
          )
          <article-title>: Short Papers</article-title>
          , pages
          <fpage>241</fpage>
          -
          <lpage>245</lpage>
          , Erlangen, Germany.
          <source>German Society for Computational Linguistics &amp; Language Technology.</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          . CoRR, abs/
          <year>1810</year>
          .04805.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Cathrine</given-names>
            <surname>Fabricius-Hansen</surname>
          </string-name>
          , Ka˚re Solfjeld, and
          <string-name>
            <given-names>Anneliese</given-names>
            <surname>Pitz</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Der Konjunktiv: Formen und Spielra¨ume</article-title>
          . Number 100 in Stauffenburg Linguistik. Stauffenburg, Tu¨bingen.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Monika</given-names>
            <surname>Fludernik</surname>
          </string-name>
          .
          <year>1993</year>
          .
          <article-title>The fictions of language and the languages of fiction. The linguisitic representation of speech and consciousness</article-title>
          . Routledge, London/New York.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <article-title>Cla´udia Freitas, Bianca Freitas</article-title>
          , and
          <string-name>
            <given-names>Diana</given-names>
            <surname>Santos</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>QUEMDISSE? Reported Speech in Portuguese</article-title>
          .
          <source>In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC</source>
          <year>2016</year>
          )
          <article-title>- Book of abstracts</article-title>
          , pages
          <fpage>4410</fpage>
          -
          <lpage>4416</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <source>Ge´rard Genette</source>
          .
          <year>2010</year>
          .
          <article-title>Die Erza¨hlung, 3 edition</article-title>
          . Number 8083 in UTB. Fink, Paderborn.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Stefan</given-names>
            <surname>Hauser</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Beobachtungen zur Redewiedergabe in der Tagespresse. Eine kontrastive Analyse</article-title>
          . In Heinz-Helmut Lu¨ger and Hartmut Lenk, editors,
          <source>Kontrastive Medienlinguistik</source>
          , pages
          <fpage>271</fpage>
          -
          <lpage>286</lpage>
          . Verlag Empirische Pa¨dagogik, Landau.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Zhiheng</given-names>
            <surname>Huang</surname>
          </string-name>
          , Wei Xu,
          <string-name>
            <given-names>and Kai</given-names>
            <surname>Yu</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Bidirectional LSTM-CRF Models for Sequence Tagging</article-title>
          . CoRR, abs/1508.
          <year>01991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Fotis</given-names>
            <surname>Jannidis</surname>
          </string-name>
          , Albin Zehe, Leonard Konle, Andreas Hotho, and
          <string-name>
            <given-names>Markus</given-names>
            <surname>Krug</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Analysing Direct Speech in German Novels</article-title>
          .
          <source>In Digital Humanities im deutschsprachigen Raum - Book of abstracts.</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>Aditya</given-names>
            <surname>Joshi</surname>
          </string-name>
          , Sarvnaz Karimi, Ross Sparks, Ce´cile Paris, and
          <string-name>
            <given-names>C</given-names>
            <surname>Raina</surname>
          </string-name>
          <article-title>MacIntyre</article-title>
          .
          <year>2019</year>
          .
          <article-title>A Comparison of Word-based and Context-based Representations for Classification Problems in Health Informatics</article-title>
          .
          <source>In Proceedings of the BioNLP 2019 workshop</source>
          , pages
          <fpage>135</fpage>
          -
          <lpage>141</lpage>
          , Florence, Italy. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>Konstantina</given-names>
            <surname>Lazaridou</surname>
          </string-name>
          , Ralf Krestel, and
          <string-name>
            <given-names>Felix</given-names>
            <surname>Naumann</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Identifying Media Bias by Analyzing Reported Speech</article-title>
          .
          <source>In IEEE International Conference on Data Mining - Book of abstracts</source>
          , pages
          <fpage>943</fpage>
          -
          <lpage>948</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>Jinhyuk</given-names>
            <surname>Lee</surname>
          </string-name>
          , Wonjin Yoon, Kim Sungdong, Kim Donghyeon, Kim Sunkyu, Chan Ho So, and
          <string-name>
            <given-names>Jaewoo</given-names>
            <surname>Kang</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>BioBERT: a pre-trained biomedical language representation model for biomedical text mining</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <given-names>Geoffrey</given-names>
            <surname>Leech</surname>
          </string-name>
          and
          <string-name>
            <given-names>Mick</given-names>
            <surname>Short</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Style in fiction. A linguistic introduction to English fictional prose, 2 edition</article-title>
          . Routledge, London u.a.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <given-names>Brian</given-names>
            <surname>McHale</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Speech Representation</article-title>
          . In Peter Hu¨hn, John Pier, Wolf Schmid, and Jo¨rg Scho¨nert, editors,
          <source>The living handbook of narratology</source>
          . Hamburg University Press, Hamburg.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , Kai Chen, Greg Corrado, and
          <string-name>
            <given-names>Jeffrey</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Efficient Estimation of Word Representations in Vector Space</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <given-names>Sean</given-names>
            <surname>Papay</surname>
          </string-name>
          and Sebastian Pado´.
          <year>2019</year>
          .
          <article-title>Quotation detection and classification with a corpus-agnostic model</article-title>
          .
          <source>In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP</source>
          <year>2019</year>
          ), pages
          <fpage>888</fpage>
          -
          <lpage>894</lpage>
          , Varna, Bulgaria. INCOMA Ltd.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <given-names>Roy</given-names>
            <surname>Pascal</surname>
          </string-name>
          .
          <year>1977</year>
          .
          <article-title>The dual voice. Free indirect speech and its functioning in the nineteenth-century European novel</article-title>
          . Manchester University Press, Manchester.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>Jeffrey</surname>
            <given-names>Pennington</given-names>
          </string-name>
          , Richard Socher, and
          <string-name>
            <given-names>Christopher D.</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Glove: Global Vectors for Word Representation</article-title>
          .
          <source>In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          , pages
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          , Doha, Qatar. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <given-names>Christof</given-names>
            <surname>Scho</surname>
          </string-name>
          ¨ch, Daniel Schlo¨r, Stefanie Popp, Annelen Brunner, and Jose´ Calvo Tello.
          <year>2016</year>
          .
          <article-title>Straight talk! Automatic Recognition of Direct Speech in Nineteenth-century French Novels</article-title>
          . In Conference Abstracts, pages
          <fpage>346</fpage>
          -
          <lpage>353</lpage>
          , Jagiellonian University &amp; Pedagogical University, Krako´w.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <given-names>Christian</given-names>
            <surname>Scheible</surname>
          </string-name>
          , Roman Klinger, and Sebastian Pado´.
          <year>2016</year>
          .
          <article-title>Model Architectures for Quotation Detection</article-title>
          .
          <source>In Proceedings of the 54th Annual</source>
          <article-title>Meeting of the Association for Computational Linguistics (ACL) - Book of abstracts</article-title>
          , pages
          <fpage>1736</fpage>
          -
          <lpage>1745</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <given-names>Luise</given-names>
            <surname>Schricker</surname>
          </string-name>
          , Manfred Stede, and
          <string-name>
            <given-names>Peer</given-names>
            <surname>Trilcke</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Extraction and Classification of Speech, Thought, and Writing in German Narrative Texts</article-title>
          .
          <source>In Preliminary proceedings of the 15th Conference on Natural Language Processing (KONVENS</source>
          <year>2019</year>
          )
          <article-title>: Long Papers</article-title>
          , pages
          <fpage>183</fpage>
          -
          <lpage>192</lpage>
          , Erlangen, Germany.
          <source>German Society for Computational Linguistics &amp; Language Technology.</source>
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <given-names>Shreyas</given-names>
            <surname>Sharma</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ron</given-names>
            <surname>Daniel</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>BioFLAIR: Pretrained Pooled Contextualized Embeddings for Biomedical Sequence Labeling Tasks</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <given-names>Milan</given-names>
            <surname>Straka</surname>
          </string-name>
          , Jana Strakova´, and
          <string-name>
            <given-names>Jan</given-names>
            <surname>Hajic</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Evaluating Contextualized Embeddings on 54 Languages in POS Tagging, Lemmatization</article-title>
          and
          <string-name>
            <given-names>Dependency</given-names>
            <surname>Parsing</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <source>Jo¨rg Tiedemann</source>
          .
          <year>2012</year>
          .
          <article-title>Parallel Data, Tools and Interfaces in OPUS</article-title>
          .
          <source>In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)</source>
          , pages
          <fpage>2214</fpage>
          -
          <lpage>2218</lpage>
          , Istanbul, Turkey.
          <source>European Language Resources Association (ELRA).</source>
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <string-name>
            <given-names>Ngoc</given-names>
            <surname>Duyen Tanja Tu</surname>
          </string-name>
          , Markus Krug, and
          <string-name>
            <given-names>Annelen</given-names>
            <surname>Brunner</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Automatic recognition of direct speech without quotation marks. A rule-based approach</article-title>
          .
          <source>In Digital Humanities: multimedial &amp; multimodal. Konferenzabstracts</source>
          , pages
          <fpage>87</fpage>
          -
          <lpage>89</lpage>
          , Frankfurt am Main/Mainz.
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <string-name>
            <given-names>Harald</given-names>
            <surname>Weinrich</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Textgrammatik der deutschen Sprache, 4</article-title>
          ., rev.
          <source>aufl edition</source>
          . Wiss. Buchges, Darmstadt.
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <string-name>
            <given-names>Gregor</given-names>
            <surname>Wiedemann</surname>
          </string-name>
          , Steffen Remus, Avi Chawla, and
          <string-name>
            <given-names>Chris</given-names>
            <surname>Biemann</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Does BERT Make Any Sense? Interpretable Word Sense Disambiguation with Contextualized Embeddings</article-title>
          .
          <source>In Preliminary proceedings of the 15th Conference on Natural Language Processing (KONVENS</source>
          <year>2019</year>
          )
          <article-title>: Long Papers</article-title>
          , pages
          <fpage>161</fpage>
          -
          <lpage>170</lpage>
          , Erlangen, Germany.
          <source>German Society for Computational Linguistics &amp; Language Technology.</source>
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <string-name>
            <given-names>Gisela</given-names>
            <surname>Zifonun</surname>
          </string-name>
          , Ludger Hoffmann, and
          <string-name>
            <given-names>Bruno</given-names>
            <surname>Strecker</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>Grammatik der deutschen Sprache, volume 3 of Schriften des Instituts fu¨r deutsche Sprache</article-title>
          . de Gruyter, Berlin/New York/Amsterdam.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>