<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An Empirical Analysis of Linguistic, Typographic, and Structural Features in Simplified German Texts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alessia Battisti Sarah Ebling Martin Volk</string-name>
          <email>alessia.battisti@uzh.ch</email>
          <email>volkg@cl.uzh.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Computational Linguistics, University of Zurich Andreasstrasse 15</institution>
          ,
          <addr-line>8050 Zurich</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2006</year>
      </pub-date>
      <abstract>
        <p>English. We investigate a newly compiled corpus of simplified German texts for evidence of multiple complexity levels using unsupervised machine learning techniques. We apply linguistic features used in previous supervised machine learning research and additionally exploit structural and typographic characteristics of simplified texts. The results show a difference in complexity among the texts investigated, with optimal partitioning solutions ranging between two and four clusters. They demonstrate that both linguistic and structural/typographic features are constitutive of the clusters.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Italiano. Esaminiamo un nuovo corpus
di testi in tedesco semplificato per
cercare delle evidenze relative a molteplici
livelli di complessita` utilizzando tecniche
di apprendimento automatico non
supervisionato. Applichiamo variabili
linguistiche utilizzate in precedenti ricerche
con apprendimento automatico
supervisionato e sfruttiamo inoltre le
caratteristiche strutturali e tipografiche dei testi
semplificati. I risultati mostrano una
differenza di complessita` tra i testi
analizzati, con suddivisioni ottimali variabili
da due a quattro cluster. Cio` dimostra
che sia le caratteristiche linguistiche sia
quelle strutturali/tipografiche sono
costitutive dei cluster.</p>
    </sec>
    <sec id="sec-2">
      <title>1 Introduction</title>
      <p>Simplified language aims at providing
comprehensible information to persons with reduced reading</p>
      <p>
        Copyright c 2019 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).
abilities. This group includes persons with
cognitive impairment and learning disabilities,
prelingually deaf persons, functionally illiterate persons,
and foreign language learners
        <xref ref-type="bibr" rid="ref7">(Bredel and Maaß,
2016)</xref>
        . Simplified language is characterised by
reduced lexical and syntactic complexity and
includes images, structured layout, and
explanations of difficult words. For simplified German,
several guidelines exist that define which
structures need to be avoided, which need to be
paraphrased, and which are comprehensible
        <xref ref-type="bibr" rid="ref17 ref22 ref8">(Bundesministerium fu¨r Arbeit und Soziales, 2011;
Inclusion Europe, 2009; Maaß, 2015; Netzwerk
Leichte Sprache, 2013)</xref>
        .
      </p>
      <p>Various countries have acknowledged
simplified language as a means of inclusion that
enables the target populations mentioned above to
inform themselves of their legal rights and
participate in society. German-speaking countries have
been promoting simplified language only in the
last years, in particular since the ratification of the
United Nations Convention on the Rights of
Persons with Disabilities (United Nations, 2006) in
Austria (2008), Germany (2009), and Switzerland
(2014). As a result, large amounts of texts in
simplified German have become available.</p>
      <p>
        More recently, simplified German has been
conceptualised as a construct with multiple
complexity levels
        <xref ref-type="bibr" rid="ref18 ref5 ref7">(Bock, 2014; Bredel and Maaß, 2016;
Kellermann, 2014)</xref>
        . However, these proposals
are merely theoretical: They are not yet
operationalised, i.e., no sets of guidelines exist that
distinguish the proposed levels with reference to
linguistic or other features. The social franchise
network capito,1 a provider of simplification services
as well as training courses for simplified language
translators, recognises three levels of simplified
German corresponding to the Common European
Framework of Reference for Language (CEFR)
1https://www.capito.eu/ (last accessed: June 27,
2019)
        <xref ref-type="bibr" rid="ref10">(Council of Europe, 2001)</xref>
        levels A1, A2, and B1.
Being commercially orientated, capito does not
make its CEFR adaptation publicly available.
      </p>
      <p>In this paper, we present an unsupervised
machine learning (clustering) approach to analysing
texts in simplified German with the aim of
investigating evidence of multiple complexity levels. To
the best of our knowledge, this is the first study of
its kind. We apply linguistic features used in
previous supervised machine learning research
(classification) and additionally exploit structural and
typographic characteristics of simplified texts that
have been described in the literature but not
incorporated into clustering and/or classification
approaches in the context of simplified language.</p>
      <p>The remainder of this paper is structured as
follows: Section 2 presents the research background.
Section 3 describes our approach, introducing a
novel dataset (Section 3.1), the feature design and
engineering (Section 3.2), the clustering
experiments (Section 3.3), and a discussion thereof
(Section 3.4). Section 4 offers a conclusion and an
outlook on future research questions.
2</p>
    </sec>
    <sec id="sec-3">
      <title>Research Background</title>
      <p>
        Two natural language processing tasks deal with
the concept of simplified language: automatic
readability assessment and automatic text
simplification. Readability assessment refers to the
process of determining the level of difficulty of
a text. Traditionally, this has involved taking
into account readability measures based on
surface features such as the number of syllables
in a word or number of words in a sentence,
e.g., via the Flesch Reading Ease Score
        <xref ref-type="bibr" rid="ref13">(Flesch,
1948)</xref>
        . Recently, more sophisticated models
employing deeper linguistic features such as
lexical, semantic, morphological, morphosyntactic,
syntactic, pragmatic, discourse,
psycholinguistic, and language model features have been
proposed
        <xref ref-type="bibr" rid="ref12 ref16 ref28 ref9">(Collins-Thompson, 2014; Dell’Orletta et
al., 2014; Heimann Mu¨hlenbock, 2013; Schwarm
and Ostendorf, 2005)</xref>
        .
      </p>
      <p>
        Readability assessment implies the existence of
multiple complexity levels. Complexity levels are
identified, e.g., along school grades or levels of the
CEFR
        <xref ref-type="bibr" rid="ref1 ref15 ref23 ref24 ref29">(Hancke, 2013; Pilan and Volodina, 2018;
Reynolds, 2016; Vajjala and Lo˜o, 2014)</xref>
        .
      </p>
      <p>The work presented in this paper represents a
preliminary stage of the readability assessment
task for simplified German in that it investigates
empirically whether different complexity levels
exist in previous German simplification practice in
the first place.</p>
    </sec>
    <sec id="sec-4">
      <title>Clustering Simplified German texts</title>
      <p>Battisti and Ebling (2019) compiled a corpus of
German/simplified German texts for use in
automatic readability assessment and automatic text
simplification. The corpus represents an
enhancement of a parallel (German/simplified German)
corpus created by Klaper et al. (2013). Compared
to its predecessor, the corpus of Battisti and Ebling
(2019) contains additional parallel data and newly
contains monolingual-only data as well as
structural and typographic information.</p>
      <p>The authors collected PDFs and web pages from
92 different domains of public offices, translation
agencies, and organisations publishing content in
German and simplified German. Overall, the
corpus consists of 6,217 documents (378 parallel and
5,461 monolingual). Metadata was recorded in
the Open Language Archives Community (OLAC)
Standard2 and converted into the metadata
standard CMDI of CLARIN, a European research
infrastructure for language resources and
technology.3 If available, information on the language
level of a simplified German text (typically A1,
A2, or B1) was stored in the metadata. 52
websites and 233 PDFs (amounting to approximately
26,000 sentences) have an explicit language level
label.</p>
      <p>
        Linguistic annotation was added automatically
using ParZu
        <xref ref-type="bibr" rid="ref30">(Sennrich et al., 2009)</xref>
        (for tokens
and dependency parses), NLTK
        <xref ref-type="bibr" rid="ref3">(Bird et al., 2009)</xref>
        (for sentence segmentation), TreeTagger
        <xref ref-type="bibr" rid="ref26">(Schmid,
1995)</xref>
        (for part-of-speech tags and lemmas), and
Zmorge
        <xref ref-type="bibr" rid="ref29">(Sennrich and Kunz, 2014)</xref>
        (for
morphological units). In addition, information on
text structure (e.g., paragraphs, lines), typography
(e.g., boldface, italics), and images (content,
position, and dimensions) was added. The
annotations were stored in the Text Corpus Format by
WebLicht (TCF) developed as part of CLARIN.4
For the experiments reported in this paper, we
2http://www.language-archives.org/
OLAC/olacms.html (last accessed: June 27, 2019)
3https://www.clarin.eu/ (last accessed: June
27, 2019)
      </p>
      <p>4https://weblicht.sfs.uni-tuebingen.
de/weblichtwiki/index.php/TheTCFFormat
(last accessed: June 27, 2019)
considered the monolingual documents of the
corpus, i.e., the monolingual-only documents as well
as the simplified German side of the parallel data.
This amounted to 5,839 texts (193,845 sentences).
3.2</p>
      <sec id="sec-4-1">
        <title>Features</title>
        <p>
          In addition to constituting the first approach to
investigating simplified German texts using
unsupervised machine learning, the unique
contribution of this paper consists of leveraging
information that has been shown to be characteristic
of simplified language
          <xref ref-type="bibr" rid="ref1 ref6 ref7">(Arfe´ et al., 2018; Bock,
2018; Bredel and Maaß, 2016)</xref>
          but has not been
incorporated into machine learning approaches
involving simplified language. Specifically, we
considered features derived from text structure (e.g.,
paragraphs, lines), typography (e.g., font type,
font style), and image (content, position, and
dimensions) information.
        </p>
        <p>
          In a simplified text, typographical information,
such as boldface and italics, serves as a discourse
marker signalling words and phrases that require
particular attention and convey different purposes
          <xref ref-type="bibr" rid="ref1">(Arfe´ et al., 2018)</xref>
          . Leveraging the concepts of
multi-modality and multi-codality in the
psychology of perception
          <xref ref-type="bibr" rid="ref27">(Schnotz, 2014)</xref>
          , images5 are
supposed to support the text by activating
previous knowledge and exemplifying the objects in the
text
          <xref ref-type="bibr" rid="ref7">(Bredel and Maaß, 2016)</xref>
          .
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Subset</title>
        <p>1
2
3
4
5</p>
      </sec>
      <sec id="sec-4-3">
        <title>Features</title>
        <p>All
Surface
Deeper
Lexical + semantic
Morphological + syntactic</p>
        <p>Altogether, the feature set comprised 115
features arranged into five feature groups, as
shown in Table 1. Subset 3 (“Deeper”) consisted
of lexical, semantic, morphological, and syntactic
features. “Surface” is short for surface, structural,
and typographic features.</p>
        <p>
          Surface, structural, and typographic
features: We took advantage of the structural and
typographic information included in the corpus
5For the sake of simplicity, the term “images” here
subsumes pictures, pictograms, photographs, graphics, and
maps.
(cf. Section 3.1) and introduced as features the
number of images, paragraphs, lines, words of
a specific font type and style, and adherence to
a one-sentence-per-line rule. We additionally
included the number of digits and numbers in
words
          <xref ref-type="bibr" rid="ref25">(Saggion, 2017)</xref>
          , number of abbreviations
and initial letters, and the number of individual
punctuation marks and special characters. Among
the special characters was the Mediopunkt
(‘centred dot’), a typographi
          <xref ref-type="bibr" rid="ref21">c device proposed by
Maaß (2015</xref>
          ) for visually segmenting compound
words. We also computed the La¨sbarhetsindex
(‘readability index’, LIX)
          <xref ref-type="bibr" rid="ref4">(Bjo¨rnsson, 1968)</xref>
          .6
Lexical and semantic features: This group
included features for lexical richness, lexical
variation (e.g., nominal ratio, noun/pronoun ratio,
bilogarithmic TTR (Vajjala and Meurers, 2012)),
word frequency based on the German reference
corpus DeReKo
          <xref ref-type="bibr" rid="ref20">(Lu¨ngen, 2017)</xref>
          , and lists of
words classified at different perceptive levels
          <xref ref-type="bibr" rid="ref14">(Glaboniat et al., 2005)</xref>
          . We also included
question words and named entities, which may strain
the reading comprehension process if the target
reader does not have the appropriate knowledge.
Morphological, morphosyntactic, and
syntactic features: In this group, we included
particles, prepositions, demonstrative and
personal pronouns, and (separately) first-, second-,
and third-person pronouns. We additionally
counted adverbs, modal verbs, subjunctions,
and conjunctions. We added genitive attributes
in relation to von+dative constructions.7 We
additionally included the number of negative
forms, the presence of pre- and post-modifiers,
and impersonal constructions. We took advantage
of the verbal morphology and included verbal
mood- and tense-based features
          <xref ref-type="bibr" rid="ref11">(Dell’Orletta et
al., 2011)</xref>
          . We also considered direct vs. indirect
speech constructions, the types of subordinate
clauses as well as features based on word and
sentence order.
        </p>
        <p>6LIX = Nw / Ns + (W x 100)/Nw, where Nw is the
number of words, Ns is the number of sentences, and W is the
percentage of tokens longer than six characters.</p>
        <p>7In German, the genitive attribute can be substituted by a
von+dative construction. Importantly, this is a case of
simplified German conflicting with the grammar of Standard
German, which encourages the use of the former construction.
3.3.1 Method
We applied agglomerative hierarchical clustering.
We used the scipy8 toolkit alongside with
models recursively created with the scikit-learn9
library. The data matrix was created using the
cosine similarity metric and the average linkage
function. Because of the significant variation in
length of the documents, we normalised the
features by dividing the values by the length of each
document expressed in tokens. We then performed
principal component analysis (PCA) to diminish
the sparseness of the data matrix and avoid the
curse-of-dimensionality trap. In a second
experiment, we applied feature agglomeration instead
of PCA prior to clustering. Feature agglomeration
allows for a straightforward interpretation of the
results.</p>
        <p>Given the lack of a ground truth for our data,
we evaluated the experiments using the following
metrics: silhouette score, Calinski-Harabasz
index, and Elbow method. These metrics were also
used to choose the optimal number of clusters.</p>
      </sec>
      <sec id="sec-4-4">
        <title>3.3.2 Results</title>
        <p>Table 2 shows the results of the first three
iterations of our clustering approach after the feature
agglomeration step. We observed that a value
between 2 and 4 (inclusive) represented a good
clustering solution for the whole corpus according to
the metrics. A dendrogram corroborated these
results (cf. Figure 1).</p>
        <p>Upon inspection of the clusters, we found the
main differences to be due to the following
features: number of nouns, number of verbs,
number of paragraphs, adherence to
one-sentence-perline rule, number of interrogative clauses, number
of different fonts, and number of words in bold.
Considering the mean ratio of the features in a
two-cluster solution, Cluster 1 displayed a higher
frequency of nouns (0.31 vs. 0.24) and adjectives
(0.9 vs. 0.6) and a lower frequency of verbs (0.13
vs. 0.17) than Cluster 2, which in turn included a
slightly higher rate of images (0.008 vs. 0.004).
3.4</p>
      </sec>
      <sec id="sec-4-5">
        <title>Discussion</title>
        <p>The inverse proportion of the mean ratios
concerning nouns and verbs (cf. Section 3.3.2) suggested
8https://www.scipy.org/ (last accessed: June
27, 2019)</p>
        <p>9https://scikit-learn.org/stable/ (last
accessed: June 27, 2019)
that Cluster 1 included texts focusing on objects
or concepts, since verbs (events, actions, etc.) had
been turned into nouns (concepts, things, etc.)
following the linguistic process of nominalisation,
while the linguistic structure of texts in Cluster 2
was simpler.</p>
        <p>Figure 2 visualises the box plots of six of the
surface features of Subset 2 (number of full stops,
number of commas, adherence to
one-sentenceper-line rule, number of paragraphs, number of
different fonts, number of images) based on the
three-cluster solution suggested by the
agglomerative hierarchical approach. The first cluster
consisted of texts that followed the
one-sentence-perline rule, featured a low frequency of commas, and
a high number of paragraphs. These
characteristics are crucial properties of simplified texts. Our
findings further emphasise the importance of
distinguishing among different types of punctuation
marks in the context of simplified language: while
for commas, a low frequency is indicative of
textual simplicity, the reverse is true for full stops.
Texts included in Cluster 1 did not contain
images. This outcome relates to the results of a more
recent study by Bock (2018), according to which
images should be used with caution even in
simplified German texts to avoid the potential of
distraction and cognitive overload.
4</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Outlook</title>
      <p>In this paper, we have presented the first
approach to investigating simplified German texts
by means of unsupervised machine learning
techniques as a basis for future readability assessment
studies on this language variety. In addition, we
have introduced novel features that have been
described in the literature but not incorporated into
machine learning (clustering and/or classification)
approaches in the context of simplified language,
notably: number of images, number of
paragraphs, number of lines, number of words of a
specific font type, and adherence to a
one-sentenceper-line rule. Our findings provide evidence that
existing texts are not simplified at a unique
complexity level of German. We have demonstrated
that features based on structural information are
capable of accounting for the different complexity
levels found.</p>
      <p>As a next step, we will use the results of the
experiments presented in this paper to establish
a framework of inductively generated complexity
levels. This framework will serve as the basis for
readability assessment in the context of simplified
German. Knowledge derived from our study can
also inform automatic and manual approaches to
simplification of German.</p>
      <p>Figure 2: Six features of Subset 2.
Sowmya Vajjala and Kaidi Lo˜o. 2014. Automatic
CEFR level prediction for Estonian learner text.
In Proceedings of the third workshop on NLP for
computer-assisted language learning, volume 107,
pages 113–127, Uppsala, Sweden.</p>
      <p>Sowmya Vajjala and Detmar Meurers. 2012. On
Improving the Accuracy of Readability Classification
using Insights from Second Language Acquisition.
In Proceedings of the 7th workshop on building
educational applications using NLP, pages 163–173,
Montral, Canada.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Barbara</surname>
            <given-names>Arfe´</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Lucia</given-names>
            <surname>Mason</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Inmaculada</given-names>
            <surname>Fajardo</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Simplifying informational text structure for struggling readers</article-title>
          . Reading and Writing,
          <volume>31</volume>
          (
          <issue>9</issue>
          ):
          <fpage>2191</fpage>
          -
          <lpage>2210</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Alessia</given-names>
            <surname>Battisti</surname>
          </string-name>
          and
          <string-name>
            <given-names>Sarah</given-names>
            <surname>Ebling</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>A corpus for automatic readability assessment and text simplification of german</article-title>
          . arXiv:
          <year>1909</year>
          .09067.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Steven</given-names>
            <surname>Bird</surname>
          </string-name>
          , Edward Loper, and
          <string-name>
            <given-names>Ewan</given-names>
            <surname>Klein</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Natural Language Processing with Python. O'Reilly Media Inc</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Carl-Hugo Bjo</surname>
          </string-name>
          ¨rnsson.
          <year>1968</year>
          .
          <article-title>La¨sbarhet</article-title>
          . Liber, Stockholm.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Bettina</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bock</surname>
          </string-name>
          .
          <year>2014</year>
          . “Leichte Sprache”: Abgrenzung,
          <article-title>Beschreibung und Problemstellungen aus Sicht der Linguistik</article-title>
          .
          <source>Sprache barrierefrei gestalten</source>
          , pages
          <fpage>17</fpage>
          -
          <lpage>51</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Bettina</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bock</surname>
          </string-name>
          .
          <year>2018</year>
          . “Leichte Sprache” - Kein
          <string-name>
            <surname>Regelwerk</surname>
          </string-name>
          .
          <article-title>Sprachwissenschaftliche Ergebnisse und Praxisempfehlungen aus dem LeiSAProjekt</article-title>
          .
          <source>Technical report</source>
          , Universita¨t Leipzig.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Ursula</given-names>
            <surname>Bredel</surname>
          </string-name>
          and
          <string-name>
            <given-names>Christiane</given-names>
            <surname>Maaß</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Leichte Sprache: Theoretische Grundlagen</article-title>
          .
          <source>Orientierung fu¨r die Praxis. Duden</source>
          , Berlin.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <source>Bundesministerium fu¨r Arbeit und Soziales</source>
          .
          <year>2011</year>
          .
          <article-title>Verordnung zur Schaffung barrierefreier Informationstechnik nach dem Behindertengleichstellungsgesetz (Barrierefreie-InformationstechnikVerordnung-BITV 2.0)</article-title>
          .
          <source>Technical Report Teil 1.</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Kevyn</given-names>
            <surname>Collins-Thompson</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Computational assessment of text readability. A survey of current and future research</article-title>
          .
          <source>ITL International Journal of Applied Linguistics</source>
          ,
          <volume>165</volume>
          (
          <issue>2</issue>
          ):
          <fpage>97</fpage>
          -
          <lpage>135</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>Council of Europe</source>
          .
          <year>2001</year>
          .
          <article-title>Common European Framework of Reference for Languages: Learning, teaching, assessment</article-title>
          . Cambridge University Press, Cambridge.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Felice</given-names>
            <surname>Dell'Orletta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Simonetta</given-names>
            <surname>Montemagni</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Giulia</given-names>
            <surname>Venturi</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>READ-IT: Assessing readability of Italian texts with a view to text simplification</article-title>
          .
          <source>In Proceedings of the Second Workshop on Speech and Language Processing for Assistive Technologies</source>
          , pages
          <fpage>73</fpage>
          -
          <lpage>83</lpage>
          , Edinburgh, Scotland, UK. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Felice</given-names>
            <surname>Dell'Orletta</surname>
          </string-name>
          , Martijn Wieling, Giulia Venturi, Andrea Cimino, and
          <string-name>
            <given-names>Simonetta</given-names>
            <surname>Montemagni</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Assessing the readability of sentences: Which corpora and features?</article-title>
          <source>In Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications</source>
          , pages
          <fpage>163</fpage>
          -
          <lpage>173</lpage>
          , Baltimore, Maryland, June. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Rudolph</given-names>
            <surname>Flesch</surname>
          </string-name>
          .
          <year>1948</year>
          .
          <article-title>A new readability yardstick</article-title>
          .
          <source>Journal of Applied Psychology</source>
          ,
          <volume>32</volume>
          :
          <fpage>221</fpage>
          -
          <lpage>233</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Manuela</given-names>
            <surname>Glaboniat</surname>
          </string-name>
          , Martin Mu¨ller, Paul Rusch, Helen Schmitz, and
          <string-name>
            <given-names>Lukas</given-names>
            <surname>Wertenschlag</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <string-name>
            <given-names>Profile</given-names>
            <surname>Deutsch</surname>
          </string-name>
          .
          <source>Klett Langenscheidt</source>
          , Berlin/Munich, Germany.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Julia</given-names>
            <surname>Hancke</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Automatic Prediction of CEFR Proficiency Levels Based on Linguistic Features of Learner Language</article-title>
          .
          <source>Master's thesis</source>
          , University of Tu¨bingen, Germany.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Katarina</surname>
          </string-name>
          Heimann Mu¨hlenbock.
          <year>2013</year>
          .
          <article-title>I see what you mean: Assessing readability for specific target groups</article-title>
          .
          <source>Ph.D. thesis</source>
          , University of Gothenburg.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Inclusion</given-names>
            <surname>Europe</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Information fu¨r alle: Europa¨ische Regeln, wie man Informationen leicht lesbar und leicht versta¨ndlich macht</article-title>
          .
          <source>Technical report</source>
          , Inclusion Europe.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>Gudrun</given-names>
            <surname>Kellermann</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Leichte und Einfache Sprache Versuch einer Definition</article-title>
          .
          <source>In Aus Politik und Zeitgeschichte</source>
          , volume
          <volume>64</volume>
          , pages
          <fpage>9</fpage>
          -
          <lpage>11</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>David</given-names>
            <surname>Klaper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Sarah</given-names>
            <surname>Ebling</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Martin</given-names>
            <surname>Volk</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Building a German/Simple German parallel corpus for automatic text simplification</article-title>
          .
          <source>In ACL Workshop on Predicting and Improving Text Readability for Target Reader Populations</source>
          , pages
          <fpage>11</fpage>
          -
          <lpage>19</lpage>
          , Sofia, Bulgaria.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Harald</surname>
            <given-names>Lu¨ngen. 2017.</given-names>
          </string-name>
          <string-name>
            <surname>DEREKO - Das Deutsche Referenzkorpus</surname>
          </string-name>
          .
          <article-title>Zeitschrift fur Germanistische Linguistik</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <given-names>C.</given-names>
            <surname>Maaß</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <string-name>
            <given-names>Leichte</given-names>
            <surname>Sprache: Das Regelbuch. Barrierefreie Kommunikation</surname>
          </string-name>
          . Lit Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <given-names>Netzwerk</given-names>
            <surname>Leichte Sprache</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Die Regeln fu¨r Leichte Sprache</article-title>
          .
          <source>Technical report.</source>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <given-names>Ildiko</given-names>
            <surname>Pilan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Elena</given-names>
            <surname>Volodina</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Investigating the importance of linguistic complexity features across different datasets related to language learning</article-title>
          .
          <source>In Proceedings ofthe Workshop on Linguistic Complexity and Natural Language Processing</source>
          , pages
          <fpage>49</fpage>
          -
          <lpage>58</lpage>
          ,
          <string-name>
            <given-names>Santa</given-names>
            <surname>Fe</surname>
          </string-name>
          , New-Mexico.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <given-names>Robert</given-names>
            <surname>Reynolds</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Insights from Russian second language readability classification: complexitydependent training requirements, and feature evaluation of multiple categories</article-title>
          .
          <source>In Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications</source>
          , pages
          <fpage>289</fpage>
          -
          <lpage>300</lpage>
          , San Diego, California.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <given-names>Horacio</given-names>
            <surname>Saggion</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Automatic Text Simplification</article-title>
          . Morgan &amp; Claypool Publishers.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <given-names>Helmut</given-names>
            <surname>Schmid</surname>
          </string-name>
          .
          <year>1995</year>
          .
          <article-title>Improvements in part-ofspeech tagging with an application to German</article-title>
          .
          <source>In Proceedings of the EACL'95 SIGDAT Workshop</source>
          , pages
          <fpage>47</fpage>
          -
          <lpage>50</lpage>
          , Dublin, Ireland.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <given-names>Wolfgang</given-names>
            <surname>Schnotz</surname>
          </string-name>
          ,
          <year>2014</year>
          .
          <article-title>An Integrated Model of Text and Picture Comprehension</article-title>
          , pages
          <fpage>72</fpage>
          -
          <lpage>103</lpage>
          . Cambridge University Press, second edition.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <given-names>Sarah E.</given-names>
            <surname>Schwarm</surname>
          </string-name>
          and
          <string-name>
            <given-names>Mari</given-names>
            <surname>Ostendorf</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Reading level assessment using support vector machines and statistical language models</article-title>
          .
          <source>In Proceedings of the 43rd Annual meeting of the Association for Computational Linguistics</source>
          , pages
          <fpage>523</fpage>
          -
          <lpage>530</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <given-names>Rico</given-names>
            <surname>Sennrich</surname>
          </string-name>
          and
          <string-name>
            <given-names>Beat</given-names>
            <surname>Kunz</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Zmorge: A German Morphological Lexicon Extracted from Wiktionary</article-title>
          .
          <source>In Proceedings of the Ninth International Conference on Language Resources and Evaluation</source>
          , pages
          <fpage>1063</fpage>
          -
          <lpage>1067</lpage>
          , Reykjavik, Iceland. European Language Resources Association.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <string-name>
            <given-names>Rico</given-names>
            <surname>Sennrich</surname>
          </string-name>
          , Gerold Schneider,
          <string-name>
            <given-names>Martin</given-names>
            <surname>Volk</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Martin</given-names>
            <surname>Warin</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>A new hybrid dependency parser for German</article-title>
          .
          <source>In Proceedings of the Biennal GSCL Conference</source>
          , pages
          <fpage>115</fpage>
          -
          <lpage>124</lpage>
          , Potsdam.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>