An Empirical Analysis of Linguistic, Typographic, and Structural Features in Simplified German Texts Alessia Battisti Sarah Ebling Martin Volk Institute of Computational Linguistics, University of Zurich Andreasstrasse 15, 8050 Zurich, Switzerland alessia.battisti@uzh.ch, {ebling|volk}@cl.uzh.ch Abstract abilities. This group includes persons with cog- nitive impairment and learning disabilities, prelin- English. We investigate a newly compiled gually deaf persons, functionally illiterate persons, corpus of simplified German texts for ev- and foreign language learners (Bredel and Maaß, idence of multiple complexity levels us- 2016). Simplified language is characterised by ing unsupervised machine learning tech- reduced lexical and syntactic complexity and in- niques. We apply linguistic features used cludes images, structured layout, and explana- in previous supervised machine learning tions of difficult words. For simplified German, research and additionally exploit structural several guidelines exist that define which struc- and typographic characteristics of simpli- tures need to be avoided, which need to be para- fied texts. The results show a difference in phrased, and which are comprehensible (Bun- complexity among the texts investigated, desministerium für Arbeit und Soziales, 2011; In- with optimal partitioning solutions rang- clusion Europe, 2009; Maaß, 2015; Netzwerk Le- ing between two and four clusters. They ichte Sprache, 2013). demonstrate that both linguistic and struc- Various countries have acknowledged simpli- tural/typographic features are constitutive fied language as a means of inclusion that en- of the clusters. ables the target populations mentioned above to Italiano. Esaminiamo un nuovo corpus inform themselves of their legal rights and partici- di testi in tedesco semplificato per cer- pate in society. German-speaking countries have care delle evidenze relative a molteplici been promoting simplified language only in the livelli di complessità utilizzando tecniche last years, in particular since the ratification of the di apprendimento automatico non super- United Nations Convention on the Rights of Per- visionato. Applichiamo variabili lin- sons with Disabilities (United Nations, 2006) in guistiche utilizzate in precedenti ricerche Austria (2008), Germany (2009), and Switzerland con apprendimento automatico supervi- (2014). As a result, large amounts of texts in sim- sionato e sfruttiamo inoltre le caratte- plified German have become available. ristiche strutturali e tipografiche dei testi More recently, simplified German has been con- semplificati. I risultati mostrano una dif- ceptualised as a construct with multiple complex- ferenza di complessità tra i testi ana- ity levels (Bock, 2014; Bredel and Maaß, 2016; lizzati, con suddivisioni ottimali variabili Kellermann, 2014). However, these proposals da due a quattro cluster. Ciò dimostra are merely theoretical: They are not yet opera- che sia le caratteristiche linguistiche sia tionalised, i.e., no sets of guidelines exist that dis- quelle strutturali/tipografiche sono costi- tinguish the proposed levels with reference to lin- tutive dei cluster. guistic or other features. The social franchise net- work capito,1 a provider of simplification services 1 Introduction as well as training courses for simplified language translators, recognises three levels of simplified Simplified language aims at providing comprehen- German corresponding to the Common European sible information to persons with reduced reading Framework of Reference for Language (CEFR) Copyright c 2019 for this paper by its authors. Use per- 1 mitted under Creative Commons License Attribution 4.0 In- https://www.capito.eu/ (last accessed: June 27, ternational (CC BY 4.0). 2019) (Council of Europe, 2001) levels A1, A2, and B1. empirically whether different complexity levels Being commercially orientated, capito does not exist in previous German simplification practice in make its CEFR adaptation publicly available. the first place. In this paper, we present an unsupervised ma- chine learning (clustering) approach to analysing 3 Clustering Simplified German texts texts in simplified German with the aim of investi- 3.1 Dataset gating evidence of multiple complexity levels. To the best of our knowledge, this is the first study of Battisti and Ebling (2019) compiled a corpus of its kind. We apply linguistic features used in pre- German/simplified German texts for use in auto- vious supervised machine learning research (clas- matic readability assessment and automatic text sification) and additionally exploit structural and simplification. The corpus represents an enhance- typographic characteristics of simplified texts that ment of a parallel (German/simplified German) have been described in the literature but not in- corpus created by Klaper et al. (2013). Compared corporated into clustering and/or classification ap- to its predecessor, the corpus of Battisti and Ebling proaches in the context of simplified language. (2019) contains additional parallel data and newly The remainder of this paper is structured as fol- contains monolingual-only data as well as struc- lows: Section 2 presents the research background. tural and typographic information. Section 3 describes our approach, introducing a The authors collected PDFs and web pages from novel dataset (Section 3.1), the feature design and 92 different domains of public offices, translation engineering (Section 3.2), the clustering experi- agencies, and organisations publishing content in ments (Section 3.3), and a discussion thereof (Sec- German and simplified German. Overall, the cor- tion 3.4). Section 4 offers a conclusion and an out- pus consists of 6,217 documents (378 parallel and look on future research questions. 5,461 monolingual). Metadata was recorded in the Open Language Archives Community (OLAC) 2 Research Background Standard2 and converted into the metadata stan- dard CMDI of CLARIN, a European research in- Two natural language processing tasks deal with frastructure for language resources and technol- the concept of simplified language: automatic ogy.3 If available, information on the language readability assessment and automatic text sim- level of a simplified German text (typically A1, plification. Readability assessment refers to the A2, or B1) was stored in the metadata. 52 web- process of determining the level of difficulty of sites and 233 PDFs (amounting to approximately a text. Traditionally, this has involved taking 26,000 sentences) have an explicit language level into account readability measures based on sur- label. face features such as the number of syllables Linguistic annotation was added automatically in a word or number of words in a sentence, using ParZu (Sennrich et al., 2009) (for tokens e.g., via the Flesch Reading Ease Score (Flesch, and dependency parses), NLTK (Bird et al., 2009) 1948). Recently, more sophisticated models em- (for sentence segmentation), TreeTagger (Schmid, ploying deeper linguistic features such as lex- 1995) (for part-of-speech tags and lemmas), and ical, semantic, morphological, morphosyntactic, Zmorge (Sennrich and Kunz, 2014) (for mor- syntactic, pragmatic, discourse, psycholinguis- phological units). In addition, information on tic, and language model features have been pro- text structure (e.g., paragraphs, lines), typography posed (Collins-Thompson, 2014; Dell’Orletta et (e.g., boldface, italics), and images (content, po- al., 2014; Heimann Mühlenbock, 2013; Schwarm sition, and dimensions) was added. The annota- and Ostendorf, 2005). tions were stored in the Text Corpus Format by Readability assessment implies the existence of WebLicht (TCF) developed as part of CLARIN.4 multiple complexity levels. Complexity levels are For the experiments reported in this paper, we identified, e.g., along school grades or levels of the 2 http://www.language-archives.org/ CEFR (Hancke, 2013; Pilan and Volodina, 2018; OLAC/olacms.html (last accessed: June 27, 2019) Reynolds, 2016; Vajjala and Lõo, 2014). 3 https://www.clarin.eu/ (last accessed: June The work presented in this paper represents a 27, 2019) 4 https://weblicht.sfs.uni-tuebingen. preliminary stage of the readability assessment de/weblichtwiki/index.php/TheTCFFormat task for simplified German in that it investigates (last accessed: June 27, 2019) considered the monolingual documents of the cor- (cf. Section 3.1) and introduced as features the pus, i.e., the monolingual-only documents as well number of images, paragraphs, lines, words of as the simplified German side of the parallel data. a specific font type and style, and adherence to This amounted to 5,839 texts (193,845 sentences). a one-sentence-per-line rule. We additionally included the number of digits and numbers in 3.2 Features words (Saggion, 2017), number of abbreviations In addition to constituting the first approach to and initial letters, and the number of individual investigating simplified German texts using un- punctuation marks and special characters. Among supervised machine learning, the unique contri- the special characters was the Mediopunkt (‘cen- bution of this paper consists of leveraging infor- tred dot’), a typographic device proposed by mation that has been shown to be characteristic Maaß (2015) for visually segmenting compound of simplified language (Arfé et al., 2018; Bock, words. We also computed the Läsbarhetsindex 2018; Bredel and Maaß, 2016) but has not been (‘readability index’, LIX) (Björnsson, 1968).6 incorporated into machine learning approaches in- volving simplified language. Specifically, we con- Lexical and semantic features: This group sidered features derived from text structure (e.g., included features for lexical richness, lexical paragraphs, lines), typography (e.g., font type, variation (e.g., nominal ratio, noun/pronoun ratio, font style), and image (content, position, and di- bilogarithmic TTR (Vajjala and Meurers, 2012)), mensions) information. word frequency based on the German reference In a simplified text, typographical information, corpus DeReKo (Lüngen, 2017), and lists of such as boldface and italics, serves as a discourse words classified at different perceptive levels marker signalling words and phrases that require (Glaboniat et al., 2005). We also included ques- particular attention and convey different purposes tion words and named entities, which may strain (Arfé et al., 2018). Leveraging the concepts of the reading comprehension process if the target multi-modality and multi-codality in the psychol- reader does not have the appropriate knowledge. ogy of perception (Schnotz, 2014), images5 are supposed to support the text by activating previ- Morphological, morphosyntactic, and syn- ous knowledge and exemplifying the objects in the tactic features: In this group, we included text (Bredel and Maaß, 2016). particles, prepositions, demonstrative and per- sonal pronouns, and (separately) first-, second-, Subset Features Number and third-person pronouns. We additionally 1 All 115 counted adverbs, modal verbs, subjunctions, 2 Surface 26 and conjunctions. We added genitive attributes 3 Deeper 89 in relation to von+dative constructions.7 We 4 Lexical + semantic 17 additionally included the number of negative 5 Morphological + syntactic 72 forms, the presence of pre- and post-modifiers, and impersonal constructions. We took advantage Table 1: Subsets of feature combinations. of the verbal morphology and included verbal mood- and tense-based features (Dell’Orletta et Altogether, the feature set comprised 115 al., 2011). We also considered direct vs. indirect features arranged into five feature groups, as speech constructions, the types of subordinate shown in Table 1. Subset 3 (“Deeper”) consisted clauses as well as features based on word and of lexical, semantic, morphological, and syntactic sentence order. features. “Surface” is short for surface, structural, and typographic features. Surface, structural, and typographic fea- 6 LIX = Nw / Ns + (W x 100)/Nw , where Nw is the num- tures: We took advantage of the structural and ber of words, Ns is the number of sentences, and W is the typographic information included in the corpus percentage of tokens longer than six characters. 7 In German, the genitive attribute can be substituted by a 5 For the sake of simplicity, the term “images” here von+dative construction. Importantly, this is a case of simpli- subsumes pictures, pictograms, photographs, graphics, and fied German conflicting with the grammar of Standard Ger- maps. man, which encourages the use of the former construction. 3.3 Experiments and Results that Cluster 1 included texts focusing on objects 3.3.1 Method or concepts, since verbs (events, actions, etc.) had been turned into nouns (concepts, things, etc.) fol- We applied agglomerative hierarchical clustering. lowing the linguistic process of nominalisation, We used the scipy8 toolkit alongside with mod- while the linguistic structure of texts in Cluster 2 els recursively created with the scikit-learn9 was simpler. library. The data matrix was created using the Figure 2 visualises the box plots of six of the cosine similarity metric and the average linkage surface features of Subset 2 (number of full stops, function. Because of the significant variation in number of commas, adherence to one-sentence- length of the documents, we normalised the fea- per-line rule, number of paragraphs, number of tures by dividing the values by the length of each different fonts, number of images) based on the document expressed in tokens. We then performed three-cluster solution suggested by the agglomer- principal component analysis (PCA) to diminish ative hierarchical approach. The first cluster con- the sparseness of the data matrix and avoid the sisted of texts that followed the one-sentence-per- curse-of-dimensionality trap. In a second exper- line rule, featured a low frequency of commas, and iment, we applied feature agglomeration instead a high number of paragraphs. These characteris- of PCA prior to clustering. Feature agglomeration tics are crucial properties of simplified texts. Our allows for a straightforward interpretation of the findings further emphasise the importance of dis- results. tinguishing among different types of punctuation Given the lack of a ground truth for our data, marks in the context of simplified language: while we evaluated the experiments using the following for commas, a low frequency is indicative of tex- metrics: silhouette score, Calinski-Harabasz in- tual simplicity, the reverse is true for full stops. dex, and Elbow method. These metrics were also Texts included in Cluster 1 did not contain im- used to choose the optimal number of clusters. ages. This outcome relates to the results of a more 3.3.2 Results recent study by Bock (2018), according to which Table 2 shows the results of the first three itera- images should be used with caution even in sim- tions of our clustering approach after the feature plified German texts to avoid the potential of dis- agglomeration step. We observed that a value be- traction and cognitive overload. tween 2 and 4 (inclusive) represented a good clus- tering solution for the whole corpus according to 4 Conclusion and Outlook the metrics. A dendrogram corroborated these re- In this paper, we have presented the first ap- sults (cf. Figure 1). proach to investigating simplified German texts Upon inspection of the clusters, we found the by means of unsupervised machine learning tech- main differences to be due to the following fea- niques as a basis for future readability assessment tures: number of nouns, number of verbs, num- studies on this language variety. In addition, we ber of paragraphs, adherence to one-sentence-per- have introduced novel features that have been de- line rule, number of interrogative clauses, number scribed in the literature but not incorporated into of different fonts, and number of words in bold. machine learning (clustering and/or classification) Considering the mean ratio of the features in a approaches in the context of simplified language, two-cluster solution, Cluster 1 displayed a higher notably: number of images, number of para- frequency of nouns (0.31 vs. 0.24) and adjectives graphs, number of lines, number of words of a spe- (0.9 vs. 0.6) and a lower frequency of verbs (0.13 cific font type, and adherence to a one-sentence- vs. 0.17) than Cluster 2, which in turn included a per-line rule. Our findings provide evidence that slightly higher rate of images (0.008 vs. 0.004). existing texts are not simplified at a unique com- 3.4 Discussion plexity level of German. We have demonstrated The inverse proportion of the mean ratios concern- that features based on structural information are ing nouns and verbs (cf. Section 3.3.2) suggested capable of accounting for the different complexity levels found. 8 https://www.scipy.org/ (last accessed: June As a next step, we will use the results of the 27, 2019) 9 https://scikit-learn.org/stable/ (last experiments presented in this paper to establish accessed: June 27, 2019) a framework of inductively generated complexity Subset 1 Subset 2 Subset 3 Subset 4 Subset 5 Sil CH Sil CH Sil CH Sil CH Sil CH 2 0.601 3867.1 0.373 1135.2 0.675 5214.2 0.693 3593.9 0.695 5463.2 3 0.532 2476.2 0.372 1266.3 0.617 3329.5 0.55 1824.8 0.572 3273.9 4 0.456 1698.3 0.493 1417.6 0.592 2572.7 0.505 1248.9 0.51 2517.8 Table 2: Comparison of the silhouette scores (Sil) and Calinski-Harabasz indices (CH) after feature agglomeration on all data samples. Figure 1: Dendrogram of the texts considering agglomerated features of Subset 1. levels. This framework will serve as the basis for Sicht der Linguistik. Sprache barrierefrei gestalten, readability assessment in the context of simplified pages 17–51. German. Knowledge derived from our study can Bettina M. Bock. 2018. “Leichte Sprache” - also inform automatic and manual approaches to Kein Regelwerk. Sprachwissenschaftliche Ergeb- simplification of German. nisse und Praxisempfehlungen aus dem LeiSA- Projekt. Technical report, Universität Leipzig. References Ursula Bredel and Christiane Maaß. 2016. Leichte Sprache: Theoretische Grundlagen. Orientierung Barbara Arfé, Lucia Mason, and Inmaculada Fa- für die Praxis. Duden, Berlin. jardo. 2018. Simplifying informational text struc- ture for struggling readers. Reading and Writing, Bundesministerium für Arbeit und Soziales. 2011. 31(9):2191–2210. Verordnung zur Schaffung barrierefreier Infor- mationstechnik nach dem Behindertengleichstel- Alessia Battisti and Sarah Ebling. 2019. A corpus for lungsgesetz (Barrierefreie-Informationstechnik- automatic readability assessment and text simplifi- Verordnung-BITV 2.0). Technical Report Teil cation of german. arXiv:1909.09067. 1. Steven Bird, Edward Loper, and Ewan Klein. Kevyn Collins-Thompson. 2014. Computational as- 2009. Natural Language Processing with Python. sessment of text readability. A survey of current and O’Reilly Media Inc. future research. ITL International Journal of Ap- plied Linguistics, 165(2):97–135. Carl-Hugo Björnsson. 1968. Läsbarhet. Liber, Stock- holm. Council of Europe. 2001. Common European Frame- work of Reference for Languages: Learning, teach- Bettina M. Bock. 2014. “Leichte Sprache”: Ab- ing, assessment. Cambridge University Press, Cam- grenzung, Beschreibung und Problemstellungen aus bridge. Figure 2: Six features of Subset 2. Felice Dell’Orletta, Simonetta Montemagni, and Giu- Robert Reynolds. 2016. Insights from Russian sec- lia Venturi. 2011. READ–IT: Assessing readabil- ond language readability classification: complexity- ity of Italian texts with a view to text simplifica- dependent training requirements, and feature evalu- tion. In Proceedings of the Second Workshop on ation of multiple categories. In Proceedings of the Speech and Language Processing for Assistive Tech- 11th Workshop on Innovative Use of NLP for Build- nologies, pages 73–83, Edinburgh, Scotland, UK. ing Educational Applications, pages 289–300, San Association for Computational Linguistics. Diego, California. Felice Dell’Orletta, Martijn Wieling, Giulia Venturi, Horacio Saggion. 2017. Automatic Text Simplification. Andrea Cimino, and Simonetta Montemagni. 2014. Morgan & Claypool Publishers. Assessing the readability of sentences: Which cor- pora and features? In Proceedings of the Ninth Helmut Schmid. 1995. Improvements in part-of- Workshop on Innovative Use of NLP for Build- speech tagging with an application to German. In ing Educational Applications, pages 163–173, Bal- Proceedings of the EACL’95 SIGDAT Workshop, timore, Maryland, June. Association for Computa- pages 47–50, Dublin, Ireland. tional Linguistics. Wolfgang Schnotz, 2014. An Integrated Model of Text Rudolph Flesch. 1948. A new readability yardstick. and Picture Comprehension, pages 72–103. Cam- Journal of Applied Psychology, 32:221–233. bridge University Press, second edition. Manuela Glaboniat, Martin Müller, Paul Rusch, Helen Sarah E. Schwarm and Mari Ostendorf. 2005. Reading Schmitz, and Lukas Wertenschlag. 2005. Profile level assessment using support vector machines and Deutsch. Klett Langenscheidt, Berlin/Munich, Ger- statistical language models. In Proceedings of the many. 43rd Annual meeting of the Association for Compu- tational Linguistics, pages 523–530. Julia Hancke. 2013. Automatic Prediction of CEFR Proficiency Levels Based on Linguistic Features of Rico Sennrich and Beat Kunz. 2014. Zmorge: A Ger- Learner Language. Master’s thesis, University of man Morphological Lexicon Extracted from Wik- Tübingen, Germany. tionary. In Proceedings of the Ninth International Conference on Language Resources and Evalua- Katarina Heimann Mühlenbock. 2013. I see what tion, pages 1063–1067, Reykjavik, Iceland. Euro- you mean: Assessing readability for specific target pean Language Resources Association. groups. Ph.D. thesis, University of Gothenburg. Rico Sennrich, Gerold Schneider, Martin Volk, and Inclusion Europe. 2009. Information für alle: Eu- Martin Warin. 2009. A new hybrid dependency ropäische Regeln, wie man Informationen leicht les- parser for German. In Proceedings of the Biennal bar und leicht verständlich macht. Technical report, GSCL Conference, pages 115–124, Potsdam. Inclusion Europe. United Nations. 2006. Convention on the Rights of Gudrun Kellermann. 2014. Leichte und Einfache Persons with Disabilities and Optional Protocol. Sprache Versuch einer Definition. In Aus Politik und Zeitgeschichte, volume 64, pages 9–11. Sowmya Vajjala and Kaidi Lõo. 2014. Automatic CEFR level prediction for Estonian learner text. David Klaper, Sarah Ebling, and Martin Volk. 2013. In Proceedings of the third workshop on NLP for Building a German/Simple German parallel corpus computer-assisted language learning, volume 107, for automatic text simplification. In ACL Workshop pages 113–127, Uppsala, Sweden. on Predicting and Improving Text Readability for Target Reader Populations, pages 11–19, Sofia, Bul- Sowmya Vajjala and Detmar Meurers. 2012. On Im- garia. proving the Accuracy of Readability Classification using Insights from Second Language Acquisition. Harald Lüngen. 2017. DEREKO - Das Deutsche In Proceedings of the 7th workshop on building ed- Referenzkorpus. Zeitschrift fur Germanistische Lin- ucational applications using NLP, pages 163–173, guistik. Montral, Canada. C. Maaß. 2015. Leichte Sprache: Das Regelbuch. Barrierefreie Kommunikation. Lit Verlag. Netzwerk Leichte Sprache. 2013. Die Regeln für Le- ichte Sprache. Technical report. Ildiko Pilan and Elena Volodina. 2018. Investigat- ing the importance of linguistic complexity features across different datasets related to language learn- ing. In Proceedings ofthe Workshop on Linguis- tic Complexity and Natural Language Processing, pages 49–58, Santa Fe, New-Mexico.