Lexicon and Syntax: Complexity across Genres and Language Varieties Pietro dell’Oglio• , Dominique Brunato , Felice Dell’Orletta • University of Pisa pietrodelloglio@live.it  Istituto di Linguistica Computazionale “Antonio Zampolli” (ILC–CNR) ItaliaNLP Lab - www.italianlp.it {dominique.brunato, felice.dellorletta}@ilc.cnr.it Abstract formalized in terms of distinct parameters that English. This paper presents first results capture either internal properties of the language of an ongoing work to investigate the inter- (in the “absolute” notion of complexity) or phe- play between lexical complexity and syn- nomena correlating to processing difficulties from tactic complexity with respect to nominal the language user’s viewpoint (in the “relative” lexicon and how it is affected by textual notion of complexity) (Miestamo, 2008). For in- genre and level of linguistic complexity stance, complexity at lexical level has been com- within genre. A cross-genre analysis is puted in terms of length (measured in characters or carried out for the Italian language using syllables), of frequency either of the whole surface multi–leveled linguistic features automat- word (Randall and Wayne, 1988; Chiari and De ically extracted from dependency parsed Mauro, 2014) or of its internal components (see corpora. e.g. the root frequency effect (Burani, 2006)), am- biguity and familiarity, among others. At syntactic Italiano. Questo articolo presenta i primi level, much attention has been paid on canonicity risultati di un lavoro in corso volto a inda- effects due to word order variation (Diessel, 2005; gare la relazione tra complessità lessi- Hawkins, 1994; Futrell et al., 2015), as well as on cale e complessità sintattica rispetto al long-distance dependencies (Gibson, 1998; Gib- lessico nominale e in che modo sia in- son, 2000) proving their effect on a wide range fluenzata dal genere testuale e dal liv- of psycholinguistic phenomena, such as the sub- ello di complessità linguistica interno al ject/object relative clauses asymmetry or the gar- genere. Un’analisi comparativa su più den path effect in main verb/reduced–relative am- generi è condotta per la lingua italiana biguities. usando caratteristiche linguistiche multi- livello estratte automaticamente da cor- An interesting question addressed by recent pora annotati fino alla sintassi a dipen- corpus-driven research is how language complex- denze. ity is affected by textual genre. At syntactic level, the study by Liu (2017) on ten genres taken from the British National Corpus showed that genre- 1 Introduction specific stylistic factors have an influence on the Linguistic complexity is a multifaceted notion distribution of dependency distances and depen- which has been addressed from different perspec- dency direction. Similarly for Italian, Brunato and tives. One established dichotomy distinguishes a Dell’Orletta (2017) investigated the influence of “global” vs a “local” perspective, where the for- genre, and level of complexity within genre, on mer considers the complexity of the language as a a range of factors of syntactic complexity auto- whole and the latter focuses on complexity within matically computed from dependency-parsed cor- each sub-domains, i.e. phonology, morphology, pora. Inspired by that work, we also intend to syntax, discourse (Miestamo, 2008). While mea- analyze the effect of genre on linguistic complex- suring global complexity is a very ambitious and ity. However, unlike the dominant local approach, probably hopeless endeavor, measuring local com- where each subdomain is typically studied in iso- plexities is perceived as a more doable task (Kort- lation, our contribution intends to address the in- mann and Szmrecsanyi, 2012). The level of com- terrelation between different levels, i.e. lexicon plexity within each subdomains indeed has been and syntax. Specifically, we investigate the fol- lowing questions: for the simple one. The former is made of 84 doc- uments (471,969 tokens) covering various topics • to what extent is lexical complexity influ- on scientific literature. The latter is made of 293 enced by genre? documents (about 205K tokens) extracted from the Italian web portal “Ecology and Environment” of • to what extent is lexical complexity influ- Wikipedia. enced by the level of complexity within the For the Educational writing corpora we relied same genre? on two collections of school textbooks: the ‘com- • is there a correlation between lexical com- plex’ one (EduAdu) contains 70 texts (48,103 to- plexity and syntactic complexity? Does it kens) targeting high school students, the ‘simple’ vary according to genre and level of complex- one (EduChi) a sample of 127 texts (48,036 to- ity within the same genre? kens) targeting primary school students. Finally, the Narrative corpora are composed To answer these questions, we conducted an in- by the original versions of Terence and Teacher depth analysis for the Italian language based on (TTorig), for the complex pole, and the corre- automatically dependency parsed corpora aimed at spondent simplified versions for the simple pole. assessing i) the distribution of simple and complex Terence, which is named after the EU Terence nominal lexicon in different genres and different Project2 , is made of 32 documents, covering short language varieties for the same genre ii) the syn- novels for children. Teacher contains 24 docu- tactic role bears by “simple” and “complex” nouns ments extracted from web sites dedicated to edu- characterizing each corpus iii) the correlation be- cational resources for teachers. All Terence and tween “simple” and “complex” nouns with fea- Teacher texts have a simpler version (TTsemp), tures of complexity underlying the syntactic struc- which is the result of a manual simplification pro- ture in which they occur. cess as described by Brunato and Dell’Orletta In what follows we first describe the corpora (2017). considered in this study. We then illustrate how All corpora were automatically tagged by the lexical and syntactic complexity have been for- part-of-speech tagger described in (Dell’Orletta, malized. In Section 4 we discuss some prelim- 2009) and dependency parsed by the DeSR parser inary findings obtained from the comparative in- described in (Attardi et al., 2009). vestigation across corpora. 3 Features of Linguistic Complexity 2 The Corpora 3.1 Assessment of Lexical Complexity Four genres were considered in this study: Jour- nalism, Scientific prose, Educational writing and For each corpus we extracted all lemmas tagged as Narrative. For each genre, we chose two corpora, nouns, without considering proper nouns, and we selected to be representative of a complex and of classified them as ‘simple’ vs ‘complex’ nouns. a simple language variety for that genre. The level Such a distinction was established according to of complexity was established according to the ex- their frequency, which is one of the most used pected target audience. parameter to assess the complexity of vocabulary The Journalistic corpora are Repubblica (Rep) (see Section 1). Frequency was here computed for the complex variety, and Due Parole (2Par) for with respect to a reference corpus, i.e. ItWac (Ba- the simple one. Rep is a corpus of 232,908 to- roni et al., 2009), which was chosen since this is kens and it is made of all articles published be- the biggest corpus available for standard Italian tween 2000 and 2005 on the newspaper of the thus offering a reliable resource to evaluate word same name; 2Par contains 322 articles taken from frequency on a large-scale. After ranking all nouns the easy-to-read magazine Due Parole1 , for a total for frequency, we pruned those with a frequency of about 73K tokens. value ≤ 3 and we kept the first quarter of nouns as The corpora representative of Scientific writing representative of the sample of simple nouns and are Scientific articles (ScientArt) for the complex the last quarter as representative of the sample of language variety, and Wikipedia articles (WikiArt) complex nouns for each corpus. 1 2 www.dueparole.it www.terenceproject.eu 3.2 Assessment of Syntactic Complexity higher one across all corpora although with differ- To investigate our main research questions, that ences ranging from the lowest percentage (35.5%) is how lexical complexity affects syntactic com- in the ‘easy’ version of the narrative corpus (i.e. plexity and the possible influence of genre and TTsemp) to the highest one (49.9%) in ScientArt language variety on this relationship, we focused (i.e. the complex language variety for the scien- on a set of features automatically extracted from tific writing genre). The syntactic role of prepo- the sentence parse tree. These features were cho- sitional complement is especially played by sim- sen since they are acknowledged to be predic- ple nouns compared to complex nouns. This is tors of phenomena of structural complexity, as particularly evident in ScientArt and Repubblica, demonstrated by their use in different scenarios, where the difference between simple and complex such as the assessment of learners’ language de- nouns occurring as prepositional complements is velopment or the level of text readability (e.g. equal respectively to 20 and 15 percentage points. (Collins-Thompson, 2014; Cimino et al., 2013; Conversely, complex nouns are more widely used Dell’Orletta et al., 2014)). as modifiers than simple nouns, especially in Re- For each corpus, all the considered features pubblica. The percentage of nouns occurring in were computed for all occurring nouns, for the the subject and object position is less than 20% in subset of complex nouns and for the subset of sim- all corpora. Interestingly, the higher occurrence ple nouns. Specifically, we focused on the follow- of nominal subjects is attested in DueParole and ing ones: ChildEdu (14.1 and 16, respectively). This might suggest that simpler language varieties, indepen- • The linear distance (in terms of tokens) sep- dently from genre, make more use of explicit sub- arating the noun from its syntactic head jects than implicit or pronominal ones. Besides, (HeadDistance in all following Tables) the likelihood of a noun to be simple or complex does not particularly affect the overall presence • The hierarchical distance (in terms of depen- of nominal subjects, unless for ScientArt and Rep dency arcs) separating the noun from the root which both show a higher percentage of simple of the tree (RootDistance) nouns in the subject position. • The average number of children per noun A deeper understanding of the relationship be- (AvgChildren) tween lexical and syntactic complexity was pro- vided by the investigation of the syntactic fea- • The average number of siblings per noun tures described in Section 3.2. Table 1 shows the (AvgSibling) average value of the monitored features with re- spect to all nouns (All), to the subset of complex nouns (Comp) and to the subset of simple nouns 4 Discussion (Simp) extracted from all corpora. We assessed To have a first insight into the effect of genre and whether the variation between these feature val- language variety on the interplay between lexical ues was statistically significant in a three different and syntactic complexity, we compared the main comparative scenarios: i) between the two corpora syntactic roles that nouns play in the sentence by of the same genre, ii) between the complex cor- calculating the frequency of all dependency types pora of each different genre and ii) between the linking a noun to its head. This is shown in Fig- simple corpora of each different genre. Table 2 ure 1, which reports the percentage distribution of shows linguistic features varying significantly for typed dependency relationships linking a noun to all the considered comparisons according to the its syntactic head across all corpora. For each cor- Wilcoxon rank-sum test, a non parametric statisti- pus there are three columns: the first one consid- cal test for two independent samples (Wild, 1997). ers data for all nouns of each corpus without any If we compare the two language varieties within complexity label, the second one only data for the each genre, it can be seen, for instance, that nouns simple noun subset and the last one only data for are hierarchically more distant from the root in the complex noun subset. the complex than in the simple version. Such a It can be noted that the distribution of nouns variation, which is highly significant for all gen- used as prepositional complements (prep) is the res, affects more the Journalistic genre (DuePa- Figure 1: Distribution of typed syntactic dependencies linking nouns to their head across corpora. For each corpus, the first column refers to all nouns; the second one to the subset of simple nouns; the third one to the subset of complex nouns HeadDistance AvgChildren AvgSibling RootDistance All Comp Simp All Comp Simp All Comp Simp All Comp Simp 2Par 2.252 2.342 2.256 1.318 1.218 1.345 1.675 1.956 1.580 2.969 2.816 2.993 Rep 2.210 2.271 2.272 1.213 0.979 1.323 1.558 1.509 1.564 4.197 4.314 4.131 Wiki 2.531 2.686 2.625 1.363 1.138 1.528 1.603 1.897 1.592 4.284 4.346 4.097 ArtScient 2.162 2.391 2.409 1.229 1.066 1.388 1.399 1.487 1.418 4.835 5.132 4.598 EduChi 2.177 2.338 2.171 1.311 1.303 1.353 1.523 1.621 1.458 3.408 3.387 3.388 EduAdu 2.598 2.875 2.695 1.440 1.375 1.560 1.654 1.715 1.640 4.269 4.483 4.143 TTsemp 2.167 2.334 2.172 1.342 1.335 1.470 1.690 1.789 1.659 3.017 2.953 2.882 TTorig 2.252 2.399 2.269 1.339 1.333 1.439 1.681 1.705 1.697 3.268 3.200 3.169 Table 1: Average value of the monitored syntactic features with respect to all nouns (All), to the subset of complex nouns (Comp) and to the subset of simple nouns (Simp) extracted from all the examined corpora. role: 2.969; Rep: 4.197) and, to a lesser extent, We finally assessed whether the variation of the Educational one (EduChi: 3.408; EduAdu: these features was statistically significant compar- 4.269). However, for the other monitored syntac- ing the simple and the complex noun subset of the tic features, the Wiki corpus appears as slightly same corpus (Table 3). According to this dimen- more difficult than its complex counterpart: it sion, we can observe that complex nouns have, has nouns that are less close to their head (Wiki: on average, less dependents (AvgChildren feature) 2.531; ArtScient: 2.162) and have a richer struc- than simple ones, independently from the inter- ture in terms of number of children (Wiki: 1.363; nal distinction within genre; on the contrary, they ArtScient: 1.229). With the exception of root dis- tend to occur more distant from the root, espe- tance, variations concerning other features within cially in the complex variety of Scientific prose the Narrative genre are not statistically significant. (ArtScient Comp: 5.132; ArtScient Simp: 4.598). This can be possibly due to the particular compo- sition of the two selected corpora: indeed, both 5 Conclusion Terence and Teacher texts in their original version were already conceived for an audience of chil- While language complexity is a central topic in dren and young students, and they were not greatly linguistic and computational linguistics research, modified in their simplified version. it is typically addressed from a local perspective, where each subdomain is investigated in isola- HeadDistance AvgChildren AvgSibling RootDistance All C S All C S All C S All C S 2Par vs Rep 3* 3* 3* 3* 3* 3* 3* 3* 3 3* 3* 3* Wiki vs ArtScient 3* 3* 3* 3* 7 3* 3* 3* 3* 3* 3* 3* EduChild vs EduAdu 7 7 7 3* 7 3 3 7 7 3* 3* 3* TTsempl vs TTorig 7 7 7 7 7 7 7 7 7 3* 3 3* ArtScient vs EduAdu 3* 3* 3* 3* 3* 3* 3* 3* 3* 3* 3* 3* Rep vs ArtScient 3* 7 3* 3* 3* 3* 3* 3* 3* 3* 3* 3* Rep vs EduAdu 3* 3* 3* 3* 3* 3* 3 3 7 3 7 7 Rep vs TTorig 3* 3* 3* 3* 3* 3* 3* 3* 3* 3* 3* 3* TTorig vs ArtScient 3* 3* 3* 3* 3* 3 3* 3* 3* 3* 3* 3* TTorig vs EduAdu 7 7 7 3* 7 3 3* 7 3* 3* 3* 3* 2Par vs EduChild 3 3* 7 3* 3* 7 3* 7 3 3* 3* 3* 2Par vs TTsemp 7 3* 7 3* 3* 7 3 7 3* 7 7 3* 2Par vs Wiki 3* 7 3* 3* 3 3* 3* 7 3* 3* 3* 3* TTsemp vs EduChild 7 7 7 7 7 7 3* 7 3* 3* 3* 3* TTsemp vs Wiki 3* 3* 3* 7 3* 7 3* 7 3* 3* 3* 3* Wiki vs EduChild 3* 3* 3* 7 3* 3 7 7 7 3* 3* 3* Table 2: Syntactic features that vary in a statistically significant way between the simple and the complex corpus of the same genre, between the complex corpora of each genre and between the simple corpora of each genre. “7” means a non significant variation; “3” means a significant variation at <0.05; “3*” means a very significant variation at <0.01. All=all nouns; C=complex nouns; S=simple nouns. HeadDistance AvgChildren AvgSibling RootDistance 2ParSostS vs 2ParSostC 3* 3* 3* 3* RepSostS vs RepSostC 3* 3* 7 3* WikiSostS vs WikiSostC 7 3* 3* 3* ArtScientSostS vs ArtScientSostC 3* 3* 7 3* EduChildSostS vs EduChildSostC 7 3 7 7 EduAduSostS vs EduAduSostC 3 3* 7 3* TTsempSostS vs TTsempSostC 3* 7 7 7 TTorigSostS vs TTorigSostC 3* 3 7 7 Table 3: Linguistic features that vary in a statistically significant way between the simple and the complex nouns of the same corpus. “7” means a non significant variation; “3” means a significant variation at <0.05; “3*” means a very significant variation at <0.01. All=all nouns; C=complex nouns; S=simple nouns. tion. In this preliminary work, we have defined a tendencies we are currently carrying out a more method to study the interplay between lexical and in depth analysis focusing on fine-grained features syntactic complexity restricted to the nominal do- of syntactic complexity, such as the depth of the main. We modeled the two notions in terms of fre- nominal subtree. Further, we would like to enlarge quency, with respect to lexical complexity, and of this approach to test other constituents of the sen- a set of parse tree features formalizing phenom- tence, such as the verb. ena of syntactic complexity. Our approach was tested on corpora selected to be representative of different genres and different levels of complexity Acknowledgments within each genre, in order to investigate whether noun complexity differently affects syntactic com- The work presented in this paper was partially sup- plexity according to the two dimensions. We ob- ported by the 2–year project (2017-2019) PER- served e.g. that nouns tend to appear closer to the FORMA – Personalizzazione di pERcorsi FOR- root in simple language varieties, independently Mativi Avanzati, funded by Regione Toscana (Pro- from genre, while the effect of genre and linguistic getti Congiunti di Alta Formazione – POR FSE complexity is less sharp with respect to the other 2014-2020 Asse A – Occupazione) in collabora- considered features. tion with Meta srl company. To have a deeper understanding of the observed References Edward Gibson. 2000. The dependency Locality The- ory: A distance–based theory of linguistic complex- Giuseppe Attardi, Felice Dell’Orletta, Maria Simi, ity. Image, Language and Brain, In W.O.A. Marants Joseph Turian. 2009. Accurate dependency pars- and Y. Miyashita (Eds.), Cambridge, MA: MIT ing with a stacked multilayer perceptron. In Pro- Press, pp. 95–126. ceedings of EVALITA 2009 - Evaluation of NLP and Speech Tools for Italian 2009, Reggio Emilia, Italy, Daniel Gildea and David Temperley. 2010. Do Gram- December 2009. mars Minimize Dependency Length? Cognitive Sci- ence, 34(2):286310. Marco Baroni, Silvia Bernardini, Adriano Ferraresi, and Eros Zanchetta. 2009. The WaCky wide web: John A. Hawkins 1994. A performance theory of order A collection of very large linguistically processed and constituency. Cambridge studies in Linguistics, web-crawled corpora. In Language Resources and Cambridge University Press, 73. Evaluation, 43:3, pp. 209-226. Felice Dell’Orletta, Martjin Wieling, Andrea Cimino, Dominique Brunato and Felice Dell’Orletta. 2017. On Giulia Venturi, and Simonetta Montemagni. 2014. the order of words in Italian: a study on genre vs Assessing the Readability of Sentences: Which Cor- complexity. International Conference on Depen- pora and Features. Proceedings of the 9th Work- dency Linguistics (Depling 2017), 18-20 September shop on Innovative Use of NLP for Building Educa- 2017, Pisa, Italy. tional Applications (BEA 2014), Baltimore, Mary- land, USA. Cristina Burani. 2006. Morfologia: i processi. In: Matti Miestamo. 2008. Grammatical complexity in a A. Laudanna and M. Voghera (cur.) Il linguaggio. crosslinguistic perspective. In: Miestamo M, Sin- Strutture. Strutture linguistiche e processi cognitivi. nemäki K. and Karlsson F. (eds), Language Com- Bari, Laterza, 2006. plexity: Typology, Contact Change, Amsterdam: Benjamins, 23–41. Isabella Chiari and Tullio De Mauro. 2014. The New Basic Vocabulary of Italian as a linguistic resource. Randall James Ryder and Wayne H. Slater. 1988. Proceedings of the First Italian Conference on Com- The relationship between word frequency and word putational Linguistics (CLIC-IT), Pisa 15-19 dicem- knowledge. The Journal of Educational Research, bre 2014. 81(5):312–317. Andrea Cimino, Felice Dell’Orletta, Giulia Venturi, Kortmann Berndt and Szmrecsanyi Benedikt. 2012. and Simonetta Montemagni. 2013. Linguistic Pro- Linguistic Complexity. Second Language Acquisi- filing based on General–purpose Features and Na- tion, Indigenization, Contact. Berlin, Boston: De tive Language Identification. Proceedings of Eighth Gruyter. Workshop on Innovative Use of NLP for Building Educational Applications, Atlanta, Georgia, June Yaqin Wang and Haitao Liu. 2017. The effects of 13, pp. 207-215. genre on dependency distance and dependency di- rection. Language Sciences, 59, 135–157. Kevyn Collins-Thompson. 2014. Computational As- sessment of text readability. Recent Advances in Au- Chris Wild 1997. The Wilcoxon Rank-Sum Test. Uni- tomatic Readability Assessment and Text Simplifica- versity of Auckland, Department of Statistics tion. Special issue of International Journal of Ap- plied Linguistics, 165:2, John Benjamins Publishing Company, 97-135. Felice Dell’Orletta. 2009. Ensemble system for part- of-speech tagging. In Proceedings of EVALITA 2009 - Evaluation of NLP and Speech Tools for Italian 2009, Reggio Emilia, Italy, December 2009. Holger Diessel. 2005. Competing motivations for the ordering of main and adverbial clauses. Linguistics, 43 (3): 449–470. Richard Futrell, Kyle Mahowald and Edward Gibson. 2015. Quantifying word order freedom in depen- dency corpora. In Proceedings of the Third In- ternational Conference on Dependency Linguistics (Depling 2015), 91–100. Edward Gibson. 1998. Linguistic complexity: Local- ity of syntactic dependencies. Cognition, 68:1–76.