=Paper= {{Paper |id=Vol-2253/paper39 |storemode=property |title=Lexicon and Syntax: Complexity across Genres and Language Varieties |pdfUrl=https://ceur-ws.org/Vol-2253/paper39.pdf |volume=Vol-2253 |authors=Pietro Dell'Oglio,Dominique Brunato,Felice Dell'Orletta |dblpUrl=https://dblp.org/rec/conf/clic-it/DellOglioBD18 }} ==Lexicon and Syntax: Complexity across Genres and Language Varieties== https://ceur-ws.org/Vol-2253/paper39.pdf

Lexicon and Syntax: Complexity across Genres and Language Varieties
Pietro dell’Oglio• , Dominique Brunato , Felice Dell’Orletta
•
University of Pisa
pietrodelloglio@live.it

Istituto di Linguistica Computazionale “Antonio Zampolli” (ILC–CNR)
ItaliaNLP Lab - www.italianlp.it
{dominique.brunato, felice.dellorletta}@ilc.cnr.it

Abstract formalized in terms of distinct parameters that
English. This paper presents first results capture either internal properties of the language
of an ongoing work to investigate the inter- (in the “absolute” notion of complexity) or phe-
play between lexical complexity and syn- nomena correlating to processing difficulties from
tactic complexity with respect to nominal the language user’s viewpoint (in the “relative”
lexicon and how it is affected by textual notion of complexity) (Miestamo, 2008). For in-
genre and level of linguistic complexity stance, complexity at lexical level has been com-
within genre. A cross-genre analysis is puted in terms of length (measured in characters or
carried out for the Italian language using syllables), of frequency either of the whole surface
multi–leveled linguistic features automat- word (Randall and Wayne, 1988; Chiari and De
ically extracted from dependency parsed Mauro, 2014) or of its internal components (see
corpora. e.g. the root frequency effect (Burani, 2006)), am-
biguity and familiarity, among others. At syntactic
Italiano. Questo articolo presenta i primi level, much attention has been paid on canonicity
risultati di un lavoro in corso volto a inda- effects due to word order variation (Diessel, 2005;
gare la relazione tra complessità lessi- Hawkins, 1994; Futrell et al., 2015), as well as on
cale e complessità sintattica rispetto al long-distance dependencies (Gibson, 1998; Gib-
lessico nominale e in che modo sia in- son, 2000) proving their effect on a wide range
fluenzata dal genere testuale e dal liv- of psycholinguistic phenomena, such as the sub-
ello di complessità linguistica interno al ject/object relative clauses asymmetry or the gar-
genere. Un’analisi comparativa su più den path effect in main verb/reduced–relative am-
generi è condotta per la lingua italiana biguities.
usando caratteristiche linguistiche multi-
livello estratte automaticamente da cor- An interesting question addressed by recent
pora annotati fino alla sintassi a dipen- corpus-driven research is how language complex-
denze. ity is affected by textual genre. At syntactic level,
the study by Liu (2017) on ten genres taken from
the British National Corpus showed that genre-
1 Introduction
specific stylistic factors have an influence on the
Linguistic complexity is a multifaceted notion distribution of dependency distances and depen-
which has been addressed from different perspec- dency direction. Similarly for Italian, Brunato and
tives. One established dichotomy distinguishes a Dell’Orletta (2017) investigated the influence of
“global” vs a “local” perspective, where the for- genre, and level of complexity within genre, on
mer considers the complexity of the language as a a range of factors of syntactic complexity auto-
whole and the latter focuses on complexity within matically computed from dependency-parsed cor-
each sub-domains, i.e. phonology, morphology, pora. Inspired by that work, we also intend to
syntax, discourse (Miestamo, 2008). While mea- analyze the effect of genre on linguistic complex-
suring global complexity is a very ambitious and ity. However, unlike the dominant local approach,
probably hopeless endeavor, measuring local com- where each subdomain is typically studied in iso-
plexities is perceived as a more doable task (Kort- lation, our contribution intends to address the in-
mann and Szmrecsanyi, 2012). The level of com- terrelation between different levels, i.e. lexicon
plexity within each subdomains indeed has been and syntax. Specifically, we investigate the fol-
lowing questions: for the simple one. The former is made of 84 doc-
uments (471,969 tokens) covering various topics
• to what extent is lexical complexity influ- on scientific literature. The latter is made of 293
enced by genre? documents (about 205K tokens) extracted from the
Italian web portal “Ecology and Environment” of
• to what extent is lexical complexity influ-
Wikipedia.
enced by the level of complexity within the
For the Educational writing corpora we relied
same genre?
on two collections of school textbooks: the ‘com-
• is there a correlation between lexical com- plex’ one (EduAdu) contains 70 texts (48,103 to-
plexity and syntactic complexity? Does it kens) targeting high school students, the ‘simple’
vary according to genre and level of complex- one (EduChi) a sample of 127 texts (48,036 to-
ity within the same genre? kens) targeting primary school students.
Finally, the Narrative corpora are composed
To answer these questions, we conducted an in- by the original versions of Terence and Teacher
depth analysis for the Italian language based on (TTorig), for the complex pole, and the corre-
automatically dependency parsed corpora aimed at spondent simplified versions for the simple pole.
assessing i) the distribution of simple and complex Terence, which is named after the EU Terence
nominal lexicon in different genres and different Project2 , is made of 32 documents, covering short
language varieties for the same genre ii) the syn- novels for children. Teacher contains 24 docu-
tactic role bears by “simple” and “complex” nouns ments extracted from web sites dedicated to edu-
characterizing each corpus iii) the correlation be- cational resources for teachers. All Terence and
tween “simple” and “complex” nouns with fea- Teacher texts have a simpler version (TTsemp),
tures of complexity underlying the syntactic struc- which is the result of a manual simplification pro-
ture in which they occur. cess as described by Brunato and Dell’Orletta
In what follows we first describe the corpora (2017).
considered in this study. We then illustrate how All corpora were automatically tagged by the
lexical and syntactic complexity have been for- part-of-speech tagger described in (Dell’Orletta,
malized. In Section 4 we discuss some prelim- 2009) and dependency parsed by the DeSR parser
inary findings obtained from the comparative in- described in (Attardi et al., 2009).
vestigation across corpora.
3 Features of Linguistic Complexity
2 The Corpora
3.1 Assessment of Lexical Complexity
Four genres were considered in this study: Jour-
nalism, Scientific prose, Educational writing and For each corpus we extracted all lemmas tagged as
Narrative. For each genre, we chose two corpora, nouns, without considering proper nouns, and we
selected to be representative of a complex and of classified them as ‘simple’ vs ‘complex’ nouns.
a simple language variety for that genre. The level Such a distinction was established according to
of complexity was established according to the ex- their frequency, which is one of the most used
pected target audience. parameter to assess the complexity of vocabulary
The Journalistic corpora are Repubblica (Rep) (see Section 1). Frequency was here computed
for the complex variety, and Due Parole (2Par) for with respect to a reference corpus, i.e. ItWac (Ba-
the simple one. Rep is a corpus of 232,908 to- roni et al., 2009), which was chosen since this is
kens and it is made of all articles published be- the biggest corpus available for standard Italian
tween 2000 and 2005 on the newspaper of the thus offering a reliable resource to evaluate word
same name; 2Par contains 322 articles taken from frequency on a large-scale. After ranking all nouns
the easy-to-read magazine Due Parole1 , for a total for frequency, we pruned those with a frequency
of about 73K tokens. value ≤ 3 and we kept the first quarter of nouns as
The corpora representative of Scientific writing representative of the sample of simple nouns and
are Scientific articles (ScientArt) for the complex the last quarter as representative of the sample of
language variety, and Wikipedia articles (WikiArt) complex nouns for each corpus.
1 2
www.dueparole.it www.terenceproject.eu
3.2 Assessment of Syntactic Complexity higher one across all corpora although with differ-
To investigate our main research questions, that ences ranging from the lowest percentage (35.5%)
is how lexical complexity affects syntactic com- in the ‘easy’ version of the narrative corpus (i.e.
plexity and the possible influence of genre and TTsemp) to the highest one (49.9%) in ScientArt
language variety on this relationship, we focused (i.e. the complex language variety for the scien-
on a set of features automatically extracted from tific writing genre). The syntactic role of prepo-
the sentence parse tree. These features were cho- sitional complement is especially played by sim-
sen since they are acknowledged to be predic- ple nouns compared to complex nouns. This is
tors of phenomena of structural complexity, as particularly evident in ScientArt and Repubblica,
demonstrated by their use in different scenarios, where the difference between simple and complex
such as the assessment of learners’ language de- nouns occurring as prepositional complements is
velopment or the level of text readability (e.g. equal respectively to 20 and 15 percentage points.
(Collins-Thompson, 2014; Cimino et al., 2013; Conversely, complex nouns are more widely used
Dell’Orletta et al., 2014)). as modifiers than simple nouns, especially in Re-
For each corpus, all the considered features pubblica. The percentage of nouns occurring in
were computed for all occurring nouns, for the the subject and object position is less than 20% in
subset of complex nouns and for the subset of sim- all corpora. Interestingly, the higher occurrence
ple nouns. Specifically, we focused on the follow- of nominal subjects is attested in DueParole and
ing ones: ChildEdu (14.1 and 16, respectively). This might
suggest that simpler language varieties, indepen-
• The linear distance (in terms of tokens) sep- dently from genre, make more use of explicit sub-
arating the noun from its syntactic head jects than implicit or pronominal ones. Besides,
(HeadDistance in all following Tables) the likelihood of a noun to be simple or complex
does not particularly affect the overall presence
• The hierarchical distance (in terms of depen- of nominal subjects, unless for ScientArt and Rep
dency arcs) separating the noun from the root which both show a higher percentage of simple
of the tree (RootDistance) nouns in the subject position.

• The average number of children per noun A deeper understanding of the relationship be-
(AvgChildren) tween lexical and syntactic complexity was pro-
vided by the investigation of the syntactic fea-
• The average number of siblings per noun tures described in Section 3.2. Table 1 shows the
(AvgSibling) average value of the monitored features with re-
spect to all nouns (All), to the subset of complex
nouns (Comp) and to the subset of simple nouns
4 Discussion (Simp) extracted from all corpora. We assessed
To have a first insight into the effect of genre and whether the variation between these feature val-
language variety on the interplay between lexical ues was statistically significant in a three different
and syntactic complexity, we compared the main comparative scenarios: i) between the two corpora
syntactic roles that nouns play in the sentence by of the same genre, ii) between the complex cor-
calculating the frequency of all dependency types pora of each different genre and ii) between the
linking a noun to its head. This is shown in Fig- simple corpora of each different genre. Table 2
ure 1, which reports the percentage distribution of shows linguistic features varying significantly for
typed dependency relationships linking a noun to all the considered comparisons according to the
its syntactic head across all corpora. For each cor- Wilcoxon rank-sum test, a non parametric statisti-
pus there are three columns: the first one consid- cal test for two independent samples (Wild, 1997).
ers data for all nouns of each corpus without any If we compare the two language varieties within
complexity label, the second one only data for the each genre, it can be seen, for instance, that nouns
simple noun subset and the last one only data for are hierarchically more distant from the root in
the complex noun subset. the complex than in the simple version. Such a
It can be noted that the distribution of nouns variation, which is highly significant for all gen-
used as prepositional complements (prep) is the res, affects more the Journalistic genre (DuePa-
Figure 1: Distribution of typed syntactic dependencies linking nouns to their head across corpora. For
each corpus, the first column refers to all nouns; the second one to the subset of simple nouns; the third
one to the subset of complex nouns

HeadDistance AvgChildren AvgSibling RootDistance
All Comp Simp All Comp Simp All Comp Simp All Comp Simp
2Par 2.252 2.342 2.256 1.318 1.218 1.345 1.675 1.956 1.580 2.969 2.816 2.993
Rep 2.210 2.271 2.272 1.213 0.979 1.323 1.558 1.509 1.564 4.197 4.314 4.131
Wiki 2.531 2.686 2.625 1.363 1.138 1.528 1.603 1.897 1.592 4.284 4.346 4.097
ArtScient 2.162 2.391 2.409 1.229 1.066 1.388 1.399 1.487 1.418 4.835 5.132 4.598
EduChi 2.177 2.338 2.171 1.311 1.303 1.353 1.523 1.621 1.458 3.408 3.387 3.388
EduAdu 2.598 2.875 2.695 1.440 1.375 1.560 1.654 1.715 1.640 4.269 4.483 4.143
TTsemp 2.167 2.334 2.172 1.342 1.335 1.470 1.690 1.789 1.659 3.017 2.953 2.882
TTorig 2.252 2.399 2.269 1.339 1.333 1.439 1.681 1.705 1.697 3.268 3.200 3.169

Table 1: Average value of the monitored syntactic features with respect to all nouns (All), to the subset
of complex nouns (Comp) and to the subset of simple nouns (Simp) extracted from all the examined
corpora.

role: 2.969; Rep: 4.197) and, to a lesser extent, We finally assessed whether the variation of
the Educational one (EduChi: 3.408; EduAdu: these features was statistically significant compar-
4.269). However, for the other monitored syntac- ing the simple and the complex noun subset of the
tic features, the Wiki corpus appears as slightly same corpus (Table 3). According to this dimen-
more difficult than its complex counterpart: it sion, we can observe that complex nouns have,
has nouns that are less close to their head (Wiki: on average, less dependents (AvgChildren feature)
2.531; ArtScient: 2.162) and have a richer struc- than simple ones, independently from the inter-
ture in terms of number of children (Wiki: 1.363; nal distinction within genre; on the contrary, they
ArtScient: 1.229). With the exception of root dis- tend to occur more distant from the root, espe-
tance, variations concerning other features within cially in the complex variety of Scientific prose
the Narrative genre are not statistically significant. (ArtScient Comp: 5.132; ArtScient Simp: 4.598).
This can be possibly due to the particular compo-
sition of the two selected corpora: indeed, both 5 Conclusion
Terence and Teacher texts in their original version
were already conceived for an audience of chil- While language complexity is a central topic in
dren and young students, and they were not greatly linguistic and computational linguistics research,
modified in their simplified version. it is typically addressed from a local perspective,
where each subdomain is investigated in isola-
HeadDistance AvgChildren AvgSibling RootDistance
All C S All C S All C S All C S
2Par vs Rep 3* 3* 3* 3* 3* 3* 3* 3* 3 3* 3* 3*
Wiki vs ArtScient 3* 3* 3* 3* 7 3* 3* 3* 3* 3* 3* 3*
EduChild vs EduAdu 7 7 7 3* 7 3 3 7 7 3* 3* 3*
TTsempl vs TTorig 7 7 7 7 7 7 7 7 7 3* 3 3*
ArtScient vs EduAdu 3* 3* 3* 3* 3* 3* 3* 3* 3* 3* 3* 3*
Rep vs ArtScient 3* 7 3* 3* 3* 3* 3* 3* 3* 3* 3* 3*
Rep vs EduAdu 3* 3* 3* 3* 3* 3* 3 3 7 3 7 7
Rep vs TTorig 3* 3* 3* 3* 3* 3* 3* 3* 3* 3* 3* 3*
TTorig vs ArtScient 3* 3* 3* 3* 3* 3 3* 3* 3* 3* 3* 3*
TTorig vs EduAdu 7 7 7 3* 7 3 3* 7 3* 3* 3* 3*
2Par vs EduChild 3 3* 7 3* 3* 7 3* 7 3 3* 3* 3*
2Par vs TTsemp 7 3* 7 3* 3* 7 3 7 3* 7 7 3*
2Par vs Wiki 3* 7 3* 3* 3 3* 3* 7 3* 3* 3* 3*
TTsemp vs EduChild 7 7 7 7 7 7 3* 7 3* 3* 3* 3*
TTsemp vs Wiki 3* 3* 3* 7 3* 7 3* 7 3* 3* 3* 3*
Wiki vs EduChild 3* 3* 3* 7 3* 3 7 7 7 3* 3* 3*

Table 2: Syntactic features that vary in a statistically significant way between the simple and the complex
corpus of the same genre, between the complex corpora of each genre and between the simple corpora
of each genre. “7” means a non significant variation; “3” means a significant variation at <0.05; “3*”
means a very significant variation at <0.01. All=all nouns; C=complex nouns; S=simple nouns.

HeadDistance AvgChildren AvgSibling RootDistance
2ParSostS vs 2ParSostC 3* 3* 3* 3*
RepSostS vs RepSostC 3* 3* 7 3*
WikiSostS vs WikiSostC 7 3* 3* 3*
ArtScientSostS vs ArtScientSostC 3* 3* 7 3*
EduChildSostS vs EduChildSostC 7 3 7 7
EduAduSostS vs EduAduSostC 3 3* 7 3*
TTsempSostS vs TTsempSostC 3* 7 7 7
TTorigSostS vs TTorigSostC 3* 3 7 7

Table 3: Linguistic features that vary in a statistically significant way between the simple and the complex
nouns of the same corpus. “7” means a non significant variation; “3” means a significant variation at
<0.05; “3*” means a very significant variation at <0.01. All=all nouns; C=complex nouns; S=simple
nouns.

tion. In this preliminary work, we have defined a tendencies we are currently carrying out a more
method to study the interplay between lexical and in depth analysis focusing on fine-grained features
syntactic complexity restricted to the nominal do- of syntactic complexity, such as the depth of the
main. We modeled the two notions in terms of fre- nominal subtree. Further, we would like to enlarge
quency, with respect to lexical complexity, and of this approach to test other constituents of the sen-
a set of parse tree features formalizing phenom- tence, such as the verb.
ena of syntactic complexity. Our approach was
tested on corpora selected to be representative of
different genres and different levels of complexity Acknowledgments
within each genre, in order to investigate whether
noun complexity differently affects syntactic com- The work presented in this paper was partially sup-
plexity according to the two dimensions. We ob- ported by the 2–year project (2017-2019) PER-
served e.g. that nouns tend to appear closer to the FORMA – Personalizzazione di pERcorsi FOR-
root in simple language varieties, independently Mativi Avanzati, funded by Regione Toscana (Pro-
from genre, while the effect of genre and linguistic getti Congiunti di Alta Formazione – POR FSE
complexity is less sharp with respect to the other 2014-2020 Asse A – Occupazione) in collabora-
considered features. tion with Meta srl company.
To have a deeper understanding of the observed
References Edward Gibson. 2000. The dependency Locality The-
ory: A distance–based theory of linguistic complex-
Giuseppe Attardi, Felice Dell’Orletta, Maria Simi, ity. Image, Language and Brain, In W.O.A. Marants
Joseph Turian. 2009. Accurate dependency pars- and Y. Miyashita (Eds.), Cambridge, MA: MIT
ing with a stacked multilayer perceptron. In Pro- Press, pp. 95–126.
ceedings of EVALITA 2009 - Evaluation of NLP and
Speech Tools for Italian 2009, Reggio Emilia, Italy, Daniel Gildea and David Temperley. 2010. Do Gram-
December 2009. mars Minimize Dependency Length? Cognitive Sci-
ence, 34(2):286310.
Marco Baroni, Silvia Bernardini, Adriano Ferraresi,
and Eros Zanchetta. 2009. The WaCky wide web: John A. Hawkins 1994. A performance theory of order
A collection of very large linguistically processed and constituency. Cambridge studies in Linguistics,
web-crawled corpora. In Language Resources and Cambridge University Press, 73.
Evaluation, 43:3, pp. 209-226.
Felice Dell’Orletta, Martjin Wieling, Andrea Cimino,
Dominique Brunato and Felice Dell’Orletta. 2017. On Giulia Venturi, and Simonetta Montemagni. 2014.
the order of words in Italian: a study on genre vs Assessing the Readability of Sentences: Which Cor-
complexity. International Conference on Depen- pora and Features. Proceedings of the 9th Work-
dency Linguistics (Depling 2017), 18-20 September shop on Innovative Use of NLP for Building Educa-
2017, Pisa, Italy. tional Applications (BEA 2014), Baltimore, Mary-
land, USA.
Cristina Burani. 2006. Morfologia: i processi. In:
Matti Miestamo. 2008. Grammatical complexity in a
A. Laudanna and M. Voghera (cur.) Il linguaggio.
crosslinguistic perspective. In: Miestamo M, Sin-
Strutture. Strutture linguistiche e processi cognitivi.
nemäki K. and Karlsson F. (eds), Language Com-
Bari, Laterza, 2006.
plexity: Typology, Contact Change, Amsterdam:
Benjamins, 23–41.
Isabella Chiari and Tullio De Mauro. 2014. The New
Basic Vocabulary of Italian as a linguistic resource. Randall James Ryder and Wayne H. Slater. 1988.
Proceedings of the First Italian Conference on Com- The relationship between word frequency and word
putational Linguistics (CLIC-IT), Pisa 15-19 dicem- knowledge. The Journal of Educational Research,
bre 2014. 81(5):312–317.
Andrea Cimino, Felice Dell’Orletta, Giulia Venturi, Kortmann Berndt and Szmrecsanyi Benedikt. 2012.
and Simonetta Montemagni. 2013. Linguistic Pro- Linguistic Complexity. Second Language Acquisi-
filing based on General–purpose Features and Na- tion, Indigenization, Contact. Berlin, Boston: De
tive Language Identification. Proceedings of Eighth Gruyter.
Workshop on Innovative Use of NLP for Building
Educational Applications, Atlanta, Georgia, June Yaqin Wang and Haitao Liu. 2017. The effects of
13, pp. 207-215. genre on dependency distance and dependency di-
rection. Language Sciences, 59, 135–157.
Kevyn Collins-Thompson. 2014. Computational As-
sessment of text readability. Recent Advances in Au- Chris Wild 1997. The Wilcoxon Rank-Sum Test. Uni-
tomatic Readability Assessment and Text Simplifica- versity of Auckland, Department of Statistics
tion. Special issue of International Journal of Ap-
plied Linguistics, 165:2, John Benjamins Publishing
Company, 97-135.

Felice Dell’Orletta. 2009. Ensemble system for part-
of-speech tagging. In Proceedings of EVALITA 2009
- Evaluation of NLP and Speech Tools for Italian
2009, Reggio Emilia, Italy, December 2009.

Holger Diessel. 2005. Competing motivations for the
ordering of main and adverbial clauses. Linguistics,
43 (3): 449–470.

Richard Futrell, Kyle Mahowald and Edward Gibson.
2015. Quantifying word order freedom in depen-
dency corpora. In Proceedings of the Third In-
ternational Conference on Dependency Linguistics
(Depling 2015), 91–100.

Edward Gibson. 1998. Linguistic complexity: Local-
ity of syntactic dependencies. Cognition, 68:1–76.