Lexicon and Syntax: Complexity across Genres and Language Varieties
                   Pietro dell’Oglio• , Dominique Brunato , Felice Dell’Orletta
                                           •
                                             University of Pisa
                                  pietrodelloglio@live.it
             
               Istituto di Linguistica Computazionale “Antonio Zampolli” (ILC–CNR)
                                    ItaliaNLP Lab - www.italianlp.it
            {dominique.brunato, felice.dellorletta}@ilc.cnr.it


                      Abstract                        formalized in terms of distinct parameters that
     English. This paper presents first results       capture either internal properties of the language
     of an ongoing work to investigate the inter-     (in the “absolute” notion of complexity) or phe-
     play between lexical complexity and syn-         nomena correlating to processing difficulties from
     tactic complexity with respect to nominal        the language user’s viewpoint (in the “relative”
     lexicon and how it is affected by textual        notion of complexity) (Miestamo, 2008). For in-
     genre and level of linguistic complexity         stance, complexity at lexical level has been com-
     within genre. A cross-genre analysis is          puted in terms of length (measured in characters or
     carried out for the Italian language using       syllables), of frequency either of the whole surface
     multi–leveled linguistic features automat-       word (Randall and Wayne, 1988; Chiari and De
     ically extracted from dependency parsed          Mauro, 2014) or of its internal components (see
     corpora.                                         e.g. the root frequency effect (Burani, 2006)), am-
                                                      biguity and familiarity, among others. At syntactic
     Italiano. Questo articolo presenta i primi       level, much attention has been paid on canonicity
     risultati di un lavoro in corso volto a inda-    effects due to word order variation (Diessel, 2005;
     gare la relazione tra complessità lessi-        Hawkins, 1994; Futrell et al., 2015), as well as on
     cale e complessità sintattica rispetto al       long-distance dependencies (Gibson, 1998; Gib-
     lessico nominale e in che modo sia in-           son, 2000) proving their effect on a wide range
     fluenzata dal genere testuale e dal liv-         of psycholinguistic phenomena, such as the sub-
     ello di complessità linguistica interno al      ject/object relative clauses asymmetry or the gar-
     genere. Un’analisi comparativa su più           den path effect in main verb/reduced–relative am-
     generi è condotta per la lingua italiana        biguities.
     usando caratteristiche linguistiche multi-
     livello estratte automaticamente da cor-            An interesting question addressed by recent
     pora annotati fino alla sintassi a dipen-        corpus-driven research is how language complex-
     denze.                                           ity is affected by textual genre. At syntactic level,
                                                      the study by Liu (2017) on ten genres taken from
                                                      the British National Corpus showed that genre-
1    Introduction
                                                      specific stylistic factors have an influence on the
Linguistic complexity is a multifaceted notion        distribution of dependency distances and depen-
which has been addressed from different perspec-      dency direction. Similarly for Italian, Brunato and
tives. One established dichotomy distinguishes a      Dell’Orletta (2017) investigated the influence of
“global” vs a “local” perspective, where the for-     genre, and level of complexity within genre, on
mer considers the complexity of the language as a     a range of factors of syntactic complexity auto-
whole and the latter focuses on complexity within     matically computed from dependency-parsed cor-
each sub-domains, i.e. phonology, morphology,         pora. Inspired by that work, we also intend to
syntax, discourse (Miestamo, 2008). While mea-        analyze the effect of genre on linguistic complex-
suring global complexity is a very ambitious and      ity. However, unlike the dominant local approach,
probably hopeless endeavor, measuring local com-      where each subdomain is typically studied in iso-
plexities is perceived as a more doable task (Kort-   lation, our contribution intends to address the in-
mann and Szmrecsanyi, 2012). The level of com-        terrelation between different levels, i.e. lexicon
plexity within each subdomains indeed has been        and syntax. Specifically, we investigate the fol-
lowing questions:                                     for the simple one. The former is made of 84 doc-
                                                      uments (471,969 tokens) covering various topics
    • to what extent is lexical complexity influ-     on scientific literature. The latter is made of 293
      enced by genre?                                 documents (about 205K tokens) extracted from the
                                                      Italian web portal “Ecology and Environment” of
    • to what extent is lexical complexity influ-
                                                      Wikipedia.
      enced by the level of complexity within the
                                                         For the Educational writing corpora we relied
      same genre?
                                                      on two collections of school textbooks: the ‘com-
    • is there a correlation between lexical com-     plex’ one (EduAdu) contains 70 texts (48,103 to-
      plexity and syntactic complexity? Does it       kens) targeting high school students, the ‘simple’
      vary according to genre and level of complex-   one (EduChi) a sample of 127 texts (48,036 to-
      ity within the same genre?                      kens) targeting primary school students.
                                                         Finally, the Narrative corpora are composed
   To answer these questions, we conducted an in-     by the original versions of Terence and Teacher
depth analysis for the Italian language based on      (TTorig), for the complex pole, and the corre-
automatically dependency parsed corpora aimed at      spondent simplified versions for the simple pole.
assessing i) the distribution of simple and complex   Terence, which is named after the EU Terence
nominal lexicon in different genres and different     Project2 , is made of 32 documents, covering short
language varieties for the same genre ii) the syn-    novels for children. Teacher contains 24 docu-
tactic role bears by “simple” and “complex” nouns     ments extracted from web sites dedicated to edu-
characterizing each corpus iii) the correlation be-   cational resources for teachers. All Terence and
tween “simple” and “complex” nouns with fea-          Teacher texts have a simpler version (TTsemp),
tures of complexity underlying the syntactic struc-   which is the result of a manual simplification pro-
ture in which they occur.                             cess as described by Brunato and Dell’Orletta
   In what follows we first describe the corpora      (2017).
considered in this study. We then illustrate how         All corpora were automatically tagged by the
lexical and syntactic complexity have been for-       part-of-speech tagger described in (Dell’Orletta,
malized. In Section 4 we discuss some prelim-         2009) and dependency parsed by the DeSR parser
inary findings obtained from the comparative in-      described in (Attardi et al., 2009).
vestigation across corpora.
                                                      3       Features of Linguistic Complexity
2       The Corpora
                                                      3.1       Assessment of Lexical Complexity
Four genres were considered in this study: Jour-
nalism, Scientific prose, Educational writing and     For each corpus we extracted all lemmas tagged as
Narrative. For each genre, we chose two corpora,      nouns, without considering proper nouns, and we
selected to be representative of a complex and of     classified them as ‘simple’ vs ‘complex’ nouns.
a simple language variety for that genre. The level   Such a distinction was established according to
of complexity was established according to the ex-    their frequency, which is one of the most used
pected target audience.                               parameter to assess the complexity of vocabulary
   The Journalistic corpora are Repubblica (Rep)      (see Section 1). Frequency was here computed
for the complex variety, and Due Parole (2Par) for    with respect to a reference corpus, i.e. ItWac (Ba-
the simple one. Rep is a corpus of 232,908 to-        roni et al., 2009), which was chosen since this is
kens and it is made of all articles published be-     the biggest corpus available for standard Italian
tween 2000 and 2005 on the newspaper of the           thus offering a reliable resource to evaluate word
same name; 2Par contains 322 articles taken from      frequency on a large-scale. After ranking all nouns
the easy-to-read magazine Due Parole1 , for a total   for frequency, we pruned those with a frequency
of about 73K tokens.                                  value ≤ 3 and we kept the first quarter of nouns as
   The corpora representative of Scientific writing   representative of the sample of simple nouns and
are Scientific articles (ScientArt) for the complex   the last quarter as representative of the sample of
language variety, and Wikipedia articles (WikiArt)    complex nouns for each corpus.
    1                                                     2
        www.dueparole.it                                      www.terenceproject.eu
3.2    Assessment of Syntactic Complexity              higher one across all corpora although with differ-
To investigate our main research questions, that       ences ranging from the lowest percentage (35.5%)
is how lexical complexity affects syntactic com-       in the ‘easy’ version of the narrative corpus (i.e.
plexity and the possible influence of genre and        TTsemp) to the highest one (49.9%) in ScientArt
language variety on this relationship, we focused      (i.e. the complex language variety for the scien-
on a set of features automatically extracted from      tific writing genre). The syntactic role of prepo-
the sentence parse tree. These features were cho-      sitional complement is especially played by sim-
sen since they are acknowledged to be predic-          ple nouns compared to complex nouns. This is
tors of phenomena of structural complexity, as         particularly evident in ScientArt and Repubblica,
demonstrated by their use in different scenarios,      where the difference between simple and complex
such as the assessment of learners’ language de-       nouns occurring as prepositional complements is
velopment or the level of text readability (e.g.       equal respectively to 20 and 15 percentage points.
(Collins-Thompson, 2014; Cimino et al., 2013;          Conversely, complex nouns are more widely used
Dell’Orletta et al., 2014)).                           as modifiers than simple nouns, especially in Re-
   For each corpus, all the considered features        pubblica. The percentage of nouns occurring in
were computed for all occurring nouns, for the         the subject and object position is less than 20% in
subset of complex nouns and for the subset of sim-     all corpora. Interestingly, the higher occurrence
ple nouns. Specifically, we focused on the follow-     of nominal subjects is attested in DueParole and
ing ones:                                              ChildEdu (14.1 and 16, respectively). This might
                                                       suggest that simpler language varieties, indepen-
    • The linear distance (in terms of tokens) sep-    dently from genre, make more use of explicit sub-
      arating the noun from its syntactic head         jects than implicit or pronominal ones. Besides,
      (HeadDistance in all following Tables)           the likelihood of a noun to be simple or complex
                                                       does not particularly affect the overall presence
    • The hierarchical distance (in terms of depen-    of nominal subjects, unless for ScientArt and Rep
      dency arcs) separating the noun from the root    which both show a higher percentage of simple
      of the tree (RootDistance)                       nouns in the subject position.

    • The average number of children per noun             A deeper understanding of the relationship be-
      (AvgChildren)                                    tween lexical and syntactic complexity was pro-
                                                       vided by the investigation of the syntactic fea-
    • The average number of siblings per noun          tures described in Section 3.2. Table 1 shows the
      (AvgSibling)                                     average value of the monitored features with re-
                                                       spect to all nouns (All), to the subset of complex
                                                       nouns (Comp) and to the subset of simple nouns
4     Discussion                                       (Simp) extracted from all corpora. We assessed
To have a first insight into the effect of genre and   whether the variation between these feature val-
language variety on the interplay between lexical      ues was statistically significant in a three different
and syntactic complexity, we compared the main         comparative scenarios: i) between the two corpora
syntactic roles that nouns play in the sentence by     of the same genre, ii) between the complex cor-
calculating the frequency of all dependency types      pora of each different genre and ii) between the
linking a noun to its head. This is shown in Fig-      simple corpora of each different genre. Table 2
ure 1, which reports the percentage distribution of    shows linguistic features varying significantly for
typed dependency relationships linking a noun to       all the considered comparisons according to the
its syntactic head across all corpora. For each cor-   Wilcoxon rank-sum test, a non parametric statisti-
pus there are three columns: the first one consid-     cal test for two independent samples (Wild, 1997).
ers data for all nouns of each corpus without any         If we compare the two language varieties within
complexity label, the second one only data for the     each genre, it can be seen, for instance, that nouns
simple noun subset and the last one only data for      are hierarchically more distant from the root in
the complex noun subset.                               the complex than in the simple version. Such a
   It can be noted that the distribution of nouns      variation, which is highly significant for all gen-
used as prepositional complements (prep) is the        res, affects more the Journalistic genre (DuePa-
Figure 1: Distribution of typed syntactic dependencies linking nouns to their head across corpora. For
each corpus, the first column refers to all nouns; the second one to the subset of simple nouns; the third
one to the subset of complex nouns

                  HeadDistance              AvgChildren                AvgSibling             RootDistance
              All    Comp Simp         All     Comp Simp       All       Comp Simp        All    Comp Simp
  2Par        2.252 2.342    2.256     1.318 1.218     1.345   1.675     1.956    1.580   2.969 2.816    2.993
  Rep         2.210 2.271    2.272     1.213 0.979     1.323   1.558     1.509    1.564   4.197 4.314    4.131
  Wiki        2.531 2.686    2.625     1.363 1.138     1.528   1.603     1.897    1.592   4.284 4.346    4.097
  ArtScient   2.162 2.391    2.409     1.229 1.066     1.388   1.399     1.487    1.418   4.835 5.132    4.598
  EduChi      2.177 2.338    2.171     1.311 1.303     1.353   1.523     1.621    1.458   3.408 3.387    3.388
  EduAdu      2.598 2.875    2.695     1.440 1.375     1.560   1.654     1.715    1.640   4.269 4.483    4.143
  TTsemp      2.167 2.334    2.172     1.342 1.335     1.470   1.690     1.789    1.659   3.017 2.953    2.882
  TTorig      2.252 2.399    2.269     1.339 1.333     1.439   1.681     1.705    1.697   3.268 3.200    3.169

Table 1: Average value of the monitored syntactic features with respect to all nouns (All), to the subset
of complex nouns (Comp) and to the subset of simple nouns (Simp) extracted from all the examined
corpora.


role: 2.969; Rep: 4.197) and, to a lesser extent,           We finally assessed whether the variation of
the Educational one (EduChi: 3.408; EduAdu:              these features was statistically significant compar-
4.269). However, for the other monitored syntac-         ing the simple and the complex noun subset of the
tic features, the Wiki corpus appears as slightly        same corpus (Table 3). According to this dimen-
more difficult than its complex counterpart: it          sion, we can observe that complex nouns have,
has nouns that are less close to their head (Wiki:       on average, less dependents (AvgChildren feature)
2.531; ArtScient: 2.162) and have a richer struc-        than simple ones, independently from the inter-
ture in terms of number of children (Wiki: 1.363;        nal distinction within genre; on the contrary, they
ArtScient: 1.229). With the exception of root dis-       tend to occur more distant from the root, espe-
tance, variations concerning other features within       cially in the complex variety of Scientific prose
the Narrative genre are not statistically significant.   (ArtScient Comp: 5.132; ArtScient Simp: 4.598).
This can be possibly due to the particular compo-
sition of the two selected corpora: indeed, both         5     Conclusion
Terence and Teacher texts in their original version
were already conceived for an audience of chil-          While language complexity is a central topic in
dren and young students, and they were not greatly       linguistic and computational linguistics research,
modified in their simplified version.                    it is typically addressed from a local perspective,
                                                         where each subdomain is investigated in isola-
                                  HeadDistance     AvgChildren      AvgSibling         RootDistance
                                 All C      S     All C      S    All C      S        All C      S
          2Par vs Rep            3* 3* 3*         3* 3* 3*        3* 3* 3             3* 3* 3*
          Wiki vs ArtScient      3* 3* 3*         3* 7       3*   3* 3* 3*            3* 3* 3*
          EduChild vs EduAdu     7    7     7     3* 7       3    3    7     7        3* 3* 3*
          TTsempl vs TTorig      7    7     7     7    7     7    7    7     7        3* 3       3*
          ArtScient vs EduAdu    3* 3* 3*         3* 3* 3*        3* 3* 3*            3* 3* 3*
          Rep vs ArtScient       3* 7       3*    3* 3* 3*        3* 3* 3*            3* 3* 3*
          Rep vs EduAdu          3* 3* 3*         3* 3* 3*        3    3     7        3    7     7
          Rep vs TTorig          3* 3* 3*         3* 3* 3*        3* 3* 3*            3* 3* 3*
          TTorig vs ArtScient    3* 3* 3*         3* 3* 3         3* 3* 3*            3* 3* 3*
          TTorig vs EduAdu       7    7     7     3* 7       3    3* 7       3*       3* 3* 3*
          2Par vs EduChild       3    3* 7        3* 3* 7         3* 7       3        3* 3* 3*
          2Par vs TTsemp         7    3* 7        3* 3* 7         3    7     3*       7    7     3*
          2Par vs Wiki           3* 7       3*    3* 3       3*   3* 7       3*       3* 3* 3*
          TTsemp vs EduChild     7    7     7     7    7     7    3* 7       3*       3* 3* 3*
          TTsemp vs Wiki         3* 3* 3*         7    3* 7       3* 7       3*       3* 3* 3*
          Wiki vs EduChild       3* 3* 3*         7    3* 3       7    7     7        3* 3* 3*

Table 2: Syntactic features that vary in a statistically significant way between the simple and the complex
corpus of the same genre, between the complex corpora of each genre and between the simple corpora
of each genre. “7” means a non significant variation; “3” means a significant variation at <0.05; “3*”
means a very significant variation at <0.01. All=all nouns; C=complex nouns; S=simple nouns.

                                            HeadDistance   AvgChildren   AvgSibling     RootDistance
         2ParSostS vs 2ParSostC                 3*             3*           3*              3*
         RepSostS vs RepSostC                   3*             3*            7              3*
         WikiSostS vs WikiSostC                  7             3*           3*              3*
         ArtScientSostS vs ArtScientSostC       3*             3*            7              3*
         EduChildSostS vs EduChildSostC          7             3             7               7
         EduAduSostS vs EduAduSostC             3              3*            7              3*
         TTsempSostS vs TTsempSostC             3*             7             7               7
         TTorigSostS vs TTorigSostC             3*             3             7               7

Table 3: Linguistic features that vary in a statistically significant way between the simple and the complex
nouns of the same corpus. “7” means a non significant variation; “3” means a significant variation at
<0.05; “3*” means a very significant variation at <0.01. All=all nouns; C=complex nouns; S=simple
nouns.


tion. In this preliminary work, we have defined a       tendencies we are currently carrying out a more
method to study the interplay between lexical and       in depth analysis focusing on fine-grained features
syntactic complexity restricted to the nominal do-      of syntactic complexity, such as the depth of the
main. We modeled the two notions in terms of fre-       nominal subtree. Further, we would like to enlarge
quency, with respect to lexical complexity, and of      this approach to test other constituents of the sen-
a set of parse tree features formalizing phenom-        tence, such as the verb.
ena of syntactic complexity. Our approach was
tested on corpora selected to be representative of
different genres and different levels of complexity     Acknowledgments
within each genre, in order to investigate whether
noun complexity differently affects syntactic com-      The work presented in this paper was partially sup-
plexity according to the two dimensions. We ob-         ported by the 2–year project (2017-2019) PER-
served e.g. that nouns tend to appear closer to the     FORMA – Personalizzazione di pERcorsi FOR-
root in simple language varieties, independently        Mativi Avanzati, funded by Regione Toscana (Pro-
from genre, while the effect of genre and linguistic    getti Congiunti di Alta Formazione – POR FSE
complexity is less sharp with respect to the other      2014-2020 Asse A – Occupazione) in collabora-
considered features.                                    tion with Meta srl company.
  To have a deeper understanding of the observed
References                                                  Edward Gibson. 2000. The dependency Locality The-
                                                              ory: A distance–based theory of linguistic complex-
Giuseppe Attardi, Felice Dell’Orletta, Maria Simi,            ity. Image, Language and Brain, In W.O.A. Marants
  Joseph Turian. 2009. Accurate dependency pars-              and Y. Miyashita (Eds.), Cambridge, MA: MIT
  ing with a stacked multilayer perceptron. In Pro-           Press, pp. 95–126.
  ceedings of EVALITA 2009 - Evaluation of NLP and
  Speech Tools for Italian 2009, Reggio Emilia, Italy,      Daniel Gildea and David Temperley. 2010. Do Gram-
  December 2009.                                              mars Minimize Dependency Length? Cognitive Sci-
                                                              ence, 34(2):286310.
Marco Baroni, Silvia Bernardini, Adriano Ferraresi,
 and Eros Zanchetta. 2009. The WaCky wide web:              John A. Hawkins 1994. A performance theory of order
 A collection of very large linguistically processed          and constituency. Cambridge studies in Linguistics,
 web-crawled corpora. In Language Resources and               Cambridge University Press, 73.
 Evaluation, 43:3, pp. 209-226.
                                                            Felice Dell’Orletta, Martjin Wieling, Andrea Cimino,
Dominique Brunato and Felice Dell’Orletta. 2017. On           Giulia Venturi, and Simonetta Montemagni. 2014.
  the order of words in Italian: a study on genre vs          Assessing the Readability of Sentences: Which Cor-
  complexity. International Conference on Depen-              pora and Features. Proceedings of the 9th Work-
  dency Linguistics (Depling 2017), 18-20 September           shop on Innovative Use of NLP for Building Educa-
  2017, Pisa, Italy.                                          tional Applications (BEA 2014), Baltimore, Mary-
                                                              land, USA.
Cristina Burani. 2006. Morfologia: i processi. In:
                                                            Matti Miestamo. 2008. Grammatical complexity in a
  A. Laudanna and M. Voghera (cur.) Il linguaggio.
                                                             crosslinguistic perspective. In: Miestamo M, Sin-
  Strutture. Strutture linguistiche e processi cognitivi.
                                                             nemäki K. and Karlsson F. (eds), Language Com-
  Bari, Laterza, 2006.
                                                             plexity: Typology, Contact Change, Amsterdam:
                                                             Benjamins, 23–41.
Isabella Chiari and Tullio De Mauro. 2014. The New
   Basic Vocabulary of Italian as a linguistic resource.    Randall James Ryder and Wayne H. Slater. 1988.
   Proceedings of the First Italian Conference on Com-        The relationship between word frequency and word
   putational Linguistics (CLIC-IT), Pisa 15-19 dicem-        knowledge. The Journal of Educational Research,
   bre 2014.                                                  81(5):312–317.
Andrea Cimino, Felice Dell’Orletta, Giulia Venturi,         Kortmann Berndt and Szmrecsanyi Benedikt. 2012.
  and Simonetta Montemagni. 2013. Linguistic Pro-             Linguistic Complexity. Second Language Acquisi-
  filing based on General–purpose Features and Na-            tion, Indigenization, Contact. Berlin, Boston: De
  tive Language Identification. Proceedings of Eighth         Gruyter.
  Workshop on Innovative Use of NLP for Building
  Educational Applications, Atlanta, Georgia, June          Yaqin Wang and Haitao Liu. 2017. The effects of
  13, pp. 207-215.                                            genre on dependency distance and dependency di-
                                                              rection. Language Sciences, 59, 135–157.
Kevyn Collins-Thompson. 2014. Computational As-
  sessment of text readability. Recent Advances in Au-      Chris Wild 1997. The Wilcoxon Rank-Sum Test. Uni-
  tomatic Readability Assessment and Text Simplifica-         versity of Auckland, Department of Statistics
  tion. Special issue of International Journal of Ap-
  plied Linguistics, 165:2, John Benjamins Publishing
  Company, 97-135.

Felice Dell’Orletta. 2009. Ensemble system for part-
  of-speech tagging. In Proceedings of EVALITA 2009
  - Evaluation of NLP and Speech Tools for Italian
  2009, Reggio Emilia, Italy, December 2009.

Holger Diessel. 2005. Competing motivations for the
  ordering of main and adverbial clauses. Linguistics,
  43 (3): 449–470.

Richard Futrell, Kyle Mahowald and Edward Gibson.
  2015. Quantifying word order freedom in depen-
  dency corpora. In Proceedings of the Third In-
  ternational Conference on Dependency Linguistics
  (Depling 2015), 91–100.

Edward Gibson. 1998. Linguistic complexity: Local-
  ity of syntactic dependencies. Cognition, 68:1–76.