=Paper= {{Paper |id=Vol-2769/57 |storemode=property |title=Is Neural Language Model Perplexity Related to Readability? |pdfUrl=https://ceur-ws.org/Vol-2769/paper_57.pdf |volume=Vol-2769 |authors=Alessio Miaschi,Chiara Alzetta,Dominique Brunato,Felice Dell'Orletta,Giulia Venturi |dblpUrl=https://dblp.org/rec/conf/clic-it/MiaschiABDV20 }} ==Is Neural Language Model Perplexity Related to Readability?== https://ceur-ws.org/Vol-2769/paper_57.pdf

Is Neural Language Model Perplexity Related to Readability?
Alessio Miaschi¨, H , Chiara Alzetta«, H , Dominique BrunatoH ,
Felice Dell’OrlettaH , Giulia VenturiH
¨
Department of Computer Science, University of Pisa
H
Istituto di Linguistica Computazionale “Antonio Zampolli”, ItaliaNLP Lab, Pisa
«
DIBRIS, University of Genoa
alessio.miaschi@phd.unipi.it, chiara.alzetta@edu.unige.it,
{name.surname}@ilc.cnr.it

Abstract human behavioural measures, such as gaze dura-
tion during reading (Demberg and Keller, 2008;
This paper explores the relationship be- Goodkind and Bicknell, 2018).
tween Neural Language Model (NLM) In this paper we focus on a less investigated per-
perplexity and sentence readability. Start- spective addressing the connection between per-
ing from the evidence that NLMs implic- plexity and readability. Since by definition per-
itly acquire sophisticated linguistic knowl- plexity gives a good approximation of how well
edge from a huge amount of training data, a model recognises an unseen piece of text as a
our goal is to investigate whether perplex- plausible one, our intuition is that lower model
ity is affected by linguistic features used perplexity should be assigned to easy-to-read sen-
to automatically assess sentence readabil- tences, while difficult-to-read ones should obtain
ity and if there is a correlation between the higher perplexity. On the other hand, state-of-
two metrics. Our findings suggest that this the-art NLMs trained on huge data have shown
correlation is actually quite weak and the to implicitly learn a sophisticated knowledge of
two metrics are affected by different lin- language phenomena, also with respect to com-
guistic phenomena.1 plex syntactic properties of sentences (Tenney et
al., 2019; Jawahar et al., 2019; Miaschi et al.,
1 Introduction and Motivation 2020). This could suggest that variations in terms
Standard Neural Language Models (NLMs) are of linguistic complexity, especially when related
trained to predict the next token given a context of to subtle morpho–syntactic and syntactic features
previous tokens. The metric commonly used for of sentence rather than lexical ones, could not im-
assessing the performance of a language model is pact on model perplexity to a great extent. This
perplexity, which corresponds to the inverse ge- assumption seems to be confirmed by the (still un-
ometric mean of the joint probability of words published) results by Martinc et al. (2019) which,
w1 , ..., wn in a held-out test corpus C. While be- to our knowledge, is the only one explicitly lever-
ing primarily an intrinsic metric of NLM quality, aging unsupervised neural language model predic-
perplexity has been used in a variety of scenarios, tions in the context of readability assessment. Ac-
such as to classify between formal and colloquial cording to this study, a NLM is even less perplexed
tweets (González, 2015), to detect the bound- by articles addressed at adults than by documents
aries between varieties belonging to the same lan- conceived for a younger readership. From a rel-
guage family (Gamallo et al., 2017) or to identify atively different perspective focused on the abil-
speech samples produced by subjects with cogni- ity of automatic comprehension systems to solve
tive and/or language diseases e.g. dementia (Co- cloze tests, Benzahra and Yvon (2019) showed
hen and Pakhomov, 2020) or Specific Language that NLMs performance is not affected by the level
Impairment (Gabani et al., 2009). From the per- of text complexity.
spective of computational studies aimed at model- In order to test the validity of all these hy-
ing human language processing, perplexity scores potheses, we rely on the perplexity score given
have also been shown to effectively match various by a state-of-the-art NLM for the Italian language
to several datasets representative of different tex-
1
Copyright c 2020 for this paper by its authors. Use per-
mitted under Creative Commons License Attribution 4.0 In- tual genres containing both easy– and complex–
ternational (CC BY 4.0). to–read sentences: ideally, such datasets should
emphasise the correlation between perplexity and long to the class of difficult-to- read documents.
readability (if present) since the corpora are ex- For the purposes of our work, we carried out read-
plicitly designed to contain both simple and diffi- ability assessment at sentence level, making the
cult examples. analysis reliable for the comparison with sentence-
based perplexity of a NLM.
Contributions We inspect whether and to which
GePpeTto. Sentence-level perplexity scores were
extent it is possible to find a relationship between
computed relying on GePpeTto (De Mattei et al.,
a readability score and the perplexity of a NLM.
2020). GePpeTto is a generative language model
To this aim we investigate (i) if the perplexity
trained on the Italian language and built using the
of a NLM and the readability score of a set of
GPT-2 architecture (Radford et al., 2019). The
sentences show a significant correlation and (ii)
model was trained on a dump of Italian Wikipedia
whether the two metrics are equally affected by
(2.8GB) and on the itWac corpus (Baroni et al.,
the same set of linguistic phenomena that occur in
2009), which amounts to 11GB of web texts. The
the sentence.
perplexity (PPL) of the model was computed as
2 Experimental Design follows:
N LL
P P L = e( N )
According to our research questions, we devised
a set of experiments to study whether NLMs per- where N N L and N correspond respectively to the
plexity reflects the level of readability of a sen- negative log-likelihood and to the length of each
tence and which are the linguistic phenomena sentence w1:n = [w1 , ..., wn ] in the datasets.
mostly involved in each metric. For this purpose,
2.2 Corpora
we firstly investigated whether sentence-level per-
plexity scores computed with one of the most In order to test the reliability of our initial hypothe-
prominent NLM model correlate with the scores sis, we chose four corpora containing different ty-
assigned to the same sentences by a supervised pologies of texts, i.e. web pages, educational ma-
readability assessment tool. Secondly, we investi- terials, narrative texts, newspaper and scientific ar-
gated which are the linguistic features of the con- ticles. Each corpus includes a balanced amount
sidered sentences that correlate in a statistically of difficult- and easy-to-read sentence. In addi-
significant way with the perplexity and readability tion, we also considered in the analysis the Italian
score respectively. In order to verify whether cor- Universal Dependency treebank. This is meant to
relations hold across different typology of texts, verify whether the connection between sentence-
we tested our approach on five Italian datasets. level readability and perplexity also holds in a
well-acknowledged benchmark corpus. For each
2.1 Models of them, we excluded from our analysis short sen-
READ-IT. Automatic readability (henceforth tences, i.e. having less than 5 tokens.
ARA) was assessed using READ-IT (Dell’Orletta PACCSS-IT2 (Brunato et al., 2016): we took into
et al., 2011) the first readability assessment tool account 125,977 sentences belonging to PACCSS-
for Italian which combines traditional raw text fea- IT, a corpus of complex-simple aligned sentences
tures with lexical, morpho-syntactic and syntactic extracted from the ItWaC corpus. The resource
information extracted from automatically parsed was build using an automatic approach for acquir-
documents. In READ-IT, analysis of readability is ing large corpora of paired sentences able to inter-
modelled as a binary classification task, based on cept structural transformations (such as deletion,
Support Vector Machines using LIBSVM (Chang reordering, etc.). For example, the two following
and Lin, 2001). Training corpora are representa- sentences represent a pair in the corpus, where a
tive of two classes of texts, i.e. difficult– vs. easy– reordering operation occurs at phrase level (i.e. the
to-read ones, both containing newspaper articles. subordinate clause proceeds vs. follows the main
The set of features exploited for predicting read- clause):
ability has been proved to capture different aspects
of sentence complexity. Thus, the assigned read- • Complex: Ringraziandola per la sua cortese
ability score ranges between 0 (easy-to-read) and 1 attenzione, resto in attesa di risposta. [Lit:
(difficult-to-read) referring to the percentage prob- 2
http://www.italianlp.it/resources/paccss-it-parallel-
ability for unseen documents or sentences to be- corpus-of-complex-simple-sentences-for-italian/
Thanking you for your kind attention, I look Stanford Dependency Treebank (ISDT)6 (Bosco et
forward to your answer.] al., 2013), the Italian version of the multilingual
Turin University Parallel Treebank (Sanguinetti
• Simple: Resto in attesa di una risposta e and Bosco, 2015) and the Venice Italian Treebank
ringrazio vivamente per l’attenzione. [Lit: I (Delmonte et al., 2007) (24,998 sentences), all
look forward to your answer and I thank you containing a mix of textual genres; and a second
greatly for your attention.] one including two collections of texts representa-
tive of social media language, i.e. generic tweets
Terence and Teacher3 (Brunato et al., 2015): two and tweets labelled for irony (PosTWITA7 and
corpora of original and manually simplified texts TWITTIRO8 ) (Sanguinetti et al., 2018; Cignarella
aligned at sentence level. Terence contains short et al., 2019) (3,660 sentences in total).
Italian novels for children and their manually sim-
plified version carried out by linguists and psy-
3 Sentence Perplexity and Readability
cholinguists targeting children with text compre- Our analysis starts from a comparison between
hension difficulties. Teacher is a corpus of pairs the average perplexity and readability scores ob-
of documents belonging to different genres (e.g. tained for each sentence of the five considered
literature, handbooks) used in educational settings datasets. As shown in Table 1, readability val-
manually simplified by teachers. We exploited ues (column ARA) are quite homogeneous across
1,644 sentences belonging to these corpora. the datasets, with low standard deviation values.
Multi–Genre Multi–Type Italian corpus: a col- On the contrary, the range of perplexity scores is
lection of Italian texts representative of three tradi- wider (column PPL), going from an average score
tional textual genres: Journalism, Scientific prose of 3,905.83 of PACCSS-IT to 436.75 of the IUDT
and Narrative. Each genre has been internally sub- miscellaneous portion (Italian UD). These differ-
divided into two sub-corpora representative of an ences seem to provide a first evidence that perplex-
easy- vs difficult-to-read variety, which was de- ity and readability are not correlate to each other.
fined according to the intended target audience for This intuition has been proved computing the
a given genre. The journalistic prose corpus in- Spearman’s rank correlation coefficient between
cludes articles automatically downloaded from the the perplexity and readability scores for each
online versions of two general-purpose newspa- dataset. Results are reported in Table 2, column
pers4 , while the “easy” sub-corpus contains arti- PPL-ARA. As it can be seen, all correlation rates
cles from two easy-to-read newspapers5 addressed are significant, except for the result obtained on
to adults with low literacy skills or mild intel- the Terence and Teacher corpus, possibly due to
lectual disabilities. The scientific prose collec- the fact that the size of the corpus is too small
tion consists of scholarly publications on linguis- to allow a significant comparison. Contrary to
tics and computational linguistics and Wikipedia our expectations, no correlation was detected be-
pages downloaded from the portal “Linguistics”, tween the two metrics for all corpora, suggesting
representative of the complex and easy variety re- that perplexity and and readability are independent
spectively. For the narrative genre, we included from each other.
long novels written by novelists of the last cen- To further investigate the reasons behind these
tury and contemporary writers in the corpora of scores and to deepen the analysis about the rela-
complex variety, while for the easy variety we col- tionship between the two metrics, we investigated
lected short novels for children. The complete cor- whether they capture the same (or similar) lin-
pus contains 56,685 sentences. guistic properties of the sentences. To this aim,
Italian Universal Dependency Treebank: it in- we tested the presence and strength of the cor-
cludes different sections of the Italian Universal relation between each of the two metrics and a
Dependency Treebank (IUDT), version 2.5 (Ze- set of 176 linguistic features, which have been
man et al., 2019). In particular, we considered shown to capture properties of sentence complex-
two groups: a first one containing the whole Italian 6
https://github.com/UniversalDependencies/UD Italian-
ISDT
3 7
http://www.italianlp.it/resources/terence-and-teacher/ https://github.com/UniversalDependencies/UD Italian-
4
www.repubblica.it and http://www.ilgiornale.it/ PoSTWITA
5 8
www.dueparole.it and http://www.informazionefacile.it/ https://universaldependencies.org/treebanks/it twittiro
Dataset PPL ARA
PACCSS-IT 3,905.83 (± 21,306.07) 0.55 (± 0.24)
Terence-Teacher 790.85 (± 5,002.62) 0.46 (± 0.27)
Multi-Genre Multi-Type 570.85 (± 4,820.12) 0.58 (± 0.31)
Italian-UD 436.75 (± 3,633.64) 0.61 (± 0.30)
Twitter-UD 986.28 (± 2,479.64) 0.59 (± 0.30)

Table 1: Perplexity (PPL) and Readability (ARA) mean and standard deviation values for the 5 datasets.

Dataset PPL-ARA Feats one reporting a medium correlation (.332). Over-
PACCSS-IT -0.031* 0.169* all, these results corroborate our previous findings
Terence-Teacher 0.014 0.149 that the two metrics are not particularly related
Multi-Genre Multi-Type 0.026* 0.184* with each other, and they further suggest that the
Italian-UD -0.054* 0.332* linguistic phenomena affecting the perplexity of
Twitter-UD -0.038* -0.037 NLM and the readability level of a sentence are
very different. Consider for example the two fol-
Table 2: Spearman’s correlation coefficients be- lowing sentences:
tween sentence-level perplexity and readability
scores (PPL-ARA) and between rankings of lin- (1) Il furto è avvenuto giovedı̀ notte.
guistic features (Feats). Statistically significant The theft has taken place Thursday night.
correlations (p < 0.05) are marked with *. (2) Il comitato di bioetica: no all’eutanasia.
The bioethics committee: no to euthanasia.

ity (Brunato et al., 2018). In particular, this anal- While (1) is very easy-to-read, with a readabil-
ysis is based on the set of features described in ity score of 0.25, but it has a quite high perplexity
Brunato et al. (2020), which are acquired from score, i.e. 40,737.81, (2) is quite difficult-to-read
raw, morpho-syntactic and syntactic levels of an- (ARA=1) but is has a very low perplexity score
notation. They range from basic information on (PPL=11.24).
the average sentence and word length, to lexi-
cal information about the internal composition of
4 In-Depth Linguistic Investigation
the vocabulary of the text (e.g. the distribution of To better explore the motivation behind these
lemmas belonging to the Basic Italian Vocabulary results, we performed an in-depth investigation
(De Mauro, 2000)). They also include morpho– aimed at understating the relationship between our
syntactic information (e.g. POS distribution and set of linguistic features and the two metrics taken
of inflectional properties of verbs) and more com- into consideration. Since we noticed that for all
plex aspects of sentence structure derived from datasets a higher number of features correlates
syntactic annotation and modeling global and lo- with ARA than with PPL, we selected those that
cal properties of parsed tree structure, e.g. the rel- are significantly correlated with the two metrics.
ative order of subjects and objects with respect The number of shared features varies for each
to the verb, the use of subordination. In order dataset, depending on their size. For example, for
to extract these features, the considered corpora the two smallest ones, i.e. Terence and Teacher
were morpho-syntactically annotated and depen- and the UD Twitter Treebank, we could only con-
dency parsed by the UDPipe pipeline (Straka et sider 34.65% (61) and 44.88% (79) of the whole
al., 2016), with the exception of the IUDT corpus. set of features respectively, while for the larger
Column Feats of Table 2 illustrates the results corpora the sub-set is wider: 81.81% (144) in
of this analysis: we report the Spearman’s correla- PACCSS-IT, 78.97% (139) for Multi-Genre Multi-
tion coefficients between the two rankings of lin- Type and 84.65% (149) for the IUD Treebank.
guistic features, each ordered by strength of corre- Table 3 shows the top ten features for each
lation between feature value and perplexity score dataset, i.e. those that obtained the strongest corre-
and readability score respectively. Once again we lation with both PPL and ARA. As expected, cor-
observe rather weak correlation values, with the relations are generally stronger between linguis-
only exception of Italian-UD which is the only tic features and readability scores, although they
PACCSS-IT
are lower than expected. This could be due to the PPL ARA
fact that, even if the READ–IT classifier is trained Feats Corr Feats Corr
aux num pers dist Sing+3 0,53 xpos dist FF 0,34
with a similar set of features, the non-linear fea- dep dist cop 0,51 dep dist punct 0,32
avg max depth 0,50 upos dist PUNCT 0,32
ture space makes it difficult to identify clear cor- upos dist ADP 0,50 ttr form 0,29
relations with individual features. Similarly, our xpos dist E 0,50 aux mood dist Cnd 0,25
dep dist case 0,49 upos dist DET 0,25
set of features seem to play only a marginal role n tokens 0,48 dep dist det 0,25
on perplexity. However, this is not the case of the dep dist root 0,48 ttr lemma 0,22
xpos dist FS 0,48 upos dist NOUN 0,21
PACCSS-IT corpus, for which the set of consid- Terence and Teacher
ered linguistic features have an higher correlation PPL ARA
Feats Corr Feats Corr
with PPL. This can be possibly related to the par- xpos dist B 0,25 dep dist det -0,39
verbs num pers dist Sing+3 0,23 upos dist DET -0,38
tial overlap between the GePpeTto training data lexical density 0,22 upos dist NOUN -0,37
and the PACCSS-IT sentences, since the latter is dep dist advmod 0,21 xpos dist S -0,37
upos dist ADV 0,21 xpos dist RD -0,29
drawn from the ItWac corpus which is included in verbs num pers dist Plur+3 -0,16 upos dist ADV 0,27
xpos dist V 0,16 dep dist advmod 0,25
the GePpeTto’s training. avg token per clause -0,16 xpos dist FF 0,25
Inspecting these results, we can also observe upos dist VERB 0,14 avg sub chain len 0,24
Multi-Genre Multi-Type
that correlations between features and PPL seem PPL ARA
to be more affected by genre–specific charac- Feats Corr Feats Corr
n tokens -0,19 principal prop dist -0,42
teristics. This is particularly clear if we con- dep dist root 0,19 ttr form -0,34
dep dist advmod 0,19 xpos dist FF 0,34
sider the Italian UD Twitter treebank, for which upos dist ADV 0,18 dep dist det -0,33
among the top ten most correlated features we n prepositional chains -0,18 upos dist DET -0,33
xpos dist B 0,18 upos dist PUNCT 0,33
find some of them characterising social media lan- upos dist ADP -0,17 dep dist punct 0,33
guage, e.g. symbols (upos-xpos dist SYM) or the xpos dist E -0,17 xpos dist FB 0,31
ttr lemma 0,16 sub prop dist 0,27
vocative relation, which marks a dialogue partic- Italian UD Treebank
PPL ARA
ipant addressed in a text along with the specifi- Feats Corr Feats Corr
cation, specifically used for Twitter @-mentions n tokens -0,27 principal prop dist -0,53
dep dist root 0,27 sub proposition dist 0,40
(dep dist vocative:mention). n prepositional chains -0,26 n tokens 0,39
avg max depth -0,24 dep dist root -0,39
upos dist ADP -0,24 ttr form -0,37
5 Conclusion ttr lemma 0,23 avg max depth 0,36
max links len -0,23 avg links len 0,35
avg max links len -0,23 max links len 0,34
The paper presented a study aimed at investigating xpos dist E -0,22 avg max links len 0,34
the relationship between two metrics computed at Italian UD Twitter Treebank
PPL ARA
sentence-level, i.e. perplexity of a state-of-the-art Feats Corr Feats Corr
upos dist SYM 0,38 upos dist PUNCT 0,30
NLM for the Italian language and readability score avg max depth -0,28 dep dist punct 0,30
automatically assigned to a sentence by a super- xpos dist SYM 0,28 dep dist det -0,29
in dict -0,24 upos dist DET -0,29
vised classifier. We carried out our analysis con- dep dist vocative:mention 0,23 verbal root perc -0,27
sidering several datasets differing at the level of in dict types -0,22 xpos dist RD -0,27
ttr lemma 0,21 avg token per clause -0,27
textual genre and language variety. Specifically, in FO -0,21 subj pre -0,27
verbal head per sent -0,19 obj post -0,24
we observed that comparing the rankings obtained
using the two metrics we cannot find any signifi- Table 3: Top 10 features along with their correla-
cant correlation, either between the scores of the tion scores between perplexity and readability.
two metrics or with respect to the set of linguis-
tic features that mostly impact their values. Fur-
ther investigation within this line of research will cessed web-crawled corpora. Language resources
and evaluation, 43(3):209–226.
explore whether we can draw the same observa-
tions when a different NLM is exploited to com- Marc Benzahra and François Yvon. 2019. Measur-
pute sentence perplexity. ing text readability with machine comprehension: a
pilot study. In Proceedings of the Fourteenth Work-
shop on Innovative Use of NLP for Building Educa-
tional Applications, pages 412–422, Florence, Italy,
References August. Association for Computational Linguistics.
Marco Baroni, Silvia Bernardini, Adriano Ferraresi, C. Bosco, S. Montemagni, and M. Simi. 2013. Con-
and Eros Zanchetta. 2009. The wacky wide verting italian treebanks: Towards an italian stanford
web: a collection of very large linguistically pro- dependency treebank. In Proceedings of the ACL
Linguistic Annotation Workshop & Interoperability Rodolfo Delmonte, Antonella Bristot, and Sara Tonelli.
with Discourse, Sofia, Bulgaria, August. 2007. VIT - Venice Italian Treebank: Syntactic and
quantitative features. In Proceedings of the Sixth In-
Dominique Brunato, Felice Dell’Orletta, Giulia Ven- ternational Workshop on Treebanks and Linguistic
turi, and Simonetta Montemagni. 2015. Design and Theories.
annotation of the first italian corpus for text simpli-
fication. In Proceedings of The 9th Linguistic Anno- V. Demberg and Frank Keller. 2008. Data from eye-
tation Workshop, pages 31–41. tracking corpora as evidence for theories of syntactic
processing complexity. Cognition, 109:193–210.
Dominique Brunato, Andrea Cimino, Felice
Dell’Orletta, and Giulia Venturi. 2016. PaCCSS- Keyur Gabani, Melissa Sherman, Thamar Solorio,
IT: A parallel corpus of complex-simple sentences Yang Liu, Lisa Bedore, and Elizabeth Peña. 2009.
for automatic text simplification. In Proceedings of A corpus-based approach for the prediction of
the 2016 Conference on Empirical Methods in Nat- language impairment in monolingual English and
ural Language Processing, pages 351–361, Austin, Spanish-English bilingual children. In Proceedings
Texas, November. Association for Computational of Human Language Technologies: The 2009 An-
Linguistics. nual Conference of the North American Chapter
of the Association for Computational Linguistics,
Dominique Brunato, Lorenzo De Mattei, Felice
pages 46–55, Boulder, Colorado, June. Association
Dell’Orletta, Benedetta Iavarone, and Giulia Ven-
for Computational Linguistics.
turi. 2018. Is this sentence difficult? do you agree?
In Proceedings of the 2018 Conference on Empiri- Pablo Gamallo, Jose Ramom Pichel, and Iñaki
cal Methods in Natural Language Processing, pages Alegria. 2017. A perplexity-based method for
2690–2699, Brussels, Belgium, October-November. similar languages discrimination. In VarDial2017
Association for Computational Linguistics. workshop at EACL 2017. Proceedings of the Fourth
Dominique Brunato, Andrea Cimino, Felice Workshop on NLP for Similar Languages, Varieties
Dell’Orletta, Giulia Venturi, and Simonetta and Dialects, pages 109–114,Valencia, Spain, April
Montemagni. 2020. Profiling-UD: a tool for 3, 2017. c c 2017 Association for Computational
linguistic profiling of texts. In Proceedings of Linguistics (http://web.science.mq.edu.au/ smal-
The 12th Language Resources and Evaluation masi/vardial4/index.html).
Conference, pages 7145–7151, Marseille, France, M. González. 2015. An analysis of twitter corpora
May. European Language Resources Association. and the differences between formal and colloquial
Chih-Chung Chang and Chih-Jen Lin. 2001. LIB- tweets. In TweetMT@SEPLN.
SVM: a library for support vector machines. Adam Goodkind and Klinton Bicknell. 2018. Pre-
Alessandra Teresa Cignarella, Cristina Bosco, and dictive power of word surprisal for reading times
Paolo Rosso. 2019. Presenting TWITTIRÒ-UD: is a linear function of language model quality. In
An italian twitter treebank in universal dependen- Asad B. Sayeed, Cassandra Jacobs, Tal Linzen, and
cies. In Proceedings of the Fifth International Con- Marten Van Schijndel, editors, Proceedings of the
ference on Dependency Linguistics (Depling, Syn- 8th Workshop on Cognitive Modeling and Compu-
taxFest 2019). tational Linguistics, CMCL 2018, Salt Lake City,
Utah, USA, January 7, 2018, pages 10–18. Associa-
Trevor Cohen and Serguei Pakhomov. 2020. A tale tion for Computational Linguistics.
of two perplexities: Sensitivity of neural language
models to lexical retrieval deficits in dementia of the Ganesh Jawahar, Benoı̂t Sagot, Djamé Seddah, Samuel
Alzheimer’s type. In Proceedings of the 58th An- Unicomb, Gerardo Iñiguez, Márton Karsai, Yannick
nual Meeting of the Association for Computational Léo, Márton Karsai, Carlos Sarraute, Éric Fleury,
Linguistics, pages 1946–1957, Online, July. Associ- et al. 2019. What does bert learn about the struc-
ation for Computational Linguistics. ture of language? In 57th Annual Meeting of the As-
sociation for Computational Linguistics (ACL), Flo-
Lorenzo De Mattei, Michele Cafagna, Felice rence, Italy.
Dell’Orletta, Malvina Nissim, and Marco Guerini.
2020. Geppetto carves italian into a language Matej Martinc, Senja Pollak, and Marko Rob-
model. arXiv preprint arXiv:2004.14253. nikSikonja. 2019. Supervised and unsupervised
neural approaches to text readability. Computing
Tullio De Mauro. 2000. Il dizionario della lingua ital- Research Repository, arXiv:1503.06733. Version 2.
iana, volume 1. Paravia.
Alessio Miaschi, Dominique Brunato, Felice
Felice Dell’Orletta, Simonetta Montemagni, and Giu- Dell’Orletta, and Giulia Venturi. 2020. Lin-
lia Venturi. 2011. READ–IT: Assessing readabil- guistic profiling of a neural language model. arXiv
ity of Italian texts with a view to text simplification. preprint arXiv:2010.01869.
In Proceedings of the Second Workshop on Speech
and Language Processing for Assistive Technolo- Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
gies, pages 73–83, Edinburgh, Scotland, UK, July. Dario Amodei, and Ilya Sutskever. 2019. Language
Association for Computational Linguistics. models are unsupervised multitask learners.
Manuela Sanguinetti and Cristina Bosco. 2015. Part-
TUT: The turin university parallel treebank. In
Roberto Basili et al., editor, Harmonization and De-
velopment of Re- sources and Tools for Italian Natu-
ral Language Processing within the PARLI Project,
page 51–69. Springer.
Manuela Sanguinetti, Cristina Bosco, Alberto Lavelli,
Alessandro Mazzei, and Fabio Tamburini. 2018.
PoSTWITA-UD: an Italian Twitter Treebank in uni-
versal dependencies. In Proceedings of the Eleventh
Language Resources and Evaluation Conference
(LREC 2018).
M. Straka, J. Hajic, and J. Strakova. 2016. UD-
Pipe: Trainable pipeline for processing CoNLL-U
files performing tokenization, morphological anal-
ysis, pos tagging and parsing. In Proceedings of
the Tenth International Conference on Language Re-
sources and Evaluation (LREC).

Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.
BERT rediscovers the classical NLP pipeline. In
Proceedings of the 57th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 4593–
4601, Florence, Italy, July. Association for Compu-
tational Linguistics.

Daniel Zeman, Joakim Nivre, Mitchell Abrams, and
et al. 2019. Universal dependencies 2.5. In
LINDAT/CLARIAH-CZ digital library at the Insti-
tute of Formal and Applied Linguistics (ÚFAL).