<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Is Neural Language Model Perplexity Related to Readability?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alessio Miaschi¨</string-name>
          <email>alessio.miaschi@phd.unipi.it</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chiara Alzetta«</string-name>
          <email>chiara.alzetta@edu.unige.it</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dominique BrunatoH</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Felice Dell'OrlettaH</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giulia VenturiH</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>This paper explores the relationship between Neural Language Model (NLM) perplexity and sentence readability. Starting from the evidence that NLMs implicitly acquire sophisticated linguistic knowledge from a huge amount of training data, our goal is to investigate whether perplexity is affected by linguistic features used to automatically assess sentence readability and if there is a correlation between the two metrics. Our findings suggest that this correlation is actually quite weak and the two metrics are affected by different linguistic phenomena.1 1 Introduction and Motivation</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Standard Neural Language Models (NLMs) are
trained to predict the next token given a context of
previous tokens. The metric commonly used for
assessing the performance of a language model is
perplexity, which corresponds to the inverse
geometric mean of the joint probability of words
w1; :::; wn in a held-out test corpus C. While
being primarily an intrinsic metric of NLM quality,
perplexity has been used in a variety of scenarios,
such as to classify between formal and colloquial
tweets
        <xref ref-type="bibr" rid="ref2">(Gonza´lez, 2015)</xref>
        , to detect the
boundaries between varieties belonging to the same
language family
        <xref ref-type="bibr" rid="ref1">(Gamallo et al., 2017)</xref>
        or to identify
speech samples produced by subjects with
cognitive and/or language diseases e.g. dementia
(Cohen and Pakhomov, 2020) or Specific Language
Impairment (Gabani et al., 2009). From the
perspective of computational studies aimed at
modeling human language processing, perplexity scores
have also been shown to effectively match various
1Copyright c 2020 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).
human behavioural measures, such as gaze
duration during reading
        <xref ref-type="bibr" rid="ref3">(Demberg and Keller, 2008;
Goodkind and Bicknell, 2018)</xref>
        .
      </p>
      <p>
        In this paper we focus on a less investigated
perspective addressing the connection between
perplexity and readability. Since by definition
perplexity gives a good approximation of how well
a model recognises an unseen piece of text as a
plausible one, our intuition is that lower model
perplexity should be assigned to easy-to-read
sentences, while difficult-to-read ones should obtain
higher perplexity. On the other hand,
state-ofthe-art NLMs trained on huge data have shown
to implicitly learn a sophisticated knowledge of
language phenomena, also with respect to
complex syntactic properties of sentences
        <xref ref-type="bibr" rid="ref11 ref4 ref6">(Tenney et
al., 2019; Jawahar et al., 2019; Miaschi et al.,
2020)</xref>
        . This could suggest that variations in terms
of linguistic complexity, especially when related
to subtle morpho–syntactic and syntactic features
of sentence rather than lexical ones, could not
impact on model perplexity to a great extent. This
assumption seems to be confirmed by the (still
unpublished) results by Martinc et al. (2019) which,
to our knowledge, is the only one explicitly
leveraging unsupervised neural language model
predictions in the context of readability assessment.
According to this study, a NLM is even less perplexed
by articles addressed at adults than by documents
conceived for a younger readership. From a
relatively different perspective focused on the
ability of automatic comprehension systems to solve
cloze tests, Benzahra and Yvon (2019) showed
that NLMs performance is not affected by the level
of text complexity.
      </p>
      <p>In order to test the validity of all these
hypotheses, we rely on the perplexity score given
by a state-of-the-art NLM for the Italian language
to several datasets representative of different
textual genres containing both easy– and complex–
to–read sentences: ideally, such datasets should
emphasise the correlation between perplexity and
readability (if present) since the corpora are
explicitly designed to contain both simple and
difficult examples.</p>
      <p>Contributions We inspect whether and to which
extent it is possible to find a relationship between
a readability score and the perplexity of a NLM.
To this aim we investigate (i) if the perplexity
of a NLM and the readability score of a set of
sentences show a significant correlation and (ii)
whether the two metrics are equally affected by
the same set of linguistic phenomena that occur in
the sentence.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Experimental Design</title>
      <p>According to our research questions, we devised
a set of experiments to study whether NLMs
perplexity reflects the level of readability of a
sentence and which are the linguistic phenomena
mostly involved in each metric. For this purpose,
we firstly investigated whether sentence-level
perplexity scores computed with one of the most
prominent NLM model correlate with the scores
assigned to the same sentences by a supervised
readability assessment tool. Secondly, we
investigated which are the linguistic features of the
considered sentences that correlate in a statistically
significant way with the perplexity and readability
score respectively. In order to verify whether
correlations hold across different typology of texts,
we tested our approach on five Italian datasets.
2.1</p>
      <sec id="sec-2-1">
        <title>Models</title>
        <p>READ-IT. Automatic readability (henceforth
ARA) was assessed using READ-IT (Dell’Orletta
et al., 2011) the first readability assessment tool
for Italian which combines traditional raw text
features with lexical, morpho-syntactic and syntactic
information extracted from automatically parsed
documents. In READ-IT, analysis of readability is
modelled as a binary classification task, based on
Support Vector Machines using LIBSVM (Chang
and Lin, 2001). Training corpora are
representative of two classes of texts, i.e. difficult– vs. easy–
to-read ones, both containing newspaper articles.
The set of features exploited for predicting
readability has been proved to capture different aspects
of sentence complexity. Thus, the assigned
readability score ranges between 0 (easy-to-read) and 1
(difficult-to-read) referring to the percentage
probability for unseen documents or sentences to
belong to the class of difficult-to- read documents.
For the purposes of our work, we carried out
readability assessment at sentence level, making the
analysis reliable for the comparison with
sentencebased perplexity of a NLM.</p>
        <p>
          GePpeTto. Sentence-level perplexity scores were
computed relying on GePpeTto (De Mattei et al.,
2020). GePpeTto is a generative language model
trained on the Italian language and built using the
GPT-2 architecture
          <xref ref-type="bibr" rid="ref7">(Radford et al., 2019)</xref>
          . The
model was trained on a dump of Italian Wikipedia
(2.8GB) and on the itWac corpus (Baroni et al.,
2009), which amounts to 11GB of web texts. The
perplexity (PPL) of the model was computed as
follows:
        </p>
        <p>P P L = e( NNLL )
where N N L and N correspond respectively to the
negative log-likelihood and to the length of each
sentence w1:n = [w1; :::; wn] in the datasets.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Corpora</title>
        <p>In order to test the reliability of our initial
hypothesis, we chose four corpora containing different
typologies of texts, i.e. web pages, educational
materials, narrative texts, newspaper and scientific
articles. Each corpus includes a balanced amount
of difficult- and easy-to-read sentence. In
addition, we also considered in the analysis the Italian
Universal Dependency treebank. This is meant to
verify whether the connection between
sentencelevel readability and perplexity also holds in a
well-acknowledged benchmark corpus. For each
of them, we excluded from our analysis short
sentences, i.e. having less than 5 tokens.</p>
        <p>PACCSS-IT2 (Brunato et al., 2016): we took into
account 125,977 sentences belonging to
PACCSSIT, a corpus of complex-simple aligned sentences
extracted from the ItWaC corpus. The resource
was build using an automatic approach for
acquiring large corpora of paired sentences able to
intercept structural transformations (such as deletion,
reordering, etc.). For example, the two following
sentences represent a pair in the corpus, where a
reordering operation occurs at phrase level (i.e. the
subordinate clause proceeds vs. follows the main
clause):</p>
        <p>Complex: Ringraziandola per la sua cortese
attenzione, resto in attesa di risposta. [Lit:
2http://www.italianlp.it/resources/paccss-it-parallelcorpus-of-complex-simple-sentences-for-italian/
Thanking you for your kind attention, I look
forward to your answer.]
Simple: Resto in attesa di una risposta e
ringrazio vivamente per l’attenzione. [Lit: I
look forward to your answer and I thank you
greatly for your attention.]
Terence and Teacher3 (Brunato et al., 2015): two
corpora of original and manually simplified texts
aligned at sentence level. Terence contains short
Italian novels for children and their manually
simplified version carried out by linguists and
psycholinguists targeting children with text
comprehension difficulties. Teacher is a corpus of pairs
of documents belonging to different genres (e.g.
literature, handbooks) used in educational settings
manually simplified by teachers. We exploited
1,644 sentences belonging to these corpora.
Multi–Genre Multi–Type Italian corpus: a
collection of Italian texts representative of three
traditional textual genres: Journalism, Scientific prose
and Narrative. Each genre has been internally
subdivided into two sub-corpora representative of an
easy- vs difficult-to-read variety, which was
defined according to the intended target audience for
a given genre. The journalistic prose corpus
includes articles automatically downloaded from the
online versions of two general-purpose
newspapers4, while the “easy” sub-corpus contains
articles from two easy-to-read newspapers5 addressed
to adults with low literacy skills or mild
intellectual disabilities. The scientific prose
collection consists of scholarly publications on
linguistics and computational linguistics and Wikipedia
pages downloaded from the portal “Linguistics”,
representative of the complex and easy variety
respectively. For the narrative genre, we included
long novels written by novelists of the last
century and contemporary writers in the corpora of
complex variety, while for the easy variety we
collected short novels for children. The complete
corpus contains 56,685 sentences.</p>
      </sec>
      <sec id="sec-2-3">
        <title>Italian Universal Dependency Treebank: it in</title>
        <p>
          cludes different sections of the Italian Universal
Dependency Treebank (IUDT), version 2.5
          <xref ref-type="bibr" rid="ref12">(Zeman et al., 2019)</xref>
          . In particular, we considered
two groups: a first one containing the whole Italian
3http://www.italianlp.it/resources/terence-and-teacher/
4www.repubblica.it and http://www.ilgiornale.it/
5www.dueparole.it and http://www.informazionefacile.it/
Stanford Dependency Treebank (ISDT)6 (Bosco et
al., 2013), the Italian version of the multilingual
Turin University Parallel Treebank
          <xref ref-type="bibr" rid="ref8">(Sanguinetti
and Bosco, 2015)</xref>
          and the Venice Italian Treebank
(Delmonte et al., 2007) (24,998 sentences), all
containing a mix of textual genres; and a second
one including two collections of texts
representative of social media language, i.e. generic tweets
and tweets labelled for irony (PosTWITA7 and
TWITTIRO8)
          <xref ref-type="bibr" rid="ref9">(Sanguinetti et al., 2018; Cignarella
et al., 2019)</xref>
          (3,660 sentences in total).
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Sentence Perplexity and Readability</title>
      <p>Our analysis starts from a comparison between
the average perplexity and readability scores
obtained for each sentence of the five considered
datasets. As shown in Table 1, readability
values (column ARA) are quite homogeneous across
the datasets, with low standard deviation values.
On the contrary, the range of perplexity scores is
wider (column PPL), going from an average score
of 3,905.83 of PACCSS-IT to 436.75 of the IUDT
miscellaneous portion (Italian UD). These
differences seem to provide a first evidence that
perplexity and readability are not correlate to each other.</p>
      <p>This intuition has been proved computing the
Spearman’s rank correlation coefficient between
the perplexity and readability scores for each
dataset. Results are reported in Table 2, column
PPL-ARA. As it can be seen, all correlation rates
are significant, except for the result obtained on
the Terence and Teacher corpus, possibly due to
the fact that the size of the corpus is too small
to allow a significant comparison. Contrary to
our expectations, no correlation was detected
between the two metrics for all corpora, suggesting
that perplexity and and readability are independent
from each other.</p>
      <p>To further investigate the reasons behind these
scores and to deepen the analysis about the
relationship between the two metrics, we investigated
whether they capture the same (or similar)
linguistic properties of the sentences. To this aim,
we tested the presence and strength of the
correlation between each of the two metrics and a
set of 176 linguistic features, which have been
shown to capture properties of sentence
complex6https://github.com/UniversalDependencies/UD
ItalianISDT</p>
      <p>7https://github.com/UniversalDependencies/UD
ItalianPoSTWITA
8https://universaldependencies.org/treebanks/it twittiro</p>
      <sec id="sec-3-1">
        <title>Dataset</title>
        <p>PACCSS-IT
Terence-Teacher
Multi-Genre Multi-Type
Italian-UD
Twitter-UD</p>
        <p>
          PPL
3,905.83 ( 21,306.07)
790.85 ( 5,002.62)
570.85 ( 4,820.12)
436.75 ( 3,633.64)
986.28 ( 2,479.64)
ity (Brunato et al., 2018). In particular, this
analysis is based on the set of features described in
Brunato et al. (2020), which are acquired from
raw, morpho-syntactic and syntactic levels of
annotation. They range from basic information on
the average sentence and word length, to
lexical information about the internal composition of
the vocabulary of the text (e.g. the distribution of
lemmas belonging to the Basic Italian Vocabulary
(De Mauro, 2000)). They also include morpho–
syntactic information (e.g. POS distribution and
of inflectional properties of verbs) and more
complex aspects of sentence structure derived from
syntactic annotation and modeling global and
local properties of parsed tree structure, e.g. the
relative order of subjects and objects with respect
to the verb, the use of subordination. In order
to extract these features, the considered corpora
were morpho-syntactically annotated and
dependency parsed by the UDPipe pipeline
          <xref ref-type="bibr" rid="ref10">(Straka et
al., 2016)</xref>
          , with the exception of the IUDT corpus.
        </p>
        <p>Column Feats of Table 2 illustrates the results
of this analysis: we report the Spearman’s
correlation coefficients between the two rankings of
linguistic features, each ordered by strength of
correlation between feature value and perplexity score
and readability score respectively. Once again we
observe rather weak correlation values, with the
only exception of Italian-UD which is the only
one reporting a medium correlation (.332).
Overall, these results corroborate our previous findings
that the two metrics are not particularly related
with each other, and they further suggest that the
linguistic phenomena affecting the perplexity of
NLM and the readability level of a sentence are
very different. Consider for example the two
following sentences:
(1) Il furto e` avvenuto gioved`ı notte.</p>
        <p>The theft has taken place Thursday night.
(2) Il comitato di bioetica: no all’eutanasia.</p>
        <p>The bioethics committee: no to euthanasia.</p>
        <p>While (1) is very easy-to-read, with a
readability score of 0.25, but it has a quite high perplexity
score, i.e. 40,737.81, (2) is quite difficult-to-read
(ARA=1) but is has a very low perplexity score
(PPL=11.24).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4 In-Depth Linguistic Investigation</title>
      <p>To better explore the motivation behind these
results, we performed an in-depth investigation
aimed at understating the relationship between our
set of linguistic features and the two metrics taken
into consideration. Since we noticed that for all
datasets a higher number of features correlates
with ARA than with PPL, we selected those that
are significantly correlated with the two metrics.
The number of shared features varies for each
dataset, depending on their size. For example, for
the two smallest ones, i.e. Terence and Teacher
and the UD Twitter Treebank, we could only
consider 34.65% (61) and 44.88% (79) of the whole
set of features respectively, while for the larger
corpora the sub-set is wider: 81.81% (144) in
PACCSS-IT, 78.97% (139) for Multi-Genre
MultiType and 84.65% (149) for the IUD Treebank.</p>
      <p>Table 3 shows the top ten features for each
dataset, i.e. those that obtained the strongest
correlation with both PPL and ARA. As expected,
correlations are generally stronger between
linguistic features and readability scores, although they
Terence and Teacher
Corr
0,53
0,51
0,50
0,50
0,50
0,49
0,48
0,48
0,48
Corr
0,25
0,23
0,22
0,21
0,21
-0,16
0,16
-0,16
0,14
Feats Corr
xpos dist FF 0,34
dep dist punct 0,32
upos dist PUNCT 0,32
ttr form 0,29
aux mood dist Cnd 0,25
upos dist DET 0,25
dep dist det 0,25
ttr lemma 0,22
upos dist NOUN 0,21</p>
      <p>ARA
Feats Corr
dep dist det -0,39
upos dist DET -0,38
upos dist NOUN -0,37
xpos dist S -0,37
xpos dist RD -0,29
upos dist ADV 0,27
dep dist advmod 0,25
xpos dist FF 0,25
avg sub chain len 0,24
Multi-Genre Multi-Type
are lower than expected. This could be due to the
fact that, even if the READ–IT classifier is trained
with a similar set of features, the non-linear
feature space makes it difficult to identify clear
correlations with individual features. Similarly, our
set of features seem to play only a marginal role
on perplexity. However, this is not the case of the
PACCSS-IT corpus, for which the set of
considered linguistic features have an higher correlation
with PPL. This can be possibly related to the
partial overlap between the GePpeTto training data
and the PACCSS-IT sentences, since the latter is
drawn from the ItWac corpus which is included in
the GePpeTto’s training.</p>
      <p>Inspecting these results, we can also observe
that correlations between features and PPL seem
to be more affected by genre–specific
characteristics.</p>
      <p>This is particularly clear if we
consider the Italian UD Twitter treebank, for which
among the top ten most correlated features we
find some of them characterising social media
language, e.g. symbols (upos-xpos dist SYM) or the
vocative relation, which marks a dialogue
participant addressed in a text along with the
specifi(dep dist vocative:mention).</p>
    </sec>
    <sec id="sec-5">
      <title>5 Conclusion</title>
      <p>The paper presented a study aimed at investigating
the relationship between two metrics computed at
sentence-level, i.e. perplexity of a state-of-the-art
NLM for the Italian language and readability score
automatically assigned to a sentence by a
supervised classifier. We carried out our analysis
considering several datasets differing at the level of
textual genre and language variety. Specifically,
we observed that comparing the rankings obtained
using the two metrics we cannot find any
significant correlation, either between the scores of the
two metrics or with respect to the set of
linguistic features that mostly impact their values.
Further investigation within this line of research will
explore whether we can draw the same
observations when a different NLM is exploited to
compute sentence perplexity.
Feats
aux num pers dist Sing+3
dep dist cop
avg max depth
upos dist ADP
xpos dist E
dep dist case
n tokens
dep dist root
xpos dist FS</p>
      <p>PPL
Feats
xpos dist B
verbs num pers dist Sing+3
lexical density
dep dist advmod
upos dist ADV
verbs num pers dist Plur+3
xpos dist V
avg token per clause
upos dist VERB</p>
      <p>PPL
Feats Corr
n tokens -0,19
dep dist root 0,19
dep dist advmod 0,19
upos dist ADV 0,18
n prepositional chains -0,18
xpos dist B 0,18
upos dist ADP -0,17
xpos dist E -0,17
ttr lemma 0,16</p>
      <p>PPL
web: a collection of very large linguistically
proverting italian treebanks: Towards an italian stanford
dependency treebank. In Proceedings of the ACL
ARA
Feats Corr
principal prop dist -0,42
ttr form -0,34
xpos dist FF 0,34
dep dist det -0,33
upos dist DET -0,33
upos dist PUNCT 0,33
dep dist punct 0,33
xpos dist FB 0,31
sub prop dist 0,27</p>
      <p>ARA
Feats
principal prop dist
sub proposition dist
n tokens
dep dist root
ttr form
avg max depth
avg links len
max links len
avg max links len</p>
      <p>Corr
-0,53
0,40
0,39
-0,39
-0,37
0,36
0,35
0,34
0,34
Italian UD Twitter Treebank</p>
      <p>Corr
0,38
-0,28
0,28
-0,24
0,23
-0,22
0,21
-0,21
-0,19</p>
      <p>ARA
Feats
upos dist PUNCT
dep dist punct
dep dist det
upos dist DET
verbal root perc
xpos dist RD
avg token per clause
subj pre
obj post</p>
      <p>Corr
0,30
0,30
-0,29
-0,29
-0,27
-0,27
-0,27
-0,27
-0,24</p>
      <p>Italian UD Treebank
tion scores between perplexity and readability.
cessed web-crawled corpora. Language resources
and evaluation, 43(3):209–226.</p>
      <p>Marc Benzahra and Franc¸ois Yvon. 2019.
Measuring text readability with machine comprehension: a
pilot study. In Proceedings of the Fourteenth
Workshop on Innovative Use of NLP for Building
Educational Applications, pages 412–422, Florence, Italy,
August. Association for Computational Linguistics.
Linguistic Annotation Workshop &amp; Interoperability
with Discourse, Sofia, Bulgaria, August.</p>
      <p>Dominique Brunato, Felice Dell’Orletta, Giulia
Venturi, and Simonetta Montemagni. 2015. Design and
annotation of the first italian corpus for text
simplification. In Proceedings of The 9th Linguistic
Annotation Workshop, pages 31–41.</p>
      <p>Dominique Brunato, Andrea Cimino, Felice
Dell’Orletta, and Giulia Venturi. 2016.
PaCCSSIT: A parallel corpus of complex-simple sentences
for automatic text simplification. In Proceedings of
the 2016 Conference on Empirical Methods in
Natural Language Processing, pages 351–361, Austin,
Texas, November. Association for Computational
Linguistics.</p>
      <p>Dominique Brunato, Lorenzo De Mattei, Felice
Dell’Orletta, Benedetta Iavarone, and Giulia
Venturi. 2018. Is this sentence difficult? do you agree?
In Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing, pages
2690–2699, Brussels, Belgium, October-November.</p>
      <p>Association for Computational Linguistics.</p>
      <p>Dominique Brunato, Andrea Cimino, Felice
Dell’Orletta, Giulia Venturi, and Simonetta
Montemagni. 2020. Profiling-UD: a tool for
linguistic profiling of texts. In Proceedings of
The 12th Language Resources and Evaluation
Conference, pages 7145–7151, Marseille, France,
May. European Language Resources Association.
Chih-Chung Chang and Chih-Jen Lin. 2001.
LIB</p>
      <p>SVM: a library for support vector machines.
Alessandra Teresa Cignarella, Cristina Bosco, and
Paolo Rosso. 2019. Presenting TWITTIR O`-UD:
An italian twitter treebank in universal
dependencies. In Proceedings of the Fifth International
Conference on Dependency Linguistics (Depling,
SyntaxFest 2019).</p>
      <p>Trevor Cohen and Serguei Pakhomov. 2020. A tale
of two perplexities: Sensitivity of neural language
models to lexical retrieval deficits in dementia of the
Alzheimer’s type. In Proceedings of the 58th
Annual Meeting of the Association for Computational
Linguistics, pages 1946–1957, Online, July.
Association for Computational Linguistics.</p>
      <p>Lorenzo De Mattei, Michele Cafagna, Felice
Dell’Orletta, Malvina Nissim, and Marco Guerini.
2020. Geppetto carves italian into a language
model. arXiv preprint arXiv:2004.14253.</p>
      <p>Tullio De Mauro. 2000. Il dizionario della lingua
italiana, volume 1. Paravia.</p>
      <p>Felice Dell’Orletta, Simonetta Montemagni, and
Giulia Venturi. 2011. READ–IT: Assessing
readability of Italian texts with a view to text simplification.
In Proceedings of the Second Workshop on Speech
and Language Processing for Assistive
Technologies, pages 73–83, Edinburgh, Scotland, UK, July.
Association for Computational Linguistics.</p>
      <p>Rodolfo Delmonte, Antonella Bristot, and Sara Tonelli.
2007. VIT - Venice Italian Treebank: Syntactic and
quantitative features. In Proceedings of the Sixth
International Workshop on Treebanks and Linguistic
Theories.</p>
      <p>V. Demberg and Frank Keller. 2008. Data from
eyetracking corpora as evidence for theories of syntactic
processing complexity. Cognition, 109:193–210.
Keyur Gabani, Melissa Sherman, Thamar Solorio,
Yang Liu, Lisa Bedore, and Elizabeth Pen˜a. 2009.
A corpus-based approach for the prediction of
language impairment in monolingual English and
Spanish-English bilingual children. In Proceedings
of Human Language Technologies: The 2009
Annual Conference of the North American Chapter
of the Association for Computational Linguistics,
pages 46–55, Boulder, Colorado, June. Association
for Computational Linguistics.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Pablo</given-names>
            <surname>Gamallo</surname>
          </string-name>
          , Jose Ramom Pichel, and In˜aki Alegria.
          <year>2017</year>
          .
          <article-title>A perplexity-based method for similar languages discrimination</article-title>
          .
          <source>In VarDial2017 workshop at EACL 2017. Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects</source>
          , pages
          <fpage>109</fpage>
          -
          <lpage>114</lpage>
          ,Valencia, Spain, April 3,
          <year>2017</year>
          . c
          <article-title>c 2017 Association for Computational Linguistics (http://web</article-title>
          .science.mq.edu.au/ smalmasi/vardial4/index.html).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>M. Gonza</surname>
          </string-name>
          ´lez.
          <year>2015</year>
          .
          <article-title>An analysis of twitter corpora and the differences between formal and colloquial tweets</article-title>
          . In TweetMT@SEPLN.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Adam</given-names>
            <surname>Goodkind</surname>
          </string-name>
          and
          <string-name>
            <given-names>Klinton</given-names>
            <surname>Bicknell</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Predictive power of word surprisal for reading times is a linear function of language model quality</article-title>
          . In Asad B.
          <string-name>
            <surname>Sayeed</surname>
          </string-name>
          , Cassandra Jacobs, Tal Linzen, and Marten Van Schijndel, editors,
          <source>Proceedings of the 8th Workshop on Cognitive Modeling and Computational Linguistics</source>
          ,
          <string-name>
            <surname>CMCL</surname>
          </string-name>
          <year>2018</year>
          , Salt Lake City, Utah, USA, January
          <volume>7</volume>
          ,
          <year>2018</year>
          , pages
          <fpage>10</fpage>
          -
          <lpage>18</lpage>
          . Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Ganesh</given-names>
            <surname>Jawahar</surname>
          </string-name>
          , Benoˆıt Sagot, Djame´ Seddah, Samuel Unicomb, Gerardo In˜iguez, Ma´rton Karsai, Yannick Le´o, Ma´rton Karsai,
          <string-name>
            <surname>Carlos Sarraute</surname>
          </string-name>
          , E´ ric
          <string-name>
            <surname>Fleury</surname>
          </string-name>
          , et al.
          <year>2019</year>
          .
          <article-title>What does bert learn about the structure of language? In 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence</article-title>
          , Italy.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Matej</given-names>
            <surname>Martinc</surname>
          </string-name>
          , Senja Pollak, and Marko RobnikSikonja.
          <year>2019</year>
          .
          <article-title>Supervised and unsupervised neural approaches to text readability</article-title>
          .
          <source>Computing Research Repository, arXiv:1503.06733. Version</source>
          <volume>2</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Alessio</given-names>
            <surname>Miaschi</surname>
          </string-name>
          , Dominique Brunato, Felice Dell'Orletta,
          <string-name>
            <given-names>and Giulia</given-names>
            <surname>Venturi</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Linguistic profiling of a neural language model</article-title>
          . arXiv preprint arXiv:
          <year>2010</year>
          .
          <year>01869</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Alec</given-names>
            <surname>Radford</surname>
          </string-name>
          , Jeffrey Wu, Rewon Child, David Luan,
          <string-name>
            <given-names>Dario</given-names>
            <surname>Amodei</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Ilya</given-names>
            <surname>Sutskever</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Language models are unsupervised multitask learners</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Manuela</given-names>
            <surname>Sanguinetti</surname>
          </string-name>
          and
          <string-name>
            <given-names>Cristina</given-names>
            <surname>Bosco</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>PartTUT: The turin university parallel treebank</article-title>
          . In Roberto Basili et al., editor,
          <source>Harmonization and Development of Re- sources and Tools for Italian Natural Language Processing within the PARLI Project, page 51-69</source>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Manuela</given-names>
            <surname>Sanguinetti</surname>
          </string-name>
          , Cristina Bosco, Alberto Lavelli, Alessandro Mazzei, and
          <string-name>
            <given-names>Fabio</given-names>
            <surname>Tamburini</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>PoSTWITA-UD: an Italian Twitter Treebank in universal dependencies</article-title>
          .
          <source>In Proceedings of the Eleventh Language Resources and Evaluation Conference (LREC</source>
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Straka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hajic</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Strakova</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>UDPipe: Trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, pos tagging and parsing</article-title>
          .
          <source>In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC).</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Ian</given-names>
            <surname>Tenney</surname>
          </string-name>
          ,
          <string-name>
            <surname>Dipanjan Das</surname>
            , and
            <given-names>Ellie</given-names>
          </string-name>
          <string-name>
            <surname>Pavlick</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>BERT rediscovers the classical NLP pipeline</article-title>
          .
          <source>In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</source>
          , pages
          <fpage>4593</fpage>
          -
          <lpage>4601</lpage>
          , Florence, Italy, July. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Zeman</surname>
          </string-name>
          , Joakim Nivre, Mitchell Abrams, and et al.
          <year>2019</year>
          .
          <article-title>Universal dependencies 2.5</article-title>
          .
          <string-name>
            <surname>In</surname>
            <given-names>LINDAT</given-names>
          </string-name>
          /
          <article-title>CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (U´ FAL)</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>