<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Assessing the Usefulness of Di erent Feature Sets for Predicting the Comprehension Di culty of Text</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Brian Mac Namee</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>John D. Kelleher</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Noel Fitzpatrick</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dublin Institute Of Technology</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Computer Science, University College Dublin</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Within English second language acquisition there is an enthusiasm for using authentic text as learning materials in classroom and online settings. This enthusiasm, however, is tempered by the di culty in nding authentic texts at suitable levels of comprehension di culty for speci c groups of learners. An automated way to rate the comprehension di culty of a text would make nding suitable texts a much more manageable task. While readability metrics have been in use for over 50 years now they only capture a small amount of what constitutes comprehension di culty. In this paper we examine other features of texts that are related to comprehension di culty and assess their usefulness in building automated prediction models. We investigate readability metrics, vocabulary-based features, and syntax-based features, and show that the best prediction accuracies are possible with a combination of all three.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Within English second language acquisition there is a fundamental di culty in
de ning what is meant by authentic as opposed to non-authentic or arti cial
language usage. For example, is authentic usage only the remit of
geographical countries where English is their rst language or is an o cial language of
communication? Within language teaching the opposition can be made between
language usage that is fabricated for the teaching of English as a second
language (ESL), and language usage which is not fabricated. This shift between
forms of usages can be seen in text books which are used in the learning of ESL
where fabricated sentences are often used to highlight speci c forms of language
or adapted material is incorporated into the reading and listening material.</p>
      <p>
        The proponents of authentic usage tend to highlight the authentic as
capturing what language is as socio-linguistic utterance in context [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. For example,
Cambridge University Press, one of the major text book publishers in English
language teaching (ELT), has a discussion board that highlights the main
advantages of using authentic materials in the classroom. The advantages listed
include: helping students to learn how to communicate in the real world,
learning language in context, and increased motivation for learners3.
      </p>
      <sec id="sec-1-1">
        <title>3 http://bit.ly/2xLHXWh</title>
        <p>There are, however, some disadvantages to using authentic material in
English language teaching. Foremost amongst these is that the language is not
primarily designed for learning but for communication between native speakers.
This can mean that that level of language used in authentic material can be too
di cult, in terms of the complexity of sentences, and, more importantly, the use
of unfamiliar words or idiomatic expressions. Authentic, but di cult, texts can
make the gap between the presumed level of the student or the class and the
di culty of the text too big leading students to quickly lose their motivation.
Reliable methods to automatically determine the comprehension di culty of a
text could greatly mitigate these disadvantages by making it easy for teachers or
learners to source authentic materials of an appropriate level. Readability metrics
are one long-standing approach to doing this.</p>
        <p>Readability is a term used to refer to the overall understandability or
comprehension level of a text. There are a number of established, widely used readability
metrics in the literature, such as Flesch and FOG4. Whilst these metrics go some
way towards determining the comprehension di culty of text, in general they all
tend to focus on speci c, narrow features of the language used|most readability
metrics are de ned as functions over counts of word syllables and/or sentence
length. As W.H. DuBay points out: \The variables used in the readability
formulas show us the skeleton of a text. It is up to us to esh out that skeleton with
tone, content, organization, coherence, and design [6, p. 56]. DuBay's analysis
highlights that there are many more features of the language used in a text,
beyond those modelled by traditional readability metrics, that impinge on the
comprehension di culty of that text. Investigating these features is the
motivation behind the work described in this paper. We analyse how useful di erent
sets of features of the language used in a text are in modelling comprehension
di culty of a text. In this work we consider readability metrics, syntax-based
features, and vocabulary-based features.</p>
        <p>The paper is structured as follows: Section 2 describes how we designed and
created a dataset of texts annotated by comprehension di culty; Section 3
describes the features we created and used in our models; Section 4 describes
the di erent models we trained and presented the results of our evaluation
experiments; in Section 5 we conclude the paper by discussing our results and
highlighting some areas of future research.
2</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Data</title>
      <p>
        In order to build models to assess the usefulness of di erent features of the
language used in text to predict comprehension di culty we needed a dataset of
texts annotated by comprehension di culty. The rst design decision in creating
this dataset was to decide on the comprehension di culty levels that we would
use for annotation. One option would have been to use the Common European
Framework of Reference for Languages (CEFR) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] levels for annotation.
Indeed, over the last number of years the development of the CEFR has led to the
4 See Section 3.1 for more details on readability metrics.
increased awareness of more nuanced understandings of language levels for
learners. However, after a review of the CEFR it was decided that for the purposes of
this project the CEFR did not give enough detail in terms of language di culty
for comprehension levels for them to be incorporated. Instead we based our
comprehension di culty annotations on the traditional English as Second Language
levels which closely follow the Cambridge levels: Beginner, Elementary, Lower
Intermediate, Intermediate, Upper Intermediate and Advanced.
      </p>
      <p>Next we collected a corpus of texts whose original purpose was not ESL. The
corpus contained 948 texts from a range of international English language online
news sources that we expected to include texts at di erent comprehension di
culty levels. The average length of these texts in words is 457:5 (with a standard
deviation of 379:7). We hired a number of ESL teachers to annotate these texts
with di culty levels through a bespoke annotation tool that presented texts to
annotators in random order. Our review of the annotations revealed that there
were no low level beginner texts in the corpus, this is to be expected as authentic
texts at this level are rare. This left us with ve levels of di culty (ESL levels
Elementary to Advanced). The distribution of di culty levels within the corpus
is shown in Figure 1.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Feature Design</title>
      <p>There are a wide range of potential descriptive features that could be used in
building a predictive model of text comprehension di culty. The high-level
domain concepts identi ed as important in this problem were: existing readability
measures and related features, features based on the vocabulary in a text, and
features based on a syntactic analysis of the text. In the following sections we
describe the sets of features we developed and used from each of these domains.
3.1</p>
      <sec id="sec-3-1">
        <title>Readability Metrics</title>
        <p>
          There are a number of well-known readability metrics, for example: FOG [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ],
Flesch [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], and Coleman-Liau [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. These metrics attempt to measure how easy
it is to read a piece of text and are generally a function over the word length
(either in terms of syllables or characters) and/or sentence length in a text.
For example, Equation 1 de nes the calculation of the Flesch readability metric
[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. In the case of the Flesch metric the readability scores range between 0 and
100, where 0 indicates that the text is unreadable and 100 indicates that the
text is extremely easy to read. For several of these readability metrics mappings
between the metric scores and school levels have been proposed.
        </p>
        <p>F lesch = 206:835
1:015</p>
        <p>total words
total sentences
84:6
total syllables
total words
(1)</p>
        <p>
          Figure 2 presents a scatter plot matrix (SPLOM) illustrating the linear
relationships between a number of standard readability metrics: Flesch,
AutomatedReadability Index (ARI) [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], Fog, Lix [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], SMOG [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], and Coleman-Liau. The
graphs along the main diagonal of the SPLOM present a density plot of the scores
generated by the related readability metric when it is applied to documents in
our corpus. The o -diagonal scatterplots reveal that many of these readability
metrics have strong linear relationships. For example, the Lix and SMOG metrics
have a very strong positive linear relationship. These strong linear relationships
between many of these readability metrics indicate that many of these metrics
are measuring close variants of the same thing. Some of the metrics, however,
do appear to be capturing other aspects of readability. For example, examining
the scatter-plots that include the ARI metric versus Flesch it appears the linear
pattern evident in many of the other scatter-plots breaks-down.
        </p>
        <p>As noted in the introduction, readability metrics do not provide a measure
of the comprehension di culty of a text. For example, text that includes many
idiomatic phrases, or novel turns of phrase may get a good readability score but
this does not indicate that it will be easy to understand or comprehend such
text. That said, readability metrics do provide an objective standard and do
provide some information regarding comprehension di culty. In developing our
predictive models we considered all of the readability metrics shown in Figure 2
as input features.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Vocabulary-based Features</title>
        <p>The words used in a text can have a direct impact on the comprehension
difculty of the text. The use of complex words is likely to make a text more
di cult to read and to comprehend. This is why so many readability metrics
use some measure of word length (syllable or character count) as a proxy for
word complexity in the calculation of readability. A striking example of this is
the FOG , also known as Gunning-Fog, readability metric which explicitly takes
the number of complex words into account in its calculation, see Equation 2.
FOG de nes complex words as those words with three or more syllables (where
common su xes are not counted as syllables; e.g., -ed, -ing, etc.) and which are
not proper nouns, familiar words, or compound nouns.</p>
        <p>F OG = 0:4
+ 100</p>
        <p>total words
total sentences
complex words
total words
(2)</p>
        <p>The FOG metric is an example of a readability metric that relies on word
categories (e.g., familiar words, proper nouns, etc.) as operationalised by
prespeci ed lists. A challenge for these models is how best to de ne these word
lists.</p>
        <p>While the occurrence of speci c words might work as a predictor of text
di culty, intuitively documents that include larger numbers of words that are
generally rare are likely to be more di cult to understand than documents that
primarily use more common words. To achieve this we used what we refer to
as rare-word features which capture the predominance of rare words within a
document in a generalisable way.</p>
        <p>
          We based our rare-word features on word frequencies from the British
National Corpus (BNC). We chose to use the BNC as our background corpus
because it is a balanced sampled corpus so it is reasonable to extrapolate from
the frequencies found in BNC to general English [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. We de ned our rare-word
features by binning the words in the BNC into 9 bins based on word-frequency.
A challenge faced in the de nition of any binning process, however, is to de ne
appropriate threshold's between bins. In this case, the challenge was to de ne
thresholds between common, rare and very rare words. Noting that the words
frequencies in the BNC follow a Zipf distribution we de ned our bins such that
each subsequent bin contained the most common remaining words and the set
of words in each bin would account for a prede ned percentage of the tokens in
the corpus. For example, Bin 1 contained the set of most frequent words in the
BNC such that these words accounted for approximately 50% of the tokens in
the BNC (this bin contained the 63 most common words in the corpus). Bin 2
contained the set of next most frequent words such that together these words
accounted for 25% of the tokens in the BNC (this bin contained 822 words). The
other bins were de ned in a similar way: Bin 3 contained the remaining most
common words that together accounted for 10% of the tokens in the corpus, the
words in Bin 4 account for 5% of the tokens in the corpus, Bin 5 also accounted
for 5% of the tokens, Bin 6 accounted for 2%, and Bins 7, 8 and 9 accounted
for 1% each. Once we had de ned our bins we represented the distribution of
common and rare words in a document by calculating the percentage of words
in a document that belong within each bin. For example, the Bin 1
percentage feature recorded the percentage of words in the document that belonged
to Bin 1. Consequently, we developed 9 features based on our word frequency
bins: Bin1%, Bin2%, : : : , Bin9%. Together, these bin percentage features give
an overall sense of the number of very common and very rare words, as well as
everything in between, in a document.
        </p>
        <p>
          We created two other vocabulary based features, one measured the lexical
diversity of a document and the other the frequency of named entities in a
document. Lexical diversity measures the range of di erent words used in a
document, with a greater range indicating a higher diversity [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. Lexical diversity
is often used as a measure of text di culty and to measure the language
competency of writers. For example, lexical diversity has been used in studies to
measure language competency skills of foreign and second language learners [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
In our work we used a basic and intuitive measure of lexical diversity known
as the type-token ratio (TTR) [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. TTR is calculated as the number of unique
words in a text (types) divided by number of words in the text (tokens), see
Equation 3. TTR values range from 0 to 1 with a higher number indicating
greater lexical diversity.
        </p>
        <p>count of unique words
T T R = (3)</p>
        <p>total words</p>
        <p>
          The nal vocabulary feature we used was the percentage of words within a
document that are part of named entity expressions. We identi ed the named
entities in the text using the named entity recognition module of the Stanford
CoreNLP software [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. The motivation for including a feature based on named
entities in our work was that named entities often pose di culties to ESL
student's, particularly those who come from di erent cultural backgrounds to that
from which the authentic text was generated from.
3.3
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Syntax-based Features</title>
        <p>
          The occurrence of particular parts of speech and/or syntactic structures may
a ect the di culty of a text from an ESL perspective. For example,
prepositions and prepositional clauses, conjunctions, subjunctions, adverbs and
adverbial clauses can all pose di culties to ESL students. To capture these syntactic
phenomena within our models we generated a set of features by rst parsing the
texts and then generating features from the parse tree annotations. We parsed
the texts using the Stanford CoreNLP software [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. The Stanford CoreNLP
outputs parse trees annotated with the Penn Treebank tagset, for more details
on the tagset see [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ].
        </p>
        <p>The rst set of features we created from the parse trees were the percentages
of each word-level part-of-speech (POS) tag in each text. These POS percentages
were generated by simply dividing the count of occurrences of each POS tag in
a text by the total number of POS tags in the text. The second set of syntactic
features generated from the parse trees measured the distribution of syntactic
tags in each text (e.g. tags such as ADJP adjective phrase, SBAR subordinate
clause, etc.). These features were de ned in a very similar manner to the POS
percentage features: we simply counted the number of occurrences of each
syntactic tag in parse trees generated from a text and divide these counts by the
total number of syntactic tags in this parse tree set.</p>
        <p>Inspired somewhat by the relationship between lexical diversity and text
di culty we created two features to capture the diversity of POS and syntactic
tags within a text: the rst feature simply counted the number of di erent parts
of speech tags that occurred at least once in the trees generated from a text;
similarly, the second diversity feature counted the number of di erent syntactic
tags that occurred at least once in the trees generated from a text. This was
based on an intuition that a greater range of POS tags or syntactic tags within
a single document could cause comprehension di culties.</p>
        <p>The last two features we generated from the parse trees were designed to
capture the complexity of the sentences in a text. In 1979 Flesch motivated the
inclusion of a parameter based on the length of a sentence within his readability
scores as follows:</p>
        <p>The longer the sentence, the more ideas your mind has to hold in
suspense until its nal decision on what all the words mean together. Longer
sentences are more likely to be complex|more subordinate clauses, more
prepositional phrases and so on. That means more mental work for the
reader. So the longer a sentence, the harder it is to read. [9, p. 22]
Certainly Flesch's argument is a good motivation for including sentence
length in a measure of readability, and also in a measure of text di culty.
However, sentence length alone does not do justice to the potential di erences in
di culty between sentences of the same length. For example, a sentence that
includes multiple clause embeddings is likely to be more di cult to comprehend
than a sentence of a similar length that is composed of a simple (if long) list.
To capture these aspects of complexity we created three other features based on
the set of parse trees generated from the text. These were:
1. Max Embeddedness : that maximal phrasal parse tree depth for sentences
within a text (implemented by iterating through the parse trees for a text,
linearising each parse tree, then reading each linearised parse tree from left
to right and during this iterative process keeping track of the maximum
number of open brackets encountered at any point)
2. Average Embeddedness : the average phrasal parse tree depth for sentences
within a text (implemented in a similar way to Max Embeddedness.
3. Average Phrasal Parse Tree Nodes : simply the average number of phrasal
parse tree nodes in sentences within a text.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Models</title>
      <p>To evaluate the di ering power of the di erent feature sets described in Section
3 to predict the comprehension di culty of a document we built and evaluated
multi-variate prediction models using each feature subset: readability metrics,
vocabulary-based features, and syntax-based features. We also consider the
performance of a model using the full combined set of features generated, and a
model where only a subset of the what appear to be the most useful features are
used.</p>
      <p>To address the class imbalance in our dataset (we have many more
documents at the intermediate level than at any other level, see Figure 1) we have
converted from a categorical classi cation problem across the ve di erent levels
to a numeric prediction problem which each level is associated within a numeric
score. The mappings are as follows elementary : 10, lower-intermediate: 30,
intermediate: 50, upper-intermediate: 70, and advanced : 90. for prediction problems
with ordinal targets this is a sensible approach to handling class imbalance.</p>
      <p>
        To select a subset of the most useful features from the set available we use a
simple rank and prune feature selection approach [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. An importance score for
each feature that captures the strength of its relationship with the numeric
comprehension di culty target is calculated and features are ordered from strongest
to weakest according to these scores. In this case we calculate the F-score [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] for
Feature
Number of Unique POS Tags Used
Number of Unique Syn. Tags Used
Word Count
Maximum Embededness
Lexical Diversity
Smog Readability Metric
Average Sentence Length
Flesch Readability Metric
Fog Readability Metric
Average Embededness
Average Phrasal Parse Tree Nodes
Lix Readability Metric
Coleman-Liau Readability Metric
% Cardinal Number (CD) POS Tags
ARI Readability Metric
Average Word Length
% Adjective (JJ) POS Tags
% Noun, Plural (NNS) POS Tags
% Preposition (IN) POS Tags
% Fragment (FRAG) Syn. Tags
% Prepositional Phrase (PP) Syn. Tags
% Verb, Gerund (VBG) POS Tags
% Symbol (SYM) POS Tags
% Bin-7 Vocabulary
% Unknown (X) Syn. Tags
each feature. The features with the top 30 scores (chosen to reduce to f rac13 of
the features) are selected for inclusion in the feature selection set. These features
and their importance scores are shown in Table 1. It is interesting to note that
the most useful features found are the syntax-based counts of the variety of POS
and syntactic tags used in a text. It is also interesting to note that a mixture
of simple measures (e.g. word count), readability metrics, and vocabulary-based
and syntax-based features are included rather than simply a large set of one
type.
      </p>
      <p>
        In all cases the models used are support vector regression (SVR) models [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],
as implemented in the Python scikit-learn package5. SVR models are chosen
as they have been widely shown to be e ective across a broad range of
multivariate prediction problems well and deal well with features displaying strong
co-linearity (for example the di erent readability metrics).
      </p>
      <p>To evaluate the performance of each model we perform a 10-fold cross
validation experiment measuring model performance using mean absoloute prediction</p>
      <sec id="sec-4-1">
        <title>5 http://scikit-learn.org</title>
        <p>error. The performances of the models built using the ve di erent features sets
are shown in Table 2. We can see that the model built using the selected feature
subset performs best out of the models test, although it is only very marginally
better than the model trained using the full feature set.</p>
        <p>We can illustrate the ability of these models to distinguish between the
document of di erent comprehension ability through boxplots that illustrate the
distribution of predictions for each di cult level. These are shown in Figure 3.
We include the boxplot for the FOG readability metric as a baseline as well as
the other feature sets. These boxplots clearly show a progression in the ability
of models trained using di erent feature sets to separate texts into the di erent
comprehension di culty levels labelled in the text.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>The ability to automatically rate the comprehension di culty of texts would
greatly reduce the challenge of using authentic text in ESL classrooms and
online services. While, the use of readability metrics has been demonstrated as
a very useful determination of the general level of a text, this is not su cient
for rating comprehension di culty. Comprehension di culty is in uenced by
more features than just the simple measures of word and sentence complexity
incorporated into readability metrics. In this paper we describe an analysis into
di erent types of features of texts that are useful for predicting readability. We
consider three di erent groups of features: readability metrics, syntax-based
features, and vocabulary-based features. We base this analysis on a corpus of 948
texts collected from a range of international English language online news sources
that were expertly annotated into ve of the traditional English as Second
Language levels: Beginner, Elementary, Lower Intermediate, Intermediate, Upper
Intermediate and Advanced. We perform our analysis by building an evaluating
predictive models using di erent feature subsets extracted from the document
corpus.</p>
      <p>The rst thing this analysis illustrates is a con rmation that, although
readability metrics can provide some indications of the of the di culty of
comprehension of a text for ESL, they are not su cient to do the job of automatic rating
accurately. This result highlights that it is necessary to make a distinction
between the level of di culty of comprehension of a text, in particular for English
as Foreign language, and the standard readability scores. The second thing we
show is that none of the di erent groups of features lead to models that are
better than one trained with features from the di erent groups combined.</p>
      <p>
        There are many extensions that could be made to the analysis described in
this paper. For example, the level of comprehension di culty of a text can be
linked to the cultural context, the use of idiosyncrasies, or the use of idioms, and
novel turns of phrase. While the identi cation of these aspects of text can be
done through the use of computational models (see for example see [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] for the
identi cation of idiomatic structures), they all remain open research challenges
for computational models of languages. Nevertheless features based on these
aspects could be examined. Similarly, there the use of speci c grammar points
(e.g. particular tenses) are known to cause comprehension di culties. While the
use of syntax-based features based on POS and syntactic tags captures these to
some extent, representing their use more directly in speci c features would be
bene cial.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Benveniste</surname>
          </string-name>
          , E.: Problemes de linguistique generale: I.
          <string-name>
            <surname>Gallimard</surname>
          </string-name>
          (
          <year>1975</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bjornsson</surname>
            ,
            <given-names>C.H.</given-names>
          </string-name>
          : Lasbarhet. Stockholm: Liber (
          <year>1968</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Y.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>C.J.:</given-names>
          </string-name>
          <article-title>Combining svms with various feature selection strategies</article-title>
          .
          <source>In: Feature extraction</source>
          , pp.
          <volume>315</volume>
          {
          <fpage>324</fpage>
          . Springer (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Coleman</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liau</surname>
            ,
            <given-names>T.L.</given-names>
          </string-name>
          :
          <article-title>A computer readability formula designed for machine scoring</article-title>
          .
          <source>Journal of Applied Psychology</source>
          <volume>60</volume>
          (
          <issue>2</issue>
          ),
          <volume>283</volume>
          (
          <year>1975</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Drucker</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burges</surname>
            ,
            <given-names>C.J.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaufman</surname>
          </string-name>
          , L.,
          <string-name>
            <surname>Smola</surname>
            ,
            <given-names>A.J.</given-names>
          </string-name>
          , ,
          <string-name>
            <surname>Vapnik</surname>
            ,
            <given-names>V.N.</given-names>
          </string-name>
          :
          <article-title>Support vector regression machines</article-title>
          .
          <source>In: Advances in Neural Information Processing Systems</source>
          <volume>9</volume>
          , pp.
          <volume>155</volume>
          {
          <fpage>161</fpage>
          . MIT Press (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>DuBay</surname>
          </string-name>
          , W.H.:
          <article-title>The principles of readability</article-title>
          .
          <source>Tech. Rep. ED490073</source>
          ,
          <string-name>
            <surname>Impact Information</surname>
          </string-name>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Duran</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malvern</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Richards</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chipere</surname>
          </string-name>
          , N.:
          <article-title>Developmental trends in lexical diversity</article-title>
          .
          <source>Applied Linguistics</source>
          <volume>25</volume>
          (
          <issue>2</issue>
          ),
          <volume>220</volume>
          {
          <fpage>242</fpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. of Europe, C.:
          <article-title>Common European Framework of Reference for Languages Learning, Teaching</article-title>
          , Assessment. Cambridge University Press (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Flesch</surname>
            ,
            <given-names>R.F.</given-names>
          </string-name>
          :
          <article-title>How to write plain English: A book for lawyers and consumers</article-title>
          .
          <source>Harpercollins</source>
          (
          <year>1979</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Gunning</surname>
            ,
            <given-names>R.:</given-names>
          </string-name>
          <article-title>The technique of clear writing</article-title>
          .
          <source>McGraw-Hill</source>
          , New York (
          <year>1952</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Kelleher</surname>
            ,
            <given-names>J.D.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Mac</given-names>
            <surname>Namee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>D'Arcy</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies</article-title>
          . MIT PRess (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Leech</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rayson</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , et al.:
          <article-title>Word frequencies in written and spoken English: Based on the British National Corpus</article-title>
          .
          <source>Routledge</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Surdeanu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bauer</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Finkel</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bethard</surname>
            ,
            <given-names>S.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McClosky</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>The Stanford CoreNLP natural language processing toolkit</article-title>
          . In:
          <article-title>Association for Computational Linguistics (ACL) System Demonstrations</article-title>
          . pp.
          <volume>55</volume>
          {
          <issue>60</issue>
          (
          <year>2014</year>
          ), http: //www.aclweb.org/anthology/P/P14/P14-5010
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Mc</surname>
            <given-names>Laughlin</given-names>
          </string-name>
          ,
          <string-name>
            <surname>G.H.</surname>
          </string-name>
          :
          <article-title>Smog grading-a new readability formula</article-title>
          .
          <source>Journal of reading 12(8)</source>
          ,
          <volume>639</volume>
          {
          <fpage>646</fpage>
          (
          <year>1969</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>McCarthy</surname>
            ,
            <given-names>P.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jarvis</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          : Mtld,
          <article-title>vocd-d, and hd-d: A validation study of sophisticated approaches to lexical diversity assessment</article-title>
          .
          <source>Behavior Research Methods</source>
          <volume>42</volume>
          (
          <issue>2</issue>
          ),
          <volume>381</volume>
          {
          <fpage>392</fpage>
          (
          <year>2010</year>
          ), http://dx.doi.org/10.3758/BRM.42.2.
          <fpage>381</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Salton</surname>
            ,
            <given-names>G.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ross</surname>
            ,
            <given-names>R.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kelleher</surname>
            ,
            <given-names>J.D.</given-names>
          </string-name>
          :
          <article-title>Idiom token classi cation using sentential distributed semantics</article-title>
          .
          <source>In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics</source>
          . pp.
          <volume>194</volume>
          {
          <fpage>204</fpage>
          . Association of Computational Linguistics (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Senter</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>E.A.</given-names>
          </string-name>
          :
          <source>Automated readability index. Tech. rep.</source>
          ,
          <source>CINCINNATI UNIV OH</source>
          (
          <year>1967</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Taylor</surname>
          </string-name>
          , A.,
          <string-name>
            <surname>Marcus</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Santorini</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>The penn treebank: an overview</article-title>
          .
          <source>In: Treebanks</source>
          , pp.
          <volume>5</volume>
          {
          <fpage>22</fpage>
          . Springer (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Templin</surname>
            ,
            <given-names>M.C.</given-names>
          </string-name>
          :
          <article-title>Certain language skills in children; their development and interrelationships</article-title>
          . University of Minesotat Press (
          <year>1957</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>