<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Leveraging the Intuitions of Lay People on Linguistic Complexity for Automatic Sentence Readability Assessment</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ignatios Charalampidis</string-name>
          <email>ignatios.charalampidis@uni-tuebingen.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiaobin Chen</string-name>
          <email>xiaobin.chen@uni-tuebingen.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Second Language Acquisition, Language Learning, Computational Linguistics, Machine Learning, Automatic</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>72072</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>EvalLAC'25: 2nd Workshop on Automatic Evaluation of Learning and Assessment Content</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Hector Research Institute of Education Sciences and Psychology (University of Tübingen)</institution>
          ,
          <addr-line>Walter-Simon-Strasse 12, Tübingen</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Automatic readability assessment (ARA) is an invaluable tool for second language teachers, as it can be used to evaluate the dificulty of educational materials. Previous studies on ARA often used text-level linguistic complexity features to predict readability, while few studies have focused on the sentence level. This paper presents a study in which crowd-sourced evaluations of sentence readability were collected from non-expert native and non-native speakers of English. 50 participants were asked to evaluate 1000 sentences in English extracted from diverse sources for lexical, grammatical and overall sentence dificulty separately, and the resulting dataset of 10,000 evaluations (hereinafter referred to as SR-Crowd) is released for future research. Several machine learning (ML) models were trained on this data using linguistic and neural features. The best model relies on neural and linguistic features and it achieves a Quadratic Weighted Kappa (QWK) score of 0.7. Further, our ifndings suggest that readability perception difers among groups and individuals and a single ARA pipeline may not be enough to capture readability accurately for all of them. This research ofers valuable insights that can be applied in the areas of evaluating language learning materials, content filtering, and text summarization, where accurate sentence-level readability assessment is essential.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In the field of second language acquisition (SLA), the Input Hypothesis, proposed by linguist Stephen
Krashen, suggests that learners acquire language most efectively when exposed to input that is slightly
beyond their current level of competence, often referred to as i+1, where i represents current competence
and 1 represents new knowledge [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This approach emphasizes the importance of providing learners
with comprehensible input that is slightly above their current level, allowing them to gradually expand
their linguistic abilities. One way to achieve this in language learning is by providing materials whose
readability is suitable for the learners’ level.
      </p>
      <p>
        Readability refers to the degree to which a piece of writing is understood and absorbed by its readers
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Greater readability reduces the efort required for reading and enhances the reading speed for all
readers, especially for those who may struggle with comprehension [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This, in turn, boosts academic
performance and leads to higher educational outcomes [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6</xref>
        ]. Therefore, foreign language teachers
need to assess the readability of the text materials they provide to learners to make sure they add
just enough new information to facilitate learning (+1 in Krashen’s terms). In this paper we use the
terms readability, complexity and dificulty loosely and interchangeably, even though we are well aware
that technically they are not the same concept. Complexity hereinafter refers mostly to linguistic
sophistication and is taken to mean the inverse of readability (the higher the complexity, the less
readable a text is to read), while dificulty is used mostly in the context of overall cognitive efort
required to read a text.
      </p>
      <p>CEUR</p>
      <p>ceur-ws.org</p>
      <p>
        Though essential, readability assessment remains a laborious task for educators. This necessitated the
creation of Automatic Readability Assessment (ARA) software and algorithms, whose goal is to evaluate
the readability of a given text automatically. There are multiple ways to perform ARA, including simple
formulas, machine learning (ML) and linguistic features, and more recently with the use of neural
networks. Simple formulas rely on surface-level (or raw) features, such as the number of syllables
and sentence length or the number of infrequent words in a text [
        <xref ref-type="bibr" rid="ref10 ref7 ref8 ref9">7, 8, 9, 10</xref>
        ]. In order to capture
linguistic complexity more systematically and thoroughly other studies proposed the use of ML models
for classification and a wide variety of linguistic features with high levels of success [
        <xref ref-type="bibr" rid="ref11 ref6">11, 6</xref>
        ]. Since there
is high correlation between linguistic complexity and readability, linguistic features serve as a good
proxy for readability and are therefore explored in this study.
      </p>
      <p>
        ARA has been successfully applied on the document level, namely assessment of entire texts [
        <xref ref-type="bibr" rid="ref12 ref13 ref14">12, 13,
14</xref>
        ]. However, document-level readability sufers from lack of granularity; a text can be classified as a
whole, but then no insight about its individual constituents is to be gained. This is problematic, for
example, in cases where it is desirable to assess parts of a text in terms of complexity in order to simplify
only those sentences that might not be readable enough for a particular audience (in this case students
studying a foreign language). That is why research has shifted towards sentence-level ARA more
recently [
        <xref ref-type="bibr" rid="ref15 ref16 ref17">15, 16, 17</xref>
        ]. To achieve sentence-level ARA previous research has leveraged judgments made
by language experts or very proficient speakers and used them to train ML ARA models [
        <xref ref-type="bibr" rid="ref15 ref17 ref18">18, 17, 15</xref>
        ].
Nonetheless, previous studies relied on the judgments of a few annotators, who provided a single,
holistic score of readability. For language learning purposes, it would be useful to tease apart the various
factors that come into play when assessing readability, most notably vocabulary and grammar. That
will provide insight into which features influence perception of readability the most, thereby leading to
better ARA solutions.
      </p>
      <p>In this study we aim to fill this gap by gathering sentence readability judgments from laypeople.
Specifically, A corpus of 1000 sentences for human evaluation was curated 1. The task was
crowdsourced and 50 participants (not language experts) were asked to assess random and diverse sentences
on a scale of 1 to 100 for lexical, grammatical and overall dificulty separately. Then several ML
classification and regression models were built and trained using the crowd’s judgments as training
data in an efort to discover which models perform best and how to best aggregate the intuitions of
lay people into readability scores that can be used for training said models. We attempt to answer the
following research questions:</p>
      <sec id="sec-1-1">
        <title>1. Can lay people’s intuitions be used to train ML models for ARA? 2. Which of the three criteria (lexical, grammatical and overall complexity) can be predicted best using ML? 3. Which ML models and features predict readability with the highest accuracy?</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        ARA aims to automatically compute a score (either numerical or categorical) which shows how readable
a text is. This can be achieved in a plethora of ways, including simple formulas, machine learning
(ML) and linguistic features, and more recently with the use of neural networks. Simple formulas
rely on surface-level (or raw) features, such as the number of syllables and sentence length or the
number of infrequent words in a text [
        <xref ref-type="bibr" rid="ref10 ref7 ref8 ref9">7, 8, 9, 10</xref>
        ]. However, readability is influenced heavily by many
factors, mostly grammatical and lexical, which raw text features alone cannot capture. In order to
capture readability more systematically and thoroughly other studies proposed the use of ML models
for classification and a wide variety of syntactic, lexical-morphological and semantic features [
        <xref ref-type="bibr" rid="ref11 ref12 ref6">11, 6, 12</xref>
        ]
with high levels of success.
      </p>
      <p>
        Several studies focused on document-level ARA, where the goal is to classify the readability level of
an entire text. In [
        <xref ref-type="bibr" rid="ref13 ref19 ref20">13, 19, 20</xref>
        ] linguistic and surface-level features were successfully used to train ML
1All data and code will be made available soon at https://github.com/IgChar/sentence-readability-crowd
classification models, with reported accuracy reaching 93.3%, 98.12% and 80% respectively. Using a
combination of linguistic features and the output of BERT-like [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] neural network models, the authors
in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] were able to achieve an almost perfect accuracy of 0.99 on the OneStopEnglish corpus [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], and
an accuracy of 0.9 on WeeBit corpus [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>These studies reported impressive results for document-level readability classification. However,
for the purposes of SLA, a more robust and granular approach to ARA is necessary, especially if the
end goal is text simplification and/or complexification of individual sentences. To accommodate this
necessity other studies have focused on sentence-level ARA, thereby addressing the issue of lack of
granular readability assessment. Various approaches and setups have been proposed for sentence-level
ARA.</p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref11 ref16 ref23">11, 16, 23</xref>
        ] ARA was framed as a pairwise comparison between sentences and several linguistic
features were used to determine which sentence is more dificult to read. All studies reported accuracy
scores above 78%.
      </p>
      <p>
        Other studies attempted to extract absolute readability scores for sentences without relying on
comparisons. [
        <xref ref-type="bibr" rid="ref15 ref18 ref24">15, 24, 18</xref>
        ] achieved this by building sentence corpora and asking evaluators to assess
sentence readability subjectively on a Likert scale. Then the authors used linguistic features to predict
those scores, achieving high accuracy and Quadratic Weighted Kappa (QWK) scores [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] and revealing
which features influence readability the most according to people’s perception.
      </p>
      <p>
        Most notably, [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] constructed the CEFR-based Sentence Profile (CEFR-SP), a collection of sentences
classified by two language professionals into the six CEFR levels (A1, A2, B1, B2, C1, C2) . Unlike
previous methods, they used BERT [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], for classification in two setups: 1) They extracted neural
features from the sentences (embeddings) and built SVM and KNN classifiers with them and 2) they used
BERT directly as a classifier. They then compared those methods and found that the BERT classifier
outperformed the other methods with a QWK of 0.628 (rather high performance, indicating high
agreement between ground truth and predicted values), but failed to accurately predict A1 and C levels,
probably due to the dataset being unbalanced.
      </p>
      <p>
        The integration of neural and linguistic features has also been explored for ARA on both document
and sentence levels [
        <xref ref-type="bibr" rid="ref14 ref26">26, 14</xref>
        ]. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] investigated two distinct approaches: 1) they combined sentences
with linguistic features and fed the resulting array into a wide range of BERT-like neural networks
for classification purposes and 2) they leveraged the output of these models as an additional feature
in an XGBoost model, which was found to outperform other machine learning (ML) models in their
experiments. Furthermore, the authors’ hybrid approach, which utilized the output of BART-large [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ],
in conjunction with linguistic features, achieved an accuracy score of 0.729.
      </p>
      <p>Previous studies have employed various approaches to ARA at the sentence level, leveraging linguistic,
neural, and raw text features to achieve high assessment accuracy. However, there remain areas for
further investigation. Specifically, previous research has primarily employed a single score or categorical
value as a ground truth label to represent readability. Assessing people’s perception of lexical and
grammatical dificulty separately could provide more nuanced and fine-grained assessments of sentence
readability. Furthermore, most prior work has relied on idealized sentences from oficial sources, which
are often devoid of ungrammatical, ill-formed, or colloquial structures and vocabulary. As a result,
systems trained on such data may struggle to generalize in contexts where more natural language is
used. To fill these gaps 1) we recruited non-experts, both native and non-native speakers of English
and explored how they perceive lexical, grammatical and overall sentence dificulty separately, 2) we
used a diverse corpus of sentences covering more colloquial and possibly idiomatic language use and
3) we explored the eficacy of a variety of handcrafted linguistic and neural features and methods in
producing reliable ARA models.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Study Design and Motivation</title>
      <p>We theorize that complexity is not an inherent universal property of text, but rather one that depends
on the receiver. Language experts perceive readability diferently from second language (L2) learners
and lay people with varying proficiency levels. What is more, while experts (e.g. foreign language
teachers) may share a common understanding of linguistic complexity amongst themselves, lay people’s
individual diferences, language proficiency level and lack of proper training may cause them to have
vastly diferent opinions on complexity and readability, so much so that modeling a common perception
of readability among this heterogeneous group using ML is impossible. To test this hypothesis we
collected judgments of complexity from lay people with the purpose of training ARA models on them.
The assumption is that if lay people’s judgments on the same sentences are too diverse with a high
deviation, than ML trained on this data will fail to converge. Conversely, if said models achieve an
adequate performance, then this hints at the possibility that the collected data can be used to model some
common perception or ground truth that the general crowd share. Further, we argue that extracting
the ground truth from lay people’s judgments is possible by leveraging the wisdom of the crowd
[28], namely the idea that a large heterogeneous group of people can collectively make decisions and
predictions that are often more accurate than those made by individuals or experts. This is based The
aim of this study is to explore whether the intuitions of the crowd (lay people) can be modeled and
captured using various ML methods and features, as well describe which methods resulted in higher
performance.</p>
      <sec id="sec-3-1">
        <title>3.1. Data Collection</title>
        <p>We collected evaluations from 50 participants for 1000 sentences in English and compiled a the Sentence
Readability Crowd corpus (Sr-Crowd). A data collection tool was made in the form of a website where
participants would log in in order to assess the complexity of SR-Crowd. Participants recruited via
the platform www.prolific.com (hereinafter Prolific) were shown 200 random sentences each and they
were asked to evaluate them in terms of grammatical complexity, lexical complexity and overall reading
dificulty (henceforth referred to as V,G and O respectively) on a scale from 1 to 100, where 1 is the
lowest dificulty and 100 is the highest, resulting in 10,000 evaluations in total, with each sentence
being evaluated by 10 random participants. The 1-100 scale was selected because it provides more
lfexibility and leeway for assessors, enabling them to express their subjective judgments about sentence
readability with greater precision than a traditional Likert scale would allow.</p>
        <p>Prior to data collection each participant was presented with 15 example sentences and their respective
evaluations done by linguists and language experts. These model evaluations act as ground-truth for the
participants to learn from and calibrate their assessments accordingly. In other words, the participants
were first primed on the examples and then they were asked to evaluate the dificulty of random
sentences based on the examples they had seen. While the goal of the study was to examine the
intuitions of the crowd, some example sentences and their respective indicative dificulty scores had
to be presented to the participants in order to prevent them from overestimating or underestimating
the complexity of sentences. We believe that showing participants a few examples is a middle-ground
solution between training them to do readability and complexity assessment and letting them do
the assessment without any criterion or standard whatsoever. The former would defeat the purpose
of leveraging lay people’s intuitions while the latter would render it more likely for participants to
underperform in the task by judging sentences too quickly without thinking. By presenting some
sentences and their respective indicative complexity scores as examples we minimize this probability
while also avoiding to impose strict standards that would bias or cloud the participants’ intuitive
judgment.</p>
        <p>Regarding the creation of SR-Crowd, many available corpora options were considered. The aim
was to collect data from sources as diverse as possible to ensure that the results reflected the full
range of English language usage. One dataset that fulfilled this criterion is the C4 corpus [ 29], which
was specifically designed for training large language models. As it already included a broad range
of documents and websites scraped from the internet, it was deemed suitable for this purpose. In
order to promote diversity even more in our sentence corpus, we drew on a specially curated subset
of the C4 corpus, dubbed Repset by Suzuki et al. (2023), [30]. This subset was created to aid in faster
training of LLMs with less data and is therefore expected to contain diverse language and content. It
was chosen for this study because it met these criteria. The sentences to be evaluated were randomly
sampled from the Repset corpus and then randomly assigned to participants of the study, with the
only criterion being sentence length. Each participant received a balanced mixture of short and long
sentences, thereby increasing the likelihood that all participants got sentences of varying levels of
readability, since sentence length correlates with perceived complexity. This allowed participants to
assess the full spectrum of linguistic complexity.</p>
        <p>Several ML classification and regression models were built and trained using the crowd’s judgments
as training data. In the following sections we describe the data we collected, how the crowd-sourced
judgments for each sentence were aggregated into a single readability label for V, G and O, and which
features and models were used for classification.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Data Description</title>
        <p>The participants in this study came from a wide range of linguistic and proficiency backgrounds,
including various native languages and levels of English proficiency. The sample was divided roughly
evenly between individuals who were born and raised speaking English (27 participants) and those
whose primary language at home was not English (23). As far as gender is concerned, 27 participants
were male and 23 were female. Participants reported a total of 15 diferent native languages: English
(en), Polish (pl), Isizulu (zu), Portuguese (pt), Greek (el), Spanish (es), Dutch (nl), Hungarian (hu), Italian
(it), Urdu (ur), Xhosa (xh), Shona (sn), French (fr), German (de), Unknown (xx). Table 1 shows the
distribution of speakers. As for the participants’ self-reported proficiency levels, most participants
reported a general proficiency of C1, followed by C2, B2 and finally B1. interestingly, some participants
claimed that their native language is English, but reported a general proficiency of less than C2 level.
Proficiency levels are summarized in Table 2.</p>
        <p>Here a few statistical characteristics of SR-Crowd are mentioned. We report the descriptive statistics
of the dataset as a whole, as well as those of each criterion separately (Vocabulary (V), Grammar (G) and
Overall (O) sentence dificulty). Table 3 summarizes the mean, variance, std, range, kurtosis, skew and
inter-quartile range for all participants for V,G,O. These values do not change significantly across the
diferent criteria, which indicates that there is high correlation between V,G and O in natural language;
sentences that are lexically complex often present grammatical complexity as well. Likewise the same
values are reported across sentences in Table 4.</p>
        <p>One property of the data is skewness to the left. Participants did not make use of the full range (1-100),
but rather gravitated towards lower values. The mean and median of the dataset are located around 37,
with values above 70 being very rare and values below 30 being more common. This is best illustrated
in Figure 1, where the kurtosis values for all three axes are displayed. Apparently participants generally
deemed sentences as simple and easy to understand, maybe partly because they were not trained or
specifically instructed before data collection and probably due to their high language proficiency.</p>
        <p>To reduce the efects of variability, all evaluations that deviated by 2 std above or below the mean were
not considered during model building. This can be further illustrated in Figure 2 which displays the std
for all axes per participant. Some participants gave consistently low scores for most sentences, which is
unlikely to reflect the true intuition of the crowd, since one would expect at least some of those sentences
to be considerably more dificult to read and understand than others. A low std among evaluations of
diferent sentences by the same participant is an indication that said participant underperformed.
(a) Lexical Dificulty
(b) Grammatical Dificulty
(c) Overall Dificulty</p>
        <p>Finally, we hypothesize that participants who are native speakers of English may have evaluated
most sentences as easy to understand because they are very proficient themselves. Here the diferences
between the two groups (native vs non-native speakers of English) are examined. Any participant who
has a std below the threshold of 10 is shown in Table 5. Most participants who showed little variation
among their assessments are indeed English speaking. Except for participant with userid 13, all others’
assessments are on the lower end of the scale, as evident by the mean values for the three axes.</p>
        <p>This discrepancy between the two groups can be further illustrated in the box plots in Figure 3. Native
speakers had a tendency to evaluate sentences as less complex than the non-native group did. Native
speakers’ mean was generally lower than the non-native group’s and their std also. This may indicate a
fundamental diference between the two groups which would prove that readability is subjective and it
should be modeled based on who is perceiving it.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Data Preparation and Scaling</title>
        <p>
          Since the data collected is in the range of 1 to 100, any model trained on it would have to produce
a continuous variable in the same range as an ARA score. This means that ARA scoring is framed
as a regression problem, which in turn means that the MSE loss should be used for the training and
evaluation of the models presented in this study. However, for the sake of comparison with other
baselines that frame ARA as a classification problem and for future studies that will aim to compare
results among methods, this study reports classification scores, namely QWK, accuracy and Macro F1.
In order to achieve this comparability, our data had to be quantized into discrete classes from 0 to 5,
reflecting CEFR levels. That way more representative scores can be reported that are compatible with
the results of other studies (e.g. [
          <xref ref-type="bibr" rid="ref17">17, 31</xref>
          ]).
        </p>
        <p>Turning a range of 1 to 100 into a 6-tier scale requires thresholding to optimal values. One approach
would be to discretize based on 7 cutof values separated linearly, efectively splitting the 1 to 100 range
into 6 equally spaced bins. However, we noticed that the mean of evaluations is not centered around 50,
but around 37. Moreover, most participants refrained from scoring higher than 67 on average, which
means that higher scores are rare. To address this imbalance, we used thresholds obtained thusly: First
we averaged all values per sentence per criterion and then calculated equal length bins starting from
the minimum of these averages up to the maximum. That way the data was split into 6 bins with
an unequal amount of sentences. We chose this method because it splits the data into a distribution
similar to that of real texts. That is to say, sentences of lower-to-medium complexity (A2, B1 and B2)
are more frequent, while very simple or very challenging sentences (A1, C1 and C2) are less frequently
encountered. For brevity we show the cutof values and the number of sentences per bin only for the
Overall Dificulty criterion (O) in Table 6.</p>
        <p>Using the first method of thresholding yields a Gaussian distribution of data, with A1 and C2 having
the least amount of sentences and B1 having the most. Another factor we experimented with is averaging
multiple assessments into a single score for each criterion (V, G and O) for each sentence. Since every
sentence was evaluated by multiple participants, we had to decide how to average those judgments
into a single score that is most representative of the sentence’s dificulty. Concretely, we considered 1)
averaging values per sentence per criterion first and then thresholding (hereinafter referred to as AVG
THRESHOLD) and 2) thresholding to discrete levels first and then taking majority vote per sentence
per criterion (referred to as MAJORITY VOTING). In both cases outliers were ignored. It was found
that AVG THRESHOLD yielded the best results, so we chose to run our experiments using ground truth
labels extracted with this method.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Model Building and Criterial Features</title>
      <p>Several models were trained on the collected data and a total of four general pipeline variations were
built: 1) Traditional ML models trained solely on linguistic features as input, 2) traditional ML models
trained on neural embeddings as input, 3) BERT-like models fine-tuned for classification on SR-Crowd,
4) A pipeline that combines BERT-like models’ scores &amp; embeddings and linguistic features with a
traditional ML model as the backbone. As far as traditional ML modeling is concerned an XGBoost
model was used. Preliminary experiments showed that XGBoost achieves better performance than
other methods (SVM, Random Forests), so we conducted our main experiments with it. The neural
models tested are a) MiniLM [32], a lightweight sentence transformer, b) BGE-M3 [33], a multilingual
BERT transformer, DeBERTa-v3-large [34], a state-of-the-art Transformer often used for classification
with peak performance, and ModernBERT [35], a modernized bidirectional encoder-only Transformer
capable of encoding long contexts.</p>
      <sec id="sec-4-1">
        <title>4.1. ML Models with Linguistic Complexity Features</title>
        <p>
          Pipeline#1 is a classification setting that involves criterial feature extraction as done in [
          <xref ref-type="bibr" rid="ref15 ref24">24, 15</xref>
          ]. The
CTAP platform [36] was used for the extraction of linguistic complexity features. CTAP extracts a
comprehensive list of 576 syntactic, lexical and morphological complexity features which can be used
to classify texts in terms of linguistic complexity. Since complexity is highly correlated with readability,
those features serve as a good proxy of readability. This pipeline achieves a QWK score of 0.5 across all
three criteria. More specifically, XGBoost was used with 576 CTAP features as input and the aggregated
participants’ judgments as output.
        </p>
        <p>As can be seen in Table 7, XGBoost with CTAP features a relatively low QWK and moderate accuracy
and F1 scores. It is worth noting that XGBoost model’s predictions tend to center around the value of 2,
which is the most common level found in the SR-Crowd after thresholding the participants’ assessments.
This may make the model not generalize well for more diverse datasets.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. ML Models with Embeddings as Features</title>
        <p>Neural embeddings are often used for classification, clustering and retrieval tasks. To test their eficacy
in classification of sentence dificulty we extracted embeddings using three state-of-the-art models: 1)
jina-embeddings-v3 [37], b) OpenAI’s text-embeddings-3-large and c) Gemma 2b-it [38] 2. After the
embeddings were extracted they were fed into an XGBoost model.</p>
        <p>As can be seen in Table 8, results were very poor, indicating that neural embeddings alone do not
capture readability accurately. Even though F1 score and Accuracy are around 0.4, QWK is universally
too low, indicating that the model produces almost random predictions centered around the most
common level found in the dataset. This could be attributed to the fact that neural embeddings are
optimized to retain semantic information, not information about grammatical or lexical complexity.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. BERT Model Training</title>
        <p>Pipeline#2 involved the use of BERT-like models directly as classifiers by attaching a classification head
on the Transformer model, without extracting features or using other ML models. Every BERT-like
model was trained using the adam optimizer in a 3-fold cross-validation setting with diferent seeds for
train-test splitting. For training eficiency the larger BERT models were trained using parameter eficient
LoRAs [39], which greatly reduce training time while performing comparably to normal training, as
was found in our preliminary experiments. Only MiniLM was trained normally without the use of
LoRAs. All models were trained with a batch size of 64 (factoring in gradient accumulation steps).</p>
        <p>
          Contrary to [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], we found that treating ARA as a regression problem and then thresholding to
optimal cutof values (similar to the method used to aggregate assessments described above) worked
best. This works because CEFR levels (unlike other categorical labels) are ordinal values, which means
that each level indicates higher complexity than the previous one. As for thresholding, optimal cutof
values were found using Scipy’s minimize function in Python, which finds optimal parameters for a
given function. We also tried the Python library Optuna, which discovers optimal hyperparameters for
2the Gemma implementation can be found here https://huggingface.co/trapoom555/Gemma-2B-Text-Embedding-cft
ML models, but found Scipy’s minimize to work decently for the purposes of this study. In Table 9 we
show an overview of the models’ performance.
        </p>
        <p>In terms of QWK all models seem to peak at around 0.65. However, bigger models produce values
in the full range of 0 to 5, whereas MiniLM produced values centered around 2 and 3, as these were
most common in the dataset. ModernBERT performed best with an accuracy of 0.55 for grammatical
dificulty prediction, while being the most eficient and fast to train and run inference with. DeBERTa
shows promise and with better hyperparameter tuning it could reach a higher performance, but the
computational cost and training time make it an unappealing choice for future experiments.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Combining Linguistic and Neural Features with BERT scores</title>
        <p>
          Previous studies have demonstrated that combining linguistic and neural features can lead to better
performance [
          <xref ref-type="bibr" rid="ref14 ref26">26, 14, 40</xref>
          ]. To test whether this holds true for SR-Crowd, pipeline#3 combined all features
extracted with the previously described methods into a single traditional ML model. ModernBERT’s
score combined with CTAP features achieved the highest overall accuracy of 0.48, with the grammatical
criterion prediction performing best at 0.48 accuracy and 0.7 QWK. The Macro F1 score of 0.364 indicates
a reasonable balance across classes, while the QWK score of 0.7 reflects a strong level of agreement
between predicted and actual levels. This suggests that the combination of ModernBERT and CTAP
efectively leveraged the strengths of both methods.
        </p>
        <p>On the other hand, combining neural and linguistic features in an XGBoost classifier did not yield any
better results for SR-Crowd. The model performed the same as or even worse than the baseline that uses
CTAP features alone. What is more, predictions are unbalanced with levels 1-3 being overrepresented.</p>
        <p>Finally, pipeline#4 combines BERT scores from ModernBERT, CTAP features and embeddings from
embedding models, but does not yield better results. Table 10 shows an overview of the results.</p>
        <p>Overall, the results indicate that combinations of models, such as ModernBERT with CTAP features
and XGBoost, tend to yield better performance metrics compared to single approaches. Further, it is
hereby proven that fine-tuning these models can be done cheaply and eficiently by harnessing newer
methods (e.g. LoRA) thereby dramatically decreasing training and inference costs.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>This study presented a new dataset of sentence dificulty evaluations performed by non-experts in the
ifeld of linguistics or teaching. The dataset was analysed and several ML models were trained on it
for sentence-based ARA. It was found that participants had a tendency to evaluate sentences as easy
to understand, with the average of their evaluations being 37 out of 100. This tendency was found
to be slightly more pronounced for Native-speakers. Furthermore, the std across participants for the
same sentences was rather high, indicating that agreement was low among participants assigned to
evaluate those sentences. Since evaluations were in a range of 1-100 and because we want our models
and methods to be comparable to those in related works, we had to convert this scale to one between 0
and 5, roughly representing CEFR levels, as other works do. Then ground truth labels were extracted
by first removing outliers and then averaging assessments for V, G and O separately. as it was found to
lead to higher convergence in the trained models.</p>
      <p>As regards mapping the 1-100 range of assessments to the CEFR scale (0-5, A1-C2), it was found
that using AVG THRESHOLD as described in section 3.3 resulted in splitting of the data into bins
roughly representing a Gaussian distribution; most sentences were classified as levels 2-3 while fewer
sentences were classified as levels 0, 1, 4 and 5. This is commensurate with our expectations, as we
empirically assumed that sentences sampled from random lay texts would follow this distribution. This
assumption, however, should be tested more thoroughly in the future using corpus analysis. In any
case, our method of aggregating the crowd’s judgments to produce ground truth labels and train our
models on them proved to be significantly more efective than MAJORITY VOTING, as models trained
on these aggregated judgments had a much higher QWK score.</p>
      <p>As for modeling, BERT-like models performed moderately well in sentence dificulty prediction, and
definitely better than ML models trained on linguistic complexity features from CTAP alone. However,
the combination of BERT regression scores and CTAP features using an XGBoost model produced the
best outcome with a QWK of 0.7 for grammatical dificulty prediction and an average score of 0.67
for V and O. This indicates that the best pipeline, which combines linguistic features with a regression
score from ModernBERT, is not significantly better or worse predicting any of the three criteria (V,G
and O). Further, our statistical analysis shows that participants’ judgments on these criteria may be
codependent; A sentence that is lexically dificult may influence lay people’s perception of grammar
and overall dificulty as well.</p>
      <p>Finally, the best model’s performance is moderately high, indicating that readability is highly
subjective and the readability judgments of one group of people may not reflect those of others due to
individual diferences. To develop efective ARA models, it is crucial to train them on the unique
intuitions and perceptual norms of specific target audience groups, rather than relying solely on expert
opinions or those of highly proficient speakers. Examining the divide between expert and non-expert
judgments is a goal for a future study. However, using the intuitions of the crowd as training data
seems to lead to generalizable ARA models that capture the general crowd’s perception.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <sec id="sec-6-1">
        <title>The authors have not employed any Generative AI tools.</title>
        <p>18653/v1/2020.acl-main.703.
[28] J. Surowiecki, The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How
Collective Wisdom Shapes Business, Economies, Societies, and Nations, Doubleday, New York,
2004.
[29] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring
the limits of transfer learning with a unified text-to-text transformer, Journal of Machine Learning
Research 21 (2020) 1–67. URL: http://jmlr.org/papers/v21/20-074.html.
[30] J. Suzuki, H. Zen, H. Kazawa, Extracting representative subset from extensive text data for training
pre-trained language models, Information Processing &amp; Management 60 (2023) 103249. URL:
https://www.sciencedirect.com/science/article/pii/S0306457322003508. doi:https://doi.org/10.
1016/j.ipm.2022.103249.
[31] T. Naous, M. J. Ryan, A. Lavrouk, M. Chandra, W. Xu, ReadMe++: Benchmarking multilingual
language models for multi-domain readability assessment, in: Y. Al-Onaizan, M. Bansal, Y.-N. Chen
(Eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,
Association for Computational Linguistics, Miami, Florida, USA, 2024, pp. 12230–12266. URL:
https://aclanthology.org/2024.emnlp-main.682. doi:10.18653/v1/2024.emnlp-main.682.
[32] W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, M. Zhou, Minilm: Deep self-attention distillation for
task-agnostic compression of pre-trained transformers, 2020. doi:10.48550/arXiv.2002.10957.
[33] J. Chen, S. Xiao, P. Zhang, K. Luo, D. Lian, Z. Liu, M3-embedding: Multi-linguality,
multifunctionality, multi-granularity text embeddings through self-knowledge distillation, in: L.-W. Ku,
A. Martins, V. Srikumar (Eds.), Findings of the Association for Computational Linguistics: ACL
2024, Association for Computational Linguistics, Bangkok, Thailand, 2024, pp. 2318–2335. URL:
https://aclanthology.org/2024.findings-acl.137/. doi:10.18653/v1/2024.findings-acl.137.
[34] P. He, J. Gao, W. Chen, Debertav3: Improving deberta using electra-style pre-training with
gradient-disentangled embedding sharing, 2021. doi:10.48550/arXiv.2111.09543.
[35] B. Warner, A. Chafin, B. Clavié, O. Weller, O. Hallström, S. Taghadouini, A. Gallagher, R. Biswas,
F. Ladhak, T. Aarsen, N. Cooper, G. Adams, J. Howard, I. Poli, Smarter, better, faster, longer: A
modern bidirectional encoder for fast, memory eficient, and long context finetuning and inference,
2024. URL: https://arxiv.org/abs/2412.13663. arXiv:2412.13663.
[36] X. Chen, D. Meurers, CTAP: A web-based tool supporting automatic complexity analysis, in:
D. Brunato, F. Dell’Orletta, G. Venturi, T. François, P. Blache (Eds.), Proceedings of the Workshop
on Computational Linguistics for Linguistic Complexity (CL4LC), The COLING 2016 Organizing
Committee, Osaka, Japan, 2016, pp. 113–119. URL: https://aclanthology.org/W16-4113/.
[37] S. Sturua, I. Mohr, M. K. Akram, M. Günther, B. Wang, M. Krimmel, F. Wang, G. Mastrapas,
A. Koukounas, A. Koukounas, N. Wang, H. Xiao, jina-embeddings-v3: Multilingual embeddings
with task lora, 2024. URL: https://arxiv.org/abs/2409.10173. arXiv:2409.10173.
[38] G. Team, Gemma (2024). URL: https://www.kaggle.com/m/3301. doi:10.34740/KAGGLE/M/3301.
[39] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, Lora: Low-rank
adaptation of large language models, in: ICLR 2022, 2022. URL: https://www.microsoft.com/en-us/
research/publication/lora-low-rank-adaptation-of-large-language-models/.
[40] F. Liu, J. Lee, Hybrid models for sentence readability assessment, in: E. Kochmar, J. Burstein,
A. Horbach, R. Laarmann-Quante, N. Madnani, A. Tack, V. Yaneva, Z. Yuan, T. Zesch (Eds.),
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications
(BEA 2023), Association for Computational Linguistics, Toronto, Canada, 2023, pp. 448–454. URL:
https://aclanthology.org/2023.bea-1.37/. doi:10.18653/v1/2023.bea-1.37.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S. D.</given-names>
            <surname>Krashen</surname>
          </string-name>
          , Principles and Practice in Second Language Acquisition, Pergamon Press, Oxford,
          <year>1982</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E.</given-names>
            <surname>Dale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Chall</surname>
          </string-name>
          ,
          <article-title>A formula for predicting readability</article-title>
          : Instructions,
          <source>Educational Research Bulletin</source>
          <volume>27</volume>
          (
          <year>1948</year>
          )
          <fpage>37</fpage>
          -
          <lpage>54</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>W. H.</given-names>
            <surname>DuBay</surname>
          </string-name>
          , The Principles of Readability, Impact Information, Costa Mesa, CA,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H. A. E.</given-names>
            <surname>Mesmer</surname>
          </string-name>
          , Tools for Matching Readers to Texts: Research-Based
          <string-name>
            <surname>Practices</surname>
          </string-name>
          , Guilford Press, New York,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K. E.</given-names>
            <surname>Stanovich</surname>
          </string-name>
          ,
          <article-title>Matthew efects in reading: Some consequences of individual diferences in the acquisition of literacy</article-title>
          ,
          <source>Journal of education 189</source>
          (
          <year>2009</year>
          )
          <fpage>23</fpage>
          -
          <lpage>55</lpage>
          . doi:
          <volume>10</volume>
          .1177/
          <fpage>0022057409189001</fpage>
          -
          <lpage>204</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.-T.</given-names>
            <surname>Sung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-C.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Dyson</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.-E. Chang</surname>
            ,
            <given-names>Y.-C.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Leveling l2 texts through readability: Combining multilevel linguistic features with the cefr</article-title>
          ,
          <source>The Modern Language Journal</source>
          <volume>99</volume>
          (
          <year>2015</year>
          )
          <fpage>371</fpage>
          -
          <lpage>391</lpage>
          . doi:https://doi.org/10.1111/modl.12213.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Kincaid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. P.</given-names>
            <surname>Fishburne</surname>
          </string-name>
          <string-name>
            <surname>Jr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Rogers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. S.</given-names>
            <surname>Chissom</surname>
          </string-name>
          ,
          <article-title>Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel (</article-title>
          <year>1975</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Chall</surname>
          </string-name>
          , E. Dale, Readability Revisited:
          <article-title>The New Dale-Chall Readability Formula</article-title>
          , Brookline Books, Cambridge, MA,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Smith</surname>
          </string-name>
          , et al.,
          <source>The Lexile Scale in Theory and Practice</source>
          .
          <source>Final Report, Technical Report ERIC ED307577</source>
          , MetaMetrics, Inc., Washington, DC,
          <year>1989</year>
          . Available from ERIC: https://eric.ed.gov/?id=
          <fpage>ED307577</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>W.</given-names>
            <surname>Dubay</surname>
          </string-name>
          ,
          <article-title>Unlocking language: The classic readability studies, Professional Communication</article-title>
          , IEEE Transactions on
          <volume>51</volume>
          (
          <year>2009</year>
          )
          <fpage>416</fpage>
          -
          <lpage>417</lpage>
          . doi:
          <volume>10</volume>
          .1109/TPC.
          <year>2008</year>
          .
          <volume>2007872</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Vajjala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Meurers</surname>
          </string-name>
          ,
          <article-title>Assessing the relative reading level of sentence pairs for text simplification</article-title>
          , in: S. Wintner,
          <string-name>
            <given-names>S.</given-names>
            <surname>Goldwater</surname>
          </string-name>
          , S. Riezler (Eds.),
          <source>Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics</source>
          , Association for Computational Linguistics, Gothenburg, Sweden,
          <year>2014</year>
          , pp.
          <fpage>288</fpage>
          -
          <lpage>297</lpage>
          . URL: https://aclanthology.org/E14-1031/. doi:
          <volume>10</volume>
          .3115/v1/
          <fpage>E14</fpage>
          - 1031.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>I.</given-names>
            <surname>Pilán</surname>
          </string-name>
          , E. Volodina,
          <string-name>
            <given-names>R.</given-names>
            <surname>Johansson</surname>
          </string-name>
          ,
          <article-title>Rule-based and machine learning approaches for second language sentence-level readability</article-title>
          , in: J.
          <string-name>
            <surname>Tetreault</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Burstein</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          Leacock (Eds.),
          <source>Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications</source>
          , Association for Computational Linguistics, Baltimore, Maryland,
          <year>2014</year>
          , pp.
          <fpage>174</fpage>
          -
          <lpage>184</lpage>
          . URL: https://aclanthology. org/W14-1821/. doi:
          <volume>10</volume>
          .3115/v1/
          <fpage>W14</fpage>
          - 1821.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Vajjala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Meurers</surname>
          </string-name>
          ,
          <article-title>On improving the accuracy of readability classification using insights from second language acquisition</article-title>
          , in: J.
          <string-name>
            <surname>Tetreault</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Burstein</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          Leacock (Eds.),
          <source>Proceedings of the Seventh Workshop on Building Educational Applications Using NLP</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Montréal, Canada,
          <year>2012</year>
          , pp.
          <fpage>163</fpage>
          -
          <lpage>173</lpage>
          . URL: https://aclanthology.org/W12-2019/.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>B. W.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. S.</given-names>
            <surname>Jang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Pushing on text readability assessment: A transformer meets handcrafted linguistic features</article-title>
          , in: M.
          <article-title>-</article-title>
          <string-name>
            <surname>F. Moens</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Specia</surname>
          </string-name>
          , S. W.-t. Yih (Eds.),
          <source>Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Online and
          <string-name>
            <given-names>Punta</given-names>
            <surname>Cana</surname>
          </string-name>
          , Dominican Republic,
          <year>2021</year>
          , pp.
          <fpage>10669</fpage>
          -
          <lpage>10686</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .emnlp-main.
          <volume>834</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .emnlp- main.834.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>D.</given-names>
            <surname>Brunato</surname>
          </string-name>
          , L. De Mattei,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dell'Orletta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Iavarone</surname>
          </string-name>
          , G. Venturi,
          <article-title>Is this sentence dificult? do you agree?</article-title>
          , in: E.
          <string-name>
            <surname>Rilof</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Chiang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Hockenmaier</surname>
          </string-name>
          , J. Tsujii (Eds.),
          <source>Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Brussels, Belgium,
          <year>2018</year>
          , pp.
          <fpage>2690</fpage>
          -
          <lpage>2699</lpage>
          . URL: https://aclanthology.org/D18-1289/. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D18</fpage>
          - 1289.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Ambati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reddy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Steedman</surname>
          </string-name>
          ,
          <article-title>Assessing relative sentence complexity using an incremental CCG parser, in:</article-title>
          K.
          <string-name>
            <surname>Knight</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Nenkova</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          Rambow (Eds.),
          <source>Proceedings of the</source>
          <year>2016</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics</article-title>
          , San Diego, California,
          <year>2016</year>
          , pp.
          <fpage>1051</fpage>
          -
          <lpage>1057</lpage>
          . URL: https://aclanthology.org/N16-1120/. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N16</fpage>
          - 1120.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Arase</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Uchida</surname>
          </string-name>
          , T. Kajiwara,
          <article-title>CEFR-based sentence dificulty annotation and assessment</article-title>
          , in: Y.
          <string-name>
            <surname>Goldberg</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Kozareva</surname>
          </string-name>
          , Y. Zhang (Eds.),
          <source>Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Abu Dhabi, United Arab Emirates,
          <year>2022</year>
          , pp.
          <fpage>6206</fpage>
          -
          <lpage>6219</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .emnlp-main.
          <volume>416</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .emnlp- main.416.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>L.</given-names>
            <surname>Seife</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Kallel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Möller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Naderi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Roller</surname>
          </string-name>
          ,
          <article-title>Subjective text complexity assessment for German</article-title>
          , in: N.
          <string-name>
            <surname>Calzolari</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Béchet</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Blache</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Choukri</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Cieri</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Declerck</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Goggi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Isahara</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Maegaard</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mariani</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Mazo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Odijk</surname>
          </string-name>
          , S. Piperidis (Eds.),
          <source>Proceedings of the Thirteenth Language Resources and Evaluation Conference</source>
          , European Language Resources Association, Marseille, France,
          <year>2022</year>
          , pp.
          <fpage>707</fpage>
          -
          <lpage>714</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .lrec-
          <volume>1</volume>
          .74/.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>F.</given-names>
            <surname>Dell'Orletta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Montemagni</surname>
          </string-name>
          , G. Venturi, READ-IT:
          <article-title>Assessing readability of Italian texts with a view to text simplification</article-title>
          , in: N.
          <string-name>
            <surname>Alm</surname>
          </string-name>
          (Ed.),
          <source>Proceedings of the Second Workshop on Speech and Language Processing for Assistive Technologies</source>
          , Association for Computational Linguistics, Edinburgh, Scotland, UK,
          <year>2011</year>
          , pp.
          <fpage>73</fpage>
          -
          <lpage>83</lpage>
          . URL: https://aclanthology.org/W11-2308/.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>M.</given-names>
            <surname>Xia</surname>
          </string-name>
          , E. Kochmar, T. Briscoe,
          <article-title>Text readability assessment for second language learners</article-title>
          , in: J.
          <string-name>
            <surname>Tetreault</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Burstein</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Leacock</surname>
          </string-name>
          , H. Yannakoudakis (Eds.),
          <source>Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications</source>
          , Association for Computational Linguistics, San Diego, CA,
          <year>2016</year>
          , pp.
          <fpage>12</fpage>
          -
          <lpage>22</lpage>
          . URL: https://aclanthology.org/W16-0502/. doi:
          <volume>10</volume>
          . 18653/v1/
          <fpage>W16</fpage>
          - 0502.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , in: J.
          <string-name>
            <surname>Burstein</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Doran</surname>
          </string-name>
          , T. Solorio (Eds.),
          <source>Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers),
          <source>Association for Computational Linguistics</source>
          , Minneapolis, Minnesota,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . URL: https://aclanthology.org/N19-1423/. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N19</fpage>
          - 1423.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>S.</given-names>
            <surname>Vajjala</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Lučić</surname>
          </string-name>
          ,
          <article-title>OneStopEnglish corpus: A new corpus for automatic readability assessment and text simplification</article-title>
          , in: J.
          <string-name>
            <surname>Tetreault</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Burstein</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Kochmar</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Leacock</surname>
          </string-name>
          , H. Yannakoudakis (Eds.),
          <source>Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications</source>
          , Association for Computational Linguistics, New Orleans, Louisiana,
          <year>2018</year>
          , pp.
          <fpage>297</fpage>
          -
          <lpage>304</lpage>
          . URL: https://aclanthology.org/W18-0535/. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>W18</fpage>
          - 0535.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>F.</given-names>
            <surname>Dell'Orletta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wieling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Venturi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cimino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Montemagni</surname>
          </string-name>
          ,
          <article-title>Assessing the readability of sentences: Which corpora and features?</article-title>
          , in: J.
          <string-name>
            <surname>Tetreault</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Burstein</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          Leacock (Eds.),
          <source>Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications</source>
          , Association for Computational Linguistics, Baltimore, Maryland,
          <year>2014</year>
          , pp.
          <fpage>163</fpage>
          -
          <lpage>173</lpage>
          . URL: https://aclanthology.org/W14-1820/. doi:
          <volume>10</volume>
          .3115/v1/
          <fpage>W14</fpage>
          - 1820.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>S.</given-names>
            <surname>Stajner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. P.</given-names>
            <surname>Ponzetto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Stuckenschmidt</surname>
          </string-name>
          ,
          <article-title>Automatic assessment of absolute sentence complexity</article-title>
          ,
          <source>in: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI17</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>4096</fpage>
          -
          <lpage>4102</lpage>
          . URL: https://doi.org/10.24963/ijcai.
          <year>2017</year>
          /572. doi:
          <volume>10</volume>
          .24963/ijcai.
          <year>2017</year>
          / 572.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>R.</given-names>
            <surname>Bakeman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Gottman</surname>
          </string-name>
          ,
          <article-title>Observing interaction: An introduction to sequential analysis</article-title>
          , Cambridge university press,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>T.</given-names>
            <surname>Deutsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jasbi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shieber</surname>
          </string-name>
          ,
          <article-title>Linguistic features for readability assessment</article-title>
          , in: J.
          <string-name>
            <surname>Burstein</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Kochmar</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Leacock</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Madnani</surname>
            , I. Pilán,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Yannakoudakis</surname>
          </string-name>
          , T. Zesch (Eds.),
          <source>Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications</source>
          , Association for Computational Linguistics, Seattle, WA, USA → Online,
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>17</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .bea-
          <volume>1</volume>
          .1/. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .bea-
          <volume>1</volume>
          .1.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghazvininejad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          , L. Zettlemoyer, BART:
          <article-title>Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension</article-title>
          , in: D.
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chai</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Schluter</surname>
          </string-name>
          , J. Tetreault (Eds.),
          <article-title>Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>7871</fpage>
          -
          <lpage>7880</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .acl-main.
          <volume>703</volume>
          /. doi:10.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>