<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>O. M. Cumbicus-Pineda);
itziar.gonzalezd@ehu.eus (I. Gonzalez-Dios); a.soroa@ehu.eus (A. Soroa)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Linguistic Capabilities for a Checklist-based evaluation in Automatic Text Simplification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Oscar M. Cumbicus-Pineda</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Itziar Gonzalez-Dios</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aitor Soroa</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Carrera de Computación, Facultad de la Energía las Industrias y los Recursos Naturales No Renovables, Universidad Nacional de Loja</institution>
          ,
          <addr-line>Loja</addr-line>
          ,
          <country country="EC">Ecuador</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Ixa group, HiTZ center, University of the Basque Country (UPV/EHU)</institution>
          ,
          <addr-line>Informatika Fakultatea, Manuel Lardizabal 1, 20018 Donostia</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Ixa group, University of the Basque Country, UPV/EHU</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Evaluation in Automatic Text Simplification (ATS) has been carried out by means of automatic metrics such as SARI, BLEU, by manual analysis that takes into account the grammar/fluency, meaning preservation and simplicity of the outputs, readability metrics or by extrinsic evaluation via NLP tasks. These metrics and dimensions give an overview of what the systems are doing, but we do not exactly which are the strong and weak points. Inspired by recent literature of Natural Language Processing tasks for classiifcation, in this paper we explore the checklist-based evaluation of the linguistic capabilities ATS systems need to meet. We apply this evaluation to a syntax aware edit-based ATS system and we point out which are the weakness and the strength of the system, which can also lead to improvements of the system.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Automatic Text Simplification</kwd>
        <kwd>Manual evaluation</kwd>
        <kwd>Checklists</kwd>
        <kwd>Capabilities</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Automatic Text Simplification is a Natural Language Processing (NLP) research line which aims
to reduce the complexity of a text at both lexical and syntactic levels for a certain target audience.
The interested reader is referred to the following works for detailed information about ATS
[
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6">1, 2, 3, 4, 5, 6</xref>
        ].
      </p>
      <p>
        As with many other Natural Language Generation (NLG) tasks [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], the evaluation of ATS is still
an open question that arises big concerns in the community. Automatic metrics do not capture all
the nuances of text simplification, and human evaluation is costly and dificult to reproduce. Still,
the research community is putting a lot of efort on systematising ATS evaluation and making
manual evaluation as reliable as possible [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ]. Moreover, it is worth to notice that most of the
works on ATS simplification focus exclusively on English.
      </p>
      <p>
        The most successful methods in ATS today are based on deep learning techniques, and are often
cast as a machine translation task where the system learns to łtranslatež from complex sentences
to simpler counterparts. Evaluation of neural models is, however, a dificult task, as they are black
boxes that are trained in an end-to-end fashion. Current trends in evaluating and debugging
neural models are focusing on methods and metrics that go beyond the traditional metrics (e.g.
accuracy) such as adversarial rules (perturbations of the input by preserving its semantics e.g
changing ‘what’ to ‘which’, ‘movie’ to ‘film’ or introducing a typo, but inducing changes in a
black box model’s predictions) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] or checklists (a matrix of general linguistic capabilities and
tests) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. For example, in the case of sentiment analysis tasks, the identification of words that
carry positive, negative, or neutral sentiment, comparatives and superlatives, negation or named
entities are the linguistic characteristics considered. Although checklists are primarily intended
for classification tasks, in our opinion, they can also be applied to generation tasks such as ATS.
      </p>
      <p>In this paper, we open a path towards the study of the linguistic capabilities required for ATS
with the aim of better understanding how ATS systems work. This way, we know which are the
weak and strong spots of the systems. To that end, we analyse the outputs of three diferent neural
ATS systems trained in English, Italian and Spanish and we present the evaluation of a system.
The contributions of this paper are: i) analysis of the outputs of three diferent systems for three
languages, ii) a list of linguistic capabilities for ATS, iii) a checklist evaluation of a system and
future directions to improve it.</p>
      <p>This paper is structured as follows: in Section 2 we present the approaches to evaluate ATS
systems, in Section 3 we detail our approach and describe the linguistic capabilities, in Section
4 we present the checklist based evaluation of a system and we conclude and outline the future
work in Section 5.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Evaluation in ATS</title>
      <p>Evaluation in ATS is a research concern for the community. At the moment, systems are usually
evaluated automatically or evaluated via human ratings.</p>
      <p>
        Regarding the automatic evaluation, the most used automatic metrics are BLEU [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], and SARI
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. These metrics are language independent since they mainly rely on n-gram overlap (BLEU)
or measuring the words that are added, delete or kept (SARI). Other metrics, however, need
language dependent tools such as a parser in the case of SAMSA [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] or a question generation and
answering systems as in QUESTEVAL for Sentence Simplification [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. However, some of these
tools are not available for many languages and cannot be applied. Readability assessment metrics
such as FleschśKincaid [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] are also language-dependent, in this case, for English. Although
some readability metrics have been adapted to some languages e.g. Fernandez-Huerta index for
Spanish [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], not all the languages have their own formulae. Other metrics that have been used
and proposed to evaluate ATS systems are TER [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], ROUGE [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], C-Score [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] or the E-Score [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
To ease the process of calculating automatic metrics and facilitating comparison, the package
EASSE was created [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], which includes BLEU, SARI and Flesch-Kincaid Grade Level.
      </p>
      <p>
        Concerning human evaluation, three criteria are mainly used: grammar/fluency, meaning
preservation, and simplicity [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Commonly, a Likert scale from 1-5 is used to give the ratings and
in the Quality Assessment for Text Simplification shared-task, a tree level scale was used with the
bad/ok/good levels for each of the criteria [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. They also created a combination of the three scores
called overall, which rewarded more meaning preservation and simplicity than grammaticality.
Human judgments, however, can vary across the target audience of the simplification and the
evaluators. Moreover, the evaluators can be linguists, simplification experts, members of the
target audience or crowdsourcing workers. They all can be paid or not. In order to assist evaluators,
a reading comprehension test was done in [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] and specific questions for the task were posed
in [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]. In this study, the authors also asked evaluators about the original sentences, since in the
dataset they cured, depending on the language, more than 15 % of the sentences were not correct.
      </p>
      <p>
        Other techniques that have been used to evaluate ATS systems are information measures
against a specially curated reference corpus to evaluate diferent linguistic phenomena [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ],
eyetracking [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] or extrinsic evaluation via information extraction [
        <xref ref-type="bibr" rid="ref28 ref29">28, 29</xref>
        ], a chunk-based question
generation system [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], machine translation [
        <xref ref-type="bibr" rid="ref30 ref31">30, 31</xref>
        ], or semantic role labelling [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ].
      </p>
      <p>
        As Alva-Manchego et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] point out, metrics such as BLEU and SARI are flawed, and it is
necessary to keep all their limitations in mind. Regarding the human evaluation criteria (leaving
apart the costs and possible bias) they wonder if grammar/fluency, meaning preservation, and
simplicity are enough. Moreover, we do not know what is happening and what neural systems
understand, unless an error analysis is made. In this line, Shardlow and Nawaz [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] present a
framework of six types of error found in clinical neural ATS. These error types can be summarised
as changes with or without loss or alteration of the original meaning, reduction of the information
leading the miss of critical information, word repetitions and no changes.
      </p>
      <p>
        In order to better understand what systems (not restricted to neural) are simplifying, in this
paper we propose a checklist evaluation for ATS by focusing on linguistic and simplicity
phenomena or capabilities. Checklist and similar techniques have been successfully used in other
NLP tasks such as sentiment analysis, duplicate question detection, machine comprehension [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ],
contradiction detection in dialogue [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ], hate speech detection [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ], ofensive content detection
[
        <xref ref-type="bibr" rid="ref35">35</xref>
        ], or bias analysis [
        <xref ref-type="bibr" rid="ref36">36</xref>
        ]. Some capabilities such as the negation have also been studied [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ]
across diferent natural language inference tasks. By using general linguistic capabilities, our
aim is also to be as language independent as possible. We are aware that this evaluation is also
expensive, but it is necessary to understand and find the weak spots of the systems, which can
lead to the development of methods to improve them. To our knowledge, this is also the first time
checklists are proposed for generation tasks.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Checklist-based Evaluation: Linguistic Capabilities for ATS</title>
      <p>
        To create the list of the capabilities, we have analysed the outputs of three ATS neural systems:
a re-implementation of the edit-based system EditNTS [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ] (EditNTS), a syntax aware edit-based
system [
        <xref ref-type="bibr" rid="ref39">39</xref>
        ] (Edit+Synt), and transformer built by us. We have trained and tested these systems in
the following corpora: for English, in Wikilarge/TurkCorpus [
        <xref ref-type="bibr" rid="ref40 ref41 ref42">40, 41, 42</xref>
        ], for Spanish in Simplext
[
        <xref ref-type="bibr" rid="ref43">43</xref>
        ] and for Italian in the combination of a subset of the PaCCSS-it corpus [
        <xref ref-type="bibr" rid="ref44">44</xref>
        ], the SIMPITIKI
corpus [
        <xref ref-type="bibr" rid="ref45">45</xref>
        ], the Terence-Teacher corpus [
        <xref ref-type="bibr" rid="ref46">46</xref>
        ]. In Table 1 we show the results of the systems for
each dataset.
      </p>
      <p>We have randomly selected a sample of 15 original sentence pairs for each dataset, together
with the outputs of each system. In Table 2 we show an example of a sentence from Wikilarge.
Based on this sample, we have analysed the features related to grammar, meaning preservation
and simplicity included in the sentences and we have created a list of them. We have also identified
Wikilarge</p>
      <p>Simplext</p>
      <p>PaCCSS-it-SIMPITIKI-TerenceTeacher
important features not related to these dimensions. This analysis has been carried out by a linguist
expert on text simplification, native in Spanish and with C1 proficiency in English and Italian.
We have decided to analyse only 15 sentences for each language because we realised that the
most important errors were repeating.</p>
      <sec id="sec-3-1">
        <title>3.1. Capabilities for ATS</title>
        <p>In the following sections, we present the linguistic capabilities we propose to reveal the strong and
weak points of ATS systems. These capabilities are meant to be useful for general simplification,
but they can be adapted depending on the target audience and the purpose of simplification. As
mentioned before, ATS is manually evaluated on the basis of three dimensions: grammar/fluency,
meaning preservation and simplicity. We add two new dimensions to this list: the prerequisites,
which lists a set of basic checklists any simplification should comply with, and ethical aspects,
which measure any ethical issue that may arise because of the produced simplifications. We
define these capabilities in terms of general linguistic phenomena so that they can be applied
to all the languages.</p>
        <sec id="sec-3-1-1">
          <title>3.1.1. Prerequisites</title>
          <p>Two types of prerequisites are needed to check before the manual evaluation starts. The first
one is the no simplification (P0), which means that the original sentence does not need to be
simplified and, therefore, the output of the simplified sentences should be a copy of the input e.g.
Take the square root of the variance.. In this case, the evaluation does not continue. If the system,
however, does not simplify a sentence that needs to be simplified, the capability is not satisfied
and the evaluation is stopped, because the system has simply copied the original one.</p>
          <p>The second one is related to system errors and these sentences should be discarded since they
are not fluent enough to be evaluated. Based on the most common system errors, the capabilities
we propose are:
• (P1) No UNK tokens: el de UNK y los UNK.
• (P2) No non required quotation marks or strange characters: the ž ž ž ž ž ž ž ž ž ž ž ž ž ž ž ž
• (P3) No relation or alignments problems: While at Kahn he was chief architect for the Fisher</p>
          <p>Building in 1928 . -&gt; he was the current conductor of the boston symphony orchestra.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.1.2. Grammar/Fluency</title>
          <p>
            These capabilities take into account the structure of the language. Although it has been argued that
the fluency of neural systems is close to human fluency, this has been only validated for English [
            <xref ref-type="bibr" rid="ref47">47</xref>
            ].
That is why fluency and grammar need to be taken into account when evaluating ATS outputs.
• Word level:
ś (G1) No repeated words e.g. repeated determinants (las las casas), adjectives (the same
whole whole was), nouns, prepositions, or conjugated verbs (estos fueron fue)...
ś (G2) No tense change, unless necessary for a certain target audience (sarebbe -&gt; è) or
modality (è -&gt; puó essere) in the verbs
• Morpho-syntactic level:
ś (G3) Correct and finished phrases or sentences ( ∗against their own, ∗defensores de los
derechos humanos en, ∗en ahora, ∗. is the national parks’ biggest or ∗promete aumentar que las
misiones en 2010)
ś (G4) Correct agreement of subject and verb (he volunteers (...) and ∗search), grammatical
gender of determinants, nouns and adjectives (∗dentro de las apoyo or ∗a organization),
contractions (∗de el or ∗can not)...
ś (G5) Correct phraseology: verb + preposition (∗enviar en los habitantes), collocations...
ś (G6) No repeated arguments: double subject or verb (∗it is it also encloses)...
• Cohesion level (both at inter- and intra-sentence levels):
ś (G7) Correct punctuation (as , , he)
ś (G8) Correct grammatical order: phrase order, sentence order
ś (G9) Correct coreference
ś (G10) No Definiteness change ( the organization -&gt; an organization)
          </p>
        </sec>
        <sec id="sec-3-1-3">
          <title>3.1.3. Meaning preservation</title>
          <p>
            Meaning preservation is related to the fidelity of the simplified sentence in relation with the
original sentence in terms of the meaning. As the capabilities may difer from sentence to sentence
e.g. presence of negation or adverbial information, we propose two levels in this dimension: the
mandatory capabilities, which should be always evaluated, and the optional capabilities, which
will be evaluated only in the cases where the phenomena is present e.g. negation should be
evaluated if there is a negation in the original sentence.
• Mandatory
ś (M1) Important information kept (all the arguments and adjuncts that are necessary to
understand the whole meaning of the sentence).
ś (M2) Register (formal, informal, literary, technical...) kept, unless required by the target
audience
ś (M3) No meaning change or only subtle nuances changes e.g. deleting or adding
emphasizers (la risposta è tecnicamente no.)
• Optional (depending on the sentence)
ś (M4) Named entities unaltered (La ministra de Defensa -&gt; la ministra de asuntos sociales)
ś (M5) Negation kept
ś (M6) Temporal adverbs and relations kept
ś (M7) Numerical expressions kept or/and not altered except for rounding (check simplicity
capabilities [
            <xref ref-type="bibr" rid="ref48">48</xref>
            ])
ś (M8) Correct lexical simplifications ( project focuses on the laws of motion-&gt; project focuses
on health care)
ś (M9) No too general lexical simplification ( educated workers -&gt; people)
ś (M10) No unnecessary cliches, idioms that afect the meaning ( Ma non è tutto ! -&gt; ma non
è tutto oro quel che luccica.)
          </p>
        </sec>
        <sec id="sec-3-1-4">
          <title>3.1.4. Simplicity</title>
          <p>
            These capabilities are related to simplification studies and guidelines [
            <xref ref-type="bibr" rid="ref49">49</xref>
            ], and to summaries of
easy-to-read guidelines [
            <xref ref-type="bibr" rid="ref50">50</xref>
            ]. As in the meaning dimension we also define here mandatory and
optional capabilities.
          </p>
          <p>Mandatory Optional
• (S1) Shorter sentences (explanations should • (S10) No legal, foreign and technical jargon
be added in our opinion in another sentence) • (S11) ‘you’ used to speak directly to readers
• (S2) Same term for same concept • (S11) Use of the number and not the word
• (S3) Logical or temporal ordering of relations • (S13) Rounded numerical expressions
• (S4) Active voice (instead of passive) • (S14) More known names for named entities
• (S5) Simple, frequent words • (S15) Necessary and correct elaborations,
ex• (S6) Same term consistently used planations
• (S7) Only one main idea per sentence covered • (S16) Elided arguments or verbs recovered
• (S8) Only one finite verb for sentence • (S17) No exceptions to exception
• (S9) Simple punctuation</p>
        </sec>
        <sec id="sec-3-1-5">
          <title>3.1.5. Ethical aspects</title>
          <p>The research and analysis of ethical aspects has gained a lot of importance in the last years in
NLP. Given that one of ATS’ goals is to adapt texts to people with dificulties, special care should
be taken and the maxims Primum non nocere or do no harm should be of a great importance. That
is why we think that these dimensions should also be taken into account. Following we present
two ethical violations we have found in our analysis.
• (E1) No wrong information or misinformation (Disney received a full-size Oscar statuette and
seven miniature ones, presented to him by 10-year-old child actress Shirley Temple. -&gt; Disney sold
to him by in old child shirley temple.), unnecessary/wrong elaborations (in the area
ProvenceAlpes-Côte Azur in the Nord-Pas-de-Calais region.), hallucinations or explanations/ information
which we do not know if they are true or not (Military career Donaldson enlisted in the Australian
Army on 18 June 2002 . -&gt; War war II military career Donaldson left the united states on 18 june
2002.)
• (E2) No non-present stereotypes or unnecessary mentions to discriminate/minoritary groups:
Detenidos tres menores por amenazar e injuriar a otra menor a través de una red social -&gt; la
guardia civil detiene a a los red emigrantes.</p>
          <p>To our knowledge, ethical aspects have not be taken into account in the evaluation of ATS and
we are open to discuss them with the community, as well as with other capabilities.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Capability score</title>
        <p>In order to quantitatively evaluate the aforementioned capabilities, we propose to score each
capability separately for each sentence. So, for each sentence, the evaluator indicates whether a
capability has been fulfilled with a binary score (1:yes, 0:no). For example, the sentence simplified
by Edit+Synt presented in Table 2 misses the capability G6 at the grammatical dimension and
S5 in the simplicity dimension.</p>
        <p>To score a sample/corpus, we calculate the percentage of the positive scores for each capability.
That is, if we are evaluating the capability G1 in a sample of 50 sentences, and it is fulfilled in 48
of them, the score of the capability G1 will be 96 %. In the case of the optional capabilities, only
the sentences that have that feature should be taken into account.</p>
        <p>To interpret the scores we propose a scale (Table 3) with the following values: 96-100 % perfect,
81-95 % substantial, 61-80 % moderate and &lt; 60 % low. This scale is inspired by the interpretation of
Cohen’s kappa, but, being the one of the main aims of ATS help people to understand texts, we think
that we need to be hard with the rating and that is why all the capabilities below 60 % are considered
low. The capabilities that also score less than 80 % should be addressed by system developers.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Case study: Checklist evaluation for Edit+Synt at Wikilarge</title>
      <p>As case study, in this section we evaluate the capabilities of the sentences simplified by Edit+Synt
(system with best quantitative performance) in the Wikilarge corpus. We have randomly chosen
10 % of the test set (36 sentences) to carry out this analysis (the ones used to create the list of
capabilities were discarded). 5 of the sentences were sentences where no simplification should
be carried out and no simplification was performed. So, we have annotated in total 31 sentences
according to the the capabilities. The annotator is an expert on text simplification and, once
trained in the task, she spent an average of 90 seconds per sentence pair. The annotation was
done in a spreadsheet. In Table 4 we group the capabilities by their score. We only show the
optional capabilities if there are 5 more sentence to evaluate.</p>
      <p>Score</p>
      <p>Low (&lt; 60 %)
Moderate (61-80 %)
Substantial (81-95 %)</p>
      <p>Perfect (96-100 %)</p>
      <p>Capabilities
P0, G3, M1, M3, M8, M9, S1, S7, S8, S10
G4, S4, S5
G1, G2, G5, G6, G7, G8, M2, M4, S3, S9, E1, E2</p>
      <p>P1, P2, P3, G9, G10, M7, S2, S6</p>
      <p>Let us explain the results by grouped dimension (In table 5 we present the examples of the
violated capabilities to illustrate the errors.). In the case of the prerequisites, the ones related to the
systems errors (P1, P2, and P3) are successfully fulfilled. However, in the case of no simplification
(P0), the system has not simplified 9 sentences (which, as mentioned before, were discarded from
evaluation), but only two of them were correctly unaltered. This result suggests that preprocessing
should be applied before performing any simplification step, as proposed by Scarton et al. [ 51].</p>
      <p>Regarding the grammar, the weakest points of the system are related to the correction of the
phrases (G3) and the agreements (G4). This indicates that the system fails to properly exploit
phrase level information. Strong points of the system are, however, word (G1 and G2) and cohesion
(G7, G8, G9 and G10) level capabilities.</p>
      <p>With respect to the meaning preservation, the system struggles to keep important information
(M1), meaning changes (M3), and lexical simplifications (M8 and M9). This indicates that the
system tends to perform too many delete operations that afect the original message of the
sentence. Lexical simplification is also a challenge for this system, which can be addressed by
performing it on its own as many other systems do. The strong points are related to the register
(M2), and the preservation of named entities (M4) and numerical expressions (M7).</p>
      <p>Concerning the simplicity, the systems really struggles to create shorter sentences (S1), and
simple sentences (S7 and S8). This is linked to the splitting operation, which is a challenge for deep
learning based systems. In order to overcome this problem, performing rule based simplification
as a previous step has been proposed [52]. Moreover, the system does not handle technical words
(S10) and does not use simple frequent words (S4), most likely due to the fact that theses words
are missing in the training data. This could be addressed together with the lexical simplification.
Some efort should also put to avoid passive voice (S5). The strong capabilities are, however, the
use of the same term for the same concept (S2) and the perseverance of logical and temporal
ordering (S3). This may be due to the fact that the system does not perform any reordering
operation and follows the order of the original sentence.</p>
      <p>Finally, looking at the ethical aspects, we see that the system has not added stereotypes (E2)
(% 93,5), but it is close to the limit regarding the addition of wrong information (% 83,9) (E1). The
sentences that contain added wrong information or hallucinations tend to be longer than the
Violated Complex
capability
P0
G3
G4
M1
M3
M8
M9
S1/S7/S8
E1</p>
      <p>Output
He was appointed Companion of Honour He was appointed Companion of Honour
(CH) in 1988. (CH) in 1988. the name was given to the
nobel prize in physics.</p>
      <p>There he had one daughter, later baptized He had one daughter, later ∅ as Mary
as Mary Ann Fisher Power, to Ann (e) Ann Fisher Power.</p>
      <p>Power.</p>
      <p>The wounds inflicted by a club are gener- The wounds was by a club are generally
ally known as bludgeoning or blunt-force known as bludgeoning or blunt-force .
trauma injuries.</p>
      <p>They are rivaled as biological materials in they are only ∅ by chitin.
toughness only by chitin.</p>
      <p>The vagina is remarkably elastic and The vagina is very thin elastic and
stretches to many times its normal diam- stretches to many times its normal
eter during vaginal birth. diameter during ∅ birth.</p>
      <p>The tongue is sticky because of the pres- The tongue is sticky because of the
presence of glycoprotein-rich mucous, which ence of glycoprotein-rich mucous, which
both lubricates movement in and out of the both lubricates movement. Out of the
snout and helps to catch ants and termites, snout and helps to catch ants and termites,
which adhere to it. which brings to it.</p>
      <p>Stands were eventually added behind each Stands were eventually added behind each
set of goals during the 1980s and 1990s as set of goals during the 1980s and 1990s as
the ground began to be modernised. the ground began to be finished.</p>
      <p>Civil unrest in northern Italy spawns civil unrest in northern Italy made the
methe medieval musical form of Geissler- dieval musical form of Geisslerlieder,
penilieder, penitential songs sung by wander- tential songs sung by wandering bands
ing bands of Flagellants. of Flagellants.</p>
      <p>He left a detachment of 11,000 troops to He left a detachment of 11,000 troops
garrison the newly conquered region. to garrison the newly conquered region
around the world, and was used in the
area.
original ones, so this can be a hint to detect this kind of errors.</p>
      <p>In general, we can say that the system passes the checklist exam, since many capabilities are
in the ranges of perfect and substantial. However, there are weak points that should be addressed
and treated, which we know thanks to this methodology.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Work</title>
      <p>In this paper we have proposed to evaluate the ATS systems based on their linguistic capabilities.
To that end, inspired by the checklist method, we have defined a first set of linguistic capabilities
required so that a sentence/text be a correct simplification. These capabilities are grouped in the
three dimensions manual ATS is usually evaluated but we have also added the prerequisites and
the ethical aspects. We think that adding the ethical aspects is important since one of the main
aims of ATS is help people to understand texts and no wrong information or biases should be
included or amplified. Moreover, based on these capabilities, we can understand what systems
are doing and which are their weak and strong points systematically. This can lead to open ways
to improve the systems and future research.</p>
      <p>We also have proven the validity of the proposal to evaluate ATS systems by analysing the
outputs of the Edit+Synt system. Based on this evaluation, we have seen that the system performs
quite well but needs improvements in the correction of phrases and agreement, keeping the
important information, creating shorter sentences.</p>
      <p>We are open to discuss more capabilities with the community, adapt them or specify them. We
would like to test other languages, other systems and even to automatise the analysis of some
of the features to facilitate the manual evaluation. As suggested by the reviewers, it will also
be interesting to i) stratify the analysis sample of the datasets to e.g based on sentence length
and depth, readability measures to analyse other kind of errors and create more capabilities; ii)
carry out the analysis in other dataset with other domains that can include abstract language,
ifgurative language, and sarcasm; iii) perform pilot studies to determine a better threshold for the
interpretation of the capability score, carry out analysis in other dataset with other domains that
can include abstract language, figurative language, and sarcasm and; iv) explore how to visualise
the evaluation; and, finally, v) compare our results to the ones obtained with a traditional (human
and automatic) evaluation method and try to find correlations.</p>
      <p>There is a lot of work to do until we get outputs that can be used by people that adapt and/or
simplify texts or directly by people who need the simplified/adapted texts. In this sense, checklist
evaluation of linguistic capabilities can open a way towards a better quality of ATS.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>We really thank the anonymous reviewers for their comments and suggestions. We
acknowledge the following projects: DeepText (KK-2020/00088), DeepReading RTI2018-096846-B-C21
(MCIU/AEI/FEDER, UE), BigKnowledge for Text Mining, BBVA and IXA group (Basque
Government (excellence research group IT1343-19).
in: Proceedings of the Workshop on Automatic Text Simplification - Methods and
Applications in the Multilingual Society (ATS-MA 2014), Association for
Computational Linguistics and Dublin City University, Dublin, Ireland, 2014, pp. 30ś40. URL:
https://www.aclweb.org/anthology/W14-5604. doi:10.3115/v1/W14-5604.
[51] C. Scarton, P. Madhyastha, L. Specia, Deciding when, how and for whom to simplify, in:</p>
      <p>ECAI 2020, volume 325, IOS Press, 2020, pp. 2172ś2179.
[52] M. Maddela, F. Alva-Manchego, W. Xu, Controllable Text Simplification with Explicit
Paraphrasing, arXiv preprint arXiv:2010.11004 (2020).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>I.</given-names>
            <surname>Gonzalez-Dios</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Aranzabe</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Díaz de Ilarraza,
          <article-title>Testuen sinplifikazio automatikoa: arloaren egungo egoera</article-title>
          ,
          <source>Linguamática</source>
          <volume>5</volume>
          (
          <year>2013</year>
          )
          <article-title>43ś63</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Shardlow</surname>
          </string-name>
          , A Survey of Automated Text Simplification,
          <source>International Journal of Advanced Computer Science and Applications</source>
          <volume>4</volume>
          (
          <year>2014</year>
          )
          <article-title>58ś70</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Siddharthan</surname>
          </string-name>
          , A Survey of Research on Text Simplification, ITL-
          <source>International Journal of Applied Linguistics</source>
          <volume>165</volume>
          (
          <year>2014</year>
          )
          <article-title>259ś298</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Saggion</surname>
          </string-name>
          , Automatic Text Simplification,
          <source>Synthesis Lectures on Human Language Technologies</source>
          <volume>10</volume>
          (
          <year>2017</year>
          )
          <article-title>1ś137</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Alva-Manchego</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Scarton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Specia</surname>
          </string-name>
          ,
          <article-title>Data-driven sentence simplification: Survey and benchmark</article-title>
          ,
          <source>Computational Linguistics</source>
          <volume>46</volume>
          (
          <year>2020</year>
          )
          <article-title>135ś187</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Sikka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pink</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Mago</surname>
          </string-name>
          , A Survey on Text Simplification, arXiv preprint arXiv:
          <year>2008</year>
          .
          <volume>08612</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Celikyilmaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <article-title>Evaluation of text generation: A survey</article-title>
          , arXiv preprint arXiv:
          <year>2006</year>
          .
          <volume>14799</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Belz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Graham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Reiter</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Shimorina (Eds.),
          <source>Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2021</year>
          . URL: https://www.aclweb.org/anthology/2021.humeval-
          <volume>1</volume>
          .0.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Shimorina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Belz</surname>
          </string-name>
          ,
          <article-title>The human evaluation datasheet 1.0: A template for recording details of human evaluation experiments in nlp</article-title>
          ,
          <year>2021</year>
          . arXiv:
          <volume>2103</volume>
          .
          <fpage>09710</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Guestrin</surname>
          </string-name>
          ,
          <article-title>Semantically equivalent adversarial rules for debugging NLP models, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics</article-title>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2018</year>
          , pp.
          <fpage>856ś865</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Guestrin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>Beyond accuracy: Behavioral testing of NLP models with CheckList, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>4902ś4912</fpage>
          . URL: https://www.aclweb.org/anthology/2020.acl-main.
          <volume>442</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .acl-main.
          <volume>442</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>Papineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roukos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ward</surname>
          </string-name>
          , W.-J. Zhu,
          <article-title>Bleu: a method for automatic evaluation of machine translation</article-title>
          ,
          <source>in: Proceedings of the 40th annual meeting of the Association for Computational Linguistics</source>
          ,
          <year>2002</year>
          , pp.
          <fpage>311ś318</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>W.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Pavlick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Callison-Burch</surname>
          </string-name>
          ,
          <article-title>Optimizing statistical machine translation for text simplification</article-title>
          ,
          <source>Transactions of the Association for Computational Linguistics</source>
          <volume>4</volume>
          (
          <year>2016</year>
          )
          <article-title>401ś415</article-title>
          . URL: https://www.aclweb.org/anthology/Q16-1029. doi:
          <volume>10</volume>
          .1162/tacl_a_
          <fpage>00107</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>E.</given-names>
            <surname>Sulem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Abend</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rappoport</surname>
          </string-name>
          ,
          <article-title>Semantic Structural Evaluation for Text Simplification, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</article-title>
          , Volume
          <volume>1</volume>
          (
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2018</year>
          , pp.
          <fpage>685ś696</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>T.</given-names>
            <surname>Scialom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Staiano</surname>
          </string-name>
          , E. Villemonte
          <string-name>
            <surname>de la Clergerie</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Sagot</surname>
          </string-name>
          ,
          <source>Rethinking Automatic Evaluation in Sentence Simplification</source>
          ,
          <year>2021</year>
          . arXiv:
          <volume>2104</volume>
          .
          <fpage>07560</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Kincaid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. P.</given-names>
            <surname>Fishburne</surname>
          </string-name>
          <string-name>
            <surname>Jr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Rogers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. S.</given-names>
            <surname>Chissom</surname>
          </string-name>
          ,
          <article-title>Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel</article-title>
          ,
          <source>Technical Report, Naval Technical Training Command Millington TN Research Branch</source>
          ,
          <year>1975</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>Fernández</surname>
          </string-name>
          <string-name>
            <surname>Huerta</surname>
          </string-name>
          , Medidas sencillas de lecturabilidad,
          <source>Consigna</source>
          <volume>214</volume>
          (
          <year>1959</year>
          )
          <article-title>29ś32</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Snover</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dorr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Schwartz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Micciulla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Makhoul</surname>
          </string-name>
          ,
          <article-title>A study of translation edit rate with targeted human annotation, in: Proceedings of association for machine translation in the Americas</article-title>
          , volume
          <volume>200</volume>
          ,
          <string-name>
            <surname>Citeseer</surname>
          </string-name>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>C.-Y. Lin</surname>
          </string-name>
          ,
          <article-title>Rouge: A package for automatic evaluation of summaries</article-title>
          , in: Text summarization branches out,
          <year>2004</year>
          , pp.
          <fpage>74ś81</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>I.</given-names>
            <surname>Temnikova</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Maneva, The C-ScoreśProposing a Reading Comprehension Metrics as a Common Evaluation Measure for Text Simplification</article-title>
          ,
          <source>in: Proceedings of the Second Workshop on Predicting and Improving Text Readability for Target Reader Populations</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>20ś29</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mathias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhattacharyya</surname>
          </string-name>
          , How Hard Can it Be?
          <article-title>The E-Score-A Scoring Metric to Assess the Complexity of Text, in: Proceedings of Quality Assessment for Text Simplification (QATS</article-title>
          ) Workshop,
          <year>2016</year>
          , pp.
          <fpage>10ś14</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>F.</given-names>
            <surname>Alva-Manchego</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Scarton</surname>
          </string-name>
          , L. Specia, Easse:
          <article-title>Easier automatic sentence simplification evaluation</article-title>
          ,
          <source>in: EMNLP-IJCNLP 2019-Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing (demo session)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>49ś54</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>S.</given-names>
            <surname>Štajner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Popovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Saggion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Specia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fishel</surname>
          </string-name>
          ,
          <article-title>Shared Task on Quality Assessment for Text Simplification</article-title>
          ,
          <source>in: Proceedings of the Quality Assessment for Text Simplification (QATS)</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>22ś31</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Mandya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Nomoto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Siddharthan</surname>
          </string-name>
          ,
          <article-title>Lexico-syntactic text simplification and compression with typed dependencies</article-title>
          ,
          <source>in: 25th International Conference on Computational Linguistics</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>I.</given-names>
            <surname>Gonzalez-Dios</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Aranzabe</surname>
          </string-name>
          , A. D. de Ilarraza,
          <article-title>Making Biographical Data in Wikipedia Readable: A Pattern-based Multilingual Approach</article-title>
          , in
          <source>: Proceedings of the Workshop on Automatic Text Simplification-Methods</source>
          and
          <article-title>Applications in the Multilingual Society (ATS-MA</article-title>
          <year>2014</year>
          ),
          <year>2014</year>
          , pp.
          <fpage>11ś20</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>C.</given-names>
            <surname>Gasperin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Maziero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Aluisio</surname>
          </string-name>
          ,
          <article-title>Challenging choices for text simplification</article-title>
          ,
          <source>in: International Conference on Computational Processing of the Portuguese Language</source>
          , Springer,
          <year>2010</year>
          , pp.
          <fpage>40ś50</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>L.</given-names>
            <surname>Rello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bautista</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Baeza-Yates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gervás</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hervás</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Saggion</surname>
          </string-name>
          ,
          <article-title>One half or 50%? An eye-tracking study of number representation readability</article-title>
          ,
          <source>in: IFIP Conference on Human-Computer Interaction</source>
          , Springer,
          <year>2013</year>
          , pp.
          <fpage>229ś245</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
          <article-title>Comparing Methods for the Syntactic Simplification of Sentences in Information Extraction</article-title>
          ,
          <source>Literary and Linguistic Computing</source>
          <volume>26</volume>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>R.</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Orasan</surname>
          </string-name>
          ,
          <article-title>Sentence Simplification for Semantic Role Labelling and Information Extraction</article-title>
          ,
          <source>in: Proceedings of Recent Advances in Natural Language Processing</source>
          ,
          <year>2019</year>
          , p.
          <fpage>285ś294</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>K.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Soni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <article-title>Exploring the efects of sentence simplification on Hindi to English machine translation system</article-title>
          ,
          <source>in: Proceedings of the Workshop on Automatic Text Simplification-Methods</source>
          and
          <article-title>Applications in the Multilingual Society (ATS-MA</article-title>
          <year>2014</year>
          ),
          <year>2014</year>
          , pp.
          <fpage>21ś29</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>S.</given-names>
            <surname>Štajner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Popović</surname>
          </string-name>
          ,
          <article-title>Can text simplification help machine translation?</article-title>
          ,
          <source>in: Proceedings of the 19th Annual Conference of the European Association for Machine Translation</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>230ś242</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>M.</given-names>
            <surname>Shardlow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Nawaz</surname>
          </string-name>
          ,
          <article-title>Neural text simplification of clinical letters with a domain specific phrase table</article-title>
          ,
          <source>in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>380ś389</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Williamson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bansal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kiela</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Weston,</surname>
          </string-name>
          <article-title>I like fish, especially dolphins: Addressing Contradictions in Dialogue Modelling</article-title>
          , arXiv preprint arXiv:
          <year>2012</year>
          .
          <volume>13391</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>P.</given-names>
            <surname>Röttger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Vidgen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Waseem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Margetts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pierrehumbert</surname>
          </string-name>
          ,
          <article-title>Hatecheck: Functional tests for hate speech detection models</article-title>
          , arXiv preprint arXiv:
          <year>2012</year>
          .
          <volume>15606</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhatt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dandapat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sitaram</surname>
          </string-name>
          ,
          <article-title>A case study of eficacy and challenges in practical human-in-loop evaluation of nlp systems using checklist</article-title>
          ,
          <source>in: Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>120ś130</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Bi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , H. Chen,
          <article-title>On robustness and bias analysis of bert-based relation extraction</article-title>
          , arXiv e-prints (
          <year>2020</year>
          )
          <article-title>arXivś2009</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <surname>M. M. Hossain</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Kovatchev</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dutta</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Kao</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Wei</surname>
            ,
            <given-names>E. Blanco,</given-names>
          </string-name>
          <article-title>An analysis of natural language inference benchmarks through the lens of negation</article-title>
          ,
          <source>in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>9106ś9118</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rezagholizadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C. K.</given-names>
            <surname>Cheung</surname>
          </string-name>
          ,
          <string-name>
            <surname>EditNTS:</surname>
          </string-name>
          <article-title>An neural programmerinterpreter model for sentence simplification through explicit editing, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Florence, Italy,
          <year>2019</year>
          , pp.
          <fpage>3393ś3402</fpage>
          . URL: https://www.aclweb.org/anthology/P19-1331. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P19</fpage>
          -1331.
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>O. M.</given-names>
            <surname>Cumbicus-Pineda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Gonzalez-Dios</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Soroa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A</given-names>
            <surname>Syntax-Aware</surname>
          </string-name>
          Edit
          <article-title>-based System for Text Simplification</article-title>
          ,
          <source>in: Proceedings of RANLP</source>
          <year>2021</year>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bernhard</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>A monolingual tree-based translation model for sentence simplification</article-title>
          ,
          <source>in: Proceedings of the 23rd International Conference on Computational Linguistics (Coling</source>
          <year>2010</year>
          ),
          <year>2010</year>
          , pp.
          <fpage>1353ś1361</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , M. Lapata,
          <article-title>Sentence simplification with deep reinforcement learning</article-title>
          ,
          <source>in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>584ś594</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>W.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Pavlick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Callison-Burch</surname>
          </string-name>
          ,
          <article-title>Optimizing statistical machine translation for text simplification</article-title>
          ,
          <source>Transactions of the Association for Computational Linguistics</source>
          <volume>4</volume>
          (
          <year>2016</year>
          )
          <article-title>401ś415</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>H.</given-names>
            <surname>Saggion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Štajner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mille</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Rello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Drndarevic</surname>
          </string-name>
          ,
          <article-title>Making it simplext: Implementation and evaluation of a text simplification system for spanish, ACM Transactions on Accessible Computing (TACCESS) 6 (2015) 1ś36</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [44]
          <string-name>
            <given-names>D.</given-names>
            <surname>Brunato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cimino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dell'Orletta</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Venturi, Paccss-it: A parallel corpus of complex-simple sentences for automatic text simplification</article-title>
          ,
          <source>in: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>351ś361</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>S.</given-names>
            <surname>Tonelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Aprosio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Saltori</surname>
          </string-name>
          ,
          <article-title>Simpitiki: a simplification corpus for italian</article-title>
          ,
          <source>Proc. of CLiC-it</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          [46]
          <string-name>
            <given-names>D.</given-names>
            <surname>Brunato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dell'Orletta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Venturi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Montemagni</surname>
          </string-name>
          ,
          <article-title>Design and annotation of the first Italian corpus for text simplification</article-title>
          ,
          <source>in: Proceedings of The 9th Linguistic Annotation Workshop</source>
          , Association for Computational Linguistics, Denver, Colorado, USA,
          <year>2015</year>
          , pp.
          <fpage>31ś41</fpage>
          . URL: https://www.aclweb.org/anthology/W15-1604. doi:
          <volume>10</volume>
          .3115/v1/
          <fpage>W15</fpage>
          -1604.
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          [47]
          <string-name>
            <given-names>L.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fan</surname>
          </string-name>
          , É. de la Clergerie,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bordes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sagot</surname>
          </string-name>
          ,
          <article-title>Multilingual unsupervised sentence simplification</article-title>
          , arXiv preprint arXiv:
          <year>2005</year>
          .
          <volume>00352</volume>
          (version
          <issue>16</issue>
          <year>Apr 2021</year>
          ) (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          [48]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bautista</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hervás</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gervás</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Power</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <article-title>A system for the simplification of numerical expressions at diferent levels of understandability, in: Natural Language Processing for Improving Textual Accessibility (NLP4ITA</article-title>
          <year>2013</year>
          ),
          <year>2013</year>
          , pp.
          <fpage>10ś19</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          [49]
          <string-name>
            <given-names>I.</given-names>
            <surname>Gonzalez-Dios</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Aranzabe</surname>
          </string-name>
          , A. D. de Ilarraza,
          <article-title>The corpus of Basque simplified texts (CBST)</article-title>
          ,
          <source>Language Resources and Evaluation</source>
          <volume>52</volume>
          (
          <year>2018</year>
          )
          <article-title>217ś247</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          [50]
          <string-name>
            <given-names>R.</given-names>
            <surname>Mitkov</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Štajner,</surname>
          </string-name>
          <article-title>The fewer, the better? a contrastive study about ways to simplify,</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>