<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">When the Scale is Unclear -Analysis of the Interpretation of Rating Scales in Human Evaluation of Text Simplification</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Regina</forename><surname>Stodden</surname></persName>
							<email>regina.stodden@hhu.de</email>
							<affiliation key="aff0">
								<orgName type="institution">Heinrich Heine University Düsseldorf</orgName>
								<address>
									<addrLine>Universitätsstraße 1</addrLine>
									<postCode>40225</postCode>
									<settlement>Düsseldorf</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">When the Scale is Unclear -Analysis of the Interpretation of Rating Scales in Human Evaluation of Text Simplification</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">9D83E0D5BC953486E15DBCAA58951FFD</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T11:07+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>text simplification</term>
					<term>human evaluation</term>
					<term>scale interpretation</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In the evaluation of text simplification, human ratings are of the highest importance as automatic metrics are not yet sufficient. However, so far, no best practices for human evaluation of text simplification exist. Hence, several different rating scales and definitions of evaluation dimensions are used to evaluate text simplification system outputs. Also, the scales lack some analysis regarding their reliability and interpretation. Therefore, in this paper, we analyse the interpretation of the scales of the evaluation dimensions meaning preservation, and simplicity based on simplification pairs with no change. Our analysis shows that annotators differently interpreted the scale of the simplicity dimension: on the one hand, the lowest value was interpreted to describe that the simplified sentence is more complex than the original sentence, and on the other hand, that a simplified sentence is as complex as the original sentence. Overall, the paper emphasises that best practices for human evaluation of text simplification are demanded to reduce misinterpretation of the scales.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Text simplification is the manual or automatic process of generating a simpler version of a complex text or sentence by preserving its meaning. Simplified texts are easier to understand, for example, for non-native speaker or people with lower literacy. Besides simplicity, meaning preservation and grammaticality are important criteria for a good simplification of a text. Thus, these criteria are also used to evaluate automatic text simplification systems <ref type="bibr" target="#b0">[1]</ref>. Therefore, the original text and its generated simplified version are aligned to a simplification pair. This pair can be evaluated manually or automatically <ref type="bibr" target="#b0">[1]</ref>.</p><p>So far, manual evaluation of text simplification is the most reliable evaluation method to judge text simplification <ref type="bibr" target="#b0">[1]</ref>, as for example the existing automatic evaluation metrics either focus only on lexical changes, e.g., SARI <ref type="bibr" target="#b1">[2]</ref>, or the meaning preservation, e.g., BLEU <ref type="bibr" target="#b2">[3]</ref>. Nevertheless, human evaluation also has its weakness because no best practice for text simplification evaluation exists. Currently, three dimensions are most often used in research, i.e., meaning preservation, simplicity and fluency <ref type="bibr" target="#b0">[1]</ref>. Even if there is agreement on these dimensions, there is no agreement on the question and scale used for evaluation (see <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b4">5,</ref><ref type="bibr" target="#b5">6,</ref><ref type="bibr" target="#b6">7]</ref>). Even if Likert scales <ref type="bibr" target="#b7">[8]</ref> are often used in text simplification evaluation and other evaluation tasks, many options exist for using and interpreting a Likert scale <ref type="bibr" target="#b8">[9]</ref>.</p><p>In this paper, we analyse the interpretation of different existing scales, including Likert scales, of human evaluation in 6 text simplification datasets. We investigate whether different scale interpretations exist by looking at human ratings of simplification pairs for which the original and the simplified sentences are identical. In detail, we answer the following research questions: I) Do human annotators agree on one label in the judgment of simplicity of identical sentence pairs, e.g., the middle or the lowest score value? II) Do human annotators agree on one label in the judgment of meaning preservation of identical sentence pairs, i.e., the highest score value? III) Do human annotators stick to their interpretation of a rating scale in all of their ratings?</p><p>In the following, we will first summarise the state of the art in current manual evaluation of text simplification. Then, we describe our methods and data and build our hypotheses. Afterwards, we present our results, conclude with some final interpretation and discussion of the results and mention possible future works.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>The human evaluation of natural language processing tasks is very costly and time-consuming, hence, automatic metrics are developed and optimised. For text simplification evaluation also some evaluation directions exist, e.g., evaluation on multiple references <ref type="bibr" target="#b1">[2]</ref>, evaluation without any reference <ref type="bibr" target="#b9">[10]</ref> or evaluation of structural simplifications <ref type="bibr" target="#b10">[11]</ref>. But all of these metrics still have some limitations, hence, they should be only used for quickly comparing and assessing different text simplification systems <ref type="bibr" target="#b0">[1]</ref>.</p><p>For a more detailed evaluation, human evaluation is required. In human text simplification evaluation, common evaluation dimensions exist, .i.e., meaning preservation, simplicity, and grammatically <ref type="bibr" target="#b0">[1]</ref>, but there is no agreement on the questions asked per evaluation dimension or the scale used for evaluation.</p><p>For the same dimensions, e.g., fluency (also called grammaticality), several definitions and questions exist: in <ref type="bibr" target="#b10">[11]</ref>, they ask the raters if the output sentence is grammatical, <ref type="bibr" target="#b11">[12]</ref> ask if the simplified sentence is grammatical and fluent, and <ref type="bibr" target="#b12">[13]</ref> state that "fluency indicates if the output is syntactically correct. ". Even if the statements sound similar, they emphasise different points and, hence, the raters may focus on different points during the rating. Especially if a rater is not an expert in text simplification, minor differences may lead to incomparable results. There is also a discussion of whether sentence pairs should be rated by experts or crowd workers of the target group <ref type="bibr" target="#b0">[1]</ref>.</p><p>Furthermore, there is no agreement on a rating scale, most approaches prefer Likert scales (see <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b4">5]</ref>) but others prefer continuous scales (see <ref type="bibr" target="#b6">[7]</ref>). However, Likert scales are also differently used, e.g., a scale ranging from 1 to 5 (see <ref type="bibr" target="#b4">[5]</ref>) or -2 to +2 (see <ref type="bibr" target="#b5">[6]</ref>). Following <ref type="bibr" target="#b8">[9]</ref>, Likert scales can also be differently used regarding other aspects, e.g., single-item vs. multi-item, same distance between consecutive points (ordinal vs. interval), odd or even number of points, each point labeled vs. only end points labeled, descending vs. ascending order, negatively or positively stated items.</p><p>In text simplification evaluation, the most common rating scales are 5 point Likert-scales, e.g., <ref type="bibr" target="#b4">[5]</ref>, a scale from -2 to +2, e.g., <ref type="bibr" target="#b5">[6]</ref>, and a continuous scale from 0 to 100, e.g., <ref type="bibr" target="#b13">[14,</ref><ref type="bibr" target="#b6">7]</ref>. On the one hand, <ref type="bibr" target="#b6">[7]</ref> argue that a continuous scale leads to more consistency in inter-annotator agreement in text simplification evaluation as already proofed for machine translation. On the other hand, <ref type="bibr" target="#b5">[6]</ref> prefer a Likert scale with negative to positive scale points including a neutral middle point because they are helpful to rate sentence pairs in which the simplified sentence is more or equally complex as the original sentence. However, both scales include a middle point element. Following <ref type="bibr" target="#b8">[9,</ref><ref type="bibr" target="#b14">15]</ref>, annotators interpret the middle point as, e.g., "undecided", "neutral", or "no opinion", which might be not always the interpretation the scale developers have intended. Overall, the different scales and their interpretations make it difficult to compare the ratings of different system outputs and, therefore, distort text simplification evaluation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Method</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Data</head><p>As we want to analyse the interpretation of rating scales by different annotators, a dataset with ratings of at least 2 annotators is required. Therefore, in our analysis, we focus on QATS <ref type="bibr" target="#b15">[16]</ref> <ref type="foot" target="#foot_0">1</ref> , HSplit <ref type="bibr" target="#b5">[6]</ref> <ref type="foot" target="#foot_1">2</ref> , PWKP test <ref type="bibr" target="#b10">[11]</ref> <ref type="foot" target="#foot_2">3</ref> , ASSET <ref type="bibr" target="#b6">[7]</ref> <ref type="foot" target="#foot_3">4</ref> , human-likert and system-likert <ref type="bibr" target="#b13">[14]</ref>  <ref type="foot" target="#foot_4">5</ref> , and Fusion <ref type="bibr" target="#b17">[18,</ref><ref type="bibr" target="#b18">19]</ref> <ref type="foot" target="#foot_5">6</ref> . An overview of their relevant evaluation dimensions, scales and number of raters per dataset is given in Table <ref type="table" target="#tab_0">1</ref>.</p><p>Additionally, in all datasets, grammaticality is also rated. However, it is rated only absolutely on the simplified sentence and not in relation to the original sentence, so we do not consider it in the analysis. The simplicity rating of QATS is also not considered for the same reason.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Hypotheses</head><p>The dataset selection already showed the differences between the human evaluation in text simplification. Even if the name and the idea behind the evaluation dimensions are very similar, the judgements are collected I) on scales with different sizes, i.e., 3, 5 and 100, II) on scales with different point names, i.e., "good", to "bad" or "strongly disagree" to "strongly agree", III) by crowd workers or experts, an IV) on different item types, i.e., questions or statements, V) on different types of simplification pairs, i.e., manually or automatically simplified sentences, VI) on sources which are reused for text simplification, e.g., English Wikipedia and Simple English Wikipedia (in HSplit) or which are directly designed for text simplification (in ASSET, human-likert), VII) on sentence pairs with different aspirations in the simplicity level, e.g., the simplified sentence must be simpler or the simplified sentence can also be more complex. Hence, the following points make it difficult to compare judgements of text simplification systems reported in system papers. In the following, we will analyse if more problems in human evaluation exist. Therefore, we analyse if the annotators consistently understand the scales in each of the datasets.</p><p>To analyse the interpretation of the scale, we compare the ratings of simplification pairs in which no change was made from the original to the simplified sentence. These sentence pairs are further called no-change pairs. As complexity assessment is a subjective task, different ratings of the simplifications are expected. But if the simplified sentence is identical to the original sentence, the rating can be expected to be the same because not the absolute simplicity of the sentence is measured but the change/simplification which does not exist in this case. Hence, we use the no-change pairs of the datasets to check whether different interpretations of the rating scales exist. An overview of the proportion of no-change pairs per dataset and their size themselves are given in Table <ref type="table">2</ref>.</p><p>We will focus on the analysis of the evaluation dimensions of simplicity and meaning preservation. The interpretation of the grammaticality dimensions couldn't be analysed as in all datasets grammaticality was only rated for the simplified sentence but not for the original sentence. In the analysis, we will verify the following hypotheses, which are based on the dataset and scale descriptions in the previous section.</p><p>Hypothesis 1: In HSplit and Fusion, the simplicity rating of no-change pairs are equal to the neutral element, i.e., 0. The simplicity ratings in HSplit are judged on a scale ranging form -2 to +2 including the neutral element 0. Following the scale definition in <ref type="bibr" target="#b5">[6]</ref>, the neutral element of the scale indicates that the simplicity of the original and the simplified sentence of a pair are the same. Hence, we hypothesise that the simplicity ratings of no-change pairs in HSplit and Fusion are equal to 0. A score of -2 indicates a more complex simplified sentence and +2 a more easy simplified sentence compared to the original sentence.</p><p>Hypothesis 2: In ASSET, human-likert, and system-likert, the simplicity ratings of no-change pairs are equal to the lowest element of the scale, i.e., 0, as it indicates the worst simplification. In ASSET, human-likert, and system-likert, the annotators rate the relative simplicity of simplification pair based on their level of agreement to a given statement. The scale ranges from 0 (strongly disagree) to 100 (strongly agree). Hence, the lowest value indicates a rejection of the statement, which is interpretable as the worst simplification. In the rating instruction <ref type="bibr" target="#b19">[20]</ref>, the question raises how to annotate the sentence pair if the original and the simplified sentence are exactly the same. The answer refers to the formulation of the dimension that some change should have been made. However, it does not indicate an expected behaviour of the annotators, e.g., do not judge an identical pair or judge with a specific value. Hence, we can only assume that the lowest score, i.e., 0, indicates that the simplified sentence is more complex than the original sentence as well as that the simplified sentence is as simple/complex as the original sentence. Following this interpretation, a score of 50 would indicate that the simplified sentence is more, roughly 50% more, simple than the original sentence.</p><p>Hypothesis 3: The meaning preservation rating is equal to the maximum element in QATS, HSplit, PWKP test, ASSET, human-likert, system-likert, and Fusion. In no-change pairs, the meaning of the original sentence is exactly the same as in the simplified sentence. As meaning preservation measures in all corpora the extent to which the meaning is preserved in the simplified compared to the original sentence, we hypothesise the highest possible value for no-change pairs in the evaluation dimension of meaning preservation. The highest possible value for QATS and PWKP test<ref type="foot" target="#foot_6">7</ref> is 3, for HSplit 5, and for ASSET, human-likert and system-likert 100, respectively. In contrast, the lowest possible values would indicate that the simplified sentence has a completely different meaning than the original sentence. Even if the scale has a middle element, this element does not have to indicate a neutral element as for the simplicity scale in HSplit. Following <ref type="bibr" target="#b8">[9]</ref>, it is also possible to express the indecision of the rater, which is more likely in this case.</p><p>Hypothesis 4: If different interpretations of the scales exist, the rater groups' ratings significantly differ for sentence pairs in which the original and the simplified sentences are not identical.</p><p>If at least one of the previous hypotheses can be disproved, the rating behaviour of the annotators will be analysed in more detail. The deviation in the scores of the hypothesis lead to the assumptions that the raters differently understood the rating scales. To evaluate the extent of the misunderstanding, we compare the ratings per sentence pair, including sentence pairs with changes, of different rating groups, e.g., preferring the highest and middle value of the scale. For example, if a rater group rated the simplicity of no-change pairs of ASSET with 50 and not with the assumed 0 score, we have a closer look at their simplicity ratings on pairs with a change. If a rater group prefers 50 for no-change pairs, they most likely annotate the pairs with a change differently than the rater group preferring 0. Hence, it is hypothesised that the ratings of such rater groups significantly differ from each other.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Results and Discussion</head><p>Each of the selected datasets contains some no-change pairs which are rated by a different number of annotators. An overview of the number of no-change pairs and annotators of nochange pairs per dataset is provided in Table <ref type="table">2</ref>. For the ratings of ASSET, human-likert, and system-likert, we normalised the human judgements by their individual mean and standard deviation, following the description in <ref type="bibr" target="#b6">[7,</ref><ref type="bibr" target="#b13">14]</ref>. In the following, we will analyse the raters' interpretation of the dimensions simplicity and meaning preservation to disprove or corroborate the hypotheses.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 2</head><p>Overview of the size of the datasets. Sentences correspond to the number of different original sentences in the datasets, whereas sentence pairs correspond to the number of different simplification pairs, e.g., produced by different systems or humans. No-change pairs are sentences in which the original and the simplified sentence are exactly the same. An annotation record is a rated score on one of the evaluation dimensions of one of the sentence pairs by one rater. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>QATS HSplit PWKP test ASSET system-likert human-likert Fusion</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Simplicity Rating</head><p>In HSplit, the ratings of the experts are consistent and corroborate Hypothesis 1. All ratings of the 346 identical sentence pairs agree on the assumed neutral value of 0, except one annotator for three of overall 7840 annotation records (0.03%).</p><p>In Fusion, on average, only 6 of 338 no-change sentence pairs (1.88%) are not scored with the neutral value as assumed in Hypothesis 1. The overall average score of all no-change pairs' simplicity judgements of all three annotators are equal to -0.0026±0.05. Interestingly deviations in both directions exist, i.e., closer to more simple and closer to more difficult.</p><p>In ASSET, the annotators do not agree with their ratings for the no-change pair on the dimension of simplicity. For each pair, roughly half of the annotators decides on the minimum value, which is hypothesised, and roughly the other half on the middle value. One annotator per no-change pair rates simplicity with the highest possible score. In contrast to Hypothesis 2, the simplicity ratings in ASSET are not always equal to the lowest element.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 3</head><p>Overview of the simplicity ratings (normalised with their mean and standard deviation) of all annotators which rated more than one no-change pair of human-likert or system-likert. The first column contains an anonymised version of the worker ids (each worker is assigned to an id following the occurrence of its name in the dataset). The last two columns in both tables highlight if the annotators rate a similar score for the pairs (same) or not (differ).</p><p>worker_id Similar to the annotators' behaviour in ASSET, in human-likert and system-likert, the annotators can be split into three rating groups: preferring 0 for no-change pairs, preferring 50 or 100. Again, in contrast to Hypothesis 2, the simplicity ratings in human-likert and system-likert are not always equal to the lowest element. Further analysis is required to check whether these ratings are done by mistake or due to different scale interpretations (see subsection 4.3).</p><p>The results of these datasets show that some crowd-workers and experts interpret the simplicity scale as hypothesised. In contrast, the number of points and points ranging from negative to positive or only positive, seem to influence the interpretation of the scale: Both datasets with a scale ranging between -2 and +2 achieved a higher consistency than the scale ranging between 0 and 100. The different interpretations might be due to different understandings of the middle point of the scale <ref type="bibr" target="#b8">[9]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Meaning Preservation Rating</head><p>For the dimension of meaning preservation, the human ratings in human-likert, system-likert and ASSET rather meet the values assumed in the hypotheses, i.e., close to 100. For all 5 identical pairs, more than 80% rate a score of the maximum category (80 to 100). But for some of the sentence pairs, one of the annotators rate the meaning preservation also either with a value between 0 and 19 or 40 and 59. Hence, a small proportion also interpreted the scale differently than hypothesised in Hypothesis 3. In comparison to the simplicity rating scale, the meaning preservation scale seems clearer to understand, which might be due to a clearer formulation of the scale item.</p><p>The annotators of HSplit again agree all in the same rating, here the maximum value, except for 8 out of 346 identical pairs (2.31%). Furthermore, in QATS all no-change pairs are rated with the highest is "good". In Fusion, 15 of the 338 no-change pairs (4.43%) were rated with a different value than the highest value. The overall average score of all no-change pairs' meaning preservation judgements of all three annotators is equal to 4.98±0.12. Hence, Hypothesis 3 can also be approved for HSplit, QATS and Fusion.</p><p>In contrast, in PWKP test, the ratings are below the values hypothesised. Each of the annotators rated the no-change pairs with a score ranging on average from 2.275 to 2.525. 3 of the 5 annotators annotated half of the no-change pairs with the highest value, but another rater only selected it for 30% of the pairs. Hence, for PWKP test Hypothesis 3 is disproved. However, it must be considered that the alignment of PWKP test was reproduced. Hence, the results of PWKP must be interpreted with caution because the found effects might be due to a misalignment.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Consistent Interpretations</head><p>Roughly half of the annotators in ASSET, human-likert and system-likert either annotated simplicity with the lowest value and the other half with the middle value. As stated in Hypothesis 4, we analyse whether the annotators stick to their scale interpretation or not.</p><p>In system-likert and human-likert, 16 of 34 annotators rated more than one no-change pair on the simplicity dimension. 10 of the 16 annotators are consistent in their ratings (see Table <ref type="table">3</ref>), they rated the no-change pairs all either with a score between 0 and 19 or 40 and 59. Looking closer at the ratings, 5 of the 10 raters, decided on a score between 0 and 19 on all of their no-change pairs, as hypothesised in Hypothesis 2, and the other half on a score between 80 and 100. However, also 6 of 16 raters alternate between the lowest, middle or highest value, hence, they seem to have no clear scale interpretation.</p><p>In ASSET, 20 crowd workers annotated more than one no-change pair. 13 of them always annotated the same value for all simplicity ratings of their no-change pairs (see Table <ref type="table" target="#tab_3">4</ref>). Similar to system-likert and human-likert, the annotators are split into nearly equally sized groups preferring either the lowest or the highest value for simplicity. Overall, we, can confirm, that different simplicity scale interpretations occur in system-likert, human-likert and ASSET. The different understandings of the lowest value might be due to a not-intended misinterpretation of the middle value <ref type="bibr" target="#b8">[9]</ref>.</p><p>To further investigate the different scale interpretations also on simplification pairs with a change, we divided the raters into groups based on their preferred score on the no-change pairs, i.e., preference-1 and preference-50. The groups are compared sentence-wise on the evaluation dimension of simplicity.</p><p>In the averages of the simplicity rating of both groups, the different interpretations are also Comparing all sentence pairs with changes rated by both rater groups using a Mann-Whitney-U-test, the simplicity ratings are significantly differing between both groups in system-likert and human-likert (U=252213.0, p≤0.01) and ASSET (U=64127.0, p≤0.01). Hence, it seems that both groups interpret the simplicity scale differently, but apply their different interpretations to all rated pairs. Hypothesis 4 can be corroborated.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion and Future Work</head><p>Concerning the research questions asked, in the dataset analysed, human annotators (experts and crowd workers) mostly agree on one label, i.e., the highest value of the scale, in the judgements of meaning preservation. In contrast, the analysis has also shown that different scale interpretations exist for the evaluation dimension of simplicity in the dataset with crowdsourced human ratings on a continuous scale. Some raters prefer the lowest value and some the middle value of the scale to indicate the same level of simplicity in no-change pairs. However, the values are not randomly seeded, a clear distinction between raters who annotate the lowest or neutral element on several no-change pairs is possible. This leads to the assumption that they did not rate the lowest or the middle element by mistake but understood the scale differently.</p><p>Following the analysis results, on the one hand, the interpretation of the simplicity scale is consistent when rated by experts or using a neutral element for simplicity. On the other hand, crowd-workers had different interpretations of the simplicity scale, i.e., either the lowest or the middle element of the scale indicate no change in simplicity. The scale and the annotations also could e.g., by reformulating the definition or scale ending, the crowd-workers could get more certain by seeing more examples before the annotation, or one could rely only on (trained) experts.</p><p>In contrast, the expert ratings in HSplit, PWKP test and the crowd worker ratings in Fusion regarding all evaluation dimensions and the ratings in ASSET regarding meaning preservation are congruent with the values hypothesised. Overall, a deeper analysis of the interpretation of human rating scales in text simplification is required. Therefore, a user study could be conducted in which several sentence pairs with and without changes would be rated on different scales or with different instructions by crowd workers and experts.</p><p>Not only the different scale interpretations by human raters but also the different implementations of the scales of human evaluation limit the comparison of human evaluation of text simplification. Hence, best practices as e.g. published for natural language generation <ref type="bibr" target="#b20">[21]</ref> are in high demand for text simplification. We hope that this paper increases the awareness of problems in text simplification evaluation and kicks off a discussion regarding these challenges, e.g., training of human annotators or showing examples to them, developing clear and precise statements or questions for evaluation dimensions, best number of points of a scale, e.g., 0 to 100 or -2 to +2, or rating of experts or crowd workers.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Overview of the evaluation dimensions, scales and raters (CW = crowd workers) and sources per dataset.</figDesc><table><row><cell>dataset</cell><cell>definition</cell><cell cols="2">simplicity scale</cell><cell></cell><cell></cell><cell cols="3">meaning preservation definition scale</cell><cell>Source</cell><cell># raters</cell></row><row><cell>QATS</cell><cell>-</cell><cell></cell><cell cols="3">1 (bad), 2 (ok),</cell><cell>-</cell><cell></cell><cell>1 (bad), 2 (ok),</cell><cell>EventS, EncBrit,</cell><cell>-</cell></row><row><cell>Hsplit</cell><cell cols="2">"Is the output sim-</cell><cell cols="2">3 (good) -2 to +2</cell><cell></cell><cell cols="2">Does the output pre-</cell><cell>3 (good) 1 to 5</cell><cell>LSLight TurkCorpus</cell><cell>3 experts</cell></row><row><cell></cell><cell cols="2">pler than the input?"</cell><cell></cell><cell></cell><cell></cell><cell cols="2">serve the meaning of</cell></row><row><cell>PWKP test</cell><cell></cell><cell></cell><cell cols="2">(maybe), 1 (no),</cell><cell>3 2</cell><cell cols="2">the input? add information, Does the output</cell><cell>(maybe), 1 (no),</cell><cell>3 2</cell><cell>PWKP</cell><cell>5 experts</cell></row><row><cell></cell><cell></cell><cell></cell><cell>(yes)</cell><cell></cell><cell></cell><cell cols="2">compared to the</cell><cell>(yes)</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>input?</cell><cell>Does the</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell cols="2">output remove im-</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell cols="2">portant information,</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell cols="2">compared to the</cell></row><row><cell>ASSET</cell><cell cols="2">The simplified sen-</cell><cell cols="3">0 ("strongly</cell><cell cols="2">input? The simplified sen-</cell><cell>0 ("strongly</cell><cell>TurkCorpus</cell><cell>ASSET:</cell></row><row><cell></cell><cell cols="2">tence is easier to un-</cell><cell cols="3">disagree") to</cell><cell cols="2">tence adequately ex-</cell><cell>disagree") to</cell><cell>15 CW &amp;</cell></row><row><cell></cell><cell cols="2">derstand than the</cell><cell cols="3">100 ("strongly</cell><cell cols="2">press the meaning of</cell><cell>100 ("strongly</cell><cell>HL+SL:</cell></row><row><cell></cell><cell cols="2">original sentence.</cell><cell>agree")</cell><cell></cell><cell></cell><cell cols="2">the original, perhaps</cell><cell>agree")</cell><cell>12-35 CW</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell cols="2">omitting the least im-</cell></row><row><cell>Fusion</cell><cell cols="2">How much simpler is</cell><cell>-2</cell><cell cols="2">(much</cell><cell cols="2">portant information. Sentence 2 preserves</cell><cell>1 to 5</cell><cell>Newsela</cell><cell>3 CW</cell></row><row><cell></cell><cell cols="2">sentence 2 than sen-</cell><cell cols="3">less simple)</cell><cell cols="2">the meaning of sen-</cell></row><row><cell></cell><cell>tence 1</cell><cell></cell><cell cols="3">to +2 (much</cell><cell>tence 1</cell><cell></cell></row><row><cell></cell><cell></cell><cell></cell><cell cols="2">simpler)</cell><cell></cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 4</head><label>4</label><figDesc>Overview of the simplicity ratings (normalised with their mean and standard deviation) of all annotators which rated more than one no-change pair of ASSET. The last two columns in both tables highlight if the annotators rate a similar score for the pairs (same) or not (differ).</figDesc><table><row><cell cols="2">worker_id 67</cell><cell>90</cell><cell>143</cell><cell>200</cell><cell>311</cell><cell>same differ</cell></row><row><cell>0</cell><cell>2.46</cell><cell>-</cell><cell>2.46</cell><cell>-</cell><cell>-</cell><cell>x</cell></row><row><cell>2</cell><cell cols="3">49.94 98.51 -</cell><cell>-</cell><cell>-</cell><cell>x</cell></row><row><cell>3</cell><cell cols="3">45.16 52.98 1.19</cell><cell cols="2">52.98 1.19</cell><cell>x</cell></row><row><cell>5</cell><cell>-</cell><cell>5.56</cell><cell>-</cell><cell>2.66</cell><cell>2.66</cell><cell>x</cell></row><row><cell>6</cell><cell>0.82</cell><cell>0.82</cell><cell>0.82</cell><cell>0.82</cell><cell>0.82</cell><cell>x</cell></row><row><cell>7</cell><cell>-</cell><cell cols="5">49.59 49.59 49.59 49.59 x</cell></row><row><cell>8</cell><cell>0.66</cell><cell>0.66</cell><cell>0.66</cell><cell>0.66</cell><cell>0.66</cell><cell>x</cell></row><row><cell>11</cell><cell>2.88</cell><cell>9.67</cell><cell cols="2">96.02 -</cell><cell>-</cell><cell>x</cell></row><row><cell>12</cell><cell cols="2">49.80 -</cell><cell>-</cell><cell cols="3">50.77 49.80 x</cell></row><row><cell>13</cell><cell>1.07</cell><cell>1.07</cell><cell>1.07</cell><cell>1.07</cell><cell>1.07</cell><cell>x</cell></row><row><cell>14</cell><cell>-</cell><cell cols="4">47.70 39.89 97.47 48.67</cell><cell>x</cell></row><row><cell>15</cell><cell cols="6">49.91 49.91 49.91 49.91 49.91 x</cell></row><row><cell>18</cell><cell cols="2">49.74 -</cell><cell cols="2">49.74 -</cell><cell>-</cell><cell>x</cell></row><row><cell>19</cell><cell>3.67</cell><cell>1.72</cell><cell>1.72</cell><cell>1.72</cell><cell>4.64</cell><cell>x</cell></row><row><cell>20</cell><cell cols="2">49.95 -</cell><cell cols="2">49.95 5.36</cell><cell>5.36</cell><cell>x</cell></row><row><cell>23</cell><cell>-</cell><cell cols="2">49.76 -</cell><cell cols="3">49.76 49.76 x</cell></row><row><cell>26</cell><cell>-</cell><cell>-</cell><cell>-</cell><cell>3.46</cell><cell>97.77</cell><cell>x</cell></row><row><cell>27</cell><cell>0.73</cell><cell>0.73</cell><cell>0.73</cell><cell>0.73</cell><cell>0.73</cell><cell>x</cell></row><row><cell>28</cell><cell cols="4">49.94 49.94 49.94 -</cell><cell>-</cell><cell>x</cell></row><row><cell>29</cell><cell>-</cell><cell>-</cell><cell>-</cell><cell cols="2">28.44 75.11</cell><cell>x</cell></row><row><cell cols="7">present. In system-likert and human-likert, the rater group preference-1 (n 𝑟𝑎𝑡𝑒𝑟𝑠 =5, n 𝑟𝑎𝑡𝑖𝑛𝑔𝑠 =911, M=52.87±40.18) have an overall lower simplicity average than the rater group preference-50</cell></row><row><cell cols="7">(n 𝑟𝑎𝑡𝑒𝑟𝑠 =5, n 𝑟𝑎𝑡𝑖𝑛𝑔𝑠 =634, M=63.77±33.88) on simplification pairs with a change. The same applies also to ASSET: preference-1 (n 𝑟𝑎𝑡𝑒𝑟𝑠 =7, n 𝑟𝑎𝑡𝑖𝑛𝑔𝑠 =571, M=35.58±37.24), preference-50 (n 𝑟𝑎𝑡𝑒𝑟𝑠 =6, n 𝑟𝑎𝑡𝑖𝑛𝑔𝑠 =292, M=44.43±33.22).</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">The data is available online https://qats2016.github.io/shared.html.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">The human judgements of HSplit are available online https://github.com/eliorsulem/simplification-acl2018.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">Due to a currently dead link to the system outputs of the sentence pairs, we instead copied the system outputs provided in EASSE<ref type="bibr" target="#b16">[17]</ref> in the given order. However, the sentence pairs of 2 system outputs could not be found. Hence, our version of the dataset contains only 500 sentence pairs. The human judgements are available online https://github.com/eliorsulem/SAMSA/blob/master/Human_evaluation_benchmark.ods. The original sentences and system outputs are available in EASSE https://github.com/feralvam/easse/tree/master/easse/resources/data.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">The human judgements of ASSET are available online https://github.com/facebookresearch/asset/tree/master/ human_ratings.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4">The human judgements are available online http://dl.fbaipublicfiles.com/questeval/simplification_human_ evaluations.tar.gz.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_5">The data will be available here https://cs.pomona.edu/~dkauchak/simplification/. Currently it is only available upon request by the authors.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_6">In PWKP test, the meaning preservation score is based on the averaged reversed ratings of information gain and information loss (see<ref type="bibr" target="#b10">[11]</ref>).</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>This research is part of the PhD-program "Online Participation", supported by the North Rhine-Westphalian (German) funding scheme "Forschungskolleg". We thank the anonymous and non-anonymous reviewers for their valuable feedback during the preparation of this paper.</p></div>
			</div>


			<div type="funding">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>R. Stodden) GLOBE https://user.phil.hhu.de/~stodden (R. Stodden) Orcid 0000-0002-7470-0961 (R. Stodden)</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Data-driven sentence simplification: Survey and benchmark</title>
		<author>
			<persName><forename type="first">F</forename><surname>Alva-Manchego</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Scarton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Specia</surname></persName>
		</author>
		<idno type="DOI">10.1162/coli_a_00370</idno>
		<ptr target="https://aclanthology.org/2020.cl-1.4.doi:10.1162/coli_a_00370" />
	</analytic>
	<monogr>
		<title level="j">Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">46</biblScope>
			<biblScope unit="page" from="135" to="187" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Optimizing statistical machine translation for text simplification</title>
		<author>
			<persName><forename type="first">W</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Napoles</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Pavlick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Callison-Burch</surname></persName>
		</author>
		<idno type="DOI">10.1162/tacl_a_00107</idno>
		<ptr target="https://aclanthology.org/Q16-1029.doi:10.1162/tacl_a_00107" />
	</analytic>
	<monogr>
		<title level="j">Transactions of the Association for Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="page" from="401" to="415" />
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Bleu: a method for automatic evaluation of machine translation</title>
		<author>
			<persName><forename type="first">K</forename><surname>Papineni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Roukos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Ward</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-J</forename><surname>Zhu</surname></persName>
		</author>
		<idno type="DOI">10.3115/1073083.1073135</idno>
		<ptr target="https://aclanthology.org/P02-1040.doi:10.3115/1073083.1073135" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</title>
				<meeting>the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics<address><addrLine>Philadelphia, Pennsylvania, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2002">2002</date>
			<biblScope unit="page" from="311" to="318" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Automatic text simplification for social good: Progress and challenges</title>
		<author>
			<persName><forename type="first">S</forename><surname>Stajner</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2021.findings-acl.233</idno>
		<ptr target="https://aclanthology.org/2021.findings-acl.233.doi:10.18653/v1/2021.findings-acl.233" />
	</analytic>
	<monogr>
		<title level="m">Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Association for Computational Linguistics</title>
				<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="2637" to="2652" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Controllable text simplification with explicit paraphrasing</title>
		<author>
			<persName><forename type="first">M</forename><surname>Maddela</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Alva-Manchego</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Xu</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2021.naacl-main.277</idno>
		<ptr target="https://aclanthology.org/2021.naacl-main.277.doi:10.18653/v1/2021.naacl-main.277" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics</title>
				<meeting>the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="3536" to="3553" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Simple and effective text simplification using semantic and neural methods</title>
		<author>
			<persName><forename type="first">E</forename><surname>Sulem</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Abend</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rappoport</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/P18-1016</idno>
		<ptr target="https://aclanthology.org/P18-1016.doi:10.18653/v1/P18-1016" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics</title>
				<meeting>the 56th Annual Meeting of the Association for Computational Linguistics<address><addrLine>Melbourne, Australia</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="162" to="173" />
		</imprint>
	</monogr>
	<note>: Long Papers), Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">ASSET: A dataset for tuning and evaluation of sentence simplification models with multiple rewriting transformations</title>
		<author>
			<persName><forename type="first">F</forename><surname>Alva-Manchego</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Martin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Bordes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Scarton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Sagot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Specia</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2020.acl-main.424</idno>
		<ptr target="https://aclanthology.org/2020.acl-main.424.doi:10.18653/v1/2020.acl-main.424" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</title>
				<meeting>the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="4668" to="4679" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">A technique for the measurement of attitudes</title>
		<author>
			<persName><forename type="first">R</forename><surname>Likert</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Archives of Psychology</title>
		<imprint>
			<biblScope unit="volume">22</biblScope>
			<date type="published" when="1932">1932</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Evidence-based survey design: The use of a midpoint on the likert scale</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">Y Y</forename><surname>Chyung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Roberts</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Swanson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Hankinson</surname></persName>
		</author>
		<idno type="DOI">10.1002/pfi.21727</idno>
		<ptr target="https://onlinelibrary.wiley.com/doi/pdf/10.1002/pfi.21727" />
	</analytic>
	<monogr>
		<title level="j">Performance Improvement</title>
		<imprint>
			<biblScope unit="volume">56</biblScope>
			<biblScope unit="page" from="15" to="23" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Reference-less quality estimation of text simplification systems</title>
		<author>
			<persName><forename type="first">L</forename><surname>Martin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Humeau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P.-E</forename><surname>Mazaré</surname></persName>
		</author>
		<author>
			<persName><forename type="first">É</forename><surname>De La Clergerie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Bordes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Sagot</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/W18-7005</idno>
		<ptr target="https://aclanthology.org/W18-7005.doi:10.18653/v1/W18-7005" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 1st Workshop on Automatic Text Adaptation (ATA), Association for Computational Linguistics</title>
				<meeting>the 1st Workshop on Automatic Text Adaptation (ATA), Association for Computational Linguistics<address><addrLine>Tilburg, the Netherlands</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="29" to="38" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Semantic structural evaluation for text simplification</title>
		<author>
			<persName><forename type="first">E</forename><surname>Sulem</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Abend</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rappoport</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/N18-1063</idno>
		<ptr target="https://aclanthology.org/N18-1063.doi:10.18653/v1/N18-1063" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</title>
		<title level="s">Long Papers</title>
		<meeting>the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies<address><addrLine>New Orleans, Louisiana</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="685" to="696" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Hybrid simplification using deep semantics and machine translation</title>
		<author>
			<persName><forename type="first">S</forename><surname>Narayan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Gardent</surname></persName>
		</author>
		<idno type="DOI">10.3115/v1/P14-1041</idno>
		<ptr target="https://aclanthology.org/P14-1041.doi:10.3115/v1/P14-1041" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics</title>
				<meeting>the 52nd Annual Meeting of the Association for Computational Linguistics<address><addrLine>Baltimore, Maryland</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="435" to="445" />
		</imprint>
	</monogr>
	<note>: Long Papers), Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Neural sentence simplification with semantic dependency information</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wan</surname></persName>
		</author>
		<ptr target="https://ojs.aaai.org/index.php/AAAI/article/view/17578" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the AAAI Conference on Artificial Intelligence</title>
				<meeting>the AAAI Conference on Artificial Intelligence</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="volume">35</biblScope>
			<biblScope unit="page" from="13371" to="13379" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<author>
			<persName><forename type="first">T</forename><surname>Scialom</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Martin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Staiano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Éric</forename><surname>Villemonte De La Clergerie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Sagot</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2104.07560</idno>
		<title level="m">Rethinking automatic evaluation in sentence simplification</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Stuck in the middle: The use and interpretation of mid-points in items on questionnaires</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">T</forename><surname>Nadler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Weston</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">C</forename><surname>Voyles</surname></persName>
		</author>
		<idno type="DOI">10.1080/00221309.2014.994590</idno>
		<idno>pMID: 25832738</idno>
		<ptr target="https://doi.org/10.1080/00221309.2014.994590" />
	</analytic>
	<monogr>
		<title level="j">The Journal of General Psychology</title>
		<imprint>
			<biblScope unit="volume">142</biblScope>
			<biblScope unit="page" from="71" to="89" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Shared task on quality assessment for text simplification</title>
		<author>
			<persName><forename type="first">S</forename><surname>Štajner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Popović</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Saggion</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Specia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Fishel</surname></persName>
		</author>
		<ptr target="http://www.lrec-conf.org/proceedings/lrec2016/workshops/LREC2016Workshop-QATS_Proceedings.pdf#page=28" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Workshop on Quality Assessment for Text Simplification (QATS), Association for Computational Linguistics</title>
				<meeting>the Workshop on Quality Assessment for Text Simplification (QATS), Association for Computational Linguistics<address><addrLine>Portorož, Slovenia</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="22" to="37" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">EASSE: Easier automatic sentence simplification evaluation</title>
		<author>
			<persName><forename type="first">F</forename><surname>Alva-Manchego</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Martin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Scarton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Specia</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/D19-3009</idno>
		<ptr target="https://aclanthology.org/D19-3009.doi:10.18653/v1/D19-3009" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations, Association for Computational Linguistics</title>
				<meeting>the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations, Association for Computational Linguistics<address><addrLine>Hong Kong, China</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="49" to="54" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<title level="m" type="main">Crowdsourcing Text Simplification with Sentence Fusion, Bachelor thesis</title>
		<author>
			<persName><forename type="first">M</forename><surname>Schwarzer</surname></persName>
		</author>
		<ptr target="https://cs.pomona.edu/classes/cs190/thesis_examples/Schwarzer.18.pdf" />
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
		<respStmt>
			<orgName>Pomona College</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Improving human text simplification with sentence fusion</title>
		<author>
			<persName><forename type="first">M</forename><surname>Schwarzer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Tanprasert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Kauchak</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2021.textgraphs-1.10" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Fifteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-15)</title>
				<meeting>the Fifteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-15)<address><addrLine>Mexico City, Mexico</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="106" to="114" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<title level="m" type="main">Automatic Sentence Simplification with Multiple Rewriting Transformations</title>
		<author>
			<persName><forename type="first">F</forename><surname>Alva-Manchego</surname></persName>
		</author>
		<ptr target="https://etheses.whiterose.ac.uk/28690/" />
		<imprint>
			<date type="published" when="2020">2020</date>
			<pubPlace>Sheffield, UK</pubPlace>
		</imprint>
		<respStmt>
			<orgName>University of Sheffield</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Phd thesis</note>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Best practices for the human evaluation of automatically generated text</title>
		<author>
			<persName><forename type="first">C</forename><surname>Van Der Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Gatt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Van Miltenburg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Wubben</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Krahmer</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/W19-8643</idno>
		<ptr target="https://aclanthology.org/W19-8643.doi:10.18653/v1/W19-8643" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 12th International Conference on Natural Language Generation, Association for Computational Linguistics</title>
				<meeting>the 12th International Conference on Natural Language Generation, Association for Computational Linguistics<address><addrLine>Tokyo, Japan</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="355" to="368" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
