<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>of Text Simplification Operations in Evaluation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sophia Ananiadou</string-name>
          <email>sophia.ananiadou@manchester.ac.uk</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laura Vásquez-Rodríguez</string-name>
          <email>laura.vasquezrodriguez@manchester.ac.uk</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matthew Shardlow</string-name>
          <email>M.Shardlow@mmu.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Piotr Przybyła</string-name>
          <email>piotr.przybyla@ipipan.waw.pl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Text Simplification, Evaluation, Edit-operations, Simplification-operations, Wikipedia-based datasets</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computing and Mathematics, Manchester Metropolitan University</institution>
          ,
          <addr-line>Manchester</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Computer Science, Polish Academy of Sciences</institution>
          ,
          <addr-line>Warsaw</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>National Centre for Text Mining, The University of Manchester</institution>
          ,
          <addr-line>Manchester</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>The Alan Turing Institute</institution>
          ,
          <addr-line>London</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <fpage>57</fpage>
      <lpage>69</lpage>
      <abstract>
        <p>Research in Text Simplification (TS) has relied mostly on the Wikipedia-based datasets and the SARI evaluation metric, as the preferred means for creating and evaluating new simplification methods. Previous studies have pointed out the flaws of data evaluation resources, including incorrect alignment of simple/complex sentence pairs, sentences with no simplifications or a dearth in the variety of simplification operations. However, there are no further analyses on the impact of the original data distribution regarding the type of simplification operations performed. In this paper, we set up a systematic benchmark of the most common TS datasets, basing our evaluation on diferent protocols for split selection (e.g., selection by random or by Monte Carlo). We perform an operation-based investigation, demonstrating in detail the limitations of existing simplification datasets. Further, we make recommendations for future standardised practices in the design, creation and evaluation of TS resources.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        TS methods transform complex text fragments into their simple variants, according to specific
operations and audiences. Non-native speakers can significantly benefit from the substitution
of complex to simple words [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], while other audiences, such as people with aphasia, will benefit
more from short, simple sentences [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Although, categorising what complexity means for
diferent audiences is useful for evaluation, TS remains a challenging task to benchmark for
the following reasons: 1) The basic concept of simplicity (relying on language complexity) is
vague and hard to define quantitatively, which means that proficient language users usually
come up with diferent simplifications for a given sentence; 2) The possible usages of TS include
scenarios aimed at diferent target audiences (e.g., children, non-native readers, people with
      </p>
      <p>CEUR
Workshop
Proceedings
htp:/ceur-ws.org
aphasia or dyslexia) and domains (e.g., scientific texts, medical and legal documents), who
may require diferent simplification methods; and 3) Using a gold-standard for TS evaluation
requires human annotations which is time consuming and costly. This is usually avoided in a
way similar to other Natural Language Generation (NLG) tasks (e.g., machine translation) by
obtaining human annotated reference simplifications and evaluating systems based on their
similarities to these. Although this mechanism of evaluation allows an unlimited number of
systems and variants to be evaluated without further human efort, there are a number of factors
we have to consider when interpreting the results.</p>
      <p>
        Firstly, there may be many equally good simplifications for a given sentence, so comparison
to a single reference may be penalising them unfairly. Despite the existing multiple references
in some TS datasets, these cannot capture the rich diversity possible in simplification. Secondly,
automatic similarity measures, such as BLEU [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], ROUGE [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] or SARI [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] have been previously
shown to have limitations (e.g., weak correlation with human judgement, dependency on quality
references, failure to capture task-dependent aspects such as simplicity), both in general tasks
[
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ] and in the context of TS [
        <xref ref-type="bibr" rid="ref5 ref8 ref9">5, 8, 9</xref>
        ]. Thirdly, how data is split between training and test data
can influence results. This is well-known in general [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], but has not attracted much attention
in TS. Finally, simplification operations may be unevenly distributed in TS datasets, afecting
the types of simplifications that a model learns to produce. Test splits may not reflect the same
simplification operations as in the training split from the same dataset.
      </p>
      <p>In this paper, we explore the impact of data splits (random and stratified) on English TS
datasets and set up a systematic benchmark on the existing datasets with altered distributions.
Our contributions are: 1) An operations-based analysis of TS datasets generated by stratification
algorithms; 2) A performance evaluation on experimental operation-based datasets; and 3)
Recommendations towards a standardised practice for building and evaluating new TS datasets.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Previous studies [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] have demonstrated the poor-quality of TS datasets used for establishing the
state of the art. In particular, Wikipedia-based datasets [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] have incorrectly aligned
complexsimple sentence pairs (e.g., sentences with no semantic similarity to each other), and pairs
with no simplification or unbalanced simplification operations (e.g., datasets that perform
mostly deletions). In contrast, Newsela is a better quality dataset [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] created by professional
translators, however it includes a restrictive data agreement that prohibits publishing or sharing
data, preventing research reproducibility and the sharing of splits or alignments. Due to these
reasons, we have not included Newsela in our study.
      </p>
      <p>
        Operations-based analysis for datasets is less common and mostly performed based on specific
scenarios. Alva-Manchego et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] performed a detailed text features-based analysis in the
ASSET dataset, including sentences splits, word deletions, insertions and reorder. Xu et al.
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] analysed a sample of 200 sentences from the PWKP dataset [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and classified them based
on whether they were simplifications or not. Real simplifications were classified under these
categories: amount of deletions-only, paraphrasing-only and a combination of both.
      </p>
      <p>
        Despite the eforts to improve these datasets in terms of the variety of simplification operations
performed and the amount of gold-standard references [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], the statistical distributions of these
datasets have not been explored. Recent work from the NLG domain has suggested how the
use of random splits can contribute to model performance [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Further, there is also a strong
argument towards biased or adversarial splits [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], demonstrating that dataset distribution is
relevant in NLP. Neither of these has been considered for TS.
      </p>
      <p>
        Another important fact to consider is the unsuitability of TS evaluation metrics. Over the
past few years, the TS research community avoided using the BLEU evaluation metric [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] due to
its low correlation with simplicity. Moreover, when simplicity is directly compared with human
evaluation, it shows a negative correlation with meaning preservation [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], since building
simple sentences also involves removing information from the original ones. As of today, the
only available means of TS evaluation is SARI [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which is not only limited as a measure of
‘simplicity gain’ in a lexical paraphrasing setting, but also it is potentially flawed when multiple
rewrite operations are present [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. As aforementioned, automatic evaluation of simplicity is
still an open question in the TS domain.
      </p>
      <p>
        For the development of TS systems, simplification operations can also have a fundamental
role, where they are explicitly identified or submitted into a TS model. The EditNTS system,
a neural programmer-interpreter model [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], detects and predicts ADD, DELETE and KEEP
simplification operation during training. Others systems, such as SeqLabel [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], performs an
automatic identification of operations in the original parallel corpus, creating a new annotated
corpus for training the model.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Operation-based Simplification Experiments</title>
      <p>We conducted a systematic analysis of the key operations we have identified for all commonly
available TS datasets. Initially, we analysed the amount of deletions, insertions and replacements
for the diferent subsets of each TS dataset (i.e., train, development and test when available).
We did not include the split operation, since our preliminary analysis using HSplit did not
show relevant changes from an edit-distance perspective. Next, we analysed the impact of
these operations on the output sentences, comparing how much a complex sentence is changed
with the presence of these transformations (Section 3.1 and Section 3.2). Furthermore, we also
analysed their distribution, with regards to these simplification operations, proposing new
scenarios to benchmark on these new distributions (Section 3.3).</p>
      <sec id="sec-3-1">
        <title>3.1. Creating Operation-based Datasets</title>
        <p>
          We performed our analysis using common Wikipedia-based TS datasets, including: WikiSmall
and WikiLarge [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], TurkCorpus [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] and ASSET [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]1. In particular, we focused on analysing
the original TS datasets and our proposed experimental datasets, which are modified versions
of WikiLarge and WikiSmall using diferent distribution methods. We have chosen these
resources, since they provide a test, a development and training subset, which are essential for
our distribution experiments. We analysed these datasets under the following classifications:
1For evaluation, we limited our study to ASSET, since it shows a wider variety of operations based on its
edit-distance [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. Also, due to space constraints, we have included the WikiSmall dataset analysis in the Appendix.
        </p>
        <p>
          Original distribution: we examined all subsets of the original TS datasets with no
modification by applying the metrics defined in section 3.2. We quantified the distribution divergence
between subsets (test compared to train and test compared to development) by calculating the
Kullback-Lieber (KL-divergence) [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] and Jensen-Shannon divergence (JSD) [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. As a result,
the Wikilarge dataset had a KL-divergence of 0.46 and a JSD divergence of 0.41, confirming
that the split of this dataset is not truly random. This can be compared in detail by observing
the distribution of these subsets in Figure 1a. Also, we determined that there is a significant
amount of sentences with no operations and sentences that have changed 100% during the
simplification process. By performing a post-hoc manual inspection of these cases, we noticed
that these corresponded to inaccurate simplifications as a product of bad alignments (i.e., poor
alignments or noise). Given these results, we proposed additional distributions to improve the
distribution of simplification operations in WikiSmall and WikiLarge datasets.
        </p>
        <p>
          Random distribution: to create randomly distributed datasets, we merged all the subsets
from the original dataset into a single dataset, shufled data using Numpy [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] and recreated
the subsets keeping their original size. We repeated this process by using 5 diferent random
seeds (155, 324, 393, 728, 989). The seeds selection was randomly generated, except for 324,
which belongs to the original implementation of EditNTS and to the initial explorations in our
previous work [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. In Figure 1, we can see a comparison of the original (Figure 1a) and the
random distribution (Figure 1c) for seed 3242.
        </p>
        <p>Minimised poor-alignments distribution: we manually inspected sentences shown at
the right-most of Figure 1a and we observed that sentences close to 100% of change correspond
to incorrect simplifications or alignments. Based on this, we created new datasets by removing
these poor-alignments ranging from 2% to 20% of sentences with the worst alignment from the
original dataset. These splits were not randomised to isolate the efect of removing the poor
alignments in TS datasets and duplicates were removed. Figure 1d and 1e show the decrease in
the percentage of change in WikiLarge by using this heuristic, including a significantly higher
reduction of change in the tests sets compared to the other subsets.</p>
        <p>Stratified distribution : sentences in TS datasets can be analysed not only by the changes
done from the original to the simplified sentence, but also by the operation type. Our main
goal for building new stratified splits is to have similar number of operations of each type
(e.g., deletes, inserts and replacements) in each subset. Since a single sentence simplification
can involve multiple operations, it is dificult to have the desired distribution between subsets.
Among the algorithms evaluated, we selected Monte Carlo Algorithm3 as our best approach
based on the operations distribution. The original datasets were distributed according to this
algorithm; datasets subsets were rebuilt and then analysed, likewise to the random distribution.
We generated 500,000 random splits searching for one with the best standard deviation between
the amount of DELETE, INSERT and REPLACE in each subset. At every 100,000 iterations, we
saved the 2 best candidates based on their best standard deviation: one in the training set and
2To avoid multiple sources of randomness, we have improved the EditNTS system to guarantee that our model
results would be deterministic. Our adaptations to the model can be found at our fork from the original Github
https://github.com/lmvasque/EditNTS-eval.</p>
        <p>3Our implementation of the Monte Carlo Algorithm runs multiple iterations of the random distribution and
calculates the standard deviation of each attempt. Finally, it chooses the distribution that has the smallest standard
deviation as the best approach for n attempts.
one in the development and test sets, minimising the diference in their individual standard
deviation. For WikiLarge, the more suitable splits were iteration 200,000 and 400,000, whereas
for WikiSmall these were iterations 300,000 and 500,000. We show the latter in the Appendix.</p>
        <p>
          Once the original and new experimental datasets were created and analysed, we evaluated
the efect they had on the performance of the EditNTS [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] model by measuring the change
in SARI score when training on the redistributed datasets. We adapted the original code with
some minor modifications to run in our setting, including: model randomisation with fixed
seeds, scripts for data preprocessing and the automation of test sets evaluation. We trained the
models on the original and the experimental subsets (poor-alignments reduction, random and
stratified distributions) using the same hyper parameters from the EditNTS model (batch size,
epochs, dropout, and learning rate). Next, we evaluated the performance of the newly trained
model by using ASSET as an external test subset. Finally, we manually inspected a sample of
the model outputs for all the proposed datasets. The adaptations for the EditNTS model, the
experimental subsets, the model outputs and the source code are documented via GitHub4.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Quantifying Simplification Operations</title>
        <p>
          Wikipedia-based TS datasets were created collaboratively by volunteers, with the main goal
to support learning for non-native speakers. In addition to the rule of writing in Simple
English5, there were no specific guidelines on how to simplify text, such as the type and the
amount of simplifications allowed, or whether it should match the original Wikipedia article.
Except for specific studies done by Alva-Manchego et al. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] and Xu et al. [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], there is no
accurate notion of what simplification operations are performed in all TS datasets. These studies
are less comprehensive since they investigate specific TS datasets or limited TS operations.
Consequently, we analysed common TS datasets by using the following metrics:
        </p>
        <p>
          Simplification operations count : we quantified the percentage of edits required to
transform a complex sentence to a simple one (henceforth, edit-distance [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]). To achieve this,
we calculated the edit-distance between two sentences by adapting the Wagner–Fischer
algorithm [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] to determine changes from characters-level to a token-level (e.g., words). This
method defines how many tokens in the complex sentence were changed in the simplified
output (e.g., 2 tokens that were deleted from one version to another is equivalent to 2 changes).
Prior to the analysis, sentences were changed to lowercase. Values are expressed as a change
percentage, where 0% indicates sentences with no changes and 100% indicates completely
different sentences. In Figure 1 we show the edit-distance analysis for WikiLarge, for the original
splits (Figure 1a) and also, for the randomised (Figure 1c), poor-alignments-based (Figure 1d,
1e) and stratified splits (Figure 1f). Random and stratified experimental splits clearly show
a more even distribution of sentences between subsets, according to the amount of change
required to obtain a new simplification from a complex sentence. On the other hand, removing
poor-alignments, without a proper distribution leaves the tests sets with the majority of samples
with minimal or no change.
        </p>
        <p>Simplification operations types : after extracting the token-level edits done between two
sentences, we classified them into simplification operations: INSERT (a token(s) has been added),
DELETE (a token(s) has been removed) and REPLACE (a token(s) has been substituted). These
three basic operations can be performed at a lexical-level6. We show in Figure 1b the
simplification operations types for WikiLarge dataset. These results not only show how unbalanced
these operations are between subsets but also the predominance of DELETE operations in the
WikiLarge dataset for the development and training subsets. Also, the DELETE efect is also
noticeable when we manually checked the outputs of the models. A majority of the simplification
operations performed deletions in the original sentence, rather than, performing substitutions
or insertions. Furthermore, in Figure 3 we performed a more exhaustive comparison, analysing
the operations count and their distribution in all our experiments.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Evaluating Operation-based Datasets</title>
        <p>We performed an evaluation of the proposed datasets by retraining the EditNTS model with
development and training subsets (for both WikiSmall and WikiLarge)7. Once models were
trained, we evaluated their performance using the SARI scores provided by the model evaluation
scripts. In our evaluation setting, the test subset of the ASSET dataset was used to test the
trained models. Also, we reported the average results for the evaluation of all ASSET references
in each complex sentence, since our implementation based on EditNTS model evaluated one
6We also merged DELETE and INSERT in cases where the same word or phrase is deleted and then inserted
again. We called this the MOVE operation. However, since the count of the MOVE operation was insignificant, we
only report on three main operations: INSERT, DELETE and REPLACE.</p>
        <p>7We did not retrain the model using the traditional subsets (i.e. TurkCorpus, ASSET), since our objective was to
study the statistical weakness of aforementioned datasets.
(a) WikiSmall
(b) WikiLarge
test reference at a time. Figure 2 shows a comparison between SARI scores in all the models,
for WikiLarge-based and WikiSmall-based models. We also include the error bars (standard
deviation of ASSET averaged observations) for all calculations. We observed that randomising
the distribution and reducing the poor alignments helped for the WikiSmall dataset. Meanwhile,
using the Monte Carlo algorithm and performing more substantial reductions in the distribution
had a better contribution for WikiLarge.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion</title>
      <p>Firstly, we conclude from our analyses (Section 3) that the original TS datasets do not follow an
even distribution between subsets. We observe that the test, development and train subsets are
diferent when measured by the amount of change from the simple to the complex sentence.
Furthermore, our evaluation on these experimental subsets show that the random distributions
provide significant variations in SARI scores, even though its composition is similar. For
WikiSmall (Figure 2a), the results between random splits showed an increase of up to 7 percentage
points in SARI score, just by randomising dataset composition and rebuilding the dataset. For
WikiLarge (Figure 2b), we found a similar efect with the Monte Carlo algorithm, which is a
split randomised by 200,000 iterations. The main diference is that this algorithm selects the
best score among all the generated random samples, rather than any of them. In this setting, the
variation of the SARI score is about 5 percentage points. The diference in SARI score should be
interpreted as a measure of simplicity gain, which provides a relative comparison of correctness
between simplifications. However, it cannot be interpreted as the best possible simplification,
since this evaluation metric fails to measure simplicity alone, as mentioned in Section 2.</p>
      <p>Secondly, WikiSmall and WikiLarge datasets show a significant amount of noise and sentences
that are not simplifications. Interestingly, we can see in Figure 1e that when we aggressively
removed 15% of the dataset, it reduced considerably the amount of sentences with a percentage
of change higher than 40%. Despite this, the performance between the original model orig_100%
and its reduced version orig_85% did not change more than 0.02% in both WikiSmall and
WikiLarge. For the model orig_80% (which has 20% of estimated noise reduction), we observed a
diferent scenario in WikiSmall; since, in comparison with orig_100% model the performance of
this dataset dropped 2.6%. WikiSmall dataset is significantly smaller than WikiLarge (3X), and
so, such a reduction afects a higher number of real simplifications. In contrast, Wikilarge model
orig_98% has a minimal number of noise reduction, keeping its composition almost unchanged.
We presume that the decrease in the model performance relates to having the same dataset
composition but with less sentence samples (despite their lower quality).</p>
      <p>Thirdly, we discuss the datasets composition with respect to the operations count (Figure 3).
Due to the large size of the training corpus, the count in the train subset is similar for all the
datasets. However, that is not the case of the test and the development subsets, where we noticed
meaningful diferences. We observe a consistent decrease in all the operations for the models
where we removed the ’poor-alignments’. Nevertheless, as we mentioned earlier, orig_80% was
the only model which presented a decrease in performance, with a minimal amount of edit
operations. On the contrary, despite the similar distribution in operations between random
datasets we did observe performance variations between these models. It is relevant to consider
that the test and development subsets are quite smaller than the training subset (359 test / 992
dev / 296,402 train in the original sentence pairs). We presume that this could minimise the
efect of new distributions towards the model performance. As future work, we would consider
changing the original subset sizes to explore further the efect of simplification operations.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Recommendations for TS datasets quality assessment</title>
      <p>Although the evaluation metrics and model outputs are not globally providing enough
information about a dataset, we believe it is important to follow a structured setting to value the
quality of a dataset. To ensure interpretable methods for dataset quality assessment, we make
the following recommendations for TS dataset evaluation.</p>
      <p>Noisy alignments detection: current TS datasets are automatically aligned, hence, these
are likely to have incorrect or unaligned sentence pairs. We propose a heuristic in which
these inaccurate alignments can be detected by quantifying the amount of change between the
complex sentence and the gold-reference ones. This can be implemented by sorting TS datasets
using edit-distance values so sentences with higher amount of changes are grouped together,
providing a straight-forward way for detection and removal of noise. The ideal threshold in
which sentences are removed can be determined by visually inspecting these groups.</p>
      <p>Simplification operations distribution : depending on the audience, some simplification
operations can be more useful than others. Ideally, we would expect not only a variety of
simplification operations but also, a similar distribution of operations between subsets tailored
to a given simplification need. There are valid scenarios in which particular operations could
be enough (e.g., REPLACE operation for complex word simplification for non-native speakers).
Other areas such as news simplification, require more elaborate constructions which not only
involves simplifications at a lexical level, but also at a discourse level (e.g, news for general public
targeted for children at schools in the Newsela dataset). By using token-based edit distance, we
can perform a global count of simplification operations performed and an evaluation of their
distribution as an aid for stratifying TS datasets as needed.</p>
      <p>Datasets stability: from our experiments, we have observed that dataset distribution
significantly afects TS model performance (measured by an increase or decrease in SARI score). Our
recommendation is to perform a dataset randomisation with diferent random seeds to evaluate
the impact of data distribution in TS models performance. In addition, datasets of significant
size, such as Wikilarge, showed to be more stable in this setting (less variation in SARI score
between random seeds).</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>In this paper, we have performed a systematic analysis of the most common TS operations
demonstrating the statistical limitations of English TS datasets. Our analysis can be reproduced
through our published scripts, which can also be used to analyse any other TS parallel dataset
for quality assessment. Moreover, we carried out a detailed evaluation of all of our experimental
settings, including distributions with poor-alignments reduction, randomisation and
stratification using the Monte Carlo algorithm. Finally, we have proposed a set of recommendations for
the creation of more reliable and standardised datasets for a better environment of TS evaluation
resources.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>We would like to thank Nhung T.H. Nguyen for her valuable discussions and comments. Laura
Vásquez-Rodríguez’s work was funded by the Kilburn Scholarship from the University of
Manchester. Piotr Przybyła’s work was supported by the Polish National Agency for Academic
Exchange through a Polish Returns grant number PPN/PPO/2018/1/00006.</p>
    </sec>
    <sec id="sec-8">
      <title>A. Appendices</title>
      <p>A.1. Poor-alignments analysis
(a) Random (Seed 155)
(b) Random (Seed 324)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G. H.</given-names>
            <surname>Paetzold</surname>
          </string-name>
          , L. Specia,
          <article-title>Unsupervised lexical simplification for non-native speakers</article-title>
          ,
          <source>in: 30th AAAI Conference on Artificial Intelligence</source>
          ,
          <source>AAAI</source>
          <year>2016</year>
          ,
          <year>2016</year>
          , p.
          <fpage>3761</fpage>
          -
          <lpage>3767</lpage>
          . URL: http://nlp.stanford.edu/projects/glove/.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Carroll</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Minnen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pearce</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Canning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tait</surname>
          </string-name>
          ,
          <article-title>Simplifying text for languageimpaired readers, in: Ninth Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Bergen, Norway,
          <year>1999</year>
          , pp.
          <fpage>269</fpage>
          -
          <lpage>270</lpage>
          . URL: https://www.aclweb.org/anthology/E99-1042.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Papineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roukos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ward</surname>
          </string-name>
          , W.-J. Zhu,
          <article-title>Bleu: a method for automatic evaluation of machine translation, in: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Philadelphia, Pennsylvania, USA,
          <year>2002</year>
          , pp.
          <fpage>311</fpage>
          -
          <lpage>318</lpage>
          . URL: https://www.aclweb.org/anthology/P02-1040. doi:
          <volume>10</volume>
          .3115/1073083.1073135.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.-Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <surname>ROUGE:</surname>
          </string-name>
          <article-title>A package for automatic evaluation of summaries, in: Text Summarization Branches Out, Association for Computational Linguistics</article-title>
          , Barcelona, Spain,
          <year>2004</year>
          , pp.
          <fpage>74</fpage>
          -
          <lpage>81</lpage>
          . URL: https://www.aclweb.org/anthology/W04-1013.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>W.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Pavlick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Callison-Burch</surname>
          </string-name>
          ,
          <article-title>Optimizing statistical machine translation for text simplification</article-title>
          ,
          <source>Transactions of the Association for Computational Linguistics</source>
          <volume>4</volume>
          (
          <year>2016</year>
          )
          <fpage>401</fpage>
          -
          <lpage>415</lpage>
          . URL: https://www.aclweb.org/anthology/Q16-1029. doi:
          <volume>10</volume>
          . 1162/tacl_a_
          <fpage>00107</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Ganesan</surname>
          </string-name>
          , ROUGE
          <volume>2</volume>
          .
          <article-title>0: Updated and Improved Measures for Evaluation of Summarization Tasks</article-title>
          , arXiv (
          <year>2018</year>
          ). URL: http://arxiv.org/abs/
          <year>1803</year>
          .
          <year>01937</year>
          . arXiv:
          <year>1803</year>
          .
          <year>01937</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Callison-Burch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Osborne</surname>
          </string-name>
          , P. Koehn,
          <article-title>Re-evaluating the role of Bleu in machine translation research, in: 11th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Trento, Italy,
          <year>2006</year>
          . URL: https://www.aclweb.org/anthology/E06-1032.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>E.</given-names>
            <surname>Sulem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Abend</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rappoport</surname>
          </string-name>
          ,
          <article-title>BLEU is not suitable for the evaluation of text simpliifcation</article-title>
          ,
          <source>in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Brussels, Belgium,
          <year>2018</year>
          , pp.
          <fpage>738</fpage>
          -
          <lpage>744</lpage>
          . URL: https://www.aclweb.org/anthology/D18-1081. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D18</fpage>
          -1081.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Humeau</surname>
          </string-name>
          , P.-E. Mazaré,
          <string-name>
            <surname>É. de La Clergerie</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Bordes</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Sagot</surname>
          </string-name>
          ,
          <article-title>Reference-less quality estimation of text simplification systems</article-title>
          ,
          <source>in: Proceedings of the 1st Workshop on Automatic Text Adaptation (ATA)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Tilburg, the Netherlands,
          <year>2018</year>
          , pp.
          <fpage>29</fpage>
          -
          <lpage>38</lpage>
          . URL: https://www.aclweb.org/anthology/W18-7005. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>W18</fpage>
          -7005.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Søgaard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ebert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bastings</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Filippova</surname>
          </string-name>
          ,
          <article-title>We need to talk about random splits</article-title>
          ,
          <source>arXiv</source>
          (
          <year>2020</year>
          ). URL: http://arxiv.org/abs/
          <year>2005</year>
          .00636. arXiv:
          <year>2005</year>
          .00636.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , M. Lapata,
          <article-title>Sentence simplification with deep reinforcement learning</article-title>
          ,
          <source>in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Copenhagen, Denmark,
          <year>2017</year>
          , pp.
          <fpage>584</fpage>
          -
          <lpage>594</lpage>
          . URL: https://www.aclweb.org/anthology/D17-1062. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D17</fpage>
          -1062.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>W.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Callison-Burch</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Napoles, Problems in current text simplification research: New data can help, Transactions of the Association for Computational Linguistics 3 (</article-title>
          <year>2015</year>
          )
          <fpage>283</fpage>
          -
          <lpage>297</lpage>
          . URL: https://www.aclweb.org/anthology/Q15-1021. doi:
          <volume>10</volume>
          .1162/tacl_a_
          <fpage>00139</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>F.</given-names>
            <surname>Alva-Manchego</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bordes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Scarton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sagot</surname>
          </string-name>
          , L. Specia,
          <article-title>ASSET: A Dataset for Tuning and Evaluation of Sentence Simplification Models with Multiple Rewriting Transformations</article-title>
          , arXiv (
          <year>2020</year>
          ). URL: http://arxiv.org/abs/
          <year>2005</year>
          .00481. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .acl-main.
          <volume>424</volume>
          . arXiv:
          <year>2005</year>
          .00481.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bernhard</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>A monolingual tree-based translation model for sentence simplification</article-title>
          ,
          <source>in: Proceedings of the 23rd International Conference on Computational Linguistics (Coling</source>
          <year>2010</year>
          ),
          <article-title>Coling 2010 Organizing Committee</article-title>
          , Beijing, China,
          <year>2010</year>
          , pp.
          <fpage>1353</fpage>
          -
          <lpage>1361</lpage>
          . URL: https://www.aclweb.org/anthology/C10-1152.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>K.</given-names>
            <surname>Gorman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bedrick</surname>
          </string-name>
          ,
          <article-title>We need to talk about standard splits, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Florence, Italy,
          <year>2019</year>
          , pp.
          <fpage>2786</fpage>
          -
          <lpage>2791</lpage>
          . URL: https://www.aclweb. org/anthology/P19-1267. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P19</fpage>
          -1267.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Schwarzer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kauchak</surname>
          </string-name>
          ,
          <article-title>Human Evaluation for Text Simplification: The SimplicityAdequacy Tradeof</article-title>
          ,
          <source>Technical Report, SoCal NLP Symposium</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>F.</given-names>
            <surname>Alva-Manchego</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bordes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Scarton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sagot</surname>
          </string-name>
          , L. Specia,
          <article-title>ASSET: A dataset for tuning and evaluation of sentence simplification models with multiple rewriting transformations, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>4668</fpage>
          -
          <lpage>4679</lpage>
          . URL: https://www.aclweb.org/anthology/2020.acl-main.
          <volume>424</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .acl-main.
          <volume>424</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rezagholizadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C. K.</given-names>
            <surname>Cheung</surname>
          </string-name>
          ,
          <string-name>
            <surname>EditNTS:</surname>
          </string-name>
          <article-title>An neural programmerinterpreter model for sentence simplification through explicit editing, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Florence, Italy,
          <year>2019</year>
          , pp.
          <fpage>3393</fpage>
          -
          <lpage>3402</lpage>
          . URL: https://www.aclweb. org/anthology/P19-1331. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P19</fpage>
          -1331.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>F.</given-names>
            <surname>Alva-Manchego</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bingel</surname>
          </string-name>
          , G. Paetzold,
          <string-name>
            <given-names>C.</given-names>
            <surname>Scarton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Specia</surname>
          </string-name>
          ,
          <article-title>Learning how to simplify from explicit labeling of complex-simplified text pairs</article-title>
          ,
          <source>in: Proceedings of the Eighth International Joint Conference on Natural Language Processing</source>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Asian Federation of Natural Language Processing</source>
          , Taipei, Taiwan,
          <year>2017</year>
          , pp.
          <fpage>295</fpage>
          -
          <lpage>305</lpage>
          . URL: https://www.aclweb.org/anthology/I17-1030.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>L.</given-names>
            <surname>Vásquez-Rodríguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shardlow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Przybyła</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ananiadou</surname>
          </string-name>
          ,
          <article-title>Investigating text simplification evaluation, in: Findings of the Association for Computational Linguistics: ACLIJCNLP 2021, Association for Computational Linguistics</article-title>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>876</fpage>
          -
          <lpage>882</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .findings-acl.
          <volume>77</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .findings-acl.
          <volume>77</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kullback</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Leibler</surname>
          </string-name>
          ,
          <source>On Information and Suficiency</source>
          ,
          <source>The Annals of Mathematical Statistics</source>
          <volume>22</volume>
          (
          <year>1951</year>
          )
          <fpage>79</fpage>
          -
          <lpage>86</lpage>
          . doi:
          <volume>10</volume>
          .1214/aoms/1177729694.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <source>Divergence Measures Based on the Shannon Entropy</source>
          ,
          <source>Technical Report, IEEE, Transactions on Information Theory</source>
          ,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>C. R.</given-names>
            <surname>Harris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. J.</given-names>
            <surname>Millman</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. J. van der Walt</surname>
          </string-name>
          , R. Gommers,
          <string-name>
            <given-names>P.</given-names>
            <surname>Virtanen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cournapeau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Wieser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Taylor</surname>
          </string-name>
          , S. Berg,
          <string-name>
            <given-names>N. J.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kern</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Picus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hoyer</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. H. van Kerkwijk</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Brett</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Haldane</surname>
            ,
            <given-names>J. F.</given-names>
          </string-name>
          <string-name>
            <surname>del Río</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Wiebe</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Peterson</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Gérard-Marchant</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Sheppard</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Reddy</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Weckesser</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Abbasi</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Gohlke</surname>
            ,
            <given-names>T. E.</given-names>
          </string-name>
          <string-name>
            <surname>Oliphant</surname>
          </string-name>
          ,
          <article-title>Array programming with NumPy</article-title>
          ,
          <source>Nature</source>
          <volume>585</volume>
          (
          <year>2020</year>
          )
          <fpage>357</fpage>
          -
          <lpage>362</lpage>
          . URL: https://doi.org/10.1038/s41586-020-2649-2. doi:
          <volume>10</volume>
          .1038/s41586-020-2649-2.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>V.</given-names>
            <surname>Levenshtein</surname>
          </string-name>
          ,
          <article-title>Binary codes capable of correcting deletions, insertions, and reversals</article-title>
          ,
          <source>Soviet Physics Doklady</source>
          <volume>10</volume>
          (
          <year>1966</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Wagner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <article-title>The String-to-String Correction Problem</article-title>
          ,
          <source>Journal of the ACM (JACM) 21</source>
          (
          <year>1974</year>
          )
          <fpage>168</fpage>
          -
          <lpage>173</lpage>
          . URL: https://dl.acm.org/doi/10.1145/321796.321811. doi:
          <volume>10</volume>
          . 1145/321796.321811.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>