SIMPITIKI: a Simplification corpus for Italian

      Sara Tonelli                  Alessio Palmero Aprosio           Francesca Saltori
Fondazione Bruno Kessler            Fondazione Bruno Kessler       Fondazione Bruno Kessler
 satonelli@fbk.eu                     aprosio@fbk.eu                 fsaltori@fbk.eu


                 Abstract                            lo stesso numero di semplificazioni, realiz-
                                                     zato intervenendo manualmente su alcuni
English. In this work, we analyse whether            documenti nel dominio amministrativo.
Wikipedia can be used to leverage simpli-
fication pairs instead of Simple Wikipedia,
which has proved unreliable for assess-          1   Introduction
ing automatic simplification systems, and
is available only in English. We focus           In recent years, the shift of interest from rule-
on sentence pairs in which the target sen-       based to data-driven automated simplification has
tence is the outcome of a Wikipedia edit         led to new research related to the creation of sim-
marked as ‘simplified’, and manually an-         plification corpora. These are parallel monolin-
notate simplification phenomena follow-          gual corpora, possibly aligned at sentence level,
ing an existing scheme proposed for pre-         in which source and target are an original and a
vious simplification corpora in Italian.         simplified version of the same sentence. This kind
The outcome of this work is the SIMPI-           of corpora is needed both for training automatic
TIKI corpus, which we make freely avail-         simplification systems and for their evaluation.
able, with pairs of sentences extracted          For English, several approaches have been eval-
from Wikipedia edits and annotated with          uated based on the Parallel Wikipedia Simplifica-
simplification types. The resource con-          tion corpus (Zhu et al., 2010), containing around
tains also another corpus with roughly           108,000 automatically aligned sentence pairs from
the same number of simplifications, which        cross-linked articles between Simple and Normal
was manually created by simplifying doc-         English Wikipedia. Although this resource has
uments in the administrative domain.             boosted research on data-driven simplification, it
                                                 has some major drawbacks, for example its avail-
Italiano.      In questo lavoro si anal-         ability only in English, the fact that automatic
izza la possibilità di utilizzare Wikipedia     alignment between Simple and Normal versions
per selezionare coppie di frasi semplifi-        shows poor quality, and that only around 50% of
cate. Si propone questa soluzione come           the sentence pairs correspond to real simplifica-
un’alternativa a Simple Wikipedia, che si        tions (according to a sample analysis performed
è dimostrata inattendibile per studiare la      on 200 pairs by Xu et al. (2015)). In this work, we
semplificazione automatica ed è disponi-        present a study aimed at assessing the possibility
bile solo in inglese. Ci concentriamo            to leverage a simplification corpus from Wikipedia
soltanto su coppie di frasi in cui la frase      in a semi-automated way, starting from Wikipedia
target è indicata come il frutto di una mod-    edits. The study is inspired by the work presented
ifica in Wikipedia, indicata dagli editor        in Woodsend and Lapata (2011), in which a set
come un caso di semplificazione. Tali cop-       of parallel sentences was extracted from Simple
pie sono annotate manualmente secondo            Wikipedia revision history. However, the present
una classificazione delle tipologie di sem-      work is different in that: (i) we use the Italian
plificazione già utilizzata in altri studi, e   Wikipedia revision history, demonstrating that the
vengono rese liberamente disponibili nel         approach can be applied also to languages other
corpus SIMPITIKI. La risorsa include an-         than English and on edits of Wikipedia that were
che un secondo corpus, contenente circa          not created for educational purposes, and (ii) we
manually select the actual simplifications and la-    release together with the Wikipedia-based one.
bel them following the annotation scheme already
applied to other Italian corpora. This makes pos-     3       Corpus extraction
sible the comparison with other resources for text
                                                      The extraction of the pairs has been performed
simplification, and allows a seamless integration
                                                      using the dump for the Italian Wikipedia avail-
between different corpora.
                                                      able on a dedicated website.1 This huge XML file
   Our methodology can be summarised as fol-          (more than 1 TB uncompressed) contains the his-
lows:     we first select the edited sentence         tory of every operation of editing in every page in
pairs which were commented as ‘simplified’ in         Wikipedia since it has been published for the first
Wikipedia edits, filtering out some specific sim-     time. In particular, the Italian edition of Wikipedia
plification types (Section 3). Then, we manually      contains 1.3M pages and is maintained by around
check the extracted pairs and, in case of simplifi-   2.500 active editors, who made more than 60M ed-
cation, we annotate the types in compliance with      its in 15 years of activity. The Italian language is
the existing annotation scheme for Italian (Section   spoken by 70M people, therefore there are on av-
4). Finally, we analyse the annotated pairs and       erage 35 active editors per million speakers, giving
compare their characteristics with the other cor-     to the Italian Wikipedia the highest ratio among
pora available for Italian (Section 5).               the 25 most spoken languages around the world.
                                                         We parse the 60M edits using a tool in Java
2   Related work                                      developed internally and freely available on the
Given the increasing relevance of large corpora       SIMPITIKI website.2 The user who edits a
with parallel simplification pairs, several efforts   Wikipedia page can insert a text giving informa-
have been devoted to develop them. The most           tion on why he or she has modified a particular
widely used corpus of this kind is the Paral-         part of the article. This action is not mandatory,
lel Wikipedia Simplification corpus (Zhu et al.,      but it is included most of the times. We first se-
2010), which was automatically leveraged by ex-       lect the edits which description includes word such
tracting normal and simple Wikipedia sentence         as “semplificato” (simplified), “semplice” (sim-
pairs. However, Xu et al. (2015) have recently        ple), “semplificazione” (simplification), and sim-
presented a position paper, in which they describe    ilar. Then, the obtained set is further filtered by
several shortcomings of this resource and recom-      removing edits marked with technical tags such as
mend the research community to drop it as the         “Template”, “Protected page”, “New page”. This
standard benchmark for simplification. Other al-      eliminates, for instance, simplifications involving
ternative approaches, suggesting to further refine    the page template and not the textual content. The
the selection of normal – Simple parallel sen-        text in the Wikipedia pages is written using the
tences to target specific phenomena like lexical      Wiki Markup Language, therefore it needs to be
simplification, have been also proposed (Yatskar      cleaned. We use the Bliki engine3 for this task.
et al., 2010), but have had limited application.      Finally, the obtained list of cleaned text passages
The fact that Simple Wikipedia is not available for   is parsed using the Diff Match and Patch library,4
languages other than English has proved benefi-       identifies the parts of each article where the text
cial to the development of alternative resources.     was modified. With this process, we obtain a list
Manually or automatically created corpora have        of 4,356 sentence pairs, where the differences be-
been proposed among others for Brazilian Por-         tween source and target sentence are marked with
tuguese (Pereira et al., 2009), German (Klaper et     deletion and insertion tags (see Figure 1).
al., 2013) and Spanish (Bott and Saggion, 2011).
                                                      4       Corpus annotation
As for Italian, the only available corpus contain-
ing parallel pairs of simplified sentences is pre-    We manually annotate pairs of sentences through
sented in Brunato et al. (2015). We borrow from       a web interface developed for the purpose and
this study the annotation scheme for our corpus, so   freely available for download.2 Differently from
that we can make a comparison between the two             1
                                                            https://dumps.wikimedia.org/
resources. We include in the comparison also an-          2
                                                            https://github.com/dhfbk/simpitiki
other novel corpus, made of manually simplified           3
                                                            http://bit.ly/bliki
                                                          4
sentences in the administrative domain, which we            http://bit.ly/diffmatchpatch
   Figure 1: Annotation interface used to mark simplification phenomena in the SIMPITIKI corpus.


corpora specifically created for text simplification,        Class            Subclass
in which modifications are almost always sim-                Split
                                                             Merge
plifications, annotating Wikipedia edits is chal-
                                                             Reordering
lenging because the source sentence may undergo
                                                                              Verb
several modifications, being partly simplifications          Insert           Subject
and partly other types of changes. Therefore, the                             Other
interface includes the possibility to select only the                         Verb
text segments in the source and in the target sen-           Delete           Subject
tence that correspond to simplification pairs, and                            Other
assign a label only to these specific segments. It                            Lexical substitution (word)
also gives the possibility to skip the pair if it does                        Lexical substitution (phrase)
                                                                              Anaphoric replacement
not contain any simplification.
                                                             Transformation   Verb to Noun (nominalization)
   A screenshot of the annotation tool is displayed                           Noun to Verb
in Figure 1. On the left, the source sentence(s) are                          Verbal voice
reported, with the modified parts marked in red (as                           Verbal features
given by the Diff Match and Patch library). On the
                                                         Table 1: Simplification classes and subclasses. For
right, the target sentence(s) were displayed, with
                                                         details see Brunato et al. (2015).
segments marked in green to show which parts
were introduced during editing. A tickbox next to
each red/green segment could be selected to align        previous works on simplification, we followed the
the source and target segments that correspond to        simplification types described in (Brunato et al.,
a modification. The annotation interface provides        2015). The tagset is reported in Table 1 and com-
the possibility to choose one of the simplification      prises 6 main classes (Split, Merge, Reordering,
types proposed in a dropdown menu (‘Conferma’),          Insert, Delete and Transformation) and some sub-
or to skip the pair (‘Vai Avanti’). The second op-       classes to better specify the Insert, Delete and
tion was given to mark the sentences where a mod-        Transformation operations. The labels are avail-
ification did not correspond to a proper simplifica-     able in the dropdown menu on the annotation in-
tion. For example the last edit shown in Fig. 1          terface and can be used to tag selected pairs of sen-
reports in the original version ‘Contando esclusi-       tences.
vamente sulla capacità del mare’, which was mod-
ified into ‘Contando soprattutto sulla capacità del     5   Corpus analysis
mare’. Since this change affects the meaning of          So far, annotators viewed 2,671 sentence pairs,
the sentence, turning exclusively into mainly, but       2,326 of which were skipped because the target
not its readability, the pair was not annotated.         sentence was not a simplified version of the source
  In order to develop a corpus which is compli-          one. 345 sentence pairs with 575 annotations are
ant with the annotation scheme already used in           currently part of the SIMPITIKI corpus, and all
         Class                                 Subclass                              # wiki   # PA   Total
         Split                                                                       20       18     38
         Merge                                                                       22       0      22
         Reordering                                                                  14       20     34
         Insert                                Verb                                  11       5      16
         Insert                                Subject                               5        1      6
         Insert                                Other                                 58       21     79
         Delete                                Verb                                  12       1      13
         Delete                                Subject                               17       1      18
         Delete                                Other                                 146      31     177
         Transformation                        Lexical Substitution (word level)     96       253    349
         Transformation                        Lexical Substitution (phrase level)   143      184    327
         Transformation                        Anaphoric replacement                 14       3      17
         Transformation                        Noun to Verb                          3        32     35
         Transformation                        Verb to Noun (nominalization)         2        0      2
         Transformation                        Verbal Voice                          2        1      3
         Transformation                        Verbal Features                       10       20     30
         Total                                                                       575      591    1166

Table 2: Number of simplification phenomena annotated in the Wikipedia-based and the public admin-
istration (PA) corpus


phenomena presented in the annotation scheme              phrase transformation, are the same across the four
proposed by (Brunato et al., 2015) are currently          datasets. However, in the Wikipedia-based corpus,
covered.                                                  word transformation is less frequent than in the
   As a comparison, we analyse also the content           other document types, while phrase transforma-
of the annotated corpora described in (Brunato            tion is much more present. This may show that the
et al., 2015), which represent the only existing          ‘controlled’ setting, in which the Terence and the
corpora for Italian simplification. These include         Teacher corpora were created, may lead educators
the Terence corpus of children stories, which was         to put more emphasis on word-based transforma-
specifically created to address the needs of poor         tions to teach synonyms, while in a more ‘ecologi-
comprehenders, and contains 1,036 parallel sen-           cal’ setting like Wikipedia the performed simplifi-
tence pairs, and the Teacher corpus, a set of doc-        cations are not guided or constrained, and phrase-
uments simplified by teachers for educational pur-        based transformations may sound more natural.
poses, containing 357 sentence pairs. Besides,            As for the PA documents, transformation phenom-
we include in the comparison also another cor-            ena are probably very frequent because of the tech-
pus, which we manually created by simplifying             nical language characterised by domain-specific
documents issued by the Trento Municipality to            words, which tend to be replaced by more com-
rule building permits and kindergarten admittance.        mon ones during manual simplification. In this
This corpus was simplified following the instruc-         corpus, noun-to-verb transformations are partic-
tions in (Brunato et al., 2015) but pertains to a dif-    ularly frequent, since nominalizations are typi-
ferent domain, i.e. public administration (PA). The       cal phenomena of the administrative language af-
wikipedia-based and the PA corpus have a compa-           fecting its readability (Cortelazzo and Pellegrino,
rable size (575 vs. 591 pairs), but the simplifi-         2003).
cation phenomena have a different frequency, as             While the Terence corpus contains on aver-
shown in Table 2.                                         age 2.1 annotated phenomena per sentence pair,
   In Fig. 2 we compare the distribution of the dif-      Teacher 2.8 and the PA corpus 2.9 , the Wikipedia-
ferent simplification types across the four corpora.      based corpus includes only 1.6 simplifications for
The graph shows that the same phenomena such as           each parallel pair. As expected, corpora that were
subject deletion, nominalizations, transfer of ver-       explicitly created for simplification tend to have a
bal voice tend to be rare across the four datasets.       higher concentration of simplification phenomena
Similarly, the three top-frequent simplification          than corpora developed in less controlled settings.
types, i.e. delete-other, word transformation and            As for non simplifications discarded during the
                                           Terence        Teacher    Wiki      PA

          45
          40
          35
          30
          25
          20
          15
          10
           5
           0
               lit

                                 or ge
                                se n g

                                    S rb
                                             t


                                 te ct


                       Tr o - N b
                     An sf-P rd

                       No -Re e

                       Ve -To e


                    T r n sf - o u n

                                VF ice

                                           re
                                           rb
                      De lete er


                       T r sf - W r
                             se jec


                             an he


                                          as

                                           c
                               -T er
                             l e je
               Sp


                                        tu
                          se V e


                          l e -V e


                                         o
                          De -Oth


                            un pla
                            Re er
                            In eri


                           sf - o
                         ap hr


                                        V
                         T r - Ot
                         In ub


                        De S u b


                       an VV

                                     ea
                 M


                       In rt-


                                      -
                                     d


                                  rt
                               rt-


                                  -
                              te


                               h
                          an


                           rb
                          a
Figure 2: Distribution of the simplification phenomena covered in the Terence, Teacher and Wikipedia-
based and Public Administration corpora.

    1. Lo psicodramma è stato il precursore di tutte le       pose this resource as a testbed for the evaluation
    forme di psicoterapia di gruppo
                                                               of Italian simplification systems, as an alternative
    2. Lo psicodramma è in relazione con altre forme di
    psicoterapia di gruppo                                     to other existing corpora created in a more ‘con-
    1. Partigiani non comunisti e giornalisti democratici      trolled’ setting. The corpus is made available to
    furono uccisi per il loro coraggio                         the research community together with the tools
    2. Partigiani non comunisti e giornalisti furono uccisi    used to create it. The SIMPITIKI resource con-
    per il loro coraggio
    1. Il dispositivo di memoria di massa utilizza memoria
                                                               tains also a second corpus, of comparable size,
    allo stato solido, ovvero basata su un semiconduttore      which was created by manually simplifying a set
    2. Il dispositivo di memoria di massa basata su semi-      of documents in the administrative domain. This
    conduttore utilizza memoria allo stato solido              allows cross-domain comparisons of simplifica-
Table 3: Examples of parallel pairs which were not             tion phenomena.
annotated as simplifications.                                     In the future, this work can be extended in sev-
                                                               eral directions. We plan to use the simplification
                                                               pairs in this corpus to train a classifier with the
creation of the Wikipedia-based corpus, they in-               goal of distinguishing between simplified and not-
clude generalizations, specifications, entailments,            simplified pairs. This could extend the gold stan-
deletions, edits changing the meaning, error cor-              dard with a larger set of “silver” data by labelling
rections, capitalizations, etc. (see some examples             all the remaining candidate pairs extracted from
in Table 3). These types of modifications are very             Wikipedia. Besides, the SIMPITIKI methodology
important because they may represent negative ex-              is currently being used to create a similar corpus
amples for training machine learning systems that              for Spanish, using the same annotation interface.
recognize simplification pairs.                                The outcome of this effort will allow multilingual
                                                               studies on simplification.
6     Conclusions and Future work
                                                                  Finally, we plan to evaluate the Ernesta system
We presented a study aimed at the extraction and               for Italian simplification (Barlacchi and Tonelli,
annotation of a corpus for Italian text simplifica-            2013) using this corpus. Specifically, since dif-
tion based on Wikipedia. The work has high-                    ferent simplification phenomena are annotated, it
lighted the challenges and the advantages related              would be interesting to perform a separate eval-
to the use of Wikipedia edits. Our goal is to pro-             uation on each class, as suggested in (Xu et al.,
2015).                                                    Mark Yatskar, Bo Pang, Cristian Danescu-Niculescu-
                                                           Mizil, and Lillian Lee. 2010. For the Sake
Acknowledgments                                            of Simplicity: Unsupervised Extraction of Lexical
                                                           Simplifications from Wikipedia. In Human Lan-
The research leading to this paper was partially           guage Technologies: The 2010 Annual Conference
supported by the EU Horizon 2020 Programme via             of the North American Chapter of the Association
                                                           for Computational Linguistics, HLT ’10, pages 365–
the SIMPATICO Project (H2020-EURO-6-2015,
                                                           368, Stroudsburg, PA, USA. Association for Com-
n. 692819).                                                putational Linguistics.
                                                          Zhemin Zhu, Delphine Bernhard, and Iryna Gurevych.
References                                                  2010. A monolingual tree-based translation model
                                                            for sentence simplification. In Proceedings of the
Gianni Barlacchi and Sara Tonelli. 2013. ERNESTA:           23rd International Conference on Computational
  A Sentence Simplification Tool for Children’s Sto-        Linguistics (Coling 2010), pages 1353–1361, Bei-
  ries in Italian. In Alexander Gelbukh, editor, Com-       jing, China, August. Coling 2010 Organizing Com-
  putational Linguistics and Intelligent Text Process-      mittee.
  ing: 14th International Conference, CICLing 2013,
  Samos, Greece, March 24-30, 2013, Proceedings,
  Part II, pages 476–487, Berlin, Heidelberg. Springer
  Berlin Heidelberg.
Stefan Bott and Horacio Saggion. 2011. An un-
   supervised alignment algorithm for text simplifica-
   tion corpus construction. In Proceedings of the
   Workshop on Monolingual Text-To-Text Generation,
   MTTG ’11, pages 20–26, Stroudsburg, PA, USA.
   Association for Computational Linguistics.
Dominique Brunato, Felice Dell’Orletta, Giulia Ven-
  turi, and Simonetta Montemagni. 2015. Design and
  Annotation of the First Italian Corpus for Text Sim-
  plification. In Proceedings of The 9th Linguistic An-
  notation Workshop, pages 31–41, Denver, Colorado,
  USA, June. Association for Computational Linguis-
  tics.
M. Cortelazzo and F. Pellegrino. 2003. Guida alla
  scrittura istituzionale. Laterza, New York, US.
David Klaper, Sarah Ebling, and Martin Volk. 2013.
  Building a German/Simple German Parallel Corpus
  for Automatic Text Simplification. In In: Proceed-
  ings of the 2nd Workshop on Predicting and Improv-
  ing Text Readability for Target Reader Populations,
  Sofia, Bulgaria, pages 11–19.
Tiago F. Pereira, Lucia Specia, Thiago A. S. Pardo,
   Caroline Gasperin, and Ra M. Aluisio. 2009. Build-
   ing a Brazilian Portuguese parallel corpus of origi-
   nal and simplified texts. In In: 10th Conference on
   Intelligent Text Processing and Computational Lin-
   guistics, Mexico City, pages 59–70.
Kristian Woodsend and Mirella Lapata. 2011. Learn-
  ing to simplify sentences with quasi-synchronous
  grammar and integer programming. In Proceedings
  of the 2011 Conference on Empirical Methods in
  Natural Language Processing, pages 409–420, Ed-
  inburgh, Scotland, UK., July. Association for Com-
  putational Linguistics.
Wei Xu, Chris Callison-Burch, and Courtney Napoles.
  2015. Problems in current text simplification re-
  search: New data can help. Transactions of the As-
  sociation for Computational Linguistics, 3:283–297.