SIMPITIKI: a Simplification corpus for Italian Sara Tonelli Alessio Palmero Aprosio Francesca Saltori Fondazione Bruno Kessler Fondazione Bruno Kessler Fondazione Bruno Kessler satonelli@fbk.eu aprosio@fbk.eu fsaltori@fbk.eu Abstract lo stesso numero di semplificazioni, realiz- zato intervenendo manualmente su alcuni English. In this work, we analyse whether documenti nel dominio amministrativo. Wikipedia can be used to leverage simpli- fication pairs instead of Simple Wikipedia, which has proved unreliable for assess- 1 Introduction ing automatic simplification systems, and is available only in English. We focus In recent years, the shift of interest from rule- on sentence pairs in which the target sen- based to data-driven automated simplification has tence is the outcome of a Wikipedia edit led to new research related to the creation of sim- marked as ‘simplified’, and manually an- plification corpora. These are parallel monolin- notate simplification phenomena follow- gual corpora, possibly aligned at sentence level, ing an existing scheme proposed for pre- in which source and target are an original and a vious simplification corpora in Italian. simplified version of the same sentence. This kind The outcome of this work is the SIMPI- of corpora is needed both for training automatic TIKI corpus, which we make freely avail- simplification systems and for their evaluation. able, with pairs of sentences extracted For English, several approaches have been eval- from Wikipedia edits and annotated with uated based on the Parallel Wikipedia Simplifica- simplification types. The resource con- tion corpus (Zhu et al., 2010), containing around tains also another corpus with roughly 108,000 automatically aligned sentence pairs from the same number of simplifications, which cross-linked articles between Simple and Normal was manually created by simplifying doc- English Wikipedia. Although this resource has uments in the administrative domain. boosted research on data-driven simplification, it has some major drawbacks, for example its avail- Italiano. In questo lavoro si anal- ability only in English, the fact that automatic izza la possibilità di utilizzare Wikipedia alignment between Simple and Normal versions per selezionare coppie di frasi semplifi- shows poor quality, and that only around 50% of cate. Si propone questa soluzione come the sentence pairs correspond to real simplifica- un’alternativa a Simple Wikipedia, che si tions (according to a sample analysis performed è dimostrata inattendibile per studiare la on 200 pairs by Xu et al. (2015)). In this work, we semplificazione automatica ed è disponi- present a study aimed at assessing the possibility bile solo in inglese. Ci concentriamo to leverage a simplification corpus from Wikipedia soltanto su coppie di frasi in cui la frase in a semi-automated way, starting from Wikipedia target è indicata come il frutto di una mod- edits. The study is inspired by the work presented ifica in Wikipedia, indicata dagli editor in Woodsend and Lapata (2011), in which a set come un caso di semplificazione. Tali cop- of parallel sentences was extracted from Simple pie sono annotate manualmente secondo Wikipedia revision history. However, the present una classificazione delle tipologie di sem- work is different in that: (i) we use the Italian plificazione già utilizzata in altri studi, e Wikipedia revision history, demonstrating that the vengono rese liberamente disponibili nel approach can be applied also to languages other corpus SIMPITIKI. La risorsa include an- than English and on edits of Wikipedia that were che un secondo corpus, contenente circa not created for educational purposes, and (ii) we manually select the actual simplifications and la- release together with the Wikipedia-based one. bel them following the annotation scheme already applied to other Italian corpora. This makes pos- 3 Corpus extraction sible the comparison with other resources for text The extraction of the pairs has been performed simplification, and allows a seamless integration using the dump for the Italian Wikipedia avail- between different corpora. able on a dedicated website.1 This huge XML file Our methodology can be summarised as fol- (more than 1 TB uncompressed) contains the his- lows: we first select the edited sentence tory of every operation of editing in every page in pairs which were commented as ‘simplified’ in Wikipedia since it has been published for the first Wikipedia edits, filtering out some specific sim- time. In particular, the Italian edition of Wikipedia plification types (Section 3). Then, we manually contains 1.3M pages and is maintained by around check the extracted pairs and, in case of simplifi- 2.500 active editors, who made more than 60M ed- cation, we annotate the types in compliance with its in 15 years of activity. The Italian language is the existing annotation scheme for Italian (Section spoken by 70M people, therefore there are on av- 4). Finally, we analyse the annotated pairs and erage 35 active editors per million speakers, giving compare their characteristics with the other cor- to the Italian Wikipedia the highest ratio among pora available for Italian (Section 5). the 25 most spoken languages around the world. We parse the 60M edits using a tool in Java 2 Related work developed internally and freely available on the Given the increasing relevance of large corpora SIMPITIKI website.2 The user who edits a with parallel simplification pairs, several efforts Wikipedia page can insert a text giving informa- have been devoted to develop them. The most tion on why he or she has modified a particular widely used corpus of this kind is the Paral- part of the article. This action is not mandatory, lel Wikipedia Simplification corpus (Zhu et al., but it is included most of the times. We first se- 2010), which was automatically leveraged by ex- lect the edits which description includes word such tracting normal and simple Wikipedia sentence as “semplificato” (simplified), “semplice” (sim- pairs. However, Xu et al. (2015) have recently ple), “semplificazione” (simplification), and sim- presented a position paper, in which they describe ilar. Then, the obtained set is further filtered by several shortcomings of this resource and recom- removing edits marked with technical tags such as mend the research community to drop it as the “Template”, “Protected page”, “New page”. This standard benchmark for simplification. Other al- eliminates, for instance, simplifications involving ternative approaches, suggesting to further refine the page template and not the textual content. The the selection of normal – Simple parallel sen- text in the Wikipedia pages is written using the tences to target specific phenomena like lexical Wiki Markup Language, therefore it needs to be simplification, have been also proposed (Yatskar cleaned. We use the Bliki engine3 for this task. et al., 2010), but have had limited application. Finally, the obtained list of cleaned text passages The fact that Simple Wikipedia is not available for is parsed using the Diff Match and Patch library,4 languages other than English has proved benefi- identifies the parts of each article where the text cial to the development of alternative resources. was modified. With this process, we obtain a list Manually or automatically created corpora have of 4,356 sentence pairs, where the differences be- been proposed among others for Brazilian Por- tween source and target sentence are marked with tuguese (Pereira et al., 2009), German (Klaper et deletion and insertion tags (see Figure 1). al., 2013) and Spanish (Bott and Saggion, 2011). 4 Corpus annotation As for Italian, the only available corpus contain- ing parallel pairs of simplified sentences is pre- We manually annotate pairs of sentences through sented in Brunato et al. (2015). We borrow from a web interface developed for the purpose and this study the annotation scheme for our corpus, so freely available for download.2 Differently from that we can make a comparison between the two 1 https://dumps.wikimedia.org/ resources. We include in the comparison also an- 2 https://github.com/dhfbk/simpitiki other novel corpus, made of manually simplified 3 http://bit.ly/bliki 4 sentences in the administrative domain, which we http://bit.ly/diffmatchpatch Figure 1: Annotation interface used to mark simplification phenomena in the SIMPITIKI corpus. corpora specifically created for text simplification, Class Subclass in which modifications are almost always sim- Split Merge plifications, annotating Wikipedia edits is chal- Reordering lenging because the source sentence may undergo Verb several modifications, being partly simplifications Insert Subject and partly other types of changes. Therefore, the Other interface includes the possibility to select only the Verb text segments in the source and in the target sen- Delete Subject tence that correspond to simplification pairs, and Other assign a label only to these specific segments. It Lexical substitution (word) also gives the possibility to skip the pair if it does Lexical substitution (phrase) Anaphoric replacement not contain any simplification. Transformation Verb to Noun (nominalization) A screenshot of the annotation tool is displayed Noun to Verb in Figure 1. On the left, the source sentence(s) are Verbal voice reported, with the modified parts marked in red (as Verbal features given by the Diff Match and Patch library). On the Table 1: Simplification classes and subclasses. For right, the target sentence(s) were displayed, with details see Brunato et al. (2015). segments marked in green to show which parts were introduced during editing. A tickbox next to each red/green segment could be selected to align previous works on simplification, we followed the the source and target segments that correspond to simplification types described in (Brunato et al., a modification. The annotation interface provides 2015). The tagset is reported in Table 1 and com- the possibility to choose one of the simplification prises 6 main classes (Split, Merge, Reordering, types proposed in a dropdown menu (‘Conferma’), Insert, Delete and Transformation) and some sub- or to skip the pair (‘Vai Avanti’). The second op- classes to better specify the Insert, Delete and tion was given to mark the sentences where a mod- Transformation operations. The labels are avail- ification did not correspond to a proper simplifica- able in the dropdown menu on the annotation in- tion. For example the last edit shown in Fig. 1 terface and can be used to tag selected pairs of sen- reports in the original version ‘Contando esclusi- tences. vamente sulla capacità del mare’, which was mod- ified into ‘Contando soprattutto sulla capacità del 5 Corpus analysis mare’. Since this change affects the meaning of So far, annotators viewed 2,671 sentence pairs, the sentence, turning exclusively into mainly, but 2,326 of which were skipped because the target not its readability, the pair was not annotated. sentence was not a simplified version of the source In order to develop a corpus which is compli- one. 345 sentence pairs with 575 annotations are ant with the annotation scheme already used in currently part of the SIMPITIKI corpus, and all Class Subclass # wiki # PA Total Split 20 18 38 Merge 22 0 22 Reordering 14 20 34 Insert Verb 11 5 16 Insert Subject 5 1 6 Insert Other 58 21 79 Delete Verb 12 1 13 Delete Subject 17 1 18 Delete Other 146 31 177 Transformation Lexical Substitution (word level) 96 253 349 Transformation Lexical Substitution (phrase level) 143 184 327 Transformation Anaphoric replacement 14 3 17 Transformation Noun to Verb 3 32 35 Transformation Verb to Noun (nominalization) 2 0 2 Transformation Verbal Voice 2 1 3 Transformation Verbal Features 10 20 30 Total 575 591 1166 Table 2: Number of simplification phenomena annotated in the Wikipedia-based and the public admin- istration (PA) corpus phenomena presented in the annotation scheme phrase transformation, are the same across the four proposed by (Brunato et al., 2015) are currently datasets. However, in the Wikipedia-based corpus, covered. word transformation is less frequent than in the As a comparison, we analyse also the content other document types, while phrase transforma- of the annotated corpora described in (Brunato tion is much more present. This may show that the et al., 2015), which represent the only existing ‘controlled’ setting, in which the Terence and the corpora for Italian simplification. These include Teacher corpora were created, may lead educators the Terence corpus of children stories, which was to put more emphasis on word-based transforma- specifically created to address the needs of poor tions to teach synonyms, while in a more ‘ecologi- comprehenders, and contains 1,036 parallel sen- cal’ setting like Wikipedia the performed simplifi- tence pairs, and the Teacher corpus, a set of doc- cations are not guided or constrained, and phrase- uments simplified by teachers for educational pur- based transformations may sound more natural. poses, containing 357 sentence pairs. Besides, As for the PA documents, transformation phenom- we include in the comparison also another cor- ena are probably very frequent because of the tech- pus, which we manually created by simplifying nical language characterised by domain-specific documents issued by the Trento Municipality to words, which tend to be replaced by more com- rule building permits and kindergarten admittance. mon ones during manual simplification. In this This corpus was simplified following the instruc- corpus, noun-to-verb transformations are partic- tions in (Brunato et al., 2015) but pertains to a dif- ularly frequent, since nominalizations are typi- ferent domain, i.e. public administration (PA). The cal phenomena of the administrative language af- wikipedia-based and the PA corpus have a compa- fecting its readability (Cortelazzo and Pellegrino, rable size (575 vs. 591 pairs), but the simplifi- 2003). cation phenomena have a different frequency, as While the Terence corpus contains on aver- shown in Table 2. age 2.1 annotated phenomena per sentence pair, In Fig. 2 we compare the distribution of the dif- Teacher 2.8 and the PA corpus 2.9 , the Wikipedia- ferent simplification types across the four corpora. based corpus includes only 1.6 simplifications for The graph shows that the same phenomena such as each parallel pair. As expected, corpora that were subject deletion, nominalizations, transfer of ver- explicitly created for simplification tend to have a bal voice tend to be rare across the four datasets. higher concentration of simplification phenomena Similarly, the three top-frequent simplification than corpora developed in less controlled settings. types, i.e. delete-other, word transformation and As for non simplifications discarded during the Terence Teacher Wiki PA 45 40 35 30 25 20 15 10 5 0 lit or ge se n g S rb t te ct Tr o - N b An sf-P rd No -Re e Ve -To e T r n sf - o u n VF ice re rb De lete er T r sf - W r se jec an he as c -T er l e je Sp tu se V e l e -V e o De -Oth un pla Re er In eri sf - o ap hr V T r - Ot In ub De S u b an VV ea M In rt- - d rt rt- - te h an rb a Figure 2: Distribution of the simplification phenomena covered in the Terence, Teacher and Wikipedia- based and Public Administration corpora. 1. Lo psicodramma è stato il precursore di tutte le pose this resource as a testbed for the evaluation forme di psicoterapia di gruppo of Italian simplification systems, as an alternative 2. Lo psicodramma è in relazione con altre forme di psicoterapia di gruppo to other existing corpora created in a more ‘con- 1. Partigiani non comunisti e giornalisti democratici trolled’ setting. The corpus is made available to furono uccisi per il loro coraggio the research community together with the tools 2. Partigiani non comunisti e giornalisti furono uccisi used to create it. The SIMPITIKI resource con- per il loro coraggio 1. Il dispositivo di memoria di massa utilizza memoria tains also a second corpus, of comparable size, allo stato solido, ovvero basata su un semiconduttore which was created by manually simplifying a set 2. Il dispositivo di memoria di massa basata su semi- of documents in the administrative domain. This conduttore utilizza memoria allo stato solido allows cross-domain comparisons of simplifica- Table 3: Examples of parallel pairs which were not tion phenomena. annotated as simplifications. In the future, this work can be extended in sev- eral directions. We plan to use the simplification pairs in this corpus to train a classifier with the creation of the Wikipedia-based corpus, they in- goal of distinguishing between simplified and not- clude generalizations, specifications, entailments, simplified pairs. This could extend the gold stan- deletions, edits changing the meaning, error cor- dard with a larger set of “silver” data by labelling rections, capitalizations, etc. (see some examples all the remaining candidate pairs extracted from in Table 3). These types of modifications are very Wikipedia. Besides, the SIMPITIKI methodology important because they may represent negative ex- is currently being used to create a similar corpus amples for training machine learning systems that for Spanish, using the same annotation interface. recognize simplification pairs. The outcome of this effort will allow multilingual studies on simplification. 6 Conclusions and Future work Finally, we plan to evaluate the Ernesta system We presented a study aimed at the extraction and for Italian simplification (Barlacchi and Tonelli, annotation of a corpus for Italian text simplifica- 2013) using this corpus. Specifically, since dif- tion based on Wikipedia. The work has high- ferent simplification phenomena are annotated, it lighted the challenges and the advantages related would be interesting to perform a separate eval- to the use of Wikipedia edits. Our goal is to pro- uation on each class, as suggested in (Xu et al., 2015). Mark Yatskar, Bo Pang, Cristian Danescu-Niculescu- Mizil, and Lillian Lee. 2010. For the Sake Acknowledgments of Simplicity: Unsupervised Extraction of Lexical Simplifications from Wikipedia. In Human Lan- The research leading to this paper was partially guage Technologies: The 2010 Annual Conference supported by the EU Horizon 2020 Programme via of the North American Chapter of the Association for Computational Linguistics, HLT ’10, pages 365– the SIMPATICO Project (H2020-EURO-6-2015, 368, Stroudsburg, PA, USA. Association for Com- n. 692819). putational Linguistics. Zhemin Zhu, Delphine Bernhard, and Iryna Gurevych. References 2010. A monolingual tree-based translation model for sentence simplification. In Proceedings of the Gianni Barlacchi and Sara Tonelli. 2013. ERNESTA: 23rd International Conference on Computational A Sentence Simplification Tool for Children’s Sto- Linguistics (Coling 2010), pages 1353–1361, Bei- ries in Italian. In Alexander Gelbukh, editor, Com- jing, China, August. Coling 2010 Organizing Com- putational Linguistics and Intelligent Text Process- mittee. ing: 14th International Conference, CICLing 2013, Samos, Greece, March 24-30, 2013, Proceedings, Part II, pages 476–487, Berlin, Heidelberg. Springer Berlin Heidelberg. Stefan Bott and Horacio Saggion. 2011. An un- supervised alignment algorithm for text simplifica- tion corpus construction. In Proceedings of the Workshop on Monolingual Text-To-Text Generation, MTTG ’11, pages 20–26, Stroudsburg, PA, USA. Association for Computational Linguistics. Dominique Brunato, Felice Dell’Orletta, Giulia Ven- turi, and Simonetta Montemagni. 2015. Design and Annotation of the First Italian Corpus for Text Sim- plification. In Proceedings of The 9th Linguistic An- notation Workshop, pages 31–41, Denver, Colorado, USA, June. Association for Computational Linguis- tics. M. Cortelazzo and F. Pellegrino. 2003. Guida alla scrittura istituzionale. Laterza, New York, US. David Klaper, Sarah Ebling, and Martin Volk. 2013. Building a German/Simple German Parallel Corpus for Automatic Text Simplification. In In: Proceed- ings of the 2nd Workshop on Predicting and Improv- ing Text Readability for Target Reader Populations, Sofia, Bulgaria, pages 11–19. Tiago F. Pereira, Lucia Specia, Thiago A. S. Pardo, Caroline Gasperin, and Ra M. Aluisio. 2009. Build- ing a Brazilian Portuguese parallel corpus of origi- nal and simplified texts. In In: 10th Conference on Intelligent Text Processing and Computational Lin- guistics, Mexico City, pages 59–70. Kristian Woodsend and Mirella Lapata. 2011. Learn- ing to simplify sentences with quasi-synchronous grammar and integer programming. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 409–420, Ed- inburgh, Scotland, UK., July. Association for Com- putational Linguistics. Wei Xu, Chris Callison-Burch, and Courtney Napoles. 2015. Problems in current text simplification re- search: New data can help. Transactions of the As- sociation for Computational Linguistics, 3:283–297.