The impact of phrases on Italian lexical simplification Sara Tonelli, Alessio Palmero Aprosio Marco Mazzon Fondazione Bruno Kessler Dept. of Psychology and Cognitive Science Trento, Italy University of Trento {satonelli,aprosio}@fbk.eu marco.mazzon@studenti.unitn.it Abstract clarity and readability. Thanks to the develop- English. Automated lexical simplification ment of benchmarks (Paetzold and Specia, 2016a) has been performed so far focusing only and freely available tools for lexical simplification on the replacement of single tokens with (Paetzold and Specia, 2015), a number of works single tokens, and this choice has affected have focused on this challenge, see for exam- both the development of systems and the ple the systems participating in the simplification creation of benchmarks. In this paper, shared task at SemEval-2012 (Specia et al., 2012). we argue that lexical simplification in real However, the task has been designed as an exer- settings should deal both with single and cise to replace complex single tokens with simpler multi-token terms, and present a bench- single tokens, and most widely used benchmarks mark created for the task. Besides, we de- and systems all follow this paradigm. We believe, scribe how a freely available system can however, that this setting covers only a limited be tuned to cover also the simplification of number of lexical simplifications as they would be phrases, and perform an evaluation com- performed in a real scenario. In particular, we ad- paring different experimental settings. vocate the need to shift the lexical simplification paradigm from single tokens to phrases, and to de- Italiano. La semplificazione lessicale au- velop datasets and tools that deal also with these tomatica è stata affrontata fino ad ora cases. This is mainly the contribution of this work, dalla comunità di ricerca TAL concentran- which covers four main points: dosi sulla sostituzione di parole singole con altre parole singole. Questa modalità • We analyse existing corpora of simplified ha condizionato sia lo sviluppo di sis- texts, not specifically developed for a shared temi di semplificazione che la creazione task or for system evaluation, and we mea- di benchmark per la valutazione. In sure the impact of phrases in lexical simplifi- questo articolo, sosteniamo che la sempli- cations ficazione lessicale in contesti reali debba includere sia parole singole che espres- • We modify a state-of-the-art tool for lexical sioni composte da più parole, e presenti- simplification in order to support phrases amo un benchmark creato a questo fine. Inoltre, descriviamo come adattare un sis- • We compare different strategies for phrase tema disponibile per la semplificazione extraction and evaluate them over a bench- lessicale in modo che supporti anche la mark semplificazione di sintagmi, e presentiamo una valutazione confrontando diversi set- • We perform all the above on Italian, for ting sperimentali. which there was no lexical simplification sys- tem available. 1 Introduction Besides, we make freely available the first Lexical simplification is a well-studied topic benchmark for the evaluation of Italian lexical within the NLP community, dealing with the au- simplification, with the goal to support research tomatic replacement of complex terms with sim- on this task and to foster the development of Ital- pler ones in a sentence, in order to improve its ian simplification systems. 2 Corpus analysis and Benchmark This revision process led to the creation of a creation benchmark with pairs extracted from the two orig- inal corpora, where only cases of lexical simplifi- We first analyse existing simplification corpora in cation are present2 . Some statistics related to the Italian to study the impact of phrases on lexical benchmark are reported in Table 1. We identify simplification. There are only two such manually four possible lexical simplification types: a sin- created corpora, which contain different types gle token is replaced by a single token (ST→ST), of data but have been annotated following the a single token is simplified through a phrase same scheme: the Simpitiki corpus (Tonelli et al., (ST→P), a phrase is simplified through a single to- 2016) and the one developed by the ItaNLP Lab ken (P→ST), and a phrase is replaced by another in Pisa (Brunato et al., 2015). The former contains phrase (P→P). 1,163 sentence pairs1 , where one is the original sentence and the other is the simplified one. The ST→ST ST→P P→ST P→P Total pairs were created starting from Wikipedia edits ItaNLP 369 112 139 87 707 and from documents in the public administration Simpitiki 112 24 30 28 194 domain. The ItaNLP corpus, instead, contains Total 481 136 169 115 901 1,393 pairs extracted from children’s stories and from educational material. Both corpora were Table 1: Statistics on lexical simplification bench- annotated following the scheme proposed in mark (ST = Single token, P = Phrase) (Brunato et al., 2015), in which simplifications were classified as Split, Merge, Reordering, Insert, We observe that the most frequent lexical sim- Delete and Transformation (plus a set of sub- plification type is ST→ST, on which most systems classes for the Insert, Delete and Transformation and shared tasks are based. However, this simpli- cases). Since our goal was to isolate a benchmark fication type covers only half of the cases included of pairs containing only the lexical cases, we in our benchmark. This confirms the need to in- discarded the classes not compatible with lexical clude cases of phrase-based simplification in the simplifications (e.g. Delete, Reordering) and creation of benchmarks. It corroborates also the then manually checked the others to identify the importance of developing systems for lexical sim- cases of interest. When, as in the majority of plification that support phrase replacement, so as cases, a lexical simplification was present together to make them work in real settings and not only with other simplification types, we re-wrote the on ad-hoc test sets. Another interesting remark is target sentence in order to retain only lexical that single tokens are not necessarily simpler than cases. For example, in the examples below, a) phrases, or vice versa: in our data, there are 136 is the original sentence and b) is the simplified ST→P and 169 P→ST, showing that no general one in the Simpitiki corpus, which contains a rule can be applied to favour (or demote) Ps over lexical simplification of ‘include’ and a shift of STs. position of ‘per convenzione’. We created version We use the final benchmark3 , containing 901 c), so that only the lexical simplification is present: sentence pairs, to evaluate a system for lexical simplification taking into account phrases, as de- a) Eurasia è il termine con cui per convenzione si scribed in the following Section. definisce la zona geografica che include l’Europa e l’Asia. 3 Automated lexical simplification b) Eurasia è, per convenzione, il termine con cui In this Section we describe the experiments we si definisce la zona geografica che comprende carried out to perform automated lexical simpli- l’Europa e l’Asia. fication using the benchmark presented in Section c) Eurasia è il termine con cui per convenzione 2. We describe the tool used and how it was mod- si definisce la zona geografica che comprende 2 In Simpitiki we focused only on the pairs in the public l’Europa e l’Asia. administration domain due to project constraints. We plan to include the pairs from Wikipedia in the next benchmark version. 1 3 The number is slightly different from what was reported Available at https://drive.google.com/ in the original paper because the corpus was revised after the file/d/0B4QAWZllD-egYS0yNWZ5dTdYQVE/ first release. view?usp=sharing ified to deal with phrases. We also detail the re- The first system variant (word2phrase) includes sources (language model and word embeddings) phrase recognition, i.e. before extracting the em- created for the task. beddings and creating the LM, the documents are analysed by the word2phrase module in the 3.1 The Lexenstein system word2vec package. This is an implementation of We use Lexenstein (Paetzold and Specia, 2015), the algorithm presented in (Mikolov et al., 2013), an open source tool for lexical simplification, to which basically identifies words that appear fre- collect a list of candidates that should replace a quently together, and infrequently in other con- given word in the text. In particular, the Paetzold texts, and treats them as single tokens (connected generator (Paetzold and Specia, 2016b) is based by an underscore). on an unsupervised approach to produce simpli- The second system variant fication candidates using a context-aware word (word2phrase+LemmaPos) adds another in- embeddings model: features used for the selec- formation layer, in that each document is first tion include word2vec vectors (Mikolov et al., lemmatized and PoS tagged using the Tint NLP 2013), language model created by SRILM (Stol- Suite (Aprosio and Moretti, 2016), that works at cke, 2002), and conditional probability of a candi- token level; then word2phrase is run, and then the date given the PoS tag of the target word. So far, embeddings and the LM are created. In this way, no evaluation on Lexestein for Italian is available. we obtain so-called ‘context-aware’ embeddings, For each complex word, five candidate replace- which is the recommended setting in (Paetzold ments are first retrieved, ranked according to sev- and Specia, 2016b). eral features, such as n-gram frequencies and word vector similarity with the target word, and then re- 4 Evaluation ranked according to their average rankings (Glavaš The evaluation of automated simplification is an and Štajner, 2015). open issue since, similar to machine translation, Since we wanted to test different strategies to there may be different acceptable simplifications create the embeddings (i.e. with and without for a term, while a benchmark usually presents phrases), we created the word/phrase vectors and only one solution. Therefore, we perform two the language model starting from freely available evaluations: the first is based on an automated corpora (1.3 billion words in total): the Italian comparison between Lexenstein output and the Wikipedia,4 OpenSubtitles2016 (Lison and Tiede- gold simplifications in the benchmark. The sec- mann, 2016),5 PAISÀ,6 and the Gazzetta Uffi- ond is a manual evaluation aimed at scoring flu- ciale,7 a collection of Italian laws. Due to the ency, adequacy and simplicity of the output. size of the data, both the corpus and the model are For the first evaluation, we compute the Mean available upon request to the authors. Reciprocal Rank (MRR), which is usually adopted 3.2 Experimental Setup to evaluate a list of possible responses ordered by probability of correctness against a gold answer. We conduct several experiments to evaluate the We use this metrics because Lexenstein returns 5 quality of lexical simplification when taking into possible simplifications, ranked by relevance, and account phrases (or not), and compare different with MRR it is possible to weight the response strategies for phrase recognition. We compare dif- matching with the gold simplification according to ferent variants to create the embeddings and the its rank. In particular, MRR is computed as: language model (LM) that were then used by Lex- enstein. |Q| The first baseline model relies on the standard 1 X 1 MRR = Lexenstein setting: word embeddings are created |Q| ranki i=1 using the word2vec package, and the LM consid- ers each token separately. where Q is the number of simplifications to be performed (901) and ranki is the position of the 4 https://it.wikipedia.org/wiki/Pagina_ correct simplification in the rank returned by Lex- principale 5 enstein. http://www.opensubtitles.org/ 6 http://www.corpusitaliano.it/ We run the system in the three configurations 7 http://www.gazzettaufficiale.it/ described in Section 3.2 on each source sentence in the benchmark. The single or multi-token term We introduce also this kind of evaluation in order to be simplified is given. If it is found in the LM, to have a fine-grained analysis of system output. the system suggests 5 ranked simplification candi- For example, in the original sentence d) (see dates. Otherwise, no output is given. below), ‘tempestivamente’ was simplified with Results show that the baseline model, i.e. the ‘periodicamente’, which is grammatically correct standard Lexenstein configuration replacing only (high Fluency) but does not preserve the meaning single tokens with single tokens, yields MRR = of the original sentence (low Adequacy). 0.036. The one using word2phrase achieves MRR = 0.042, while the version including d) Il richiedente dovrà comunicare also lemma and PoS information yields MRR = tempestivamente l’esattezza dei recapiti for- 0.050. A detailed evaluation is reported in Table niti. 2: for each of the three experimental settings, we report the number of cases in which the gold sim- When using word2phrase without lemmatiza- plification matches the first ranked replacement re- tion, the average Fluency is 3.72, Adequacy is turned by Lexenstein (1st), the second, the third, 2.60 and Simplicity is 2.95. This shows that, while and so on. In the last column, we report how many PoS and form of a simplified term are generally times (out of 901) the rank returned by Lexenstein correct also without any processing, the preserva- does not contain the gold simplification present in tion of the meaning is a critical issue. Simplic- the benchmark. ity achieves better scores than Adequacy, but it still needs improvements. Results obtained using 1st 2nd 3rd 4th 5th none lemma and PoS in combination with word2phrase Baseline 23 12 7 3 2 854 are slightly better, with 2.64 Adequacy and 3.01 word2phrase 30 8 8 4 1 850 Simplicity. In general, the above evaluations show +LemmaPos 32 16 11 4 4 834 that using word2phrase with lemma and PoS in- formation is a promising approach to improve the Table 2: Rank of correct simplifications returned performance of lexical simplification in real set- by Lexenstein tings. The performance of Lexenstein could be further improved by adding other corpora to the This evaluation shows that, although limited, LM and post-process the output of the system, so using word2phrase in combination with lemma as to discard inconsistent simplifications, for ex- and PoS information yields an improvement over ample when a verb is simplified through an ad- the baseline. However, the informativeness of this verb. However, some linguistic phenomena like automated simplification is limited because the non-local dependencies cannot be addressed using cases labeled as ‘none’ include both wrong sim- this approach, and a separate strategy to simplify plifications and correct simplifications that are not them should be taken into account. present in the benchmark. Besides, they include also cases in which the word to be simplified was 5 Conclusions not found in the LM. In order to better understand where the ap- In this work, we presented a first analysis of the proach fails, we also perform a manual evaluation. role of phrases in Italian lexical simplification. Following the standard scheme for human evalua- We also introduced the adaptation of Lexenstein, tion of automatic text simplification (Saggion and an existing lexical simplification system, so as Hirst, 2017), we judge Fluency (grammaticality), to take phrases into account. In the future, we Adequacy (meaning preservation) and Simplicity plan to test other approaches for the extraction of lexical simplifications using a five-point Likert of phrases, for example by applying algorithms scale (the higher the score, the better the output). for recognising multiword expressions. We also For the setting using lemma and PoS, we do not plan to integrate our best model for phrase sim- judge Fluency, since the output is lemmatized and plification in ERNESTA (Barlacchi and Tonelli, not converted in the original form of the source 2013), a system for syntactic simplification of Ital- term (we plan to add this in the near future). Eval- ian documents. Furthermore, within the H2020 uation is performed using a set of 150 sentence SIMPATICO project, we will integrate our phrase pairs randomly extracted from the benchmark. simplification approach in the existing services of Trento Municipality and perform a pilot study Gustavo Paetzold and Lucia Specia. 2016a. Bench- with real users. marking lexical simplification systems. In Nico- letta Calzolari (Conference Chair), Khalid Choukri, Acknowledgments Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, The research leading to this paper was supported Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Tenth International Con- by the EU Horizon 2020 Programme via the ference on Language Resources and Evaluation SIMPATICO Project (H2020-EURO-6-2015, n. (LREC 2016), Paris, France, may. European Lan- 692819). guage Resources Association (ELRA). Gustavo H. Paetzold and Lucia Specia. 2016b. Unsu- pervised lexical simplification for non-native speak- References ers. In Dale Schuurmans and Michael P. Wellman, Alessio Palmero Aprosio and Giovanni Moretti. 2016. editors, Proceedings of the Thirtieth AAAI Con- Italy goes to Stanford: A collection of CoreNLP ference on Artificial Intelligence, February 12-17, modules for Italian. CoRR, abs/1609.06204. 2016, Phoenix, Arizona, USA., pages 3761–3767. AAAI Press. Gianni Barlacchi and Sara Tonelli. 2013. ERNESTA: A Sentence Simplification Tool for Children’s Sto- H. Saggion and G. Hirst. 2017. Automatic Text Sim- ries in Italian. In Alexander Gelbukh, editor, Com- plification. Synthesis Lectures on Human Language putational Linguistics and Intelligent Text Process- Technologies. Morgan & Claypool Publishers. ing: 14th International Conference, CICLing 2013, Samos, Greece, March 24-30, 2013, Proceedings, Lucia Specia, Sujay Kumar Jauhar, and Rada Mihalcea. Part II, pages 476–487, Berlin, Heidelberg. Springer 2012. Semeval-2012 task 1: English lexical sim- Berlin Heidelberg. plification. In Proceedings of the First Joint Con- ference on Lexical and Computational Semantics - Dominique Brunato, Felice Dell’Orletta, Giulia Ven- Volume 1: Proceedings of the Main Conference and turi, and Simonetta Montemagni. 2015. Design and the Shared Task, and Volume 2: Proceedings of the Annotation of the First Italian Corpus for Text Sim- Sixth International Workshop on Semantic Evalua- plification. In Proceedings of The 9th Linguistic An- tion, SemEval ’12, pages 347–355, Stroudsburg, PA, notation Workshop, pages 31–41, Denver, Colorado, USA. Association for Computational Linguistics. USA, June. Association for Computational Linguis- tics. Andreas Stolcke. 2002. Srilm - an extensible language modeling toolkit. pages 901–904. Goran Glavaš and Sanja Štajner. 2015. Simplifying lexical simplification: Do we need simplified cor- Sara Tonelli, Alessio Palmero Aprosio, and Francesca pora? In Proceedings of the 53rd Annual Meet- Saltori. 2016. SIMPITIKI: a Simplification corpus ing of the Association for Computational Linguistics for Italian. In Proceedings of the 3rd Italian Confer- and the 7th International Joint Conference on Natu- ence on Computational Linguistics (CLiC-it), vol- ral Language Processing (Volume 2: Short Papers), ume 1749 of CEUR Workshop Proceedings. pages 63–68, Beijing, China, July. Association for Computational Linguistics. Pierre Lison and Jörg Tiedemann. 2016. Opensub- titles2016: Extracting large parallel corpora from movie and TV subtitles. In Proceedings of the Tenth International Conference on Language Resources and Evaluation LREC 2016, Portorož, Slovenia, May 23-28, 2016. Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed rep- resentations of words and phrases and their com- positionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Pro- ceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States., pages 3111– 3119. Gustavo Paetzold and Lucia Specia. 2015. Lexenstein: A framework for lexical simplification. In ACL- IJCNLP 2015 System Demonstrations, ACL, pages 85–90, Beijing, China.