Towards Personalised Simplification based on L2 Learners’ Native Language Alessio Palmero Aprosio† , Stefano Menini† , Sara Tonelli† Luca Ducceschi‡ , Leonardo Herzog‡ † FBK, ‡ University of Trento {aprosio,menini,satonelli@fbk.eu} luca.ducceschi@unitn.it leonardo.herzog@studenti.unitn.it Abstract studies using deep learning (Zhang and Lapata, 2017; Nisioi et al., 2017). Nevertheless, only re- English. We present an approach to im- cently researchers have started to build simplifi- prove the selection of complex words for cation systems that can adapt to users, based on automatic text simplification, addressing the observation that the preceived simplicity of a the need of L2 learners to take into account document depends a lot on the user profile, in- their native language during simplifica- cluding not only specific disabilities but also lan- tion. In particular, we develop a method- guage proficiency, age, profession, etc. Therefore ology that automatically identifies ‘diffi- in the last few months the first approaches to per- cult’ terms (i.e. false friends) for L2 learn- sonalised text simplification have been proposed ers in order to simplify them. We eval- at major conferences, with the goal of simplifying uate not only the quality of the detected a document for different language proficiency lev- false friends but also the impact of this els (Scarton and Specia, 2018; Bingel et al., 2018; methodology on text simplification com- Lee and Yeung, 2018). pared with a standard frequency-based ap- Along this research line, we present in this pa- proach. per an approach to perform automated lexical sim- plification for L2 learners, able to adapt to the user Italiano. In questo contributo presentia- mother tongue. To our knowledge, this is the first mo un approccio per selezionare le paro- work taking into account this aspect and present- le complesse da semplificare in modo au- ing a solution that, given an Italian document and tomatico, tenendo conto della lingua ma- the user’s mother tongue as input, selects only the dre dell’utente. Nello specifico, la nostra words that the user may find difficult given his/her metodologia identifica i termini ‘difficili’ knowledge of another language. Specifically, we (falsi amici) per l’utente per proporne la detect and simplify automatically the terms that semplificazione. In questo contesto, viene may be misleading for the user because they are valutata non soltanto la qualità dei falsi false friends, while we do not simplify those that amici individuati, ma anche l’impatto che have an orthographically and semantically similar questa semplificazione personalizzata ha translation in the user native language (so-called rispetto ad approcci standard basati sulla cognates). In multilingual settings, for instance frequenza delle parole. while teaching, learning or translating a foreign language, these two phenomena have proven to be very relevant (Ringbom, 1986), because the lexi- 1 Introduction cal similarities between the two languages in con- tact have proven to create interferences, favouring The task of automated text simplification has been or hindering the course of learning. investigated within the NLP community for sev- We compare our approach to the selection of eral years with a number of different approaches, words to be simplified with a standard frequency- from rule-based ones (Siddharthan, 2010; Bar- based one, in which only the terms that are not lacchi and Tonelli, 2013; Scarton et al., 2017) listed in De Mauro’s Dictionary of Basic Ital- to supervised (Bingel and Søgaard, 2016; Alva- ian1 are simplified, regardless of the user native Manchego et al., 2017) and unsupervised ones 1 (Paetzold and Specia, 2016), including recent https://dizionario.internazionale.it/ language. Our experiments are evaluated on the strings S1 and S2 . The formula is: Italian-French pair, but the approach is generic. P 2 B 1+(pos(x)−pos(y))2 2 Approach description XX(S1 , S2 ) = xb(S1 ) + xb(S2 ) Given a document Di to be simplified, and a na- tive language L1 spoken by the user, our approach where B is the set of pairs of shared extended consists of the following steps: bigrams (x, y), x in S1 and y in S2 . The functions pos(x) and xb(S) return the posi- 1. Candidate selection: for each content word2 tion of extended bigram x and the number of wi in Di , we automatically generate a list extended bigrams in string S respectively. of words W1 ⊂ L1 which are orthographi- cally similar to wi . In this phase, several or- • NED, Normalized Edit Distance (Wagner and thographical similarity metrics are evaluated. Fischer, 1974). A regular Edit Distance cal- We keep the 5 most-similar terms to wi . culates the orthographic difference between two strings assigning a cost to any minimum 2. False friend and cognate detection: for number of edit operations (deletion, substitu- each of the 5 most similar words in W1 , we tion and insertion, all with cost of 1) needed classify whether it is a false friend of wi or to make them equal. NED is obtained by not. dividing the edit cost by the length of the longest string. 3. Simplification choice: Based on the output of the previous steps, the system marks wi • Jaro/Winkler (Winkler, 1990). The Jaro simi- as difficult to understand for the user if there larity metric for two strings S1 and S2 is com- are corresponding false friends in L1 . Other- puted as follows: wise, wi is left in its original form. When a   1 m m m−T word is marked as difficult, a subsequent sim- J(S1 , S2 ) = · + + plification module (not included in this work) 3 |S1 | |S2 | m should try to find an alternative form (such as a synonym, or a description) to make the term where m is the number of characters in com- more understandable to the user. mon, provided that they occur in the same (not interrupted) sequence, and T is the num- 2.1 Candidate Selection ber of transpositions of character in S1 to ob- A number of similarity metrics have been pre- tain S2 . The Winkler variation of the metric sented in the past to identify candidate cognates adds a bias if the two strings share a prefix. and false friends, see for example the evaluation JW (S1 , S2 ) = J(S1 , S2 )+(1−J(S1 , S2 ))lp in Inkpen and Frunza (2005). We choose three of them, motivated by the fact that we want to have at least one ngram-based metric (XXDICE) and one where l is the number of characters of the non ngram-based (Jaro/Winkler). To that, we add common prefix of the two strings, up to four, a more standard metric, Normalized Edit Distance and p is a scaling factor, usually set to 0.1. (NED). The three metrics are explained below: Each of these three measures has some dis- • XXDICE (Brew et al., 1996). It takes in advantages. For example, we found that consideration the shared number of extended Jaro/Winkler metric boosts the similarity of words bigrams3 and their position relative to two with the same root. On the other hand, applying NED leads to several pairs of words having the nuovovocabolariodibase same similarity score. As a result, two words that 2 Content words are words that have a meaning such as names, adjectives, verbs and adverbs. To extract this infor- are close according to a metric can be far using an- mation, we use the POS tagger included in the Tint pipeline other metric. To overcome this limitation, we bal- (Aprosio and Moretti, 2018). ance the three metrics by computing a weighted 3 An extended bigram is an ordered letter pair formed by deleting the middle letter from any three letter substring of average of the three scores tuned on a training set. the word. For details, see Section 3. 2.2 False Friend and Cognate Detection In order to identify the best way to combine the As for false friend and cognate detection, we rely three similarity metrics detailed in Section 2.1., we on a SVM-based classifier and train it on a single compute all the possibile combinations of weights feature obtained from a multilingual embedding on 10 groups of 200 word pairs randomly ex- space (Mikolov et al., 2013), where the user lan- tracted from the 1,531 pairs in the training set, and guage L1 and the language of the document to be then keep the combination that scores the highest simplified L2 are aligned. In particular, the feature average similarity. is the cosine distance between the embeddings of a In Table 1 we report the percentage of times in given content word wi in the language L2 and the which the cognate or false friend of wi in the train- embedding of its candidate false friends or cog- ing set would appear among the 5 most-similar nates in L1 . The intuition behind this approach terms extracted from the French online dictionary is that two cognates have a shared semantics and according to the three different scores in isolation: therefore a high cosine similarity, as opposed to XX for XXDICE, JW for Jaro/Winkler and NED false friends, whose meanings are generally unre- for Normalized Edit Distance. We also report the lated. While past approaches to false friend and best configuration of the three metrics with the cognate detection have already exploited monolin- corresponding weight to maximise the presence of gual word embeddings (St Arnaud et al., 2017), a cognate or false friend among the 5 most simi- we employ for our experiments a multilingual set- lar terms. We observe that, while the three metrics ting, so that the semantic distance between the in isolation yield a similar result, combining them candidate pairs can be measured in their original effectively increases the presence of cognates and language without a preliminary translation. false friends among the top candidates. This con- firms that the metrics capture three different types 3 Experimental Setup of similarity, and that it is recommended to take them all into account when performing candidate In our experiments, we consider a setting in which selection: an approach where evey metric con- French speakers would like to make Italian doc- tributes to detecting false friend / cognate candi- uments easier for them to read. Nevertheless, dates outperforms the single metrics. the approach can be applied to any language pair, given that it requires minimal adaptation. XX JW NED % Top 5 In order to tune the best similarity metrics com- 1.0 - - 64.6 bination and to train the SVM classifier, a lin- - 1.0 - 65.6 guist has manually created an Italian-French gold - - 1.0 65.9 standard, containing pairs of words marked as ei- 0.2 0.4 0.4 77.3 ther cognates or false friends. These terms were collected from several lists available on the web. Table 1: Analysis of the candidate selection strat- Overall, the Ita-Fr dataset contains a training set egy using different metrics in isolation and in com- of 1,531 pairs (940 cognates and 591 false friends) bination. and a test set of 108 pairs (51 cognates and 57 false friends). For false friends and cognates detection, we For the candidate selection step, the goal is to proceed as follows. Given a word wi in Italian, we obtain for each term wi in Italian, the 5 French identify the 5 most similar words in French using terms with the highest orthographic similarity. the 0.2-0.4-0.4 score introduced before. In case Therefore, given wi , we compute its similarity of ties in the 5th positon, we extend the selection with each term in a French online dictionary4 to all the candidates sharing the same similarity (New, 2006) using the three scores described in the value. previous section. The lemmas were normalized Each word pair including wi and one of the for accents and diacritics, in order to avoid poor 5 most similar words is then classified as false results of the metrics in cases like général and friend or cognate with a SVM using a radial kernel generale, where the accented é character would be trained on the 1,531 word pairs in the training set. considered different with respect to e.5 For the multilingual embeddings used to compute 4 http://www.lexique.org/ 0.375 when the two strings are not normalized and 0.125 5 For example, NED between général and generale returns when they are. the semantic similarity between the Italian words paring each content word with De Mauro’s Dic- and their candidates, we use the vectors from Bo- tionary of Basic Italian and simplifying only those janowski et al. (2016)6 trained on Wikipedia data that are not listed among the 7,000 entries of the with fastText (Joulin et al., 2016). We chose these basic vocabulary. resources since they are available both for Italian This evaluation shows that out of 1,035 con- and French (and several other languages). For the tent words in the test sentences, our simplification alignement of the semantic spaces of the two lan- approach based on a) would simplify 367 words, guages we use 22,767 Italian-French word pairs and 823 if we adopt the strategy b). Based on collected from an online dictionary.7 De Mauro’s dictionary, instead, 240 terms would be simplified. Furthermore, there would be only 4 Evaluation 76 terms simplified using both strategy a) and De Mauro’s list, and 154 overlaps for strategy b). This We perform two types of evaluation. In the first shows that the two approaches are rather comple- one, the goal is to assess whether the system can mentary and based on different principles. This correctly identify false friends and cognates in a is evident also looking at the evaluated sentences: text. In the second one, we want to check what while considering frequency lists like De Mauro’s, is the difference between the terms simplified by terms such as accademico and speleologo should a system with our approach compared with a stan- be simplified because they are not frequently used dard frequency-based simplification system. in Italian, our approach would not simplify them For the first evaluation, we manually create a because they have very similar French translations set of 108 Italian sentences containing one false (académique and spéléologue respectively), and friend or cognate for French speakers taken from are not classified as false friends by the system. the test set. On each term, we run our algorithm On the other hand, vedere would not be simpli- and we consider a term a false friend according fied in a standard frequency-based system because to two strategies: a) if all 5 most similar words it is listed among the 2,000 fundamental words in in French are classified as false friends, or b) if Italian. However, our approach would identify it the majority of them are classified as false friends. as a false friend to be simplified because vider in Results are reported in Table 2. French (transl. svuotare) is orthographically very P R F1 similar to vedere but has a completely different false friends (a) 0.75 0.44 0.55 meaning. false friends (b) 0.57 0.88 0.69 5 Conclusions Table 2: False friends classification using setting In this work, we have presented an approach sup- (a) and (b) porting personalized simplification in that it en- The evaluation shows that the two settings lead ables to adapt the selection of difficult words for to two different outcomes. In general terms, the lexical simplification to the native language of L2 first strategy is more conservative and favours Pre- learners. To our knowledge, this is the first at- cision, while the second boosts Recall and F1. tempt to deal with this kind of adaptation. The ap- As for the second evaluation, on the same set of proach is relatively easy to apply to new languages sentences, we run our algorithm again, this time provided that they have a similar alphabet, since trying to classify any content word as being a false multilingual embeddings are already available and friend for French speakers or not. We evaluate this lists of cognates and false friends, although of lim- component as being part of a simplification sys- ited size, can be easily retrieved online.8 tem that simplifies only false friends, and we com- The work will be extended along different re- pare this choice with a more standard approach, search directions: first, we will evaluate the ap- in which only ‘unusual’ or ‘unfrequent’ terms are proach on other language pairs. Then, we will add simplified. This second choice is taken by com- a lexical simplification module selecting only the words identified as complex by our approach. For 6 https://github.com/facebookresearch/ fastText/blob/master/pretrained-vectors. 8 See for example the Wiktionary entries at md https://en.wiktionary.org/wiki/Category: 7 http://dizionari.corriere.it/ False_cognates_and_false_friends this, we can rely on existing simplification tools Chris Brew, David McKelvie, et al. 1996. Word-pair (Paetzold and Specia, 2015), which could be tuned extraction for lexicography. In Proceedings of the 2nd International Conference on New Methods in to adapt also the simplification choices to the user Language Processing, pages 45–55. native language, for example by changing the can- didate ranking algorithm. Finally, it would be in- Diana Inkpen and Oana Frunza. 2005. Automatic teresting to involve L2 learners in the evaluation, identification of cognates and false friends in french and english. In Proceedings of RANLP, pages 251– with the goal to measure the effectiveness of dif- 257, 01. ferent simplification strategies in a real setting. Armand Joulin, Edouard Grave, Piotr Bojanowski, Acknowledgments Matthijs Douze, Hérve Jégou, and Tomas Mikolov. 2016. Fasttext.zip: Compressing text classification This work has been supported by the European models. arXiv preprint arXiv:1612.03651. Commission project SIMPATICO (H2020-EURO- John Lee and Chak Yan Yeung. 2018. Personalizing 6-2015, grant number 692819). We would like to lexical simplification. In Proceedings of the 27th In- thank Francesca Fedrizzi for her help in creating ternational Conference on Computational Linguis- the gold standard. tics, pages 224–232. Association for Computational Linguistics. Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013. References Exploiting similarities among languages for ma- chine translation. arXiv preprint arXiv:1309.4168. Fernando Alva-Manchego, Joachim Bingel, Gustavo Paetzold, Carolina Scarton, and Lucia Specia. 2017. Boris New. 2006. Lexique 3: Une nouvelle base de Learning how to simplify from explicit labeling données lexicales. In Actes de la Conférence Traite- of complex-simplified text pairs. In Greg Kon- ment Automatique des Langues Naturelles (TALN drak and Taro Watanabe, editors, Proceedings of 2006). the Eighth International Joint Conference on Natu- ral Language Processing, IJCNLP 2017, Taipei, Tai- Sergiu Nisioi, Sanja Stajner, Simone Paolo Ponzetto, wan, November 27 - December 1, 2017 - Volume 1: and Liviu P. Dinu. 2017. Exploring neural text sim- Long Papers, pages 295–305. Asian Federation of plification models. In Regina Barzilay and Min-Yen Natural Language Processing. Kan, editors, Proceedings of the 55th Annual Meet- ing of the Association for Computational Linguis- Alessio Palmero Aprosio and Giovanni Moretti. 2018. tics, ACL 2017, Vancouver, Canada, July 30 - Au- Tint 2.0: an all-inclusive suite for nlp in italian. In gust 4, Volume 2: Short Papers, pages 85–91. Asso- Proceedings of the Sixth Italian Conference on Com- ciation for Computational Linguistics. putational Linguistics (CLiC-it 2018), Torino, Italy. Gustavo Paetzold and Lucia Specia. 2015. Lexenstein: Gianni Barlacchi and Sara Tonelli. 2013. ERNESTA: A framework for lexical simplification. In ACL- A Sentence Simplification Tool for Children’s Sto- IJCNLP 2015 System Demonstrations, ACL, pages ries in Italian. In Alexander Gelbukh, editor, Com- 85–90, Beijing, China. putational Linguistics and Intelligent Text Process- ing: 14th International Conference, CICLing 2013, Gustavo H. Paetzold and Lucia Specia. 2016. Unsu- Samos, Greece, March 24-30, 2013, Proceedings, pervised lexical simplification for non-native speak- Part II, pages 476–487, Berlin, Heidelberg. Springer ers. In Dale Schuurmans and Michael P. Wellman, Berlin Heidelberg. editors, Proceedings of the Thirtieth AAAI Con- ference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA., pages 3761–3767. Joachim Bingel and Anders Søgaard. 2016. Text sim- AAAI Press. plification as tree labeling. In Proceedings of the 54th Annual Meeting of the Association for Compu- H. Ringbom. 1986. Crosslinguistic influence and the tational Linguistics (Volume 2: Short Papers), pages foreign language learning process. In E. Kellerman 337–343. Association for Computational Linguis- and Smith Sharwood M., editors, Crosslinguistic In- tics. fluence in Second Language Acquisition. Pergamon Press, New York. Joachim Bingel, Gustavo Paetzold, and Anders Søgaard. 2018. Lexi: A tool for adaptive, personal- Carolina Scarton and Lucia Specia. 2018. Learning ized text simplification. In Proceedings of COLING. simplifications for specific target audiences. In ACL Association for Computational Linguistics. (2), pages 712–718. Association for Computational Linguistics. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching word vectors with Carolina Scarton, Alessio Palmero Aprosio, Sara subword information. CoRR, abs/1607.04606. Tonelli, Tamara Martı́n Wanton, and Lucia Specia. 2017. Musst: A multilingual syntactic simplifica- tion tool. In Proceedings of the IJCNLP 2017, Sys- tem Demonstrations, pages 25–28. Association for Computational Linguistics. Advaith Siddharthan. 2010. Complex lexico-syntactic reformulation of sentences using typed dependency representations. In Proceedings of the 6th Inter- national Natural Language Generation Conference (INLG 2010), Dublin, Ireland. Adam St Arnaud, David Beck, and Grzegorz Kon- drak. 2017. Identifying cognate sets across dic- tionaries of related languages. In Proceedings of the 2017 Conference on Empirical Methods in Natu- ral Language Processing, pages 2519–2528, Copen- hagen, Denmark, September. Association for Com- putational Linguistics. Robert A Wagner and Michael J Fischer. 1974. The string-to-string correction problem. Journal of the ACM (JACM), 21(1):168–173. William E Winkler. 1990. String comparator met- rics and enhanced decision rules in the fellegi-sunter model of record linkage. Xingxing Zhang and Mirella Lapata. 2017. Sentence simplification with deep reinforcement learning. In Martha Palmer, Rebecca Hwa, and Sebastian Riedel, editors, Proceedings of the 2017 Conference on Em- pirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pages 584–594. Association for Com- putational Linguistics.