Towards Personalised Simplification
                         based on L2 Learners’ Native Language
                    Alessio Palmero Aprosio† , Stefano Menini† , Sara Tonelli†
                              Luca Ducceschi‡ , Leonardo Herzog‡
                                  †
                                    FBK, ‡ University of Trento
                         {aprosio,menini,satonelli@fbk.eu}
                                luca.ducceschi@unitn.it
                         leonardo.herzog@studenti.unitn.it

                     Abstract                        studies using deep learning (Zhang and Lapata,
                                                     2017; Nisioi et al., 2017). Nevertheless, only re-
    English. We present an approach to im-           cently researchers have started to build simplifi-
    prove the selection of complex words for         cation systems that can adapt to users, based on
    automatic text simplification, addressing        the observation that the preceived simplicity of a
    the need of L2 learners to take into account     document depends a lot on the user profile, in-
    their native language during simplifica-         cluding not only specific disabilities but also lan-
    tion. In particular, we develop a method-        guage proficiency, age, profession, etc. Therefore
    ology that automatically identifies ‘diffi-      in the last few months the first approaches to per-
    cult’ terms (i.e. false friends) for L2 learn-   sonalised text simplification have been proposed
    ers in order to simplify them. We eval-          at major conferences, with the goal of simplifying
    uate not only the quality of the detected        a document for different language proficiency lev-
    false friends but also the impact of this        els (Scarton and Specia, 2018; Bingel et al., 2018;
    methodology on text simplification com-          Lee and Yeung, 2018).
    pared with a standard frequency-based ap-           Along this research line, we present in this pa-
    proach.                                          per an approach to perform automated lexical sim-
                                                     plification for L2 learners, able to adapt to the user
    Italiano. In questo contributo presentia-        mother tongue. To our knowledge, this is the first
    mo un approccio per selezionare le paro-         work taking into account this aspect and present-
    le complesse da semplificare in modo au-         ing a solution that, given an Italian document and
    tomatico, tenendo conto della lingua ma-         the user’s mother tongue as input, selects only the
    dre dell’utente. Nello specifico, la nostra      words that the user may find difficult given his/her
    metodologia identifica i termini ‘difficili’     knowledge of another language. Specifically, we
    (falsi amici) per l’utente per proporne la       detect and simplify automatically the terms that
    semplificazione. In questo contesto, viene       may be misleading for the user because they are
    valutata non soltanto la qualità dei falsi      false friends, while we do not simplify those that
    amici individuati, ma anche l’impatto che        have an orthographically and semantically similar
    questa semplificazione personalizzata ha         translation in the user native language (so-called
    rispetto ad approcci standard basati sulla       cognates). In multilingual settings, for instance
    frequenza delle parole.                          while teaching, learning or translating a foreign
                                                     language, these two phenomena have proven to be
                                                     very relevant (Ringbom, 1986), because the lexi-
1   Introduction                                     cal similarities between the two languages in con-
                                                     tact have proven to create interferences, favouring
The task of automated text simplification has been
                                                     or hindering the course of learning.
investigated within the NLP community for sev-
                                                        We compare our approach to the selection of
eral years with a number of different approaches,
                                                     words to be simplified with a standard frequency-
from rule-based ones (Siddharthan, 2010; Bar-
                                                     based one, in which only the terms that are not
lacchi and Tonelli, 2013; Scarton et al., 2017)
                                                     listed in De Mauro’s Dictionary of Basic Ital-
to supervised (Bingel and Søgaard, 2016; Alva-
                                                     ian1 are simplified, regardless of the user native
Manchego et al., 2017) and unsupervised ones
                                                        1
(Paetzold and Specia, 2016), including recent               https://dizionario.internazionale.it/
language. Our experiments are evaluated on the                       strings S1 and S2 . The formula is:
Italian-French pair, but the approach is generic.                                        P           2
                                                                                           B 1+(pos(x)−pos(y))2
2     Approach description                                              XX(S1 , S2 ) =
                                                                                           xb(S1 ) + xb(S2 )
Given a document Di to be simplified, and a na-
tive language L1 spoken by the user, our approach                    where B is the set of pairs of shared extended
consists of the following steps:                                     bigrams (x, y), x in S1 and y in S2 . The
                                                                     functions pos(x) and xb(S) return the posi-
    1. Candidate selection: for each content word2                   tion of extended bigram x and the number of
       wi in Di , we automatically generate a list                   extended bigrams in string S respectively.
       of words W1 ⊂ L1 which are orthographi-
       cally similar to wi . In this phase, several or-           • NED, Normalized Edit Distance (Wagner and
       thographical similarity metrics are evaluated.               Fischer, 1974). A regular Edit Distance cal-
       We keep the 5 most-similar terms to wi .                     culates the orthographic difference between
                                                                    two strings assigning a cost to any minimum
    2. False friend and cognate detection: for                      number of edit operations (deletion, substitu-
       each of the 5 most similar words in W1 , we                  tion and insertion, all with cost of 1) needed
       classify whether it is a false friend of wi or               to make them equal. NED is obtained by
       not.                                                         dividing the edit cost by the length of the
                                                                    longest string.
    3. Simplification choice: Based on the output
       of the previous steps, the system marks wi                 • Jaro/Winkler (Winkler, 1990). The Jaro simi-
       as difficult to understand for the user if there             larity metric for two strings S1 and S2 is com-
       are corresponding false friends in L1 . Other-               puted as follows:
       wise, wi is left in its original form. When a                                                            
                                                                                    1      m       m     m−T
       word is marked as difficult, a subsequent sim-                J(S1 , S2 ) = ·           +      +
       plification module (not included in this work)                               3     |S1 | |S2 |       m
       should try to find an alternative form (such as
       a synonym, or a description) to make the term                 where m is the number of characters in com-
       more understandable to the user.                              mon, provided that they occur in the same
                                                                     (not interrupted) sequence, and T is the num-
2.1    Candidate Selection                                           ber of transpositions of character in S1 to ob-
A number of similarity metrics have been pre-                        tain S2 . The Winkler variation of the metric
sented in the past to identify candidate cognates                    adds a bias if the two strings share a prefix.
and false friends, see for example the evaluation
                                                                     JW (S1 , S2 ) = J(S1 , S2 )+(1−J(S1 , S2 ))lp
in Inkpen and Frunza (2005). We choose three of
them, motivated by the fact that we want to have at
least one ngram-based metric (XXDICE) and one                        where l is the number of characters of the
non ngram-based (Jaro/Winkler). To that, we add                      common prefix of the two strings, up to four,
a more standard metric, Normalized Edit Distance                     and p is a scaling factor, usually set to 0.1.
(NED). The three metrics are explained below:
                                                                   Each of these three measures has some dis-
    • XXDICE (Brew et al., 1996). It takes in                   advantages.       For example, we found that
      consideration the shared number of extended               Jaro/Winkler metric boosts the similarity of words
      bigrams3 and their position relative to two               with the same root. On the other hand, applying
                                                                NED leads to several pairs of words having the
nuovovocabolariodibase                                          same similarity score. As a result, two words that
    2
      Content words are words that have a meaning such as
names, adjectives, verbs and adverbs. To extract this infor-    are close according to a metric can be far using an-
mation, we use the POS tagger included in the Tint pipeline     other metric. To overcome this limitation, we bal-
(Aprosio and Moretti, 2018).                                    ance the three metrics by computing a weighted
    3
      An extended bigram is an ordered letter pair formed by
deleting the middle letter from any three letter substring of   average of the three scores tuned on a training set.
the word.                                                       For details, see Section 3.
2.2       False Friend and Cognate Detection                         In order to identify the best way to combine the
As for false friend and cognate detection, we rely                three similarity metrics detailed in Section 2.1., we
on a SVM-based classifier and train it on a single                compute all the possibile combinations of weights
feature obtained from a multilingual embedding                    on 10 groups of 200 word pairs randomly ex-
space (Mikolov et al., 2013), where the user lan-                 tracted from the 1,531 pairs in the training set, and
guage L1 and the language of the document to be                   then keep the combination that scores the highest
simplified L2 are aligned. In particular, the feature             average similarity.
is the cosine distance between the embeddings of a                   In Table 1 we report the percentage of times in
given content word wi in the language L2 and the                  which the cognate or false friend of wi in the train-
embedding of its candidate false friends or cog-                  ing set would appear among the 5 most-similar
nates in L1 . The intuition behind this approach                  terms extracted from the French online dictionary
is that two cognates have a shared semantics and                  according to the three different scores in isolation:
therefore a high cosine similarity, as opposed to                 XX for XXDICE, JW for Jaro/Winkler and NED
false friends, whose meanings are generally unre-                 for Normalized Edit Distance. We also report the
lated. While past approaches to false friend and                  best configuration of the three metrics with the
cognate detection have already exploited monolin-                 corresponding weight to maximise the presence of
gual word embeddings (St Arnaud et al., 2017),                    a cognate or false friend among the 5 most simi-
we employ for our experiments a multilingual set-                 lar terms. We observe that, while the three metrics
ting, so that the semantic distance between the                   in isolation yield a similar result, combining them
candidate pairs can be measured in their original                 effectively increases the presence of cognates and
language without a preliminary translation.                       false friends among the top candidates. This con-
                                                                  firms that the metrics capture three different types
3       Experimental Setup                                        of similarity, and that it is recommended to take
                                                                  them all into account when performing candidate
In our experiments, we consider a setting in which                selection: an approach where evey metric con-
French speakers would like to make Italian doc-                   tributes to detecting false friend / cognate candi-
uments easier for them to read. Nevertheless,                     dates outperforms the single metrics.
the approach can be applied to any language pair,
given that it requires minimal adaptation.                                   XX     JW     NED     % Top 5
   In order to tune the best similarity metrics com-                         1.0     -       -      64.6
bination and to train the SVM classifier, a lin-                              -     1.0      -      65.6
guist has manually created an Italian-French gold                             -      -      1.0     65.9
standard, containing pairs of words marked as ei-                            0.2    0.4     0.4     77.3
ther cognates or false friends. These terms were
collected from several lists available on the web.                Table 1: Analysis of the candidate selection strat-
Overall, the Ita-Fr dataset contains a training set               egy using different metrics in isolation and in com-
of 1,531 pairs (940 cognates and 591 false friends)               bination.
and a test set of 108 pairs (51 cognates and 57 false
friends).                                                            For false friends and cognates detection, we
   For the candidate selection step, the goal is to               proceed as follows. Given a word wi in Italian, we
obtain for each term wi in Italian, the 5 French                  identify the 5 most similar words in French using
terms with the highest orthographic similarity.                   the 0.2-0.4-0.4 score introduced before. In case
Therefore, given wi , we compute its similarity                   of ties in the 5th positon, we extend the selection
with each term in a French online dictionary4                     to all the candidates sharing the same similarity
(New, 2006) using the three scores described in the               value.
previous section. The lemmas were normalized                         Each word pair including wi and one of the
for accents and diacritics, in order to avoid poor                5 most similar words is then classified as false
results of the metrics in cases like général and                friend or cognate with a SVM using a radial kernel
generale, where the accented é character would be                trained on the 1,531 word pairs in the training set.
considered different with respect to e.5                          For the multilingual embeddings used to compute
    4
        http://www.lexique.org/                                   0.375 when the two strings are not normalized and 0.125
    5
        For example, NED between général and generale returns   when they are.
the semantic similarity between the Italian words       paring each content word with De Mauro’s Dic-
and their candidates, we use the vectors from Bo-       tionary of Basic Italian and simplifying only those
janowski et al. (2016)6 trained on Wikipedia data       that are not listed among the 7,000 entries of the
with fastText (Joulin et al., 2016). We chose these     basic vocabulary.
resources since they are available both for Italian         This evaluation shows that out of 1,035 con-
and French (and several other languages). For the       tent words in the test sentences, our simplification
alignement of the semantic spaces of the two lan-       approach based on a) would simplify 367 words,
guages we use 22,767 Italian-French word pairs          and 823 if we adopt the strategy b). Based on
collected from an online dictionary.7                   De Mauro’s dictionary, instead, 240 terms would
                                                        be simplified. Furthermore, there would be only
4   Evaluation                                          76 terms simplified using both strategy a) and De
                                                        Mauro’s list, and 154 overlaps for strategy b). This
We perform two types of evaluation. In the first
                                                        shows that the two approaches are rather comple-
one, the goal is to assess whether the system can
                                                        mentary and based on different principles. This
correctly identify false friends and cognates in a
                                                        is evident also looking at the evaluated sentences:
text. In the second one, we want to check what
                                                        while considering frequency lists like De Mauro’s,
is the difference between the terms simplified by
                                                        terms such as accademico and speleologo should
a system with our approach compared with a stan-
                                                        be simplified because they are not frequently used
dard frequency-based simplification system.
                                                        in Italian, our approach would not simplify them
   For the first evaluation, we manually create a
                                                        because they have very similar French translations
set of 108 Italian sentences containing one false
                                                        (académique and spéléologue respectively), and
friend or cognate for French speakers taken from
                                                        are not classified as false friends by the system.
the test set. On each term, we run our algorithm
                                                        On the other hand, vedere would not be simpli-
and we consider a term a false friend according
                                                        fied in a standard frequency-based system because
to two strategies: a) if all 5 most similar words
                                                        it is listed among the 2,000 fundamental words in
in French are classified as false friends, or b) if
                                                        Italian. However, our approach would identify it
the majority of them are classified as false friends.
                                                        as a false friend to be simplified because vider in
Results are reported in Table 2.
                                                        French (transl. svuotare) is orthographically very
                          P       R      F1             similar to vedere but has a completely different
      false friends (a)   0.75    0.44   0.55           meaning.
      false friends (b)   0.57    0.88   0.69
                                                        5       Conclusions
Table 2: False friends classification using setting
                                                        In this work, we have presented an approach sup-
(a) and (b)
                                                        porting personalized simplification in that it en-
   The evaluation shows that the two settings lead      ables to adapt the selection of difficult words for
to two different outcomes. In general terms, the        lexical simplification to the native language of L2
first strategy is more conservative and favours Pre-    learners. To our knowledge, this is the first at-
cision, while the second boosts Recall and F1.          tempt to deal with this kind of adaptation. The ap-
   As for the second evaluation, on the same set of     proach is relatively easy to apply to new languages
sentences, we run our algorithm again, this time        provided that they have a similar alphabet, since
trying to classify any content word as being a false    multilingual embeddings are already available and
friend for French speakers or not. We evaluate this     lists of cognates and false friends, although of lim-
component as being part of a simplification sys-        ited size, can be easily retrieved online.8
tem that simplifies only false friends, and we com-        The work will be extended along different re-
pare this choice with a more standard approach,         search directions: first, we will evaluate the ap-
in which only ‘unusual’ or ‘unfrequent’ terms are       proach on other language pairs. Then, we will add
simplified. This second choice is taken by com-         a lexical simplification module selecting only the
                                                        words identified as complex by our approach. For
   6
     https://github.com/facebookresearch/
fastText/blob/master/pretrained-vectors.                    8
                                                           See for example the Wiktionary entries at
md                                                      https://en.wiktionary.org/wiki/Category:
   7
     http://dizionari.corriere.it/                      False_cognates_and_false_friends
this, we can rely on existing simplification tools          Chris Brew, David McKelvie, et al. 1996. Word-pair
(Paetzold and Specia, 2015), which could be tuned             extraction for lexicography. In Proceedings of the
                                                              2nd International Conference on New Methods in
to adapt also the simplification choices to the user
                                                              Language Processing, pages 45–55.
native language, for example by changing the can-
didate ranking algorithm. Finally, it would be in-          Diana Inkpen and Oana Frunza. 2005. Automatic
teresting to involve L2 learners in the evaluation,           identification of cognates and false friends in french
                                                              and english. In Proceedings of RANLP, pages 251–
with the goal to measure the effectiveness of dif-            257, 01.
ferent simplification strategies in a real setting.
                                                            Armand Joulin, Edouard Grave, Piotr Bojanowski,
Acknowledgments                                               Matthijs Douze, Hérve Jégou, and Tomas Mikolov.
                                                              2016. Fasttext.zip: Compressing text classification
This work has been supported by the European                  models. arXiv preprint arXiv:1612.03651.
Commission project SIMPATICO (H2020-EURO-                   John Lee and Chak Yan Yeung. 2018. Personalizing
6-2015, grant number 692819). We would like to                lexical simplification. In Proceedings of the 27th In-
thank Francesca Fedrizzi for her help in creating             ternational Conference on Computational Linguis-
the gold standard.                                            tics, pages 224–232. Association for Computational
                                                              Linguistics.

                                                            Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013.
References                                                    Exploiting similarities among languages for ma-
                                                              chine translation. arXiv preprint arXiv:1309.4168.
Fernando Alva-Manchego, Joachim Bingel, Gustavo
  Paetzold, Carolina Scarton, and Lucia Specia. 2017.       Boris New. 2006. Lexique 3: Une nouvelle base de
  Learning how to simplify from explicit labeling             données lexicales. In Actes de la Conférence Traite-
  of complex-simplified text pairs. In Greg Kon-              ment Automatique des Langues Naturelles (TALN
  drak and Taro Watanabe, editors, Proceedings of             2006).
  the Eighth International Joint Conference on Natu-
  ral Language Processing, IJCNLP 2017, Taipei, Tai-        Sergiu Nisioi, Sanja Stajner, Simone Paolo Ponzetto,
  wan, November 27 - December 1, 2017 - Volume 1:             and Liviu P. Dinu. 2017. Exploring neural text sim-
  Long Papers, pages 295–305. Asian Federation of             plification models. In Regina Barzilay and Min-Yen
  Natural Language Processing.                                Kan, editors, Proceedings of the 55th Annual Meet-
                                                              ing of the Association for Computational Linguis-
Alessio Palmero Aprosio and Giovanni Moretti. 2018.           tics, ACL 2017, Vancouver, Canada, July 30 - Au-
  Tint 2.0: an all-inclusive suite for nlp in italian. In     gust 4, Volume 2: Short Papers, pages 85–91. Asso-
  Proceedings of the Sixth Italian Conference on Com-         ciation for Computational Linguistics.
  putational Linguistics (CLiC-it 2018), Torino, Italy.
                                                            Gustavo Paetzold and Lucia Specia. 2015. Lexenstein:
Gianni Barlacchi and Sara Tonelli. 2013. ERNESTA:             A framework for lexical simplification. In ACL-
  A Sentence Simplification Tool for Children’s Sto-          IJCNLP 2015 System Demonstrations, ACL, pages
  ries in Italian. In Alexander Gelbukh, editor, Com-         85–90, Beijing, China.
  putational Linguistics and Intelligent Text Process-
  ing: 14th International Conference, CICLing 2013,         Gustavo H. Paetzold and Lucia Specia. 2016. Unsu-
  Samos, Greece, March 24-30, 2013, Proceedings,              pervised lexical simplification for non-native speak-
  Part II, pages 476–487, Berlin, Heidelberg. Springer        ers. In Dale Schuurmans and Michael P. Wellman,
  Berlin Heidelberg.                                          editors, Proceedings of the Thirtieth AAAI Con-
                                                              ference on Artificial Intelligence, February 12-17,
                                                              2016, Phoenix, Arizona, USA., pages 3761–3767.
Joachim Bingel and Anders Søgaard. 2016. Text sim-
                                                              AAAI Press.
  plification as tree labeling. In Proceedings of the
  54th Annual Meeting of the Association for Compu-         H. Ringbom. 1986. Crosslinguistic influence and the
  tational Linguistics (Volume 2: Short Papers), pages        foreign language learning process. In E. Kellerman
  337–343. Association for Computational Linguis-             and Smith Sharwood M., editors, Crosslinguistic In-
  tics.                                                       fluence in Second Language Acquisition. Pergamon
                                                              Press, New York.
Joachim Bingel, Gustavo Paetzold, and Anders
  Søgaard. 2018. Lexi: A tool for adaptive, personal-       Carolina Scarton and Lucia Specia. 2018. Learning
  ized text simplification. In Proceedings of COLING.         simplifications for specific target audiences. In ACL
  Association for Computational Linguistics.                  (2), pages 712–718. Association for Computational
                                                              Linguistics.
Piotr Bojanowski, Edouard Grave, Armand Joulin, and
   Tomas Mikolov. 2016. Enriching word vectors with         Carolina Scarton, Alessio Palmero Aprosio, Sara
   subword information. CoRR, abs/1607.04606.                 Tonelli, Tamara Martı́n Wanton, and Lucia Specia.
  2017. Musst: A multilingual syntactic simplifica-
  tion tool. In Proceedings of the IJCNLP 2017, Sys-
  tem Demonstrations, pages 25–28. Association for
  Computational Linguistics.
Advaith Siddharthan. 2010. Complex lexico-syntactic
  reformulation of sentences using typed dependency
  representations. In Proceedings of the 6th Inter-
  national Natural Language Generation Conference
  (INLG 2010), Dublin, Ireland.
Adam St Arnaud, David Beck, and Grzegorz Kon-
  drak. 2017. Identifying cognate sets across dic-
  tionaries of related languages. In Proceedings of
  the 2017 Conference on Empirical Methods in Natu-
  ral Language Processing, pages 2519–2528, Copen-
  hagen, Denmark, September. Association for Com-
  putational Linguistics.
Robert A Wagner and Michael J Fischer. 1974. The
  string-to-string correction problem. Journal of the
  ACM (JACM), 21(1):168–173.
William E Winkler. 1990. String comparator met-
  rics and enhanced decision rules in the fellegi-sunter
  model of record linkage.
Xingxing Zhang and Mirella Lapata. 2017. Sentence
  simplification with deep reinforcement learning. In
  Martha Palmer, Rebecca Hwa, and Sebastian Riedel,
  editors, Proceedings of the 2017 Conference on Em-
  pirical Methods in Natural Language Processing,
  EMNLP 2017, Copenhagen, Denmark, September
  9-11, 2017, pages 584–594. Association for Com-
  putational Linguistics.