77


            Efficient Identification of Formal Analogies

                                               Philippe Langlais

                                             RALI/DIRO
                                         Université de Montréal
                                      Montréal, Canada, H3C 3J7
                                       felipe@iro.umontreal.ca
                                 http://www.iro.umontreal.ca/~felipe


              Abstract. Analogical learning is a lazy learning mechanism which maps
              input forms (e.g. strings) to output ones, thanks to analogies identified in
              a training material. It has proven e↵ective in a number of Natural Lan-
              guage Processing (NLP) tasks such as machine translation. One challenge
              with this approach is the identification of analogies in the training mate-
              rial. In this study, we revisit an o✏ine algorithm that has been proposed
              for enumerating analogies in a corpus. We extend it in order to scale to
              larger datasets, as well as to deal with new forms. On a task of trans-
              lating unfrequent words of Wikipedia, we observe that our approach is
              much more efficient at identifying analogies than previously published
              methods, and that the resulting engine competes with a state-of-the-art
              phrase-based statistical machine translation (SMT) system.

              Keywords: Analogical Learning, Natural Language Processing, Formal
              Analogy, Machine Translation, Rare Word Translation.


      1    Introduction
      Mapping forms of an input space into forms of an output one has many ap-
      plications in NLP including machine translation, transliteration, grapheme-to-
      phoneme conversion, etc. Most solutions to such a canonical problem consist in
      learning a mapping function from the input space into the output one, making
      use of training examples, that is, pairs of input and output forms. This function
      typically learns correspondences between sequences of symbols in both spaces
      and uses a so-called decoder for producing an output to an unknown input form.
          Analogical learning is a lazy learning technique that relies on the concept of
      proportional analogy (or analogy for short) for producing a solution. An analogy,
      noted [x : y :: z : t], is a 4-tuple of entities which reads “x is to y as z is
      to t”, as in [silane : silicon :: methane : carbon]. In this work, we concentrate
      on formal analogies, that is, analogies at the formal or graphemic level, as in
      [weak : weaker :: clean : cleaner].
          Given a training set of pairs of input and output forms {(x, x0 )}, and an un-
      known input form u, analogical learning produces output entities u0 by searching
      input elements in the training set that define with u an analogy [x : y :: z : u].
      Those analogies are assumed to carry over the output space; that is, u0 should


Copyright © 2016 for this paper by its authors. Copying permitted for private and academic
purposes. In Proceedings of the ICCBR 2016 Workshops. Atlanta, Georgia, United States of America
      Copyright © 2016 for this paper by its authors. Copying permitted for private and academic purposes.
      In Proceedings of the ICCBR 2016 Workshops. Atlanta, Georgia, United States of America
                                                                                            78


be a solution of a so-called analogical equations [x0 : y 0 :: z 0 :?] built thanks to
the training material. This learning paradigm has been tested in a number of
NLP tasks, including grapheme-to-phoneme conversion [14], machine translation
[9, 6], transliteration [2, 5], unsupervised morphology acquisition [12], as well as
parsing [10, 1].
     Building an analogical device requires in the first place to identify (input)
analogies for a given form u. This is a key challenge where a brute force approach
(enumerating all 3-tuples of forms and checking those that define with u an
analogy) is an operation cubic in the number of forms in the input space. A
more practical solution has been proposed in [7], but this is essentially an online
method, which means that time response at test time (which matters in practice)
is impacted.
     In this paper, we revisit and adapt an o✏ine algorithm presented in [8] which
enumerates all the formal analogies in a training set. We show that this algorithm
is only applicable to small datasets (a few thousand forms) and propose a simple,
yet e↵ective heuristic to scale to much larger datasets. We further adapt this
algorithm to the task of identifying analogies involving a form unseen at training
time. We evaluate our proposal on a task of translating rare words in English
Wikipedia into French.
     In Section 2, we discuss the issue of efficiently identifying analogies, and
propose our solution to the problem. We present the datasets we used in Section 3
and report in Section 4 some calibration experiments we conducted that show the
e↵ectiveness of our approach. We applied it to translate rare words in Wikipedia
and our results are reported in Section 5. We conclude our work in Section 6.


2     Identifying Analogies

We first introduce our notations. A form w is a sequence of symbols (string) over
an alphabet A = {a1 , . . . , a|A| } on which an arbitrary order is assumed. We note
|w|c the number of symbols c in w. The Parikh vector of a form w, noted ⇡(w)
is defined as a function ⇡: A? ! N|A| given by: ⇡(w) = (|w|a1 , . . . , |w|a|A| ). We
use the notation [x : y :: z : t] to denote a formal analogy. An analogical equation
is an analogy where the last form is missing and we note [x : y :: z :?] the set of
its solutions.1 Last, we call I, the set of input forms available at training time.


2.1     Online Searching

A brute force enumeration of analogies is cubic in the size of I and this is
manageable for tiny input spaces (a few hundred forms) only. In [7], the authors
developed a strategy for speeding up the search procedure, which main idea is
to exploit this property of formal analogies [10]:

                  [x : y :: z : u] ) |x|c + |u|c = |y|c + |z|c 8c 2 A                 (1)
1
    Analogical equation solvers typically produce several solutions to an equation.
                                                                                           79


    They describe a data structure which maps any vector v to the set of forms in
I that have v as a Parikh vector: fI (v) = {w 2 I : ⇡(w) = v}; and an algorithm
for computing ⌧I,u (x), the set of pairs of forms in the input space that verify the
count constraints imposed by the forms x and u in the right side of Equation 1:

                ⌧I,u (x) = {(y, z) 2 I 2 : ⇡(x) + ⇡(u) = ⇡(y) + ⇡(z)}                (2)

   Their search strategy consists in computing ⌧u (x) for all forms x in I, then
checking that the resulting tuples (x, y, z, u) are true analogies.2 For input spaces
of more than several thousand elements, it it necessary to sample the forms x,
thus risking to miss useful analogies.


2.2    O✏ine Searching

In [8], Lepage describes an algorithm for enumerating all the analogies in a set of
forms. A tree-structure is proposed which encodes the set ⌃ of all the di↵erences
of 2 Parikh vectors — we call them signatures — one might find in this set of
forms, and which associates to each signature the set of pairs of forms — we call
them stems — that have this di↵erence. The details of the data structure and
the algorithm for building it are rather intricate, but for the sake of our study,
it suffices to think at this data structure as a function g : N|A| ! I 2 given by
gI ( ) = {(x, y) 2 I 2 : ⇡(x) ⇡(y) = }. Any pair of stems in gI ( ) verifies the
right hand side of Equation 1 and are therefore candidate analogies. We qualify
a signature as productive, if at least two of its stems define an analogy. We
note (x) y the fact that ⇡(x) ⇡(y) = .
    Figure 1 illustrates two productive signatures and a few of their stems. Only
symbols with a non null count are represented. For instance, the signature =
e(1).a(1).y(-1).v(1) “encodes” how adjectives in English may be derived from
nouns. We have (alternity)         alternative because we can transform the first
form into the second one by removing symbol y and adding (in a specific order
not prescribed by the signature) symbols e, a and v. Note that some stems may
be characterized as noisy, as for instance (tiny,native) since native is not an
adjective derived from tiny.


           e(1).a(1).y(-1).v(1)              i(1).a(1).t(1).v(1)
               (fixity,fixative)                 (purge,purgative)
               (lubricity,lubricative)           (compare,comparative)
               (alternity,alternative)           (administer,administrative)
               (tiny,native)                     (cure,curative)


Fig. 1. Two productive signatures extracted from a set of English words, and a few of
their associated stems.
2
    Analogy checking is carried out by the polynomial algorithm described in [11].
                                                                                            80


    There are at least two issues with the o✏ine strategy proposed by Lepage.
The first one is scaling. As we shall see in Section 4, on a 8Gb memory computer,
we found the construction of g practical only for input spaces of around 5 000
forms. The second issue is that in applications such as machine translation, we
are interested in treating new (unknown) forms, and this requires to run the
algorithm each time a new form has to be treated. We describe in the sequel our
solutions to both issues.

2.3     Scaling to larger Datasets
We tried two strategies to the scaling issue. A first one restrains the signatures
acquired, for instance by limiting the number of symbols of the alphabet on
which two forms of a stem can disagree. A limit of 3 would for instance reject
the two signatures exemplified in Figure 1, since they both involve 4 symbols.
We show in Section 4 that this heuristic allows the approach of Lepage to scale
nicely to larger datasets. We found however in the application we study here
(machine translation) that by doing so, we loose useful signatures, therefore
useful analogies. Also, by relaxing the constraints on the signatures considered,
we rapidly face scaling issue.
    Thus, we tested a simple yet practical two-pass approach: we first collect the
signatures observed on a subset of the input forms. In a second pass, we consider
the entire input space and collect stems concerned only by those signatures. Of
course, a mix of both strategies can be considered, and we found such a mix
adequate in our translation experiments (see Section 5).

2.4     Dealing with new Forms
In many applications of interest, we want to identify analogies [x : y :: z : u],
for a form u unseen at training time. We propose an online strategy for doing
so which assumes the existence of a precomputed function g, that is, a set of
signatures ⌃ and their associated stems computed on the training set. We collect
all the forms z in I that can be transformed by one signature into u. Formally,
we compute:

                      SI (u) = {( , z) :    2 ⌃, z 2 I, (z)     u}                   (3)
    For any pair ( , z) in SI (u), gI ( ) gathers the stems (x, y) in I such that
tuples (x, y, z, u) verify the right hand side of Equation 1. This is illustrated
in Figure 2 where for instance (probity, probative, adversity, adversative) is a
(candidate) tuple identified for the form adversative unseen at training time.
    Thanks to the data structure involved in Lepage’s algorithm, computing
SI (u) is time efficient3 and most of the online computation for dealing with
a new form is spent to check among the candidates, those that are analogies.
3
    A few milliseconds in our experiments. The description of the computation would
    require too many new notations. Conceptually, we compute the intersection of two
    rational languages: one that recognizes all forms in I, the other that recognizes all
    the forms reachable from u by any signature in ⌃.
                                                                                              81


S(adversative) = {     (adversity, e(1).a(1).y(-1).v(1)), (adverse, i(1).a(1).t(1).v(1))
                       (derivative, i(1).a(-1).s(-1)), (revised, a(2).t(1).v(1)), . . . }

⇤ z = adversity, = e(1).a(1).y(-1).v(1)
   g( ) = {(probity, probative), (multiplicity, multiplicative), (fixity, fixative), . . .}
⇤ z = adverse, = i(1).a(1).t(1).v(1)
   g( ) = {(lucre, lucrative), (preserve, preservative), . . .}
..
 .

Fig. 2. Illustration of the online identification of analogies involving the form
adversative absent for the training material.


3     Datasets
We downloaded the June 2013 Wikipedia dump in English (4 262 946 articles)
and French (1 398 932 articles). We concentrated on unfrequent English words,
that is, words seen at most 25 times.4 Those words are typically less easy to trans-
late by automatic methods, and are more difficult to find in bilingual lexicons,
parallel or comparable ressources. Many rare words in Wikipedia are named enti-
ties (persons, places, etc.) that are not necessarily interesting from a translation
perspective (at least for the English-French language pair).
    We intersected the less frequent words in Wikipedia with a large in-house
bilingual lexicon (107 799 entries), which led us to 35 209 English words with
at least one (but possibly several) reference translations. We split them in two
disjoint parts: a training set of 32 245 English words (80 311 English-French
pairs), and a test set of 1000 English words seen less than 25 times in English
Wikipedia and their translations (1 219 pairs).
    Some English words are close to their translation (e.g. hexametric/hexamètre),
while others are not, as unloose which translates — according to our bilingual
lexicon — into desserrer, détendre, and relâcher.

4     Identifying Analogies in a Corpus
We first applied our implementation of Lepage’s algorithm for identifying analo-
gies in a set of forms I. We soon realized that the number of signatures found
by Lepage’s algorithm increases very fast with the number of forms considered
in the input space. Table 1 (left part) reports the main characteristics of the
algorithm as a function of the number of words considered. On a 8Gb mem-
ory computer, we could only treat 5000 words (line 3). The data structure we
built for encoding the function gI described in section 2.2 contains over 166 mil-
lions nodes, and no less than 11 967 571 signatures were collected. Most of them
(96.4%) involve only one stem, which means that no analogy would be identified.
We found only 9 856 (0.08%) productive signatures, leading to 762 259 candidate
analogies; 15 814 (2.1%) of them being formal analogies.
4
    6.8 million words (92%) in English Wikipedia occur less than 26 times.
                                                                                             82


                               pass-1                                       pass-2
           (s)   nodes (M)     |⌃| (M)     prod.          ana.   (s)    nodes (k) ana. (M)

    1000    29        8.42        0.50        25         26       33          0.5     0.37
    2000   105       31.07        1.98       320        396      282          5.9     2.42
    5000   510      166.33       11.97     9 856     15 814      909        155.4     4.74

Table 1. Statistics of our implementation of Lepage’s algorithm for identifying analo-
gies in a corpus, as a function of the number of words treated. (s) stands for the time
counted in seconds. For better readability, figures are sometimes provided in millions
(M) or thousands (k). Read the text for more.


    Having a two-pass strategy allows to visit the full training set, keeping fixed
the productive signatures identified in the first pass. The impact of this is char-
acterized in the right part of Table 1. For the largest configuration, the number
of analogies identified increases from 15 814 to over 4.7 million ones. Most of
the time of the second pass is spent for checking that pairs of stems are formal
analogies. It is also worth noting that the second pass has a much smaller mem-
ory footprint (155k nodes compared to almost 12M nodes in the first pass), since
only productive signatures are encoded.


                         (s)   nodes (M)     ⌃ (M)           prod.     analogies

                  1000    0         0.35           0.00          11           12
                  5000    8         3.05           0.03       3 016        8 412
                 10000   21         6.65           0.10      19 265      118 455
                 20000   53        12.85           0.27      78 834    1 426 711
                 30000   92        18.14           0.44     150 393    5 307 632

Table 2. Influence of limiting the size of the signatures to 4 during the first pass of
our algorithm.


    In order to digest more of the training material while collecting the set of
signatures during the first pass, we must apply some filters. We explored two
simple ones. First, we control the size of the signatures, that is the number
of symbols in the alphabet on which two forms can disagree in their number
of occurrences. We also impose a limitation on the maximum di↵erence in the
count of one symbol in a stem.
    Table 2 reports statistics when limiting to 4 the number of symbols in the
signature, and to 3 the maximum di↵erence between counts.5 Comparing with
Table 1, we observe that filtering drastically reduces computation time. For
5
    Relaxing constraints, increases the number of signatures and analogies identified,
    but at the price of an increased computation time.
                                                                                                    83


instance, on a corpus of 5000 forms (line 2), the processing time drops from 510s
to 8s, leading to less analogies in the first pass (8 412 against 15 814). But since
the full corpus can be parsed, this leads to configurations that can identify a
larger set of signatures, thus more analogies in the end (5.3 versus 4.4 million
analogies).


5     Translation Experiments
We discussed so far the problem of identifying analogies in an input space. In
this section we compare a number of analogical engines for translating words
unseen at training time. For this, we project thanks to the training set each
input analogy into a target equation that we solve. The solutions generated are
simply sorted in decreasing order of frequency (a solution may be produced by
several equations). This is illustrated in Figure 3. This is very similar to the
approach described in [5] with the notable exception that we aggregate solutions
by frequency, instead of resorting to a classifier for distinguishing good from
bad solutions. Our main motivation is to compare strategies for identifying in-
put analogies, and training a classifier would have been too cumbersome here.
Nevertheless, we also compare analogical devices to a strong baseline which is a
phrase-based translation engine trained at the character level.


[probité : probatoire :: adversité : ?]         adversatoire, . . .
[multiplicité : multiplicatif :: adversité : ?] adversatif , . . .
[lucre : lucratif :: adverse : ?]                 adversatif , . . .
[conserve : conservateur :: déficitaire : ?]     déficitairateur, . . .
. (adversatif ,257) (adversante,73) (contrairateur,72) (déficitairateur,71) (défavorablateur,65) . . .

Fig. 3. Equations solved for the analogies identified in Figure 2 when translating En-
glish word adversative into French.


5.1   Metrics
Each translation engine we tested was asked to produce a ranked list of trans-
lations. We measure the precision and recall at rank 1 and 100. Given a test set
{(ei , ri )i2[1,N ] } of pairs of source words ei and their reference translations ri (a
term may have several translations), and ci a ranked list of candidate translations
for each ei , those metrics are computed as ( the Kronecker symbol):
            P                                     P
   nk = i (|ri \ cki | > 0)            P@k = nk / i (|ci | > 0)     R@k = nk /N (4)

where cki designates the k top candidate translations produced for ei . It should be
clear that recall at rank 1 is in our case the so-called accuracy measure. Also, for
translation engines that are always proposing (at least) one candidate translation
per test form, precision equals recall, while systems that are sometimes silent
(such as our analogical devices) have a precision higher than recall.
                                                                                       84


    The performance of the translation engines we compared is measured for two
settings, one call Trans in which no filtering of the candidate list is performed,
and one called Align in which we filtered in the only candidates that belong to
the French vocabulary in Wikipedia.6 This second setting evaluates alignment
technology, since the task is not about generating new words, but identifying the
words in the target language that are most likely translations of a test form. It
should be noted that correct translations may be filtered out in this scenario.


5.2   Statistical Machine Translation

In a nutshell, SMT seeks to find the optimal translation ê of a sentence f using
a log-linear combination of models (hi ), including a language model p(e) which
scores how likely a hypothesis is in the target language, and a translation model
p(f |e) which predicts the likelihood that two sentences are translations:
                                                                         !
                                                       X
             ê = argmax p(f |e) ⇥ p(e) ⇡ argmax exp        i hi (e, f )      (5)
                   e                        e
                                                         i

    We trained such a system at the character level, very similarly to the approach
described in [3]. Such a system has been massively used as a key component of
tasks such as transliteration (see for instance the NEWS evaluation campaign),
or machine translation for closely related language pairs [13].
    We used the Moses toolkit [4] for training and testing the SMT engine. The
trigram language model7 has been trained on the target part of the training ma-
terial. We tune the coefficient ( i ) given to each model embedded in Equation 5
on a development corpus of 500 entries of our bilingual lexicon (disjoint from the
training set) using the MERT method implemented in Moses. This development
corpus is not exploited by analogical learning.8


5.3   Results

The performance of some of the translations we tested are reported in Table 3
for two settings aforementioned (Trans and Align). We varied the size of the
training material used for collecting signatures (see Section 2.3). We observe
overall (bottom part), that it is preferable to consume as much as possible of the
training material. Di↵erences between the 20k and the 30k variants are however
not very marked, therefore, we did not train larger configurations.
     We implemented and tested a number of variants of the approach described in
[7]. We report (line LY ) the best configuration we obtained in terms of accuracy.
It compares globally to the largest configurations of our approach; the 30k variant
being at a slight advantage. More importantly, the time required for identifying
6
  We counted 2 942 067 di↵erent tokens in French Wikipedia.
7
  We tried other n-gram orders without better results.
8
  To be more fair to the analogical approach, we could have added this corpus to its
  training set, since it does not perform any tuning.
                                                                                            85


analogies at runtime is 0.03s per test form for the 20k configuration, compared to
7 seconds for LY, a clear advantage of the batch procedure of Lepage we adapted.
We tested faster variants of [7], but they lead to much lower performance. It is
worth noting that LY identifies slightly more analogies than the variants of our
proposed strategy. That this does not translate into better performance does
indicate that the signature information used by our solution is useful.
    We observe that the statistical translation engine delivers better performance
at rank 1. The analogical devices seem to deliver more diversified translations,
since their precision and recall at rank 100 outperform those of the former sys-
tem. Recall that in this study, we did not attempt to classify good from bad
candidates, which has been showed for instance in [5] to produce better results
than the aggregation by frequency we conducted here.
    Results on the Align setting overall confirm the observations just made and
reinforce the fact that the analogical device is producing more diversified trans-
lations than the statistical engine. In fact, for around 60% of our test forms, the
analogical device could produce a useful translation, while this rate culminates
at 49% for the statistical engine. Because of this, it is likely that combining the
statistical and the analogical engines will lead to better performance overall. To
illustrate this, we tried one simplistic combination in the Align setting in which
Moses is trusted whenever a solution is proposed, and the analogical engine (the
20k variant) otherwise. This increases accuracy (R@1 ) to 50.9%, an absolute gain
of over 5 points over the SMT engine, and 8 points over the best analogical one.


                          Trans                                  Align
                  P@1 R@1 P@100 R@100          sil.     P@1 R@1 P@100 R@100          sil.

          5k      23.6   19.0   53.5    43.1 194        46.5   32.8   70.7    49.9   294
         10k      27.7   25.2   55.8    50.9 90         51.0   40.7   70.2    56.0   202
         20k      27.0   25.7   57.1    54.4 47         50.4   42.8   72.6    61.7   150
         30k      28.1   26.9   56.8    54.3 44         51.0   43.3   71.3    60.5   151

       LY         27.2 26.2     56.2    54.1    37      51.5 41.0     67.9    54.1 204
      SMT         34.0 34.0     49.4    49.4     0      62.7 45.6     67.9    49.4 273
 SMT + 20k                                              55.1 50.9     64.4    59.5 76

Table 3. Translation (Trans) and alignment (Align) performance on the test set of
1000 rare English words of Wikipedia. sil. is the number of test form without a solution.


6    Conclusion

We proposed an adaptation of an algorithm proposed in [8] for listing all the
analogies in a corpus. We investigated a way to reduce the computation time
                                                                                          86


and memory footprint of this algorithm. We also proposed an extension of this
algorithm for listing — online — analogies for an unseen form.
    We applied our solution to the task of translating rare words in Wikipedia.
Our solution leads to slightly better performance than the approach described
in [7] at an order of magnitude faster running time. We also show that while a
statistical engine does deliver better translation at rank 1, a simple combination
of both systems systematically outperform each system individually.

Acknowledgments. This work has been funded by the Natural Sciences and
Engineering Research Council of Canada.

References
 1. Ben Hassena, A.: Apprentissage analogique par analogie de structures d’arbres.
    Ph.D. thesis, Univ. de Rennes I, France (2011)
 2. Dandapat, S., Morrissey, S., Naskar, S.K., Somers, H.: Mitigating problems in
    analogy-based ebmt with smt and vice versa: a case study with named entity
    transliteration. In: PACLIC. Sendai, Japan (2010)
 3. Finch, A., Sumita, E.: Transliteration Using a Phrase-based Statistical Machine
    Translation System to Re-score the Output of a Joint Multigram Model. In: 2nd
    Named Entities Workshop (NEWS’10). pp. 48–52. Uppsala, Sweden (2010)
 4. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N.,
    Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A.,
    Herbst, E.: Moses: Open Source Toolkit for Statistical Machine Translation. In:
    45th ACL. pp. 177–180 (2007), interactive Poster and Demonstration Sessions
 5. Langlais, P.: Mapping source to target strings without alignment by analogical
    learning: A case study with transliteration. In: 51st ACL. pp. 684–689 (2013)
 6. Langlais, P., Patry, A.: Translating Unknown Words by Analogical Learning. In:
    EMNLP. pp. 877–886. Prague, Czech Republic (2007)
 7. Langlais, P., Yvon, F.: Scaling up analogical learning. In: 22nd COLING. pp. 51–
    54. Manchester, United Kingdom (Aug 2008), poster
 8. Lepage, Y.: Analogies between binary images: Application to chinese characters. In:
    Prade, H., Richard, G. (eds.) Computational Approaches to Analogical Reasoning:
    Current Trends, pp. 25–57. Studies in Computational Intelligence, Springer-Verlag
    (2014)
 9. Lepage, Y., Denoual, E.: Purest ever example-based machine translation: Detailed
    presentation and assesment. Mach. Translat 19, 25–252 (2005)
10. Lepage, Y., Shin-ichi, A.: Saussurian analogy: A theoretical account and its appli-
    cation. In: 7th COLING. pp. 717–722 (1996)
11. Stroppa, N.: Définitions et caractérisations de modèles à base d’analogies pour
    l’apprentissage automatique des langues naturelles. Ph.D. thesis, Telecom Paris,
    ENST, Paris, France (2005)
12. Stroppa, N., Yvon, F.: An analogical learner for morphological analysis. In: 9th
    CONLL. pp. 120–127. Ann Arbor, USA (2005)
13. Tiedemann, J.: Character-based psmt for closely related languages. In: Proceedings
    of 13th EAMT. pp. 12–19 (2009)
14. Yvon, F.: Paradigmatic cascades: a linguistically sound model of pronunciation by
    analogy. In: In Proceedings of 35th ACL. pp. 429–435 (1997)