23


                     Character–position arithmetic for
                   analogy questions between word forms

                                               Yves Lepage?

                                       IPS, Waseda University
               2-7 Hibikino, Wakamatsu-ku, Kitakyushu-shi, 808-0135 Fukuoka-ken, Japan
                                       yves.lepage@waseda.jp


                 Abstract. We show how to answer analogy questions A : B :: C : D
                 of unknown D between word forms, by essentially relying on the basic
                 arithmetic equality D[iB − iA + iC ] = B[iB ] − A[iA ] + C[iC ] on char-
                 acters and positions at the same time. We decompose the problem into
                 two steps: specification and decoding. We examine several techniques to
                 implement each of these two steps. We perform experiments on a set of
                 positive and negative examples and assess the accuracy of combinations
                 of techniques. We then evaluate the performance of the best combination
                 of techniques on a large set of more than 40 million analogy questions
                 from the training data of a shared task in morphology. We obtain the
                 correct answer in 94 % of the cases.

                 Keywords: Formal analogy, analogy questions, character–position arith-
                 metic.


        1      Introduction

        In this paper, we address the problem of answering analogy questions of the type
        A : B :: C : D between word forms where the unknown is D. Our proposal
        consists in relying essentially on the intuitive basic arithmetic equality D =
        B − A + C. We propose to write this arithmetic equality using characters and
        positions at the same time:

                                D[iB − iA + iC ] = B[iB ] − A[iA ] + C[iC ]                     (1)

        The use of this arithmetic equality is directly inspired by the famous equality
        between vectors proposed in [7] to answer analogy questions between words in
        the framework of distributional semantics. This is now referred to as vector
        arithmetic and is always exemplified with:
                 −−−→ ≈ −
                 queen
                         −→
                        king − −−→ + −
                               man    −−−→
                                     woman                (man : king :: woman : queen)         (2)
         ?
             This work was supported by a JSPS Grant, Number 15K00317 (Kakenhi C), entitled
             Language productivity: efficient extraction of productive analogical clusters and their
             evaluation using statistical machine translation.


Copyright © 2017 for this paper by its authors. Copying permitted for private and
academic purpose. In Proceedings of the ICCBR 2017 Workshops. Trondheim, Norway
                                                                                             24


    This paper gives empirical support for the use of Equality (1) to answer
analogy questions (which only involve commutation1 ) between word forms (not
meaning). For relevance to morphology, no knowledge other than equality of
characters is used, i.e., order of letters in an alphabet, or the like, is not used.
    The rest of the paper is structured as follows. Section 2 shows how the answer
to an analogy question can be specified by a character–position matrix computed
using Equality (1). Specification refinements are introduced to solve problematic
cases. Section 2 shows that decoding the answer from the character–position
matrix can be viewed as an assignment problem and thus solved using a standard
algorithm. Section 4 summarises the two previous sections by giving an algorithm
for the proposed method. This proposed method is validated by two series of
experiments in Section 5.


2     Specifying the answer of an analogy question

2.1     Known features of the answer to an analogy question

From previous research [4,10,2,3], it is known that the answer D to A : B ::
C : D is partially determined. In particular, its length, which characters it con-
tains, and their number of occurrences, are known. In mathematical notations:
                                    (
                                        |D| = |B| − |A| + |C|
               A : B :: C : D  ⇒                                               (3)
                                       |D|c = |B|c − |A|c + |C|c , ∀c

where |X| stands for the length of string X and |X|c for the number of occur-
rences of character c in string X. The above equations are yet another instance
of the general arithmetic equality D = B − A + C. In addition, some work [6]
states that the LCS distance, noted d below, between the pair of terms is equal:
                                         (
                                           d(A, B) = d(C, D)
                    A : B :: C : D   ⇒                                       (4)
                                           d(A, C) = d(B, D)

2.2     Character–position arithmetic

In analogies of commutation, pieces are combined and exchanged to compose
the four different terms of the analogy A, B, C and D. It is always possible to
put some pieces in common to A and B in correspondence with some pieces in
common to C and D (or the same by exchanging B and C in this statement).
For instance, in

                        hearty : unheartily :: lucky : unluckily,                     (5)
1
     See [5] or [6, middle of p. 161]. We exclude analogies of repetition, e.g., Indone-
    sian guru : guru-guru :: pelajar : pelajar-pelajar; reduplication, e.g., Ancient Greek
    λύω : λέλυκα :: παύω : πέπαυκα; and mirroring, e.g., abc : wxyz :: cba : zyxw.
                                                                                                  25


the pieces ‘heart’, common to A and B, and ‘luck’, common to C and D,
correspond. They are exchanged on each side of the symbol ::. We can write
A[1 . . . 5] = B[3 . . . 7] = heart and C[1 . . . 4] = D[3 . . . 6] = luck; and further
D[3 . . . 6] = luck = B[3 . . . 7] − A[1 . . . 5] + C[1 . . . 4] = heart − heart + luck,
where 3 = 3 − 1 + 1 and 6 = 7 − 5 + 4. This equality combines several instances of
Equation (1) for several instances of indices (iA , iB , iC , iD ): (1, 3, 1, 3), (2, 3, 2, 3),
. . . , (5, 7, 4, 6)
       To summarise the previous remarks, Proposition (6), which embeds Equal-
ity (1), can be laid as a hypothesis to be tested. Importantly, it only makes sense
if either A[iA ] = B[iB ] or A[iA ] = C[iC ] holds, i.e., if either (iA , iB ) or (iA , iC )
are match points in the matrices representing A : B and A : C (see Figure 1).

      A : B :: C : D ⇒                         
                                               
                                                       1 ≤ iA ≤ |A|
                                               
                                                        1 ≤ iB ≤ |B|                       (6)
                                               
                                               
            ∀iD ∈ N / 1 ≤ iD ≤ |D|,
                                               
                                               
                                                        1 ≤ iC ≤ |C|
                    ∃(iA , iB , iC ) ∈ N3 /    
                                                      iD = iB − iA + iC
                                               
                                               
                                               
                                               
                                               
                                                   D[iD ] = B[iB ] − A[iA ] + C[iC ]
                                               


2.3    Specification of the answer using a character–position matrix
Enforcing Proposition (6) allows to specify the solution of an analogy question as
illustrated in Figure 1. For each index in the string D, all possible combinations of
indices in A, B and C corresponding to match points in matrices A : B and A : C
are examined and the number of instances of Equality (1) for that index in D
is memorised in matrices B : D and C : D. By adding up these values for each
character that we know will appear in D, and for each index in D, a character–
position matrix can be built, from which the answer can ultimately be decoded.

     b   
    a                                        character
                      1. Specification                   c    d        2. Decoding
 c a   d b                                    position \                                  d c
                      −−−−−−−−−−−→                                    −−−−−−−−−→
 . 1 ? 1 .                                        1      .    1
 1 . ? . 1                                        2      1    .


Fig. 1. Specification of the answer of the analogy question ab : ac :: db : x using
character–position arithmetic. The characters in the answer and their number of oc-
currences are known: { c : 1, d : 1}. The black squares () in the upper part on the left
visualise the match points in the matrices A : B and A : C. The lower part gives the
number of instances of Equality (1) in each cell of matrices B : D and C : D using the
match points and character–position arithmetic. In the middle, the character–position
matrix summarises the evidence for each character and position (. stands for zero).
This character–position matrix can be decoded into the answer dc on the right.
                                                                                        26


2.4   Virtual beginning and end match points
We now turn to a first problematic case where Proposition (6) is not verified,
although it should be: work : sing :: you work : you sing. In this case, no triple
of indices in A, B and C can be found for the first position in D corresponding to
the character y. The character y in D can only come from the same character in
C in the first position, as it does not appear in B. However, no character in A is
equal to any character in B, so that there is no match point. So Proposition (6)
does not hold.
    Now, Proposition (6) holds for `worka : `singa :: `you worka : `you singa
where beginning and end markers are added in the previous example. In that
case, for position iD = 1 in D corresponding to character y, the triple of indices
(iA = 0, iB = 0, iC = 1) in A, B and C is such that: D[1 = 0 − 0 + 1] = C[1] = y
and A[0] = B[0] = ` . The addition of such markers is tantamount to the
insertion of virtual match points in the matrices representing A : B and A : C.
On our set of positive examples (see Section 5), Proposition (6) holds for all
examples when adding such virtual match points

2.5   Match points inside diagonal bands
We turn to a second problematic case for Proposition (6). Taking all possible
match points into account may give too much weight for some of them, leading
to wrong answers. This is the case for leaf : leaves :: wolf : D. Taking into
account all possible match points in the matrices A : B and A : C leads to 10
instances of Equation (1) voting in favour of D[1] = l, and only 7 in favour of
D[3] = l (while the situation is balanced for D[1] = w and D[3] = w with 8
equations each). This leads to the incorrect answer ‘lowves’ instead of ‘wolves’.
     The work in [4] considers only match points lying on the edit distance traces
in A : B and A : C, to create the answers to analogy questions. We follow this
idea, but not as strictly. In [9, p. 106, illustrated on Figure 3], it is proven that
the match points on edit distance traces between strings X and Y lie inside a
diagonal band in the edit distance matrix. This diagonal band can be equivalently
defined by Inequality (7) which uses the notion of similarity between two strings,
i.e., the length of their longest common subsequence, noted s below, instead of
the notion of edit distance.

                − |X| + s(X, Y ) ≤ iY − iX ≤ |Y | − s(X, Y )                     (7)

On the previous example, restricting to match points inside diagonal bands de-
limited by Inequality (7) in all matrices yields more instances of Equality (1)
favouring D[3] = l (4 equalities) over D[1] = l (1 equality only), and no more
equalities to support D[3] = w; this leads to the correct answer ‘wolves’.

2.6   Re-estimating values in the character–position matrix
We turn to a third and last problematic case for Proposition (6). For the analogy
question in German setzen : setzte :: lachen : D, the character–position matrix
                                                                                       27


built as described above somehow hesitates for the but last position between t
and e: 2 instances of Equality (1) support D[5] = t, 6 support D[5] = e, 1
supports D[6] = t, and 4 support D[6] = e. This leads to the incorrect answer
‘lachet’ instead of ‘lachte’.
    The situation is similar to the one encountered in statistical machine trans-
lation where word-to-word correspondence probabilities should be re-estimated
from the mere evidence that they appear in corresponding sentences. The answer
consists in using the expectation–maximisation (EM) algorithm to estimate a
probability distribution that will maximise the entropy over all possible word-to-
word correspondences. The problem here is similar. Words in the source and tar-
get languages in machine translation correspond to character and positions in the
character–position matrix. When applying the EM algorithm to the character–
position matrix on the previous example, the probabilities for the previous char-
acters and positions are re-estimated as follows: p(D[5] = t) = 0.29 now exceeds
p(D[5] = e) = 0.25, and p(D[6] = e) = 0.32 exceeds p(D[6] = t) = 0.28. This
leads to the correct solution ‘lachte’.

3     Decoding the answer of an analogy question
3.1   Solving an assignment problem
In the previous section, we showed how a character–position matrix can be built
which assigns a probability to each character and position in the answer D of
an analogy question       A : B :: C : D. The final problem is thus an optimal
assignment problem where each position should receive a character and each
character should go to a position in D without conflict. This problem can be
solved by the Hungarian method, or Kuhn’s algorithm [1]. In our setting, we
look for a solution of the assignment problem with a maximal cost.
    It has also been shown that the Hungarian method is in fact the limit of
an entropy maximisation problem [8]. So, we implement a naive and imperfect
algorithm which works as follows. We assign each (possibly repeated) character
to a position by scanning all characters in increasing order of entropy over all
available positions. We assign a character to the position where it gets its highest
probability. As several characters may have the same entropy simultaneously,
we choose positions so as to avoid conflicts. If conflicts cannot be avoided, we
simply stop the process and output no solution for the analogy question. Else, the
characters and positions just assigned are removed, the entropies are computed
again for the remaining characters and positions, and the process is repeated
until all characters have been assigned to a position. If some remaining character
cannot be assigned to any position, no solution is output.
    This strategy is more prone to fail than the Hungarian method, and should
thus be considered as a loose baseline.

3.2   Plurality of answers
There may be no answer, one answer or several answers to an analogy question.
For instance, the analogy question abcabc : gh :: mnkl : D has no solution; the
                                                                                          28


analogy question easy : uneasy :: known : D has only one possible answer: D =
unknown; and the analogy question aa : ab :: aaa : x has two possible answers
only, aab or aba, if considered as an analogy of commutation (see Footnote 1).
    From the solution delivered by the Hungarian method, it is possible to look
in the character-position matrix and enumerate all other solutions of same cost
by performing all possible exchanges between characters and positions.


4     Overview of the proposed method
4.1    Sketch of the method

Algorithm 1 Solving an analogy question A : B :: C : x.
def Solve(A, B, C):

      # 1. Specify the answer by a character-position matrix, M .
      ComputeKnownFeatures(A, B, C)
      M [iD ][cD ] = 0 for all cD ∈ D and iD ∈ {1, . . . , |D|}
      for each (iA , iB ) / InsideDiagonal(iA , iB ) and A[iA ] == B[iB ]:
           for each (iC , iD = iB − iA + iC ) / InsideAllDiagonals(iA , iB , iC , iD ):
                M [iD ][ C[iC ] ] += 1
      #      Do same thing as above by exchanging B and C.
      ...
      M = ExpectationMaximisation(M )
      # 2. Decode the character-position matrix.
      list of pairs (iD , cD ) = HungarianMethod(M )
      return EnumerateAllSolutions(M, list of pairs (iD , cD )))

def ComputeKnownFeatures(A, B, C):
    s(A, B), s(A, C) = similarity(A, B), similarity(A, C)
    s(B, D), s(C, D) = s(A, C) − |A| + |B|, s(A, B) − |A| + |C|
    |D| = |B| − |A| + |C|
    for each character c:
         occ]_in_D[c] = occ]_in_B[c] - occ]_in_A[c] + occ]_in_C[c]

def InsideDiagonal(iX , iY ):
    return −|X| + s(X, Y ) ≤ iY − iX ≤ |Y | − s(X, Y )

def InsideAllDiagonals(iA , iB , iC , iD ):
    return all(InsideDiagonal(iX , iY ) for (X, Y ) in [(A, C), (B, D), (C, D)])


    Algorithm 1 sketches the method as already illustrated in Figure 1. After
computing the features of the answer, a character–position matrix is built and
its cells are filled using the character-position arithmetic. The values in the
character–position matrix are then re-estimated using the expectation–maximisa-
tion algorithm. Decoding is performed using the Hungarian method. This out-
                                                                                          29


puts one answer. An enumeration of all character–position exchanges of same
cost yields all possible answers.

4.2    Complexity analysis of the method
We give a very rough analysis of the complexity of the method. Computation
of similarities or enumeration of the match points are basically square in the
length of the strings, so that the most costly component in the algorithm is
solving the assignment problem by the Hungarian method, known to be cubic in
the size of the square matrix, i.e., cubic in the length of the solution in its best
implementation. The convergence of the EM algorithm is difficult to estimate2 .
    It is interesting to observe that the method is linear in the size of D in the
best case, i.e., when A = B. In that case, the diagonal band is reduced to the
main diagonal in the matrix A : B. Consequently, the character–position matrix
exhibits a degenerated form where each character in C is assigned the same
position in D as in C. Such a matrix is a degenerated case for the Hungarian
method, which returns a solution in one pass over the matrix.


5     Experiments
To inspect the accuracy of our proposed method, we use an in-house data set
of 160 examples, 113 positive examples and 47 negative examples. Most of the
positive examples are from various languages: Arabic, Chinese, English, German,
etc. They address complex phenomena, like parallel infixing, as in:
         (German) sprechen : ihr aussprächet :: nehmen : ihr ausnähmet
In addition, some formal positive examples address incrementing problems:
                      abc : abcabc :: abcabcabc : abcabcabcabc
                     ab : aabb :: aaaaaabbbbbb : aaaaaaabbbbbbb
The purpose of the negative examples is to test the ability of our method not to
deliver an incorrect answer. The negative examples are of the type:
                            ab : aabb :: aaabbb : aaabbbba
where the answer proposed, aaabbbba, is incorrect; the only correct answer, when
restricting to analogies of commutation, is aaaabbbb. In this case, an algorithm
that would blindly output all possible combinations of four a’s and four b’s in
any order would have the incorrect answer in its set of solutions; it would thus
fail the test.
    We test different combinations of components: for the specification of the
answer, use of all possible match points vs. only those inside diagonal bands
2
     We set the convergence threshold to the reciprocal of the number of cells in the
    character–position matrix, i.e., 1/|D|2 . In general we observe convergence after a
    very small number of steps.
                                                                                            30


(Sect. 2.5), and use of the EM algorithm to re-estimate values in the character–
position matrix vs. no use (Sect. 2.6); for decoding, use of the Hungarian method
vs. our loose baseline (increasing entropies) (Sect. 3.1).
    The results in Table 1 show that each of the components contributes to
accuracy. Considering only match points inside diagonal bands allows to jump
from below 66 % accuracy to almost 80 % and above. The EM algorithm may
be of no utility or may add around 5 % in accuracy. As expected, the Hungarian
method always beats our loose baseline by at least 5 % in accuracy. The best
accuracy obtained is 91 % when decoding using the Hungarian method.

    Match points
    inside        EM algorithm Decoding method Recall Precision Accuracy
    diagonal band

                                    Increasing entropies     98        41        57
                          No
                                     Hungarian method        98        50        64
          No
                                    Increasing entropies     98        41        57
                          Yes
                                     Hungarian method        98        53        66
                                    Increasing entropies     99        70        78
                          No
                                     Hungarian method        99        82        87
          Yes
                                    Increasing entropies     99        81        86
                          Yes
                                     Hungarian method        99        88        91

Table 1. Testing 8 different configurations to output the answer specified by character-
position arithmetic. The best configuration is the last one.


    Table 2 details the confusion matrix for the best configuration. Recall, pre-
cision and accuracy are computed in the standard way, and were reported in
Table 1. This confusion matrix shows that the weakness of the method lies in
too many negative predictions on positive examples. This is measured by a rel-
atively low precision of 88 %.


                          Positive predictions Negative predictions     Total
      Positive examples      TP = 99 (62 %)       FN = 14 ( 9 %) 113 ( 71 %)
      Negative examples      FP = 1 ( 0 %)        TN = 46 (29 %) 47 ( 29 %)
      Total                        100 (63 %)            60 (37 %) 160 (100 %)

Table 2. Confusion matrix for the verification of analogies on 113 positive examples
of analogies and 47 negative examples in the best configuration (see Table 1, last line).


   This relatively low precision will now be nuanced by results on a much larger
dataset which supposedly reflects more real analogy questions. This dataset is
from Task 1 of Track 1 of SIGMORPHON 2016 Shared Task: Morphological
Reinflection. We use all the offered languages: Arabic, Finnish, Georgian, Ger-
                                                                                       31


man, Hungarian, Maltese, Navajo, Russian, Spanish and Turkish.3 We extract
analogy questions from such data by considering all analogies of form filtered by
morphological features. For each analogy, four different analogy questions can be
asked, each of the four terms becoming the answer.4 As an example in Spanish,
the four following questions correspond to the same analogy:
           alterado : alterada :: adeudados : x  ⇒ x = adeudadas
           alterada : alterado :: adeudadas : x  ⇒ x = adeudados
           adeudadas : adeudados :: alterada : x  ⇒ x = alterado
           adeudados : adeudadas :: alterado : x  ⇒ x = alterada
The number of analogy questions obtained in each language is given in Table 3.
The total number over all languages exceeds 40 million analogy questions. For
half of the languages, the percentage of correct answers is equal to or higher than
95 %. The total number of correct answers over all questions in each language
reaches 94 %.

                                Number of anal- % of correct
                    Language    ogy questions   answers

                    Arabic             381,132            94
                    Finnish              3,076            95
                    Georgian         7,256,156            87
                    German             349,796            91
                    Hungarian       15,157,368            94
                    Maltese             10,000            97
                    Navajo          18,588,020            97
                    Russian             66,672            99
                    Spanish             95,564            95
                    Turkish            729,092            86
                    Total           42,636,876            94

Table 3. Solving analogy questions extracted from all training data of Task 1 of
Track 1 from SIGMORPHON 2016 Shared Task for the best configuration of our
proposed method (last line in Table 1)


6   Conclusion and future work
We showed how to answer analogy questions A : B :: C : D of unknown D be-
tween strings of characters, by essentially relying on an intuitive basic arithmetic
equality: D[iB − iA + iC ] = B[iB ] − A[iA ] + C[iC ]. We decomposed the problem
into two steps: specification and decoding. We performed experiments on a set
3
    https://github.com/ryancotterell/sigmorphon2016/tree/master/data/. We
  use all the files of the type <language>-task1-train.
4
   This is not the task proposed in SIGMORPHON Shared Task, which consists in
  a machine learning task: predicting a word form given a lemma and morphological
  features after having learnt from the training data.
                                                                                             32


of positive and negative examples and measured the contribution of each of the
components in accuracy. We further assessed the precision of the method on a
very large set of more than 40 million analogy questions from the dataset of a
shared task in morphology. We obtained the correct answer in 94 % of the cases.
    As future direction, we want to carry on in testing the efficiency of the
character–position arithmetic. For instance, it remains to inspect whether re-
stricting further to those match points lying on edit distance traces helps or
harms and whether we can dispense with the EM algorithm.


References
 1. Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Research
    Logistic Quarterly 2, 83—97 (1955), http://onlinelibrary.wiley.com/doi/10.
    1002/nav.3800020109/epdf
 2. Langlais, P., Patry, A.: Translating unknown words by analogical learning. In: Pro-
    ceedings of the 2007 Joint Conference on Empirical Methods in Natural Language
    Processing and Computational Natural Language Learning (EMNLP-CoNLL). pp.
    877–886 (2007), https://www.aclweb.org/anthology/D/D07/D07-1092.pdf
 3. Langlais, P., Yvon, F.: Scaling up analogical learning. In: Proceedings of the 22nd
    International Conference on Computation Linguistics (Coling 2008). vol. Posters,
    pp. 51–54. Coling 2008 Organizing Committee, Manchester, UK (August 2008)
 4. Lepage, Y.: Solving analogies on words: an algorithm. In: Proceedings of the Asso-
    ciation for Computational Linguistics (ACL’98) and 17th International Conference
    on Computational Linguistics (COLING’98). vol. I, pp. 728–735. Montreal (Aug
    1998), https://aclweb.org/anthology/P/P98/P98-1120.pdf
 5. Lepage, Y.: De l’analogie rendant compte de la commutation en linguistique.
    Mémoire d’habilitation à diriger les recherches, Université de Grenoble (May 2003),
    https://tel.archives-ouvertes.fr/tel-00004372/document
 6. Lepage, Y.: Proportional analogy in written language data. In: Gala, N., Rapp,
    R., Bel-Enguix, G. (eds.) Language, Production, Cognition and the Lexicon,
    pp. 1–23. Text, Speech and Language Technology 48, Springer International
    Publishing Switzerland (2014), http://link.springer.com/chapter/10.1007%
    2F978-3-319-08043-7_10
 7. Mikolov, T., Yih, W.T., Zweig, G.: Linguistic regularities in continuous space word
    representations. In: Proceedings of the 2013 Conference of the North American
    Chapter of the Association for Computational Linguistics: Human Language Tech-
    nologies (NAACL-HLT 2013). pp. 746–751. Association for Computational Linguis-
    tics, Atlanta, Georgia (June 2013), http://www.aclweb.org/anthology/N13-1090
 8. Sharify, M., Gaubert, S., Grigori, L.: Solution of the optimal assignment problem
    by diagonal scaling algorithms. ArXiv e-prints (Apr 2011), https://arxiv.org/
    pdf/1104.3830.pdf
 9. Ukkonen, E.: Algorithms for approximate string matching. Information and Con-
    trol 64, 100–118 (1985), http://www.sciencedirect.com/science/article/pii/
    S0019995885800462
10. Yvon, F., Stroppa, N., Miclet, L., Delhay, A.: Solving analogical equations on
    words. Technical report ENST2004D005, ENST (Jul 2004)