67


Solving Analogical Equations Between Strings of
        Symbols Using Neural Networks

                           Vivatchai Kaveeta and Yves Lepage?

                              IPS, Waseda University
      2-7 Hibikino, Wakamatsu-ku, Kitakyushu-shi, 808-0135 Fukuoka-ken, Japan
                 vivatchai@fuji.waseda.jp, yves.lepage@waseda.jp


        Abstract. A neural network model to solve analogical equations bet-
        ween strings of symbols is proposed. The method transforms the input
        strings into two fixed size alignment matrices. The matrices act as the
        input of the neural network which predicts two output matrices. Finally,
        a string decoder transforms the predicted matrices into the final string
        output. By design, the neural network is constrained by several proper-
        ties of analogy. The experimental results show a fast learning rate with
        a high prediction accuracy that can beat a baseline algorithm.

        Keywords: Proportional Analogy, Neural Networks


1     Introduction
We design a neural network model to solve analogical equations between strings
of symbols. A matrix representation is proposed which transforms the input
strings into a matrices. Re-sampling and filtering on the matrices are applied.
Next, the neural network design is introduced. We exploit properties of pro-
portional analogy between strings of symbols to introduce constraints on the
neural network. Finally, the output prediction matrices are decoded into the
final string output. Experiments are performed to evaluate the e↵ects of con-
figurable parameters. The method is compared with an algorithm based on a
formal characterization of proportional analogy between strings of symbols [7].
    This paper is organized as follows. Section 2 describes the background on
proportional analogy between strings of symbols and previous related works. In
Sect. 3, the proposed methods and neural network models are explained. Exper-
iments are detailed in Sect. 4. Section 5 shows the experiment results.


2     Background
Proportional analogy between sequences of symbols, being they phonemes or
characters is stated as the relationship between four strings in the form of ‘A is
?
    This work was supported by a JSPS Grant, Number 15K00317 (Kakenhi C), entitled
    Language productivity: efficient extraction of productive analogical clusters and their
    evaluation using statistical machine translation.


Copyright © 2016 for this paper by its authors. Copying permitted for private and academic purposes.
In Proceedings of the ICCBR 2016 Workshops. Atlanta, Georgia, United States of America
                                                                                             68


to B as C is to D’ denoted by A : B :: C : D. Analogical equations are the follow-
ing problems: if three strings A, B and C are given, how to coin the fourth string?
Proportional analogies are seen at work to coin new words or new sentences. In
this work, we focus on a type of analogies called analogies of commutation1 . We
do not deal with semantic analogies like: queen : king :: woman : man. Rather,
the computational analogy directly works on the symbolic level. [7] gives an
algorithm to solve analogies of commutation on strings. [10] proposes a simi-
lar algorithm. Both algorithms base on the notion of edit distance. In [9], the
formalization was successfully applied in the development of an analogy-based
machine translation system. In this work we refer to these previous formaliza-
tions in designing an appropriate structure for a neural network.
    Neural networks have been successfully applied to many tasks. Their main
advantage is their ability to learn from examples without predefined knowledge of
the problem. Assuming that an appropriate model structure is used, the network
can estimate the underlying structure of the problem. Although many neural
networks are proposed for di↵erent tasks, no specific neural network seems to
have ever been proposed to solve analogical equations on strings of symbols. [1, 3]
proposed networks to generate new images based on the previous image samples
for classification model training. However, these problems are not expressed in
the form of analogy equations. In [13], neural networks generate new images
by solving analogy equations between images. This is similar to our problem
of analogical equations on strings. The successful implementations point at the
possibility of developing neural network to solve analogical equations on strings.


3     Proposed Method
In Sect. 3.1, a method to transform input strings into matrices is introduced.
The matrices are re-sampled into fixed-size matrices in Sect. 3.2. Two filtering
methods are introduced in Sect. 3.3. The neural network is explained in Sect. 3.4.
The output matrices are decoded into a final string by the decoder in Sect. 3.5.

3.1     Alignment Matrices
The usual approach for processing strings of characters with neural networks
is vector encoding. Dictionary based one-hot-vectors are used in [4, 6, 15]. For
the analogy resolution task, vectors could be built at the character level. Un-
fortunately, this vector representation presents some problems. First, strings are
variable in length. Second, dictionary-based vectors are language-specific. These
limitations make the usage of one-hot-vector limited to fixed length and specific
language strings.
    Proportional analogy can be processed by calculating similarities through
edit distances. Consequently, a representation for the similarity of strings seems
1
    One can distinguish between four types of analogies between strings of symbols:
    repetition (e.g., A : A.A :: B : B.B), reduplication (e.g., cat : caat :: dog : doog),
    mirror (e.g., abc : wxyz :: cba : zyxw) and commutation (examples in the text).
                                                                                                    69


          (a)                       h
                                    a
                                    r
                                    d
                                        h a r d e r

          (b)


                        (1)              (2)           (3)        (4)         (5)

          (c)
                Filters:


                  (1)                          (2)                  (3)             (4)


Fig. 1. a An example of alignment matrix. Original (Left) Re-sampled (Right). b Com-
parison between re-sampling methods. Original matrix (1), Proposed (2), Nearest
Neighbor (3), Bilinear (4) and Bicubic (5). c An example of filtering methods


appropriate. Alignment matrices are widely adopted in the genetic sequence
alignment task [2]. They encode a pair of sequences into a matrix where each cell
represents a local matching point. Figure 1a (left) shows the alignment between
the strings ‘harder’ and ‘hard’. Local matching positions with the value of 1.0
are shown as black cells. Unmatched positions with a value of 0.0 are shown as
white cells.

3.2   Matrix Re-sampling
The alignment matrices can vary in dimension depending on the lengths of the
input strings. To feed the alignment data to a neural network, the matrices need
to be re-sampled into fixed dimension matrices.
                     2                   3   2                 3
                       a1 b1 . . . am b1       AB11 . . . ABz1
                     6 .. . .            7   6 .               7
                     4 .         .       5 ) 4 .. . . .        5             (1)
                           a1 b n        a m bn          AB1z       ABzz
                        dI(y+1,m)e dI(x+1,n)e
                            X              X
            ABxy =                                   (ai bj ⇥ f (i, x, n) ⇥ f (j, y, m))      (2)
                        i=bI(y,m)c j=bI(x,n)c
                                                                                              z
  f (s, t, u) = min (I(s + 1, u), t + 1)             max (I(s, u), t + 1) ,   I(w, n) = w ⇥
                                                                                              n
                                                                                        70


                                   AB          AC              ĀB̄         ĀC̄
      AB          AC


           BD                     BD           CD              B̄ D̄        C̄ D̄


                         Fig. 2. Equivalent prediction flows


    Input strings A and B with lengths m and n are denoted as a1 a2 . . . am and
b1 b2 . . . bn respectively. The original alignment matrix (Fig. 1, left) with dimen-
sion m⇥n has been re-sampled into AB with dimension z ⇥z (Fig. 1, right). The
formula is shown in (2). The matrix AB is using a non-uniform linear transfor-
mation on both axes. Figure 1b illustrates this. The di↵erence with other image
re-sampling methods is shown on Fig. 1c. Our method can generate sharp edges
while maintaining the diagonal line visible.

3.3   Filtering
Anomaly black cells appear because of duplicate characters. As they do not
belong to any valid alignment, these cells degrade the prediction quality. We
thus introduce two filtering methods to remediate this.
Mathematical Morphology Originally proposed in [14], mathematical morpho-
logy enhances an input image by specified filters and operations. Alignment noise
usually appears in positions out of the main diagonal. An example is the cell
on the right of Fig. 1a which matches the character ‘r’ in ‘hard’ with the last
character of ‘harder’. Two 3 ⇥ 3 filters are shown in Fig. 1b (2,top). The original
matrix is filtered by both filters for grey scale erosion. The two filtered matrices
are combined by their maximum values. Using this procedure, black cells ap-
pearing out of a diagonal line are filtered out, while diagonal lines keep sharp
ending. Figure 1b (2,bottom) shows this.
Diagonal Weight Usually, the valid alignments are located on the main primary
diagonal line. So, we apply a linear weighting scheme along this diagonal line.
Cells which are further away from the line are gradually less weighted. Figure 1b
(4,top) shows the weighting filter. Equation (3) denotes the value at (x, y) of
the weight-filtered matrix AB. z is the size of the matrix. Figure 1b (4,bottom)
shows the result of this filtering method.
                                            ✓            ◆
                                                  |x y|
                          ABxy = ax by ⇥ 1                                       (3)
                                                     z

3.4   Neural Network Model
To design an appropriate neural network structure, we rely on the properties of
analogies. An important property is the equivalent forms of analogy. As described
                                                                                             71


         AB11 AB12                 ABzz                  AC11 AC12              ACzz

         ab11   ab12               abzz                  ac11   ac12            aczz
                          ...                                             ...


                                                         ...
                                       h11         h12          h1q
                                                          ..
                                                           .
                                       hp1         hp2   ...    hpq


                                                         ...
                                 bd1         bd2                   bdzz

                                BD11 BD12                         BDmm


                           Fig. 3. Neural network structure


in [8], a single form A : B :: C : D has 7 other equivalent forms: A : C :: B : D,
B : A :: D : C, B : D :: A : C, C : A :: D : B, C : D :: A : B, D : B :: C : A,
D : C :: B : A.
    Another property is the mirroring of strings. If A : B :: C : D is an analogy,
then Ā : B̄ :: C̄ : D̄ holds too, where Ā represents the mirror of A. The mirror
of string a1 a2 . . . am is am am 1 . . . a1 . As a result, we get eight additional equiv-
alent equations. Equivalent prediction flows are shown in Fig. 2. Four boxes on
corners represent the alignment matrix generated from the input strings (mir-
ror versions on the right). Matrix AB is the fixed-dimension alignment matrix
build against string A and B using the method explained in Sect. 3.1 to 3.3.
The two input matrices are flattened and concatenated to form the input vector.
The concatenations are represented as a circle connection. The merged data are
fed into the neural network. The dashed line is representing the neural network
structure with shared parameters for all equivalent prediction flows. The net-
work is trained on all equivalent data flows. Note that the direction of the input
alignment matrices needs to be in the correct orientation.
    The neural network AB AC99KBD is detailed in Fig. 3. The network predicts
the matrix BD. The input data is the flattened and concatenated representation
of matrices AB and AC. The total number of input nodes is thus 2⇥z 2 . The
flow goes into p fully connected hidden layers, where each layer has q nodes. The
output layer has z 2 nodes. Pairs of predicted output matrices from the network
are decoded into a final string by a decoder algorithm explained below.


3.5   Decoder

From the alignment matrices AB and AC, the neural network produces a pair
of matrices that stand for the alignment BD and CD. The decoder decodes the
pair of matrices into a final string which is a hypothesis for the solution D of
the analogical equation between strings A : B :: C : D, In the decoder, we rely
                                                                                         72


    Algorithm 1: Decoder
   Input: BD, CD: two re-sampled matrices
   Input: A, B, C: input strings
   Input: z: length of output string
   Data: N : set of number of occurrences for each character
   Data: V [c, i]: set of likelihood values of character c at each i position
   Output: D: output string
 1 N, V, D    []                                                       // Initialize.
 2 foreach c 2 {C [ B [ A} do
 3     N [c]   |B|c + |C|c |A|c                            // Enforce property (5).
 4 foreach c 2 {A [ B [ C} do
 5    for j     1 to P
                     z do
 6        V [c, j]   P(BD[i, j], 8B[i] = c) +            // Compute likelihoods.
 7                     (CD[i, j], 8C[i] = c)

 8 foreach c 2 {A [ B [ C} do
 9    for i    1 to N [c] do
10        D[i]    argc max(V [c, i])

11    return D


on a set of properties of analogies. The first property is the relationship between
the lengths of the output string and the input strings. Equation (4) shows this
relationship. The length of D is entirely determined by the lengths of A, B and
C. Another piece of information is the number of occurrences of symbols in the
output string. In (5), |A|a stands for the number of occurrences of symbol a in
string A. (5) applied to all symbols implies (4). Our decoder is based on the
following three pieces of information: the two predicted alignment matrices, the
length of the output string, and the number of occurrences of each symbol.

                                |D| = |B| + |C|    |A|                             (4)
                           |D|a = |B|a + |C|a     |A|a , 8a                        (5)

4      Experiments
4.1     Dataset
We construct a data set of analogical equations on strings. The data set is
a combination of all 3,370 formal analogies in the Google test set [11] (e.g.,
bright : brightest :: sweet : sweetest) and 2,423 formal analogies in multiple
languages as in [7] (e.g., wolf : wolves :: leaf : leaves, (Japanese) r∏ : ar∏* ::
÷4 : a÷4*, (Chinese) ˚ : ˚⇧ :: f : f⇧, (Malay) kawan : mengawani ::
keliling : mengelilingi). We randomly selected 10 % of the data as our test set,
the rest is used for training. The statistics of the data set are: # of training
samples = 5214, # of test samples = 579, average edit distance = 1.78, average
length of string = 7.04 ± 2.54.
                                                                                     73


                             Table 1. Experiment results

                            # of hyper Train time Train loss Test loss Accuracy
                            parameters   (m:s)     (MSE)      (MSE)      (%)
                   2⇥2           1,668       4.07    0.009    0.005     1.73
      Alignment    4⇥4           6,288       4.53    0.013    0.008    16.75
       matrices    8⇥8          24,768       5.34    0.017    0.010    67.18
         size     16 ⇥ 16       98,688       7.12    0.024    0.017    79.10
                  32 ⇥ 32      394,368      14.03    0.035    0.026    84.11
                  NN            98,688        6.56   0.039    0.031    67.88
   Re-sampling Bilinear         98,688        7.10   0.015    0.010    72.71
    methods     Bicubic         98,688        6.40   0.019    0.012    78.24
               Proposed         98,688        7.12   0.024    0.017    79.10
                  None          98,688        6.28   0.056    0.044    77.72
      Filtering   Morph         98,688        6.39   0.040    0.031    76.68
      methods     Weight        98,688        7.32   0.034    0.025    79.45
                  Both          98,688        7.12   0.024    0.017    80.48
                    128         98,688       7.12    0.024    0.017    80.83
      Number of
                    256        197,120       8.21    0.022    0.015    82.38
       hidden
                    512        393,984       9.51    0.021    0.017    83.07
        nodes
                   1024        787,712      13.28    0.019    0.012    85.84
                    1           98,688       7.12    0.024    0.017    80.66
      Number of
                    2          115,200       9.55    0.023    0.015    84.44
       hidden
                    3          131,712      11.22    0.023    0.014    86.36
        layers
                    4          148,224      11.54    0.023    0.014    87.56


4.2    Experimental Procedure
We performed a series of experiments to evaluate the performance of each para-
meter (see subsection below). Another experiment determines the highest ac-
curacy rate. For each experiment, all parameters are constant except those pa-
rameters which are tested. The basic parameter settings are: size of alignment
matrices = 16⇥16, re-sampling method = proposed, filtering method = both,
number of hidden nodes = 128, number of hidden layers = 1, loss function =
MSE, activation = ReLU[12], optimizer = Adam[5], and number of epochs =
200. Any alteration to these basic parameters is clearly stated in the description
of each experiment.

4.3    Evaluation
Training Time A significant advantage of our design is its ability to learn at a
high speed. We ran experiments to test the influence of various hyper-parameters.
We measured the training times after 200 epochs.
                                                                                         74


                                                100                            32⇥32


                                     Accuracy
 Loss
    10   0
                                                 50                            16⇥16
                                                                                8⇥8
                                                    0                           4⇥4
             50  100 150 200                                5     10      15    2⇥2
             Number of epochs                               Output length


              Fig. 4. Loss (left), accuracy against output length (right)


Loss We measured the values of a loss function (or objective function). These
values reflect the ability of the neural network to predict the correct alignment
matrices by comparing the output with the reference ground truth. The Mean
Square Error (MSE) function is given in (6). P is the predicted matrix from the
neural network, and T is the ground truth matrix. Lower values reflect a more
precise prediction, hence better model configurations.
Accuracy We measured the accuracy of our network by the percentage of cor-
rect answers over the total number of test samples (see (7)). In any case if one
or more characters in the output string are di↵erent from the reference string,
the prediction is counted as a failure.
                                                z       z
                                   1 XX                                2
                          MSE =                (Pij            Rij )               (6)
                                   z 2 i=1 j=1

                                   # of correct answers
                    Accuracy =                            ⇥ 100                    (7)
                                  total # of test samples


5       Experiment Results

Size of Alignment Matrices In this experiment, the size of alignment matrices
are varied from 4 to 32 by subsequent powers of 2. Experiment results in Fig. 4
show expected behaviors. The bigger the alignment matrices, the higher the
accuracy. The highest rate of 84.11% is obtained with 32 ⇥ 32 matrices. Figure 4
(right) shows that the size of alignment matrices directly contributes to their
ability to output longer solutions. The downside of bigger alignment matrices
is the number of network connections which causes an exponential explosion of
hyper-parameters. Table 1 shows that training times increase with the number
of hyper-parameters.
Matrix Re-sampling Methods We compared our re-sampling method with
nearest neighbor, bilinear and bicubic methods. Results show that our re-sampling
method achieves the highest accuracy.
Filtering methods As we proposed two filtering methods, we test all combina-
tions: no filtering, only one, or both. This gives four combinations. The results in
Table 1 show that the use of both filtering methods yields the highest accuracy.
                                                                                     75


                                 100


                      Accuracy
                                  50

                                   0
                                       5         10        15
                                           Output length


                      Fig. 5. Accuracy against output length


We observe that the use of mathematical morphology yields a lower accuracy
rate than without any filtering. This may indicate some limitation of the selected
morph filters.
Number of hidden nodes We set the number of hidden layers to one, but the
number of nodes varies. A higher number of hidden nodes reflects the ability to
recognize more complex patterns. Results show that networks with more hidden
nodes can yield higher accuracy rates at the expense of the training time.
Number of hidden layers Usually, deeper networks are favored for more com-
plicated patterns, a disadvantage being longer learning times. In this experiment,
each network has the same number of hidden nodes (128) per layer, but a di↵er-
ent number of layers. As expected, a deeper structure yields better recognition
rates. Interestingly, with four hidden layers, the number of hyper-parameters is
much lower than one large single layer network as in the last experiment. Yet, a
deeper network can achieve a better accuracy rate with shorter training time.
Benchmark We select an extreme configuration (alignment matrices = 32⇥32,
number of hidden nodes = 1024, number of hidden layers = 2) to compare the re-
sults with a baseline algorithm given in [7]. Our proposed neural network achieves
a higher accuracy rate over the baseline, 95.68 against 94.47. The baseline does
not need any training, while our network needed to be trained for 36 minutes.
This may come from the fact that our dataset contains some samples which do
not comply the formalization of the baseline algorithm. Figure 5 shows some
limitation in our network to solve equations with longer strings. Nevertheless,
this result proves the e↵ectiveness of our neural network model for the task.


6   Conclusion

In this work, a neural network design to solve analogical equations of commuta-
tion on strings of symbols has been proposed. We presented several methods to
transform the input strings to matrix representation and back to output string.
Two filtering methods to reduce the alignment noise were introduced. The model
parameters were tested on a number of experiments. They show promising results
as an accuracy of more than 95% was achieved. The comparison to a baseline
system showed higher accuracy rate on the test set.
                                                                                           76


   We intend to further improve our neural network in the future. From the re-
ported experiments, the accuracy degrades with the length of strings. Improve-
ments in the network structure may help to improve the prediction accuracy.
Also, the decoding algorithm impacts the accuracy of the final string. The de-
coding scheme can be further improved.


References
 1. Dosovitskiy, A., Tobias Springenberg, J., Brox, T.: Learning to generate chairs with
    convolutional neural networks. IEEE Conference on Computer Vision and Pattern
    Recognition pp. 1538–1546 (2015)
 2. Gibbs, A.J., McIntyre, G.A.: The diagram, a method for comparing sequences.
    European Journal of Biochemistry 16(1), 1–11 (1970)
 3. Gregor, K., Danihelka, I., Graves, A., Rezende, D.J., Wierstra, D.: Draw: A recur-
    rent neural network for image generation. In: Proceedings of the 32nd International
    Conference on Machine Learning (ICML-15) (2015)
 4. Johnson, R., Zhang, T.: Semi-supervised convolutional neural networks for text
    categorization via region embedding. In: Proceedings of the Advances in neural
    information processing systems. pp. 919–927 (2015)
 5. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings
    of the 3rd International Conference for Learning Representations (2014)
 6. Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text
    classification. In: Proceedings of the 29th AAAI Conference on Artificial Intelli-
    gence. pp. 2267–2273 (2015)
 7. Lepage, Y.: Solving analogies on words: an algorithm. In: Proceedings of the 36th
    Annual Meeting of the Association for Computational Linguistics. Association for
    Computational Linguistics (1998)
 8. Lepage, Y.: Languages of analogical strings. In: Proceedings of the 18th conference
    on Computational linguistics-Volume 1. pp. 488–494. Association for Computa-
    tional Linguistics (2000)
 9. Lepage, Y., Denoual, E.: Purest ever example-based machine translation: Detailed
    presentation and assessment. Machine Translation 19(3-4), 251–282 (December
    2005)
10. Miclet, L., Delhay, A.: Analogy on sequences: a definition and an algorithm. Ph.D.
    thesis, INRIA (2003)
11. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre-
    sentations in vector space. International Conference on Learning Representations
    (2013)
12. Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann ma-
    chines. In: Proceedings of the 27th International Conference on Machine Learning
    (ICML-10). pp. 807–814 (2010)
13. Reed, S.E., Zhang, Y., Zhang, Y., Lee, H.: Deep visual analogy-making. In: Pro-
    ceedings of the Advances in Neural Information Processing Systems. pp. 1252–1260
    (2015)
14. Serra, J.: Image analysis and mathematical morphology. Academic press (1982)
15. Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text
    classification. In: Proceedings of the Advances in Neural Information Processing
    Systems. pp. 649–657 (2015)