67 Solving Analogical Equations Between Strings of Symbols Using Neural Networks Vivatchai Kaveeta and Yves Lepage? IPS, Waseda University 2-7 Hibikino, Wakamatsu-ku, Kitakyushu-shi, 808-0135 Fukuoka-ken, Japan vivatchai@fuji.waseda.jp, yves.lepage@waseda.jp Abstract. A neural network model to solve analogical equations bet- ween strings of symbols is proposed. The method transforms the input strings into two fixed size alignment matrices. The matrices act as the input of the neural network which predicts two output matrices. Finally, a string decoder transforms the predicted matrices into the final string output. By design, the neural network is constrained by several proper- ties of analogy. The experimental results show a fast learning rate with a high prediction accuracy that can beat a baseline algorithm. Keywords: Proportional Analogy, Neural Networks 1 Introduction We design a neural network model to solve analogical equations between strings of symbols. A matrix representation is proposed which transforms the input strings into a matrices. Re-sampling and filtering on the matrices are applied. Next, the neural network design is introduced. We exploit properties of pro- portional analogy between strings of symbols to introduce constraints on the neural network. Finally, the output prediction matrices are decoded into the final string output. Experiments are performed to evaluate the e↵ects of con- figurable parameters. The method is compared with an algorithm based on a formal characterization of proportional analogy between strings of symbols [7]. This paper is organized as follows. Section 2 describes the background on proportional analogy between strings of symbols and previous related works. In Sect. 3, the proposed methods and neural network models are explained. Exper- iments are detailed in Sect. 4. Section 5 shows the experiment results. 2 Background Proportional analogy between sequences of symbols, being they phonemes or characters is stated as the relationship between four strings in the form of ‘A is ? This work was supported by a JSPS Grant, Number 15K00317 (Kakenhi C), entitled Language productivity: efficient extraction of productive analogical clusters and their evaluation using statistical machine translation. Copyright © 2016 for this paper by its authors. Copying permitted for private and academic purposes. In Proceedings of the ICCBR 2016 Workshops. Atlanta, Georgia, United States of America 68 to B as C is to D’ denoted by A : B :: C : D. Analogical equations are the follow- ing problems: if three strings A, B and C are given, how to coin the fourth string? Proportional analogies are seen at work to coin new words or new sentences. In this work, we focus on a type of analogies called analogies of commutation1 . We do not deal with semantic analogies like: queen : king :: woman : man. Rather, the computational analogy directly works on the symbolic level. [7] gives an algorithm to solve analogies of commutation on strings. [10] proposes a simi- lar algorithm. Both algorithms base on the notion of edit distance. In [9], the formalization was successfully applied in the development of an analogy-based machine translation system. In this work we refer to these previous formaliza- tions in designing an appropriate structure for a neural network. Neural networks have been successfully applied to many tasks. Their main advantage is their ability to learn from examples without predefined knowledge of the problem. Assuming that an appropriate model structure is used, the network can estimate the underlying structure of the problem. Although many neural networks are proposed for di↵erent tasks, no specific neural network seems to have ever been proposed to solve analogical equations on strings of symbols. [1, 3] proposed networks to generate new images based on the previous image samples for classification model training. However, these problems are not expressed in the form of analogy equations. In [13], neural networks generate new images by solving analogy equations between images. This is similar to our problem of analogical equations on strings. The successful implementations point at the possibility of developing neural network to solve analogical equations on strings. 3 Proposed Method In Sect. 3.1, a method to transform input strings into matrices is introduced. The matrices are re-sampled into fixed-size matrices in Sect. 3.2. Two filtering methods are introduced in Sect. 3.3. The neural network is explained in Sect. 3.4. The output matrices are decoded into a final string by the decoder in Sect. 3.5. 3.1 Alignment Matrices The usual approach for processing strings of characters with neural networks is vector encoding. Dictionary based one-hot-vectors are used in [4, 6, 15]. For the analogy resolution task, vectors could be built at the character level. Un- fortunately, this vector representation presents some problems. First, strings are variable in length. Second, dictionary-based vectors are language-specific. These limitations make the usage of one-hot-vector limited to fixed length and specific language strings. Proportional analogy can be processed by calculating similarities through edit distances. Consequently, a representation for the similarity of strings seems 1 One can distinguish between four types of analogies between strings of symbols: repetition (e.g., A : A.A :: B : B.B), reduplication (e.g., cat : caat :: dog : doog), mirror (e.g., abc : wxyz :: cba : zyxw) and commutation (examples in the text). 69 (a) h a r d h a r d e r (b) (1) (2) (3) (4) (5) (c) Filters: (1) (2) (3) (4) Fig. 1. a An example of alignment matrix. Original (Left) Re-sampled (Right). b Com- parison between re-sampling methods. Original matrix (1), Proposed (2), Nearest Neighbor (3), Bilinear (4) and Bicubic (5). c An example of filtering methods appropriate. Alignment matrices are widely adopted in the genetic sequence alignment task [2]. They encode a pair of sequences into a matrix where each cell represents a local matching point. Figure 1a (left) shows the alignment between the strings ‘harder’ and ‘hard’. Local matching positions with the value of 1.0 are shown as black cells. Unmatched positions with a value of 0.0 are shown as white cells. 3.2 Matrix Re-sampling The alignment matrices can vary in dimension depending on the lengths of the input strings. To feed the alignment data to a neural network, the matrices need to be re-sampled into fixed dimension matrices. 2 3 2 3 a1 b1 . . . am b1 AB11 . . . ABz1 6 .. . . 7 6 . 7 4 . . 5 ) 4 .. . . . 5 (1) a1 b n a m bn AB1z ABzz dI(y+1,m)e dI(x+1,n)e X X ABxy = (ai bj ⇥ f (i, x, n) ⇥ f (j, y, m)) (2) i=bI(y,m)c j=bI(x,n)c z f (s, t, u) = min (I(s + 1, u), t + 1) max (I(s, u), t + 1) , I(w, n) = w ⇥ n 70 AB AC ĀB̄ ĀC̄ AB AC BD BD CD B̄ D̄ C̄ D̄ Fig. 2. Equivalent prediction flows Input strings A and B with lengths m and n are denoted as a1 a2 . . . am and b1 b2 . . . bn respectively. The original alignment matrix (Fig. 1, left) with dimen- sion m⇥n has been re-sampled into AB with dimension z ⇥z (Fig. 1, right). The formula is shown in (2). The matrix AB is using a non-uniform linear transfor- mation on both axes. Figure 1b illustrates this. The di↵erence with other image re-sampling methods is shown on Fig. 1c. Our method can generate sharp edges while maintaining the diagonal line visible. 3.3 Filtering Anomaly black cells appear because of duplicate characters. As they do not belong to any valid alignment, these cells degrade the prediction quality. We thus introduce two filtering methods to remediate this. Mathematical Morphology Originally proposed in [14], mathematical morpho- logy enhances an input image by specified filters and operations. Alignment noise usually appears in positions out of the main diagonal. An example is the cell on the right of Fig. 1a which matches the character ‘r’ in ‘hard’ with the last character of ‘harder’. Two 3 ⇥ 3 filters are shown in Fig. 1b (2,top). The original matrix is filtered by both filters for grey scale erosion. The two filtered matrices are combined by their maximum values. Using this procedure, black cells ap- pearing out of a diagonal line are filtered out, while diagonal lines keep sharp ending. Figure 1b (2,bottom) shows this. Diagonal Weight Usually, the valid alignments are located on the main primary diagonal line. So, we apply a linear weighting scheme along this diagonal line. Cells which are further away from the line are gradually less weighted. Figure 1b (4,top) shows the weighting filter. Equation (3) denotes the value at (x, y) of the weight-filtered matrix AB. z is the size of the matrix. Figure 1b (4,bottom) shows the result of this filtering method. ✓ ◆ |x y| ABxy = ax by ⇥ 1 (3) z 3.4 Neural Network Model To design an appropriate neural network structure, we rely on the properties of analogies. An important property is the equivalent forms of analogy. As described 71 AB11 AB12 ABzz AC11 AC12 ACzz ab11 ab12 abzz ac11 ac12 aczz ... ... ... h11 h12 h1q .. . hp1 hp2 ... hpq ... bd1 bd2 bdzz BD11 BD12 BDmm Fig. 3. Neural network structure in [8], a single form A : B :: C : D has 7 other equivalent forms: A : C :: B : D, B : A :: D : C, B : D :: A : C, C : A :: D : B, C : D :: A : B, D : B :: C : A, D : C :: B : A. Another property is the mirroring of strings. If A : B :: C : D is an analogy, then Ā : B̄ :: C̄ : D̄ holds too, where Ā represents the mirror of A. The mirror of string a1 a2 . . . am is am am 1 . . . a1 . As a result, we get eight additional equiv- alent equations. Equivalent prediction flows are shown in Fig. 2. Four boxes on corners represent the alignment matrix generated from the input strings (mir- ror versions on the right). Matrix AB is the fixed-dimension alignment matrix build against string A and B using the method explained in Sect. 3.1 to 3.3. The two input matrices are flattened and concatenated to form the input vector. The concatenations are represented as a circle connection. The merged data are fed into the neural network. The dashed line is representing the neural network structure with shared parameters for all equivalent prediction flows. The net- work is trained on all equivalent data flows. Note that the direction of the input alignment matrices needs to be in the correct orientation. The neural network AB AC99KBD is detailed in Fig. 3. The network predicts the matrix BD. The input data is the flattened and concatenated representation of matrices AB and AC. The total number of input nodes is thus 2⇥z 2 . The flow goes into p fully connected hidden layers, where each layer has q nodes. The output layer has z 2 nodes. Pairs of predicted output matrices from the network are decoded into a final string by a decoder algorithm explained below. 3.5 Decoder From the alignment matrices AB and AC, the neural network produces a pair of matrices that stand for the alignment BD and CD. The decoder decodes the pair of matrices into a final string which is a hypothesis for the solution D of the analogical equation between strings A : B :: C : D, In the decoder, we rely 72 Algorithm 1: Decoder Input: BD, CD: two re-sampled matrices Input: A, B, C: input strings Input: z: length of output string Data: N : set of number of occurrences for each character Data: V [c, i]: set of likelihood values of character c at each i position Output: D: output string 1 N, V, D [] // Initialize. 2 foreach c 2 {C [ B [ A} do 3 N [c] |B|c + |C|c |A|c // Enforce property (5). 4 foreach c 2 {A [ B [ C} do 5 for j 1 to P z do 6 V [c, j] P(BD[i, j], 8B[i] = c) + // Compute likelihoods. 7 (CD[i, j], 8C[i] = c) 8 foreach c 2 {A [ B [ C} do 9 for i 1 to N [c] do 10 D[i] argc max(V [c, i]) 11 return D on a set of properties of analogies. The first property is the relationship between the lengths of the output string and the input strings. Equation (4) shows this relationship. The length of D is entirely determined by the lengths of A, B and C. Another piece of information is the number of occurrences of symbols in the output string. In (5), |A|a stands for the number of occurrences of symbol a in string A. (5) applied to all symbols implies (4). Our decoder is based on the following three pieces of information: the two predicted alignment matrices, the length of the output string, and the number of occurrences of each symbol. |D| = |B| + |C| |A| (4) |D|a = |B|a + |C|a |A|a , 8a (5) 4 Experiments 4.1 Dataset We construct a data set of analogical equations on strings. The data set is a combination of all 3,370 formal analogies in the Google test set [11] (e.g., bright : brightest :: sweet : sweetest) and 2,423 formal analogies in multiple languages as in [7] (e.g., wolf : wolves :: leaf : leaves, (Japanese) r∏ : ar∏* :: ÷4 : a÷4*, (Chinese) ˚ : ˚⇧ :: f : f⇧, (Malay) kawan : mengawani :: keliling : mengelilingi). We randomly selected 10 % of the data as our test set, the rest is used for training. The statistics of the data set are: # of training samples = 5214, # of test samples = 579, average edit distance = 1.78, average length of string = 7.04 ± 2.54. 73 Table 1. Experiment results # of hyper Train time Train loss Test loss Accuracy parameters (m:s) (MSE) (MSE) (%) 2⇥2 1,668 4.07 0.009 0.005 1.73 Alignment 4⇥4 6,288 4.53 0.013 0.008 16.75 matrices 8⇥8 24,768 5.34 0.017 0.010 67.18 size 16 ⇥ 16 98,688 7.12 0.024 0.017 79.10 32 ⇥ 32 394,368 14.03 0.035 0.026 84.11 NN 98,688 6.56 0.039 0.031 67.88 Re-sampling Bilinear 98,688 7.10 0.015 0.010 72.71 methods Bicubic 98,688 6.40 0.019 0.012 78.24 Proposed 98,688 7.12 0.024 0.017 79.10 None 98,688 6.28 0.056 0.044 77.72 Filtering Morph 98,688 6.39 0.040 0.031 76.68 methods Weight 98,688 7.32 0.034 0.025 79.45 Both 98,688 7.12 0.024 0.017 80.48 128 98,688 7.12 0.024 0.017 80.83 Number of 256 197,120 8.21 0.022 0.015 82.38 hidden 512 393,984 9.51 0.021 0.017 83.07 nodes 1024 787,712 13.28 0.019 0.012 85.84 1 98,688 7.12 0.024 0.017 80.66 Number of 2 115,200 9.55 0.023 0.015 84.44 hidden 3 131,712 11.22 0.023 0.014 86.36 layers 4 148,224 11.54 0.023 0.014 87.56 4.2 Experimental Procedure We performed a series of experiments to evaluate the performance of each para- meter (see subsection below). Another experiment determines the highest ac- curacy rate. For each experiment, all parameters are constant except those pa- rameters which are tested. The basic parameter settings are: size of alignment matrices = 16⇥16, re-sampling method = proposed, filtering method = both, number of hidden nodes = 128, number of hidden layers = 1, loss function = MSE, activation = ReLU[12], optimizer = Adam[5], and number of epochs = 200. Any alteration to these basic parameters is clearly stated in the description of each experiment. 4.3 Evaluation Training Time A significant advantage of our design is its ability to learn at a high speed. We ran experiments to test the influence of various hyper-parameters. We measured the training times after 200 epochs. 74 100 32⇥32 Accuracy Loss 10 0 50 16⇥16 8⇥8 0 4⇥4 50 100 150 200 5 10 15 2⇥2 Number of epochs Output length Fig. 4. Loss (left), accuracy against output length (right) Loss We measured the values of a loss function (or objective function). These values reflect the ability of the neural network to predict the correct alignment matrices by comparing the output with the reference ground truth. The Mean Square Error (MSE) function is given in (6). P is the predicted matrix from the neural network, and T is the ground truth matrix. Lower values reflect a more precise prediction, hence better model configurations. Accuracy We measured the accuracy of our network by the percentage of cor- rect answers over the total number of test samples (see (7)). In any case if one or more characters in the output string are di↵erent from the reference string, the prediction is counted as a failure. z z 1 XX 2 MSE = (Pij Rij ) (6) z 2 i=1 j=1 # of correct answers Accuracy = ⇥ 100 (7) total # of test samples 5 Experiment Results Size of Alignment Matrices In this experiment, the size of alignment matrices are varied from 4 to 32 by subsequent powers of 2. Experiment results in Fig. 4 show expected behaviors. The bigger the alignment matrices, the higher the accuracy. The highest rate of 84.11% is obtained with 32 ⇥ 32 matrices. Figure 4 (right) shows that the size of alignment matrices directly contributes to their ability to output longer solutions. The downside of bigger alignment matrices is the number of network connections which causes an exponential explosion of hyper-parameters. Table 1 shows that training times increase with the number of hyper-parameters. Matrix Re-sampling Methods We compared our re-sampling method with nearest neighbor, bilinear and bicubic methods. Results show that our re-sampling method achieves the highest accuracy. Filtering methods As we proposed two filtering methods, we test all combina- tions: no filtering, only one, or both. This gives four combinations. The results in Table 1 show that the use of both filtering methods yields the highest accuracy. 75 100 Accuracy 50 0 5 10 15 Output length Fig. 5. Accuracy against output length We observe that the use of mathematical morphology yields a lower accuracy rate than without any filtering. This may indicate some limitation of the selected morph filters. Number of hidden nodes We set the number of hidden layers to one, but the number of nodes varies. A higher number of hidden nodes reflects the ability to recognize more complex patterns. Results show that networks with more hidden nodes can yield higher accuracy rates at the expense of the training time. Number of hidden layers Usually, deeper networks are favored for more com- plicated patterns, a disadvantage being longer learning times. In this experiment, each network has the same number of hidden nodes (128) per layer, but a di↵er- ent number of layers. As expected, a deeper structure yields better recognition rates. Interestingly, with four hidden layers, the number of hyper-parameters is much lower than one large single layer network as in the last experiment. Yet, a deeper network can achieve a better accuracy rate with shorter training time. Benchmark We select an extreme configuration (alignment matrices = 32⇥32, number of hidden nodes = 1024, number of hidden layers = 2) to compare the re- sults with a baseline algorithm given in [7]. Our proposed neural network achieves a higher accuracy rate over the baseline, 95.68 against 94.47. The baseline does not need any training, while our network needed to be trained for 36 minutes. This may come from the fact that our dataset contains some samples which do not comply the formalization of the baseline algorithm. Figure 5 shows some limitation in our network to solve equations with longer strings. Nevertheless, this result proves the e↵ectiveness of our neural network model for the task. 6 Conclusion In this work, a neural network design to solve analogical equations of commuta- tion on strings of symbols has been proposed. We presented several methods to transform the input strings to matrix representation and back to output string. Two filtering methods to reduce the alignment noise were introduced. The model parameters were tested on a number of experiments. They show promising results as an accuracy of more than 95% was achieved. The comparison to a baseline system showed higher accuracy rate on the test set. 76 We intend to further improve our neural network in the future. From the re- ported experiments, the accuracy degrades with the length of strings. Improve- ments in the network structure may help to improve the prediction accuracy. Also, the decoding algorithm impacts the accuracy of the final string. The de- coding scheme can be further improved. References 1. Dosovitskiy, A., Tobias Springenberg, J., Brox, T.: Learning to generate chairs with convolutional neural networks. IEEE Conference on Computer Vision and Pattern Recognition pp. 1538–1546 (2015) 2. Gibbs, A.J., McIntyre, G.A.: The diagram, a method for comparing sequences. European Journal of Biochemistry 16(1), 1–11 (1970) 3. Gregor, K., Danihelka, I., Graves, A., Rezende, D.J., Wierstra, D.: Draw: A recur- rent neural network for image generation. In: Proceedings of the 32nd International Conference on Machine Learning (ICML-15) (2015) 4. Johnson, R., Zhang, T.: Semi-supervised convolutional neural networks for text categorization via region embedding. In: Proceedings of the Advances in neural information processing systems. pp. 919–927 (2015) 5. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of the 3rd International Conference for Learning Representations (2014) 6. Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text classification. In: Proceedings of the 29th AAAI Conference on Artificial Intelli- gence. pp. 2267–2273 (2015) 7. Lepage, Y.: Solving analogies on words: an algorithm. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics (1998) 8. Lepage, Y.: Languages of analogical strings. In: Proceedings of the 18th conference on Computational linguistics-Volume 1. pp. 488–494. Association for Computa- tional Linguistics (2000) 9. Lepage, Y., Denoual, E.: Purest ever example-based machine translation: Detailed presentation and assessment. Machine Translation 19(3-4), 251–282 (December 2005) 10. Miclet, L., Delhay, A.: Analogy on sequences: a definition and an algorithm. Ph.D. thesis, INRIA (2003) 11. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre- sentations in vector space. International Conference on Learning Representations (2013) 12. Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann ma- chines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10). pp. 807–814 (2010) 13. Reed, S.E., Zhang, Y., Zhang, Y., Lee, H.: Deep visual analogy-making. In: Pro- ceedings of the Advances in Neural Information Processing Systems. pp. 1252–1260 (2015) 14. Serra, J.: Image analysis and mathematical morphology. Academic press (1982) 15. Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: Proceedings of the Advances in Neural Information Processing Systems. pp. 649–657 (2015)