13 A study of the saturation of analogical grids agnostically extracted from texts Rashel Fam and Yves Lepage ? IPS, Waseda University 2-7 Hibikino, Wakamatsu-ku, Kitakyushu-shi, 808-0135 Fukuoka-ken, Japan fam.rashel@fuji.waseda.jp, yves.lepage@waseda.jp Abstract. Analogical grids aim to capture the organization of the lexi- con of a language. We conduct experiments on analogical grids extracted in four different languages with different morphological richness. We study the saturation of analogical grids against their size. We observe that the logarithm of the saturation of an analogical grid is linear in the logarithm of its size. More surprisingly, the coefficients of this log-log linear relation are extremely close across all four languages, even when the size or the genre of the corpus vary. Keywords: analogical grids, saturation, organization of lexicon. 1 Introduction and background show : shows : showing : showed makan : dimakan : memakan : makanan walk : walks : walking : walked minum : diminum : meminum : minuman open : opens : opening : main : : : mainan study : : studying : beli : dibeli : : read : reads : reading : Fig. 1. Analogical grids in English (left) and Indonesian (right). Figure 1 shows two examples of analogical grids, one in English, the other one in Indonesian. Such analogical grids may be automatically constructed from the set of words contained in a text. Each cell in an analogical grid either con- tains a word form or is empty. As exemplified in Figure 1 (left), a column (or a row) in an analogical grid usually exhibits similar word forms for different words: e.g., infinitive, present 3rd person singular, present participle, etc. for different English verbs on the left of Figure 1. Analogical grids are not paradigm tables, ? This work was supported by a JSPS Grant, Number 15K00317 (Kakenhi C), entitled Language productivity: efficient extraction of productive analogical clusters and their evaluation using statistical machine translation. Copyright © 2017 for this paper by its authors. Copying permitted for private and academic purpose. In Proceedings of the ICCBR 2017 Workshops. Trondheim, Norway 14 i.e., they are not the result of a linguistic formalization with explicit lexemes and exponents as in standard works in morphology, but they constitute a pre- liminary step in that direction. Analogical grids too give a compact view of the organization of the lexicon, but they are the output of an empirical procedure, e.g., the one introduced in [4]. Analogical grids can be used to study word productivity in a given language as in [12, 9, 6]. They can also be used to make comparisons across languages as in [4], where the goal is to explain unseen words by using analogical grids automatically built from the set of all words contained in texts in 12 different languages. In this paper, we report an interesting phenomenon observed when building analogical grids in various different languages using the method in [4]. This phe- nomenon relates the saturation of the obtained analogical grids to their size. The experimental results show that the coefficients which characterize the relation would not be influenced by the size, the genre or the language of the texts used. The paper is organized as follows: Section 2 introduces basic notions related to analogical grids. Section 3 presents our experiments on four languages with different richness in morphology. It analyzes the results and explores the relation- ship between the saturation and the size of analogical grids. Section 4 presents further experiments to inquire the relation. Section 5 gives conclusion. 2 Basic notions In this section, we mathematically define the basic notions related to analogical grids. The method to extract such analogical grids has already been presented elsewhere [8, 4]. 2.1 Illustration with toy data Anto memakan nasi dan meminum air. Nasi itu dibeli di pasar. Di pasar, Anto melihat mainan. Anto senang main bola. Setelah main, Anto suka minum es dan makan cilok. Makanan dan minuman itu juga dia beli di pasar. Es dan cilok memang enak dimakan dan diminum selesai olahraga. air anto beli bola cilok dan di dia dibeli dimakan diminum enak es itu juga main mainan makan makanan melihat memakan memang meminum minum minuman nasi olahraga pasar selesai senang setelah suka Fig. 2. A text in Indonesian (above) and the list of words extracted from it (below ). Words appearing in Figure 1 (right) are boldfaced. The top of Figure 2 is a forged example text in Indonesian, a language which is known for its relative richness in derivational morphology. We intentionally do 15 not give its translation into English to place the reader in the agnostic position of the computer in front of such data. The list of words, sorted in lexicographic order, that can be extracted from this text, is given at the bottom of Figure 2. From this word list, some commonalities between words can be identified at a glance. An example is the word makan and the word makanan. Another is the words bola and beli which share the same consonants in the same order: b and l. However, the existence of only one pair is not enough to support the evidence that two words are actually in relation one with the other. On the contrary, for the words makan and the word makanan, the same ratio is seen to hold between several other word pairs from the same text, like minum and minuman, or main and mainan. These actually reflect a phenomenon in Indonesian morphology by using the suffix -an which builds a noun from active verb. In standard linguistics, a systematization of these relationships between word forms is given by paradigm tables, which is the result of linguistic formalisation. Here, we agnostically extract analogical grids relying on a formal relationship between words, proportional analogy. The right part of Figure 1 shows the ana- logical grid extracted from the set of words given in Figure 2. 2.2 Analogical grids An analogical grid is a table of dimension M × N as defined by Formula (1). As illustrated by Figure 1, analogical grids extracted from texts usually contain empty cells. (Caution: there is no importance in the order of lines or rows.) P11 : P12 : · · · : P1m P21 : P22 : · · · : P2m ∀(i, k) ∈ {1, . . . , n}2 , ∆ .. .. .. ⇐⇒ ∀(j, l) ∈ {1, . . . , m}2 , (1) . . . Pij : Pil :: Pkj : Pkl Pn1 : Pn2 : · · · : Pnm The definition of analogical grids in Formula (1) implies that for any four word forms at the intersection of two rows and two columns form a proportional analogy between sequences of characters [7, 13]. A proportional analogy is defined as a relationship between four objects where two properties are met: (a) equality of ratios (defined hereafter) between the first and the second terms on one hand, and the third and the fourth terms on the other hand, and (b) exchange of the means (the second and the third terms can always be ex- changed).  ∆ A:B=C:D A : B :: C : D ⇐⇒ (2) A:C =B:D According to Formula (1), we can get many analogies from analogical grids in Figure 1. Figure 3 shows three of them. We define the ratio between two words in Formula (3) as a vector of features made up of all the differences in number of occurrences in the two words, for all 16 makan : makanan :: main : mainan makan : memakan :: minum : meminum minum : diminum :: beli : dibeli Fig. 3. Some analogies extracted from analogical grid in Figure 1 (right). the characters, whatever the writing system, plus, the distance between the two words.     |A|a − |B|a −1  |A|b − |B|b  0     ∆  .. ∆  . A:B =  makan : makanan =  .. (3)   .     |A|z − |B|z  0 d(A, B) 2 In Formula (3), the notation |S|c stands for the number of occurrences of char- acter c in string S. The last dimension, written as d(A, B), is the edit distance between the two strings. This indirectly gives the number of common characters appearing in the same order in A and B.1 The above definition of ratios captures prefixing and suffixing. Although we do not show it here, this definition also captures parallel infixing or interdigita- tion, well-known phenomena in semitic languages [1, 14]. However, reduplication or repetition (e.g. consonant spreading) are not captured by this definition. makan : makanan main : mainan makan : main makanan : mainan −1 −1         1 1 0 0 0 0          .  .  .  .  .. =  .. &  .. =  ..         0 0 0 0 2 2 3 3 ⇒ makan : makanan :: main : mainan Fig. 4. The two ratios between pairs of words for the first analogy in Figure 3. This formal definition of word ratio in Formula (3) gives the same vector for the ratios makan : makanan , makan : namakan , and makan : mnaakan . This is due to the use of insertion and deletion as the only edit operations. The purpose of working with analogical grid, and not only with individual analogies, is that Formula (1) imposes more constraints for a word form to enter 1 The only two edit operations used are insertion and deletion, hence, d(A, B) = |A| + |B| − 2 × s(A, B). |S| denotes the length of a string S and s(A, B) is the length of the longest common sub-sequence (LCS) between A and B. 17 a grid: a word form in a grid must satisfy all analogy relationship with all sur- rounding word forms in the grid. The word form makanan in the analogical grid of Figure 1 (right) is the only word form which fits in, among makanan, namakan, or mnaakan. For example, as proved below, using the words main and mainan from the analogical grid, the inequality between the ratios makan : main and namakan : mainan implies that there is no analogy between these four words. The same holds for the word form mnaakan. In all these cases, the inequality comes from different edit distance values. makan  : main  namakan  : mainan  1 1 0 0      ..  ..  . 6=  . ⇒ makan : main 6:: namakan : mainan     0 0 3 5 The above discussion shows that there should be a relationship between the size of the analogical grids and the freedom in filling an empty cell in an analog- ical grid. 2.3 Size and saturation of analogical grids We simply define the size of an analogical grid as its number of rows multiplied by its number of columns. The analogical grids in Figure 1 has a size of 4×5 = 20 (left) and 4 × 4 = 16 (right) respectively. Let us now turn to the number of empty cells of an analogical grid, or rather the number of non-empty cells which we call its saturation 2 . We compute it using Formula (4) which will give a saturation of 80 % (left) and 75 % (right) for Figure 1. Number of empty cells ×100 Saturation = 100 − (4) Total number of cells 3 Experiments 3.1 Data used We carried out experiments on a multilingual parallel corpus created from the translation of the Bible collected by Christodoulopoulos3 [10]. We selected four languages with different richness in morphology: English, Russian, Modern Greek, and Indonesian. The reason for using a multilingual parallel corpus is the need to draw conclusions across different languages in a reliable way. Table 1 presents statistics on the corpus. For each text in each language, we first extracted the list of all words, and finally built all analogical grids. 2 In [2, p. 79], saturation is the maximal proportion of word forms attested for any one lemma of a given paradigm. Here we use the term for each entire grid. 3 http://homepages.inf.ed.ac.uk/s0787820/bible/ 18 # tokens # types Length of types Time Language # grids (N ) (V ) avg±std. dev. (h:min) English 792,074 12,498 7.03 ± 2.18 12,855 45 Indonesian 648,606 15,641 7.84 ± 2.63 25,752 2:04 Modern Greek 706,771 36,786 8.49 ± 2.49 69,173 11:03 Russian 560,524 47,226 8.26 ± 2.73 60,035 10:34 Table 1. Statistics on the Bible corpus and number of analogical clusters and number of analogical grids produced in each language with the time needed to produce them 3.2 Analogical grids obtained English Indonesian Modern Greek Russian 106 106 106 106 5 5 5 5 10 10 10 10 Number of analogical grids Number of analogical grids Number of analogical grids Number of analogical grids 104 104 104 104 103 103 103 103 102 102 102 102 10 10 10 10 1 1 1 1 1 103 106 10 9 1 103 10 6 9 10 1 103 106 9 10 1 103 106 109 Analogical grid size Analogical grid size Analogical grid size Analogical grid size Fig. 5. Number of analogical grids with the same size in each language. Logarithmic scale on both axes. From left to right: English, Indonesian, Modern Greek and Russian. Same ranges along the axes for all languages. Table 1 shows the number of analogical grids produced in each language. These numbers show that English produced the lowest number of analogical grids. Indonesian produced twice as many tables as English. Modern Greek and Russian produced five times more tables than English. Modern Greek produced a larger amount of analogical grids than Russian despite its lesser number of analogical clusters. To summarize, languages with poorer morphology tend to produce less analogical grids than languages with richer morphology, which meets intuition. Let us recall that, by construction, on the contrary to many previous works in morphological induction [11, 5, 3, etc.], our analogical grids do not contain in any way information about word frequency, word context, nor the frequency or distribution of morphemes or the like. 3.3 Size and saturation of analogical grids The graphs at the bottom of Figure 5 show the number of analogical grids with the same sizes in each language. Most of the analogical grids have a small size. The number of analogical grids with the same size decreases gradually as the size increases. Languages with a richer morphology produce bigger analogical 19 grids in average and also more analogical grids for a given size. All of this meets intuition. English Indonesian Modern Greek Russian 100% 100% 100% 100% 10% 10% 10% 10% Saturation Saturation Saturation Saturation 1% 1% 1% 1% 0.1% 0.1% 0.1% 0.1% 0.01% 0.01% 0.01% 0.01% 1 1000 1M 1B 1 1000 1M 1B 1 1000 1M 1B 1 1000 1M 1B Size Size Size Size English Indonesian Modern Greek Russian 100% 100% 100% 100% 90% 90% 90% 90% 80% 80% 80% 80% Saturation Saturation Saturation Saturation 70% 70% 70% 70% 60% 60% 60% 60% 50% 50% 50% 50% 6 10 20 50 100 6 10 20 50 100 6 10 20 50 100 6 10 20 50 100 Size Size Size Size Fig. 6. Saturation of analogical grids against size in each language. From left to right: English, Indonesian, Modern Greek and Russian. Algorithmic scale on both horizontal (size) and vertical (saturation) axes. Saturation (in ordinates) in the range [0 %, 100 %] (top) and in the range [50 %, 100 %] (bottom). Same ranges along the horizontal axes for all languages for the same range of saturation. We now turn to the study of the saturation of analogical grids compared to their size. The top of Figure 6 shows saturation against size for analogical grids in each language. Analogical grids with smaller sizes tend to have higher saturation. Some tables are extremely sparse. Because of the logarithmic scale on the y-axis, the bottom half is for tables with a saturation less than 1 %. In all cases, the plots exhibit a similar linear shape in logarithmic scale across all languages. This would correspond to Formula (5). We confirmed the similarity by the computation of the coefficients a and b for each language, as obtained by the least squares method. These coefficients are presented in Table 2. They are almost the same in all languages. log(saturation) = a × log(size) + b (5) As mentioned in Section 2.2, intuitively, analogical grids with higher satura- tion are more reliable to fill in because there are more word forms around the empty cells as supporting evidence. However, it may not always be the case. For instance, an analogical grid for regular English verbs extracted from any text is very hollow but empty cells can be filled in a reliable way. 4 Discussion and further experiments Let us make a first remark on the type of the observed relation. This is not yet another instance of a Zipfian law, because, in the present case, the objects are 20 not ranked individually according to their frequency (number of occurrences). In a Zipfian law, the x-axis stands for the list of individual objects ranked by frequency. Recall also that our analogical grids do not encapsulate any infor- mation about the frequency of individual words whatsoever. In our graphs, two analogical grids with the same size have the same abscissa. If they also have the same saturation, they have the same ordinate and are thus plotted as the same point. Range for saturation Language Data and size [0%,100 %] [50%,100 %] a b a b English Bible 100.0 % -0.480 0.510 -0.366 0.332 50.0 % -0.479 0.507 -0.372 0.343 25.0 % -0.476 0.499 -0.368 0.336 12.5 % -0.474 0.491 -0.361 0.323 Europarl (same size as Bible) -0.481 0.516 -0.365 0.333 Indonesian Bible 100.0 % -0.481 0.518 -0.371 0.343 Modern Greek ” -0.479 0.514 -0.369 0.342 Russian ” -0.482 0.520 -0.370 0.342 Table 2. Linear coefficients for each language; and for different sizes and different genres in English. The interesting fact that comes into light is not so much the fact that the relation between size and saturation of analogical grids be a log–log relation, but the fact that it exhibits very similar slopes in all four languages. A reasonable explanation is that these coefficients are independent of the language because they characterize the corpus used. The corpus is defined by its size and its genre. We first inquired whether the coefficients depend on the size of the corpus used. We performed the same experiment in English and let the size of the corpus vary: a half, a quarter, an eighth of the original size. The computation of the coefficients led to very similar results as shown in Table 2. We then inquired the influence of the genre and performed the same experi- ment with the same size of text in English again. We chose the Europarl corpus for this experiment. Again, the computation of the linear coefficients led to very similar results, as shown in Table 2. Further experiments with more parameters varying are required to confirm that the coefficients of the relationship between saturation and the size are al- ways very similar. However, for the time being, we observe that the parameters are relatively close at least for these four languages whith different richness in morphology. 21 5 Conclusion We studied analogical grids in different languages with different morphological richness. These analogical grids were automatically built from actual texts, us- ing a technique which has been presented in previous work. Without surprise, languages known to be richer in morphology produce bigger and more analogi- cal grids than languages less rich in morphology. Empty cells in such analogical grids are interesting because they could be filled by words that should then be tested against the actual language. We studied the relation between size and saturation in analogical grids. Ex- perimental results clearly showed that the logarithm of the saturation of an analogical grids linearly depends on the logarithm of its size. This is not so surprising. More interestingly, the computation of the coefficients characterizing this log-log linear relation led to the result that, across all the four languages used, and even when having size and genre varying in one language, these coef- ficients are almost always the same: the relation between the saturation and the size of an analogical grid would be almost independent of the size, the genre and the language of a text. References 1. Beesley, K.R.: Consonant spreading in Arabic stems. In: Proceedings of COLING- ACL’98. vol. I, pp. 117–123. Montréal (Aug 1998), http://www.aclweb.org/ anthology/P98-1018 2. Chan, E.: Structures and distributions in morphology learning. Ph.D. thesis, University of Pennsylvania. (2008), http://nlp.cs.swarthmore.edu/~richardw/ papers/chan2008-structures.pdf 3. Dryer, M., Eisner, J.: Discovering morphological paradigms from plain text us- ing a dirichlet process mixture model. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP’2011). pp. 616– 627. Association for Computational Linguistics, Edinburgh, Scotland, UK (2011), https://www.cs.jhu.edu/~jason/papers/dreyer+eisner.emnlp11.pdf 4. Fam, R., Lepage, Y.: Morphological predictability of unseen words using compu- tational analogy. In: Proceedings of the Computational Analogy Workshop at the 24th International Conference on Case-Based Reasoning (ICCBR-CA-16). pp. 51– 60. Atlanta, Georgia (2016) 5. Goldsmith, J.: Unsupervised learning of the morphology of a natural language. Computational Linguistics 27, 153–198 (2001) 6. Hathout, N.: Acquisition of the morphological structure of the lexicon based on lexical similarity and formal analogy. In: Proceedings of the 3rd Textgraphs workshop on Graph-based Algorithms for Natural Language Processing. pp. 1– 8. Coling 2008 Organizing Committee, Manchester, UK (August 2008), http: //www.aclweb.org/anthology/W08-2001 7. Langlais, P., Yvon, F.: Scaling up analogical learning. In: Coling 2008: Companion volume: Posters. pp. 51–54. Coling 2008 Organizing Committee, Manchester, UK (August 2008), http://www.aclweb.org/anthology/C08-2013 22 8. Lepage, Y.: Analogies between binary images: Application to Chinese charac- ters. In: Prade, H., Richard, G. (eds.) Computational Approaches to Analogi- cal Reasoning: Current Trends, pp. 25–57. Springer, Berlin, Heidelberg (2014), http://dx.doi.org/10.1007/978-3-642-54516-0_2 9. Neuvel, S., Fulop, S.A.: Unsupervised learning of morphology without morphemes. In: Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning. pp. 31–40. Association for Computational Linguistics (July 2002), http: //www.aclweb.org/anthology/W02-0604 10. Resnik, P., Olsen, M.B., Diab, M.: The Bible as a parallel corpus: Annotating the ‘book of 2000 tongues’. Computers and the Humanities 33(1), 129–153 (1999), http://dx.doi.org/10.1023/A:1001798929185 11. Schone, P., Jurafsky, D.: Knowledge-free induction of morphology using latent se- mantic analysis. In: Proceedings of CoNLL-2000 and LLL-2000. pp. 67–72. Lisbon, Portugal (2000), http://web.stanford.edu/~jurafsky/W00-0712.pdf 12. Singh, R., Ford, A.: In praise of Sakatayana: some remarks on whole word morphol- ogy. In: Singh, R. (ed.) The Yearbook of South Asian Languages and Linguistics- 200. Sage, Thousand Oaks (2000) 13. Stroppa, N., Yvon, F.: An analogical learner for morphological analysis. In: Pro- ceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005). pp. 120–127. Association for Computational Linguistics, Ann Ar- bor, Michigan (June 2005), http://www.aclweb.org/anthology/W/W05/W05-0616 14. Wintner, S.: Natural Language Processing of Semitic Languages, chap. Morpho- logical Processing of Semitic Languages, pp. 43–66. Springer, Berlin, Heidelberg (2014), http://dx.doi.org/10.1007/978-3-642-45358-8_2