-

A study of the saturation of analogical grids agnostically extracted from texts

Rashel Fam

fam.rashel@fuji.waseda.jp 0

Yves Lepage ?

yves.lepage@waseda.jp 0 0 IPS, Waseda University 2-7 Hibikino , Wakamatsu-ku, Kitakyushu-shi, 808-0135 Fukuoka-ken , Japan

13 22

Analogical grids aim to capture the organization of the lexicon of a language. We conduct experiments on analogical grids extracted in four di erent languages with di erent morphological richness. We study the saturation of analogical grids against their size. We observe that the logarithm of the saturation of an analogical grid is linear in the logarithm of its size. More surprisingly, the coe cients of this log-log linear relation are extremely close across all four languages, even when the size or the genre of the corpus vary.

analogical grids saturation organization of lexicon

Introduction and background show : shows : showing : showed walk : walks : walking : walked open : opens : opening : study : : studying : read : reads : reading : makan : dimakan : memakan : makanan minum : diminum : meminum : minuman main : : : mainan beli : dibeli : :

Figure 1 shows two examples of analogical grids, one in English, the other one in Indonesian. Such analogical grids may be automatically constructed from the set of words contained in a text. Each cell in an analogical grid either contains a word form or is empty. As exempli ed in Figure 1 (left ), a column (or a row) in an analogical grid usually exhibits similar word forms for di erent words: e.g., in nitive, present 3rd person singular, present participle, etc. for di erent English verbs on the left of Figure 1. Analogical grids are not paradigm tables, ? This work was supported by a JSPS Grant, Number 15K00317 (Kakenhi C), entitled Language productivity: e cient extraction of productive analogical clusters and their evaluation using statistical machine translation.

Copyright © 2017 for this paper by its authors. Copying permitted for private and academic purpose. In Proceedings of the ICCBR 2017 Workshops. Trondheim, Norway i.e., they are not the result of a linguistic formalization with explicit lexemes and exponents as in standard works in morphology, but they constitute a preliminary step in that direction. Analogical grids too give a compact view of the organization of the lexicon, but they are the output of an empirical procedure, e.g., the one introduced in [4].

Analogical grids can be used to study word productivity in a given language as in [12, 9, 6]. They can also be used to make comparisons across languages as in [4], where the goal is to explain unseen words by using analogical grids automatically built from the set of all words contained in texts in 12 di erent languages.

In this paper, we report an interesting phenomenon observed when building analogical grids in various di erent languages using the method in [4]. This phenomenon relates the saturation of the obtained analogical grids to their size. The experimental results show that the coe cients which characterize the relation would not be in uenced by the size, the genre or the language of the texts used.

The paper is organized as follows: Section 2 introduces basic notions related to analogical grids. Section 3 presents our experiments on four languages with di erent richness in morphology. It analyzes the results and explores the relationship between the saturation and the size of analogical grids. Section 4 presents further experiments to inquire the relation. Section 5 gives conclusion. 2

Basic notions

In this section, we mathematically de ne the basic notions related to analogical grids. The method to extract such analogical grids has already been presented elsewhere [8, 4]. 2.1

Illustration with toy data

Anto memakan nasi dan meminum air. Nasi itu dibeli di pasar. Di pasar, Anto melihat mainan. Anto senang main bola. Setelah main, Anto suka minum es dan makan cilok. Makanan dan minuman itu juga dia beli di pasar. Es dan cilok memang enak dimakan dan diminum selesai olahraga. air anto beli bola cilok dan di dia dibeli dimakan diminum enak es itu juga main mainan makan makanan melihat memakan memang meminum minum minuman nasi olahraga pasar selesai senang setelah suka

The top of Figure 2 is a forged example text in Indonesian, a language which is known for its relative richness in derivational morphology. We intentionally do not give its translation into English to place the reader in the agnostic position of the computer in front of such data. The list of words, sorted in lexicographic order, that can be extracted from this text, is given at the bottom of Figure 2.

From this word list, some commonalities between words can be identi ed at a glance. An example is the word makan and the word makanan. Another is the words bola and beli which share the same consonants in the same order: b and l. However, the existence of only one pair is not enough to support the evidence that two words are actually in relation one with the other. On the contrary, for the words makan and the word makanan, the same ratio is seen to hold between several other word pairs from the same text, like minum and minuman, or main and mainan. These actually re ect a phenomenon in Indonesian morphology by using the su x -an which builds a noun from active verb.

In standard linguistics, a systematization of these relationships between word forms is given by paradigm tables, which is the result of linguistic formalisation. Here, we agnostically extract analogical grids relying on a formal relationship between words, proportional analogy. The right part of Figure 1 shows the analogical grid extracted from the set of words given in Figure 2. 2.2

Analogical grids

An analogical grid is a table of dimension M N as de ned by Formula ( 1 ). As illustrated by Figure 1, analogical grids extracted from texts usually contain empty cells. (Caution: there is no importance in the order of lines or rows.) P11 : P12 : P21 : P22 : .. .

. .

Pn1 : Pn2 : : P1m : P2m . .

. : Pnm () 8(i; k) 2 f1; : : : ; ng2; 8(j; l) 2 f1; : : : ; mg2;

Pij : Pil :: Pkj : Pkl ( 1 ) ( 2 )

The de nition of analogical grids in Formula ( 1 ) implies that for any four word forms at the intersection of two rows and two columns form a proportional analogy between sequences of characters [7, 13]. A proportional analogy is de ned as a relationship between four objects where two properties are met: (a) equality of ratios (de ned hereafter) between the rst and the second terms on one hand, and the third and the fourth terms on the other hand, and (b) exchange of the means (the second and the third terms can always be exchanged).

A : B :: C : D ()

A : B = C : D A : C = B : D

According to Formula ( 1 ), we can get many analogies from analogical grids in Figure 1. Figure 3 shows three of them.

We de ne the ratio between two words in Formula ( 3 ) as a vector of features made up of all the di erences in number of occurrences in the two words, for all makan : makanan :: main : mainan makan : memakan :: minum : meminum

minum : diminum :: beli : dibeli the characters, whatever the writing system, plus, the distance between the two words.

0jAja jBja1

BjAjb jBjb C A : B = BBBB@jAjz ...jBjz CCCCA d(A; B) In Formula ( 3 ), the notation jSjc stands for the number of occurrences of character c in string S. The last dimension, written as d(A; B), is the edit distance between the two strings. This indirectly gives the number of common characters appearing in the same order in A and B.1

The above de nition of ratios captures pre xing and su xing. Although we do not show it here, this de nition also captures parallel in xing or interdigitation, well-known phenomena in semitic languages [1, 14]. However, reduplication or repetition (e.g. consonant spreading) are not captured by this de nition. makan : makanan 0 11 main : mainan 0 11 =

B 0 C B .C B .C B .C B@ 0 AC 3 =

B 0 C B .C B .C B .C B@ 0 AC

3 & ) makan : makanan :: main : mainan

This formal de nition of word ratio in Formula ( 3 ) gives the same vector for the ratios makan : makanan , makan : namakan , and makan : mnaakan . This is due to the use of insertion and deletion as the only edit operations.

The purpose of working with analogical grid, and not only with individual analogies, is that Formula ( 1 ) imposes more constraints for a word form to enter 1 The only two edit operations used are insertion and deletion, hence, d(A; B) = jAj + jBj 2 s(A; B). jSj denotes the length of a string S and s(A; B) is the length of the longest common sub-sequence (LCS) between A and B. a grid: a word form in a grid must satisfy all analogy relationship with all surrounding word forms in the grid. The word form makanan in the analogical grid of Figure 1 (right) is the only word form which ts in, among makanan, namakan, or mnaakan. For example, as proved below, using the words main and mainan from the analogical grid, the inequality between the ratios makan : main and namakan : mainan implies that there is no analogy between these four words. The same holds for the word form mnaakan. In all these cases, the inequality comes from di erent edit distance values. 6=

) makan : main 6:: namakan : mainan

The above discussion shows that there should be a relationship between the size of the analogical grids and the freedom in lling an empty cell in an analogical grid. 2.3

Size and saturation of analogical grids

We simply de ne the size of an analogical grid as its number of rows multiplied by its number of columns. The analogical grids in Figure 1 has a size of 4 5 = 20 (left ) and 4 4 = 16 (right ) respectively.

Let us now turn to the number of empty cells of an analogical grid, or rather the number of non-empty cells which we call its saturation2. We compute it using Formula (4) which will give a saturation of 80 % (left ) and 75 % (right ) for Figure 1.

Saturation = 100

Number of empty cells

Total number of cells 100 (4) 3 3.1

Experiments Data used

We carried out experiments on a multilingual parallel corpus created from the translation of the Bible collected by Christodoulopoulos3 [10]. We selected four languages with di erent richness in morphology: English, Russian, Modern Greek, and Indonesian. The reason for using a multilingual parallel corpus is the need to draw conclusions across di erent languages in a reliable way. Table 1 presents statistics on the corpus. For each text in each language, we rst extracted the list of all words, and nally built all analogical grids. 2 In [2, p. 79], saturation is the maximal proportion of word forms attested for any one lemma of a given paradigm. Here we use the term for each entire grid. 3 http://homepages.inf.ed.ac.uk/s0787820/bible/ 106 s105 ird lag104 icg laon103 a f reo102 b m uN 10 11 106 s105 ird lag104 icg laon103 a f reo102 b m uN 10 11 106 s105 ird lag104 icg laon103 a f reo102 b m uN 10 11 106 s105 ird lag104 icg laon103 a f reo102 b m uN 10 11 103 106 Analogical grid size 109 103 106 Analogical grid size 109 103 106 Analogical grid size 109 103 106 Analogical grid size 109 The graphs at the bottom of Figure 5 show the number of analogical grids with the same sizes in each language. Most of the analogical grids have a small size. The number of analogical grids with the same size decreases gradually as the size increases. Languages with a richer morphology produce bigger analogical grids in average and also more analogical grids for a given size. All of this meets intuition.

English 100% 10% on i trau 1% taS 0.1% 100% 90% ittaoSn 6800%% rau 70% 100% 10% on i trau 1% taS 0.1% 100% 90% ittaoSn 6800%% rau 70% 100% 10% on i trau 1% taS 0.1%

We now turn to the study of the saturation of analogical grids compared to their size. The top of Figure 6 shows saturation against size for analogical grids in each language. Analogical grids with smaller sizes tend to have higher saturation. Some tables are extremely sparse. Because of the logarithmic scale on the y-axis, the bottom half is for tables with a saturation less than 1 %.

In all cases, the plots exhibit a similar linear shape in logarithmic scale across all languages. This would correspond to Formula ( 5 ). We con rmed the similarity by the computation of the coe cients a and b for each language, as obtained by the least squares method. These coe cients are presented in Table 2. They are almost the same in all languages.

log(saturation) = a log(size) + b ( 5 )

As mentioned in Section 2.2, intuitively, analogical grids with higher saturation are more reliable to ll in because there are more word forms around the empty cells as supporting evidence. However, it may not always be the case. For instance, an analogical grid for regular English verbs extracted from any text is very hollow but empty cells can be lled in a reliable way. 4

Discussion and further experiments

Let us make a rst remark on the type of the observed relation. This is not yet another instance of a Zip an law, because, in the present case, the objects are not ranked individually according to their frequency (number of occurrences). In a Zip an law, the x-axis stands for the list of individual objects ranked by frequency. Recall also that our analogical grids do not encapsulate any information about the frequency of individual words whatsoever. In our graphs, two analogical grids with the same size have the same abscissa. If they also have the same saturation, they have the same ordinate and are thus plotted as the same point.

Language English Indonesian

Modern Greek Russian

The interesting fact that comes into light is not so much the fact that the relation between size and saturation of analogical grids be a log{log relation, but the fact that it exhibits very similar slopes in all four languages. A reasonable explanation is that these coe cients are independent of the language because they characterize the corpus used. The corpus is de ned by its size and its genre.

We rst inquired whether the coe cients depend on the size of the corpus used. We performed the same experiment in English and let the size of the corpus vary: a half, a quarter, an eighth of the original size. The computation of the coe cients led to very similar results as shown in Table 2.

We then inquired the in uence of the genre and performed the same experiment with the same size of text in English again. We chose the Europarl corpus for this experiment. Again, the computation of the linear coe cients led to very similar results, as shown in Table 2.

Further experiments with more parameters varying are required to con rm that the coe cients of the relationship between saturation and the size are always very similar. However, for the time being, we observe that the parameters are relatively close at least for these four languages whith di erent richness in morphology.

Conclusion

We studied analogical grids in di erent languages with di erent morphological richness. These analogical grids were automatically built from actual texts, using a technique which has been presented in previous work. Without surprise, languages known to be richer in morphology produce bigger and more analogical grids than languages less rich in morphology. Empty cells in such analogical grids are interesting because they could be lled by words that should then be tested against the actual language.

We studied the relation between size and saturation in analogical grids. Experimental results clearly showed that the logarithm of the saturation of an analogical grids linearly depends on the logarithm of its size. This is not so surprising. More interestingly, the computation of the coe cients characterizing this log-log linear relation led to the result that, across all the four languages used, and even when having size and genre varying in one language, these coefcients are almost always the same: the relation between the saturation and the size of an analogical grid would be almost independent of the size, the genre and the language of a text.

1. Beesley , K.R. : Consonant spreading in Arabic stems . In: Proceedings of COLINGACL'98 . vol. I, pp. 117 { 123 . Montreal (Aug 1998 ), http://www.aclweb.org/ anthology/P98-1018

2. Chan , E. : Structures and distributions in morphology learning . Ph.D. thesis , University of Pennsylvania. ( 2008 ), http://nlp.cs.swarthmore.edu/~richardw/ papers/chan2008-structures.pdf

3. Dryer , M. , Eisner , J.: Discovering morphological paradigms from plain text using a dirichlet process mixture model . In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP' 2011 ). pp. 616 { 627 . Association for Computational Linguistics, Edinburgh, Scotland, UK ( 2011 ), https://www.cs.jhu.edu/~jason/papers/dreyer+eisner. emnlp11.pdf 4 . Fam , R. , Lepage , Y. : Morphological predictability of unseen words using computational analogy . In: Proceedings of the Computational Analogy Workshop at the 24th International Conference on Case-Based Reasoning (ICCBR-CA-16) . pp. 51 { 60 . Atlanta , Georgia ( 2016 )

5. Goldsmith , J.: Unsupervised learning of the morphology of a natural language . Computational Linguistics 27 , 153 { 198 ( 2001 )

6. Hathout , N.: Acquisition of the morphological structure of the lexicon based on lexical similarity and formal analogy . In: Proceedings of the 3rd Textgraphs workshop on Graph-based Algorithms for Natural Language Processing . pp. 1 { 8 . Coling 2008 Organizing Committee , Manchester, UK ( August 2008 ), http: //www.aclweb.org/anthology/W08-2001

7. Langlais , P. , Yvon , F. : Scaling up analogical learning . In: Coling 2008 : Companion volume: Posters . pp. 51 { 54 . Coling 2008 Organizing Committee , Manchester, UK ( August 2008 ), http://www.aclweb.org/anthology/C08-2013

8. Lepage , Y. : Analogies between binary images: Application to Chinese characters . In: Prade, H. , Richard, G. (eds.) Computational Approaches to Analogical Reasoning: Current Trends, pp. 25 { 57 . Springer, Berlin, Heidelberg ( 2014 ), http://dx.doi.org/10.1007/978-3- 642 -54516- 0 _ 2

9. Neuvel , S. , Fulop , S.A. : Unsupervised learning of morphology without morphemes . In: Proceedings of the ACL-02 Workshop on Morphological and Phonological Learning . pp. 31 { 40 . Association for Computational Linguistics ( July 2002 ), http: //www.aclweb.org/anthology/W02-0604

10. Resnik , P. , Olsen , M.B. , Diab , M.: The Bible as a parallel corpus: Annotating the `book of 2000 tongues' . Computers and the Humanities 33 ( 1 ), 129 { 153 ( 1999 ), http://dx.doi.org/10.1023/A:1001798929185

11. Schone , P. , Jurafsky , D. : Knowledge-free induction of morphology using latent semantic analysis . In: Proceedings of CoNLL-2000 and LLL-2000 . pp. 67 { 72 . Lisbon, Portugal ( 2000 ), http://web.stanford.edu/~jurafsky/W00-0712.pdf

12. Singh , R. , Ford , A. : In praise of Sakatayana: some remarks on whole word morphology . In: Singh, R . (ed.) The Yearbook of South Asian Languages and Linguistics200 . Sage, Thousand Oaks ( 2000 )

13. Stroppa , N. , Yvon , F. : An analogical learner for morphological analysis . In: Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005) . pp. 120 { 127 . Association for Computational Linguistics, Ann Arbor, Michigan ( June 2005 ), http://www.aclweb.org/anthology/W/W05/W05-0616

14. Wintner , S. : Natural Language Processing of Semitic Languages, chap. Morphological Processing of Semitic Languages , pp. 43 { 66 . Springer, Berlin, Heidelberg ( 2014 ), http://dx.doi.org/10.1007/978-3- 642 -45358- 8 _ 2