Testing a Statistical Word Stemmer based on Axality Measurements in INEX 2012 Tweet Contextualization Track Carlos-Francisco Méndez-Cruz1 , Edmundo-Pavel Soriano-Morales1 , and Alfonso Medina-Urrea2 1 GIL-Instituto de Ingeniería UNAM, México cmendezc@ii.unam.mx, sorianopavel@gmail.com 2 El Colegio de México A.C., México amedinau@colmex.mx Abstract. This paper presents an experiment of statistical word stem- ming based on axality measurements. These measurements quantify three characteristics of language. In this experiment we tested one strat- egy of stemming with three dierent sizes of training data. The developed stemmer was used by the automatic summarization system Cortex to preprocess input texts and produce readable summaries. All summaries were evaluated as part of the INEX 2012 Tweet Contextualization Track. We present the results of evaluation and a discussion about our stemming strategy. Key words: INEX, Automatic summarization system, Axality Measurements, Morphological Segmentation, Statistical Stemming, CORTEX, Tweet Contextu- alization. 1 Introduction The task proposed in the INEX 2012 Tweet Contextualization Track consists in obtaining some textual context from the English Wikipedia about the subject of a tweet. The nal contextualization of the tweet should take the form of a readable summary of 500 words. An amount of 1133 documents, contextualized tweets with text from Wikipedia from November 2011, were processed in order to obtained summaries. Bibliographic references an empty Wikipedia pages were omitted. The evaluation of summaries was done by the INEX organizers taking into ac- count informativeness and readability. The former was obtained using Kullback- Leibler divergence with Dirichlet smoothing by comparing n-gram distributions. The latter was accomplished by the participants in the track; they evaluated the summaries taking into account syntax, anaphoric resolution and redundancy. More details of the system of evaluation and the INEX 2012 Tweet Contextual- ization Track could be found in [1]. For this track we developed a stemmer based on morphological segmentation. The stemmer was coupled with Cortex, an automatic summarization system, in order to generate the summaries. We tested three sizes of training corpora to determine the best option for statistical stemming for English. The organization of this paper is as follows: in Section 2 we review some approaches of morphological segmentation; in Section 3 we present word stem- ming; in Section 4 we describe the axality measurements; Section 5 presents the stemming strategy; evaluation obtained in INEX track is expose in Section 6 and nally, in Section 7, we briey present our conclusions and future work. 2 Morphological Segmentation The rst work for unsupervised discovery of morphological units of language is due to Zellig Harris [2]. His method, commonly known as frequent succes- sor, consists in counting dierent letters or symbols before and after a possible morphological boundary. As more dierent symbols, the probability of a true morphological cut increases. This approach shown, among other things, that uncertainty is a well clue for morphological segmentation. Now a day, one of the most utilized methods for unsupervised learning of morphology is based on Minimum Description Length (MDL) approach. This has been developed as a computational system called Linguistica [3, 4].1 This method tries to obtain a lexicon of morphs inferred from a corpus. The best lexicon is the one that has the less redundancy, i.e. when the description length of the data is the lowest. Also, this utilizes some combinatorial structures called signatures in order to improve segmentation. This method has been employed for stemming work in [5]. In that paper the developed stemmer was utilized for an information retrieval task instead of summarization. The mission of preprocessing documents for tasks of NLP, such as Question Answering, Information Retrieval or Automatic Text Summarization, in agglu- tinative languages is more complex. This is due to the fact that agglutinative languages have numerous combinations of morphs rather than a simple prex- stem-sux combination. A method of unsupervised morphological segmentation for these kinds of languages is called Morfessor [69]. 2 This approach uses MDL by Maximum a Posteriori framework. Also, it integrates a morphotactic analysis to represent each word by a Hidden Markov Model (HMM). We are not sure if this method has been used for word stemming. 3 Word stemming The majority of NLP systems preprocesses documents in order to decrease the Vector Space Model representation. This is the case of Cortex, which will be explained below. A well-known strategy for that purpose is word stemming, i.e. 1 http://linguistica.uchicago.edu 2 http://www.cis.hut./projects/morpho/ truncating words by eliminating the inection. Also, it is possible to remove derivational axes. The methods most widely used for word stemming are created by means of hand-made rules, like [10, 11]. These kinds of stemmers have been success- fully applied for European languages. However, languages with more complex morphology than English, such as agglutinative ones, need unsupervised mor- phological strategies in order to deal with language complexity. In [12] a review of stemming methods is presented. The variety of stemming approaches includes: distance function to measure an orthographical similarity [13], directed graphs [14, 5], and frequency of n-grams of letters [15]. Moreover, there are some works about stemming evaluation in information retrieval tasks, for example [16, 17]. 4 Axality Measurements The axality measurements used to morphological segmentation were proposed for Spanish in [18, 19]. These measurements have been also applied to Czech [20], and to the Amerindian Languages Chuj and Tarahumara [21]. This approach lies on the linguistic idea that there is a force between segments of a word (morphs) called axality. If we can quantify this axality, we can expect some peaks where morphological cuts are possible. In next sections we present the way to calculate these measurements. 4.1 Entropy As we said above, Harris's approach revealed that uncertainty helps to morpho- logical segmentation. This uncertainty could be seen as the Shannon's concept of information content (entropy) [22]. To calculate the entropy of a possible segmentation, given ai,j ::bi,j as a word segmentation, and Bi,j as a set of all segments combined with ai,j , we can used the formula: X H (ai,j :: Bi,j ) = − p (bk,j ) × log2 (p (bk,j )) (1) where k = 1, 2, 3, . . . |Bi,j | and each bk,j ∈ Bi,j . For our purpose we tested peaks of entropy from right to left in order to discover suxes. 4.2 Economy Principle The Economy Principle could be understood as follows: fewer units at one level of language are combined in order to create a great number of other units at the next level. Taking advantage of this principle, we can dene a stem as a word segment that belong to a big set of relatively infrequent units, and axes as word segments that belong to a small set of frequent ones. In [23] a quantication of this economy was suggested, however, we present a reformulation. Given a word segmentation ai,j ::bi,j , the economy of a segmentation is calculated depending on type of morph hypothesized: p |Ai,j | − |Api,j | s |Bi,j | − |Bi,j | Ki,j =1− s | ; s Ki,j =1− p (2) |Bi,j |Ai,j | where Ai,j is the set of segments which alternate with bi,j (ai,j ∈ Ai,j ), and Bi,j a set of segments which alternate with ai,j (bi,j ∈ Bi,j ). Also, let Api,j be the set of segments which are likely prexes, and Bi,j s the set of segments which are likely suxes. 4.3 Numbers of Squares Joseph Greenberg [24] proposed the concept of square when four expressions of language, let say A, B, C, D, are combined to form AC, BC, AD, and BD. Hence, we set ci,j as a number of squares found in segment j of the word i. 5 Stemming Strategy The axality of all possible segmentations within a word is estimated by an average of normalized values of the three explained measurements: cx /max ci + kx /max ki + hx /max hi AF n (sx ) = (3) 3 To calculate this axality, a training corpus of raw text is required. In this track we use three dierent sizes of 100k, 200k, and 500k word tokens. With an index of axality calculated for each possible word segment, it is possible to choose a strategy for morphological segmentation; for example [19] propounded four strategies. In this experiment we use a peak-valley strategy for segmentation. Given a set of axality indexes inside a word afik , let afi−1 k < afik > afi+1 k be a peak of axality from left to right, where k is the length of the word plus one (the ending of the word). The main disadvantage of this approach is that small peaks are taking into account generating oversegmentation. Regarding stemming, we truncate words at most left peak of axality. For a language with scare morphology like English, we can imagine that a most right peak of axality could be sucient for stemming. However, in order to improve Cortex summarization, we decide to strongly conate words by a left-peak strategy. Next section explains CORTEX's approach. 5.1 Cortex Summarizer As we mentioned before, Cortex is an automatic text summarizer system. A wide explanation of this summarizer could be found in [2529]. Here, we briey describe some relevant aspects. First, Cortex represents input documents in Vector Space Model. To do that, the documents should be preprocessed. Actu- ally, we incorporate our stemmer in this step. After preprocessing, a frequency matrix γ is generated representing the pres- ence and absence of words (terms) in a sentence:  1 1 γ1 γ2 . . . γi1 . . . γM 1   γ12 γ22 . . . γi2 . . . γM 2  µ γ= . . . . .  , γi ∈ {0, 1, 2, . . . } (4)  .. .. . . .. . . . ..    P γ1P γ2P . . . γiP . . . γM each element γiµ of this matrix represents the number of occurrences of the wordi in the sentence µ; 1 ≤ i ≤ M words, 1 ≤ µ ≤ P sentences. Then, statistical information is extracted from the matrix by calculating some metrics. More information about these metrics could be found in [30]. A summary of this metrics is oered here; they are based on frequencies, entropy, measures of Hamming and hybrid values. 1. Frequency measures. PM (a) Term Frequency: F µ = i=1 γiµ PM PP j (b) Interactivity of segments: I µ = i=1 µ j=1 ξi ξi 6=0 PMj6=µ (c) Sum of probability frequencies: ∆µ = i=1 pi γiµ ; pi = word's i probabil- ity PM 2. Entropy. E µ = − i=1 µ pi log2 pi ξi 6=0 3. Measures of Hamming. These metrics use a Hamming matrix H , a square matrix M × M : P  1 if ξm j 6= ξnj  X m ∈ [2, M ] Hnm = for (5) 0 elsewhere n ∈ [1, m] j=1 PM Pm (a) Hamming distances: Ψ µ = m=2 µ n=1 µ Hnm ξm 6=0ξn 6=0 PM (b) Hamming weight of segments: φ µ = i=1 ξiµ PM (c) Sum of Hamming weight of words per segment: Θµ = i=1 ψi ; every ξiµ 6=0 PP word. ψi = µ=1 ξiµ (d) Hamming heavy weight: Π µ = φµ Θµ PM (e) Sum of Hamming  PMweights of words by frequency: Ω µ = i=1 ψi γiµ γiµ Title  4. Titles. θµ = cos kγi=1 µ kkTitlek Finally, a decision algorithm combines those metrics to score sentences. Two averages are calculated, λµ > 0.5, and λµ < 0.5 (λµ = 0.5 is ignored): µ X Γ X µ Γ  X X (6) w νw 0.5 − wλνµ w w w α= wλµ w − 0.5 ; β= ν=1 ν=1 kλνµ k>0.5 kλνµ k<0.5 The next expression is used to calculate the score of each sentence: µ µ ! X X If α> β Pµ Pµ α β then Λ = 0.5 + µ else Λ = 0.5 − µ Γ Γ Cortex sorts nal sentences by using Λµ ; µ = 1, · · · , P . Additionally, Cor- tex let us delimit a compression rate, which was xed at 500 words. 6 Experiments and Results 6.1 Design of Experiments We made use of three sizes of training corpora, 100K, 200K, and 500K word tokens, to test our stemmer. With these sizes we performed the three runs for INEX track. The assigned numbers of runs were 153 (100K), 154 (200K), and 155 (500K). The corpus for evaluation was the 1133 contextualized tweets with text from Wikipedia from November 2011. About training corpora, we selected 24 documents from the same contextualized tweets. 6.2 Results For informativeness, Cortex, coupled with our stemmer, obtained rank 12, 14, and 15. Average scores of informativeness are shown in Table 1. The best run in this evaluation was run 154 (200K). Table 1. Average scores of informativeness Rank Run Unigrams Bigrams Skip 12 154 0.8233 0.9254 0.9251 14 155 0.8253 0.9280 0.9274 15 153 0.8266 0.9291 0.9290 Those scores were computed by organizers using a Perl script (inexqa-eval.pl); for details about this script check [1]. On the other hand, the best results for readability evaluation were obtained by run 155 (500K), see Table 2. Comparing our results with other runs, run 155 (500K) obtained rank 4 in relevance, rank 6 in syntax, and rank 9 in structure. The worst run in our experiment was the run 153 (100K) in both evaluations. Table 2. Scores of readability Run Relevance Syntax Structure 155 0.6968 0.6161 0.5315 154 0.5352 0.5305 0.4748 153 0.4984 0.4576 0.3784 7 Conclusions and Future Work In this paper we reported an experiment using a stemmer based on morphological segmentation. We used axality measurements in order to segment words. This stemmer was coupled with Cortex, an automatic summarization system. We suggested the next stemming strategy: given some peaks of axality of a word, we truncated at most left peak. Also, we tested three training corpus sizes to obtain statistical information for the axality indexes: 100K, 200K, and 500K word tokens. Our two goals were to know if our stemming strategy can produce readable summaries, and if dierent sizes of training corpora can improve the Cortex performance. According to results of evaluation, our stemming strategy produces not only readable summaries but also competitive ones. That is, from an average of rele- vance, syntax, and structure (0.6148), run 155 obtained a rank 7 among 27 runs. What is more, concerning informativeness, run 154 obtained rank 12 among 33 participants. Regarding corpus sizes, it is not clear what size is the best for English, be- tween 200K and 500K word tokens. However, it is clear that increasing corpus size is a good strategy because 100K obtained the worst results. Additionally, a greater training corpus gives better position in the ranking, for example, from an average of relevance, syntax, and structure, run 155 (500K) obtained rank 7 and run 153 (100K) obtained rank 15. In future experiments we will test dierent strategies for morphological seg- mentation and stemming. Additionally, we can test dierent stemming approaches, such as Porter's stemmer. References 1. SanJuan, E., Moriceau, V., Tannier, X., Bellot, P., Mothe, J.: Overview of the INEX 2011 Question Answering Track (QA@INEX). In: INEX 2011 Workshop Pree-Proceedings, IR Publications, Hofgut Imsbach, Saarbrücken, Germany (2011) 145153 2. Harris, Z.S.: From Phoneme to Morpheme. Language 31 (1955) 190222 3. Goldsmith, J.: Unsupervised Learning of the Morphology of a Natural Language. Computational Linguistics 27 (2001) 153198 4. Goldsmith, J.: An Algorithm for the Unsupervised Learning of Morphology. Nat- ural Language Engineering 12 (2006) 353371 5. Paik, J.H., Mitra, M., Parui, S.K., Jarvelin, K.: GRAS: An eective and ecient stemming algorithm for information retrieval. ACM Trans. Inf. Syst. 29 (2011) 6. Creutz, M., Lagus, K.: Unsupervised Discovery of Morphemes. In: Proc. of the Workshop on Morphological and Phonological Learning of ACL-02, Philadelphia, SIGPHON-ACL (2002) 2130 7. Creutz, M.: Unsupervised segmentation of words using prior distributions of morph length and frequency. In Hinrichs, E., Roth, D., eds.: 41st Annual Meeting of the ACL, Sapporo, Japan. (2003) 280287 8. Creutz, M., Lagus, K.: Induction of a Simple Morphology for Highly-Inecting Languages. In: Proc. of 7th Meeting of the ACL Special Interest Group in Com- putational Phonology SIGPHON-ACL. (2004) 4351 9. Creutz, M., Lagus, K.: Inducing the Morphological Lexicon of a Natural Lan- guage from Unannotated Text. In: Int. and Interdisciplinary Conf. on Adaptive Knowledge Representation and Reasoning (AKRR05). (2005) 106113 10. Lovins, J.B.: Development of a Stemming Algorithm. Mechanical Translation and Computational Linguistics 11 (1968) 2331 11. Porter, M.F.: An algorithm for Sux Stripping. Program 14 (1980) 130137 12. Lennon, M., Pierce, D., Tarry, B., Willet, P.: An evaluation of some conation algorithms for information retrieval. J. of Information Science 3 (1981) 177183 13. Majumder, P., Mitra, M., Pal, D.: Bulgarian, Hungarian and Czech stemming using YASS. In: Proceedings of Advances in Multilingual and Multimodal Information Retrieval, Springer-Verlag, Berlin (2008) 4956 14. Bacchin, M., Ferro, N., Melucci, M.: A probabilistic model for stemmer generation. Mechanical Translation and Computational Linguistics 41 (2005) 121137 15. McNamee, P., Mayeld, J.: Character n-gram tokenization for European language text retrieval. Information Retrieval 7 (2004) 7397 16. Krovetz, R.: Viewing Morphology as an Inference Process. In: Proccedings of the 16th ACM/SICIR Conference. (1993) 191202 17. Hull, D.A.: Stemming algorithms - A case study for detailed evaluation. Journal of the American Society for Information Science 47 (1996) 7084 18. Medina-Urrea, A.: Investigación cuantitativa de ajos y clíticos del español de México. Glutinometría en el Corpus del Español Mexicano Contemporáneo. PhD thesis, El Colegio de México, México (2003) 19. Medina-Urrea, A.: Automatic Discovery of Axes by means of Corpus: A Catalog of Spanish Axes. Journal of Quantitative Linguistics 7 (2000) 97114 20. Medina-Urrea, A., Hlavácová, J.: Automatic Recognition of Czech Derivational Prexes. In: Proceedings of CICLing 2005. Volume 3406., LNCS, Springer, Berlin/Heidelberg/New York (2005) 189197 21. Medina-Urrea, A.: Ax Discovery based on Entropy and Economy Measurements. Texas Linguistics Society 10 (2008) 99112 22. Shannon, C., Weaver, W.: The Mathematical Theory of Communication. Univer- sity of Illinois Press, Urbana (1949) 23. de Kock, J., Bossaert, W.: Introducción a la lingüística automática en las lenguas románicas. Gredos, Madrid (1974) 24. Greenberg, J.H.: Essays in Linguistics. The Univ. of Chicago Press, Chicago (1957) 25. Torres-Moreno, J.M.: Résume automatique de documents. Lavoisier, Paris (2011) 26. Torres-Moreno, J.M., Saggion, H., da Cunha, I., SanJuan, E., Velázquez-Morales, P.: Summary Evaluation with and without References. Polibits 42 (2010) 1319 27. Saggion, H., Torres-Moreno, J.M., da Cunha, I., SanJuan, E.: Multilingual summa- rization evaluation without human models. In: 23rd Int. Conf. on Computational Linguistics. COLING '10, Beijing, China, ACL (2010) 10591067 28. Torres-Moreno, J.M., Velazquez-Moralez, P., Meunier, J.: CORTEX, un algorithme pour la condensation automatique de textes. In: ARCo. Volume 2. (2005) 365 29. Torres-Moreno, J.M., Velazquez-Morales, P., Meunier, J.: Condensés de textes par des méthodes numériques. JADT 2 (2002) 723734 30. Torres-Moreno, J.M., St-Onge, P.L., Gagnon, M., El-Bèze, M., Bellot, P.: Auto- matic Summarization System coupled with a Question-Answering System (QAAS). CoRR abs/0905.2990 (2009)