Testing a Statistical Word Stemmer based on
   Axality Measurements in INEX 2012 Tweet
              Contextualization Track
   Carlos-Francisco Méndez-Cruz1 , Edmundo-Pavel Soriano-Morales1 , and
                         Alfonso Medina-Urrea2
                   1
                    GIL-Instituto de Ingeniería UNAM, México
                  cmendezc@ii.unam.mx, sorianopavel@gmail.com
                      2
                        El Colegio de México A.C., México
                              amedinau@colmex.mx


      Abstract.   This paper presents an experiment of statistical word stem-
      ming based on axality measurements. These measurements quantify
      three characteristics of language. In this experiment we tested one strat-
      egy of stemming with three dierent sizes of training data. The developed
      stemmer was used by the automatic summarization system Cortex to
      preprocess input texts and produce readable summaries. All summaries
      were evaluated as part of the INEX 2012 Tweet Contextualization Track.
      We present the results of evaluation and a discussion about our stemming
      strategy.


Key words:  INEX, Automatic summarization system, Axality Measurements,
Morphological Segmentation, Statistical Stemming, CORTEX, Tweet Contextu-
alization.


1 Introduction
The task proposed in the INEX 2012 Tweet Contextualization Track consists in
obtaining some textual context from the English Wikipedia about the subject
of a tweet. The nal contextualization of the tweet should take the form of a
readable summary of 500 words. An amount of 1133 documents, contextualized
tweets with text from Wikipedia from November 2011, were processed in order
to obtained summaries. Bibliographic references an empty Wikipedia pages were
omitted.
    The evaluation of summaries was done by the INEX organizers taking into ac-
count informativeness and readability. The former was obtained using Kullback-
Leibler divergence with Dirichlet smoothing by comparing n-gram distributions.
The latter was accomplished by the participants in the track; they evaluated the
summaries taking into account syntax, anaphoric resolution and redundancy.
More details of the system of evaluation and the INEX 2012 Tweet Contextual-
ization Track could be found in [1].
    For this track we developed a stemmer based on morphological segmentation.
The stemmer was coupled with Cortex, an automatic summarization system,
in order to generate the summaries. We tested three sizes of training corpora to
determine the best option for statistical stemming for English.
    The organization of this paper is as follows: in Section 2 we review some
approaches of morphological segmentation; in Section 3 we present word stem-
ming; in Section 4 we describe the axality measurements; Section 5 presents
the stemming strategy; evaluation obtained in INEX track is expose in Section
6 and nally, in Section 7, we briey present our conclusions and future work.


2 Morphological Segmentation
The rst work for unsupervised discovery of morphological units of language
is due to Zellig Harris [2]. His method, commonly known as frequent succes-
sor, consists in counting dierent letters or symbols before and after a possible
morphological boundary. As more dierent symbols, the probability of a true
morphological cut increases. This approach shown, among other things, that
uncertainty is a well clue for morphological segmentation.
    Now a day, one of the most utilized methods for unsupervised learning of
morphology is based on Minimum Description Length (MDL) approach. This
has been developed as a computational system called Linguistica [3, 4].1 This
method tries to obtain a lexicon of morphs inferred from a corpus. The best
lexicon is the one that has the less redundancy, i.e. when the description length
of the data is the lowest. Also, this utilizes some combinatorial structures called
signatures in order to improve segmentation. This method has been employed
for stemming work in [5]. In that paper the developed stemmer was utilized for
an information retrieval task instead of summarization.
    The mission of preprocessing documents for tasks of NLP, such as Question
Answering, Information Retrieval or Automatic Text Summarization, in agglu-
tinative languages is more complex. This is due to the fact that agglutinative
languages have numerous combinations of morphs rather than a simple prex-
stem-sux combination. A method of unsupervised morphological segmentation
for these kinds of languages is called Morfessor [69]. 2 This approach uses MDL
by Maximum a Posteriori framework. Also, it integrates a morphotactic analysis
to represent each word by a Hidden Markov Model (HMM). We are not sure if
this method has been used for word stemming.


3 Word stemming
The majority of NLP systems preprocesses documents in order to decrease the
Vector Space Model representation. This is the case of Cortex, which will be
explained below. A well-known strategy for that purpose is word stemming, i.e.
1
    http://linguistica.uchicago.edu
2
    http://www.cis.hut./projects/morpho/
truncating words by eliminating the inection. Also, it is possible to remove
derivational axes.
    The methods most widely used for word stemming are created by means
of hand-made rules, like [10, 11]. These kinds of stemmers have been success-
fully applied for European languages. However, languages with more complex
morphology than English, such as agglutinative ones, need unsupervised mor-
phological strategies in order to deal with language complexity.
    In [12] a review of stemming methods is presented. The variety of stemming
approaches includes: distance function to measure an orthographical similarity
[13], directed graphs [14, 5], and frequency of n-grams of letters [15]. Moreover,
there are some works about stemming evaluation in information retrieval tasks,
for example [16, 17].


4 Axality Measurements
The axality measurements used to morphological segmentation were proposed
for Spanish in [18, 19]. These measurements have been also applied to Czech [20],
and to the Amerindian Languages Chuj and Tarahumara [21]. This approach lies
on the linguistic idea that there is a force between segments of a word (morphs)
called axality. If we can quantify this axality, we can expect some peaks
where morphological cuts are possible. In next sections we present the way to
calculate these measurements.


4.1   Entropy


As we said above, Harris's approach revealed that uncertainty helps to morpho-
logical segmentation. This uncertainty could be seen as the Shannon's concept
of information content (entropy) [22]. To calculate the entropy of a possible
segmentation, given ai,j ::bi,j as a word segmentation, and Bi,j as a set of all
segments combined with ai,j , we can used the formula:

                                         X
                 H (ai,j :: Bi,j ) = −       p (bk,j ) × log2 (p (bk,j ))       (1)

   where k = 1, 2, 3, . . . |Bi,j | and each bk,j ∈ Bi,j . For our purpose we tested
peaks of entropy from right to left in order to discover suxes.


4.2   Economy Principle


The Economy Principle could be understood as follows: fewer units at one level
of language are combined in order to create a great number of other units at the
next level. Taking advantage of this principle, we can dene a stem as a word
segment that belong to a big set of relatively infrequent units, and axes as word
segments that belong to a small set of frequent ones. In [23] a quantication of
this economy was suggested, however, we present a reformulation. Given a word
segmentation ai,j ::bi,j , the economy of a segmentation is calculated depending
on type of morph hypothesized:


            p          |Ai,j | − |Api,j |                             s
                                                         |Bi,j | − |Bi,j |
           Ki,j =1−            s |        ;    s
                                              Ki,j =1−           p             (2)
                            |Bi,j                             |Ai,j |

    where Ai,j is the set of segments which alternate with bi,j (ai,j ∈ Ai,j ), and
Bi,j a set of segments which alternate with ai,j (bi,j ∈ Bi,j ). Also, let Api,j be
the set of segments which are likely prexes, and Bi,j
                                                    s
                                                       the set of segments which
are likely suxes.

4.3   Numbers of Squares

Joseph Greenberg [24] proposed the concept of square when four expressions of
language, let say A, B, C, D, are combined to form AC, BC, AD, and BD. Hence,
we set ci,j as a number of squares found in segment j of the word i.


5 Stemming Strategy
The axality of all possible segmentations within a word is estimated by an
average of normalized values of the three explained measurements:

                             cx /max ci + kx /max ki + hx /max hi
              AF n (sx ) =                                                     (3)
                                               3
    To calculate this axality, a training corpus of raw text is required. In this
track we use three dierent sizes of 100k, 200k, and 500k word tokens. With an
index of axality calculated for each possible word segment, it is possible to
choose a strategy for morphological segmentation; for example [19] propounded
four strategies.
    In this experiment we use a peak-valley strategy for segmentation. Given a
set of axality indexes inside a word afik , let afi−1
                                                   k
                                                       < afik > afi+1
                                                                    k
                                                                       be a peak
of axality from left to right, where k is the length of the word plus one (the
ending of the word). The main disadvantage of this approach is that small peaks
are taking into account generating oversegmentation.
    Regarding stemming, we truncate words at most left peak of axality. For a
language with scare morphology like English, we can imagine that a most right
peak of axality could be sucient for stemming. However, in order to improve
Cortex summarization, we decide to strongly conate words by a left-peak
strategy. Next section explains CORTEX's approach.

5.1   Cortex Summarizer

As we mentioned before, Cortex is an automatic text summarizer system. A
wide explanation of this summarizer could be found in [2529]. Here, we briey
describe some relevant aspects. First, Cortex represents input documents in
Vector Space Model. To do that, the documents should be preprocessed. Actu-
ally, we incorporate our stemmer in this step.
    After preprocessing, a frequency matrix γ is generated representing the pres-
ence and absence of words (terms) in a sentence:
                     1 1
                      γ1 γ2 . . . γi1 . . . γM 1
                                                 
                     γ12 γ22 . . . γi2 . . . γM
                                               2 
                                                      µ
                γ= . . .            .         .  , γi ∈ {0, 1, 2, . . . }   (4)
                     .. .. . . .. . . . .. 
                                                

                                               P
                      γ1P γ2P . . . γiP . . . γM
   each element γiµ of this matrix represents the number of occurrences of the
wordi in the sentence µ; 1 ≤ i ≤ M words, 1 ≤ µ ≤ P sentences.
   Then, statistical information is extracted from the matrix by calculating
some metrics. More information about these metrics could be found in [30]. A
summary of this metrics is oered here; they are based on frequencies, entropy,
measures of Hamming and hybrid values.
1.  Frequency measures.
                               PM
  (a) Term Frequency: F µ = i=1 γiµ
                                         PM PP        j
  (b) Interactivity of segments: I µ =     i=1
                                           µ
                                                 j=1 ξi
                                          ξi 6=0
                                               PMj6=µ
   (c) Sum of probability frequencies: ∆µ = i=1 pi γiµ ; pi = word's i probabil-
       ity             PM
2. Entropy. E µ = −       i=1
                          µ
                              pi log2 pi
                               ξi 6=0
3.   Measures of Hamming. These metrics use a Hamming matrix H , a square
     matrix M × M :
                               P 
                                   1 if ξm
                                         j
                                           6= ξnj
                                                  
                               X                              m ∈ [2, M ]
                    Hnm =                               for                        (5)
                                   0 elsewhere                n ∈ [1, m]
                               j=1
                                        PM        Pm
     (a) Hamming distances: Ψ µ =        m=2
                                          µ
                                                n=1
                                                µ
                                                      Hnm
                                         ξm 6=0ξn 6=0
                                               PM
  (b) Hamming weight of segments: φ          µ
                                              = i=1 ξiµ
                                                               PM
   (c) Sum of Hamming weight of words per segment: Θµ =          i=1 ψi ; every
                                                                ξiµ 6=0
                 PP
       word. ψi = µ=1 ξiµ
  (d) Hamming heavy weight: Π µ = φµ Θµ
                                                               PM
   (e) Sum of Hamming
                    PMweights     of words by frequency: Ω µ = i=1 ψi γiµ
                          γiµ Title
                                    
4. Titles. θµ = cos kγi=1
                       µ kkTitlek


   Finally, a decision algorithm combines those metrics to score sentences. Two
averages are calculated, λµ > 0.5, and λµ < 0.5 (λµ = 0.5 is ignored):
         µ
         X          Γ
                    X                        µ           Γ
                                           X            X
                                                                                   (6)
                               w νw
                                                                    0.5 − wλνµ w
                                                                          w w
             α=                wλµ w − 0.5 ;   β=
                     ν=1                                  ν=1
                  kλνµ k>0.5                           kλνµ k<0.5
      The next expression is used to calculate the score of each sentence:
                         µ           µ
                                              !
                         X           X
                   If        α>           β
                                                  Pµ                         Pµ
                                                       α                           β
                        then Λ = 0.5 +
                                µ
                                                           else Λ = 0.5 −
                                                                µ
                                                   Γ                           Γ

      Cortex sorts nal sentences by using Λµ ; µ = 1, · · · , P . Additionally, Cor-
tex let us delimit a compression rate, which was xed at 500 words.


6 Experiments and Results
6.1     Design of Experiments


We made use of three sizes of training corpora, 100K, 200K, and 500K word
tokens, to test our stemmer. With these sizes we performed the three runs for
INEX track. The assigned numbers of runs were 153 (100K), 154 (200K), and
155 (500K). The corpus for evaluation was the 1133 contextualized tweets with
text from Wikipedia from November 2011. About training corpora, we selected
24 documents from the same contextualized tweets.


6.2     Results


For informativeness, Cortex, coupled with our stemmer, obtained rank 12, 14,
and 15. Average scores of informativeness are shown in Table 1. The best run in
this evaluation was run 154 (200K).


                         Table 1. Average scores of informativeness


                         Rank Run Unigrams Bigrams                    Skip
                           12       154       0.8233        0.9254    0.9251
                           14       155       0.8253         0.9280   0.9274
                           15       153       0.8266         0.9291   0.9290


    Those scores were computed by organizers using a Perl script (inexqa-eval.pl);
for details about this script check [1].
    On the other hand, the best results for readability evaluation were obtained
by run 155 (500K), see Table 2. Comparing our results with other runs, run 155
(500K) obtained rank 4 in relevance, rank 6 in syntax, and rank 9 in structure.
The worst run in our experiment was the run 153 (100K) in both evaluations.
                          Table 2. Scores of readability


                       Run Relevance Syntax Structure
                       155    0.6968    0.6161    0.5315
                       154    0.5352    0.5305    0.4748
                       153    0.4984    0.4576    0.3784


7 Conclusions and Future Work
In this paper we reported an experiment using a stemmer based on morphological
segmentation. We used axality measurements in order to segment words. This
stemmer was coupled with Cortex, an automatic summarization system.
    We suggested the next stemming strategy: given some peaks of axality of a
word, we truncated at most left peak. Also, we tested three training corpus sizes
to obtain statistical information for the axality indexes: 100K, 200K, and 500K
word tokens. Our two goals were to know if our stemming strategy can produce
readable summaries, and if dierent sizes of training corpora can improve the
Cortex performance.
    According to results of evaluation, our stemming strategy produces not only
readable summaries but also competitive ones. That is, from an average of rele-
vance, syntax, and structure (0.6148), run 155 obtained a rank 7 among 27 runs.
What is more, concerning informativeness, run 154 obtained rank 12 among 33
participants.
    Regarding corpus sizes, it is not clear what size is the best for English, be-
tween 200K and 500K word tokens. However, it is clear that increasing corpus
size is a good strategy because 100K obtained the worst results. Additionally, a
greater training corpus gives better position in the ranking, for example, from
an average of relevance, syntax, and structure, run 155 (500K) obtained rank 7
and run 153 (100K) obtained rank 15.
    In future experiments we will test dierent strategies for morphological seg-
mentation and stemming. Additionally, we can test dierent stemming approaches,
such as Porter's stemmer.


References
 1. SanJuan, E., Moriceau, V., Tannier, X., Bellot, P., Mothe, J.: Overview of the
    INEX 2011 Question Answering Track (QA@INEX). In: INEX 2011 Workshop
    Pree-Proceedings, IR Publications, Hofgut Imsbach, Saarbrücken, Germany (2011)
    145153
 2. Harris, Z.S.: From Phoneme to Morpheme. Language 31 (1955) 190222
 3. Goldsmith, J.: Unsupervised Learning of the Morphology of a Natural Language.
    Computational Linguistics 27 (2001) 153198
 4. Goldsmith, J.: An Algorithm for the Unsupervised Learning of Morphology. Nat-
    ural Language Engineering 12 (2006) 353371
 5. Paik, J.H., Mitra, M., Parui, S.K., Jarvelin, K.: GRAS: An eective and ecient
    stemming algorithm for information retrieval. ACM Trans. Inf. Syst. 29 (2011)
 6. Creutz, M., Lagus, K.: Unsupervised Discovery of Morphemes. In: Proc. of the
    Workshop on Morphological and Phonological Learning of ACL-02, Philadelphia,
    SIGPHON-ACL (2002) 2130
 7. Creutz, M.: Unsupervised segmentation of words using prior distributions of morph
    length and frequency. In Hinrichs, E., Roth, D., eds.: 41st Annual Meeting of the
    ACL, Sapporo, Japan. (2003) 280287
 8. Creutz, M., Lagus, K.: Induction of a Simple Morphology for Highly-Inecting
    Languages. In: Proc. of 7th Meeting of the ACL Special Interest Group in Com-
    putational Phonology SIGPHON-ACL. (2004) 4351
 9. Creutz, M., Lagus, K.: Inducing the Morphological Lexicon of a Natural Lan-
    guage from Unannotated Text. In: Int. and Interdisciplinary Conf. on Adaptive
    Knowledge Representation and Reasoning (AKRR05). (2005) 106113
10. Lovins, J.B.: Development of a Stemming Algorithm. Mechanical Translation and
    Computational Linguistics 11 (1968) 2331
11. Porter, M.F.: An algorithm for Sux Stripping. Program 14 (1980) 130137
12. Lennon, M., Pierce, D., Tarry, B., Willet, P.: An evaluation of some conation
    algorithms for information retrieval. J. of Information Science 3 (1981) 177183
13. Majumder, P., Mitra, M., Pal, D.: Bulgarian, Hungarian and Czech stemming using
    YASS. In: Proceedings of Advances in Multilingual and Multimodal Information
    Retrieval, Springer-Verlag, Berlin (2008) 4956
14. Bacchin, M., Ferro, N., Melucci, M.: A probabilistic model for stemmer generation.
    Mechanical Translation and Computational Linguistics 41 (2005) 121137
15. McNamee, P., Mayeld, J.: Character n-gram tokenization for European language
    text retrieval. Information Retrieval 7 (2004) 7397
16. Krovetz, R.: Viewing Morphology as an Inference Process. In: Proccedings of the
    16th ACM/SICIR Conference. (1993) 191202
17. Hull, D.A.: Stemming algorithms - A case study for detailed evaluation. Journal
    of the American Society for Information Science 47 (1996) 7084
18. Medina-Urrea, A.: Investigación cuantitativa de ajos y clíticos del español de
    México. Glutinometría en el Corpus del Español Mexicano Contemporáneo. PhD
    thesis, El Colegio de México, México (2003)
19. Medina-Urrea, A.: Automatic Discovery of Axes by means of Corpus: A Catalog
    of Spanish Axes. Journal of Quantitative Linguistics 7 (2000) 97114
20. Medina-Urrea, A., Hlavácová, J.: Automatic Recognition of Czech Derivational
    Prexes. In: Proceedings of CICLing 2005. Volume 3406., LNCS, Springer,
    Berlin/Heidelberg/New York (2005) 189197
21. Medina-Urrea, A.: Ax Discovery based on Entropy and Economy Measurements.
    Texas Linguistics Society 10 (2008) 99112
22. Shannon, C., Weaver, W.: The Mathematical Theory of Communication. Univer-
    sity of Illinois Press, Urbana (1949)
23. de Kock, J., Bossaert, W.: Introducción a la lingüística automática en las lenguas
    románicas. Gredos, Madrid (1974)
24. Greenberg, J.H.: Essays in Linguistics. The Univ. of Chicago Press, Chicago (1957)
25. Torres-Moreno, J.M.: Résume automatique de documents. Lavoisier, Paris (2011)
26. Torres-Moreno, J.M., Saggion, H., da Cunha, I., SanJuan, E., Velázquez-Morales,
    P.: Summary Evaluation with and without References. Polibits 42 (2010) 1319
27. Saggion, H., Torres-Moreno, J.M., da Cunha, I., SanJuan, E.: Multilingual summa-
    rization evaluation without human models. In: 23rd Int. Conf. on Computational
    Linguistics. COLING '10, Beijing, China, ACL (2010) 10591067
28. Torres-Moreno, J.M., Velazquez-Moralez, P., Meunier, J.: CORTEX, un algorithme
    pour la condensation automatique de textes. In: ARCo. Volume 2. (2005) 365
29. Torres-Moreno, J.M., Velazquez-Morales, P., Meunier, J.: Condensés de textes par
    des méthodes numériques. JADT 2 (2002) 723734
30. Torres-Moreno, J.M., St-Onge, P.L., Gagnon, M., El-Bèze, M., Bellot, P.: Auto-
    matic Summarization System coupled with a Question-Answering System (QAAS).
    CoRR abs/0905.2990 (2009)