N-Gram Morphemes for Retrieval
                             Paul McNamee and James Mayfield
                              JHU Applied Physics Laboratory
                        {paul.mcnamee,james.mayfield}@jhuapl.edu


                                            Abstract
     Stemming, an approximation to morphological analysis, is a commonly used technique
     to improve performance in information retrieval systems. In the MorphoChallenge 2007
     evaluation we applied a simple zero-knowledge technique that is based on frequency
     counts rather than machine learning. Our method is based on substituting a single
     fixed-length substring for each word that appears in documents or queries. We hope
     to discover whether this method, which has been used in previous IR evaluations with
     good effect, will be as effective for the information retrieval task as the unsupervised
     methods used by other participants. It should be emphasized that out submission was
     not a credible attempt to learn morphology and thus is not expected to perform well
     in the morphology induction task.

Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor-
mation Search and Retrieval

General Terms
Experimentation

Keywords
Morphological Analysis, Character N-gram Tokenization, Text Retrieval


1    Introduction
Our aim in the MorphoChallenge 2007 evaluation was to explore how a simple method for ap-
proximating word stemming would compare against state of the art morphology induction on the
Information Retrieval task. Our method is based on the use of character n-gram tokenization,
which we have used successfully in the Cross Language Evaluation Forum (CLEF) competitions.
N-grams have the advantages of simplicity and accuracy, but do incur penalties of increased disk
space requirements and longer run times. In an attempt to achieve the benefits of n-gram indexing
without incurring a substantial cost in run-time performance we devised a method of ‘n-gram stem-
ming’ that replaces each word with a selected n-gram. The method only presumes the availability
of text, which is clearly reasonable for IR applications.
    In Section 2 we explain n-gram indexing and describe how synthetic morphemes are obtained.
We evaluate the effectiveness of the technique in Section 3 using previous CLEF data sets. We
summarize our findings in Section 4 and give a complete listing of our source code for producing
morphemes as an appendix.
2      Character N-gramming
2.1      N-gram Indexing
Commonly IR systems adopt a bag-of-words model where words in a document are treated as the
basic indexing units. Often the tokenization process additionally transforms words into stems or
root forms so that words that differ only by inflectional or perhaps even derivational morphology
are treated as the same. With character n-gram indexing, words are not considered the basic
indexing unit; instead, character substrings are used, typically of a fixed length. For example,
for the text ‘Johns Hopkins’ the character 4-grams generated are: #joh, john, ohns, hns#, ns#h,
s#ho, #hop, hopk, opki, pkin, kins, and ins# 1 . While short n-grams can have higher individual
ambiguity (e.g., ins# could come from the words fins, skins, and dolphins, among others), this
collection of twelve n-grams is not very likely to occur unless the phrase ‘Johns Hopkins’ was
present in the original text. Storing 12 entries in an inverted file instead of just two (one for
each word) is why n-gram indexing suffers from performance issues. However, this expense also
enables flexible matching. Web log analysis shows that roughly 30% of users mistakenly2 type
“john hopkins” and yet 8 out of the 12 n-grams still match this string.
    N-grams have been used in many Asian languages because of inherent difficulties with word
segmentation; however they have not been very popular in Western languages. Damashek [1]
promoted their use at early TREC evaluations, but his claims generated controversy and were
received with skepticism. McNamee and Mayfield [3] overwhelmingly demonstrated that n-grams
are effective in European languages and presented compelling evidence using eight languages from
the CLEF 2002 evaluation. They found that lengths of n=4 or n=5 worked about equally well
and significantly outperformed unnormalized words.

2.2      N-gram Stemming
In their article on n-gram indexing, McNamee and Mayfield left unaddressed the question of
whether n-grams achieved higher performance over stemmed words. But in other work they did a
direct comparison of n-grams and stems and found that 4-grams performed on par with Porter’s
SNOWBALL stemmer and in three of eight CLEF 2002 languages performed significantly better.
The tendency was that n-grams held an advantage in the more morphologically complex languages
(i.e., Finnish, Swedish, German). Still, the question remained as to whether the advantage in
accuracy was worth the degradation in run-time performance.
    The key factor in increased disk space and run-times is the fact that each word generates
multiple n-grams. If there were only some way to prune away most of the n-grams and yet preserve
enough of the benefits of n-grams to still confer an advantage. The selection method for each word
must be efficient as this is an operation that will be performed on every word in the corpus –
billions of operations will be required. Mayfield and McNamee [2] introduced n-gram stemming
where a single n-gram would be used to represent each word. The selected n-gram was chosen
based on the relative document frequencies of the n-grams spanning the word under consideration.
The least frequent n-gram was used; this makes sense if the hypothesis that frequent n-grams are
more likely to be part of the morphologically variable part of a word, not the root form. As an
example, consider the words jugglers and juggling. Using statistics from 110282 newspaper articles
the constituent n-grams of each word are shown in Table 1. It can be clearly seen that the rarest
n-gram ‘jugg’ occurs much less often than the suffixes ‘ers ’ and ‘ing ’. And both words would get
replaced with ‘jugg’ which intuitively seems like a good choice.
    Other methods for selecting a single n-gram or a small number of n-grams are possible. For
example, the examples we have given use n=4, but since three runs were permitted in this year’s
evaluation we also created lists using n=3 and n=5. One could consider variants such as using
two disjoint 3-grams or never ending an n-gram on a vowel, or any number of other permutations
that might have linguistic merit.
    1 The # symbol is used to denote whitespace.
    2 See http://en.wikipedia.org/wiki/Johns Hopkins for an explanation of this unusual given name.
                     Table 1: Frequency of n-grams in jugglers and juggling

                       n-gram     df(jugglers)     n-gram     df(juggling)
                        #jug           681          #jug            681
                         jugg          495           jugg          495
                         uggl          6775          uggl          6775
                         ggle          5377          ggli          3003
                         gler          890           glin          4567
                          lers         9650          ling         55210
                        ers#          89701         ing#         106463


           Table 2: Efficacy of N-gram Stemming and Full N-gram Indexing (CLEF)
                                words 4-stems             4-grams
                  Bulgarian     0.2096 0.2583 (+23.2%) 0.2928 (+39.7%)
                  English       0.3897 0.3796 (-2.6%)     0.4099 (+5.2%)
                  Finnish       0.2278 0.2532 (+11.1%) 0.3754 (+64.8%)
                  German        0.3335 0.3404 (+2.1%)     0.3994 (+19.8%)
                  Hungarian 0.1983 0.2939 (+48.2%) 0.3568 (+80.0%)


3     Evaluation
3.1    Task 1: Morphology Induction
For Task 1 we simply reported, for each word, the 3, 4, or 5-gram with lowest collection frequency
based on the training data provided for each language. We did not predict any attributes such as
+PL or +3 PER. Since we did not set out to truly learn morphological rules, we are not really
surprised by our results. Our runs had lower F-score than all other runs in each of the four
languages. We note that our runs with 5-grams did have the highest precision in each of the four
languages with with very low recall.

3.2    Prior Results in Information Retrieval
We have experimented with n-gram morphemes in previous CLEF datasets. In Table 2 we present
results in five languages using data from CLEF 2002 (for English, German, and Finnish) and CLEF
2005 (Bulgarian and Hungarian) that illustrate the relative effectiveness of n-gram indexing and
n-gram analysis compared to processing unnormalized words. We report performance using mean
average precision. Comparisons should not be made across languages in Table 2, but the results
are meaningful for the different conditions in each language.
    N-gram stemming usually helps, though performance was slightly worse for English. The
largest gains are in the more complex languages and n-gram stemming does not perform as well
as the use of ordinary n-gram indexing where the set of all n-grams is used instead of a select one.

3.3    Task 2: Information Retrieval
In this competition the organizers took lists of normalized word forms and replaced words in the
CLEF source documents with a canonical form. The test sets used were: CLEF-2005 (English),
CLEF-2004 (Finnish), and CLEF-2003 (German). The Lemur retrieval engine was used with
TF/IDF or Okapi BM25 term weighting. No automated relevance feedback was performed on
queries.
            Table 3: Task 2 Results for N-gram Stemming and Reference Conditions
                                        English-05 Finnish-04 German-03
              No Analysis (dummy) 0.2783           0.3496        0.3559
              Porter Stemmer            0.3052     0.3725        0.3566
              4-Stems                   0.2838     0.3049        0.3257
              5-Stems                   0.2885     0.3327        0.3618
              Baseline Morfessor        0.2863     0.3874        0.4105
              Gold Standard             0.2602     0.3128        0.3734


                         Table 4: Results from Prior CLEF evaluations
                                  English-05 Finnish-04 German-03
                    All 4-grams 0.3692          0.5064       0.5056
                    All 5-grams 0.3873          0.5236       0.4869


    In Table 3 we compare our performance using 4-stems and 5-stems to several reference condi-
tions. These figures are all based on the withnew condition, the scenario where all words in the
IR collection were transformed, and with TF/IDF weighting.
    The 4-Stems do not score as high as the 5-Stems. The 5-Stems outperform the dummy condition
and the gold standard analysis in two out of three languages. The Morfessor baseline roundly beats
5-Stems in Finnish and German, though the two methods are comparable in English. The results
in Table 2, using CLEF-2002 data (which was not used in MorphoChallenge 2007), also show little
difference between the dummy condition (labeled ’words’ in Table 2) and n-gram morphemes.
However, in Table 2, the use of all n-grams that span a word yields sizable improvements. This is
also the case on the test sets used in this year’s competition.
    In Table 4 we list results obtained using the JHU/APL HAIRCUT retrieval engine in the CLEF
evaluations in 2003-2005. These runs also made use of relevance feedback, and are not directly
comparable to the results in Table 3. Nonetheless these figures suggest the use of all n-grams
yields better performance compared to a single selected one.


4    Summary
We set out to participate in the information retrieval evaluation, not the morphology induction
task. Our method for producing morphemes is unorthodox, but we believe that it is suitable
for use in information retrieval. We provided results based on our own information retrieval
experiments that support this claim, and that offer a means to obtain improvement with n-grams
without a run-time penalty. The technique is also well motivated when working in a language with
no available morphological toolkit. The results from the Task 2 evaluation suggest that n-gram
morphemes perform better than unnormalized words and similarly to the gold standard analysis,
but not as well as the Morfessor algorithm.


References
[1] Marc Damashek. Gauging simularity with n-grams: Language-independent categororization
    of text. Science, 267:843–848, 10 February 1995.
[2] James Mayfield and Paul McNamee. Single n-gram stemming. In SIGIR, pages 415–416, 2003.
[3] Paul McNamee and James Mayfield. Character N-gram tokenization for european language
    text retrieval. Inf. Retr, 7(1-2):73–97, 2004.
Program Listing
#! / u s r / b i n / p e r l
# Take c o r p u s o f t r a i n i n g d a t a and an i n p u t w o r d l i s t , compute n−gram f r e q u e n c i e s ,
# and o u t p u t t r a n s f o r m e d w o r d l i s t t h a t r e d u c e wordforms t o t h e l e a s t f r e q u e n t n−gram .
# Usage : CreateStems c o r p u s . t x t w o r d l i s t . t x t

package main ;

my %c n t = ( ) ;
my $ n l e n = 5 ;
my $ c o r p f i l e = $ARGV [ 0 ] ;
my $ w o r d f i l e = $ARGV [ 1 ] ;
open FH, ”< $ c o r p f i l e ” ;
while (<FH>) {
      $line = $ ;
     chomp( $ l i n e ) ;
      next i f ( $ l i n e =˜ /ˆ0\ t / ) ;
      $ l i n e =˜ s /\d+\t / / ; # remove l e a d i n g j u n k
      $ l i n e =˜ s /\ s+/ / g ; # r e p l a c e s p a c e w i t h u n d e r s c o r e s
      $ l i n e = ” ” . $ l i n e . ” ” ; # padding
     # s p l i t l i n e i n t o n−grams
     my $ l e n l i m = length ( $ l i n e ) − $ n l e n + 1 ;
      f o r (my $ i =0; $ i <$ l e n l i m ; $ i ++) {
         my $tok = substr ( $ l i n e , $ i , $ n l e n ) ;
         # a d j u s t c o u n t s f o r each o b s e r v e d n−gram
          $ c n t { $tok }++;
      }
}
c l o s e FH;
open FH, ”<$ w o r d f i l e ” ;
while (<FH>) {
      $line = $ ;
     chomp( $ l i n e ) ;
      $ l i n e =˜ s /\d+\s +//; # remove l e a d i n g j u n k ( i . e . , t h e f r e q u e n c y ) , k e e p t h e word
     my $ o r i g w o r d = $ l i n e ;
      $ l i n e =˜ s /\ s+/ / g ; # r e p l a c e s p a c e w i t h u n d e r s c o r e s
      $line = ” ” . $line . ” ” ;
     # s p l i t l i n e ( r e a l l y word ) i n t o n−grams
     my $ l e n l i m = length ( $ l i n e ) − $ n l e n + 1 ;
     my $ b e s t t o k = ” ” ;
     my $ l o w e s t f r e q = 1 0 0 0 0 0 0 0 0 ;
      f o r (my $ i =0; $ i <$ l e n l i m ; $ i ++) {
         my $tok = substr ( $ l i n e , $ i , $ n l e n ) ;
          i f ( e x i s t s ( $ c n t { $tok } ) && $ c n t { $tok } < $ l o w e s t f r e q ) {
              $ b e s t t o k = $tok ;
              $ l o w e s t f r e q = $ c n t { $tok } ;
          }
      }
      i f ( $ b e s t t o k eq ” ” ) {
         # Word wasn ’ t i n corpus , or was < 2 c h a r s i n l e n g t h . Use t h e o r i g i n a l word .
          $besttok = $origword ;
      }
     # Now o u t p u t word and stem
      print ” $ o r i g w o r d \ t $ b e s t t o k \n” ;
}
c l o s e FH;
1;