<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>LCN</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Retrieval Experiments at Morpho Challenge 2008</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Paul McNamee JHU Human Language Technology Center of Excellence</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2008</year>
      </pub-date>
      <volume>5</volume>
      <issue>0</issue>
      <abstract>
        <p>Morpho Challenge 2008 hosted an extrinsic evaluation of morphological analysis that explored whether unsupervised morphology induction could bene t information retrieval. This paper presents results in alternative methods for word normalization using test sets from the Cross-Language Evaluation Forum (CLEF) ad-hoc collections. Preliminary results for the Morpho Challenge 2008 evaluation are consistent with these data. We found that: (1) rule-based stemming is e ective in less morphologically complicated languages; (2) alternative methods for stemming such as unsupervised learning of morphemes and least common n-gram stemming are helpful; and, (3) full character n-gram indexing is the most e ective form of tokenization in more morphologically complex languages.</p>
      </abstract>
      <kwd-group>
        <kwd>Multilingual text retrieval</kwd>
        <kwd>Character n-grams</kwd>
        <kwd>Morphological normalization</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Experimentation</p>
    </sec>
    <sec id="sec-2">
      <title>Introduction</title>
      <p>Over the past year we conducted experiments using alternative methods of lexical normalization.
Using test sets in 13 languages used in Cross-Language Evaluation Forum (CLEF) evaluations
between 2002 and 2007, we compared the use of a rule-based stemmer (Snowball ), an unsupervised
morphological segmenter (Morfessor ), a synthetic form of stemming based on selecting a single
character n-gram from each word, and the use of all n-grams for a given word. Character n-grams
of length n = 5 were the most e ective technique, performing 18% better than unnormalized
words, averaged across the set of languages.</p>
      <p>N-grams possess many advantages. They work in every language, require no training, and are
more e ective than plain words. Their main shortcoming is increased disk space requirements
and slower query response times. To address the performance penalty we introduced a method
of n-gram stemming that selects for each word its least common n-gram. By selecting only one
n-gram per word, the inverted le becomes no larger than when words or stemmed words are used.
And query responses times also improve since the number of query terms becomes equal to the
number of words, not a function of the number of letters in the query string.</p>
      <p>
        While our n-gram stems perform better than unlemmatized words, it was clear from the Morpho
Challenge 2007 evaluation [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] that other techniques were even more e ective. This year we wanted
to compare full (i.e., traditional) n-gram indexing against the more linguistically sophisticated
approaches of other participants. For the Information Retrieval subtask at Morpho Challenge
2008 we provided o cial submissions using three techniques in each of English, Finnish, and
German. The methods were:
      </p>
      <p>Character n-grams with length n = 5
Character n-grams with length n = 4</p>
      <p>The rarest 5-gram from each word (measured using document frequency statistics)
In Section 2 we describe our tokenization experiments on the CLEF data sets. In Section 3 we
analyze our Morpho Challenge 2008 results.
2</p>
    </sec>
    <sec id="sec-3">
      <title>Analysis on Past CLEF Collections</title>
      <p>
        We compared plain words, stems, induced morphemes, n-gram stems, and character n-grams using
test sets from the CLEF ad hoc tasks between 2002 and 2007. In each of the 13 languages we
used two years worth of data except for Czech where only one year was available. The number of
test queries per language varied from 50 (Czech) to 107 (Spanish). The document collections for
each language were indexed using the various methods of tokenization. Common to each method
was conversion to lower case letters, removal of punctuation, and truncation of long numbers to
6 digits. The HAIRCUT retrieval engine was used with a statistical language model retrieval
metric. The similarity calculation combines document term frequencies and corpus frequencies
(for smoothing) using linear interpolation with a smoothing constant of 0.5 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The methods of
tokenization examined were:
words: space-delimited tokens.
snow: output of the Snowball stemmer1, if available in the given language.
morf : the set of morphemes produced by Morfessor 2 for each input word. A model was
trained using the document collection's lexicon with digit-containing tokens omitted. The
default parameters for the Morfessor algorithm were used [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
lcn4: the least common word-internal character 4-gram for each input word [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
lcn5: the least common word-internal character 5-gram for each input word.
4-gram: overlapping, word-spanning character 4-grams produced from the stream of words
encountered in the document or query.
      </p>
      <p>5-gram: length n = 5 n-grams created in the same fashion as the character 4-grams.
In Table 1 results are presented using mean average precision to compare performance. The score
for the highest performing technique in each language is emboldened.
2.1</p>
      <sec id="sec-3-1">
        <title>Unnormalized words</title>
        <p>Not attempting to control for morphological processes can have harmful e ects. In Bulgarian,
Czech, Finnish, and Hungarian, more than a 30% loss is observed compared to the use of 4-grams
as indexing terms.</p>
        <p>1Available from http://snowball.tartarus.org/
2Available from http://www.cis.hut. /projects/morpho/
5-gram
0.2916
0.3223
0.4443
0.4612
0.4960
0.4399
0.4321
0.3438
0.4220
0.3515
0.3330
0.4376
0.4271
0.3979
0.3956</p>
        <p>Word
authored
authorized
authorship</p>
        <p>afoot
footballs
footloose
footprint
feet
juggle
juggled
jugglers</p>
        <p>
          Snowball
author
author
authorship
afoot
footbal
footloos
footprint
feet
juggl
juggl
juggler
As it may be di cult to nd a rule-based stemmer for every language, a language-independent
approach can be quite attractive. The Morfessor algorithm only requires a lexicon (i.e., wordlist)
for a language to learn to identify morpheme boundaries, even for previously unseen words. Such
automatically detected segments can be an e ective form of tokenization [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Examples of the
algorithm's output are presented in Table 2, along with results for Snowball and character
5grams.
        </p>
        <p>Compared to plain words the induced morphemes produced by Morfessor led to gains in 9 of 13
languages; 8 of these were signi cant improvements with p &lt; 0.05 (Wilcoxon test). The languages
where words outperformed segments were English (dramatically), French, Italian, and Spanish {
each is relatively low in morphological complexity. The di erences in French and Spanish were
less than 0.004 in absolute terms. Segments achieved more than a 20% relative improvement in
Bulgarian, Finnish, and Russian, and over 40% in Czech and Hungarian.
2.4</p>
      </sec>
      <sec id="sec-3-2">
        <title>Least Common N-gram Stems</title>
        <p>Another language-neutral approach to stemming is to select for each word, its least common
ngram. This requires advance knowledge of n-gram frequencies, but this is easily obtainable by
constructing a regular n-gram index, or even by scanning a corpus and counting. Lengths of
n = 4 and n = 5 appear about equally e ective with a slight advantage for lcn4, but this is
in uenced primarily by the languages with greater morphological complexity, which see larger
changes. An 8% relative improvement in mean average precision over words is obtained. As can
be seen from Table 1, in languages where rule-based stemming is available its use is preferable.
N-gram stemming achieves comparable performance with Morfessor segments..
2.5</p>
      </sec>
      <sec id="sec-3-3">
        <title>Overlapping Character N-grams</title>
        <p>N-grams achieve morphological regularization indirectly due to the fact that subsequences that
touch on word roots will match. For example, \juggling" and \juggler" will share the 5-grams jugg
and juggl. While n-gram's redundancy enables useful matches, other matches are less valuable,
for example, every word ending in `tion' will share 5-gram tion with all of the others. In practice
these morphological false alarms are almost completely discounted because term weighting
deemphasizes them. In fact, such a xes can be so common, that ignoring them entirely by treating
them as \stop n-grams" is a reasonable thing to do.</p>
        <p>
          Character n-grams are the most e ective technique studied here, giving a relative improvement
of 18%. Consistent with earlier work [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] lengths of n = 4 and n = 5 are equally e ective averaged
across the 13 languages; however there are some noticeable di erences in particular languages. The
data is suggestive of a trend that the most morphologically variable languages (i.e., Bulgarian,
Czech, Hungarian, and Russian) gain more from 4-grams than 5-grams, while 5-grams have a
slight advantage in medium complexity languages.
        </p>
        <p>Snowball stems are roughly as e ective as n-grams, on average, but only available in certain
languages (i.e., 8 of 13 in this study). The other \alternative" stemming approaches, segments
and least common n-grams, appear to gain about half of the bene t that full n-gram indexing sees
compared to unnormalized word forms.
3</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Morpho Challenge 2008 Submissions</title>
      <p>This year's evaluation was conducted in English, Finnish, and German, as was the case in
Morpho Challenge 2007. The contest was conducted by having participants submit an analysis for
each word form, which would be used to create replacement indexing terms in an IR system.
According to the contest website the LEMUR IR engine was used with Okapi BM 25 term
weighting. Preliminary results reported by the organizers are presented in Table 3 using mean average
precision.</p>
      <p>An examination of the preliminary results shows the same basic trend identi ed in Section 2.
Rule-based stemming is most e ective in English, and in a morphologically rich language such as
Finnish, n-grams have a strong advantage.
4</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>We examined a variety of methods for lexical normalization, nding that the most e ective
technique was character n-gram indexing which achieved a relative gain of 18% in mean average
precision over unlemmatized words. In Czech, Bulgarian, Finnish, and Hungarian gains of over
40% were observed. While rule-based stemming can be quite e ective, such tools are not available
in every language and even when present, require additional work to integrate with an IR system.
When language-neutral methods are able to achieve the same, or better performance, their use
should be seriously considered.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Mathias</given-names>
            <surname>Creutz</surname>
          </string-name>
          and
          <string-name>
            <given-names>Krista</given-names>
            <surname>Lagus</surname>
          </string-name>
          .
          <article-title>Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0</article-title>
          .
          <string-name>
            <surname>Technical</surname>
            <given-names>report</given-names>
          </string-name>
          , Helsinki University of Technology,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Mikko</given-names>
            <surname>Kurimo</surname>
          </string-name>
          , Mathias Creutz, and
          <string-name>
            <given-names>Ville</given-names>
            <surname>Turunen</surname>
          </string-name>
          .
          <article-title>Overview of Morpho Challenge in CLEF 2007</article-title>
          .
          <article-title>In Alessandro Nardi</article-title>
          and Carol Peters, editors,
          <source>Working Notes of the CLEF 2007 Workshop</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>[3] James May eld and Paul McNamee</article-title>
          .
          <article-title>Single n-gram stemming</article-title>
          .
          <source>In SIGIR</source>
          , pages
          <volume>415</volume>
          {
          <fpage>416</fpage>
          . ACM,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Paul</given-names>
            <surname>McNamee</surname>
          </string-name>
          and
          <article-title>James May eld. Character N-gram tokenization for european language text retrieval</article-title>
          .
          <source>Information Retrieval</source>
          ,
          <volume>7</volume>
          (
          <issue>1</issue>
          -2):
          <volume>73</volume>
          {
          <fpage>97</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Paul</given-names>
            <surname>McNamee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Charles</given-names>
            <surname>Nicholas</surname>
          </string-name>
          , and
          <article-title>James May eld</article-title>
          . Don'
          <article-title>t have a stemmer?: un+concern+ed</article-title>
          .
          <source>In SIGIR '08</source>
          , pages
          <fpage>813</fpage>
          {
          <fpage>814</fpage>
          , New York, NY, USA,
          <year>2008</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Jay</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ponte</surname>
            and
            <given-names>W. Bruce</given-names>
          </string-name>
          <string-name>
            <surname>Croft</surname>
          </string-name>
          .
          <article-title>A language modeling approach to information retrieval</article-title>
          .
          <source>In SIGIR</source>
          , pages
          <volume>275</volume>
          {
          <fpage>281</fpage>
          . ACM,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>