<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>JHU Experiments in Monolingual Farsi Document Retrieval at CLEF 2009</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Paul McNamee</string-name>
          <email>paul.mcnamee@jhuapl.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>JHU Human Language Technology Center of Excellence</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>At CLEF 2009 JHU submitted runs in the ad hoc track for the monolingual Persian evaluation. Variants of character n-gram tokenization provided a 10% relative gain over unnormalized words. A run based on skip n-grams, which allow internal skipped letters, achieved a mean average precision of 0.4938. Using traditional 5-grams resulted in a score of 0.4868 while plain words had a score of 0.4463.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Experimentation</title>
    </sec>
    <sec id="sec-2">
      <title>Farsi document retrieval</title>
      <sec id="sec-2-1">
        <title>Introduction</title>
        <p>
          For CLEF 2009 we participated in the ad hoc Persian task, submitting results only for the
monolingual condition. Similar to our experiments at CLEF 2008, these experiments were based on
comparing di erent tokenization methods. The JHU HAIRCUT retrieval system was used with a
statistical language model similarity metric [
          <xref ref-type="bibr" rid="ref2 ref5">2, 5</xref>
          ]:
        </p>
        <p>P (DjQ) / Y
t2Q</p>
        <p>P (tjD) + (1
)P (tjC)
(1)
Though performance might be improved slightly by optimizing choice of as a function of
tokenization, for simplicity a smoothing constant of 0.5 was used throughout these experiments.
Automated relevance feedback was used for all submitted runs. From an initial pass of retrieval
the 20 top-ranked documents were considered and depending on the method of tokenization a
different number of expansion terms was used. We submitted ve runs based on: (a) plain words; (b)
words that were truncated to at most 5 characters; (c) overlapping character n-grams of lengths
4 &amp; 5; and, (d) a variant of character n-gram indexing allowing some letters to be skipped.
Common to each tokenization method was conversion to lower case letters, removal of punctuation,
and truncation of long numbers to 6 digits.</p>
        <p>The tokenization methods examined were:</p>
        <p>Terms
words
trun5
4-grams
4-grams
5-grams
5-grams
sk41
sk51
4-grams + sk41
5-grams + sk51
trun5: truncation of words to at most the rst ve letters.
4-grams: overlapping, word-spanning character 4-grams produced from the stream of words
encountered in the document or query.
5-grams: length n = 5 n-grams created in the same fashion as the character 4-grams.
sk41: skip n-grams of length n = 5 that contain one internal skipped letter. The skipped
letter is replaced with a special symbol to indicate the position of the deleted letter. Skipgram
tokenization of length four for the word kayak would include the regular n-grams kaya and
ayak in addition to k yak, ka ak, and kay k.
sk51: skip n-grams of length n = 6 that contain one internal skipped letter.
4-grams+sk41: both regular 4-grams and sk41 skip n-grams.</p>
        <p>5-grams+sk51: both regular 5-grams and sk51 skip n-grams.</p>
        <p>
          N-grams are e ective at addressing morphological variation, particularly in languages where
words have many related surface forms. In recent experiments we have shown that in many
languages more than 25% relative gains can be realized compared to unnormalized words when
n-grams are used [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. N-grams have been used previously in Middle Eastern languages. McNamee
et al. used them for Arabic retrieval at the TREC-2001 and TREC-2002 evaluations [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] and
AleAhmad et al. found character 4-grams to be e ective in Farsi [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <p>The results of our o cial runs are presented in Table 1. Compared to the 2008 evaluation
we observe somewhat smaller gains with n-grams, about a 10% relative improvement over plain
words. The best condition involved the use of character skipgrams, which were also found to be
e ective in our CLEF 2008 experiments. Despite their computational expense, we like skipgrams
because of the slightly fuzzier matching they allow, which we think may be helpful with Farsi
morphology. In Farsi some morphemes can be either bound or free, and thus there may, or may
not be an intervening space character. There is also extensive derivational compounding where
words from di erent parts of speech are combined.
2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Conclusions</title>
        <p>Based on our experiments using the 50 queries from the CLEF 2008 evaluation, we expected
n-gram based techniques to prove e ective and they did. 4-grams, 5-grams, and skipgrams all
provided about a 10% relative gain over plain words.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Abolfazi</surname>
            <given-names>AleAhmad</given-names>
          </string-name>
          , Parsia Hakimian, Farzad Mehdikhani, and
          <string-name>
            <given-names>Farhad</given-names>
            <surname>Oroumchian</surname>
          </string-name>
          .
          <article-title>N-gram and Local Context Analysis for Persian Text Retrieval</article-title>
          .
          <source>In Proceedings of International Symposium on Signal Processing and its Applications</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Djoerd</given-names>
            <surname>Hiemstra</surname>
          </string-name>
          .
          <article-title>Using Language Models for Information Retrieval</article-title>
          .
          <source>PhD thesis</source>
          , University of Twente,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Paul</given-names>
            <surname>McNamee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Charles</given-names>
            <surname>Nicholas</surname>
          </string-name>
          , and
          <article-title>James May eld</article-title>
          .
          <source>Addressing Morphological Variation in Alphabetic Languages. In SIGIR '09: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , pages
          <volume>75</volume>
          {
          <fpage>82</fpage>
          , New York, NY, USA,
          <year>2009</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Paul</given-names>
            <surname>McNamee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Christine</given-names>
            <surname>Piatko</surname>
          </string-name>
          , and
          <article-title>James May eld</article-title>
          . JHU/APL at TREC 2002:
          <article-title>Experiments in Filtering and Arabic Retrieval</article-title>
          .
          <source>In Proceedings of Eleventh Text REtrieval Conference (TREC</source>
          <year>2002</year>
          ),
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>David</surname>
            <given-names>R. H.</given-names>
          </string-name>
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>Tim</given-names>
          </string-name>
          <string-name>
            <surname>Leek</surname>
          </string-name>
          , and
          <string-name>
            <surname>Richard</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Schwartz</surname>
          </string-name>
          .
          <article-title>A hidden Markov model information retrieval system</article-title>
          .
          <source>In SIGIR '99: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , pages
          <volume>214</volume>
          {
          <fpage>221</fpage>
          , New York, NY, USA,
          <year>1999</year>
          . ACM.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>