JHU Experiments in Monolingual Farsi
           Document Retrieval at CLEF 2009
                                    Paul McNamee
                    JHU Human Language Technology Center of Excellence
                               paul.mcnamee@jhuapl.edu


                                             Abstract
     At CLEF 2009 JHU submitted runs in the ad hoc track for the monolingual Persian
     evaluation. Variants of character n-gram tokenization provided a 10% relative gain
     over unnormalized words. A run based on skip n-grams, which allow internal skipped
     letters, achieved a mean average precision of 0.4938. Using traditional 5-grams resulted
     in a score of 0.4868 while plain words had a score of 0.4463.

Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor-
mation Search and Retrieval

General Terms
Experimentation

Keywords
Farsi document retrieval


1    Introduction
For CLEF 2009 we participated in the ad hoc Persian task, submitting results only for the mono-
lingual condition. Similar to our experiments at CLEF 2008, these experiments were based on
comparing different tokenization methods. The JHU HAIRCUT retrieval system was used with a
statistical language model similarity metric [2, 5]:
                                        Y
                             P (D|Q) ∝      λP (t|D) + (1 − λ)P (t|C)                       (1)
                                       t∈Q

Though performance might be improved slightly by optimizing choice of λ as a function of to-
kenization, for simplicity a smoothing constant of 0.5 was used throughout these experiments.
Automated relevance feedback was used for all submitted runs. From an initial pass of retrieval
the 20 top-ranked documents were considered and depending on the method of tokenization a dif-
ferent number of expansion terms was used. We submitted five runs based on: (a) plain words; (b)
words that were truncated to at most 5 characters; (c) overlapping character n-grams of lengths
4 & 5; and, (d) a variant of character n-gram indexing allowing some letters to be skipped. Com-
mon to each tokenization method was conversion to lower case letters, removal of punctuation,
and truncation of long numbers to 6 digits.
    The tokenization methods examined were:
                           Table 1: Monolingual Persian Performance
      Terms          Query Fields RF Terms      MAP       P@5    Rel. Found                  Runid
      words              TD           50       0.4463    0.6560     4072                       -
      trun5              TD           50       0.4511    0.6280     4100                 jhufatr5r50td
     4-grams             TD          100       0.4853    0.6560     4119                 jhufa4r100td
     4-grams            TDN          100       0.4825    0.6480     4082                jhufa4r100tdn
     5-grams             TD          100       0.4868 0.7000        4121                 jhufa5r100td
     5-grams            TDN          100       0.4784    0.6840     4029                       -
       sk41              TD          400      0.4951 0.6760         4102                       -
       sk51              TD          400       0.4583    0.6480     4072                       -
 4-grams + sk41          TD          400       0.4938    0.6800     4108               jhufask41r400td
 5-grams + sk51          TD          400       0.4597    0.6600     4048                       -


    • words: space-delimited tokens.
    • trun5: truncation of words to at most the first five letters.
    • 4-grams: overlapping, word-spanning character 4-grams produced from the stream of words
      encountered in the document or query.
    • 5-grams: length n = 5 n-grams created in the same fashion as the character 4-grams.
    • sk41: skip n-grams of length n = 5 that contain one internal skipped letter. The skipped
      letter is replaced with a special symbol to indicate the position of the deleted letter. Skipgram
      tokenization of length four for the word kayak would include the regular n-grams kaya and
      ayak in addition to k•yak, ka•ak, and kay•k.
    • sk51: skip n-grams of length n = 6 that contain one internal skipped letter.
    • 4-grams+sk41: both regular 4-grams and sk41 skip n-grams.

    • 5-grams+sk51: both regular 5-grams and sk51 skip n-grams.
    N-grams are effective at addressing morphological variation, particularly in languages where
words have many related surface forms. In recent experiments we have shown that in many
languages more than 25% relative gains can be realized compared to unnormalized words when
n-grams are used [3]. N-grams have been used previously in Middle Eastern languages. McNamee
et al. used them for Arabic retrieval at the TREC-2001 and TREC-2002 evaluations [4] and
AleAhmad et al. found character 4-grams to be effective in Farsi [1].
    The results of our official runs are presented in Table 1. Compared to the 2008 evaluation
we observe somewhat smaller gains with n-grams, about a 10% relative improvement over plain
words. The best condition involved the use of character skipgrams, which were also found to be
effective in our CLEF 2008 experiments. Despite their computational expense, we like skipgrams
because of the slightly fuzzier matching they allow, which we think may be helpful with Farsi
morphology. In Farsi some morphemes can be either bound or free, and thus there may, or may
not be an intervening space character. There is also extensive derivational compounding where
words from different parts of speech are combined.


2     Conclusions
Based on our experiments using the 50 queries from the CLEF 2008 evaluation, we expected
n-gram based techniques to prove effective and they did. 4-grams, 5-grams, and skipgrams all
provided about a 10% relative gain over plain words.
References
[1] Abolfazi AleAhmad, Parsia Hakimian, Farzad Mehdikhani, and Farhad Oroumchian. N-gram
    and Local Context Analysis for Persian Text Retrieval. In Proceedings of International Sym-
    posium on Signal Processing and its Applications, 2007.
[2] Djoerd Hiemstra. Using Language Models for Information Retrieval. PhD thesis, University
    of Twente, 2001.

[3] Paul McNamee, Charles Nicholas, and James Mayfield. Addressing Morphological Variation
    in Alphabetic Languages. In SIGIR ’09: Proceedings of the 32nd International ACM SIGIR
    Conference on Research and Development in Information Retrieval, pages 75–82, New York,
    NY, USA, 2009. ACM.
[4] Paul McNamee, Christine Piatko, and James Mayfield. JHU/APL at TREC 2002: Experiments
    in Filtering and Arabic Retrieval. In Proceedings of Eleventh Text REtrieval Conference (TREC
    2002), 2002.
[5] David R. H. Miller, Tim Leek, and Richard M. Schwartz. A hidden Markov model information
    retrieval system. In SIGIR ’99: Proceedings of the 22nd Annual International ACM SIGIR
    Conference on Research and Development in Information Retrieval, pages 214–221, New York,
    NY, USA, 1999. ACM.