=Paper=
{{Paper
|id=Vol-1169/CLEF2003wn-adhoc-McNameeEt2003
|storemode=property
|title=JHU/APL Experiments in Tokenization and Non-Word Translation
|pdfUrl=https://ceur-ws.org/Vol-1169/CLEF2003wn-adhoc-McNameeEt2003.pdf
|volume=Vol-1169
|dblpUrl=https://dblp.org/rec/conf/clef/McNameeM03b
}}
==JHU/APL Experiments in Tokenization and Non-Word Translation==
<pdf width="1500px">https://ceur-ws.org/Vol-1169/CLEF2003wn-adhoc-McNameeEt2003.pdf</pdf>
<pre>
    JHU/APL Experiments in Tokenization and Non-Word Translation
                                   Paul McNamee and James Mayfield
                           Johns Hopkins University Applied Physics Laboratory
                                       11100 Johns Hopkins Road
                                     Laurel, MD 20723-6099 USA
                                   {mcnamee, mayfield}@jhuapl.edu

       In the past we have conducted experiments that investigate the benefits and peculiarities
       attendant to alternative methods for tokenization, particularly overlapping character n-grams.
       This year we continued this line of work and report new findings reaffirming that the judicious
       use of n-grams can lead to performance surpassing that of word-based tokenization. In
       particular we examined: the relative performance of n-grams and a popular suffix stemmer; a
       novel form of n-gram indexing that approximates stemming and achieves fast run-time
       performance; various lengths of n-grams; and the use of n-grams for robust translation of
       queries using an aligned parallel text. For the CLEF 2003 evaluation we submitted
       monolingual and bilingual runs for all languages and language pairs, multilingual runs using
       English as a source language, and a first attempt and cross-language spoken document
       retrieval. Our key findings are that shorter n-grams (n=4 and n=5) outperform a popular
       stemmer in non-Romance languages, that direct translation of n-grams is feasible using an
       aligned corpus, that translated 5-grams yield superior performance to words, stems, or 4-
       grams, and that a combination of indexing methods is best of all.


Introduction
In the past we have examined a number of issues pertaining to how documents and queries are represented.
This has been a particular interest in our work with the HAIRCUT retrieval system due to the consistent
success we have observed with the use of overlapping character n-grams. Simple measures that can be
uniformly applied to text processing, regardless of language, reduce developer effort and appear to be at least
as effective as approaches that rely on language-specific processing, and perhaps more so. They are
increasingly used when linguistic resources are unavailable[11][14][15], but in general have not been widely
adopted. We believe that this may be due in part to a belief that n-grams are not as effective as competing
approaches (an idea that we attempt to refute here), and also due to a fear of increased index-time and run-
time costs. We do not focus on the second concern here; few studies addressing the performance implications
of n-gram processing have been undertaken (but see [10]), and we hope this gap is soon filled.

Over this past year we investigated several issues in tokenization. Using the CLEF 2002 and 2003 test suites
as an experimental framework, we attempt to answer the following questions:
             o Should diacritical marks be retained?
             o What length of character n-grams results in the best performance?
             o Does the optimal length vary by language?
             o Are n-grams as effective as stemmed words?
             o Can n-gram processing be sped up?
             o What peculiarities arise when n-grams are used for bilingual retrieval?
             o Are n-grams effective for cross-language spoken document retrieval?

We submitted official runs for the monolingual, bilingual, multilingual tracks and participated in the first
cross-language spoken document benchmark. For all of our runs we used the HAIRCUT system and a
statistical language model similarity calculation. Many of our official runs were based on n-gram processing
though we found that by using a combination of n-grams and stemmed words better performance can be
obtained. For our bilingual runs we relied on pre-translation query expansion. We also developed a new
method of translating queries, using n-grams rather than words as the elements to be translated. This method
does not suffer from several key obstacles in dictionary-based translation, such as word lemmatization,
matching of multiple word expressions, and out-of-vocabulary words such as common surnames [12].
Methods
HAIRCUT supports a variety of indexing terms and represents documents using a bag-of-terms model. Our
general method is to process the text for each document, reducing all terms to lower-case. Generally words
were deemed to be white-space delimited tokens in the text; however, we preserve only the first 4 digits of a
number and we truncate any particularly long tokens (those greater than 35 characters in length). Once
words are identified we optionally perform transformations on the words to create indexing terms (e.g.,
stemming). So-called stopwords are retained in our index and the dictionary is created from all words present
in the corpus.

We have wondered whether diacritical marks have much effect upon retrieval performance - for a long time
we have been retaining diacritical marks as part of our ordinary lexical processing, in keeping with a keep-it-
simple approach. One principled argument for retaining inflectional marks is that they possess a
deconflationary effect when content words that differ only in diacritics have different meaning. For example,
the English words resume (to continue) and résumé (a summary of one’s professional life) can be
distinguished by differences in diacritics. On the other hand, such marks are not always uniformly applied,
and furthermore, if retained, might distinguish two semantically related words. Stephen Tomlinson
investigated preservation of diacritics using the CLEF 2002 collection and reported that it was helpful in
some cases (Finnish) and harmful in others (Italian and French) [16]. We found similar results (see Table 1),
though the effect is seen only for words, not n-grams. As there is practically no effect, we opted to remove
such accents routinely. Intuitively we thought that removing the distinction might improve corpus statistics
when n-grams are used. Whenever stemming was used, words were first stemmed, and then any remaining
marks were removed; this enabled the stemmer to take advantage of marks when present. N-grams were
produced from the same sequence of words; however, we attempt to detect sentence boundaries to prevent
generating n-grams across sentence boundaries.

         language DE            EN       ES       FI        FR       IT       NL       SV
         words       -0.0002 0.0028 0.0146 -0.0363 0.0139 0.0076 -0.0005 0.0045
         4-grams -0.0028 -0.0093 0.0019 0.0075 0.0077 -0.0090 0.0009 -0.0056
Table 1. Absolute difference in mean average precision when accented marks were removed.

HAIRCUT uses gamma compression to reduce the size of the inverted file. Within-document positional
information is not retained, but both document-id and term frequencies are compressed. We also produce a
‘dual file’ that is a document-indexed collection of term-ids and counts. Construction of this data structure
doubles our on-disk space requirements, but confers advantages such as being able to quickly examine
individual document representations. This is particularly useful for automated (local) query expansion. Our
lexicon is stored as a B-tree but nodes are compressed in memory to maximize the number of in-memory
terms subject to physical memory limitations. For the indexes created for CLEF 2003 memory was not an
issue as only O(106) distinct terms were found in each collection.

We use a statistical language model for retrieval akin to those presented by Miller et al. [9] and Hiemstra [2]
with Jelinek-Mercer smoothing[3]. In this model, relevance is defined as

                              P(D | Q) = ∏ [αP(q | D) + (1− α )P(q | C)],
                                           q ∈Q


where Q is a query, D is a document, C is the collection as a whole, and α is a smoothing parameter. The
probabilities on the right side of the equation are replaced by their maximum likelihood estimates when
scoring a document. The language model has the advantage that term weights are mediated by the corpus.
Our experience has been that this type of probabilistic model outperforms a vector-based cosine model or a
binary independence model with Okapi BM25 weighting.

For the monolingual, bilingual, and multilingual tasks, all of our submitted runs were based on a combination
of several base runs. Our method for combination was to normalize scores by probability mass and to then
merge documents by score. All of our runs were automatic runs and used only the title and description topic
fields.
Monolingual Experiments
For our monolingual work we created several indexes for each language using the permissible document
fields appropriate to each collection. Our four basic methods for tokenization were unnormalized words,
stemmed words obtained through the use of the Snowball stemmer, 4-grams, and 5-grams. Information about
each index is shown in Table 2.
      language #docs        %docs #rel       %rel         index size (MB) / unique terms (1000s)
                                                       words         stems      4-grams     5-grams
      DE         294805      18.3     1825 18.2 265 / 11.88 219 / 860 705 / 219 1109 / 1230
      EN         166754      10.3     1006 10.0       143 / 302 123 / 235 504 / 166        827 / 917
      ES         454041      28.2     2368 23.6       303 / 525 251 / 347 990 / 217 1538 / 1144
      FI          55344      3.4       483    4.8     89 / 977     60 / 520 136 / 138      229 / 709
      FR         129804      8.1       946    9.4     91 / 262     76 / 178 277 / 144      440 / 724
      IT         157558       9.7      809    8.0     115 / 374    92 / 224 329 / 144      529 / 721
      NL         190605      11.8     1577 15.7       161 / 683 147 / 575 469 / 191 759 / 1061
      RU          16715      1.0       151    1.5     25 / 253     25 / 253     44 / 136    86 / 569
      SV         142819      8.9       889    8.8     94 / 505     80 / 361 258 / 162      404 / 863
      total      1608445             10054            1286 MB 1073 MB 3712 MB              5921 MB
Table 2. Summary information about the test collection and index data structures

From the table above it can be seen that the percentage of relevant documents for each subcollection is
closely related to its contribution to the overall number of documents. This would suggest that collection size
might be a useful factor for multilingual merging. We also note that n-gram indexing results in increased disk
storage costs. This cost is driven by the increased number of postings in the inverted file when n-gram
indexing is performed.

Our use of 4-grams and 5-grams as indexing terms represents a departure from previous work using 6-grams
[6]. We conducted tests using various lengths of n-grams for all eight CLEF 2002 languages and found that
choices of n=4 or n=5 performed best. Figure 1 charts performance using six different term indexing
strategies; a value of α=0.5 was used throughout and no relevance feedback was attempted.


                                                  Effect of Differing Tokenization


                            0.50


                            0.45
   Mean Average Precision


                            0.40                                                                     words
                                                                                                     3-grams
                                                                                                     4-grams
                            0.35
                                                                                                     5-grams
                                                                                                     6-grams
                            0.30                                                                     7-grams


                            0.25


                            0.20
                                   DE   EN   ES           FI         FR         IT   NL   SV


Figure 1. Relative efficacy of different tokenization methods using the CLEF 2002 test set. Note that blind
relevance feedback was not used for these runs.
We determined that use of n=4 or n=5 is best in all eight languages though it is hard to distinguish between
the two. 6-grams are clearly not as effective in these languages. There are differences in performance
depending on the value of smoothing constant, α, that is used, though we have yet to test whether these
differences are significant or merely represent overtraining on the 2002 test set. The effect of smoothing
parameter selection in language model-based retrieval was investigated by Zhai and Lafferty [17]. We report
on our results investigating the effect of n-gram length, with additional detail and further experiments in a
forthcoming manuscript [8].

In additional to determining good values for n, we also wanted to see if n-grams remained an attractive
technique in comparison to stemmed words. Having no substantive experience with stemming, we were
pleased to discover that the Snowball stemmer [13], a derivative of the Porter stemmer extended to many
languages by Porter, provides a set of rules for all of the CLEF 2003 languages. Furthermore, the software
contains Java bindings so it fit seamlessly with the HAIRCUT system. We decided to make a comparison
between raw words, stems, 4-grams, 5-grams, and a surrogate technique based on n-grams that might
approximate stems. Our n-gram approximation to stemming was based on picking the word-internal n-gram
for each word with lowest document frequency (i.e., we picked the least common n-gram for each word). As
an example, consider the words ‘juggle’, ‘juggles’, and ‘juggler’. The least common 5-gram for the first two
is ‘juggl’, however the least common 5-gram for ‘juggler’ is ‘ggler’1. The least common 4-gram for all three
words is ‘jugg’. We hypothesize that high IDF n-gram affixes will span portions of words that exhibit little
morphological variation.

This method has the advantage of providing some morphological normalization, but it does not increase the
number of postings in an inverted file. This can be viewed either as a way to approximate stems or a way of
lowering the computational cost of using n-grams. We found that n-grams did outperform stems, and that our
pseudo stems based on n-grams were better than raw words, but not as effective as a rule-based stemmer (see
Figure 2). Details about this work can be found in Mayfield and McNamee [5].


     sv


     nl


      it


      fr


      fi


     es


     en


     de


       0.00    0.10       0.20      0.30       0.40    0.50
                      Mean Average Precision

       Words   Stems      4-grams   Pseudo-4      Pseudo-5

Figure 2. Comparing words, stemmed words, 4-grams, and approximate stemming (2002 collection).


1
    The Snowball stemmer also fails to transform juggler to a canonical form.
On the 2003 test collection we produced base runs for words, stems (using the Snowball stemmer), 4-grams,
and 5-grams. Performance (based on average precision) for each is reported in Table 3. All of these runs used
blind relevance feedback and used an α value of 0.3 with words and stems, or 0.8 with n-grams. None of
these runs were submitted as official runs; instead, we created hybrid runs using multiple methods. In the
past we have found that combination from multiple runs can confer a nearly 10% improvement in
performance. Savoy has also reported improvements from multiple term types [15].
              DE           EN        ES         FI       FR         IT        NL        RU          SV
    words     0.4175 0.4988 0.4773 0.3355 0.4590 0.4856 0.4615 0.2550 0.3189
    stems     0.4604 0.4679 0.5277 0.4357 0.4780 0.5053 0.4594 0.2550 0.3698
    4-grams 0.5056 0.4692 0.5011 0.5396 0.5244 0.4313 0.4974 0.3276 0.4163
    5-grams 0.4869 0.4610 0.4695 0.5468 0.4895 0.4568 0.4618 0.3271 0.4137
Table 3. Mean average precision for CLEF 2003 base runs with maximal values highlighted.

To produce our official monolingual runs we decided to combine runs based on the Snowball stemmer with
runs using n-grams as indexing terms. Runs named aplmoxxa used 4-grams and stems while runs named
aplmoxxb used 5-grams and stems. However, due to a mistake while creating the scripts used to produce all
of our runs, we inadvertently failed to perform blind relevance feedback for our monolingual submissions.
Routinely we expand queries to 60 terms using additional terms ranked after examining the top 20 and
bottom 75 (of 1000) documents. Failing to use blind relevance feedback had a detrimental effect on our
official runs. Our official monolingual runs are described in Table 4 and corrected scores are presented on
the far right.

     run id       MAP       =Best >=Median Rel. Found Relevant # topics MAP’ % change
     aplmodea 0.4852          2         31           1721        1825        56     0.5210        7.39%
     aplmodeb 0.4834           2        27           1732                           0.5050        4.46%
     aplmoena 0.4943                                  977        1006        54     0.5040        1.96%
     aplmoenb 0.5127                                  980                           0.5074       -1.03%
     aplmoesa 0.4679           3        32           2226        2368        57     0.5311      13.50%
     aplmoesb 0.4538           3        32           2215                           0.5165      13.82%
     aplmofia 0.5514          12        31            475         483        45     0.5571        1.03%
     aplmofib 0.5459           9        31            475                           0.5649        3.49%
     aplmofra 0.5228           9        35            924         946        52     0.5415        3.58%
     aplmofrb 0.5148          9         37            920                           0.5168        0.39%
     aplmoita     0.4620       7        21            776         809        51     0.4784        3.54%
     aplmoitb 0.4744           8        22            771                           0.4982        5.02%
     aplmonla 0.4817          3         42           1485        1577        56     0.5088        5.63%
     aplmonlb 0.4709           2        40           1487                           0.4841        2.86%
     aplmorua 0.3389           2        17            115         151        28     0.3728      10.00%
     aplmorub 0.3282          4         16            113                           0.3610      10.00%
     aplmosva 0.4515           7        36            840         889        53     0.4358       -3.47%
     aplmosvb 0.4498          6         38            838                           0.4310       -4.18%
Table 4. Official results for monolingual task. The shaded row contains results for a comparable, unofficial
English run. The two columns at the far right report a corrected value for mean average precision when blind
relevance feedback is applied, and the relative difference compared to the corresponding official run.

It appears that several of our runs would have increased substantially if we had correctly used blind relevance
feedback. Relative improvements of more than 5% were seen in German, Russian, and Spanish although
performance would have dropped slightly in Swedish. The German and Spanish document collections are the
two largest in the entire test suite. We wonder if relevance feedback may be more beneficial when larger
collections are available, a conjecture partially explored by Kwok and Chan [4].

Bilingual Experiments
This year the Bilingual task focused on retrieval involving four language pairs, which notably did not contain
English as a source or target language. This is only significant because of the difficulty in locating direct
translation resources for some language pairs and the fact that many translation resources are available when
English is one of the languages involved. The four language pairs are German to Italian, Finnish to German,
French to Dutch, and Italian to Spanish.
For the 2002 campaign we relied on a single translation resource: bilingual wordlists extracted from parallel
corpora. We built a large alignable collection from a single source, the Official Journal of the EU [18], and
we again used this resource as our only source of translations for 2003. The parallel corpus grew by about
50% this year, so a somewhat larger resource was available. First we describe the construction of the parallel
corpus and the extraction of our bilingual wordlists, then we discuss our overall strategy for bilingual
retrieval, and finally we report on our official results.

Our collection was obtained through a nightly crawl of the Europa web site where we targeted the Official
Journal of the European Union [18]. The Journal is available in each of the E.U. languages and consists
mainly of governmental topics, for example, trade and foreign relations. We had data available from
December 2000 through May 2003. Though focused on European topics, the time span is 5 to 8 years after
the CLEF-2002 document collection. The Journal is published electronically in PDF format and we wanted
to create an aligned collection. We started with 33.4 GB of PDF documents and converted them to plain text
using the publicly available pdftotext software (version 1.0). Once converted to text, documents were split
into pieces using conservative rules for page breaks and paragraph breaks. Many of the documents are
written in outline form, or contain large tables, so this pre-alignment processing is not easy. We ended up
with about 300MB of text, per language, that could be aligned. Alignment was carried out using the
char_align program [1]. In this way we created an aligned collection of approximately 1.2 million passages;
these ‘documents’ were each about 2 or 3 sentences in length.

We performed pairwise alignments between languages pairs, for example, between German and Italian. Once
aligned, we indexed each pairwise-aligned collection using the technique described for the CLEF-2003
document collections. Again, we created four indexes per sub-collection, per language – one each of words,
stems, 4-grams and 5-grams. Our goal was to support query term translation, so for each source language
term occurring in at least 4 documents, we attempted to determine a translation of the same token type in the
target language. At this point we should mention that the ‘proper’ translation of an n-gram is decidedly
slippery – clearly there can be no single correct answer. Nonetheless, we simply relied on the large volume
of n-grams to smooth topic translation. For example, the central 5-grams of the English phrase ‘prime
minister’ include ‘ime_m’, ‘me_mi’, and ‘e_min’. The derived ‘translations’ of these English 5-grams into
French are ‘er_mi’, ‘_mini’, and ‘er_mi’, respectively. This seems to work as expected for the French phrase
‘premier ministre’, although the method is not foolproof. Consider n-gram translations from the phrase
‘communist party’ (parti communiste): ‘_commu’ (mmuna), ‘commu’ (munau), ‘ommun’ (munau), ‘mmuni’
(munau), ‘munis’ (munis), ‘unist’ (unist), ‘nist_’ (unist), ‘ist_p’ (ist_p), ‘st_pa’ (1_re_), ‘t_par’ (rtie_),
‘_part’ (_part), ‘party’ (rtie_), and ‘arty_’ (rtie_). The lexical coverage of translation resources is a critical
factor for good CLIR performance, so the fact that almost any n-gram has a ‘translation’ should improve
performance. The direct translation of n-grams may offer a solution to several key obstacles in dictionary-
based translation. Word normalization is not essential since sub-word strings will be compared. Translation
of multiword expressions can be approximated by translation of word-spanning n-grams. Out-of-vocabulary
words, particularly proper nouns, can be be partially translated by common n-gram fragments or left
untranslated in close languages.

We extracted candidate translations as follows. First, we would take a candidate term as input and identify
documents containing this term in the source language subset of the aligned collection. Up to 5000
documents were considered; we bounded the number for reasons of efficiency and because we felt that
performance was not enhanced appreciably when a greater number of documents was used. If no document
contained this term, then it was left untranslated. Second, we would identify the corresponding documents in
the target language. Third, using a statistic that is similar to mutual information, we would extract a single
potential translation. Our statistic is a function of the frequency of occurrence in the whole collection and the
frequency in the subset of aligned documents. In this way we extracted the single-best target language term
for each source language term in our lexicon (not just the query terms in the CLEF topics). When 5-grams
were used this process took several days.

Table 5 lists examples of translating within the designated language pairs using each type of tokenization.
Mistakes are evident; however, especially when pre-translation expansion is used the overall effectiveness is
quite high. We believe the redundancy afforded by translating multiple n-grams for each query word also
reduces loss due to erroneous translations. Finally, incorrect translations may still prove helpful if they are a
collocation rather than an actual translation.
        Desired              DEIT                     FIDE                    FRNL               ITES
        Mapping       DE            IT         FI            DE         FR         NL       IT        ES
words             milk     milch     latte     maidon           milch        lait   melk     latte    leche
              olympic olympische olimpico olympialaisiin olympischen olympique olympisch olimpico olimpico
stems             milk     milch       latt      maido          milch        lait   melk       latt     lech
              olympic     olymp     olimp        olymp         olymp     olymp olympisch    olimp     olimp
4-       first 4-gram
grams            (milk)      milc      latt       maid            land       lait   melk       latt     lech
         last 4-gram
                 (milk)      ilch      latt        idon            milc      lait   melk      atte      acte
         first 4-gram
            (olympic)       olym      olim         olym          olym     olym      olym     olim       olim
         last 4-gram
            (olympic)       sche     rope            siin        n_au      ique      isch    pico       pico
5-       first 5-gram
grams            (milk)    milch     _latt       maido          milch      _lait   _melk     latte    leche
         last 5-gram
                 (milk)    milch     _latt        aidon         milch      lait_   _melk     latte    leche
         first 5-gram
            (olympic)     olymp     olimp        olymp         olymp     olymp     _olym    olimp     olimp
         last 5-gram
            (olympic)      ische    urope           isiin       ichen    pique      pisch mpico       _olim
Table 5. Examples of term-to-term translation

We remain convinced that pre-translation query expansion is a tremendously effective method to improve
bilingual performance. Therefore we used each CLEF 2003 document collection as an expansion collection
for the source language queries. Queries were expanded to a list of 60 terms, and then we attempted to
translate each using our corpus-derived resource. In the past we have been interested in using n-grams as
terms, however, we have worked with bilingual wordlists for translation. This year we decided to create
translingual mappings using the same tokenization in both the source and target languages. Thus for each of
the four language pairs, we created four different lists (for a total of 16): one list per type of indexing term
(i.e., word, stem, 4-gram, or 5-gram). Again using experiments on the CLEF 2002 collection, we determined
that mappings between n-grams were more efficacious than use of word-to-word or stem-to-stem mappings.
Thus different tokenization can be used for initial search, pre-translation expansion, query translation, and
target language retrieval. In testing we found the best results using both n-grams and stems for an initial
source-language search, then we extracted ordinary words as ‘expansion’ terms, and finally we translated
each n-gram contained in the expanded source language word list into n-grams in the target language (or
stems into stems, as appropriate). The process is depicted in Figure 3:

                                                                                                Tokenization &
  Ribellioni in Sierra Leone e i                   combattimenti     militare
                                                                                                Translation
                                                   ribelli           rivoluzionario
                                                   guerriglieri      leone
IT query                                           diamanti sierra
                                                   diamantifero      …


                                                                                        Words       N-grams
                                    combates          militares
                                    ribeldes rivolucionario
                                    guerriglieri      leona                    _comb, comba, ebate, …
                                    diamantes         sierra                   _sier, sierra, erra_, erril, …
        ES docs                                                                milit, itari, …
                                    diamantes         …
                                                                               _diam, diama, …
                                                                               …


Figure 3. Illustration of bilingual processing. The initial input to translation is an expanded list of plain
words extracted from a set of documents obtained by retrieval in the source language collection. These words
are optionally tokenized (e.g., to stems or n-grams), and the constituent query terms are then translated using
the mappings derived from the parallel texts. Multiple base runs are combined to create a final ranked list.
The performance of APL’s official bilingual runs are described in Table 6.
          run id         MAP       % mono =Best >=Median Rel. Found            Relevant    # topics
          aplbideita 0.4264         89.88     11       38           789          809          51
          aplbideitb 0.4603         97.03     12       45           780
          aplbifidea 0.3454         71.19     16       39           1554           1825       56
          aplbifideb 0.3430         70.69     16       42           1504
          aplbifrnla 0.4045         83.97     15       33           1493           1577       56
          aplbifrnlb 0.4365         90.62     13       33           1442
          aplbiitesa 0.4242         90.66      5       32           2174           2368       57
          aplbiitesb 0.4261         91.07      4        38          2189
Table 6. Official results for bilingual task.

Our runs named aplbixxyya are bilingual runs that were translated directly from the source language to the
target language; each run was a combination of four base runs that either used words, stems, 4-grams, or 5-
grams, with (post-translation) relevance feedback. The runs named aplbixxyyb were combined in the same
way, however the four constituent base runs did not make use of post-translation feedback. When words or
stems were used a value of 0.3 was used for alpha; when n-grams were used the value was 0.5. The base runs
are compared in Figure 4.

                                           Tokenization and Translation


                            0.50


                            0.45

                                                                                            words
   Mean Average Precision


                            0.40                                                            stems
                                                                                            4-grams
                                                                                            5-grams
                            0.35                                                            words noRF
                                                                                            stems noRF
                                                                                            4-grams noRF
                            0.30                                                            5-grams noRF
                                                                                            Best APL Run

                            0.25


                            0.20
                                   DEIT   FIDE              FRNL            ITES


Figure 4. Analysis of the base-runs used for bilingual retrieval. The best APL run was achieved in each
instance through run combination.

From observing the data in Table 6 and Figure 4, it would appear that the use of post-translation feedback did
not enhance performance when multiple runs were combined. The two types of runs seemed to perform
similarly in two language pairs (Finnish to German and Italian to Spanish); however, the merged runs
without relevance feedback did better for the German to Italian and French to Dutch runs.

Combination of methods resulted in between a 3 and 10% gain depending on language pair. We have not yet
had the opportunity to retrospectively analyze the contribution to our overall performance of pre-translation
expansion.
Multilingual Experiments
We initially thought to create runs for the multilingual task in the exact same way as for the bilingual task.
However, we decided to use English as our source language and we had to create translation lists for seven
languages using four tokenization types (a total of 28 mappings). Construction of the 5-gram lists took
longer than expected and so we had to modify our plans for our official submission. We decided to submit a
hybrid run based on words, stems, and 4-grams; merging was again accomplished using normalized scores.
As with the bilingual task, runs ending in ‘a’ denote the use of post-translation relevance feedback, while
runs ending in ‘b’ did not use feedback (see Table 7).

           run id        Task MAP =Best            >=Median     Rel. Found    Relevant    # topics
           aplmuen4a      4      0.2926     3         33          4377         6145          60
           aplmuen4b      4      0.2747     0         34          4419
           aplmuen8a      8      0.2377     4         28          5939          9902        60
           aplmuen8b      8      0.2406     1         41          5820
Table 7. APL results for multilingual task.

Spoken Document Evaluation
This was our first time using the TREC-8 and TREC-9 spoken document dataset. Our submissions were
created in very short order – in one day. We pre-processed the data so it had similar SGML markup as the ad
hoc TREC collections and then indexed the English text using only 5-grams. The index took 33 minutes to
build. We did not make use of any collection expansion for these runs. Our processing was similar to the
work we did for the bilingual track, except that we used only 5-grams as translation terms and did not use
pre-translation expansion (which was not permitted for ‘primary’ submissions).

The runs we submitted for the spoken document evaluation are summarized in Table 8.
                               run id           Task / Condition MAP
                               aplspenena EN      Monolingual     0.3184
                               aplspfrena FR        Primary       0.1904
                               aplspdeena DE        Primary       0.2206
                               aplspnlena NL       Secondary      0.2269
                               aplspitena  IT      Secondary      0.2046
                               aplspesena ES       Secondary      0.2395
Table 8. Submissions for the Cross-Language Spoken Document Evaluation


Conclusions
For the first time we were able to directly compare words, various lengths of character n-grams, a suffix
stemmer, and an n-gram alternative to stemming, all using the same retrieval engine. We found that n-grams
of shorter lengths (n=4 or n=5) were preferable across the CLEF 2003 languages and that n-grams generally
outperformed use of the Snowball stemmer: 4-grams had a 8% mean relative advantage across the 9
languages compared to stems; however stemming was better in Italian and Spanish (by 17% and 5%
respectively). We found best performance can be obtained using a combination of methods. If emphasis is
placed on accuracy over storage requirements or response time, this approach is reasonable. For bilingual
retrieval we identified a method for direct translation of n-grams instead of word-based translation. Without
the use of relevance feedback, 5-grams outperformed stems by an average of 17% over the four bilingual
pairs though 4-grams appeared to lose much of their monolingual superiority. When feedback was used, the
gap narrowed substantially.

This work should not be taken as an argument against language resources, but rather as further evidence that
knowledge-light methods can be quite effective, when optimized. We are particularly excited about the use
of non-word translation (i.e., using direct n-gram translation) as this appears to have the potential to avoid
several pitfalls that plague dictionary-based translation of words.

We are still analyzing our results from the multilingual and spoken-document tracks and hope to report on
them more fully in our revised paper.
References
[1] K.W. Church,‘Char_align: A program for aligning parallel texts at the character level.’ Proceedings of the 31st
Annual Meeting of the Association for Computational Linguistics, pp. 1-8, 1993.
[2] D. Hiemstra, Using Language Models for Information Retrieval. Ph. D. Thesis, Center for Telematics and
Information Technology, The Netherlands, 2000.
[3] F. Jelinek and R. Mercer, ‘Interpolated Estimation of Markov Source Parameters from Sparse Data’. In Gelsema ES
and Kanal LN eds., Pattern Recognition in Practice, North Holland, pp. 381-402, 1980.
[4] K. L. Kwok and M. Chan, ‘Improving Two-Stage Ad-Hoc Retrieval for Short Queries.’ In the Proceedings of the
21st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR-98), pp. 250-
256, 1998.
[5] J. Mayfield and P. McNamee, ‘Single N-gram Stemming’, To appear in the Proceedings of the Twenty-Sixth Annual
International ACM SIGIR Conference on Research and Development in Information Retrieval, 2003.
[6] P. McNamee and J. Mayfield, ‘Scalable Multilingual Information Access’. To appear in the Proceedings of the
CLEF 2002 Workshop.
[7] P. McNamee and J. Mayfield, ‘Comparing Cross-Language Query Expansion Techniques by Degrading Translation
Resources’. In the Proceedings of the 25th Annual International Conference on Research and Development in
Information Retrieval, Tampere, Finland, pp. 159-166, 2002.
[8] P. McNamee and J. Mayfield, ‘Character N-gram Tokenization for European Language Text Retrieval’. To appear
in Information Retrieval.
[9] D. Miller, T. Leek, and R. Schwartz, ‘A hidden Markov model information retrieval system’. In Proceedings of the
22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley,
California, pp. 214-221, 1999.
[10] E. Miller, D. Shen, J. Liu, and C. Nicholas, ‘Performance and Scalability of a Large-Scale N-gram Based
Information Retrieval System.’ In the Journal of Digital Information, 1(5), January 2000.
[11] C. Monz, J. Kamps, and M. de Rijke, ‘The University of Amsterdam at CLEF 2002’, Working Notes of the CLEF
2002 Workshop, pp. 73-84, 2002.
[12] A. Pirkola, T. Hedlund, H. Keskusalo, and K. Järvelin, ‘Dictionary-Based Cross-Language Information Retrieval:
Problems, Methods, and Research Findings’, Information Retrieval, 4:209-230, 2001.
[13] M. Porter, ‘Snowball: A Language for Stemming Algorithms’, http://snowball.tartarus.org/texts/introduction.html,
(visited 13 March 2003).
[14] D. Reidsma, D. Hiemstra, F. de Jong, and W. Kraaij, ‘Cross-language Retrieval at Twente and TNO’, Working
Notes of the CLEF 2002 Workshop, pp. 111-114, 2002.
[15] J. Savoy Cross-language information retrieval: experiments based on CLEF 2000 corpora. Information Processing
and Management, 39(1):75-115, 2003.
[16] S. Tomlinson, ‘Experiments in 8 European Languages with Hummingbird SearchServer at CLEF 2002’, Working
Notes of the CLEF 2002 Workshop, pp. 203, 214, 2002.
[17] C. Zhai and J. Lafferty, ‘A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information
Retrieval’ Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in
Information Retrieval, pp. 334-342, 2001.
[18] http://europa.eu.int/

</pre>