<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Experiments with N-Gram Prefixes on a Multinomial Language Model versus Lucene's off-the-shelf ranking scheme and Rocchio Query Expansion (TEL@CLEF Monolingual Task)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jorge Machado</string-name>
          <email>jorge.r.machado@ist.utl.pt</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bruno Martins</string-name>
          <email>bruno.martins@ist.utl.pt</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>José Borbinha</string-name>
          <email>jose.borbinha@ist.utl.pt</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Departmento de Engenharia Informática, Technical University of Lisbon</institution>
          ,
          <country country="PT">Portugal</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We describe our participation in the TEL@CLEF task of the CLEF 2009 ad-hoc track, where we measured the retrieval performance of LGTE, an index engine for Geo-Temporal collection which is mostly based on Lucene, together with extensions for query expansion and multinomial language modelling. We experiment an N-Gram stemming model to improve our last year experiments which consisted in combinations of query expansion, Lucene's off-the-shelf ranking scheme and the ranking scheme based on multinomial language modeling. The N-Gram stemming model was based in a linear combination of N-Gram, with n between 2 and 5, using weight factors obtained by learning from last year topics and assessments. The rochio ranking function was also adapted to implement this N-Gram model. Results show that this stemming technique together with query expansion and multinomial language modeling both result in increased performance.</p>
      </abstract>
      <kwd-group>
        <kwd>Language Model</kwd>
        <kwd>Vector Space Model</kwd>
        <kwd>Lucene</kwd>
        <kwd>Rocchio QE</kwd>
        <kwd>Stemming</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        This paper describes the participation of the Technical University of Lisbon at the
TEL@CLEF task. Our experiments aimed at measuring the retrieval performance of
the LGTE1 tool which is implementing the IR service of DIGMAP2, an EU-funded
project which addresses the development of services for virtual digital libraries of
materials related to historical cartography [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. DIGMAP collects bibliographic
metadata from European national libraries and other relevant third-party providers
(e.g. collections with descriptions available through OAI-PMH), aiming to provide
advanced searching and browsing mechanisms that combine thematic, geographic and
temporal aspects. In case of success, the ultimate goal of the project is to become fully
integrated into The European Library. The LGTE is the DIGMAP text retrieval
service which is mostly based on Lucene, together with extensions for using query
expansion and multinomial language modeling. A previous version of the system was
described in the MSc thesis of Machado [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and we are now in the process of
developing extensions for geo-temporal information retrieval [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        Like last year in CLEF, we experimented combinations query expansion, Lucene’s
off-the-shelf ranking scheme and the ranking scheme based on multinomial language
modeling, but this year we include an N-Gram model for degraded collections
proposed by Parapar in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. We adapt this model to our records collections using only
N-Gram prefixes instead of the usual sliding window N-Grams. We also perform
several experiments on how to use this model in Rochio selection formula for query
expansion with encourage results.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2 The experimental environment</title>
      <p>The underlying IR system used in our submissions is based on Lucene3, together
with a multinomial language modeling extension developed at the University of
Amsterdam and a query expansion extension developed by Neil Rubens. The
following subsections detail these components. We adapt our model to use a linear
combination of scores using several N-Gram indexes. We also adapt the ranking
function defined by Rochio to make use of the N-Gram indexes and the weights
assigned to each one of those indexes.</p>
      <sec id="sec-2-1">
        <title>2.1 Lucene’s off-the-shelf retrieval model</title>
        <p>We started with Lucene’s off-the-shelf retrieval model. For a collection D,
document d and query q, the ranking score is given by the formula bellow:
ranking(q, d) = ∑</p>
        <p>tft,q ⋅ idft ⋅ tft,d ⋅ idft ⋅ coordq,d ⋅ weightt
t ∈q normq normd
(1)</p>
        <sec id="sec-2-1-1">
          <title>1 http://code.google.com/p/digmap/wiki/LuceneGeoTemporal 2 http://www.dgmap.eu 3 http://lucene.apache.org where:</title>
          <p>Lucene has been extensively used in previous editions of the CLEF, NTCIR and
TREC joint evaluation experiments.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2 Lucene extension based on multinomial language modeling</title>
        <p>
          We experimented with a Lucene extension that implements a retrieval scheme
based on estimating a language model (LM) for each document, using the formula
described by Hiemstra [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. This extension was developed at the Informatics Institute
of the University of Amsterdam4. For any given query, it ranks the documents with
respect to the likelihood that the document’s LM generated the query:
ranking(d , q) = P(d | q) ∝ P(d ) ⋅ ∏ P(t | d )
        </p>
        <p>t∈q
P(d | q) = P(d) ⋅ ∏((1- λ) ⋅ P(t | D) + λ ⋅ P(t | d))</p>
        <p>t∈q</p>
        <p>In the formula, d is a document and t is a term in query q. The probabilities are
reduced to rank-equivalent logs of probabilities. To account for data sparseness, the
likelihood P(t|d) is interpolated using Jelinek-Mercer smoothing:</p>
        <p>
          In the formula, D is the collection and λ is a smoothing parameter (in our
experiments set to the default value of 0.15). The model needs to estimate three
probabilities: the prior probability of the document, P(d); the probability of observing
a term in a document, P(t|d) and the probability of observing the term in the
collection, P(t|D). Assuming the query terms to be independent, and using a linear
interpolation of a document model and a collection model to estimate the probability
of a query term, the probabilities can be estimated using maximum likelihood
estimates:
(2)
(3)
(4)
This language modeling approach has been used in past experiments within the
CLEF, NTCIR and TREC joint evaluation campaigns – see for example Ahn et. Al
[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3 N-Gram ranking scheme</title>
        <p>The N-Grams stemming technique is very used in corpus resultant from OCR
processes because many times the text brings OCR errors. This technique consists in
tokenizing the words with a sliding window into tokens of size N, with N assuming
several sizes. This process is applied both in documents and queries to increase
retrieval performance.</p>
        <p>The original N-Grams stemming does not fit very well in our problem because our
records were not obtained from OCR processes. On other hand using this technique
turns the stemming phase a language independent process, which was our main focus.
For that reason, we used a simplistic approach for the N-Grams model which consist
in suffixes removal starting in character N+1 and use the prefix for indexing purposes.</p>
        <p>
          Recent experiments related in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] by Parapar demonstrate that using independent
N-Grams indexes, for example from 2 to 5 grams, and combining the individual ranks
in a linear combination can improve the results when we find good parameter values
to weight each independent index. Our objective was to use this technique. We
tokenize our terms in five different ways each of which to create a different inverted
file. We create four files with prefixes of N-Grams (2 to 5 grams) and one file with
the original terms. The formula to calculate the final score is illustrated by the formula
6 introduced in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
(6)
        </p>
        <p>In formula d is the document α, β, γ, δ and ε are the weights assigned to each
independent score. To implement this feature in Lucene we re-implement the term
scorers of the text models (off-the-shelf and language model) to calculate the score.</p>
        <p>The system was trained through experiments with 2008 AdHoc topics and
relevance judgments. We found a set of optimal parameter values to weight each
inverted file independently. Table 1 shows the optimal values found for each index in
each collection. We found that bi-grams worsen the results so we set their weight to
zero in the three collections.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4 Rocchio query expansion</title>
        <p>
          The fact that there are frequently occurring spelling variations and synonyms for
any query term degrades the performance of standard techniques for ad-hoc retrieval.
To overcome this problem, we experimented with the method for pseudo feedback
query expansion proposed by Rocchio [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. The Lucene extension from the LucQE
project5 implements this approach. On test data from the 2004 TREC Robust
Retrieval Track, LucQE achieved a MAP score of 0.2433 using Rocchio query
expansion.
        </p>
        <p>Assuming that the top D documents returned for an original query qi are relevant, a
better query qi+1 can be given by the terms resulting from the formula bellow:
qi+1 = α ⋅ qi +
β</p>
        <p>⋅ ∑ termWeight (dr )
| D | dr ∈D
(7)</p>
        <p>In the formula, α and β are tuning parameters. In our experiments, they were set to
the default values of 1.0 and 0.75. The system was trained through experiments with
2008 AdHoc topics and relevance judgments. We found an optimal value of 64 terms
for English topics and 40 terms for French and German topics. The terms were
extracted from the highest ranked documents (i.e. the |D| parameter) from the original
query qi. With the training we obtain optimal values using 7 documents in English
and French topics and 8 documents in German topics.</p>
      </sec>
      <sec id="sec-2-5">
        <title>2.5 N-Grams and Rocchio query expansion</title>
        <p>In order to deal with N-Gram prefix stemming we need to adapt the Rochio formula.
Three techniques were experimented but only the third one improves the results:
• First of all we try to use a list of expansion tokens with a fixed size of tokens
of each inverted file (2,3,4,5 Grams and terms), let’s say the 15 most relevant
tokens of each inverted file. The boosting factors were calculated using the
original formula of Rochio to rank terms independently in the different
indexes (Inverted File for terms and N-Grams from 2 to 5). To smooth the
boosting factor in the expanded query we used the weight of each inverted
file (2,3,4,5 Grams and terms) found in training experiments (see Table 1).</p>
        <p>This doesn’t work.
• Second of all we picked up all terms of each top document, we tokenize them
to obtain the 2,3,4,5 Grams tokens and we calculate the term relevance using
as ranking function of Rochio formula in each inverted file and then we
apply the linear combination introduced in section 2.3. This will give the
rank of the term. The expanded query was build with the 5 projections of the
term, 2-5 Grams tokens and the term, using the ranking calculated with the
linear combination as boosting factor. Didn’t work at all.
• Third and our best approach which really improves the results was in first
place calculate, independently in each inverted file, the score for each</p>
        <sec id="sec-2-5-1">
          <title>5 http://lucene-qe.sourceforge.net/</title>
          <p>possible N-Grams tokens and terms present in top documents. In second
place we order them independently of their source file (2-5Grams or term)
and pick the most relevant ones. We calculate the score using Rochio
formula for the pairs, source inverted file and token, and we smooth it with
the respective weight presented in Table 1 depending on the inverted file.
Finally the tokens were ordered by score ignoring their source file and
finally the highest scored tokens were used. The score was used as boost
factor in final query (e.g.: absolute:information^0.53
index5grams:retrie^0.02, etc).</p>
          <p>Our third experiment method turns weak the tokens from less weighted indexes
like 2-Grams, and 3-Grams. This fact makes that tokens from weak indexes only
were picked if they were very relevant. Expanded queries were mainly composed
by tokens of 4-5 Grams and terms. On other hand we presence that all queries
had tokens from all indexes. With this technique we deal with all indexes in the
same way taking into account that terms from less weighted indexes should be
penalized and putted in the same bag.</p>
        </sec>
      </sec>
      <sec id="sec-2-6">
        <title>2.4 Processing the topics and the document collections</title>
        <p>
          Before the actual indexing, the document collections (i.e. the bibliographic
records) were passed through the following pre-processing operations:
• Field Weighting - The bibliographic records composing the collections from the
TEL@CLEF experiment contain structured information in the form of document
fields such as title or subject. We use the scheme proposed by Robertson et. al [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]
to weight the different document field according to their importance. Instead of
changing the ranking formulas in order to introduce boosting factors, we generate
virtual documents in which the content of some specific fields is repeated. The
combination used in our experiments is based on repeating the title field three
times, the subject field twice and keeping the other document fields unchanged.
• Normalization – The structured documents were converted to unstructured
documents for the process of indexing, removing the XML tags and putting the
element’s contents in separate sentences.
        </p>
        <p>Topic processing was fully automatic and the queries submitted to the IR engine
were generated using all parts of the topics (i.e. title, description and narrative). The
generation of the actual queries from the query topics was based on the following
sequence of processing operations:
• Parsing and Normalisation - All characters were reduced to the lowercase
unaccented equivalents (i.e. “Ö” reduced to “o” and “É” to “e” etc.) in order to
maximise matching.
• Stop Word Removal - Stopword lists were used to remove terms that carry little
meaning and would otherwise introduce noise. The considered stop words came
from the minimized lists distributed with Lucene, containing words such as</p>
        <p>Lucene internally normalizes documents and queries to lower case, also removing
stop-words. However, explicitly introducing these operations when processing the
topics, has the advantage of facilitating the development of more advanced topic
processing (e.g. adding query expansion methods).</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3 The experimental story</title>
      <p>We submitted 12 official runs to the CLEF evaluation process, a total of 4 runs for
each of the languages/collections under consideration in the monolingual task. The
runs were selected from those whose obtain best results with the 2008 topics. The
conditions under test for each of the submitted runs are as follows:</p>
    </sec>
    <sec id="sec-4">
      <title>4 Results</title>
      <sec id="sec-4-1">
        <title>6 http://snowball.tartarus.org/</title>
        <p>improves the results significantly. The weight N-Grams model was better than porter
stemming in all situations.
We present now the complete set of experiments using both text models, vector space
and language model. We combine all possible situations using rochio query expansion
and our different stemming approaches. We demonstrate that these two techniques,
stemming and query expansion, improve the results when used alone and even more
when combined. We demonstrate that the linear combination of N-Grams is many
times better then porter stemming and can be used with rochio query expansion using
our term selection method what improves the results even more. Table 4 resumes the
obtained results in terms of MAP (Mean Average Precision), P@5 (Precision in first 5
results) and P@10 (Precision in first 10 results) for all possible combinations in the
three languages. In French collection the experiment of the rochio query expansion
with porter stemming is worst than using just porter stemming, the same is not true
with the N-Grams thechnique which inclusively outperfoms all other experiments
except language model with porter stemming that is a very strong run, also one of our
best runs in 2008 experiments.
0.6760
0.5720
0.4760
0.3880
0.5080
0.4240</p>
        <p>The obtained results support the hypotheses that using Rocchio query expansion
together with N-Grams weighted model and a ranking scheme based on language
modeling can be beneficial to the CLEF ad-hoc task. The N-Grams prefix stemming
linearly combined using tokens of different grams and terms outperform the Porter
stemming technique in most scenarios especially when the linguistic stemmers are not
appropriate. Using this technique with different text models appear to be independent
from those models if the terms score is used independently in the formulas. Unlike
last year where our experiments result in poor results both in French and German
collections, this year we could obtain very encourage results. Like last year we
presence that multinomial language model is almost equal to vector space model in
majority of situations. On other hand the multinomial language model has the
advantage that we could train it very easily tuning the language model parameters,
which was not our objective in this experiment, so we believe that language model
has potential to return even better results than vector space model.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Porter</surname>
            ,
            <given-names>M. F.</given-names>
          </string-name>
          :
          <article-title>An algorithm for suffix stripping</article-title>
          : In: Sparck Jones,
          <string-name>
            <given-names>K.</given-names>
            &amp;
            <surname>Willett</surname>
          </string-name>
          ,
          <string-name>
            <surname>P</surname>
          </string-name>
          . (eds.), (
          <year>1997</year>
          )
          <article-title>Readings in Information Retrieval</article-title>
          ., pp.
          <fpage>313</fpage>
          -
          <lpage>316</lpage>
          . San Francisco: Morgan Kaufmann. (
          <year>1980</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Hiemstra</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Using Language Models for Information Retrieval:</article-title>
          <source>Ph.D. Thesis</source>
          , Centre for Telematics and Information Technology, University of Twente. (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Rocchio</surname>
            ,
            <given-names>J. J.</given-names>
          </string-name>
          :
          <article-title>Relevance Feedback in Information Retrieval: In: The SMART Retrieval System</article-title>
          .
          <source>Experiments in Automatic Document Processing:</source>
          pp
          <fpage>313</fpage>
          -
          <lpage>323</lpage>
          . Prentice Hall. (
          <year>1971</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Machado</surname>
            , J.: Mitra:
            <given-names>A Metadata</given-names>
          </string-name>
          <string-name>
            <surname>Aware Web Search Engine for Digital Libraries: M.Sc</surname>
          </string-name>
          . Thesis, Departamento de Engenharia Informática, Technical University of Lisbon. (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Robertson</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaragoza</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Taylor</surname>
          </string-name>
          , M.:
          <article-title>Simple BM25 extension to multiple weighted fields</article-title>
          :
          <source>In Proceedings of the Thirteenth ACM international Conference on information and Knowledge</source>
          Management (Washington,
          <string-name>
            <surname>D.C.</surname>
          </string-name>
          , USA, November
          <volume>08</volume>
          -
          <issue>13</issue>
          ,
          <year>2004</year>
          ).
          <source>CIKM '04. ACM</source>
          , New York, NY,
          <fpage>42</fpage>
          -
          <lpage>49</lpage>
          . (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ahn</surname>
            ,
            <given-names>D. D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Azzopardi</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Balog</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fissaha</surname>
            ,
            <given-names>A. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jijkoun</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kamps</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Müller</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>de Rijke</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <given-names>Erik</given-names>
            <surname>Tjong</surname>
          </string-name>
          Kim Sang: The University of Amsterdam at TREC 2005:
          <article-title>Working Notes for the 2005 Text Retrieval Conference</article-title>
          . (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Pedrosa</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luzio</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manguinhas</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Martins</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>DIGMAP: A service for searching and browsing old maps</article-title>
          :
          <source>In Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries</source>
          (Pittsburgh PA, PA, USA, June 16 - 20,
          <year>2008</year>
          ).
          <source>JCDL '08. ACM</source>
          , New York, NY,
          <fpage>431</fpage>
          -
          <lpage>431</lpage>
          . (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Machado</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martins</surname>
            <given-names>B</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Borbinha</surname>
            <given-names>J</given-names>
          </string-name>
          . (
          <year>2009</year>
          ), “LGTE:
          <article-title>Lucene Extensions for Geo-Temporal Information Retrieval”</article-title>
          ,
          <source>paper will be presented at the European Conference on Information Retrieval</source>
          , at Workshop on Geographic Information on Internet, Toulouse,
          <year>April 2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Parapar</surname>
          </string-name>
          , Javier; Freire, Ana; Barreiro,
          <string-name>
            <surname>Álvaro</surname>
          </string-name>
          (
          <year>2009</year>
          ).
          <article-title>“Revisiting N-gram Based Models for Retrieval in Degraded Large Collections”</article-title>
          ,
          <source>European Conference on Information Retrieval</source>
          , Toulouse, April 2009
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>