1 Introduction

Experiments with N-Gram Prefixes on a Multinomial Language Model versus Lucene's off-the-shelf ranking scheme and Rocchio Query Expansion (TEL@CLEF Monolingual Task)

Jorge Machado

jorge.r.machado@ist.utl.pt 0

Bruno Martins

bruno.martins@ist.utl.pt 0

José Borbinha

jose.borbinha@ist.utl.pt 0 0 Departmento de Engenharia Informática, Technical University of Lisbon , Portugal

We describe our participation in the TEL@CLEF task of the CLEF 2009 ad-hoc track, where we measured the retrieval performance of LGTE, an index engine for Geo-Temporal collection which is mostly based on Lucene, together with extensions for query expansion and multinomial language modelling. We experiment an N-Gram stemming model to improve our last year experiments which consisted in combinations of query expansion, Lucene's off-the-shelf ranking scheme and the ranking scheme based on multinomial language modeling. The N-Gram stemming model was based in a linear combination of N-Gram, with n between 2 and 5, using weight factors obtained by learning from last year topics and assessments. The rochio ranking function was also adapted to implement this N-Gram model. Results show that this stemming technique together with query expansion and multinomial language modeling both result in increased performance.

Language Model Vector Space Model Lucene Rocchio QE Stemming

1 Introduction

This paper describes the participation of the Technical University of Lisbon at the TEL@CLEF task. Our experiments aimed at measuring the retrieval performance of the LGTE1 tool which is implementing the IR service of DIGMAP2, an EU-funded project which addresses the development of services for virtual digital libraries of materials related to historical cartography [ 7 ]. DIGMAP collects bibliographic metadata from European national libraries and other relevant third-party providers (e.g. collections with descriptions available through OAI-PMH), aiming to provide advanced searching and browsing mechanisms that combine thematic, geographic and temporal aspects. In case of success, the ultimate goal of the project is to become fully integrated into The European Library. The LGTE is the DIGMAP text retrieval service which is mostly based on Lucene, together with extensions for using query expansion and multinomial language modeling. A previous version of the system was described in the MSc thesis of Machado [ 4 ] and we are now in the process of developing extensions for geo-temporal information retrieval [ 8 ].

Like last year in CLEF, we experimented combinations query expansion, Lucene’s off-the-shelf ranking scheme and the ranking scheme based on multinomial language modeling, but this year we include an N-Gram model for degraded collections proposed by Parapar in [ 9 ]. We adapt this model to our records collections using only N-Gram prefixes instead of the usual sliding window N-Grams. We also perform several experiments on how to use this model in Rochio selection formula for query expansion with encourage results.

2 The experimental environment

The underlying IR system used in our submissions is based on Lucene3, together with a multinomial language modeling extension developed at the University of Amsterdam and a query expansion extension developed by Neil Rubens. The following subsections detail these components. We adapt our model to use a linear combination of scores using several N-Gram indexes. We also adapt the ranking function defined by Rochio to make use of the N-Gram indexes and the weights assigned to each one of those indexes.

2.1 Lucene’s off-the-shelf retrieval model

We started with Lucene’s off-the-shelf retrieval model. For a collection D, document d and query q, the ranking score is given by the formula bellow: ranking(q, d) = ∑

tft,q ⋅ idft ⋅ tft,d ⋅ idft ⋅ coordq,d ⋅ weightt t ∈q normq normd (1)

1 http://code.google.com/p/digmap/wiki/LuceneGeoTemporal 2 http://www.dgmap.eu 3 http://lucene.apache.org where:

Lucene has been extensively used in previous editions of the CLEF, NTCIR and TREC joint evaluation experiments.

2.2 Lucene extension based on multinomial language modeling

We experimented with a Lucene extension that implements a retrieval scheme based on estimating a language model (LM) for each document, using the formula described by Hiemstra [ 2 ]. This extension was developed at the Informatics Institute of the University of Amsterdam4. For any given query, it ranks the documents with respect to the likelihood that the document’s LM generated the query: ranking(d , q) = P(d | q) ∝ P(d ) ⋅ ∏ P(t | d )

t∈q P(d | q) = P(d) ⋅ ∏((1- λ) ⋅ P(t | D) + λ ⋅ P(t | d))

t∈q

In the formula, d is a document and t is a term in query q. The probabilities are reduced to rank-equivalent logs of probabilities. To account for data sparseness, the likelihood P(t|d) is interpolated using Jelinek-Mercer smoothing:

In the formula, D is the collection and λ is a smoothing parameter (in our experiments set to the default value of 0.15). The model needs to estimate three probabilities: the prior probability of the document, P(d); the probability of observing a term in a document, P(t|d) and the probability of observing the term in the collection, P(t|D). Assuming the query terms to be independent, and using a linear interpolation of a document model and a collection model to estimate the probability of a query term, the probabilities can be estimated using maximum likelihood estimates: (2) (3) (4) This language modeling approach has been used in past experiments within the CLEF, NTCIR and TREC joint evaluation campaigns – see for example Ahn et. Al [ 6 ].

2.3 N-Gram ranking scheme

The N-Grams stemming technique is very used in corpus resultant from OCR processes because many times the text brings OCR errors. This technique consists in tokenizing the words with a sliding window into tokens of size N, with N assuming several sizes. This process is applied both in documents and queries to increase retrieval performance.

The original N-Grams stemming does not fit very well in our problem because our records were not obtained from OCR processes. On other hand using this technique turns the stemming phase a language independent process, which was our main focus. For that reason, we used a simplistic approach for the N-Grams model which consist in suffixes removal starting in character N+1 and use the prefix for indexing purposes.

Recent experiments related in [ 9 ] by Parapar demonstrate that using independent N-Grams indexes, for example from 2 to 5 grams, and combining the individual ranks in a linear combination can improve the results when we find good parameter values to weight each independent index. Our objective was to use this technique. We tokenize our terms in five different ways each of which to create a different inverted file. We create four files with prefixes of N-Grams (2 to 5 grams) and one file with the original terms. The formula to calculate the final score is illustrated by the formula 6 introduced in [ 9 ]. (6)

In formula d is the document α, β, γ, δ and ε are the weights assigned to each independent score. To implement this feature in Lucene we re-implement the term scorers of the text models (off-the-shelf and language model) to calculate the score.

The system was trained through experiments with 2008 AdHoc topics and relevance judgments. We found a set of optimal parameter values to weight each inverted file independently. Table 1 shows the optimal values found for each index in each collection. We found that bi-grams worsen the results so we set their weight to zero in the three collections.

2.4 Rocchio query expansion

The fact that there are frequently occurring spelling variations and synonyms for any query term degrades the performance of standard techniques for ad-hoc retrieval. To overcome this problem, we experimented with the method for pseudo feedback query expansion proposed by Rocchio [ 3 ]. The Lucene extension from the LucQE project5 implements this approach. On test data from the 2004 TREC Robust Retrieval Track, LucQE achieved a MAP score of 0.2433 using Rocchio query expansion.

Assuming that the top D documents returned for an original query qi are relevant, a better query qi+1 can be given by the terms resulting from the formula bellow: qi+1 = α ⋅ qi + β

⋅ ∑ termWeight (dr ) | D | dr ∈D (7)

In the formula, α and β are tuning parameters. In our experiments, they were set to the default values of 1.0 and 0.75. The system was trained through experiments with 2008 AdHoc topics and relevance judgments. We found an optimal value of 64 terms for English topics and 40 terms for French and German topics. The terms were extracted from the highest ranked documents (i.e. the |D| parameter) from the original query qi. With the training we obtain optimal values using 7 documents in English and French topics and 8 documents in German topics.

2.5 N-Grams and Rocchio query expansion

In order to deal with N-Gram prefix stemming we need to adapt the Rochio formula. Three techniques were experimented but only the third one improves the results: • First of all we try to use a list of expansion tokens with a fixed size of tokens of each inverted file (2,3,4,5 Grams and terms), let’s say the 15 most relevant tokens of each inverted file. The boosting factors were calculated using the original formula of Rochio to rank terms independently in the different indexes (Inverted File for terms and N-Grams from 2 to 5). To smooth the boosting factor in the expanded query we used the weight of each inverted file (2,3,4,5 Grams and terms) found in training experiments (see Table 1).

This doesn’t work. • Second of all we picked up all terms of each top document, we tokenize them to obtain the 2,3,4,5 Grams tokens and we calculate the term relevance using as ranking function of Rochio formula in each inverted file and then we apply the linear combination introduced in section 2.3. This will give the rank of the term. The expanded query was build with the 5 projections of the term, 2-5 Grams tokens and the term, using the ranking calculated with the linear combination as boosting factor. Didn’t work at all. • Third and our best approach which really improves the results was in first place calculate, independently in each inverted file, the score for each

5 http://lucene-qe.sourceforge.net/

possible N-Grams tokens and terms present in top documents. In second place we order them independently of their source file (2-5Grams or term) and pick the most relevant ones. We calculate the score using Rochio formula for the pairs, source inverted file and token, and we smooth it with the respective weight presented in Table 1 depending on the inverted file. Finally the tokens were ordered by score ignoring their source file and finally the highest scored tokens were used. The score was used as boost factor in final query (e.g.: absolute:information^0.53 index5grams:retrie^0.02, etc).

Our third experiment method turns weak the tokens from less weighted indexes like 2-Grams, and 3-Grams. This fact makes that tokens from weak indexes only were picked if they were very relevant. Expanded queries were mainly composed by tokens of 4-5 Grams and terms. On other hand we presence that all queries had tokens from all indexes. With this technique we deal with all indexes in the same way taking into account that terms from less weighted indexes should be penalized and putted in the same bag.

2.4 Processing the topics and the document collections

Before the actual indexing, the document collections (i.e. the bibliographic records) were passed through the following pre-processing operations: • Field Weighting - The bibliographic records composing the collections from the TEL@CLEF experiment contain structured information in the form of document fields such as title or subject. We use the scheme proposed by Robertson et. al [ 5 ] to weight the different document field according to their importance. Instead of changing the ranking formulas in order to introduce boosting factors, we generate virtual documents in which the content of some specific fields is repeated. The combination used in our experiments is based on repeating the title field three times, the subject field twice and keeping the other document fields unchanged. • Normalization – The structured documents were converted to unstructured documents for the process of indexing, removing the XML tags and putting the element’s contents in separate sentences.

Topic processing was fully automatic and the queries submitted to the IR engine were generated using all parts of the topics (i.e. title, description and narrative). The generation of the actual queries from the query topics was based on the following sequence of processing operations: • Parsing and Normalisation - All characters were reduced to the lowercase unaccented equivalents (i.e. “Ö” reduced to “o” and “É” to “e” etc.) in order to maximise matching. • Stop Word Removal - Stopword lists were used to remove terms that carry little meaning and would otherwise introduce noise. The considered stop words came from the minimized lists distributed with Lucene, containing words such as

Lucene internally normalizes documents and queries to lower case, also removing stop-words. However, explicitly introducing these operations when processing the topics, has the advantage of facilitating the development of more advanced topic processing (e.g. adding query expansion methods).

3 The experimental story

We submitted 12 official runs to the CLEF evaluation process, a total of 4 runs for each of the languages/collections under consideration in the monolingual task. The runs were selected from those whose obtain best results with the 2008 topics. The conditions under test for each of the submitted runs are as follows:

4 Results 6 http://snowball.tartarus.org/

improves the results significantly. The weight N-Grams model was better than porter stemming in all situations. We present now the complete set of experiments using both text models, vector space and language model. We combine all possible situations using rochio query expansion and our different stemming approaches. We demonstrate that these two techniques, stemming and query expansion, improve the results when used alone and even more when combined. We demonstrate that the linear combination of N-Grams is many times better then porter stemming and can be used with rochio query expansion using our term selection method what improves the results even more. Table 4 resumes the obtained results in terms of MAP (Mean Average Precision), P@5 (Precision in first 5 results) and P@10 (Precision in first 10 results) for all possible combinations in the three languages. In French collection the experiment of the rochio query expansion with porter stemming is worst than using just porter stemming, the same is not true with the N-Grams thechnique which inclusively outperfoms all other experiments except language model with porter stemming that is a very strong run, also one of our best runs in 2008 experiments. 0.6760 0.5720 0.4760 0.3880 0.5080 0.4240

The obtained results support the hypotheses that using Rocchio query expansion together with N-Grams weighted model and a ranking scheme based on language modeling can be beneficial to the CLEF ad-hoc task. The N-Grams prefix stemming linearly combined using tokens of different grams and terms outperform the Porter stemming technique in most scenarios especially when the linguistic stemmers are not appropriate. Using this technique with different text models appear to be independent from those models if the terms score is used independently in the formulas. Unlike last year where our experiments result in poor results both in French and German collections, this year we could obtain very encourage results. Like last year we presence that multinomial language model is almost equal to vector space model in majority of situations. On other hand the multinomial language model has the advantage that we could train it very easily tuning the language model parameters, which was not our objective in this experiment, so we believe that language model has potential to return even better results than vector space model.

1. Porter , M. F. : An algorithm for suffix stripping : In: Sparck Jones, K. & Willett , P . (eds.), ( 1997 ) Readings in Information Retrieval ., pp. 313 - 316 . San Francisco: Morgan Kaufmann. ( 1980 )

2. Hiemstra , D. : Using Language Models for Information Retrieval: Ph.D. Thesis , Centre for Telematics and Information Technology, University of Twente. ( 2001 )

3. Rocchio , J. J. : Relevance Feedback in Information Retrieval: In: The SMART Retrieval System . Experiments in Automatic Document Processing: pp 313 - 323 . Prentice Hall. ( 1971 )

4. Machado , J.: Mitra: A Metadata Aware Web Search Engine for Digital Libraries: M.Sc . Thesis, Departamento de Engenharia Informática, Technical University of Lisbon. ( 2008 )

5. Robertson , S. , Zaragoza , H. , and Taylor , M.: Simple BM25 extension to multiple weighted fields : In Proceedings of the Thirteenth ACM international Conference on information and Knowledge Management (Washington, D.C. , USA, November 08 - 13 , 2004 ). CIKM '04. ACM , New York, NY, 42 - 49 . ( 2004 )

6. Ahn , D. D. , Azzopardi , L. , Balog , K. , Fissaha , A. S. , Jijkoun , V. , Kamps , J. , Müller , K. , de Rijke , M. and

Erik

Tjong Kim Sang: The University of Amsterdam at TREC 2005: Working Notes for the 2005 Text Retrieval Conference . ( 2005 )

7. Pedrosa , G. , Luzio , J. , Manguinhas , H. , and Martins , B. : DIGMAP: A service for searching and browsing old maps : In Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries (Pittsburgh PA, PA, USA, June 16 - 20, 2008 ). JCDL '08. ACM , New York, NY, 431 - 431 . ( 2008 )

8. Machado

, Martins

, Borbinha

. ( 2009 ), “LGTE: Lucene Extensions for Geo-Temporal Information Retrieval” , paper will be presented at the European Conference on Information Retrieval , at Workshop on Geographic Information on Internet, Toulouse, April 2009 .

9. Parapar , Javier; Freire, Ana; Barreiro, Álvaro ( 2009 ). “Revisiting N-gram Based Models for Retrieval in Degraded Large Collections” , European Conference on Information Retrieval , Toulouse, April 2009