SINAI at CLEF Ad-Ho Robust Tra k 2007:
applying Google sear h engine for robust
ross-lingual retrieval
Fernando Martínez-Santiago, Arturo Montejo-Ráez, Miguel A. Gar ía-Cumbreras
Department of Computer S ien e. University of Jaén, Jaén, Spain
{dofer,amontejo,mag }ujaen.es
Abstra t
We have reported on our experimentation for the Ad-Ho Robust tra k CLEF task
on erning web-based query generation for English and Fren h olle tions. We have
ontinued the approa h of the last year, although the model has been modied. Last
year we used Google in order to expand the original query. This year we don't expand
the query but we rather make a new query to be exe uted. Thus, we have to deal with
two lists of relevant do uments, one from ea h query. In order to integrate both lists
of do uments we have applied logisti regression merging solution. Obtained results
are dis ouraging.
Categories and Subje t Des riptors
H.3 [Information Storage and Retrieval℄: H.3.1 Content Analysis and Indexing; H.3.3 Infor-
mation Sear h and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries
General Terms
Algorithms, Languages, Performan e, Experimentation
Keywords
Information Retrieval, Multilingual Information Retrieval, Robust Information Retrieval
1 Introdu tion
Expanding user queries by using web sear h engines su h as Google has been su essfully used for
improving the robustness of retrieval systems over olle tions in English[2℄. Due to the multilin-
guality of the web, we have assumed that this ould be extended to additional languages, though
the smaller amount of web non-English pages ould be a major ounterpoint. Therefore, we have
used Google in order to expand the query in a similar way [3℄, but instead of repla ing the original
query by the the expanded query, we have exe uted both queries (the original and expanded one).
For ea h query we have obtained a list of relevant do uments. Thus, we need to ombine the
retrieval results from these two independent list of do uments. This is a similar problem to the so
alled olle tion fusion problem [8℄, but we have not several olle tions: there is only a olle tion
but two list of relevant do uments. The question is how should we make the al ulus of the s ore
of ea h do ument in the nal resulting list. Given a query, in order to integrate the information
available about the relevan e of every retrieved do ument, we have applied a model based on
logisti regression. Logisti regression has been used su essfully in multilingual s enarios[5, 7℄.
2 Query expansion with Google sear h engine
This se tion des ribes the pro ess for generating a new query using expansion by the Google sear h
engine. To this end, we have sele ted a random sample do ument. The following elds orrespond
to the do ument with identi ation number 10.2452/252-ah from the English olle tion.
pension s hemes in europe
find do uments that give information about urrent pension
systems and retirement benefits in any european ountry.
relevant do uments will ontain information on urrent pension
s hemes and benefits in single european states. information of
interest in ludes minimum and maximum ages for retirement and the way
in whi h the retirement in ome is al ulated. plans for future
pension reform are not relevant.
These elds have been on atenated into one single text and all ontained nouns, noun phrases
and prepositional phrases have been extra ted by means of TreeTagger. TreeTagger is a tool
for annotating text with part-of-spee h and lemma information whi h has been developed at the
Institute for Computational Linguisti s of the University of Stuttgart1 .
On e nouns and phrases are identied they are taken to ompose the query, preserving phrases
thanks to Google's query syntax.
do uments ``pension s hemes'' benefits retirement information
The former string is passed to Google and the snippets (small fragment of text from the
asso iated web page result) of the top 100 results are joined into one single text wherefrom, again,
phrases are extra ted with their frequen ies to generate a nal expanded query. The 20 most
frequent nouns, noun phrases and prepositional phrases from this generated text are repli ated
a ording to their frequen ies in the snippets-based text and then normalized to the minimal
frequen y in those 20 items (i.e. normalized a ording to the least frequent phrase among the top
ones). The resulting query is shown below:
pension pension pension pension pension pension pension pension
pension pension pension pension pension pension pension pension
pension pension pension benefits benefits benefits benefits benefits
benefits benefits benefits benefits benefits retirement retirement
retirement retirement retirement retirement retirement retirement
retirement retirement retirement age age pensions o upational
o upational o upational o upational s hemes s hemes s hemes
s hemes s hemes s hemes s hemes s hemes s hemes s hemes s hemes
s hemes s hemes s hemes s hemes s hemes regulations information
information information information information s heme s heme
dis losure dis losure pension s hemes pension s hemes pension
s hemes pension s hemes pension s hemes pension s hemes pension
s hemes pension s hemes pension s hemes pension s hemes pension
s hemes pension s hemes retirement benefits s hemes members members
o upational pension s hemes o upational pension s hemes
o upational pension s hemes retirement benefits retirement benefits
dis losure of information
Fren h do uments have been pro essed in a similar way, but using the OR operator to join
found phrases for the generated Google query. This has been done due to the smaller number of
indexed web pages in Fren h language. Sin e we expe t to re over 100 snippets, we have found
that with this operator this is possible, despite low quality texts been onsidered to produ e the
nal expanded query.
The next step is to exe ute both original and Google queries on the Lemur information re-
trieval system. The olle tion dataset has been indexed using Lemur IR system2 . It is a toolkit
that supports indexing of large-s ale text databases, the onstru tion of simple language models
for do uments, queries, or sub olle tions, and the implementation of retrieval systems based on
language models as well as a variety of other retrieval models. The toolkit is being developed
1 Available at http://www.ims.uni-stuttgart.de/projekte/ orplex/TreeTagger/
2 http://www.lemurproje t.org/
as part of the Lemur Proje t, a ollaboration between the Computer S ien e Department at the
University of Massa husetts and the S hool of Computer S ien e at Carnegie Mellon University.
In these experiments we have used Okapi as weighting fun tion ([4℄).
Finally, we have to merge both lists of relevant do uments. [1, 6℄ propose a merging approa h
based on logisti regression. Logisti regression is a statisti al methodology for predi ting the
probability of a binary out ome variable a ording to a set of independent explanatory variables.
The probability of relevan e to the orresponding do ument Di will be estimated a ording to
four parameters: the s ore and the ranking obtained by using the original query, and the s ore
and the ranking by means of the Google-based query (see equation 1). Based on these estimated
probabilities of relevan e, the list of do uments will be interleaved making up an unique nal list.
P rob[Di is rel|rankorgi , rsvorgi , rankgooglei , rsvgooglei ] =
eα+β1 ·ln(rankorgi )+β2 ·rsvorgi +β3 ·ln(rankgooglei )+β4 ·rsvgooglei
(1)
1 + eα+β1 ·ln(rankorgi )+β2 ·rsvorgi +β3 ·ln(rankgooglei )+β4 ·rsvgooglei
The oe ients α, β1 , β2 , β3 and β4 are unknown parameters of the model. When tting
the model, usual methods to estime these parameters are maximum likelihood or iteratively re-
weighted least squares methods.
As it is needed to t the underlying model, training set (topi s and their relevan e assessments)
must be available for ea h monolingual olle tion. Sin e there are relevan e assessments for English
and Fren h, we have made the experiments for these languages only. For Portuguese we have
reported only the base ase (we have not used Google queries for su h language).
3 Results
As tables 1 and 2 show, the results are disappointing. For training data, Google queries improve
both the m.a.p. and the geometri pre ision in both languages, English and Fren h. But this good
behavior disappears when we apply our approa h on test data. Of ourse, we hope that pre ision
for test data gets worse regarding training data, but we think that the dieren e in pre ision is
ex essive. This issue demands further analysis by us.
Approa h Colle tion map gm-ap
Google training 0.29 0.12
Base training 0.26 0.10
Google test 0.34 0.12
Base test 0.38 0.14
Table 1: Results for English data. Google approa h is the result obtained by merging original
queries and Google queries. Base results are those obtained by means of original queries only.
Approa h Colle tion map gm-ap
Google training 0.28 0.10
Base training 0.26 0.12
Google test 0.30 0.11
Base test 0.31 0.13
Table 2: Results for Fren h data. Google approa h is the result obtained by merging original
queries and Google queries. Base results are those obtained by means of original queries only.
4 Con lusions and Future work
We have reported on our experimentation for the Ad-Ho Robust Multilingual tra k CLEF task
involving web-based query generation for English and Fren h olle tions. The generation of a
nal list of results by merging sear h results obtained from two dierent queries has been studied.
These two queries are the original one and a new one generated from Google results. Both lists
are joined by means of logisti regression, instead of using an expanded query as we did last year.
The results are disappointing. While results for training data are very promising, there is not
improvement for test data. This question must be nd out and we hope to understand why the
performan e is so poor for test data, analyzing, for instan e, side ee ts of the regression approa h.
5 A knowledgments
This work has been partially supported by a grant from the Spanish Government, proje t TIMOM
(TIN2006-15265-C06-03), and the RFC/PP2006/Id_514 granted by the University of Jaén..
Referen es
[1℄ A. Calvé and J. Savoy. Database merging strategy based on logisti regression. Information
Pro essing & Management, 36:341359, 2000.
[2℄ K. L. Kwok, L. Grunfeld, and D. D. Lewis. TREC-3 ad-ho , routing retrieval and thresh-
olding experiments using PIRCS. In Pro eedings of TREC'3, volume 500, pages 247255,
Gaithersburg, 1995. NIST.
[3℄ Fernando Martínez-Santiago, Arturo Montejo-Ráez, Miguel A. Gar ía-Cumbreras, and L. Al-
fonso Ureña-López . SINAI at CLEF 2006 Ad Ho Robust Multilingual Tra k: Query Ex-
pansion using the Google Sear h Engine. Evaluation of Multilingual and Multi-modal Infor-
mation Retrieval 7th Workshop of the Cross-Language Evaluation Forum, CLEF 2006. LNCS-
Springer., 4730, September 2007.
[4℄ S.E. Robertson and S.Walker. Okapi-Keenbow at TREC-8. In Pro eedings of the 8th Text
Retrieval Conferen e TREC-8, NIST Spe ial Publi ation 500-246, pages 151162, 1999.
[5℄ Fernando Martínez Santiago, Luis Alfonso Ureña López, and Maite Teresa Martín-Valdivia. A
merging strategy proposal: The 2-step retrieval status value method. Inf. Retr., 9(1):7193,
2006.
[6℄ J. Savoy. Cross-Language information retrieval: experiments based on CLEF 2000 orpora.
Information Pro essing & Management, 39:75115, 2003.
[7℄ J. Savoy. Combining multiple strategies for ee tive ross-language retrieval. Information
Retrieval, 7(1-2):121148, 2004.
[8℄ E. Voorhees, N. K. Gupta, and B. Johnson-Laird. The olle tion fusion problem. In D. K.
Harman, editor, Pro eedings of the 3th Text Retrieval Conferen e TREC-3, volume 500-225,
pages 95104, Gaithersburg, 1995. National Institute of Standards and Te hnology, Spe ial
Publi ation.