pension s hemes in europe

       SINAI at CLEF Ad-Ho                              Robust Tra k 2007:
          applying Google sear h engine for robust
                               ross-lingual retrieval

        Fernando Martínez-Santiago, Arturo Montejo-Ráez, Miguel A. Gar ía-Cumbreras
               Department of Computer S ien e. University of Jaén, Jaén, Spain
                                 {dofer,amontejo,mag }ujaen.es


                                             Abstra t
     We have reported on our experimentation for the Ad-Ho Robust tra k CLEF task
      on erning web-based query generation for English and Fren h olle tions. We have
      ontinued the approa h of the last year, although the model has been modied. Last
     year we used Google in order to expand the original query. This year we don't expand
     the query but we rather make a new query to be exe uted. Thus, we have to deal with
     two lists of relevant do uments, one from ea h query. In order to integrate both lists
     of do uments we have applied logisti regression merging solution. Obtained results
     are dis ouraging.

Categories and Subje t Des riptors
H.3 [Information Storage and Retrieval℄: H.3.1 Content Analysis and Indexing; H.3.3 Infor-
mation Sear h and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries

General Terms
Algorithms, Languages, Performan e, Experimentation

Keywords
Information Retrieval, Multilingual Information Retrieval, Robust Information Retrieval


1 Introdu tion
Expanding user queries by using web sear h engines su h as Google has been su essfully used for
improving the robustness of retrieval systems over olle tions in English[2℄. Due to the multilin-
guality of the web, we have assumed that this ould be extended to additional languages, though
the smaller amount of web non-English pages ould be a major ounterpoint. Therefore, we have
used Google in order to expand the query in a similar way [3℄, but instead of repla ing the original
query by the the expanded query, we have exe uted both queries (the original and expanded one).
For ea h query we have obtained a list of relevant do uments. Thus, we need to ombine the
retrieval results from these two independent list of do uments. This is a similar problem to the so
 alled olle tion fusion problem [8℄, but we have not several olle tions: there is only a olle tion
but two list of relevant do uments. The question is how should we make the al ulus of the s ore
of ea h do ument in the nal resulting list. Given a query, in order to integrate the information
available about the relevan e of every retrieved do ument, we have applied a model based on
logisti regression. Logisti regression has been used su essfully in multilingual s enarios[5, 7℄.
2 Query expansion with Google sear h engine
This se tion des ribes the pro ess for generating a new query using expansion by the Google sear h
engine. To this end, we have sele ted a random sample do ument. The following elds orrespond
to the do ument with identi ation number 10.2452/252-ah from the English olle tion.
<title>pension s hemes in europe </title>

<des >find do uments that give information about urrent pension
systems and retirement benefits in any european ountry. </des >

<narr>relevant do uments will ontain information on urrent pension
s hemes and benefits in single european states. information of
interest in ludes minimum and maximum ages for retirement and the way
in whi h the retirement in ome is al ulated. plans for future
pension reform are not relevant. </narr>

    These elds have been on atenated into one single text and all ontained nouns, noun phrases
and prepositional phrases have been extra ted by means of TreeTagger. TreeTagger is a tool
for annotating text with part-of-spee h and lemma information whi h has been developed at the
Institute for Computational Linguisti s of the University of Stuttgart1 .
    On e nouns and phrases are identied they are taken to ompose the query, preserving phrases
thanks to Google's query syntax.
do uments ``pension s hemes'' benefits retirement information

    The former string is passed to Google and the snippets (small fragment of text from the
asso iated web page result) of the top 100 results are joined into one single text wherefrom, again,
phrases are extra ted with their frequen ies to generate a nal expanded query. The 20 most
frequent nouns, noun phrases and prepositional phrases from this generated text are repli ated
a ording to their frequen ies in the snippets-based text and then normalized to the minimal
frequen y in those 20 items (i.e. normalized a ording to the least frequent phrase among the top
ones). The resulting query is shown below:
 pension pension pension pension pension pension pension pension
 pension pension pension pension pension pension pension pension
 pension pension pension benefits benefits benefits benefits benefits
 benefits benefits benefits benefits benefits retirement retirement
 retirement retirement retirement retirement retirement retirement
 retirement retirement retirement age age pensions o upational
 o upational o upational o upational s hemes s hemes s hemes
 s hemes s hemes s hemes s hemes s hemes s hemes s hemes s hemes
 s hemes s hemes s hemes s hemes s hemes regulations information
 information information information information s heme s heme
 dis losure dis losure pension s hemes pension s hemes pension
 s hemes pension s hemes pension s hemes pension s hemes pension
 s hemes pension s hemes pension s hemes pension s hemes pension
 s hemes pension s hemes retirement benefits s hemes members members
 o upational pension s hemes o upational pension s hemes
 o upational pension s hemes retirement benefits retirement benefits
 dis losure of information

    Fren h do uments have been pro essed in a similar way, but using the OR operator to join
found phrases for the generated Google query. This has been done due to the smaller number of
indexed web pages in Fren h language. Sin e we expe t to re over 100 snippets, we have found
that with this operator this is possible, despite low quality texts been onsidered to produ e the
nal expanded query.
    The next step is to exe ute both original and Google queries on the Lemur information re-
trieval system. The olle tion dataset has been indexed using Lemur IR system2 . It is a toolkit
that supports indexing of large-s ale text databases, the onstru tion of simple language models
for do uments, queries, or sub olle tions, and the implementation of retrieval systems based on
language models as well as a variety of other retrieval models. The toolkit is being developed
  1 Available at http://www.ims.uni-stuttgart.de/projekte/ orplex/TreeTagger/
  2 http://www.lemurproje t.org/
as part of the Lemur Proje t, a ollaboration between the Computer S ien e Department at the
University of Massa husetts and the S hool of Computer S ien e at Carnegie Mellon University.
In these experiments we have used Okapi as weighting fun tion ([4℄).
    Finally, we have to merge both lists of relevant do uments. [1, 6℄ propose a merging approa h
based on logisti regression. Logisti regression is a statisti al methodology for predi ting the
probability of a binary out ome variable a ording to a set of independent explanatory variables.
The probability of relevan e to the orresponding do ument Di will be estimated a ording to
four parameters: the s ore and the ranking obtained by using the original query, and the s ore
and the ranking by means of the Google-based query (see equation 1). Based on these estimated
probabilities of relevan e, the list of do uments will be interleaved making up an unique nal list.


                   P rob[Di is rel|rankorgi , rsvorgi , rankgooglei , rsvgooglei ] =
                   eα+β1 ·ln(rankorgi )+β2 ·rsvorgi +β3 ·ln(rankgooglei )+β4 ·rsvgooglei
                                                                                                (1)
                 1 + eα+β1 ·ln(rankorgi )+β2 ·rsvorgi +β3 ·ln(rankgooglei )+β4 ·rsvgooglei
   The oe ients α, β1 , β2 , β3 and β4 are unknown parameters of the model. When tting
the model, usual methods to estime these parameters are maximum likelihood or iteratively re-
weighted least squares methods.
   As it is needed to t the underlying model, training set (topi s and their relevan e assessments)
must be available for ea h monolingual olle tion. Sin e there are relevan e assessments for English
and Fren h, we have made the experiments for these languages only. For Portuguese we have
reported only the base ase (we have not used Google queries for su h language).


3 Results
As tables 1 and 2 show, the results are disappointing. For training data, Google queries improve
both the m.a.p. and the geometri pre ision in both languages, English and Fren h. But this good
behavior disappears when we apply our approa h on test data. Of ourse, we hope that pre ision
for test data gets worse regarding training data, but we think that the dieren e in pre ision is
ex essive. This issue demands further analysis by us.

                             Approa h         Colle tion         map      gm-ap
                                Google           training        0.29       0.12
                                 Base            training        0.26       0.10
                                Google             test          0.34       0.12
                                 Base              test          0.38       0.14

Table 1: Results for English data. Google approa h is the result obtained by merging original
queries and Google queries. Base results are those obtained by means of original queries only.


                             Approa h         Colle tion         map      gm-ap
                                Google           training        0.28       0.10
                                 Base            training        0.26       0.12
                                Google             test          0.30       0.11
                                 Base              test          0.31       0.13

Table 2: Results for Fren h data. Google approa h is the result obtained by merging original
queries and Google queries. Base results are those obtained by means of original queries only.
4 Con lusions and Future work
We have reported on our experimentation for the Ad-Ho Robust Multilingual tra k CLEF task
involving web-based query generation for English and Fren h olle tions. The generation of a
nal list of results by merging sear h results obtained from two dierent queries has been studied.
These two queries are the original one and a new one generated from Google results. Both lists
are joined by means of logisti regression, instead of using an expanded query as we did last year.
The results are disappointing. While results for training data are very promising, there is not
improvement for test data. This question must be nd out and we hope to understand why the
performan e is so poor for test data, analyzing, for instan e, side ee ts of the regression approa h.


5 A knowledgments
This work has been partially supported by a grant from the Spanish Government, proje t TIMOM
(TIN2006-15265-C06-03), and the RFC/PP2006/Id_514 granted by the University of Jaén..


Referen es
[1℄ A. Calvé and J. Savoy. Database merging strategy based on logisti regression. Information
    Pro essing & Management, 36:341359, 2000.
[2℄ K. L. Kwok, L. Grunfeld, and D. D. Lewis. TREC-3 ad-ho , routing retrieval and thresh-
    olding experiments using PIRCS. In Pro eedings of TREC'3, volume 500, pages 247255,
    Gaithersburg, 1995. NIST.
[3℄ Fernando Martínez-Santiago, Arturo Montejo-Ráez, Miguel A. Gar ía-Cumbreras, and L. Al-
    fonso Ureña-López . SINAI at CLEF 2006 Ad Ho Robust Multilingual Tra k: Query Ex-
    pansion using the Google Sear h Engine. Evaluation of Multilingual and Multi-modal Infor-
    mation Retrieval 7th Workshop of the Cross-Language Evaluation Forum, CLEF 2006. LNCS-
    Springer., 4730, September 2007.
[4℄ S.E. Robertson and S.Walker. Okapi-Keenbow at TREC-8. In Pro eedings of the 8th Text
    Retrieval Conferen e TREC-8, NIST Spe ial Publi ation 500-246, pages 151162, 1999.
[5℄ Fernando Martínez Santiago, Luis Alfonso Ureña López, and Maite Teresa Martín-Valdivia. A
    merging strategy proposal: The 2-step retrieval status value method. Inf. Retr., 9(1):7193,
    2006.
[6℄ J. Savoy. Cross-Language information retrieval: experiments based on CLEF 2000 orpora.
    Information Pro essing & Management, 39:75115, 2003.
[7℄ J. Savoy. Combining multiple strategies for ee tive ross-language retrieval.        Information
    Retrieval, 7(1-2):121148, 2004.
[8℄ E. Voorhees, N. K. Gupta, and B. Johnson-Laird. The olle tion fusion problem. In D. K.
    Harman, editor, Pro eedings of the 3th Text Retrieval Conferen e TREC-3, volume 500-225,
    pages 95104, Gaithersburg, 1995. National Institute of Standards and Te hnology, Spe ial
    Publi ation.