Introduction

TALP at GikiCLEF 2009

0 Daniel Ferr ́es and Horacio Rodr ́ıguez TALP Research Center Software Department Universitat Polit`ecnica de Catalunya

This paper describes our experiments in Geographical Information Retrieval with the Wikipedia collection in the context of our participation in the GikiCLEF 2009 Multilingual task in English and Spanish. Our system, called gikiTALP, follows a very simple approach that uses standard Information Retrieval with the Sphinx full-text search engine and some Natural Language Processing techniques without Geographical Knowdledge.

Introduction

GikiCLEF collections The Wikipedia collections for all GikiCLEF languages are available in three formats, HTML dump, SQL dump, and XML version. Most of the collections are from June 2008. We used the SQL dump version of the English and Spanish collections. The system architecture has three phases that are performed sequentially: Collection Indexing, Topic Analysis, and Information Retrieval. The textual Collection Indexing has been applied over the textual collections with MySQL and the full-text engine Sphinx using the Wikipedia SQL dumps.

Sphinx 1 is a full-text search engine that provides fast, size-efficient and relevant full-text search functions to other applications. The indexes created with Sphinx do not have any language processing. Sphinx has two types of weighting functions: • Phrase rank: based on a length of longest common subsequence (LCS) of search words between document body and query phrase. • Statistical rank: based on classic BM25 function which only takes word frequencies into account.

We used two types of search modes in Sphinx: • MATCH ALL: the final weight is a sum of weighted phrase ranks. • MATCH EXTENDED: the final weight is a sum of weighted phrase ranks and BM25 weight, multiplied by 1000 and rounded to integer.

The Topic Analysis phase extracts some relevant keywords (with its analysis) from the topics. These keywords are then used by the Document Retrieval phases. This process extracts lexicosemantic information using the following set of Natural Language Processing tools: TnT (POS tagger) and [ 2 ] WordNet lemmatizer (version 2.0) for English, and Freeling [ 1 ]. for Spanish.

The retrieval is done with Sphinx and then the final results are filtered. The Wikipedia entries without Categories are discarded. 3

Experiments

For the GikiCLEF 2009 evaluation we designed a set of three experiments that consist in applying different baseline configurations (see Table 2) to retrieve Wikipedia entries (answers) of 50 geographically challenging topics.

The three baseline runs were designed changing two parameters of the system: the IR Sphinx search mode and the Natural Language Processing techniques applied over the query. The first run (gikiTALP1) do not uses any NLP processing technique over the query and the Sphinx match mode used is MATCH ALL. The second run (gikiTALP2) uses stopwords filtering and the lemmas of the remaining words as a query and the Sphinx match mode used is MATCH ALL. The third run (gikiTALP3) uses stopwords filtering and the lemmas of the remaining words as a query and the Sphinx match mode used is MATCH EXTENDED. The results of the gikiTALP system at the GikiCLEF 2009 Monolingual English and Spanish task are summarized in Table 3. This table has the following IR measures for each run: number of correct answers (#Correct Answers), Precision, and Score.

The run gikiTALP1 obtained the following scores for English, Spanish and Global: 0.6684, 0.0280, and 0.6964. Due to an unexpected error we did not produced answers for the Spanish topics in run 2 (gikiTALP2), then the results for English and global were 1,3559. The results of the scores of the run gikiTALP3 for English, Spanish and Global were 1.635, 0.2667, and 1.9018 respectively. This is our first approach for a Wikipedia-based retrieval task. We have used the Sphinx full-text search engine with limited Natural Language Processing processing and without using Geographical Knowledge. We obtained the best results when we have used all the NLP techniques (lemmas in the queries and stopwords filtered) and the Sphinx mode MATCH EXTENDED. Geographical Knowledge as baseline algorithms. As a future work we plan to: 1) detect the Expected Answer Type and use the wordnet synsets to improve the results, 2) use Geographical Knowledge in the Topic Analysis, 3) increase the use of the Wikipedia links.

Acknowledgments

This work has been supported by the Spanish Research Dept. (TEXT-MESS, TIN2006-15265C06-05). Daniel Ferr´es is supported by a UPC-Recerca grant from Universitat Polit`ecnica de Catalunya (UPC). TALP Research Center is recognized as a Quality Research Group (2001 SGR 00254) by DURSI, the Research Department of the Catalan Government.

[1]

Jordi

Atserias , Bernardino Casas, Elisabet Comelles, Meritxell Gonz´alez, Lluis Padr´o, and Muntsa Padr´o. FreeLing 1 . 3: Syntactic and Semantic Services in an Open-Source NLP Library . In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'06) , pages 48 - 55 , 2006 .

[2]

Brants. TnT - A Statistical Part- Of-Speech Tagger . In Proceedings of the 6th Applied NLP Conference (ANLP-2000) , Seattle, WA, United

States

, 2000 .

[3]

Diana

Santos , Nuno Cardoso, Paula Carvalho, Iustin Dornescu, Sven Hartrumpf, Johannes Leveling, and

Yvonne

Skalban . Getting Geographical Answers from Wikipedia: the GikiP pilot at CLEF . In Francesca Borri and

Alessandro

Nardi and Carol Peters, editor, Working notes for the CLEF 2008 Workshop , Aarhus, Denmark, September 2008 . CLEF 2008 Organizing Committee .