TALP at GikiCLEF 2009 Daniel Ferrés and Horacio Rodrı́guez TALP Research Center Software Department Universitat Politècnica de Catalunya {dferres,horacio}@lsi.upc.edu Abstract This paper describes our experiments in Geographical Information Retrieval with the Wikipedia collection in the context of our participation in the GikiCLEF 2009 Mul- tilingual task in English and Spanish. Our system, called gikiTALP, follows a very simple approach that uses standard Information Retrieval with the Sphinx full-text search engine and some Natural Language Processing techniques without Geographi- cal Knowdledge. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor- mation Search and Retrieval; H.3.4 Systems and Software General Terms Design, Performance, Experimentation Keywords report. Information Retrieval, Wikipedia, Geographical Information Retrieval, Natural Language Processing 1 Introduction In this paper we present the overall architecture of our gikiTALP IR system and we describe briefly its main components. We also present the experiments, results and initial conclusions in the context of the GikiCLEF 2009 Monolingual English ans Spanish task. GikiCLEF 2009 is an evaluation task under the scope of CLEF. Its aim is to evaluate systems which find Wikipedia entries/documents that answer a particular information need, which requires geographical reasoning of some sort. GikiCLEF is the successor of the GikiP 2008 [3] pilot task which ran in 2008 under GeoCLEF. For GikiCLEF, systems will need to answer or address geographically challenging topics, on the Wikipedia collections, returning Wikipedia document titles as list of answers in all languages it can find answers. The following (Wikipedia) languages are available in GikiCLEF: Bulgarian, Dutch, English, German, Italian, Norwegian, Portuguese, Romanian and Spanish. 1.1 GikiCLEF collections The Wikipedia collections for all GikiCLEF languages are available in three formats, HTML dump, SQL dump, and XML version. Most of the collections are from June 2008. We used the SQL dump version of the English and Spanish collections. Table 1: Description of the Collections we used at gikiclef 2009. Language #Total #Pages #Templates #Categories #Images en 6,587,912 5,255,077 154,788 365,210 812,837 es 714,294 641,852 11,885 60,556 1 2 System Description The system architecture has three phases that are performed sequentially: Collection Indexing, Topic Analysis, and Information Retrieval. The textual Collection Indexing has been applied over the textual collections with MySQL and the full-text engine Sphinx using the Wikipedia SQL dumps. Sphinx 1 is a full-text search engine that provides fast, size-efficient and relevant full-text search functions to other applications. The indexes created with Sphinx do not have any language processing. Sphinx has two types of weighting functions: • Phrase rank: based on a length of longest common subsequence (LCS) of search words between document body and query phrase. • Statistical rank: based on classic BM25 function which only takes word frequencies into account. We used two types of search modes in Sphinx: • MATCH ALL: the final weight is a sum of weighted phrase ranks. • MATCH EXTENDED: the final weight is a sum of weighted phrase ranks and BM25 weight, multiplied by 1000 and rounded to integer. The Topic Analysis phase extracts some relevant keywords (with its analysis) from the topics. These keywords are then used by the Document Retrieval phases. This process extracts lexico- semantic information using the following set of Natural Language Processing tools: TnT (POS tagger) and [2] WordNet lemmatizer (version 2.0) for English, and Freeling [1]. for Spanish. The retrieval is done with Sphinx and then the final results are filtered. The Wikipedia entries without Categories are discarded. 3 Experiments For the GikiCLEF 2009 evaluation we designed a set of three experiments that consist in apply- ing different baseline configurations (see Table 2) to retrieve Wikipedia entries (answers) of 50 geographically challenging topics. The three baseline runs were designed changing two parameters of the system: the IR Sphinx search mode and the Natural Language Processing techniques applied over the query. The first run (gikiTALP1) do not uses any NLP processing technique over the query and the Sphinx match mode used is MATCH ALL. The second run (gikiTALP2) uses stopwords filtering and the lemmas of the remaining words as a query and the Sphinx match mode used is MATCH ALL. The third run (gikiTALP3) uses stopwords filtering and the lemmas of the remaining words as a query and the Sphinx match mode used is MATCH EXTENDED. 1 http://www.sphinxsearch.com/ Table 2: Description of the Experiments at GikiCLEF 2009. Automatic Runs NLP in Query Sphinx Match gikiTALP1 - MATCH ALL (phrase rank) gikiTALP2 lemma + stopwords filtering MATCH ALL (prhase rank) gikiTALP3 lemma + stopwords filtering MATCH EXTENDED (BM25) 4 Results The results of the gikiTALP system at the GikiCLEF 2009 Monolingual English and Spanish task are summarized in Table 3. This table has the following IR measures for each run: number of correct answers (#Correct Answers), Precision, and Score. The run gikiTALP1 obtained the following scores for English, Spanish and Global: 0.6684, 0.0280, and 0.6964. Due to an unexpected error we did not produced answers for the Spanish topics in run 2 (gikiTALP2), then the results for English and global were 1,3559. The results of the scores of the run gikiTALP3 for English, Spanish and Global were 1.635, 0.2667, and 1.9018 respectively. Table 3: TALP GikiTALP Results run Measures English (EN) Spanish (ES) Total #Answers 383 143 526 run 1 #Correct answers 16 2 18 Precision 0.0418 0.0140 0.0342 Score 0.6684 0.0280 0.6964 #Answers 295 – 295 run 2 #Correct answers 20 – 20 Precision 0.0678 – 0.0678 Score 1.3559 – 1.3559 #Answers 296 60 356 run 3 #Correct answers 22 4 26 Precision 0.0743 0.0667 0.0730 Score 1.6351 0.2667 1.9018 5 Conclusions This is our first approach for a Wikipedia-based retrieval task. We have used the Sphinx full-text search engine with limited Natural Language Processing processing and without using Geograph- ical Knowledge. We obtained the best results when we have used all the NLP techniques (lemmas in the queries and stopwords filtered) and the Sphinx mode MATCH EXTENDED. Geographical Knowledge as baseline algorithms. As a future work we plan to: 1) detect the Expected Answer Type and use the wordnet synsets to improve the results, 2) use Geographical Knowledge in the Topic Analysis, 3) increase the use of the Wikipedia links. Acknowledgments This work has been supported by the Spanish Research Dept. (TEXT-MESS, TIN2006-15265- C06-05). Daniel Ferrés is supported by a UPC-Recerca grant from Universitat Politècnica de Catalunya (UPC). TALP Research Center is recognized as a Quality Research Group (2001 SGR 00254) by DURSI, the Research Department of the Catalan Government. References [1] Jordi Atserias, Bernardino Casas, Elisabet Comelles, Meritxell González, Lluis Padró, and Muntsa Padró. FreeLing 1.3: Syntactic and Semantic Services in an Open-Source NLP Library. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’06), pages 48–55, 2006. [2] T. Brants. TnT – A Statistical Part-Of-Speech Tagger. In Proceedings of the 6th Applied NLP Conference (ANLP-2000), Seattle, WA, United States, 2000. [3] Diana Santos, Nuno Cardoso, Paula Carvalho, Iustin Dornescu, Sven Hartrumpf, Johannes Leveling, and Yvonne Skalban. Getting Geographical Answers from Wikipedia: the GikiP pilot at CLEF. In Francesca Borri and Alessandro Nardi and Carol Peters, editor, Working notes for the CLEF 2008 Workshop, Aarhus, Denmark, September 2008. CLEF 2008 Organizing Committee.