UAIC at GikiCLEF 2009 Adrian Iftene, Andrei-Cristian Prodan, Ion-Cătălin Condrea UAIC: Faculty of Computer Science, “Alexandru Ioan Cuza” University, Romania {adiftene, cristian.prodan, catalin.condrea}@info.uaic.ro Abstract. This year marked UAIC1’s first participation at the GikiCLEF competition. For GikiCLEF 2009, systems needed to answer or address geographically challenging topics, on the Wikipedia collections, returning Wikipedia document titles as list of answers. The UAIC team’s debut in this year competition has enriched us with the experience of developing the first system for the GikiCLEF task, at the same time setting the scene for next years participations. A brief description of our system is given in this paper. 1 Introduction GikiCLEF2 is an evaluation task under the scope of CLEF. Its aim is to evaluate systems which find Wikipedia entries/documents that answer a particular information need, which requires geographical reasoning of some sort. GikiCLEF is the successor of the GikiP3 2008 pilot task which ran in 2008 under GeoCLEF. A system participating in GikiCLEF 2009 received a set of topics in all GikiCLEF languages (Bulgarian, Dutch, English, German, Italian, Norwegian, Portuguese, Romanian and Spanish) and it must produce a list of answers, in all languages it can find answers. The motivation for this kind of system behaviour is that in a real environment, a user prefers to read answers in his native language, but he is happy with answers (answers are titles of Wikipedia entries) in other languages he also knows or even just slightly understands. GikiCLEF collections were represented by Wikipedia collections for all GikiCLEF languages and were available in three formats: HTML, SQL and XML. Participant systems used one of the versions of the collections and must offer answers to 50 topics prepared by organizers. In the end their answers have to point to valid HTML or XML files in the GikiCLEF collection. The general system architecture is described in Section 2, while Section 3 is concerned with presentation of results. Last Section presents conclusions regarding our participation in GikiCLEF 2009. 1 “Al. I. Cuza” University 2 GikiCLEF: http://www.linguateca.pt/GikiCLEF/ 3 GikiP: http://www.linguateca.pt/GikiCLEF/index.php/Main_Page#GikiP_2008_pilot_task Initial Topics RO ES RO and ES Wikipedia Topic Analysis: -Tokenizing and lemmatization Pre-processing - Focus, keywords and NEs identification P2P network - Question classification Lucene Lucene Queries Index Information Retrieval XML Titles NEs Identification and Answers Ranking Final Answers Figure 1: UAIC system used in GikiCLEF 2009 2 Architecture of the GikiCLEF System The system contains four main modules that deal with corpora pre-processing, topic analysis, information retrieval and answers ranking (See Figure 1). For the pre- processing part we used a peer-to-peer network in which on separated computers we unzip initial XML files, pre-process them and after that we unify them in one common file. These files obtained on separated computers are afterwards sending to the indexing module. In what follow, we give few details and examples in order to understand better how our system works. 2.1 Corpus Pre-processing From collections provided by organizers we used the XML version. Because, in the XML files a lot of tags were useless, we decided to eliminate these tags and to only keep relevant tags. This pre-processing part was done in two steps: in the first step we extract the relevant tags, and in the second step we eliminate from the content of tags identified at step 1, the formatting tags. The useful tags identified by us at step 1 were tags that contain paragraphs, titles, lines or columns from tables. At step 2 we eliminate tags for text formatting like bold, italic, underline, size, color, etc. and also the hyperlink tags. For example, for file “Active_Directory_3275.xml” from English XML collection at first step one of the extracted paragraph tags was: Table 1: Example of Paragraph Extracted after First Pre-Processing Step
Active Directory (AD) is an implementation of LDAP directory services by Microsoft for use primarily in Windows environments. Its main purpose is to provide central authentication and authorization services for Windows-based computers. ...
And after the second step the same paragraph looks like in Table 2. Table 2: Example of Paragraph Obtained after Second Pre-Processing StepActive Directory (AD) is an implementation of LDAP directory services by Microsoft for use primarily in Windows environments. Its main purpose is to provide central authentication and authorization services for Windows-based computers. ...
In this way we only keep the relevant text in new XML files. 2.2 Index Creation The purpose of this module is to prepare the index necessary for retrieval of the relevant snippets of text for every topic. For this task we used the Lucene4 indexing component. The index was created on the basis of the XML files obtained at the previous step. We have created one index at document level; in which, for fields, we insert the document title and all relevant text from a given XML. 2.3 Topic Analysis and Lucene Queries Creation This step is mainly concerned with the building of Lucene query necessary in the retrieval part. Queries are created using the sequences of keywords, Lucene mandatory operator “+” and relevance operator “^” and “title” field. In this manner we obtain a regular expression for every question, which is then used in the search phase. In addition, it also provides the answer type, the question focus, and the question type. The topic analyzer performs the following steps (similar to [1]): i) NP-chunking and Named Entity extraction, ii) Identification of question focus, iii) The answer type identification, iv) Inferring the question type, v) Keyword generation, vi) Building of Lucene query. For example for first topic “GC-2009-01” the output of this module is presented in Table 3. In tag