=Paper= {{Paper |id=Vol-1175/CLEF2009wn-QACLEF-FlemmingsEt2009 |storemode=property |title=BBK-UFRGS@CLEF2009: Query Expansion of Geographic Place Names |pdfUrl=https://ceur-ws.org/Vol-1175/CLEF2009wn-QACLEF-FlemmingsEt2009.pdf |volume=Vol-1175 |dblpUrl=https://dblp.org/rec/conf/clef/FlemmingsBGM09 }} ==BBK-UFRGS@CLEF2009: Query Expansion of Geographic Place Names== https://ceur-ws.org/Vol-1175/CLEF2009wn-QACLEF-FlemmingsEt2009.pdf
    BBK-UFRGS@CLEF2009: Query Expansion of Geographic Place Names


                  Richard Flemmings1, Joana Barros1, André P. Geraldo 2, Viviane P. Moreira 2
                       1
                         Department of Geography, Environment and Development Studies
                    Birkbeck, University of London, Malet Street, London WC1E 7HX, UK
                                 j.barros@bbk.ac.uk, richflemmings@gmail.com
               2
                 Instituto de Informática – Universidade Federal do Rio Grande do Sul (UFRGS)
                          Caixa Postal 15.064 – 91.501-970 – Porto Alegre – RS – Brazil
                                         [apgeraldo, viviane]@inf.ufrgs.br


                                                  Abstract
             For our first participation on CLEF, our aim was to compare plain information
             retrieval strategies and query expansion and emphasis of geographic terms. ANNIE
             was used to recognise geographic entities which were expanded using Google's
             Hierarchical List of Geographical Place Names. The idea was that the expansion
             would produce more accurate answers. The results have shown the opposite. Our
             best performing run was the baseline. Future work will include further experiments
             and a deeper analysis of our results in order to enable the design of a better
             performing strategy.


Categories and Subject Descriptors
H.3.1 [Content Analysis and Indexing]: Linguistic processing. H.3.4 [Systems and Software]: Performance
evaluation
Free Keywords
Experimentation, performance measurement, placenames


1     Introduction
          This paper reports on monolingual question answering experiments performed for GikiCLEF. Our aim
was to compare the performance of straightforward Information Retrieval techniques with query expansion using
geographic terms. 50 Topics were issued in multiple languages. The GikiCLEF task required the Topics to be
used to query Wikipedia articles. The Topics issued for the GikiCLEF task each contain a geographically
significant reference. By identifying and emphasising this geographic reference within a topic, we anticipated
that the queries could provide more focused and accurate answers.


2     Experiments

2.1    Description of Runs and Resources
The text collection used was the Portuguese version of Wikipedia. Some details of this collection are given in
Table 1.
                             Table 1 - Details of the Portuguese data collection
                             Number of documents                      1,630,303
                             Total number of terms                 193,623,264
                             Number of unique terms                   1,120,786
                             Average document length                      1,173
We worked on the html files of the Wikipedia articles. The files were pre-processed to extract the textual
contents and to remove redirects. We also removed stop-words according to the lists available from Snowball1.
The IR system we used was Zettair [4], which is a compact and fast search engine developed by RMIT
University (Australia) distributed under a BSD-style license. Zettair implements a series of IR metrics for
comparing queries and documents. We used Okapi BM25 as some preliminary tests we performed on other data
collections showed it achieved the best results.
The top 15 retrieved documents for each query were considered as correct answers. We did not perform any
filtering of classification of the answers. Answer selection was based solely on the similarity score between the
article and the query.
Three different runs were performed. Below we describe each one:
       •    Run 1 - The unchanged topic sentences were used for searching. This was our baseline.
       •    Run 2 – The original topic sentences were submitted to ANNIE [2], an information extraction system
            of GATE (General Architecture for Textual Engineering) version 4.0. The GATE software allows a
            body of text to be input and searched for significant words. The ANNIE (A Nearly New Information
            Extraction System) extension to GATE has a built in gazetteer of place names. This allowed place
            names within the GikiCLEF topic list to be highlighted (“annotated”). All terms identified as
            geographic entities by ANNIE had their weights increased when performing Run 2. In order to
            emphasise geographic terms, we used a modified version of BM25 [3] which was originally designed to
            promote rare terms.
       •    Run 3 - In order to place a geographical word or phrase into context, and ensure a more diverse range
            of geographical words could be searched for, the geographic terms identified by Annie, were submitted
            to Google's Hierarchical List of Geographical Place Names [1]. By replacing individual geographical
            words within a GikiCLEF Topic with multiple words, it was anticipated that a greater number of
            documents would be returned. A tool was built to automatically search for multiple words associated
            with the geographic word(s) from each Topic. Each geographic word was submitted to the tool and all
            the geographic words from the hierarchical level below were returned. The tool sent the geographic
            terms in English as queries and received as the result all geographic place names one level below in
            Portuguese.
For example, Topic 47 contains the word Germany. This topic was submitted using the word Germany, but also
using the geographical level below Germany; the German Bundesland (Figure 1). Using an OR operator each
original topic has been expanded using the original geographic term, along with terms from the geographical
level below it.




                                    Figure 1 – Example Hierarchical search result



2.2        Results
Our results are presented in table 2. The scores are very low. The best result was achieved by Run 1 (the
baseline), which was able to return eight out of 85 correct answers. Run 1 was able to retrieve correct answers
for three topics (12, 23, and 34). Run 2 retrieved three correct answers for the same topic. Run three was able to


1
    http://snowball.tartarus.org/
retrieve only one correct answer, but for a different topic (47) "Which cities in Germany have more than one
university?". The expansion of the original query with the term "Hamburgo" was able to retrieve one correct
article.
                             Table 2 - Details of the Portuguese data collection


                                    Run ID     Score    #Correct Answers
                                    Run 1      0.088             8
                                    Run 2      0.012             3
                                    Run 3      0.000             1


     Compared to other participants, our best run was ranked in 11th place out of 17 runs.


3    Conclusions
This paper reported on monolingual question answering experiments performed for GikiCLEF. The aim was to
compare traditional Information Retrieval, term weighting, and query expansion.
The results have shown that our strategies for improving the results did not produce the expected outcome. It
was anticipated that Run 2 and Run 3 would return a greater number of correct answers as the geographical
words within each topic were expanded upon. However, our experiments did not take into account the fact that
many words within the Topics that can be classed as geographic may not be contained within the Google
Hierarchical List of Geographical Place Names. For example, Topic 08 refers to the "Bohemian Forest". The
simplistic approach of word substitution used in Run 3 requires further development to take into account the
vague geographies that exist within some of the Topics. Further work will include a deeper analysis of the
experimental results and the study and development of new techniques for Geographical Question Answering.



Acknowledgements
This work was partially supported by CNPq (Brazil).


References

1.      Google's Hierarchical List of Geographical Place Names.                10-May-2009]; Available from:
        http://code.google.com/intl/en/apis/maps/documentation/geocoding/index.html.
2.      Cunningham, H., et al., Developing Language Processing Components with GATE Version 4 (a User
        Guide). 2007: The University of Sheffield http://gate.ac.uk/sale/tao/.
3.      Geraldo, A.P. and V.M. Orengo, UFRGS@CLEF2008: Using Association rules for Cross-Language
        Information Retrieval, in Working Notes of CLEF2008, F. Borri, A. Nardi, and C. Peters, Editors. 2008:
        Aarhus, Denmark.
4.      Zettair. Zettair. [cited 2007 11/06/07]; Available from: www.seg.rmit.edu.au/zettair/.