University of Hagen at GeoCLEF 2008:
                     Combining IR and QA for
                 Geographic Information Retrieval
                                 Johannes Leveling and Sven Hartrumpf
                      Intelligent Information and Communication Systems (IICS)
                             University of Hagen (FernUniversität in Hagen)
                                         58084 Hagen, Germany
                              firstname.lastname@fernuni-hagen.de


                                              Abstract
     This paper describes the participation of GIRSA at GeoCLEF 2008, the geographic informa-
     tion retrieval task at CLEF. GIRSA is a modified and improved variant of the system which
     participated at GeoCLEF 2007. It combines results retrieved with methods from information
     retrieval (IR) on geographically annotated data and question answering (QA) employing query
     decomposition.
         For the monolingual German experiments, several parameter settings were varied: using a
     single index or a separate index for content and geographic annotation, using complex term
     weighting, adding location names from the narrative part of the topics, and merging results
     from IR and QA. The best mean average precision (MAP) was obtained by combining IR and
     QA results (0.2608 MAP).
         For bilingual (English-German and Portuguese-German) experiments, topics were trans-
     lated via various machine translation web services: Applied Language Solutions, Google
     Translate, and Promt Online Translator. Performance for these experiments is generally lower
     than for monolingual experiments. For both source languages, Google Translate seems to re-
     turn the best translations. For English topics, 60% (0.1571 MAP) of the maximum MAP for
     monolingual German experiments is achieved. For bilingual Portuguese-German experiments,
     80% (0.2085 MAP) of the maximum MAP for monolingual German experiments is achieved.


Categories and Subject Descriptors
H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing—Indexing methods; Linguis-
tic processing; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Query
formulation; Search process; H.3.4 [Information Storage and Retrieval]: Systems and Software—Per-
formance evaluation (efficiency and effectiveness)

General Terms
Experimentation, Measurement, Performance

Keywords
Geographic Information Retrieval, Question Answering, Cross-language Information Retrieval
1    Introduction
GeoCLEF is the geographic information retrieval (GIR) task at CLEF, the cross-language evaluation cam-
paign. In recent years, we have developed GIRSA (Geographic Information Retrieval by Semantic Annota-
tion), a system for exploring novel approaches at GIR. GIRSA supports methods to improve precision (e.g.
annotation of metonymic location names [5]) and methods to improve recall (e.g. normalization of loca-
tion name synsets [4]). For GeoCLEF 2008, the major improvement lies in the combination of results from
information retrieval (IR) on geographically annotated documents with methods from question answering
(QA).


2    System Description
GIRSA is a system for the evaluation of novel indexing and retrieval methods for GIR. Basically, the
GIRSA setup introduced at GeoCLEF 2007 is used for the GIR experiments. This setup involves the
identification and normalization of location indicators, i.e. text segments from which a geographic scope
can be inferred. Location adjectives, names for inhabitants of a place, geographic codes, orthographic
variants, acronyms, and abbreviations are mapped to location names. For its participation in GeoCLEF
2008, selected aspects of the IR subsystem have been improved:
    • The IR indexing methods utilize an improved version of the German stemmer (in the old version,
      adjectives were often stemmed incorrectly due to an incorrectly implemented stemming rule).
    • The resources for the identification of location indicators have been expanded. Additional lists of
      synonymous location names were extracted from Wikipedia articles and added to the geographic
      annotation data. For the normalization of multi-word names, missing inflectional variants of names
      were automatically generated and added. Furthermore, an automatic consistency check to find circu-
      lar normalizations and other data inconsistencies was integrated and inconsistencies in the annotation
      data were removed (e.g. if the data contains entries to normalize “Geneva” to “Genf” and vice versa,
      this will be detected).
    • The retrieval was modified to include a weighting scheme already used in our QA system [3]. The
      term weighting is meant to achieve a higher initial MAP by assigning weights according to the
      semantic contribution of words from the topic. Terms receive weights corresponding to their impor-
      tance as follows (in order of increasing weights): lower case words (e.g. adjectives and adverbs),
      numeric expressions (e.g. temporal expressions), the answer subtype (similar to the expected answer
      type known from QA, typically the first noun from a question), nouns, and proper nouns.
    The QA subsystem of GIRSA is InSicht, which also participates in QA@CLEF (see for example [2]).
For the specific requirements in an IR setting, the QA system has been modified in the following ways:
    • The normal processing of queries or questions stops after matching semantic representations of the
      query with semantic representations of documents. Answer generation is skipped because typical IR
      queries are not asking for answers, but for relevant documents.
    • Semantic decomposition of queries, which was pioneered in the previous GeoCLEF [4], was ex-
      tended by developing 6 decomposition methods aiming at improving recall for QA and/or IR (see [1]
      for details on the application of this approach to QA). For this year’s experiments, only two decom-
      position methods were activated in order to reduce runtime and to avoid finding irrelevant documents.
      For the title of topic 91-GC (“Waldbrände auf spanischen Inseln”/‘Forest fires on Spanish islands’),
      description decomposition produces the subquestion “Nenne spanische Inseln.”/‘Name Spanish is-
      lands.’ The 14 subanswers found (e.g. “Gran Canaria”) are substituted on the level of semantic
      representations in the original question, leading to 14 revised queries, e.g. “Waldbrände auf Gran
      Canaria”. For the title of topic 96-GC (“Wirtschaftsaufschwung in Südostasien”/‘Economic boom
      in Southeast Asia’), meronymy decomposition leads to subquestions like “Welche Region/Welcher
      Staat/Welche Stadt liegt in Südostasien?”/‘Which region/country/city is located in Southeast Asia?’.
        As these examples indicate, subquestions produce background knowledge (often of a geographic
        type) on the fly. Some pieces of knowledge are to be found in gazetteers, but there are many cases
        (“Mittelmeeranrainerstaaten”/‘Mediterranean countries’ in topic 81-GC, “Nordafrika”/‘Northern
        Africa’ in topic 83-GC, “Südpazifik”/‘South Pacific’ in topic 85-GC, etc.) where it is unlikely to
        find the relevant information in static, general-purpose gazetteers. To improve the answers for sub-
        questions, these subquestions (in contrast to the original GeoCLEF queries) are answered also over
        the Wikipedia corpus used in QA@CLEF. With decomposition, 1238 documents (232 assessed as
        relevant) were retrieved; only 125 documents (77 assessed as relevant) without decomposition.
     • The semantic network for a query can be split into two semantic networks at certain relations, e.g.
       splitting off temporal or local restrictions. In GeoCLEF 2007, these two parts had to be matched in
       the same document; this year, a NEAR operator (with 2000 characters) instead of the AND operator
       was applied in order to improve precision for these cases.


3      Experiments
We formulate our expectations regarding the MAP for different parameter settings in our experiments as
hypotheses:
    H1 Experiments using additional location names from the narrative part of the topics will achieve a
       higher MAP than experiments that do not (to confirm results from GeoCLEF 2007).

    H2 The MAP for experiments adding results from the QA subsystem will be somewhat higher than for
       experiments with pure GIR.
    H3 Topic translations with the Promt Online Translator web service will be better (e.g. containing less
       untranslated words) than those from the other web services tested. The corresponding results will
       therefore have a higher MAP.

    H4 Applying the weighting from QA (for all experiments), merging results from IR and QA, and combin-
       ing indexes for location names and content words will result in a higher initial MAP.
    GIRSA was employed to produce results for a number of monolingual and bilingual experiments. The
following parameter settings were varied in different retrieval experiments (see Table 1):

     • language (lang.):
       German (DE), English (EN), or Portuguese (PT) serves as topic source language.
     • translation (transl.):
       Applied Language Solutions1 (A), Google Translate2 (G), or Promt Online Translator3 (O) was used
       to translate topics.

     • fields:
       Content keywords and location indicators are extracted from the topic title and description: with
       location names from the topic narrative (TDN) or without (TD).
     • index:

           – All words are stemmed; a single index is produced (A).
           – Content words are decompounded (if possible) and stemmed; location names are identified;
             both are indexed separately (B).
           – Content words are decompounded (if possible) and stemmed; location indicators are normal-
             ized; both are indexed separately (C).
    1 http://www.appliedlanguage.com/free_translation.shtml
    2 http://translate.google.com/
    3 http://www.online-translator.com/
 Table 1: Results for monolingual and bilingual retrieval experiments on German GeoCLEF documents.
 Run                                         Parameters                                               Results
                         lang.     transl.     fields    index      comb.        MAP        rel ret    P@5       P@10      P@20
 FUHtd01                  DE           -       TD           A          N       0.2420         977       0.39       0.37     0.31
 FUHtd01m                 DE           -       TD           A          Y       0.2608        1028       0.38       0.37     0.35
 FUHtd20                  DE           -       TD           B          N       0.1719         914       0.20       0.29     0.27
 FUHtd20m                 DE           -       TD           B          Y       0.2211         998       0.36       0.35     0.34
 FUHtdn20                 DE           -       TDN          B          N       0.1478         834       0.17       0.24     0.20
 FUHENAtd20               EN          A        TD           B          N       0.1076          644      0.18       0.17     0.17
 FUHENAtdn20              EN          A        TDN          B          N       0.0962          610      0.14       0.15     0.13
 FUHENGtdn20              EN          G        TDN          B          N       0.1571          800      0.21       0.21     0.21
 FUHENOtd20               EN          O        TD           B          N       0.1179          703      0.23       0.23     0.21
 FUHENOtdn20              EN          O        TDN          B          N       0.1146          699      0.21       0.21     0.19
 FUHPTGtd01                PT         G        TD           A          N       0.2085          903      0.41       0.38     0.33
 FUHPTGtd20                PT         G        TD           B          N       0.1776          907      0.29       0.30     0.27
 FUHPTGtdn20               PT         G        TDN          B          N       0.1571          800      0.21       0.21     0.21
 FUHPTGtd21                PT         G        TD           C          N       0.2002          913      0.34       0.34     0.31
 FUHPTGtdn21               PT         G        TDN          C          N       0.1567          793      0.22       0.21     0.22


     • combination (comb.):
       Results from IR and QA are combined (Y) or not (N).4
Three metrics are employed to measure retrieval performance (see Table 1):
     • MAP: mean average precision,
     • rel ret: the number of relevant and retrieved documents (a total of 1417 documents was assessed as
       relevant for the GeoCLEF 2008 topics), and
     • P@N: precision at N documents.


4      Results and Discussion
Let us revisit the hypotheses from Section 3.
    H1 Experiments using additional location names from the narrative part of the topics will achieve a
       higher MAP than experiments that do not (to confirm results from GeoCLEF 2007). This turned
       out to be false. The MAP for experiments with additional location names from the topic narrative
       is lower than for the experiments using title and description only (e.g. FUHtd20 vs. FUHtdn20).
       Maybe additional location names from the topic narrative do not match the names in documents as
       exactly as in old topics; maybe too many additional location names are added, causing a topic shift.
       A solution would require a more elaborate weighting algorithm.
    H2 The MAP for experiments adding results from the QA subsystem will be somewhat higher than for
       experiments with pure GIR. This is also not true: performance is considerably higher due to the im-
       provements in the QA subsystem (query decomposition, less strict matching). The MAP for merged
       runs is higher in all cases. FUHtd01m shows a relative improvement of 7.8% in MAP compared to
       FUHtd01, FUHtd20m shows an improvement of 28.6% compared to FUHtd20; also, more relevant
       documents are retrieved in both cases. InSicht found documents for 13 (of the 25) topics, which
       is much better than last year. These results alone are not sufficient for GIR, but due to their high
       complementarity merging these results improves GIRSA significantly.
    4 To merge, the maximum score of results is chosen (for duplicate results), and the top-1000 documents are returned.
 H3 Topic translations with the Promt Online Translator web service will be better (e.g. containing less
    untranslated words) than those from the other web services tested. The corresponding results will
    therefore have a higher MAP. The MAP for the best bilingual English-German experiment is 0.1571
    (about 60% of the best MAP for monolingual German); the MAP for the best bilingual Portuguese-
    German experiment is 0.2085 (about 80% compared to monolingual German). The highest MAP
    was achieved with Google Translate. The experiments with topics translated by Google Translate
    returned the best results (FUHENGtdn20 vs. FUHENOtdn20 vs. FUHENAtdn20). Promt offers a
    web service (in beta status) different from previous years, which may be a reason why topics could
    not be translated well enough.
 H4 Applying the weighting from QA (for all experiments), merging results from IR and QA, and combin-
    ing indexes for location names and content words will result in a higher initial MAP. In comparison
    with results from the Berkeley group, the initial MAP was considerably higher: GIRSA returned 69%
    MAP at 0% recall for monolingual German experiments (experiment FUHtd01m), other participants
    achieved 43% and 16%, respectively (cf. the GeoCLEF overview paper in this volume); GIRSA
    achieved 63% MAP at 0% recall for bilingual experiments (experiment FUHPTGtd01), other partic-
    ipants achieved 47% and 16%, respectively.
    To test GIRSA, experiments with the same parameter settings were conducted for the GeoCLEF 2007
topics before the 2008 campaign. The test experiments for topics from 2007 showed different results, e.g.
the hypothesis H1 is true for the GeoCLEF 2007 topics, but not for the GeoCLEF 2008 topics (see also
results for official experiments described in [4]). Future work will include a more thorough, per-topic
analysis of errors.


References
[1] Sven Hartrumpf. Semantic decomposition for question answering. In Malik Ghallab, Constantine D.
    Spyropoulos, Nikos Fakotakis, and Nikos Avouris, editors, Proceedings of the 18th European Confer-
    ence on Artificial Intelligence (ECAI), pages 313–317, Patras, Greece, July 2008.
[2] Sven Hartrumpf, Ingo Glöckner, and Johannes Leveling. University of Hagen at QA@CLEF 2008:
    Efficient question answering with question decomposition and multiple answer streams. In Results
    of the CLEF 2008 Cross-Language System Evaluation Campaign, Working Notes for the CLEF 2008
    Workshop, Aarhus, Denmark, September 2008.
[3] Johannes Leveling. On the role of information retrieval in the question answering system IRSAW.
    In Proceedings of the LWA 2006 (Learning, Knowledge, and Adaptability), Workshop Information
    Retrieval, pages 119–125. Universität Hildesheim, Hildesheim, Germany, 2006.
[4] Johannes Leveling and Sven Hartrumpf. Inferring location names for geographic information retrieval.
    In Carol Peters, Valentin Jijkoun, Thomas Mandl, Henning Müller, Douglas W. Oard, Anselmo Peñas,
    Vivien Petras, and Diana Santos, editors, Advances in Multilingual and Multimodal Information Re-
    trieval: 8th Workshop of the Cross-Language Evaluation Forum, CLEF 2007, volume 5152 of Lecture
    Notes in Computer Science (LNCS), pages 773–780, Berlin, 2008. Springer.
[5] Johannes Leveling and Sven Hartrumpf. On metonymy recognition for geographic information re-
    trieval. International Journal of Geographical Information Science, 22(3):289–299, 2008.