=Paper= {{Paper |id=Vol-1172/CLEF2006wn-GeoCLEF-ToralEt2006 |storemode=property |title=Geographic IR Helped by Structured Geospatial Knowledge Resources |pdfUrl=https://ceur-ws.org/Vol-1172/CLEF2006wn-GeoCLEF-ToralEt2006.pdf |volume=Vol-1172 |dblpUrl=https://dblp.org/rec/conf/clef/ToralFNKMM06a }} ==Geographic IR Helped by Structured Geospatial Knowledge Resources== https://ceur-ws.org/Vol-1172/CLEF2006wn-GeoCLEF-ToralEt2006.pdf
 Geographic IR Helped by Structured Geospatial Knowledge
                        Resources
         A. Toral, O. Ferrández, E. Noguera, Z. Kozareva, A. Montoyo and R. Muñoz
                Natural Language Processing and Information Systems Group
                       Department of Software and Computing Systems
                                 University of Alicante, Spain
               {atoral,ofe,elisa,zkozareva,montoyo,rafael}@dlsi.ua.es


                                            Abstract
     For the participation of the University of Alicante in the second edition of GeoCLEF,
     we have researched the incorporation of geographic knowledge into Geographic Infor-
     mation Retrieval (GIR). Our system is made up of an IR module used for several years
     in the CLEF competitions (IR-n) and a Geographic Knowledge module (Geonames).
     The latter is used to carry out an expansion of the initial topic by adding geographic
     items. The geographic items and relations are extracted from the topics and queries
     using the Geonames database are built from them. The returned information by this
     geographic resource is incorporated into the topics which at the end are processed by
     IR-n. We have submitted several runs, in order to compare the performance of the
     usage of a classic IR with the usage of geographic knowledge. The results show that
     the addition of geographic knowledge has negative impact on the obtained precision.
     However, the fact that for some topics the obtained results are better, makes us con-
     clude that the addition of this knowledge could be useful but a lot of research effort is
     needed in order to determine how this knowledge should be correctly applied.

Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.3 Information Search and Retrieval

General Terms
Algorithms, Geographic database, Experimentation, Measurement, Performance

Keywords
Information Retrieval, Geographic Information Retrieval, Geographic Database


1    Introduction
GeoCLEF is a track of the Cross-Language Evaluation Forum (CLEF) whose aim is to provide
the necessary framework in which to evaluate Geographic Information Retrieval (GIR) Systems
for search tasks involving both spatial and multilingual aspects.
    The underlying and basic technology of GIR, Information Retrieval (IR), deals with the selec-
tion of the most relevant documents from a document collection given a query. Thus, GIR is a
specialization of IR which introduces geospatial restrictions to the retrieval task.
    Several approaches were followed in order to perform GIR within the first edition of GeoCLEF.
Several systems used Named Entity Recognition [10] [4] [5] [8] [3] specialised to the geographic
domain. Some used geographic knowledge resources [10] [5] [8] [3]. Some systems used Natural
Language Processing tools such as Part-of-Speech tagging [5] or Text Mining [3]. Another approach
was to perform a query expansion [10] [2]. Finally, there were approaches based on the classic IR
without any treatment of geography [7] [6] [11].
    Three out of the top-4 systems for the English monolingual run were based only on IR (the
remaining one [4] used also geographic NER). This may be due to the fact that the systems which
tried to use some kind of geographic reasoning did not do that in the correct way. Thus, the
best results from GeoCLEF 2005, based on IR may be used as a baseline in order to test the
performance of the geographic reasoning. This supports the claim that the research in GIR is still
at the very first steps and so, there is a long way to go.
    In our participation in GeoCLEF 2005 [4], we mentioned as an important aspect the lack of
adequate ready-to-use structured knowledge resources of geographic items for our specific purpose.
This is why for our participation in the second edition of this forum, we have centered our efforts
on studying geographic resources and trying to determine how to use them within GIR. In a
nutshell, we have researched the appliance of available resources of geospatial nature to GIR.
    The GIR system developed with this purpose consists of exploiting geographic resources in
order to make a query expansion with geographic knowledge. Obviously, we also want to evaluate
the impact of the addition of these geographic items into our IR module.
    The rest of this paper is organized as follows. The next section presents a detailed description
of our system and the modules it is made of. Section 3 illustrates the carried out experiments and
the obtained results by means of an example that shows the functioning of our system. Finally,
Section 4 outlines our conclusions and future work proposals.


2      System Description
Our approach is based on IR with the appliance of geographic knowledge which is extracted from
a structured knowledge resource. Figure 1 depicts an overview of our system as well as how the
different modules interact among each other.


                                                         Documents

                           Geographical Knowledge
                                                                            Relevant documents



                                                                IR Module


                                             Topics
                                               &
             Topics                      Geo-knowledge




                                    Figure 1: System architecture

    The topics are processed and enriched with related geographic information which is obtained
by exploiting the Geonames1 resource. A SQL query is generated from the geographic information
provided by the topic, and then the query is processed in the Geonames database in order to obtain
the related geographic items.
    Once all the geographic information is collected, we apply the IR module with this knowledge
in order to retrieve all the relevant documents concerning this specific geospatial information.
    The next subsections describe in details the two main modules of our system.
    1 www.geonames.org
2.1    IR Module
The IR module we used is called IR-n [12]. IR-n is a Passage Retrieval system (PR). These systems
[9] study the appearance of query terms in contiguous fragments of the documents (also called
passages). One of the main advantages of these systems is that they allow us to determine not
only whether a document is relevant or not, but also to detect the relevant part of the document.
    The passages are usually composed of a fixed number of sentences. This number depends on
a measure obtained from the used document collection. To determine this value, the system has
been trained on the GeoCLEF 2005 data collections. The number of sentences that obtains the
best results is 8 both for English and Spanish. Furthermore, IR-n uses overlapping passages in
order to avoid documents that could be considered as non relevant if there appear words of the
question in adjacent passages.
    IR-n allows the use of distinct similarity measures. With the aim to evaluate the most ap-
propriate one, we have trained the system on the English and Spanish collections. For the both
collections, the similarity measure which obtains the best results is dfr [1].
    We have specifically adapted the IR-n system to incorporate geographic knowledge. In order
to do this, we need to take into account two kinds of restrictions: required words and geographical
items.

Required words These words are marked with ’#’. Passages which do not contain at least one
    of these words are not included in the rank list.
Geographical places In addition, a query expansion has been done using the Geonames data-
    base (this is studied in depth in the section 2.2). They are added into a new label called
    .

    As required words, we consider all the nouns of the topic (title, description and narrative) but
geographic ones, stop words or other common words appearing in topic definitions (e.g. document,
relevant). This is, we consider the words that define the main concept of the topic. The reason
for doing this is to lessen the noise of the incorporation of big lists of geographic items to the IR
query could introduce.

2.2    Geographic Knowledge Module
Geonames is a geographic database which contains more than 6 million entries for geographical
names whereof 2.2 million are cities and villages. Geonames is built from different sources being
the most important nga2 , gnis3 and wikipedia4 among others. Its data is freely available and may
be used through web services or from database dumps which are periodically provided.
    The information that Geonames provides for each entry is structured in several information
fields from which we have used the following ones:

   • Name: name of the geographical entry
   • Alternate names: alternative names (different names for a geographical point that may
     include translations)
   • Latitude: latitude in decimal degrees (wgs84)
   • Longitude: longitude in decimal degrees (wgs84)
   • Feature class: type of the entry according to Geonames taxonomy5
   • Country code: ISO-3166 2-letter country code (two characters)
  2 http://gnswww.nga.mil/geonames/GNS/index.jsp
  3 http://geonames.usgs.gov/index.html
  4 http://www.wikipedia.org
  5 http://www.geonames.org/export/codes.html
    • Population: number of inhabitants (only if the entry belongs to a populated place type)

   Our approach regarding Geonames consisted of building a query to the Geonames database for
each topic in a methodical way. We extract for each topic the geographic entities and relations,
and we enrich the topic with the information that the query with this geographic info returns. We
add an appendix in which the geographic queries for all topics are shown.
   Due to the big size of Geonames, there is possible incorporation of noise into the topics.
Therefore, we put some restrictions to the extracted data. From the returned entries to the query,
we only consider those for which the population is bigger than 10,000 inhabitants and those that
belong to a first-order administrative division (ADM1).
   It should be noted that for some topics (26, 40 and 41), our method to build a query could not
be applied, because the topics did not have any geographical restriction considered by Geonames.
For instance topic 40 does not have any geographic restriction.


3     Experiments and Results
The organizers of GeoCLEF provide 25 topics in four languages (English, German, Portuguese
and Spanish) for all participants, as well as different data collections for each target language
(e.g. EFE 94 and EFE 95 for Spanish). In our participation in this edition of GeoCLEF we have
evaluated our system for the English and Spanish monolingual tasks.
    For each task, we have carried out three experiments. The first two ones just apply classic IR
to the provided queries. The motivation is to provide an experiment which allow us to evaluate our
approach which is practically carried out in our third experiment. The unique difference between
these two experiments is that the first (called uaTD) uses only the topic title and description in
order to retrieve the documents. However, the second experiment (called uaTDN )uses also the
geographic information provided by the topic narrative section.
    The third experiment (called uaTDNGeo), consists of IR module, but the queries which are
passed to the system are previously enriched with geographic information. This information is
obtained from the Geonames database. In the followings paragraphs, we show the whole process
that is carried out with our system in this experiment.
    The example with which we illustrate the process of our system is for topic 31.

    1. Extract from the topic the required words and the geographic entities and relations:
       required words: combat, embargo, effect, fact
       geographic entities: Iraq
       geographic relations: north-of
    2. Build the Geonames query

      select name, alternames from geonames
       WHERE
        latitude>33 AND #average latitude of Iraq is 33 N
        country_code=’IQ’ AND
        ((feature_class=’P’ AND population > 10000) OR feature_code=’ADM1’);

    3. Assemble a new IR query incorporating the extracted geographic knowledge

      GC031
       Combats# and embargo# in the northern part of Iraq
       Documents telling about combats# or embargo# in the northern
       part of Iraq
       Relevant documents are about combats# and effects# of the 90s
       embargo# in the northern part of Iraq.
       Documents about these #facts happening in other parts of Iraq are
      not relevant
      Zakho Tozkhurmato Khurmati Touz Hourmato [...]
     

  4. Retrieve the relevant documents using the IR-n system

                               Language    Run                AvgP
                                           CLEF Average      0.1975
                                English    uaTD              0.2723
                                           uaTDN             0.2985
                                           uaTDNGeo          0.1201
                                           CLEF Average      0.19096
                                Spanish    uaTD              0.3508
                                           uaTDN             0.3237
                                           uaTDNGeo          0.1525


            Table 1: Overall GeoClef 2006 officials results for the Monolingual tasks


               Topic           English AvgP                    Spanish AvgP
                       uaTD    uaTDN uaTDNGeo          uaTD    uaTDN uaTDNGeo
                026    49.07    50.09       48.34      16.67    15.18      15.18
                027     0.22     0.66        0.04       2.56     3.89      10.35
                028    16.85     3.93        0.59      30.56    37.65       0.47
                029     9.07    12.84        4.91      57.58    68.63       0.07
                030    91.67    95.83        0.00      43.05    37.18       0.00
                031    43.10    35.57        2.03      61.18    69.22      41.28
                032    88.41    90.05       64.59      90.00    95.91      90.16
                033     0.30     0.50        2.59       6.00     2.31      54.64
                034    37.68    51.52        1.44      13.51    20.67       0.00
                035     5.07     2.40        0.00      21.05     8.17       0.15
                036     0.00     0.00        0.00      60.91    21.43       0.00
                037     9.16     0.06        1.30      20.69     0.00       0.00
                038     1.03     0.57        0.00      20.00    11.78       0.00
                039     3.94    36.53        6.28      38.81    30.75       4.34
                040    36.93    32.11       26.93      70.50    77.24      77.55
                041     0.17     0.18        0.67      40.00    38.12      22.45
                042    45.00    100.00       6.55      32.08    45.30       4.58
                043     1.13     0.95        0.05      12.50     8.54       0.11
                044    11.34     8.38        0.00      46.60     5.69       0.01
                045    10.04    29.89       34.89       8.33     1.66       0.11
                046    71.43    69.05        4.08      60.71    77.26       6.47
                047     5.48     1.69        0.00       3.39     0.00       0.00
                048    80.82    82.66       81.69      66.04    81.18      52.18
                049    36.11    35.00        0.00      51.15    38.69       0.06
                050    26.79    15.85       13.23      22.00    12.85       1.04


             Table 2: Results topic by topic for the GeoClef 2006 Monolingual tasks

   The addition of geographical information has drastically decrement the precision. For English,
the best run (uaTDN) obtains 29.85 while the geographic run (uaTDNGeo) achieves 12.01 (see
Table 1). In the case of Spanish, the best run (uaTD) reaches 35.09 and the geographic one
(uaTDNGeo) 15.25 (see Table 1). Although we implement the model of the required words in
order to lessen the noise introduced by the large lists of geographic items to IR queries, this seems
to be insufficient.
    However, for both English and Spanish, the run with geographic information obtains the best
results for the three topics (see Table 2): 33, 41 and 45 (EN) and 27, 33 and 40 (ES). Therefore,
a more in-depth analysis should be carried out in order to achieve a better understanding on the
behaviour of the geographic information incorporation and how it should be done.
    It should be noted that the results for Spanish are slightly better than those for English. This
is so for every run we have submitted (uaTD, uaTDN and uaTDNGeo). This happens because
the IR module was initially designed for Spanish and, moreover, it has been used for this language
for several years.


4    Conclusions and Future Work
For our participation in GeoCLEF 2006 we have proposed the expansion of IR queries with geo-
graphic information related to the topics. For this purpose we have studied knowledge geographic
resources and we have used Geonames.
    The proposal has obtained poor results compared to our simpler model in which we only use
an Information Retrieval system. This is a paradigmatic example of the state of the art of the
GIR field; it is just the beginning and more efforts are needed in order figure out how to introduce
the geographic knowledge in a way that the basic IR systems could benefit from it.
    Therefore, as future work we consider to research into different ways of providing the geographic
knowledge to basic IR and evaluating the impact of each approach. Thus, our aim is to improve
GIR results by applying existing geographic knowledge from structured resources.


Acknowledgements
This research has been partially funded by the Spanish Government under project CICyT number
TIC2003-07158-C04-01 and by the Valencia Government under project number GV06-161.


References
 [1] G. Amati and C. J. Van Rijsbergen. Probabilistic Models of information retrieval based on
     measuring the divergence from randomness. ACM TOIS, 20(4):357–389, 2002.
 [2] Davide Buscaldi, Paolo Rosso, and Emilio Sanchis Arnal. A WordNet-based Query Expansion
     method for Geographical Information Retrieval. Working Notes in Cross-Language Evaluation
     Forum (CLEF) 2005, 2005.
 [3] Nuno Cardoso, Bruno Martins, Marcirio Silveria Chaves, Leonardo Andrade, and Mario J.
     Silva. The XLDB Group at GeoCLEF 2005. Working Notes in Cross-Language Evaluation
     Forum (CLEF) 2005, 2005.
 [4] Oscar Ferrández, Zornitsa Kozareva, Antonio Toral, Elisa Noguera, Andrés Montoyo, Rafael
     Muñoz, and Fernando Llopis. The University of Alicante at GeoCLEF 2005. Working Notes
     in Cross-Language Evaluation Forum (CLEF) 2005, 2005.
 [5] Daniel Ferrés, Alicia Ageno, and Horacio Rodrı́guez. The GeoTALP-IR System at GeoCLEF-
     2005: Experiments Using a QA-based IR System, Linguistic Analysis and a Geographical
     Thesaurus. Working Notes in Cross-Language Evaluation Forum (CLEF) 2005, 2005.
 [6] Fredric Gey and Vivien Petras. Berkeley2 at GeoCLEF: Cross-Language Geographic In-
     formation Retrieval of German and English Documents. Working Notes in Cross-Language
     Evaluation Forum (CLEF) 2005, 2005.
 [7] Rocio Guillé. CSUSM Experiments in GeoCLEF2005: Monolingual and Bilingual Tasks.
     Working Notes in Cross-Language Evaluation Forum (CLEF) 2005, 2005.
 [8] Baden Hughes. NICTA i2d2 at GeoCLEF 2005. Working Notes in Cross-Language Evaluation
     Forum (CLEF) 2005, 2005.
 [9] M. Kaskziel and J. Zobel. Passage retrieval revisited. In Proceedings of the 20th annual
     International ACM Philadelphia SIGIR, pages 178–185, 1997.
[10] Sara Lana-Serrano and Jose M. Goñi-Menoyo. MIRACLE’s 2005 Approachj to Geographical
     Information Retrieval. Working Notes in Cross-Language Evaluation Forum (CLEF) 2005,
     2005.
[11] Ray R. Larson. Cheshire II at GeoCLEF: Fusion and Query Expansion for GIR. Working
     Notes in Cross-Language Evaluation Forum (CLEF) 2005, 2005.
[12] Fernando Llopis. IR-n un Sistema de Recuperacin de Informacin Basado en Pasajes. Ph.D.
     tesis. Procesamiento del Lenguaje Natural, 30:127–128, Universidad de Alicante 2003.


A     SQL queries

26. no geographic SQL-query was implemented;

27. (longitude>7.98 AND longitude<9.38 AND latitude>49.21 AND
latitude<51.01);

28. (country_code=’CA’ OR country_code=’US’ OR country_code=’MX’);

29. (country_code=’AO’ OR country_code=’ZA’);

30. (longitude>-6.32 AND longitude<-1.04 AND latitude>38.40 AND
latitude<42.40);

31. latitude>33.20 AND country_code=’IQ’;

32. country_code=’CA’ and admin1_code=10;

33. country_code=’DE’ and admin1_code=7;

34. (latitude>-23.51 AND latitude<23.51) AND ((feature_class=’P’ AND
population > 50000) OR feature_code=’ADM1’);

35. (country_code=’BG’ OR country_code=’HU’ OR country_code=’CZ’ OR
country_code=’SK’ OR country_code=’PL’ OR country_code=’RO’);

36. (country_code=’JP’ OR country_code=’KP’ OR country_code=’KR’ OR
country_code=’RU’);

37. (country_code=’IR’ OR country_code=’IQ’ OR country_code=’TK’ OR
country_code=’EG’ OR country_code=’LB’ OR country_code=’SA’ OR
country_code=’JO’ OR country_code=’YE’ OR country_code=’QA’ OR
country_code=’KW’ OR country_code=’BH’ OR country_code=’IL’ OR
country_code=’OM’ OR country_code=’SY’ OR country_code=’AE’ OR
country_code=’CY’ OR country_code=’PS’);

38. (country_code=’BN’ OR country_code=’KH’ OR country_code=’TL’ OR
country_code=’ID’ OR country_code=’LA’ OR country_code=’MY’ OR
country_code=’MM’ OR country_code=’PH’ OR country_code=’SG’ OR
country_code=’TH’ OR country_code=’VN’);

39. (country_code=’AZ’ OR country_code=’AM’ OR country_code=’GE’);

40. no geographic SQL-query was implemented;

41. no geographic SQL-query was implemented;

42. country_code=’DE’ and (admin1_code=’03’ or admin1_code=’04’ or
admin1_code=’06’ or admin1_code=’12’ or admin1_code=’10’);

43. country_code=’US’ and (admin1_code=’CT’ or admin1_code=’RI’ or
admin1_code=’MA’ or admin1_code=’VT’ or admin1_code=’NH’ or
admin1_code=’ME’);

44. (country_code=’SI’ OR country_code=’MK’ OR country_code=’HR’ OR
country_code=’YI’ OR country_code=’BK’);

45. country_code=’BR’ and (admin1_code=’02’ or admin1_code=’05’ or
admin1_code=’06’ or admin1_code=’13’ or admin1_code=’17’ or
admin1_code=’19’ or admin1_code=’20’ or admin1_code=’22’ or
admin1_code=’28’);

46. country_code=’PT’ and (admin1_code=’21’ or admin1_code=’17’ or
admin1_code=’04’ or admin1_code=’05’);

47. (country_code=’FR’ or country_code=’SP’ or country_code=’MC’ or
country_code=’IT’ or country_code=’MT’ or country_code=’SI’ or
country_code=’HR’ or country_code=’BA’ or country_code=’CS’ or
country_code=’AL’ or country_code=’GR’ or country_code=’TR’ or
country_code=’CY’);

48. (country_code=’GL’);

49. (country_code=’FR’);

50. (country_code=’DE’ or country_code=’AT’ or country_code=’SK’ or
country_code=’HU’ or country_code=’HR’ or country_code=’CS’ or
country_code=’BG’ or country_code=’RO’ or country_code=’UA’ or
country_code=’LI’ or country_code=’FR’ or country_code=’NL’ or
country_code=’CH’);