WordNet-based Index Terms Expansion for
Geographical Information Retrieval
Davide Buscaldi and Paolo Rosso and Emilio Sanchis
Dpto. de Sistemas Informticos y Computación (DSIC),
Universidad Politcnica de Valencia, Spain
{dbuscaldi, prosso, esanchis}@dsic.upv.es
August 20, 2006
Abstract
This paper presents the results obtained by our group at the GeoCLEF 2006. Our
system used a method based on the expansion of index terms, which exploits WordNet
synonyms and holonyms. This may help in finding implicit geographic information
from text, particularly in the cases in which the indication of the containing geograph-
ical entity is omitted. The system is based on the Lucene search engine. We submitted
two kind of runs, one using WordNet to expand the index terms, the other without
any expansion. Results show that expansion can improve recall in some cases, al-
though a specific ranking function is needed in order to obtain better results in terms
of precision.
Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor-
mation Search and Retrieval; H.3.4 Systems and Software
General Terms
Measurement, Algorithms, Performance, Experimentation
Keywords
Geographical Information Retrieval, Index Term Expansion, WordNet
1 Introduction
The application of Natural Language Processing techniques to geographical names presents various
issues. Although most of the information available in electronic format involves some kind of spatial
awareness, correctly identifying the locations to which a document refers to is not a trivial task.
Explicit information about areas including the cited geographical entities is usually missing from
texts (e.g. usually France is not named in a news related to Paris). Moreover, using text strings
in order to identify a geographical entity creates problems related to ambiguity, synonymy and
names changing over time.
Ambiguity and synonymy are well-known problems in the field of Information Retrieval. The
use of semantic knowledge may help to solve these problems, even if no strong experimental results
are yet available in support of this hypothesis. Some results [1] show improvements by the use
of semantic knowledge; others do not [7]. The most common approaches make use of standard
keyword-based techniques, improved through the use of additional mechanisms such as document
structure analysis and automatic query expansion.
In our 2005 participation to the GeoCLEF, the use of automatic query expansion did not obtain
good results [5]. Although currently are available some effective query expansion techniques [4]
applied to the geographical domain, we think that the expansion of the queries with synonyms
and meronyms does not fit with the characteristics of the GeoCLEF task. Other methods using
thesauri with synonyms for general domain IR also did not achieve promising results [8].
In our work for GeoCLEF 2006 we focused on the use of WordNet [6] for the expansion of
index terms by means of synonyms and holonyms, a technique we described last year even if we
were not able to send runs due to the time needed to index the collection [3]. We used the subset
of the WordNet ontology related to the geographical domain. It is quite difficult to calculate the
number of geographical entities stored in WordNet, due to the lack of an explicit annotation of
the synsets, however we retrieved some figures by means the has instance relationship, resulting
in 654 cities, 280 towns, 184 capitals and national capitals, 196 rivers, 44 lakes, 68 mountains.
Geographical resources like gazetteers usually contains a much greater quantity of information.
For instance, the Geonet Names Server1 (GNS) contains more than 5 million of place names.
In the following section we describe in detail our technique for the expansion of index terms;
in section 3 we present and discuss the results obtained.
2 Expansion of Index Terms with WordNet
The expansion of index terms is a method that exploits the holonymy relationship of the WordNet
ontology. A concept A is holonym of another concept B if A contains B, or viceversa B is part
of A (B is also said to be meronym of A). Therefore, our idea is to add to the geographical index
terms the informations about their holonyms, such that a user looking information about Spain
will find documents containing Valencia, Madrid or Barcelona even if the document itself does
not contain any reference to Spain.
We used the Lucene2 search engine, an open source project freely available from Apache
Jakarta. A Porter stemmer was used during the indexing phase, particularly the Snowball3 imple-
mentation. The indexing process is performed by means of the Lucene search engine, generating
two index for each text: a geo index, containing all the geographical terms included in the text
and also those obtained through WordNet, and a text index, containing the stems of text words
that are not related to geographical entities. Thanks to the separation of the indices, a docu-
ment containing “John Houston” will not be retrieved if the query contains “Houston”, the city
in Texas. The adopted weighting scheme is the usual tf · idf . The geographical terms in the text
are identified by means of a Named Entity (NE) recognizer based on maximum entropy4 , and put
into the geo index, together with all its synonyms and holonyms obtained from WordNet.
For instance, consider the following text:
“A federal judge in Detroit struck down the National Security Agency’s domestic surveillance
program yesterday, calling it unconstitutional and an illegal abuse of presidential power.”
The NE recognizer identifies Detroit as a geographical entity. A search for Detroit synonyms
in WordNet returns {Detroit, Motor city, Motown}, while its holonyms are:
-> Michigan, Wolverine State, Great Lakes State, MI
-> Midwest, middle west, midwestern United States
-> United States, United States of America, U.S.A., USA, U.S., America
-> North America
-> northern hemisphere
1 http://earth-info.nga.mil/gns/html/index.html
2 http://lucene.jakarta.org
3 http://snowball.tartarus.org/
4 Freely available from the OpenNLP project: http://opennlp.sourceforge.net
-> western hemisphere, occident, New World
-> America
Therefore, the following index terms are put into the geo index: { Michigan, Wolverine State,
Great Lakes State, MI, Midwest, middle west, midwestern United States, United States, United
States of America, U.S.A., USA, U.S., America, North America, northern hemisphere, western
hemisphere, occident, New World }. The result of the expansion of index terms is that the above
text will be indexed also by terms like Michigan, North America that were not explicitly mentioned
in it.
3 Experimental Results
This year we submitted four runs, two generated using the WordNet-based system and two with
the system without the index terms expansion. The runs were the mandatory “title-description”
and “title-description-narrative” for each of the two systems. For every query the top 1000 ranked
documents have been returned. In both systems the topic fields are analyzed in search of colloca-
tions (e.g. pairs “noun-noun” or “adjective-noun”). In 2005 we observed that this lead to worse
results [3], however we consider this step necessary in order to identify correctly geographical
entities such as North America, or northern hemisphere which are compound.
In Figure 1 we show the interpolated precision/recall graphs obtained for the “title and descrip-
tion” runs with the system that did not use the WordNet-based index term expansion (rfiaUPV01 )
and the WordNet-enhanced one (rfiaUPV03 ). Figure 2 contains the precision/recall graphs for
the runs which included also the topic narrative. In this case rfiaUPV04 is the run obtained with
the WordNet-based system.
Figure 1: Interpolated precision/recall graph for the two “title and description” runs: rfiaUPV01,
obtained using the system without WordNet, and rfiaUPV03, obtained using the system with
WordNet
Figure 2: Interpolated precision/recall graph for the two “title, description and narrative” runs:
rfiaUPV02, obtained using the system without WordNet, and rfiaUPV04, obtained using the
system with WordNet.
In table 1 we show the recall and average precision values obtained. Recall has been calculated
for each run as the number of relevant documents retrieved divided by the number of relevant
documents in the collection (378). The average precision is the non-interpolated average precision
calculated for all relevant documents, averaged over queries.
The results obtained in term of precision show that non-WordNet runs are better than the
other ones, particularly for the all-fields run rfiaUPV02. However, as we expected, we obtained
an improvement in recall for the WordNet-based system, although the improvement was not so
significant as we hoped (about 1%).
Table 1: Average precision and recall values obtained for the four runs. WN: tells whether the
run uses WordNet or not.
run WN avg. precision recall
rfiaUPV01 no 25.07% 78.83%
rfiaUPV02 no 27.35% 80.15%
rfiaUPV03 yes 23.35% 79.89%
rfiaUPV04 yes 26.60% 81.21%
In order to better understand the obtained results, we analyzed the topics in which the two
systems differ more (in terms of recall). Topics 40 and 48 resulted the worst ones for the WordNet
based system. Topic 40 does not contain any name of geographical place:
Cities near active volcanoes
Cities, towns or villages threatened by the eruption of a volcano
Topic 48 contains references to places (Greenland and Newfoundland ) for which WordNet does
not provide many informations.
On the other hand, the system based on index term expansion performed particularly well for
topics 27, 37 and 44. These topics contain references to countries and regions (Western Germany
for topic 27, Middle East in the case of 37 and Yugoslavia for 44) for which WordNet provides a
rich information in terms of meronyms. It was interesting to note that in topic 44 the difference
between “title-desc” runs was os 6 documents retrieved in favour of the WordNet-based run,
whereas the runs using all the topic fields obtained the same recall. This can be explained with
the fact that the narrative of this topic contains a list of states that are meronyms of Yugoslavia
(therefore they were indexed together with the holonym Yugoslavia).
4 Conclusions and Further Work
The obtained results show that the expansion of index terms by means of WordNet holonyms can
improve slightly the recall. However, a better ranking function needs to be implemented in order
to obtain an improvement in precision. Our next work directions will be the implementation of
the same method with a richer (in terms of coverage of geographical places) resource such as the
Getty Thesaurus of Geographical Names, or an ontology we are currently developing using the
GNS and GNIS gazetteers together with WordNet itself and Wikipedia [2].
Acknowledgments
We would like to thank R2D2 CICYT (TIC2003-07158-C04-03) and ICT EU-India (ALA/95/23/2003/077-
054) research projects for partially supporting this work.
References
[1] K. Bo-Yeong, K. Hae-Jung, and L. Sang-Lo. Performance analysis of semantic indexing in
text retrieval. In CICLing 2004, Lecture Notes in Computer Science, Vol. 2945, Mexico City,
Mexico, 2004.
[2] D. Buscaldi, P. Rosso, and P. Peris. Inferring geographical ontologies from multiple resources
for geographical information retrieval. In Proceedings of the 3rd GIR Workshop, SIGIR 2006,
Seattle, WA, 2006.
[3] D. Buscaldi, P. Rosso, and E. Sanchis. Using the wordnet ontology in the geoclef geographical
information retrieval task. In Proceedings of the CLEF 2005 Workshop, Vienna, Austria, 2005.
[4] G. Fu, C.B. Jones, and A.I. Abdelmoty. Ontology-based spatial query expansion in information
retrieval. In Proceedings of the ODBASE 2005 conference, 2005.
[5] Fredric Gey, Ray Larson, Mark Sanderson, Hideo Joho, and Paul Clough. Geoclef: the clef
2005 cross-language geographic information retrieval track. In Working notes for the CLEF
2005 Workshop (C.Peters Ed.), Vienna, Austria, 2005.
[6] G. A. Miller. Wordnet: A lexical database for english. In Communications of the ACM,
volume 38, pages 39–41, 1995.
[7] Paolo Rosso, Edgardo Ferretti, D. Jiménez, and Vicente Vidal. Text categorization and infor-
mation retrieval using wordnet senses. In CICLing 2004, Lecture Notes in Computer Science,
Vol. 2945, Mexico City, Mexico, 2004.
[8] Ellen Voorhees. Query expansion using lexical-semantic relations. In Proceedings of the ACM
SIGIR 1994, 1994.