<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A WordNet-based Query Expansion method for Geographical Information Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Davide Buscaldi</string-name>
          <email>dbuscaldi@dsic.upv.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Rosso</string-name>
          <email>prosso@dsic.upv.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emilio Sanchis Arnal</string-name>
          <email>esanchis@dsic.upv.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>General Terms</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Query Expansion, WordNet, Geographical Information Retrieval</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dpto. de Sistemas Informa ́ticos y Computacoi ́n</institution>
          ,
          <addr-line>DSIC</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Measurement</institution>
          ,
          <addr-line>Performance, Experimentation</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Universidad Polietc ́nica de Valencia</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This report describes a query expansion method based on the expansion of geographical terms by means of WordNet synonyms and meronyms. We used this method for our participation to the GeoCLEF 2005 English monolingual task, while using the well-known Lucene search engine for indexing and retrieval. The obtained results show that the proposed method was not suitable for the GeoCLEF track, while WordNet can be used in a more eeffctive way during the indexing phase, by adding synonyms and holonyms to the index terms.</p>
      </abstract>
      <kwd-group>
        <kwd>H</kwd>
        <kwd>3 [Information Storage and Retrieval]</kwd>
        <kwd>H3</kwd>
        <kwd>1 Content Analysis and Indexing</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>3 Information Search and Retrieval</kwd>
        <kwd>I</kwd>
        <kwd>2 [Artificial Intelligence ]</kwd>
        <kwd>I</kwd>
        <kwd>2</kwd>
        <kwd>7 Natural Language Processing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Geographical entities can appear in very different forms in text collections, such as when a foreign
name is used instead of the English one, or when the citation of some region or place omits
the name of a larger geographical entity containing them. This is a well-known problem in the
field of Information Retrieval. The use of semantic knowledge may help to solve this problem,
even if no strong experimental results are yet available in support of this hypothesis. Some
results [7] show improvements by the use of semantic knowledge, others do not [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The most
common approaches make use of standard keyword-based techniques, improved through the use
of additional mechanisms such as document structure analysis and automatic query expansion.
      </p>
      <p>
        Automatic query expansion is used to add terms to the user’s query. In the field of IR, the
expansion techniques based on statistically derived associations have proven useful [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], while other
methods using thesauri with synonyms obtained less promising results [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This is due to the
ambiguity of the query terms and its propagation to their synonyms. The resolution of term
ambiguity (Word Sense Disambiguation) is still an open problem in Natural Language Processing.
Nevertheless, in the case of geographical terms, the resolution of ambiguity is usually less difficult
and therefore better results can be obtained by the use of effective query expansion techniques
based on ontologies, as demonstrated by the query expansion techniques developed for the SPIRIT
project [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>In our work we used the WordNet ontology only in the geographical domain, by applying a
query expansion method, based on the synonymy and meronymy relationships, to geographical
terms. The method is based on a similar one we previously developed for the use with the TREC-81
adhoc task.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Query Reformulation</title>
      <p>There can be many different ways to refer to a geographical entity. This may occur specially for
foreign names (e.g. Rome can be indicated also with its original italian name, Roma), acronyms
(e.g. U.K. or G.B. used instead of the extended form United Kingdom of Great Britain and
Northern Ireland ), or even some popular names (for instance, Paris is also known as the ville
lumeir`e , i.e., the city of light ). Each one of these cases can be reduced to the synonymy problem.
Moreover, sometimes the rhetoric figure of metonymy (i.e., the substitution of one word for another
with which it is associated) is used to indicate a greater geographical entity (e.g. Washington for
U.S.A.), or the indication of the including entity is omitted because it is supposed to be well-known
to the readers (e.g. Paris and France).</p>
      <p>WordNet can help in solving these problems. In fact, WordNet provides synonyms ({U.S.,
U.S.A., United States of America, America, United States, US, USA} is the synset corresponding
to the “North American republic containing 50 states”), and meronyms (e.g. France has Paris
among its meronyms), i.e., concepts associated through the “part of” relationship.</p>
      <p>Taking into account these observations, we developed a query expansion method that exploits
these relationships. First of all, the query is tagged with POS labels. After this step, the query
expansion is done in accordance to the following algorithm:
1. Select from the query the next word (w) tagged as proper noun.
2. Check in WordNet if w has the {country, state, land } synset among its hypernyms; if not,
return to 1, else add to the query all the synonyms, with the exception of stop-words and
the word w, if present; then go to 3.
3. Retrieve the meronyms of w and add to the query all the words in the synset containing the
word capital in its gloss or synset, except the word capital itself. If there are more words in
the query, return to 1, else end.</p>
      <p>For example, the query: Shark Attacks off Australia and California is POS-tagged as follows:
NN/shark, NNS/attacks, PRP/off, NNP/Australia CC/and NNP/California. “Shark” and
“Attacks” do not have the {country, state, land } synset among their hypernyms, therefore Australia
is selected as the next w. The corresponding WordNet synset is {Australia, Commonwealth of
Australia}, with the result of adding Commonwealth of Australia to the expanded query.
Moreover, the following meronym contains the word “capital” in synset or gloss: Canberra, Australian
capital, capital of Australia – (the capital of Australia; located in southeastern Australia), and the
result is that Canberra is included in the expanded query. The next w is California. In this case
the WordNet synset is {California, Golden State, CA, Calif.}, and the words added to the query
are Golden State, CA and Calif.. Two meronyms contains the word “capital”:
• Los Angeles, City of the Angels – (a city in southern California; motion picture capital of
the world; most populous city of California and second largest in the United States)
• Sacramento, capital of California – (a city in north central California 75 miles northeast of</p>
      <p>San Francisco on the Sacramento River; capital of California)</p>
      <p>Moreover, during the POS tagging phase, the system looks for word pairs of the kind “adjective
noun” or “noun noun”. The aim of this step was to imitate the search strategy that a human would
attempt. Stopwords are also removed from the query during this phase. Therefore, the expanded
query handed over to the search engine is: “shark attacks” Australia California “Commonwealth
of Australia” Canberra “Golden State” CA Calif. “Los Angeles” “City of the Angels” Sacramento.</p>
      <p>
        For this work we used the Lucene2 search engine, an open source project available for free
download from Apache Jakarta. The Porter stemmer [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] was used during the indexing phase, and
for this reason the expanded queries are also stemmed by the SnowballAnalyzer (provided with
the Lucene API) before being submitted to the search engine itself.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>We submitted only the two mandatory runs, one using the topic title and description fields, and
the second including the “concept” and “location” efilds.</p>
      <p>For every query the top 1000 ranked documents have been returned by the system. We
performed two runs, one with the unexpanded queries, the other one with the expansion. For both
runs we plotted the precision/recall graph (see Figure 1) which displays the precision values
obtained at each of the 10 recall levels.</p>
      <p>100
80</p>
      <p>It is clear that the results obtained by taking into account the “concept” and “location” efilds
are generally better than those obtained working only on the topic title and denfiition, as can be
observed also by the results obtained over the single topics in Table 1.</p>
      <p>The obtained results show that the performance of our system stands around the average of
the participants to the exercise. From these results the advantage of the query expansion method
is not clear, even if it proved effective in some topics (particularly 16 and 7). We suppose this
is due to the fact that the expansion may introduce unnecessary information. For example, if
the user is asking about “shark attacks in California”, we have seen that Sacramento is added
to the query. Therefore, documents containing “shark attacks” and “Sacramento” will obtain an</p>
      <p>Topic
001
002
003
004
005
006
007
008
009
010
011
012
013
dsic051gc
9.23%
5.24%
14.71%
0.03%
0.31%
11.17%
4.45%
0.14%
30.68%
7.94%
0.00%
0.18%
12.54%
higher rank, with the result that documents that contain “shark attacks” but not “Sacramento”
are placed lower in the ranking. Since it is unlikely to observe a shark attack in Sacramento, the
result is that the number of documents in the top positions will be reduced with respect to the
one obtained with the unexpanded query, thus resulting in achieving a smaller precision.</p>
      <p>In order to evaluate in a more precise way the query expansion, we compared the results
obtained with two baselines, the rfist obtained by submitting to the Lucene search engine the
query without the synonyms and meronyms, and the latter by using only the tokenized fields
from the topic. For instance, the query “shark attacks” Australia California “Commonwealth of
Australia” Canberra “Golden State” CA Calif. “Los Angeles” “City of the Angels” Sacramento
would be “shark attacks” Australia California for the rfist baseline (without WN) and shark attacks
Australia California in the second case.
without query expansion</p>
      <p>dsic052gc
clean system
100
80</p>
      <p>Due to the slowness of the process, we completed only the indexing of the Glasgow Herald
1995 collection in time for this paper. The topics (all-fields) were submitted to Lucene as for the
simplest search strategy, but using the usual Lucene syntax for multi-efild queries (e.g. all the
geographical terms were labelled with “geo:”). The obtained results are displayed in Figure 4.</p>
      <p>Even in this case we compared the obtained results with the standard search (i.e., no term
was searched in the geo index), as for the baseline obtained by using only the tokenized fields
from the topic. In order to clarify the difference, the following string was submitted to Lucene for
the WordNet-enhanced search: “text:shark text:attacks geo:california geo:australia”, while for the
standard search the submitted string was: “text:shark text:attacks text:california text:australia”.
It is clear that this method gives much better results than the query expansion, and even better
than the baseline.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions and Further Work</title>
      <p>
        Our query expansion method was tested before only on a set of topics from the TREC-8 collection,
demonstrating that a small improvement could be obtained in recall, but with a deterioration of
the average precision [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The results obtained for our participation at the GeoCLEF do not
conrfim the previous results. We believe that this is due to the different nature of the searches in
the two exercises; more precisely, in the TREC-8 queries the geographical names usually represent
political entities: “U.S.A.”, “Germany”, “Israel”, for instance, are used to indicate the American,
German or Israeli government (therefore the proposed query expansion method, which added to
the query Washington, Berlin or Jerusalem, proved effective), while in GeoCLEF the geographical
names just represent a location constraint for the user’s information needs. In such a context the
use WordNet during the indexing phase proved to be more effective, by adding the synonyms and
the holonyms of the encountered geographical entities to each document’s index terms. We plan
20
0
0
      </p>
      <p>clean system, GH95 corpus</p>
      <p>WN-enhanced indexing, GH95 corpus
2
4</p>
      <p>6
to perform more experiments on both the whole GeoCLEF and TREC-8 collections in order to
verify the effectiveness of this indexing method.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments References</title>
      <p>We would like to thank R2D2 CICYT (TIC2003-07158-C04-03) and ICT EU-India
(ALA/95/23/2003/077054) research projects for partially supporting this work.
[7] Bo-Yeong , K., Hae-Jung, K., Sang-Jo, L., “Performance Analysis of Semantic Indexing in
Text Retrieval”, CICLing 2004, Lecture Notes in Computer Science, Vol. 2945.
SpringerVerlag, 2004</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Baeza-Yates</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ribeiro-Neto</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , “Modern Information Retrieval”,
          <string-name>
            <surname>Addison-Wesley</surname>
          </string-name>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Fu</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>C.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abdelmoty</surname>
            ,
            <given-names>A.I.</given-names>
          </string-name>
          , “
          <article-title>Ontology-based Spatial Query Expansion in Information Retrieval”</article-title>
          ,
          <source>ODBASE</source>
          <year>2005</year>
          ,
          <article-title>accepted</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferretti</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , Jimen´ez,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Vidal</surname>
          </string-name>
          ,
          <string-name>
            <surname>V.</surname>
          </string-name>
          , “
          <article-title>Text Categorization and Information Retrieval Using WordNet Senses”</article-title>
          ,
          <source>CICLing 2004, Lecture Notes in Computer Science</source>
          , Vol.
          <volume>2945</volume>
          . Springer-Verlag,
          <year>2004</year>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Calcagno</surname>
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buscaldi</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            <given-names>P.</given-names>
          </string-name>
          , Go´mez
          <string-name>
            <given-names>J. M.</given-names>
            ,
            <surname>Masulli</surname>
          </string-name>
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Rovetta</surname>
          </string-name>
          <string-name>
            <surname>S.</surname>
          </string-name>
          , “
          <article-title>Comparison of Indexing Techniques based on Stems, Synsets, Lemmas and Term Frequency”</article-title>
          . In: Workshop “Red Tema´
          <source>tica en Tecnoloaıg´ del Habla</source>
          , Valencia, Spain (
          <year>2004</year>
          ), pp.
          <fpage>171</fpage>
          -
          <lpage>176</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Voorhees</surname>
            ,
            <given-names>E.M.</given-names>
          </string-name>
          , “
          <article-title>Query Expansion using lexical-semantic relations”</article-title>
          ,
          <source>ACM SIGIR</source>
          <year>1994</year>
          : pp.
          <fpage>61</fpage>
          -
          <lpage>70</lpage>
          , ACM,
          <year>1994</year>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>W.B.</given-names>
            <surname>Croft</surname>
          </string-name>
          , “
          <article-title>Query Expansion using Local and Global Document Analysis”</article-title>
          ,
          <source>ACM SIGIR</source>
          <year>1996</year>
          : pp.
          <fpage>4</fpage>
          -
          <lpage>11</lpage>
          , ACM,
          <year>1996</year>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>