<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multi-lingual Geographical Information Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>General Terms</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Measurement</institution>
          ,
          <addr-line>Performance, Experimentation</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Rocio Guillen California State University San Marcos</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper reports on the results of our experiments in the Monolingual English, German and Portuguese tasks and the Bilingual German topics on English collections, English topics on German collections and English topics on Portuguese collections tasks. Seven runs were submitted as o cial runs, four for the monolingual task and three for the bilingual task. We used the Terrier (TERabyte RetrIEveR) Information Retrieval Platform version 2.1 to index and query the collections. Experiments were performed for both tasks using the Inverse Document Frequency model with Laplace after-e ect and normalization 2. Topics were processed automatically and the only elds considered were the title and the description. We included the title eld only for an experiment with the Portuguese collection. The stopword list provided by Terrier was used to index all the collections. Results for both the monolingual and bilingual tasks were low in terms of precision and recall mainly due to the following reasons: 1) no manual processing was done; 2) no query expansion based on automated relevance feedback was added; 3) no experiments including the narrative eld were run; 4) no terms were translated for the bilingual task; 5) no German and Portuguese stopword lists were used instead of the default stopword list; and 6) no pre-processing or removal of diacritic marks was performed. We are running new experiments to address some of the issues aforementioned and determine the impact they have on retrieval performance.</p>
      </abstract>
      <kwd-group>
        <kwd>H</kwd>
        <kwd>3 [Information Storage and Retrieval]</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>1 Content Analysis and Indexing</kwd>
        <kwd>Linguistic Processing</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>3 Information Search and Retrieval</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Research e orts on GIR are addressing issues such as access to multilingual documents, techniques
for information mining (i.e., extraction, exploration and visualization of geo-referenced
information), investigation of spatial representations and ranking methods for di erent representations,
application of machine learning techniques for place name recognition, development of datasets
containing annotated geographic entities, among others. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Other researchers are exploring the
usage of the World Wide Web as the largest collection of geospatial data.
      </p>
      <p>The focus of one of the tasks was on experimenting with and evaluating the performance of GIR
systems when topics include geographic references. Collections of documents and topics in di erent
languages were available to carry out monolingual and bilingual experiments. We ran monolingual
experiments in English, German, and Portuguese; for bilingual retrieval, we worked with topics in
German and English and collections in English, German and Portuguese.</p>
      <p>In this paper we describe experiments in the cross-language monolingual and bilingual task. We
used the Terrier Information Retrieval (IR) platform version 2.1 to run our experiments. This
platform has performed successfully in monolingual information retrieval tasks in CLEF and TREC.
The paper is organized as follows. In Section 2 we present our work in the monolingual task
including an overview of Terrier. Section 3 describes our setting and experiments in the bilingual
task. Finally, we present conclusions and current work in Section 4.
2</p>
      <p>Cross-lingual Geographical IR Task
In this section we present Terrier (TERabyte RetRIEveR) an information retrieval (IR) platform
used in all the experiments. Then we describe experiments and results for monolingual GIR in
English, German, and Portuguese. The nal subsection includes the experiments and results for
bilingual GIR with topics in German and English.</p>
      <p>
        Terrier is a high performance and scalable search engine platform for the rapid development of
large-scale retrieval applications. It o ers a variety of IR models based on the Divergence from
Randomness (DFR) framework ([
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]) and supports classic retrieval models like the Ponte-Croft
language model ([
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]).
      </p>
      <p>The components of the DFR models are the following: 1) a randomness model; 2) an information
gain model; and 3) a term frequency normalization model. The latter component adjusts the
frequency of a term in a document based on the length of a document and the average document
length in the entire collection. For example, the Normalization 2 term frequency normalization
model assumes a decreasing density function of the normalized term frequency concerning the
document length.</p>
      <p>
        The normalized term frequency tfn is calculated as follows:
tf n = tf:log2(1 + c
avg len
len
)
tf is the term frequency, avg len is the average document length in the collection, and len is the
document length, and c is a hyper-parameter. We used c = 1.5 for short queries, which is the
default value, c = 3.0 for short queries. Short queries in our context are those which use only the
topic title eld and the topic description eld. We used these values based on the results generated
by the experiments on tuning for BM25 and DFR models done by He and Ounis [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. They carried
out experiments for TREC (Text REtrieval Conference) with three types of queries depending on
the di erent elds included in the topics given. Queries were de ned as follows: 1) short queries
are those where the title and the description elds are used; and 2) long queries are those where
title, description and narrative are used.
      </p>
      <p>Each query term in a document is assigned a weight depending how important the term is to the
document. Term weights are then used to match documents to a query. Documents are ranked
according to their estimated relevance to the query. The formula to estimate the probability of
producing the query for a given document is the sum of the probability of producing the terms in
the query plus the probability of not producing other terms.</p>
    </sec>
    <sec id="sec-2">
      <title>Both indexing and querying of the documents in English, German, and Portuguese was done with</title>
    </sec>
    <sec id="sec-3">
      <title>Terrier using the InL2 term weighting model. This model is the Inverse Document Frequency model with Laplace after-e ect and normalization 2. The InL2 model has been used in experiments in the past, GeoCLEF2005, GeoCLEF2006 and GeoCLEF2007[6, 7, 8], successfully.</title>
      <p>2.1</p>
      <sec id="sec-3-1">
        <title>Data</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>The document collections indexed were the LA Times (American) 1994 and the Glasgow Her</title>
      <p>ald (British) 1995 for English, publico94, publico95, folha94 and folha95 for Portuguese, and
der spiegel, frankfurter and fr rundschau for German. There were 25 topics for each of the
languages tested. Documents and topics were processed using the English stopwords list and the</p>
    </sec>
    <sec id="sec-5">
      <title>Porter stemmer provided by Terrier. No stopwords lists for German and Portuguese were used.</title>
      <p>2.2</p>
      <sec id="sec-5-1">
        <title>Experimental Results for Monolingual Task</title>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>We submitted 1 run for English, 1 run for German, and 2 runs for Portuguese. Queries were automatically constructed for all the runs. Results for the monolingual task in English, German and Portuguese are shown in Table 1, Table 2 and Table 3, respectively. Run Id</title>
    </sec>
    <sec id="sec-7">
      <title>Topic Fields MAP monen1 title, desc. 0.16 18.4 Table 1: English Monolingual Retrieval Performance</title>
      <p>Run Id</p>
    </sec>
    <sec id="sec-8">
      <title>Topic Fields MAP monde1 title, desc. 0.22 25.12 Table 2: German Monolingual Retrieval Performance</title>
    </sec>
    <sec id="sec-9">
      <title>Recall</title>
    </sec>
    <sec id="sec-10">
      <title>Prec.</title>
    </sec>
    <sec id="sec-11">
      <title>Mean Rel. Ret.</title>
    </sec>
    <sec id="sec-12">
      <title>Recall</title>
    </sec>
    <sec id="sec-13">
      <title>Prec.</title>
    </sec>
    <sec id="sec-14">
      <title>Mean</title>
      <p>Rel. Ret.
Run Id</p>
    </sec>
    <sec id="sec-15">
      <title>Topic Fields MAP</title>
    </sec>
    <sec id="sec-16">
      <title>Recall</title>
    </sec>
    <sec id="sec-17">
      <title>Prec.</title>
    </sec>
    <sec id="sec-18">
      <title>Mean</title>
      <p>Rel. Ret.</p>
    </sec>
    <sec id="sec-19">
      <title>Three runs were submitted as o cial runs for the GeoCLEF2008 bilingual task. In Table 4 we report the results on runs with topics in German and documents in English (de2en) and the results on runs with English topics and documents in German (en2de) and Portuguese (en2pt).</title>
      <p>Unlike the monolingual runs and the Spanish !English run, relevance feedback did not improve
performance retrieval. No querying was done with the language model option.
4</p>
      <p>
        Conclusions
In this paper we presented work on monolingual and bilingual geographical information retrieval.
We used Terrier to run our experiments using the InL2 parameter-based model. Comparing results
with those obtained in the past three years (see [
        <xref ref-type="bibr" rid="ref6 ref7 ref8">6, 7, 8</xref>
        ] show that precision and recall are likely
a ected by the following factors: 1) not carrying out manual processing ; 2) excluding query
expansion; 3) not including the narrative eld content to generate the query; 4) leaving out the
translation module for the bilingual task; and 5) not removing diacritic marks in the collection and
the topics. We are running more experiments to determine the impact each of the above factors
has on retrieval performance.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>C.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Purves</surname>
          </string-name>
          , R.S.: :
          <article-title>Geographical information retrieval</article-title>
          .(
          <year>2008</year>
          ). In
          <source>International Journal of Geographical Information Science</source>
          <volume>22</volume>
          (
          <issue>3</issue>
          ), pp.
          <fpage>219</fpage>
          -
          <lpage>228</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>He</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ounis</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>A study of parameter tuning for the frequency normalization</article-title>
          .
          <source>Proceedings of the twelfth international conference on Information and knowledge management</source>
          , New Orleans, LA, USA,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Ponte</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Croft</surname>
          </string-name>
          , W.B. :
          <article-title>A Language Modeling Approach to Information Retrieval</article-title>
          . SIGIR'
          <volume>98</volume>
          ,
          <string-name>
            <surname>Melbourne</surname>
          </string-name>
          , Australia,
          <year>1998</year>
          . p:
          <fpage>275</fpage>
          -
          <lpage>281</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Amati</surname>
            , G., van Rijsbergen,
            <given-names>C.J.</given-names>
          </string-name>
          :
          <article-title>Probabilistic Models of Information Retrieval Based on Measuring the Divergence from Randomness</article-title>
          .
          <source>ACM Transactions on Information Systems</source>
          . Vol.
          <volume>20</volume>
          (
          <issue>4</issue>
          ), pp:
          <fpage>357</fpage>
          -
          <lpage>389</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Ounis</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lioma</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Macdonald</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Plachouras</surname>
          </string-name>
          , V.:
          <article-title>Research Directions in Terrier: A Search Engine for Advanced Retrieval on the Web</article-title>
          .
          <source>In UPGRADE The European Journal for the Informatics Professional</source>
          at http://www.upgrade-cepis.org Vol.
          <source>VII(1)</source>
          ,
          <year>February 2007</year>
          , pp:
          <fpage>49</fpage>
          -
          <lpage>56</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Guillen</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <source>CSUSM Experiments at GeoCLEF2005: 6th Workshop of the Cross-Language Evaluation Forum</source>
          ,
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          <year>2005</year>
          , Peters,
          <string-name>
            <given-names>C.</given-names>
            ;
            <surname>Gey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ;
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          ; Mueller, H.; Jones,
          <string-name>
            <surname>G.</surname>
          </string-name>
          ; Kluck,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Magnini</surname>
          </string-name>
          , B.; de Rijke, M. (Eds.), Vienna, Austria,
          <source>Revised Selected Papers. \Lecture Notes in Computer Science"</source>
          , vol.
          <volume>4022</volume>
          . Springer-Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Guillen</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Monolingual and Bilingual Experiments in GeoCLEF2006: Evaluation of Multilingual and Multi-modal Information Retrieval Cross-Language Information Forum</article-title>
          ,
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          <year>2006</year>
          ,
          <article-title>Revised Selected Papers</article-title>
          . Peters,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Clough</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Gey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.C.</given-names>
            ,
            <surname>Karlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Magnini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Oard</surname>
          </string-name>
          , D.W., de Rijke,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Stempfhuber</surname>
          </string-name>
          , M. (Eds.).
          <source>\Lecture Notes in Computer Science"</source>
          , vol.
          <volume>4730</volume>
          . Springer-Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Guillen</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>GeoCLEF2007 Experiments in Query Parsing</article-title>
          and
          <string-name>
            <surname>Cross-language</surname>
            <given-names>GIR</given-names>
          </string-name>
          :
          <article-title>CLEF 2007 Working Notes</article-title>
          .
          <source>Alessandro Nardi and Carol Peters (Eds.) ISSN per Working Notes and CD:</source>
          <year>1818</year>
          -
          <fpage>8044</fpage>
          . ISBN Abstracts:
          <fpage>2</fpage>
          -
          <lpage>912335</lpage>
          -31-0.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>