<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>University of Glasgow at WebCLEF 2005: Experiments in per- eld normalisation and language speci c stemming.</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Craig Macdonald</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vassilis Plachouras</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ben He</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christina Lioma</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iadh Ounis</string-name>
          <email>ounisg@dcs.gla.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>General Terms</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computing Science</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Measurement</institution>
          ,
          <addr-line>Performance, Experimentation</addr-line>
        </aff>
      </contrib-group>
      <fpage>2</fpage>
      <lpage>6</lpage>
      <abstract>
        <p>We investigated the use of appropriate web retrieval techniques and language speci c stemming techniques in a multi-lingual environment. The aim of the University of Glasgow's participation in the mono-lingual task of WebCLEF 2005 was to test the application of Web Information Retrieval (IR) techniques on a collection with many di erent languages. We based our CLEF 2005 participation on our IR platform, Terrier [6]. Our experiments were focused on the combination of evidence in a Web IR setting as well as the application of an appropriate stemming in a multi-lingual environment. The outline of this paper is as follows: Section 2 details the stemming techniques we applied in this work, while Section 3 details our experimental setup. We describe the o cial runs submitted in Section 4, and nally analyse our results in Section 5. Section 6 provides concluding remarks.</p>
      </abstract>
      <kwd-group>
        <kwd>H</kwd>
        <kwd>3 [Information Storage and Retrieval]</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>1 Content Analysis and Indexing</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>3 Information Search and Retrieval</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>4 Systems and Software</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Our main research hypothesis for the participation to WebCLEF 2005 was that being able to
apply the correct stemmer to a document and a topic would increase the performance of the search
engine. To test this hypothesis, we created three indices of the EuroGOV collection. In the rst
index, the collection was indexed without applying any stemming techniques. In the second index,
the collection was indexed by applying Porter's English stemmer to all documents, regardless of
their domain and language. In the last index, we stemmed each document taking into account
the language of each document. The language of the document was primarily determined by the
language identi cation data provided by the WebCLEF organisers. However, as the language
identi cation data is not precise - often giving multiple language choices for documents - we chose
to supplement this with a few heuristics. For each document, we examined the suggested languages
provided by the WebCLEF organisers, and look for evidence to support these languages in the
URL of the document, the metadata of the document, and in a list of \normal languages" for each
domain.</p>
      <p>
        Having identi ed one language for each document, we would apply the stemmer for that
language to the terms from that document. We mainly used the Snowball [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] stemmers to stem
the documents. The exceptions to this were: English where we used Porter's English stemmer;
Icelandic where we used the Danish Snowball Stemmer; Hungarian where we used Hunstem [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
and Greek where we did not apply any stemmer.
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Terrier Setup</title>
      <p>To support multi-lingual retrieval, it is essential that the IR system accurately and uniquely
represent each term in the corpus. To meet this requirement, we used a version of Terrier that
supports UTF-8 encoding, ensuring that we had a robust representation of the collection.</p>
      <p>During the parsing of the collection, we used heuristics, based on the HTTP headers, the META
tags and the language identi er to determine the correct content encoding for each document. Once
the correct encoding for each document was determined the collection was parsed, each term being
read and converted to UTF-8 representation.</p>
      <p>As described above, we applied several stemming combinations to index the EuroGOV
collection. In each index, all terms including stop words were indexed, and positional information was
kept for each term in the collection. Three indices were built - one which applied no stemming
during indexing, one which applied only the porter stemmer, and one which applied the stemmer
deemed appropriate for each document.</p>
      <p>
        We used di erent sources of evidence (or elds) available in a web corpus: the body of the
document, the title of the document and the anchor text information. We used a new technique
for combining sources of evidence during retrieval, that we call per- eld normalisation, which is
an alternative to [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The used weighting model is a per- eld derivative of the following PL2 DFR
model:
      </p>
      <p>qtf n
score(d; Q) = X
t2Q tf n + 1</p>
      <p>tf n
tf n log2
+ (
tf n) log2 e + 0:5 log2(2
tf n)
(1)
where score(d; Q) is the relevance score of a document d for a query Q. t is a query term in Q.
is the mean and variance of a Poisson distribution. qtf n is the normalised query term frequency.</p>
      <p>
        qtf
It is given by qtf n = qtfmax , where qtf is the query term frequency and qtfmax is the maximum
query term frequency among the query terms. The normalised term frequency tf n is given by the
so-called Normalisation 2 :
avg l
tf n = tf log2(1 + c ); (c &gt; 0) (2)
l
where l is the document length and avg l is the average document length in the whole collection.
tf is the original term frequency. c is the free parameter of the normalisation method. The c
parameters values were set automatically using a technique that extends our previous work [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to
take elds into account.
      </p>
      <p>
        In our previous work [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], we found that taking the length of the URL of a document into
account is particularly e ective in homepage nding tasks. Following this work, we used this
evidence.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Runs</title>
      <p>
        We submitted 5 runs to the mono-lingual task of WebCLEF 2005, four of which used topic
metadata of some form. For all metadata runs, we used the domain topic metadata to limit the URL
domain of the returned results. For example, if the topic stated &lt;domain domain="eu.int"/&gt;,
only results with URLs in the eu.int domain were returned. The o cial runs we submitted are
detailed below:
uogSelStem: This run did not use any metadata. Instead, we used the language identi er
(textcat [
        <xref ref-type="bibr" rid="ref3 ref8">3, 8</xref>
        ]) to identify the language of each topic. The topic was then stemmed using
the appropriate stemmer for that language. We used the index with all stemmers. This run
tested the accuracy of the language identi er in determining which stemmer to apply to each
topic.
uogNoStemNLP: This run used only the domain metadata described above. No stemming was
applied to the topics. Additionally, we used a language processing technique to deal with
acronyms. No document stemming was used. This run tested the retrieval e ectiveness of
not applying stemming in this multi-lingual Web IR setting.
uogPorStem: This run used only the domain metadata described above. Porter's English
stemmer was applied to all topics and the index with Porter's stemmer was used. This run
tested the retrieval e ectiveness of applying the Porter's stemmer to all languages in the
EuroGOV collection.
uogAllStem: This run used both the domain metadata described above and the topic
language metadata, which allowed the use of the correct stemmer to stem the topic. We used the
index with all stemmers. This run tested the hypothesis that applying the correct stemmer
to both documents and topics would improve results overall.
uogAllStemNP: This run is identical to uogAllStem except that term order in the topics
was presumed to be important. We applied a strategy where query terms are appropriately
weighted to re ect the order of terms in a query. The underlying hypothesis is that in web
search, the user will typically enter the most important keywords rst, then apply additional
terms to narrow the focus of the search.
5
      </p>
    </sec>
    <sec id="sec-4">
      <title>Results &amp; Analysis</title>
      <p>From initial inspection of the overall performance of the the runs, it would appear that the
run which did not apply any stemmers performed best overall. Indeed, the uogNoStemNLP
run gives the best results over the baseline tasks, closely followed by the uogPorStem run.
However, the runs with the correct stemming applied (uogAllStem &amp; uogAllStemNP) have
comparable results to the uogNoStemNLP &amp; uogPorStem runs, even though the performance
of the Hungarian queries is considerably reduced. In particular, the last line of Table 1 shows
the MRR of all runs with all Hungarian topics removed. This shows that the stemming makes
minor di erence - the uogAllStem &amp; uogAllStemNP runs achieve approximately the same
MRR as the uogPorStem run, and are very comparable to the uogNoStemNLP run.
It would appear that the Porter's English stemmer is more e ective than either no stemming
or the appropriate Snowball stemmer for Dutch and Russian. English and Portuguese topics
give the best performance without any stemming applied.</p>
      <p>By comparing the uogAllStem and the uogSelStem runs, we can see identify that the
language identi er is de cient at recognising the language of topics in many languages, and
poor at Greek and Danish in particular. For Hungarian topics, when language identi er
mis-classi es the language of the topics, performance is actually improved. In the nal
version of this paper, we will assess the accuracy of the language classi cation for the topics,
to detect correlations between correct classi cation and increased performance.
Query term ordering overall showed little retrieval performance di erence when applied.
However, its application was particularly e ective to the Greek topics (0.3586 to 0.4003),
but showed very little positive or negative change for most languages.
6</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>We performed experimentation around the correct application of stemmers in a multi-lingual
setting. We found that generally applying the correct stemmer for the language of the document
and topic worked in most cases, however if language classi cation of the documents was incorrect,
the retrieval performance could be harmed. The bare-system approach of applying no stemming at
all, is a very safe and stable option where the results will not be very di erent from that produced
by the best approach for that language.</p>
      <p>While it is clear that stemming with respect to a language can assist in retrieval performance,
determining the language of a topic or document is still an active research problem for reliable
application in a multi-lingual document collection.</p>
      <p>
        For this paper, we investigated the average topic length - in particular for the German, Spanish
and English topic sets, and found these to be 3.3, 6.3 terms and 5.7 terms respectively. In contrast,
a recent study by Jansen &amp; Spink [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] found the an average length of 1.9, 2.6 and 5.0 terms for
German, Spanish and English queries, suggesting that the topics used in WebCLEF 2005 were not
representative of real European user queries on a multi-lingual collection. For a future WebCLEF
participation, we would like to see queries gathered from commercial European search engines, as
these are de nitive of real user needs.
      </p>
      <p>version
of Ispell
&amp;</p>
      <p>Hungarian</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Hunspell</surname>
          </string-name>
          &amp; Hunstem: Hungarian http://magyarispell.sourceforge.net/.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Snowball</surname>
            <given-names>stemmers</given-names>
          </string-name>
          , http://snowball.tartarus.org/.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>W.B.</given-names>
            <surname>Cavnar</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.M.</given-names>
            <surname>Trenkle</surname>
          </string-name>
          .
          <article-title>N-gram-based text categorization</article-title>
          .
          <source>In Proceedings of SDAIR94, 3rd Annual Symposium on Document Analysis and Information Retrieval</source>
          , pages
          <volume>161</volume>
          {
          <fpage>175</fpage>
          ,
          <string-name>
            <surname>Las</surname>
            <given-names>Vegas</given-names>
          </string-name>
          ,
          <string-name>
            <surname>US</surname>
          </string-name>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>He</surname>
          </string-name>
          and
          <string-name>
            <given-names>I.</given-names>
            <surname>Ounis</surname>
          </string-name>
          .
          <article-title>A study of the Dirichlet Priors for term frequency normalisation</article-title>
          .
          <source>In SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <volume>465</volume>
          {
          <fpage>471</fpage>
          , New York, NY, USA,
          <year>2005</year>
          . ACM Press.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.J.</given-names>
            <surname>Jansen</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Spink</surname>
          </string-name>
          .
          <article-title>An analysis of Web searching by European AlltheWeb.com users</article-title>
          .
          <source>Inf</source>
          . Process. Manage.,
          <volume>41</volume>
          (
          <issue>2</issue>
          ):
          <volume>361</volume>
          {
          <fpage>381</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>I.</given-names>
            <surname>Ounis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Amati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Plachouras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Macdonald</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Johnson</surname>
          </string-name>
          .
          <article-title>Terrier Information Retrieval Platform</article-title>
          .
          <source>In Proceedings of the 27th European Conference on IR Research (ECIR</source>
          <year>2005</year>
          ), volume
          <volume>3408</volume>
          of Lecture Notes in Computer Science, pages
          <volume>517</volume>
          {
          <fpage>519</fpage>
          . Springer,
          <year>2005</year>
          . URL: http://ir.dcs.gla.ac.uk/terrier/
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>V.</given-names>
            <surname>Plachouras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I.</given-names>
            <surname>Ounis</surname>
          </string-name>
          . University of Glasgow at TREC2004:
          <article-title>Experiments in Web, Robust and Terabyte tracks with Terrier</article-title>
          .
          <source>In Notebook of 13thText REtrieval Conference (TREC2004)</source>
          , NIST, MD. USA,
          <year>October 2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>G. van Noord.</surname>
          </string-name>
          <article-title>Textcat language guesser</article-title>
          , http://odur.let.rug.nl/ vannoord/textcat/.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zaragoza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Taylor</surname>
          </string-name>
          , S. Saria, and
          <string-name>
            <given-names>S.</given-names>
            <surname>Robertson</surname>
          </string-name>
          . Microsoft Cambridge at TREC-
          <volume>13</volume>
          :
          <article-title>Web and HARD tracks</article-title>
          .
          <source>In Notebook of 13thText REtrieval Conference (TREC2004)</source>
          , NIST, MD. USA,
          <year>October 2004</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>