<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Domain-Specific Russian Retrieval: A Baseline Approach</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fredric C. Gey</string-name>
          <email>gey@berkeley.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>General Terms Measurement</institution>
          ,
          <addr-line>Performance, Experimentation</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>UC Data Archive &amp; Technical Assistance (UC DATA), University of California</institution>
          ,
          <addr-line>Berkeley, CA 94720-5100</addr-line>
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Berkeley group 2 chose to perform some very straightforward experiments in retrieval of Russian documents using queries derived from topics in all three languages. Thus we performed two runs with monolingual Russian retrieval and one cross-lingual run each with German topics and English topics. Query translation was done using the online PROMT translator (www.translate.ru). Monolingual results were substantially better than the overall median performance of all Russian runs, and crosslanguage results were encouraging with GermanÆ Russian retrieval doing substantially better than EnglishÆ Russian.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Cross-language information retrieval</kwd>
        <kwd>Russian retrieval</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Domain-specific retrieval has been a track in CLEF since the beginning with the GIRT collections [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. For the
CLEF 2005 campaign, the domain-specific included a Russian social science abstract collection and the topics
were available in German, English and Russian for experiments with all DS collections. Berkeley group 2
performed some very straightforward experiments in retrieval of Russian documents using queries derived from
topics in all three languages. Thus we performed two runs with monolingual Russian retrieval and one
crosslingual run each with German topics and English topics. Query translation was done using the online PROMT
translator (www.translate.ru ) which prior experience had shown to produce more useful translations than the
SYSTRAN translation system (http://babelfish.altavista.com).
      </p>
      <p>
        The Russian collection of CLEF 2005 domain specific track consists of 94,581 documents containing titles (for
all documents) abstracts (for 47,130 documents or 50% of the collection). Unfortunately for this collection, only
12% of the collection (11,403 documents) have controlled-vocabulary thesaurus terms assigned. The GIRT
thesaurus terms are assigned from the Thesaurus for the Social Sciences [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] which has been made available in
German, English and Russian.
In all its CLEF submissions, the Berkeley 2 group used a document ranking algorithm based on logistic
regression first used in the TREC-2 conference [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. For all runs, we used a 256 word Russian stopword list
developed for the CLEF 2003 Izvestia collection to remove very common words [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] For stemming we utilized
the Russian SNOWBALL Stemmer available from www.tartarus.org/snowball. As a general procedure, we also
use Aitao Chen’s blind feedback algorithm [
        <xref ref-type="bibr" rid="ref2 ref3">2,3</xref>
        ] every run. It selects the top 30 ranked terms from the top 20
ranked documents from the initial search to merge with the original query. Thus the sequence of processing for
retrieval is: query Æ stopword removal Æ (decompounding) Æ stemming Æ ranking Æ blind feedback
The only translation done was query translation from English and German to Russian using the PROMT
translator found online at www.translate.ru.
      </p>
    </sec>
    <sec id="sec-2">
      <title>3 Runs and Results</title>
      <p>Our results are summarized by topic in the following table with comparison to overall precision. The highlighted
columns are the median performances for monolingual and cross-language IR while the final row is precision
averaged over all 25 topics:
The first monolingual Russian run (BK2MLRU1) and the two bilingual runs (BK2BLER1, BK2BLER2) were
made using the required Title and Description (T-D) fields. The second monolingual run (BK2MLRU2) used
the Title, Description and Narrative (T-D-N) fields. The T-D run (BK2MLRU1) achieved overall mean average
precision of 0.304 with 9 best-of-topic results out of the 25 topics. Interestingly, the T-D run performed 30
percent higher than the T-D-N monolingual run (BK2MLRU2) which had an average precision of only 0.235,
We speculate that this is because over half the documents in the collection only have a &lt;TITLE&gt; field and not a
&lt;TEXT&gt; field, Topic 150 Поведение во время телепередач (Television Behaviour) retrieved zero relevant
documents from all DS monolingual run, while bilingual runs to the Russian found only two relevant document
with best average precision of 0.0178.</p>
      <p>
        The German-Russian bilingual run BK2BLGR1 (MAP of 0.233) performed twenty nine percent better than the
English-German run BK2BLER1 (MAP of 0.181). Much of this difference can be attributed to topic 143 Отказ
от курения (Giving up Smoking) where the German translation seems to have been more accurate than the
English one. The G--&gt;R precision for topic 143 was 1.0 while the E--&gt;R precision was 0.0094.
We believe we achieved our goal of providing a baseline performance for the Russian Domain Specific
collection of CLEF. We believe our results provide a foundation from which more sophisticated experiments
can be developed which leverage the controlled vocabulary indexing of the CLEF DS collections. For the future
of CLEF domain specific Russian to be interesting and successful, substantially more documents will need to
have indexing keywords assigned to the documents – 12 percent is simply not enough to perform meaningful
experiments on the utility of controlled vocabulary. In addition, document abstracts provide a richer set of
textual clues from which to mine associations to controlled vocabulary terms as the work by Petras shows [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>5 Acknowledgement</title>
      <p>Thanks to Aitao Chen for implementing and permitting the use of the logistic regression formula for probabilistic
information retrieval as well as German decompounding and blind feedback in his MULIR retrieval system.
Thanks also to Vivien Petras who performed the actual indexing and running of the experiments.</p>
    </sec>
    <sec id="sec-4">
      <title>6 References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          , W Cooper, and
          <string-name>
            <given-names>F</given-names>
            <surname>Gey</surname>
          </string-name>
          .
          <year>1994</year>
          .
          <article-title>Full text retrieval based on probabilistic equations with coefficients fitted by logistic regression</article-title>
          .
          <source>In The Second Text Retrieval Conference (TREC-2)</source>
          , edited by D. K. Harman.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Chen</surname>
          </string-name>
          , Aitao.
          <year>2003</year>
          .
          <article-title>Cross-Language Retrieval Experiments at CLEF 2002</article-title>
          . Lecture Notes in Computer Science 2785,
          <year>Springer 2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          , and
          <string-name>
            <given-names>F</given-names>
            <surname>Gey</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>Multilingual Information Retrieval Using Machine Translation</article-title>
          ,
          <source>Relevance Feedback and Decompounding. Information Retrieval</source>
          <volume>7</volume>
          (
          <issue>1</issue>
          -2):
          <fpage>149</fpage>
          -
          <lpage>182</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Kluck</surname>
          </string-name>
          , Michael.
          <year>2003</year>
          .
          <article-title>The GIRT Data in the Evaluation of CLIR Systems - from 1997 Until 2003</article-title>
          .
          <source>In Comparative Evaluation of Multilingual Information Access Systems: 4th Workshop of the Cross-Language Evaluation Forum</source>
          ,
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          <year>2003</year>
          ,
          <article-title>edited by C. A</article-title>
          .
          <string-name>
            <surname>Peters</surname>
          </string-name>
          . Trondheim, Norway,
          <source>August 21-22</source>
          ,
          <year>2003</year>
          : Lecture Notes in Computer Science 3237,
          <year>Springer 2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Petras</surname>
            ,
            <given-names>V</given-names>
          </string-name>
          ,
          <article-title>How One Word Can Make all the Difference - Using Subject Metadata for Automatic Query Expansion and Reformulation</article-title>
          , in this volume,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Petras</surname>
            , V.,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Perelman</surname>
            , and
            <given-names>F</given-names>
          </string-name>
          <string-name>
            <surname>Gey</surname>
          </string-name>
          .
          <year>2003</year>
          . UC Berkeley at CLEF 2003 -
          <article-title>- Russian Language Experiments and Domain-Specific Cross-Language Retrieval</article-title>
          .
          <source>In Comparative Evaluation of Multilingual Information Access Systems: 4th Workshop of the Cross-Language Evaluation Forum</source>
          ,
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          <year>2003</year>
          . Trondheim, Norway,
          <source>August 21-22</source>
          ,
          <year>2003</year>
          : Lecture Notes in Computer Science 3237,
          <year>Springer 2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Schott</surname>
          </string-name>
          , Hannelore.
          <year>2000</year>
          .
          <article-title>Thesaurus for the Social Sciences. 2 vols</article-title>
          . Vol.
          <volume>1</volume>
          . German - English, 2. English - German. Bonn:
          <string-name>
            <surname>Informations-Zentrum Socialwissenschaften</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>