<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>TCD-DCU at TEL@CLEF 2009: Document Expansion, Query Translation and Language Modeling</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Johannes Leveling</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dong Zhou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gareth F. Jones</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vincent Wade</string-name>
          <email>vincent.wadeg@cs.tcd.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>General Terms</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Dublin, Ireland</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centre for Next Generation Localisation</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Experimentation</institution>
          ,
          <addr-line>Measurement, Performance</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>For the multilingual ad-hoc document retrieval track (TEL@CLEF) at at the Cross-Language Retrieval Forum (CLEF) Trinity College Dublin and Dublin City University participated in collaboration. Our retrieval experiments focus on i) investigating document expansion using an entry vocabulary module, ii) translating queries with Google translate and a statistical MT system, and iii) investigating language modeling as a retrieval method. The major results are that the document expansion approach did not increase MAP; topic translation using the statistical MT system resulted in about 70% of the mean average precision (MAP) achieved when using Google translate for topic translation, and language modeling performs equally or better in comparison with BM25. The bilingual retrieval French and German to English experiments obtained 89% and 90% of the best MAP for monolingual English.</p>
      </abstract>
      <kwd-group>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>1 [Information Storage and Retrieval]</kwd>
        <kwd>Content Analysis and Indexing</kwd>
        <kwd>Indexing methods</kwd>
        <kwd>Linguistic processing</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>3 [Information Storage and Retrieval]</kwd>
        <kwd>Information Search and Retrieval</kwd>
        <kwd>Query formulation</kwd>
        <kwd>Search process</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>4 [Information Storage and Retrieval]</kwd>
        <kwd>Systems and Software</kwd>
        <kwd>Performance evaluation (efficiency and effectiveness)</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The TEL (The European Library) task at CLEF is concerned with ad-hoc information retrieval (IR) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
The TEL document subcollections in English, German, and French consist of about 1 million bibliographic
records. The data is provided by the archives of the British Library (English), the Austrian National Library
(German), and Bibliothe`que nationale de France (French) of The European Library. TEL documents follow
the Dublin Core metadata standard and contain multiple fields including title, contributors, language, and
subject terms.
      </p>
      <p>
        Our IR experiments for the ad-hoc task at CLEF 2009 aim at investigating several aspects of retrieval:
1. employing and evaluating EVM [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] for document expansion (DE) to obtain longer documents for the
TEL collection (see [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] for a comparison of query and document expansion), 2. applying a statistical MT
system [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] for topic translation and comparing it to Google translate, and 3. comparing language modeling
(LM) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] as a retrieval method to Okapi BM25 [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
2
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Retrieval Experiments</title>
      <sec id="sec-2-1">
        <title>Topic Processing</title>
        <p>
          The Lemur toolkit1 was employed to index and retrieve documents. Two different retrieval models were
employed: BM25 [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] with default parameters (b = 1:2, k1 = 2:0, k3 = 7) and language modeling with
Jelinek-Mercer smoothing [
          <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
          ]. The text of different fields was extracted and processed to produce a
single flat index:
all: all fields
set1: dc:title, dc:description, dcterms:alternative, and dc:subject.
set2: dc:language, dc:identifier, dc:rights, dc:type, dc:creator, dc:publisher, dc:date, dc:relation,
dc:contributor, dcterms:issued, dcterms:extent, mods:location
set3: dc:language, dc:identifier, dc:rights, dc:type, dc:creator, dc:publisher, dc:date, dc:contributor,
mods:location
set4: dc:language, dc:identifier, dc:rights, dc:type, dc:creator, dc:publisher, dc:date, dc:contributor,
dcterms:spatial, dcterms:isPartOf, dcterms:edition, dcterms:issued, dcterms:available, mods:location
All other document fields were discarded. Prior to indexing the documents, their contents were
preprocessed with the Snowball stemmer2 for the corresponding language and stopwords were removed.
        </p>
        <p>
          For most runs, pseudo-relevance feedback was applied for query expansion (QE): the top ten ranked
documents and 30 terms were used for BM25 and the top five documents and 20 added terms for LM. A
variant of query expansion using information an external resource was also explored (QE2) for bilingual
retrieval. The top 10 results for the query in the source language were extracted and translated with Google
translate. Highly co-occurring terms were extracted for query expansion [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], using the mutual information
to calculate co-occurrence and select the highest score for target translation.
        </p>
        <p>
          For the bilingual retrieval experiments, topics were translated using either Google translate (GT) 3 or a
statistical machine translation system (MT) [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Document Expansion using EVM</title>
        <p>The TEL collection contains documents of ranging from very short documents, because the presence or
absence of a field with bibliographic information leads to documents with varying length. Furthermore,
some fields contain information not in natural language (i.e. alphanumeric codes or classifications). The
1http://www.lemurproject.org/
2http://snowball.tartarus.org/
3http://translate.google.com/
0.4900
0.2400
0.2500
main idea behind some of our experiments was to apply a document expansion method to obtain longer
documents.</p>
        <p>In the English documents, there are more than 585,000 fields containing a valid Dewey Decimal Code
(DDC), and about 50% of all documents contain a corresponding field. The percentage of LCC in the
English collection is considerably lower and DDC and LCC for the German and French collection are not
present or occur only sparsely, so the experiments were focused on the DDC classification and the English
document collection.</p>
        <p>Before conducting the official experiments, we performed a test experiment with the English TEL
documents. We divided the document collection into short (less than 80 characters) and long documents
and into document with a DDC and without. Results for the test run based on CLEF 2008 data for the
different sets of documents are shown in Table 1. For short and long documents, retrieval performance is
very similar, but less short documents were assessed as relevant. In contrast, documents with DDC make
up a large portion of relevant documents (72%), while about half of all English documents are associated
with a DDC. Possible explanations might be that in previous experiments, the DDC has been treated as
a separate index term which could be used in relevance feedback or that longer documents provide more
context for relevance assessment. The relative and absolute performance for documents without DDC
classification is lower. As a result of this analysis, we tried to expand documents via an automatic DDC
classification to create documents with a more evenly distributed length.</p>
        <p>The DDC is a hierarchical library classification. The classification system defines ten main classes, 100
divisions, and 1000 sections, each denoted by digits. For example, the DDC 627 represents the main class
“technology”, division “engineering and applied operations”, section “hydraulic engineering”.</p>
        <p>The main idea for document expansion was to train a classifier on documents containing a DDC and
apply it to obtain classification codes for all other documents. All classification codes are then replaced
with their natural language description, which is preprocessed and added to the index. The natural language
descriptions are available in English only and originate from the OCLC web site4. The natural language
description for these codes was compiled into a machine-readable format using the sources from OCLC.
The resulting description contained all 1110 entries for the DDC of which 933 were actually used in the
document collection. The documents were modified as follows: documents with a DDC are expanded by
appending the natural language description of the DDC to their content; documents without a DDC are first
classified using an EVM and then processed as described above.</p>
        <p>
          Entry Vocabulary Modules (EVM, [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]) have been successfully employed to map uncontrolled
vocabulary (free text) to a controlled vocabulary or classification for query expansion [
          <xref ref-type="bibr" rid="ref3 ref9">9, 3</xref>
          ]. EVM determine a
ranking of most likely classifications. The top-ranked classification is used for document expansion. The
EVM used for our experiments was trained on all documents with a DDC assigned to them. As the EVM
returns a ranking of classification, only the top ranked DDC was considered and its description used to
expand the documents.
Run ID
TCDDCU EN1F
TCDDCU EN2F
TCDDCU EN3
TCDDCU EN4
TCDDCU FR1
TCDDCU FR1F
TCDDCU FR3
TCDDCU FR4
TCDDCU DE1
TCDDCU DE1F
TCDDCU DE3
TCDDCU DE4
        </p>
        <sec id="sec-2-2-1">
          <title>TCDDCU DEEN1 TCDDCU DEEN3</title>
        </sec>
        <sec id="sec-2-2-2">
          <title>TCDDCU FREN1F TCDDCU FREN2 TCDDCU FREN2F</title>
          <p>3</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>EN
EN
EN
EN
FR
FR
FR
FR
DE
DE
DE
DE
DE
DE
FR
FR</p>
      <p>FR
Results for the ad-hoc IR experiments are shown in Table 2. Some experiments achieved a performance
among the top five participants at the TEL track at CLEF 2009, i.e. run TCDDCU DEEN1 was 4th in
bilingual English (0.3333 MAP), run TCDDCU DE3 was 4th in monolingual German (0.2686 MAP), and
run TCDDCU EN3 was 5th in monolingual English (0.3696 MAP).</p>
      <p>In all cases, runs with blind relevance feedback to expand queries yield a higher MAP compared to
the corresponding runs without blind feedback. The query expansion variant based on external
information from web pages found by Google web search did not show the expected results as it degraded the
performance (TCDDCU DEEN3 vs. TCDDCU DEEN1).</p>
      <p>Obviously using only a subset of the document fields yields a slightly higher precision (e.g.
TCDDCU DE3 vs. TCDDCU DE4).</p>
      <p>BM25 and language modeling perform similar for the retrieval experiments in all languages. Because
of small differences in the experimental setup (e.g. the fields indexed), some additional experiments will
have to be conducted before testing for significant differences.</p>
      <p>For the bilingual runs with target language English, 89.9% and 90.1% of the MAP for the best
monolingual English runs was achieved for French and German, respectively. Using the MaTrEx system for
topic translation achieves a MAP of 70.1% in comparison to topic translation by Google translate
(TCDDCU FREN2 vs. TCDDCU FREN1).
4</p>
    </sec>
    <sec id="sec-4">
      <title>Future Work</title>
      <p>Future work will include an analysis of the accuracy of the DDC classification based on a manually
extracted and annotated sample of the English document collection. Also, blind relevance feedback using
external resources will be further investigated.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>Thanks to Andy Way’s group at DCU for providing topic translations with the MaTrEx statistical MT
system.</p>
      <p>This research is supported by the Science Foundation Ireland (Grant 07/CE/I1142) as part of the Centre for
Next Generation Localisation (CNGL) project.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Eneko</given-names>
            <surname>Agirre</surname>
          </string-name>
          ,
          <string-name>
            <surname>Giorgio M. Di Nunzio</surname>
            , Nicola Ferro, Thomas Mandl, and
            <given-names>Carol</given-names>
          </string-name>
          <string-name>
            <surname>Peters</surname>
          </string-name>
          .
          <source>CLEF</source>
          <year>2008</year>
          :
          <article-title>Ad hoc track overview</article-title>
          .
          <source>In Working Notes for the CLEF 2008 Workshop</source>
          ,
          <fpage>17</fpage>
          -
          <lpage>19</lpage>
          September, Aarhus, Denmark,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Lisa</given-names>
            <surname>Ballesteros</surname>
          </string-name>
          and
          <string-name>
            <given-names>Bruce W.</given-names>
            <surname>Croft</surname>
          </string-name>
          .
          <article-title>Resolving ambiguity for cross-language retrieval</article-title>
          .
          <source>In SIGIR '98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <fpage>64</fpage>
          -
          <lpage>71</lpage>
          , Melbourne, Australia,
          <year>1998</year>
          . ACM. address = New York, USA.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Benjamin</given-names>
            <surname>Berghaus</surname>
          </string-name>
          , Thomas Mandl, Christa Womser-Hacker, and
          <string-name>
            <given-names>Michael</given-names>
            <surname>Kluck</surname>
          </string-name>
          .
          <article-title>An entry vocabulary module for a political science test collection</article-title>
          .
          <source>In Witold Abramowicz and Dieter Fensel</source>
          , editors,
          <source>Business Information Systems, 11th International Conference, BIS 2008, volume 7 of Lecture Notes in Business Information Processing</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>11</lpage>
          . Springer, Berlin,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Bodo</given-names>
            <surname>Billerbeck</surname>
          </string-name>
          and
          <string-name>
            <given-names>Justin</given-names>
            <surname>Zobel</surname>
          </string-name>
          .
          <article-title>Document expansion versus query expansion for ad-hoc retrieval</article-title>
          .
          <source>In Andrew Turpin and Ross Wilkinson</source>
          , editors,
          <source>Proceedings of the Tenth Australasian Document Computing Symposium</source>
          , pages
          <fpage>34</fpage>
          -
          <lpage>41</lpage>
          , Sydney, Australia,
          <year>December 2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Peter F. Brown</surname>
          </string-name>
          , John Cocke, Stephen A.
          <string-name>
            <surname>Della</surname>
            <given-names>Pietra</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vincent J. Della</surname>
            <given-names>Pietra</given-names>
          </string-name>
          , Fredrick Jelinek, John D. Lafferty, Robert L.
          <string-name>
            <surname>Mercer</surname>
            , and
            <given-names>Paul S.</given-names>
          </string-name>
          <string-name>
            <surname>Roossin</surname>
          </string-name>
          .
          <article-title>A statistical approach to machine translation</article-title>
          .
          <source>Computational Linguistics</source>
          ,
          <volume>16</volume>
          (
          <issue>2</issue>
          ):
          <fpage>79</fpage>
          -
          <lpage>85</lpage>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Stanley</surname>
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            and
            <given-names>Joshua</given-names>
          </string-name>
          <string-name>
            <surname>Goodman</surname>
          </string-name>
          .
          <article-title>An empirical study of smoothing techniques for language modeling</article-title>
          .
          <source>In 34th Annual Meeting of the Association for Computational Linguistics (ACL)</source>
          , pages
          <fpage>310</fpage>
          -
          <lpage>318</lpage>
          ,
          <string-name>
            <surname>Santa</surname>
            <given-names>Cruz</given-names>
          </string-name>
          , USA,
          <year>1996</year>
          . Morgan Kaufmann/ACL.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Jinhua</given-names>
            <surname>Du</surname>
          </string-name>
          , Yifan He,
          <string-name>
            <surname>Sergio Penkale</surname>
          </string-name>
          , and Andy Way.
          <article-title>MaTrEx: the DCU MT system for WMT 2009</article-title>
          .
          <source>In Proceedings of the Fourth Workshop on Statistical Machine Translation</source>
          , pages
          <fpage>95</fpage>
          -
          <lpage>99</lpage>
          , Athens, Greece,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Fredric</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Gey</surname>
          </string-name>
          , Michael Buckland, Aitao Chen, and
          <string-name>
            <surname>Ray</surname>
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Larson</surname>
          </string-name>
          .
          <article-title>Entry vocabulary - a technology to enhance digital search</article-title>
          .
          <source>In Proceedings of the First International Conference on Human Language Technology</source>
          , San Diego, USA, March
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Vivien</given-names>
            <surname>Petras</surname>
          </string-name>
          .
          <article-title>GIRT and the use of subject metadata for retrieval</article-title>
          . In Carol Peters, Paul Clough, Julio Gonzalo,
          <string-name>
            <given-names>Gareth J. F.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Kluck</surname>
          </string-name>
          , and Bernardo Magnini, editors,
          <source>Multilingual Information Access for Text, Speech and Images, 5th Workshop of the Cross-Language Evaluation Forum</source>
          ,
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          <year>2004</year>
          ,
          <article-title>Bath</article-title>
          , UK,
          <source>September 15-17</source>
          ,
          <year>2004</year>
          , Revised Selected Papers, volume
          <volume>3491</volume>
          <source>of LNCS</source>
          , pages
          <fpage>298</fpage>
          -
          <lpage>309</lpage>
          . Springer, Berlin,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Stephen</surname>
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Robertson</surname>
          </string-name>
          , Steve Walker, Susan Jones, and
          <string-name>
            <surname>Micheline</surname>
          </string-name>
          Hancock-Beaulieu.
          <article-title>Okapi at TREC3</article-title>
          .
          <source>In Proceedings of the Third Text REtrieval Conference (TREC</source>
          <year>1994</year>
          ), Gaithersburg, USA,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>