<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Dictionary-based CLIR for the CLEF Multilingual Track</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Mirna Adriani Department of Computing Science University of Glasgow Glasgow G12 8QQ</institution>
          ,
          <addr-line>Scotland</addr-line>
        </aff>
      </contrib-group>
      <fpage>2</fpage>
      <lpage>5</lpage>
      <abstract>
        <p>This report describes the work done for our participation in the multilingual track of the CrossLanguage Evaluation Forum (CLEF). We use a dictionary-based approach to translate English queries into German, French and Italian queries. We then apply a term disambiguation technique to select the best translation terms from the terms found in the dictionary entries, and a query expansion technique to enhance the queries' retrieval performance.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        2.1
In order to select the best translation terms from an entry in the dictionary, we apply our term
disambiguation technique which is based on the statistical similarity values among terms. The similarity
value is measured using the Dice similarity measure, based on the co-occurrences of terms in documents
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Basically, given a set of original query terms, we select for each of the terms the best sense such that
the resulting set of selected senses contains senses that are mutually related- or statistically similar- with
one another. For computational cost consideration, this is done using an approximate algorithm. Given a
set of n original query terms {t1, t2, …, tn}, a set of translation terms, T, is obtained using the following
algorithm:
1. For each ti (i=1 to n), retrieve a set of senses Si from the dictionary.
2. For each set Si (i=1 to n), do steps 2.1, 2.2 and 2.3.
2.1 For each sense tj’ (j=1 to |Si|) in Si, do step 2.1.1
2.1.1 For each set Sk (k=1 to n and k &lt;&gt; i), get the maximum similarity, Mj,k, between tj’ and the senses in
      </p>
      <p>
        Sk..
2.2 Compute the score of sense tj’ as the sum of Mj,k (k=1 to n and k &lt;&gt; i).
2.3 Select the sense in Si with the highest score, and add the selected sense into the set T.
Query terms that are not found in the dictionary are included in the translation set T as-is. This is
typically the case for proper names, technical terms, and acronyms. For a complete explanation of our
technique can be found in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
2.2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Query Expansion Technique</title>
      <p>The resulting translated queries are, of course, worse than the original queries, in terms of their accuracy
and retrieval effectiveness. We expand the translated queries by adding related terms to the queries to
further improve their retrieval performance. Our query expansion technique uses a Dice similarity matrix
which is computed based on the co-occurrences of terms in document passages. We built a database that
contains passages of 200 terms from each collection. We then ran each query set to obtain the relevant
passages. We used the top 20 passages to create the term similarity matrix. Next, we computed the sum of
similarity values between each term in the passages with all terms in the query. Finally, we added the top
10 terms from the relevant passages to the query.
2.3</p>
    </sec>
    <sec id="sec-3">
      <title>Rank Merging</title>
      <p>The rank merging technique is required because we run the query set for each language collection
independent of the other language collections. The results from the four language collections are then
merged in a single rank list. We employ a simple method based on an assumption that the highest-rank
document in one language is comparable, in terms of relevance to the query, to the highest-rank
document in another language. We realized that this assumption is not always true, but, for lacking of
time to experiment with other techniques, we thought that it was a reasonable one. With this assumption,
we normalize the relevance scores with the highest score in each rank list, and then merge and sort the
them into a single rank.
3</p>
    </sec>
    <sec id="sec-4">
      <title>Experiment</title>
      <p>In the multilingual track, the document collections are in four languages, namely, English, German,
French, and Italian. We chose to run the English queries which were then translated using the online
dictionaries.</p>
      <p>First, we eliminated all stop words from the English queries and stemmed the remaining terms using the
Porter stemmer. Each term is then translated into its translation or translations, if more than one is
possible, according to the dictionary. We also included translations for terms that are part of phrases in
the query. The translation terms were stemmed using the French and German stemmers from the PRISE
retrieval system obtained from NIST. Stopwords in the translations were also removed. We then applied
our term disambiguation technique to choose the best translation term. The resulting queries were then
enhanced by applying the query expansion technique which adds 10 terms from a set of 20 relevant
passages that are most closely related to the query terms. The values of 10 and 20 were obtained through
a brief preliminary experiment.</p>
      <p>Finally, we ran each query set on its respective document collection, including the original English
queries on the English collection, and the retrieval results from the sets were then combined into a single
document ranking.</p>
      <p>In this experiment, we ran two query formats, namely, the title-only and the long (full) query formats.
Each query in the long query set contains the title, the description, and the explanation texts of the CLEF
query. We chose to do both query sets to see whether the results are consistent across both sets. All the
steps in the multilingual task were done fully automatic.
4</p>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>We participate in the multilingual task by running the title-only (glatitle) and the long (glalong) query
formats. However, only the title-only query run was considered in the CLEF relevance assessment pool.</p>
      <p>Run
glatitle
glatitle
glalong
glalong</p>
      <p>Task</p>
      <sec id="sec-5-1">
        <title>Monolingual</title>
      </sec>
      <sec id="sec-5-2">
        <title>Cross Languange Monolingual</title>
      </sec>
      <sec id="sec-5-3">
        <title>Cross Language</title>
      </sec>
      <sec id="sec-5-4">
        <title>English 0.2705 0.3804</title>
      </sec>
      <sec id="sec-5-5">
        <title>German</title>
        <p>0.2075
0.0810
0.2790
0.0932</p>
      </sec>
      <sec id="sec-5-6">
        <title>French</title>
        <p>0.2260
0.1097
0.2682
0.1012</p>
      </sec>
      <sec id="sec-5-7">
        <title>Italian</title>
        <p>0.0347
0.0569
0.1279
0.1050</p>
        <p>As can be seen in Table 1, we obtained good results for the Italian translation queries, followed by the
French translation queries and, lastly, the German translation queries which performed the poorest. Our
investigation of the title-only query format revealed that the retrieval performance of each translation
query set is proportional with the number of English terms that could not be translated into the target
language using the bilingual dictionary. Specifically, for the topic-only query format, our German query
set contains 3 untranslated English terms and stand-alone German terms that were supposed to be 19
German compound nouns. The French query set contains 19 untranslated English terms. The Italian
query set contains 8 untranslated English terms.</p>
        <p>
          In our previous work [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], we demonstrated that our German queries perform better than the equivalent
Spanish queries in retrieving documents from an English collection. The reason being that German
compound words have exact meanings in English as compared to Spanish phrases if the phrases are
translated word by word using bilingual dictionaries. In other word, the degree of ambiguity of the
German queries is less than that of the Spanish queries. However, from this experiment, we learned that
translating English queries to German, which involves translating into compound words, is a difficult
task.
        </p>
        <p>Lastly, we learned that our rank merging technique also contributed to our poor overall retrieval
performance. We hope that we can improve it in the future. We also hope that the next time we will be
able to use better machine-readable dictionaries.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Adriani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <given-names>C.J. van Rijsbergen. Term</given-names>
            <surname>Similarity</surname>
          </string-name>
          <article-title>Based Query Expansion for Cross Language Information Retrieval</article-title>
          .
          <source>In Proceedings of Research and Advanced Technology for Digital Libraries, Third European Conference (ECDL'99)</source>
          , p.
          <fpage>311</fpage>
          -
          <lpage>322</lpage>
          . Springer Verlag: Paris,
          <year>September 1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Adriani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Using Statistical Term Similarity for Sense Disambiguation in Cross Language Information Retrieval</article-title>
          .
          <source>Information Retrieval</source>
          <volume>2</volume>
          (
          <issue>1</issue>
          ), p.
          <fpage>67</fpage>
          -
          <lpage>78</lpage>
          . Kluwer: February,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>van Rijsbergen</surname>
            ,
            <given-names>C. J. Information</given-names>
          </string-name>
          <string-name>
            <surname>Retrieval</surname>
          </string-name>
          . Second ed. London, UK: Butterworths,
          <year>1979</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>