<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SINAI at CLEF 2004: Using Machine Translation resources with mixed 2-step RSV merging algorithm</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fernando Mart´ınez-Santiago</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Miguel A. Garc´ıa-Cumbreras</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manuel C. D´ıaz-Galiano</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>L. Alfonso Uren˜a</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science. University of Ja ́en</institution>
          ,
          <addr-line>Ja ́en</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This year, we have participated in multilingual CLEF task. Our main interest has been testing Machine Translation (MT) with mixed 2-step RSV merging algorithm. Since 2-step RSV requires grouping together the document frequency for each term and its own translations, and MT translates the whole of the phrase better than word for word, MT is not directly feasible with 2-step RSV merging algorithm (given a word of the original query, its translation to the rest of languages must be known). Thus, we propose a straightforward and effective algorithm in order to align the original query and its translation at term level.</p>
      </abstract>
      <kwd-group>
        <kwd>Mixed 2-step RSV merging algorithm and Machine Trans- lation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The basic 2-step RSV idea is straightforward: given a query term and the translation of such
term into the other languages, the document frequencies are grouped together[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Therefore, the
method requires recalculating the document score by changing the document frequency of each
query term. Given a query term, the new document frequency will be calculated by means of
the sum of the monolingual retrieved document frequency of the term and their translations.
In the first step the query is translated and searched in each monolingual collection. This phase
produces a T0 vocabulary made up by ”concepts. A concept consists of each term together with its
corresponding translation. Moreover, we obtain a single multilingual collection D0 of preselected
documents as a result of the union of the first 1000 retrieved documents for each language. The
second step consists of re-indexing the multilingual collection D0, but considering solely the T0
vocabulary. Finally, a new query formed by concepts in T0 is generated and this query is carried
out against the new index.
2.1
      </p>
      <sec id="sec-1-1">
        <title>An algorithm in order to align at term level a phrase and its translation by using Machine Translation</title>
        <p>
          Since 2-step RSV requires grouping together the document frequency for each term and its own
translations, and MT translates the whole of the phrase better than word for word, the 2-step
RSV merging algorithm is not directly feasible with MT (given a word of the original query, its
translation to the rest of languages must be known). Thus, we propose a straightforward and
effective algorithm in order to align the original query and its translation at term level. In this
paper, machine translation is perceived as a black box which receives English phrases and generates
translations of theses phrases to the other languages. Briefly, for each translation the algorithm
works as follows (a more detailed description is available in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]):
1. Let the original phrase be in English. The phrases is translated to the target language with
an MT resource.
2. To extract unigrams and bigrams from the English phrase. Both of them are translated with
the same MT resource used in 1.
3. To remove stopwords. Non stopwords are stemmed.
4. To test the alignment of terms by matching terms into the translated phrase with the
translation based on unigrams (note that the translation based on unigrams is fully aligned. Thus,
if a word of the translated phrase is translated in the same way with a word for word
translation method, then we know the translation of the word in the translated phrase. Thus,
this word is aligned).
5. After the alignment based on the translation of unigrams is finished, if any term in the
translated phrase is not aligned, use the bigrams with exactly one term aligned in order to
align the other term of the bigram.
        </p>
        <p>
          This algorithm fails if there are bigrams without any aligned term after the step 3. In addition, in
order to improve the matching process, words are stemmed by removing at least genre and number.
Finally, agglutinative languages, such as German, usually translate (adjetive, noun) bigrams by
using a compound word. For example, “baby food” is translated by “s¨auglingsnahrung” instead
of “s¨augling nahrung” (Babelfish translation). We decompound compound words if possible with
the algorithm depicted in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>We have tested the proposed algorithm with previous CLEF query sets (Title+Description). It
aligns about 85-90% of non-empty words (Table 1).</p>
        <p>
          This year, we have used MT resources in order to translate the original English query into
French and Russian language. However, we have not found quality free Finnish MT, so we have
used a Machine Dictionary Readable (MDR) approach (see section 3.1 for more details about
translation strategies). The percentage of aligned words is shown in table 2.
Although the proposed algorithm to align phrases and translations at term level works well, it
does not obtain fully aligned queries. In order to improve the system performance when some
terms of the query are not aligned, we make two subqueries. The first one is made up by the
aligned terms only and the other one is formed with the non-aligned terms. Thus, for each query
every retrieved document obtains two scores. The first score is obtained by using the 2-step RSV
merging algorithm over the first subquery. In contrast, the second subquery is used in a traditional
monolingual system with the respective monolingual list of documents. Therefore, we have two
scores for each query, one is global for all languages and the other is local for each language. Thus
we have to integrate both values. As a way to deal with partially aligned queries (i.e. queries
with some terms not aligned), last year we proposed several approaches by mixing evidence from
aligned and non-aligned terms [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. This year we have used raw mixed 2-step RSV and logistic
regression:
• Raw mixed 2-step RSV method:
        </p>
        <p>
          RSVi0 = α · RSVialign + (1 − α) · RSVinonalign
where RSVialign is the score calculated by means of aligned terms, as original 2-step RSV
method shows.On the other hand, RSVinonalign is calculated locally. Finally, α is a constant
(usually fixed to α = 0.75).
• Logistic regression: [
          <xref ref-type="bibr" rid="ref1 ref10">1, 10</xref>
          ] propose a merging approach based on logistic regression. Logistic
regression is a statistical methodology for predicting the probability of a binary outcome
variable according to a set of independent explanatory variables. The probability of relevance
to the corresponding document Di will be estimated according to both the original score
and logarithm of the ranking. Based on these estimated probabilities of relevance, the
monolingual list of documents will be interleaved forming a single list:
        </p>
        <p>P rob[Di is rel|ranki, rsvi] = 1 + eα+β1·ln(ranki)+β2·rsvi
eα+β1·ln(ranki)+β2·rsvi
The coefficients α, β1 and β2 are unknown parameters of the model. The usual methods
when fitting the model tend to be maximum likelihood or iteratively re-weighted least squares
methods. Because this approach requires fitting the underlying model, the training set
(topics and their relevance assessments) must be available for each monolingual collection.
In the same way that the score and ln(rank) evidence was integrated by using logistic
regression (Formula 2), we are able to integrate RSV align and RSV nonalign values:
P rob[Di is rel|ranki, rsvialign, rsvinonalign] =
eα+β1·ln(ranki)+β2·rsvialign+β3·rsvinonalign
1 + eα+β1·rsvialign+β2·rsvinonalign
where RSVialign and RSVinonalign are calculated as Formula 1. Again, training data must
be available in order to fit the model. This is a serious drawback, but this approach
allows integrating not only aligned and non-aligned scores but also the original rank of the
document:</p>
        <p>eα+β1·ln(ranki)+β2·rsvialign+β3·rsvinonalign
1 + eα+β1·ln(ranki)+β2·rsvialign+β3·rsvinonalign
(4)
where RSVirank is the local rank reached by Di at the end of the first step.
3</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Experiments and Results</title>
      <p>Our Multilingual Information Retrieval System uses English as the selected topic language, and
the goal is to retrieve relevant documents for all languages in the collection, listing the results
in a single, ranked list. In this list there are a set of documents written in different languages
retrieved as an answer to a query in a given language, English in our case. There are several
approaches for this task, such as translating the whole document collection to an intermediate
language or translating the question to every language found in the collection. Our approach is
the latter: we translate the query for each language present in the multilingual collection. Thus,
every monolingual collection must be preprocessed and indexed separately. The preprocessing and
indexing tasks are shown below.
3.1</p>
      <sec id="sec-2-1">
        <title>Language-dependent features</title>
        <p>
          In CLEF 2004 the multilingual task is made up by four languages: English, Finnish, French
and Russian. These languages are very heterogeneous: agglutinative languages such as Finnish,
Cyrillic alphabet of the Russian and finally the morphologic complexity of French make difficult
the application of a homogeneous strategy for preprocessing and translation tasks:
• English has been preprocessed as usual in other years. Stop-words have been eliminated and
we have used the Porter algorithm[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] as it is implemented in the ZPrise system.
• Finnish is an agglutinative language. Thus, we have used the same decompounding algorithm
as last year [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Stopword list and stemmer algorithm have been obtained in the snowball site
1. Since we have not found any good free machine translation for Finnish, we use FinnPlace
online dictionary 2.
• The resources for French have been updated by using the stop-word list and French stemmer
from http://www.unine.ch/info/clef. The translation from English has been carried out by
using Reverso3 software.
• For Russian, stop-word list and stemmer algorithm have been obtained in the snowball
site. Cyrillic alphabet has been transliterated with ASCII characters, following the standard
Library of Congress transliteration scheme. We have used the Prompt MT 4 in order to
translate the queries from English into Russian
1Snowball is a small string-handling language in which stemming algorithms can be easily represented. Its name
was chosen as a tribute to SNOBOL. Available at http://www.snowball.tartarus.org
2FinnPlace is available on-line at http://www.tracetech.net/db.htm
3Reverso is available on-line at translation2.paralink.com
4Prompt is available on-line at http://www.online-translator.com/text.asp?lang=en
        </p>
        <p>Preprocessing
Additional preprocessing</p>
        <p>
          Translation approach
Once collections have been pre-processed, they are indexed with the ZPrise IR system5, using the
OKAPI probabilistic model (fixed at b = 0.75 and k1 = 1.2) [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. OKAPI model has also been used
for the on-line re-indexing process required by the calculation of 2-step RSV. This year, we have
not used blind feedback because the improvement is very poor for these collections, the precision
is even worse for some languages (English and Russian).
3.3
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Results</title>
        <p>the poor performance achieved by logistic regression. The reason for this result could be that this
merging approach requires relevance assessments for each collection in order to fit the underlying
model. Nevertheless, we have no relevance assessment for 1995 Le Monde document collection
(this collection is available for the first time this year). Thus, we have trained the model with the
rest of the French collections. For this reason, we think that the model has been trained poorly.
In this way, this explains that the best result is obtained by using the most straightforward mixed
2-step RSV approach (UJAMLRSV2), since the rest of approaches are based on the combination
of logistic regression with 2-step RSV.</p>
        <p>5ZPrise, developed by Darrin Dimmick (NIST). Available on demand at
http://www.itl.nist.gov/iad/894.02/works/papers/zp2/zp2.html</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Global relevance blind feedback</title>
      <p>This year, we have not used blind feedback because the obtained improvement is poor. We
have tested a new way to apply blind feedback globally better than locally. Local relevance blind
feedback is the expansion of the query applied by every monolingual IR system. Global relevance
blind feedback is the expansion of the query applied by the multilingual IR system. In this way, we
analyze the top-N documents ranked into the multilingual list of documents. This idea is applied
to 2-step RSV merging algorithm as follows:
1. Merge the document rankings using 2-step RSV.
2. Apply blind relevance feedback to the top-N documents ranked into the multilingual list of
documents.
3. Add the top-N more meaningful terms to the query. Since there are documents written in
very different languages, the list of selected terms will be multilingual.
4. Expand the concept query6 with the selected terms.
5. Apply again 2-step RSV over the ranked lists of documents, but by using the expanded
query instead of the original query.</p>
      <p>Note that blind relevance feedback (we have used Okapi BM25 in this experiment) usually selects
terms that are in the initial query. Thus, such terms will probably be aligned. The rest of the
selected terms are integrated by using mixed 2-step RSV.
1. Usually, blind relevance feedback is poorly suited to CLEF document collections.
2. We use the expanded query to apply 2-step RSV re-weighting the documents retrieved for
each language, but the list of retrieved documents does not change ( it only changes the score
of such documents). We can also test the improvement of the results by sending the expanded
query for each monolingual collection. Thus, the monolingual lists of documents will be
modified. Then, we could apply 2-step RSV with the expanded query by recalculating the
score of these modified monolingual lists of documents instead of the lists retrieved by means
of the non-expanded query. In this way, new documents will be retrieved and evaluated.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions and future work</title>
      <p>In past years, we have used a merging approach called 2-step RSV with translations based on
MDR. This year we have used the proposed method with several Machine Translation resources.
In addition, the multilingual task requires working with very different languages (very different
6The concept query is the query used by 2-step RSV with aligned terms. A concept represents a term
independently of the language
alphabets and morphological structures). Other years we have tested the performance of 2-step
RSV with MDR, blind feedback and other languages and collections. In every experiment, the
proposed merging algorithm works well. It outperforms traditional merging approaches about
2040%. Thus, 2-step RSV is a very stable and scalable merging strategy. Another aim for this year
is the integration of learning based algorithms such as logistic regression with 2-step RSV. The
obtained results have been not so good. We think that the idea is good but the model could be
trained poorly because we have no relevance assessments for one document collection (Le Monde
1995). A study in progress is evaluating this approach but filtering 2004 CLEF relevance
assessment by eliminating relevant documents of Le Monde 1995. Thus, the whole of the multilingual
collection would be covered by the relevance assessments used for training.</p>
      <p>
        In spite of the bad results we think that the idea of global blind relevance feedback should improve
the performance of the our CLIR model, so we will continue working on this point.
Finally, we are interested in the application of other learning algorithms instead of logistic
regression, such as Support Vector Machines (SVM)[
        <xref ref-type="bibr" rid="ref11 ref3">11, 3</xref>
        ] and Perceptron Learning Algorithm with
Uneven Margins (PLAUM)[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
6
      </p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work has been supported by Spanish Government (MCYT) with grant FIT-150500-2003-412.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Calv</surname>
          </string-name>
          <article-title>´e and</article-title>
          <string-name>
            <given-names>J.</given-names>
            <surname>Savoy</surname>
          </string-name>
          .
          <article-title>Database merging strategy based on logistic regression</article-title>
          .
          <source>Information Processing &amp; Management</source>
          ,
          <volume>36</volume>
          :
          <fpage>341</fpage>
          -
          <lpage>359</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>Cross-language retrieval experiments at CLEF-2002</article-title>
          . In Carol Peters, Martin Braschler, Julio Gonzalo, and Michael Kluck, editors,
          <source>Advances in Cross-Language Information Retrieval, Third Workshop of the Cross-Language Evaluation Forum</source>
          ,
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          <year>2002</year>
          . Rome, Italy,
          <source>September 19-20</source>
          ,
          <year>2002</year>
          . Revised Papers, volume
          <volume>2785</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>26</fpage>
          -
          <lpage>48</lpage>
          . Springer Verlag,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Cortes</surname>
          </string-name>
          and
          <string-name>
            <given-names>V.</given-names>
            <surname>Vapnik</surname>
          </string-name>
          .
          <article-title>Support-vector networks</article-title>
          .
          <source>Machine Learning</source>
          ,
          <volume>20</volume>
          :
          <fpage>273</fpage>
          -
          <lpage>297</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zaragoza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Herbrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shawe-Taylor</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <surname>J. Kandola.</surname>
          </string-name>
          <article-title>The perceptron algorithm with uneven margins</article-title>
          .
          <source>In Proceedings of the International Conference of Machine Learning</source>
          .(ICML'
          <year>2002</year>
          ),
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Mart</surname>
          </string-name>
          <article-title>´ınez-</article-title>
          <string-name>
            <surname>Santiago</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Mart´ın</article-title>
          , and
          <string-name>
            <given-names>L.A.</given-names>
            <surname>Uren</surname>
          </string-name>
          <article-title>˜a</article-title>
          . SINAI at CLEF 2002:
          <article-title>Experiments with merging strategies</article-title>
          . In Carol Peters, Martin Braschler, Julio Gonzalo, and Michael Kluck, editors,
          <source>Advances in Cross-Language Information Retrieval, Third Workshop of the Cross-Language Evaluation Forum</source>
          ,
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          <year>2002</year>
          . Rome, Italy,
          <source>September 19-20</source>
          ,
          <year>2002</year>
          . Revised Papers, volume
          <volume>2785</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>103</fpage>
          -
          <lpage>110</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Mart</surname>
          </string-name>
          <article-title>´ınez-</article-title>
          <string-name>
            <surname>Santiago</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Mart´ın</article-title>
          , and
          <string-name>
            <given-names>L.A.</given-names>
            <surname>Uren</surname>
          </string-name>
          <article-title>˜a. A merging strategy proposal: the 2- step retrieval status value method</article-title>
          .
          <source>Technical Report</source>
          . Department of Computer Science of University of Ja´en,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Mart</surname>
          </string-name>
          <article-title>´ınez-</article-title>
          <string-name>
            <surname>Santiago</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Montejo-R´aez</surname>
          </string-name>
          , L.A. Uren˜a, and
          <string-name>
            <surname>M.C.</surname>
          </string-name>
          <article-title>Diaz</article-title>
          . SINAI at CLEF 2003:
          <article-title>Merging and decompounding</article-title>
          . In Carol Peters, Martin Braschler, Julio Gonzalo, and Michael Kluck, editors,
          <source>Proceedings of the CLEF</source>
          <year>2003</year>
          <article-title>Cross-Language Text Retrieval System Evaluation Campaign</article-title>
          , pages
          <fpage>99</fpage>
          -
          <lpage>109</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.F.</given-names>
            <surname>Porter</surname>
          </string-name>
          .
          <article-title>An algorithm for suffix stripping</article-title>
          .
          <source>In Program 14</source>
          , pages
          <fpage>130</fpage>
          -
          <lpage>137</lpage>
          ,
          <year>1980</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S. E</given-names>
            <surname>Robertson</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Walker.</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Beaulieu</surname>
          </string-name>
          .
          <article-title>Experimentation as a way of life: Okapi at TREC</article-title>
          .
          <source>Information Processing and Management</source>
          ,
          <volume>1</volume>
          (
          <issue>36</issue>
          ):
          <fpage>95</fpage>
          -
          <lpage>108</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Savoy</surname>
          </string-name>
          .
          <article-title>Cross-Language information retrieval: experiments based on CLEF 2000 corpora</article-title>
          . Information Processing &amp; Management,
          <volume>39</volume>
          :
          <fpage>75</fpage>
          -
          <lpage>115</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>V.</given-names>
            <surname>Vapnik</surname>
          </string-name>
          .
          <source>The Nature of Statistical Learning Theory</source>
          . Springer, New York,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>