<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Retrieval of bilingual Spanish-English information by means of a standard automatic translation system</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Universidad de Salamanca Facultad de Documentación c/ Fco.</institution>
          <addr-line>De Vitoria 6-16 37008 SALAMANCA-</addr-line>
          <country>ESPAÑA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes our participation in bilingual retrieval (queries in Spanish on documents in English), by means of an information retrieval system based on the vector model. The queries, formulated in Spanish, were translated into English by means of a commercial automatic translation system; the terms extracted from the resulting translations were filtered in order to get rid of empty words and then they were normalised by stemming. Results are poorer than those obtained through monolingual retrieval with the original queries in English slightly above 15%.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Our participation in CLEF 2000 is centred mainly on bilingual retrieval, by which we mean queries in
Spanish against a collection of documents in English. Obviously, we have worked with the very same
queries formulated originally in English, which we have used to obtain a line of comparable results.</p>
      <p>When one tries to solve queries in a given language versus documents written in a different one, the
problems is to get a homogeneous representation of both queries and documents, so that they can be
compared and thus allowing us to establish a measure of similarity between them [OARD96]. Once this
homogeneous representation has been obtained, the similarity between a query and each of the documents
of the collection may be computed by means of any of the systems usually employed for monolingual
retrieval.</p>
      <p>For term-based information-retrieval systems, as is the case of the vector model [SALTON83], the
question is to insure that the terms that represent documents and queries use the same language. One way
or another, this involves some sort of translation; at least in principle, translation of queries seems to be
less expensive than translating whole documents. Anyway, the problem is the translation of individual
terms, which seems less complex than translating a sintactically structured text. The main problem,
beyond using a machine-readable bilingual dictionary, is to disambiguate those terms: each of them may
have different meanings and each of them will have a different equivalent in the other language. It is not
easy to determine the proper equivalents for each case and several methods have been proposed with this
purpose [AGIRRE2000]; final results depend, to a large extent, of the quantity and quality of semantic
knowledge contained in the lexicons and dictionaries employed .</p>
      <p>An obvious alternative to approach the problem of bilingual retrieval is to use some automatic
translation system; there is quite a number of commercially available systems. However, these systems are
not too well liked, since in general terms the translations they produce contain many mistakes and,
occasionally, are not acceptable from a linguistic point of view. It must be noticed, however, that the
linguistic requirements of retrieval systems are rather lower than those of the persons who must read and
understand the translations [HULL96]. In fact, many information-retrieval systems do not use or consider
syntactic constructs and, when terms experience some kind of normalisation process, they ignore
morphology.</p>
      <p>The utilisation of one of these automatic translation commercial systems poses no difficulties and, in
our case, lacking experience in bilingual retrieval, seems to be a good way to start on this subject.</p>
    </sec>
    <sec id="sec-2">
      <title>Experiment</title>
      <p>The retrieval engine we have is used is our own software, which we call Karpanta
[FIGUEROLA2000]. It is a simple program, based on the vector model, and it has been designed with
educational (vs. productivity) purposes. It works, although it is rather slow for large numbers of
documents. On the other hand, the goal of our work is to check the efficiency of a standard automatic
translation system when it is applied to information retrieval; rather than as a monolingual retrieval
technique.</p>
      <p>Hence, we used Karpanta to index the whole lot of documents (in English), keeping all of their fields.
We had eliminated empty words previously, using a standard list of empty English word that consists of
approximately 200 words.</p>
      <p>Non-empty words were stemmed by means of Porter’s well-known algorithm [PORTER80]. This was
done by means of a Perl script that implements the above algorithm; this script has been spread widely by
CPAN [PHILLIPS95]. The weights of the terms or stems we obtained were calculated by means of the
usual scheme of term frequency in the document x IDF.</p>
      <p>The original queries, in English, were dealt with in the same way. We used all of their fields, empty
words were eliminated and thus we obtained the stems, whose weight we measured as before. Query
resolution, that is, the computation of similitude between each query and each document, was made by
means of the usual cosine formula; thus we obtained the results we have used as the basis to establish
comparisons with the results we got afterwards in bilingual retrieval.</p>
      <p>Spanish queries were translated into English by means of an automatic translation system. Actually,
we tried various commercial systems: Systran, Power Translation Pro, Spanish Assistant. Although most
automatic translation systems allow for some kind of context adequacy and training, using such things as
specific lexicons, translation memories etc., we did not use any of these possibilities. In the case of
Systran, we actually used the web-accessible version [SYSTRAN2000]. All the systems we tried
produced rather similar translations; they also tended to produce remarkably similar errors. We finally
gave the nod to Systrans, since it seems to have better capabilities when recognising proper nouns;
besides, it is better at translating them, when at all possible.</p>
      <p>The translations thus obtained were processed in the same way as the original queries in English:
elimination of empty words, stemming, computation of weights and calculation of similitude with each
document.</p>
      <p>A comparison between the stems produced for each of these translations and those produced by the
original queries in English shows the divergences. If we compare a list of the stems we obtained by means
of the original queries in English with those obtained in translations, we observe that an average of 28%
are different. This does not mean they are necessarily incorrect since in some cases the translations may
have used synonymous or semantically equivalent terms.</p>
      <p>Bilingual Retrieval Spanish-English
1
0,8
0,6
n
o
i
s
i
c
e
rp0,4
0,2
0
0
0,2
0,6
0,8</p>
      <p>1
0,4</p>
      <p>The results we have obtained with queries translated into Spanish produce an average accuracy of
0.2273 and they have been shown in the previous graph. However, results show rather large variations
between queries (typical deviation=0.23).</p>
      <p>On the other hand, if we compare these results with those obtained from the original queries in
English (with an average precision of 0.27), they are clearly inferior. Precision-Recall curves are almost
parallel. However, if we examine each individual query, it can be seen that the ones that produce the best
results in English are also the ones that work best in the Spanish-to-English translation. Similarly, the
queries that produce the worst results are also the same, both in the original (English) queries as in the
queries translated into Spanish.</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusions</title>
      <p>The use of a standard system of automatic translation to solve bilingual retrieval tasks is an easy and
fast solution, although the efficiency we achieved in retrieval is clearly lower than the one obtained by
means of monolingual queries. This reduction is about 15%, although it is lower for reduced levels of
completeness (that is, taking into account just the first few documents we find).
[AGIRRE2000]
[FIGUEROLA2000]</p>
      <p>Agirre E., Atserias J., Padró L. and Rigau G., Combining Supervised and
Unsupervised Lexical Knowledge Methods for Word Sense Disambiguation
Computers and the Humanities, Special Double Issue on SensEval. Eds.</p>
      <p>M a r t h a P a l m e r a n d A d a m K i l g a r r i f f . 3 4 : 1 , 2 , 2 0 0 0 .
[http://www.lsi.upc.es/~nlp/papers/chum99-arpa.ps.gz]
Figuerola, C.G.; Alonso Berrocal, J.L. &amp; Zazo Rodríguez, A.F.: ”Diseño de
un motor de recuperación para uso experimental y educativo”, BiD: textos
u n i v e r s i t a r i s d e b i b l i o t e c o n o m i a i d o c u m e n t a c i ó , 4
[http://http://www.ub.es/biblio/bid/04figue2.htm]
Hull, D.A. &amp; Grefenstette, G.: " Queryng Across Languages: A
DictionaryBased Approach to Multilingual Intormation Retrieval", SIGIR 96, 49-57
Oard, D. &amp; Dorr, B.J. : "A Survey of Multilingual
Retrieval",[http://www.clis.umd.edu/dlrg/filter/papers/mlir.ps]
Phillips, I.
http://www.perl.com/CPAN-local/authors/Ian_Phillipps/Stem-0.1.tar.gz
Salton, G. &amp; McGill, M. : Introduction to Modern Information Retrieval, New
York, McGraw-Hill, 1983
Systran Software: SYSTRAN - Translation Technologies, Language
Translator, Online dictionary, Translate English, [http://www.systransoft.com]</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>