=Paper= {{Paper |id=Vol-1169/CLEF2003wn-adhoc-BertoldiEt2003 |storemode=property |title=ITC-irst at CLEF 2003: Monolingual, Bilingual and Multilingual Information Retrieval |pdfUrl=https://ceur-ws.org/Vol-1169/CLEF2003wn-adhoc-BertoldiEt2003.pdf |volume=Vol-1169 |dblpUrl=https://dblp.org/rec/conf/clef/BertoldiF03b }} ==ITC-irst at CLEF 2003: Monolingual, Bilingual and Multilingual Information Retrieval== https://ceur-ws.org/Vol-1169/CLEF2003wn-adhoc-BertoldiEt2003.pdf
                         ITC-irst at CLEF 2003:
      Monolingual, Bilingual, and Multilingual Information Retrieval
                                 Nicola Bertoldi and Marcello Federico
                             ITC-irst - Centro per la Ricerca Scientifica e Tecnologica
                                            I-38050 Povo, Trento, Italy.

                                                        Abstract
         This paper reports on the participation of ITC-irst in the Cross Language Evaluation Forum 2003; in
         particular, in the monolingual, bilingual, small multilingual, and spoken document retrieval tracks. Con-
         sidered languages were English, French, German, Italian, and Spanish. With respect to our CLEF 2002
         system, the statistical models for bilingual document retrieval have been improved, more languages
         have been considered, and a novel multilingual information retrieval system has been developed, which
         combines several bilingual retrieval models into a statistical framework. As in the last CLEF, bilingual
         models integrate retrieval and translation scores over the set of N-best translations of the source query.


                1. Introduction                                  2. Multilingual Information Retrieval:
                                                                            Statistical approach
This paper reports on the participation of ITC-irst in
                                                                Multilingual information retrieval can be defined as
the Cross Language Evaluation Forum (CLEF) 2003.
                                                                the task of finding and ranking documents, which are
Several tracks were faced: monolingual document re-
                                                                relevant for a given topic, within a collection of texts
trieval in Italian, French, German and Spanish; bilin-
                                                                in several language. As we know the language of
gual document retrieval from German to Italian and
                                                                each document, we may view the multilingual target
from Italian to Spanish; small multilingual document
                                                                collection as the union of distinct monolingual col-
retrieval from English to English, German, French,
                                                                lections.
and Spanish; and, finally, cross-language spoken doc-
ument retrieval from French, German, Italian, Span-             2.1. Multilingual retrieval model
ish to English.                                                 Let a multilingual collection D contain documents
The statistical cross-language information retrieval            in L different languages, where D results from the
(CLIR) model presented in the 2002 CLEF evalu-                  union of L monolingual sub-collections D1 , . . . , DL .
ation (Federico and Bertoldi, 2002) was extended                Let f be a query in a given source language, even-
in order to cope with a multilingual target collec-             tually different from any of the L languages. One
tion. Moreover, better query-translation probabili-             would like to rank documents d within the multilin-
ties were obtained by exploiting bilingual dictionar-           gual collection D, according to the posterior proba-
ies and statistics from monolingual corpora. Ba-                bility:
sically, the ITC-irst system presented in the 2002                               Pr(d | f ) ∝ Pr(f , d)              (1)
CLEF evaluation was expanded with a module for
                                                                where the right term of formula (1) follows from the
merging document rankings of different document
                                                                constancy of Pr(f ), with respect to the ranking of
collections generated by different bilingual systems.
                                                                documents.
Each bilingual system features a statistical model,             A hidden variable l is introduced, which represents
which generates a list of the N-best query transla-             the language of either a sub-collection or a document.
tions, and a basic IR engine, which integrates scores,
computed by a standard Okapi model and a statistical                                       X
language model, over multiple translations. Remark-                       Pr(f , d) =
                                                                                    =           Pr(l, f , d)
ably, training of the system’s parameters just requires                                     l
                                                                                           X
a bilingual dictionary, the target document collection,                              =          Pr(l) Pr(f , d | l)   (2)
and a document collection in the source language.                                           l

This paper is organized as follows. Section 2 intro-            where Pr(l) is an a-priori distribution over languages,
duces the statistical approach to multilingual IR. Sec-         which can be estimated from the multilingual collec-
tions 3 briefly summarizes main features of our sys-            tion or taken uniform. Formula (2) shows a weighted
tem, and describes the retrieval procedure. Section 4           mixture of bilingual IR models depending on the sub-
and 5 present experimental results for each tracks we           collection. However, given that we know the lan-
participated in. Section 6 closes the paper.                    guage each document is written in, we can assume
that the probability Pr(f , d | l) is larger than zero             2.2. Basic Query-Document Model
only if d belongs to the sub-collection Dl .                       The query-document model computes the joint prob-
Next, a hidden variable e is introduced, which repre-              ability of a query e and a document d, written in the
sents a (term-by-term) translation of f into one of the            same language. The query-document model consid-
L languages. Hence, we derive the following decom-                 ered in the experiments results from the combination
position:                                                          of two different models: a language model and an
                                                                   Okapi based scoring function.
                         X
     Pr(f , d | l) =           Pr(f , e, d | l)                    Language Model The joint probability can be fac-
                           e
                                                                   tored out as follows:
                         X
                    ≈          Pr(f , e | l) Pr(d | e, l) (3)                    Pr(e, d) = Pr(e | d) Pr(d)                     (4)
                           e

                                                                   where the a-priori probability of d, Pr(d), is assumed
In deriving formula (3), we make the assump-
                                                                   to be uniform, and the probability of e given d to be
tion (or approximation) that the probability of doc-
                                                                   an order-free multinomial (bag-of-word) model:
ument d given query f , translation e and lan-
guage l, does not depend on f . Formula (3) puts                                                           Y
                                                                                                           n

in evidence a language-dependent query-translation                         Pr(e = e1 , . . . , en | d) =         p(ek | d)          (5)
model, Pr(f , e | l), and a collection-dependent                                                           k=1

query-document model, Pr(d | e, l).
                                                                   Okapi The joint probability can be obtained
The language-dependent query-translation model is
                                                                   through the normalization over queries and docu-
defined as follows:
                                                                   ments of a generic scoring function s(e, d):
Pr(f , e | l) = Pr(f | l)Prl (e | f )                                                                   s(e, d)
                                                                                Pr(e, d) = P                                   (6)
                       Prl (f , e)                                                                            0 0
                 
                     X                           if e ∈ Tl (f )                                     e0 ,d0 s(e , d )
                 
                              Prl (f , e0 )
              ∝    e0 ∈Tl (f )                                     The denominator is considered only for the sake of
                 
                 
                 
                                                                  normalization, but can be disregarded in the compu-
                 
                   0                              otherwise        tation of equation (3).
                                                                   A scoring function derived from the standard Okapi
where Tl (f ) is the set of all translations of f into lan-        formula, is used
guage l. For practical reasons, this set is approxi-
                                                                                                     n
                                                                                                     Y
mated with the set of the N most probable transla-
                                                                      s(e = e1 , . . . , en , d) =         idf (ek )Wd (ek )    (7)
tions computed by the basic query-translation model
                                                                                                     k=1
Prl (f , e). The term Pr(f | l) can be considered inde-
pendent from l and hence be discarded. The normal-                 Combination Previous work (Bertoldi and Fed-
ization introduced in formula (4) is needed in order to            erico, 2001) showed that the two models rank doc-
obtain ranking scores, which are comparable among                  uments almost independently. Hence, information
different languages.                                               about the relevant documents can be gained by in-
The collection-dependent query-document model is                   tegrating the scores of both methods. Combination
derived from a basic query-document model Prl (d |                 of the two models is implemented by just taking the
e) as follows:                                                     sum of scores, after a suitable normalization.
                  
                    Prl (d, e)                                    2.3. Basic Query-Translation Model
                  
                   X                          if d ∈ I(e, l)
                  
                        Prl (d0 , e)                              The query-translation model computes the probabil-
Pr(d | e, l) =       d0 ∈I(e,l)                                    ity of any query-translation pair. This probability is
                  
                  
                  
                                                                  modeled by an HMM (Rabiner, 1990) in which the
                  
                   0                           otherwise           observable variable is the query f in the source lan-
                                                                   guage, and the hidden variable is its translation e in
where I(e, l) is the set of documents in Dl containing             the target language. According to the HMM, the joint
at least a word of e.                                              probability of a pair (f , e) is decomposed as follows:
The basic query document and query translation
models are now briefly described; more details can                     P r(f = f1 , . . . , fn , e = e1 , . . . , en )
be found in (Bertoldi and Federico, 2002). The sub-                                      Y n                    Y n
script l, which refers to the specific language or col-                      = p(e1 )         p(ek | ek−1 )           p(fk | ek )
lection the models are estimated on, will be omitted                                     k=2                     k=1
without loss of generality.                                                                                                     (8)
The term translation probabilities p(f | e) are esti-
mated from a bilingual dictionary as follows:                                          Query (source)

                              δ(f, e)
             Pr(f | e) = P           0
                                                   (9)
                              f 0 δ(f , e)                                              Preprocessing

where δ(f, e) = 1 if the term e is one of the transla-
tions of term f and δ(f, e) = 0 otherwise. This flat
distribution can be refined through the EM algorithm
(Dempster et al., 1977) by exploiting a large corpus       Documents


                                                                        Bilingual IR
                                                                                                          Documents


                                                                                                                       Bilingual IR
in the source language.                                    Bilingual
                                                           Dictionary
                                                                                                          Bilingual
                                                                                                          Dictionary




The target LM probabilities p(e | e0 ), are estimated
on the target document collection, through an order-
free bigram LM, which tries to compensate for differ-
ent word positions induced by the source and target                                        Merging
languages. Let
                             p(e, e0 )
             p(e | e0 ) = P        00 0
                                                 (10)
                            e00 p(e , e )
                                                                                       Ranked Documents

where p(e, e0 ) is the probability of e co-occurring
with e0 , regardless of the order, within a text win-    Figure 1: Architecture of the multilingual IR system.
dow of fixed size. Smoothing of this probability is
performed through absolute discounting and interpo-
lation.

           3. System architecture                          • Tokenization was performed to separate words
                                                             from punctuation marks, to recognize abbrevia-
As shown in Section 2, the ITC-irst multilingual             tions and acronyms, correct possible word splits
IR system features several independent bilingual re-         across lines, and discriminate between accents
trieval systems, which return collection-dependent           and quotation marks.
rankings, and a module for merging these results
into a global ranking with respect to the whole mul-       • Stemming was performed by using a language-
tilingual collection. Moreover, language-dependent           dependent Porter-like algorithm (Frakes and
text preprocessing modules have been implemented             Baeza-Yates, 1992), freely available at snow-
to process documents and queries. Figure 3. shows            ball.tartarus.org.
the architecture of the system.
Two merging criteria were developed. The first, we         • Stop-terms removal was applied on the
call stat method, implements the statistical model           documents by removing terms included
introduced in Section 2: for each language, language-        in    a   language-dependent public list
dependent relevance scores of documents, computed            (www.unine.ch/info/clef).
by the bilingual IR systems are normalized in order        • Proper names and numbers in queries were rec-
to have language independent scores, and, hence, a           ognized in order to improve coverage of the dic-
global ranking is created.                                   tionary.
The second criterion, we call rank method, exploits
the document rank positions only, i.e. all the collec-     • Out-of-dictionary terms which have not been
tion dependent rank lists are joined and documents           recognized as proper names or numbers were re-
are globally sorted according to the inverse of their        moved.
original rank position.
Monolingual and bilingual versions of the system         3.2. Blind Relevance Feedback
trivially follows by omitting the query-translation      After document ranking, the following Blind Rele-
model and by limiting the collection to one language,    vance Feedback (BRF) technique was applied. First,
respectively.                                            the documents matching the source query e are
                                                         ranked, then the B best ranked documents are taken
3.1. Preprocessing                                       and the R most relevant terms in them are added to
In order to homogenize the preparation of data, and,     the query, and the retrieval phase is repeated. In the
hence, to reduce workload, a standard procedure was      CLIR framework, R terms are added to each single
defined. More specifically, the following preprocess-    translation of the N -best list and the retrieval algo-
ing steps were applied both to documents and queries     rithms is repeated once again. In this work, 15 new
in every language:                                       search terms are selected from the top 5 documents
according to the Offer Weight proposed in (Johnson             Dictionary          #entries     avg. # translations
et al., 1999).                                                 English-French       44728              1.97
                                                               English-German      131429              1.88
        4. Experimental Evaluation                             English-Italian      44195              1.95
                                                               English-Spanish      47305              1.83
ITC-irst submitted 4 monolingual runs in French,               Italian-Spanish      66059              3.94
German, Italian, and Spanish, 4 Italian-Spanish bilin-         German-Italian      103618              3.91
gual runs, 2 German-Italian bilingual runs, and 4
small multilingual runs using queries in English to                  Table 3: Statistics about dictionaries.
search documents in English, French, German, and
Spanish. Moreover, some unofficial experiments
were performed for the sake of comparison.
                                                             would suggest that they contain two wrong transla-
                                                             tions per entry, on the average.
4.1. Data
                                                             Moreover, all term translation probabilities, but the
In Table 1, statistics about the target collections for      German-Italian ones, were estimated through the EM
the five considered languages are reported.                  algorithm by using the corresponding document col-
                                                             lections.
       Language         #docs          #words
       English         166,754      100,971,969              4.2. Results
       French          129,809       52,275,689              Table 4 reports main settings and official mAvPr
       German          294,809       99,461,570              scores for each run. In particular, the number of N -
       Italian         153,208       54,434,345              best translations (1 vs. 10), the type of bilingual dic-
       Spanish         454,045      171,971,487              tionary (flat vs. estimated through EM algorithm),
       Multi-4        1,045,417     424,680,715              and the merging policy (looking at the rank vs. the
                                                             stat) are indicated. Source and target languages are
     Table 1: Statistics about target collections.           indicated in the run name.
                                                             Monolingual results As shown in Table 4, our
Table 2 reports statistics about the topics and corre-       monolingual retrieval system achieves good results
sponding relevant documents in each collection (top-         for all languages. More than 70% of queries have
ics with no relevant document are not considered).           mAvPr greater than or equal to the median values. It
                                                             is worth noticing that mAvPrs are pretty the same for
          Language      #queries     #rel.docs               all languages.
          English         54           1006                  Bilingual results Italian-Spanish results show that
          French          52            946                  the estimation of translation probabilities through the
          German          56           1825                  EM algorithm is quite effective, especially in combi-
          Italian         51            809                  nation with the 10-best translations.
          Spanish         57           2368
          Multi-4         60           6145                   Language      monolingual       bilingual from English
                                                              French        .5339                     .4297
           Table 2: Statistics about queries.                 German        .5173                     .4378
                                                              Italian       .5397                     .4184
                                                              Spanish       .5375                     .4298
Bilingual dictionaries from English to the other
languages were gathered from public available re-            Table 5: Comparison of monolingual and bilingual
sources. Unfortunately, German-Italian and Italian-          performance.
Spanish dictionaries were not available. Hence,
the missing dictionaries were built from other avail-
able dictionaries using English as a pivot language.         Table 5 reports mAvPr for monolingual and bilingual
For example, an Italian-Spanish dictionary was de-           runs for every language; the 10-best translations were
rived by exploiting the Spanish-English and Italian-         obtained with EM estimated translation probabilities.
English dictionaries as follows: the translation alter-      A relative degradation between 15% and 22% is al-
natives of an Italian term are all Spanish translations      ways observed. This means that the translation pro-
of all English translations of that term. Table 2 re-        cess causes almost equal losses in performance for
ports some statistics of the bilingual dictionaries. It is   each language pair.
worth noticing that for the generated dictionaries the
average number of translation alternatives is about          Multilingual results As shown in Table 4, about
twice larger than that of original dictionaries. This        60% of the queries have mAvPr greater than or equal
              Official Run       Setting                   mAvPr    mdn      bst
              IRSTfr 1                                     .5339    15        10       27        11
              IRSTde 1                                     .5173    16        5        35        6
              IRSTit 1                                     .5397    11        8        32        10
              IRSTes 1                                     .5375    17        3        37        5
              IRSTit2es 1        10-best, EM               .4262    31        1        25        2
              IRSTit2es 2        10-best, flat             .4006    36        1        20        2
              IRSTit2es 3        1-best, EM                .4053    33        1        23        2
              IRSTit2es 4        1-best, flat              .4009    35        1        21        2
              IRSTde2it 1        10-best, flat             .2291    38        0        18        0
              IRSTde2it 2        1-best, flat              .2437    36        0        20        0
              IRSTen2xx 1        10-best, EM, rank         .3147    23        1        36        0
              IRSTen2xx 2        10-best, EM, stat         .3089    22        2        36        1
              IRSTen2xx 3        10-best, flat, rank       .3084    25        2        33        0
              IRSTen2xx 4        10-best, flat, stat       .3036    25        1        34        1

      Table 4: Main settings and results of the official runs. Comparison against the median and best values.



to the median values. The merging method based on          were used (www.nist.gov/speech/tests/sdr). In partic-
the rank is a little more effective, but differences are   ular, 313K documents are extracted from Los Angeles
very low. Again, the EM estimation of term proba-          Times, Washington Post, New York Times, and Asso-
bilities slightly improves performance.                    ciated Press Worldstream, issued between Septem-
The merging criteria were also applied to the mono-        ber 1997 and April 1998. Unfortunately, the avail-
lingual runs, in order to obtain an upper bound for our    able texts do not entirely cover the test period. The
multilingual retrieval system. The achieved mAvPrs         following strategy was chosen: first query expansion
for this virtual experiment were .3754 and .3667 for       was performed on parallel texts, and then on target
the “rank” and “stat” criteria, respectively. The rel-     collection.
ative degradation is very similar to that observer for
bilingual experiments.                                     5.2. Results
                                                           Table 6 reports the official submitted runs, and some
                                                           unofficial runs (in italics), used for comparison.
 5.     Cross-Language Spoken Document                       Official run          Query mAvPr
                   Retrieval                                 mono-brf               EN    .3944
ITC-irst participated also in the Cross-Language Spo-        mono-brf-brf           EN    .4244
ken Document Retrieval (CLSDR) track, which con-             fr-en-1bst-brf-bfr     FR    .2281
sists in searching for relevant stories within a collec-     fr-en-sys-brf-bfr      FR    .3064
tion of automatically transcribed English broadcast          de-en-dec-1bst-brf-bfr DE    .2676
news. Topics correspond in 50 short queries man-             de-en-1bst-brf-bfr     DE    .2523
ually translated from English into French, German,           de-en-sys-brf-bfr      DE    .2880
Italian, and Spanish. For the CLSDR track, the bilin-        it-en-1bst-brf-bfr     IT    .2347
gual version of the ITC-irst IR system was applied,          it-en-sys-brf-bfr      IT    .3218
with little changes in the BRF expansion of queries.         es-en-1bst-brf-bfr     ES    .2746
Moreover, German text were also processed for split-         es-en-sys-brf-bfr      ES    .3555
ting compound words, by using a DP based algo-
rithm.                                                     Table 6: mAvPr results of CLSDR track at CLEF
                                                           2003
5.1. Query expansion on parallel corpora
As the number of stories in the SDR target collection      The official English monolingual run was performed
was quite small, a double query expansion policy was       in order to evaluate the quality of the retrieval sys-
chosen. New terms are added which are extracted not        tem. ITC-irst performance is about 10% above the
only from the target collection, but also from a large     other participants. For this experiment the query ex-
corpus of written texts, consisting of newspapers and      pansion on the parallel corpus was not applied. If not
news wires.                                                so, a relative improvement of 7% is observed. As the
As a parallel corpus for query expansion, newspa-          double query expansion policy is quite effective, was
per articles of the North American News Text corpus        applied in all the other experiments.
In the bilingual experiments, query were trans-             of the 8th Text REtrieval Conference. Gaithersburg,
lated either through our 1-best translation approach        MD.
or by the Babelfish translation service, powered          Rabiner, Lawrence R., 1990. A tutorial on hid-
by Systran, which is available on the Internet              den Markov models and selected applications in
(world.altavista.com). Run names are indicating with        speech recognition. In Alex Weibel and Kay-Fu
1bst and sys, respectively. Commercial transla-             Lee (eds.), Readings in Speech Recognition. Los
tions outperforms our approach.                             Altos, CA: Morgan Kaufmann, pages 267–296.
German word decompounding seems to be slightly
effective, as shown by comparing the run without
decompounding ( de-en-1bst-brf-bfr) and the with
(de-en-dec-1bst-brf-bfr).

                 6. Conclusion
This paper presented a multilingual IR system devel-
oped at ITC-irst. A complete statistical model was
defined which combines several bilingual retrieval
model. The system was evaluated in the CLEF2003
campaign in the monolingual, bilingual, and multilin-
gual tracks. The basic monolingual IR model resulted
very competitive on every languages. The multilin-
gual IR systems also achieves higher performance
than the median. Experiments in the Cross-Language
Spoken Document Retrieval task, which uses very
short queries, showed that significantly better results
are still achieved by using translations produced by a
commercial system.

                 7.   References
Bertoldi, N. and M. Federico, 2001. ITC-irst at
  CLEF 2000: Italian monolingual track. In Carol
  Peters (ed.), Cross-Language Information Re-
  trieval and Evaluation, volume 2069 of Lecture
  Notes in Computer Science. Heidelberg, Germany:
  Springer Verlag.
Bertoldi, N. and M. Federico, 2002. ITC-irst at
  CLEF 2001: Monolingual and bilingual tracks.
  In Carol Peters, Martin Braschler, Julio Gonzales,
  and Michael Kluck (eds.), Cross-Language Infor-
  mation Retrieval and Evaluation, volume 2406 of
  Lecture Notes in Computer Science. Heidelberg,
  Germany: Springer Verlag.
Dempster, A. P., N. M. Laird, and D. B. Rubin, 1977.
  Maximum-likelihood from incomplete data via the
  EM algorithm. Journal of the Royal Statistical So-
  ciety, B, 39:1–38.
Federico, Marcello and Nicola Bertoldi, 2002. Sta-
  tistical cross-language information retrieval using
  n-best query translations. In Proceedings of the
  25th Annual International ACM SIGIR Conference
  on Research and Development in Information Re-
  trieval. Tampere, Finland.
Frakes, William B. and Ricardo Baeza-Yates (eds.),
  1992. Information Retrieval: Data Structures and
  Algorithms. Englewood Cliffs, NJ: Prentice Hall.
Johnson, S.E., P. Jourlin, K. Spark Jones, and P.C.
  Woodland, 1999. Spoken document retrieval for
  TREC-8 at Cambridge University. In Proceedings