<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ITC-irst at CLEF 2001: Monolingual and Bilingual Tracks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nicola Bertoldi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marcello Federico</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ITC-irst - Centro per la Ricerca Scientifica e Tecnologica I-38050 Povo</institution>
          ,
          <addr-line>Trento</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1990</year>
      </pub-date>
      <fpage>2</fpage>
      <lpage>5</lpage>
      <abstract>
        <p>This paper reports on the participation of ITC-irst in the Cross Language Evaluation Forum (CLEF) of 2001. ITC-irst has taken part to two tracks: the monolingual retrieval task, and the bilingual retrieval task. In both cases, Italian was chosen as the query language, while English was chosen as the document language of the bilingual task. The employed retrieval engine combines scores computed by an Okapi model and a statistical language model. The cross language system employes a statistical query translation model, which is estimated on the target document collection and on a translation dictionary.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        This paper reports on the participation of
ITCirst in two Information Retrieval (IR) tracks of the
Cross Language Evaluation Forum (CLEF) 2001: the
monolingual retrieval task, and the bilingual retrieval
task. The language for the queries was always
Italian, and English documents were searched for in the
bilingual task. With respect to the 2000 CLEF
evaluation
        <xref ref-type="bibr" rid="ref1">(Bertoldi and Federico, 2000)</xref>
        , the monolingual
IR system was just slightly refined, while most of
the effort was dedicated to develop an original
crosslanguage IR system.
      </p>
      <p>The basic IR engine, used for both evaluations,
combines scores of a standard Okapi model and of a
statistical language model. For cross-language IR, a
light-weight statistical model for translating queries
was developed, which does not need any parallel or
comparable corpora to be trained, but just the target
document collection and a bilingual dictionary.
This paper is organized as follows. In Section 2, the
employed text pre-processing modules are presented.
Section 3 describes the employed IR models,
Section 4 introduces the cross-language specific models,
namely the query translation model and the retrieval
model. Section 5 presents the official evaluation
results. Finally, Section 6 gives some conclusions.
Text pre-processing is performed in several stages,
which may differ according to the task and language.
In the following a list of modules used to pre-process
documents and queries is given, by also specifying to
which languages they apply.</p>
      <sec id="sec-1-1">
        <title>2.1. Tokenization - IT+EN</title>
        <p>Text tokenization is performed in order to isolate
words from punctuation marks, recognize
abbreviations and acronyms, correct possible word splits
across lines, and discriminate between accents and
quotation marks.
2.2.</p>
      </sec>
      <sec id="sec-1-2">
        <title>Morphological analysis - IT</title>
        <p>A morphological analyzer decomposes each Italian
inflected word into its morphemes, and suggests all
possible POSs and base forms of each valid
decomposition. By base forms we mean the usual not
inflected entries of a dictionary.</p>
      </sec>
      <sec id="sec-1-3">
        <title>2.3. POS tagging - IT</title>
        <p>Words in a text are tagged with parts-of-speech (POS)
by computing the best text-POS alignment through a
statistical model. The employed tagger works with
57 tag classes and has an accuracy around 96%.</p>
      </sec>
      <sec id="sec-1-4">
        <title>2.4. Base form extraction - IT</title>
        <p>Once the POS and the morphological analysis of each
word in the text is computed, a base form can be
assigned to each word.</p>
      </sec>
      <sec id="sec-1-5">
        <title>2.5. Stemming - EN</title>
        <p>Word stemming is just performed on English words
by using the Porter’s algorithm.</p>
      </sec>
      <sec id="sec-1-6">
        <title>2.6. Stop-terms removal - IT+EN</title>
        <p>Words that are not considered relevant for IR are
discarded in order to save index space. Words are
filtered out on the basis either of their POS (if available)
or their inverted document frequency.
2.7.</p>
      </sec>
      <sec id="sec-1-7">
        <title>Multi-word recognition - EN</title>
        <p>Multi-words are just used for the sake of the query
translation. Hence, the statistics used by the
translation models do contain multi-words. After
translation, multi-words are split into single words.
,
random variables of query, translation, and document
instances of query, query translation, and document
generic term, Italian term, English term
collection of documents
set of terms occurring in
, and in document
number of term occurrences in
, and in a document
frequency of term
in
, in document , and in query
3.1.</p>
      </sec>
      <sec id="sec-1-8">
        <title>Okapi Model</title>
        <p>where:</p>
        <p>Information Retrieval Models
, the following Okapi weighting function is applied:
scores the relevance of</p>
        <p>in , and the inverted
document frequency:
evaluates the relevance of term
inside the
collecbe empirically estimated over a development sample.
referred in it.</p>
      </sec>
      <sec id="sec-1-9">
        <title>3.2. Language Model</title>
        <p>
          were used. An explanation of the involved terms can
be found in
          <xref ref-type="bibr" rid="ref7">(Robertson et al., 1994)</xref>
          and other papers
        </p>
      </sec>
      <sec id="sec-1-10">
        <title>3.3. Combined model</title>
        <p>According to this model, the match between a query
probability distribution:
random variable</p>
        <p>and a document random
variable
is expressed through the following conditional</p>
        <p>To score the relevance of a document versus a query
number of documents in
which contain term
size of a set
(1)
(2)
(3)</p>
        <p>
          , the word probability over the
collecIn this work we use an interpolation formula which
applies the smoothing method proposed by
          <xref ref-type="bibr" rid="ref8">(Witten
and Bell, 1991)</xref>
          . This method linearly smoothes word
The probability that a term
is generated by can be
estimated by a statistical language model (LM).
Previous work on statistical information retrieval
          <xref ref-type="bibr" rid="ref3 ref5">(Miller
et al., 1998; Ng, 1999)</xref>
          proposed to interpolate
relative frequencies of each document with those of the
whole collection, with interpolation weights
empirically estimated from the data.
the document, i.e.:
frequencies of a document, and the amount of
probability assigned to never observed terms is
proportional to the number of different words contained in
tion, is estimated by interpolating the smoothed
relative frequency with the uniform distribution over the
vocabulary :
(6)
(7)
Previous work
          <xref ref-type="bibr" rid="ref1">(Bertoldi and Federico, 2000)</xref>
          showed
that Okapi and the statistical model rank documents
almost independently. Hence, information about the
relevant documents can be gained by integrating the
scores of both methods.
        </p>
        <p>Combination of the two
models is implemented by just taking the sum of
given
a uniform a-priori probability distribution about the
documents, and disregarding the normalization
factor, documents can be ranked, with respect to
, just
order-free multinomial model, the likelihood is:
added to the query. Hence, the retrieval phase is
repeated with the augmented query. In this work, new
best ranked documents
most relevant terms in them are
(4)
scores. Actually, in order to adjust scale differences,
scores of each model are normalized in the range
represents the likelihood of ,</p>
      </sec>
      <sec id="sec-1-11">
        <title>3.4. Blind Relevance Feedback</title>
        <p>Blind relevance feedback (BRF) is a well known
technique that allows to improve retrieval
performance. The basic idea is to perform retrieval in two
steps.
query</p>
        <p>First, the documents matching the original</p>
        <p>
          Cross-language IR Model
ƒ (7), and the absolute discounting term is equal to
4
the estimate proposed in
          <xref ref-type="bibr" rid="ref4">(Ney et al., 1994)</xref>
          :
        </p>
      </sec>
      <sec id="sec-1-12">
        <title>4.2. Cross-Language IR Model</title>
        <p>As a first method to perform cross-language retrieval,
a simple plug-in method was devised, which
decouples the translation and retrieval phases. Hence, given
a query in the source language, the Viterbi
decoding algorithm is applied to compute the most
probable translation in the target language, according
to the statistical query translation model explained
above. Then, the document collection is searched by
applying a conventional monolingual IR method.</p>
      </sec>
      <sec id="sec-1-13">
        <title>4.1. Query Translation Model</title>
        <p>
          Query translation is based on a hidden Markov model
(HMM)
          <xref ref-type="bibr" rid="ref6">(Rabiner, 1990)</xref>
          , in which the observable part
is the query in the source language (Italian), and
the hidden part is the corresponding query in the
target language (English). Hence, the joint
probability of a pair can be decomposed as follows:
(13)
} the top documents according to
          <xref ref-type="bibr" rid="ref2">(Johnson et al.,
search terms are extracted by sorting all the terms of
1999)</xref>
          :
Si where is the number of documents, among the top
} O ~ 6iO formed experiments the values and
} documents, which contain word . In all the
perwere used.
(8)
RTS* where is the probability of co-occurring
with , regardless of the order, within a text
window of fixed size. Smoothing of the probability is
performed through absolute discounting and
interpolation as follows:
        </p>
        <p>Order documents by using the translation</p>
      </sec>
      <sec id="sec-1-14">
        <title>5.1. Monolingual Track</title>
        <p>Two monolingual runs were submitted to the Italian
monolingual track. The first run used all the
information available for the topics, while the second one just
the title and description parts. The track consisted of
47 topics, for a total of 1,246 documents to be
retrieved inside a collection of 108,578 documents.
A detailed description of the used system follows
now:</p>
        <p>Document/query pre-processing: tokenization,
POS tagging, base form extraction, stop-term
removal.</p>
        <p>Retrieval step 1: separate Okapi and LM runs.</p>
        <sec id="sec-1-14-1">
          <title>BRF: performed on each model output.</title>
        </sec>
      </sec>
      <sec id="sec-1-15">
        <title>5.2. Bilingual IR Evaluation</title>
        <p>Two runs were submitted to the Italian-to-English
bilingual track, with the same modalities of the
monolingual track. The bilingual track consisted of
47 topics, for a total of 856 documents to be retrieved
inside a collection of 110,282 documents. A detailed
description of the used system follows now:</p>
        <p>Document pre-processing: tokenization,
stemming, stop-term removal.</p>
        <p>Query pre-processing: tokenization, POS
tagging, base form extraction, stop term removal,
translation, multi-words split, stemming.</p>
        <p>Retrieval step 1: separate Okapi and LM runs.</p>
        <sec id="sec-1-15-1">
          <title>BRF: performed on each model output.</title>
          <p>Retrieval step 2: same as step 1 with the
expanded query.</p>
          <p>Final rank: sum of Okapi and LM normalized
scores.</p>
          <p>An important issue concerns with the use of
multiwords. Multi-words were only used for the target
language, i.e. English, and just for the translation
process. After translation, multi-words in the query are
split again into single words.</p>
          <p>As a term of comparison, our statistical query
translation model was replaced with the Babelfish text
translation service powered by Systran and available on
the Internet1. Cross-language retrieval performance
was measured by keeping all the other components
of the system fixed. Results obtained by the
submitted runs and by the Babelfish translator are shown in
Table 4. The mean average precision achieved with
the commercial translation system shows to be about
5%-10% better, depending to the retrieval mode.
Detailed results of the experiments are shown in Table 4.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Conclusion</title>
      <p>In this work we presented the monolingual and
crosslanguage information retrieval systems developed at
ITC-irst and evaluated at the CLEF 2001. In
particular, the cross-language system uses a statistical query
translation algorithm that requires minimal language
resources: a bilingual dictionary and the target
document collection. Results on the CLEF 2001
evaluation data show that satisfactory performance can be
achieved with this simple translation model.
However, experience gained from the many performed
experiments suggest that a fair comparison between
different systems would require a much larger amount
of queries. The retrieval performance shows in fact
to be very sensitive to the translation step.</p>
      <p>Current work is in the direction of further
developing the here proposed statistical model for
crosslanguage IR. In particular, significant improvements
have been achieved by closely integrating the
translation and retrieval models.</p>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgements</title>
      <p>The authors would like to thank their colleagues at
ITC-irst Bernardo Magnini and Emanuele Pianta for
putting at disposal an electronic Italian-English
dictionary.</p>
      <p>7.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Bertoldi</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Federico</surname>
          </string-name>
          ,
          <year>2000</year>
          .
          <article-title>Italian text retrieval for CLEF 2000 at ITC-irst</article-title>
          .
          <source>In Working notes of CLEF 2000</source>
          . Lisbon, Portugal.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Johnson</surname>
            ,
            <given-names>S.E.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Jourlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Spark</given-names>
            <surname>Jones</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.C.</given-names>
            <surname>Woodland</surname>
          </string-name>
          ,
          <year>1999</year>
          .
          <article-title>Spoken document retrieval for TREC-</article-title>
          8 at Cambridge University.
          <source>In Proc. of 8th TREC. Gaithersburg</source>
          , MD.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>David R. H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tim</surname>
            <given-names>Leek</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Richard</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Schwartz</surname>
          </string-name>
          ,
          <year>1998</year>
          .
          <article-title>BBN at TREC-7: Using hidden Markov models for information retrieval</article-title>
          .
          <source>In Proc. of 7th TREC. Gaithersburg</source>
          , MD.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Ney</surname>
            , Herman,
            <given-names>Ute</given-names>
          </string-name>
          <string-name>
            <surname>Essen</surname>
          </string-name>
          , and Reinhard Kneser,
          <year>1994</year>
          .
          <article-title>On structuring probabilistic dependences in stochastic language modelling</article-title>
          .
          <source>Computer Speech and Language</source>
          ,
          <volume>8</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Ng</surname>
          </string-name>
          , Kenney,
          <year>1999</year>
          .
          <article-title>A maximum likelihood ratio information retrieval model</article-title>
          .
          <source>In Proc. of 8th TREC. Gaithersburg</source>
          , MD.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Rabiner</surname>
          </string-name>
          , Lawrence R.,
          <year>1990</year>
          .
          <article-title>A tutorial on hidden Markov models and selected applications in speech recognition</article-title>
          .
          <source>In Alex Weibel and Kay-Fu Lee (eds.)</source>
          ,
          <article-title>Readings in Speech Recognition</article-title>
          . Los Altos, CA: Morgan Kaufmann, pages
          <fpage>267</fpage>
          -
          <lpage>296</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Robertson</surname>
            ,
            <given-names>S.</given-names>
            E., S.
          </string-name>
          <string-name>
            <surname>Walker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>M. M.</given-names>
          </string-name>
          <string-name>
            <surname>Hancock-Beaulieu</surname>
            , and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Gatford</surname>
          </string-name>
          ,
          <year>1994</year>
          .
          <article-title>Okapi at TREC-3</article-title>
          .
          <source>In Proc. of 3rd TREC. Gaithersburg</source>
          , MD.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Witten</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ian H</surname>
          </string-name>
          . and
          <string-name>
            <surname>Timothy</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Bell</surname>
          </string-name>
          ,
          <year>1991</year>
          .
          <article-title>The zerofrequency problem: Estimating the probabilities of novel events in adaptive text compression</article-title>
          .
          <source>IEEE Trans. Inform. Theory</source>
          , IT-
          <volume>37</volume>
          (
          <issue>4</issue>
          ):
          <fpage>1085</fpage>
          -
          <lpage>1094</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>