=Paper=
{{Paper
|id=Vol-1166/CLEF2000wn-adhoc-BertoldiEt2000
|storemode=property
|title=Italian Text Retrieval for CLEF 2000 at ITC-irst 
|pdfUrl=https://ceur-ws.org/Vol-1166/CLEF2000wn-adhoc-BertoldiEt2000.pdf
|volume=Vol-1166
|dblpUrl=https://dblp.org/rec/conf/clef/BertoldiF00a
}}
==Italian Text Retrieval for CLEF 2000 at ITC-irst ==
<pdf width="1500px">https://ceur-ws.org/Vol-1166/CLEF2000wn-adhoc-BertoldiEt2000.pdf</pdf>
<pre>
         Italian Text Retrieval for CLEF 2000 at ITC-irst
                              Nicola Bertoldi and Marcello Federico
                          ITC-irst - Centro per la Ricerca Scientifica e Tecnologica
                                         I-38050 Povo, Trento, Italy.

                                                    Abstract
       This paper presents work on document retrieval for Italian carried out at ITC-irst. Two different ap-
       proaches to information retrieval were investigated, one based on the Okapi weighting formula and one
       based on a statistical model. Development experiments were carried out using the Italian sample of
       the TREC-8 CLIR track. Performance evaluation was done on the Cross Language Evaluation Forum
       (CLEF) 2000 Italian monolingual track.


           1. INTRODUCTION                                 Tokenization. Tokenization of text is performed
                                                           in order to isolate words from punctuation marks,
    This paper reports on Italian text retrieval re-       recognize abbreviations and acronyms, correct pos-
search that has recently started at ITC-irst. Experi-      sible word splits across lines, and discriminate be-
mental evaluation was carried out in the framework         tween accents and quotation marks.
of the Cross Language Evaluation Forum (CLEF),
a text retrieval system evaluation activity coordi-        Morphological analysis. A morphological ana-
nated in Europe from 2000, in collaboration with           lyzer decomposes each Italian inflected word into
the US National Institute of Standards and Tech-           its morphemes, and suggests all possible POSs and
nology (NIST) and the TREC Conferences.                    base forms of each valid decomposition. By base
    ITC-irst has recently started to develop mono-         forms we mean the usual not inflected entries of a
lingual text retrieval systems (Sparck Jones and           dictionary.
Willett, 1997) for the main purpose of accessing           POS tagging. POS tagging is based on a Viterbi
broadcast news audio and video data (Federico,             decoder that computes the best text-POS alignment
2000). This paper presents two Italian monolingual         on the basis of a bigram POS language model and a
text retrieval systems that have been submitted to         discrete observation model (Merialdo, 1994). The
CLEF 2000: a conventional Okapi derived model,             employed tagger works with 57 tag classes and has
and a statistical retrieval model. After the evalu-        an accuracy around 96%.
ation, a combined model was also developed that
just integrates the scores of the two basic models.        Base form extraction. Once the POS and the
This simple and effective model shows a significant        morphological analysis of each word in the text
improvement over the two single models.                    is computed, a base form can be assigned to each
    The paper is organized as follows. In Section 2,       word.
the text preprocessing of documents and queries            Stemming. Word stemming is applied at the level
is presented. Section 3 and 4 introduce the text           of tagged base forms. POS specific rules were de-
retrieval models that were officially evaluated at         veloped that remove suffixes from verbs, nouns,
CLEF and present experimental results. Section 5           and adjectives.
discusses improvements on the basic models that
were made after the CLEF evaluation. In particular,        Stop-terms removal. Words in the collection that
a combined retrieval model is introduced and evalu-        are considered non relevant for the purpose of infor-
ated on the CLEF test collection. Finally, Section 6       mation retrieval are discarded in order to save index
offers some conclusions regarding the research at          space. Words are filtered out on the basis either of
ITC-irst in the field of text retrieval.                   their POS or their inverted document frequency. In
                                                           particular, punctuation is eliminated together with
                                                           articles, determiners, quantifiers, auxiliary verbs,
      2. TEXT PREPROCESSING                                prepositions, conjunctions, interjections, and pro-
                                                           nouns. Among the remaining terms, those with a
    Document and query preprocessing implies sev-
                                                           low inverted document frequency, i.e. that occur in
eral stages: tokenization, morphological analysis
                                                           many different documents, are eliminated.
of words, part-of-speech (POS) tagging of text,
base form extraction, stemming, and stop-terms re-           An example of text preprocessing is presented in
moval.                                                     Table 8.
      ( )
   fd w         frequency of word w in document d            3.2. Statistical Model
      ( )
   fq w         frequency of w in query q                        A statistical retrieval model was developed
   f (w )       frequency of w in the collection             based on previous work on statistical language
   fd           length of document d                         modeling (Federico and De Mori, 1998).
   f            length of the collection                         The match between a query q and a document d
   l           mean document length                         can be expressed through the following conditional
   N            number of documents                          probability distribution:
   Nw           number of documents containing w
   Vd           vocabulary size of document d
                                                                              (   j )=        (
                                                                                             P q   j ) ()
                                                                                                    d P d
   
   Vd           average document vocabulary size                          P d      q
                                                                                                    ( )
                                                                                                   P q
                                                                                                                  (4)
   V            vocabulary size of the collection
                                                             where P (q j d) represents the likelihood of q , given
Table 1: Notation used in the information retrieval            , ( ) represents the a-priori probability of d, and
                                                             d P d

models.                                                      P (q ) is a normalization term. By assuming no a-

                                                             priori knowledge about the documents, and disre-
                                                             garding the normalization factor, documents can be
                                  l                        ranked, with respect to q , just by the likelihood
        Terms           Stop               V      Vd
                                                             term. If we interpret the likelihood function as
        text            no       225    160K     134
                                                             the probability of d generating q and assume an
        base forms      no       225    126K     129
                                                             order-free multinomial model, the following log-

                                                                                   X
        stems           no       225    101K     126
                                                             probability score can be derived:
        base forms      yes      103    125K     80
        stems           yes      103    100K      77
                                                                 log P (q j d) =              ( ) log P (w j d)
                                                                                            fq w                  (5)
                                                                                   w2q
Table 2: Effect of text preprocessing steps on the
mean document length, global vocabulary size, and                The probability that a term w is generated by
mean document vocabulary size.                               d can be estimated by applying statistical language
                                                             modeling techniques. Previous work on statistical
                                                             information retrieval (Miller et al., 1998; Ng, 1999)
  3. INFORMATION RETRIEVAL                                   proposed to interpolate relative frequencies of each
            MODELS                                           document with those of the whole collection, with
                                                             interpolation weights empirically estimated from
3.1. Okapi Model                                             the data.
    Okapi (Robertson et al., 1994) is the name of                In this work we use an interpolation formula
a retrieval system project that developed a family           which applies the smoothing method proposed by
of weighting functions in order to evaluate the rel-         (Witten and Bell, 1991). This method linearly
evance of a document d versus a query q . In this            smoothes word frequencies of a document and the
work, the following Okapi weighting function was             amount of probability assigned to never observed
applied:

            ( )=
          s d
                     X          ( ) cd (w) idf (w)
                             fq w                      (1)
                                                             terms is proportional to the number of different
                                                             words contained in the document. Hence, the fol-
                                                             lowing probability estimate is applied:
                   w2q\d
                                                                                           ( )
where:                                                               (
                                                                   P w   j )=
                                                                          d
                                                                                   fd w

                                                                                   fd
                                                                                               +
                                                                                                   Vd

                                                                                           + Vd fd + Vd
                                                                                                        P (w )    (6)
                                ( )(k1 + 1)
                             fd w
           ( )=
         cd w
                                        fd             (2)   where P (w), the word probability over the collec-
                   k1   (1     b) + k1 b  + fd (w )
                                         l                   tion, is estimated by interpolating the smoothed rel-
scores the relevance of w in d, and the inverted doc-        ative frequency with the uniform distribution over
ument frequency:                                             the vocabulary V :

                                         + 0:5                                         f w ( )     V  1
                 ( ) = log
                                N     Nw
                                                       (3)                ( )=
                                                                         P w                   +                  (7)
            idf w
                                    Nw + 0:5
                                                                                       f   +V    f +V V


evaluates the relevance of w inside the collection.          3.3. Blind Relevance Feedback
The model implies two parameters k 1 and b to be                 Blind relevance feedback (BRF) is a well
empirically estimated over a development sample.             known technique that allows to improve retrieval
An explanation of the involved terms can be found            performance. The basic idea is to perform retrieval
in (Robertson et al., 1994) and other papers referred        in two steps. First, the documents matching the
in it.                                                       original query q are ranked, then the B best ranked
                                               Avg. #                                                         # of Relevant Docs
  Data Set                      # docs        words/ doc                Data Set (topic #’s)            Min      Max Avg. Total
  CLIR - Swiss News Agency      62,359           225                    CLIR (54-81)                     2        15     7.1     170
  CLEF - La Stampa              58,051           552                    CLEF (1-40)                      1        42     9.9     338

  Table 3: Development and test collection sizes.                 Table 5: Document retrieval statistics of develop-
                                                                  ment and test collections.

                                 # of Words
  Data Set (topic #s’)   Min    Max Avg.           Total
                                                                   mAvPr
  CLIR (54-81)           41     107    70.4        1690
  title                   3      8      5.1        122
  description             8      27    17.1         410           47
                                                                  46
  narrative              25      81    48.3        1158           45
  CLEF (1-40)            31     96     60.8        2067           44
                                                                  43
  title                   3      9      5.3        179            42
  description             7      35    15.7         532           41
                                                                  40
  narrative              14      84    39.9        1356           39
                                                                  38

Table 4: Topic statistics of development and
test collections. For development and evaluation,
                                                                  1.1
queries were generated by using all the available                       1.3
                                                                              1.5
topic fields.                                                K1
                                                                                    1.7                                                1
                                                                                          1.9 0           0.4          0.6   0.8
                                                                                                  0.2              B


documents are taken and the T most relevant terms
in them are added to the query. Hence, the retrieval
phase is repeated with the augmented query. In                    Figure 1: Mean Average Precision versus different
this work, new search terms are extracted by sort-                settings of Okapi formula’s parameters k 1 and b.
ing all the terms of the B top documents according
to (Johnson et al., 1999):
                                                                  The collection consists of the test set used by the
      (rw + 0:5)(N Nw B + rw + 0:5)                               1999 TREC-8 CLIR track and its relevance assess-
   rw                                                  (8)
        (Nw rw + 0:5)(B rw + 0:5)                                 ments. The CLIR collection contains topics and
                                                                  documents in four languages: English, German,
where rw is the frequency of word w inside the B                  French, and Italian. The Italian part consists of
top documents.                                                    texts issued by the Swiss News Agency (Schweiz-
                                                                  erische Depeschenagentur) from 17-11-1989 until
            4. EXPERIMENTS                                        12-31-1990, and 28 topics, four of which have no
    This section presents work done to develop and                corresponding Italian relevant documents 1. More
test the presented models. Development and test-                  details about the development collection are pro-
ing were done on two different Italian document re-               vided in Tables 3, 4, and 5.
trieval tasks. Performance was measured in terms
of Average Precision (AvPr) and mean Average                      4.2. Okapi Tuning
Precision (mAvPr). Given the document ranking                         Tuning of the parameters in formula (2) was car-
provided against a given query q , let r 1  : : :  rk           ried out on the development data. In Figure 1 a
be the ranks of the retrieved relevant documents.                 plot of the mAvPr versus different values of the pa-
The AvPr for q is defined as the average of the pre-              rameters is shown. Finally, the values k 1 = 1:5
cision values achieved at all recall points, i.e.:                and b = 0:4 were chosen, because they provided

              AvPr = 100 
                               1   X
                                   k
                                         i
                                                       (9)
                                                                  consistently good results also with other evaluation
                                                                  measures. The achieved mAvPr is 46.07%.
                               k         ri
                                   i=1                            4.3. Blind Relevance Feedback Tuning
The mAvPr of a set of queries corresponds to the                      Tuning of BRF parameters B and T was carried
mean of the corresponding query AvPr values.                      out just for the Okapi model. In Figure 2 a plot of
                                                                  the mAvPr versus different values of the parame-
4.1. Development
    For the purpose of parameter tuning, develop-                        1
                                                                       CLIR topics without Italian relevant documents are
ment material made available by CLEF was used.                    60, 63, 76, and 80.
                                                                                                                    0           5           10             15              20             25             30              35               40
                                                                                                             0.7
                                                                                                                                                                                                                        irst1
                                                                                                                                                                                                                        irst2
                                                                                                             0.6                                                                                                        best
mAvPr

                                                                                                             0.5
50


                                                                             AvPr (difference from median)
                                                                                                             0.4
49
48
                                                                                                             0.3
47
46                                                                                                           0.2
45
44                                                                                                           0.1
43
                                                                                                               0


                                                                                                             -0.1
 5
          10                                                             5                                   -0.2
                    15                                            7
                         20                             10                                                   -0.3
               T                                                                                                        1   3   5   7   9        11   13   15   17    19        21   23   25   27   29        31   33    35     37   39
                                25                12         B
                                                                                                                                                                 Topic Number
                                       30   15


                                                                                                             Figure 3: Difference (in mean average precision)
                                                                                                             from the median for each of the 34 topics in the
 Figure 2: Mean Average Precision versus different                                                           CLEF 2000 Italian monolingual track. Moreover,
 settings of blind relevance feedback parameters B                                                           the best AvPr reference is plotted for each topic.
 and T .

 ters is shown. Finally, the number of relevant doc-                                                         of the topics do not have corresponding documents
 uments B = 5 and the number of relevant terms                                                               in the collection they are not taken into account 2.
 T = 15 were chosen, whose combination gives
                                                                                                             More details about the CLEF collection and topics
 a mAvPr of 49.2%, corresponding to a 6.8% im-                                                               are in Tables 3, 4, and 5.
 provement over the first step.                                                                                  Official results of the Okapi and statistical mod-
     Further work was done to optimize the perfor-                                                           els are reported in Figure 3 with the names irst1
 mance of the first retrieval step. Indeed, perfor-                                                          and irst2, respectively. Figure 3 shows the differ-
 mance of the BRF procedure is determined by the                                                             ence in AvPr between each run and the median
 precision achieved, by the first retrieval phase, on                                                        reference provided by the CLEF organization. As a
 the very top ranking documents. In particular, an                                                           further reference, performance differences between
 higher resolution for documents and queries was                                                             the best result of CLEF and the median are also
 considered by using base forms instead of stems.                                                            plotted. The mAvPr of irst1 and irst2 are 49.0%
 In Table 6 mAvPr values are shown by considering                                                            and 47.5%, respectively. Both methods score above
 different combinations of text preprocessing before                                                         the median reference mAvPr, which is 44.5%. The
 and after BRF. In particular, we considered using                                                           mAvPr of the median reference was computed by
 base forms before and after BRF, using word stems                                                           taking the average over the median AvPr scores.
 before and after BRF, and using base forms before
 BRF and stems after BRF. The last combination                                                                                      5. IMPROVEMENTS
 achieved the largest improvement (8.6%) and was
                                                                                                                 By looking at Figure 3 it emerges that the Okapi
 adopted for the final system.                                                                               and the statistical model have quite different behav-
                                                                                                             iors. This would suggest that if the two methods
                                        # of relevant terms T
     I             II     5           10      15      20     25        30
                                                                                                             rank documents independently, some information
     st            st    46.4        47.3 49.2 49.6 48.3              48.5                                   about the relevant documents could be gained by
     ba            ba    46.2        47.6 47.6 47.6 47.7              47.3                                   integrating the scores of both methods.
     ba            st    46.7        48.7 50.0 48.5 48.6              48.6                                       In order to compare the rankings of two models
                                                                                                             A and B , the Spearman’s rank correlation can be

 Table 6: Mean Average Precision by using base                                                               applied. Given a query, let r(A(d)) and r(B (d))
 forms (ba) or word stems (st) before (I) and after                                                          represent the ranks of document d given by A and
 (II) blind relevance feedback (with B=5).                                                                   B , respectively. Hence, Spearman’s rank correla-

                                                                                                             tion (Mood et al., 1974) is defined as:

                                                                                                                                                  6
                                                                                                                                                      XrAd rBd  [ ( ( ))                       (    ( ))]
                                                                                                                                                                                                         2
 4.4. Official Evaluation
     The two presented models were evaluated on                                                                                 S=1                        d
                                                                                                                                                                     N (N 2               1)
                                                                                                                                                                                                                                (10)
 the CLEF 2000 Italian monolingual track. The test
 collection consists of newspaper articles published                                                                    2
                                                                                                                  CLEF topics without Italian relevant documents are
 by La Stampa, during 1994, and 40 topics. As six                                                            3, 6, 14, 27, 28, and 40.
                                       0             5           10             15             20             25             30             35             40                                          0           5           10             15             20             25             30             35             40
                                0.7                                                                                                                                                             0.5
                                                                                                                                        merge                                                                                                                                              merge vs. irst1
                                                                                                                                         best                                                                                                                                              merge vs. irst2
                                0.6                                                                                                                                                             0.4

                                0.5
                                                                                                                                                                                                0.3


AvPr (difference from median)


                                                                                                                                                                AvPr (difference from median)
                                0.4
                                                                                                                                                                                                0.2

                                0.3
                                                                                                                                                                                                0.1
                                0.2
                                                                                                                                                                                                  0
                                0.1

                                                                                                                                                                                                -0.1
                                  0

                                                                                                                                                                                                -0.2
                                -0.1


                                -0.2                                                                                                                                                            -0.3


                                -0.3                                                                                                                                                            -0.4
                                           1     3   5   7   9        11   13   15   17   19        21   23   25   27   29        31   33   35   37   39                                                   1   3   5   7   9        11   13   15   17   19        21   23   25   27   29        31   33   35   37   39
                                                                                      Topic Number                                                                                                                                                  Topic Number


                                Figure 4: Difference (in mean average precision)                                                                                                                Figure 5: Difference (in mean average precision) of
                                from the median of the combined model and the                                                                                                                   the combined model from each single model.
                                best reference of CLEF 2000.

                                               Retrieval Model                        Official Run                           mAvPr                                                                                         6.            CONCLUSION
                                               Okapi                                    irst1                                 49.0                                                                  This paper presents preliminary research results
                                               Statistical model                        irst2                                 47.5                                                              by ITC-irst in the field of text retrieval. Never-
                                               Combined model                               -                                 50.0                                                              theless, participation to the CLEF evaluation has
                                                                                                                                                                                                been considered important in order to gain expe-
                                Table 7: Performance of retrieval models on the                                                                                                                 rience and feedback about our progress. Future
                                CLEF 2000 Italian monolingual track.                                                                                                                            work will be done to improve the statistical retrieval
                                                                                                                                                                                                model, develop a statistical blind relevance feed-
                                                                                                                                                                                                back method, and extend the text retrieval system
                                                                                                                                                                                                to other languages, i.e. English and German.
                                Under the hypothesis of independence between A
                                and B , S has mean 0 and variance 1=(N 1). On                                                                                                                                                       7. References
                                the contrary, in case of perfect correlation the S
                                                                                                                                                                                                Federico, Marcello, 2000. A system for the re-
                                statistics has value 1.
                                                                                                                                                                                                  trieval of italian broadcast news. Speech Com-
                                    By taking the average of S over all the queries 3 ,                                                                                                           munication, 33(1-2).
                                a rank correlation of 0.4 resulted between the irst1
                                                                                                                                                                                                Federico, Marcello and Renato De Mori, 1998.
                                and irst2 runs.
                                                                                                                                                                                                  Language modelling. In Renato De Mori (ed.),
                                    This results confirms some degree of indepen-                                                                                                                 Spoken Dialogues with Computers, chapter 7.
                                dence between the two information retrieval mod-                                                                                                                  London, UK: Academy Press.
                                els. Hence, a combination of the two models was                                                                                                                 Johnson, S.E., P. Jourlin, K. Spark Jones, and P.C.
                                implemented by just taking the sum of scores. Ac-                                                                                                                 Woodland, 1999. Spoken document retrieval
                                tually, in order to adjust scale differences, scores                                                                                                              for TREC-8 at Cambridge University. In Pro-
                                of each model were normalized in the range [0; 1]                                                                                                                 ceedings of the 8th Text REtrieval Conference.
                                before summation. By using the official relevance                                                                                                                 Gaithersburg, MD.
                                assessments of CLEF, a mAvPr of 50.0% was                                                                                                                       Merialdo, Bernard, 1994. Tagging English text
                                achieved by the combined model.                                                                                                                                   with a probabilistic model. Computational Lin-
                                    In Figure 4 and Figure 5 detailed results of                                                                                                                  guistics, 20(2):155–172.
                                the combined model (merge) are provided for each                                                                                                                Miller, David R. H., Tim Leek, and Richard M.
                                query, respectively, against the CLEF references                                                                                                                  Schwartz, 1998. BBN at TREC-7: Using hidden
                                and the irst1 and irst2 runs. It results that the com-                                                                                                            Markov models for information retrieval. In Pro-
                                bined model performs better than the median refer-                                                                                                                ceedings of the 7th Text REtrieval Conference.
                                ence on 24 topics of 34, while irst1 and irst2 im-                                                                                                                Gaithersburg, MD.
                                proved the median AvPr 16 e 17 times, respec-                                                                                                                   Mood, Alexander M., Franklin A. Graybill, and
                                tively. Finally, the combined model improves the                                                                                                                  Duane C. Boes, 1974. Introduction to the The-
                                best reference on two topics (20 and 36).                                                                                                                         ory of Statistics. Singapore: McGraw-Hill.
                                                                                                                                                                                                Ng, Kenney, 1999. A maximum likelihood ratio
                                           3
                                    As an approximation, rankings were computed for                                                                                                               information retrieval model. In Proceedings of
                                the union of the 100 top documents retrieved by each                                                                                                              the 8th Text REtrieval Conference. Gaithersburg,
                                model.                                                                                                                                                            MD.
                    Text               POS      Base form         Stem             R
                    IL                 RS       IL                IL               0
                    PRIMO              AS       PRIMO             PRIM             1
                    MINISTRO           SS       MINISTRO          MINISTR          1
                    LITUANO            AS       LITUANO           LITUAN           1
                    ,                  XPW      ,                 ,                0
                    SIGNORA            SS       SIGNORA           SIGNOR           1
                    KAZIMIERA          SPN      KAZIMIERA         KAZIMIER         1
                    PRUNSKIENE         SPN      PRUNSKIENE        PRUNSKIEN        1
                    ,                  XPW      ,                 ,                0
                    HA                 #VI#     AVERE             AVERE            0
                    ANCORA             B        ANCORA            ANCORA           0
                    UNA                RS       UNA               UNA              0
                    VOLTA              SS       VOLTA             VOLT             1
                    SOLLECITATO        VSP      SOLLECITARE       SOLLECIT         1
                    OGGI               B        OGGI              OGGI             0
                    UN                 RS       UN                UN               0
                    RAPIDO             #SS#     RAPIDO            RAPID            1
                    AVVIO              SS       AVVIO             AVVIO            1
                    DEI                EP       DEI               DEI              0
                    NEGOZIATI          SP       NEGOZIATO         NEG              1
                    CON                E        CON               CON              0
                    L’                 RS       L’                L’               0
                    URSS               YA       URSS              URSS             1
                    ,                  XPW      ,                 ,                0
                    RITENENDO          VG       RITENERE          RITEN            0
                    FAVOREVOLE         AS       FAVOREVOLE        FAVOR            1
                    L’                 RS       L’                L’               0
                    ATTUALE            AS       ATTUALE           ATTUAL           1
                    SITUAZIONE         SS       SITUAZIONE        SIT              1
                    NEI                EP       NEI               NEI              0
                    RAPPORTI           SP       RAPPORTO          RAPPORT          1
                    FRA                E        FRA               FRA              0
                    MOSCA              SPN      MOSCA             MOSC             1
                    E                  C        E                 E                0
                    VILNIUS            SPN      VILNIUS           VILNIUS          1

Table 8: Example of text preprocessing. The flag in the last column indicates if the term survives or not
after the stop-terms removal. The two POSs marked with # are wrong, nevertheless they permit to generate
correct base forms and stems.


Robertson, S. E., S. Walker, S. Jones, M. M.
  Hancock-Beaulieu, and M. Gatford, 1994.
  Okapi at TREC-3. In Proceedings of the 3rd Text
  REtrieval Conference. Gaithersburg, MD.
Sparck Jones, Karen and Peter Willett (eds.), 1997.
  Readings in Information Retrieval. San Fran-
  cisco, CA: Morgan Kaufmann.
Witten, Ian H. and Timothy C. Bell, 1991. The
  zero-frequency problem: Estimating the prob-
  abilities of novel events in adaptive text com-
  pression. IEEE Trans. Inform. Theory, IT-
  37(4):1085–1094.

</pre>