=Paper=
{{Paper
|id=Vol-1166/CLEF2000wn-adhoc-BertoldiEt2000
|storemode=property
|title=Italian Text Retrieval for CLEF 2000 at ITC-irst
|pdfUrl=https://ceur-ws.org/Vol-1166/CLEF2000wn-adhoc-BertoldiEt2000.pdf
|volume=Vol-1166
|dblpUrl=https://dblp.org/rec/conf/clef/BertoldiF00a
}}
==Italian Text Retrieval for CLEF 2000 at ITC-irst ==
Italian Text Retrieval for CLEF 2000 at ITC-irst Nicola Bertoldi and Marcello Federico ITC-irst - Centro per la Ricerca Scientifica e Tecnologica I-38050 Povo, Trento, Italy. Abstract This paper presents work on document retrieval for Italian carried out at ITC-irst. Two different ap- proaches to information retrieval were investigated, one based on the Okapi weighting formula and one based on a statistical model. Development experiments were carried out using the Italian sample of the TREC-8 CLIR track. Performance evaluation was done on the Cross Language Evaluation Forum (CLEF) 2000 Italian monolingual track. 1. INTRODUCTION Tokenization. Tokenization of text is performed in order to isolate words from punctuation marks, This paper reports on Italian text retrieval re- recognize abbreviations and acronyms, correct pos- search that has recently started at ITC-irst. Experi- sible word splits across lines, and discriminate be- mental evaluation was carried out in the framework tween accents and quotation marks. of the Cross Language Evaluation Forum (CLEF), a text retrieval system evaluation activity coordi- Morphological analysis. A morphological ana- nated in Europe from 2000, in collaboration with lyzer decomposes each Italian inflected word into the US National Institute of Standards and Tech- its morphemes, and suggests all possible POSs and nology (NIST) and the TREC Conferences. base forms of each valid decomposition. By base ITC-irst has recently started to develop mono- forms we mean the usual not inflected entries of a lingual text retrieval systems (Sparck Jones and dictionary. Willett, 1997) for the main purpose of accessing POS tagging. POS tagging is based on a Viterbi broadcast news audio and video data (Federico, decoder that computes the best text-POS alignment 2000). This paper presents two Italian monolingual on the basis of a bigram POS language model and a text retrieval systems that have been submitted to discrete observation model (Merialdo, 1994). The CLEF 2000: a conventional Okapi derived model, employed tagger works with 57 tag classes and has and a statistical retrieval model. After the evalu- an accuracy around 96%. ation, a combined model was also developed that just integrates the scores of the two basic models. Base form extraction. Once the POS and the This simple and effective model shows a significant morphological analysis of each word in the text improvement over the two single models. is computed, a base form can be assigned to each The paper is organized as follows. In Section 2, word. the text preprocessing of documents and queries Stemming. Word stemming is applied at the level is presented. Section 3 and 4 introduce the text of tagged base forms. POS specific rules were de- retrieval models that were officially evaluated at veloped that remove suffixes from verbs, nouns, CLEF and present experimental results. Section 5 and adjectives. discusses improvements on the basic models that were made after the CLEF evaluation. In particular, Stop-terms removal. Words in the collection that a combined retrieval model is introduced and evalu- are considered non relevant for the purpose of infor- ated on the CLEF test collection. Finally, Section 6 mation retrieval are discarded in order to save index offers some conclusions regarding the research at space. Words are filtered out on the basis either of ITC-irst in the field of text retrieval. their POS or their inverted document frequency. In particular, punctuation is eliminated together with articles, determiners, quantifiers, auxiliary verbs, 2. TEXT PREPROCESSING prepositions, conjunctions, interjections, and pro- nouns. Among the remaining terms, those with a Document and query preprocessing implies sev- low inverted document frequency, i.e. that occur in eral stages: tokenization, morphological analysis many different documents, are eliminated. of words, part-of-speech (POS) tagging of text, base form extraction, stemming, and stop-terms re- An example of text preprocessing is presented in moval. Table 8. ( ) fd w frequency of word w in document d 3.2. Statistical Model ( ) fq w frequency of w in query q A statistical retrieval model was developed f (w ) frequency of w in the collection based on previous work on statistical language fd length of document d modeling (Federico and De Mori, 1998). f length of the collection The match between a query q and a document d l mean document length can be expressed through the following conditional N number of documents probability distribution: Nw number of documents containing w Vd vocabulary size of document d ( j )= ( P q j ) () d P d Vd average document vocabulary size P d q ( ) P q (4) V vocabulary size of the collection where P (q j d) represents the likelihood of q , given Table 1: Notation used in the information retrieval , ( ) represents the a-priori probability of d, and d P d models. P (q ) is a normalization term. By assuming no a- priori knowledge about the documents, and disre- garding the normalization factor, documents can be l ranked, with respect to q , just by the likelihood Terms Stop V Vd term. If we interpret the likelihood function as text no 225 160K 134 the probability of d generating q and assume an base forms no 225 126K 129 order-free multinomial model, the following log- X stems no 225 101K 126 probability score can be derived: base forms yes 103 125K 80 stems yes 103 100K 77 log P (q j d) = ( ) log P (w j d) fq w (5) w2q Table 2: Effect of text preprocessing steps on the mean document length, global vocabulary size, and The probability that a term w is generated by mean document vocabulary size. d can be estimated by applying statistical language modeling techniques. Previous work on statistical information retrieval (Miller et al., 1998; Ng, 1999) 3. INFORMATION RETRIEVAL proposed to interpolate relative frequencies of each MODELS document with those of the whole collection, with interpolation weights empirically estimated from 3.1. Okapi Model the data. Okapi (Robertson et al., 1994) is the name of In this work we use an interpolation formula a retrieval system project that developed a family which applies the smoothing method proposed by of weighting functions in order to evaluate the rel- (Witten and Bell, 1991). This method linearly evance of a document d versus a query q . In this smoothes word frequencies of a document and the work, the following Okapi weighting function was amount of probability assigned to never observed applied: ( )= s d X ( ) cd (w) idf (w) fq w (1) terms is proportional to the number of different words contained in the document. Hence, the fol- lowing probability estimate is applied: w2q\d ( ) where: ( P w j )= d fd w fd + Vd + Vd fd + Vd P (w ) (6) ( )(k1 + 1) fd w ( )= cd w fd (2) where P (w), the word probability over the collec- k1 (1 b) + k1 b + fd (w ) l tion, is estimated by interpolating the smoothed rel- scores the relevance of w in d, and the inverted doc- ative frequency with the uniform distribution over ument frequency: the vocabulary V : + 0:5 f w ( ) V 1 ( ) = log N Nw (3) ( )= P w + (7) idf w Nw + 0:5 f +V f +V V evaluates the relevance of w inside the collection. 3.3. Blind Relevance Feedback The model implies two parameters k 1 and b to be Blind relevance feedback (BRF) is a well empirically estimated over a development sample. known technique that allows to improve retrieval An explanation of the involved terms can be found performance. The basic idea is to perform retrieval in (Robertson et al., 1994) and other papers referred in two steps. First, the documents matching the in it. original query q are ranked, then the B best ranked Avg. # # of Relevant Docs Data Set # docs words/ doc Data Set (topic #’s) Min Max Avg. Total CLIR - Swiss News Agency 62,359 225 CLIR (54-81) 2 15 7.1 170 CLEF - La Stampa 58,051 552 CLEF (1-40) 1 42 9.9 338 Table 3: Development and test collection sizes. Table 5: Document retrieval statistics of develop- ment and test collections. # of Words Data Set (topic #s’) Min Max Avg. Total mAvPr CLIR (54-81) 41 107 70.4 1690 title 3 8 5.1 122 description 8 27 17.1 410 47 46 narrative 25 81 48.3 1158 45 CLEF (1-40) 31 96 60.8 2067 44 43 title 3 9 5.3 179 42 description 7 35 15.7 532 41 40 narrative 14 84 39.9 1356 39 38 Table 4: Topic statistics of development and test collections. For development and evaluation, 1.1 queries were generated by using all the available 1.3 1.5 topic fields. K1 1.7 1 1.9 0 0.4 0.6 0.8 0.2 B documents are taken and the T most relevant terms in them are added to the query. Hence, the retrieval phase is repeated with the augmented query. In Figure 1: Mean Average Precision versus different this work, new search terms are extracted by sort- settings of Okapi formula’s parameters k 1 and b. ing all the terms of the B top documents according to (Johnson et al., 1999): The collection consists of the test set used by the (rw + 0:5)(N Nw B + rw + 0:5) 1999 TREC-8 CLIR track and its relevance assess- rw (8) (Nw rw + 0:5)(B rw + 0:5) ments. The CLIR collection contains topics and documents in four languages: English, German, where rw is the frequency of word w inside the B French, and Italian. The Italian part consists of top documents. texts issued by the Swiss News Agency (Schweiz- erische Depeschenagentur) from 17-11-1989 until 4. EXPERIMENTS 12-31-1990, and 28 topics, four of which have no This section presents work done to develop and corresponding Italian relevant documents 1. More test the presented models. Development and test- details about the development collection are pro- ing were done on two different Italian document re- vided in Tables 3, 4, and 5. trieval tasks. Performance was measured in terms of Average Precision (AvPr) and mean Average 4.2. Okapi Tuning Precision (mAvPr). Given the document ranking Tuning of the parameters in formula (2) was car- provided against a given query q , let r 1 : : : rk ried out on the development data. In Figure 1 a be the ranks of the retrieved relevant documents. plot of the mAvPr versus different values of the pa- The AvPr for q is defined as the average of the pre- rameters is shown. Finally, the values k 1 = 1:5 cision values achieved at all recall points, i.e.: and b = 0:4 were chosen, because they provided AvPr = 100 1 X k i (9) consistently good results also with other evaluation measures. The achieved mAvPr is 46.07%. k ri i=1 4.3. Blind Relevance Feedback Tuning The mAvPr of a set of queries corresponds to the Tuning of BRF parameters B and T was carried mean of the corresponding query AvPr values. out just for the Okapi model. In Figure 2 a plot of the mAvPr versus different values of the parame- 4.1. Development For the purpose of parameter tuning, develop- 1 CLIR topics without Italian relevant documents are ment material made available by CLEF was used. 60, 63, 76, and 80. 0 5 10 15 20 25 30 35 40 0.7 irst1 irst2 0.6 best mAvPr 0.5 50 AvPr (difference from median) 0.4 49 48 0.3 47 46 0.2 45 44 0.1 43 0 -0.1 5 10 5 -0.2 15 7 20 10 -0.3 T 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 25 12 B Topic Number 30 15 Figure 3: Difference (in mean average precision) from the median for each of the 34 topics in the Figure 2: Mean Average Precision versus different CLEF 2000 Italian monolingual track. Moreover, settings of blind relevance feedback parameters B the best AvPr reference is plotted for each topic. and T . ters is shown. Finally, the number of relevant doc- of the topics do not have corresponding documents uments B = 5 and the number of relevant terms in the collection they are not taken into account 2. T = 15 were chosen, whose combination gives More details about the CLEF collection and topics a mAvPr of 49.2%, corresponding to a 6.8% im- are in Tables 3, 4, and 5. provement over the first step. Official results of the Okapi and statistical mod- Further work was done to optimize the perfor- els are reported in Figure 3 with the names irst1 mance of the first retrieval step. Indeed, perfor- and irst2, respectively. Figure 3 shows the differ- mance of the BRF procedure is determined by the ence in AvPr between each run and the median precision achieved, by the first retrieval phase, on reference provided by the CLEF organization. As a the very top ranking documents. In particular, an further reference, performance differences between higher resolution for documents and queries was the best result of CLEF and the median are also considered by using base forms instead of stems. plotted. The mAvPr of irst1 and irst2 are 49.0% In Table 6 mAvPr values are shown by considering and 47.5%, respectively. Both methods score above different combinations of text preprocessing before the median reference mAvPr, which is 44.5%. The and after BRF. In particular, we considered using mAvPr of the median reference was computed by base forms before and after BRF, using word stems taking the average over the median AvPr scores. before and after BRF, and using base forms before BRF and stems after BRF. The last combination 5. IMPROVEMENTS achieved the largest improvement (8.6%) and was By looking at Figure 3 it emerges that the Okapi adopted for the final system. and the statistical model have quite different behav- iors. This would suggest that if the two methods # of relevant terms T I II 5 10 15 20 25 30 rank documents independently, some information st st 46.4 47.3 49.2 49.6 48.3 48.5 about the relevant documents could be gained by ba ba 46.2 47.6 47.6 47.6 47.7 47.3 integrating the scores of both methods. ba st 46.7 48.7 50.0 48.5 48.6 48.6 In order to compare the rankings of two models A and B , the Spearman’s rank correlation can be Table 6: Mean Average Precision by using base applied. Given a query, let r(A(d)) and r(B (d)) forms (ba) or word stems (st) before (I) and after represent the ranks of document d given by A and (II) blind relevance feedback (with B=5). B , respectively. Hence, Spearman’s rank correla- tion (Mood et al., 1974) is defined as: 6 XrAd rBd [ ( ( )) ( ( ))] 2 4.4. Official Evaluation The two presented models were evaluated on S=1 d N (N 2 1) (10) the CLEF 2000 Italian monolingual track. The test collection consists of newspaper articles published 2 CLEF topics without Italian relevant documents are by La Stampa, during 1994, and 40 topics. As six 3, 6, 14, 27, 28, and 40. 0 5 10 15 20 25 30 35 40 0 5 10 15 20 25 30 35 40 0.7 0.5 merge merge vs. irst1 best merge vs. irst2 0.6 0.4 0.5 0.3 AvPr (difference from median) AvPr (difference from median) 0.4 0.2 0.3 0.1 0.2 0 0.1 -0.1 0 -0.2 -0.1 -0.2 -0.3 -0.3 -0.4 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 Topic Number Topic Number Figure 4: Difference (in mean average precision) Figure 5: Difference (in mean average precision) of from the median of the combined model and the the combined model from each single model. best reference of CLEF 2000. Retrieval Model Official Run mAvPr 6. CONCLUSION Okapi irst1 49.0 This paper presents preliminary research results Statistical model irst2 47.5 by ITC-irst in the field of text retrieval. Never- Combined model - 50.0 theless, participation to the CLEF evaluation has been considered important in order to gain expe- Table 7: Performance of retrieval models on the rience and feedback about our progress. Future CLEF 2000 Italian monolingual track. work will be done to improve the statistical retrieval model, develop a statistical blind relevance feed- back method, and extend the text retrieval system to other languages, i.e. English and German. Under the hypothesis of independence between A and B , S has mean 0 and variance 1=(N 1). On 7. References the contrary, in case of perfect correlation the S Federico, Marcello, 2000. A system for the re- statistics has value 1. trieval of italian broadcast news. Speech Com- By taking the average of S over all the queries 3 , munication, 33(1-2). a rank correlation of 0.4 resulted between the irst1 Federico, Marcello and Renato De Mori, 1998. and irst2 runs. Language modelling. In Renato De Mori (ed.), This results confirms some degree of indepen- Spoken Dialogues with Computers, chapter 7. dence between the two information retrieval mod- London, UK: Academy Press. els. Hence, a combination of the two models was Johnson, S.E., P. Jourlin, K. Spark Jones, and P.C. implemented by just taking the sum of scores. Ac- Woodland, 1999. Spoken document retrieval tually, in order to adjust scale differences, scores for TREC-8 at Cambridge University. In Pro- of each model were normalized in the range [0; 1] ceedings of the 8th Text REtrieval Conference. before summation. By using the official relevance Gaithersburg, MD. assessments of CLEF, a mAvPr of 50.0% was Merialdo, Bernard, 1994. Tagging English text achieved by the combined model. with a probabilistic model. Computational Lin- In Figure 4 and Figure 5 detailed results of guistics, 20(2):155–172. the combined model (merge) are provided for each Miller, David R. H., Tim Leek, and Richard M. query, respectively, against the CLEF references Schwartz, 1998. BBN at TREC-7: Using hidden and the irst1 and irst2 runs. It results that the com- Markov models for information retrieval. In Pro- bined model performs better than the median refer- ceedings of the 7th Text REtrieval Conference. ence on 24 topics of 34, while irst1 and irst2 im- Gaithersburg, MD. proved the median AvPr 16 e 17 times, respec- Mood, Alexander M., Franklin A. Graybill, and tively. Finally, the combined model improves the Duane C. Boes, 1974. Introduction to the The- best reference on two topics (20 and 36). ory of Statistics. Singapore: McGraw-Hill. Ng, Kenney, 1999. A maximum likelihood ratio 3 As an approximation, rankings were computed for information retrieval model. In Proceedings of the union of the 100 top documents retrieved by each the 8th Text REtrieval Conference. Gaithersburg, model. MD. Text POS Base form Stem R IL RS IL IL 0 PRIMO AS PRIMO PRIM 1 MINISTRO SS MINISTRO MINISTR 1 LITUANO AS LITUANO LITUAN 1 , XPW , , 0 SIGNORA SS SIGNORA SIGNOR 1 KAZIMIERA SPN KAZIMIERA KAZIMIER 1 PRUNSKIENE SPN PRUNSKIENE PRUNSKIEN 1 , XPW , , 0 HA #VI# AVERE AVERE 0 ANCORA B ANCORA ANCORA 0 UNA RS UNA UNA 0 VOLTA SS VOLTA VOLT 1 SOLLECITATO VSP SOLLECITARE SOLLECIT 1 OGGI B OGGI OGGI 0 UN RS UN UN 0 RAPIDO #SS# RAPIDO RAPID 1 AVVIO SS AVVIO AVVIO 1 DEI EP DEI DEI 0 NEGOZIATI SP NEGOZIATO NEG 1 CON E CON CON 0 L’ RS L’ L’ 0 URSS YA URSS URSS 1 , XPW , , 0 RITENENDO VG RITENERE RITEN 0 FAVOREVOLE AS FAVOREVOLE FAVOR 1 L’ RS L’ L’ 0 ATTUALE AS ATTUALE ATTUAL 1 SITUAZIONE SS SITUAZIONE SIT 1 NEI EP NEI NEI 0 RAPPORTI SP RAPPORTO RAPPORT 1 FRA E FRA FRA 0 MOSCA SPN MOSCA MOSC 1 E C E E 0 VILNIUS SPN VILNIUS VILNIUS 1 Table 8: Example of text preprocessing. The flag in the last column indicates if the term survives or not after the stop-terms removal. The two POSs marked with # are wrong, nevertheless they permit to generate correct base forms and stems. Robertson, S. E., S. Walker, S. Jones, M. M. Hancock-Beaulieu, and M. Gatford, 1994. Okapi at TREC-3. In Proceedings of the 3rd Text REtrieval Conference. Gaithersburg, MD. Sparck Jones, Karen and Peter Willett (eds.), 1997. Readings in Information Retrieval. San Fran- cisco, CA: Morgan Kaufmann. Witten, Ian H. and Timothy C. Bell, 1991. The zero-frequency problem: Estimating the prob- abilities of novel events in adaptive text com- pression. IEEE Trans. Inform. Theory, IT- 37(4):1085–1094.