Introduction

Text Reduction-Enrichment at WebCLEF∗

Faculty of Computer Science

Mexico

0 Department of Information Systems and Computation , UPV , Spain

In this paper we are reporting the results obtained after submitting one run to the Mixed Monolingual task of WebCLEF 2006. We have used a text reduction process based on the selection of mid-frequency terms. Although our approach enhances precision, it must be improved in recall by an enrichment process based on the addition of high co-ocurrence terms. We have seen that a improvement of 40% in the corpus used last year in the BiEnEs was obtained. But we also observed that low Mean Reciprocal Rank (MRR) values were obtained compared with those of the mixed monolingual task of WebCLEF 2005. We consider that our low MRR is derived of a bad preprocessing phase, but we must investigate this issue in detail.

Text reduction Text enrichment Mixed-Monolingual

Introduction

∗This work was partially supported by FCC-BUAP and the BUAP-701 PROMEP/103.5/05/1536 grant.

Nowadays, WebCLEF have defined one task for the evaluation of search engines: the Mixed Monolingual. Thus, in this paper we are reporting the results obtained after the submission of one run to this task.

We have used a text reduction and enrichment process and, therefore, we organized this document in three sections. The next section describes the components of our search engine. In Section 3.3 the evaluation results are presented, and finally a discussion of findings are given. 2

Description of the search engine

We used a boolean model with Jaccard similarity formula for our system. Our goal was to determine the behaviour of document indexing reduction in an information retrieval environment. In order to reduce the terms from every document treated, we applied a technique named Transition Point, which is described as follows. 2.1

The Transition Point Technique

The Transition Point (TP) is a frequency value that splits the vocabulary of a text into two sets of terms (low and high frequency). This technique is based on the Zipf Law of Word Ocurrences [ 9 ] and also on the refined studies of Booth [ 1 ], as well as of Urbizag´astegui [ 8 ]. These studies are meant to demonstrate that mid-frequency terms are closely related to the conceptual content of a document. Therefore, it is possible to form the hypothesis that terms closer to TP can be used as indexes of a document. A typical formula used to obtain this value is: T P = (√8 ∗ I1 + 1 − 1)/2, where I1 represents the number of words with frequency equal to 1; see [ 4 ] [ 8 ].

Alternatively, TP can be localized by identifying the lowest frequency (from the highest frequencies) that it is not repeated in each document; this characteristic comes from the properties of the Booth’s law of low frequency words [ 1 ]. In our experiments we have used this approach.

Let us consider a frequency-sorted vocabulary of a document; i.e., VT P = [(t1, f1), ..., (tn, fn)], with fi ≥ fi+1, then T P = fi−1, iif fi = fi+1. The most important words are those that obtain the closest frequency values to TP, i.e.,

T PSET = {ti|(ti, fi) ∈ VT P , U1 ≤ fi ≤ U2}, (1) where U1 is a lower threshold obtained by a given neighbourhood percentage of TP (NTP), thus, U1 = (1 − N T P ) ∗ T P . U2 is the upper threshold and it is calculated in a similar way (U2 = (1 + N T P ) ∗ T P ). Either in WebCLEF-2005 and in the current competition, we have used N T P = 0.4, considering that the TP technique is language independent. 2.2

Term Enrichment

Certainly TP reduction may increase precision, but furthermore it decreases recall. Due to this fact, we enriched the selected terms by obtaining new terms, those with similar characteristics to the initial ones. Specifically, given a text T , with selected terms T PSET , y is a new term if it co-occurs with some x ∈ T PSET , i.e.,

T PS0 ET = T PSET ∪ {y|x ∈ T PSET ∧ (f r(xy) > 1 ∨ f r(yx) > 1)}. (2) Considering the text length, we only selected a window of size 1 around each term of S, and a minimum frequency of two for each bigram was required as condition to include new terms. 2.3

Information Retrieval Model

Our information retrieval is based on the Boolean Model and, in order to rank documents retrieved, we used the Jaccard’s similarity function applied to both, the query and every document of the corpus used. Previously, each document was preprocessed and its index terms were selected (the preprocessing phase is described in section 3.1). As we will see in Section 3.3 we represent each text using the selection given by equation 1, additionally, after reduction, we carried out an enrichment process based on the identification of related terms to those selected, Eq. 2. 3 3.1

Evaluation Corpus

We used the EuroGOV corpus provided by the WebCLEF forum which is better described in [ 5 ], but we indexed only 20 domains: DE, AT, BE, DK, SI, ES, EE, IE, IT, SK, LU, MT, NL, LV, PT, FR, CY, GR, HU, and UK (we did not indexed the following domains: EU, RU, FI, PL, SE, CZ, LT). Due to this fact, only 1470 from 1939 topics were evaluated, which is approximately a 75,81% of the total of topics. Although we presented in Section 3.3 the MRR over 1939 topics, 469 topics related with the not indexed domains were not evaluated.

The preprocessing phase of the EuroGOV corpus was carried out by writing two scripts for obtaining the terms to be indexed from each document. The first script uses regular expressions for excluding all the information which is enclosed by the characters < and >. Although this script obtains very good results, it is very slow and therefore we decided to used it only with three domains of the EuroGOV collection, namely Spanish (ES), French (FR), and German (DE).

On the other hand, we wrote a script based in the html syntax for obtaining all the terms considered interesting for indexing, i.e., those different than script codes (javascript, vbscript, style cascade sheet, etc), html codes, etc. This script speeded up our indexing process but it did not took into account that some web pages are incorrectly written and, therefore, we missed important information from those documents.

For every page compiled in the EuroGOV corpus, we also determine its language by using TexCat [ 7 ], a language identification program widely used. We construct our evaluation corpus with those documents identified as a language of the above list.

Another preprocessing problem consisted in the charset codification, which leads to a even more difficult analysis. Although the EuroGOV corpus is given in UTF-8, the documents that made up this corpus does not neccesarily keep this charset. We have seen that for some domains, the charset codification is given in the html metadata tag, but also we found that this codification could be wrong, perhaps because it was filled without the supervision of the creator of that page, who may be does not know anything, and evenmore does not matter about charsets codifications. We consider it as the most difficult problem in the preprocessing process.

Finally, we eliminated stopwords for each language (except for Greek language) and punctuation symbols. The same process was applied to the queries.

For the evaluation of this corpus, a set of queries was provided by WebCLEF-2006. 3.2

Indexing reduction

After our first participation in WebCLEF [ 3 ], we carried out more experiments using only those documents in Spanish language from the EuroGOV corpus. We observed that a value of N T P = 0.4 using the reduction process shown in the Equation 1 was adequated. Therefore, in this test we carried out one run with that value. Moreover, this run took the evaluation corpus composed by the reduction of every text, using TP technique with a neighbourhood of 40% around TP, an enriched this set of terms using related terms as described by Equation 2.

Table 1 shows the size of every evaluation corpus used; the vocabulary composed by representation of all texts, |T PS0 ET |, as well as the percentage of reduction obtained by each one with respect to the original text. As we can see, the TP technique obtained a percentage of reduction lower than 5%, which also implies a reduction in time for the indexing process in a search engine. We have used an index reduction method for our search engine that includes an enrichment step. Our proposal is based on the transition point technique which allows to obtain mid-frequency terms from every document to be indexed. Our method is linear in computational time and, therefore, it can be used in a wide spectrum of practical tasks.

After submitting our run we observed enhancement if we compare the results obtained with those of the BiEnEs task in WebCLEF 2005. By using the enrichment, more than 40% on MRR was achieved. However, using the Vector Space Model similar results to boolean model were obtained.

The TP technique has shown an effective use on diverse areas of NLP, and its best features for NLP, are mainly two: a high content of semantic information and the sparseness that can be obtained on vectors for document representation on models based on the vector space model. On the other hand, its language independence allows to use this technique in multilingual environments.

[1]

Booth : A Law of Ocurrences for Words of Low Frequency, Information and control, 1967 .

[2] CLEF 2005 : Cross-Language Evaluation Forum , http://www.clef-campaign.org/, 2005 .

[3]

Pinto , H.

Jim´enez-

Salazar , P.

Rosso , E.

Sanchis: TPIRS: A System for Document Indexing Reduction on WebCLEF , Extended abstract in Working notes of CLEF'05 , Viena , 2005 .

[4]

Reyes-Aguirre , E. Moyotl-Hern´andez & H. Jim´enez-Salazar: Reducci´on de T´erminos Indice Usando el Punto de Transici´on, In proceedings of Facultad de Ciencias de Computaci´ on XX Anniversary Conferences , BUAP , 2003 .

[5]

Sigurbj ¨ornsson, J. Kamps, and M. de Rijke: EuroGOV: Engineering a Multilingual Web Corpus , In Proceedings of CLEF 2005 , 2005 .

[6]

Sigurbj ¨ornsson, J. Kamps, and M. de Rijke: WebCLEF 2005: Cross-Lingual Web Retrieval , In Proceedings of CLEF 2005 , 2005 .

[7] TextCat: Language identification tool , http://odur.let.rug.nl/ vannord/TextCat/, 2005 .

[8]

Urbizag ´astegui: Las posibilidades de la Ley de Zipf en la indizaci´ on autom´atica, Research report of the California Riverside University , 1999 .

[9] G. K.

Zipf: Human Behavior and the Principle of Least-Effort, Addison-

Wesley , Cambridge MA, 1949 .