=Paper=
{{Paper
|id=Vol-1169/CLEF2003wn-adhoc-VillaresEt2003
|storemode=property
|title=COLE Experiments at CLEF 2003: Spanish Monolingual Track
|pdfUrl=https://ceur-ws.org/Vol-1169/CLEF2003wn-adhoc-VillaresEt2003.pdf
|volume=Vol-1169
|dblpUrl=https://dblp.org/rec/conf/clef/VilaresAR03a
}}
==COLE Experiments at CLEF 2003: Spanish Monolingual Track==
COLE experiments at CLEF 2003 Spanish monolingual track Jesús Vilares Francisco J. Ribadas Miguel A. Alonso Departamento de Computación Escuela Superior de Ingenierı́a Informática Universidade da Coruña Universidade de Vigo Campus de Elviña s/n Campus As Lagoas s/n 15071 La Coruña (Spain) 32004 Orense (Spain) jvilares,alonso @udc.es ribadas@uvigo.es Abstract In this our second participation in the CLEF Spanish monolingual track, we have continued ap- plying Natural Language Processing techniques for single word and multi-word term conflation. Two different conflation approaches have been tested. The first approach is based on the lemmatization of the text in order to avoid inflectional variation. Our second approach consists of the employment of syntactic dependencies as complex index terms, in an attempt to solve the problems derived from syntactic variation and, in this way, to obtain more precise terms. Such dependencies are obtained through a shallow parser based on cascades of finite-state transducers. 1 Introduction In Information Retrieval (IR) systems, the correct representation of a document through an accurate set of index terms is the basis for obtaining a good performance. If we are not able to both extract and weight appropriately the terms which capture the semantics of the text, this shortcoming will have an effect on all the subsequent processing. In this context, one of the major limitations we have to deal with is the linguistic variation of natural lan- guages [2], particularly when processing documents written in languages with more complex morphologic and syntactic structures than those present in English, as in the case of Spanish. When managing this type of phe- nomena, the employment of Natural Language Processing (NLP) techniques becomes feasible. This has been our working hypothesis since our research group, COLE Group 1, started its work on Spanish Information Retrieval. This was our working hypothesis in our participation in CLEF 2002 [18], and now in CLEF 2003. As in our first participation, our main premise is the simplicity, motivated by the lack of available linguistic resources for Spanish such as large tagged corpora, treebanks or advanced lexicons. This work is a continuation and refinement of our previous work, presented in CLEF 2002, but centered this time on the employment of lemmatization for solving the inflectional variation and the employment of syntactic dependencies for solving the syntactic variation. This article is outlined as follows. Section 2 describes the techniques used for single word term conflation. Our approach for dealing with syntactic variation through shallow parsing is introduced in Section 3. The tuning process of our system before the official runs is shown in Section 4. Finally, official runs are presented and discussed in Section 5. 1 http://www.grupocole.org 2 Single word term conflation As in our previous contribution to CLEF 2002 [18], our proposal for single word term conflation keeps being based on exploiting the lexical level in two phases: firstly, by solving the inflectional variation through lemmatization, and secondly, by solving the derivational morphology through the employment of morphological families. The process followed for single word term conflation starts by tagging the document. The first step consists of applying our linguistically-motivated preprocessor module [9, 3] in order to perform tasks such as format conver- sion, tokenization, sentence segmentation, morphological pretagging, contraction splitting, separation of enclitic pronouns from verbal stems, expression identification, numeral identification and proper noun recognition. Clas- sical approaches, such as stemming, rarely manage these phenomena, resulting in wrong simplifications during conflation process. The output generated by our preprocessor is then taken as input by our tagger-lemmatizer, MrTagoo [6], al- though any high-performance part-of-speech tagger could be used instead. MrTagoo is based on a second order Hidden Markov Model (HMM), whose elements and procedures of estimation of parameters are based on Brant’s work [4], and also incorporates certain capabilities which motivated its employment in our system. Such capabil- ities include a very efficient structure for storage and search —based on finite-state automata [8]—, management of unknown words, the possibility of integrating external dictionaries in the probabilistic frame defined by the HMM [10], and the possibility of management of segmentation ambiguity [7] Nevertheless, these kind of tools are very sensitive to spelling errors, as, for example, in the case of sentences written completely in uppercase —e.g. news titles and subsection headings—, which cannot be correctly man- aged by the preprocessor and tagger modules. For this reason, the initial output of the tagger is processed by an uppercase-to-lowercase module [18] in order to process uppercase sentence, converting them to lowercase and restoring the spelling signs when necessary. Once text has been tagged, the lemmas of the content words (nouns, verbs and adjectives) are extracted to be indexed. In this way we are solving the problems derived from inflection in Spanish. With regard to computational cost, the running cost of a lemmatizer-disambiguator is linear in relation to the length of the word, and cubic in relation to the size of the tagset, which is a constant. As we only need to know the grammatical category of the word, the tagset is small and therefore the increase in cost with respect to classical approaches (stemmers) becomes negligible. Our previous experiments in CLEF 2002 showed that lemmatization performs better than stemming, even when using stemmers which also deals with derivational morphology. Once inflectional variation has been solved, the next logical step consists on solving the problems caused by derivational morphology. For this purpose, we have grouped the words derivable one from another by means of mechanisms of derivational morphology; each one of these groups is a morphological family. Each one of the lemmas belonging to the same morphological family is conflated into the same term, a representative of the family. The set of morphological families are automatically generated from a large lexicon of Spanish words by means of a tool which implements the most common derivational mechanisms of Spanish [20]. Since the set of morphological families is generated statically, there is no increment in the running cost. Nevertheless, our previous experiments in CLEF 2002 showed that the employment of morphological families for single word term conflation introduces too much noise in the system. This way, lemmatization will be the conflation technique to be used for single word term conflation, while morphological families will only be used in multi-word term conflation, as shown in Section 3. 3 Managing the syntactic variation through shallow parsing Following the same scheme of our previous experiments, once we have established the way to process the content of the document at word level, the next step consists of deciding how to process, at phrase level, its syntactic content in order to manage the syntactic variation of the document. For this purpose, we will extract the pairs of words related through syntactic dependencies in order to use them as complex index terms. This process is performed in two steps: firstly, the text is parsed by means of a shallow parser and, secondly, the syntactic dependencies are extracted and conflated into index terms. 3.1 The shallow parser When dealing with syntactic variation, we have to face the problems derived from the high computational cost of parsing. In order to maintain a linear complexity with respect to the length of the text to be analyzed, we have discarded the employment of full parsing techniques [14], opting for applying shallow parsing techniques, also looking for more robustness. The theoretical basis for the design of our parser comes from formal language theory, which tells us that, given a context-free grammar and an input string, the syntactic trees of height generated by a parser can be obtained by means of layers of finite-state transducers: the first layer obtains the nodes labeled by non-terminals corresponding to left-hand sides of productions that only contain terminals on their right-hand side; the second layer obtains those nodes which only involve terminal symbols and those non-terminal symbols generated on the previous layer; and so on. It can be argued that the parsing capability of the system is, in this way, limited by the height of the parseable trees. Nevertheless, this kind of shallow parsing [1] has shown itself to be useful in several NLP application fields, particularly in Information Extraction. Its application in IR, which has not been deeply studied, has been tested by Xerox for English [11], showing its superiority with respect to classical approaches based on contiguous words. This way, we have implemented a shallow parser based on a five layer architecture whose input is the output of our tagger-lemmatizer. Next, we will describe the function of each layer: Layer 0: improving the preprocessing. Its function is the management of certain linguistic constructions in order to minimize the noise generated during the subsequent parsing. Such constructions include: Numerals in non-numerical format. Quantity expressions. Expressions of the type algo más de dos millones (a little more than two million) or unas dos docenas (about two dozens), which denote a number but with a certain vagueness about its concrete value, are identified as numeral phrases ( ). Expressions with a verbal function. Some verbal expressions such as tener en cuenta (to take into account), must be considered as a unit, in this case synonym of the verb considerar (to consider), to avoid errors in the upper layers such as identifying en cuenta as a complement of the verb. Layer 1: adverbial phrases and first level verbal groups. In this layer the system identifies, on the one hand, the adverbial phrases ( ) of the text, either those with an adverbial head —e.g. r ápidamente (quickly)—, or those expressions not properly adverbial but with a equivalent function —e.g. de forma r ápida (in a quick way)—. On the other hand, non-periphrastic verbal groups, which we name first level verbal groups, are processed, either their simple and compound forms, and either their active and passive forms. Layer 2: adjectival phrases and second level verbal groups. Adjectival phrases ( ) such as azul (blue) or muy alto (very high) are managed here, together with periphrastic verbal groups, such as tengo que ir (I have to go), which we name second level verbal groups. Verbal periphrasis are unions of two or more verbal forms working as a unit, giving attributing shades of meaning, such as obligation, degree of development of the action, etc., to the semantics of the main verb. Moreover, these shades can not be expressed by means of the simple and compound forms of the verb. Layer 3: noun phrases. In the case of noun phrases ( ), together with simple structures such as the attachment of determiners and adjectives to the name, we have considered more complex phenomena, such as the existence of partitive complements ( ) —e.g. alguno de (some of), ninguno de (none of)—, in order to cover more complex nominal structures —e.g. cualquiera de aquellos coches nuevos (any of those new cars)—. Layer 4: prepositional phrases. Formed by a noun phrase ( ) preceded by a preposition ( ), we have con- sidered three different types according to this preposition, in order to make the extraction of dependencies easier: those preceded by the preposition por (by) or , those preceded by de (of) or , and the rest of preposi- tional phrases or . Each of the rules involved in the different stages of the parsing process has been implemented through a finite- state transducer, compounding, in this way, a parser based on a cascade of finite-state transducers. Therefore, our approach maintains a linear complexity. 3.2 Extraction and conflation of dependencies Once the text has been parsed, the system identifies the syntactic roles of the phrases recognized and extracts the following dependency pairs: A noun and each of its modifying adjectives. A noun and the head of its prepositional complement. The head of the subject and its predicative verb. The head of the subject and the head of the attribute. From a semantical point of view, copulative verbs are mere links, so the dependency is directly established between the subject and the attribute. An active verb and the head of its direct object. A passive verb and the head of its agent. A predicative verb and the head of its prepositional complement. The head of the subject and the head of a prepositional complement of the verb, but only when it is copulative (because of its special behavior). Once such dependencies have been identified, they are conflated through the following conflation scheme: 1. The simple terms compounding the pair are conflated employing morphological families —see Section 2— in order to improve the management of the syntactic variation by covering the appearance of morphosyn- tactic variants of the original term [19, 12]. In this way, terms such as cambio en el clima (change of the climate) and cambio climático (climatic change), which express the same concept in different words —but semantically and derivatively related—, can be matched. 2. Conversion to lowercase and elimination of spelling signs, as in the case of stemmers. Previous experiments show that this process eliminates much of the noise introduced by spelling errors [18]. 4 Tuning the system Before making the official experiments for CLEF 2003, we tuned our parsing-based approach using the CLEF 2001/2002 corpus [13], formed by 215,738 news reports filling a total disk space of 509 MBs, and a set of 100 queries, from 41 to 140. The initial conditions of these training experiments were: 1. Employment of the vector-based indexing engine SMART [5], with an atn-ntc weighting scheme [17]. 2. Stopword list obtained from the content word lemmas of the Spanish stopword list provided by SMART. In the case of the dependency pairs, a pair is eliminated if any of its compounding words is a stop word. 3. Employment of the uppercase-to-lowercase module to recover uppercase sentences during tagging. 4. Elimination of spelling signs and conversion to lowercase after conflation to reduce typographical errors. 5. The three fields of the query —title, description and narrative— were employed, but giving double relevance to the title statement because it summarizes the basic semantics of the query. lem sd1 sd2 sd3 sd4 sd5 sd6 sd7 sd8 sd9 sd10 sd11 sd12 opt Documents 99k 99k 99k 99k 99k 99k 99k 99k 99k 99k 99k 99k 99k -- -- Relevant (5548 expected). 5220 5214 5250 5252 5252 5248 5249 5244 5242 5241 5240 5239 5239 5252 32 R-precision .5131 .4806 .5041 .5137 .5175 .5174 .5200 .5203 .5197 .5182 .5175 .5158 .5167 .5203 .0072 Non-interpolated precision .5380 .5085 .5368 .5440 .5461 .5462 .5464 .5472 .5463 .5462 .5462 .5459 .5456 .5472 .0092 Document precision .5924 .5489 .5860 .5974 .6013 .6025 .6028 .6026 .6020 .6017 .6015 .6010 .6007 .6028 .0104 Precision at 0.00 Re. .8754 .8493 .8729 .8716 .8689 .8684 .8686 .8706 .8681 .8654 .8696 .8735 .8760 .8760 .0006 Precision at 0.10 Re. .7934 .7602 .8027 .8019 .8079 .8093 .8082 .8069 .8071 .8090 .8097 .8088 .8076 .8097 .0163 Precision at 0.20 Re. .7340 .6847 .7240 .7394 .7435 .7440 .7465 .7468 .7458 .7456 .7433 .7410 .7387 .7468 .0128 Precision at 0.30 Re. .6697 .6355 .6671 .6777 .6835 .6802 .6826 .6819 .6835 .6825 .6823 .6823 .6820 .6835 .0138 Precision at 0.40 Re. .6256 .5911 .6206 .6297 .6322 .6332 .6348 .6335 .6324 .6324 .6323 .6320 .6322 .6348 .0092 Precision at 0.50 Re. .5749 .5384 .5707 .5825 .5856 .5827 .5843 .5849 .5842 .5840 .5830 .5831 .5825 .5856 .0107 Precision at 0.60 Re. .5146 .4753 .5041 .5137 .5168 .5176 .5187 .5214 .5207 .5208 .5195 .5198 .5195 .5214 .0068 Precision at 0.70 Re. .4402 .4142 .4331 .4408 .4428 .4430 .4445 .4462 .4467 .4457 .4454 .4450 .4451 .4467 .0065 Precision at 0.80 Re. .3652 .3512 .3691 .3714 .3724 .3724 .3722 .3738 .3733 .3729 .3728 .3723 .3727 .3738 .0086 Precision at 0.90 Re. .2723 .2649 .2799 .2834 .2853 .2850 .2830 .2831 .2817 .2813 .2808 .2805 .2792 .2853 .0130 Precision at 1.00 Re. .1619 .1534 .1613 .1645 .1628 .1634 .1630 .1646 .1641 .1641 .1641 .1641 .1638 .1646 .0027 Precision at 5 docs. .6747 .6525 .6909 .6869 .6848 .6788 .6808 .6828 .6808 .6828 .6828 .6808 .6747 .6909 .0162 Precision at 10 docs. .6010 .5859 .6091 .6192 .6202 .6192 .6192 .6172 .6152 .6121 .6131 .6121 .6131 .6202 .0192 Precision at 15 docs. .5623 .5441 .5690 .5737 .5778 .5791 .5791 .5764 .5758 .5758 .5737 .5731 .5710 .5791 .0168 Precision at 20 docs. .5374 .5040 .5298 .5328 .5354 .5343 .5384 .5394 .5384 .5399 .5399 .5399 .5399 .5399 .0025 Precision at 30 docs. .4825 .4549 .4778 .4852 .4892 .4886 .4882 .4896 .4896 .4892 .4886 .4879 .4875 .4896 .0071 Precision at 100 docs. .3067 .2873 .3017 .3070 .3084 .3095 .3087 .3089 .3083 .3083 .3084 .3088 .3088 .3095 .0028 Precision at 200 docs. .2051 .1959 .2033 .2057 .2062 .2063 .2067 .2067 .2065 .2062 .2063 .2063 .2064 .2067 .0016 Precision at 500 docs. .0997 .0980 .0997 .1001 .1004 .1005 .1005 .1005 .1005 .1004 .1004 .1004 .1003 .1005 .0008 Precision at 1000 docs. .0527 .0527 .0530 .0531 .0531 .0530 .0530 .0530 .0529 .0529 .0529 .0529 .0529 .0531 .0004 Table 1: Tuning the system with the CLEF 2001/2002 corpus 6. Combination of simple and complex terms. The former, obtained through the lemmatization of the content words of the text, the later, obtained through the conflation of the syntactic dependencies identified by the shallow parser. For this training phase we used the indexing of the content words lemmas of the text (lem) as as our point of reference, since our previous experiments [18, 19], where lemmatization beats stemming as word-level conflation technique, indicate that this technique is the best starting point for the development of NLP-based conflation methods. Table 1 shows the results obtained. The first column of the table (lem) shows the results for lemmatization, whereas the next columns (sd ) contain the results obtained by merging lemmatized simple terms and complex terms based on syntactic dependencies (sd), when the weight relation between simple and complex terms, to 1, changes. The column opt is formed by the best results obtained with sd for each parameter considered, which are also highlighted in bold. Finally, the column shows the improvement of opt with respect to lem. Each row con- tains one of the parameters employed to measure the performance of the system: number of documents retrieved, number of relevant documents retrieved (5548 expected), R-precision, average precision (non-interpolated) for all relevant documents (averaged over queries), average document precision for all relevant documents (averaged over relevant documents), precision at standard levels of recall, and precision at N documents retrieved. As is shown in column sd1, the direct employment of syntactic dependencies as index terms led to a general decrease of the performance of the system. After examining the behavior of the system for each query, we inferred that the problem was caused by an over-balance of the weight of complex terms, which are much less frequent than simple terms and, therefore, with a much higher assigned weight. In this way, when a matching between a complex term and a relevant document occurred, its assigned score increased substantially, improving its ranking. Nevertheless, in the same way, when a undesired matching with a non-relevant document occurred, its computed relevance grew excessively. It can be argued that, according to this, we would expect similar results to those obtained only with simple terms. Nevertheless, it should be noticed that complex term matchings are much less frequent than those for simple terms. Therefore, incorrect matchings between complex terms and non-relevant documents are much more harmful than those for simple terms, whose effect tends to be weakened by the rest of the matchings. It can be deduced that this first attempt led to a increasing instability of the system. In order to minimize the negative effect of undesired matchings, the over-balance of complex terms needed to be solved. Therefore, the balance factor between the weights of simple and complex terms was corrected, decreasing the extra initial relevance assigned to complex terms. The results of this solution were immediate, as is shown in the remaining sd columns, where the performance of the system gradually improves, particularly with respect to the precision in the first 15 documents retrieved and to the number of relevant documents retrieved (5220 with lem, 5214 with sd1, and 5250 with sd2). As generally happens in IR, we can not talk about a best method for all situations. From a ranking point of view and with respect to the top N documents retrieved, sd4 —in which the weights of simple terms are quadrupled— obtained the best results, also reaching the best recall (5252 relevant documents retrieved). Nevertheless, the best results for global performance measures were obtained with sd7, using a higher balance factor. The performance of the system gets worse, in general, for higher factors, except in the case of the precision vs. recall, where we obtain the best results for the lowest levels of recall, nevertheless, at the expense of sacrificing performance in the rest of aspects. Since our priority was to increase the precision of the top documents retrieved, we decided to use a balance factor of 4, as in the case of sd4, for the official runs. 5 CLEF 2003 official runs In this new edition of CLEF, the document corpus for the Spanish Monolingual Track has been enlarged. The new corpus is formed by 215,738 news (509 MB) from 1994 plus 238,307 news (577 MB) from 1995; that is, 454,045 documents (1086 MB). The set of topics has also been enlarged; this year it consists of 60 queries (141 to 200) instead of 50 as previous years. Our group submitted four runs to the CLEF 2003 Spanish monolingual track: coleTDlemZP03 (TDlemZP for short): Conflation of content words via lemmatization, i.e. each form of a content word is replaced by its lemma. This kind of conflation takes only into account inflectional mor- the Okapi BM 25 weight scheme [15] with the constants defined in [16] for Spanish ( , ). The phology. The resulting conflated document was indexed using the probabilistic engine ZPrise 2 , employing query is formed by the set of meaning lemmas present in the title and description fields. coleTDNlemZP03 (TDNlemZP for short): The same as before, but the query also includes the set of meaning lemmas obtained from the narrative field. coleTDNlemSM03 (TDNlemSM for short): As in the case of coleTDNlemZP03, the three fields of the query are conflated through lemmatization. Nevertheless, this time the indexing engine is the vector-based SMART [5], with an atn-ntc weighting scheme [17]. This run was submitted in order to use it as a point of reference for the rest of runs. coleTDNpdsSM03 (TDNpdsSM for short): Text conflated via the combination of simple terms, obtained through the lemmatization of content words, and complex terms, obtained through the conflation of syntactic dependencies, as was described in Section 3. The balance factor between the weights of simple and complex terms is 4 to 1 —i.e. the weights of simple terms are quadrupled— looking for increasing the precision of the top ranked documents according to the results of Section 4. There is no experiments indexing syntactic dependencies with the Okapi BM 25 weight scheme, since we are still studying the best way to integrate them into a probabilistic model. With respect to the conditions employed in the official runs, they were: 1. Stopword list obtained from the content word lemmas of the SMART Spanish stopword list. 2. Employment of the uppercase-to-lowercase module to recover uppercase sentences during tagging. 2 http://www.itl.nist.gov Table 2: CLEF 2003: performance measures TDlemZP TDNlemZP TDNlemSM TDNpdsSM Documents retrieved 57,000 57,000 57,000 57,000 Relevant documents retrieved (2368 expected) 2,237 2,253 2,221 2,249 R-precision 0.4503 0.4935 0.4453 0.4684 Average non-interpolated precision 0.4662 0.5225 0.4684 0.4698 Average document precision 0.5497 0.5829 0.5438 0.5408 11-points average precision 0.4788 0.5325 0.4799 0.4861 Table 3: CLEF 2003: average precision at 11 standard recall levels and at seen documents Recall Precision Precision TDlemZP TDNlemZP TDNlemSM TDNpdsSM TDlemZP TDNlemZP TDNlemSM TDNpdsSM 0.00 0.8014 0.8614 0.7790 0.7897 5 0.5930 0.6421 0.5930 0.5684 0.10 0.7063 0.7905 0.6982 0.7165 10 0.5070 0.5596 0.5018 0.4965 0.20 0.6553 0.7301 0.6331 0.6570 15 0.4713 0.4971 0.4515 0.4573 0.30 0.5969 0.6449 0.5738 0.6044 20 0.4307 0.4614 0.4281 0.4202 0.40 0.5485 0.5911 0.5388 0.5562 30 0.3719 0.4012 0.3784 0.3678 0.50 0.4969 0.5616 0.5003 0.5092 100 0.2316 0.2393 0.2316 0.2305 0.60 0.4544 0.4871 0.4457 0.4391 200 0.1461 0.1505 0.1455 0.1458 0.70 0.3781 0.4195 0.3987 0.3780 500 0.0726 0.0731 0.0718 0.0719 0.80 0.3083 0.3609 0.3352 0.3191 1000 0.0392 0.0395 0.0390 0.0395 0.90 0.2093 0.2594 0.2292 0.2248 1.00 0.1111 0.1512 0.1472 0.1525 3. Elimination of spelling signs and conversion to lowercase after conflation to reduce typographical errors. 4. Except for the first run, TDlemZP, the terms extracted from title field of the query are given double relevance with respect to description and narrative. According to Tables 2 and 3, the probabilistic-based approach through a BM 25 weighting scheme —TDlemZP and TDNlemZP— shows clearly superior to the vector-based atn-ntc weighting scheme —TDNlemSM and TDNpdsSM—, even when only lemmatizing the text. As we can see, TDlemZP obtains similar or better results than TDNlemSM even when the later also employs the extra information provided by the narrative field of the topic. With respect to the main contribution of this work, the employment of syntactic dependencies as complex index terms, the results are a little different from expected. With respect to global performance measures, TDNpdsSM run obtains better results than TDNlemSM, except for average document precision. Nevertheless, the behavior of the system with respect to ranking has changed partially, since the results obtained for precision at N docu- ments retrieved when employing complex terms —TDNpdsSM— are worse than those obtained using only simple lemmatized terms —TDNlemSM—. On the other hand, the results for precision vs. recall keep being better. Taking into account the possibility that the weight balance factor of 4 employed in the official run was not the most accurate for these set of queries, we have tried different values in a range of 1 to 12, as is shown in Table 4. The scheme of these extra experiments is the same followed during the training phase —see Table 1—. The first column, lem, shows the results for lemmatization —i.e. TDNlemSM— whereas sd columns contain the results obtained using syntactic dependencies with a weight balance factor of .You are reminded that sd4 shows the results for the official run TDNpdsSM, because it was created using a balance factor of . The column opt shows the best results obtained for sd and the column shows the improvement of opt with respect to lem. The results obtained make even more difficult to choose a balance factor, since the degree of improvement with respect to lemmatization changes according to the balance factor. It could be considered that the best balance factor for global measures is 10 —sd10—, since it obtains the best non-interpolated and document precision, very lem sd1 sd2 sd3 sd4 sd5 sd6 sd7 sd8 sd9 sd10 sd11 sd12 opt Documents 57k 57k 57k 57k 57k 57k 57k 57k 57k 57k 57k 57k 57k -- -- Relevant (2368 expected). 2221 2218 2241 2243 2249 2244 2244 2245 2244 2243 2243 2242 2240 2249 28 R-precision .4453 .4121 .4581 .4637 .4684 .4540 .4503 .4490 .4491 .4487 .4493 .4502 .4496 .4684 .0031 Non-interpolated precision .4684 .4132 .4481 .4627 .4698 .4664 .4683 .4689 .4719 .4723 .4723 .4714 .4717 .4723 .0039 Document precision .5438 .4664 .5163 .5329 .5408 .5438 .5456 .5471 .5481 .5483 .5485 .5481 .5484 .5485 .0047 Precision at 0.00 Re. .7790 .7926 .8151 .8049 .7897 .7819 .7822 .7806 .7852 .7817 .7820 .7814 .7898 .8151 .0361 Precision at 0.10 Re. .6982 .6645 .6873 .7185 .7165 .7027 .6999 .7030 .7141 .7127 .7127 .7123 .7161 .7185 .0203 Precision at 0.20 Re. .6331 .5803 .6237 .6365 .6570 .6542 .6521 .6509 .6473 .6457 .6431 .6390 .6374 .6570 .0239 Precision at 0.30 Re. .5738 .5288 .5711 .5919 .6044 .5903 .5934 .5928 .5924 .5913 .5904 .5863 .5865 .6044 .0306 Precision at 0.40 Re. .5388 .4864 .5309 .5495 .5562 .5442 .5460 .5459 .5431 .5448 .5444 .5428 .5417 .5562 .0174 Precision at 0.50 Re. .5003 .4376 .4770 .5008 .5092 .4966 .5013 .4992 .5070 .5081 .5088 .5072 .5067 .5092 .0089 Precision at 0.60 Re. .4457 .3698 .4194 .4308 .4391 .4382 .4416 .4416 .4479 .4498 .4476 .4477 .4469 .4498 .0041 Precision at 0.70 Re. .3987 .3137 .3497 .3665 .3780 .3844 .3898 .3921 .3947 .3959 .3972 .3970 .3966 .3972 -.0015 Precision at 0.80 Re. .3352 .2597 .2973 .3092 .3191 .3248 .3288 .3294 .3304 .3312 .3320 .3321 .3320 .3321 -.0031 Precision at 0.90 Re. .2292 .1950 .2168 .2208 .2248 .2268 .2301 .2316 .2328 .2330 .2307 .2308 .2310 .2330 .0008 Precision at 1.00 Re. .1472 .1317 .1508 .1534 .1525 .1501 .1499 .1499 .1505 .1506 .1491 .1490 .1491 .1534 .0062 Precision at 5 docs. .5930 .4947 .5333 .5684 .5684 .5719 .5789 .5860 .5930 .6000 .6000 .5930 .5895 .6000 .0070 Precision at 10 docs. .5018 .4421 .4877 .4912 .4965 .5035 .5053 .5105 .5070 .5070 .5105 .5088 .5088 .5105 .0087 Precision at 15 docs. .4515 .3895 .4421 .4503 .4573 .4526 .4538 .4538 .4538 .4526 .4526 .4503 .4526 .4573 .0018 Precision at 20 docs. .4281 .3640 .4009 .4140 .4202 .4211 .4211 .4219 .4237 .4237 .4246 .4246 .4254 .4254 .0027 Precision at 30 docs. .3784 .3234 .3509 .3667 .3678 .3737 .3754 .3772 .3778 .3807 .3789 .3813 .3819 .3819 .0035 Precision at 100 docs. .2316 .2053 .2221 .2289 .2305 .2330 .2335 .2342 .2340 .2337 .2333 .2332 .2333 .2342 .0026 Precision at 200 docs. .1455 .1367 .1430 .1448 .1458 .1459 .1466 .1465 .1466 .1464 .1463 .1463 .1463 .1466 .0011 Precision at 500 docs. .0718 .0691 .0712 .0716 .0719 .0719 .0720 .0721 .0721 .0721 .0721 .0721 .0720 .0721 .0003 Precision at 1000 docs. .0390 .0389 .0393 .0394 .0395 .0394 .0394 .0394 .0394 .0394 .0394 .0393 .0393 .0395 .0005 Table 4: CLEF 2003: Re-tuning the system a posteriori good recall —2243 documents retrieved, against 2221 for lem and 2249 for opt—, and a slight improvement for R-precision. From a ranking point of view, our official run, sd4 —i.e. TDNpdsSM—, is the best compromise when talking about precision vs. recall, since it obtains the best results in the range 0.20–0.60, and very good results for the range 0.00–0.20. Nevertheless, its results for precision at N documents retrieved are not very good, since there is no improvement with respect to lem, which was our goal. In this case, sd10 shows again as the best option, since it obtains the best compromise for the top 15 documents retrieved; however, the improvement reached is lesser than the one obtained during the training phase —see Table 1—. Acknowledgements The research described in this paper has been supported in part by Ministerio de Ciencia y Tecnologı́a (TIC2000- 0370-C02-01, HP2001-0044 and HF2002-81), FPU grants of Secretarı́a de Estado de Educación y Universidades, Xunta de Galicia (PGIDT01PXI10506PN, PGIDIT02PXIB30501PR and PGIDIT02SIN01E) and Universidade da Coruña. The authors also would like to thank Darrin Dimmick, from NIST, for giving us the opportunity to use the ZPrise system, and Fernando Martı́nez, from Universidad de Jaén, for helping us to make it operative. References [1] S. Abney. Partial parsing via finite-state cascades. Natural Language Engineering, 2(4):337–344, 1997. [2] A. Arampatzis, T. van der Weide, C. Koster, and P. van Bommel. Linguistically motivated information retrieval. In Encyclopedia of Library and Information Science. Marcel Dekker, Inc., New York and Basel, 2000. [3] Fco. Mario Barcala, Jesús Vilares, Miguel A. Alonso, Jorge Graña, and Manuel Vilares. Tokenization and proper noun recognition for information retrieval. In A Min Tjoa and Roland R. Wagner, editors, Thirteen In- ternational Workshop on Database and Expert Systems Applications. 2-6 September 2002. Aix-en-Provence, France, pages 246–250, Los Alamitos, California, USA, September 2002. IEEE Computer Society Press. [4] Thorsten Brants. T N T - a statistical part-of-speech tagger. In Proceedings of the Sixth Applied Natural Language Processing Conference (ANLP’2000), Seattle, 2000. [5] C. Buckley. Implementation of the SMART information retrieval system. Technical re- port, Department of Computer Science, Cornell University, 1985. Source code available at ftp://ftp.cs.cornell.edu/pub/smart. [6] Jorge Graña. Técnicas de Análisis Sintáctico Robusto para la Etiquetación del Lenguaje Natural. PhD thesis, University of La Coruña, La Coruña, Spain, 2000. [7] Jorge Graña, Miguel A. Alonso, and Manuel Vilares. A common solution for tokenization and part-of- speech tagging: One-pass Viterbi algorithm vs. iterative approaches. In Petr Sojka, Ivan Kopecek, and Karel Pala, editors, Text, Speech and Dialogue, volume 2448 of Lecture Notes in Computer Science, pages 3–10. Springer-Verlag, Berlin-Heidelberg-New York, 2002. [8] Jorge Graña, Fco. Mario Barcala, and Miguel A. Alonso. Compilation methods of minimal acyclic automata for large dictionaries. In Bruce W. Watson and Derick Wood, editors, Implementation and Application of Automata, volume 2494 of Lecture Notes in Computer Science, pages 135–148. Springer-Verlag, Berlin- Heidelberg-New York, 2002. [9] Jorge Graña, Fco. Mario Barcala, and Jesús Vilares. Formal methods of tokenization for part-of-speech tagging. In Alexander Gelbukh, editor, Computational Linguistics and Intelligent Text Processing, volume 2276 of Lecture Notes in Computer Science, pages 240–249. Springer-Verlag, Berlin-Heidelberg-New York, 2002. [10] Jorge Graña, Jean-Cédric Chappelier, and Manuel Vilares. Integrating external dictionaries into stochastic part-of-speech taggers. In Proceedings of the Euroconference Recent Advances in Natural Language Pro- cessing (RANLP 2001), pages 122–128, Tzigov Chark, Bulgaria, 2001. [11] D. A. Hull, G. Grefenstette, B. M. Schulze, E. Gaussier, H. Schutze, and J. O. Pedersen. Xerox TREC-5 site report: routing, filtering, NLP, and Spanish tracks. In Proceedings of the Fifth Text REtrieval Conference (TREC-5), pages 167–180, 1997. [12] Christian Jacquemin and Evelyne Tzoukermann. NLP for term variant extraction: synergy between mor- phology, lexicon and syntax. In Tomek Strzalkowski, editor, Natural Language Information Retrieval, volume 7 of Text, Speech and Language Technology, pages 25–74. Kluwer Academic Publishers, Dor- drecht/Boston/London, 1999. [13] C. Peters, editor. Results of the CLEF 2002 Cross-Language System Evaluation Campaign, Work- ing Notes for the CLEF 2002 Workshop, Rome, Italy, Sept. 2002. Official site of CLEF: http://www.clef-campaign.org [14] Jose Perez-Carballo and Tomek Strzalkowski. Natural language information retrieval: progress report. Infor- mation Processing and Management, 36(1):155–178, 2000. [15] Okapi/Keenbow at TREC-8. In E. Voorhees and D. K. Harman, editors, Proceedings of the Eighth Text REtrieval Conference (TREC-8), NIST Special Publication 500-264, pages 151–161, 2000. [16] J. Savoy. Report on CLEF-2002 Experiments: Combining Multiple Sources of Evidence. In [13], pages 31–46. [17] J. Savoy, A. Le Calve, and D. Vrajitoru. Report on the TREC-5 experiment: Data fusion and collection fusion. Proceedings of TREC’5, NIST publication #500-238, pages 489–502, Gaithersburg, MD, 1997. [18] Jesús Vilares, Miguel A. Alonso, Francisco J. Ribadas, and Manuel Vilares. COLE experiments at CLEF 2002 Spanish monolingual track. In C. Peters, M. Braschler, J. Gonzalo, and M. Kluck, editors, Advances in Cross-Language Information Retrieval: Results of the CLEF 2002 Evaluation Campaign, volume 2785 of Lecture Notes in Computer Science. Springer-Verlag, Berlin-Heidelberg-New York, 2003. [19] Jesús Vilares, Fco. Mario Barcala, and Miguel A. Alonso. Using syntactic dependency-pairs conflation to improve retrieval performance in Spanish. In Alexander Gelbukh, editor, Computational Linguistics and Intelligent Text Processing, volume 2276 of Lecture Notes in Computer Science,, pages 381–390. Springer- Verlag, Berlin-Heidelberg-New York, 2002. [20] Jesús Vilares, David Cabrero, and Miguel A. Alonso. Applying productive derivational morphology to term indexing of Spanish texts. In Alexander Gelbukh, editor, Computational Linguistics and Intelligent Text Processing, volume 2004 of Lecture Notes in Computer Science, pages 336–348. Springer-Verlag, Berlin- Heidelberg-New York, 2001.