Introduction

COLE experiments at CLEF 2002 Spanish monolingual track ∗

Miguel A. Alonso

alonso@udc.es 0 1 2 3 4 5

Campus de Elvin˜a s/n

0 1 2 3 4 5

Francisco J. Ribadas

ribadas@ei.uvigo.es 0 1 2 3 4 5 0 Campus As Lagoas s/n 1 Departamento de Computacio ́n 2 Escuela Superior de Ingenier ́ıa Informa ́tica 3 Jesu ́ s Vilares 4 Universidade da Corun ̃a 5 Universidade de Vigo

2002

In this our first participation in CLEF, we have applied Natural Language Processing techniques for single word and multi-word term conflation. We have tested several approaches at different levels of text processing in our experiments: firstly, we have lemmatized the text to avoid inflectional variation; secondly, we have expanded the queries through synonyms according to a fixed threshold of similarity; and thirdly, we have tested a mixed approach based on the employment of productive derivational morphology to solve derivational variation and syntactic dependencies to deal with the syntactic content of the document.

Introduction

with derivational morphology.

In this process, the first step consists of tagging the document. Document processing starts by applying our linguistically-motivated preprocessor module [ 8, 2 ], performing tasks such as format conversion, tokenization, sentence segmentation, morphological pretagging, contraction splitting, separation of enclitic pronouns from verbal stems, expression identification, numeral identification and proper noun recognition. It is interesting to remark that classical techniques do not deal with many of these phenomena, resulting in wrong simplifications during conflation process.

The output of the preprocessor is taken as input by the tagger-lemmatizer. Although any kind of tagger could be applied, in our system we have used a second order Markov model for part-of-speech tagging. The elements of the model and the procedures to estimate its parameters are based on Brant’s work [ 3 ], incorporating information from external dictionaries [ 9 ] which are implemented by means of numbered minimal acyclic finite-state automata [ 7 ].

Once text has been tagged, the lemmas of the content words (nouns, verbs, adjectives) are extracted to be indexed. In this way we are solving the problems derived from inflection in Spanish and, as a result, recall is increased. With regard to computational cost, the running cost of a lemmatizer-disambiguator is linear in relation to the length of the word, and cubic in relation to the size of the tagset, which is a constant. As we only need to know the grammatical category of the word, the tagset is small and therefore the increase in cost with respect to classical approaches (stemmers) becomes negligible.

Now inflectional variation has been solved, the next logical step is to solve the problems caused by derivational morphology. Spanish has a great productivity and flexibility in its word formation mechanisms by using a rich and complex productive morphology, preferring derivation to other mechanisms of word formation. We have considered the derivational morphemes, the allomorphic variants of such morphemes and the phonological conditions they must satisfy, to automatically generate the set of morphological families from a large lexicon of Spanish words [ 18 ]. The resulting morphological families can be used as a kind of advanced and linguistically motivated stemmer for Spanish, where every lemma is substituted by a fixed representative of its morphological family. Since the set of morphological families is generated statically, there is no increment in the running cost. 3

Using synonymy to expand queries

The use of synonymy relations in the task of automatic query expansion is not a new subject, but the approaches presented until now do not assign a weight to the degree of synonymy that exists between the original terms present in the query and those produced by the process of expansion [ 10 ]. Nevertheless, our system does have access to this information, so a threshold of synonymy can be set in order to control the degree of query expansion.

The most frequent definition of synonymy conceives it as a relation between two expressions with identical or similar meaning. The controversy of understanding synonymy as a precise question or as an approximate question, i.e. as a question of identity or as a question of similarity, has existed from the beginning of the study of this semantic relation. In our system, synonymy is understood as a gradual relation between words. In order to calculate the degree of synonymy, we use the Jaccard’s coefficient as measure of similarity applied on the sets of synonyms provided by a dictionary of synonyms for each of its entries [ 5 ]. Given two sets X and Y , their similarity is measured as: sm(X, Y ) = |X ∩ Y | |X ∪ Y | Let us consider a word w with mi possible meanings, and another word w0 with mj possible meanings, where dc(w, mi) represents the function that gives us the set of synonyms provided by the dictionary for every entry w in the concrete meaning mi. The degree of synonymy of w and w0 in the meaning mi of w is calculated as dg(w, mi, w0) = maxj sm[dc(w, mi), dc(w0, mj )]. Furthermore, by calculating k = arg maxj sm[dc(w, mi), dc(w0, mj )] we obtain in mk the meaning of w0 closest to the meaning mi of w. 4

Extracting dependencies between words by means of a shallow parser

Our system is not only able to process the content of the document at word level, it can also process its syntactic structure. For this purpose, a parser module obtains from the tagged document the head-modifier pairs corresponding to the most relevant syntactic dependencies: noun-modifier, relating the head of a noun phrase with the head of a modifier; subject-verb, relating the head of the subject with the main verb of the clause; and verb-complement, relating the main verb of the clause with the head of a complement.

The kernel of the grammar used by this shallow parser is inferred from the basic trees corresponding to noun phrases1 and their syntactic and morpho-syntactic variants [ 11, 17 ]: • Syntactic variants result from the inflection of individual words and from modifying the syntactic structure of the original noun phrase by means of: – Synapsy: it corresponds to a change of preposition or the addition or removal of a determiner.

una ca´ıda de ventas (a drop in sales) – Substitution: it consists of employing modifiers to make a term more specific.

una ca´ıda inusual de ventas (an unusual drop in sales) – Permutation: this refers to the permutation of words around a pivot element.

una inusual ca´ıda de ventas (an unusual drop in sales) – Coordination: this consists of employing coordinating constructions (copulative or disjunctive) with the modifier or with the modified term.

una inusual ca´ıda de ventas y de beneficios (an unusual drop in sales and profits) • Morpho-syntactic variants differ from syntactic variants in that at least one of the content words of the original noun phrase is transformed into another word derived from the same morphological stem.

las ventas han ca´ıdo (sales have dropped)

We must remark that syntactic variants involve inflectional morphology but not derivational morphology, whereas morpho-syntactic variants involve both inflectional and derivational morphology. In addition, syntactic variants have a very restricted scope (the noun phrase) whereas morpho-syntactic variants can span a whole sentence, including a verb and its complements.

Once the basic trees of noun phrases and their variants have been established, they are compiled into a set of regular expressions, which are matched against the tagged document in order to extract its dependencies in the form of pairs which are used as index terms after conflating their components through morphological families, as is described in [ 17 ]. In this way, we are identifying dependency pairs through simple pattern matching over the output of the tagger-lemmatizer, solving the problem by means of finite-state techniques, leading to a considerable reduction of the running cost. 5

Non-official experiments with CLEF 2001 queries

The Spanish corpus was incorporated in CLEF 2001 [ 16 ], but the techniques proposed in this paper have been integrated very recently and so we could not participate in that edition. Nevertheless, we consider interesting to present the results of some non-official experiments performed with the set of queries of CLEF 20012.

The Spanish CLEF corpus is formed by 215,738 documents corresponding to the news provided by EFE, a Spanish news agency, in 1994. Documents are formatted in SGML, with a total size of 509 Megabytes. After deleting SGML tags, the size of the text corpus is reduced to 438 Megabytes. Each query consists of three fields: a brief title statement, a one-sentence description, and a more complex narrative specifying the relevance assessment criteria. In these experiments, we have employed the three fields to build the final query submitted to the system. For linguistically-motivated indexing techniques, the terms contained in the title section are given the double of importance with respect to description and narrative.

The techniques proposed in this article are independent of the indexing engine we choose to use. This is because we first conflate the document to obtain its index terms; then, the engine receives the conflated version of the document as input. So, any standard text indexing engine may be employed, which is a great advantage. Nevertheless, each engine will behave according to its own characteristics 3 [ 19 ]. The results we show here have been obtained with SMART, using the ltc-lnc weighting scheme [ 4 ], without relevance feedback.

We have compared the results obtained by four different indexing methods:

1At this point we will take as example the noun phrase una ca´ıda de las ventas (a drop in the sales).

2We have also tested some of the techniques proposed in this article over our own, non standard, corpus, formed by 21,899 news articles (national, international, economy, culture,. . . ). Results are reported in [ 19 ].

3Indexing model, ranking algorithm, etc.

stm • Stemming text after eliminating stopwords (stm). In order to apply this technique, we have tested several stemmers for Spanish. Finally, the best results we obtained were for the stemmer used by the open source search engine Muscat4, based on Porter’s algorithm [ 1 ]. • Conflation of content words via lemmatization (lem), i.e. each form of a content word is replaced by its lemma. This kind of conflation takes only into account inflectional morphology. • Conflation of content words by means of morphological families (fam), i.e. each form of a content word is replaced by the representative of its morphological family. This kind of conflation takes into account both inflectional and derivational morphology. • Text conflated by means of the combined use of morphological families and syntactic dependency pairs (f-sdp).

The methods lem, fam, and f-sdp are linguistically motivated. Therefore, they are able to deal with some complex linguistic phenomena such as clitic pronouns, contractions, idioms, and proper name recognition. In contrast, the method stm works simply by removing a given set of suffixes, without taking into account such linguistic phenomena, yielding incorrect conflations that introduce noise in the system. For example, clitic pronouns are simply considered a set of suffixes to be removed. Moreover, the employment of finite-state techniques in the implementation of our methods let us to reduce their computational cost, making possible their application in practical environments.

Table 1 shows the statistics of the terms that compose the corpus. The first and second row show the total number of terms and unique terms obtained for the indexed documents, respectively, either for the source text and for the different conflated texts. Table 2 shows performance measures as defined in the standard trec eval program. The monolingual Spanish task in 2001 considered a set of 50 queries, but for one query any relevant document exists in the corpus, and so the performance measures are computed over 49 queries. Table 3 shows in its left part the precision attained at the 11 standard recall levels. We can observe that linguistically motivated indexing techniques beats stm for low levels of recall. This fact means that more highly relevant documents are placed in the top part of the ranking list applying these techniques. As a complement, the right part of Table 3 shows the precision computed at N seen documents.

The results of our experiments seems to be consistent with the results obtained for English and Germanic languages by other IR systems based on NLP techniques [ 12, 13, 14, 15 ]. As in [ 14 ], syntax does not improve average precision, but is the best technique for low levels of recall. A similar conclusion can be extracted from the work of [ 12 ] on Dutch texts, where syntactic methods only beats statistical ones at low levels of recall. Our results with respect to syntactic dependency pairs seem to be better that those of Perez-Carballo and Strzalkowski [ 15 ]. It 4Currently, Muscat is not an open source project, and the web site http://open.muscat.com used to download the stemmer is not operating. Information about a similar stemmer for Spanish (and other European languages) can be found at http://snowball.sourceforge.net/spanish/stemmer.html. stm lem fam is difficult to know if this improvement is due to a more accurate extraction of pairs or due to differences between Spanish and English constructions. 6 6.1

Experiments with CLEF 2002 queries

The uppercase-to-lowercase module An important characteristic of IR test collections that may have a considerable impact on the performance of linguistically motivated indexing techniques is the large number of typographical errors present in documents, as have been reported, in the case of the Spanish CLEF corpus, by [ 6 ]. In particular, titles of news and subsections are generally written in capital letters without accents. We must take into account that these titles are usually very indicative of the topic of the document.

For CLEF 2002 experiments, we have incorporated an uppercase-to-lowercase module to our system to process uppercase sentences, converting them to lowercase and restoring the existent diacritics when necessary. Other approaches, such as [ 20 ], deal with documents where absolutely all diacritics have been eliminated. Nevertheless, our situation is different, because the main of the document is written lowercase and preserves their diacritics, only some sentences are written in capital letters; moreover, for our purposes we only need the grammatical category and lemma of the word, not the form.

So, we can employ the lexical context of an uppercase sentence, either forms and lemmas, to recover this lost information. The first step of this process is to identify the uppercase phrases. We consider that a sequence of words form an uppercase phrase, when it consists of three or more words written in capital letters and at least three of them have more than three characters. For each of these uppercase phrases we do the following: 1. We obtain its surrounding context. 2. For each of the words in the phrase: (a) We examine the context looking for entries with the same flattened form 5. Each of these words become candidates. (b) If candidates are found, the most numerous is chosen, and in case of existing a draw, the closest to the phrase is chosen. (c) If no candidates are found, the lexicon is examined: i. We obtain from the lexicon all entries with the same flattened form, grouping them according to their category and lemma (we are not interested in the form, just in the category and the lemma of the word). ii. If no entries are found, we keep the actual tag and lemma. iii. If only one entry is found, we choose that one. iv. If more than one entry is found, we choose the most numerous in the context (according to the category and the lemma). Again, in case of existing a draw, we choose the closest to the sentence. Sometimes, some words of the uppercase phrase preserve some of their diacritics, for example the ˜ of the N˜ . In this situations, the candidates from the context or the lexicon must observe this restriction.

5That is, after both words been converted to lowercase, and after eliminating all diacritics from them

Documents retrieved Relevant documents retrieved (2854 expected) R-precision Average precision per query Average precision per relevant docs 11-points average precision

TDlem • TDlem: Conflation of content words via lemmatization, i.e. each form of a content word is replaced by its lemma. This kind of conflation takes only into account inflectional morphology. The query is formed by the set of meaning lemmas present in title and description. • TDNlem: The same as before, but the query also includes the set of meaning lemmas obtained from the narrative. Both this method and the previous one correspond to the lem indexing method referred in Section 5. • TDNsyn: Conflation of content words via lemmatization and expansion of queries by means of synonymy.

We have considered that two words are synonyms if their similarity measure is greater or equal to 0.80. The query is formed by the set of meaning lemmas present in title, description and narrative, but only the title and description field of each query have been expanded using synonyms. • TDNpds: Text conflated by means of the combined use of morphological families and syntactic dependency pairs. The query is formed by the union of the set of representatives of the morphological families corresponding to the content words and the set of dependency pairs extracted from the title, description and narrative fields. It corresponds to the f-sdp indexing method referred in Section 5.

Except for the first method, the terms extracted from the title section are given the double of importance with respect to description and narrative.

According to Tables 4 and 5, the lemmatization method (TDNlem) seems to be the best option. The expansion through synonymy (TDNsyn) does not improve the results obtained, perhaps because the expansion is total, that is, all synonyms of all terms of the query are employed, introducing too much noise. In the case of the employment of syntactic dependency pairs (TDNpds), the results are worse than for CLEF 2001 queries. This may be simply due to the different set of queries employed, but after comparing the results of each particular query with lemmatization, it may be concluded that the more accurate is the complex term with respect to its constituting simple terms, the more the results improve, as in the case of estad´ısticas de divorcio (divorce statistics) in the 115th query.

These results, together with the previous ones obtained for CLEF 2001 queries, suggest that mere lemmatization is a good starting point. It may be investigated whether this initial search should be followed by a relevance feedback process based on the expansion of the synonyms of the most relevant terms of the most relevant documents to minimize the noise. Another alternative to study for postprocessing consists on the reranking of the results by means of syntactic information obtained in form of syntactic dependency pairs.

[1]

Ricardo

Baeza-Yates and

Berthier

Ribeiro-Neto . Modern information retrieval . Addison-Wesley, Harlow, England, 1999 .

[2]

Fco. Mario

Barcala , Jesu´s Vilares, Miguel A. Alonso, Jorge Gran˜a, and Manuel Vilares. Tokenization and proper noun recognition for information retrieval . In 3rd International Workshop on Natural Language and Information Systems (NLIS 2002 ), September 2-3 , 2002 . Aix-en- Provence , France, Los Alamitos, California, USA, September 2002 . IEEE Computer Society Press.

[3]

Thorsten

Brants. TNT - a statistical part-of-speech tagger . In Proceedings of the Sixth Applied Natural Language Processing Conference (ANLP' 2000 ), Seattle, 2000 .

[4]

Chris

Buckley , James Allan, and

Gerard

Salton . Automatic routing and ad-hoc retrieval using SMART: TREC 2 . In D. K. Harman, editor, NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) , pages 45 - 56 , Gaithersburg, MD , USA, 1993 .

[5]

Santiago

Ferna ´ndez, Jorge Gran˜a, and Alejandro Sobrino. A Spanish e-dictionary of synonyms as a fuzzy tool for information retrieval . In Actas de las I Jornadas de Tratamiento y Recuperacio´n de Informacio´ n (JOTRI 2002 ), Leo´n, Spain, September 2002 .

[6] Carlos

Figuerola , Raquel Go´mez, Angel F. Zazo Rodr´ıguez, and Jose´ Luis Alonso Berrocal. Stemming in Spanish: A first approach to its impact on information retrieval . In Carol Peters, editor, Working notes for the CLEF 2001 workshop , Darmstadt, Germany, September 2001 .

[7]

Jorge

Gran ˜a, Fco . Mario Barcala, and Miguel

Alonso . Compilation methods of minimal acyclic automata for large dictionaries . In Bruce W. Watson and Derick Wood, editors, Proc. of the 6th Conference on Implementations and Applications of Automata (CIAA 2001 ), pages 116 - 129 , Pretoria, South Africa, July 2001 .

[8]

Jorge

Gran ˜a, Fco. Mario Barcala, and Jesu´s Vilares. Formal methods of tokenization for part-of-speech tagging . In Alexander Gelbukh, editor, Computational Linguistics and Intelligent Text Processing , volume 2276 of Lecture Notes in Computer Science, pages 240 - 249 . Springer-Verlag, Berlin-Heidelberg-New York, 2002 .

[9]

Jorge

Gran ˜a, Jean-Ce´dric Chappelier, and Manuel Vilares. Integrating external dictionaries into stochastic part-of-speech taggers . In Proceedings of the Euroconference Recent Advances in Natural Language Processing (RANLP 2001 ), pages 122 - 128 , Tzigov

Chark

, Bulgaria, 2001 .

[10]

Jane

Greenberg . Automatic query expansion via lexical-semantic relationships . Journal of the American Society for Information Science and Technology , 52 ( 5 ): 402 - 415 , 2001 .

[11]

Christian

Jacquemin and

Evelyne

Tzoukermann . NLP for term variant extraction: synergy between morphology, lexicon and syntax . In Tomek Strzalkowski, editor, Natural Language Information Retrieval , volume 7 of Text, Speech and Language Technology , pages 25 - 74 . Kluwer Academic Publishers, Dordrecht/Boston/London, 1999 .

[12]

Wessel

Kraaij and Rene´e Pohlmann. Comparing the effect of syntactic vs. statistical phrase indexing strategies for Dutch . In Christos Nicolaou and Constantine Stephanidis, editors, Research and Adavanced Technology for Digital Libraries , volume 1513 of Lecture Notes in Computer Science, pages 605 - 614 . SpringerVerlag, Berlin/Heidelberg/New York, 1998 .

[13] Byung-Kwan

Kwak

, Jee-Hyub

Kim

Geunbae

Lee , and Jung Yun Seo. Corpus-based learning of compound noun indexing . In J. Klavans and J . Gonzalo, editors, Proc. of the ACL'2000 workshop on Recent Advances in Natural Language Processing and Information Retrieval , Hong Kong, October 2000 .

[14]

Markus

Mittendorfer and

Werner

Winiwarter . Exploiting syntactic analysis of queries for information retrieval . Data & Knowledge Engineering , 2002 .

[15] Jose Perez-Carballo and Tomek Strzalkowski . Natural language information retrieval: progress report. Information Processing and Management , 36 ( 1 ): 155 - 178 , 2000 .

[16] Carol

Peters

, editor. Results of the CLEF 2001 Cross-Language System Evaluation Campaign . Working Notes for the CLEF 2001 Workshop , Darmstadt, Germany, September 2001 .

[17] Jesu´s Vilares, Fco . Mario Barcala, and Miguel

Alonso . Using syntactic dependency-pairs conflation to improve retrieval performance in Spanish . In Alexander Gelbukh, editor, Computational Linguistics and Intelligent Text Processing , volume 2276 of Lecture Notes in Computer Science,, pages 381 - 390 . SpringerVerlag, Berlin-Heidelberg-New York, 2002 .

[18] Jesu´s Vilares, David Cabrero, and Miguel A. Alonso. Applying productive derivational morphology to term indexing of Spanish texts . In Alexander Gelbukh, editor, Computational Linguistics and Intelligent Text Processing , volume 2004 of Lecture Notes in Computer Science , pages 336 - 348 . Springer-Verlag, BerlinHeidelberg -New York, 2001 .

[19] Jesu´s Vilares, Manuel Vilares, and Miguel A. Alonso. Towards the development of heuristics for automatic query expansion . In Heinrich C. Mayr, Jiri Lazansky, Gerald Quirchmayr, and Pavel Vogel, editors, Database and Expert Systems Applications , volume 2113 of Lecture Notes in Computer Science, pages 887 - 896 . Springer-Verlag, Berlin-Heidelberg-New York, 2001 .

[20]

David

Yarowsky . A comparison of corpus-based techniques for restoring accents in Spanish and French text . In Natural Language Processing Using Very Large Corpora , pages 99 - 120 . Kluwer Academic Publishers, 1999 .