<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>COLE experiments at CLEF 2002 Spanish monolingual track ∗</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Miguel A. Alonso</string-name>
          <email>alonso@udc.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Campus de Elvin˜a s/n</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francisco J. Ribadas</string-name>
          <email>ribadas@ei.uvigo.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Campus As Lagoas s/n</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Departamento de Computacio ́n</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Escuela Superior de Ingenier ́ıa Informa ́tica</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Jesu ́ s Vilares</institution>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Universidade da Corun ̃a</institution>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>Universidade de Vigo</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2002</year>
      </pub-date>
      <abstract>
        <p>In this our first participation in CLEF, we have applied Natural Language Processing techniques for single word and multi-word term conflation. We have tested several approaches at different levels of text processing in our experiments: firstly, we have lemmatized the text to avoid inflectional variation; secondly, we have expanded the queries through synonyms according to a fixed threshold of similarity; and thirdly, we have tested a mixed approach based on the employment of productive derivational morphology to solve derivational variation and syntactic dependencies to deal with the syntactic content of the document.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>with derivational morphology.</p>
      <p>
        In this process, the first step consists of tagging the document. Document processing starts by applying our
linguistically-motivated preprocessor module [
        <xref ref-type="bibr" rid="ref2 ref8">8, 2</xref>
        ], performing tasks such as format conversion, tokenization,
sentence segmentation, morphological pretagging, contraction splitting, separation of enclitic pronouns from
verbal stems, expression identification, numeral identification and proper noun recognition. It is interesting to remark
that classical techniques do not deal with many of these phenomena, resulting in wrong simplifications during
conflation process.
      </p>
      <p>
        The output of the preprocessor is taken as input by the tagger-lemmatizer. Although any kind of tagger could be
applied, in our system we have used a second order Markov model for part-of-speech tagging. The elements of the
model and the procedures to estimate its parameters are based on Brant’s work [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], incorporating information from
external dictionaries [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] which are implemented by means of numbered minimal acyclic finite-state automata [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>Once text has been tagged, the lemmas of the content words (nouns, verbs, adjectives) are extracted to be
indexed. In this way we are solving the problems derived from inflection in Spanish and, as a result, recall is
increased. With regard to computational cost, the running cost of a lemmatizer-disambiguator is linear in relation
to the length of the word, and cubic in relation to the size of the tagset, which is a constant. As we only need to
know the grammatical category of the word, the tagset is small and therefore the increase in cost with respect to
classical approaches (stemmers) becomes negligible.</p>
      <p>
        Now inflectional variation has been solved, the next logical step is to solve the problems caused by
derivational morphology. Spanish has a great productivity and flexibility in its word formation mechanisms by using
a rich and complex productive morphology, preferring derivation to other mechanisms of word formation. We
have considered the derivational morphemes, the allomorphic variants of such morphemes and the phonological
conditions they must satisfy, to automatically generate the set of morphological families from a large lexicon of
Spanish words [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. The resulting morphological families can be used as a kind of advanced and linguistically
motivated stemmer for Spanish, where every lemma is substituted by a fixed representative of its morphological
family. Since the set of morphological families is generated statically, there is no increment in the running cost.
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Using synonymy to expand queries</title>
      <p>
        The use of synonymy relations in the task of automatic query expansion is not a new subject, but the approaches
presented until now do not assign a weight to the degree of synonymy that exists between the original terms
present in the query and those produced by the process of expansion [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Nevertheless, our system does have
access to this information, so a threshold of synonymy can be set in order to control the degree of query expansion.
      </p>
      <p>
        The most frequent definition of synonymy conceives it as a relation between two expressions with identical
or similar meaning. The controversy of understanding synonymy as a precise question or as an approximate
question, i.e. as a question of identity or as a question of similarity, has existed from the beginning of the study
of this semantic relation. In our system, synonymy is understood as a gradual relation between words. In order
to calculate the degree of synonymy, we use the Jaccard’s coefficient as measure of similarity applied on the sets
of synonyms provided by a dictionary of synonyms for each of its entries [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Given two sets X and Y , their
similarity is measured as:
sm(X, Y ) = |X ∩ Y |
|X ∪ Y |
Let us consider a word w with mi possible meanings, and another word w0 with mj possible meanings,
where dc(w, mi) represents the function that gives us the set of synonyms provided by the dictionary for
every entry w in the concrete meaning mi. The degree of synonymy of w and w0 in the meaning mi
of w is calculated as dg(w, mi, w0) = maxj sm[dc(w, mi), dc(w0, mj )]. Furthermore, by calculating
k = arg maxj sm[dc(w, mi), dc(w0, mj )] we obtain in mk the meaning of w0 closest to the meaning mi
of w.
4
      </p>
    </sec>
    <sec id="sec-3">
      <title>Extracting dependencies between words by means of a shallow parser</title>
      <p>Our system is not only able to process the content of the document at word level, it can also process its syntactic
structure. For this purpose, a parser module obtains from the tagged document the head-modifier pairs
corresponding to the most relevant syntactic dependencies: noun-modifier, relating the head of a noun phrase with the head of
a modifier; subject-verb, relating the head of the subject with the main verb of the clause; and verb-complement,
relating the main verb of the clause with the head of a complement.</p>
      <p>
        The kernel of the grammar used by this shallow parser is inferred from the basic trees corresponding to noun
phrases1 and their syntactic and morpho-syntactic variants [
        <xref ref-type="bibr" rid="ref11 ref17">11, 17</xref>
        ]:
• Syntactic variants result from the inflection of individual words and from modifying the syntactic structure
of the original noun phrase by means of:
– Synapsy: it corresponds to a change of preposition or the addition or removal of a determiner.
      </p>
      <p>una ca´ıda de ventas (a drop in sales)
– Substitution: it consists of employing modifiers to make a term more specific.</p>
      <p>una ca´ıda inusual de ventas (an unusual drop in sales)
– Permutation: this refers to the permutation of words around a pivot element.</p>
      <p>una inusual ca´ıda de ventas (an unusual drop in sales)
– Coordination: this consists of employing coordinating constructions (copulative or disjunctive) with
the modifier or with the modified term.</p>
      <p>una inusual ca´ıda de ventas y de beneficios (an unusual drop in sales and profits)
• Morpho-syntactic variants differ from syntactic variants in that at least one of the content words of
the original noun phrase is transformed into another word derived from the same morphological stem.</p>
      <p>las ventas han ca´ıdo (sales have dropped)</p>
      <p>We must remark that syntactic variants involve inflectional morphology but not derivational morphology,
whereas morpho-syntactic variants involve both inflectional and derivational morphology. In addition,
syntactic variants have a very restricted scope (the noun phrase) whereas morpho-syntactic variants can span a whole
sentence, including a verb and its complements.</p>
      <p>
        Once the basic trees of noun phrases and their variants have been established, they are compiled into a set of
regular expressions, which are matched against the tagged document in order to extract its dependencies in the
form of pairs which are used as index terms after conflating their components through morphological families, as
is described in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. In this way, we are identifying dependency pairs through simple pattern matching over the
output of the tagger-lemmatizer, solving the problem by means of finite-state techniques, leading to a considerable
reduction of the running cost.
5
      </p>
    </sec>
    <sec id="sec-4">
      <title>Non-official experiments with CLEF 2001 queries</title>
      <p>
        The Spanish corpus was incorporated in CLEF 2001 [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], but the techniques proposed in this paper have been
integrated very recently and so we could not participate in that edition. Nevertheless, we consider interesting to
present the results of some non-official experiments performed with the set of queries of CLEF 20012.
      </p>
      <p>The Spanish CLEF corpus is formed by 215,738 documents corresponding to the news provided by EFE, a
Spanish news agency, in 1994. Documents are formatted in SGML, with a total size of 509 Megabytes. After
deleting SGML tags, the size of the text corpus is reduced to 438 Megabytes. Each query consists of three fields: a
brief title statement, a one-sentence description, and a more complex narrative specifying the relevance assessment
criteria. In these experiments, we have employed the three fields to build the final query submitted to the system.
For linguistically-motivated indexing techniques, the terms contained in the title section are given the double of
importance with respect to description and narrative.</p>
      <p>
        The techniques proposed in this article are independent of the indexing engine we choose to use. This is
because we first conflate the document to obtain its index terms; then, the engine receives the conflated version
of the document as input. So, any standard text indexing engine may be employed, which is a great advantage.
Nevertheless, each engine will behave according to its own characteristics 3 [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. The results we show here have
been obtained with SMART, using the ltc-lnc weighting scheme [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], without relevance feedback.
      </p>
      <p>We have compared the results obtained by four different indexing methods:</p>
      <sec id="sec-4-1">
        <title>1At this point we will take as example the noun phrase una ca´ıda de las ventas (a drop in the sales).</title>
        <p>
          2We have also tested some of the techniques proposed in this article over our own, non standard, corpus, formed by 21,899 news articles
(national, international, economy, culture,. . . ). Results are reported in [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ].
        </p>
        <p>3Indexing model, ranking algorithm, etc.</p>
        <p>
          stm
• Stemming text after eliminating stopwords (stm). In order to apply this technique, we have tested several
stemmers for Spanish. Finally, the best results we obtained were for the stemmer used by the open source
search engine Muscat4, based on Porter’s algorithm [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
• Conflation of content words via lemmatization (lem), i.e. each form of a content word is replaced by its
lemma. This kind of conflation takes only into account inflectional morphology.
• Conflation of content words by means of morphological families (fam), i.e. each form of a content word is
replaced by the representative of its morphological family. This kind of conflation takes into account both
inflectional and derivational morphology.
• Text conflated by means of the combined use of morphological families and syntactic dependency pairs
(f-sdp).
        </p>
        <p>The methods lem, fam, and f-sdp are linguistically motivated. Therefore, they are able to deal with some
complex linguistic phenomena such as clitic pronouns, contractions, idioms, and proper name recognition. In
contrast, the method stm works simply by removing a given set of suffixes, without taking into account such
linguistic phenomena, yielding incorrect conflations that introduce noise in the system. For example, clitic pronouns
are simply considered a set of suffixes to be removed. Moreover, the employment of finite-state techniques in
the implementation of our methods let us to reduce their computational cost, making possible their application in
practical environments.</p>
        <p>Table 1 shows the statistics of the terms that compose the corpus. The first and second row show the total
number of terms and unique terms obtained for the indexed documents, respectively, either for the source text
and for the different conflated texts. Table 2 shows performance measures as defined in the standard trec eval
program. The monolingual Spanish task in 2001 considered a set of 50 queries, but for one query any relevant
document exists in the corpus, and so the performance measures are computed over 49 queries. Table 3 shows
in its left part the precision attained at the 11 standard recall levels. We can observe that linguistically motivated
indexing techniques beats stm for low levels of recall. This fact means that more highly relevant documents are
placed in the top part of the ranking list applying these techniques. As a complement, the right part of Table 3
shows the precision computed at N seen documents.</p>
        <p>
          The results of our experiments seems to be consistent with the results obtained for English and Germanic
languages by other IR systems based on NLP techniques [
          <xref ref-type="bibr" rid="ref12 ref13 ref14 ref15">12, 13, 14, 15</xref>
          ]. As in [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], syntax does not improve
average precision, but is the best technique for low levels of recall. A similar conclusion can be extracted from the
work of [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] on Dutch texts, where syntactic methods only beats statistical ones at low levels of recall. Our results
with respect to syntactic dependency pairs seem to be better that those of Perez-Carballo and Strzalkowski [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. It
4Currently, Muscat is not an open source project, and the web site http://open.muscat.com used to download the
stemmer is not operating. Information about a similar stemmer for Spanish (and other European languages) can be found
at http://snowball.sourceforge.net/spanish/stemmer.html.
stm
lem
fam
is difficult to know if this improvement is due to a more accurate extraction of pairs or due to differences between
Spanish and English constructions.
6
6.1
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experiments with CLEF 2002 queries</title>
      <p>
        The uppercase-to-lowercase module
An important characteristic of IR test collections that may have a considerable impact on the performance of
linguistically motivated indexing techniques is the large number of typographical errors present in documents, as
have been reported, in the case of the Spanish CLEF corpus, by [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In particular, titles of news and subsections
are generally written in capital letters without accents. We must take into account that these titles are usually very
indicative of the topic of the document.
      </p>
      <p>
        For CLEF 2002 experiments, we have incorporated an uppercase-to-lowercase module to our system to
process uppercase sentences, converting them to lowercase and restoring the existent diacritics when necessary. Other
approaches, such as [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], deal with documents where absolutely all diacritics have been eliminated. Nevertheless,
our situation is different, because the main of the document is written lowercase and preserves their diacritics, only
some sentences are written in capital letters; moreover, for our purposes we only need the grammatical category
and lemma of the word, not the form.
      </p>
      <p>So, we can employ the lexical context of an uppercase sentence, either forms and lemmas, to recover this lost
information. The first step of this process is to identify the uppercase phrases. We consider that a sequence of
words form an uppercase phrase, when it consists of three or more words written in capital letters and at least
three of them have more than three characters. For each of these uppercase phrases we do the following:
1. We obtain its surrounding context.
2. For each of the words in the phrase:
(a) We examine the context looking for entries with the same flattened form 5. Each of these words
become candidates.
(b) If candidates are found, the most numerous is chosen, and in case of existing a draw, the closest to the
phrase is chosen.
(c) If no candidates are found, the lexicon is examined:
i. We obtain from the lexicon all entries with the same flattened form, grouping them according to
their category and lemma (we are not interested in the form, just in the category and the lemma
of the word).
ii. If no entries are found, we keep the actual tag and lemma.
iii. If only one entry is found, we choose that one.
iv. If more than one entry is found, we choose the most numerous in the context (according to the
category and the lemma). Again, in case of existing a draw, we choose the closest to the sentence.
Sometimes, some words of the uppercase phrase preserve some of their diacritics, for example the ˜ of the N˜ . In
this situations, the candidates from the context or the lexicon must observe this restriction.</p>
      <p>5That is, after both words been converted to lowercase, and after eliminating all diacritics from them</p>
      <sec id="sec-5-1">
        <title>Documents retrieved Relevant documents retrieved (2854 expected)</title>
      </sec>
      <sec id="sec-5-2">
        <title>R-precision Average precision per query Average precision per relevant docs 11-points average precision</title>
        <p>TDlem
• TDlem: Conflation of content words via lemmatization, i.e. each form of a content word is replaced by its
lemma. This kind of conflation takes only into account inflectional morphology. The query is formed by
the set of meaning lemmas present in title and description.
• TDNlem: The same as before, but the query also includes the set of meaning lemmas obtained from the
narrative. Both this method and the previous one correspond to the lem indexing method referred in Section 5.
• TDNsyn: Conflation of content words via lemmatization and expansion of queries by means of synonymy.</p>
        <p>We have considered that two words are synonyms if their similarity measure is greater or equal to 0.80. The
query is formed by the set of meaning lemmas present in title, description and narrative, but only the title
and description field of each query have been expanded using synonyms.
• TDNpds: Text conflated by means of the combined use of morphological families and syntactic dependency
pairs. The query is formed by the union of the set of representatives of the morphological families
corresponding to the content words and the set of dependency pairs extracted from the title, description and
narrative fields. It corresponds to the f-sdp indexing method referred in Section 5.</p>
        <p>Except for the first method, the terms extracted from the title section are given the double of importance with
respect to description and narrative.</p>
        <p>According to Tables 4 and 5, the lemmatization method (TDNlem) seems to be the best option. The expansion
through synonymy (TDNsyn) does not improve the results obtained, perhaps because the expansion is total, that is,
all synonyms of all terms of the query are employed, introducing too much noise. In the case of the employment of
syntactic dependency pairs (TDNpds), the results are worse than for CLEF 2001 queries. This may be simply due to
the different set of queries employed, but after comparing the results of each particular query with lemmatization,
it may be concluded that the more accurate is the complex term with respect to its constituting simple terms, the
more the results improve, as in the case of estad´ısticas de divorcio (divorce statistics) in the 115th query.</p>
        <p>These results, together with the previous ones obtained for CLEF 2001 queries, suggest that mere
lemmatization is a good starting point. It may be investigated whether this initial search should be followed by a relevance
feedback process based on the expansion of the synonyms of the most relevant terms of the most relevant
documents to minimize the noise. Another alternative to study for postprocessing consists on the reranking of the
results by means of syntactic information obtained in form of syntactic dependency pairs.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Ricardo</given-names>
            <surname>Baeza-Yates</surname>
          </string-name>
          and
          <string-name>
            <given-names>Berthier</given-names>
            <surname>Ribeiro-Neto</surname>
          </string-name>
          .
          <article-title>Modern information retrieval</article-title>
          . Addison-Wesley, Harlow, England,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Fco. Mario</given-names>
            <surname>Barcala</surname>
          </string-name>
          ,
          <article-title>Jesu´s Vilares, Miguel A. Alonso, Jorge Gran˜a, and Manuel Vilares. Tokenization and proper noun recognition for information retrieval</article-title>
          .
          <source>In 3rd International Workshop on Natural Language and Information Systems (NLIS</source>
          <year>2002</year>
          ),
          <source>September 2-3</source>
          ,
          <year>2002</year>
          .
          <article-title>Aix-en-</article-title>
          <string-name>
            <surname>Provence</surname>
          </string-name>
          , France, Los Alamitos, California, USA,
          <year>September 2002</year>
          . IEEE Computer Society Press.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Thorsten</given-names>
            <surname>Brants. TNT -</surname>
          </string-name>
          <article-title>a statistical part-of-speech tagger</article-title>
          .
          <source>In Proceedings of the Sixth Applied Natural Language Processing Conference</source>
          (ANLP'
          <year>2000</year>
          ), Seattle,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Chris</given-names>
            <surname>Buckley</surname>
          </string-name>
          , James Allan, and
          <string-name>
            <given-names>Gerard</given-names>
            <surname>Salton</surname>
          </string-name>
          .
          <article-title>Automatic routing and ad-hoc retrieval using SMART: TREC 2</article-title>
          . In D. K. Harman, editor,
          <source>NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)</source>
          , pages
          <fpage>45</fpage>
          -
          <lpage>56</lpage>
          , Gaithersburg,
          <string-name>
            <surname>MD</surname>
          </string-name>
          , USA,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Santiago</given-names>
            <surname>Ferna</surname>
          </string-name>
          <article-title>´ndez, Jorge Gran˜a, and Alejandro Sobrino. A Spanish e-dictionary of synonyms as a fuzzy tool for information retrieval</article-title>
          . In Actas de las I Jornadas de Tratamiento y Recuperacio´n de Informacio´
          <source>n (JOTRI</source>
          <year>2002</year>
          ), Leo´n, Spain,
          <year>September 2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Carlos</surname>
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Figuerola</surname>
          </string-name>
          , Raquel Go´mez, Angel F. Zazo Rodr´ıguez, and Jose´ Luis Alonso Berrocal.
          <article-title>Stemming in Spanish: A first approach to its impact on information retrieval</article-title>
          . In Carol Peters, editor,
          <source>Working notes for the CLEF 2001 workshop</source>
          , Darmstadt, Germany,
          <year>September 2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Jorge</given-names>
            <surname>Gran</surname>
          </string-name>
          <article-title>˜a, Fco</article-title>
          . Mario Barcala, and
          <string-name>
            <surname>Miguel</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Alonso</surname>
          </string-name>
          .
          <article-title>Compilation methods of minimal acyclic automata for large dictionaries</article-title>
          . In Bruce W. Watson and Derick Wood, editors,
          <source>Proc. of the 6th Conference on Implementations and Applications of Automata (CIAA</source>
          <year>2001</year>
          ), pages
          <fpage>116</fpage>
          -
          <lpage>129</lpage>
          , Pretoria, South Africa,
          <year>July 2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Jorge</given-names>
            <surname>Gran</surname>
          </string-name>
          <article-title>˜a, Fco. Mario Barcala, and Jesu´s Vilares. Formal methods of tokenization for part-of-speech tagging</article-title>
          . In Alexander Gelbukh, editor,
          <source>Computational Linguistics and Intelligent Text Processing</source>
          , volume
          <volume>2276</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>240</fpage>
          -
          <lpage>249</lpage>
          . Springer-Verlag, Berlin-Heidelberg-New York,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Jorge</given-names>
            <surname>Gran</surname>
          </string-name>
          <article-title>˜a, Jean-Ce´dric Chappelier, and Manuel Vilares. Integrating external dictionaries into stochastic part-of-speech taggers</article-title>
          .
          <source>In Proceedings of the Euroconference Recent Advances in Natural Language Processing (RANLP</source>
          <year>2001</year>
          ), pages
          <fpage>122</fpage>
          -
          <lpage>128</lpage>
          ,
          <string-name>
            <surname>Tzigov</surname>
            <given-names>Chark</given-names>
          </string-name>
          , Bulgaria,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Jane</given-names>
            <surname>Greenberg</surname>
          </string-name>
          .
          <article-title>Automatic query expansion via lexical-semantic relationships</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology</source>
          ,
          <volume>52</volume>
          (
          <issue>5</issue>
          ):
          <fpage>402</fpage>
          -
          <lpage>415</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Christian</given-names>
            <surname>Jacquemin</surname>
          </string-name>
          and
          <string-name>
            <given-names>Evelyne</given-names>
            <surname>Tzoukermann</surname>
          </string-name>
          .
          <article-title>NLP for term variant extraction: synergy between morphology, lexicon and syntax</article-title>
          . In Tomek Strzalkowski, editor,
          <source>Natural Language Information Retrieval</source>
          , volume
          <volume>7</volume>
          of Text,
          <source>Speech and Language Technology</source>
          , pages
          <fpage>25</fpage>
          -
          <lpage>74</lpage>
          . Kluwer Academic Publishers, Dordrecht/Boston/London,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Wessel</given-names>
            <surname>Kraaij</surname>
          </string-name>
          and Rene´e Pohlmann.
          <article-title>Comparing the effect of syntactic vs. statistical phrase indexing strategies for Dutch</article-title>
          . In Christos Nicolaou and Constantine Stephanidis, editors,
          <source>Research and Adavanced Technology for Digital Libraries</source>
          , volume
          <volume>1513</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>605</fpage>
          -
          <lpage>614</lpage>
          . SpringerVerlag, Berlin/Heidelberg/New York,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Byung-Kwan</surname>
            <given-names>Kwak</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jee-Hyub</surname>
            <given-names>Kim</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Geunbae</given-names>
            <surname>Lee</surname>
          </string-name>
          , and Jung Yun Seo.
          <article-title>Corpus-based learning of compound noun indexing</article-title>
          .
          <source>In J. Klavans and J</source>
          . Gonzalo, editors,
          <source>Proc. of the ACL'2000 workshop on Recent Advances in Natural Language Processing and Information Retrieval</source>
          , Hong Kong,
          <year>October 2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Markus</given-names>
            <surname>Mittendorfer</surname>
          </string-name>
          and
          <string-name>
            <given-names>Werner</given-names>
            <surname>Winiwarter</surname>
          </string-name>
          .
          <article-title>Exploiting syntactic analysis of queries for information retrieval</article-title>
          .
          <source>Data &amp; Knowledge Engineering</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Jose</surname>
            Perez-Carballo and
            <given-names>Tomek</given-names>
          </string-name>
          <string-name>
            <surname>Strzalkowski</surname>
          </string-name>
          .
          <source>Natural language information retrieval: progress report. Information Processing and Management</source>
          ,
          <volume>36</volume>
          (
          <issue>1</issue>
          ):
          <fpage>155</fpage>
          -
          <lpage>178</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Carol</surname>
            <given-names>Peters</given-names>
          </string-name>
          , editor.
          <source>Results of the CLEF</source>
          <year>2001</year>
          <article-title>Cross-Language System Evaluation Campaign</article-title>
          .
          <source>Working Notes for the CLEF 2001 Workshop</source>
          , Darmstadt, Germany,
          <year>September 2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <article-title>Jesu´s Vilares, Fco</article-title>
          . Mario Barcala, and
          <string-name>
            <surname>Miguel</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Alonso</surname>
          </string-name>
          .
          <article-title>Using syntactic dependency-pairs conflation to improve retrieval performance in Spanish</article-title>
          . In Alexander Gelbukh, editor,
          <source>Computational Linguistics and Intelligent Text Processing</source>
          , volume
          <volume>2276</volume>
          of Lecture Notes in Computer Science,, pages
          <fpage>381</fpage>
          -
          <lpage>390</lpage>
          . SpringerVerlag, Berlin-Heidelberg-New York,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <article-title>Jesu´s Vilares, David Cabrero, and Miguel A. Alonso. Applying productive derivational morphology to term indexing of Spanish texts</article-title>
          . In Alexander Gelbukh, editor,
          <source>Computational Linguistics and Intelligent Text Processing</source>
          , volume
          <volume>2004</volume>
          <source>of Lecture Notes in Computer Science</source>
          , pages
          <fpage>336</fpage>
          -
          <lpage>348</lpage>
          . Springer-Verlag,
          <fpage>BerlinHeidelberg</fpage>
          -New York,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <article-title>Jesu´s Vilares, Manuel Vilares, and Miguel A. Alonso. Towards the development of heuristics for automatic query expansion</article-title>
          . In Heinrich C. Mayr, Jiri Lazansky, Gerald Quirchmayr, and Pavel Vogel, editors,
          <source>Database and Expert Systems Applications</source>
          , volume
          <volume>2113</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>887</fpage>
          -
          <lpage>896</lpage>
          . Springer-Verlag, Berlin-Heidelberg-New York,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>David</given-names>
            <surname>Yarowsky</surname>
          </string-name>
          .
          <article-title>A comparison of corpus-based techniques for restoring accents in Spanish and French text</article-title>
          .
          <source>In Natural Language Processing Using Very Large Corpora</source>
          , pages
          <fpage>99</fpage>
          -
          <lpage>120</lpage>
          . Kluwer Academic Publishers,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>