=Paper= {{Paper |id=Vol-1169/CLEF2003wn-adhoc-BesanconEt2003 |storemode=property |title=The LIC2M's CLEF 2003 System |pdfUrl=https://ceur-ws.org/Vol-1169/CLEF2003wn-adhoc-BesanconEt2003.pdf |volume=Vol-1169 |dblpUrl=https://dblp.org/rec/conf/clef/BesanconCFFMN03a }} ==The LIC2M's CLEF 2003 System== https://ceur-ws.org/Vol-1169/CLEF2003wn-adhoc-BesanconEt2003.pdf
                 The LIC2M’s CLEF 2003 System
                       Romaric Besançon, Gaël de Chalendar, Olivier Ferret,
                         Christian Fluhr, Olivier Mesnard and Hubert Naets
                                        CEA/LIST - LIC2M
                        {Romaric.Besancon,Gael.de-Chalendar,Olivier.Ferret,
                       Christian.Fluhr,Olivier.Mesnard, Hubert.Naets}@cea.fr


                                                Abstract
         For its first birthday, the LIC2M has participated to the Small Multilingual Track of
      CLEF 2003. Our system is based on a deep linguistic analysis of documents and queries and
      on an original search algorithm inherited from the Spirit (EMIR) system. With a partially
      developed system, we obtained average results that will serve as a baseline for us in our future
      participations to IR evaluations.


1     Introduction
CLEF 2003 is the first participation of the LIC2M (Multimedia and Multilingual Knowledge
Engineering Laboratory) to an international evaluation campaign. The lab was created inside the
CEA (Commissariat à l’Énergie Atomique) at the beginning of 2002 and has inherited from the
technology developed during the last three decades in the CEA with the EMIR [4] and SPIRIT
[5] systems.
    As we decided to restart the development from scratch to use at best this technological basis
but with modern software engineering methods and tools, the system that was used for the Small
Multilingual Track of CLEF 2003 has been developed in a short period of time and is still in-
complete: some important modules are not included (see section 3) and the overall system needs
tuning and validation. However, we obtained this year results similar to those of the previous
version of the system that participated to TREC7 (cf. section 7). The future developments and
tuning of the system (section 8) should then lead to better results in future evaluations.
    The paper is organized as follow: we begin by an overview of the system (section 2) followed
by sections 3 to 6 dedicated to the description of its components (Linguistic processing, Indexing,
Query Processing and Search and Merging). Next, we describe our results (section 7) before
presenting our conclusions and the plans for our future work (section 8).


2     Outline of the process
The general architecture of the process is presented in figure 1. The system is composed of four
elements :
    • a linguistic analyzer, which is used to process documents and queries;
    • an indexer to store the documents in indexes, as a matrix documents×terms;
    • a query processor that reformulates the query to suit the search (monolingual and multilin-
      gual reformulations);
    • a search engine that searches the indexes for the closest documents to the reformulated
      queries and merges the results obtained for each language.
   The linguistic analyzer is itself composed of a set of modules customized with resources. The
combination of modules and resources (dictionaries, set of categories..) depends on the language.
              English             French            Spanish             German              original
              corpus              corpus            corpus              corpus               topic


                                                                                            query
                             indexing                                                     processing



              English             French             Spanish             German
               index               index              index              index


                              English            French              Spanish            German
                              query              query                query             query


                              English             French            Spanish             German
                              search              search             search             search


                            results for         results for        results for         results for
                             English              French            Spanish             German


                                                merging strategy


                                                  global results



                               Figure 1: General architecture of the search


    The indexing process can be described as follows: each document from a collection is first
processed through the linguistic analyzer, according to the language it is written in. The output
of the linguistic analyzer is a stream of tokens, each token being a set of three pieces of information:
lemma1 , category and position. Position means offset of the word relative to the beginning of the
document. The pair (lemma, category) is considered as a term. Each distinct pair is used as an
entry in the index. The indexer stores for each entry, the documents containing the term, and the
positions of the terms in each document. Four indexes are built, one for each language.
    The query process performs the following tasks: the topic is processed through the linguistic
analyzer; pairs (lemma, category) from the stream of tokens are considered at this stage as con-
cepts and concepts are reformulated into four sets of terms, one for each target language, using
monolingual and multilingual reformulation dictionaries.
    The search engine then sends requests to the four indexes to get a list of the documents
containing the largest number of query terms, and merges the four lists obtained. The search
engine takes into account the original concepts and their weights to score the document.


3     Linguistic Processing
The linguistic processing is a fundamental part of the system. We decided to base all the process-
ing of our system on a deep linguistic analysis designed to extract precise information from the
structure of texts. We will see below that this goal is not yet fully reached since all modules are
not completely implemented or fully functional.
   1 Here and in the rest of the paper, a lemma is a normalized form and can group several lemmas (in the linguistic

sense) of strict synonyms.
    Figure 2 shows the organization of the modules in the linguistic processing chain. The input
is a raw text in a given language2 converted to Unicode.

                                                   Tokenization


                                                Dictionary search           Fullform dictionary
                                              or default categories



                                              Idiomatic expressions




                       Trigram/bigram      PoS tagging            Stopper        Stoplist
                          matrices



                                                 Named entities
                       Rules


                                               Syntactic analysis             Grammar rules




                                                Complex terms




                               Figure 2: Overview of the linguistic processing

    The first step is the tokenization of the text in order to find word and sentence breaks. Next,
tokens are searched in a full form dictionary. If they are found in the dictionary, they are associated
to their lemmas and are given all the morphosyntactic categories associated with this form.
    It should be noted that our system uses a specific set of morphosyntactic categories that makes
it quite different from other systems: the categories are positional, meaning that the category itself
allows to distinguish which words can appear before or after another words. For example, for
French, there are pre-nominal adjectives and post-nominal adjectives. The positional properties
of this large set of categories (137 for French, 120 for English, 100 for Spanish and currently only
38 for German) will allow a very effective disambiguation, as we will see below.
    If a token is not found in the dictionary, it is given a default set of morphosyntactic categories
based on some typographical properties: a token beginning with an uppercase letter will obtain
the categories of proper nouns, for example.
    After tokenization, the idiomatic expressions are detected and replaced by a single token.
Idiomatic expressions are phrases or usual compound nouns that are in our dictionary. They can
be non-contiguous like phrasal verbs: ”switch the light on” is replaced by ”switch-on the light
(on)” (the information about the preposition is kept because it will be used during syntactic
analysis). The detection of idiomatic expressions is performed by applying a set of rules, that are
triggered on specific words and tested on left and right contexts of the trigger (note that the same
technology is used for identifying named entities and compounds).
    After these steps, each token have several possible morphosyntactic categories. The goal of
the part-of-speech tagger is to drastically reduce the number of possible categories for each word.
We will see below that our technology should be able to obtain results at the level of state of the
art systems without any statistics. Unfortunately, this module was not ready at the time of the
   2 Our language and encoding identifier has not been used for CLEF as the language and encoding of each given

corpus are already known. Even if some texts of one language’s corpus are actually in another language (for
example, some LATimes texts are in Spanish), we have decided to ignore this point.
CLEF campaign and thus we only remove functional and too frequent words with stoplists. We
then kept all other tokens with content word categories.
    The fifth step uses the same algorithms as the idiomatic expressions step, with a specific set of
rules, to extract the named entities like people or locations names. The extracted tokens groups
are replaced by a single token. This module has quite good results with a precision 80% and a
recall around 60% depending on the language and the entity type. This evaluation has been done
for English (5,000 texts), French (5,000 texts) and Spanish (50 texts) using a set of manually
annotated texts.
    The resulting text should then be used to do a syntactic analysis Again, this module was not
ready at the time of CLEF and thus was not used. We will see in section 8 that we hope to
improve our results in a future evaluation by using all the modules our system is designed to work
with.
    The last step of the linguistic analysis is the extraction of compound nouns. These compounds
are important indicators for information extraction but also for a lot of other tasks. This module
also relies on the rule-bases technology used to extract idiomatic expressions and named entities.
Since no syntactic analysis is performed, every compound corresponding to a pattern (like Noun
Preposition Noun) described in a rule is kept without any tense or genre agreement checking.
    The results of the linguistic processing that we described above contain linguistic data of
various levels that are used as well for the documents indexing (section 4) and for the queries
processing (section 5).


4     Indexing
The search engine described in section 6 relies on basic functions which give access to raw infor-
mation on the collection. Efficient implementation of these basic functions allow an easy use of
the global system. We built inverted files (containing, for each term found in the collection of
documents, a list of all documents containing this term), using the Lemur toolkit (V2.01) [7, 1].
Within this context, our design choices consist in:

    1. considering terms as a pair of two complementary pieces of information: the lemma of the
       word and its grammatical category as found in the document. This was possible thanks
       to the deep linguistic processing described in the previous section. The first benefit is that
       access to entries in indexes and compression of indexes are more efficient because there is
       less entries than if we had considered inflected forms of terms. The second benefit is that we
       take into account the semantic of terms at this very basic layer of the system which lead to
       more simple reformulation calculation in the query processing layer. Adding the grammatical
       category must lead to more discriminant inverted files than relying only on lemma, but we
       did not have the opportunity to evaluate the impact of such distinction. Only macro category
       have been considered: Noun, Verb, Adjective and Proper Noun. Entries of indexes are built
       doing concatenation of the lemma and a number which represents the category (the cost is
       then only of two characters for each entry).
    2. indexing every terms, without any frequency filtering, using only ”stop words” lists. Indexing
       a large number of terms does not cost a lot when efficient compression schemes are used. In
       the future version, the indexing terms will be chosen based on their part-of-speech (keeping
       only content words such as nouns, verbs adjectives). Stoplists will then be used only as
       a complement to eliminate some non-relevant content words. In the current version the
       stoplists represent approximately 60% of the text.
    3. building separate indexes for each language: English, French, Spanish and German. There is
       no drawback because the language is identified for each text and we consider that documents
       are monolingual (only one language per document), and this leads to more manageable
       indexes.
    The statistics of the index results are given for each corpus in table 1, indicating for each
language the size of the corresponding corpus, the number of documents in the corpus, the total
number of terms that where indexed (including single terms and compounds), the number of
distinct single terms indexed, the number of distinct compounds indexed, the total size of the
index files (as kept on disk), and the memory size that is used to load the index (includes term
and document lists and lookups on the index files). Because indexes are too big to fit in memory,
all information is not loaded but stays in files. Looks up are used to hold the offset of each entry.

        corpus         nb docs      nb terms        nb distinct     nb distinct      index size    memory
        size (Mo)                                   uniterms        compounds        (Mo)          size (Mo)
 fre           326      129.806      30.756.020         297.084       2.281.349            476            185
 eng           576      169.477      48.387.519         512.196       1.571.058            593            136
 ger           632      287.670      55.605.358       1.076.714       179.357 3            603             89
 spa         1.084      454.045     112.603.093         747.755       3.012.136          1.325            261

                               Table 1: Statistics of the indexing process

   It can be can noted that the total size of indexes is larger than the original text (except for
German). Our indexes are not very compact because of the great number of compound terms
(which increases the size of vocabulary and add many entries in indexes). We have noted that we
actually need 800 Mo to load the four indexes and run a request.
   In future versions of our system, we will replace the Lemur toolkit with a more efficient im-
plementation, allowing a smaller memory footprint to store the vocabulary, a transaction model
to manage both read and write access (to index new files while executing retrieval request) and
handling of Unicode text.


5      Query Processing
5.1     Linguistic processing
Each query is first processed through the linguistic analyzer corresponding to the query language,
as described in section 3. The three fields of the query (title, description, narrative) are kept for
this analysis.
    The result is a query composed of a list of elements that can be a lemma, associated with its
part-of-speech (limited to nouns, verbs, and adjectives); a named entity, associated with its type;
a compound, in a normalized form, associated with its part-of-speech4 .
    The query is then filtered using a stoplist containing meta-words (words used in the narrative to
describe what are relevant documents, such as : “document”, “relevant” etc.). These meta-words
stoplists have been built on the basis of CLEF 2002 topics, from a first selection using frequency
information, and revised manually.
    No deep semantic analysis has been performed on the narrative, to take into account, for
instance, the negative descriptions of the topics (“documents that contain ... are not relevant”).
Some negative descriptive elements have then been kept in the final queries.
    After this processing, the query is a list of indexing elements, in the original language of the
topic. These elements are called the concepts of the query.

5.2     Query reformulation
The list of query concepts is augmented with additional knowledge, using external resources (such
as monolingual and bilingual reformulation dictionaries) and using the corpus to search as a
reference (the indexes are used to filter out words inferred by the reformulation).
   4 The system still being under development, the compounds were not properly taken into account, especially in

query expansion and translation, and were just used for monolingual search.
    The reformulated query is then a list of the original query terms, called query concepts, and a
list of inferred terms, called search terms. Each search term is linked to a query concept, and a
weight is associated to the link. An example of query reformulation is presented in figure 3 at the
end of this section.

5.2.1    Translation
In order to query a corpus in a language different from the original query language, a translation of
the query terms is performed, using bilingual dictionaries. Each term of the query is translated into
several terms in target language. The translated words form the search terms of the reformulated
query. The links between the search terms and the query concepts can also be weighted by a
confidence value (between 0 and 1) indicating the confidence in the translation. In this first
version, all translations were assigned the same weight.
    In the small multilingual task, the only language pair for which we did not have a bilingual
dictionary is the Spanish/German pair. For this pair, we used a two-step translation, using a pivot
language (terms are first translated from topic language to the pivot language, and then from pivot
language to target language). In this case, the confidence in the translation is the product of the
confidences of the two successive translations. For the submitted runs, we used only one pivot
language for the Spanish/German pair (chosen language was French), but a concurrent use of
different pivot languages could also be used for this translation.

5.2.2    Monolingual reformulation
A semantic expansion was also performed to increase the lexical variety of the query concepts,
using monolingual reformulation dictionaries (containing mostly synonymy information).
    In the runs we submitted, this semantic expansion was only performed for monolingual query
expansion. In a more general approach for crosslingual retrieval, several combinations of multi-
step translation and monolingual expansion can be imagined. We plan to test several of them on
the CLEF 2003 data.

5.2.3    Topical expansion
The semantic expansion described in the previous section is mainly based on synonyms. [8] shows
that this kind of expansion reliably improves results if the terms of the queries are semantically
disambiguated. As we did not perform such a disambiguation, we chose to reinforce the represen-
tation of the context of each query by adding to it words that are topically linked to its words
after the semantic expansion step5 .
     The selection of such words is based on a network of lexical cooccurrences. For French, the
only language for which we tested this kind of expansion, this network was built from a 39 million
word corpus made of 24 months from the Le Monde newspaper (see [3] for more details). After a
filtering procedure was applied [2] to select the cooccurrences that are likely to be supported by a
topical relation, we got a network of 7,200 lemmas and 183,000 cooccurrences.
     This network is used in a three-stage process that relies on a kind of bootstrapping. First, a
set of words from the network that are strongly linked to the considered query are selected. The
strength of this link is set by the number of words of the query the word from the network is
linked to (3 words in our experiments). Most of these words, which are called expansion words,
are topically close to the query but some of them also represent noise. The next stage aims at
discarding this noise. It consists in selecting the words of the query that are the most representative
of its topic. This selection is based on the words resulting from the first stage: we assume that
a query word is a significant one if it has contributed to the selection of a minimal number of
expansion words (2 words in our experiments). The final stage is identical to the first one, except
that the expansion is done from the selected words of the query and not from all its plain words.
   5 According to [6], topical relations are non systematic semantic relations such as the ones between tennis, racket,

ball and net for instance.
Moreover, the number of expansion words is arbitrarily set to 10 to avoid swamping the initial
words of the query.
     This topical expansion was applied to the 60 French topics of CLEF 2003. A set of expansion
words was produced for 42 of them. This set was empty for the other ones, which means that it was
not possible in these cases to build from the network of cooccurrences a significant representation
of the topic of the query. As an example, the result of the topical expansion of the topic C164,
Les condamnations pour trafic de drogue en Europe (European Drug Sentences), is the following
list of words: {amende, infraction, prison, délit, procès, pénal, crime, juge, cocaı̈ne, sursis}.

           Une lettre piégée pour Arabella Kiesbauer. Trouvez des informations sur l’explosion
           d’une lettre piégée dans le studio de la présentatrice Arabella Kiesbauer de la chaı̂ne
           de télévision PRO 7. Le 9 juin 1995, une lettre piégée envoyée par des extrémistes
           de droite et destinée à la présentarice noire Arabella Kiesbauer ...


                                            linguistic processing

           lettre-N piéger-V Arabella-NP Kiesbauer-NP Arabella Kiesbauer-PER explosion-
           N studio-N présentatrice-N chaı̂ne-N télévision-N professionnel-N informa-
           tions sur l’explosion-N studio de la présentatrice-N chaı̂ne de télévision-N ...


           monolingual reformulation                                     translation reformulation


                                                             lettre-N                        letter-N
lettre-N                       lettre-N
                                                             piéger-V
                               correspondance-N
                                                             Arabella-NP                     Arabella-NP
                               courrier-N
                                                             Kiesbauer-NP                    Kiesbauer-NP
                               billet-N
                                                             Arabella Kiesbauer-PER          Arabella ...-PER
                               pli-N
                                                             studio-N                        studio-N
piéger-V                      piéger-V
                                                             présentatrice-N                announceur-N
Arabella-NP                    Arabella-NP
                                                                                             newscaster-N
Kiesbauer-NP
                                                                                             newsreader-N
Arabella Kiesbauer-PER
                                                                                             presenter-N
explosion-N                    explosion-N
                                                             chaı̂ne-N                       chain-N
                               manifestation-N
                                                                                             channel-N
                               expansion-N
                                                                                             line-N
studio-N                       studio-N
                                                                                             program-N
                               chambre-N
                                                                                             range-N
                               appartement-N
                                                                                             ridge-N
présentatrice-N               présentatrice-N
                                                                                             string-N
chaı̂ne-N                      chaı̂ne-N
                                                             télévision-N                  television-N
télévision-N                 télévision-N
...
                                                             ...


                   Figure 3: Summarization of the query construction on an example



6     Search and Merging
The original topic is transformed, during the query processing, into four different queries, one for
each language. The search is performed independently for each of theses queries on the index of
the corresponding language. 1,000 documents are retrieved for each language. The 4,000 retrieved
documents from the four corpora are then merged and sorted by their relevance to the topic. Only
the first 1,000 are then kept. We present in the following sections the search technique and the
merging strategy.

6.1    Search
For each expanded query, the search is performed by retrieving from the index of the corresponding
language, for each search term of the query, the documents containing the term. A term profile
is then associated to each document: this profile consists in a binary vector of size the number of
search terms of the query, each component of which indicating the presence or absence of the search
terms in the document. The frequency of the term in the document is not used: we consider that
if a document contains relevant information with respect to a topic, then the document is relevant,
event if the document contains other material that is not relevant to the topic (a visualization
step can then select the relevant parts and show them to the user). However, further version of
the search engine will also be able to take the frequency of the term in the document into account,
to make the search engine more suitable to standard evaluation procedures.
     Since we kept, in the expanded query, the links between the search terms and the query
concepts, we can associate to each document a concept profile, indicating the presence/absence of
each query concept in the documents.
     The retrieved documents are then classified, grouping in the same cluster the documents that
share the same concept profile. This classification is motivated by at least two reasons: it makes it
easy to merge the results from different languages (see following section) and the visualization of
the results is more efficient: the clustering of the results, and the association of a concept profile to
a cluster makes it easier for the user to search the results (a cluster corresponding to a non-relevant
subset of query concepts can simply be ignored).

6.2    Merge and sort
Given that a concept profile is associated to each class, the merging strategy is quite straightfor-
ward: since the concepts are in the original query language, the concept profiles associated to the
classes are comparable, and the classes having the same profile are simply merged.
    The classes are then sorted by their relevance to the query. For this purpose, we use the idf
(inverse document frequency) weights of the terms, defined for a term t by the formula idf (t) =
log dfN(t) , where df (t) is the document frequency of the term (i.e. the number of documents
containing the term) and N is the total number of documents in the corpus.
    The first idea is to use the idf weights of the terms in each language, and compute the weight
of a concept by some combination of the weights of the terms derived from the concept, and then
associate a weight to a concept profile derived by the concepts it contains. However, in this case,
the weights computed for the different languages are not comparable (idf of the terms depend of
the corpora).
    We decided to compute a crosslingual pseudo-idf weight of the concepts, using only the corpus
composed of the 4000 documents kept as the result of the search. The idf weight of the concepts
is computed on this corpus, using only information contained in the concept profiles of the classes
and the size of the classes. A weight can then be associated to each concept profile by computing
the sum of the weights of the concepts present in the profile.
    The classes are then sorted by their weights: all documents in a class are given the weight of
the class (the documents are not sorted inside the classes). The list of the first 1000 documents
from the best classes is then built and used for the evaluation.
    We used this simple weighting procedure, but a lot of other procedures can be imagined.
We plan to test several more sophisticated weighting schemes, using in particular the document
frequency of the search terms and the confidence weight in the expansion associations.
7    Results
Figure 4 shows the precision-recall curves of our five runs. There are few differences between them
except for the Spanish run that is slightly better than the others. One probable reason for that
is that the number of relevant documents for Spanish (2,368) is largely greater than in the other
languages (1,825 in German, 1,006 in English and 946 in French).

                                1
                                                                            lic2mde1
                                                                            lic2men1
                                                                            lic2mes1
                                                                              lic2mfr1
                                                                              lic2mfr2
                               0.8




                               0.6
                   Precision




                               0.4




                               0.2




                                0
                                     0   0.2        0.4            0.6     0.8           1
                                                          Recall




                                         Figure 4: Results of the 5 runs

   Table 2 details the numbers and percentages of relevant documents found in each language for
each run. The percentages are the percentages of relevant documents found among all the relevant
documents for a given language (cf. previous paragraph for these numbers).

                                                 number (%) of relevant found
                      interro. lang.        fre       eng         spa         ger
                        fre (run 1)       409 (43) 310 (31)     204 (9)     620 (34)
                        fre (run 2)       631 (67) 154 (15) 201 (8.5)       565 (31)
                            eng           354 (37) 546 (54) 252 (23)        274 (15)
                            spa           257 (27) 303 (30) 948 (40)        265 (15)
                            ger           333 (37) 139 (14)     115 (5)    1,060 (58)


Table 2: Percentages of relevant documents of each language found depending on the interrogation
language

     As expected, our system exhibits its best results for monolingual retrieval. It shows that we
still have work to do on our reformulation process and its resources. Concerning the resources, we
see that the French-Spanish dictionary, used to reformulate between French and Spanish but also
as a pivot language for Spanish to German and German to Spanish, needs particular attention as
the number of relevant documents retrieved when using it is very low.
     Concerning monolingual reformulation, we submitted two runs for French topics: one with only
semantic expansion of queries and the other one with semantic and topical expansion of queries.
The results for these two runs were globally very similar: in the first case, 1543 relevant documents
were returned with a R-precision of 0.1425 while in the second case, 1551 relevant documents were
returned with a R-precision of 0.1438. However, the two sets of relevant documents are not
identical since their intersection only contains 1305 documents. Topical expansion brings new
relevant documents but also discards relevant documents brought by semantic expansion in the
same proportion. More precisely, Table 2 shows that it favors monolingual retrieval while it has a
negative impact on crosslingual retrieval.
    Considering this observation, we tested if a combination of the two runs could lead to improve
results. We adopted a basic strategy: the two lists of documents were interleaved in the decreasing
order of their score; only one occurrence of each document was kept and the resulting list was
cut at 1000 documents. 1730 relevant documents were returned with a R-precision of 0.1443.
From the viewpoint of the number of relevant documents, the benefit is low (+179 documents,
i.e. an increase of 11.6%) but not insignificant. Moreover, most of the relevant documents that
could be caught by our combination of runs were actually caught. However, from the viewpoint
of R-precision, the benefit is not significant, which probably means that the rank of most of the
new relevant documents is high.


8    Conclusion and Future Work
Despite the numerically modest results of our system, our participation to CLEF’2003 is an encour-
agement for the future. Indeed, in a short period we implemented a tokenizer, a morphosyntactic
analyzer, a named entities recognizer, an indexer and a search engine. All this work was done
with a multilingual methodology that necessitated to gather several monolingual and bilingual
resources. However, our system still requires more work both on the technology and the resources.
     Some work has already been done: we have implemented a new method to store our dictionaries.
A first version of the morphosyntactic tagger works but will need to be replaced for a more speed-
efficient one. Some other work have started: we are currently working on the syntactic analyzer
that will extract nominal and verbal chunks and find dependencies inside the chunks and between
them. This last part will allow to find compound nouns with better precision that our current
method. These compound nouns are a fundamental part of our search algorithm and thus their
current lack probably takes a great part in our current poor results. They will also permit to
have a better multilingual search due to the noise introduced by a word by word translation. In
the next year, we will also extend our languages coverage to three new languages and introduce
monolingual and crosslingual relevance feedback.


References
[1] The Lemur Toolkit for language modeling and information retrieval.
    http://www-2.cs.cmu.edu/ lemur/.
[2] Olivier Ferret. Filtrage thématique d’un réseau de collocations. In TALN 2003, pages 347–352,
    2003.
[3] Olivier Ferret and Brigitte Grau. A bootstrapping approach for robust topic analysis. Natural
    Language Engineering, 8(2/3):209–233, 2002.
[4] Christian Fluhr, Patrick Mordini, André Moulin, and Erwin Stegentritt. Emir final report.
    Technical Report ESPRIT project 5312, DG III, Commission of the European Union, CEA,
    oct 1994.
[5] Christian FLuhr, Dominique Schmit, Philippe Ortet, Faiza Elkateb, and Karine Gurtner.
    Spirit-w3, a distributed crosslingual indexing and retrieval engine. In INET’97, june 1997.
[6] M.A.K. Halliday and R. Hasan. Cohesion in English. Longman, London, 1976.
[7] Paul Ogilvie and James Callan. Experiments using the lemur toolkit. In Proceedings of TREC-
    2001, The Tenth Text REtrieval Conference, pages 103–108, 2001.
[8] Ellen M. Voorhees. Query expansion using lexical-semantic relations. In Proceedings of SIGIR-
    94, 17th ACM International Conference on Research and Development in Information Re-
    trieval, pages 61–69, Dublin, IE, 1994.