CLEF 2001 Experiments Using KCSL’s Retrieval System
                                     Ilia Kaufman and Meena Ghanekar
                                   KCSL Inc., 5160 Yonge Street, Suite 1012
                                       Toronto, ON, Canada M2N 6L9


Abstract
We entered CLEF 2001 Evaluation Forum with our             For query translation we used commercial MT
UniFind retrieval system that was developed at             (Machine Translation) software from Lernout and
KCSL to satisfy corporate information retrieval            Hauspie.
needs. We participated in the following three
tasks:                                                     Our system analyzes the query and all documents
                                                           in the corpus to determine word usage, word
     •   Multilingual Information Retrieval                morphology, sentence boundaries and a very
         (EN=>FR, DE, ES, IT, EN)                          detailed topological structure that accounts for the
     •   Bilingual Information Retrieval                   distribution of query words and sentences
         (FR=>EN)                                          containing these words and their derivatives.
     •   Monolingual Information Retrieval
         (Spanish collection)                              In addition, a sentence analysis is performed to
                                                           determine both a position independent and a
This is the first time we entered TREC/CLEF                position dependent score for each sentence in a
experiments and we used UniFind essentially                document. This step not only helps to improve the
without any modifications, except for the ranking          accuracy, relevancy, and quality of the results but
of documents in the Multilingual tasks and for             also determines the most relevant part of a
determining the relevancy cutoff points.                   document that best relates to the query. Thus, our
                                                           algorithms comprise Topological, Statistical and
For the Multilingual tasks we first performed              Linguistic analyses of queries and documents. The
separate runs for each of the five languages,              algorithms tend to relate to concepts as they are
namely, English, French, Italian, German and               expressed in sentences, as well as, how they relate
Spanish. The results of these five runs were then          to a document as a whole. This process is
merged and re-ranked based on the similarity               conceptually identical for all sentence-based
values obtained in the individual runs. UniFind’s          languages.
quantitative process of similarity ranking and of
extracting relevant documents is identical for all         The test data as supplied by CLEF 2001, contained
languages. Therefore, since the number of                  749,877 documents, in five languages (we didn’t
documents processed in each of the five individual         process the Dutch corpus) and occupied
runs was substantial (the smallest was French with         approximately 2GB of disk space.
more than 87,000 documents), at the merging and
re-ranking stage, we used the original similarity          We submitted three sets of runs for each of the
values obtained separately for each language.              three tasks by automatically constructing the
                                                           queries from the 50 topics in the selected language.
With respect to the relevancy cutoff points, our           Our three runs in each task were: title field only,
system selects them automatically and usually              title and description fields, and title with
succeeds in eliminating irrelevant and marginally          description and narrative fields.
relevant documents. Since CLEF expects, by
default, 1000 documents in each set, we decided to          We observed that in all three tasks the sets with
relax our cutoff strategy in order to return more          queries consisting of title and description fields
documents from each run. Unfortunately, we                 seem to have the best performance.
didn’t do it quite right, for most of our runs still
returned well under 50 documents. This clearly             All of our runs were executed on Windows 2000
contributed to lower Recall scores and we believe          platform running on Pentium III CPU at 800 MHz
that in turn it resulted in lower overall scores for       with 1GB of RAM and 40 GB disk.
UniFind.