CLEF 2001 Experiments Using KCSL’s Retrieval System Ilia Kaufman and Meena Ghanekar KCSL Inc., 5160 Yonge Street, Suite 1012 Toronto, ON, Canada M2N 6L9 Abstract We entered CLEF 2001 Evaluation Forum with our For query translation we used commercial MT UniFind retrieval system that was developed at (Machine Translation) software from Lernout and KCSL to satisfy corporate information retrieval Hauspie. needs. We participated in the following three tasks: Our system analyzes the query and all documents in the corpus to determine word usage, word • Multilingual Information Retrieval morphology, sentence boundaries and a very (EN=>FR, DE, ES, IT, EN) detailed topological structure that accounts for the • Bilingual Information Retrieval distribution of query words and sentences (FR=>EN) containing these words and their derivatives. • Monolingual Information Retrieval (Spanish collection) In addition, a sentence analysis is performed to determine both a position independent and a This is the first time we entered TREC/CLEF position dependent score for each sentence in a experiments and we used UniFind essentially document. This step not only helps to improve the without any modifications, except for the ranking accuracy, relevancy, and quality of the results but of documents in the Multilingual tasks and for also determines the most relevant part of a determining the relevancy cutoff points. document that best relates to the query. Thus, our algorithms comprise Topological, Statistical and For the Multilingual tasks we first performed Linguistic analyses of queries and documents. The separate runs for each of the five languages, algorithms tend to relate to concepts as they are namely, English, French, Italian, German and expressed in sentences, as well as, how they relate Spanish. The results of these five runs were then to a document as a whole. This process is merged and re-ranked based on the similarity conceptually identical for all sentence-based values obtained in the individual runs. UniFind’s languages. quantitative process of similarity ranking and of extracting relevant documents is identical for all The test data as supplied by CLEF 2001, contained languages. Therefore, since the number of 749,877 documents, in five languages (we didn’t documents processed in each of the five individual process the Dutch corpus) and occupied runs was substantial (the smallest was French with approximately 2GB of disk space. more than 87,000 documents), at the merging and re-ranking stage, we used the original similarity We submitted three sets of runs for each of the values obtained separately for each language. three tasks by automatically constructing the queries from the 50 topics in the selected language. With respect to the relevancy cutoff points, our Our three runs in each task were: title field only, system selects them automatically and usually title and description fields, and title with succeeds in eliminating irrelevant and marginally description and narrative fields. relevant documents. Since CLEF expects, by default, 1000 documents in each set, we decided to We observed that in all three tasks the sets with relax our cutoff strategy in order to return more queries consisting of title and description fields documents from each run. Unfortunately, we seem to have the best performance. didn’t do it quite right, for most of our runs still returned well under 50 documents. This clearly All of our runs were executed on Windows 2000 contributed to lower Recall scores and we believe platform running on Pentium III CPU at 800 MHz that in turn it resulted in lower overall scores for with 1GB of RAM and 40 GB disk. UniFind.