<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Ilia Kaufman and Meena Ghanekar KCSL Inc.</institution>
          ,
          <addr-line>5160 Yonge Street, Suite 1012 Toronto, ON</addr-line>
          ,
          <country country="CA">Canada</country>
          <addr-line>M2N 6L9</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Multilingual Information Retrieval (EN=&gt;FR, DE, ES, IT, EN) Bilingual Information Retrieval (FR=&gt;EN) Monolingual Information Retrieval</institution>
          ,
          <addr-line>Spanish collection</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2001</year>
      </pub-date>
      <abstract>
        <p>We entered CLEF 2001 Evaluation Forum with our UniFind retrieval system that was developed at KCSL to satisfy corporate information retrieval needs. We participated in the following three tasks:</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>•
This is the first time we entered TREC/CLEF
experiments and we used UniFind essentially
without any modifications, except for the ranking
of documents in the Multilingual tasks and for
determining the relevancy cutoff points.</p>
      <p>For the Multilingual tasks we first performed
separate runs for each of the five languages,
namely, English, French, Italian, German and
Spanish. The results of these five runs were then
merged and re-ranked based on the similarity
values obtained in the individual runs. UniFind’s
quantitative process of similarity ranking and of
extracting relevant documents is identical for all
languages. Therefore, since the number of
documents processed in each of the five individual
runs was substantial (the smallest was French with
more than 87,000 documents), at the merging and
re-ranking stage, we used the original similarity
values obtained separately for each language.
With respect to the relevancy cutoff points, our
system selects them automatically and usually
succeeds in eliminating irrelevant and marginally
relevant documents. Since CLEF expects, by
default, 1000 documents in each set, we decided to
relax our cutoff strategy in order to return more
documents from each run. Unfortunately, we
didn’t do it quite right, for most of our runs still
returned well under 50 documents. This clearly
contributed to lower Recall scores and we believe
that in turn it resulted in lower overall scores for
UniFind.</p>
      <p>For query translation we used commercial MT
(Machine Translation) software from Lernout and
Hauspie.</p>
      <p>Our system analyzes the query and all documents
in the corpus to determine word usage, word
morphology, sentence boundaries and a very
detailed topological structure that accounts for the
distribution of query words and sentences
containing these words and their derivatives.
In addition, a sentence analysis is performed to
determine both a position independent and a
position dependent score for each sentence in a
document. This step not only helps to improve the
accuracy, relevancy, and quality of the results but
also determines the most relevant part of a
document that best relates to the query. Thus, our
algorithms comprise Topological, Statistical and
Linguistic analyses of queries and documents. The
algorithms tend to relate to concepts as they are
expressed in sentences, as well as, how they relate
to a document as a whole. This process is
conceptually identical for all sentence-based
languages.</p>
      <p>The test data as supplied by CLEF 2001, contained
749,877 documents, in five languages (we didn’t
process the Dutch corpus) and occupied
approximately 2GB of disk space.</p>
      <p>We submitted three sets of runs for each of the
three tasks by automatically constructing the
queries from the 50 topics in the selected language.
Our three runs in each task were: title field only,
title and description fields, and title with
description and narrative fields.</p>
      <p>We observed that in all three tasks the sets with
queries consisting of title and description fields
seem to have the best performance.</p>
      <p>All of our runs were executed on Windows 2000
platform running on Pentium III CPU at 800 MHz
with 1GB of RAM and 40 GB disk.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>