<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>University of Ottawa's participation in the CL-SR task at CLEF 2006</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Muath Alzghool</string-name>
          <email>alzghool@site.uottawa.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Diana Inkpen</string-name>
          <email>diana@site.uottawa.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Information Technology and Engineering University of Ottawa</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents the second participation of the University of Ottawa group in CLEF, the CrossLanguage Spoken Retrieval (CL-SR) task. We present the results of the submitted runs for the English collection and very briefly for the Czech collection, followed by many additional experiments. We have used two Information Retrieval systems in our experiments: SMART and Terrier were tested with many different weighting schemes for indexing the documents and the queries and with several query expansion techniques (including a new method based on log-likelihood scores for collocations). Our experiments showed that query expansion methods do not help much for this collection. We tested whether the new Automatic Speech Recognition transcripts improve the retrieval results; we also tested combinations of different automatic transcripts (with different estimated word error rates). The retrieval results did not improve, probably because the speech recognition errors happened for the words that are important in retrieval, even in the newer ASR2006 transcripts. By using different system settings, we improved on our submitted result for the required run (English queries, title and description) on automatic transcripts plus automatic keywords. We present crosslanguage experiments, where the queries are automatically translated by combining the results of several online machine translation tools. Our experiments showed that high quality automatic translations (for French) led to results comparable with monolingual English, while the performance decreased for the other languages. Experiments on indexing the manual summaries and keywords gave the best retrieval results.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        This paper presents the second participation of the University of Ottawa group in CLEF, the Cross-Language
Spoken Retrieval (CL-SR) track. We briefly describe the task [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Then, we present our systems, followed by
results for the submitted runs for the English collection and very briefly for the Czech collection. We present
results for many additional runs for the English collection. We experiment with many possible weighting
schemes for indexing the documents and the queries, and with several query expansion techniques. We test with
different speech recognition transcripts to see if the word error rate has an impact on the retrieval performance.
We describe cross-language experiments, where the queries are automatically translated from French, Spanish,
German and Czech into English, by combining the results of several online machine translation (MT) tools. At
the end we present the best results when summaries and manual keywords were indexed.
      </p>
      <p>The CLEF-2006 CL-SR collection includes 8104 English segments, and 105 topics (queries). Relevance
judgments were provided for 63 training topics, and later for 33 test topics. In each document (segment), there
are six fields that can be used for the official runs: ASRTEXT2003A, ASRTEXT2004A, ASRTEXT2006A,
ASRTEXT2006B, AUTOKEYWORD2004A1, and AUTOKEYWORD2004A2. The first four fields are
transcripts produced using Automatic Speech Recognition (ASR) systems developed by the IBM T. J. Watson
Research Center in three successive years 2003, 2004, and 2006, with different estimated mean word error rates of
44%, 38%, and 25% respectively.</p>
      <p>Among the 8104 segments covered by the test collection, only 7377 segments have the ASRTEXT2006A
field. The ASRTEXT2006B field content is identical to the ASRTEXT2006A field if there is ASR output
produced by the 2006 system for the segment, or identical to the ASRTEXT2004A if not. Moreover just 7034
segments have ASRTEXT2003A field. The AUTOKEYWORD2004A1 and AUTOKEYWORD2004A2 field
contain a set of thesaurus keywords that were assigned automatically using two different k-Nearest Neighbor (kNN)
classifiers using only words from the ASRTEXT2004A field of the segment. Among the 8104 segments covered
by the test collection, 8071 and 8090 segments have AUTOKEYWORD2004A1 and
AUTOKEYWORD2004A2, respectively</p>
      <p>There is also a Czech collection for this year’s CL-SR track; the document collection consists of ASR
transcripts for 354 interviews in Czech, together with some manually assigned metadata and some automatically
generated metadata, and 115 search topics in two languages (Czech and English). The task for this collection is
to return a ranked list of time stamps marking the beginning of sections that are relevant to a topic.</p>
    </sec>
    <sec id="sec-2">
      <title>2 System Overview</title>
      <p>
        The University of Ottawa Cross-Language Information Retrieval (IR) systems were built with off-the-shelf
components. For translating the queries from French, Spanish, German, and Czech into English, several free
online machine translation tools were used. Their output was merged in order to allow for variety in lexical
choices. All the translations of a title made the title of the translated query; the same was done for the description
and narrative fields. For the retrieval part, the SMART [
        <xref ref-type="bibr" rid="ref2 ref9">2,9</xref>
        ] IR system and the Terrier [
        <xref ref-type="bibr" rid="ref1 ref6">1,6</xref>
        ] IR system were
tested with many different weighting schemes for indexing the collection and the queries.
      </p>
      <p>For translating the topics into English we used several online MT tools. The idea behind using multiple
translations is that they might provide more variety of words and phrases, therefore improving the retrieval
performance. The seven online MT systems that we used for translating from Spanish, French, and German were:
1. http://www.google.com/language_tools?hl=en
2. http://www.babelfish.altavista.com
3. http://freetranslation.com
4. http://www.wordlingo.com/en/products_services/wordlingo_translator.html
5. http://www.systranet.com/systran/net
6. http://www.online-translator.com/srvurl.asp?lang=en
7. http://www.freetranslation.paralink.com</p>
      <p>For translation the Czech language topics into English we were able to find only one online MT system:
http://intertran.tranexp.com/Translate/result.shtml.</p>
      <p>We combined the outputs of the MT systems by simply concatenating all the translations. All seven
translations of a title made the title of the translated query; the same was done for the description and narrative fields.
We used the combined topics for all the cross-language experiments reported in this paper.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Retrieval</title>
      <p>
        We used two systems in our participation: SMART and Terrier. SMART was originally developed at Cornell
University in the 1960s. SMART is based on the vector space model of information retrieval [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. It generates
weighted term vectors for the document collection. SMART preprocesses the documents by tokenizing the text
into words, removing common words that appear on its stop-list, and performing stemming on the remaining
words to derive a set of terms. When the IR server executes a user query, the query terms are also converted into
weighted term vectors. Vector inner-product similarity computation is then used to rank documents in
decreasing order of their similarity to the user query. The newest version of SMART (version 11) offers many
state-ofthe-art options for weighting the terms in the vectors. Each term-weighting scheme is described as a combination
of term frequency, collection frequency, and length normalization components [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        In this paper we employ the notation used in SMART to describe the combined schemes: xxx.xxx. The first
three characters refer to the weighting scheme used to index the document collection and the last three
characters refer to the weighting scheme used to index the query fields. In SMART, we used mainly the lnn.ntn
weighting scheme which performs very well in CLEF-CLSR 2005 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]; lnn.ntn means that lnn was used for
documents and ntn for queries according to the following formulas:
weight ln n = ln(tf ) + 1.0
weight ntn= tf × log
nt
      </p>
      <p>Where tf denote the term frequency of a term t in the document or query, N denotes the number of documents
in the collection, and nt denotes the number of documents in which the term t occurs.</p>
      <p>
        We have also used a query expansion mechanism with SMART, which follows the idea of extracting related
words for each word in the topics using the Ngram Statistics Package (NSP) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. We extracted the top 6412 pairs
of related words based on log likelihood ratios (high collocation scores in the corpus of ASR transcripts), using
a window size of 10 words. We chose log-likelihood scores because they are known to work well even when the
text corpus is small. For each word in the topics, we added the related words according to this list. We call this
approach to relevance feedback SMARTnsp.
      </p>
      <p>
        Terrier was originally developed at University of Glasgow. It is based on Divergence from Randomness
models (DFR) where IR is seen as a probabilistic process [
        <xref ref-type="bibr" rid="ref1 ref6">1, 6</xref>
        ]. We experimented with the In(exp)C2
weighting model, one of Terrier’s DFR-based document weighting models. Using the In(exp)C2 model, the relevance
score of a document d for a query q is given by the formula:
sim(d , q) = ∑ qtf .w(t, d )
      </p>
      <p>t∈q
where
- qtf is the frequency of term t in the query q,
- w(t,d) is the relevance score of a document d for the query term t, given by:
w(t, d ) = (</p>
      <p>F + 1 N + 1
nt × (tfne + 1) ) × (tfne × log 2 ne + 0.5)
where
-F is the term frequency of t in the whole collection.
-N is the number of document in the whole collection.
-nt is the document frequency of t.
-ne is given by ne = N × (1 − (1 − nt ) F )</p>
      <p>
        N
- tfne is the normalized within-document frequency of the term t in the document d. It is given by the
normalization 2 [
        <xref ref-type="bibr" rid="ref1 ref3">1, 3</xref>
        ]:
tfne = tf × loge (1 + c ×
avg _ l
l
      </p>
      <p>)
where
- c is a parameter, for the submitted run, we fix this parameter to 1.
- tf is the within-document frequency of the term t in the document d.
- l is the document length and avg_l is the average document length in the whole collection.</p>
      <p>We estimated the parameter c of the normalization 2 formula by running some experiments on the training
data, to get the best values for c depending on the topic fields used. We obtained the following values: c=0.75
for queries using the Title only, c=1 for queries using the Title and Description fields, and c=1 for queries using
the Title, Description, and Narrative fields. We select the c value that has a best MAP score according to the
training data.</p>
      <p>
        We have also used a query expansion mechanism in Terrier, which follows the idea of measuring divergence
from randomness. In our experiments, we applied the Kullback-Leibler (KL) model for query expansion [
        <xref ref-type="bibr" rid="ref10 ref4">4, 10</xref>
        ].
It is one of the Terrier DFR-based term weighing models. Using the KL model, the weight of a term t in the
topranked documents is given by:
w(t) = P × log 2 PPx
x
      </p>
      <p>c
where</p>
      <p>Px =
tfx
lx
and Pc =</p>
      <p>F
tokenc
-tfx is the frequency of the query term in the top-ranked documents.
-lx is the sum of the length of the top-ranked documents,
-F is the term frequency of the query term in the whole collection.
- tokenc is the total number of tokens in the whole collection.</p>
    </sec>
    <sec id="sec-4">
      <title>4 Experimental Results</title>
      <sec id="sec-4-1">
        <title>4.1 Submitted runs</title>
        <p>In the rest of the paper we focus only on the Eglish CL-SR collection.
TD
TD</p>
      </sec>
      <sec id="sec-4-2">
        <title>Comparison of systems and query expansion methods</title>
        <p>Table 3 presents results for the best weighting schemes: for SMART we chose lnn.ntn and for Terrier we chose
the In(exp)C2 weighting model, because they achieved the best results on the training data. We present results
with and without relevance feedback.</p>
        <p>According to Table 3, we note that:
• Relevance feedback helps to improve the retrieval results in Terrier for TDN, TD, and T for the training
data; the improvement was high for TD and T, but not for TDN. For the test data there is a small
improvement.
• NSP relevance feedback with SMART does not help to improve the retrieval for the training data (except
for TDN), but it helps for the test data (small improvement).</p>
        <p>• SMART results are better than Terrier results for the test data, but not for the training data.
In order to find the best ASR transcripts to use for indexing the segments, we compared the retrieval results
when using the ASR transcripts from the years 2003, 2004, and 2006 or combinations. We also wanted to find
out if adding the automatic keywords helps to improve the retrieval results. The results of the experiments using
Terrier and SMART are shown in Table 4 and Table 5, respectively.</p>
        <p>We note from the experimental results that:
• Using Terrier, the best field is ASRTEXT2006B which contains 7377 transcripts produced by the ASR
system on 2006 and 727 transcripts produced by the ASR system in 2004, this improvement over using
only the ASRTEXT2004A field is very. On the other hand, the best ASR field using SMART is
ASRTEXT2004A.
• Any combination between two ASRTEXT fields does not help to improve the retrieval.
• Using Terrier and adding the automatic keywords to ASRTEXT2004A improved the retrieval for the
training data but not for the test data. For SMART it helps for both the training and the test data.
• In general, adding the automatic keywords helps. Adding them to ASRTEXT2003A or ASRTEXT2006B
improved the retrieval results for the training and test data.
• For the required submission run English TD, the maximum MAP score was obtained by the combination
of ASRTEXT 2004A and 2006A plus autokeywords using Terrier (0.0952) or SMART (0.0932) on the
training data; on the test data the combination of ASRTEXT 2004A and autokeywords using SMART
obtained the highest value, 0.0725, higher than the value we report in Table 1 for the submitted run.</p>
        <sec id="sec-4-2-1">
          <title>Segment fields</title>
          <p>ASRTEXT 2003A
ASRTEXT 2004A
ASRTEXT 2006A
ASRTEXT 2006B
ASRTEXT 2003A+2004A
ASRTEXT 2004A+2006A
ASRTEXT 2004A+2006B
ASRTEXT 2003A +
AUTOKEYWORD2004A1,A2
ASRTEXT 2004A+
AUTOKEYWORD2004A1,A2
ASRTEXT 2006B+
AUTOKEYWORD2004A1,A2
ASRTEXT 2004A+ 2006A +
AUTOKEYWORD2004A1,A2
ASRTEXT 2004A +2006B +
AUTOKEYWORD2004A1,A2</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.4 Cross-language experiments</title>
        <p>Table 6 presents results for the combined translation produced by the seven online MT tools, from French,
Spanish, and German into English, for comparison with monolingual English experiments (the first line in the table).
All the results in the table are from SMART using the lnn.ntn weighting scheme.</p>
        <p>
          Since the result of combined translation for each language was better than when using individual translations
from each MT tool on the CLEF 2005 CL-SR data [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], we used combined translations in our experiments.
        </p>
        <sec id="sec-4-3-1">
          <title>Terrier</title>
        </sec>
        <sec id="sec-4-3-2">
          <title>SMART</title>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>4.5 Manual summaries and keywords</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>We experimented with two different systems: Terrier and SMART, with various weighting scheme for indexing
the document and query terms. We proposed a new approach for query expansion that uses collocations with
high log-likelihood ratio. Used with SMART, the method obtained a small improvement on test data (probably
not significant). The KL relevance feedback method produced only small improvements with Terrier on test
data. So, query expansion methods do not seem to help for this collection.</p>
      <p>The improvements of mean word error rates in the ASR transcripts (of ASRTEXT2006A relative to
ASRTEXT2004A) did not improve the retrieval results. Also, combining different ASR transcripts (with
different error rates) did not seem to help.</p>
      <p>For some experiments, Terrier was better than SMART, for other it was not; therefore we cannot clearly
choose one or another IR system for this collection.</p>
      <p>The idea of using multiple translations proved to be good. More variety in the translations would be
beneficial. The online MT systems that we used are rule-based systems. Adding translations by statistical MT tools
might help, since they could produce radically different translations.</p>
      <p>On the manual data, the best MAP score we obtained is around 29%, for the English test topics. On
automatically-transcribed data the best result is around 7.6% MAP score. Since the improvement in the ASR word error
rate does not improve the retrieval results, as shown from the experiments in section 4.3, we think that the
justification for the difference to the manual summaries is due to the fact that summaries contain different words to
represent the content of the segments. In future work we plan to investigate methods of removing or correcting
some of the speech recognition errors in the ASR contents and to use speech lattices for indexing.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>G.</given-names>
            <surname>Amati</surname>
          </string-name>
          and
          <string-name>
            <surname>C. J. van Rijsbergen</surname>
          </string-name>
          :
          <article-title>Probabilistic models of information retrieval based on measuring the divergence from randomness</article-title>
          .
          <source>ACM Transactions on Information Systems (TOIS)</source>
          ,
          <volume>20</volume>
          (
          <issue>4</issue>
          ):
          <fpage>357</fpage>
          -
          <lpage>389</lpage>
          ,
          <year>October 2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>C.</given-names>
            <surname>Buckley</surname>
          </string-name>
          , G. Salton, and
          <string-name>
            <given-names>J.</given-names>
            <surname>Allan</surname>
          </string-name>
          :
          <article-title>Automatic retrieval with locality information using SMART</article-title>
          .
          <source>In Proceedings of the First Text REtrieval Conference (TREC-1)</source>
          , pages
          <fpage>59</fpage>
          -
          <lpage>72</lpage>
          . NIST Special Publication 500-207,
          <year>March 1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>C.</given-names>
            <surname>Carpineto</surname>
          </string-name>
          , R. de Mori, G. Romano, and
          <string-name>
            <given-names>B.</given-names>
            <surname>Bigi</surname>
          </string-name>
          .
          <article-title>An information-theoretic approach to automatic query expansion</article-title>
          .
          <source>ACM Transactions on Information Systems (TOIS)</source>
          ,
          <volume>19</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>27</lpage>
          ,
          <year>January 2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>D.</given-names>
            <surname>Inkpen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Alzghool</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Islam</surname>
          </string-name>
          <article-title>: Using various indexing schemes and multiple translations in the CLSR task at CLEF 2005</article-title>
          .
          <source>In Proceedings of CLEF 2005, Lecture Notes in Computer Science 4022, SpringerVerlag</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Oard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Soergel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Doermann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. C.</given-names>
            <surname>Murray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ramabhadran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Franz</surname>
          </string-name>
          and
          <string-name>
            <surname>S.</surname>
          </string-name>
          <article-title>Gustman : Building an Information Retrieval Test Collection for Spontaneous Conversational Speech</article-title>
          ,
          <source>in Proceedings of SIGIR</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>I.</given-names>
            <surname>Ounis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Amati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Plachouras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Macdonald</surname>
          </string-name>
          and
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>Johnson : Terrier Information Retrieval Platform</article-title>
          .
          <source>In Proceedings of the 27th European Conference on Information Retrieval (ECIR 05)</source>
          ,
          <year>2005</year>
          . http://ir.dcs.gla.ac.uk/wiki/Terrier
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Pedersen</surname>
            . and
            <given-names>S.</given-names>
          </string-name>
          <article-title>Banerjee : The design, implementation and use of the ngram statistics package</article-title>
          .,
          <source>Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics</source>
          , Mexico City, Mexico,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Salton : Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer</article-title>
          . Addison-Wesley Publishing Company,
          <year>1989</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>G.</given-names>
            <surname>Salton</surname>
          </string-name>
          and
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Buckley : Term-weighting approaches in automatic retrieval</article-title>
          .
          <source>Information Processing and Management</source>
          <volume>24</volume>
          (
          <issue>5</issue>
          ):
          <fpage>513</fpage>
          -
          <lpage>523</lpage>
          ,
          <year>1988</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>R. W.</given-names>
            <surname>White</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Oard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. J. F.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Soergel</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.</surname>
          </string-name>
          <article-title>Huang : Overview of the CLEF-2005 Cross-Language Speech Retrieval Track</article-title>
          .
          <source>In Proceedings of CLEF 2005, Lecture Notes in Computer Science 4022</source>
          , Springer-Verlag,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>