<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>University of Chicago at the CLEF 2007 Cross-language Speech Retrieval Track</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gina-Anne Levow</string-name>
          <email>levow@cs.uchicago.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>General Terms</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Measurement</institution>
          ,
          <addr-line>Performance, Experimentation</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Chicago</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>The University of Chicago participated in the CLEF 2007 CL-SR track, performing monolingual retrieval for both English and Czech and cross-language French-English retrieval. English experiments considered the impact of automatically generated keywords on retrieval. Czech experiments explored the effect of different stemming approaches on retrieval for this morphologically rich language. The best results for English employed automatically generated keywords, and the best results for Czech employed stemming strategies which significantly outperformed unstemmed techniques.</p>
      </abstract>
      <kwd-group>
        <kwd>H</kwd>
        <kwd>3 [Information Storage and Retrieval]</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>1 Content Analysis and Indexing</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>3 Information Search and Retrieval</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>4 Systems and Software</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>7 Digital Libraries</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>English Baseline System</title>
      <p>We describe the query formulation, document creation, and retrieval processing for the
monolingual English retrieval system.
2.1</p>
      <sec id="sec-2-1">
        <title>Query Formulation</title>
        <p>All retrieval experiments employed the title and description fields of the original topic
specifications. The components were simply concatenated together, employing the weighted sum operator
(#sum) from the InQuery system, with default stemming and stopword removal. No additional
removal of stop structure was performed.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Document Creation</title>
        <p>
          Two document formulations formed a primary contrast for monolingual and cross-language
retrieval for English documents, one configuration included automatically generated keywords and
the other did not. In both cases, we employed the manual document segmentation provided by the
track organizers and the ASRTEXT2006B automatic speech recognition output field as the core
document representation, based on its prior assessed effectiveness.[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] In the automatic keyword
condition, all automatically generated keywords (AK1 and AK2) were added to the core document
representation by concatenation.
2.3
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Retrieval Engine</title>
        <p>
          For the experiments reported below, we used the InQuery information retrieval system (version
3.1p1) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] developed at the University of Massachusetts, Amherst, with a design motivated by
inference networks, which normalizes the individual term weights when they are computed and
then uses an unnormalized inner product to produce retrieval status values. The documents are
then sorted in order of decreasing retrieval status value to form a list in an order that approximates
a decreasing degree of relevance to the searcher’s query. For English, stopwords were removed based
on the default stopword list, and stemming was performed with the default kstem algorithm. We
adopt the convention that values of p &lt; 0.05 for a Wilcoxon signed ranks test on a pair of retrieval
results is considered significant.
2.4
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>Monolingual English Results</title>
        <p>The baseline ASR transcript only yields a mean average precision of 0.0512. When augmented
with automatically generated keywords, the results rise to 0.0571. The difference does not quite
reached significance (p &lt;= 0.5068) by Wilcoxon Signed-Ranks test. The automatically generated
keywords enrich the document representation with topical terms which appear to enhance retrieval
and reduce the effects of the variance of the noisy output of the recognition of the automatic
recognition of spontaneous speech.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>French-English Cross-language Retrieval</title>
      <p>For the cross-language retrieval conditions, the official runs employed a publicly available
translation tool, while contrastive runs employed a dictionary-based word-for-word translation strategy,</p>
      <sec id="sec-3-1">
        <title>Query Mono FR + G FR + wd</title>
        <p>with a freely available dictionary and statistically derived stemming. We discuss the query
translation procedure below; all document processing and retrieval components were identical to the
monolingual English configuration.
3.1</p>
        <sec id="sec-3-1-1">
          <title>Query Translation</title>
          <p>For the official cross-language French-English retrieval runs, we employed the publicly available
translation tool provided by Google (http://translate.google.com) to translate the queries. Queries
were created by concatenating the title and description fields of the French topics, analogous to
the procedure used for English query formulation.</p>
          <p>
            For contrastive runs, we employed a dictionary-based word-for-word translation procedure
consistent with [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ]. We obtained a freely available French-English bilingual term list from http://www.freedict.com.
For the word-for-word translation process, all terms are first converted to lowercase and all
accent diacritics are removed for consistency with the translation resource. Next, the translation
procedure applies a backoff stemming strategy [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ], to support matching with highest precision
between the query terms and the dictionary, but backing off to stemmed forms to enhance recall.
We attempt initially to match the unstemmed forms in the query with unstemmed forms in the
bilingual term list. Only if no match is found, do we perform stemming, attempting to match the
stemmed query term in the unstemmed term list, the surface form of the query in the stemmed
term list, and finally the stemmed query term in the stemmed term list. The stemming procedure
employed stemming rules derived by a statistical stemming process as in [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ]. 27% of the query
terms remained untranslated, and all untranslatable terms were retained.
          </p>
          <p>In both cases, the resulting English translations are stemmed and stopwords are removed,
consistent with earlier document processing.
3.2</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>Cross-language Retrieval Results &amp; Discussion</title>
          <p>We find a substantial drop in mean average precision from the monolingual to the cross-language
conditions, and from system-based to dictionary-based translation. Results appear in Table 2.
With comparable document representations, effectiveness for the cross-language condition drops
29-37% from monolingual levels for comparable document representations. A larger drop is
observed for the baseline condition without automatically generated keywords than for that with
keywords. Furthermore, the drop in retrieval effectiveness is even more pronounced in the
wordfor-word case, and is low enough that the ASR-based retrieval is very similar to keyword-based
retrieval.</p>
          <p>These results indicate overall good effectiveness for the online translation tool for the
FrenchEnglish language pair and the limitations of the small dictionary-based translation strategy.
However, the degradation from monolingual retrieval is quite large, and alternate strategies will be
needed to overcome it. The less formal and highly variable character of the spontaneous speech
materials limits the effectiveness of retrieval on ASR transcripts alone, while enrichment with
automatically generated keywords appears to provide more useful representations of topical
information, though differences to not reach significance. Additional strategies for enrichment and
denoising of the query and document content will be necessary to overcome these challenges.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Query</title>
        <p>Czech TD</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Monolingual Czech Retrieval: Stemming Strategies</title>
      <p>The Czech language poses special challenges for information retrieval. In contrast with English
which has a relatively impoverished morphology, Czech employs a very rich morphology. As a
result, to support effective matching between the document content and the specification of the
user’s information need, some means of overcoming the surface variance due to morphology is
required. Here we explore the impact on retrieval effectiveness of three different stemming strategies:
no stemming, light stemming, and aggressive stemming. We employ two freely available java-based
Czech stemmers from the University of Neuchatel (http://members.unine.ch/jacques.savoy/clef/index.html)
to perform light and aggressive stemming for Czech. ”Light” stemming in these cases refers to
removing affixes only for nouns and adjectives.</p>
      <p>
        The basic query formulation, indexing, and retrieval processes are consistent with those
described above for monolingual English with three contrasts, in stemming, stopword removal, and
document formation. First, we apply one of the three stemming strategies - none, light, or
aggressive - to both queries and documents to enhance matching and retrieval in this morphologically
rich language. We also incorporate a freely available Czech stopword list from the same source as
the stemmers, to support stopword removal. 1 Finally, we employ the playback point based
document segmentation provided by the track organizers. Results thus employ the mGAP measure
for Czech retrieval [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
4.1
      </p>
      <sec id="sec-4-1">
        <title>Results and Discussion</title>
        <p>We find that the best results are obtained with the aggressive stemming strategy, followed by
light stemming, and lastly no stemming. All results appear in Table 3. The results for aggressive
stemming were the second best title+description based results for Czech in this evaluation. Clearly,
the stemming approaches more effectively overcome the high degree of surface form variation of
Czech terms. The unstemmed case is significantly outperformed by both the aggressive stemmer
(p &lt;= 0.002) and the light stemmer (p &lt;= 0.01), although the two stemmers are not significantly
different from each other.</p>
        <p>However, the overall effectiveness of spontaneous speech retrieval in Czech is still quite limited.
We speculate that not only must term matching be improved, for example through improved
stemming and enhanced retrieval through pseudo-relevance feedback query or document expansion,
but also through improvements in basic transcription accuracy and handling of spontaneous speech
phenomena.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and Future Work</title>
      <p>Experiments in monolingual English and cross-language French-English speech retrieval obtained
the best results by augmenting ASR transcripts with automatically generated keywords.
Experiments in monolingual Czech speech retrieval demonstrated the importance of stemming to
overcome surface form variation for this morphologically rich language.</p>
      <p>However, retrieval from spontaneous speech in both languages remains a very challenging task
due both to the difficulties of speech recognition and to the challenging structure of spontaneous
speech. In future work we plan to explore approaches to minimize the impact of speech recognition
errors and variations in lexical choice through denoising strategies such as Generalized Latent
Semantic Analysis to enhance document and query similarity even without lexical overlap. Also,
1For compatibility with the retrieval system, we also removal all accent diacritics after stemming.
while the current work has only employed the document segmentations provided, we hope to
explore novel approaches to automatic segmentation of spontaneous speech which is important
for broadening access to lengthy recorded materials and also poses many interesting challenges to
integrating lexical and acoustic evidence of structure in spontaneous speech.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>James</surname>
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Callan</surname>
            ,
            <given-names>W. Bruce</given-names>
          </string-name>
          <string-name>
            <surname>Croft</surname>
          </string-name>
          , and
          <string-name>
            <surname>Stephen</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Harding</surname>
          </string-name>
          .
          <article-title>The INQUERY retrieval system</article-title>
          .
          <source>In Proceedings of the Third International Conference on Database and Expert Systems Applications</source>
          , pages
          <fpage>78</fpage>
          -
          <lpage>83</lpage>
          . Springer-Verlag,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Gina-Anne</surname>
            <given-names>Levow</given-names>
          </string-name>
          , Douglas W. Oard, and
          <string-name>
            <given-names>Philip</given-names>
            <surname>Resnik</surname>
          </string-name>
          .
          <article-title>Dictionary-based techniques for cross-language information retrieval</article-title>
          .
          <source>Information Processing and Management: Special Issue on Cross-language Information Retrieval</source>
          ,
          <volume>41</volume>
          (
          <issue>4</issue>
          ),
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Oard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.-A.</given-names>
            <surname>Levow</surname>
          </string-name>
          , and
          <string-name>
            <surname>C. I. Cabezas.</surname>
          </string-name>
          CLEF Experiments at Maryland: Statistical Stemming and
          <string-name>
            <given-names>Backoff</given-names>
            <surname>Translation</surname>
          </string-name>
          . Springer,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Oard</surname>
          </string-name>
          , Jianqiang Wang,
          <string-name>
            <given-names>Gareth J.F.</given-names>
            <surname>Jones</surname>
          </string-name>
          , Ryan W. White, Pavel Pecina, Dagobert Soergel, Xiaoli Huang, and
          <string-name>
            <given-names>Izhak</given-names>
            <surname>Shafran</surname>
          </string-name>
          .
          <article-title>Overview of the clef-2006 cross-language speech retrieval track</article-title>
          .
          <source>In CLEF-2006</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Philip</given-names>
            <surname>Resnik</surname>
          </string-name>
          , Douglas W. Oard, and
          <string-name>
            <surname>Gina-Anne Levow</surname>
          </string-name>
          .
          <article-title>Improved cross-language retrieval using backoff translation</article-title>
          .
          <source>In Proceedings of Human Language Technology Conference (HLT)</source>
          <year>2001</year>
          , pages
          <fpage>153</fpage>
          -
          <lpage>155</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>