<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of the CLEF-2006 Cross-Language Speech Retrieval Track</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Speech Retrieval</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Evaluation</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Generalized Average Precision</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dagobert Soergel and Xiaoli Huang College of Information Studies University of Maryland</institution>
          ,
          <addr-line>College Park, MD 20742</addr-line>
          ,
          <country country="US">U.S.A</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Douglas W. Oard College of Information Studies and Institute for Advanced Computer Studies University of Maryland</institution>
          ,
          <addr-line>College Park, MD 20742</addr-line>
          ,
          <country country="US">U.S.A</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Gareth J.F. Jones School of Computing Dublin City University</institution>
          ,
          <addr-line>Dublin 9</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Izhak Shafran OGI School of Science &amp; Engineering</institution>
          ,
          <addr-line>Oregon Health and Sciences University 20000 NW Walker Rd, Portland, OR 97006</addr-line>
          ,
          <country country="US">U.S.A</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Jianqiang Wang Department of Library and Information Studies State University of New York at Bu alo</institution>
          ,
          <addr-line>Bu alo, NY 14260</addr-line>
          ,
          <country country="US">U.S.A</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>Pavel Pecina MFF UK</institution>
          ,
          <addr-line>Malostranske namesti 25</addr-line>
          ,
          <institution>Room 422 Charles University</institution>
          ,
          <addr-line>118 00 Praha 1</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
        <aff id="aff6">
          <label>6</label>
          <institution>Ryen W. White Microsoft Research One Microsoft Way</institution>
          ,
          <addr-line>Redmond, WA 98052</addr-line>
          ,
          <country country="US">U.S.A</country>
        </aff>
      </contrib-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The CLEF-2006 Cross-Language Speech Retrieval (CL-SR) track included two tasks:
to identify topically coherent segments of English interviews in a known-boundary
condition, and to identify time stamps marking the beginning of topically relevant passages
in Czech interviews in an unknown-boundary condition. Five teams participated in
the English evaluation, performing both monolingual and cross-language searches of
ASR transcripts, automatically generated metadata, and manually generated
metadata. Results indicate that the 2006 evaluation topics are more challenging than those
used in 2005, but that cross-language searching continued to pose no unusual challenges
when compared with collections of character-coded text. Three teams participated in
the Czech evaluation, but no team achieved results comparable to those obtained with
English interviews. The reasons for this outcome are not yet clear.</p>
    </sec>
    <sec id="sec-2">
      <title>Categories and Subject Descriptors</title>
      <p>H.3 [Information Storage and Retrieval]: H.3.3 Information Search and Retrieval</p>
    </sec>
    <sec id="sec-3">
      <title>General Terms</title>
      <p>Measurement, Performance, Experimentation
1</p>
      <sec id="sec-3-1">
        <title>Introduction</title>
        <p>The 2006 Cross-Language Evaluation Forum (CLEF) Cross-Language Speech Retrieval (CL-SR)
track continues last year's e ort to support research on ranked retrieval from spontaneous
conversational speech. Automatically transcribing spontaneous speech has proven to be considerably
more challenging than transcribing the speech of news anchors for the Automatic Speech
Recognition (ASR) techniques on which fully-automatic content-based search systems are based.</p>
        <p>The CLEF 2005 CL-SR task focused on searching English interviews. For CLEF 2006, 30
new search topics were developed for the same collection, and an improved ASR transcript with
better accuracy for the same set of testimonies was added. This made it possible to validate the
retrieval techniques that were shown to be e ective with last year's topics, and to further explore
the in uence of ASR accuracy on the retrieval e ectiveness. The CLEF 2006 CL-SR track also
added a new task of searching Czech interviews.</p>
        <p>Similar to CLEF 2005, the English task is again based on a known-boundary condition for
topically coherent segments. The Czech search task is based on a unknown-boundary condition
where participants are required to identify a time stamp for the beginning of each distinct topically
relevant passage.</p>
        <p>The rst part of this paper describes the English language CL-SR task and summarizes the
participants' submitted results. This is followed by a description of the Czech language task with
corresponding details of submitted runs.
2</p>
      </sec>
      <sec id="sec-3-2">
        <title>English Task</title>
        <p>
          The structure of the CLEF 2006 CL-SR English task was identical to that used in 2005. Two
English collections were released this year. The rst release (March 14, 2006) contained all material
that was now available for training
          <xref ref-type="bibr" rid="ref6 ref7">(i.e., both the training and the test topics from last year's CLEF
2005 CL-SR evaluation)</xref>
          . There was one small di erence from the original 2005 data release: each
person's last name that appears in the NAME eld (or in the associated XML data les) was
reduced into its initial followed by three dots (e.g., \Smith" became \S..."). This collection contains
a total of 63 search topics, 8,104 topically coherent segments (the equivalent of \documents" in a
classic IR evaluation), and 30,497 relevance judgments.
        </p>
        <p>The second release (June 5, 2006) included a re-release of all the training materials (unchanged)
and an additional 42 candidate evaluation topics (30 new topics, plus 12 other topics for which
relevance judgments had not previously been released) and two new elds based on an improved
ASR transcript from the IBM T. J. Watson Research Center.
2.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Segments</title>
      <p>
        Other than the changes described above, the segments used for the CLEF 2006 CL-SR task were
identical to those used for CLEF 2005. Two new elds contain ASR transcripts of higher accuracy
than were available in 2005
        <xref ref-type="bibr" rid="ref10 ref4">(ASRTEXT2006A and ASRTEXT2006B)</xref>
        . The ASRTEXT2006A eld
contains a transcript generated using the best presently available ASR system, which has a mean
word error rate of 25% on held-out data. Because of time constraints, however, only 7,378 segments
have text in this eld. For the remaining 726 segments, no ASR output was available from the
2006A system at the time the collection was distributed. The ASRTEXT2006B eld seeks to
avoid this no-content condition by including content identical to the ASRTEXT2006A eld when
available, and content identical to the ASRTEXT2004A eld otherwise. Since ASRTEXT2003A,
ASRTEXT2004A, and ASRTEXT2006B contain ASR text that was automatically generated for
all 8,104 segments, any (or all) of them can be used for the required run based on automatic data.
A detailed description of the structure and elds of the English segment collection is given in last
year's track overview paper [11].
2.2
      </p>
    </sec>
    <sec id="sec-5">
      <title>Topics</title>
      <p>The limited size of the collection would likely make it impractical to continue to do new topic
development for the same set of segments in future years, so we elected to use every previously
unreleased topic for the CLEF-2006 CLSR English task. A total of 30 new topics were created for
this year's evaluation from actual requests received by the USC Shoah Foundation Institute for
Visual History and Education.1 These were combined with 12 topics that had been developed in
previous years, but for which relevance judgments had not been released. This resulted in a set
of 42 topics that were candidates for use in the evaluation.</p>
      <p>All topics were initially prepared in English. Translations into Czech, Dutch, French, German,
Spanish were created by native speakers of those languages, and the same process was used to
prepare French translations of the narrative eld for all topics in the training collection (which
had not been produced in 2005 due to resource constraints). With the exception of Dutch, all
translations were checked for reasonableness by a second native speaker of the language.2</p>
      <p>
        A total of 33 or the 42 candidate topics were used as a basis for the o cial 2006 CL-SR
evaluation; the remaining 9 topics were rejected because they had either too few known relevant
segments (fewer than 5) or too high a density of known relevant segments among the available
judgments (over 48%, suggesting that many relevant segments may not have been found).
Participating teams were asked to submit results for all 105 available topics
        <xref ref-type="bibr" rid="ref10 ref4">(the 63 topics in the 2006
training set and the 42 topics in the 2006 evaluation candidate set)</xref>
        so that new pools could be
formed to perform additional judgments on the development set if additional assessment resources
become available.
      </p>
      <p>1On January 1, 2006 the University of Southern California (USC) Shoah Foundation Institute for Visual History
and Education was established as the successor to the Survivors of the Shoah Visual History Foundation, which
had originally assembled and manually indexed the collection used in the CLEF CL-SR track.</p>
      <p>2A subsequent quality assurance check for Dutch revealed only a few minor problems. Both the as-run and the
nal corrected topics will therefore be released for Dutch.
2.3
As in the CLEF-2005 CL-SR track, we report Mean uninterpolated Average Precision (MAP) as
the principal measure of retrieval e ectiveness. Version 8.0 of the trec eval program was used to
compute this measure.3
2.4</p>
    </sec>
    <sec id="sec-6">
      <title>Relevance Judgments</title>
      <p>Subject matter experts created multi-scale and multi-level relevance assessments in the same
manner as was done for the CLEF-2005 CL-SR track [11]. These were then con ated into binary
judgments using the same procedure as was used for CLEF-2005: the union of direct indirect
relevance judgments with scores of 2, 3, or 4 (on a 0{4 scale) were treated as topically relevant,
and any other case as non-relevant. This resulted in a total of 28,223 binary judgments across the
33 topics, among which 2,450 (8.6%) are relevant.
2.5</p>
    </sec>
    <sec id="sec-7">
      <title>Techniques</title>
      <p>The following gives a brief description of the methods used by the participants in the English task.
Additional details are available in each team's paper.
2.5.1</p>
      <sec id="sec-7-1">
        <title>University of Alicante (UA)</title>
        <p>The University of Alicante used the MINIPAR parser to produce an analysis of syntactic
dependencies in the topic descriptions and in the automatically generated portion of the collection.
The then used these results in combination with their locally developed IR-n system to produce
overlapping passages. Their experiments focused on combining these sources of evidence and on
optimizing search e ectiveness using pruning techniques.
2.5.2</p>
      </sec>
      <sec id="sec-7-2">
        <title>Dublin City University (DCU)</title>
        <p>Dublin City University used two systems based on the Okapi retrieval model. One version used
Okapi with their summary-based pseudo relevance feedback method. The other system explored
combination of multiple segment elds using the method introduced in [8]. This system also
explored the use of a eld-based method for term selection in query expansion with
pseudorelevance feedback.
2.5.3</p>
      </sec>
      <sec id="sec-7-3">
        <title>University of Maryland (UMD)</title>
        <p>The University of Maryland team tried two techniques, using the InQuery system in both cases [1].
Four elds of automatic data were combined to create a segment index. Retrieval results from this
index were compared with results from index based on individual automatic data eld, showing
that combining the four automatic data elds could slightly help, although the observed
improvement is not statistically signi cant. Manual metadata elds were also combined in the same, but
no comparative results were reported. In addition, the team also applied the so-called \meaning
matching" technique to French-English cross-language retrieval. Although there is some sign
showing the technique helps marginally, the CLIR e ectiveness is signi cantly worse than monolingual
performance.
2.5.4</p>
      </sec>
      <sec id="sec-7-4">
        <title>Universidad Nacional de Educacin a Distancia (UNED)</title>
        <p>The UNED team compared the utility of the 2006 ASR with manually generated summaries and
manually assigned keywords. A CLIR experiment was performed using Spanish queries with the
2006 ASR.</p>
        <p>3The trec eval program is available from http://trec.nist.gov/trec eval/. The DCU results reported in this paper
are based on a subsequent re-submission that corrected a formatting error.
2.5.5
The University of Ottawa used two information retrieval systems in their experiments: SMART
[2] and Terrier [7]. The two systems were used with many di erent weighting schemes for indexing
the segments and the queries, and with several query expansion techniques (including a new
proposed method based on log-likelihood scores for collocations). For the English collection,
di erent Automatic Speech Recognition transcripts (with di erent estimated word error rates)
were used for indexing the segments, and also several combinations of automatic transcripts.
Cross-language experiments were run after the topics were automatically translated into English
by combining the results of several online machine translation tools. The manual summaries and
manual keywords were used for indexing in the manual run.
2.5.6</p>
      </sec>
      <sec id="sec-7-5">
        <title>University of Twente (UT)</title>
        <p>The University of Twente employed a locally developed XML retrieval system that supports
Narrowed Extended XPath (NEXI) queries to search the collection. They also prepared Dutch
translations of the topics that they used as a basis for CLIR experiments.
2.6</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>English evaluation results</title>
      <p>Table 1 summarizes the results for all 30 o cial runs averaged over the 33 evaluation topics,
listed in descending order of MAP. Required runs are shown in bold. The best results for the
required condition (title plus description queries, automatically generated data, from Dublin City
University) of 0.0747 are considerably below (i.e., just 58% of) last year's best results. A similar
e ect was not observed when manually generated metadata were indexed, however with this year's
best result (0.2902) being 93% of last year's best manually generated metadata result. From this
we conclude that this year's topic set seems somewhat less well matched with the ASR results, but
that the topics are not otherwise generally much harder for information retrieval techniques based
on term matching. CLIR also seemed to pose no unusual challenges with this year's topic set,
with the best CLIR on automatically generated indexing data (a French run from the University
of Ottawa) achieving 83% of the MAP achieved by a comparable monolingual run. Similar e ects
were observed with manually generated metadata (at 80% of the corresponding monolingual MAP
for Dutch queries, from the University of Twente).
3</p>
      <sec id="sec-8-1">
        <title>Czech Task</title>
        <p>The goal of the Czech task was to automatically identify the start points of topically-relevant
passages in interviews. Ranked lists for each topic were submitted by each system in the same
form as the CLEF ad hoc task, with the single exception that a system-generated starting point
was speci ed rather than a document identi er. The format for this was
\VHF[IntCode].[startingtime]," where \IntCode" is the ve-digit interview code (with leading zeroes added) and
\startingtime" is the system-suggested replay starting point (in seconds) with reference to the beginning
of the interview. Lists were to be ranked by systems in the order that they would suggest for
listening to passages beginning at the indicated points.
3.1</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Interviews</title>
      <p>The Czech task was broadly similar to the English task in that the goal was to design systems
that could help searchers identify sections of an interview that they might wish to listen to. The
processing of the Czech interviews was, however, di erent from that used for English in three
important ways:</p>
      <p>No manual segmentation was performed. This alters the format of the interviews (which for
Czech is time-oriented rather than segment-oriented), it alters the nature of the task (which
for Czech is to identify replay start points rather than to select among prede ned segments),
and it alters the nature of the manually assigned metadata (there are no manually written
summaries for Czech and the meaning of a manual thesaurus term assignment for Czech is
that discussion of a topic started at that time).</p>
      <p>
        The two available Czech ASR transcripts were generated using di erent ASR systems. In
both cases, the acoustic models were trained using 15-minutes snippets from 336 speakers,
all of whom are present in the test set as well. However, the language model was created
by interpolating two models{an in-domain model from transcripts, and an out-of-domain
model from selected portions of Czech National Corpus. For details, see the baseline systems
described in [9, 10]. Apart from the improvement in transcription accuracy, the 2006 system
di ers from the 2004 system in that the transcripts are produced in formal Czech, rather
than the colloquial Czech that was produced in 2004. Since the topics were written in formal
Czech, the 2006 ASR transcripts may yield better matching. Interview-speci c vocabulary
priming (adding proper names to the recognizer vocabulary based on names present in a
preinterview questionnaire) was not done for either Czech system. Thus, a somewhat higher
error rate on named entities might be expected for the Czech systems than for the two
English systems
        <xref ref-type="bibr" rid="ref10 ref4 ref9">(2004 and 2006)</xref>
        in which vocabulary priming was included.
      </p>
      <p>ASR is available for both the left and right stereo channels (which usually were recorded
from microphones with di erent positions and orientations).</p>
      <p>
        Because the task design for Czech is not directly compatible with the design of
documentoriented IR systems, we provided a \quickstart" package containing the following:
A quickstart script for generating overlapping passages directly from the ASR transcripts.
The passage duration (in seconds), the spacing between passage start times (also in seconds),
and the desired ASR system (2004 or 2006) could be speci ed. The default settings
        <xref ref-type="bibr" rid="ref10 ref4">(180,
60, and 2006)</xref>
        result in 3-minute passages in which one minute on each end overlaps with
the preceding or subsequent passage.
      </p>
      <p>A quickstart collection created by running the quickstart script with the default settings.</p>
      <p>This collection contains 11,377 overlapping passages.</p>
      <p>The quickstart collection contains the following automatically generated elds:
DOCNO The DOCNO eld contains a unique document number in the same format as the start
times that systems were required to produce in a ranked list. This design allowed the output
of a typical IR system to be used directly as a list of correctly formatted (although perhaps
not very accurate) start times for scoring purposes.</p>
      <p>
        ASRSYSTEM specifying the source of the ASR text collection
        <xref ref-type="bibr" rid="ref10 ref4 ref9">(either \2004" for the colloquial
Czech system developed by the University of West Bohemia and Johns Hopkins University
in 2004 or \2006" for an updated and possibly more accurate formal Czech system provided
by the same research groups in 2006)</xref>
        .
      </p>
      <p>CHANNEL The CHANNEL eld speci es which recorded channel (left or right) was used to
produce the transcript. The channel that produced the greatest number of total words over
the entire transcript (which is usually the channel that produced the best ASR accuracy
for words spoken by the interviewee) was automatically selected by default. This automatic
selection process was hardcoded in the script, although the script could be modi ed to
generate either or both channels.</p>
      <p>ASRTEXT The ASRTEXT eld contains words in order from the transcript selected by
ASRSYSTEM and CHANNEL for a passage beginning at the start time indicated in DOCNO. When
the selected transcript contains no words at all from that time period, words are drawn from
one alternate source that is chosen in the following priority order: (1) the same
ASRSYSTEM from the other CHANNEL, (2) the same CHANNEL from the other ASRSYSTEM,
or (3) the other CHANNEL from the other ASRSYSTEM.</p>
      <p>ENGLISHAUTOKEYWORD The ENGLISHAUTOKEYWORD eld contains a set of
thesaurus terms that were assigned automatically using a k-Nearest Neighbor (kNN) classi er
using only words from the ASRTEXT eld of the passage; the top 20 thesaurus terms are
included in best- rst order. Thesaurus terms (which may be phrases) are separated with
a vertical bar character. The classi er was trained using English data (manually assigned
thesaurus terms and manually written segment summaries) and run using automatically
produced English translations of the 2006 Czech ASRTEXT [6]. Two types of thesaurus terms
are present, but not distinguished: (1) terms that express a subject or concept; (2) terms
that express a location, often combined with time in one precombined term [5]. Because the
classi er was trained on the English collection, in which thesaurus terms were assigned with
segments, the natural interpretation of an automatically assigned thesaurus term is that the
classi er believes the indicated topic is associated with the word spoken in this passage.
Note that this di ers from the way in which the presence of a manually assigned thesaurus
term (described below) should be interpreted.</p>
      <p>CZECHAUTOKEYWORD The CZECHAUTOKEYWORD eld contains Czech translations
of the ENGLISHAUTOKEYWORD eld. These translations were obtained from three
sources: (1) professional translation of about 3,000 thesaurus terms, (2) volunteer
translation of about 700 thesaurus terms, and (3) a custom-built machine translation system that
reused words and phrases from manually translated thesaurus terms to produce additional
translations. Some words (e.g., foreign place names) remained untranslated when none of
the three sources yielded a usable translation.</p>
      <p>Three additional elds containing data produced by human indexers at the Survivors of the
Shoah Visual History Foundation were also available for use in contrastive conditions:
INTERVIEWDATA The INTERVIEWDATA eld contains the rst name and last initial for
the person being interviewed. This eld is identical for every passage that was generated
from the same interview.</p>
      <p>ENGLISHMANUKEYWORD The ENGLISHMANUALKEYWORD eld contains thesaurus
terms that were manually assigned with one-minute granularity from a custom-built
thesaurus by subject matter experts at the Survivors of the Shoah Visual History Foundation
while viewing the interview. The format is the same as that described for the
ENGLISHAUTOKEYWORD eld, but the meaning of a keyword assignment is di erent. In the Czech
collection, manually assigned thesaurus terms are used as onset marks|they appear only
once at the point where the indexer recognized that a discussion of a topic or location-time
pair had started; continuation and completion of discussion are not marked.</p>
      <p>CZECHMANUKEYWORD The CZECHMANUALKEYWORD eld contains Czech
translations of the English thesaurus terms that were produced from the
ENGLISHMANUALKEYWORD eld using the process described above.</p>
      <p>All three teams used the quickstart collection; no other approaches to segmentation and no
other settings for passage length or passage start time spacing were tried.
3.2</p>
    </sec>
    <sec id="sec-10">
      <title>Topics</title>
      <p>At the time the Czech evaluation topics were released, it was not yet clear which of the available
topics were likely to yield a su cient number of relevant passages in the Czech collection.
Participating teams were therefore asked to run 115 topics|every available topic at that time. This
included the full 105 topic set that was available this year for English (including all training and
all evaluation candidate topics) and 10 adaptations of topics from that set in which geographic
restrictions had been removed (as insurance against the possibility that the smaller Czech collection
might not have adequate coverage for exactly the same topics).</p>
      <p>All 115 topics had originally been constructed in English and then translated into Czech by
native speakers. Since translations into languages other than Czech were not available for the
10 adapted topics, only English and Czech topics were distributed with the Czech collection. No
teams used the English topics this year; all o cial runs this year with the Czech collection were
monolingual.</p>
      <p>Two additional topics were created as part of the process of training relevance assessors, and
those topics were distributed to participants along with a (possibly incomplete) set of relevance
judgments. This distribution occurred too late to in uence the design of any participating system.
3.3</p>
    </sec>
    <sec id="sec-11">
      <title>Evaluation Measure</title>
      <p>The evaluation measure that we chose for Czech is designed to be sensitive to errors in the start
time, but not in the end time, of system-recommended passages. It is computed in the same
manner as mean average precision, but with one important di erence: partial credit is awarded in
a way that rewards system-recommended start times that are close to those chosen by assessors.
After a simulation study, we chose a symmetric linear penalty function that reduces the credit
for a match by 0.1 (absolute) for every 15 seconds of mismatch (either early or late) [4]. This
results in the same computation as the well-known mean Generalized Average Precision (mGAP)
measure that was introduced to deal with human assessments of partial relevance [3]. In our
case, the human assessments are binary; it is the degree of match to those assessments that can
be partial. Relevance judgments are drawn without replacement so that only the highest ranked
match (including partial matches) can be scored for any relevance assessment; other potential
matches receive a score of zero. Di erences at or beyond a 150 second error are treated as a
no-match condition, thus not \using up" a relevance assessment.
3.4</p>
    </sec>
    <sec id="sec-12">
      <title>Relevance Judgments</title>
      <p>Relevance judgments were completed at Charles University in Prague for a total of 29 Czech
topics by subject matter experts who were native speakers of Czech. All relevance assessors had
good English reading skills. Topic selection was performed by individual assessors, subject to the
following factors:</p>
      <p>At least ve relevant start times in the Czech collection were required in order to minimize
the e ect of quantization noise on the computation of mGAP.</p>
      <p>The greatest practical degree of overlap with topics for which relevance judgments were
available in the English collection was desirable.</p>
      <p>Once a topic was selected, the assessor iterated between topic research (using external
resources) and searching the collection. A new search system was designed to support this
interactive search process. The best channel of the Czech ASR and the manually assigned English
thesaurus terms were indexed as overlapping passages, and queries could be formed using either
or both. Once a promising interview was found, an interactive search within the interview could
be performed using either type of term and promising regions were identi ed using a graphical
depiction of the retrieval status value. Assessors could then scroll through the interview using
these indications, the displayed English thesaurus terms, and the displayed ASR transcript as
cues. They could then replay the audio from any point in order to con rm topical relevance. As
they did this, they could indicate the onset and conclusion of the relevant period by designating
points on the transcript that were then automatically converted to times with 15-second
granularity.4 Only the start times are used for computation of the mGAP measure, but both start and
end times are available for future research.</p>
      <p>Once that search-guided relevance assessment process was completed, the assessors were
provided with a set of additional points to check for topical relevance that were computed using a
pooling technique similar to that used for English. The top 50 start times from every o cial
run were pooled, duplicates (at one minute granularity) were removed, and the results were
inserted into the assessment system as system recommendations. Every system recommendation
was checked, although assessors exercised judgment regarding when it would be worthwhile to
actually listen to the audio in order to limit the cost of this \highly ranked" assessment process.
Relevant passages identi ed in this way were added to those found using search-guided assessment
to produce the nal set of relevance judgments (topic 4000 was generalized from a pre-existing
topic).</p>
      <p>A total of 1,322 start times for relevant passages were identi ed, thus yielding an average of
46 relevant passages per topic (minimum 8, maximum 124). Table 2 shows the number of relevant
start times for each of the 29 topics, 28 of which are the same as topics used in the English test
collection.</p>
      <p>4Several di erent types of time spans arise when describing evaluation of speech indexing systems. For clarity, we
have tried to stick to the following terms when appropriate: manually de ned segments (for English indexing),
15minute snippets (for ASR training), 15-second increments (for the start and end time of Czech relevance judgments),
relevant passages (identi ed by Czech relevance assessors), and automatically generated passages (for the quickstart
collection).</p>
      <p>topid
The participating teams all employed existing information retrieval systems to perform
monolingual searches of the quickstart collection.
The University of Maryland submitted three o cial runs in which they tried combining all the
elds (Czech ASR text, Czech (manual and automatic) keyword, and the English translations of
the keywords) to form a uni ed passage index using Inquery. They compared the retrieval results
based on this index with those based on ASR alone or the combination of automatic keywords
and ASR text.
3.5.2</p>
      <sec id="sec-12-1">
        <title>University of Ottawa (UO)</title>
        <p>Three runs were submitted from the University of Ottawa for the Czech task using SMART and
one run was submitted using Terrier.
3.5.3</p>
      </sec>
      <sec id="sec-12-2">
        <title>University of West Bohemia (UWB)</title>
        <p>The University of West Bohemia was the only team to apply morphological normalization and
stopword removal for Czech. A classic TF*IDF model was implemented in Lemur, along with the
Lemur implementation of blind relevance feedback. Five runs were submitted for o cial scoring,
and one additional run was scored locally.
3.5.4</p>
      </sec>
      <sec id="sec-12-3">
        <title>Results</title>
        <p>With two exceptions, the mean Generalized Average Precision (mGAP) values were between 0.0003
and 0.0005. In a side experiment reported in the UWB paper, random permutation of the possible
start times was found to yield a mGAP of 0.0005 in a simulation study. We therefore conclude
that none of those runs demonstrated any useful degree of system support for the task.</p>
        <p>Two runs yielded more interesting results. The best o cial run, from UO, achieved a mGAP of
0.0039, and a run that was locally scored at UWB achieved a mGAP of 0.0015. Interestingly, these
are two of the three runs in which the ENGLISHMANUALKEYWORD eld was used. A
positive in uence from that factor would require that untranslated English terms (e.g., place names)
match terms that were present in the topic descriptions (either with or without morphological
normalization). The UWB paper provides an analysis that suggests that the bene cial e ect of
using that eld may be limited to a single topic.</p>
        <p>The use of overlapping passages in the quickstart collection probably reduced mGAP values
substantially because the design of the measure tends to penalize duplication. Speci cally, the
start time of the highest-ranking passage that matches a passage start time in the relevance
judgments will \use up" that judgment. Subsequent passages in which the same matching terms
were present would then receive no credit at all (even if they were closer matches). We had
Run name
uoCzEnTDNsMan
uoCzTDNsMan
uoCzEnTDt
umd.asr
uoCzTDNs
uoCzTDs
UWB mk aTD
UWB mk a akTD
UWB mk a akTDN
umd.akey.asr
UWB aTD
UWB a akTD
umd.all
originally intended the quickstart collection to be used only for out-of-the-box sanity checks, with
the idea that teams would either modify the quickstart scripts or create new systems outright
to explore a broader range of possible system designs. Time pressure and a lack of a suitable
training collection precluded that sort of experimentation, however, and the result was that this
undesirable e ect of passage overlap a ected every system.</p>
        <p>Other possible explanations for the relatively poor results also merit further investigation. This
is the rst time that mGAP has been used in this way to evaluate actual system results, so it is
possible that the measure is poorly designed or that there is a bug in the scoring script. Simulation
studies suggest that is not likely to be the case, however. This is also the rst time that Czech
ASR has been used, and it is the rst time that relevance assessment has been done in Czech
(using a newly designed system). So there are many possible factors that need to be explored.
This year's Czech collection is exactly what we need for such an investigation, so it should be
possible to make signi cant process over the next year.
4</p>
        <sec id="sec-12-3-1">
          <title>Conclusion and Future Plans</title>
          <p>The CLEF 2006 CL-SR track extended the previous year's work on the English task by adding new
topics, and introduced a new Czech task with a new unknown-boundary evaluation condition. The
results of the English task suggest that the evaluation topics this year posed somewhat greater
di culty for systems doing fully automatic indexing. Studying what made these topics more
di cult would be an interesting scope for future work. However, the most signi cant achievement
of this year's track was the development of a CL-SR test collection based on a more realistic
unknown-boundary condition. Now that we have both that collection and an initial set of system
designs, we are in a good position to explore issues of system and evaluation design that clearly
have not yet been adequately resolved.</p>
          <p>We expect that it would be possible to continue the CLEF CL-SR track in 2007 if there is
su cient interest. For Czech, it may be possible to obtain relevance judgments for additional
topics, perhaps increasing to a total of 50 the number of topics that the track can leave as a legacy
for use by future researchers. Developing additional topics for English seems to be less urgent (and
perhaps less practical), but we do expect to be able to provide additional automatically generated
indexing data (either ASR for additional interviews, word lattices in some form, or both) if there
is interest in further work with the English collection. Some unique characteristics of the CL-SR
collection may also be of interest to other tracks, including domain-speci c retrieval and geoCLEF.
We look forward to discussing these and other issues when we meet in Alicante!
5</p>
        </sec>
        <sec id="sec-12-3-2">
          <title>Acknowledgments</title>
          <p>This track would not have been possible without the e orts of a great many people. Our
heartfelt thanks go to the dedicated group of relevance assessors in Maryland and Prague, to the
Dutch, French and Spanish teams that helped with topic translation, and to Bill Byrne, Martin
Cetkovsky, Bonnie Dorr, Ayelet Goldin, Sam Gustman, Jan Hajic, Jimmy Lin, Baolong Liu, Craig
Murray, Scott Olsson, Bhuvana Ramabhadran and Deborah Wallace for their help with creating
the techniques, software, and data sets on which we have relied.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Broglio</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Callan</surname>
            ,
            <given-names>J. P.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Croft</surname>
          </string-name>
          , W. B.:
          <article-title>INQUERY System Overview</article-title>
          .
          <source>In Proceedings of the Annual Meeting of the Association for Computational Linguistics</source>
          , pages
          <volume>47</volume>
          {
          <fpage>67</fpage>
          ,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Buckley</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Allan</surname>
          </string-name>
          , J.:
          <article-title>Automatic retrieval with locality information using SMART</article-title>
          .
          <source>In Proceedings of the First Text REtrieval Conference (TREC-1)</source>
          , pages
          <fpage>59</fpage>
          {
          <fpage>72</fpage>
          . NIST Special Publication 500-
          <issue>207</issue>
          ,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Kekalainen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Jarvelin</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Using graded relevance assessments in IR evaluation</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology</source>
          ,
          <volume>53</volume>
          (
          <issue>13</issue>
          )1120{
          <fpage>1129</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Oard</surname>
            ,
            <given-names>D. W.</given-names>
          </string-name>
          :
          <article-title>One-sided measures for evaluating ranked retrieval e ectiveness with spontaneous conversational speech</article-title>
          .
          <source>In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , pages
          <volume>673</volume>
          {
          <fpage>674</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Murray</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dorr</surname>
            ,
            <given-names>B. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hajic</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Pecina</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          : Leveraging Reusability:
          <article-title>Cost-e ective Lexical Acquisition for Large-scale Ontology Translation</article-title>
          .
          <source>In Proceedings of the Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Olsson</surname>
            ,
            <given-names>J. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oard</surname>
            ,
            <given-names>D. W.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Hajic</surname>
          </string-name>
          , J.:
          <article-title>Cross-language text classi cation</article-title>
          .
          <source>In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval:</source>
          <volume>645</volume>
          {
          <fpage>646</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Ounis</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amati</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Plachouras</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Macdonald</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Johnson</surname>
          </string-name>
          , D.:
          <article-title>Terrier Information Retrieval Platform</article-title>
          .
          <source>In Proceedings of the 27th European Conference on Information Retrieval (ECIR 05)</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Robertson</surname>
            ,
            <given-names>S. E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaragoza</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Taylor</surname>
          </string-name>
          , M.:
          <article-title>Simple BM25 Extension to Multiple Weighted Fields</article-title>
          ,
          <source>Proceedings of the 13th ACM International Conference on Information and Knowledge Management</source>
          , pages
          <volume>42</volume>
          {
          <fpage>49</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Shafran</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Byrne</surname>
          </string-name>
          , W.:
          <article-title>Task-Speci c Minimum Bayes-Risk Decoding using Learned Edit Distance</article-title>
          ,
          <source>In Proceedings of INTERSPEECH2004-ICSLP</source>
          , vol.
          <volume>3</volume>
          , pages
          <year>1945</year>
          {
          <year>1948</year>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Shafran</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          and Hall,
          <string-name>
            <surname>K.</surname>
          </string-name>
          :
          <article-title>Corrective Models for Speech Recognition of In ected Languages</article-title>
          ,
          <source>Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>White</surname>
            ,
            <given-names>R. W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oard</surname>
            ,
            <given-names>D. W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>G. J. F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soergel</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Overview of the CLEF-</article-title>
          2005
          <string-name>
            <surname>Cross-Language Speech Retrieval Track</surname>
          </string-name>
          ,
          <source>Proceedings of the CLEF 2005 Workshop on Cross-Language Information Retrieval and Evaluation</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>