<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The University of West Bohemia at CLEF 2006, the CL-SR track</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pavel Ircing</string-name>
          <email>ircing@kky.zcu.cz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ludekˇ Mu¨ller</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Spoken Document Retrieval</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of West Bohemia</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>The paper describes the system build by the team from the University of West Bohemia for participation in the CLEF 2006 CL-SR track. We have decided to concentrate only on the monolingual searching in the Czech test collection. We have employed the Czech morphological analyser and tagger in order to perform necessary linguistic preprocessing (lemmatization and stop-word removal). As for the actual search system, we have employed the classical tf.idf approach with blind relevance feedback as implemented in the LEMUR toolkit. Since the results are currently very close to zero and appear to behave rather randomly, it is not possible to draw any conclusion at this moment. There are several hypothesis concerning the possible causes of the system failure that are currently a subject of investigation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>This paper presents the rfist participation of the University of West Bohemia group in CLEF (and,
for that matter, rfist participation of the group in an IR evaluation campaign whatsoever). Thus,
being novices in the IR field, we have decided to concentrate only on the monolingual searching
in the Czech test collection where we have tried to make use of the two advantages that our
team might have over the others - the knowledge of the language in question (Czech - our mother
tongue) and the experience with automatic NLP of that language, together with the disposal of
the necessary tools (morphological analyzer, tagger).</p>
      <p>As for the actual search side of the task, it has been shown by various teams experimenting
with the last year English test collection that good results can be achieved simply by using some
freely available IR system (see for example [4]). We have decided to use the same strategy.</p>
      <p>Although both the English and the Czech CL-SR collections consists of the (automatic)
transcriptions1 of the interviews with the Holocaust survivors, the Czech collection lacks the manually
1Plus some additional metadata - see the description of the collections in the track overview.
created topical segmentation that is available for the English data. This obviously makes the
retrieval more complicated. Thus, in order to facilitate the initial experiments with the Czech
collection, the track organizers provided also a so-called Quickstart collection with artificially
defined “documents” that were created by sliding 3-minute window over the continuous stream of
transcriptions with the 1-minute step. The total number of those “documents” in the collection is
11,377. Given the lack of time for experimentation, the presence of many other system parameters
and the absence of the training topics, we did not explore any other segmentation possibilities
beyond this Quickstart collection in our experiments.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>System description</title>
      <p>Linguistic preprocessing
At least rudimentary linguistic processing of the document collection and topics (stemming,
stopword removal) is considered to be indispensable in state-of-the-art IR systems. We have decided
to use quite sophisticated NLP tools for that purposes - the morphological analyzer and tagger
developed by the team around Jan Hajciˇ [2],[3]. The serial combination of these two tools assigns
disambiguated lemma (basic word form) and morphological tag to the input word form and also
provides the information about the stem-ending partitioning.</p>
      <p>This is an example of the typical system output:</p>
      <p>&lt;f&gt;holokaustem&lt;MDl&gt;holokaust&lt;MDt&gt;NNIS7-----A----&lt;R&gt;holokaust&lt;E&gt;em
where &lt;f&gt; introduces the actual word form, the &lt;MDl&gt; the corresponding lemma and the
&lt;MDt&gt; the corresponding morphological tag (in this case the tag correctly describes the word
holokaustem as the noun (N in the first position) having the masculine inanimate gender ( I) and
being in singular (S) instrumental (7) form). Finally, the &lt;R&gt; introduces the stem and &lt;E&gt; the
ending of the word form in question. Note that although in this example the stem is identical to
the lemma it is not the general rule and we believe that the lemmatization should be used instead
of stemming in the IR experiments with highly inflectional languages such as Czech. Therefore all
our submitted experiments use the lemmatized version of the collection and topics.</p>
      <p>The information provided by the NLP tools was also exploited for the stop-word removal. As
we were not able to find any decent stoplist of Czech words we have decided to remove words
on the basis of their part-of-speech (POS). As can be seen from the example above, the POS
information is present at the first position of the morphological tag. We removed from indexing
all the words that were tagged as prepositions, conjunctions, particles and interjections (note that
they are no articles in Czech).</p>
      <p>Here is an example of one of the topic before and after the linguistic preprocessing. The
original topic
&lt;top&gt;
&lt;num&gt;1286&lt;/num&gt;
&lt;title&gt;Hudba v holokaustu&lt;/title&gt;
&lt;desc&gt;Sveˇdectıv´o tom, zda hudba poma´hala (dusˇevneˇ nebo i jinak) nebo prˇekaz´ˇela
evˇznuˇ˚m internovany´m v koncentracˇınc´h at´borech.&lt;/desc&gt;
&lt;narr&gt;Popis toho, jakou roli hra´la hudba vzˇivoteˇ veˇznuˇ˚.&lt;/narr&gt; &lt;/top&gt;
gets processed into
&lt;top&gt;
&lt;num&gt;1286&lt;/num&gt;
&lt;title&gt;hudba holokaust&lt;/title&gt;
&lt;desc&gt;sveˇdectıv´ten hudba poma´hat dsuˇevneˇ jinak prˇekaz´ˇet veˇzenˇ internovany´
koncentracˇın´ta´bor&lt;/desc&gt;
&lt;narr&gt;Popis ten jaky´ role hra´t hudbazˇivot veˇzenˇ&lt;/narr&gt; &lt;/top&gt;
2.2
For the actual IR we have used the freely available LEMUR toolkit [1] that allows to employ various
retrieval strategies, including among others the classical vector space model and the language
modeling approach.</p>
      <p>We have decided to stick to the tf.idf model where both documents and queries are represented
as weighted term vectors d~i = (wi,1, wi,2, · · · , wi,n) and ~qk = (wk,1, wk,2, · · · , wk,n), respectively (n
denotes the total number of distinct terms in the collection). The inner-product of such weighted
term vectors then determines the similarity between individual documents and queries. As there
are many ways to compute the weights wi,j without any of them performing consistently better
than the others, we employed the very basic formula
wi,j = tfi,j · log
d
dfj
where tfi,j denotes the number of occurrences of the term tj in the document di (term
frequency), d is the total number of documents in the collection and nfially dfj denotes the number
of documents that contain tj.</p>
      <p>
        In order to boost the performance, we also used the simpliefid version of the Rocchio’s blind
relevance feedback implemented in LEMUR [
        <xref ref-type="bibr" rid="ref2">7</xref>
        ]. The original Rocchio’s algorithm is defined by
the formula
~qnew = ~qold + α · d~R − β · d R¯
~
where R and R¯ denote the set of relevant and non-relevant documents, respectively, and d~R
and d~R¯ denote the corresponding centroid vectors of those sets. In other words, the basic idea
behind this algorithm is to move the query vector closer to the relevant documents and away from
the non-relevant ones. In the case of blind feedback, the top M documents from the rfist-pass run
are simply considered to be relevant. The LEMUR modification of this algorithm sets the β = 0
and keeps only the K top-weighted terms in d~R.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Description of the runs</title>
      <p>As we already mentioned in the Introduction, all the experiments were carried out on the Czech
Quickstart collection, using only the Czech version of the queries. The linguistic preprocessing
and the retrieval method as described in sections 2.1 and 2.2, respectively, were the same for all
the runs submitted to the official evaluation. Those runs were the following:
UWB aTD</p>
      <sec id="sec-3-1">
        <title>Query fields used: &lt;title&gt; (T) and &lt;desc&gt; (D)</title>
      </sec>
      <sec id="sec-3-2">
        <title>Collection fields used: &lt;ASRTEXT&gt; only</title>
        <sec id="sec-3-2-1">
          <title>UWB a akTD</title>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Query fields used: TD</title>
        <p>Collection fields used: &lt;ASRTEXT&gt; and &lt;CZECHAUTOKEYWORD&gt;
UWB mk aTD</p>
      </sec>
      <sec id="sec-3-4">
        <title>Query fields used: TD</title>
        <p>Collection fields used: &lt;CZECHMANUKEYWORD&gt; and &lt;ASRTEXT&gt;</p>
        <sec id="sec-3-4-1">
          <title>UWB mk a akTD</title>
        </sec>
      </sec>
      <sec id="sec-3-5">
        <title>Query fields used: TD</title>
        <p>Collection fields used: &lt;CZECHMANUKEYWORD&gt; and &lt;ASRTEXT&gt; and &lt;CZECHAUTOKEYWORD&gt;</p>
        <sec id="sec-3-5-1">
          <title>UWB mk a akTDN</title>
        </sec>
      </sec>
      <sec id="sec-3-6">
        <title>Query fields used: TD and &lt;narr&gt; (N)</title>
        <p>Collection fields used: &lt;CZECHMANUKEYWORD&gt; and &lt;ASRTEXT&gt; and &lt;CZECHAUTOKEYWORD&gt;
Since the official results of the other teams participating in the track revealed that using only
the manual keywords gives the best results, we have generated this one additional run:</p>
        <sec id="sec-3-6-1">
          <title>UWB mkTD</title>
          <p>Query fields used: &lt;title&gt; (T) and &lt;description&gt; (D)</p>
        </sec>
      </sec>
      <sec id="sec-3-7">
        <title>Collection fields used: &lt;CZECHMANUKEYWORD&gt; only</title>
        <p>The total of six runs seems to be small to assess the behavior of the task. However, the
reasons why we stopped additional runs are explained in detail in the following section.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results and their analysis</title>
      <p>There are 115 queries defined for searching in the Czech test collection. However, only 29 of them
were manually evaluated by the assessors and used to generate the qrel files. Table 1 summarizes
the results for the runs described above (the prexfi UWB is omitted due to formatting issues). The
mean Generalized Average Precision (GAP) is used as the evaluation metric - the details about
this measure can be found in [5].</p>
      <p>Run
Mean GAP
aTD
0.0003
a akTD
0.0003
mk aTD
0.0004
mk a akTD
0.0004
mk a akTDN
0.0004
mkTD
0.0015</p>
      <p>As you can see from the table, the achieved results are very close to zero, especially for the runs
employing any of the automatic fields. Moreover, when we tried the same runs with the original
word forms or the stems instead of the lemmas we have discovered that the mean GAP generally
remains unchanged but at the same time the GAP for individual topics varies quite wildly. This
lead us to the hypothesis that the results are completely random. In order to prove or reject such
hypothesis we have generated 100 different runs putting random 1,000 documents in the ranked
list for each of the topics. The average mean GAP of these runs exceeds 0.0005 which is actually
more than the results achieved by runs involving any of the automatic fields.</p>
      <p>As for the run using only the manual keywords, the GAP is slightly better there but behaves
no less wildly. Generally, the mean GAP result depends on the result achieved on the single topic
(number 14312) as the rest of the GAPs is more or less zero. Moreover, when we accidentally
run the non-lemmatized topics on the lemmatized collection, the GAP of that topic jumped from
0.0325 to 0.1000 causing the mean GAP to jump from 0.0015 to 0.0046.</p>
      <p>Therefore we are prone to conclude that the results are indeed currently at the random level.
The question is why is it so. The first possibility that comes to mind is some fundamental flaw
in the design of the search system itself. This is quite improbable as all the three teams that
participated in the track achieved comparable results and at least two of these teams (that means,
all besides us) have a significant experience with designing the IR systems for similar evaluation
campaigns.</p>
      <p>
        According to our opinion, one of the reasons of the failure is the immense difficulty of the task
in question. This difficulty stems mainly from the following factors:
1. The collection lacks the topical segmentation - segments in the Quickstart collection are in
most cases not topically coherent.
2. The quality of the ASR transcriptions is rather poor (around 40% WER) - but that is a
problem which is shared by both the Czech and the English collections.
3. The quality of the automatic keyword assignment is generally very low - this is probably
caused by the fact that this assignment had to be done in a complicated cross-language
manner due to the lack of annotated Czech training data (see [
        <xref ref-type="bibr" rid="ref1">6</xref>
        ]).
4. There appears to be a non-negligible vocabulary mismatch between the topics and the
collection or even between the different efilds in the collection. For example, just looking at
the first two topics that were evaluated by the assessors we have discovered that in the topic
1181 the name of the infamous concentration camp “Auschwitz” was kept untranslated in the
topics but it was translated into its Czech form (“Osvetˇim”) in the &lt;CZECHMANUKEYWORD&gt;
and &lt;CZECHAUTOKEYWORD&gt; fields 2, the word “Sonderkommando” was written with double
“m” in the topics and in the &lt;ASRTEXT&gt; efild and with single “m” in the keyword fields.
5. Some of the salient words from the topics hardly appear in the collection at all. For example,
the most important word from the topic 1166 (“Hasidism”, or its modifications) appears only
3 times in the entire collection, in all cases in the &lt;CZECHAUTOKEYWORD&gt; field and in all cases
it is placed there incorrectly.
      </p>
      <p>There is also a remote possibility that the problem is on the evaluation side. Anyway, all those
issues will be a subject of a more detailed investigation in the near future.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>The Czech CL-SR track presents a first attempt to create and test the collection of the Czech
spontaneous speech. As such (and given the inherent challenge of the task in question) seems to
suffer some initial difficulties that have to be first precisely identiefid and then hopefully solved.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments References</title>
      <p>[1] http://www.lemurproject.org/.</p>
      <p>This work was supported by the Grant Agency of the Czech Academy of Sciences project No.
1ET101470416 and the Ministry of Education of the Czech Republic project No. LC536.
[2] Jan Hajci.ˇ Disambiguation of Rich Inflection. (Computational Morphology of Czech) .</p>
      <p>Karolinum, Prague, 2004.
[3] Jan Hajciˇ and Barbora Hladka´. Tagging Ineflctive Languages: Prediction of Morphological
Categories for a Rich, Structured Tagset. In Proceedings of COLING-ACL Conference, pages
483–490, Montreal, Canada, 1998.
[4] Diana Inkpen, Muath Alzghool, and Aminul Islam. University of Ottawa’s Contribution to
CLEF 2005, the CL-SR Track. In Working notes for CLEF 2005 Workshop, Vienna, Austria,
2005.
[5] Baolong Liu and Douglas Oard. One-Sided Measures for Evaluating Ranked Retrieval
Effectiveness with Spontaneous Conversational Speech. In Proceedings of SIGIR 2006, pages 673 –
674, Seattle, Washington, USA, 2006.</p>
      <p>2Note that neither one of those variants is the original name of the Polish town in question - “Osw´ei¸cim”.
Consequently, all three forms are routinely used by the Czech speakers and therefore appear in the ASR transcripts.</p>
      <sec id="sec-6-1">
        <title>Proceedings</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Scott</given-names>
            <surname>Olsson</surname>
          </string-name>
          , Douglas Oard, and
          <string-name>
            <given-names>Jan</given-names>
            <surname>Hajci</surname>
          </string-name>
          .ˇ
          <string-name>
            <surname>Cross-Language Text</surname>
          </string-name>
          <article-title>Classicfiation</article-title>
          .
          <source>In of SIGIR 2005</source>
          , pages
          <fpage>645</fpage>
          -
          <lpage>646</lpage>
          , Salvador, Brazil,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Chengxiang</given-names>
            <surname>Zhai</surname>
          </string-name>
          .
          <source>Notes on the Lemur TFIDF model. Note with lemur 1</source>
          .9 documentation, School of CS, CMU,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>