<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Combining Probabilistic and Translation-Based Models for Information Retrieval based on Word Sense Annotations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Elisabeth Wolf</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Delphine Bernhard</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iryna Gurevych</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>General Terms</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Information Retrieval, Probabilistic Retrieval Model, Monolingual Translation-based Model, Word
Sense Disambiguation</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Measurement</institution>
          ,
          <addr-line>Performance, Experimentation</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>UKP Lab, Technische Universitat Darmstadt</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we describe our experiments carried out for the robust word sense disambiguation (WSD) track of the CLEF 2009 campaign. This track consists of a monolingual and bilingual task and addresses information retrieval utilizing word sense annotations. We took part in the monolingual task only. Our objective was twofold. On the one hand, we intended to increase the precision of WSD by a heuristic-based combination of the annotations of the two WSD systems. For this, we provide an extrinsic evaluation on di erent levels of word sense accuracy. On the other hand, we aimed at combining an often used probabilistic model, namely the Divergence From Randomness BM25 model (DFR BM25), with a monolingual translation-based model. Our best performing system with and without utilizing word senses ranked 1st overall in the monolingual task. However, we could not observe any improvement by applying the sense annotations compared to the retrieval settings based on tokens or lemmas only.</p>
      </abstract>
      <kwd-group>
        <kwd>H</kwd>
        <kwd>3 [Information Storage and Retrieval]</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>1 Content Analysis and Indexing</kwd>
        <kwd>H</kwd>
        <kwd>3</kwd>
        <kwd>3 [Information Search and Retrieval]</kwd>
        <kwd>Performance evaluation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>http://www.ukp.tu-darmstadt.de
The CLEF 1 robust word sense disambiguation (WSD) track aims at promoting the development
and evaluation of textual document retrieval systems utilizing word sense disambiguated data.
Participants to this task are provided with topics and documents from previous CLEF campaigns
which were annotated by two di erent WSD systems. The WSD track consists of two independent
tasks: a monolingual task with topics and documents in English and a bilingual task with Spanish
topics and English documents. In both tasks, the goal is to analyze and compare the performance
of retrieval systems on the document collection with and without word sense information. We
took part in the monolingual task only.</p>
      <p>
        The CLEF robust WSD track was rst introduced in 2008 and received submissions from eight
groups plus two late submissions. The participants submitted runs by di erent systems varying
in the pre-processing steps, indexing procedures, ranking functions, the application of query
expansion methods, and the integration of word senses. The best performance could be achieved
by a combination of di erent probabilistic models [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], namely the BM25 model [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], the I(ne)C2
model - a Divergence From Randomness version of the BM25 model -, and a statistical language
model introduced by Hiemstra [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Through their combination approach, they obtained the
highest mean average precision (MAP) over all submitted retrieval systems. All participants always
took only one of the two systems for WSD into account when selecting the word sense
annotations. According to the MAP, most submitted systems obtain higher performance on the plain
document collection than on the word sense annotated corpus. Only some participants were able
to slightly improve the performance by utilizing word sense information in their system. However,
it is questionable whether these improvements were signi cant. The WSD task using the same
document collection was repeated in 2009 in order to further investigate the performance on WSD
annotated data.
      </p>
      <p>
        Our motivation in the robust WSD task is twofold. On the one hand, we intended to increase the
precision of WSD by an heuristic-based combination of the annotations of the two WSD systems.
We provide an extrinsic evaluation on di erent levels of word sense accuracy. On the other hand,
we aimed at combining an often used probabilistic model with a monolingual translation-based
model, which was trained on de nitions and glosses provided by di erent lexical semantic
resources, namely WordNet, Wiktionary, Wikipedia, and Simple Wikipedia. This translation-based
model was successfully used for the task of answer nding by Bernhard and Gurevych [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. We
report all di erent index and retrieval settings and the performance on the training and test data.
The paper is organized as follows. In the next section, we describe the provided document
collection and de ne our indexing and retrieval approach in detail. We de ne the applied probabilistic
and translation-based models as well as the method to combine both of them. In Section 3, we
report and discuss our evaluation results, and nally, in Section 4 we conclude our experiments.
2
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Experiments</title>
      <sec id="sec-2-1">
        <title>Document collection</title>
        <p>
          The document collection consists of Los Angeles Times 1994 and Glasgow Herald 1995 English
national newspaper articles which were used for CLEF 2001. It comprises around 169,000 documents
(113,000 Los Angeles Times documents; 56,500 Glasgow Herald documents). The documents were
annotated with two di erent word sense disambiguation systems provided by the IXA NLP Group
at the University of the Basque Country (UBC) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] and the Department of Computer Science
at the National University of Singapore (NUS) [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. The UBC system is based on a combination
of k-nearest neighbor classi ers. Each classi er learns from a distinct set of features. The set
of features comprises, e.g., syntactic, collocations, and bag-of-words features as well as features
learned from a reduced space via Singular Value Decomposition. The NUS approach extracts
similar features from English-Chinese parallel corpora, the SEMCOR, and the DSO corpus. Based
on the extracted features, an SVM is trained for each open-class word. Both WSD systems were
among the top performing systems in the lexical sample and all-words WSD subtasks of
SemEval2007 [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. For the ne-grained all-words WSD task, the NUS system obtained an accuracy of
0.587, while the UBC system's accuracy was 0.544. The nal collection contains three di erent
corpora: (i) the plain corpus, (ii) a corpus where each token is annotated with a lemma as well
as multiple senses and probability scores using the UBC system, and (iii) a corpus with the same
annotations from the NUS system. Word sense annotations refer to synsets in WordNet version
1.6.
        </p>
        <p>Training (150) and test (160) topics consist of a combination of CLEF topics from previous
challenges and are annotated with word sense information as well. Each topic consists of a brief title,
a one-sentence description, and a more detailed narrative eld specifying the relevance assessment
criteria. Participants were instructed to create their queries only from the title and description
elds. Documents and topics are provided in XML format.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Indexing</title>
        <p>
          We used Terrier (TERabyte RetrIEveR) [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], version 2.1 for indexing the documents. This
framework provides state-of-the-art retrieval and query expansion models, such as the commonly used
Divergence From Randomness (DFR) BM25 probabilistic model. During the training phase, we
determined the best performing combination of retrieval and query expansion method.
Each document is represented by its tokens. Each token is assigned a lemma and multiple word
senses. The accuracy of word sense annotations can highly in uence the retrieval performance
when utilizing word senses (see e.g. Sanderson [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]). Therefore, we extrinsically analyzed the
automatically annotated word senses based on information retrieval experiments. The original
document collection consists of approximately 100 Mil. tokens including stop words. The NUS
annotated corpus comes with around 199 Mil. sense annotations including the sense probability
scores, i.e. on average 2 senses per token. The UBC annotated corpus even consists of around 275
Mil. sense annotations and probability scores, i.e. on average 2.75 senses per token. Preliminary
experiments on the training topics have shown that restricting the incorporated senses to the
highest scored sense for each token increases the MAP of retrieval.
        </p>
        <p>Further, we hypothesize that combining the NUS and UBC sense assignments increases the
precision of annotated word senses. Therefore, we created several indices for our experiments. Each
index consists of three elds, namely token, lemma, and sense. The indexed senses vary in the
way they are selected. Four di erent indices were created: (i) an index with the highest scored
UBC sense for each token (UBCBest), (ii) an index with the highest scored NUS sense for each
token (NUSBest), (iii) an index with senses that were assigned by both systems and have the
greatest sum of scores (CombBest), and nally (iv) an index with senses as in (iii), but where we
chose the sense with the highest score from the UBC or NUS corpus when the set of senses that
were assigned by both systems is empty (CombBest+). The construction of CombBest can be
formally described by:
(1)
(2)
sense(t) = argmax
s2S(t)</p>
        <p>scoreUBC (s) + scoreNUS (s)
with S(t) = SUBC (t) \ SNUS (t), where SUBC (t) is the set of senses of token t obtained from the
UBC system and SNUS (t) is the sense set accordingly obtained from the NUS system. Thus, S(t)
is the intersection of the senses of token t annotated from the UBC and NUS systems. Further,
scoreUBC (t) and scoreNUS (t) is the probability score assigned to sense s from the UBC and NUS
system, respectively. Accordingly, CombBest+ is de ned as:
sense(t) =
argmaxs2S(t)
argmaxs2S+(t)
scoreUBC (s) + scoreNUS (s)
scoreUBC;NUS (s)
if S 6= ;
otherwise
where S+(t) = SUBC (t) [ SNUS (t) is the union of the sense sets of token t from the UBC and
NUS systems.</p>
        <p>We created multi eld indices including elds for tokens, lemmas, and the word senses. Prior
to indexing, we applied standard stopword removal. Without stopwords, all indices consists of
approximately 40.7 Mil. tokens. As shown in the third column of Table 1 the UBCBest index
contains around 34.1 Mil. senses, the NUSBest index contains around 34.5 Mil. senses, i.e. 6.6
Mil. and 6.2 Mil. tokens are not annotated with any sense in the UBCBest and NUSBest index,
respectively. The CombBest index contains only 31.7 Mil. senses, while the CombBest+ index
consists of 35.1 Mil. senses.</p>
        <p>The queries were automatically constructed from the topic elds title and description. The
stopword list used for queries was the same as the one used for the documents, plus the following
terms: nd, describing, discussing, document, and report.
(3)
(5)
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Retrieval Models</title>
        <p>We carried out several retrieval experiments using a probabilistic model, a monolingual
translationbased model, and their combination.
2.3.1</p>
        <sec id="sec-2-3-1">
          <title>Probabilistic Model</title>
          <p>
            Terrier provides a set of di erent probabilistic ranking models. We report the performance on the
training and test topics applying the term-weighting model I(n)OL2, which is called DFR BM25
in Terrier. The DFR BM25 model is the Divergence From Randomness (DFR) version of the
BM25 model [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ]. According to Ounis et al. [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ] the DFR version of the BM25 model infers the
informativeness of a term t in the document d by the divergence between its within-document
term-frequency and its frequency within the whole collection. Multiple TREC experiments have
shown that this is a competitive model in various retrieval settings. The DFR BM25 model is
de ned by
weight(tjd) =
          </p>
          <p>tf n
tf n + 1
log2
jDj df + 1
df + 0:5
;
where
adl
tf n = tf (tjd) log2(1 + c ) ; (4)
dl
jDj is the size of the document collection, df is the document frequency, tf is the term frequency,
dl is the document length, and adl is the average document length. We set the parameter c to the
default value of 1. The probabilistic model can be applied on indexed tokens, lemmas, and senses.
2.3.2</p>
        </sec>
        <sec id="sec-2-3-2">
          <title>Relevance Feedback</title>
          <p>
            It is often observed that probabilistic models have problems dealing with synonymy. This problem,
also called lexical gap, arises from alternative ways of expressing a concept using di erent terms.
Query expansion models try to overcome the lexical gap problem by reformulating the original
query to increase the retrieval performance. Terrier provides di erent query expansion models,
namely the Bose-Einstein 1, the Bose-Einstein 2, and the Kullback-Leibler (KL) model [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ]. We
chose the KL query expansion model, since it performed best on preliminary experiments on the
training data. The Kullback-Leibler Divergence term weighting model is de ned by:
weight(t) = PR(t) log
          </p>
          <p>PR(t)
PC (t)
;
where PR(t) is the probability of the term t in the top ranked documents and PC (t) is the
probability of the term t in the whole collection. Terms with a high probability in the top ranked
documents and a low probability in the whole collection are likely to be the expansion terms. In
our experiments the original query is expanded by up to 10 most informative (highest weighted)
terms from the 3 top ranked documents.
2.3.3</p>
        </sec>
        <sec id="sec-2-3-3">
          <title>Translation Model</title>
          <p>
            A further solution to the lexical gap problem is the integration of monolingual statistical
translation models rst introduced by Berger and La erty [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ]. These models encode statistical word
associations which are trained on parallel monolingual document collections such as
questionanswer pairs. Recently, Bernhard and Gurevych [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ] successfully applied monolingual translation
models for the task of answer nding. In order to automatically train the translation models, they
used the de nitions and glosses provided for the same term by di erent lexical semantic resources,
index type
UBCBest
NUSBest
CombBest
CombBest+
# tokens
namely WordNet, Wiktionary, Wikipedia, and Simple Wikipedia. The usage of these resources
yields domain-independent monolingual translation models. The authors have shown that their
models signi cantly perform better than baseline approaches for answer nding.
We employed the model de ned by Xue et al. [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ] and used by Bernhard and Gurevych [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ] in our
experiments:
where
          </p>
          <p>P (qjD) =</p>
          <p>Y P (wjd) ;
w2q
P (wjd) = (1
)Pmx(wjd) +</p>
          <p>P (wjD) ;
Pmx(wjd)
=
(1
)Pml(wjd) +</p>
          <p>X P (wjt)Pml(tjd) ;
t2d
q is the query, d the document, the smoothing parameter for the document collection D and
P (wjt) is the probability of translating a document term t to the query term w. The parameter
was set to 0.8 and to 0.5.</p>
          <p>We applied the translation-based model trained for the answer nding task on the newswire
document collection, though it was not particular trained for this task. As the translation-based
model was trained on tokens, we apply it on the indexed token eld exclusively.
2.4</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>Combination of Retrieval Models</title>
        <p>Our hypothesis is that translation-based models retrieve di erent documents for some queries than
probabilistic models. Therefore, we compute a combined relevance score to improve the retrieval
performance.</p>
        <p>First, we normalize the scores resulting from each model applying standard normalization:
rnorm(i) =
rorig(i)
rmax</p>
        <p>rmin ;
rmin
where rorig(i) is the original score, rmin is the minimum, and rmax is the maximum occurring
score for a query.</p>
        <p>
          Second, we combine the normalized relevance scores computed for individual models into a nal
score using the CombSUM method introduced by Fox and Shaw [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. This method ranks the
documents based on the sum of the (normalized) similarity scores of individual runs. Each run
can be assigned a di erent weight.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>In the following subsections we describe all our results carried out during our experiments and
discuss them in detail. We chose the ve best performing experiments with and without utilizing
word senses for submission based on the MAP values obtained for the training set. In addition to
the o cially submitted runs we report some further experiments on the test topics.
(6)
(7)
(8)
(9)
retrieval
model
translation model</p>
      <p>DFR BM25
DFR BM25 + KL
training data
test data
token
lemma
As stated in Section 2.2 we created four indices which di er in the way word senses assigned by
the UBC and NUS systems are selected. Table 1 shows the number of indexed word senses and
the MAP values of di erent retrieval experiment applying the DFR BM25 ranking model with the
Kullback-Leibler query expansion model. Retrieval on the UBCBest index shows a MAP value of
0.2514 for the training and 0.2636 for the test topics. For retrieval based on the NUSBest index the
MAP value increases by 14.2% and 24.1% for training and test topics, respectively. According to
this extrinsic evaluation, the NUS system clearly outperforms the UBC system. While CombBest
does not increase the retrieval performance measured by MAP (0.2921), we were able to increase
the MAP value using the CombBest+ index up to 0.3551.</p>
      <p>In the remainder of this paper, we use the indices CombBest and CombBest+ as our intention
was to analyze the performance of the heuristic-based combination approach. Each index consists
of three elds: token, lemma, and sense. The runs that we o cially submitted are based on the
CombBest index only.
3.2</p>
      <sec id="sec-3-1">
        <title>Retrieval Experiments</title>
        <p>We report experiments applying the probabilistic retrieval model DFR BM25 (with and without
query expansion), and the monolingual translation-based model on both the training and test
data. The translation-based model is always restricted to the indexed tokens; the probabilistic
model can use all di erent elds. We did not perform any ne-tuning on the parameters.
Table 2 shows the MAP of the di erent models. For the training data the DFR BM25 model
on tokens outperforms the translation model approach, even without any query expansion. The
translation-based model shows a MAP value of 0.3045, while the DFR BM25 model achieves a
MAP value of 0.3374 without and 0.3760 with query expansion. Retrieval on lemmas even increases
the MAP value further to 0.3829. Retrieval on senses shows the lowest MAP values ranging
from 0.2557 up to 0.3011. Applying query expansion on the CombBest+ index outperforms the
according runs on the CombBest index.</p>
        <p>For the test data, the translation model and the DFR BM25 model without any query expansion
show similar MAP values. However, when applying query expansion the DFR BM25 approach
outperforms the translation-based model.</p>
        <p>An interesting aspect is that the di erence between the performance on lemmas compared to
tokens is much higher on the test topics than on the training topics. The DFR BM25 model with
query expansion on tokens yields a MAP value of 0.4223 while we get a MAP value of 0.4451
on lemmas, which is an improvement of 5.1%. Again, experiments on senses achieve the lowest
performance. Again, retrieval on the CombBest+ index performs better than on the CombBest
index.</p>
        <p>For the probabilistic model, we additionally conducted several experiments applying multi eld
queries. We have submitted one run querying the token, lemma, and sense eld at the same time,
which achieved a MAP value of 0.4380 on the CombBest and 0.4456 on the CombBest+ index,
respectively. However, as they do not have any signi cant di erent outcomes we do not report the
gures here.</p>
        <p>token
trans-
probalation bilistic
+</p>
      </sec>
      <sec id="sec-3-2">
        <title>Combination of Retrieval Models</title>
        <p>We have manually analyzed the documents retrieved for some topics by the probabilistic and the
translation model. We observed that the sets of retrieved documents by the two models are often
di erent from each other. Therefore, we combined both models in order to improve the
overall performance. We extensively experimented on the training data with di erent combination
weights for the two retrieval models using the CombSUM method described in Section 2.4. The
conclusion was that the combination achieves best performance when the probabilistic models
based on tokens and lemmas were assigned a higher weight (due to their higher MAP values) than
the model based on senses or the translation-based model. Table 3 illustrates the results obtained
on the test topics by di erent combinations, with and without the integration of word senses. The
weight combinations were determined during the training phase. They yielded best performance
on the training data.</p>
        <p>Two combinational aspects are of particular interest. The combination of the probabilistic models
based on tokens and lemmas yields no improvement. In contrast, the combinations of the
probabilistic model with the translation-based model always leads to an improvement. Even if the
impact of the translation model, i.e. its weight, is low (here: 0.2), the MAP values increase when
compared to the results obtained by the probabilistic model alone, on the token and lemma index
elds. This fact corroborates our hypothesis that the probabilistic and the translation-based
models retrieve di erent sets of relevant documents for some queries and that those di erent sets are
e ectively combined applying the CombSUM approach. The best performance could be obtained
by a combination of the probabilistic model based on lemmas and the translation model based on
tokens with weights 0.8 and 0.2, respectively. This combination yields a MAP value of 0.4509 and
ended up with the 1st rank in the o cial challenge (see Section 3.4).</p>
        <p>The second interesting aspect concerns the integration of word sense information. Retrieval based
on senses from the CombBest index yields a MAP of 0.3313, while retrieval based on senses of the
CombBest+ index shows a MAP of 0.3551. We attribute the di erence to the fact that CombBest
looses information about the documents due to the smaller amount of indexed senses. However,
all combinations either with the CombBest or the CombBest+ senses end up with a very similar
performance. The reason could be that the loss of information when using the CombBest index is
compensated by querying the tokens or lemmas as well.</p>
        <p>In some combinational variations, the integration of word senses could achieve a higher MAP value
than retrieval settings without word senses. For example, the MAP value corresponding to the
retrieval based on tokens alone is 0.4223, while the combination with senses obtains a MAP value
of 0.4303 for the CombBest index and even 0.4327 for the CombBest+ index. For the combinations
based on lemmas and senses, the di erence is not signi cant. Overall, the best performance is
obtained by the combination of the translation model and the probabilistic model based on lemmas
and senses, applying weights of 0.1, 0.8, and 0.1, respectively. For the o cially submitted run on
the CombBest index a MAP value of 0.4500 was achieved, while the run on the CombBest+ index
achieves a slightly better MAP value of 0.4507.
3.4</p>
      </sec>
      <sec id="sec-3-3">
        <title>Discussion</title>
        <p>In the previous section we described all our experiments carried out on the document collection
disambiguated with word senses. We submitted ve runs without the integration of word senses
and ve further runs utilizing the annotated word senses. According to the MAP values our runs
without word senses ended up in the 1st place out of 10 participants. Our highest MAP value
could be achieved with the combination of the translation-based and the probabilistic model based
on lemmas and the assigned weights of 0.2 and 0.8, respectively.</p>
        <p>When utilizing word senses, the combination of the translation model based on tokens and the
probabilistic model based on both lemmas and senses obtains the 1st place according to the MAP
in the o cial challenge. We mistakingly submitted runs on the CombBest index, even though
we planned to focus on the CombBest+ index. However, we have shown that the di erences
between the combinational approaches are minimal. Our best performing submitted retrieval setting
achieved a MAP value of 0.4500, whereas the second top scoring system in the o cial challenge
obtains a MAP value of 0.4346.</p>
        <p>Overall, we could not observe any signi cant improvement by applying the sense annotations
compared to the retrieval settings based on tokens or lemmas only. This observation is consistent
with the conclusion of last years' challenge. Participants of last years' challenge proposed several
di erent methods for utilizing word senses, but could not achieve a signi cant improvement. We
increased the precision of WSD annotations through a heuristic-based combination of the UBC
and NUS annotated senses, which we evaluated extrinsically. This evaluation has shown that the
accuracy of annotated word senses highly in uences the outcome of retrieval systems utilizing
these disambiguated data (see Table 1).</p>
        <p>Regarding the performance of the translation-based model, the results on the combination is
promising given that we merely applied a translation model built for a previous application in
the eld of answer nding. The main drawback of the straightforward use is the discrepancy in
the tokenization scheme. The tokenization of the document collection is not always compatible
with the tokenization of the parallel corpora used for training the translation model. In
addition, the translation model we used contains only tokens and thus cannot deal with indexed multi
word expressions. For instance, the phrase \public transport" is indexed as \public transport".
In the translation model the two terms \public" and \transport" appear, but not the phrase
\public transport". We quickly analyzed the amount of multi word expressions in the test topic
collection. In fact, 61 queries out of the 160 test queries contain at least one multi word
expression. This analysis shows that the translation model was not particularly trained for this task
and motivates further improvements. In addition, further translation models could be trained on
lemmas and senses. The latter option, however, requires a word sense disambiguated monolingual
parallel corpus.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>We have described a combinational approach to information retrieval on word sense disambiguated
data, which combines a probabilistic and a monolingual translation-based model. For the
probabilistic model we have used the Divergence From Randomness (DFR) BM25 model with the
Kullback-Leibler Divergence as the query expansion method. For the translation-based model we
have applied a model which was already trained for an answer nding task.</p>
      <p>Our aim was to assess the bene ts of the combination of both models. We have shown that the
combinational approach always achieves better performance than the stand-alone models. Our
second goal was to analyse di erent methods for selecting word senses from annotated corpora in
order to increase their accuracy. We have discovered that our heuristic-based approach CombBest+
increases the retrieval performance based on word senses by 2.2% when compared to NUSBest and
even 25.8% when compared to UBCBest. The huge di erence between NUSBest and UBCBest
demonstrates that WSD accuracy is essential for utilizing word sense information. However, the
experiments on the CombBest+ index have shown that we could only increase the retrieval
performance in one speci c case: by combining the probabilistic model based on tokens with the same
model based on senses. Nevertheless, other combinations without word senses outperformed this
setting easily.</p>
      <p>
        In conformance with our results, the best run out of all participants of last years' challenge was
conducted without any word sense annotations. However, some participants were able to improve
the performance of their own retrieval system slightly when utilizing word sense annotations, but
it is questionable whether the improvements were signi cant. The UniNE group [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] has manually
analyzed fty queries of the test queries provided last year and gured out that some
disambiguations were incorrect. Therefore, a manual evaluation of the current test queries would be of
particular interest, which we leave for future work.
      </p>
      <p>
        In summary, we agree with Sanderson [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] that rst of all the accuracy of annotated word senses
has to increase in order to improve the performance of retrieval based on word sense annotations.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>This work has been supported by the Emmy Noether Program of the German Research Foundation
(DFG) under the grant No. GU 798/3-1, and by the Volkswagen Foundation as part of the
Lichtenberg-Professorship Program under the grant No. I/82806.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Eneko</given-names>
            <surname>Agirre and Oier Lopez de Lacalle.</surname>
          </string-name>
          UBC-ALM:
          <article-title>Combining k-NN with SVD for WSD</article-title>
          .
          <source>In Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)</source>
          , pages
          <fpage>342</fpage>
          {
          <fpage>345</fpage>
          , Prague, Czech Republic,
          <year>June 2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Adam</given-names>
            <surname>Berger</surname>
          </string-name>
          and John La erty.
          <article-title>Information Retrieval as Statistical Translation</article-title>
          .
          <source>In Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '99)</source>
          , pages
          <fpage>222</fpage>
          {
          <fpage>229</fpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Delphine</given-names>
            <surname>Bernhard</surname>
          </string-name>
          and
          <string-name>
            <given-names>Iryna</given-names>
            <surname>Gurevych</surname>
          </string-name>
          .
          <article-title>Combining Lexical Semantic Resources with Question &amp; Answer Archives for Translation-Based Answer Finding</article-title>
          .
          <source>In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP</source>
          , pages
          <volume>728</volume>
          {
          <fpage>736</fpage>
          ,
          <string-name>
            <surname>Suntec</surname>
          </string-name>
          , Singapore,
          <year>August 2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Yee</given-names>
            <surname>Seng</surname>
          </string-name>
          <string-name>
            <surname>Chan</surname>
          </string-name>
          , Hwee Tou Ng, and
          <string-name>
            <given-names>Zhi</given-names>
            <surname>Zhong</surname>
          </string-name>
          .
          <article-title>NUS-PT: Exploiting Parallel Texts for Word Sense Disambiguation in the English All-Words Tasks</article-title>
          .
          <source>In Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)</source>
          , pages
          <fpage>253</fpage>
          {
          <fpage>256</fpage>
          , Prague, Czech Republic,
          <year>June 2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Thomas</surname>
            <given-names>M.</given-names>
          </string-name>
          <article-title>Cover and Joy A. Thomas. Elements of information theory</article-title>
          . Wiley-Interscience, New York, NY, USA,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Ljiljana</given-names>
            <surname>Dolamic</surname>
          </string-name>
          , Claire Fautsch, and Jacques Savoy.
          <source>UniNE at CLEF</source>
          <year>2008</year>
          :
          <article-title>TEL, Persian and Robust IR</article-title>
          .
          <source>In Working Notes for the CLEF 2008 Workshop 17-19 September</source>
          <year>2008</year>
          , Aarhus, Denmark,
          <year>September 2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Edward</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Fox</surname>
            and
            <given-names>Joseph A.</given-names>
          </string-name>
          <string-name>
            <surname>Shaw</surname>
          </string-name>
          .
          <article-title>Combination of Multiple Searches</article-title>
          .
          <source>In Proceedings of the 2nd Text REtrieval Conference (TREC-2)</source>
          , pages
          <fpage>243</fpage>
          {
          <fpage>252</fpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Djoerd</given-names>
            <surname>Hiemstra</surname>
          </string-name>
          .
          <article-title>Term-speci c Smoothing for the Language Modeling Approach to Information Retrieval: the Importance of a Query Term</article-title>
          .
          <source>In SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <volume>35</volume>
          {
          <fpage>41</fpage>
          , New York, NY, USA,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Iadh</given-names>
            <surname>Ounis</surname>
          </string-name>
          , Gianni Amati, Vassilis Plachouras, Ben He,
          <string-name>
            <surname>Craig Macdonald</surname>
            , and
            <given-names>Christina</given-names>
          </string-name>
          <string-name>
            <surname>Lioma</surname>
          </string-name>
          .
          <article-title>Terrier: A High Performance and Scalable Information Retrieval Platform</article-title>
          .
          <source>In Proceedings of ACM SIGIR'06 Workshop on Open Source Information Retrieval (OSIR</source>
          <year>2006</year>
          ),
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Vassilis</surname>
            <given-names>Plachouras</given-names>
          </string-name>
          , Ben He, and Iadh Ounis. University of Glasgow at TREC2004:
          <article-title>Experiments in Web, Robust and Terabyte tracks with Terrier</article-title>
          .
          <source>In Proceedings of the 13th Text REtrieval Conference (TREC</source>
          <year>2004</year>
          ),
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Sameer</surname>
            <given-names>Pradhan</given-names>
          </string-name>
          , Edward Loper, Dmitriy Dligach, and Martha Palmer. SemEval-2007 Task17:
          <article-title>English Lexical Sample, SRL and All Words</article-title>
          .
          <source>In Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)</source>
          , pages
          <fpage>87</fpage>
          {
          <fpage>92</fpage>
          , Prague, Czech Republic,
          <year>June 2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Stephen</surname>
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Robertson</surname>
          </string-name>
          , Steve Walker, Micheline Hancock-Beaulieu,
          <article-title>Mike Gatford, and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Payne</surname>
          </string-name>
          .
          <article-title>Okapi at TREC-4</article-title>
          .
          <source>In NIST Special Publication 500-236: The Fourth Text REtrieval Conference (TREC-4)</source>
          , pages
          <fpage>73</fpage>
          {
          <fpage>96</fpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Mark</given-names>
            <surname>Sanderson</surname>
          </string-name>
          .
          <article-title>Word Sense Disambiguation and Information Retrieval</article-title>
          .
          <source>In SIGIR '94: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <volume>142</volume>
          {
          <fpage>151</fpage>
          , New York, NY, USA,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Xiaobing</surname>
            <given-names>Xue</given-names>
          </string-name>
          , Jiwoon Jeon, and
          <string-name>
            <given-names>W. Bruce</given-names>
            <surname>Croft</surname>
          </string-name>
          .
          <article-title>Retrieval Models for Question and Answer Archives</article-title>
          .
          <source>In SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <volume>475</volume>
          {
          <fpage>482</fpage>
          , New York, NY, USA,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>