Cross-lingual Information Retrieval based on
                   Multiple Indexes
    Philipp Sorg, Marlon Braun, David Nicolay                    Philipp Cimiano
       Institut AIFB, Universität Karlsruhe                   Universität Bielefeld
          sorg@aifb.uni-karlsruhe.de                    cimiano@techfak.uni-bielefeld.de
           marlon.braun@t-online.de
            davidnicolay85@yahoo.de


                                             Abstract


        In this paper we present the technical details of the retrieval system with which
     we participated at the CLEF09 Ad-hoc TEL task. We present a retrieval approach
     based on multiple indexes for different languages which is combined with a concept-
     based retrieval approach based on Explicit Semantic Analysis. In order to create the
     language-specific indices for each language, a language detection approach is applied
     as preprocessing step. We combine the different indices through rank aggregation and
     present our experimental results with different rank aggregation strategies. Our results
     show that the use of multiple indices (one for each language) does not improve upon a
     baseline index containing documents in all languages. The combination with concept
     based retrieval, however, results in better retrieval performance in some of the cases
     considered. For the bi-lingual tasks the final retrieval results of our system were the
     5th best results on the BL dataset and the second best on the BNF dataset.

Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor-
mation Search and Retrieval

General Terms
Measurement, Performance, Experimentation

Keywords
Cross-language Information Retrieval, Explicit Semantic Analysis, Rank Aggregation, Machine
Translation


1    Introduction
There are two important paradigms that can be applied to the problem of cross-language retrieval:
concept-based retrieval approaches as well as approaches exploiting machine translation (MT).
Concept-based methods map documents and queries into a language-independent concept space
[5]. MT-based methods translate the queries or documents into the target language or into all
target languages [3].
    Most machine translation based approaches work for specific language pairs. The topic is given
in a specific source language and all documents in the corpus are given in a defined target language.
In this paper we extend this model to be able to handle corpora containing documents in multiple
languages and moreover documents containing fields in different languages. Our approach is
directly motivated by the CLEF Ad-hoc TEL task. Here the target collection contains documents
in different languages and the task is to find relevant documents in all languages for given topics.
Our hypothesis is that retrieval can be improved by translating topics to all languages of the
corpus, performing a language specific search for each translation and aggregating all the results
for the single indices into one final ranking.
    Another important question we address in this paper is whether concept- and MT-based tech-
niques can be successfully combined to increase the performance of CLIR compared to concept-
based and MT-based techniques alone.
    For both problems, i.e. retrieval using multiple indices and combination of MT-based and
concept based retrieval, relevance measures computed by different models have to be combined to
an aggregated relevance score. A common approach to this problem that we also use in this paper
is rank aggregation. This means that the final scores of each model are used as input values for
the aggregation function. In the following we will describe the main techniques used in related
work to combine different retrieval approaches.
    In order to combine concept-based retrieval and term-based retrieval, Müller and Gurevych [4]
use Wikipedia and Wiktionary as background knowledge to improve the retrieval performance on a
mono-lingual search task. They were able to improve the performance measured by mean average
precision by 34% compared to the bag-of-words baseline. Similar to our approach they use Explicit
Semantic Analysis [2] for concept-based retrieval. In this paper we extend this approach to CLIR
and investigate different strategies to combine evidence from different retrieval approaches.
    Croft [1] describes different strategies to combine IR techniques. He shows that the task of
combining the output of different retrieval systems can be modeled as the task of combining
the output of multiple classifiers. He also presents different frameworks to combine multiple
retrieval systems at different levels, e.g. at the representation level or at the output level. In our
approach we use some of the score normalization algorithms presented by Croft. Our combination
approaches are also inspired by this work, but we extend it by using machine learning to find
optimal parameters of the combination.
    The results of these different combination approaches show that evidence coming from different
sources can be aggregated to achieve better performance of the overall retrieval system. In the
context of our participation on CLEF this year, we investigate whether these techniques can also
be used for the Ad-hoc task on the TEL datasets. Overall, we build on the system we presented
at CLEF2008, which achieved a reasonable performance using concept based retrieval based on
Explicit Semantic Analysis.
    Our main contributions in this paper are the following ones:

   • We extend both MT-based and concept-based retrieval into truly multi-lingual settings where
     not only the document collection can contain multiple languages but a document itself can
     contain fields in different languages. The main innovation is here that we maintain separate
     indices for each language and apply our combination strategies on the retrieval engines for
     each of these language-specific indices. Our results show that for the CLEF Ad-hoc TEL
     task we get a similar performance compared to a baseline system based on a single index,
     but no significant improvement over it.
   • We also present an approach by which MT-based and concept-based retrieval (by ESA) can
     be combined through rank aggregation. This combination effectively increases the perfor-
     mance of the retrieval system for the bi-lingual task on the BL dataset using French topics
     and the ONB dataset using English and German topics.

    The paper is structured as follows: In the Section 2 we describe our retrieval system and
define MT-based retrieval, concept based retrieval as well as different aggregation approaches. In
Section 3 we describe the used datasets and the preprocessing of the data. In Section 4 we present
the experiments on the Ad-hoc TEL task using topics from CLEF2008, in Section 5 using topics
from CLEF2009. We conclude in Section 6.
            TEL
             TEL                    Language                         TEL
              TEL
               TEL
          Record                  Classification                    Record
           Record
            Record
             Record


                                                                en
                                                                de
                                                                             …


                                                                fr


                                                                                                              Indexing
                               ESA (en)
                               ESA (de)
          ESA
                               ESA (fr)


          ESA                 ESA                  Index       Index             Index   …         Baseline
         Index               Index                  (en)        (de)              (fr)              Index


                                                               Matching and Aggregation (Step 1)
                     Matching and
                  Aggregation (Step 2)


                                                                                                              Search
                                                                                 en
                                                                                 de
                                ESA                                                          …


                                                                                 fr
                                                       Machine
          Topic                                                                      Topic
                                                      Translation

                    Figure 1: Figure of all used indices in our retrieval framework.


2     Approach
The main idea behind our approach is to use multiple indices (one for each language under consid-
eration, which are all the common European languages). These are indices of fields of documents in
different languages as well as concept indices of documents. The basic idea is to combine retrieval
results based on the different indices. Figure 1 illustrates the different indices and processing
steps which will be described in more detail in the following sections. But first we introduce some
notation.

2.1    Notations
In the remaining article we use the following notations:
    • L = {α, β, γ, . . .}: A set of languages.
    • D = {d1 , . . . , dn }: A text corpus consisting of multi-lingual documents. The function fα (d)
      selects all the document fragments of d in language α. Dα = {fα (d1 ), . . . , fα (dn )} defines a
      restriction of corpus D where all document consist of their fragments in language α.
    • C = {c1 , . . . , cm }: A set of concepts that define a concept space. Each concept has a textual
      description. We use ci both to refer to concept ci as well as to the description of ci . The
      intended meaning will be clear from the context.
    • Tα = {tα,1 , tα,2 , . . .}: A set of topics in language α that will be used to construct queries to
      the retrieval system. Each topic represents a certain information need. for the translation
      of a topic tα to language β we will use the notation tα→β .
    • Statistics of a term w in document d of corpus D:
         – TFd (w): Term frequency of w in document d.
         – |d|: Document length of d.
         – DF(w): Document frequency of w in corpus D.
         – TF(w): Term frequency of w in corpus D.
         – n = |D|: Number of documents.
           ˆ Average document length in corpus D.
        – |d|:

2.2    Language Detection
In our settings, the document corpus consists of multi-lingual documents which contain content
in multiple languages. In our approach we assume that the parts of a document which are in
different languages are identified and labeled appropriately. This is essentially the way how the fα
function described above is realized. This makes the application of language detection approaches
necessary before indexing the documents (we will rely on different indices per language). In our
settings the parts correspond to the fields of the documents in the TEL dataset which can be in
different languages. In order to identify the language for each field, we exploit a language detection
approach based on character n-grams models. The probability distributions for character sequences
of the size n are used to classify text into a set of languages. We used a classifier provided by the
Ling Pipe Identification Tool1 which was trained on corpora in different languages as described in
Section 3.

2.3    Machine Translation based CLIR
In the most simple case, the CLIR problem can be formulated as bilingual retrieval: given a topic
tβ in language β and a set of multi-lingual documents D, find relevant documents in Dα . If all
document fragments in D are of language α then D = Dα , which is the most common scenario. In
this case a MT system translating text from language β to α can be used to reduce the problem
to mono-lingual retrieval by translating topic tβ to a topic tβ→α in language α. Mono-lingual
retrieval models can then be used to define the relevance of documents in Dα to the translated
topic tβ→α .
    In our approach we extend the bi-lingual setting to multiple languages. As shown in Figure 1
the first step is building indices for each language α that contain all terms of documents in Dα .
This means that index Iα only contains information about text in language α. In the retrieval
step, each topic is simultaneously translated into all languages and each translation of the topic
is matched to the corresponding index. This results in a different ranking for each language. An
overall ranking is computed through different aggregation approaches of these rankings which will
be described in more detail in Section 2.5.
    The matching of the translated topic to the language specific index is based on a mono-
lingual retrieval model. In this paper we use models that have been implemented in the Terrier2
framework.
    For mono-lingual IR, we use the following retrieval models:
   • DLH13
                                                               ˆ
                                          TFd (w) log TF                                       TFd (w) )
                                                                                                        
                                                         d (w)|d||D|
                          X                            |d|TF(w)
                                                                     + .5 log   2πTF d (w)(1 −   |d|
         Score(t, d) :=         TFt (w)
                          w∈t
                                                                   TFd (w) + .5

   • BB2
                            X                   TF(w) + 1
         Score(t, d)   :=         TFt (w)                       TFt (w)
                            w∈t
                                            DF(w)(NTFd (w) + 1)
                            (− log(|D| − 1) + Φ(|D| + TF(w) − 1, |D| + TF(w) − NTFd (w) − 2)
                        −Φ(TF(w), TF(w) − NTFd (w)))
                                       ˆ
                                         
      with NTFd (w) = TFd (w) log 1 + |d|
                                      |d|
                                                                      n
                                            and Φ(n, m) := m + .5 log m + (n − m) log n.
  1 http://alias-i.com/lingpipe/
  2 http://ir.dcs.gla.ac.uk/terrier/
    • LemurTF IDF
                                                                                                      2
                                      X                       1.2TFd (w)                        |D|
                    Score(t, d) :=          TFt (w)                                      log
                                      w∈t             TFd (w) + 1.2(.25 + .75 |d|
                                                                               ˆ )
                                                                                               DF(w)
                                                                                  |d|


2.4     Concept-based CLIR
As an instance of concept-based CLIR we build on the CL-ESA approach previously presented
in [6]. For the sake of completeness we first discuss Explicit Semantic Analysis and then the
cross-language extension CL-ESA.
    In our retrieval system each document is mapped by ESA into a conceptual representation
(the Wikipedia article space) which can be understood as an interlingua-based representation
abstracting from languages which is inherently able to represent documents with fields in different
languages. As shown in Figure 1 we follow two different approaches to build the index. One
approach maps whole documents to the Wikipedia article space using ESA without considering
that documents can contain different languages. The second approach classifies each field of a
document into a corresponding language and then maps each field field into a concept vector
using the a language-specific ESA instantiation.
    We compare the performance of these approaches in our experiments. In both cases we rely
on a single index for concept based retrieval, as the multiple languages are already considered in
the concept mapping.

2.4.1    Explicit Semantic Analysis (ESA)
ESA classifies given document d with respect to a set of explicitly given external categories C.
Gabrilovich and Markovitch [2] have outlined the general theory behind ESA and in particular
described its instantiation to the case of using Wikipedia articles as external categories. We will
basically build on this instantiation which we briefly summarize in the following.
    In essence, Explicit Semantic Analysis takes as input a document d and maps it to a high-
dimensional real-valued vector space. This vector space is spanned by a concept space Cα =
{c1 , . . . , cm } in language α such that each dimension corresponds to concept ci . This mapping is
given by the following function: Φα : D → R|Cα | with

                                     Φα (d) := hAS(d, c1 ), . . . , AS(d, cm )i

The function AS expresses the association strength between d and the concept ci . In the original
ESA model AS is defined by sum of TF.IDFci values of all words of wj ∈ d based on the textual
description of concept ci . In previous work we examined the performance of different association
strength functions for CLIR tasks [7]. Based on these result we use the following modified function:
                                                                    
                                             X TFc (w)
                                                   i
                                                               |C|
                               AS(d, ci ) :=            log
                                                 |ci |        DF(w)
                                                 w∈d


2.4.2    Cross-lingual ESA (CL-ESA)
In this section we present the extension to ESA called CL-ESA (Cross-language Explicit Semantic
Analysis). This is a relatively straightforward extension of ESA to a cross-lingual setting which we
presented before in [6]. We will also describe how CL-ESA can be used for the semantic analysis
for multi-lingual documents.
    CL-ESA relies on the principle that concept vectors computed with respect to the Wikipedia
database in one language can be translated into concept vectors with respect to another Wikipedia
database relying on Wikipedia’s language links3 . This is done by mapping each dimension corre-
sponding to article a in Wikipedia Wα to the dimension corresponding to article b in Wikipedia
   3 Cross-language links are those that link a certain article to a corresponding article in the Wikipedia database

in another language.
Wβ so that there exists a language link from a to b. This means that article a and b are tex-
tual descriptions of the same concept. Given this mapping it is for example possible to compare
documents in language α and β based on the mapped concept vector.
    In general the concept space that is used for CL-ESA needs textual descriptions of all concepts
in all supported languages. We will refer to the description of concept ci in language α by ci,α .
For a multi-lingual document d CL-ESA is defined as follows:
                                                 X
                                   AS(d, ci ) :=    AS(dα , ci,α )                              (1)
                                                  α∈L

    When CL-ESA is instantiated using the Wikipedia database, the articles have to be restricted
to the articles having cross-language links to articles in all languages in L. Then all concepts
represented by an article in any language have descriptions in all other languages given by the
linked articles, which is needed for our model. In the following mα→β : Wα → Wβ defines the
function mapping articles from Wα according to language links to Wβ .
    Given a target language α for the concept representation of a multi-lingual document d with
respect to Wikipedia Wα = {a1 , a2 , . . .}, the association strength defined in Equation 1 can be
instantiated to Wikipedia by:
                                               X
                             ASWα (d, ai ) :=     AS(fβ (d), mα→β (ai ))
                                              β∈L

    Intuitively this is the association strength of a multi-lingual document d to a concept c repre-
sented by the Wikipedia article ai in language α. This value is defined by the sum of the association
strength of all fragments fβ (d) in languages β to the concept description of c in language β. This
description is given by the article in Wβ to which ai links to.

2.4.3   Retrieval using CL-ESA
Using the above defined association strength function, a mapping Φ of documents or topics to
concept vectors can be defined as follows:
                              Φ(d) := d~ = hAS(d, c1 ), . . . , AS(d, cm )i
Given the vector representations of topics and documents, similarity measures in vector space can
be used to determine the relevance of documents to topics. In our previous work we defined the
following relevance function [7]:
                                 rel(t, d) := Γ(Π(Φ(q)), Π(Φ(d))),
    where Π is a projection function which reduces the dimensionality of the vector. This is done
for performance issues as efficient indexing is not possible without the reduction. In our framework
we use Πm      ~                                                         ~
          abs (d) which selects the m dimensions with highest values in d, as this reduction function
was shown to achieve good performance in CLIR tasks [7].
    Γ defines the vector space similarity. We used the cosine similarity that is defined as
                                                      < ~t, d~ >
                                         Γcosine =
                                                      k~tkkdk ~

2.5     Rank Aggregation
In our framework we aggregate two different kinds of rankings for a topic t. First, as we deal
with multi-lingual documents and due to our separate language-specific indexing approach, for
each language α ∈ L there is a ranking that expresses the relevance based on the text parts in
language α. Second we compute a ranking based on the concept representation of topics and
documents. In our framework we chose a two step rank aggregation approach. We first combine
all text-based rankings and finally combine the resulting ranking with the concept-based ranking.
In the following we describe different rank aggregation methods which we used for either the first
or the second step of rank aggregation. More details will be presented in Section 4.
2.5.1    Linear Aggregation
As the first approach to aggregate different ranking scores we chose linear aggregation. This means
that the final relevance score of a document is computed by the sum of all scores in the different
rankings:                                       X
                                 score(t, d) :=    δ(r) scorer (t, d)
                                                r∈R

where R is a set of rankings and δ(r) a weighting function. In our experiments we use the following
variations of this weighting function:
    • Normalization using max score: δ(r) := 1/maxscore(r)
      Before the aggregation, each ranking is normalized to values in [0, 1]. This is done by dividing
      each ranking score by the maximum score.

    • Normalization using the number of retrieved documents: δ(r) := |r|/ r0 ∈R |r0 |
                                                                                 P
      where |r| is the number of retrieved documents of ranking r. This weight corresponds to the
      share of the number of retrieved documents for one ranking to the total number of retrieved
      documents for all rankings.

    • A priori weights based on language: δ(rα ) := P (α)
      This weighting function can applied to our first step of rank aggregation. In this case each
      ranking rα is weighted by the apriori probability for a document to be in a certain language
      α. We use the share of text parts in language α in relation to all text parts in the corpus a
      apriori probability P (α).

2.5.2    Support Vector Machine Aggregation
As alternative approach to linear aggregation we considered rank aggregation based on Support
Vector Machines (SVMs). For a given topic or document, a feature vector can be built by using
the relevance score returned by each index. This is then used as input for a SVM classifier that
predicts the relevance of the document on the basis of a combination of the ranking scores. This
means that the results of each retrieval step on the different indices are used as feature values. The
classification model is trained by using the relevance assessment available for the corpus. Each
relevant document for a topic defines a positive training example, each non-relevant a negative
one.
    Using a linear kernel the model of the classifier corresponds to linear aggregation. By using
non-linear kernels this can be extended to non-linear rank aggregation. In Section 4 we describe
experiments with linear kernels and radial basis function kernels.


3       Evaluation
In this section we first introduce all datasets we used for our experiments. Then we describe
the evaluation methodology and the evaluation measures. Finally we briefly present some details
about our implementation.

3.1     Datasets
3.1.1    TEL Dataset
The TEL dataset was provided by the European Library in the context of the CLEF 2008/2009
ad-hoc track. This dataset consists of library catalog records of three libraries: the British Library
(BL) with 1,000,100 records, the Austrian National Library (ONB) with 869,353 records and the
Bibliotheque Nationale de France (BNF) with 1,000,100 records. While the BL contains a majority
of English records, the ONB dataset of German records and the BNF dataset of French records, all
collections also contain records in multiple languages. Each record consists of fields which again
             Field           Description                              BL    ONB      BNF
             title           The title of the document                  1     .95     1.05
             subject         Keyword list of contained subjects      2.22    3.06     0.71
             alternative     Alternative title                        .11     .50        0
             abstract        Abstract oft the document               .002    .004        0

         Table 1: Average frequency of content fields of the TEL library catalog records.

                   BL                             ONB                          BNF
         Lang        Tag      Det     Lang           Tag     Det     Lang        Tag       Det
         English   61.8%    76.7%     German       69.6%   80.9%     French    56.4%     77.6%
         French     5.3%     4.0%     English      11.9%    8.0%     English   12.9%      8.2%
         German     4.1%     2.9%     French        2.8%    2.1%     German     4.1%      3.8%
         Spanish    3.1%     2.0%     Italian       1.8%    1.5%     Italian    2.3%      1.4%
         Russian    2.7%     1.7%     Esperanto     1.5%    1.5%     Spanish    2.0%      1.4%

Table 2: Distribution of the 5 most frequent languages in each dataset, based on the language
tags (Tag) and on the language detection model (Det).


may be of different languages. Not all of these fields describe the content of the record but contain
also meta data such as the publisher name or year of publication.
    As the CLEF topics are only targeted at the content fields, we first identified all content fields.
Table 1 contains a list of the selected fields and the average count of each field for a record. Further
we reduced additional noise by removing non-content terms like constant prefix or suffix terms
from fields, e.g. the prefix term Summary in abstract fields.
    In order to be able to use the library catalog records as multi-lingual documents as defined in
Section 2 we also had to determine the language of each field. Our language detection approach
is based on the language tags provided for 100.0% (BL), 89.916% (ONB), 81.64% (BNF) of all
records as well as on the text-based language detection approach described in Section 2. Our
analysis of the datasets showed that relying merely on the language tags introduces many errors
in language assignment. First there are records tagged with the wrong language. Second, as there
is only one tag per record, language detection based on tags is not adequate for records containing
fields in different languages. Our language detection model determines the language for each field
based on evidence from tags and from text based classification. Table 2 contains the language
distribution in the TEL datasets based on the tags (Tag) as well as on our detection model (Det).
A manual evaluation using a random selection of records showed that performance of the language
detection approach on fields is reasonable.

3.1.2   Wikipedia Database
For concept-based retrieval we used the Wikipedia database in English, German and French as
concept space. As we rely on bijective mappings between articles across languages for CL-ESA, we
selected only those articles that are connected via cross-language links between all three Wikipedia
databases. In this case every article is a concept having textual descriptions in English, German
and French, namely the article text. Using the snapshot by 03/12/2008 for English, 06/25/2008
for French, and 06/29/2008 for German, we obtained the aligned collection of 166,484 articles in
all three languages.

3.1.3   Training Corpora for Language Detection
The language detection framework requires sufficiently large corpora in all languages the classi-
fier is trained for. We rely on the Leipzig Corpora Collection4 , which contains texts collected
  4 http://corpora.uni-leipzig.de
        Test Size (characters)       1         2         4          8          16         32
        Accuracy                  22.59 %   34.82 %   58.55 %    81.17 %    92.45 %    97.33 %
        Test Size (characters)       64       128       256        512       1024        2048
        Accuracy                  98.99 %   99.67 %   99.86 %    99.97 %    99.99 %     100 %

Table 3: Results of language detection using test data of different character sizes measured by
classification accuracy.


from the web and newspapers, and the JRC-Acquis Multilingual Parallel Corpus5 , which contains
documents published by the European Union translated in various languages.

3.2      Preprocessing
3.2.1     Language Detection
For language detection we used the n-gram language classifier included in the Ling Pipe software
collection6 . The classifier was trained using the Leipzig and JRC-Acquis corpora. When a certain
language was available in both corpora we preferred the data of the Leipzig Corpus, as this showed
better results in a cross validation on the training data.
    We conducted multiple tests for verifying the effectiveness of the language detection model.
The results showed that using a 5-gram model and a 100,000 character training is optimal in our
case. Table 3 contains the classification results using different data sizes measured by the character
size. The results show that the classifier achieves high performance of more than 97% accuracy
for text containing more than 32 characters. As this is given for most fields in the TEL dataset
this classifier is applicable for the language detection task in our framework.

3.2.2     Document Preprocessing
We used the following methods for the preprocessing of documents:
Tokenizer As tokenizer we used a standard white space tokenizer. All non-character tokens were
    deleted. For Wikipedia articles we also deleted all wiki markup.
Stop-Word Filtering We used standard stop word lists in the languages English, German,
     Finnish, French, Italian, Portugese, Swedish, which were taken from the University of
     Neuchatel7 , and Danish, Spanish, Dutch and Norwegian, which were taken form Ranks.nl8 .
Stemmer We used the Snowball Stemmers9 to stem terms in English, German, French, Danish,
    Dutch, Finnish, Italian, Norwegian, Portugese and Swedish.

Fields in other languages than those mentioned above were not preprocessed using stemmers or
stop word lists.

3.3      Evaluation Measures
The relevance assessments for the search task are provided by CLEF, resulting from a pooled
manual evaluation. As evaluation measure we report mean average precision (MAP), precision at
a cutoff level of 10 (P@10) and recall at a cutoff level of 100 (R@100).
  5 http://wt.jrc.it/lt/Acquis/
  6 http://alias-i.com/lingpipe/
  7 http://members.unine.ch/jacques.savoy/clef/
  8 http://www.ranks.nl/resources
  9 http://snowball.tartarus.org
3.4     Implementation
In our implementation we used different third party software tools as well as own implementations.
For text based retrieval including inverted indexes and scoring models we used the Terrier IR
framework. For translating the topics to various languages we used the machine translation service
provided by Google10 . We used our own implementation of CL-ESA for concept-based retrieval11 .
We also implemented an inverted concept index that allows efficient retrieval based on the concept
representations of topics and documents. For example, for the ONB dataset the inverted concept
index has the size of approx. 26 GB and the average processing time of a topic is approx. 135
seconds.


4       Experiments on CLEF08 Ad-hoc Topics
In this section we present the results of experiments using the CLEF08 Ad-hoc topics. As relevance
assessments are available for these topics we used this task to optimize our system in respect to
the retrieval model and the aggregation functions.
    In all experiments we relied on the mono-lingual task, i.e. English topics for BL dataset,
German topics for ONB and French topics for BNF. As all of these datasets contain documents
in different languages, cross-lingual retrieval can be applied to find relevant documents in other
languagesl. The mono-lingual task can therefore also be used to optimize the multi-lingual setting
we propose in our framework.

4.1     Mono-lingual Retrieval Model
First we conducted experiments to optimize the retrieval models for MT based IR. As this is based
on mono-lingual retrieval we compared the performance of different State-of-the-Art retrieval
models. The hypothesis here was that good performance in mono-lingual retrieval should also
result in good performance in cross-lingual retrieval.
    We rely on the retrieval models provided by the Terrier framework in our work. We selected
the best retrieval model for each dataset according to MAP and got the following best retrieval
results on the different TEL datasets: MAP of .34 on the BL dataset using model DLH13, MAP
of .22 on the ONB dataset using model LemurTF IDF and MAP op .30 on the BNF dataset using
model BB2. In the remainder of this paper we will report results relying on the best retrieval
model for each dataset.

4.2     Rank Aggregation
As described above we defined two aggregation steps in our model. First the results of multiple
text-based indexes are aggregated and afterwards the aggregated score is combined with concept-
based retrieval score. In the following experiments we used again the CLEF2008 topics for the
Ad-hoc mono-lingual task. The first aggregation step was evaluated on all three TEL datasets.
For the evaluation of the second step we only performed experiments on the BL dataset.

4.2.1    Linear Aggregation for Multiple Indexes
The baseline for the proposed retrieval using multiple indexes is given by retrieval on a single index
of all text in the documents without language classification. The performance of this baseline is
shown in the first row of Table 4.
    As described in Section 2 we used different normalization and weighting models for linear
aggregation of the multiple indexes. Table 4 contains all results of aggregation without normaliza-
tion, using max score and using the number of retrieved documents of each index and aggregation
using a priori weights.
 10 http://translate.google.com
 11 http://code.google.com/p/research-esa
                                                 BL                  ONB                  BNF
  Retrieval Method                      MAP     P@10   R@100   MAP   P@10   R@100   MAP   P@10   R@100

  Baseline (single index)                .34    .51     .50    .23   .36     .45    .30   .38     .56
  Multiple Indexes (no norm.)            .25    .36     .45    .18   .26     .42    .22   .26     .48
  Multiple Indexes (max score norm.)     .07    .08     .14    .08   .14     .22    .12   .16     .26
  Multiple Indexes (num ret norm.)       .34    .51     .50    .22   .35     .42    .29   .35     .54
  Multiple Indexes (a priori)            .34    .51     .50    .23   .36     .43    .30   .38     .54

Table 4: Results for MT-based retrieval on the CLEF08 mono-lingual task using a single index
and using different rank aggregation methods for multiple indexes.


    The results clearly show that our approaches to aggregate the results of the multiple indexes are
not able to beat the baseline using a single index. Normalization based on the number of retrieved
documents as well as a priori weights can both be used to achieve comparable performance in
respect to MAP, P@10 and R@100. The results indicate that linear aggregation based on the
multiple indexes seems not be able to improve the overall performance in this task.
    As alternative approach to linear combination we experimented with Support Vector Machine
based aggregation. To balance the ratio between the training data, we used all relevant docu-
ments for all topics as positive samples and randomly selected non-relevant documents as negative
samples to achieve a ratio of positive/negative samples of 1/2.
    As SVM implementation we used LIBSVM12 . Using the SVM type C-SVC (c=1) with a radial
basis function kernel, the training data could be classified using a 5-fold cross validation with
precision of .61 and recall of .42. However when using the trained model for the actual retrieval
the MAP was very low with .01. When using a linear kernel, which would lead to a classifier
that is comparable to linear aggregation, we were not able to learn the model as the learning
algorithm did not terminate. Our assumption is that using these kernel functions it is not possible
to separate the positive and negative samples in the feature space. This would also explain the
bad performance of the resulting retrieval system. It might be possible to use SVMs for rank
aggregation by using other kernels, but in the scope of this paper we did not investigate that idea
any further.

4.2.2   Linear Aggregation with Concept-Based Retrieval
In the technical report of last year, we presented results only based on concept-based retrieval using
ESA [6]. In the current system we also investigate a modified version of the ESA-based mapping
to the Wikipedia article space. The language classification step represents the TEL records as
multi-lingual documents. This is used to map the documents fragments for each language to the
concept space based on the Wikipedia databases in the corresponding languages. The concept
vector representations of the different fragments are then combined to a single concept vector for
each document as described in Section 2. Experiments on the CLEF08 mono-lingual task on the
BL dataset showed an improvement of the new concept mapping model with respect to the model
used in the last year experiments of 1% MAP, 7% P@10 and 5% R@100. For our experiments on
the CLEF09 tasks we therefore used the new model.
    In our final experiments using the CLEF08 topics we investigated the combination of MT
based retrieval and concept-based retrieval. As for example suggested in [4] we also chose a linear
aggregation function. The problem thereby is to find an optimal weight for each retrieval model.
We approximated the optimal weight by a brute-force and systematic exploration of the parameter
space. The results of this exploration for the BL dataset are presented in Figure 2. The left most
bar represents MAP value giving full weight to the concept-based retrieval, while the right most
bar represents the MAP giving full weight to the concept-based retrieval. The bars in between
result from experiments using the combined approach with different weights. For the experiments
using the CLEF2009 topics we used the best weightings derived from these experiments.
 12 http://www.csie.ntu.edu.tw/
                               ~cjlin/libsvm/
                        0,4

                        0,3


                  MAP
                        0,2

                        0,1

                         0
                              MT based                                   Concept based
                              Retrieval                                       Retrieval

Figure 2: Results for the CLEF2008 mono-lingual ad-hoc task on the BL dataset using different
weightings of MT-based retrieval and concept-based retrieval combined by linear aggregation. The
left most result corresponds to MT-based retrieval, the right most to concept-based retrieval.

    Topic                                        BL                   ONB                  BNF
    Lang.   Retrieval Method              MAP   P@10   R@100   MAP    P@10   R@100   MAP   P@10   R@100

    en      Baseline (single index)       .35    .51    .55     .16   .26     .36    .25   .39     .45
            Multiple Indexes              .33    .50    .52     .15   .24     .35    .22   .34     .45
            Concept + Baseline            .35    .52    .54    .17∗   .27     .37    .25   .39     .45
    de      Baseline (single index)       .33    .49    .53     .23   .35     .47    .24   .35     .45
            Multiple Indexes              .31    .48    .51     .23   .34     .49    .22   .32     .43
            Concept + Baseline            .33    .49    .53    .24∗   .35     .47    .24   .36     .45
    fr      Baseline (single index)       .31    .48    .50     .15   .22     .31    .27   .38     .51
            Multiple Indexes              .29    .45    .47     .14   .20     .32    .25   .35     .50
            Concept + Baseline            .32   .51∗    .50     .15   .22     .31    .27   .37     .50

Table 5: Results on the CLEF 2009 Ad-Hoc Task. Statistical relevant improvements according to
a paired t-test with confidence level .05 are marked with ∗ .


5    Experiments on CLEF09 Ad-hoc Topics
The CLEF09 Ad-hoc topics are similar to the topics from CLEF08. The 50 topics have the
same format consisting of two fields, a short title containing 2-4 keywords and a description of the
information item of interest in terms of 1-2 sentences. The objective is to query the selected target
collection using topics in the same language (mono-lingual run) or topics in a different language
(bi-lingual run) and to submit the results in a ranked list ordered with respect to decreasing
relevance. In line with these objectives we submitted results of six different runs to CLEF08.
These are the results of querying English, German and French topics to the BL, ONB and BNF
datasets.
    The results of our experiments are presented in Table 5. The results using multiple indexes
show that this approach was not able to beat the baseline. Using a single index for the TEL
records without language classification and topics only translated into the main language of each
dataset achieved better performance compared to our approach based on indexes for each language
and multiple translations of the topic to the matching languages. Another result is that the
combination of concept-based retrieval to the MT-based retrieval was able to improve the retrieval
in some cases. The improvement was significant according to a paired t-test with confidence level
.05 for French topics on the BL dataset and English and German topics on the ONB dataset.
However in many cases the performance was similar to the baseline without statistical significance
of the difference. We could therefore not reproduce the strong improvements e.g. presented in [4].
6    Conclusion
In this paper we have presented a cross-language information retrieval approach based on multiple
indexes for different languages and rank aggregation to combine the different partial results. The
approach was developed in the light of the fact that the CLEF TEL dataset consists of records
in different languages which also may contain fragments of more than one language. For this
approach a language detection of all documents fragments of the dataset as well as translation
of topics to all supported languages is necessary. Our results showed that for the CLEF08 and
CLEF09 Ad-hoc task we were not able to improve retrieval result with this new model. The
baseline consisting of a single index without language classification and a topic translated only to
the index language achieved similar or even better results.
    We also combined Machine Translation based retrieval with concept-based retrieval. The
results showed that we were able to improve the baseline through the combination in some cases.
However the improvement on the CLEF Ad-hoc task were not as strong as reported on other
experiments in related work.


Acknowledgments
This work was funded by the Multipla project sponsored by the German Research Foundation
(DFG) under grant number 38457858.


References
[1] W. Bruce Croft. Combining approaches to information retrieval. In Advances in Information
    Retrieval, pages 1–36. 2000.

[2] E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-based
    explicit semantic analysis. In Proceedings of the International Joint Conference on Artificial
    Intelligence (IJCAI), pages 1606–1611, 2007.
[3] J. Kürsten, T. Wilhelm, and M. Eibl. CLEF 2008 Ad-Hoc Track: On-line Processing Experi-
    ments with Xtrieval. In Working Notes of the Annual CLEF Meeting, 2008.

[4] C. Müller and I. Gurevych. Using Wikipedia and Wiktionary in Domain-Specific Information
    Retrieval. In Working Notes of the Annual CLEF Meeting, 2008.
[5] Gerard Salton and Michael J. McGill. Introduction to Modern Information Retrieval. McGraw-
    Hill, Inc., 1986.

[6] P. Sorg and P. Cimiano. Cross-lingual Information Retrieval with Explicit Semantic Analysis.
    In Working Notes of the Annual CLEF Meeting, 2008.
[7] Philipp Sorg and Philipp Cimiano. An experimental comparison of explicit semantic analysis
    implementations for cross-language retrieval. In Proceedings of the International Conference
    on Applications of Natural Language to Information Systems (NLDB), Saarbrücken, 2009.