<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>P. Schoenhofen, A. Benzcur, I. Biro, and K. Csalogany. Performing cross-language retrieval with
wikipedia. In Proceedings of CLEF</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Cross-lingual Information Retrieval with Explicit Semantic Analysis</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Philipp Sorg, Philipp Cimiano Institute AIFB, University of Karlsruhe</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>[12] M.L. Littman and Greg A. Keim. Cross-language text retrieval with three languages. Technical report, Department of Computer Science, Duke University</institution>
          ,
          <addr-line>Durham, North Carolina, 1997</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2007</year>
      </pub-date>
      <volume>2007</volume>
      <fpage>49</fpage>
      <lpage>57</lpage>
      <abstract>
        <p>We have participated on the monolingual and bilingual CLEF Ad-Hoc Retrieval Tasks, using a novel extension of the by now well-known Explicit Semantic Analysis (ESA) approach. We call this extension Cross-Language Explicit Semantic Analysis (CL-ESA) as it allows to apply ESA in a cross-lingual information retrieval setting. In essence, ESA represents documents as vectors in the space of Wikipedia articles, using the tfidf measure to capture how “important” a Wikipedia article is for a specific word. The interesting property of ESA is that arbitrary documents can be represented as a vector with respect to the Wikipedia article space. ESA thus replaces the standard BOW model for retrieval. In our cross-lingual extension of ESA, the cross-language links of Wikipedia are used in order to map the ESA vectors between different languages, thus allowing retrieval across languages. Our results are far behind the ones of other systems on the monolingual and ad-hoc retrieval tasks, but our motivation was to find out the potential of the CL-ESA approach using a first and unoptimized implementation thereof.</p>
      </abstract>
      <kwd-group>
        <kwd>Cross-language Information Retrieval</kwd>
        <kwd>Explicit Semantic Analysis</kwd>
        <kwd>Wikipedia</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        When tackling the task of retrieving documents across languages, there seem to be essentially two main
paradigms:
1. Translation-based approaches which rely either on a translation of documents or queries. For the
translation of queries, one typically relies on bilingual dictionaries (compare [10], [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]).
2. Mapping of queries and documents into a multilingual space in which similarity between queries
and documents can be computed uniformly across languages.
      </p>
      <p>
        The first type of approaches is obviously highly dependent on the quality of the translation system used or
the bilingual dictionary in question. Demner-Fusham et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] have in particular shown that the coverage
of the bilingual dictionary has a crucial impact on the retrieval task. As mentioned by Demner-Fushman et
al., for a successful dictionary-based CLIR model, the following three steps need to be accomplished: (1)
selection of the terms to be translated, (2) generation of a set of candidate translations, and (3) use of that
set of candidate translations in the retrieval process.
      </p>
      <p>
        Concerning the second type of approaches in which queries and documents are mapped into a
multilingual space, there are two crucially different models:
latent model: Instead of representing documents (and queries) with respect to the bag-of-word
dimensions, some approaches compute “latent” concepts from the data and index documents with
respect to these latent concepts. Latent concepts correspond to certain topics emerging bottom-up
from the document collection. The most prominent technique here is latent semantic analysis (LSA)
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In fact, LSA has also been applied in cross-lingual IR settings (compare [17]). For this purposes,
parallel texts are needed across languages in order to construct a matrix where the dimensions
correspond to words in all languages considered. Dimensionality reduction is then applied to discover
correlated words across languages. Queries and documents can then be represented in this “latent
space” and retrieval can be performed in a standard fashion by calculating the cosine in this space.
external category model: In contrast to retrieval models which build on latent topics or concepts,
one can also choose a set of external categories, topics or concepts to define the dimensions of the
vectors. These can be categories from existing thesauri, ontologies etc. The advantage is that the
vectors then remain constant across different document collections, in particular also across languages.
Such models presuppose that we are indeed able to index texts in various languages with respect to
the multilingual space spanned by the external categories.
      </p>
      <p>
        The latter approach based on indexing with respect to external categories is interesting in the sense
that i) no parallel texts are required (e.g. in order to compute latent topics grouping words from different
languages), and ii) no bilingual dictionaries are needed. Obviously, this is true only to some extent as
the mapping into the external categories (across languages) might well require cross-lingual dictionaries.
Gabrilovich and Markovitch [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] have for example recently presented an interesting approach in which
Wikipedia articles are used as dimensions of the vectors, i.e. documents are indexed with respect to the
Wikipedia article space. While Gabrilovich and Markovitch have applied this model to calculate semantic
relatedness between words, this model extends straightforwardly to an IR setting, in which query and
documents are mapped to a vector representing the Wikipedia article space (see for instance [9] and [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]).
      </p>
      <p>An interesting characteristic of Wikipedia is that articles are linked across languages by bidirectional
language links. Thus, we can in principle translate a document or query vector indexed with respect to the
Wikipedia of language Li to language Lj , thus extending straightforwardly into a cross-lingual retrieval
task.</p>
      <p>In this paper we investigate this idea closer and present an approach for cross-language IR based on
Explicit Semantic Analysis. In particular, we present our system as it has been used on the CLEF
monolingual and multilingual Ad-Hoc retrieval tasks. Further, we also present additional experiments on the
Multext dataset conducted after the submission to the CLEF campaign in order to verify some of the
parameter settings of our approach on another dataset. In order to be able to quantify the influence of the
parameters, we have in particular conducted standard mating experiments on the Multext dataset.</p>
      <p>The article is structured as follows: in the next section 2 we describe in more detail the ESA model
and show how it can be used in a retrieval setting. In Section 3 we discuss how this model can be extended
to a cross-lingual setting relying on the Wikipedia cross-language links. In section 4 we discuss some
implementation details which are nevertheless important to understand how the overall system works on
the task of cross-language IR. Finally, in Section 5 we present our results on the CLEF datasets as well as
on the Multext corpus.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Explicit Semantic Analysis (ESA)</title>
      <p>
        Explicit Semantic Analysis (ESA) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] attempts to index or classify a given text t with respect to a set of
explicitly given external categories. It is in this sense that ESA is explicit compared to approaches which
aim at representing texts with respect to latent topics or concepts, as done in Latent Semantic Analysis
(LSA) (see [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], [11]). Gabrilovich and Markovitch have outlined the general theory behind ESA and in
particular described its instantiation to the case of using Wikipedia articles as external categories. We will
basically build on this instantiation as described in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] which we briefly summarize in the following.
      </p>
      <p>In essence, Explicit Semantic Analysis takes as input a text t and maps it to a high-dimensional
realvalued vector space. This vector space is spanned by a Wikipedia database Wk = fa1; : : : ; ang in language
Lk such that each dimension corresponds to an article ai. This mapping is given by the following function:
where jWkj is the number of articles in Wikipedia Wk corresponding to language Lk. The value vi in
the ESA vector of t expresses the strength of association between t and the Wikipedia article ai. Based on
a function as that defines the strength of association between words and Wikipedia articles, the values vi
can be computed as the sum of the association strength of all words of t = hw1; : : : ; wsi to the article ai:
k : T ! RjWkj
k(t) := hv1; : : : ; vjWkji
vi := X as(wj ; ai)</p>
      <p>wj2t</p>
      <p>One approach to define such a association strength function as is to use a tf.idf function based on the
Bag-of-Words (BOW) model of the Wikipedia articles. The association strength of word wj to article ai is
then equal to the tf.idf value of wj in ai:</p>
      <p>as(wj ; ai) = tf:idfai (wj )</p>
      <p>
        In the literature, many different definitions of tf.idf functions based on the BOW model have been
proposed (see [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]). The particular function that was used in our experiments is described in Section 4.
      </p>
      <p>
        Essentially, for each article ai in Wikipedia, ESA sums up all the association strengths of each word wj
appearing in the document. In this sense, the Semantic Interpreter applying ESA described in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] essentially
computes the function . As output we thus get a vector representing the strength of association of our text
t with respect to the articles in Wikipedia Wk. Actually, this vector thus corresponds to a ranking of the
Wikipedia articles according to importance or relevance for a text t.
      </p>
      <p>Given the ESA framework, we can assess the similarity between two texts ti; tj 2 T , between a query
q and a text ti etc. For example, the standard cosine measure can be used to compare the vectors. In the
remainder of this paper we will simply assume that the cosine is used to compare different vectors.</p>
      <p>
        In fact, this framework is flexible to be applied to a variety of tasks, computing the similarity between:
single words, which can be seen as singleton texts consisting of only one word. This can then be
used to compute semantic relatedness between words as in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Gabrilovich and Markovitch actually
showed that their method performs better than LSI on the task of computing semantic relatedness
between words.
two documents (e.g. in a clustering task)
a query and a document (e.g. in a retrieval. task)
      </p>
      <p>In this paper we are concerned with a retrieval task, in which we are given a query q and need to rank the
documents according to relevance. It should be clear from the above discussions that ESA straightforwardly
extends to a retrieval scenario.</p>
      <p>As a running example in this paper, we will use query 10.2452/460-AH (“Scary Movies”) from the
2008 CLEF Ad-hoc retrieval dataset where our system performed remarkably well. In the following table
we indicate the 10 top-ranked Wikipedia articles for the query in the three languages German, English and
French:</p>
      <p>Language
Query</p>
      <p>English</p>
      <p>Scary Movies</p>
      <p>The top-10 ranked articles clearly differ between the languages. It is in particular interesting to observe
that many results are actually named entities which clearly differ between languages due to a different
cultural background. Consequently, the ESA vectors for the same query in different languages varies
substantially, which is less optimal in a cross-language retrieval setting.</p>
      <p>In the following section, we present our own extension to ESA called CL-ESA (Cross-language Explicit
Semantic Analysis)1, which represents a relatively straightforward extension of ESA to a cross-lingual
setting. Our main aim in this paper is to discover if CL-ESA performs well in a cross-lingual retrieval
setting.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Cross-lingual ESA (CL-ESA)</title>
      <p>A very interesting characteristic of Wikipedia, besides the overwhelming amount of information created
dynamically and in a collaborative way, is the fact that articles are linked across languages. Cross-language
links are those that link a certain article to a corresponding article in the Wikipedia database in another
language. A previous analysis of this cross-lingual link structure between the German and English Wikipedia
showed that 95% of these links are indeed bi-directional (see [16]). The analysis of French-English and
French-German links showed similar results. In the following we therefore assume the existence of a
mapping function mi!j that maps an article of Wikipedia Wi to its corresponding article in Wikipedia
Wj .</p>
      <p>In fact, given a text t 2 T in language Li, it turns out that we can simply index this document with
respect to any of the other languages L1; ::; Ln we consider by transforming the vector i(t) into a
corresponding vector in the vector space that is spanned by the articles of Wikipedia in the target language.
Thus, given that we consider n languages, we have n2 mapping functions of the type:</p>
      <sec id="sec-3-1">
        <title>This mapping is calculated as follows:</title>
        <p>i!j : RjWij ! RjWj j
i!j hv1; :::; vjWiji = hv10; :::; vj0Wj ji
1We would like to point out that we have developed and called our model CL-ESA independently of the CL-ESA approach
described by Potthast et al. [13]. We discovered this work just after finishing our paper, so that CL-ESA is introduced here as a novel
paradigm while it clearly has the CL-ESA approach of Potthast et al. as precedent. We thank the Web Technology &amp; Information
Systems Group of Weimar University (in particular Martin Potthast) for bearing with us in spite of missing their work in the first place
and for the exchange with respect to technical details related to the implementation of the ESA approach.
where
(1)
with 1 p jWij, 1 q jWj j. In case that i = j we thus have the identity function.</p>
        <p>In order to get the ESA representation of a document t 2 T in language Li with respect to Wikipedia
Wj we simply have to compute the function i!j ( i(t)).</p>
        <p>In the following table, we give the top-ranked Wikipedia articles for our running example query together
with the result of mapping the German and French vectors into the English Wikipedia space:</p>
        <p>This thus gives us an elegant retrieval model which is uniform across languages. A prerequisite for
this model is certainly that we know the language of the query and of the different documents in order to
know which mapping should be applied. We describe in the implementation section how we actually
implemented a straightforward component for language detection.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Implementation</title>
      <p>In this section we describe the implementation details we used for our experiments. In particular, we
describe i) the document preprocessing (Section 4.1), ii) the actual ESA implementation that consists of
article preprocessing, ESA vector computation and multi-lingual mapping (Section 4.2), iii) the
identification method to identify the language of a document (Section 4.3), and iv) the overall retrieval process
(Section 4.4).
4.1</p>
      <sec id="sec-4-1">
        <title>Preprocessing of Documents</title>
        <sec id="sec-4-1-1">
          <title>We used the following methods for the preprocessing of documents:</title>
          <p>Tokenizer As tokenizer we used a standard white space tokenizer. All non-character tokens were deleted.
Stop-Word Filtering We used standard stop word lists in the languages English, German and French to
filter out stop words.</p>
          <p>Stemmer All terms in the documents were stemmed using Snowball Stemmers 2 available for many
different languages including English, German and French.
4.2</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>ESA Implementation</title>
        <p>The implementation of Cross-Lingual ESA can be divided into three steps. The first step is the
preprocessing of the Wikipedia articles. This includes preprocessing of the article texts as well as the selection
of articles that will be used for ESA indexing. The next step is the computation of the ESA vector, which
depends on the choice of the association strenght (as) function that assigns the strength of association
between words of the documents and Wikipedia articles. The last step is the multi-lingual mapping of the
ESA vector.</p>
        <p>In the following, the implementation of all of these steps including different variations and parameters
will be explained in detail.
4.2.1</p>
        <p>Wikipedia Article Preprocessing
The processing of the Wikipedia articles was done by using the Wikipedia tokenizer that is included in the
Lucene3 software package and then using the same methods for stop word removal and stemming as in
the preprocessing of the documents. The Wikipedia tokenizer removes all Wiki markup from the text, e.g.
syntax for links, headings and font styles.</p>
        <p>The selection of articles that were used as dimensions of the ESA vector was based on different criteria.
First we filtered out all redirect articles and all category articles. Then all articles with less than 100
words or less than 5 incoming pagelinks were discarded. In our first experiments, we did not perform any
further selection. The results of the CLEF ad hoc retrieval are based on these settings. In the subsequent
experiments on the Multext dataset, we restrict the Wikipedia articles used for ESA indexing to those that
have at least a language link to one of the two other languages we consider. For example, we only consider
an article of the English Wikipedia if it has a cross-language link to the German or the French Wikipedia.
In absolute numbers, we used 536; 896 English, 390; 027 German and 362; 972 French articles for the ESA
indexing (Wikipedia snapshot of March 12, 2008).</p>
        <p>
          In the original ESA approach, Gabrilovich and Markovitch included more preprocessing and selection
steps [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. They added to the text for example the anchor text of incoming pagelinks and titles of redirects
to an article. Some articles such as articles about years and similar were discarded. We have not made use
of any additional similar heuristics in our implementation of the ESA/CL-ESA approach. Nevertheless, it
would be interesting to study the influence of such additional heuristics in the future.
4.2.2
        </p>
        <p>ESA Vector Computation
The computation of the ESA vector is based on an inverted index of the preprocessed selected Wikipedia
articles. Each document of the dataset can then be treated as a query to this index. The retrieved articles
with their weight can then be used to build the ESA vector.</p>
        <p>2http://snowball.tartarus.org
3http://lucene.apache.org</p>
        <p>The implementation of the index was done by using Lucene. As the function for computing the
association strength between documents and articles, we used a customized implementation of the Lucene
similarity function which computes the following function for a text t = hw1; : : : ; wli and a Wikipedia
article ai of Wikipedia database W :
with
asR(t; ai) = (Ct)pjaij 1 X tfai (wj )idf (wj )</p>
        <p>Ct =
wj2t
1
qPwj2t idf (wj )
tfai (wi) = pnumber of occurrences of wi in ai
idf (wj ) = 1 + log
number of articles containing wj
jW j + 1
The choice of asR is motivated by the good performance on IR tasks. We therefore assume that this
1
association strength can be used for the computation of the values of the ESA vector. The factor pjaij
constitutes a normalization by length of the article. The factor C(t) is only dependant on the query and
does therefore not affect the relevance ranking of articles to the text t or the cosine computation.</p>
        <p>
          In the experiments on the Multext dataset, we also used a different function that computes a bit valued
ESA vector. This function asBIT is defined as follows:
asBIT (t; ai) =
1 ai contains any wj 2 t
0 else
For both functions, the number k of articles (dimensions) considered in order to compute the ESA vector is
used as a parameter. In fact, it seems that for the computation of the ESA vector “less is more” as conveyed
by the experiments described in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. However, this is only the case provided that we have a reasonable way
of determining which articles are most suitable. In our approach we only set those values in the ESA vector
corresponding to the k articles with the highest association strength to a document t. Thus, the vectors we
consider are relatively sparse with jW j k dimensions having zero values.
        </p>
        <p>When using asBIT to compute the ESA vector, the ranking of relevant articles for a text is still based
on asR. As this ranking is used to select k articles, asBIT is not independent from asR. The objective
of using asBIT however is to flatten the differences between the associated Wikipedia articles in the ESA
vector.</p>
        <p>Gabrilovich and Markovitch weighted the association strength by exploiting the pagelink structure of
Wikipedia. It remains future work to adapt this method to our implementation.
4.2.3</p>
        <p>Multi-lingual Mapping
As described above the multi-lingual mapping was done by using the cross-language links of Wikipedia.
To use these links in an efficient way, some preprocessing is necessary. First we did a normalization of the
target page titles of all cross-language links, as this is not done automatically in the Wikipedia database.
Then we identified all cross-language links pointing to redirect pages and replaced them with language
links to the article to which the redirect was leading.</p>
        <p>In order to map the vectors from language Li to language Lj we only use the cross-language links of
Wikipedia Wi pointing to Wj . As our statistics showed that most of these links are bi-directional (95%)
we did not include the links from Wj to Wi.</p>
        <p>In some cases, two or more articles in Wi contain a cross-language link to the same article in a 2 Wj .
In this case, the new value of the ESA dimension corresponding to a was set to the sum of the values of all
dimensions that correspond to the source articles in the original ESA vector (see Equation 1).
ESA-RETRIEVAL(T opics T; Language k; Documents D)
1 for t 2 T
2 do
3 ~t = k(t);
4
5 for d 2 D
6 do
7 l := lang(d);
8 d~ = l!k( l(d))
9 for t 2 T
10 do score[t; d] = cos(~t; d~);
11
In order to be able to compute the ESA vector for a document, the language of this document must be
known as the computation is based on an index of a Wikipedia database in the document’s language.
Many document collections only contain documents in one language and thus no language identification
is needed. In other cases, such as in the CLEF ad hoc retrieval task, the dataset contains documents in
different languages.</p>
        <p>In our implementation we first try to determine the language by using properties of the documents such
as language annotations. If these are not available, we apply a simple heuristic to determine the language
of document t as follows:
minDim( k(t))
lang(t) := maxLk2fL1;:::;Lng maxDim( k(t))
where minDim(~v) returns the value of the lowest dimension in vector ~v and maxDim(~v) returns the
highest correspondingly. The intuition behind this heuristic is that a small difference between the values of
the lowest and highest dimension, which is computed by the share of these values, means that the document
matches good to many Wikipedia articles and it can therefore be assumed that the document is of the same
language as the used Wikipedia articles. Comparing a document to Wikipedia articles in another language,
there will be some mathes but the value of lowest dimension will most probably be very small.</p>
        <p>While we have not done an extensive evaluation of this heuristic, a check showed that the quality of
this heuristic is reasonable and sufficient for our purposes.
4.4</p>
      </sec>
      <sec id="sec-4-3">
        <title>Retrieval</title>
        <p>The implementation of the multi-lingual retrieval task is described in Figure 1 using pseudo code. In
summary we first compute the ESA vector of all topics and then iterate over all documents in the dataset.
The described workflow reduces the number of ESA vector computations substantially.</p>
        <p>For the CLEF ad hoc retrieval task we were able to process the ONB dataset using all English, German
and French topics in about 40 hours. The same task on the BL dataset had a runtime of approximately 60
hours.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Evaluation</title>
      <p>In this section, we describe the datasets used for the evaluation. Then we present the experiments together
with the different parameter settings applied. Finally, we also analyze the results of our approach with
respect to different parameters using alternative measures such as the overlap of retrieved documents for
the same query in different languages.
5.1
The first dataset we used was the TEL dataset that was provided by The European Library in the context
of the CLEF 2008 ad-hoc track. This dataset consists of library catalog records mainly in English, German
and French but also some records in other languages. In our experiments, we used two parts of this dataset:
The TEL English data provided by the British Library with mainly English records and the TEL German
data provided by the Austrian National Library with mainly German records. All of these records consist of
content information together with meta information about the publication. The title of the record is the only
content information that is available for all records. Some records additionally contain some annotation
terms. In our experiments we only used the available content information.</p>
      <p>This dataset is challenging for IR tasks in different ways. First the text of the records is very short, only
a few words for most records. Second, the dataset consists of records in different languages and retrieval
methods need to consider relevant documents in all of these languages. The following examples show the
complete content information of some records of the TEL English dataset:</p>
      <p>Title or Subject
Strength, fracture and complexity : an international journal.</p>
      <p>Studies in the anthropology of North American indians series.</p>
      <p>Lehrbuch des Schachspiels und Einfuehrung in die Problemkunst.</p>
      <p>Annotation Terms
Fracture mechanics, Strength of materials
Chess
The TEL English dataset contains 1,000,100 records, the TEL German dataset 869,353.</p>
      <p>As second dataset we used the Multext JOC corpus4. The original data of this corpus is composed of
written questions asked by members of the European Parliament on a wide variety of topics and
corresponding answers from the European Commission in 9 parallel versions, published as one section of the
C Series of the Official Journal of the European Community of the year 1993. The parts corresponding
to the languages of the Multext project (English, French, German, Italian and Spanish) were collected and
prepared in collaboration with the MLCC project. For our experiments we used the English, German and
French parts. This dataset contains 3126 question/answer pairs in each language which are aligned across
the languages.
5.2</p>
      <sec id="sec-5-1">
        <title>CLEF Ad-hoc Experiments</title>
        <p>The CLEF ad-hoc TEL task was divided into mono-lingual and bi-lingual tasks. 50 topics in the main
languages English, German and French were provided. The topics consist of two fields, a short title containing
2-4 keywords and a description of the information item of interest in terms of 1-2 sentences.</p>
        <p>The objective is to query the selected target collection using topics in the same language (mono-lingual
run) or topics in a different language (bi-lingual run) and to submit the results in a list ranked with respect
to decreasing relevance. In line with these objectives we submitted results of six different runs to CLEF
2008. These are the results of querying English, German and French topics to the TEL English dataset and
English, German and French topics to the TEL German dataset.</p>
        <p>The following parameter settings as described in the implementation section were used for these
experiments:
ESA vector length We used different lengths of the ESA vector to represent topics and records. For the
topics we used k = 10; 000, that means that 10,000 Wikipedia articles with the strongest association
to a specific topic were used to build the ESA vector for this topic. For the records, we used k =
1000. The difference between the lengths is mainly due to performance issues. We were only able
to process the huge amount of records by limiting the length of the ESA vectors for records to 1000
non-zero entries. As only 50 topics were provided, we were able to use more entries for the ESA
vectors for topics. Our intention thereby was to improve recall of the retrieval by using more ESA
dimensions.</p>
        <p>Article selection In the results of the experiments submitted to CLEF, we only used the default article
selection as described in the implementation section. One problem of this setting is the loss of many
dimensions in the mapping process, as not all of the articles corresponding to a non-zero ESA vector
4http://aune.lpl.univ-aix.fr/projects/multext/
entry have a corresponding cross-language link to the Wikipedia in the target language. In this case,
the information about this dimension is lost in the mapping process.</p>
        <p>The following table contains the CLEF 2008 results of our submitted experiments measured by the
Mean Average Precision (MAP) quality measure:</p>
        <p>Dataset
TEL English (BL)
TEL German (ONB)</p>
        <p>Topic language
English
German
French
English
German
French</p>
        <p>In addition to the submitted experiments we also conducted additional experiments on the TEL dataset
to better quantify and understand the impact of certain parameters on the result quality. As we were not
able to evaluate the results apart from the submitted ones, we decided to examine the result overlap for
queries in different languages on the same dataset. This measure can be seen as a quality measure for the
capability of retrieving relevant documents across languages. Ideally, queries in different languages should
result in the same set of retrieved records. We computed the result overlap for two different settings. First
we used the same settings as used in the submitted results. For the second set of experiments we further
restricted the Wikipedia articles that were used for ESA indexing to articles with at least one language link
to one of the two other languages considered. The following table contains the result overlaps for topic
pairs in different languages on the TEL English dataset:</p>
        <p>Article restriction
No restriction
Articles with exiting
cross-language link</p>
        <p>Topic language pair
English - German
English - French
German - French
English - German
English - French
German - French</p>
        <p>Average result overlap
21%
19%
28%
39%
51%
39%
The results show that we were able to substantially improve the retrieval methods according to the results
overlap measure by restricting the Wikipedia articles. Our assumption is that the results on the retrieval
task would also improve, but we did not manage to submit an additional run on time for CLEF.
5.3</p>
      </sec>
      <sec id="sec-5-2">
        <title>Mate Retrieval on Multext JOC Corpus</title>
        <p>As described above, the part of the Multext JOC Corpus we used consists of 3126 question/answer pairs
in English, German and French. All of these documents are aligned across languages in the sense that for
all documents there exist a corresponding article in the other languages. This dataset can therefore be used
for mate-retrieval experiments, which allow a direct assessment of different parameters. Mate retrieval is
the task of using a document as query with the objective to identify its translated counterpart in a set of
documents in another language. In this case the counterpart is known in advance enabling an automatic
evaluation of the mate retrieval results.</p>
        <p>Our main goal of the mate retrieval experiments was to optimize the parameters settings for CL-ESA.
We ran the experiments for various parameter settings:
ESA vector length We used different k for the maximal number of non-zero dimensions of the ESA
vector, namely k 2 f1000; 10; 000; 100; 000g.</p>
        <p>Article selection We only used articles with existing cross-language links for the ESA vector computation
as described in the implementation section.</p>
        <p>Text selection We used different text parts of the question/answer pairs in our experiments, namely
subject, question and all text consisting of subject, question and response. We always compared identic
parts of queries and documents, e.g. if we used the subject as query we only matched it to the subjects
of the documents in the retrieval process.</p>
        <p>Real vs. Bit vectors In the experiments we examined the effect of using real valued ESA vectors versus
bit valued ESA vectors.</p>
        <p>As evaluation measure we used TOP-1 and TOP-10 Precision, that is the share of input document for
which the mate was retrieved on position 1 or among the 10 best ranked results. The results for different
text selection, ESA vector model and ESA vector lengths are presented in the following table:</p>
        <p>The results presented in the following table are retrieval results using German queries on English
documents:</p>
        <p>Text
Subject</p>
        <p>Vector model
real values
bit values
Question</p>
        <p>real values
All text
real values
bit values
bit values</p>
        <p>The results show that using the bit valued ESA vectors yields a big loss in performance at the mate
retrieval task, independently of the text parts that were used. It seems therefore to be important to use the
relevance of articles to the queries that is encoded in the real values of the ESA vector representation of
queries.</p>
        <p>Looking at the number of dimensions of the ESA vector that were used, 10,000 seems to be a good
value for this parameter. Using more dimensions does not yield better precision. For queries consisting of
question part of the documents and all of the text, the results are even worse.</p>
        <p>Comparing the results using different text parts as queries the differences are not significantly different.
As e.g. subjects only consist of a few words but the whole documents contain several sentences, this is an
unexpected result. It seems that this method works good for short queries, but with longer queries more
noise is added as well and the retrieval performance therefore is not getting much better.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Related Work</title>
      <p>
        The first approaches to Cross-lingual Information Retrieval (CLIR) were based on the translation of the
query into the language of the target documents. Hull and Grefenstette presented a system that uses the
term vector translation model [10]. All terms of the query are translated by looking them up in a bilingual
dictionary. A problem of this approach is that many terms have multiple translations which are all added to
the translated query. This leads to a loss of precision in the retrieval process. Demner-Fushman and Oard
studied the effect of the size of the bilingual term list in dictionary based CLIR [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. One of there results
is that term lists with above 30,000 entries optimize the coverage of general vocabulary in their
experiments. Additionally they showed that the translation of named entities is very important and substantially
influences the retrieval quality. Because of that they suggest that supplemental techniques for named entity
translation are useful even with large lexicons.
      </p>
      <p>
        Another approach to CLIR is based on on Latent Semantic Indexing (LSI). LSI applied to text
documents is a technique to reduce the vector representation [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Based on a training corpus Principal
Component Analysis (PCA) on the co-occurrence matrix of words can be used to identify relevant dimensions
and to construct a mapping of the original Bag-of-Words vector space to these new dimensions. For CLIR
LSI can be applied by using a parallel corpus with documents in two languages for training. Parallel
documents are therefore merged co-occurrences are computed across languages. The learned model can then
be used for CLIR [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] [17]. If a training corpus in multiple languages is available, containing versions of all
documents in all languages, LSI can also be used for CLIR in many languages [12].
      </p>
      <p>Recently emerging approaches to CLIR use the Wikipedia database as background knowledge.
Schoenhofen et al. [15] presented a system that translates queries based on a small dictionary and cross-language
links in Wikipedia. Afterwards the terms of the translated query are mapped to Wikipedia articles.
Different features of these articles are then used to filter the query terms that are used for retrieval. This approach
is different to the presented approach as they use cross-language links to translate single query terms. In
our approach these links are used to define a mapping of high dimensional vector spaces, that is used to
map the ESA vector representation of the whole query.</p>
      <p>
        Egozi et al. presented a system for monolingual IR using Wikipedia as background knowledge [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
This work is highly relevant for this paper as they apply Explicit Semantic Analysis [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] to IR. Additionally
they propose a method to improve the ESA mapping in regards to IR tasks based on Pseudo Relevance
Feedback (PSF). This is done first performing standard Bag-of-Words retrieval with a query and then using
these results to select relevant dimensions of the ESA vector representation of the same query. A future
challenge will be to apply these techniques as well to multi-lingual IR based on the cross-lingual ESA
approach we presented in this paper.
      </p>
      <p>Another approach to use PRF in multi-lingual retrieval is described in by Qu et al [14]. They examined
the effects of pre-translation feedback versus post-translation feedback and identified different errors that
were induced through the query expansion.</p>
      <p>After developing our approach and submitting this paper, our literature search discovered the paper by
Potthast et al. [13], who independently of us developed and presented the CL-ESA model before. In their
paper, they perform extensive evaluations on two datasets: Wikipedia and the JRC Acquis dataset5. We
also intend to use this dataset in future experimental evaluation. The approaches also differ in the way
the association between a text and a Wikipedia article is computed. While Potthast et al. use the cosine
similarity between a document and a Wikipedia article as weight, we have simply used the tf.idf values for
this purpose.
7</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>
        In this paper, we have presented our CL-ESA approach and the corresponding implementation with which
we have participated in this year’s CLEF campaign on the monolingual and bilingual Ad-Hoc retrieval
tasks. In particular, we have presented a cross-lingual extension to the Explicit Semantic Analysis (ESA)
approach of Gabrilovich and Markovitch. While the results are far from satisfactory, we think that there is
still a lot of potential to improve the approach in future research. Questions which seem very important to
us are in how far various measures for calculating the association strength between a word (or text) and a
Wikipedia article as well as the selection of Wikipedia articles influence the overall results. The interesting
experiments presented in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] show that ”less is more” in the sense that considering a small number of
articles can be enough provided that they are selected appropriately. In direct future work, we plan to
compare our method with LSI-based cross-lingual retrieval methods to find out more in detail about the
performance of our approach, being able to better quantify the weaknesses of the current implementation.
      </p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This work was funded by the Multipla project sponsored by the German Research Foundation (DFG) under
grant number 38457858.</p>
      <p>5http://langtech.jrc.it/JRC-Acquis.html</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Baeza-Yates</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Ribeiro-Neto</surname>
          </string-name>
          .
          <article-title>Modern Information Retrieval</article-title>
          . Addison-Wesley,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.W.</given-names>
            <surname>Berry</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.G.</given-names>
            <surname>Young</surname>
          </string-name>
          .
          <article-title>Using latent semantic indexing for multilanguage information retrieval</article-title>
          .
          <source>Computers and Humanities</source>
          ,
          <volume>29</volume>
          (
          <issue>6</issue>
          ):
          <fpage>413</fpage>
          -
          <lpage>429</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Scott</given-names>
            <surname>Deerwester</surname>
          </string-name>
          , Susan T. Dumais, George W. Furnas,
          <string-name>
            <surname>Thomas</surname>
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Landauer</surname>
          </string-name>
          , and Richard Harshman.
          <article-title>Indexing by latent semantic analysis</article-title>
          .
          <source>Journal of the American Society for Information Science</source>
          ,
          <volume>41</volume>
          (
          <issue>6</issue>
          ):
          <fpage>391</fpage>
          -
          <lpage>407</lpage>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Scott</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Deerwester</surname>
          </string-name>
          , Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman.
          <article-title>Indexing by latent semantic analysis</article-title>
          .
          <source>Journal of the American Society of Information Science</source>
          ,
          <volume>41</volume>
          (
          <issue>6</issue>
          ):
          <fpage>391</fpage>
          -
          <lpage>407</lpage>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Demner-Fushman</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.W.</given-names>
            <surname>Oard</surname>
          </string-name>
          .
          <article-title>The effect of bilingual term list size on dictionary-based crosslanguage information retrieval</article-title>
          .
          <source>In Proceedings of the 36th Annual Hawaii International Conference on System Sciences (HICSS'03)</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Ofer</given-names>
            <surname>Egozi</surname>
          </string-name>
          , Evgeniy Gabrilovich, and
          <string-name>
            <given-names>S.</given-names>
            <surname>Markovitch</surname>
          </string-name>
          .
          <article-title>Concept-based feature generation and selection for information retrieval</article-title>
          .
          <source>In Proceedings of the Twenty-Third Conference on Artificial Intelligence (AAAI)</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E.</given-names>
            <surname>Gabrilovich</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Markovitch</surname>
          </string-name>
          .
          <article-title>Computing semantic relatedness using wikipedia-based explicit semantic analysis</article-title>
          .
          <source>In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI)</source>
          , pages
          <fpage>1606</fpage>
          -
          <lpage>1611</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Evgeniy</given-names>
            <surname>Gabrilovich</surname>
          </string-name>
          .
          <article-title>Feature Generation for Textual Information Retrieval using World Knowledge</article-title>
          .
          <source>PhD thesis</source>
          , Israel Institute of Technology, Kislev, 5767 Haifa, Istrael,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>