=Paper= {{Paper |id=Vol-1175/CLEF2009wn-adhoc-AnderkaEt2009 |storemode=property |title=Evaluating Cross-Language Explicit Semantic Analysis and Cross Querying at TEL@CLEF 2009 |pdfUrl=https://ceur-ws.org/Vol-1175/CLEF2009wn-adhoc-AnderkaEt2009.pdf |volume=Vol-1175 |dblpUrl=https://dblp.org/rec/conf/clef/AnderkaLS09a }} ==Evaluating Cross-Language Explicit Semantic Analysis and Cross Querying at TEL@CLEF 2009== https://ceur-ws.org/Vol-1175/CLEF2009wn-adhoc-AnderkaEt2009.pdf
 Evaluating Cross-Language Explicit Semantic
Analysis and Cross Querying at TEL@CLEF 2009
                      Maik Anderka         Nedim Lipka         Benno Stein
                               Faculty of Media, Media Systems
                                 Bauhaus University Weimar
                                   99421 Weimar, Germany
                          .@uni-weimar.de


                                             Abstract
     This paper describes our participation in the TEL@CLEF task of the CLEF 2009 ad-
     hoc track. The task is to retrieve items from various multilingual collections of library
     catalog records, which are relevant to a user’s query. Two different strategies are em-
     ployed: (i) the Cross-Language Explicit Semantic Analysis, CL-ESA, where the library
     catalog records and the queries are represented in a multilingual concept space that is
     spanned by aligned Wikipedia articles, and, (ii) a Cross Querying approach, where a
     query is translated into all target languages using Google Translate and where the ob-
     tained rankings are combined. The evaluation shows that both strategies outperform
     the monolingual baseline and achieve comparable results.
         Furthermore, inspired by the Generalized Vector Space Model we present a formal
     definition and an alternative interpretation of the CL-ESA model. This interpretation
     is interesting for real-world retrieval applications since it reveals how the computational
     effort for CL-ESA can be shifted from the query phase to a preprocessing phase.

Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor-
mation Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries

General Terms
Measurement, Performance, Experimentation

Keywords
Cross-Language Information Retrieval, Cross-Language Explicit Semantic Analysis, Wikipedia,
Cross Querying


1    Introduction
Cross-language information retrieval, CLIR, is the task of retrieving documents from a target
collection written in a language different from the language of a user’s query. CLIR systems give
multilingual users the possibility to express queries in any language, e.g., their native language,
and to obtain result documents in all languages they are familiar with. Since CLIR is not restricted
to collections in the query language more sources can be included in the retrieval process, and
the chance to fulfill a particular information need of a multilingual user is higher. Another use
case for CLIR techniques is cross-language plagiarism detection, where the query corresponds to
a suspicious document and the target collection is a reference corpus with original documents [3].
    The Cross-Language Evaluation Forum, CLEF, provides an infrastructure for the evaluation
of information retrieval systems, both monolingual and cross-lingual. We participated in the
TEL@CLEF task of the CLEF 2009 ad-hoc track, which aims at the evaluation of systems to
retrieve relevant items from multilingual collections of library catalog records. The main challenges
of this task are the multilinguality and the sparsity of the dataset. We used two different CLIR
approaches to tackle this task; the paper in hand outlines and discusses these approaches and the
achieved results.
    The first approach is Cross-Language Explicit Semantic Analysis, CL-ESA, which is a multilin-
gual retrieval model to access cross-language similarity between text documents [3]. The CL-ESA
model exploits a document-aligned comparable corpus such as Wikipedia in order to map the
query and the documents into a common multilingual concept space [3, 4]. We also present a
formal definition and an alternative interpretation for the CL-ESA model, which is inspired by the
Generalized Vector Space Model, GVSM. Our view is mathematically equivalent to the original
idea of the CL-ESA model; it reveals how the computational effort for CL-ESA can be shifted
from the query phase to a preprocessing phase.
    In the second approach, called Cross Querying, each query is translated into all target lan-
guages. The particular rankings are used in a combined fashion considering the most likely lan-
guage of the documents. The evaluation on the TEL@CLEF collections shows that both CLIR
approaches are able to outperform the monolingual baseline. In the bilingual subtask, querying
with a foreign language, Cross Querying achieves nearly the same or even higher results compared
to the monolingual subtask; the performance of the CL-ESA is lower compared to the monolingual
results.
    The paper is organized as follows. Section 2 describes the target collection used in the
TEL@CLEF task along with the evaluation procedure. Section 3 defines the general CL-ESA
model, our formalization, and details of the CL-ESA implementation employed in the experi-
ments. Section 4 presents the Cross Querying approach, Section 5 discusses the evaluation, and
Section 6 concludes with an outlook.


2      TEL@CLEF dataset and Evaluation Procedure
In this year’s TEL@CLEF task three target collections, provided by The European Library1 ,
TEL, are used. The collections are labeled BL, ONB, and BNF, and mainly contain information
in English, German, and French respectively (see Table 1). The collections are comprised of library
catalog records, referring to different types of items such as articles, books, or videos. The data is
provided in structured form and represented in XML. Each library catalog record has several fields
containing meta information and content information that describe the particular item. Typical
meta information fields are author, rights, or publisher, and typical content information fields
are title, description, subject, or alternative. In our experiments we focus on the content
information fields. A major difficulty is the sparsity of the available information: for many records
only few fields are given.
    The user’s information need is specified by 50 topics that are provided by CLEF in the three
main languages of the target collections, namely English, German, and French. A topic consists of
two fields: a title, containing 2-4 keywords, and a description, containing 1-2 sentences that
specify the item of interest in greater detail. The topics are used to construct the queries.
    The TEL@CLEF task is divided into a monolingual and a bilingual subtask. The aim in both
subtasks is to retrieve documents (library catalog records) from the target collections, which are
most relevant to a query; for each query the results are submitted as a ranked list of documents.
In the monolingual subtask the language of the query and the main language of the collection
are the same, while in the bilingual subtask the language of the query is different from the main
language of the collection. We submitted runs for both subtasks and for all three languages.
    1 The European Library: http://www.theeuropeanlibrary.org/.
    Table 1: Statistics of the three target collections BL, ONB, BNF, used in the TEL@CLEF task.
                                                              BL           ONB             BNF
            main language                                    English       German          French
            # documents                                   1 000 100        869 353       1 000 100
            # documents with title                        1 000 042        829 675       1 000 095
            average length of title                         8.033           5.500         17.124
            # documents with description                     518 493         0           1 000 100
            average length of description                     6.222          0            10.095
            # documents with subject                         671 544       602 580         368 788
            average length of subject                         7.032         8.373          10.833
            # documents with alternative                     78 679        404 415           0
            average length of alternative                    5.491          8.158            0
            # documents without content information            20          37 564            0



3     Cross-Language Explicit Semantic Analysis
Cross-Language Explicit Semantic Analysis, CL-ESA, is a generalization of the Explicit Semantic
Analysis, ESA [2], and was proposed by Potthast et al. [3]. This section presents a formal definition
of the CL-ESA model that reveals its close connection to the Generalized Vector Space Model,
GVSM [5]: the ESA model and the GVSM can be transformed into each other [1]. It follows
immediately that this is also true for the CL-ESA model and the cross-lingual extension of the
Generalized Vector Space Model, CL-GVSM [6].

3.1    Formal Definition
Let di be a real-world document written in language Li , and let di be a bag-of-word-based repre-
sentation of di , encoded as a vector of normalized term frequency weights over a universal term
vocabulary Vi . Vi contains all used terms for language Li . A set Di of document representations
defines a term-document matrix ADi , where each column in ADi corresponds to a vector di ∈ Di .

Definition 1 (ESA Representation [1]) Let Di∗ be a collection of index documents written in
language Li . The ESA representation di ESA of a document di with representation di is defined
as follows:
                                     di ESA = ATD∗ · di ,                                  (1)
                                                         i

        T
where A designates the matrix transpose of A.
   The rationale of this definition becomes clear if one considers that the weight vectors d∗i ∈ D∗i
and di are normalized: ||d∗i || = ||di || = 1, for each d∗i ∈ D∗i . Hence, each entry in the ESA
representation di ESA of a document di is the cosine similarity between di and some vector d∗i ∈ D∗i .
Put another way, di is compared to each index document in Di∗ , and diESA is comprised of the
respective cosine similarities.

Definition 2 (CL-ESA Similarity) Let L = {L1 , . . . , Lk } denote a set of natural languages,
and let D∗ = {D1∗ , . . . , Dk∗ } be a set of index collections where each Di∗ ∈ D∗ is a list of index
documents written in language Li ∈ L. D∗ is a document-aligned comparable corpus, i.e., for each
language Li ∈ L the n-th index document in Di∗ ∈ D∗ describes the same concept. The CL-ESA
similarity, ϕCL−ESA (qj , di ), between a query qj in language Lj and a document di in language Li
is computed as cosine similarity ϕ of the ESA representations of qj and di :

                    ϕCL−ESA (qj , di ) = ϕ(qj ESA , di ESA ) = ϕ(ATD∗ · qj , ATD∗ · di )             (2)
                                                                       j             i
                      Table 2: The different interpretations of the CL-ESA model.
                                  Original interpretation           Alternative interpretation
                                                                     View (i)           View (ii)

       ϕCL−ESA (qj , di ) =         ϕ(AT
                                       D∗
                                          · qj , A T
                                                   D∗
                                                      · di )       (qT
                                                                     j · Gj,i ) · di       qT
                                                                                            j · (Gj,i · di )
                                          j           i

       Runtime complexity             O(l · |D ∗ | + |D ∗ |)       O(l · |Vj | + l)             O(l)



    Due to the alignment of the index collections Dj∗ and Di∗ the ESA representations of qj
and di are comparable. Definition 2 is equivalent to the definition of the CL-GSVM similar-
ity ϕCL−GVSM (qj , di ) given in [6], which means that, in analogy to [1], the CL-ESA model and
the CL-GVSM can be directly transformed into each other:

                    ϕCL−ESA (qj , di ) = ϕ(ATDj∗ · qj , ATDi∗ · di ) = ϕCL−GVSM (qj , di )                     (3)

3.2    Alternative Interpretation
The original idea of the CL-ESA model is to map both query and documents into a multilingual
concept space, as it is expressed in Equation 2. Note that Equation 2 can be rearranged as follows:

                   ϕCL−ESA (qj , di ) = ϕ(ATD∗ · qj , ATD∗ · di ) = qTj · ADj∗ · ATD∗ · di                     (4)
                                                  j            i                       i


   In particular, the matrix ADj∗ · ATD∗ = Gj,i can be computed in advance since it is independent
                                       i
from a particular qj or di . Hence:

                                     ϕCL−ESA (qj , di ) = qTj · Gj,i · di                                      (5)
    The rationale of Equation 5 becomes apparent if one recognizes Gj,i = ADj∗ · ATD∗ as |Vj | × |Vi |
                                                                                      i
term co-occurrence matrix. The n-th row in ADj∗ corresponds to the distribution of the n-th
term tn ∈ Vj over the index documents in Dj∗ ; likewise, the m-th row in ADi∗ corresponds to the
distribution of the m-th term tm ∈ Vi over the index documents in Di∗ . Recall that the index
documents in Dj∗ and Di∗ are aligned. I.e., the value in the n-th row and the m-th column of Gj,i
quantifies the similarity between the distributions of tj and ti given the concepts described by the
index documents in Dj∗ and Di∗ .
   The CL-ESA similarity computation of Equation 5 can be viewed in two ways:
 (i) As a translation of the query representation qj into the space of the document representa-
     tion di : ϕCL−ESA (qj , di ) = (qTj · Gj,i ) · di , or,
(ii) as a translation of the document representation di into the space of the query representa-
     tion qj : ϕCL−ESA (qj , di ) = qTj · (Gj,i · di ).
    These views are different from the original idea of the CL-ESA model where both the query
representation and the document representation are mapped into a common multilingual concept
space (see Equation 2). From a mathematical standpoint Equation 2 and Equation 5 are equiva-
lent; however, implementing CL-ESA based on the alternative interpretation yields a considerable
runtime improvement in practical retrieval applications. Table 2 contrasts the interpretations and
the related runtime complexities. Here, we assume a closed retrieval situation where from a given
target collection Di in language Li the most similar documents to a query qj in language Lj are
desired. CLIR with CL-ESA is straightforward: computation of ϕCL−ESA (qj , di ) for each di ∈ Di
and ranking by decreasing CL-ESA similarity.
    Under the original interpretation the ESA representations di ESA of the documents di ∈ Di
can be computed in advance. At retrieval time the query is mapped into the concept space in
O(l · |D∗ |), where l denotes the number of query terms. The computation of the cosine similarity
between the ESA representations qj ESA and di ESA requires O(|D∗ |). Under the alternative
interpretation the matrix Gj,i can be computed in advance. Note that in practical applications
l ≪ |D∗ |, since a reasonable index collection size |D∗ | is 10 000, which shows the substantial
performance improvement under the alternative interpretation and View (ii) .

3.3    Usage in TEL@CLEF
In this subsection we describe implementation details of the CL-ESA model we used in our sub-
mission. The best parameter setting was determined by analyzing unofficial experiments of the
TEL@CLEF 2008 dataset.
Query and Document Construction. We use the original words of both topic fields, title and
description, as queries. The documents are constructed by merging the text of the three record
fields title, subject, and alternative. We assume that the language of these fields is the same
within one record; however, this assumption may be violated in some cases since the collections
contain multilingual records. Records without these fields are omitted in the experiments (see
Table 1).
Index Collection. As index collection Wikipedia is employed. We restrict the multilinguality of our
model to the three main languages of the target collections: English, German, and French. Based
on a Wikipedia snapshot from March 2009 about 169 000 articles per language can be aligned and
fulfill several filter criteria, e.g., to contain more than 100 words or not to be a disambiguation or
redirection page. All articles are used as index documents.
    As term weighting schema tf · idf is used. Query and document words are stemmed using the
Snowball stemmers. To speed-up the CL-ESA similarity computation all values below a threshold
of ǫ = 0.025 are discarded.
Language Detection. While the language of the queries is determined by the corresponding topics
the language of the documents is unknown since the collections are multilingual and no language
meta information is provided. In the experiments we resort to a simple “detection by stop words”
approach for the three main languages; if the detection fails the main language of the collection is
assumed.


4     Cross Querying
Cross querying is a straightforward approach for CLIR systems. We subsume the fields of a topic
in one query which is translated in the other languages. With each of the translations we compute
a set of rankings by retrieving against each document field. The rankings are merged with respect
to their cosine similarities. Additionally, the scores are multiplied by a boosting constant.

Definition 3 (Cross Querying) Let L = {L1 , . . . , Lk } denote a set of natural languages and
let F = {F1 , . . . , Fk } denote a set of document fields. lang : D → L, lang(d) 7→ Li estimates the
language of a document d. d, q, and qLi are the representations of a document d, a query q and
the translation of q in language Li . Then the cross querying similarity, ϕCQ (q, d), of a query q
and a document d is defined as follows:
                                    X                              X                
                       ϕCQ (q, d) =       b · ϕ(qlang(d) , dFi ) +     ϕ(qLi , dFi ) ,            (6)
                                Fi ∈F                           Li ∈L,
                                                            Li 6=lang(d)


where ϕ is the cosine similarity and b the boosting constant.
    The name “Cross Querying” reflects the fact that |L| × |F | rankings are merged by querying
in each language in each field. The applied parameters are as follows:
Query and Document Construction. The words of both topic fields, title and description, are
used as queries and translated to each Li ∈ L, with L = {German, F rench, English}. The
selection of the document fields corresponds to title and subject.
   As term weighting schema tf · idf is used. Query and document words are stemmed using
the Snowball stemmers while stop words are removed. The queries are translated with Google
Translate; the boosting constant b is based on the unofficial evaluation on the TEL@CLEF 2008
dataset.
Language Detection. In order to estimate the language of d with lang(d) we take the corpus
language of the associated evaluation run.


5                    Evaluation Results
The results of the monolingual subtask and the bilingual subtask are shown in Figure 1 and
Figure 2 respectively.

                                        Monolingual English                                                            Monolingual German
                     70                                                                             70
                                                             Baseline                                                                        Baseline
                                                       Cross Querying                                                                  Cross Querying
                     60                                       CL-ESA                                60                                        CL-ESA
                                                          CL-ESA-LD                                                                       CL-ESA-LD
                     50                                                                             50
    Precision in %




                                                                                   Precision in %
                     40                                                                             40

                     30                                                                             30

                     20                                                                             20

                     10                                                                             10

                      0                                                                              0
                          0   10   20   30   40    50    60   70   80   90   100                         0   10   20   30    40    50    60   70   80    90   100
                                             Recall in %                                                                     Recall in %

                                        Monolingual French
                     70
                                                             Baseline
                                                       Cross Querying
                     60                                       CL-ESA
                                                          CL-ESA-LD
                     50                                                                                                English         German           French
    Precision in %




                     40                                                               Baseline                              0.158        0.100          0.110
                                                                                      Cross Querying                        0.200        0.164          0.145
                     30
                                                                                      CL-ESA                                0.215        0.137          0.142
                     20                                                               CL-ESA-LD                             0.195        0.134          0.163

                     10

                      0
                          0   10   20   30   40    50    60   70   80   90   100
                                             Recall in %

Figure 1: Evaluation results of the monolingual runs. The plots show the standard recall levels vs.
interpolated precision. The table show the results in terms of mean average precision, MAP.

    We submitted an additional baseline to the monolingual subtask using state-of-the-art retrieval
technology: since in this subtask the language of the topics is equal to the main language of the
target collection, the ranking is based on the cosine similarities of the tf ·idf -weighted bag-of-words
representations of the topics and the documents.
    Each plot in Figure 1 corresponds to one target collection and shows the baseline along with the
results achieved under Cross Querying, CL-ESA, and CL-ESA with automatic language detection,
CL-ESA-LD. Both Cross Querying and CL-ESA clearly outperform the baseline. The variation
between the two approaches is small, except for the German collection where Cross Querying
outperforms CL-ESA at low recall levels. At higher recall levels CL-ESA is better, which explains
a slightly higher mean average precision on the English and the French collections. Using CL-
ESA along with the automatic language detection improves the performance only for the French
collection, which indicates that this collection contains a larger fraction of non-French documents.
    In the bilingual subtask the language of the queries is different from the main language of
the target collection. Each plot in Figure 2 corresponds to one target collection that is queried
in the two other languages, using both Cross Querying and CL-ESA. For example, in the plot
                                             Bilingual English                                                                  Bilingual German
                     70                                                                                 70
                                                      Cross Querying-de                                                                  Cross Querying-en
                                                       Cross Querying-fr                                                                  Cross Querying-fr
                     60                                      CL-ESA-de                                  60                                      CL-ESA-en
                                                              CL-ESA-fr                                                                          CL-ESA-fr
                     50                                                                                 50
    Precision in %




                                                                                       Precision in %
                     40                                                                                 40

                     30                                                                                 30

                     20                                                                                 20

                     10                                                                                 10

                      0                                                                                  0
                          0   10   20   30     40    50    60    70   80   90   100                          0   10   20   30     40    50    60    70     80   90   100
                                               Recall in %                                                                        Recall in %

                                             Bilingual French
                     70
                                                      Cross Querying-de
                                                      Cross Querying-en
                     60                                      CL-ESA-de
                                                             CL-ESA-en                                                          English       German            French
                     50
                                                                                      Cross Querying-en                            -               0.129        0.132
    Precision in %




                     40                                                               Cross Querying-de                          0.215               -          0.087
                                                                                      Cross Querying-fr                          0.225             0.158          -
                     30
                                                                                      CL-ESA-en                                    -               0.124        0.145
                     20                                                               CL-ESA-de                                  0.144               -          0.104
                                                                                      CL-ESA-fr                                  0.139             0.108          -
                     10

                      0
                          0   10   20   30     40    50    60    70   80   90   100
                                               Recall in %

Figure 2: Evaluation results of the bilingual runs. The plots show the standard recall levels vs. interpo-
lated precision. The table show the results in terms of mean average precision, MAP.


“Bilingual English” the graph for “CL-ESA-de” shows the results of querying the English collection
with German topics using the CL-ESA. Cross Querying achieves nearly the same or even higher
results compared to the monolingual situation, whereas the performance of the CL-ESA is lower
in contrast to the monolingual results.


6                    Conclusion and Future Work
The evaluation results for the TEL@CLEF task show that both CLIR approaches CL-ESA and
Cross Querying are able to outperform the monolingual baseline—though the absolute results are
still improvable. Furthermore, we have presented a formal definition and an alternative interpreta-
tion for the CL-ESA model, which is interesting for real-world retrieval applications since it reveals
how the computational effort for CL-ESA can be shifted from the query phase to a preprocessing
phase.
     As for future work, CL-ESA and Cross Querying will benefit if more languages are taken into
account. Currently, German, English, and French are used, but the target collections comprise
more languages. For documents from other languages an inconsistent CL-ESA representation is
computed. CL-ESA also needs a reliable language detection mechanism in order to compute a
consistent representation; note that we used a rather simple approach in our experiments.


References
[1] Maik Anderka and Benno Stein. The ESA Retrieval Model Revisited. In Mark Sanderson,
    James Allan ChengXiang Zhai, Justin Zobel, and Javed A. Aslam, editors, 32th Annual
    International ACM SIGIR Conference, pages 670–671. ACM, July 2009.
[2] Evgeniy Gabrilovich and Shaul Markovitch. Computing Semantic Relatedness using
    Wikipedia-based Explicit Semantic Analysis. In Proceedings of The 20th International Joint
    Conference for Artificial Intelligence, Hyderabad, India, 2007.

[3] Martin Potthast, Benno Stein, and Maik Anderka. A Wikipedia-Based Multilingual Retrieval
    Model. In Craig Macdonald, Iadh Ounis, Vassilis Plachouras, Ian Ruthven, and Ryen W.
    White, editors, 30th European Conference on IR Research, ECIR 2008, Glasgow, volume
    4956 LNCS of Lecture Notes in Computer Science, pages 522–530, Berlin Heidelberg
    New York, 2008. Springer.
[4] Philipp Sorg and Philipp Cimiano. Cross-lingual information retrieval with explicit semantic
    analysis. In Working Notes for the CLEF 2008 Workshop, 2008.
[5] S. K. M. Wong, Wojciech Ziarko, and Patrick C. N. Wong. Generalized vector spaces model
    in information retrieval. In SIGIR ’85: Proceedings of the 8th annual international ACM
    SIGIR conference on Research and development in information retrieval, pages 18–25, New
    York, NY, USA, 1985. ACM.
[6] Yiming Yang, Jaime G. Carbonell, Ralf D. Brown, and Robert E. Frederking. Translingual
    information retrieval: learning from bilingual corpora. Artif. Intell., 103(1-2):323–345, 1998.