=Paper=
{{Paper
|id=Vol-1175/CLEF2009wn-adhoc-AnderkaEt2009
|storemode=property
|title=Evaluating Cross-Language Explicit Semantic Analysis and Cross Querying at TEL@CLEF 2009
|pdfUrl=https://ceur-ws.org/Vol-1175/CLEF2009wn-adhoc-AnderkaEt2009.pdf
|volume=Vol-1175
|dblpUrl=https://dblp.org/rec/conf/clef/AnderkaLS09a
}}
==Evaluating Cross-Language Explicit Semantic Analysis and Cross Querying at TEL@CLEF 2009==
Evaluating Cross-Language Explicit Semantic Analysis and Cross Querying at TEL@CLEF 2009 Maik Anderka Nedim Lipka Benno Stein Faculty of Media, Media Systems Bauhaus University Weimar 99421 Weimar, Germany. @uni-weimar.de Abstract This paper describes our participation in the TEL@CLEF task of the CLEF 2009 ad- hoc track. The task is to retrieve items from various multilingual collections of library catalog records, which are relevant to a user’s query. Two different strategies are em- ployed: (i) the Cross-Language Explicit Semantic Analysis, CL-ESA, where the library catalog records and the queries are represented in a multilingual concept space that is spanned by aligned Wikipedia articles, and, (ii) a Cross Querying approach, where a query is translated into all target languages using Google Translate and where the ob- tained rankings are combined. The evaluation shows that both strategies outperform the monolingual baseline and achieve comparable results. Furthermore, inspired by the Generalized Vector Space Model we present a formal definition and an alternative interpretation of the CL-ESA model. This interpretation is interesting for real-world retrieval applications since it reveals how the computational effort for CL-ESA can be shifted from the query phase to a preprocessing phase. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor- mation Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries General Terms Measurement, Performance, Experimentation Keywords Cross-Language Information Retrieval, Cross-Language Explicit Semantic Analysis, Wikipedia, Cross Querying 1 Introduction Cross-language information retrieval, CLIR, is the task of retrieving documents from a target collection written in a language different from the language of a user’s query. CLIR systems give multilingual users the possibility to express queries in any language, e.g., their native language, and to obtain result documents in all languages they are familiar with. Since CLIR is not restricted to collections in the query language more sources can be included in the retrieval process, and the chance to fulfill a particular information need of a multilingual user is higher. Another use case for CLIR techniques is cross-language plagiarism detection, where the query corresponds to a suspicious document and the target collection is a reference corpus with original documents [3]. The Cross-Language Evaluation Forum, CLEF, provides an infrastructure for the evaluation of information retrieval systems, both monolingual and cross-lingual. We participated in the TEL@CLEF task of the CLEF 2009 ad-hoc track, which aims at the evaluation of systems to retrieve relevant items from multilingual collections of library catalog records. The main challenges of this task are the multilinguality and the sparsity of the dataset. We used two different CLIR approaches to tackle this task; the paper in hand outlines and discusses these approaches and the achieved results. The first approach is Cross-Language Explicit Semantic Analysis, CL-ESA, which is a multilin- gual retrieval model to access cross-language similarity between text documents [3]. The CL-ESA model exploits a document-aligned comparable corpus such as Wikipedia in order to map the query and the documents into a common multilingual concept space [3, 4]. We also present a formal definition and an alternative interpretation for the CL-ESA model, which is inspired by the Generalized Vector Space Model, GVSM. Our view is mathematically equivalent to the original idea of the CL-ESA model; it reveals how the computational effort for CL-ESA can be shifted from the query phase to a preprocessing phase. In the second approach, called Cross Querying, each query is translated into all target lan- guages. The particular rankings are used in a combined fashion considering the most likely lan- guage of the documents. The evaluation on the TEL@CLEF collections shows that both CLIR approaches are able to outperform the monolingual baseline. In the bilingual subtask, querying with a foreign language, Cross Querying achieves nearly the same or even higher results compared to the monolingual subtask; the performance of the CL-ESA is lower compared to the monolingual results. The paper is organized as follows. Section 2 describes the target collection used in the TEL@CLEF task along with the evaluation procedure. Section 3 defines the general CL-ESA model, our formalization, and details of the CL-ESA implementation employed in the experi- ments. Section 4 presents the Cross Querying approach, Section 5 discusses the evaluation, and Section 6 concludes with an outlook. 2 TEL@CLEF dataset and Evaluation Procedure In this year’s TEL@CLEF task three target collections, provided by The European Library1 , TEL, are used. The collections are labeled BL, ONB, and BNF, and mainly contain information in English, German, and French respectively (see Table 1). The collections are comprised of library catalog records, referring to different types of items such as articles, books, or videos. The data is provided in structured form and represented in XML. Each library catalog record has several fields containing meta information and content information that describe the particular item. Typical meta information fields are author, rights, or publisher, and typical content information fields are title, description, subject, or alternative. In our experiments we focus on the content information fields. A major difficulty is the sparsity of the available information: for many records only few fields are given. The user’s information need is specified by 50 topics that are provided by CLEF in the three main languages of the target collections, namely English, German, and French. A topic consists of two fields: a title, containing 2-4 keywords, and a description, containing 1-2 sentences that specify the item of interest in greater detail. The topics are used to construct the queries. The TEL@CLEF task is divided into a monolingual and a bilingual subtask. The aim in both subtasks is to retrieve documents (library catalog records) from the target collections, which are most relevant to a query; for each query the results are submitted as a ranked list of documents. In the monolingual subtask the language of the query and the main language of the collection are the same, while in the bilingual subtask the language of the query is different from the main language of the collection. We submitted runs for both subtasks and for all three languages. 1 The European Library: http://www.theeuropeanlibrary.org/. Table 1: Statistics of the three target collections BL, ONB, BNF, used in the TEL@CLEF task. BL ONB BNF main language English German French # documents 1 000 100 869 353 1 000 100 # documents with title 1 000 042 829 675 1 000 095 average length of title 8.033 5.500 17.124 # documents with description 518 493 0 1 000 100 average length of description 6.222 0 10.095 # documents with subject 671 544 602 580 368 788 average length of subject 7.032 8.373 10.833 # documents with alternative 78 679 404 415 0 average length of alternative 5.491 8.158 0 # documents without content information 20 37 564 0 3 Cross-Language Explicit Semantic Analysis Cross-Language Explicit Semantic Analysis, CL-ESA, is a generalization of the Explicit Semantic Analysis, ESA [2], and was proposed by Potthast et al. [3]. This section presents a formal definition of the CL-ESA model that reveals its close connection to the Generalized Vector Space Model, GVSM [5]: the ESA model and the GVSM can be transformed into each other [1]. It follows immediately that this is also true for the CL-ESA model and the cross-lingual extension of the Generalized Vector Space Model, CL-GVSM [6]. 3.1 Formal Definition Let di be a real-world document written in language Li , and let di be a bag-of-word-based repre- sentation of di , encoded as a vector of normalized term frequency weights over a universal term vocabulary Vi . Vi contains all used terms for language Li . A set Di of document representations defines a term-document matrix ADi , where each column in ADi corresponds to a vector di ∈ Di . Definition 1 (ESA Representation [1]) Let Di∗ be a collection of index documents written in language Li . The ESA representation di ESA of a document di with representation di is defined as follows: di ESA = ATD∗ · di , (1) i T where A designates the matrix transpose of A. The rationale of this definition becomes clear if one considers that the weight vectors d∗i ∈ D∗i and di are normalized: ||d∗i || = ||di || = 1, for each d∗i ∈ D∗i . Hence, each entry in the ESA representation di ESA of a document di is the cosine similarity between di and some vector d∗i ∈ D∗i . Put another way, di is compared to each index document in Di∗ , and diESA is comprised of the respective cosine similarities. Definition 2 (CL-ESA Similarity) Let L = {L1 , . . . , Lk } denote a set of natural languages, and let D∗ = {D1∗ , . . . , Dk∗ } be a set of index collections where each Di∗ ∈ D∗ is a list of index documents written in language Li ∈ L. D∗ is a document-aligned comparable corpus, i.e., for each language Li ∈ L the n-th index document in Di∗ ∈ D∗ describes the same concept. The CL-ESA similarity, ϕCL−ESA (qj , di ), between a query qj in language Lj and a document di in language Li is computed as cosine similarity ϕ of the ESA representations of qj and di : ϕCL−ESA (qj , di ) = ϕ(qj ESA , di ESA ) = ϕ(ATD∗ · qj , ATD∗ · di ) (2) j i Table 2: The different interpretations of the CL-ESA model. Original interpretation Alternative interpretation View (i) View (ii) ϕCL−ESA (qj , di ) = ϕ(AT D∗ · qj , A T D∗ · di ) (qT j · Gj,i ) · di qT j · (Gj,i · di ) j i Runtime complexity O(l · |D ∗ | + |D ∗ |) O(l · |Vj | + l) O(l) Due to the alignment of the index collections Dj∗ and Di∗ the ESA representations of qj and di are comparable. Definition 2 is equivalent to the definition of the CL-GSVM similar- ity ϕCL−GVSM (qj , di ) given in [6], which means that, in analogy to [1], the CL-ESA model and the CL-GVSM can be directly transformed into each other: ϕCL−ESA (qj , di ) = ϕ(ATDj∗ · qj , ATDi∗ · di ) = ϕCL−GVSM (qj , di ) (3) 3.2 Alternative Interpretation The original idea of the CL-ESA model is to map both query and documents into a multilingual concept space, as it is expressed in Equation 2. Note that Equation 2 can be rearranged as follows: ϕCL−ESA (qj , di ) = ϕ(ATD∗ · qj , ATD∗ · di ) = qTj · ADj∗ · ATD∗ · di (4) j i i In particular, the matrix ADj∗ · ATD∗ = Gj,i can be computed in advance since it is independent i from a particular qj or di . Hence: ϕCL−ESA (qj , di ) = qTj · Gj,i · di (5) The rationale of Equation 5 becomes apparent if one recognizes Gj,i = ADj∗ · ATD∗ as |Vj | × |Vi | i term co-occurrence matrix. The n-th row in ADj∗ corresponds to the distribution of the n-th term tn ∈ Vj over the index documents in Dj∗ ; likewise, the m-th row in ADi∗ corresponds to the distribution of the m-th term tm ∈ Vi over the index documents in Di∗ . Recall that the index documents in Dj∗ and Di∗ are aligned. I.e., the value in the n-th row and the m-th column of Gj,i quantifies the similarity between the distributions of tj and ti given the concepts described by the index documents in Dj∗ and Di∗ . The CL-ESA similarity computation of Equation 5 can be viewed in two ways: (i) As a translation of the query representation qj into the space of the document representa- tion di : ϕCL−ESA (qj , di ) = (qTj · Gj,i ) · di , or, (ii) as a translation of the document representation di into the space of the query representa- tion qj : ϕCL−ESA (qj , di ) = qTj · (Gj,i · di ). These views are different from the original idea of the CL-ESA model where both the query representation and the document representation are mapped into a common multilingual concept space (see Equation 2). From a mathematical standpoint Equation 2 and Equation 5 are equiva- lent; however, implementing CL-ESA based on the alternative interpretation yields a considerable runtime improvement in practical retrieval applications. Table 2 contrasts the interpretations and the related runtime complexities. Here, we assume a closed retrieval situation where from a given target collection Di in language Li the most similar documents to a query qj in language Lj are desired. CLIR with CL-ESA is straightforward: computation of ϕCL−ESA (qj , di ) for each di ∈ Di and ranking by decreasing CL-ESA similarity. Under the original interpretation the ESA representations di ESA of the documents di ∈ Di can be computed in advance. At retrieval time the query is mapped into the concept space in O(l · |D∗ |), where l denotes the number of query terms. The computation of the cosine similarity between the ESA representations qj ESA and di ESA requires O(|D∗ |). Under the alternative interpretation the matrix Gj,i can be computed in advance. Note that in practical applications l ≪ |D∗ |, since a reasonable index collection size |D∗ | is 10 000, which shows the substantial performance improvement under the alternative interpretation and View (ii) . 3.3 Usage in TEL@CLEF In this subsection we describe implementation details of the CL-ESA model we used in our sub- mission. The best parameter setting was determined by analyzing unofficial experiments of the TEL@CLEF 2008 dataset. Query and Document Construction. We use the original words of both topic fields, title and description, as queries. The documents are constructed by merging the text of the three record fields title, subject, and alternative. We assume that the language of these fields is the same within one record; however, this assumption may be violated in some cases since the collections contain multilingual records. Records without these fields are omitted in the experiments (see Table 1). Index Collection. As index collection Wikipedia is employed. We restrict the multilinguality of our model to the three main languages of the target collections: English, German, and French. Based on a Wikipedia snapshot from March 2009 about 169 000 articles per language can be aligned and fulfill several filter criteria, e.g., to contain more than 100 words or not to be a disambiguation or redirection page. All articles are used as index documents. As term weighting schema tf · idf is used. Query and document words are stemmed using the Snowball stemmers. To speed-up the CL-ESA similarity computation all values below a threshold of ǫ = 0.025 are discarded. Language Detection. While the language of the queries is determined by the corresponding topics the language of the documents is unknown since the collections are multilingual and no language meta information is provided. In the experiments we resort to a simple “detection by stop words” approach for the three main languages; if the detection fails the main language of the collection is assumed. 4 Cross Querying Cross querying is a straightforward approach for CLIR systems. We subsume the fields of a topic in one query which is translated in the other languages. With each of the translations we compute a set of rankings by retrieving against each document field. The rankings are merged with respect to their cosine similarities. Additionally, the scores are multiplied by a boosting constant. Definition 3 (Cross Querying) Let L = {L1 , . . . , Lk } denote a set of natural languages and let F = {F1 , . . . , Fk } denote a set of document fields. lang : D → L, lang(d) 7→ Li estimates the language of a document d. d, q, and qLi are the representations of a document d, a query q and the translation of q in language Li . Then the cross querying similarity, ϕCQ (q, d), of a query q and a document d is defined as follows: X X ϕCQ (q, d) = b · ϕ(qlang(d) , dFi ) + ϕ(qLi , dFi ) , (6) Fi ∈F Li ∈L, Li 6=lang(d) where ϕ is the cosine similarity and b the boosting constant. The name “Cross Querying” reflects the fact that |L| × |F | rankings are merged by querying in each language in each field. The applied parameters are as follows: Query and Document Construction. The words of both topic fields, title and description, are used as queries and translated to each Li ∈ L, with L = {German, F rench, English}. The selection of the document fields corresponds to title and subject. As term weighting schema tf · idf is used. Query and document words are stemmed using the Snowball stemmers while stop words are removed. The queries are translated with Google Translate; the boosting constant b is based on the unofficial evaluation on the TEL@CLEF 2008 dataset. Language Detection. In order to estimate the language of d with lang(d) we take the corpus language of the associated evaluation run. 5 Evaluation Results The results of the monolingual subtask and the bilingual subtask are shown in Figure 1 and Figure 2 respectively. Monolingual English Monolingual German 70 70 Baseline Baseline Cross Querying Cross Querying 60 CL-ESA 60 CL-ESA CL-ESA-LD CL-ESA-LD 50 50 Precision in % Precision in % 40 40 30 30 20 20 10 10 0 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Recall in % Recall in % Monolingual French 70 Baseline Cross Querying 60 CL-ESA CL-ESA-LD 50 English German French Precision in % 40 Baseline 0.158 0.100 0.110 Cross Querying 0.200 0.164 0.145 30 CL-ESA 0.215 0.137 0.142 20 CL-ESA-LD 0.195 0.134 0.163 10 0 0 10 20 30 40 50 60 70 80 90 100 Recall in % Figure 1: Evaluation results of the monolingual runs. The plots show the standard recall levels vs. interpolated precision. The table show the results in terms of mean average precision, MAP. We submitted an additional baseline to the monolingual subtask using state-of-the-art retrieval technology: since in this subtask the language of the topics is equal to the main language of the target collection, the ranking is based on the cosine similarities of the tf ·idf -weighted bag-of-words representations of the topics and the documents. Each plot in Figure 1 corresponds to one target collection and shows the baseline along with the results achieved under Cross Querying, CL-ESA, and CL-ESA with automatic language detection, CL-ESA-LD. Both Cross Querying and CL-ESA clearly outperform the baseline. The variation between the two approaches is small, except for the German collection where Cross Querying outperforms CL-ESA at low recall levels. At higher recall levels CL-ESA is better, which explains a slightly higher mean average precision on the English and the French collections. Using CL- ESA along with the automatic language detection improves the performance only for the French collection, which indicates that this collection contains a larger fraction of non-French documents. In the bilingual subtask the language of the queries is different from the main language of the target collection. Each plot in Figure 2 corresponds to one target collection that is queried in the two other languages, using both Cross Querying and CL-ESA. For example, in the plot Bilingual English Bilingual German 70 70 Cross Querying-de Cross Querying-en Cross Querying-fr Cross Querying-fr 60 CL-ESA-de 60 CL-ESA-en CL-ESA-fr CL-ESA-fr 50 50 Precision in % Precision in % 40 40 30 30 20 20 10 10 0 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Recall in % Recall in % Bilingual French 70 Cross Querying-de Cross Querying-en 60 CL-ESA-de CL-ESA-en English German French 50 Cross Querying-en - 0.129 0.132 Precision in % 40 Cross Querying-de 0.215 - 0.087 Cross Querying-fr 0.225 0.158 - 30 CL-ESA-en - 0.124 0.145 20 CL-ESA-de 0.144 - 0.104 CL-ESA-fr 0.139 0.108 - 10 0 0 10 20 30 40 50 60 70 80 90 100 Recall in % Figure 2: Evaluation results of the bilingual runs. The plots show the standard recall levels vs. interpo- lated precision. The table show the results in terms of mean average precision, MAP. “Bilingual English” the graph for “CL-ESA-de” shows the results of querying the English collection with German topics using the CL-ESA. Cross Querying achieves nearly the same or even higher results compared to the monolingual situation, whereas the performance of the CL-ESA is lower in contrast to the monolingual results. 6 Conclusion and Future Work The evaluation results for the TEL@CLEF task show that both CLIR approaches CL-ESA and Cross Querying are able to outperform the monolingual baseline—though the absolute results are still improvable. Furthermore, we have presented a formal definition and an alternative interpreta- tion for the CL-ESA model, which is interesting for real-world retrieval applications since it reveals how the computational effort for CL-ESA can be shifted from the query phase to a preprocessing phase. As for future work, CL-ESA and Cross Querying will benefit if more languages are taken into account. Currently, German, English, and French are used, but the target collections comprise more languages. For documents from other languages an inconsistent CL-ESA representation is computed. CL-ESA also needs a reliable language detection mechanism in order to compute a consistent representation; note that we used a rather simple approach in our experiments. References [1] Maik Anderka and Benno Stein. The ESA Retrieval Model Revisited. In Mark Sanderson, James Allan ChengXiang Zhai, Justin Zobel, and Javed A. Aslam, editors, 32th Annual International ACM SIGIR Conference, pages 670–671. ACM, July 2009. [2] Evgeniy Gabrilovich and Shaul Markovitch. Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis. In Proceedings of The 20th International Joint Conference for Artificial Intelligence, Hyderabad, India, 2007. [3] Martin Potthast, Benno Stein, and Maik Anderka. A Wikipedia-Based Multilingual Retrieval Model. In Craig Macdonald, Iadh Ounis, Vassilis Plachouras, Ian Ruthven, and Ryen W. White, editors, 30th European Conference on IR Research, ECIR 2008, Glasgow, volume 4956 LNCS of Lecture Notes in Computer Science, pages 522–530, Berlin Heidelberg New York, 2008. Springer. [4] Philipp Sorg and Philipp Cimiano. Cross-lingual information retrieval with explicit semantic analysis. In Working Notes for the CLEF 2008 Workshop, 2008. [5] S. K. M. Wong, Wojciech Ziarko, and Patrick C. N. Wong. Generalized vector spaces model in information retrieval. In SIGIR ’85: Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval, pages 18–25, New York, NY, USA, 1985. ACM. [6] Yiming Yang, Jaime G. Carbonell, Ralf D. Brown, and Robert E. Frederking. Translingual information retrieval: learning from bilingual corpora. Artif. Intell., 103(1-2):323–345, 1998.