Introduction

Replicating an Experiment in Cross-lingual Information Retrieval with Explicit Semantic Analysis

0 Institute of Information Systems Engineering , TU Wien, Favoritenstra e 9-11/194, 1040 Vienna , Austria

We have participated in the Replicability Track of the CENTRE@CLEF 2018 conference [1{4]. This paper reintroduces Explicit Semantic Analysis (ESA) and its extension for cross-lingual document retrieval tasks, called Cross-lingual Explicit Semantic Analysis (CL-ESA), for the rst time introduced by Sorg and Cimiano in 2008. The goal is to replicate an experiment from Sorg and Cimiano, who participated in the CLEF conference in 2008 and report on the results as well as to point out mistakes and problems along the way. This work should be read in conjunction with the original work done by Sorg and Cimiano [7].

replicability Explicit Semantic Analysis cross-lingual

Introduction

Explicit Semantic Analysis (ESA) uses chosen external categories to represent a given text t. We will introduce the main ideas of a Wikipedia based approach, so the reader is able to understand the notions in the implementation section. A more detailed description can be found either in the paper by Sorg and Cimiano [ 7 ] or by Markovitch and Gabrilovich [ 5 ], who introduced this Wikipedia-based approach and the more general theory behind it. The main idea in Wikipedia-based Explicit Semantic Analysis is to map a text t into a high-dimensional real-valued vector space. Given a set of Wikipedia articles Wk = fa1; : : : ; ang in language Lk, each article ai is an external category and corresponds to a dimension in the vector space. The following function describes this mapping:

k : T ! RjWkj k(t) := hv1; : : : ; vjWkji vi := X as(wj ; ai)

wj2t where jWkj is the number of articles in Wk. Each vi is computed by summing up the results of a function as, which de nes the strength of association between a Wikipedia article ai and a word wj , for each word in a given text t = hw1; : : : ; wli. In this regard k(t) is called ESA-vector and expresses the strength of association of a given text t with each article ai in Wk.

There are many di erent ways to de ne the as function, e.g. here we use a tf:idf function, based on a Bag-of-Words model of the Wikipedia articles. In this sense, Explicit Semantic Analysis is very exible, as it can be adapted to di erent tasks and contexts, simply by choosing a di erent as function.

as(wj ; ai) = tf:idfai (wj )

The function we used for the experiment is described in Section 4. Computing the ESA-vector, means we compute the strength of association for each Wikipedia article ai, hence after sorting the ESA-vector by value, it corresponds to a ranking of Wikipedia articles according to relevance for a given text t. Essentially, Explicit Semantic Analysis transforms a given text t into a vector representation according to external categories. This means one can simply assess the similarity between two arbitrary texts t1 and t2 by computing their ESA-vectors and for example using the standard cosine similarity to compare the vectors. This is another reason for which Explicit Semantic Analysis is very exible, as it can be used on arbitrary texts | we can simply adapt it to di erent tasks, among other things a retrieval task (query and document) or a clustering task (two documents). 3

Cross Lingual Explicit Semantic Analysis

Cross Lingual Explicit Semantic Analysis (CL-ESA) is an extension to ESA, which can handle multi-lingual retrieval tasks. This approach uses the fact that Wikipedia articles are linked across di erent languages. Therefore one can assume that there exists a mapping function mi!j , which maps an article ai from Wikipedia Wi to its corresponding article in Wikipedia Wj . Suppose there are n languages L1; : : : ; Ln. Transforming a given text t from language Li to language Lj is as simple as transforming i(t) to j(t) using a map, which is de ned over the cross language links in Wikipedia Wi. Since we consider n languages, we de ne an n2 mapping function of the type:

This mapping is computed as follows: where

i!j : RjWij ! RjWjj i!jhv1; : : : ; vjWiji = hv10; : : : ; vj0Wjji with 1 p jWij; 1 q jWjj. Given a text t in language Li, obtaining an ESA-vector from Wikipedia Wj, is as simple as computing i!j( i(t)). Using the above setting we can now de ne the cosine between a query qi in language Li and a document dj in language Lj in a straightforward manner as follows: cos(qi; dj) := cos( i(qi); j!i( j(dj)))

Now we obtained a uniform approach across multiple languages. Nevertheless, it is important to note that CL-ESA works under the assumption that the language of the document is known.

4 Implementation

In this section we will describe the implementation details used for the experiment. Unfortunately the overall setup di ers from the original experiment, because we were not able to obtain database dumps from 2008. Instead we downloaded static HTML dumps1 in English, German and French from the year 2008 and extracted them on the disk [1{4]. 4.1

Preprocessing of the documents

For the actual indexing step we used the same methods as Sorg and Cimiano, namely a standard white space tokenizer, standard stop word lists for English, German and French and a Snowball2 Stemmer for English, German and French respectively. 4.2

ESA Implementation

In this section we will give a detailed description of our implementation and preprocessing of the documents and Wikipedia articles. Moreover we will compare our implementation with the implementation described in the original paper by Sorg and Cimiano. 1 https://dumps.wikimedia.org/other/static_html_dumps/2008-06/ 2 Snowball Stemmers are included in Lucene Wikipedia Article Preprocessing After the extraction we realized quickly, that the static HTML dump has much more pages than just the article pages we wanted to index. Hence the rst step for us was to write a python script which ignored Wikipedia speci c pages. Fortunately these pages were easy to locate as their purpose was encoded in the lename of the page, namely pages which started with Category, Image, Portal, Help, Template, User or Wikipedia3 were ignored and the corresponding discussion pages4 were ignored as well. The python script simply changed the le extension from .html to .ign, which stands for ignore. Moreover every page with a lesize of less than 1KB was ignored, due to the fact that those les were redirect pages. The HTML markup for an actual article would already exceed this limit, hence there was no room for erroneously ignoring an actual article. Now the indexer would only index pages whose le extension is not .ign.

The processing of the Wikipedia articles is vastly di erent to Sorg's approach, due to using a di erent source. First we needed to extract the relevant text from the HTML mark up, using a library called JSoup5. This library empowered us to query the HTML markup using CSS-style queries. With this approach we selected the div-element with id equal to content. Then we removed the table of contents which is a table tag with id equal to toc. Moreover we removed any category links, which is a div-element with id equal to catlinks and we removed all edit sections, which were span tags with class equal to editsection. After these steps we simply selected the text between all remaining tags and used this as a document for the index. We randomly sampled about 30 articles and looked at the result of this process to convince ourselves that there are no more unnecessary texts which might skew the index [1{4]. Nevertheless there were cases where this approach threw exceptions and we printed each of them into a log le. However, the number was less than 200, when compared to the number of documents in the corpus which are more than half a million documents, we decided that it was not worth it to investigate those articles further at this stage.

For indexing the documents using Lucene6, it was not necessary to use the WikipediaAnalyzer, because our text was just plain text without Wiki markup. Therefore we used the same methods as in the preprocessing of the documents. Sorg and Cimiano mention two kinds of restrictions on the article selection, \Then all articles with less than 100 words or less than 5 incoming pagelinks were discarded." We implemented them adapted to the static HTML dump. The length is checked before adding a document to the index by counting the tokens generated from Lucene. The incoming page links were more complex to obtain. We generated a page link map by parsing all Wikipedia articles and counting the number of a href tags in the div-element with id equal to content. The 3 The pre xes were in the same language as the corresponding Wikipedia, e.g. Benutzer for User in the German Wikipedia 4 for each pre x there was a discussion page encoded as <pre x> talk 5 https://jsoup.org/ 6 https://lucene.apache.org/core/ link7 in these tags is a path to the corresponding article in the lesystem and we used them as keys for the aforementioned page link map and counted the occurrences. Then instead of parsing the whole Wikipedia directory with the indexer, we simply looped over all keys in the map and only parsed the articles with 5 or more page links. Unfortunately, there are no exact document counts from Sorg and Cimiano after applying these restrictions. Nevertheless, they reported document counts after restricting the documents to \at least a language link to one of the two other languages we consider [. . . ] we used 536.896 English, 390.027 German and 362.972 French articles for the ESA indexing" [ 7 ] We ended up with more documents for the ESA indexing than Sorg and Cimiano. The English Wikipedia index consists of 1.517.398 documents, the German Wikipedia index consists of 520.433 documents and the French Wikipedia index consists of 431.245 documents. Considering the additional restriction Sorg and Cimiano applied to obtain their counts, we think that the discrepancy between our numbers is reasonable [1{4]. The English Wikipedia is much larger than the German or French Wikipedia. Therefore there are a lot more pages which do not have a link to German or French. The smaller discrepancy in the French and German Wikipedia are probably pages unique to their cultural heritage and therefore are not likely to have an English equivalent. Nevertheless, we did not check any of the aforementioned reasons, because the additional restriction is unrelated to replicating the experiment at hand. That being said, having a vastly di erent preprocessing and wikipedia source is probably a solid reason, why we were not able to obtain results similar to Sorg and Cimiano.

ESA Vector Computation The computation of the ESA vector uses an inverted index of the selected Wikipedia articles. Each document will be queried against this index and the retrieved articles will be used to build the ESA vector. Similar to Sorg and Cimiano, we used Lucene for indexing and the association strength was implemented using a customized Lucene similarity function. The function takes a text t = hw1; : : : ; wli and a Wikipedia article ai of Wikipedia corpus jW j and computes the following function:

asR(t; ai) = (Ct)pjaij 1 X tfai (wj )idf (wj ) with

Ct = wj2t 1 r P idf (wj )

wj2t tfai (wi) = p#occurrences ofwiinai idf (wj ) = 1 + log

jW j + 1 #articles containingwj 7 removed anchorpoints

The following idf is described in the original paper by Sorg and Cimiano: idf (wj) = 1 + log #articles containingwj jW j + 1

First we need to point out an error in de ning the idf function the way it was de ned by Sorg and Cimiano [1{4]. The result of the idf function would be negative because the fraction is de nitely less than 1 and taking the log of a value less than 1 yields a negative result. We realized, that this is probably an error because in the literature (e.g. [ 6 ]), variations of the idf are de ned di erently and result in a positive value greater than 1. Therefore we swapped the numerator with the denominator and used this variant for our experiment. Multi-lingual Mapping Similar to Sorg and Cimiano some preprocessing was needed to obtain the multi-lingual mapping. Due to using the static HTML dump instead of a database dump, the cross language links were embedded in the HTML and pointed to the actual lename on the lesystem. The replicated experiment only involved English topic titles, hence we only computed the mapping from German to English and from French to English. A normalization of the page title was not needed, because we computed the mapping using the document ids from the index, which correspond to the index of the dimension in the ESA-vector. Nevertheless we needed to deal with redirect pages [1{4], therefore we used the following steps to compute the mapping:

1. For each document in the German (resp. French) index

2. Find the le in the le system and check8 if a link to English is available 3. If an English link is available nd the le in the le system. 4. Recursively determine if the le is a redirect page until the actual document is reached. 5. Look up the document id in the English index and add it to the mapping.

Similar to Sorg and Cimiano we summed up the scores, in case of multiple language links pointed to the same article in the English Wikipedia. 4.3

Language Identi cation

The computation of the ESA vector is based on the assumption that we know the language in which the document is written. Unfortunately this is not always the case. Even in the TEL German dataset used for the replication, there are records without any knowledge about the language. Hence Sorg and Cimiano presented the following function to determine the language of a document t: 8 We selected the div-element with id equal to p-lang with JSoup where \minDim( k(t)) returns the value of the lowest dimension in vector k and maxDim( k(t)) returns the highest correspondingly." [ 7 ] To us this description of lowest and highest dimension of a vector does not make sense. We thought of multiple possibilities to interpret it, e.g. index of the dimension with the lowest and highest value of the vector or the actual minimal and maximal values of the vector. However, none of this made sense. Fortunately, Sorg and Cimiano give an intuition about their heuristic as \The intuition behind this heuristic is that a small di erence between the values of the lowest and highest dimension, which is computed by the share of these values, means that the document matches good to many Wikipedia articles and it can therefore be assumed that the document is of the same language as the used Wikipedia articles. Comparing a document to Wikipedia articles in another language, there will be some matches but the value of lowest dimension will most probably be very small." Following this intuition lead us to interpret it as follows: where jWkj is the number of articles used from Wikipedia in language k [1{4]. This way we get the percentage of Wikipedia articles a document t is matched to. Then the language, in which the document should be written in, is the language of the Wikipedia base with the highest percentual article match. We have implemented our interpretation of the language identi cation, however when trying to identify the language of records without language tag, the run would have taken too long and we would not have been able to submit our results in time. Therefore we chose to try and match a document without language tag with the German Wikipedia by default. The retrieval algorithm we used, presented in Algorithm 1, is generally the same as Sorg and Cimiano presented in their paper. The only change here is that our language identi cation solely relies on language tags in the records to identify the correct language of the document. 5

Results

5.1

Dataset

In this section we present additional information about the dataset, its language distribution, additional settings of the experiment and the results. The TEL German dataset has in total 869353 records in over 100 di erent languages. The data can be split in records with a language tag, which is about 90%, and without a language tag, which is the rest. German, English and French are the main languages of those records with language tag and make up about Input: Topics T , Language k, Documents D for t 2 T do

t = k(t ); end for d 2 D do l := lang(d); d = l!k l(d ); for t 2 T do

score [t; d] = cos(t, d ); end end

Algorithm 1: Retrieval-Algorithm 88%. In our experiment we only use the title information to build queries for the index and according to Sorg and Cimiano \The title of the record is the only content information that is available for all records". We cannot con rm this statement, because we have found that about 4,5% of the records in the dataset do not contain title information [1{4]. In our experiment we simply ignore the records that do not contain this information. 5.2

CLEF Replicability Experiment

The objective of this experiment is to query the 50 given topics in English on the multi-lingual TEL German dataset [1{4]. The topics consist of a title and a short description to build a query. Sorg and Cimiano do not mention what they used to build the query. We chose to use only the title as a query. As for the ESA-vector length k we ended up trying two di erent settings. Sorg and Cimiano used k = 10:000 for topics and k = 1:000 for the records. We ran one experiment with the same settings and additionaly we ran another experiment with values of a magnitude smaller, namely for the topics we used k = 1:000 and for the records we used k = 100. The result obtained by Sorg and Cimiano was a mean average precision (MAP) of 6,7% [ 7 ]. Unfortunately, they did not explicitly mention a requirement for a record to count as a relevant document for a certain topic. Therefore we assumed, that every score greater than zero is a relevant document. This means that as soon as there is one overlapping dimension in the ESA-vectors of a record and a topic it would yield a relevant document. This assumption lead us to the problem, that the full list of relevant documents, obtained from the experiment with the larger ESA-vector length, matched more than two thirds of all the records in the dataset to nearly every topic. Looking at the list we gured out, that the score of the higher ranking documents decreased at a faster pace and the relevant documents after rank 1000 decreased in a much slower pace and yielded only a small fraction of the score in comparison to the higher ranking documents. We concluded, that it would be meaningful to cut o the results at a certain rank and look at the MAP of the reduced lists, because the relevant documents with a very low score might just have been accidently connected by a single Wikipedia article, which does not necessarily convey a semantical connection between a record and a topic. We used the top 10, the top 100 and the top 1000 results for each topic, and we calculated the MAP using trec eval9. The results are shown in Table 1. After comparing our results with the 6,7% obtained by Sorg and Cimiano, we conclude, that we were not able to reproduce the result. In this paper, we described the CL-ESA approach presented by Sorg and Cimiano and we attempted to replicate an experiment, submitted to the CLEF conference in the year 2008. In the end we were not able to reproduce the result. Most parts of the experimental setup were replicated accurately, but especially the index might be very di erent in comparison with the index of the original experiment, because we were not able to obtain a Wikipedia database dump from 2008 and therefore worked with a static HTML dump. Since the index is at the core of the experiment, it can lead to subsequent di erences in every other part. Other than that, some missing details, e.g. the way the query is built from the topics and a detailed explanation about what elds from the records of the dataset were used, make it hard to replicate the experiment in a more detailed manner. Moreover we ran into some problems based on our own assumptions. We are refering to the fact that the full result list of the experiment with the bigger ESA-vector lengths matched two thirds of all the records in the dataset to almost every topic. We think, that the cause of this problem lies in the fact, that while there are Wikipedia articles, which might very accurately describe a semantical category, there are certainly some articles, which have the opposite e ect. To give a short example, suppose an article of a famous actress will have a lot of di erent words with di erent semantical meanings on her page (e.g. overview of her career and life), while also having acted in some horror movies. This article would then match any kind of record of the dataset, which somehow was able to obtain a positive score through words which are not semantically connected to horror movies, to the topic horror movies, even though there is probably no semantic connection whatsoever. Therefore we think, that for Wikipedia-based CL-ESA to yield better results it is essential to have a good article selection or to introduce certain restrictions on what ends up being a relevant document for a certain topic, e.g. at least 10 dimensions need to overlap in the ESA-vectors. 9 https://github.com/usnistgov/trec_eval

1. Bellot , P. , Trabelsi , C. , Mothe , J. , Murtagh , F. , Nie , J.Y. , Soulier , L. , SanJuan , E., Cappellato , L. , Ferro , N. (eds.): Experimental IR Meets Multilinguality, Multimodality, and Interaction . Proceedings of the Nineth International Conference of the CLEF Association (CLEF 2018 ). Lecture Notes in Computer Science (LNCS) , Springer, Heidelberg, Germany ( 2018 )

2. Cappellato , L. , Ferro , N. , Nie , J.Y. , Soulier , L . (eds.): CLEF 2018 Working Notes. CEUR Workshop Proceedings (CEUR-WS.org) , ISSN 1613-0073 ( 2018 )

3. Ferro , N. , Maistro , M. , Sakai , T. , Soboro , I. : CENTRE@CLEF2018: Overview of the Replicability Task . In: Cappellato et al. [ 2 ]

4. Ferro , N. , Maistro , M. , Sakai , T. , Soboro , I. : Overview of CENTRE@CLEF 2018: a First Tale in the Systematic Reproducibility Realm . In: Bellot et al. [ 1 ]

5. Gabrilovich , E. , Markovitch , S. : Computing semantic relatedness using wikipediabased explicit semantic analysis . In: Proceedings of The Twentieth International Joint Conference for Arti cial Intelligence . pp. 1606 { 1611 . Hyderabad , India ( 2007 ), http://www.cs.technion.ac.il/~shaulm/papers/pdf/ Gabrilovich-Markovitch-ijcai2007.pdf

6. Manning , C.D. , Raghavan , P. , Schutze, H.: Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA ( 2008 )

7. Sorg , P. , Cimiano , P. : Cross-lingual Information Retrieval with Explicit Semantic Analysis . In: Working Notes for the CLEF 2008 Workshop ( 2008 )