Replicating an Experiment in Cross-lingual
    Information Retrieval with Explicit Semantic
                      Analysis

        Marco Jungwirth ( ) and Allan Hanbury[0000−0002−7149−5843]

 Institute of Information Systems Engineering, TU Wien, Favoritenstraße 9-11/194,
                                1040 Vienna, Austria
          e1326215@student.tuwien.ac.at, allan.hanbury@tuwien.ac.at


      Abstract. We have participated in the Replicability Track of the CEN-
      TRE@CLEF 2018 conference [1–4]. This paper reintroduces Explicit Se-
      mantic Analysis (ESA) and its extension for cross-lingual document re-
      trieval tasks, called Cross-lingual Explicit Semantic Analysis (CL-ESA),
      for the first time introduced by Sorg and Cimiano in 2008. The goal is to
      replicate an experiment from Sorg and Cimiano, who participated in the
      CLEF conference in 2008 and report on the results as well as to point
      out mistakes and problems along the way. This work should be read in
      conjunction with the original work done by Sorg and Cimiano [7].

      Keywords: replicability · Explicit Semantic Analysis · cross-lingual.


1   Introduction

The goal of this paper is to replicate an experiment and its results as reported
in the paper Cross-lingual Information Retrieval with Explicit Semantic Anal-
ysis by Philipp Sorg and Philipp Cimiano. This paper is structured as follows:
in Section 2, we concisely introduce Explicit Semantic Analysis [5]. Section 3
describes the extension of this approach for cross-lingual document retrieval us-
ing cross language links from Wikipedia. In Section 4 the experimental setup
and the differences to the original experiment are described in detail. Finally,
in Section 5 we present the results of the replicated experiment.


2   Explicit Semantic Analysis

Explicit Semantic Analysis (ESA) uses chosen external categories to represent
a given text t. We will introduce the main ideas of a Wikipedia based ap-
proach, so the reader is able to understand the notions in the implementation
section. A more detailed description can be found either in the paper by Sorg
and Cimiano [7] or by Markovitch and Gabrilovich [5], who introduced this
Wikipedia-based approach and the more general theory behind it. The main
idea in Wikipedia-based Explicit Semantic Analysis is to map a text t into
a high-dimensional real-valued vector space. Given a set of Wikipedia articles
Wk = {a1 , . . . , an } in language Lk , each article ai is an external category and
corresponds to a dimension in the vector space. The following function describes
this mapping:

                                    Φk : T → R|Wk |


                               Φk (t) := hv1 , . . . , v|Wk | i
where |Wk | is the number of articles in Wk . Each vi is computed by summing up
the results of a function as, which defines the strength of association between a
Wikipedia article ai and a word wj , for each word in a given text t = hw1 , . . . , wl i.
In this regard Φk (t) is called ESA-vector and expresses the strength of association
of a given text t with each article ai in Wk .
                                       X
                                 vi :=     as(wj , ai )
                                         wj ∈t

    There are many different ways to define the as function, e.g. here we use a
tf.idf function, based on a Bag-of-Words model of the Wikipedia articles. In
this sense, Explicit Semantic Analysis is very flexible, as it can be adapted to
different tasks and contexts, simply by choosing a different as function.

                               as(wj , ai ) = tf.idfai (wj )
    The function we used for the experiment is described in Section 4. Com-
puting the ESA-vector, means we compute the strength of association for each
Wikipedia article ai , hence after sorting the ESA-vector by value, it corresponds
to a ranking of Wikipedia articles according to relevance for a given text t.
Essentially, Explicit Semantic Analysis transforms a given text t into a vector
representation according to external categories. This means one can simply as-
sess the similarity between two arbitrary texts t1 and t2 by computing their
ESA-vectors and for example using the standard cosine similarity to compare
the vectors. This is another reason for which Explicit Semantic Analysis is very
flexible, as it can be used on arbitrary texts — we can simply adapt it to different
tasks, among other things a retrieval task (query and document) or a clustering
task (two documents).


3    Cross Lingual Explicit Semantic Analysis

Cross Lingual Explicit Semantic Analysis (CL-ESA) is an extension to ESA,
which can handle multi-lingual retrieval tasks. This approach uses the fact that
Wikipedia articles are linked across different languages. Therefore one can as-
sume that there exists a mapping function mi→j , which maps an article ai from
Wikipedia Wi to its corresponding article in Wikipedia Wj . Suppose there are n
languages L1 , . . . , Ln . Transforming a given text t from language Li to language
Lj is as simple as transforming Φi (t) to Φj (t) using a map, which is defined over
the cross language links in Wikipedia Wi . Since we consider n languages, we
define an n2 mapping function of the type:

                                 Ψi→j : R|Wi | → R|Wj |
      This mapping is computed as follows:

                       Ψi→j hv1 , . . . , v|Wi | i = hv10 , . . . , v|W
                                                                     0
                                                                        j|
                                                                           i
where
                                                 X
                            vp0 =                                  vq
                                    q∈{q∗|mi →mj (aq∗ )=ap }

with 1 ≤ p ≤ |Wi |, 1 ≤ q ≤ |Wj |. Given a text t in language Li , obtaining an
ESA-vector from Wikipedia Wj , is as simple as computing Ψi→j (Φi (t)). Using
the above setting we can now define the cosine between a query qi in language
Li and a document dj in language Lj in a straightforward manner as follows:

                      cos(qi , dj ) := cos(Φi (qi ), Ψj→i (Φj (dj )))
    Now we obtained a uniform approach across multiple languages. Neverthe-
less, it is important to note that CL-ESA works under the assumption that the
language of the document is known.

4      Implementation
In this section we will describe the implementation details used for the experi-
ment. Unfortunately the overall setup differs from the original experiment, be-
cause we were not able to obtain database dumps from 2008. Instead we down-
loaded static HTML dumps1 in English, German and French from the year 2008
and extracted them on the disk [1–4].

4.1     Preprocessing of the documents
For the actual indexing step we used the same methods as Sorg and Cimiano,
namely a standard white space tokenizer, standard stop word lists for English,
German and French and a Snowball2 Stemmer for English, German and French
respectively.

4.2     ESA Implementation
In this section we will give a detailed description of our implementation and pre-
processing of the documents and Wikipedia articles. Moreover we will compare
our implementation with the implementation described in the original paper by
Sorg and Cimiano.
1
    https://dumps.wikimedia.org/other/static_html_dumps/2008-06/
2
    Snowball Stemmers are included in Lucene
Wikipedia Article Preprocessing After the extraction we realized quickly,
that the static HTML dump has much more pages than just the article pages
we wanted to index. Hence the first step for us was to write a python script
which ignored Wikipedia specific pages. Fortunately these pages were easy to
locate as their purpose was encoded in the filename of the page, namely pages
which started with Category, Image, Portal, Help, Template, User or Wikipedia3
were ignored and the corresponding discussion pages4 were ignored as well. The
python script simply changed the file extension from .html to .ign, which stands
for ignore. Moreover every page with a filesize of less than 1KB was ignored, due
to the fact that those files were redirect pages. The HTML markup for an actual
article would already exceed this limit, hence there was no room for erroneously
ignoring an actual article. Now the indexer would only index pages whose file
extension is not .ign.
    The processing of the Wikipedia articles is vastly different to Sorg’s approach,
due to using a different source. First we needed to extract the relevant text from
the HTML mark up, using a library called JSoup5 . This library empowered us
to query the HTML markup using CSS-style queries. With this approach we
selected the div-element with id equal to content. Then we removed the table
of contents which is a table tag with id equal to toc. Moreover we removed any
category links, which is a div-element with id equal to catlinks and we removed
all edit sections, which were span tags with class equal to editsection. After
these steps we simply selected the text between all remaining tags and used
this as a document for the index. We randomly sampled about 30 articles and
looked at the result of this process to convince ourselves that there are no more
unnecessary texts which might skew the index [1–4]. Nevertheless there were
cases where this approach threw exceptions and we printed each of them into a
logfile. However, the number was less than 200, when compared to the number
of documents in the corpus which are more than half a million documents, we
decided that it was not worth it to investigate those articles further at this stage.
   For indexing the documents using Lucene6 , it was not necessary to use the
WikipediaAnalyzer, because our text was just plain text without Wiki markup.
Therefore we used the same methods as in the preprocessing of the documents.
Sorg and Cimiano mention two kinds of restrictions on the article selection,
“Then all articles with less than 100 words or less than 5 incoming pagelinks
were discarded.” We implemented them adapted to the static HTML dump. The
length is checked before adding a document to the index by counting the tokens
generated from Lucene. The incoming page links were more complex to obtain.
We generated a page link map by parsing all Wikipedia articles and counting
the number of a href tags in the div-element with id equal to content. The

3
  The prefixes were in the same language as the corresponding Wikipedia, e.g. Benutzer
  for User in the German Wikipedia
4
  for each prefix there was a discussion page encoded as <prefix> talk
5
  https://jsoup.org/
6
  https://lucene.apache.org/core/
link7 in these tags is a path to the corresponding article in the filesystem and
we used them as keys for the aforementioned page link map and counted the
occurrences. Then instead of parsing the whole Wikipedia directory with the
indexer, we simply looped over all keys in the map and only parsed the articles
with 5 or more page links. Unfortunately, there are no exact document counts
from Sorg and Cimiano after applying these restrictions. Nevertheless, they re-
ported document counts after restricting the documents to “at least a language
link to one of the two other languages we consider [. . . ] we used 536.896 English,
390.027 German and 362.972 French articles for the ESA indexing” [7] We ended
up with more documents for the ESA indexing than Sorg and Cimiano. The En-
glish Wikipedia index consists of 1.517.398 documents, the German Wikipedia
index consists of 520.433 documents and the French Wikipedia index consists
of 431.245 documents. Considering the additional restriction Sorg and Cimiano
applied to obtain their counts, we think that the discrepancy between our num-
bers is reasonable [1–4]. The English Wikipedia is much larger than the German
or French Wikipedia. Therefore there are a lot more pages which do not have a
link to German or French. The smaller discrepancy in the French and German
Wikipedia are probably pages unique to their cultural heritage and therefore
are not likely to have an English equivalent. Nevertheless, we did not check any
of the aforementioned reasons, because the additional restriction is unrelated to
replicating the experiment at hand. That being said, having a vastly different
preprocessing and wikipedia source is probably a solid reason, why we were not
able to obtain results similar to Sorg and Cimiano.


ESA Vector Computation The computation of the ESA vector uses an in-
verted index of the selected Wikipedia articles. Each document will be queried
against this index and the retrieved articles will be used to build the ESA vector.
Similar to Sorg and Cimiano, we used Lucene for indexing and the association
strength was implemented using a customized Lucene similarity function. The
function takes a text t = hw1 , . . . , wl i and a Wikipedia article ai of Wikipedia
corpus |W | and computes the following function:
                                      p −1 X
                  asR (t, ai ) = (Ct ) |ai | tfai (wj )idf (wj )
                                                    wj ∈t

with
                                                    1
                               Ct = r P
                                                    idf (wj )
                                            wj ∈t

                                      p
                       tfai (wi ) =       #occurrences ofwi inai

                                                  |W | + 1
                    idf (wj ) = 1 + log
                                           #articles containingwj
7
    removed anchorpoints
      The following idf is described in the original paper by Sorg and Cimiano:

                                           #articles containingwj
                     idf (wj ) = 1 + log
                                                  |W | + 1

    First we need to point out an error in defining the idf function the way it
was defined by Sorg and Cimiano [1–4]. The result of the idf function would
be negative because the fraction is definitely less than 1 and taking the log of
a value less than 1 yields a negative result. We realized, that this is probably
an error because in the literature (e.g. [6]), variations of the idf are defined
differently and result in a positive value greater than 1. Therefore we swapped
the numerator with the denominator and used this variant for our experiment.


Multi-lingual Mapping Similar to Sorg and Cimiano some preprocessing was
needed to obtain the multi-lingual mapping. Due to using the static HTML
dump instead of a database dump, the cross language links were embedded in
the HTML and pointed to the actual filename on the filesystem. The replicated
experiment only involved English topic titles, hence we only computed the map-
ping from German to English and from French to English. A normalization of
the page title was not needed, because we computed the mapping using the doc-
ument ids from the index, which correspond to the index of the dimension in the
ESA-vector. Nevertheless we needed to deal with redirect pages [1–4], therefore
we used the following steps to compute the mapping:

1. For each document in the German (resp. French) index
2. Find the file in the file system and check8 if a link to English is available
3. If an English link is available find the file in the file system.
4. Recursively determine if the file is a redirect page until the actual document
   is reached.
5. Look up the document id in the English index and add it to the mapping.

   Similar to Sorg and Cimiano we summed up the scores, in case of multiple
language links pointed to the same article in the English Wikipedia.


4.3     Language Identification

The computation of the ESA vector is based on the assumption that we know
the language in which the document is written. Unfortunately this is not always
the case. Even in the TEL German dataset used for the replication, there are
records without any knowledge about the language. Hence Sorg and Cimiano
presented the following function to determine the language of a document t:

                                                 minDim(Φk (t))
                     lang(t) :=       max
                                  Lk ∈{L1 ,...,Ln } maxDim(Φk (t))

8
    We selected the div-element with id equal to p-lang with JSoup
where “minDim(Φk (t)) returns the value of the lowest dimension in vector Φk
and maxDim(Φk (t)) returns the highest correspondingly.” [7] To us this de-
scription of lowest and highest dimension of a vector does not make sense. We
thought of multiple possibilities to interpret it, e.g. index of the dimension with
the lowest and highest value of the vector or the actual minimal and maximal
values of the vector. However, none of this made sense. Fortunately, Sorg and
Cimiano give an intuition about their heuristic as “The intuition behind this
heuristic is that a small difference between the values of the lowest and highest
dimension, which is computed by the share of these values, means that the docu-
ment matches good to many Wikipedia articles and it can therefore be assumed
that the document is of the same language as the used Wikipedia articles. Com-
paring a document to Wikipedia articles in another language, there will be some
matches but the value of lowest dimension will most probably be very small.”
Following this intuition lead us to interpret it as follows:

                                              #non zero elements of Φk (t)
             lang(t) :=        max
                          Lk ∈{L1 ,...,Ln }              |Wk |
where |Wk | is the number of articles used from Wikipedia in language k [1–4].
This way we get the percentage of Wikipedia articles a document t is matched to.
Then the language, in which the document should be written in, is the language
of the Wikipedia base with the highest percentual article match. We have imple-
mented our interpretation of the language identification, however when trying
to identify the language of records without language tag, the run would have
taken too long and we would not have been able to submit our results in time.
Therefore we chose to try and match a document without language tag with the
German Wikipedia by default.

4.4   Retrieval
The retrieval algorithm we used, presented in Algorithm 1, is generally the same
as Sorg and Cimiano presented in their paper. The only change here is that our
language identification solely relies on language tags in the records to identify
the correct language of the document.


5     Results
In this section we present additional information about the dataset, its language
distribution, additional settings of the experiment and the results.

5.1   Dataset
The TEL German dataset has in total 869353 records in over 100 different lan-
guages. The data can be split in records with a language tag, which is about
90%, and without a language tag, which is the rest. German, English and French
are the main languages of those records with language tag and make up about
      Input: Topics T , Language k, Documents D
      for t ∈ T do
          t = Φk (t );
      end
      for d ∈ D do
          l := lang(d);
          d = Ψl→k Φl (d );
          for t ∈ T do
              score [t, d] = cos(t, d );
          end
      end
                      Algorithm 1: Retrieval-Algorithm


88%. In our experiment we only use the title information to build queries for the
index and according to Sorg and Cimiano “The title of the record is the only
content information that is available for all records”. We cannot confirm this
statement, because we have found that about 4,5% of the records in the dataset
do not contain title information [1–4]. In our experiment we simply ignore the
records that do not contain this information.

5.2     CLEF Replicability Experiment
The objective of this experiment is to query the 50 given topics in English on
the multi-lingual TEL German dataset [1–4]. The topics consist of a title and
a short description to build a query. Sorg and Cimiano do not mention what
they used to build the query. We chose to use only the title as a query. As for
the ESA-vector length k we ended up trying two different settings. Sorg and
Cimiano used k = 10.000 for topics and k = 1.000 for the records. We ran one
experiment with the same settings and additionaly we ran another experiment
with values of a magnitude smaller, namely for the topics we used k = 1.000 and
for the records we used k = 100. The result obtained by Sorg and Cimiano was a
mean average precision (MAP) of 6,7% [7]. Unfortunately, they did not explicitly
mention a requirement for a record to count as a relevant document for a certain
topic. Therefore we assumed, that every score greater than zero is a relevant
document. This means that as soon as there is one overlapping dimension in
the ESA-vectors of a record and a topic it would yield a relevant document.
This assumption lead us to the problem, that the full list of relevant documents,
obtained from the experiment with the larger ESA-vector length, matched more
than two thirds of all the records in the dataset to nearly every topic. Looking at
the list we figured out, that the score of the higher ranking documents decreased
at a faster pace and the relevant documents after rank 1000 decreased in a much
slower pace and yielded only a small fraction of the score in comparison to the
higher ranking documents. We concluded, that it would be meaningful to cut off
the results at a certain rank and look at the MAP of the reduced lists, because
the relevant documents with a very low score might just have been accidently
connected by a single Wikipedia article, which does not necessarily convey a
semantical connection between a record and a topic. We used the top 10, the
top 100 and the top 1000 results for each topic, and we calculated the MAP
using trec eval9 . The results are shown in Table 1. After comparing our results
with the 6,7% obtained by Sorg and Cimiano, we conclude, that we were not
able to reproduce the result.

         Table 1. Mean Average Precisions of the experiments (values in %)

               ESA length (Topic/Record) Top 10 Top 100 Top 1.000
                         1.000/100             0,65*    0,1     0,01
                        10.000/1.000           0,63*    0,09*   0,01
              *
                  submitted as run to the replicability track


6     Conclusion
In this paper, we described the CL-ESA approach presented by Sorg and Cimiano
and we attempted to replicate an experiment, submitted to the CLEF conference
in the year 2008. In the end we were not able to reproduce the result. Most parts
of the experimental setup were replicated accurately, but especially the index
might be very different in comparison with the index of the original experiment,
because we were not able to obtain a Wikipedia database dump from 2008 and
therefore worked with a static HTML dump. Since the index is at the core of the
experiment, it can lead to subsequent differences in every other part. Other than
that, some missing details, e.g. the way the query is built from the topics and a
detailed explanation about what fields from the records of the dataset were used,
make it hard to replicate the experiment in a more detailed manner. Moreover
we ran into some problems based on our own assumptions. We are refering to
the fact that the full result list of the experiment with the bigger ESA-vector
lengths matched two thirds of all the records in the dataset to almost every topic.
We think, that the cause of this problem lies in the fact, that while there are
Wikipedia articles, which might very accurately describe a semantical category,
there are certainly some articles, which have the opposite effect. To give a short
example, suppose an article of a famous actress will have a lot of different words
with different semantical meanings on her page (e.g. overview of her career and
life), while also having acted in some horror movies. This article would then
match any kind of record of the dataset, which somehow was able to obtain
a positive score through words which are not semantically connected to horror
movies, to the topic horror movies, even though there is probably no semantic
connection whatsoever. Therefore we think, that for Wikipedia-based CL-ESA to
yield better results it is essential to have a good article selection or to introduce
certain restrictions on what ends up being a relevant document for a certain
topic, e.g. at least 10 dimensions need to overlap in the ESA-vectors.
9
    https://github.com/usnistgov/trec_eval
References
1. Bellot, P., Trabelsi, C., Mothe, J., Murtagh, F., Nie, J.Y., Soulier, L., SanJuan,
   E., Cappellato, L., Ferro, N. (eds.): Experimental IR Meets Multilinguality, Mul-
   timodality, and Interaction. Proceedings of the Nineth International Conference of
   the CLEF Association (CLEF 2018). Lecture Notes in Computer Science (LNCS),
   Springer, Heidelberg, Germany (2018)
2. Cappellato, L., Ferro, N., Nie, J.Y., Soulier, L. (eds.): CLEF 2018 Working Notes.
   CEUR Workshop Proceedings (CEUR-WS.org), ISSN 1613-0073 (2018)
3. Ferro, N., Maistro, M., Sakai, T., Soboroff, I.: CENTRE@CLEF2018: Overview of
   the Replicability Task. In: Cappellato et al. [2]
4. Ferro, N., Maistro, M., Sakai, T., Soboroff, I.: Overview of CENTRE@CLEF 2018:
   a First Tale in the Systematic Reproducibility Realm. In: Bellot et al. [1]
5. Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-
   based explicit semantic analysis. In: Proceedings of The Twentieth In-
   ternational Joint Conference for Artificial Intelligence. pp. 1606–1611. Hy-
   derabad, India (2007), http://www.cs.technion.ac.il/~shaulm/papers/pdf/
   Gabrilovich-Markovitch-ijcai2007.pdf
6. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval.
   Cambridge University Press, New York, NY, USA (2008)
7. Sorg, P., Cimiano, P.: Cross-lingual Information Retrieval with Explicit Semantic
   Analysis. In: Working Notes for the CLEF 2008 Workshop (2008)