<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Replicating an Experiment in Cross-lingual Information Retrieval with Explicit Semantic Analysis</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Information Systems Engineering</institution>
          ,
          <addr-line>TU Wien, Favoritenstra e 9-11/194, 1040 Vienna</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We have participated in the Replicability Track of the CENTRE@CLEF 2018 conference [1{4]. This paper reintroduces Explicit Semantic Analysis (ESA) and its extension for cross-lingual document retrieval tasks, called Cross-lingual Explicit Semantic Analysis (CL-ESA), for the rst time introduced by Sorg and Cimiano in 2008. The goal is to replicate an experiment from Sorg and Cimiano, who participated in the CLEF conference in 2008 and report on the results as well as to point out mistakes and problems along the way. This work should be read in conjunction with the original work done by Sorg and Cimiano [7].</p>
      </abstract>
      <kwd-group>
        <kwd>replicability Explicit Semantic Analysis cross-lingual</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Explicit Semantic Analysis (ESA) uses chosen external categories to represent
a given text t. We will introduce the main ideas of a Wikipedia based
approach, so the reader is able to understand the notions in the implementation
section. A more detailed description can be found either in the paper by Sorg
and Cimiano [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] or by Markovitch and Gabrilovich [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], who introduced this
Wikipedia-based approach and the more general theory behind it. The main
idea in Wikipedia-based Explicit Semantic Analysis is to map a text t into
a high-dimensional real-valued vector space. Given a set of Wikipedia articles
Wk = fa1; : : : ; ang in language Lk, each article ai is an external category and
corresponds to a dimension in the vector space. The following function describes
this mapping:
      </p>
      <p>k : T ! RjWkj
k(t) := hv1; : : : ; vjWkji
vi := X as(wj ; ai)</p>
      <p>wj2t
where jWkj is the number of articles in Wk. Each vi is computed by summing up
the results of a function as, which de nes the strength of association between a
Wikipedia article ai and a word wj , for each word in a given text t = hw1; : : : ; wli.
In this regard k(t) is called ESA-vector and expresses the strength of association
of a given text t with each article ai in Wk.</p>
      <p>There are many di erent ways to de ne the as function, e.g. here we use a
tf:idf function, based on a Bag-of-Words model of the Wikipedia articles. In
this sense, Explicit Semantic Analysis is very exible, as it can be adapted to
di erent tasks and contexts, simply by choosing a di erent as function.</p>
      <p>as(wj ; ai) = tf:idfai (wj )</p>
      <p>The function we used for the experiment is described in Section 4.
Computing the ESA-vector, means we compute the strength of association for each
Wikipedia article ai, hence after sorting the ESA-vector by value, it corresponds
to a ranking of Wikipedia articles according to relevance for a given text t.
Essentially, Explicit Semantic Analysis transforms a given text t into a vector
representation according to external categories. This means one can simply
assess the similarity between two arbitrary texts t1 and t2 by computing their
ESA-vectors and for example using the standard cosine similarity to compare
the vectors. This is another reason for which Explicit Semantic Analysis is very
exible, as it can be used on arbitrary texts | we can simply adapt it to di erent
tasks, among other things a retrieval task (query and document) or a clustering
task (two documents).
3</p>
    </sec>
    <sec id="sec-2">
      <title>Cross Lingual Explicit Semantic Analysis</title>
      <p>Cross Lingual Explicit Semantic Analysis (CL-ESA) is an extension to ESA,
which can handle multi-lingual retrieval tasks. This approach uses the fact that
Wikipedia articles are linked across di erent languages. Therefore one can
assume that there exists a mapping function mi!j , which maps an article ai from
Wikipedia Wi to its corresponding article in Wikipedia Wj . Suppose there are n
languages L1; : : : ; Ln. Transforming a given text t from language Li to language
Lj is as simple as transforming i(t) to j(t) using a map, which is de ned over
the cross language links in Wikipedia Wi. Since we consider n languages, we
de ne an n2 mapping function of the type:</p>
      <sec id="sec-2-1">
        <title>This mapping is computed as follows: where</title>
        <p>i!j : RjWij ! RjWjj
i!jhv1; : : : ; vjWiji = hv10; : : : ; vj0Wjji
with 1 p jWij; 1 q jWjj. Given a text t in language Li, obtaining an
ESA-vector from Wikipedia Wj, is as simple as computing i!j( i(t)). Using
the above setting we can now de ne the cosine between a query qi in language
Li and a document dj in language Lj in a straightforward manner as follows:
cos(qi; dj) := cos( i(qi); j!i( j(dj)))</p>
        <p>Now we obtained a uniform approach across multiple languages.
Nevertheless, it is important to note that CL-ESA works under the assumption that the
language of the document is known.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4 Implementation</title>
      <p>In this section we will describe the implementation details used for the
experiment. Unfortunately the overall setup di ers from the original experiment,
because we were not able to obtain database dumps from 2008. Instead we
downloaded static HTML dumps1 in English, German and French from the year 2008
and extracted them on the disk [1{4].
4.1</p>
      <sec id="sec-3-1">
        <title>Preprocessing of the documents</title>
        <p>For the actual indexing step we used the same methods as Sorg and Cimiano,
namely a standard white space tokenizer, standard stop word lists for English,
German and French and a Snowball2 Stemmer for English, German and French
respectively.
4.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>ESA Implementation</title>
        <p>In this section we will give a detailed description of our implementation and
preprocessing of the documents and Wikipedia articles. Moreover we will compare
our implementation with the implementation described in the original paper by
Sorg and Cimiano.
1 https://dumps.wikimedia.org/other/static_html_dumps/2008-06/
2 Snowball Stemmers are included in Lucene
Wikipedia Article Preprocessing After the extraction we realized quickly,
that the static HTML dump has much more pages than just the article pages
we wanted to index. Hence the rst step for us was to write a python script
which ignored Wikipedia speci c pages. Fortunately these pages were easy to
locate as their purpose was encoded in the lename of the page, namely pages
which started with Category, Image, Portal, Help, Template, User or Wikipedia3
were ignored and the corresponding discussion pages4 were ignored as well. The
python script simply changed the le extension from .html to .ign, which stands
for ignore. Moreover every page with a lesize of less than 1KB was ignored, due
to the fact that those les were redirect pages. The HTML markup for an actual
article would already exceed this limit, hence there was no room for erroneously
ignoring an actual article. Now the indexer would only index pages whose le
extension is not .ign.</p>
        <p>The processing of the Wikipedia articles is vastly di erent to Sorg's approach,
due to using a di erent source. First we needed to extract the relevant text from
the HTML mark up, using a library called JSoup5. This library empowered us
to query the HTML markup using CSS-style queries. With this approach we
selected the div-element with id equal to content. Then we removed the table
of contents which is a table tag with id equal to toc. Moreover we removed any
category links, which is a div-element with id equal to catlinks and we removed
all edit sections, which were span tags with class equal to editsection. After
these steps we simply selected the text between all remaining tags and used
this as a document for the index. We randomly sampled about 30 articles and
looked at the result of this process to convince ourselves that there are no more
unnecessary texts which might skew the index [1{4]. Nevertheless there were
cases where this approach threw exceptions and we printed each of them into a
log le. However, the number was less than 200, when compared to the number
of documents in the corpus which are more than half a million documents, we
decided that it was not worth it to investigate those articles further at this stage.</p>
        <p>
          For indexing the documents using Lucene6, it was not necessary to use the
WikipediaAnalyzer, because our text was just plain text without Wiki markup.
Therefore we used the same methods as in the preprocessing of the documents.
Sorg and Cimiano mention two kinds of restrictions on the article selection,
\Then all articles with less than 100 words or less than 5 incoming pagelinks
were discarded." We implemented them adapted to the static HTML dump. The
length is checked before adding a document to the index by counting the tokens
generated from Lucene. The incoming page links were more complex to obtain.
We generated a page link map by parsing all Wikipedia articles and counting
the number of a href tags in the div-element with id equal to content. The
3 The pre xes were in the same language as the corresponding Wikipedia, e.g. Benutzer
for User in the German Wikipedia
4 for each pre x there was a discussion page encoded as &lt;pre x&gt; talk
5 https://jsoup.org/
6 https://lucene.apache.org/core/
link7 in these tags is a path to the corresponding article in the lesystem and
we used them as keys for the aforementioned page link map and counted the
occurrences. Then instead of parsing the whole Wikipedia directory with the
indexer, we simply looped over all keys in the map and only parsed the articles
with 5 or more page links. Unfortunately, there are no exact document counts
from Sorg and Cimiano after applying these restrictions. Nevertheless, they
reported document counts after restricting the documents to \at least a language
link to one of the two other languages we consider [. . . ] we used 536.896 English,
390.027 German and 362.972 French articles for the ESA indexing" [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] We ended
up with more documents for the ESA indexing than Sorg and Cimiano. The
English Wikipedia index consists of 1.517.398 documents, the German Wikipedia
index consists of 520.433 documents and the French Wikipedia index consists
of 431.245 documents. Considering the additional restriction Sorg and Cimiano
applied to obtain their counts, we think that the discrepancy between our
numbers is reasonable [1{4]. The English Wikipedia is much larger than the German
or French Wikipedia. Therefore there are a lot more pages which do not have a
link to German or French. The smaller discrepancy in the French and German
Wikipedia are probably pages unique to their cultural heritage and therefore
are not likely to have an English equivalent. Nevertheless, we did not check any
of the aforementioned reasons, because the additional restriction is unrelated to
replicating the experiment at hand. That being said, having a vastly di erent
preprocessing and wikipedia source is probably a solid reason, why we were not
able to obtain results similar to Sorg and Cimiano.
        </p>
        <p>ESA Vector Computation The computation of the ESA vector uses an
inverted index of the selected Wikipedia articles. Each document will be queried
against this index and the retrieved articles will be used to build the ESA vector.
Similar to Sorg and Cimiano, we used Lucene for indexing and the association
strength was implemented using a customized Lucene similarity function. The
function takes a text t = hw1; : : : ; wli and a Wikipedia article ai of Wikipedia
corpus jW j and computes the following function:</p>
        <p>asR(t; ai) = (Ct)pjaij 1 X tfai (wj )idf (wj )
with</p>
        <p>Ct =
wj2t
1
r P idf (wj )</p>
        <p>wj2t
tfai (wi) = p#occurrences ofwiinai
idf (wj ) = 1 + log</p>
        <p>jW j + 1
#articles containingwj
7 removed anchorpoints</p>
        <p>The following idf is described in the original paper by Sorg and Cimiano:
idf (wj) = 1 + log
#articles containingwj
jW j + 1</p>
        <p>
          First we need to point out an error in de ning the idf function the way it
was de ned by Sorg and Cimiano [1{4]. The result of the idf function would
be negative because the fraction is de nitely less than 1 and taking the log of
a value less than 1 yields a negative result. We realized, that this is probably
an error because in the literature (e.g. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]), variations of the idf are de ned
di erently and result in a positive value greater than 1. Therefore we swapped
the numerator with the denominator and used this variant for our experiment.
Multi-lingual Mapping Similar to Sorg and Cimiano some preprocessing was
needed to obtain the multi-lingual mapping. Due to using the static HTML
dump instead of a database dump, the cross language links were embedded in
the HTML and pointed to the actual lename on the lesystem. The replicated
experiment only involved English topic titles, hence we only computed the
mapping from German to English and from French to English. A normalization of
the page title was not needed, because we computed the mapping using the
document ids from the index, which correspond to the index of the dimension in the
ESA-vector. Nevertheless we needed to deal with redirect pages [1{4], therefore
we used the following steps to compute the mapping:
        </p>
        <sec id="sec-3-2-1">
          <title>1. For each document in the German (resp. French) index</title>
          <p>2. Find the le in the le system and check8 if a link to English is available
3. If an English link is available nd the le in the le system.
4. Recursively determine if the le is a redirect page until the actual document
is reached.
5. Look up the document id in the English index and add it to the mapping.</p>
          <p>Similar to Sorg and Cimiano we summed up the scores, in case of multiple
language links pointed to the same article in the English Wikipedia.
4.3</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Language Identi cation</title>
        <p>
          The computation of the ESA vector is based on the assumption that we know
the language in which the document is written. Unfortunately this is not always
the case. Even in the TEL German dataset used for the replication, there are
records without any knowledge about the language. Hence Sorg and Cimiano
presented the following function to determine the language of a document t:
8 We selected the div-element with id equal to p-lang with JSoup
where \minDim( k(t)) returns the value of the lowest dimension in vector k
and maxDim( k(t)) returns the highest correspondingly." [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] To us this
description of lowest and highest dimension of a vector does not make sense. We
thought of multiple possibilities to interpret it, e.g. index of the dimension with
the lowest and highest value of the vector or the actual minimal and maximal
values of the vector. However, none of this made sense. Fortunately, Sorg and
Cimiano give an intuition about their heuristic as \The intuition behind this
heuristic is that a small di erence between the values of the lowest and highest
dimension, which is computed by the share of these values, means that the
document matches good to many Wikipedia articles and it can therefore be assumed
that the document is of the same language as the used Wikipedia articles.
Comparing a document to Wikipedia articles in another language, there will be some
matches but the value of lowest dimension will most probably be very small."
Following this intuition lead us to interpret it as follows:
where jWkj is the number of articles used from Wikipedia in language k [1{4].
This way we get the percentage of Wikipedia articles a document t is matched to.
Then the language, in which the document should be written in, is the language
of the Wikipedia base with the highest percentual article match. We have
implemented our interpretation of the language identi cation, however when trying
to identify the language of records without language tag, the run would have
taken too long and we would not have been able to submit our results in time.
Therefore we chose to try and match a document without language tag with the
German Wikipedia by default.
The retrieval algorithm we used, presented in Algorithm 1, is generally the same
as Sorg and Cimiano presented in their paper. The only change here is that our
language identi cation solely relies on language tags in the records to identify
the correct language of the document.
5
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>5.1</p>
      <sec id="sec-4-1">
        <title>Dataset</title>
        <p>In this section we present additional information about the dataset, its language
distribution, additional settings of the experiment and the results.
The TEL German dataset has in total 869353 records in over 100 di erent
languages. The data can be split in records with a language tag, which is about
90%, and without a language tag, which is the rest. German, English and French
are the main languages of those records with language tag and make up about
Input: Topics T , Language k, Documents D
for t 2 T do</p>
        <p>t = k(t );
end
for d 2 D do
l := lang(d);
d = l!k l(d );
for t 2 T do</p>
        <p>score [t; d] = cos(t, d );
end
end</p>
        <p>Algorithm 1: Retrieval-Algorithm
88%. In our experiment we only use the title information to build queries for the
index and according to Sorg and Cimiano \The title of the record is the only
content information that is available for all records". We cannot con rm this
statement, because we have found that about 4,5% of the records in the dataset
do not contain title information [1{4]. In our experiment we simply ignore the
records that do not contain this information.
5.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>CLEF Replicability Experiment</title>
        <p>
          The objective of this experiment is to query the 50 given topics in English on
the multi-lingual TEL German dataset [1{4]. The topics consist of a title and
a short description to build a query. Sorg and Cimiano do not mention what
they used to build the query. We chose to use only the title as a query. As for
the ESA-vector length k we ended up trying two di erent settings. Sorg and
Cimiano used k = 10:000 for topics and k = 1:000 for the records. We ran one
experiment with the same settings and additionaly we ran another experiment
with values of a magnitude smaller, namely for the topics we used k = 1:000 and
for the records we used k = 100. The result obtained by Sorg and Cimiano was a
mean average precision (MAP) of 6,7% [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Unfortunately, they did not explicitly
mention a requirement for a record to count as a relevant document for a certain
topic. Therefore we assumed, that every score greater than zero is a relevant
document. This means that as soon as there is one overlapping dimension in
the ESA-vectors of a record and a topic it would yield a relevant document.
This assumption lead us to the problem, that the full list of relevant documents,
obtained from the experiment with the larger ESA-vector length, matched more
than two thirds of all the records in the dataset to nearly every topic. Looking at
the list we gured out, that the score of the higher ranking documents decreased
at a faster pace and the relevant documents after rank 1000 decreased in a much
slower pace and yielded only a small fraction of the score in comparison to the
higher ranking documents. We concluded, that it would be meaningful to cut o
the results at a certain rank and look at the MAP of the reduced lists, because
the relevant documents with a very low score might just have been accidently
connected by a single Wikipedia article, which does not necessarily convey a
semantical connection between a record and a topic. We used the top 10, the
top 100 and the top 1000 results for each topic, and we calculated the MAP
using trec eval9. The results are shown in Table 1. After comparing our results
with the 6,7% obtained by Sorg and Cimiano, we conclude, that we were not
able to reproduce the result.
In this paper, we described the CL-ESA approach presented by Sorg and Cimiano
and we attempted to replicate an experiment, submitted to the CLEF conference
in the year 2008. In the end we were not able to reproduce the result. Most parts
of the experimental setup were replicated accurately, but especially the index
might be very di erent in comparison with the index of the original experiment,
because we were not able to obtain a Wikipedia database dump from 2008 and
therefore worked with a static HTML dump. Since the index is at the core of the
experiment, it can lead to subsequent di erences in every other part. Other than
that, some missing details, e.g. the way the query is built from the topics and a
detailed explanation about what elds from the records of the dataset were used,
make it hard to replicate the experiment in a more detailed manner. Moreover
we ran into some problems based on our own assumptions. We are refering to
the fact that the full result list of the experiment with the bigger ESA-vector
lengths matched two thirds of all the records in the dataset to almost every topic.
We think, that the cause of this problem lies in the fact, that while there are
Wikipedia articles, which might very accurately describe a semantical category,
there are certainly some articles, which have the opposite e ect. To give a short
example, suppose an article of a famous actress will have a lot of di erent words
with di erent semantical meanings on her page (e.g. overview of her career and
life), while also having acted in some horror movies. This article would then
match any kind of record of the dataset, which somehow was able to obtain
a positive score through words which are not semantically connected to horror
movies, to the topic horror movies, even though there is probably no semantic
connection whatsoever. Therefore we think, that for Wikipedia-based CL-ESA to
yield better results it is essential to have a good article selection or to introduce
certain restrictions on what ends up being a relevant document for a certain
topic, e.g. at least 10 dimensions need to overlap in the ESA-vectors.
9 https://github.com/usnistgov/trec_eval
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bellot</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Trabelsi</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murtagh</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nie</surname>
            ,
            <given-names>J.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soulier</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>SanJuan</surname>
          </string-name>
          , E.,
          <string-name>
            <surname>Cappellato</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferro</surname>
          </string-name>
          , N. (eds.):
          <article-title>Experimental IR Meets Multilinguality, Multimodality, and Interaction</article-title>
          .
          <source>Proceedings of the Nineth International Conference of the CLEF Association (CLEF</source>
          <year>2018</year>
          ).
          <source>Lecture Notes in Computer Science (LNCS)</source>
          , Springer, Heidelberg, Germany (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Cappellato</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nie</surname>
            ,
            <given-names>J.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soulier</surname>
            ,
            <given-names>L</given-names>
          </string-name>
          . (eds.):
          <source>CLEF 2018 Working Notes. CEUR Workshop Proceedings (CEUR-WS.org)</source>
          ,
          <source>ISSN 1613-0073</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maistro</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sakai</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soboro</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>CENTRE@CLEF2018: Overview of the Replicability Task</article-title>
          . In: Cappellato et al. [
          <volume>2</volume>
          ]
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maistro</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sakai</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soboro</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Overview of CENTRE@CLEF 2018: a First Tale in the Systematic Reproducibility Realm</article-title>
          . In: Bellot et al. [
          <volume>1</volume>
          ]
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Gabrilovich</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Markovitch</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Computing semantic relatedness using wikipediabased explicit semantic analysis</article-title>
          .
          <source>In: Proceedings of The Twentieth International Joint Conference for Arti cial Intelligence</source>
          . pp.
          <volume>1606</volume>
          {
          <fpage>1611</fpage>
          .
          <string-name>
            <surname>Hyderabad</surname>
          </string-name>
          ,
          <string-name>
            <surname>India</surname>
          </string-name>
          (
          <year>2007</year>
          ), http://www.cs.technion.ac.il/~shaulm/papers/pdf/ Gabrilovich-Markovitch-ijcai2007.pdf
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raghavan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , Schutze, H.: Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Sorg</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cimiano</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Cross-lingual Information Retrieval with Explicit Semantic Analysis</article-title>
          .
          <source>In: Working Notes for the CLEF 2008 Workshop</source>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>