<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A statistical approach to crosslingual natural language tasks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>David Pinto</string-name>
          <email>dpinto@cs.buap.mx</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jorge Civera</string-name>
          <email>jcivera@dsic.upv.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alfons Juan</string-name>
          <email>ajuan@dsic.upv.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Rosso</string-name>
          <email>prosso@dsic.upv.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alberto Barro´n-Ceden˜o</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Departamento de Sistemas Inform ́aticos y Computaci ́on Universidad Polit ́ecnica de Valencia</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Facultad de Ciencias de la Computaci ́on, Benem ́erita Universidad Aut ́onoma de Puebla</institution>
          ,
          <country country="MX">Mexico</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The existence of huge volumes of documents written in multiple languages in Internet lead to investigate novel approaches to deal with information of this kind. We propose to use a statistical approach in order to tackle the problem of dealing with crosslingual natural language tasks. In particular, we apply the IBM alignment model 1 with the aim of obtaining a statistical bilingual dictionary which may further be used in order to approximate the relatedness probability of two given documents (written in different languages). The experimental results sucessfully obtained in three different tasks -text classification, information retrieval and plagiarism analysis- highlight the benefit of using the presented statistical approach.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The fast growth of the Internet and the increasing presence of multilinguality on
the web poses new challenges for Natural Language Processing (NLP)
technology. This fact leads us to the necessity of developing novel techniques to manage
multilingual data. Indeed, the growing demand of NLP systems that deal with
multilingual information induces the development and evaluation of multilingual
systems in international events such as the Cross Language Evaluation Forum
(CLEF)3 and the Text Analysis Conference (TAC)4.</p>
      <p>
        It is easy to find examples of NLP applications in which more than one
language is involved. In this paper, we focus on three specific multilingual
applications: bilingual Text Classification (TC), crosslingual Information Retrieval
(IR) and crosslingual plagiarism. The proliferation and categorisation of
multilingual documentation has become a common phenomenon in many official
institutions and private companies. The most significant case is the EU
parliament and commission, in which most official documents are translated into
more than 20 languages, and categorised according to the Eurovoc thesaurus [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>3 http://www.clef-campaign.org/</title>
      <p>4 http://www.nist.gov/tac/
In this scenario, we could take advantage of redundancy and word correlation
across languages to improve the accuracy of text classifiers. Moreover, users may
be interested in information which is in a language other than their own native
one. A common language scenario is where a user has some comprehension
ability for a given language but the user is not sufficiently proficient to confidently
specify a search query in that language. Thus, a search engine that may deal
with this crosslingual problem should be of a high benefit. Finally, another
multilingual application would be crosslingual plagiarism. In particular, this latter
application is a real problem which ocurrs very frequently, for instance, in
academic environments. It consists of the detection of text fragments which have
been translated or partially rewritten from one original language without the
adequate reference to the original text. The crosslingual component added to
traditional NLP tasks incorporates a higher level of complexity which must be
studied adequately.</p>
      <p>
        Most of the current approaches to crosslingual NLP use conventional
monolingual NLP techniques that usually incorporate a decoupled translation process
as a preprocessing step to bridge the crosslingual gap. However, this two-step
approach is too sensitive to translation errors, and in general to the accumulative
effect of errors. In fact, even if we have a highly accurate NLP system, translation
errors may prevent us from obtaining the desired performance. To overcome this
drawback, we propose to bring together source and target documents written
in two different languages as input to a direct probabilistic crosslingual NLP
system which integrates both steps, translation and the specific NLP task, into
a single one. In order to carry out this integrated approach to crosslingual
applications, we propose to employ the IBM alignment model 1 (M1), which was
firstly introduced for statistical Machine Translation (MT) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        The M1 model, the first of the IBM models, is basically defined as a statistical
bilingual dictionary that captures word correlation across languages. In
statistical MT, the M1 model has traditionally been an important component part in
applications such as the alignment of bilingual sentences [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], the alignment of
syntactic tree fragments [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], the segmentation of bilingual long sentences for
improved word alignment [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], the extraction of parallel sentences from comparable
corpora [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], the estimation of word-level confidence measures [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and, it has been
an inspiration for lexicalised phrase scoring in phrase-based systems [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In
contrast to this statistical MT applications, the M1 model has been recently applied
to other NLP areas such as bilingual TC, crosslingual IR and plagiarism with
promising results [
        <xref ref-type="bibr" rid="ref10 ref11 ref9">9–11</xref>
        ]. To our knowledge, the employment of the M1 model
outside statistical MT tasks has barely been investigated.
      </p>
      <p>The rest of this paper is structured as follows. Section 2 presents the M1
model together with the maximum likelihood estimation of its parameters
using the Expectation-Maximisation (EM) algorithm. In Section 3 we introduce
the three crosslingual applications in which the M1 model has been applied and
we describe how the M1 model has been integrated. Section 4 shows the
experimental results obtained on actual tasks related to the proposed crosslingual
applications. Finally, conclusions and further work are discussed in Section 5.
2.1</p>
      <sec id="sec-2-1">
        <title>The M1 model</title>
        <sec id="sec-2-1-1">
          <title>The model</title>
          <p>where A(x, y) denotes the set of all possible alignments from x to y. Now, we
can factorise the term p(x, a | y) at the word-level from left to right
Let x = x1 . . . xj . . . x|x| be a sentence in a certain source language of known
length |x| and y = y1 . . . yi . . . y|y| is its corresponding translation in a different
target language of known length |y|. Let X and Y denote the source and target
vocabularies, respectively.</p>
          <p>To derive the M1 model we start from the target-conditional probability
distribution p(x | y), for which we define the alignment hidden variable a =
a1 · · · aj · · · a|x|. The alignment variable connects each source word to exactly
one target word aj = {0, · · · , i, · · · , |y|}, being 0 the position of the NULL5
word
p(x | y) =</p>
          <p>p(x, a | y)</p>
          <p>X
a∈A(x,y)
(1)
(2)
(3)
(4)
(5)
|x|
p(x, a | y) = Y p(xj , aj | xj1−1, aj1−1, y)
j=1
|x|
= Y p(aj | xj1−1, aj1−1, y) p(xj | xj1−1, aj1, y)</p>
          <p>j=1
where p(aj | xj−1, aj1−1, y) is an alignment probability function (p.f.) and
1
p(xj | xj1−1, aj1, y) is a lexical p.f. or statistical dictionary.</p>
          <p>The well-known M1 model is defined by making the following two
assumptions. First, we assume that the probability of aligning a source position to a
target position is uniform
p(x, a | y; Θ) = Y|x| 1
j=1 |y| + 1
p(xj | yaj )
5 The NULL word represents the target word to which those source words with no
direct translation are connected.</p>
          <p>p(aj | xj1−1, aj1−1, y) :=</p>
          <p>1
|y| + 1</p>
          <p>p(xj | xj1−1, aj1, y) := p(xj | yaj )
Then, we also assume that the probability of translating a source word does only
depend on the target word to which is aligned
where p(xj | yaj ) is a statistical bilingual dictionary. Thus, we can rewrite Eq. (2)
under assumptions in Eqs. (3) and (4) as
where the parameter vector
is a statistical bilingual dictionary.</p>
          <p>Θ =
p(u | v)
u ∈ X , v ∈ Y
The model using indicator vectors Now we change the nature of the original
alignment variable aj ∈ {0, . . . , |y|} from an integer value into an indicator vector
aj = (aj0, aj1, . . . , aj|y|)t.
in order to ease the presentation of the parameter estimation of the M1 model.
The vector aj values one in the ith position and zeros elsewhere, if the source
position j is aligned to the target position i. Equivalently to Eq. (5), we have
2.2</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Parameter estimation</title>
          <p>In this section we present the maximum likelihood estimation of the parameter
vector Θ for the M1 model with respect to a set of N independent bilingual
samples (X, Y ) = ((x1, y1), . . . , (xn, yn), . . . , (xN , yN ))t, where the sequence of
source and target words of the nth sample is xn = xn1, . . . , xnj , . . . , xn|x| and
yn = yn1, . . . , yni, . . . , yn|y| , respectively.</p>
          <p>The log-likelihood function of Θ, which we would like to maximise, is</p>
          <p>N |xn| |yn|
L(Θ; X, Y ) = X X log X
n=1 j=1</p>
          <p>1
i=0 |yn| + 1
p(xnj | yni).</p>
          <p>(12)
According to this notation, the initial model in Eq. (1) can be rewritten as
|x| |y|
p(x, a | y; Θ) = Y Y
j=1 i=0</p>
          <p>1
|y| + 1
p(xj | yi)
aji</p>
          <p>.</p>
          <p>|x| |y|
p(x | y; Θ) = Y X
j=1 i=0</p>
          <p>1
|y| + 1
p(xj | yi).</p>
          <p>Eq. (9) is the usual form of the M1 model. The M1 model makes the naive
assumption that source words are conditionally independent given y
|x|
p(x | y; Θ) = Y p(xj | y)
j=1
where</p>
          <p>|y| 1
p(xj | y) = X p(xj | yi)</p>
          <p>i=0 |y| + 1
is the average probability of xj to be translated into a target word in y.
(6)
(7)
(8)
(9)
(10)
(11)
Now, let A be the set of alignment indicator vectors associated with the bilingual
pairs (X, Y ) with</p>
          <p>A = (a1, . . . , an, . . . , aN )t .
(13)
The variable A is the alignment missing data in the M1 model, since this
information is not present in the bilingual samples (X, Y ). Indeed, if the alignment
information were available, the estimation of the parameter p(u | v) would be as
easy as counting how many times the source word u is aligned to the target word
v in (X, Y ) and normalise adequately. However, we do not know how the
bilingual samples are aligned, and the maximisation of Eq. (12) in order to estimate
Θ is troublesome.</p>
          <p>For this reason, we need to revert to the well-known EM algorithm that
performs the maximum likelihood estimation of statistical models with missing
data. The idea behind the EM algorithm is to estimate the parameter vector Θ
in two iterative steps. First, the so-called E-step computes the expected value of
the missing data, in our case, an estimation of the actual value of the alignment
data. Then, in the so-called M-step, given that we have an estimation of the
missing data we can compute Θ, in the case of M1 model, an estimation of the
bilingual dictionary. This two-step process is repeated to refine the estimation
of the missing data, and then improve the estimation of the parameter vector.</p>
          <p>Formally, the E step computes the expected value of the logarithm of the
term p(X, A | Y ), given the (incomplete) data samples (X, Y ) and a current
estimate of Θ at iteration k, Θ(k). Given that the alignment variables in A are
independent from each other, we can compute the E step,
with
where</p>
          <p>N |xn| |yn|
Q(Θ | Θ(k)) = X X X a(nkj)i log
n=1 j=1 i=0</p>
          <p>1
|yn| + 1</p>
          <p>+ log p(xnj | yni)
a(nkj)i =</p>
          <p>p(xnj | yni)(k)
P|iy′=n|0 p(xnj | yni′ )(k)
.</p>
          <p>That is, the expectation of word xnj to be aligned to yni is our current estimation
of the probability of xnj to be translated into yni, instead of any other word in
yn (including the NULL word).</p>
          <p>In the M step, we maximise Eq. (14), in order to obtain the standard update
formula for the M1 model,
p(u | v)(k+1) =</p>
          <p>N (u, v)
P N (u′, v)
u′∈X</p>
          <p>∀u ∈ X , v ∈ Y</p>
          <p>N |xn| |yn|
N (u, v) = X X X δ(xnj = u) δ(yni = v) a(nkj)i .</p>
          <p>n=1 j=1 i=0
The estimation of p(u | v) can be seen as a normalised partial count of how many
times the source word u is aligned to the target word v.
(14)
(15)
(16)
(17)</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Crosslingual applications based on the M1 model</title>
        <p>In this section we introduce three different natural language tasks which could
take benefit from the application of the previously presented M1 probabilistic
model. The three tasks –text classification, information retrieval and plagiarism
analysis– are considered to be in the crosslingual scenario, i.e., some texts are
written in one language, whereas other texts of the same collection are written
in another one. The following sections explain into detail the manner we have
used the M1 model in each of these tasks.
3.1</p>
        <sec id="sec-2-2-1">
          <title>Bilingual text classification</title>
          <p>
            The purpose of TC is to convert an unstructured repository of documents into a
structured one by automatically assigning documents to a predefined number of
groups, in the case of text clustering, or to a set of predefined categories, in the
case of text categorisation. Doing so, the task of storing, searching and browsing
documents in these repositories is significantly simplified [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ].
          </p>
          <p>
            Among the diverse approaches to TC, the well-known naive Bayes
classifier [
            <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
            ] is one of the most popular. Being so, there have been several
instantiations and generalisations of this classifier, from Bernoulli mixtures [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ]
to multinomial mixtures [
            <xref ref-type="bibr" rid="ref16 ref17">16, 17</xref>
            ]. Both generalisations seek to relax the naive
Bayes feature independence assumption made when using a single Bernoulli or
multinomial distribution per category.
          </p>
          <p>
            The unrealistic assumption of the naive Bayes classifier is one of the main
reasons explaining its comparatively poor results in contrast to other techniques
such as boosting-based classifier committees (boosting) [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ] and support vector
machines (SVM) [
            <xref ref-type="bibr" rid="ref19">19</xref>
            ]. However, the performance of the naive Bayes classifier is
significantly improved by using the generalisations mentioned above. Moreover,
there are other recent generalisations (and corrections) that also overcome the
weaknesses of the naive Bayes classifier and achieve competitive results [
            <xref ref-type="bibr" rid="ref20 ref21 ref22 ref23">20–23</xref>
            ].
          </p>
          <p>Bilingual TC is a novel application strongly characterised by word correlation
across languages. This word correlation comes from the fact that the bilingual
texts to be classified are mutual parallel translations. Given the latter scenario,
we propose two main approaches to tackle bilingual TC. First, we may naively
consider that bilingual texts were generated independently and therefore, there
is not exist any crosslingual relation between words found in mutual translations.
Alternatively, we may realistically assume that an underlying crosslingual word
mapping exists and can be exploited to boost the performance of a bilingual
classifier. Undoubtedly, the latter approach is significantly more complex than
the former, however the crosslingual structure apprehended by the latter is a
valuable information that cannot be neglected.</p>
          <p>
            The M1 model in bilingual text classification Formally, our goal is to
classify a bilingual parallel text (x, y) into one of the C supervised categories,
so that we minimise the classification error. According to the optimal Bayes
decision (classification) rule [
            <xref ref-type="bibr" rid="ref24">24</xref>
            ], this can be achieved classifying the bilingual
document (x, y) in the class with maximum posterior probability
cˆ(x, y) = arg max p(c | x, y) = arg max p(c) p(x, y | c)
          </p>
          <p>c=1,...,C c=1,...,C
where</p>
          <p>p(x, y | c) = p(y | c) p(x | y, c)
can be factorised into a language p.f., p(y | c), and a translation p.f., p(x | y, c).</p>
          <p>Given the bilingual classification rule stated in Eqs. (18) and (19), we can
derive three different classification rules depending on the assumptions we make:
1. The monolingual rule only considers the contribution of one of the two
languages
being p(x | c) modelled as a unigram model
p(x, y | c) ≈ p(x | c)</p>
          <p>|x|
p(x | c) := Y p(xj | c).</p>
          <p>j=1
(18)
(19)
(20)
(21)
2. The bilingual naive factorisation rule unrealistically assumes that the
bilingual parallel texts are independent from each other
p(x, y | c) ≈ p(x | c) p(y | c)
(22)
where p(x | c) and p(y | c) are modelled as source and target unigram models,
respectively. This rule incorporates a second source of information into the
classifier that leads to believe in its superiority compared to the monolingual
rule.
3. The general rule, as presented in Eqs. (18) and (19), models the language
p.f., p(y | c), as a target unigram model and the translation p.f., p(x | y, c),
as an M1 model. The integration of the M1 model allows to capture word
correlation across languages enriching the structure of the bilingual text
classifier, being theoretically superior to the monolingual and naive rules.</p>
          <p>The maximum likelihood estimation of the source and target models is
trivially computed by relative word frequency. The estimation of the M1 model
involved in the general rule was already introduced in Section 2.
3.2</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>Crosslingual information retrieval</title>
          <p>
            In CrossLingual Information Retrieval (CLIR), the usual approach consists of
firstly translating the query into the target language and then retrieving
documents in this language by using a conventional mono-lingual information
retrieval system. The translation system might be of any type, rule-based,
statistical or hybrid. In [
            <xref ref-type="bibr" rid="ref25 ref26">25, 26</xref>
            ], a statistical MT system is used, but it had to be
previously trained with parallel texts. See [
            <xref ref-type="bibr" rid="ref27 ref28">27, 28</xref>
            ] for a survey on CLIR. As
previously mentioned, the above two-step approach is too sensitive to translation
errors and, therefore, even if one information retrieval system performs well in
a mono-lingual enviroment, its performance may be highly degraded in a
multilingual scenario.
          </p>
          <p>
            Probabilistic approaches which use parallel corpora in order to translate the
input queries by means of a statistical dictionary in CLIR have been used in
previous work [
            <xref ref-type="bibr" rid="ref26">26</xref>
            ]. However, our aim is not to translate queries but to obtain a
set of associated words for a given query. In Figure 1 we may see the components
of this novel approach that has just recently been explored in the literature [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ].
The training phase is done by applying the M1 model to a set of pairs of query
vs. relevant webpages. The obtained statistical dictionary is used in conjunction
with the set of target webpages in order to show the most relevant ones given a
query which is written in a different language of that of the webpages.
          </p>
        </sec>
        <sec id="sec-2-2-3">
          <title>The M1 model in crosslingual information retrieval Lex x be a query</title>
          <p>text in a certain (source) language, and let y1, y2, . . . , yn be a collection of n web
pages in a different (target) language. Given a number k &lt; n, we are interested
in finding the k most relevant web pages with respect to the source query x. For
this purpose, we have employed a probabilistic approach in which the k most
relevant web pages are computed as those most probable ones given x, i.e.,
Sˆk(x) =</p>
          <p>
            arg max arg min p(y | x)
S⊂{y1,...,yn} y∈S
|S|=k
(23)
In this work, p(y | x) is modelled by using the M1 model. The M1 model assumes
that the order of the words in the query is not important and, therefore, each
position in a document is equally likely to be connected to each position in the
query. Although this assumption is unrealistic in MT, we consider the M1 model
to be particularly well-suited for CLIR.
Plagiarism is the practice of rewriting someone else’s creative work, in whole
or in part, without the adequate credit of the original authorship. Plagiarism
may be carried out in the same language or across different languages (i.e.,
crosslingual plagiarism). In some way, crosslingual plagiarism analysis is related
to the crosslingual information retrieval field [
            <xref ref-type="bibr" rid="ref10 ref29">10, 29</xref>
            ]. In fact, the aim is to retrieve
those fragments that have been plagiarised from a source text originally written
in another language.
          </p>
          <p>
            Whereas some research works have been carried out for the automatic
plagiarism analysis [
            <xref ref-type="bibr" rid="ref30 ref31">30, 31</xref>
            ], to our knowledge, Cross-Lingual Plagiarism Analysis
(CLiPA) is a NLP task that nearly has been studied in the literature. In [
            <xref ref-type="bibr" rid="ref32">32</xref>
            ], it
is proposed an automatic method to assign descriptors (keywords) drawn from
the multilingual Eurovoc thesaurus to documents that can be found in different
languages. Given the multilingual nature of these descriptors (but with a unique
descriptor id) the authors suggest the possibility of automatically identifying
document translations on the basis of common descriptors. This approach could
be useful in the plagiarism analysis but it has not been investigated any further.
In [
            <xref ref-type="bibr" rid="ref33">33</xref>
            ], the authors propose a preliminar method based on semantic analysis in
order to identify documents that may be plagiarised in a different language.
          </p>
        </sec>
        <sec id="sec-2-2-4">
          <title>The M1 model in crosslingual plagiarism Let x be a text fragment drawn</title>
          <p>from a suspicious document in a given (source) language, and let y1, y2, . . . , yn
be a set of original text fragments that may be the source of plagiarised texts in
a different (target) language (the reference corpus).</p>
          <p>Given the suspicious fragment x, the aim is to find the most probable original
text fragment yˆ obtained as a plagiarised translation of x
yˆ(x) =</p>
          <p>arg max
y∈{y1,...,yn}
p(y | x)
(24)
Again, p(y | x) is modelled using the M1 model described in Section 2.
4</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>Experimental results</title>
        <p>In this section we present the results obtained by applying the statistical
approach, based on the M1 model, to three different crosslingual natural language
tasks. Bilingual text classification is presented first, crosslingual information
retrieval follows and, finally, the experiments with crosslingual plagiarism are
shown.
The three bilingual text classifiers introduced in Section 3.1 were assessed in
terms of classification error rate on two categorised parallel corpora. First, we
describe these two corpora and then, we present the experimental setting
employed to evaluate the proposed monolingual and bilingual text classifiers.</p>
        <p>The INTERSECT corpus is a collection of sentence-aligned parallel texts in
English, French and German drawn from different subjects. The English-French
partition contains extracts coming from the Bible, the Canadian Hansard, fiction
books, user manuals, news, scientific-technical reports and official documents
from international organisations. These seven subjects constitute the categories
in which bilingual parallel sentences are classified. The statistics of this corpus
can be found in Table 1.</p>
        <p>
          OPUS [
          <xref ref-type="bibr" rid="ref34">34</xref>
          ] is a growing sentence-aligned multilingual corpus of translated
open source documents freely available on the Internet6. The collections
extracted from OPUS for experimental purposes were:
– OpenOffice.org documentation7.
– KDE manuals including KDE system messages8.
– PHP manuals9.
        </p>
        <p>– European constitution.</p>
        <p>These four collections were considered as independent categories in which
bilingual parallel sentences had to be classified. Their corresponding joint statistics
are presented in Table 1.</p>
        <p>For experimental purposes, these corpora were partitioned into three sets,
devoting 80% for training, 5% for development and 15% test sets. This partitioning
process was randomly carried out 30 times to compute confidence intervals on
test error. The parameters of the statistical models proposed were
automatically learnt on the training set, additional smoothing parameters were manually</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>6 http://urd.let.rug.nl/tiedeman/OPUS/</title>
      <p>7 OpenOffice.org is an open source office suite.
8 The K Desktop Environment (KDE) is a free graphical desktop environment.
9 Hypertext Preprocessor (PHP) is a widely-used general purpose scripting language.
tuned on the development set and the accuracy of the different text classifiers
was assessed on the test set.</p>
      <p>
        Table 2 presents the results for the monolingual, naive and general
classifiers on the test sets of the INTERSECT and OPUS corpora. As observed in
both corpora, the monolingual classifier is outperformed by the naive and
general classifiers that incorporate additional information obtained from the second
language. However, remarkably, the general classifier that capture word
correlation across language using the M1 model is superior to the naive classifier in
which each language is modelled independently. This proves the benefits of the
M1 model to improve the accuracy of bilingual text classifiers. These results are
consistent with those presented in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
The presented approach (CLIR Model) was 10-fold cross-validated on the
EuroGOV corpus [
        <xref ref-type="bibr" rid="ref35">35</xref>
        ]. In the training process we used its 134 supervised English
queries. The obtained results were compared against the three best results
reported at the bilingual “English to Spanish” subtrack of WebCLEF 200510. A
complete explanation of the systems/runs evaluated at WebCLEF 2005 may be
found in [
        <xref ref-type="bibr" rid="ref36">36</xref>
        ]. The performance of each system is evaluated by using the Mean
Reciprocal Rank (MRR). The reciprocal rank of a query response is the
multiplicative inverse of the rank of the correct answer. The MRR is the average of
the reciprocal ranks of the results for a sample of queries [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ].
      </p>
      <p>In Figure 2 it is presented the name of each run together with its MRR.
The Average Success At (ASA) different number of documents retrieved (1, 5,
10, 20 and 50) is also shown in this figure. It is quite obvious the improvement
that may be obtained when using the presented M1 model instead of traditional
ones such as those which represent documents by using the vector space model.
The main contribution of the M1 model in CLIR consists of its direct approach
(translation and indexing/searching) over crosslingual data.
We have carried out some preliminary experiments by selecting five document
fragments, y1, · · · , y5, from one author of the information retrieval area (e.g.
10 http://www.clef-campaign.org/
y5: Intrinsic plagiarism analysis deals with the detection of plagiarised sections
within a document d, without comparing d to extraneous sources). The aim of
this experiment was to obtain an author-based bilingual statistical dictionary
which can be used to perform an author-focused CLiPA.</p>
      <p>For each original text fragment, we have constructed plagiarised cases by
using both, machine and human translators. In the former approach, we have used
five popular online translators11, whereas for the latter five different people have
“plagiarised” each original fragment written in English to fragments in Italian.
In general, the complete corpus is made up of the following text fragments:
i Five original fragments written in English by a unique author
ii Five human simulated plagiarisms for each original fragment (in Italian)
iii Five automatic machine translations for each original fragment (in Italian)
iv Five unplagiarised versions for each original fragment (in Italian) obtained
by rewriting the same original concept but mostly with other words.
iv Twenty unplagiarised (independent) fragments about the plagiarism topic
originally written in Italian language</p>
      <p>We have splitted the complete corpus into two datasets: training (60%) and
test (40%). The training dataset, which is used to construct the statistical
bilingual dictionary, is made up of 40 pairs composed of original fragments and their
corresponding plagiarised versions. In the test dataset, we employed 10
plagiarised text fragments plus 45 unplagiarised text fragments.</p>
      <p>
        Figure 3(a) shows the performance of the proposed statistical-based system
identifying plagiarized documents at different thresholds. The maximum value
obtained on the basis of the M1 model indicates the correct association between
the original and the corresponding plagiarised text fragment. The obtained
results in Italian-written plagiarisms are consistent with those reported in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] for
11 Freetranslation (www.f reetranslation.com), Systran (www.systransof t.com),
Google (www.google.com/language tools), Worldlingo (www.worldlingo.com), and
Reverso (www.reverso.net)
documents written in English and plagiarized in Spanish language (see Figure
3(b)). Table 3 presents the CER results for the crosslingual classifiers on the test
sets of the English-Italian and English-Spanish corpora.
      </p>
      <p>0.75
0.7
0.65
reu 0.6
sea
m
−F 0.55
0.5</p>
      <sec id="sec-3-1">
        <title>Conclusions and future work</title>
        <p>In this paper we have presented the application of the M1 statistical model to
the bilingual TC and the crosslingual IR and plagiarism tasks. The M1
translation model has been widely employed in statistical machine translation but still
unexplored in many other crosslingual NLP tasks.</p>
        <p>The aim of the presented approach is to directly capture word correlation
across languages, in contrast to current approaches that ignore or do not take
full advantage of multilinguality. The experimental results obtained in different
NLP tasks highlight the benefits of the M1 model and the usefulness of learning
crosslingual information in multilingual applications.</p>
        <p>As a future work, we plan to apply the M1 model for the bilingual TC task
by using the challenging JRC-Acquis corpus. Moreover, the extension of the
bilingual text classifier to the multilingual case is yet another appealing idea
that we would like to study.</p>
        <p>It would also be worth exploring superior IBM translation models, like the
IBM model 2, that refine the M1 model by learning a crosslingual source-target
position mapping. This refinement should be also analysed in other NLP tasks
such as summarization, headlines generation and word sense disambiguation.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Acknowledgments</title>
        <p>The authors would like to thank Raphael Salkie of the University of Brighton
for providing access to the INTERSECT corpus. This work has been partially
supported by the MCyT TIN2006-15265-C06-04, TIN2006-15694-CO2-01 and
CSD2007-00018 research projects, the BUAP-701 PROMEP/103.5/05/1536 grant
and the FPU fellowship AP2003-0342.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. EC: Thesaurus eurovoc - volume
          <volume>2</volume>
          :
          <article-title>Subject-oriented version. Annex to the index of the Official Journal of the EC, Office for Official Publications of the EC (</article-title>
          <year>1995</year>
          ) http://europa.eu.int/celex/eurovoc.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Brown</surname>
          </string-name>
          , P.F., et al.:
          <source>The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics</source>
          <volume>19</volume>
          (
          <issue>2</issue>
          ) (
          <year>1993</year>
          )
          <fpage>263</fpage>
          -
          <lpage>311</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Moore</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Fast and accurate sentence alignment of bilingual corpora</article-title>
          .
          <source>In: Proc. of AMTA'02</source>
          . (
          <year>2002</year>
          )
          <fpage>135</fpage>
          -
          <lpage>244</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Ding</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gildea</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Palmer</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>An algorithm for word-level alignment of parallel dependency trees</article-title>
          .
          <source>In: Proc. of MT Summit IX</source>
          . (
          <year>2003</year>
          )
          <fpage>95</fpage>
          -
          <lpage>101</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Nevado</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Casacuberta</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vidal</surname>
          </string-name>
          , E.:
          <article-title>Parallel corpora segmentation using anchor words</article-title>
          .
          <source>In: Proc. of EAMT/CLAW'03</source>
          . (
          <year>2003</year>
          )
          <fpage>33</fpage>
          -
          <lpage>40</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Munteanu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fraser</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marcu</surname>
            ,
            <given-names>D.:</given-names>
          </string-name>
          <article-title>Improved machine translation performance via parallel sentence extraction from comparable corpora</article-title>
          .
          <source>In: Proc. of HLTNAACL'04</source>
          . (
          <year>2004</year>
          )
          <fpage>265</fpage>
          -
          <lpage>272</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Ueffing</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ney</surname>
          </string-name>
          , H.:
          <article-title>Word-level confidence estimation for machine translation</article-title>
          .
          <source>Computational Linguistics</source>
          <volume>33</volume>
          (
          <issue>1</issue>
          ) (
          <year>2007</year>
          )
          <fpage>9</fpage>
          -
          <lpage>40</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Koehn</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Och</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marcu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Statistical phrase-based translation</article-title>
          .
          <source>In: Proc. of NAACL'03</source>
          . (
          <year>2003</year>
          )
          <fpage>48</fpage>
          -
          <lpage>54</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Civera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Juan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Unigram-IBM Model 1 Mixtures for Bilingual Text Classification</article-title>
          .
          <source>In: Proc. of LREC'08</source>
          . (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Pinto</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Juan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Using query-relevant documents pairs for crosslingual information retrieval</article-title>
          .
          <source>In: Proc. of TSD'07</source>
          . (
          <year>2007</year>
          )
          <fpage>630</fpage>
          -
          <lpage>637</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Barr</surname>
          </string-name>
          <article-title>´on-Ceden˜o,</article-title>
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Pinto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Juan</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>On cross-lingual plagiarism analysis using a statistical model</article-title>
          .
          <source>In: Proc. of PAN-08</source>
          . (
          <year>2008</year>
          ) in print
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Sebastiani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Classification of text, automatic</article-title>
          . In Brown, K., ed.:
          <source>The Encyclopedia of Language and Linguistics</source>
          . Volume
          <volume>2</volume>
          . Second edn. Elsevier Science Publishers, Amsterdam, NL (
          <year>2006</year>
          )
          <fpage>457</fpage>
          -
          <lpage>463</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Lewis</surname>
          </string-name>
          , D.D.: Naive Bayes at Forty:
          <article-title>The Independence Assumption in Information Retrieval</article-title>
          .
          <source>In: Proc. of ECML'98</source>
          . (
          <year>1998</year>
          )
          <fpage>4</fpage>
          -
          <lpage>15</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>McCallum</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nigam</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>A Comparison of Event Models for Naive Bayes Text Classification</article-title>
          .
          <source>In: Proc. of AAAI/ICML-98: Workshop on Learning for Text Categorization</source>
          . (
          <year>1998</year>
          )
          <fpage>41</fpage>
          -
          <lpage>48</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Juan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vidal</surname>
          </string-name>
          , E.:
          <article-title>On the use of Bernoulli mixture models for text classification</article-title>
          .
          <source>Pattern Recognition</source>
          <volume>35</volume>
          (
          <issue>12</issue>
          ) (
          <year>2002</year>
          )
          <fpage>2705</fpage>
          -
          <lpage>2710</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Nigam</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , et al.:
          <article-title>Text Classification from Labeled and Unlabeled Documents using EM</article-title>
          .
          <source>Machine Learning</source>
          <volume>39</volume>
          (
          <issue>2</issue>
          /3) (
          <year>2000</year>
          )
          <fpage>103</fpage>
          -
          <lpage>134</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17. Novovicov´a, J., Mal´ık, A.:
          <article-title>Application of Multinomial Mixture Model to Text Classification</article-title>
          .
          <source>In: Proc. of IbPRIA 2003</source>
          . Volume
          <volume>2652</volume>
          of Lecture Notes in Computer Science., Springer-Verlag (
          <year>2003</year>
          )
          <fpage>646</fpage>
          -
          <lpage>653</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Schapire</surname>
            ,
            <given-names>R.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Singer</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Boostexter: A boosting-based systemfor text categorization</article-title>
          .
          <source>Machine Learning</source>
          <volume>39</volume>
          (
          <issue>2-3</issue>
          ) (
          <year>2000</year>
          )
          <fpage>135</fpage>
          -
          <lpage>168</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Joachims</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Text Categorization with Support Vector Machines: Learning with Many Relevant Features</article-title>
          .
          <source>In: Proc. of ECML'98</source>
          . (
          <year>1998</year>
          )
          <fpage>137</fpage>
          -
          <lpage>142</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Scheffer</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wrobel</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Text Classification Beyond the Bag-of-Words Representation</article-title>
          .
          <source>In: Proc. of ICML'02: Workshop on Text Learning</source>
          . (
          <year>2002</year>
          )
          <fpage>28</fpage>
          -
          <lpage>35</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Rennie</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , et al.:
          <article-title>Tackling the Poor Assumptions of Naive Bayes Text Classifiers</article-title>
          .
          <source>In: Proc. of ICML'03</source>
          . (
          <year>2003</year>
          )
          <fpage>616</fpage>
          -
          <lpage>623</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Pavlov</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , et al.:
          <article-title>Document Preprocessing For Naive Bayes Classification and Clustering with Mixture of Multinomials</article-title>
          .
          <source>In: Proc. of KDD'04</source>
          . (
          <year>2004</year>
          )
          <fpage>829</fpage>
          -
          <lpage>834</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Peng</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , et al.:
          <article-title>Augmenting Naive Bayes classifiers with statistical language models</article-title>
          .
          <source>Information Retrieval</source>
          <volume>7</volume>
          (
          <issue>3</issue>
          ) (
          <year>2004</year>
          )
          <fpage>317</fpage>
          -
          <lpage>345</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Duda</surname>
            ,
            <given-names>R.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hart</surname>
            ,
            <given-names>P.E.: Pattern</given-names>
          </string-name>
          <string-name>
            <surname>Classification</surname>
            and
            <given-names>Scene</given-names>
          </string-name>
          <string-name>
            <surname>Analysis</surname>
          </string-name>
          . Wiley (
          <year>1973</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Franz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCarley</surname>
            ,
            <given-names>J.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roukos</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Ad-hoc and multilingual information retrieval at ibm</article-title>
          .
          <source>In: Proc. of the TREC-7 Conference</source>
          . (
          <year>1998</year>
          )
          <fpage>157</fpage>
          -
          <lpage>168</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Kraaij</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nie</surname>
            ,
            <given-names>J.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Simard</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Embedding web-based statistical translation models in cross-language information retrieval</article-title>
          .
          <source>Computational Linguistics</source>
          <volume>29</volume>
          (
          <issue>3</issue>
          ) (
          <year>2003</year>
          )
          <fpage>381</fpage>
          -
          <lpage>419</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Fuhr</surname>
          </string-name>
          , N.:
          <article-title>Probabilistic models in information retrieval</article-title>
          .
          <source>The Computer Journal</source>
          <volume>35</volume>
          (
          <issue>3</issue>
          ) (
          <year>1992</year>
          )
          <fpage>243</fpage>
          -
          <lpage>255</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Rijsbergen</surname>
            ,
            <given-names>C.J.V.</given-names>
          </string-name>
          :
          <article-title>Information Retrieval, 2nd edition</article-title>
          . Dept. of Computer Science, University of Glasgow (
          <year>1979</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Kraaij</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nie</surname>
            ,
            <given-names>J.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Simard</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Embedding web-based statistical translation models in cross-language information retrieval</article-title>
          .
          <source>Computational Linguistics</source>
          <volume>29</volume>
          (
          <issue>3</issue>
          ) (
          <year>2003</year>
          )
          <fpage>381</fpage>
          -
          <lpage>419</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Si</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leong</surname>
            ,
            <given-names>H.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lau</surname>
            ,
            <given-names>R.W.H.</given-names>
          </string-name>
          :
          <article-title>Check: a document plagiarism detection system</article-title>
          .
          <source>In: Proc. of the 1997 ACM Symposium on Applied Computing</source>
          , ACM (
          <year>1997</year>
          )
          <fpage>70</fpage>
          -
          <lpage>77</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>Stein</surname>
          </string-name>
          , B., Meyer zu Eissen, S.:
          <article-title>Intrinsic plagiarism analysis with meta learning</article-title>
          .
          <source>In: Proc. of PAN-07</source>
          . (
          <year>2007</year>
          )
          <fpage>45</fpage>
          -
          <lpage>50</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <surname>Pouliquen</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Steinberger</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ignat</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Automatic annotation of multilingual text collections with a conceptual thesaurus</article-title>
          .
          <source>In: Proc. of EUROLAN'03</source>
          . (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          33.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anderka</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>A wikipedia-based multilingual retrieval model</article-title>
          .
          <source>In: Proc. of ECIR'08</source>
          . Volume
          <volume>4956</volume>
          of Lecture Notes in Computer Science., Springer-Verlag (
          <year>2008</year>
          )
          <fpage>522</fpage>
          -
          <lpage>530</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          34.
          <string-name>
            <surname>Tiedemann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nygaard</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>The opus corpus - parallel &amp; free</article-title>
          .
          <source>In: Proc. of LREC'04</source>
          , Lisbon, Portugal (
          <year>2004</year>
          )
          <fpage>1183</fpage>
          -
          <lpage>1186</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          35. Sigurbj¨ornsson,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Kamps</surname>
          </string-name>
          , J., de Rijke, M.: Eurogov:
          <article-title>Engineering a multilingual web corpus</article-title>
          .
          <source>In: Proc. of WebCLEF'06</source>
          . Volume
          <volume>4022</volume>
          of Lecture Notes in Computer Science., Springer-Verlag (
          <year>2006</year>
          )
          <fpage>825</fpage>
          -
          <lpage>836</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          36. Sigurbj¨ornsson,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Kamps</surname>
          </string-name>
          , J., de Rijke, M.:
          <article-title>Overview of WebCLEF 2005</article-title>
          .
          <source>In: Proc. of WebCLEF'06</source>
          . Volume
          <volume>4022</volume>
          of Lecture Notes in Computer Science.,
          <source>SpringerVerlag</source>
          (
          <year>2006</year>
          )
          <fpage>810</fpage>
          -
          <lpage>824</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          37.
          <string-name>
            <surname>Voorhees</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>The TREC-8 question answering track report</article-title>
          .
          <source>In: Proc. of the 8th Text Retrieval Conference</source>
          . (
          <year>1999</year>
          )
          <fpage>77</fpage>
          -
          <lpage>82</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>