<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using sentence similarity measure for plagiarism source retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Denis Zubarev</string-name>
          <email>zubarev@isa.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ilya Sochenkov</string-name>
          <email>isochenkov@sci.pfu.edu.ru</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute for Systems Analysis of Russian Academy of Sciences</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Peoples' Friendship University of Russia</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>1027</fpage>
      <lpage>1034</lpage>
      <abstract>
        <p>This paper describes a method that was implemented in the software submitted to PAN 2014 competition for the source retrieval task. For generating queries we use the most important noun phrases and words of sentences selected from a given suspicious document. To download documents that are likely to be sources of plagiarism we employ a sentence similarity measure.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        The plagiarism detection track on PAN [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is divided in two subtasks: source retrieval
and text alignment. Detailed information about these tasks is provided in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In this
paper we present a method that is supposed to solve the former task. The search engines
(Indri, ChatNoir) are used to retrieve a candidate source for a suspicious document.
We formulate a set of queries for search engines based on sentences of a suspicious
document. The process of formulating queries is very important because it determines
the maximum possible recall that can be achieved by a source retrieval method. One
of goals of this paper is to explore an influence of using a phrasal search on recall. We
need also to pay attention to filtering of incorrect source candidates to keep precision
high enough. It allows to save computational resources during second task solution.
We employ a sentence similarity measure for filtering source candidates. If a candidate
contains a sentence that is quite similar to some suspicious sentence we consider such a
candidate as a source, otherwise the candidate is filtered. An another goal is to minimize
amount of queries to maximum possible extent. To achieve this goal we use a small
amount of suspicious document sentences for generating a set of queries and we actively
filter the queries based on downloaded sources.
      </p>
      <p>The rest of this paper is organized as follows: Section 2 provides the details of the
source retrieval method. Section 3 describes used sentence similarity measure. Section 4
provides information about conducted experiments. Section 5 presents the performance
of the software in PAN 2014 competition. Section 6 concludes the paper.</p>
    </sec>
    <sec id="sec-2">
      <title>Source retrieval method</title>
      <sec id="sec-2-1">
        <title>The source retrieval method consists of several steps:</title>
        <p>
          1. suspicious document chunking
2. query formulation
3. download filtering
4. search control
These steps are almost identical to those ones described in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. The algorithm 1 shows
a pseudocode of our source retrieval algorithm, which we describe in details in the
sections below.
        </p>
        <p>
          In our method we use linguistic information such as forms of a word and syntactic
dependencies of words. We use the dependency tree of a sentence to construct two-word
noun phrases. If a noun is linked to another word and there are no words between them,
then we consider such a structure as a phrase. Forms of words are used for measuring
sentence similarity. To obtain linguistic information we use Freeling [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. It performs
document splitting, part-of-speech tagging, and dependency parsing.
        </p>
        <p>We use TF-IDF weighting in our method, hence we formed a collection for
calculating IDF weights of words. This collection consisted of English Wikipedia’s articles
(about 400,000 documents) that were chosen randomly. We suppose that this collection
is sufficient for our tasks, since it allows us to distinguish words that are regularly used
in any writing from some important ones.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Algorithm 1 General overview of source retrieval method</title>
        <p>. snippets and urls are returned
. pairs of similar sentences
2.1</p>
        <sec id="sec-2-2-1">
          <title>Suspicious document chunking</title>
          <p>Firstly, the suspicious document is split into sentences. To calculate a weight of a
sentence we sum weights of all words and phrases in the sentence. The score of a word is
obtained using TF-IDF weighting. The TF weight is calculated using the equation (4)
from section 3. The IDF weight is obtained by using the equation (2) from section 3.
Words that comprise a two-word noun phrases do not contribute in an overall sentence
weight. A phrase weight is calculated as follows: Wphr = 2(Wh + Wd), where Wh is
the weight of a head word and Wd is the weight of a dependent one.</p>
          <p>After calculating the weight of all sentences we select some sentences using
different filters. Firstly, a sentence weight must exceed 0.45 value, since the low weight points
to insignificance of information. Other parameters are taken into consideration, such as
the maximum and minimum amount of words (excluding prepositions, conjunctions,
articles) per sentence. Rationale for this is to exclude short sentences that may have
rather high similarity with some sentence only because of sharing the most weighted
words. Very large sentences are typically either errors of splitting or large lists. It is not
likely to fetch a snippet that contains the whole large sentence, so we omit them. N
sentences with the highest scores, which satisfied these criteria, are selected for further
analysis. We call them suspicious sentences.
2.2</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>Query formulation</title>
          <p>
            Before formulating queries we delete articles, pronouns, prepositions as well as
duplicate words or phrases from each selected sentence. We use two available search engines
Indri [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ] and ChatNoir [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ] for submitting queries. Sentences which contain phrases are
submitted to Indri. For that we take 6 the most important (the most weighted) entities
(phrases or words) from the sentence. The phrases are wrapped up by the special
operator to leverage the phrasal search supported by Indri. If the sentence does not contain
any phrases, 6 of the most weighted words are submitted to ChatNoir as a query. The
formed queries are sequentially submitted to the search engines. This scheme is similar
to approach that was used by Elizalde [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ] in PAN 2013.
2.3
          </p>
        </sec>
        <sec id="sec-2-2-3">
          <title>Download filtering</title>
          <p>Seven first snippets returned by a search engine are downloaded and preprocessed by
means of Freeling. Then we calculate the similarity between suspicious sentences and
sentences extracted from the snippets. A method described in section 3 is used for
measuring the similarity. If the similarity between any suspicious sentence and a snippet
sentence exceeds M inSim, then a document to which a snippet belongs is scheduled
for downloading. Such document is considered to be a source document.
2.4</p>
        </sec>
        <sec id="sec-2-2-4">
          <title>Search control</title>
          <p>
            To filter the rest of the queries we use downloaded documents. After downloading
sources are subjected to preprocessing by Freeling. Then again we calculate the
similarity between suspicious sentences and sentences extracted from the downloaded
documents. If there is a sentence similar to a suspicious one, the latter is marked as a misuse.
The misuses are not used in the query formulation process, since their sources have
already been found. This approach is similar to one used by Haggag [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ].
3
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Sentence similarity measure</title>
      <p>Let us introduce the method for measuring sentence similarity. Given two arbitrary
sentences se and st from documents de and dt respectively, denote as N (se; st) a set
of word pairs with the same lemma, where the first element is taken from se and the
second one from st. We compare two sentences by considering word pairs from the set
N (se; st). For calculating an overall similarity measure of two sentences we compute
multiple similarities measures and then combine its values. The employed similarities
are described below.</p>
      <sec id="sec-3-1">
        <title>3.1 IDF overlap measure</title>
        <sec id="sec-3-1-1">
          <title>Similar to [6] we define IDF overlap as follows:</title>
          <p>I1(se; st) =</p>
          <p>X
(we;wt)2N(se;st)</p>
          <p>IDF (we)
IDF (we) = logjDj</p>
          <p>jDj
m(we; D)
;
where D is a set of documents and m(we; D) is an amount of documents that contain
the word we. For correctness sake we assume that if m(we; D) = 0, then IDF (we) = 1.
3.2</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>TF-IDF-based similarity measure</title>
        <sec id="sec-3-2-1">
          <title>Let us define TF-IDF-based measure in the following way:</title>
          <p>X
(we;wt)2N(se;st)
I2(se; st) =</p>
          <p>f (we; wt)IDF (we)T F (wt; dt)</p>
          <p>T F (wt; dt) = logjdtj (k(wt; dt))
where jdtj is an amount of words in a document dt and k(wt; dt) is an amount of the
word wt in a document dt. f (we; wt) is a kind of a penalty for mismatch of we; wt
forms :
f (we; wt) =
(1:0; if forms of the words are the same
0:8; otherwise
(1)
(2)
(3)
(4)
(5)
3.3</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Sentence syntactic similarity measure</title>
        <p>To be able to measure syntactic similarity we need to generate syntactic dependency tree
from each sentence. We define Syn(se) as a set that contains triplets (wh; ; wd), where
wh, wd are normalized head and dependent word respectively, is type of syntactic
relation. We define syntactic similarity in the following way:</p>
        <p>I3(se; st) =</p>
        <p>P
(wh; ;wd)2(Syn(se)\Syn(st))</p>
        <p>P IDF (wh)
(wh; ;wd)2Syn(se)</p>
        <p>IDF (wh)
(6)</p>
        <p>
          Rationale for using syntactic information is to treat sentences not as a bag-of-words
but as syntactically linked text. The syntactic similarity will be low for sentences in
which the same words are used but they are linked in a different way. This measure is
quite similar to one, described in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
3.4
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>Overall sentence similarity</title>
        <p>The overall sentence similarity we define as a linear combination of described measures.
Sim(se; st) = W Idf I1(se; st) + W T f Idf I2(se; st) + W Synt I3(se; st); (7)
where W Idf , W T f Idf , W Synt determine relative contributions of each similarity. If
Sim(se; st) &gt; M inSim, then the suspicious sentence is considered as the plagiarised
sentence. The process of tuning these four parameters is described in section 4.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>
        Experiments were run on the training corpus provided by the PAN organizers (about 100
documents of Webis-TRC-12 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]). The source retrieval method described in section 2
is highly tunable and has multiple parameters. Some parameters were tuned separately,
namely parameters that are involved in the process of the query generation (the amount
of words and phrases in query, document chunking parameters). One of these
parameters was varied while the others were fixed. The values that performed well in terms
of the F-measure were chosen for further experiments. The ChatNoir oracle was used
for the source retrieval evaluation. The found parameters are shown in table 1. The
experiments showed that using only words for the candidates retrieval decreased recall by
over 30%.
      </p>
      <p>There are many tunable parameters in the method for measuring sentence similarity.
For tuning these parameters we fixed parameters that are involved in the query
generation process. All possible snippets and full document texts were fetched for queries
that were formulated by using the fixed parameters. The downloaded data were
preprocessed by Freeling and prepared for loading by our software. Using such preprocessed
data, multiple combinations of M inSim, W Idf , W T f Idf , W Synt parameters were
tried. In the end we chose a combination that gave the best F-measure. The obtained
parameters are shown in table 2.</p>
      <p>As can be seen from Table 3, our software achieved F1 score of 0.45. It was the
second highest achieved the F1 score by all participants.</p>
      <p>An average of 37.03 queries were submitted per suspicious document and 18.62
results downloaded. The amount of both queries and downloaded results were relatively
low in comparison with the other softwares particapated in PAN. Only 37 sentences
were transformed into queries from 83 selected sentences on average. Such heavy
filtering was crucial for achieving relatively high precision. It was experimentally found
that the query filtering decreased recall but on the other hand significantly increased
precision. According to results our software downloaded 18:62 0:54 = 10:05 true
positives for one suspicious document on average. This means that at least 3.7 queries
were required for retrieving true positives results. This result proved that using phrases
for candidate retrieval is quite reasonable and effective. However, the amount of queries
to first detection was 5.4. It seems that ranking of sentences according to its weight was
not the best strategy. Nevertheless, indicators that are measuring time to first detection
were also low in comparison with the other participants results. On average, 2.25 full
texts were downloaded until the first correct source document. It suggests that the
snippets filtering based on employing sentence similarity measure worked relatively well.</p>
      <p>No plagiarism sources were detected for 3 suspicious documents, which was about
3% of the suspicious documents in the test corpus. This result shows that the software
is able to retrieve sources of plagiarism for the majority of documents.
6</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>This paper describes the source retrieval method that can achieve relatively high
retrieval performance while minimizing the workload. It is possible due to employing
two approaches, namely the phrasal search for the retrieving of candidates and using
sentence similarity measure for the filtering of the candidates.</p>
      <p>Many search engines support phrasal search to some extent. Some of them (e.g.
Yandex, Yahoo) provide a proximity search which makes it possible to search phrases,
and the other search engines (e.g. Google) provide the exact phrase search.</p>
      <p>The filtering that is based on the sentence similarity measure works well when a
snippet is an original sentence (or some part of it). If snippets contain heavily
overlapped fragments of one sentence divided by a delimiter, then it is hardly useful to
employ sentence comparison. We occasionally experienced this issue during experiments
with ChatNoir snippets. But we believe that snippets that are provided by the popular
search engines (e.g. Google, Yandex) contain the original sentences. Therefore results
of our method are supposed to be reproducible in real-world environment.</p>
      <p>However, there is some room for further improvements. It is probably worth
reducing the set of phrases only to collocations. The rationale for this is based on an
assumption that it is very easy to change a phrase using synonym of a head or a
dependent word. But one cannot simply change any word in collocation because the phrase
will lose its meaning. So one needs to synonymize the whole phrase or to leave it as it
is.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>We gratefully acknowledge support by Russian Foundation for Basic Research under
grant No. 14-07-31149 and the PAN Lab organizers for their effort in organizing PAN
2014.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Atserias</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Casas</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Comelles</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>González</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Padró</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Padró</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>FreeLing 1.3: Syntactic and semantic services in an open-source NLP library</article-title>
          .
          <source>In: Proceedings of LREC</source>
          . vol.
          <volume>6</volume>
          , pp.
          <fpage>48</fpage>
          -
          <lpage>55</lpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Elizalde</surname>
          </string-name>
          , V.:
          <article-title>Using statistic and semantic analysis to detect plagiarism</article-title>
          . In: CLEF (Online Working Notes/Labs/Workshop) (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beyer</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Busse</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Recent trends in digital text forensics and its evaluation</article-title>
          .
          <source>In: Information Access Evaluation</source>
          . Multilinguality, Multimodality, and Visualization, pp.
          <fpage>282</fpage>
          -
          <lpage>302</lpage>
          . Springer (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Haggag</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>El-Beltagy</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Plagiarism candidate retrieval using selective query formulation and discriminative query scoring</article-title>
          . In: CLEF (Online Working Notes/Labs/Workshop) (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dong</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          :
          <article-title>A dependency grammar and WordNet based sentence similarity measure</article-title>
          .
          <source>Journal of Computational Information Systems</source>
          <volume>8</volume>
          (
          <issue>3</issue>
          ),
          <fpage>1027</fpage>
          -
          <lpage>1035</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Metzler</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bernstein</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Croft</surname>
            ,
            <given-names>W.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moffat</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zobel</surname>
          </string-name>
          , J.:
          <article-title>Similarity measures for tracking information flow</article-title>
          .
          <source>In: Proceedings of the 14th ACM international conference on Information and knowledge management</source>
          . pp.
          <fpage>517</fpage>
          -
          <lpage>524</lpage>
          . ACM (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hagen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tippmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kiesel</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Overview of the 5th International Competition on Plagiarism Detection</article-title>
          . In: Forner,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Navigli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Tufis</surname>
          </string-name>
          ,
          <string-name>
            <surname>D</surname>
          </string-name>
          . (eds.)
          <source>Working Notes Papers of the CLEF 2013 Evaluation Labs (Sep</source>
          <year>2013</year>
          ), http://www.clef-initiative.eu/publication/working-notes
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hagen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graßegger</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tippmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Welsch</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>ChatNoir: a search engine for the ClueWeb09 corpus</article-title>
          .
          <source>In: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval</source>
          . pp.
          <fpage>1004</fpage>
          -
          <lpage>1004</lpage>
          . ACM (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hagen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Völske</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Crowdsourcing interaction logs to understand text reuse from the web</article-title>
          . In: Fung,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Poesio</surname>
          </string-name>
          , M. (eds.)
          <article-title>Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL 13)</article-title>
          . pp.
          <fpage>1212</fpage>
          -
          <lpage>1221</lpage>
          . ACL (Aug
          <year>2013</year>
          ), http://www.aclweb.org/anthology/P13-1119
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Strohman</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Metzler</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Turtle</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Croft</surname>
          </string-name>
          , W.B.:
          <article-title>Indri: A language model-based search engine for complex queries</article-title>
          .
          <source>In: Proceedings of the International Conference on Intelligent Analysis</source>
          .
          <source>vol. 2</source>
          , pp.
          <fpage>2</fpage>
          -
          <lpage>6</lpage>
          . Citeseer (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>