<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Improving Synoptic Quering for Source Retrieval</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Informatics, Masaryk University</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <abstract>
        <p>Source retrieval is a part of a plagiarism discovery process, where only a selected set of candidate documents is retrieved from a large corpus of potential source documents and passed for detailed document comparison in order to highlight potential plagiarism. This paper describes a used methodology and the architecture of a source retrieval system, developed for PAN 2015 lab on uncovering plagiarism, authorship and social software misuse. The system is based on our previous systems used at PAN since 2012. The paper also discusses the queries performance and provides explanation for many implementation settings. The proposed methodology achieved the highest recall with usage of the least number of queries among other PAN 2015 softwares during the official test run. The source retrieval subsystem forms an integral part of a modern system for plagiarism discovery.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        In plagiarism detection, the source retrieval is a task for the automated system to
retrieve candidate documents from large document collections, which may have served
as a pattern for plagiarism, and pass the retrieved documents for further inspection [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
The further inspection means to evaluate document similarities in detail, which,
however, can be done only among documents that are known to the system. The desired
similarities are mainly textual, thus the text of suspicious document is usually aligned
with each document from the system’s document base. If the collection of potential
source documents is too large, it is infeasible to calculate detailed similarities among
each document pairs from that collection, therefore, only a small selected subset of
potential sources to each input suspicious documents is retrieved. The whole collection
of possible source documents is an unknown environment for the plagiarism detection
system, thus the document retrieval is carried out by utilization of a search engine which
is capable of a document retrieval.
      </p>
      <p>
        This paper describes the key aspects and the main changes in the source retrieval
methodology from the system used at PAN 2014 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which is described in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The
queries performance analysis is also provided.
      </p>
      <p>The task for the PAN lab was to retrieve all plagiarized documents, based on a
suspicious document, from the reference document collection by utilizing a web search
engine, while minimizing the retrieval costs.</p>
      <p>
        For the source retrieval, based on each suspicious document the prepared queries
were passed to search engines according to their type. The synoptic queries were passed
to ChatNoir [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and the phrasal queries to Indri [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Both search engines index ClueWeb091
which constituted the corpus of potential source documents. Afterwards, the search
engine results were examined and if a similar passage with the suspicious document was
found, the result was reported as a candidate document for being a source of plagiarism.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Building of Queries</title>
      <p>
        The system prepared several types of queries for each suspicious document, which we
divide into two main groups: the whole document keywords-based queries and the
paragraph-based queries. For queries construction, a weight wi was assigned to each
term ki from input document dj . A term is represented by a word, extracted from the
input text using blank spaces as a separator between two words, and cleaned of all
punctuation. The weights follow the TF-IDF weighting scheme [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. As a reference corpus
for the weights calculation, an English web-based corpus containing more then 4
billion tokens was used 2. For each term, it’s lemma li was also extracted using Python
NLTK3 lemmatizer, which was performed on the basis of extracted sentences. The
ambiguous terms were let in the original form. Each document dj was then represented as
(ki; wi; li) where i 2 [1; jdj j].
2.1
2.2
      </p>
      <sec id="sec-2-1">
        <title>Paragraph-based Queries</title>
        <p>The scored suspicious document was also divided into paragraphs-like chunks using an
empty new line (occurring in its plaintext format) as a chunks’ separator. From each
1 http://www.lemurproject.org/clueweb09.php/
2 http://www.sketchengine.co.uk/documentation/wiki/Corpora/TenTen/enTenTen
3 http://www.nltk.org/
4 Phrasal queries posed an exception in a query length.
chunk ci a single query was prepared. Let beg(ci) denote the position of the first word
of the chunk ci in the input file and let end(ci) denote the position of the last word of the
chunk ci in the input file. The paragraph based query from ci comprised from 10 words
ki 2 si, with maximal Pj1=01 wij , where si denotes the interval [beg(ci); end(ci)].
Ten tokens is the maximum length of a query for the ChatNoir search engine. The
maximum length was chosen in order to produce the most specific query for the given
paragraph, which should maximize the probability of retrieving texts containing similar
paragraphs. The query was constructed from tokens which might be scattered over the
whole chunk, therefore it cannot be used as a phrase query. On the contrary, due to its
specificity, it is hardly usable as a synoptic query or a theme-related keywords-based
query.</p>
        <p>Paragraph-based queries were passed to ChatNoir. The interval si was associated to
all paragraph-based queries, which denotes the file position of the query. The query is
said to characterize the text within its interval.
2.3</p>
      </sec>
      <sec id="sec-2-2">
        <title>Queries Scheduling</title>
        <p>In order to acquire maximum information from the top-scored keywords, they were
combined into several different types of queries. They appeared in the pilot query and
in the collocational queries and they may have appeared in paragraph-based queries.
Apart from this distinct appearance of the top scored keywords in different formulated
queries, no further query reformulation, such as reformulation based on the results, was
applied.</p>
        <p>Queries were scheduled for execution sorted by their priority, starting with the
pilot query, next the collocational phrase queries, the collocational synoptic queries,
afterwards the queries constructed from remaining keywords if any, and lastly all the
paragraph-based queries.</p>
        <p>The paragraph-based queries were executed on demand according to their position,
if and only if there was no intersection of the query position interval with any of the
intervals from all the so far found similarities.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results Downloading and Assessing</title>
      <p>
        A maximum of 100 results obtained from search engines were processed based on each
query. Only selected results were downloaded and textually aligned with the suspicious
document. The decision whether to download a result was made based on the result’s
500 characters long snippets, which were generated for each token from the query and
concatenated into one text chunk. If this chunk showed promising similarity with the
suspicious document, the result was downloaded. This decision making was adopted
from our previous years’ implementations, for more information see [
        <xref ref-type="bibr" rid="ref10 ref12">10,12</xref>
        ].
      </p>
      <p>
        Each downloaded document was thoroughly compared with its suspicious
document by calculating common features [
        <xref ref-type="bibr" rid="ref11 ref12">11,12</xref>
        ] based on word n-grams and stop-word
mgrams [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The common features formed valid intervals, which were covered "densely
enough" by the features. Two valid intervals were merged if they were closer than 81
characters, which estimates the length of a text line. Each resulting valid interval
represented one plagiarism case. Such an interval sres was marked in the suspicious
document and all the waiting paragraph-based queries for which si \sres 6= ; were excluded
from the queue of prepared queries.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Method Assessment</title>
      <p>
        As a training corpus for the source retrieval task, the task organizers provided 98
Englishwritten documents, which contained plagiarized passages from web pages retrieved
from the ClueWeb09 document collection. The documents were mostly highly
plagiarized, only one document from the corpus was plagiarism free, and this was a short
document containing only a single paragraph of 204 words. Each document was about
a specific topic and the documents were created manually [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The plagiarism cases for
each document were annotated. The size of the plaintexts were 30 KB on average and
each document contained around five thousand words on average.
      </p>
      <p>During the training phase evaluation, for the whole 98 documents in the training
corpus, 32.9 queries per document on average were executed, from which 18.8% were
directed to Indri and 81.2% to ChatNoir. In total, 134247 unique URLs were retrieved,
provided that each query asked the search engine for 100 results.</p>
      <p>The assumption for preparing queries was that the paragraph-based queries were
more specific, thus leading to less number of returned URLs. On the other hand,
synoptic queries, such as the pilot query and other keywords-based queries, should retrieve
more results provided the corpus is large enough. We can affirm this assumption by
measuring the number of results returned from the search engine per query type.
However, the query generality is limited by the maximum number of results, for which we
asked the search engine. Each type of query’s specificity and generality is shown in
Tab. 1. The column Scope Usage shows the average from all queries of one type,
expressed as a percentage of the potential maximum number of retrieved results. The
table also shows the portion of queries based on which the search engines retrieved the
maximum allowed number of results – Top Retrieval, which indicates they were
general enough under the given conditions. The last column shows portions of queries for
which the search engine returned zero answers. Table 1 supports the assumption that
paragraph-based queries were more biased in retrieval towards their paragraph text.</p>
      <p>The number of results for general queries has little information value. If the query is
too general the search engine returns a huge number of results. Therefore, the synoptic
query construction must lead to the generation of large number of relevant results.
Relevant results can be defined by their purpose, such as, for example, documents following
a specific topic or similar to the suspicious document in some extend. We define a result
relevant if its text alignment with the suspicious document produces one or more valid
intervals, meaning the two documents contain a textually similar passage.</p>
      <p>From all of the retrieved results, 6392 were found to be relevant. Please note that not
all results were downloaded, therefore, some of the similarities might have been missed
during the download decision. From all of the discovered URLs, 32538 were actually
downloaded and textually aligned with its suspicious document. For all except one
suspicious document that contained plagiarism, some relevant documents were found.</p>
      <p>Table 2 shows the performance of queries by their type5. The third and fourth
columns show the number of successful results and the coverage of results of current
query type respectively. One relevant URL was counted into the total number of
successful results only once, but some queries led to the retrieval of already discovered
results. Therefore, in order to make an unbiased evaluation of the coverage, in terms
of query execution sequence, each successful result is credited with all queries which
retrieved that result. Table 2 shows that the portion of retrieved relevant URLs for the
pilot, phrasal and paragraph-based queries is getting closer to nearly a half of all
relevant retrieved URLs, but phrasal and paragraph-based queries needed nearly 3 times
and 11 times more searches than the pilot queries respectively. The average number of
hits per query depicts the fifth column of Tab. 2, which supports the assumption that the
pilot query is the most important and it is the best choice for the synoptic search to start
with, in order to cover the majority of plagiarism as quickly as possible.</p>
      <p>The paragraph-based queries have relatively low yield of relevant results per one
query, which is due to their specificity (Tab. 1), but they can cover a large portion of
successful results. Therefore, it may be beneficial to skip these queries for some parts
of input documents, e.g. parts where plagiarism was already discovered, and use them
in order to aim the search for more suspicious parts, for example, parts selected using
intrinsic plagiarism methods. Both tables show that 2109 paragraph-based queries were
executed, however, the total number of prepared paragraph-based queries was 6693,
5 For all the 98 suspicious documents, there were only 183 pilot based queries executed, despite
the fact, that the pilot query should have been processed by both ChatNoir and Indri search
engines. For 13 suspicious document, the pilot query was processed by only one search engine.
There were missed 12 queries in Indri and one query in ChatNoir because of the timeout. The
search engines were utilized during a standard operation over the network with timeout set to
8 minutes.
meaning that there was 68.5% of such queries omitted due to their position inside an
already discovered interval of textual similarity.</p>
      <p>Since the pilot query yields 15.4 relevant results per query, it is clear that a search
engine should be asked to retrieve at least tens of results based on this query type.
However, the number of results influences not only recall of the system, but also time and
space requirements of the system. Relevant URLs were also retrieved from very high
sequence numbers of ranking in the Search Engine Result Page (SERP). Similarities
were found even among higher than 100th result based on one query. The limit 100
results was set due to the time consumption of URLs checking. One of the master hits (see
further) was retrieved from SERP’s 100th position. Figure 1 depicts the total number of
relevant URLs retrieved at first 20 positions6 of SERP based on all queries.</p>
      <p>Surprisingly poor performance compared to other types, can be observed at the other
keywords-based queries (see both Tab. 1 and Tab. 2), which indicates that extraction of
less than 10 quality keywords is sufficient for such texts. Table 1 shows that their scope
was around half of asked results, despite the fact that they were aimed for synoptic
theme related searches. Table 2 indicates that they covered only 6.3% of all discovered
similarities; on the other hand, in terms of number, they represented the smallest type
of queries.</p>
      <p>
        The discovered relevant URLs contained some portion of similar text with the
suspicious document. In the ClueWeb09 corpus, many texts are reused like in a real web,
therefore, many web pages may be classified as near-duplicates [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and many
documents just contain smaller or larger passages identical to other web page. The retrieved
relevant results are among those cases. However, the source of each plagiarism in the
corpus of suspicious documents were annotated with the original web page from which
the text was reused. We call the retrieval of such an original document a master hit.
6 20 best ranked URLs for each query.
l
a
n
o
it
a
c
o
ll
o
C
l
a
s
a
r
h
P
l 
a
n
o
it
a
c
o
iltoP lloC
d
e
s
a
b
s
d
la ro
llitcoaoonC  tryheeKwO
d
e
s
a
b
h
p
a
r
g
a
r
a
      </p>
      <p>P
20</p>
      <p>Queries
5
10
15
25
30
35
40
10
20
40
50</p>
      <p>60
30
Queries
Taking into account only master hits7, for the whole input corpus, the overall recall
was 0.45 with 5 document having 100% recall and 12 documents without a master hit.
We consider this performance as a very good result providing that no near-duplicates
were taken into account. Figure 2 shows the progress of detection during the scheduled
querying of two selected8 documents. The y axis shows the percentage of retrieved
relevant documents. The portion of queries covered by specific query type is distinguished
with the type-labelled vertical line separators. The number of queries to first detection
is also evaluated in PAN, this is the job for pilot queries, which should lead to
positive results using the very first two queries. From the right plot of Fig. 2, it can be
seen that paragraph-based queries can highly support the detection, if fewer similarities
were discovered using previous types of queries. In a real-world situation, while
expecting the documents to contain less plagiarism, we would try to lower the number of
executed paragraph-based queries with methods detecting suspicious parts of the input
documents, and schedule the paragraph-based queries located only in those parts. The
left plot of Fig. 2 shows pilot and phrasal queries as the most profitable, which was in
most cases.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>
        This paper described an architecture of PAN 2015 software for source retrieval in a
plagiarism detection task. The key settings were discussed and analyses of the settings
provided. The software was based on previous versions used in PAN since 2012 [
        <xref ref-type="bibr" rid="ref10 ref11 ref12">11,12,10</xref>
        ],
this paper also described changes made for 2015 lab at PAN.
      </p>
      <p>For the source retrieval, based on each suspicious document, queries of several types
were prepared: keywords-based; divided further into the pilot, phrasal collocations,
collocations, and other keywords-based; and the paragraph-based queries which were
associated with the position in the suspicious document of the paragraph they characterized.
Queries were executed sequentially and all results from each query were evaluated, in
7 The master hits analysis is included because of the low precision, which the system achieved
in the test phase of the lab. In the real-world, the ani plagiarism system must provide the user
the possibility of examination of relatively small textual similarities.
8 For those documents, the highest number of distinct relevant URLs was retrieved.
order to skip some of the paragraph-based queries for whose paragraphs a similarity
was already detected. Final results containing the valid intervals with the suspicious
document, were reported. The pilot queries proved to be the best choice for synoptic
search and the paragraph-based queries manage to perform well in the positional
retrieval, which is biased towards searching for specific short texts.</p>
      <p>
        The retrieval recall in the lab’s official test run, compared to the previous year, has
increased, but so has the total number of used queries. However, the proposed
methodology achieved the highest recall with usage of the least number of queries among the
PAN 2015 softwares during the official test run. The discussion and evaluation of PAN
can be found in the lab overview paper by the lab organizers [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Hagen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Source Retrieval for Plagiarism Detection from Large Web Corpora: Recent Approaches</article-title>
          .
          <source>In: Working Notes Papers of the CLEF 2015 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org (Sep</source>
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hagen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kiesel</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oberländer</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tippmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barrón-Cedeño</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Overview of the 4th International Competition on Plagiarism Detection</article-title>
          .
          <source>In: CLEF 2012 Evaluation Labs and Workshop</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hagen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beyer</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Busse</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tippmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Overview of the 6th International Competition on Plagiarism Detection</article-title>
          . In: Working Notes for CLEF 2014 Conference, Sheffield, UK,
          <source>September 15-18</source>
          ,
          <year>2014</year>
          . pp.
          <fpage>845</fpage>
          -
          <lpage>876</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hagen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graßegger</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tippmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Welsch</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>ChatNoir: A Search Engine for the ClueWeb09 Corpus</article-title>
          . In: Hersh,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Callan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Maarek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Sanderson</surname>
          </string-name>
          , M. (eds.) 35th
          <source>International ACM Conference on Research and Development in Information Retrieval (SIGIR 12)</source>
          . p.
          <fpage>1004</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (Aug
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hagen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Völske</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Crowdsourcing Interaction Logs to Understand Text Reuse from the Web</article-title>
          .
          <source>In: ACL (1)</source>
          . pp.
          <fpage>1212</fpage>
          -
          <lpage>1221</lpage>
          . The Association for Computer Linguistics (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Salton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>On the Specification of Term Values in Automatic Indexing</article-title>
          .
          <source>Tech. Rep. TR-73-173</source>
          , Cornell University (Ithaca,
          <string-name>
            <surname>NY US</surname>
          </string-name>
          ) (
          <year>1973</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>Plagiarism detection using stopword n-grams</article-title>
          .
          <source>JASIST</source>
          <volume>62</volume>
          (
          <issue>12</issue>
          ),
          <fpage>2512</fpage>
          -
          <lpage>2527</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Strohman</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Metzler</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Turtle</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Croft</surname>
          </string-name>
          , W.B.:
          <article-title>Indri: A Language-model Based Search Engine for Complex Queries</article-title>
          .
          <source>Tech. rep., in Proceedings of the International Conference on Intelligent Analysis</source>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Suchomel</surname>
          </string-name>
          , Š.,
          <string-name>
            <surname>Brandejs</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Approaches for Candidate Document Retrieval</article-title>
          .
          <source>In: Information and Communication Systems (ICICS)</source>
          ,
          <year>2014</year>
          5th International Conference on. pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . IEEE,
          <string-name>
            <surname>Irbid</surname>
          </string-name>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Suchomel</surname>
          </string-name>
          , Š.,
          <string-name>
            <surname>Brandejs</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Heterogeneous Queries for Synoptic and Phrasal Search</article-title>
          . In: Working Notes for CLEF 2014 Conference, Sheffield, UK,
          <source>September 15-18</source>
          ,
          <year>2014</year>
          . pp.
          <fpage>1017</fpage>
          -
          <lpage>1020</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Suchomel</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kasprzak</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brandejs</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Three way search engine queries with multi-feature document comparison for plagiarism detection</article-title>
          .
          <source>In: CLEF 2012 Evaluation Labs and Workshop</source>
          , Online Working Notes, Rome, Italy,
          <source>September 17-20</source>
          ,
          <year>2012</year>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Suchomel</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kasprzak</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brandejs</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Diverse Queries and Feature Type Selection for Plagiarism Discovery</article-title>
          . In: Working Notes for CLEF 2013 Conference , Valencia, Spain,
          <source>September 23-26</source>
          ,
          <year>2013</year>
          . (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>