=Paper=
{{Paper
|id=Vol-1180/CLEF2014wn-Pan-PotthastEt2014
|storemode=property
|title=Overview of the 6th International Competition on Plagiarism Detection
|pdfUrl=https://ceur-ws.org/Vol-1180/CLEF2014wn-Pan-PotthastEt2014.pdf
|volume=Vol-1180
|dblpUrl=https://dblp.org/rec/conf/clef/PotthastHBBTRS14
}}
==Overview of the 6th International Competition on Plagiarism Detection==
<pdf width="1500px">https://ceur-ws.org/Vol-1180/CLEF2014wn-Pan-PotthastEt2014.pdf</pdf>
<pre>
        Overview of the 6th International Competition on
                      Plagiarism Detection

               Martin Potthast,1 Matthias Hagen,1 Anna Beyer,1 Matthias Busse,1
                      Martin Tippmann,1 Paolo Rosso,2 and Benno Stein1
         1
             Web Technology & Information Systems, Bauhaus-Universität Weimar, Germany
             2
               Natural Language Engineering Lab, Universitat Politècnica de València, Spain

                              pan@webis.de         http://pan.webis.de


             Abstract This paper overviews 17 plagiarism detectors that have been evaluated
             within the sixth international competition on plagiarism detection at PAN 2014.
             We report on their performances for the two tasks source retrieval and text align-
             ment of external plagiarism detection. For the third year in a row, we invite
             software submissions instead of run submissions for this task, which allows for
             cross-year evaluations. Moreover, we introduce new performance measures for
             text alignment to shed light on new aspects of detection performance.


1      Introduction
Algorithms for the retrieval and extraction of text reuse from large document collections
are central to applications such as plagiarism detection, copyright protection, and infor-
mation flow analysis. They have to be able to deal with all kinds of text reuse ranging
from verbatim copies and quotations to paraphrases and translations to summaries [21].
Particularly the latter kinds of text reuse still present a formidable challenge to both en-
gineering and evaluation of retrieval and extraction algorithms. Until recently, one of the
primary obstacles to the development of new algorithms has been a lack of evaluation
resources. To rectify this lack, we have build a variety of high-quality, large-scale eval-
uation resources [29, 27], which have been employed within our annual shared tasks on
plagiarism detection since 2009, whereas this paper reports on the results of our shared
task’s sixth edition.1
     Since the plagiarism detection task has been running for six years in a row, we ob-
serve a multi-year life cycle within this shared task. It can be divided into three phases,
namely an innovation phase, a consolidation phase, and a production phase. In the in-
novation phase, new evaluation resources are being developed and introduced for the
first time, such as new corpora, new performance measures, and new technologies. The
introduction of new evaluation resources typically stirs up a lot of dust and is prone to
errors and inconsistencies that may spoil evaluation results to some extent. This cannot
be avoided, since only the use of new evaluation resources by many different parties
 1
     Some of the concepts found in this paper have been described earlier, so that, because of the
     inherently incremental nature of shared tasks, and in order for this paper to be self-contained,
     we reuse text from previous overview papers.


                                                   845
                   Suspicious
                   document
                                                                 Plagiarism Detection


                     Source        Candidate          Text         Knowledge-based
                    retrieval      documents       alignment        post-processing


                   Document                                           Suspicious
                   collection                                         passages


                    Figure 1. Generic retrieval process to detect plagiarism [36].

will reveal their shortcomings. Therefore, the evaluation resources are released only
sparingly so they last for the remainder of a cycle. In the consolidation phase, based
on the feedback and results obtained from the first phase, the new evaluation resources
are developed to maturity by making adjustments and fixing errors. In the production
phase, the task is repeated with little changes to allow participants to build upon and
to optimize against what has been accomplished, and, to make the most of the prior in-
vestment in developing the new evaluation resources. Meanwhile, new ideas are being
developed to introduce further innovation.

1.1     Terminology and Related Work
Terminology. Figure 1 shows a generic retrieval process to detect plagiarism in a given
suspicious document dplg , when also given a (very large) document collection D of
potential source documents. This process is also referred to as external plagiarism de-
tection since plagiarism in dplg is detected by searching for text passages in D that are
highly similar to text passages in dplg .2 The process is divided into three basic steps,
which are typically implemented in most plagiarism detectors. First, source retrieval,
which identifies a small set of candidate source documents Dsrc ⊆ D that are likely
sources for plagiarism regarding dplg . Second, text alignment, where each candidate
source document dsrc ∈ Dsrc is compared to dplg , extracting all passages of text that
are highly similar. Third, knowledge-based post-processing, where the extracted pas-
sage pairs are cleaned, filtered, and possibly visualized for later inspection.
Shared Tasks on Plagiarism Detection. We have organized shared tasks on plagia-
rism detection annually since 2009. In the innovation phase of our shared task at
PAN 2009 [30], we developed the first standardized evaluation framework for plagia-
rism detection [29]. This framework was consolidated in the second and third task at
PAN 2010 and 2011 [22, 23], and it has since entered the production phase while be-
ing adopted by the community. Our initial goal with this framework was to evaluate
 2
     Another approach to detect plagiarism is called intrinsic plagiarism detection, where detectors
     are given only a suspicious document and are supposed to identify text passages in them which
     deviate in their style from the remainder of the document.


                                                846
the process of plagiarism detection depicted in Figure 1 as a whole. We expected that
participants would implement source retrieval algorithms as well as text alignment al-
gorithms and use them as modules in their plagiarism detectors. However, the results of
the innovation phase proved otherwise, since participants implemented only text align-
ment algorithms, whereas they resorted to exhaustively comparing all pairs of docu-
ments within our evaluation corpora, even when the corpora were tens of thousands of
documents large. Therefore, upon entering the production phase after the third edition
of our shared task, and, because of continued interest from the research community,
we rechristened the shared task to text alignment and continued to offer it in the three
following years at PAN 2012 to 2014.
    To establish source retrieval as a shared task of its own, we introduced it at
PAN 2012 next to the text alignment task [24], thus entering a new task life cycle for
this task. We developed a new, large-scale evaluation corpus of essay-length plagiarism
cases that have been written manually, and whose sources have been retrieved manually
from the ClueWeb corpus [27]. Given our above observation from the text alignment
task, the ClueWeb was deemed too large to be exhaustively compared to a given suspi-
cious document in a reasonable time. Furthermore, we developed a new search engine
for the ClueWeb called ChatNoir [26], which serves participants who do not wish to
develop their own ClueWeb search engine as a means of participation. We then offered
source retrieval as an individual task based on the new evaluation resources [24, 25],
whereas this year marks the third time we do so, and the transition of the source retrieval
task into the production phase.

1.2   Contributions
Both source retrieval and text alignment are now in the production phase of their life
cycles. Therefore, we refrain from changing the existing evaluation resources too much,
whereas we continue to maintain them. Therefore, our contributions this year consist
of (1) a survey of submitted approaches, which reveals new trends among participants
at solving the respective tasks, and (2) an analysis of the participants’ retrieval per-
formances on a per-obfuscation basis, using new performance measures, and in direct
comparison to participants from previous years.
    In this connection, our goal with both shared tasks is to further automate them.
Hence, we continue to develop the TIRA evaluation platform [10, 11], which gives rise
to software submissions with minimal organizational overhead. Last year, we focused
on TIRA’s infrastructure, which employs virtualization technology to allow participants
to use their preferred development environment, and to secure the execution of untrusted
software while making the release the test corpora unnecessary [9]. This year, we intro-
duce a fully-fledged web service as a user interface that enables participants to remote
control their evaluations on the test corpora under our supervision.


2     Source Retrieval
In source retrieval, given a suspicious document and a web search engine, the task is
to retrieve all source documents from which text has been reused whilst minimizing


                                           847
         Participant          Web search                Web             Evaluation corpus


                                                                                             construction
                                                                                               Corpus
                                                                                             Evaluation
                                                                          Performance
                                Search


                                                                                                run
                                                                          measures and
                               proxy API
                                                                          source oracle
             Submitted
             plagiarism
              detector


                                                                                             Static evaluation
                                                                                              infrastructure
                                                                           ClueWeb09
                                                                           (0.5 billion
                           ChatNoir                                          English
                           Cluster                                         web pages)
      Virtual machines
                               Indri

  TIRA experimentation
        platform


Figure 2. Overview of the building blocks used in the evaluation of the source retrieval subtask.
The components are organized by the two activities corpus construction and evaluation runs (top
two rows). Both activities are based on a static evaluation infrastructure (bottom row) consisting
of an experimentation platform, web search engines, and a web corpus.

retrieval costs. The cost-effectiveness of plagiarism detectors in this task is important
since using existing search engines is perhaps the only feasible way for researchers as
well as small and medium-sized businesses to implement plagiarism detection against
the web, whereas search companies charge considerable fees for automatic usage.
    In what follows, we briefly describe the building blocks of our evaluation setup,
provide brief details about the evaluation corpus, and discuss the performance measures
(see last year’s task overview for more details on these three points [25]). We also survey
the submitted softwares, and finally, report on their achieved results in this year’s setup.

2.1     Evaluation Setup
For the evaluation of source retrieval from the web, we consider the real-world scenario
of an author who uses a web search engine to retrieve documents in order to reuse
text from them in a document. A plagiarism detector typically uses a search engine,
too, to find reused sources of a given document. Over the past years, we assembled
the necessary building blocks to allow for a meaningful evaluation of source retrieval
algorithms; Figure 2 shows how they are connected. The setup was described in much
more detail in last year’s task overview [25].
    Two main components are the TIRA experimentation platform and the ClueWeb09
with two associated search engines. TIRA [10] itself consists of a number of building


                                              848
blocks; one of them, depicted in Figure 2 bottom left, facilitates both platform indepen-
dent software development and software submissions at the same time by its capability
to create and remote control virtual machines on which our lab’s participants deploy
their plagiarism detectors.
    The ClueWeb corpus 2009 (ClueWeb09)3 is one of the most widely adopted web
crawls which regularly used for large-scale web search-related evaluations. It consists
of about one billion web pages, half of which are English ones. Although an updated
version of the corpus has been released,4 our evaluation is still based on the 2009 version
since our corpus of suspicious documents was built on top of ClueWeb09. Indri5 and
ChatNoir [26] are currently the only publicly available search engines that index the
ClueWeb09 corpus; their retrieval models are based on language modeling and BM25F,
respectively. For developer convenience, we also provide a proxy server which unifies
the APIs of the search engines. At the same time, the proxy server logs all accesses to
the search engines for later performance analysis.

2.2   Evaluation Corpus
The evaluation corpus employed for source retrieval is based on the Webis text reuse
corpus 2012 (Webis-TRC-2012) [28, 27]. The corpus consists of 297 documents that
have been written by 27 writers who worked with our setup as shown in the first row
of Figure 2: given a topic, a writer used ChatNoir to search for source material on that
topic while preparing a document of 5700 words length on average, reusing text from
the found sources.
    In the last years, we sampled 98 documents from the Webis-TRC-2012 as training
and test documents. This year, these documents were provided for training, and another
99 documents were sampled as test documents. The remainder of the corpus will be
used within future labs on this task.

2.3   Performance Measures
Given a suspicious document dplg that contains passages of text that have been reused
from a set of source documents Dsrc , we measure the retrieval performance of a source
retrieval algorithm in terms of precision and recall of the retrieved documents Dret
taking into account the effect of near-duplicate web documents as follows (cf. last year’s
task overview [25] for more details).
    For any dret ∈ Dret , we employ a near-duplicate detector to judge whether it is
a true positive detection; i.e., whether there is a dsrc ∈ Dsrc of dplg that is a near-
duplicate of dret . We say that dret is a true positive detection for a given pair of dsrc
and dplg iff (1) dret = dsrc (equality), or (2) the Jaccard similarity of the word n-grams
in dret and dsrc is above 0.8 for n = 3, above 0.5 for n = 5, and above 0 for n = 8
(similarity), or (3) the passages in dplg known to be reused from dsrc are contained in
dret (containment). Here, containment is measured as asymmetrical set overlap of the
 3
   http://lemurproject.org/clueweb09
 4
   http://lemurproject.org/clueweb12
 5
   http://lemurproject.org/clueweb09/index.php#Services


                                            849
Duplicate Hull D'ret                  Documents D            Duplicate Hull D'ret                  Documents D
                                      Duplicate Hull D'src                                         Duplicate Hull D'src
Retrieved Dret                              Sources Dsrc     Retrieved Dret                              Sources Dsrc


                       Dret Ç D'src                                                 D'ret Ç Dsrc


Figure 3. Effect of near-duplicates on computing precision (left) and recall (right) of retrieved
source documents. Without taking near-duplicates into account, a lot of potentially correct
sources might be missed.

passages’ set of word n-grams regarding that of dret , so that the overlap is above 0.8
for n = 3, above 0.5 for n = 5, and above 0 for n = 8. This three-way approach of
determining true positive detections inherently entails inaccuracies. While there is no
straightforward way to solve this problem, this error source affects all detectors, still
allowing for relative comparisons.
    Let ddup denote a near-duplicate of a given dsrc that would be considered a true
positive detection according to the above conditions. Note that every dsrc may have
more than one such near-duplicate and every ddup may be a near-duplicate of more
                                           0
than one source document. Further, let Dsrc   denote the set of all near-duplicates of a
                                                       0
given set of source documents Dsrc of dplg and let Dret   denote the subset of Dsrc that
have at least one corresponding true positive detection in Dret :
 0
Dsrc = {ddup | ddup ∈ D and ∃dsrc ∈ Dsrc : ddup is a true positive detection of dsrc },
 0
Dret = {dsrc | dsrc ∈ Dsrc and ∃dret ∈ Dret : dret is a true positive detection of dsrc }.

Based on these sets, we define precision and recall of Dret regarding Dsrc and dplg as
follows:                              0                 0
                          |Dret ∩ Dsrc  |             |Dret ∩ Dsrc |
                  prec =                  ,   r ec =                 .
                              |Dret |                     |Dsrc |
Rationale for this definition is that retrieving more than one near-duplicate of a source
document does not decrease precision, but it does not increase recall, either, since no
additional information is obtained. A further graphical explanation of how we take near-
duplicates into account for precision and recall is given in Figure 3. Note that Dret as
defined above does not actually contain all duplicates of the retrieved documents, but
only those that are already part of Dsrc .
    Finally, to measure the cost-effectiveness of a source retrieval algorithm in retriev-
ing Dret , we count the numbers of queries and downloads made and compute the work-
load in terms of queries and downloads until the first true positive detection is made.
The Source Oracle To allow for participation in the source retrieval task without the
need of having a text alignment component at hand, we provide a source oracle that
automatically enriches a downloaded document with information about whether or not
it is considered a true positive source for the given suspicious document. Note that


                                                       850
the oracle employs the aforementioned conditions to determine whether a document
is a true positive detection. However, the oracle does not, yet, tell for which part of
a suspicious document a downloaded document is a true positive detection. Hence,
applying a custom text alignment strategy can still be beneficial.

2.4   Survey of Retrieval Approaches
Six of the 16 participants submitted softwares for the source retrieval task, all of whom
also submitted a notebook describing their approach. An analysis of these descriptions
reveals the same building blocks that were commonly used in last years’ source retrieval
algorithms: (1) chunking, (2) keyphrase extraction, (3) query formulation, (4) search
control, and (5) download filtering. Some participants only slightly changed their ap-
proach from the previous year; in what follows, we describe the employed ideas in
detail.
Chunking Given a suspicious document, it is divided into (possibly overlapping) pas-
sages of text. Each chunk of text is then processed individually. Rationale for chunking
the suspicious document is to evenly distribute “attention” over a suspicious document
so that algorithms employed in subsequent steps are less susceptible to unexpected char-
acteristics of the suspicious document.
    The chunking strategies employed by the participants are no chunking (i.e.,
the whole document as one chunk) [37], 50-line chunks [5], headings as separate
chunks [37], headings to split documents into chunks [37], 100-word chunks based on
heading detection [31], 200-word chunks [31], 5-sentence chunks [41, 16], and combi-
nations thereof.
    Note that chunks typically are stated as non-overlapping. The potentially interesting
question of whether overlapping chunks might help was not really tackled by any ap-
proach. However, typical plagiarism cases have no fixed length and overlapping chunks
might reduce the risk of, for instance, having more than one source in one chunk of
50 lines. Furthermore, relying on the given document structure (e.g., chunking by lines
or paragraphs) bears the risk of failing for some unseen documents that are not as
well-formatted as the ones in our evaluation corpus. Maybe mixed chunking strate-
gies as seen in Suchomel and Brandejs [37]’ approach is an interesting future direction.
Notably, their document level queries seem to also guarantee an early recall (cf. Sec-
tion 2.5).
Keyphrase Extraction Given a chunk, keyphrases are extracted from it in order to
formulate queries with them. Rationale for keyphrase extraction is to select only those
phrases (or words) which maximize the chance of retrieving source documents match-
ing the suspicious document. Keyphrase extraction may also serve as a means to limit
the amount of queries formulated, thus reducing the overall costs of using a search en-
gine. This step is perhaps the most important one of a source retrieval algorithm since
the decisions made here directly affect the overall performance: the fewer keywords are
extracted, the better the choice must be or recall is irrevocably lost.
    Some participants use single keywords while others extract whole phrases. Most of
the participants preprocessed the suspicious document by removing stop words before


                                          851
the actual keyphrase extraction. Phrasal search was provided by the Indri search en-
gine. All participants did use Indri when submitting phrasal queries; some of which also
combine phrases with non-phrasal ChatNoir queries, the search engine that the origi-
nal essay authors had used. In particular, Elizalde [5] applies three different keyphrase
extraction strategies very similar to her last year’s approach: (1) one query per 50-
lines chunk containing the top 10 words scored by tf ·idf values, (2) first 8-gram with
three words from 1 per chunk, (3) 15 phrases based on head noun clusters [3]. Prakash
and Saha [31] use the top 5 document-level tf -ranked terms and five paragraph-level
tf -ranked terms and the nouns from sentence subgroups to form queries. Kong et al.
[16] choose the ten best phrases per chunk according to an own keyphrase extraction
based on BM25 and tf · idf weighting. Williams et al. [41] use their very simplistic
keyphrase extraction strategy from last year: only nouns, adjectives, and verbs form
the keyphrases. Zubarev and Sochenkov [42] follow a similar strategy; for 83 high-
weighting sentences (weighting according to overlap with other sentences) they form
queries by ignoring articles, pronouns, prepositions, and repeated words. One problem
with sentence weighting might be that it does not distribute the selected sentences over
the entire document such that for specific parts no keywords might used.
     Suchomel and Brandejs [37] apply three different strategies very similar to their last
year’s approach. First, from the whole document, they use the top 6 words ranked by
tf·idf values. These top 6 keywords are then also combined with their most frequent two
or three term collocations. Second, they use the longest sentence from each paragraph.
Third, they detect headers in the text and use 6-term phrases from these headers.
     Altogether, the participants’ approaches to keyphrase extraction can still basically
be divided into four different categories. (1) Rather simplistic strategies that identify
keyphrases by chunking the whole document into some longer n-grams. This proba-
bly conforms with the folklore human strategy of identifying some suspicious n-gram
in a suspicious document and submitting this n-gram to a search engine. Using all
longer n-grams probably also “hits” parts of the n-grams a human would have cho-
sen. Thus, it is interesting to analyze the final performance of approaches that use this
kind of keyphrases (cf. Section 2.5). (2) Another very common strategy is to use the
tf ·idf -ranked top scoring words or phrases relying on some background collection for
document frequencies. (3) Notably, established keyphrase extraction schemes devel-
oped from the respective research community are only used in one approach. (4) Some
participants do not rely on one strategy alone but combine the other three approaches
for keyphrase extraction. This way, just as with chunking, the risk of algorithm error
is further diminished and it becomes possible to exploit potentially different sources of
information that complement each other.
Query Formulation Interestingly, most of the participants hardly combine
keyphrases into one query apart from merging, for instance, the top k tf · idf -ranked
terms, then the next k terms, etc. This way, most participants implicitly formulate non-
overlapping queries (i.e., they do not explicitly use the same keyword in more than one
query) except for some of the participants who basically use all the longer n-grams
in the suspicious document or who do not mind same keywords in different queries
that appear rather “by accident” than by intention. This non-overlap-approach is in line
with many query-by-document strategies but in contrast to previous source retrieval


                                           852
strategies that were shown to better identify highly related documents using overlap-
ping queries [14]. Also note that hardly any of the participants made use of advanced
search operators offered by Indri or ChatNoir, such as the facet to search for web pages
of at least 300 words of text, and the facet to filter search results by readability.
Search Control Given sets of keywords or keyphrases extracted from chunks, queries
are formulated which are tailored to the API of the search engine used. Rationale for
this is to adhere to restrictions imposed by the search engine and to exploit search fea-
tures that go beyond basic keyword search (e.g., Indri’s phrasal search). The maximum
number of search terms enforced by ChatNoir is 10 keywords per query while Indri
allows for longer queries.
    Given a set of queries, the search controller schedules their submission to the search
engine and directs the download of search results. Rationale for this is to dynami-
cally adjust the search based on the results of each query, which may include dropping
queries, reformulating existing ones, or formulating new ones based on the relevance
feedback obtained from search results. Some participants do not describe a search con-
trol. The ones who do basically schedule queries and drop some of the previously gener-
ated queries. Prakash and Saha [31] drop a query when more than 60% of its terms are
contained in another query. Suchomel and Brandejs [37] schedule queries dependent
on the keyphrase extractor which extracted the words: the order of precedence corre-
sponds to the order in which they have been explained above. Whenever later queries
were formulated for portions of the suspicious document that were already mapped to a
source, these queries are not submitted and discarded from the list of open queries. Also
Zubarev and Sochenkov [42] remove queries for sentences that already are mapped to
a potential retrieved source.
    Note that still (just as last year) none of the teams did try to reformulate exist-
ing queries or formulating new ones based on the available number of search results,
the search snippets, or the downloaded documents, which leaves significant room for
improvement. Another interesting aspect might be the scheduling of the queries them-
selves. The experimental results (cf. Section 2.5) seem to suggest that some document-
level queries in the first submission positions guarantee an early recall (e.g., Suchomel
and Brandejs [37]). Simply scheduling queries in the order of chunks in the documents
instead, might run into problems with early recall as maybe there is not that much reused
text at the beginning of a document. This might also be an interesting point for future
research.
Download Filtering Given a set of search engine results, a download filter removes
all documents that are probably not worthwhile being compared in detail with the sus-
picious document. Rationale for this is to further reduce the set of candidates and to
save invocations of the subsequent detailed comparison step.
    In particular, Elizalde [5] focuses on the top 10 results of a query and downloads a
result document when at least 90% of the 4-grams in a 500-character snippet are con-
tained in the suspicious document. Prakash and Saha [31] download a document from
the top 10 when at least one 5-gram in a 500-character snippet is contained in the suspi-
cious document; they explicitly avoid double downloads of the same URL. Zubarev and
Sochenkov [42] base their strategy on the top 7 results and compute similarities of snip-


                                          853
  Table 1. Source retrieval results with respect to retrieval performance and cost-effectiveness.
Software Submission        Downloaded            Total           Workload to No Runtime
Team        Year             Sources            Workload         1st Detection Detect.
(alphabetical order)     F1     Prec. Rec. Queries Dwlds Queries Dwlds
Elizalde          2013   0.16   0.12   0.37     41.6   83.9        18.0     18.2    4 11:18:50
Elizalde          2014   0.34   0.40   0.39     54.5   33.2        16.4      3.9    7 04:02:00
Foltynek          2013   0.11   0.08   0.26    166.8   72.7       180.4      4.3   32 152:26:23
Gillam            2013   0.06   0.04   0.15     15.7   86.8        16.1     28.6   34 02:24:59
Haggag            2013   0.38   0.67   0.31     41.7    5.2        13.9      1.4   12 46:09:21
Kong              2013   0.01   0.01   0.59     47.9 5185.3         2.5    210.2    0 106:13:46
Kong              2014   0.12   0.08   0.48     83.5 207.1         85.7     24.9    6 24:03:31
Lee               2013   0.40   0.58   0.37     48.4   10.9         6.5      2.0    9 09:17:10
Prakash           2014   0.39   0.38   0.51     60.0   38.8         8.1      3.8    7 19:47:45
Suchomel          2013   0.05   0.04   0.23     17.8 283.0          3.4     64.9   18 75:12:56
Suchomel          2014   0.11   0.08   0.40     19.5 237.3          3.1     38.6    2 45:42:06
Williams          2013   0.47   0.60   0.47    117.1   12.4        23.3      2.2    7 76:58:22
Williams          2014   0.47   0.57   0.48    117.1   14.4        18.8      2.3    4 39:44:11
Zubarev           2014   0.45   0.54   0.45     37.0   18.6         5.4      2.3    3 40:42:18


pet sentences to the suspicious document’s sentences. A download was scheduled when
the similarity is high. Suchomel and Brandejs [37] obtain snippets for each individ-
ual query term and download documents (no information on the number of results per
query is given) when more than 20% of the word 2-grams in the concatenated snippets
also appear in the suspicious document. Williams et al. [41] try to train a classifier for
download scheduling using a lot of search engine and snippet features. However, com-
paring their approach from last year with a more simplistic download filtering and this
year’s classifier idea, not much improvement was achieved (basically the same number
of downloads and hardly any improvements in overall or early recall). Kong et al. [16]
simply download the top 3 results per query.
    Interestingly, the participants heavily rely on the retrieval models of the search en-
gines by focusing on the at most top 10 results per query. It is probably not much
more expensive to get more results per query to be able to select from a wider range
of results. Interestingly, Elizalde [5] requests 30 results per query but then immediately
focuses on the top 10 documents without any further consideration of the lower ranks.
Other participants restrict themselves to only a few documents per query while com-
paring against maybe a hundred results might not be much more costly than selecting
from only 3 results. Considering more results per query might be an interesting option
for future research based on the User-over-Ranking hypothesis [35, 13].

2.5     Evaluation Results
Table 1 shows the performances of the six plagiarism detectors that took part in this
year’s source retrieval subtask as well as those of the last year’s participants whose
approaches were re-evaluated on this year’s test corpus using the TIRA experimentation
platform. Since there is currently no single formula to organize retrieval performance
and cost-effectiveness into an absolute order, the detectors are ordered alphabetically,
whereas the best performance value for each metric is highlighted. As can be seen, there


                                               854
         1.0


         0.8


         0.6                                                                         Kong 2013
Recall


                                             Prakash 2014    Kong 2014
                 Williams 2014
                 Williams 2013
         0.4     Zubarev 2014                               Suchomel 2014
                      Lee 2013
                 Haggag 2013
                 Elizalde 2014
                 Foltynek 2013
                                                                Suchomel 2013
                 Elizalde 2013
         0.2
                                                        Gillam 2013


         0.0
             0
           10                    101           102                    103          104                  105
                                                     Downloads

         Figure 4. Recall at a specific number of downloads per participant averaged over all topics.

is no single detector that performs best on all accounts. Rather, different detectors have
different characteristics.
    Arguably, highest possible recall at a reasonable workload (queries and downloads)
is the goal of source retrieval. One might discuss for instance, whether recall should
be given more weight in the F-measure. Still, when sorting the participants by recall, it
appears that only two of the top 8 participants are not from 2014. This indicates some
progress in the “right” direction for this task. However, the 2013-approach of Kong
et al. [17] still achieves the best recall—at the cost of poor precision. To further shed
some light on the recall of the different approaches, Figure 4 shows the recall against
the number of downloaded documents. It can be seen that recall is typically gained
over the whole process of downloading documents and not with the very first down-
loads (the plateau effect at the upper right end of each plot is due to the averaging).
Unsurprisingly, some of the low-workload approaches achieve higher recall levels with
fewer downloads while approaches with more downloads typically achieve their bet-
ter final recall levels only at a much higher number of downloads. Thus, focusing on
download filtering strategies at document-level (when all queries are submitted) might
improve their early recall at the download-level—but this would probably harm recall
when measured against the submitted queries.
    Interestingly, the ensemble of all submitted approaches would achieve an average
recall of 0.85 retrieving all sources for 48 topics. Only for 14 topics the recall is below
0.6 (which is the best individual average recall).
    Not just focusing on recall, a per-participant analysis also reveals some interesting
observations when comparing the approaches from different years. For instance, Su-
chomel and Brandejs [37] almost doubled their recall without increased query load.


                                                     855
Also Elizalde [5] managed to save a lot of downloads and improve precision over last
year without harming recall. By saving a lot of downloads, Kong et al. [16] sacrificed
their good recall from last year. A little disappointing is the almost negligible effect
of Williams et al. [41] efforts to improve over last year: the better scheduling of the
downloads yields no real changes in their overall performance; still, a lot of queries
are invested for only a tenth of actual downloads. Prakash and Saha [31] and Zubarev
and Sochenkov [42] achieve very good results with their first participation in source
retrieval, contributing a lot to the better average recall of this year’s approaches over the
last year.
     The detectors of Williams et al. [40, 41] achieves the best trade-off between pre-
cision and recall and therefore the best F1 value. This detector is followed closely by
that of Zubarev and Sochenkov [42], which also achieves almost the same balanced
precision and recall with a much smaller workload. It is not easy to decide which of
the participating detectors solves the task best, since each of them may have their jus-
tification in practice. For example, the detector of Haggag and El-Beltagy downloads
only about five documents on average per suspicious document and minimizes the time
to first detection; however, it also has no detection at all for 12 documents. Despite the
excellent precision/recall trade-off of Williams et al.’s detectors, it incurs the second-
highest costs in terms of queries on average, much more than some other participants
that achieve better or only slightly worse recall. Kong et al.’s detector has the highest
download costs, but one may argue that downloads are much cheaper than queries, and
that in the source retrieval task recall is more important than precision.
     Altogether, the current strategies might be a little too focused on saving downloads
(and queries) compared to for instance increased recall. Also runtime should probably
not be the key metric to optimize (e.g., using threads instead of sequential processing
does not decrease the workload on the search engines). A reasonable assumption proba-
bly is that recall is most important to the end user of a source retrieval system. Investing
a couple of queries and a couple (maybe even hundreds) of downloads to achieve a re-
call above 0.8 might be a very important research direction. In the end, whatever source
the source retrieval step misses, cannot be found by a later text alignment step. This
probably is a key argument for a recall-oriented source retrieval strategy that also takes
into account basic considerations on total workload of query submission and down-
loads. It would be interesting to see efforts in that direction of significantly improved
recall at a moderate cost increase in future approaches.


3   Text Alignment

In text alignment, given a pair of documents, the task is to identify all contiguous pas-
sages of reused text between them. The challenge with this task is to identify passages
of text that have been obfuscated, sometimes to the extent that, apart from stop words,
little lexical similarity remains between an original passage and its plagiarized coun-
terpart. To provide a challenging evaluation corpus, we resort to our previous corpus
construction efforts. The performance of a plagiarism detector is measured based on
the traditionally employed measures plagdet, precision, recall, and granularity, whereas
we also introduce new measures that shed light on different performance aspects of


                                            856
plagiarism detection. Finally, we conduct a cross-year evaluation and compare the per-
formance of this year’s detectors with those of last year.

3.1   Evaluation Corpus

As an evaluation corpus for this year, we reuse that of last year [25]. The corpus is
based on the Webis-TRC-13 [27]. Instead of employing the documents of that corpus
directly, pairs of documents that comprise reused passages have been constructed au-
tomatically, similarly to previous years [29]. The corpus comprises plagiarism cases
whose reused portions of text have been subject to four obfuscation strategies, namely
verbatim copies, random obfuscation, cyclic translation obfuscation, and summary ob-
fuscation. The latter strategy has been found to be the most challenging kind of obfus-
cation in terms of being detected to date.
    While the best performing approach was determined based on last year’s test corpus,
we have compiled a variant of the above corpus using the same methods as before which
comprises only unobfuscated and randomly obfuscated plagiarism. This supplemental
corpus serves as a baseline corpus.
Discussion Reusing a previously constructed evaluation corpus another time has been
a compromise on our part to free up time for working on the front end of the TIRA ex-
perimentation platform. This strategy, however, bears the risk of approaches being over-
fitted to a given corpus and participants cheating. While reusing a previously released
corpus would not be a problem if participants used only the training data to fine-tune
their approach, it cannot be entirely ruled out that the publicly available test data has
been used as well. Therefore, the evaluation results for text alignment of this year must
be taken with a grain of salt.
     In an attempt to alleviate this problem, we constructed a supplemental corpus using
the same corpus construction process that was used for the reused corpus, however, the
supplemental corpus comprises only basic obfuscation strategies that can be detected
a lot easier than, for example, summary obfuscation. Therefore, this corpus was not
chosen to be the reference for this year’s ranking among participants.
     In general, we are thinking of new ways to organize text alignment as a shared task.
Throughout the years, we have created a new evaluation corpus every year, except for
this year, adding new kinds of obfuscation strategies each time. However, we feel that
this level of output cannot be sustained much longer; should there still be considerable
interest in continuing this task, we will involve participants in the corpus construction
efforts by submitting a corpus of their own design, while cross-evaluating the submitted
corpora using all submitted approaches. We have invited data submissions on a volun-
tary basis before, however, without much success. Perhaps, by making data submissions
a mandatory part of this shared task, there will be more participant engagement.

3.2   Performance Measures Revisited
To assess the performance of the submitted text alignment softwares, we employ the
performance measures used in previous evaluations [29]: precision, recall, and granu-
larity, which are combined into the plagdet score. While these measures are not beyond


                                          857
criticism, they have served as a reliable means to rank plagiarism detectors; these mea-
sures are generally perceived as very strict.
     This year, we revisit performance measurement of plagiarism detectors by shedding
light on more abstract levels of detection performance. Until now, plagiarism detection
performance has been measured at the character level under the model of a user who
expects her plagiarism detector to retrieve and extract contiguous plagiarized passages
of text from a given pair of documents. For example, many current plagiarism detectors
extract only overlapping substrings of a given plagiarism case which makes reviewing
their detections cumbersome, whereas extracting contiguous passages has turned out
to be much more convenient. Therefore, the current measures have been specifically
developed to capture the “completeness” of detection of a given plagiarism case: they
measure the precision and recall of detecting a plagiarism case at character level and
combine that with the granularity measure, which counts the number of times a given
plagiarism case has been detected.
     However, there is more than one relevant user model, and performance can be mea-
sured at different levels of abstraction, which may even lead to contrary results as to
which particular plagiarism detection approach is best suited for a particular user. In
what follows, we review the previous performance measures, and add two new levels
of abstraction. They incorporate different assumptions about the detection characteris-
tics preferred by users. Besides the character level, we propose to measure detection
performance at the case level, which fixes the minimum precision and recall at which a
plagiarism case has to be detected, and, at the document level, which disregards whether
all plagiarism cases present in a document are detected as long as a significant portion
of one of them is detected. The case level measures assume a user is interested in cover-
ing all plagiarism cases which are present in a given collection, whereas the document
level measures assume a user wants to determine if a document is suspicious and worth
further analysis.
Character level performance measures Let S denote the set of plagiarism cases in
the corpus, and let R denote the set of detections reported by a plagiarism detector
for the suspicious documents. A plagiarism case s = hsplg , dplg , ssrc , dsrc i, s ∈ S, is
represented as a set s of references to the characters of dplg and dsrc , specifying the
passages splg and ssrc . Likewise, a plagiarism detection r ∈ R is represented as r. We
say that r detects s iff s ∩ r 6= ∅ and splg overlaps with rplg and ssrc overlaps with rsrc .
Based on this notation, precision and recall of R under S can be measured as follows:
                              S                                           S
                    1 X | s∈S (s u r)|                             1 X | r∈R (s u r)|
   prec(S, R) =                             ,      r ec(S, R) =                         ,
                   |R|             |r|                            |S|          |s|
                        r∈R                                           s∈S
                                            
                                              s ∩ r if r detects s,
                          where s u r =
                                                ∅     otherwise.
Observe that neither precision nor recall account for the fact that plagiarism detectors
sometimes report overlapping or multiple detections for a single plagiarism case. This
is undesirable, and to address this deficit also a detector’s granularity is quantified as
follows:
                                                 1 X
                               gran(S, R) =                |Rs |,
                                               |SR |
                                                    s∈SR


                                            858
where SR ⊆ S are cases detected by detections in R, and Rs ⊆ R are detections of s;
i.e., SR = {s | s ∈ S and ∃r ∈ R : r detects s} and Rs = {r | r ∈ R and r detects s}.
Note further that the above three measures alone do not allow for a unique ranking
among detection approaches. Therefore, the measures are combined into a single overall
score as follows:
                                                  F1
                       plagdet(S, R) =                        ,
                                        log2 (1 + gran(S, R))
where F1 is the equally weighted harmonic mean of precision and recall.
Case level performance measures Let S and R be defined as above. Further, let

S 0 = {s | s ∈ S and r ecchar (s, R) > τ1 and ∃r ∈ R: r detects s and precchar (S, r) > τ2 }

denote the subset of all plagiarism cases S which have been detected with more than a
threshold τ1 in terms of character recall r ecchar and more than a threshold τ2 in terms
of character precision precchar . Likewise, let

R0 = {r | r ∈ R and precchar (S, r) > τ2 and ∃s ∈ S: r detects s and r ecchar (s, R) > τ1 }

denote the subset of all detections R which contribute to detecting plagiarism cases with
more than a threshold τ1 in terms of character recall r ecchar and more than a threshold
τ2 in terms of character precision precchar . Here, character recall and precision derive
from the character level performance measures defined above:
                             S                                     S
                            | s∈S (s u r)|                        | r∈R (s u r)|
        precchar (S, r) =                  ,   r ecchar (s, R) =                 .
                                  |r|                                   |s|
Based on this notation, we compute case level precision and recall as follows:
                                      |R0 |                           |S 0 |
                  preccase (S, R) =         ,     r eccase (S, R) =          .
                                       |R|                             |S|
The thresholds τ1 and τ2 can be used to adjust the minimal detection accuracy with
regard to passage boundaries. Threshold τ1 adjusts how accurate a plagiarism case has
to be detected, whereas threshold τ2 adjusts how accurate a plagiarism detection has
to be. Beyond the minimal detection accuracy imposed by these thresholds, however,
a higher detection accuracy does not contribute to case level precision and recall. If
τ1 → 1 and τ2 → 1, the minimal required detection accuracy approaches perfection,
whereas if τ1 → 0 and τ2 → 0, it is sufficient to report an entire document as plagiarized
to achieve perfect case level precision and recall. In between these extremes, it is an
open question which threshold settings are valid with regard to capturing the minimally
required detection quality beyond which most users of a plagiarism detection system
will not perceive improvements, anymore. Hence, we choose τ1 = τ2 = 0.5 as a
reasonable trade off, for the time being: for case level precision, a plagiarism detection
r counts a true positive detection if it contributes to detecting at least τ1 = 0.5 ∼ 50%
of a plagiarism case s, and, if at least τ2 = 0.5 ∼ 50% of r contributes to detecting
plagiarism cases. Likewise, for case level recall, a plagiarism case s counts as detected
if at least 50% of s are detected, and, if a plagiarism detection r contributes to detecting
s while at least 50% of r contributes to detecting plagiarism cases in general.


                                            859
Document level performance measures Let S, R, and R0 be defined as above. Fur-
ther, let Dplg be the set of suspicious documents and Dsrc be the set potential source
documents. Then Dpairs = Dplg × Dsrc denotes the set of possible pairs of documents
that a plagiarism detector may analyze, whereas

 Dpairs|S = {(dplg , dsrc ) | (dplg , dsrc ) ∈ Dpairs and ∃s ∈ S : dplg ∈ s and dsrc ∈ s}

denotes the subset of Dpairs whose document pairs contain the plagiarism cases S, and

 Dpairs|R = {(dplg , dsrc ) | (dplg , dsrc ) ∈ Dpairs and ∃r ∈ R : dplg ∈ r and dsrc ∈ r}

denotes the corresponding subset of Dpairs for which plagiarism was detected in R.
Likewise, Dpairs|R0 denotes the subset of Dpairs for which plagiarism was detected
when requiring a minimal detection accuracy as per R0 defined above. Based on this
notation, we compute document level precision and recall as follows:

                   |Dpairs|S ∩ Dpairs|R0 |                          |Dpairs|S ∩ Dpairs|R0 |
precdoc (S, R) =                           ,     r ecdoc (S, R) =                           .
                         |Dpairs|R |                                      |Dpairs|S |

Again, the thresholds τ1 and τ2 allow for adjusting the minimal required detection ac-
curacy for R0 , but for document level recall, it is sufficient that at least one plagiarism
case is detected beyond that accuracy in order for the corresponding document pair
(dplg , dsrc ) to be counted as true positive detection. If none of the plagiarism cases
present in (dplg , dsrc ) is detected beyond the minimal detection accuracy, it is counted
as false negative, whereas if detections are made for a pair of documents in which no
plagiarism case is present, it is counted as false positive.
Discussion Compared to the character level measures, the case level measures relax
the fine-grained measurement of plagiarism detection quality to allow for judging a de-
tection algorithm by its capability of “spotting” plagiarism cases reasonably well with
respect to the minimum detection accuracy fixed by the thresholds τ1 and τ2 . For exam-
ple, a user who is interested in maximizing case level performance may put emphasis on
the coverage of all plagiarism cases rather than the precise extraction of each individual
plagiarized pair of passages. The document level measures further relax the require-
ments to allow for judging a detection algorithm by its capability “to raise a flag” for
a given pair of documents, disregarding whether it finds all plagiarism cases contained.
For example, a user who is interested in maximizing these measures puts emphasis on
being made suspicious which might lead to further, more detailed investigations. In this
regard the three levels of performance measurement complement each other. To rank
plagiarism detection with regard to their case level performance and their document
level performance, we currently use the Fα -Measure. While the best setting of α is also
still unclear, we resort to α = 1.

3.3   Survey of Text Alignment Approaches
Eleven of the 16 participants submitted softwares that implement text alignment, and for
ten of them also a notebook describing their approach has been submitted. An analysis


                                           860
of these notebooks reveals that a number of building blocks are commonly used to build
text alignment algorithms: (1) seeding, (2) extension, and (3) filtering. Text alignment is
closely related to gene sequence alignment in bioinformatics, of which the terminology
is borrowed: most of this year’s approaches to text alignment implement the so-called
seed and extend-paradigm which is frequently applied in gene sequence alignment.
However, also a new trend can be observed, namely methods to predict the obfuscation
type at hand and the dynamic choice or adjustment of text alignment approaches based
on the prediction. In what follows, we survey the approaches in detail.
Seeding Given a suspicious document and a source document, matches (so-called
„seeds”) between the two documents are identified using some seed heuristic. Seed
heuristics either identify exact matches or create matches by changing the underlying
texts in a domain-specific or linguistically motivated way. Rationale for this is to pin-
point substrings that altogether make up for the perceived similarity between a suspi-
cious and a source document. By coming up with as many reasonable seeds as possible,
the subsequent step of extending them into aligned passages of text becomes a lot easier.
A number of seed heuristics have been applied by this year’s participants:
 – Alvi et al. [2] use character 20-grams and the Rabin-Karp algorithm for string
   matching between suspicious and source document.
 – Glinos [8] use word 1-grams and exact matching. In a supplementary approach,
   they use only the top 30 most frequent words longer than 5 characters.
 – Sanchez-Perez et al. [32] use sentences, whereas short sentences of less than 4
   words are joined with their respective succeeding sentence, and they apply a sen-
   tence similarity measure where each sentence is represented as tf · idf -weighted
   vector which are compared using cosine similarity and the Dice coefficient. Sen-
   tences match if their similarities under both similarity measures exceeds a threshold
   of 0.33.
 – Gross and Modaresi [12] use skip word 2-grams with skips ranging from 1 to 4 and
   exact matching, whereas seeds appearing more than four times are discarded.
 – Torrejón and Ramos [39] reuse their previous approach which is based on sorted
   3-grams and sorted 1-skip 3-grams and exact matching.
 – Abnar et al. [1] use word 2-grams to word 5-grams. Matching is based on a simi-
   larity measure that computes pairwise word similarities between a pair of n-grams,
   incorporating knowledge about how likely a given word is exchanged by another
   (e.g., because it is a synonym). This approach differs from others, where, for exam-
   ple, synonyms are normalized and exact matching is applied.
 – Palkovskii and Belov [20] use word n-grams, stop word n-grams, named entity
   n-grams, frequent word n-grams, stemmed n-grams, sorted n-grams, and skip-n-
   grams and exact matching, however, it is not clear which n is used.
 – Gillam and Notley [7] apply a custom fingerprinting approach and compute
   similarity-sensitive hash values over portions of the input documents, whereas al-
   gorithm details and parameter settings remain obscure.
 – Kong et al. [16] reuse their previous year’s approach using sentences as seeds which
   match if they exceed a similarity threshold.
 – Shrestha et al. [34] also use sentences which are matched if their TER-p score
   exceeds a threshold, but they also use word 2-grams and exact matching.


                                           861
Before computing seeds, many participants choose to collapse whitespace, reduce
cases, remove non-alphanumeric characters, remove stop words, and stem the remain-
ing words, if applicable to their respective seed heuristics.
Extension Given seed matches identified between a suspicious document and a
source document, they are extended into aligned text passages between the two doc-
uments of maximal length, which are then reported as plagiarism detections. Rationale
for merging seed matches is to determine whether a document contains plagiarized pas-
sages at all rather than just seeds matching by chance, and to identify a plagiarized
passage as a whole rather than only its fragments.
    The extension algorithms applied this year have become more diverse, including
rule-based approaches, dynamic programming, and clustering-based approaches. In
previous years, rule-based approaches have been used by almost everyone; they merge
seeds into aligned passages if they are adjacent in both suspicious and source docu-
ment and the size of the gap between them is below some threshold. The exact rules
depend on the seeds used, and instead of using just one rule, many participants develop
sets of constraints that have to be fulfilled by aligned passages in order to be reported
as plagiarism detections. The complexity of the rule sets and their interdependencies
has outgrown a casual description, whereas this year’s rules are comparably simple.
For example, Alvi et al. [2] merge seeds to aligned passages if they are less than 200
chars apart, and Gillam and Notley [7] continue to develop their previous year’s ap-
proach which used to merge seeds that are less than 900 chars apart. However, given
the success of alternative, more dynamic extension algorithms, crafting rule sets by
hand appears not to be a competitive approach, anymore.
    One of the classical approaches to extension is dynamic programming, and two
participants make use of corresponding bioinformatics algorithms: Glinos [8] employ
a variant of the Smith-Waterman algorithm, which they tailored to the application for
pairs of texts instead of pairs of gene sequences, making improvements in terms of
runtime and multiple detections. Oberreuter and Eiselt [19] employ an algorithm from
the BLAST family of local gene sequence alignment algorithms. These algorithms can
handle noise, which is why Glinos [8] is at liberty to use word 1-grams as seeds, which
may help to uncover text similarities that are lost with longer seeds. However, these
algorithms may have difficulties to handle heavy re-ordering of phrases and words.
    Many participants apply some kind of clustering algorithm or at least algorithms
which relate to clustering algorithms. In their secondary approach, Glinos [8] try to
identify clusters of frequent topic-related words in order to better pinpoint summaries
and highly obfuscated plagiarism, whereas they do not employ one of the standard
clustering algorithms but a handcrafted set of rules. Similarly, Sanchez-Perez et al. [32]
apply an approach that relates to divisive clustering. They first merge all subsequent
seeds using a broad gap threshold into what they call fragments, and then divide the
merged fragments until all divided fragments exceed a similarity threshold, in which
case the resulting fragments are output as aligned pairs of passaged. By contrast, Gross
and Modaresi [12] apply agglomerative single-linkage clustering, where pairs of seeds
are merged based on a distance measure, where the merging order follows that of least
distance, and the stopping criterion is defined by a distance threshold that may not
be exceeded. Abnar et al. [1] apply the density-based clustering algorithm DBSCAN,


                                          862
and Palkovskii and Belov [20] apply what they call “angled ellipse-based graphical
clustering,” whereas, for the latter, it remains unclear which clustering algorithm is
used, exactly, since no reference nor a description is given.
Filtering Given a set of aligned passages, a passage filter removes all aligned pas-
sages that do not meet certain criteria. Rationale for this is mainly to deal with overlap-
ping passages and to discard extremely short passages, whereas a real-world plagiarism
detection might attempt to discern reused text that has been properly acknowledged
from reused text for which an acknowledgement is missing. Participants, however, use
filtering mainly to optimize against the evaluation corpus in order to maximize the per-
formances measured, which is of course impractical, but in the nature of things in a
competition.
     Alvi et al. [2] discard alignments of less than 200 chars length in the source doc-
ument and less than 100 chars length in the suspicious document; Glinos [8] discard
alignments that contain less than 40 words on both sides; Sanchez-Perez et al. [32]
attempt to disambiguate and merge overlapping alignments and discard alignments of
less than 150 chars length; Gross and Modaresi [12] discard alignments of less then 15
words length; Abnar et al. [1] and Shrestha et al. [34] discard short alignments based
on an unclear threshold; and other participants either do not give details, or they do not
filter alignments.
Remarks and New Trends Given the fact that eleven teams participated in text align-
ment this year, five of whom for the first time, we conclude that there is still a lot of
interest in this shared task, and that there is also a lot to be accomplished still. One of
the new trends that many participants have picked up simultaneously is that of tailoring
their approaches to specific kinds of obfuscation and then to dynamically select the ap-
propriate approach either based on a prediction which kind of obfuscation is at hand in
a given pair of to be analyzed documents, or based on an a posteriori decision rule when
applying more than one variant at the same time. This development is encouraging as
it opens new avenues of research around text alignment, let alone the opportunity to
improve significantly over one-fits-all approaches:
 – Glinos [8] distinguishes word order-preserving plagiarism from all other kinds and
   apply their dynamic programming approach and their clustering-based approach at
   the same time. It is not entirely clear what the decision rule is, or which approach
   takes precedence over another. The clustering approach, however, turns out to aim
   at summaries in particular, so that it may be that the dynamic programming algo-
   rithm’s output takes precedence.
 – Sanchez-Perez et al. [32] distinguish summaries from all other kinds and employ
   two parameter settings for their approach at the same time, one conservative, the
   other progressive. Afterwards, the decision which of the two outputs obtained is
   returned is based on the length imbalance between suspicious passage and source
   passage. If the source passage is more than 3 times longer than the suspicious pas-
   sage, the progressive settings take precedence.
 – Torrejón and Ramos [39] attempted to tune their approach to the evaluation corpus
   based on hundreds of runs under different parameter settings. They discuss three


                                           863
      Table 2. Text alignment performances of the 2014 participants on the 2013 test data.
          Team             PlagDet     Recall   Precision Granularity Runtime
          Sanchez-Perez    0.87818    0.87904      0.88168     1.00344      00:25:35
          Oberreuter       0.86933    0.85779      0.88595     1.00369      00:05:31
          Palkovskii       0.86806    0.82637      0.92227     1.00580      01:10:04
          Glinos           0.85930    0.79331      0.96253     1.01695      00:23:13
          Shrestha         0.84404    0.83782      0.85906     1.00701      69:51:15
          R. Torrejón      0.82952    0.76903      0.90427     1.00278      00:00:42
          Gross            0.82642    0.76622      0.93272     1.02514      00:03:00
          Kong             0.82161    0.80746      0.84006     1.00309      00:05:26
          Abnar            0.67220    0.61163      0.77330     1.02245      01:27:00
          Alvi             0.65954    0.55068      0.93375     1.07111      00:04:57
          Baseline         0.42191 0.34223         0.92939     1.27473      00:30:30
          Gillam           0.28302 0.16840         0.88630     1.00000      00:00:55


   parameter sets that are supposedly applicable in different situations, but they do not
   go further to combine the three settings in a dynamic manner.
 – Palkovskii and Belov [20] attempt to predict which of four kinds of obfuscation is
   at hand, namely no obfuscation, random obfuscation, summaries, and “undefined.”
   Based on the prediction, a choice is made among four corresponding parameter sets.
   Only little hints are given about the features used for prediction, and it remains
   entirely unclear how the prediction pipeline works, exactly, what classifiers are
   used, and how well the prediction performs.
 – Kong et al. [16] attempt to predict whether obfuscated or unobfuscated plagiarism
   is at hand. This distinction corresponds to that of Glinos [8] mentioned above. They
   employ logistic regression to train a classifier based on lexical similarity features
   using the training data set of the evaluation corpus. However, it remains unclear
   whether the entire document pairs are compared using the similarity measures and
   how the prediction is incorporated into their text alignment approach. Moreover, no
   analysis of prediction performance is conducted.
While this development is encouraging, it must be noted that the attempts made are still
in their infancy and not yet analyzed well enough to form a conclusion whether they
yield useful overall performance improvements or not.

3.4   Evaluation Results
In this section, we report on the evaluation of this year’s submissions on the afore-
mentioned evaluation corpora. Moreover, we conduct a cross-year evaluation of all
softwares submitted since 2012 on the current evaluation corpus. We further differ-
entiate performance with regard to obfuscation strategies to provide insights into how
the softwares deal with different strengths of obfuscation. Finally, we compute the per-
formances of all software using the newly proposed performance measures and contrast
them with the traditionally applied measures.
Overall Results of 2014 Table 2 shows the overall performances of the eleven pla-
giarism detectors that implement text alignment and were submitted this year on the


                                             864
 Table 3. Text alignment performances of the 2014 participants on the supplemental test data.
          Team            PlagDet     Recall   Precision Granularity Runtime
          Palkovskii       0.90779   0.88916      0.92757     1.00027      00:57:15
          Oberreuter       0.89268   0.91539      0.87171     1.00051      00:05:37
          Sanchez-Perez    0.89197   0.91984      0.86606     1.00026      00:22:10
          Glinos           0.88770   0.84511      0.96007     1.01761      00:19:32
          Shrestha         0.86806   0.89839      0.84418     1.00381      74:52:47
          Gross            0.85500   0.81819      0.92522     1.02187      00:02:49
          R. Torrejón      0.84870   0.80267      0.90032     1.00000      00:00:31
          Kong             0.83514   0.84156      0.82882     1.00000      00:05:18
          Alvi             0.73416   0.67283      0.90081     1.06943      00:04:17
          Abnar            0.66377   0.84779      0.54833     1.00455      20:14:51
          Baseline         0.64740 0.52838        0.90024     1.04005      00:15:15
          Gillam           0.44076 0.29661        0.85744     1.00000      00:00:56


2013 test data. The overall best performing approach is that of Sanchez-Perez et al.
[32], followed by that of Oberreuter and Eiselt [19] and Palkovskii and Belov [20]. The
former two detectors have balanced precision and recall, while the latter does not. None
of the detectors achieve perfect granularity, yet the scores obtained are very reason-
able. While the best performing approach this year comes from a first-time participant,
the performances of all five newcomers range from very good to poor. One detector’s
performance does not exceed the baseline. In terms of precision and granularity, the
lower-ranked detectors have some shortcomings, whereas performance decreases more
or less steadily toward the lower ranks. In terms of runtime, three detectors took more
than one hour to finish, one of which took almost three days. The best performing detec-
tor of Torrejón and Ramos [39] finishes the 5185 document pairs in less than a minute,
whereas the authors claim even faster runtimes on multi-core machines.
    Table 3 shows the overall performances of the elven plagiarism detectors on the
supplemental test data, which comprises only unobfuscated, and randomly obfuscated
plagiarism. These obfuscation strategies have been found to be easier to detect, which
explains the generally higher performances. Interestingly, Sanchez-Perez et al. [32] and
Palkovskii and Belov [20] switch places, which may be an artifact of the former’s focus
on detecting summary plagiarism and the latter’s focus on detecting verbatim plagia-
rism. The performance differences, however, are not very big, and the global ranking
does not change a lot compared to Table 2. Note that the results of Table 2 determine
the best performing approach of 2014, since the 2013 test data pose a much bigger
challenge.
Cross-Year Evaluation between 2012 and 2014 Tables 4 to 7 show the perfor-
mances of all 29 plagiarism detectors submitted since 2012 that implement text align-
ment on the 2013 test data. The overall performance of the detectors with regard to the
plagdet score can be found in Table 4. As can be seen, many of the approaches sub-
mitted 2014 significantly improve over the best performing detectors of previous years.
Sanchez-Perez et al. [32] takes the lead across all years on the 2013 test data. The best
performing detector from previous years from Kong et al. [18] is ranked sixth so that the
2014 participants seem to have raised the bar for future participants significantly. Again,
these results must be taken with a grain of salt, since the 2013 test has been available to


                                            865
Table 4. Cross-year evaluation of text alignment software submissions from 2012 to 2014 with
respect to plagdet. The darker a cell, the better the performance compared to the entire column.

Software Submission          Obfuscation Strategies of the 2013 Evaluation Corpus       Entire Corpus
Team            Year     None            Random        Cyclic translation     Summary

Sanchez-Perez   2014    0.90032          0.88417            0.88659           0.56070      0.87818
Oberreuter      2014    0.91976          0.86775            0.88118           0.36804      0.86933
Palkovskii      2014    0.96004          0.86495            0.85750           0.27645      0.86806
Glinos          2014    0.96236          0.80623            0.84722           0.62359      0.85930
Shrestha        2014    0.89174          0.86556            0.84384           0.15550      0.84404
Kong            2012    0.87249          0.83242            0.85212           0.43635      0.83679
R. Torrejón     2014    0.93184          0.75378            0.85899           0.35298      0.82952
Oberreuter      2012    0.94170          0.74955            0.84618           0.13208      0.82678
Gross           2014    0.89950          0.80293            0.83825           0.31869      0.82642
R. Torrejón     2013    0.92586          0.74711            0.85113           0.34131      0.82220
Kong            2014    0.83777          0.82300            0.85162           0.43135      0.82161
Kong            2013    0.82740          0.82281            0.85181           0.43399      0.81896
Palkovskii      2012    0.88161          0.79692            0.74032           0.27507      0.79155
R. Torrejón     2012    0.88222          0.70151            0.80112           0.44184      0.78767
Suchomel        2013    0.81761          0.75276            0.67544           0.61011      0.74482
Suchomel        2012    0.89848          0.65213            0.63088           0.50087      0.73224
Saremi          2013    0.84963          0.65668            0.70903           0.11116      0.69913
Shrestha        2013    0.89369          0.66714            0.62719           0.11860      0.69551
Abnar           2014    0.85124          0.49058            0.67370           0.17148      0.67220
Alvi            2014    0.92693          0.50247            0.54506           0.09032      0.65954
Kueppers        2012    0.81977          0.51602            0.56932           0.13848      0.62772
Palkovskii      2013    0.82431          0.49959            0.60694           0.09943      0.61523
Nourian         2013    0.90136          0.35076            0.43864           0.11535      0.57716
Sánchez-Vega    2012    0.52179          0.45598            0.44323           0.28807      0.45923

Baseline                0.93404          0.07123            0.10630           0.04462      0.42191
Gillam          2012    0.87655          0.04723            0.01225           0.00218      0.41373
Gillam          2013    0.85884          0.04191            0.01224           0.00218      0.40059
Gillam          2014    0.66329          0.05500            0.00403           0.00000      0.28302
Jayapal         2013    0.38780          0.18148            0.18181           0.05940      0.27081
Jayapal         2012    0.34758          0.12049            0.10504           0.04541      0.20169


                                                866
Table 5. Cross-year evaluation of text alignment software submissions from 2012 to 2014 with
respect to precision. The darker a cell, the better the performance compared to the entire column.

Software Submission           Obfuscation Strategies of the 2013 Evaluation Corpus       Entire Corpus
Team            Year      None            Random        Cyclic translation     Summary

Glinos          2014     0.96445          0.96951            0.96165           0.96451      0.96253
Nourian         2013     0.92921          0.96274            0.95856           0.99972      0.94707
Jayapal         2012     0.98542          0.95984            0.89590           0.83259      0.94507
Alvi            2014     0.91875          0.94785            0.95984           0.88036      0.93375
Gross           2014     0.91761          0.96000            0.92105           0.94876      0.93272

Baseline                 0.88741          0.98101            0.97825           0.91147      0.92939
Palkovskii      2014     0.95584          0.91453            0.89941           0.91315      0.92227
R. Torrejón     2014     0.89901          0.93843            0.90088           0.89793      0.90427
R. Torrejón     2013     0.90060          0.90996            0.89514           0.90750      0.89484
Oberreuter      2012     0.89037          0.87921            0.90328           0.98983      0.89443
Gillam          2014     0.88097          0.95157            1.00000           0.00000      0.88630
Oberreuter      2014     0.85231          0.90608            0.89977           0.93581      0.88595
Gillam          2012     0.88128          0.95572            0.97273           0.99591      0.88532
Gillam          2013     0.88088          0.95968            0.97273           0.99591      0.88487
Sanchez-Perez   2014     0.83369          0.91015            0.88465           0.99910      0.88168
Jayapal         2013     0.91989          0.92314            0.85653           0.68832      0.87901
Shrestha        2013     0.80933          0.92335            0.88008           0.90455      0.87461
Kueppers        2012     0.83258          0.89889            0.89985           0.86239      0.86923
Saremi          2013     0.82676          0.91810            0.84819           0.94600      0.86509
Shrestha        2014     0.82202          0.91098            0.84604           0.93862      0.85906
Kong            2012     0.80786          0.89367            0.85423           0.96399      0.85297
Suchomel        2012     0.81678          0.87581            0.85151           0.87478      0.84437
Kong            2014     0.78726          0.87003            0.85822           0.96381      0.84006
Kong            2013     0.76077          0.86224            0.85744           0.96384      0.82859
R. Torrejón     2012     0.81313          0.83881            0.81159           0.92666      0.82540
Palkovskii      2012     0.79219          0.84844            0.83218           0.94736      0.82371
Palkovskii      2013     0.79971          0.93137            0.82207           0.67604      0.81699
Abnar           2014     0.74910          0.82988            0.76575           0.92946      0.77330
Suchomel        2013     0.69323          0.82973            0.68494           0.67088      0.72514
Sánchez-Vega    2012     0.40340          0.49524            0.37300           0.45184      0.39857


                                                 867
Table 6. Cross-year evaluation of text alignment software submissions from 2012 to 2014 with
respect to recall. The darker a cell, the better the performance compared to the entire column.

Software Submission          Obfuscation Strategies of the 2013 Evaluation Corpus       Entire Corpus
Team            Year     None            Random        Cyclic translation     Summary

Sanchez-Perez   2014    0.97853          0.86067            0.88959           0.41274      0.87904
Oberreuter      2014    0.99881          0.83254            0.86335           0.24455      0.85779
Shrestha        2014    0.97438          0.83161            0.85318           0.08875      0.83782
Palkovskii      2014    0.96428          0.82244            0.82031           0.17672      0.82637
Kong            2012    0.94836          0.77903            0.85003           0.29892      0.82449
Kong            2013    0.90682          0.78682            0.84626           0.30017      0.81344
Kong            2014    0.89521          0.78079            0.84512           0.29636      0.80746
Glinos          2014    0.96028          0.72478            0.76248           0.48605      0.79331
Saremi          2013    0.95416          0.68877            0.80473           0.10209      0.77123
R. Torrejón     2014    0.96715          0.62985            0.82082           0.23149      0.76903
Oberreuter      2012    0.99932          0.65322            0.79587           0.07076      0.76864
Gross           2014    0.90724          0.71884            0.78410           0.20577      0.76622
Suchomel        2013    0.99637          0.68886            0.66621           0.56296      0.76593
R. Torrejón     2013    0.95256          0.63370            0.81124           0.21593      0.76190
Palkovskii      2012    0.99379          0.75130            0.66672           0.16089      0.76181
R. Torrejón     2012    0.96414          0.60283            0.79092           0.29007      0.75324
Shrestha        2013    0.99902          0.71461            0.63618           0.09897      0.73814
Suchomel        2012    0.99835          0.51946            0.50106           0.35305      0.64667
Abnar           2014    0.99110          0.35360            0.60498           0.12200      0.61163
Sánchez-Vega    2012    0.74452          0.43502            0.58133           0.22161      0.56225
Alvi            2014    0.98701          0.36603            0.41988           0.05685      0.55068
Palkovskii      2013    0.85048          0.36420            0.49667           0.08082      0.53561
Kueppers        2012    0.83854          0.36865            0.42427           0.09265      0.51074
Nourian         2013    0.87626          0.23609            0.28568           0.07622      0.43381
Jayapal         2013    0.86040          0.18182            0.19411           0.07236      0.38187

Baseline                0.99960          0.04181            0.08804           0.03649      0.34223
Gillam          2012    0.87187          0.02422            0.00616           0.00109      0.26994
Gillam          2013    0.83788          0.02142            0.00616           0.00109      0.25890
Jayapal         2012    0.51885          0.11148            0.09195           0.04574      0.22287
Gillam          2014    0.53187          0.02832            0.00202           0.00000      0.16840


                                                868
Table 7. Cross-year evaluation of text alignment software submissions from 2012 to 2014 with
respect to granularity. The darker a cell, the better the performance compared to the entire column.

Software Submission           Obfuscation Strategies of the 2013 Evaluation Corpus       Entire Corpus
Team            Year      None            Random        Cyclic translation     Summary

Gillam          2012     1.00000          1.00000            1.00000           1.00000      1.00000
Gillam          2013     1.00000          1.00000            1.00000           1.00000      1.00000
Gillam          2014     1.00000          1.00000            1.00000           1.00000      1.00000
Oberreuter      2012     1.00000          1.00000            1.00000           1.00000      1.00000
Palkovskii      2012     1.00000          1.00000            1.00000           1.00000      1.00000
R. Torrejón     2012     1.00000          1.00000            1.00000           1.00000      1.00000
Suchomel        2013     1.00000          1.00000            1.00000           1.00476      1.00028
Suchomel        2012     1.00000          1.00000            1.00000           1.00610      1.00032
R. Torrejón     2013     1.00000          1.00000            1.00000           1.03086      1.00141
R. Torrejón     2014     1.00000          1.00000            1.00000           1.06024      1.00278
Kong            2012     1.00000          1.00000            1.00000           1.06452      1.00282
Kong            2014     1.00000          1.00000            1.00000           1.07190      1.00309
Kong            2013     1.00000          1.00000            1.00000           1.07742      1.00336
Sanchez-Perez   2014     1.00000          1.00086            1.00081           1.05882      1.00344
Oberreuter      2014     1.00000          1.00000            1.00000           1.07568      1.00369
Palkovskii      2014     1.00000          1.00176            1.00088           1.10112      1.00580
Shrestha        2014     1.00000          1.00630            1.00948           1.06034      1.00701
Glinos          2014     1.00000          1.04037            1.00547           1.05128      1.01695
Sánchez-Vega    2012     1.00394          1.02200            1.03533           1.04523      1.02196
Abnar           2014     1.00332          1.01509            1.00462           1.39130      1.02245
Gross           2014     1.01998          1.03336            1.01466           1.08667      1.02514
Kueppers        2012     1.02687          1.01847            1.01794           1.31061      1.03497
Nourian         2013     1.00092          1.11558            1.00485           1.34234      1.04343
Alvi            2014     1.03731          1.07203            1.10207           1.26966      1.07111
Palkovskii      2013     1.00000          1.06785            1.02825           1.73596      1.07295
Shrestha        2013     1.00083          1.30962            1.26184           1.83696      1.22084
Saremi          2013     1.06007          1.29511            1.24204           2.15556      1.24450

Baseline                 1.00912          1.18239            1.86726           1.97436      1.27473
Jayapal         2012     2.87916          2.15530            2.00578           2.75743      2.45403
Jayapal         2013     3.90017          2.19096            2.34218           3.60987      2.90698


                                                 869
participants beforehand: although no participants mention in their notebook paper that
they also used this corpus to train their approach, it may be the case that part of the
good performance of the 2014 participants can be attributed to the fact that they had a
priori access to the test data.
     Regarding the obfuscation strategies, the detectors’ performances correlate with
their overall performance. On unobfuscated plagiarism (column “None” in the tables),
Palkovskii and Belov [20] and Glinos [8] perform best, beating the performance of the
first-ranked Sanchez-Perez et al. [32] by far. On random obfuscation and cyclic transla-
tion, however, the latter maintains his lead. On summary obfuscation, the approach of
Glinos [8] and that of Suchomel et al. [38] perform best, again, outperforming Sanchez-
Perez et al. [32] by far. This hints that the clustering approach of Glinos [8] works rather
well, whereas the combination with the Smith-Waterman algorithm provides for a com-
petitive trade off.
     Table 5 shows the detectors’ performances with regard to precision. In general,
achieving a high precision appears to be less of a problem compared to achieving a
high recall. This is underpinned by the fact that our basic baseline approach outper-
forms almost all detectors in precision. However, the detectors that perform best in
precision typically have deficiencies in terms of recall, but not the other way around:
the aforementioned overall best performing detectors achieve mid-range precision. The
only exception to the rule is the detector of Glinos [8] which, for the first time, achieves
both best overall precision and a competitive overall ranking with regard to plagdet
performance.
     Table 6 shows the detectors’ performances with regard to recall. Four 2014 par-
ticipants now outperform the formerly best performing pair of detectors submitted by
Kong et al. [18, 17]. Some participants achieve 0.99 recall on unobfuscated plagiarism,
whereas the recall on such plagiarism is generally high, even among the low-ranked
detectors. Recall performances on random obfuscation and cyclic obfuscation correlate
with those on the entire corpus. The best performing detector on summary obfusca-
tion is still that of Suchomel et al. [38], whereas even Glinos [8] performs significantly
worse in terms of recall.
     Table 7 shows the detectors’ performances with regard to granularity. The top fifth
of the table entries have unanimously perfect granularity, so that these approaches are
ranked alphabetically. Despite their perfect granularity scores, these detectors do not
perform well with regard to other measures which may hint that these detectors empha-
size granularity performance too much at the expense of recall, precision, and therefore
plagdet. With the exception of summary obfuscation and therefore the performance on
the entire corpus, the top half of the table shows near-perfect scores. Only summary ob-
fuscation still poses a slight challenge, whereas it appears that granularity is still mostly
under control. We repeat our concern, however, that participants often resort to post-
retrieval filtering in order to optimize granularity only for the sake of achieving a good
ranking, while some admit that they would not do this in practice.
New Performance Measures Table 8 contrasts the character level performance mea-
sures, which are traditionally applied to measure the performance of a plagiarism de-
tector, to the new measures introduced above, which measure performance at case level
and at document level. The table shows the performances of all detectors that have been


                                            870
Table 8. Cross-year evaluation of text alignment software submissions from 2012 to 2014 with
respect to performance measures at character level, case level, and document level.
Software Submission           Character Level             Case Level        Document Level
Team           Year plagdet prec        r ec    gran prec     r ec     F1   prec   r ec   F1
Sanchez-Perez 2014     0.88      0.88   0.88    1.00   0.90   0.91   0.90   0.92   0.91   0.91
Oberreuter    2014     0.87      0.89   0.86    1.00   0.84   0.89   0.87   0.89   0.89   0.89
Palkovskii    2014     0.87      0.92   0.83    1.01   0.90   0.85   0.87   0.90   0.84   0.87
Glinos        2014     0.86      0.96   0.79    1.02   0.90   0.83   0.87   0.93   0.88   0.91
Kong          2012     0.84      0.85   0.82    1.00   0.86   0.85   0.85   0.89   0.85   0.87
Shrestha      2014     0.84      0.86   0.84    1.01   0.91   0.85   0.88   0.94   0.85   0.89
Gross         2014     0.83      0.93   0.77    1.03   0.90   0.86   0.88   0.93   0.85   0.89
Oberreuter    2012     0.83      0.89   0.77    1.00   0.81   0.79   0.80   0.83   0.80   0.81
R. Torrejón   2014     0.83      0.90   0.77    1.00   0.84   0.83   0.83   0.89   0.84   0.86
R. Torrejón   2013     0.83      0.90   0.77    1.00   0.83   0.83   0.83   0.87   0.84   0.85
Kong          2013     0.82      0.83   0.81    1.00   0.85   0.86   0.85   0.89   0.86   0.87
Kong          2014     0.82      0.84   0.81    1.00   0.86   0.85   0.85   0.89   0.85   0.87
Palkovskii    2012     0.79      0.82   0.76    1.00   0.80   0.80   0.80   0.82   0.80   0.81
R. Torrejón   2012     0.79      0.83   0.75    1.00   0.65   0.79   0.72   0.65   0.78   0.71
Suchomel      2013     0.74      0.73   0.77    1.00   0.66   0.83   0.73   0.67   0.82   0.74
Suchomel      2012     0.73      0.84   0.65    1.00   0.76   0.70   0.73   0.77   0.69   0.73
Saremi        2013     0.70      0.87   0.77    1.24   0.59   0.80   0.68   0.82   0.82   0.82
Shrestha      2013     0.70      0.87   0.74    1.22   0.57   0.76   0.65   0.77   0.78   0.77
Abnar         2014     0.67      0.77   0.61    1.02   0.76   0.63   0.69   0.88   0.65   0.75
Alvi          2014     0.66      0.93   0.55    1.07   0.77   0.59   0.67   0.88   0.63   0.73
Kueppers      2012     0.63      0.87   0.51    1.03   0.76   0.59   0.67   0.83   0.64   0.72
Palkovskii    2013     0.62      0.82   0.54    1.07   0.62   0.56   0.59   0.76   0.59   0.66
Nourian       2013     0.58      0.95   0.43    1.04   0.84   0.44   0.58   0.88   0.45   0.59
Sánchez-Vega 2012      0.46      0.40   0.56    1.02   0.35   0.63   0.45   0.64   0.67   0.66
Baseline               0.42      0.93   0.34    1.27   0.42   0.31   0.36   0.55   0.32   0.41
Gillam         2012    0.41      0.89   0.27    1.00   0.92   0.28   0.43   0.93   0.30   0.46
Gillam         2013    0.40      0.88   0.26    1.00   0.92   0.27   0.42   0.93   0.29   0.44
Gillam         2014    0.28      0.89   0.17    1.00   0.91   0.18   0.31   0.92   0.19   0.31
Jayapal        2013    0.27      0.88   0.38    2.91   0.04   0.33   0.07   0.14   0.34   0.20
Jayapal        2012    0.20      0.95   0.22    2.45   0.01   0.17   0.01   0.02   0.19   0.03


                                               871
evaluated for text alignment since 2012. When comparing the plagdet performances
with the F1 performances of the case level measures and the document level measures,
they are highly correlated. This is in the nature of things, since it is unlikely that a pla-
giarism detector that performs poor at character level performs excellent at case level or
at document level. However, the rankings still differ. For example, at case level, the de-
tectors of Shrestha et al. [34] and Gross and Modaresi [12] are ranked second and third
to that of Sanchez-Perez et al. [32], whereas the detector of Oberreuter and Eiselt [19]
looses some ranks. Most of the other detectors maintain their rank relative to the other
detectors. It can be followed that the detectors whose ranks are better than before do a
sensible job of “spotting” at least 50% of each plagiarism case, whereas, at character
level, they are outperformed by other detectors.
    At document level, the detector of Glinos [8] catches up with that of Sanchez-Perez
et al. [32], whereas those of Shrestha et al. [34] and Gross and Modaresi [12] also gain
a few ranks. Interestingly, also detectors that are ranked lower at character level, such
as those of Kong et al. [17, 16] and Saremi and Yaghmaee [33] perform significantly
better, which hints that these detectors, along with the other top-ranked ones, are useful
for raising suspicions about a given pair of documents.
    Nevertheless, the detector of Sanchez-Perez et al. [32] dominates all other detectors
at all three levels.
    Assessing a new performance measures is a difficult task, since at the beginning it
remains unclear whether the intuitions that guided their definition are captured well,
and whether they actually reveal performance aspects which other measures do not
capture well enough. In our case, all of the current plagiarism detectors have been op-
timized against the character level performance measures, so that it cannot, yet, be told
whether it is possible to build a plagiarism detection which outperforms all others, say,
at document level, but not so at the other levels. Only time will tell. Moreover, the
hyper-parameters τ1 and τ2 as well as the weight α used in the Fα -Measure are not
yet fixed and subject to ongoing research. Therefore, it would be premature to give the
new sets of performance measures precedence over the existing performance measures,
which have been already adopted by the community.


4   Conclusion and Outlook
Altogether, the sixth international competition on plagiarism detection at PAN 2014
has been a success: despite being organized for the sixth time in a row, we see steady
interest from from the community to further study this task, as is evidenced by the fact
that many participants have returned to make a new submissions. Moreover, the two
tasks source retrieval and text alignment are picked up by new participants each year,
which makes for a steady stream of new input and inspiration at solving these tasks.
In total, 16 teams submitted plagiarism detectors, 6 for source retrieval and 11 for text
alignment.
    Since both source retrieval and text alignment are in the production phase of their
shared task life cycles—they are well-defined and all evaluation resources are set up
and provide for a challenging testbed—we have refrained from introducing too many
changes to the two tasks. Nevertheless, we continuously work to make the maintenance


                                            872
of the evaluation resources for both tasks easier. This pertains particularly to the re-
sources required for source retrieval which include a fully-fledged search engine for
ClueWeb corpus.
    Moreover, we continue to pursue our goal of automating evaluations within shared
tasks by developing the TIRA experimentation platform [11]. TIRA facilitates software
submissions, where participants submit their plagiarism detection software to be evalu-
ated at our site [9]. As of this year, the newly introduced web front end for TIRA allows
participants to conduct self-service evaluations on the test data of both our shared tasks
under our supervision and guidance, whereas the test data remains hidden from direct
access from participants.6 This has allowed us to put participants back in charge of ex-
ecuting their software while the software itself remains in a running state within virtual
machines managed by TIRA. Based on this technology, we conduct cross-year evalua-
tions of all plagiarism detectors that have been submitted to our tasks since 2012.
    This year, we place emphasis on analyzing the detection performances of the pla-
giarism detectors by developing new means of visualizing their performance as well as
new performance measures that shed light on different performance aspects of plagia-
rism detection than the traditionally applied measures. This is ongoing research, and
our goal is to provide a more in-depth analysis of each plagiarism detector that enters
our evaluations.

Acknowledgements
We thank the participating teams of this task for their devoted work. This paper was
partially supported by the WIQ-EI IRSES project (Grant No. 269180) within the FP7
Marie Curie action.

Bibliography
 1. Abnar, S., Dehghani, M., Zamani, H., Shakery, A.: Expanded N-Grams for Semantic Text
    Alignment—Notebook for PAN at CLEF 2014. In: [4]
 2. Alvi, F., Stevenson, M., Clough, P.: Hashing and Merging Heuristics for Text Reuse
    Detection—Notebook for PAN at CLEF 2014. In: [4]
 3. Barker, K., Cornacchia, N.: Using Noun Phrase Heads to Extract Document Keyphrases.
    In: Hamilton, H.J. (ed.) Advances in Artificial Intelligence, 13th Biennial Conference of the
    Canadian Society for Computational Studies of Intelligence, AI 2000, Montréal, Quebec,
    Canada, May 14-17, 2000, Proceedings. Lecture Notes in Computer Science, vol. 1822, pp.
    40–52. Springer (2000)
 4. Cappellato, L., Ferro, N., Halvey, M., Kraaij, W. (eds.): CLEF 2014 Evaluation Labs and
    Workshop – Working Notes Papers, 15-18 September, Sheffield, UK. CEUR Workshop
    Proceedings, CEUR-WS.org (2014),
    http://www.clef-initiative.eu/publication/working-notes
 5. Elizalde, V.: Using Noun Phrases and tf-idf for Plagiarized Document Retrieval—Notebook
    for PAN at CLEF 2014. In: [4]
 6. Forner, P., Navigli, R., Tufis, D. (eds.): CLEF 2013 Evaluation Labs and Workshop –
    Working Notes Papers, 23-26 September, Valencia, Spain (2013),
    http://www.clef-initiative.eu/publication/working-notes
 6
     www.tira.io


                                              873
 7. Gillam, L., Notley, S.: Evaluating Robustness for ’IPCRESS’: Surrey’s Text Alignment for
    Plagiarism Detection—Notebook for PAN at CLEF 2014. In: [4]
 8. Glinos, D.: A Hybrid Architecture for Plagiarism Detection—Notebook for PAN at CLEF
    2014. In: [4]
 9. Gollub, T., Potthast, M., Beyer, A., Busse, M., Rangel, F., Rosso, P., Stamatatos, E., Stein,
    B.: Recent Trends in Digital Text Forensics and its Evaluation. In: Forner, P., Müller, H.,
    Paredes, R., Rosso, P., Stein, B. (eds.) Information Access Evaluation meets Multilinguality,
    Multimodality, and Visualization. 4th International Conference of the CLEF Initiative
    (CLEF 13). pp. 282–302. Springer, Berlin Heidelberg New York (Sep 2013)
10. Gollub, T., Stein, B., Burrows, S.: Ousting Ivory Tower Research: Towards a Web
    Framework for Providing Experiments as a Service. In: Hersh, B., Callan, J., Maarek, Y.,
    Sanderson, M. (eds.) 35th International ACM Conference on Research and Development in
    Information Retrieval (SIGIR 12). pp. 1125–1126. ACM (Aug 2012)
11. Gollub, T., Stein, B., Burrows, S., Hoppe, D.: TIRA: Configuring, Executing, and
    Disseminating Information Retrieval Experiments. In: Tjoa, A.M., Liddle, S., Schewe,
    K.D., Zhou, X. (eds.) 9th International Workshop on Text-based Information Retrieval (TIR
    12) at DEXA. pp. 151–155. IEEE, Los Alamitos, California (Sep 2012)
12. Gross, P., Modaresi, P.: Plagiarism Alignment Detection by Merging Context
    Seeds—Notebook for PAN at CLEF 2014. In: [4]
13. Hagen, M., Stein, B.: Applying the User-over-Ranking Hypothesis to Query Formulation.
    In: Advances in Information Retrieval Theory. 3rd International Conference on the Theory
    of Information Retrieval (ICTIR 11). Lecture Notes in Computer Science, vol. 6931, pp.
    225–237. Springer, Berlin Heidelberg New York (2011)
14. Hagen, M., Stein, B.: Candidate Document Retrieval for Web-Scale Text Reuse Detection.
    In: 18th International Symposium on String Processing and Information Retrieval (SPIRE
    11). Lecture Notes in Computer Science, vol. 7024, pp. 356–367. Springer, Berlin
    Heidelberg New York (2011)
15. Haggag, O., El-Beltagy, S.: Plagiarism Candidate Retrieval Using Selective Query
    Formulation and Discriminative Query Scoring—Notebook for PAN at CLEF 2013. In: [6]
16. Kong, L., Han, Y., Han, Z., Yu, H., Wang, Q., Zhang, T., Qi, H.: Source Retrieval Based on
    Learning to Rank and Text Alignment Based on Plagiarism Type Recognition for
    Plagiarism Detection—Notebook for PAN at CLEF 2014. In: [4]
17. Kong, L., Qi, H., Du, C., Wang, M., Han, Z.: Approaches for Source Retrieval and Text
    Alignment of Plagiarism Detection—Notebook for PAN at CLEF 2013. In: [6]
18. Kong, L., Qi, H., Wang, S., Du, C., Wang, S., Han, Y.: Approaches for Candidate Document
    Retrieval and Detailed Comparison of Plagiarism Detection—Notebook for PAN at CLEF
    2012. In: Forner, P., Karlgren, J., Womser-Hacker, C. (eds.) CLEF 2012 Evaluation Labs
    and Workshop – Working Notes Papers, 17-20 September, Rome, Italy (Sep 2012),
    http://www.clef-initiative.eu/publication/working-notes
19. Oberreuter, G., Eiselt, A.: Submission to the 6th International Competition on Plagiarism
    Detection. http://www.webis.de/research/events/pan-14 (2014),
    http://www.clef-initiative.eu/publication/working-notes, From Innovand.io, Chile
20. Palkovskii, Y., Belov, A.: Developing High-Resolution Universal Multi-Type N-Gram
    Plagiarism Detector—Notebook for PAN at CLEF 2014. In: [4]
21. Potthast, M.: Technologies for Reusing Text from the Web. Dissertation,
    Bauhaus-Universität Weimar (Dec 2011),
    http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:gbv:wim2-20120217-15663
22. Potthast, M., Barrón-Cedeño, A., Eiselt, A., Stein, B., Rosso, P.: Overview of the 2nd
    International Competition on Plagiarism Detection. In: Braschler, M., Harman, D., Pianta,
    E. (eds.) Working Notes Papers of the CLEF 2010 Evaluation Labs (Sep 2010),
    http://www.clef-initiative.eu/publication/working-notes


                                              874
23. Potthast, M., Eiselt, A., Barrón-Cedeño, A., Stein, B., Rosso, P.: Overview of the 3rd
    International Competition on Plagiarism Detection. In: Petras, V., Forner, P., Clough, P.
    (eds.) Working Notes Papers of the CLEF 2011 Evaluation Labs (Sep 2011),
    http://www.clef-initiative.eu/publication/working-notes
24. Potthast, M., Gollub, T., Hagen, M., Graßegger, J., Kiesel, J., Michel, M., Oberländer, A.,
    Tippmann, M., Barrón-Cedeño, A., Gupta, P., Rosso, P., Stein, B.: Overview of the 4th
    International Competition on Plagiarism Detection. In: Forner, P., Karlgren, J.,
    Womser-Hacker, C. (eds.) Working Notes Papers of the CLEF 2012 Evaluation Labs (Sep
    2012), http://www.clef-initiative.eu/publication/working-notes
25. Potthast, M., Gollub, T., Hagen, M., Tippmann, M., Kiesel, J., Rosso, P., Stamatatos, E.,
    Stein, B.: Overview of the 5th International Competition on Plagiarism Detection. In:
    Forner, P., Navigli, R., Tufis, D. (eds.) Working Notes Papers of the CLEF 2013 Evaluation
    Labs (Sep 2013), http://www.clef-initiative.eu/publication/working-notes
26. Potthast, M., Hagen, M., Stein, B., Graßegger, J., Michel, M., Tippmann, M., Welsch, C.:
    ChatNoir: A Search Engine for the ClueWeb09 Corpus. In: Hersh, B., Callan, J., Maarek,
    Y., Sanderson, M. (eds.) 35th International ACM Conference on Research and Development
    in Information Retrieval (SIGIR 12). p. 1004. ACM (Aug 2012)
27. Potthast, M., Hagen, M., Völske, M., Stein, B.: Crowdsourcing Interaction Logs to
    Understand Text Reuse from the Web. In: Fung, P., Poesio, M. (eds.) Proceedings of the
    51st Annual Meeting of the Association for Computational Linguistics (ACL 13). pp.
    1212–1221. ACL (Aug 2013), http://www.aclweb.org/anthology/P13-1119
28. Potthast, M., Hagen, M., Völske, M., Stein, B.: Exploratory Search Missions for TREC
    Topics. In: Wilson, M.L., Russell-Rose, T., Larsen, B., Hansen, P., Norling, K. (eds.) 3rd
    European Workshop on Human-Computer Interaction and Information Retrieval
    (EuroHCIR 2013). pp. 11–14. CEUR-WS.org (Aug 2013),
    http://www.cs.nott.ac.uk/ mlw/euroHCIR2013/proceedings/paper3.pdf
29. Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An Evaluation Framework for
    Plagiarism Detection. In: Huang, C.R., Jurafsky, D. (eds.) 23rd International Conference on
    Computational Linguistics (COLING 10). pp. 997–1005. Association for Computational
    Linguistics, Stroudsburg, Pennsylvania (Aug 2010)
30. Potthast, M., Stein, B., Eiselt, A., Barrón-Cedeño, A., Rosso, P.: Overview of the 1st
    International Competition on Plagiarism Detection. In: Stein, B., Rosso, P., Stamatatos, E.,
    Koppel, M., Agirre, E. (eds.) SEPLN 09 Workshop on Uncovering Plagiarism, Authorship,
    and Social Software Misuse (PAN 09). pp. 1–9. CEUR-WS.org (Sep 2009),
    http://ceur-ws.org/Vol-502
31. Prakash, A., Saha, S.: Experiments on Document Chunking and Query Formation for
    Plagiarism Source Retrieval—Notebook for PAN at CLEF 2014. In: [4]
32. Sanchez-Perez, M., Sidorov, G., Gelbukh, A.: A Winning Approach to Text Alignment for
    Text Reuse Detection at PAN 2014—Notebook for PAN at CLEF 2014. In: [4]
33. Saremi, M., Yaghmaee, F.: Submission to the 5th International Competition on Plagiarism
    Detection. http://www.webis.de/research/events/pan-13 (2013),
    http://www.clef-initiative.eu/publication/working-notes, From Semnan University, Iran
34. Shrestha, P., Maharjan, S., Solorio, T.: Machine Translation Evaluation Metric for Text
    Alignment—Notebook for PAN at CLEF 2014. In: [4]
35. Stein, B., Hagen, M.: Introducing the User-over-Ranking Hypothesis. In: Advances in
    Information Retrieval. 33rd European Conference on IR Resarch (ECIR 11). Lecture Notes
    in Computer Science, vol. 6611, pp. 503–509. Springer, Berlin Heidelberg New York (Apr
    2011)
36. Stein, B., Meyer zu Eißen, S., Potthast, M.: Strategies for Retrieving Plagiarized
    Documents. In: Clarke, C., Fuhr, N., Kando, N., Kraaij, W., de Vries, A. (eds.) 30th


                                             875
    International ACM Conference on Research and Development in Information Retrieval
    (SIGIR 07). pp. 825–826. ACM, New York (Jul 2007)
37. Suchomel, Šimon., Brandejs, M.: Heterogeneous Queries for Synoptic and Phrasal
    Search—Notebook for PAN at CLEF 2014. In: [4]
38. Suchomel, Šimon., Kasprzak, J., Brandejs, M.: Diverse Queries and Feature Type Selection
    for Plagiarism Discovery—Notebook for PAN at CLEF 2013. In: [6]
39. Torrejón, D.A.R., Ramos, J.M.M.: CoReMo 2.3 Plagiarism Detector Text Alignment
    Module—Notebook for PAN at CLEF 2014. In: [4]
40. Williams, K., Chen, H.H., Chowdhury, S., Giles, C.: Unsupervised Ranking for Plagiarism
    Source Retrieval—Notebook for PAN at CLEF 2013. In: [6]
41. Williams, K., Chen, H.H., Giles, C.: Supervised Ranking for Plagiarism Source
    Retrieval—Notebook for PAN at CLEF 2014. In: [4]
42. Zubarev, D., Sochenkov, I.: Using Sentence Similarity Measure for Plagiarism Source
    Retrieval—Notebook for PAN at CLEF 2014. In: [4]


                                            876

</pre>