=Paper= {{Paper |id=Vol-1391/inv-pap11-CR |storemode=property |title=Towards Data Submissions for Shared Tasks: First Experiences for the Task of Text Alignment |pdfUrl=https://ceur-ws.org/Vol-1391/inv-pap11-CR.pdf |volume=Vol-1391 |dblpUrl=https://dblp.org/rec/conf/clef/PotthastGRS15 }} ==Towards Data Submissions for Shared Tasks: First Experiences for the Task of Text Alignment== https://ceur-ws.org/Vol-1391/inv-pap11-CR.pdf
       Towards Data Submissions for Shared Tasks:
     First Experiences for the Task of Text Alignment?

             Martin Potthast,1 Steve Göring,1 Paolo Rosso,2 and Benno Stein1
      1
          Web Technology & Information Systems, Bauhaus-Universität Weimar, Germany
          2
            Natural Language Engineering Lab, Universitat Politècnica de València, Spain

                           pan@webis.de          http://pan.webis.de



          Abstract This paper reports on the organization of a new kind of shared task
          that outsources the creation of evaluation resources to its participants. We intro-
          duce the concept of data submissions for shared tasks, and we use our previous
          shared task on text alignment as a testbed. A total of eight evaluation datasets
          have been submitted by as many participating teams. To validate the submitted
          datasets, they have been manually peer-reviewed by the participants. Moreover,
          the submitted datasets have been fed to 31 text alignment approaches in order to
          learn about the datasets’ difficulty. The text alignment implementations have been
          submitted to our shared task in previous years and since been kept operational on
          the evaluation-as-a-service platform TIRA.


1   Introduction

The term “shared task” refers to a certain kind of computer science event, where re-
searchers working on a specific problem of interest, the task, convene to compare their
latest algorithmic approaches at solving it in a controlled laboratory experiment.1 The
organizers of a shared task usually take care of the lab setup by compiling evaluation
resources, by selecting performance measures, and sometimes even by raising the task
itself for the first time.
    The way shared tasks are organized at present may strongly influence future evalu-
ations: for instance, in case the event has sufficient success within its target community,
subsequent research on the task may be compelled to follow the guidelines proposed
by the shared task organizers, or else risk rejection from reviewers who are aware of
the shared task. The fact that a shared task has been organized may amplify the impor-
tance of its experimental setup over others, stifling contributions off the beaten track.
However, there is only anecdotal evidence in support of this narrowing effect of shared
tasks. Nevertheless, it has been frequently pointed out in the position papers submitted
to a workshop organized by the natural language generation community on the pros and
cons of adopting shared tasks for evaluation [1].
?
   A summary of this report has been published as part of [62], and some of the descriptions are
   borrowed from earlier reports on the shared task of text alignment at PAN.
 1
   The etymology of the term “shared task” is unclear; conceivably, it was coined to describe a
   special kind of conference track and was picked up into general use from there.
     One of the main contributions of organizing a shared task is that of creating a
reusable experimental setup for future research, allowing for comparative evaluations
even after the shared task has passed. Currently, however, this goal is only partially
achieved: participants and researchers following up on a shared task may only compare
their own approach to those of others, whereas other aspects of a shared task remain
fixed, such as the evaluation datasets, the ground truth annotations, and the performance
measures used. As time passes, these fixtures limit future research to using evaluation
resources that may quickly become outdated in order to compare new approaches with
the state of the art. Moreover, given that shared tasks are often organized by only a few
dedicated people, this further limits the attainable diversity of evaluation resources.
     To overcome the outlined shortcomings, we propose that shared task organizers at-
tempt to remove as many fixtures from their shared tasks as possible, relinquishing
control over the choice of evaluation resources to their community. We believe that, in
fact, only data formats and interfaces between evaluation resources need to be fixed a
priori to ensure compatibility of contributions submitted by community members. As a
first step in this direction, we investigate for the first time the feasibility of data submis-
sions to a well-known shared task by posing the construction of evaluation datasets as
a shared task of its own.
     As a testbed for data submissions, we use our established shared task on plagiarism
detection at PAN [62], and in particular the task on text alignment. Instead of invit-
ing text alignment algorithms, we ask participants to submit datasets of their own de-
sign, validating the submitted datasets both via peer-review and by running the the text
alignment softwares submitted in previous editions of the text alignment task against
the submitted corpora. Altogether, eight datasets have been submitted by participants,
ranging from automatically constructed ones to manually created ones, including vari-
ous languages.
     In what follows, after a brief discussion of related work in Section 2, we outline our
approach to data submissions in Section 3, survey the submitted datasets in Section 4,
and report on their validation and evaluation in Section 5.


2   Related Work
Research on plagiarism detection has a long history, both within PAN and without. We
have been the first to organize shared tasks on plagiarism detection [51], whereas since
then, we have introduced a number of variations of the task as well as new evaluation
resources: the first shared task was organized in 2009, studying two sub-problems of
plagiarism detection, namely the traditional external plagiarism detection [64], where
a reference collection is used to identify plagiarized passages, and intrinsic plagiarism
detection [31, 63], where no such reference collection is at hand and plagiarism has to
be identified from writing style changes within a document. For the this share task, we
have created the first standardized, large-scale evaluation corpus for plagiarism detec-
tion [50]. As part of this effort, we have devised novel performance measures which,
for the first time, took into account task-specific characteristics of plagiarism detec-
tion, such as detection granularity. Finally, in the first three years of PAN, we have
introduced cross-language plagiarism detection as a sub-task of plagiarism detection
for the first time [42], adding corresponding problem instances into the evaluation cor-
pus. Altogether, in the first three years of our shared task, we successfully acquired and
evaluated plagiarism detection approaches of 42 research teams from around the world,
some participating more than once. Many insights came out of this endeavor which
informed our subsequent activities [51, 41, 43].
     Starting in 2012, we have completely overhauled our evaluation approach to pla-
giarism detection [44]. Since then, we have separated external plagiarism detection into
the two tasks source retrieval and text alignment. The former task deals with informa-
tion retrieval approaches to retrieve potential sources for a suspicious document from
a large text collection, such as the web, which are indexed with traditional retrieval
models. The latter task of text alignment focuses on the problem of extracting matching
passages from pairs of documents, if there are any. Both tasks have never been studied
in this way before.
     For source retrieval, we went to considerable lengths to set up a realistic evaluation
environment: we indexed the entire English portion of the ClueWeb09 corpus, build-
ing the research search engine ChatNoir [48]. ChatNoir served two purposes, namely
as an API for plagiarism detectors for those who cannot afford to index the ClueWeb
themselves, but also as an end user search engine for authors which were hired to con-
struct a new, realistic evaluation resource for source retrieval. We have hired 18 semi-
professional authors from the crowdsourcing platform oDesk (now Upwork) and asked
them to write essays of length at least 5000 words on pre-defined TREC web track
topics. To write their essays, the authors were asked to conduct their research using
ChatNoir, reusing text from the web pages they found. This way, we have created re-
alistic information needs which in turn lead the authors to use our search engine in a
realistic way to fulfill their task. AS part of this activity, we gained new insights into
the nature of how humans reuse text, some building up a text as they go, whereas others
first collect a lot of text and then boil it down to the final essay [49]. Finally, we have
devised and developed new evaluation measures for source retrieval that, for the first
time, take into account the retrieval of near-duplicate results when calculating precision
and recall [45, 47]. We report on the latest results of the source retrieval subtask in [20].
     Regarding text alignment, we focus on the text reuse aspects of the task by stripping
down the problem to its very core, namely comparing two text documents to identify
reused passages of text. In this task, we have started in 2012 to experiment with software
submissions for the first time, which lead to the development of the TIRA experimenta-
tion platform [18]. TIRA is an implementation of the emerging evaluation-as-a-service
paradigm [22]. We have since scaled TIRA in order to also collect participant software
for source retrieval and for the entire PAN evaluation lab as of 2013, thus improving the
reproducibility of PAN’s shared tasks for the foreseeable future [17, 46]. Altogether,
in the second three-year cycle of this task, we have acquired and evaluated plagiarism
detection approaches of 20 research teams on source retrieval and 31 research teams on
text alignment [44, 45, 47].
3     Data Submissions: Crowdsourcing Evaluation Resources
Data submissions for shared tasks have not been systematically studied until now, so
that no best practices have been established, yet. Asking shared task participants to
submit data is nothing short of crowdsourcing, albeit the task of creating an evaluation
resource is by comparison much more complex than average crowdsourcing tasks found
in the literature. In what follows, we outline the rationale of data submissions, review
important aspects of defining a data submissions task that may inform instructions to be
handed out to participants, and detail two methods to evaluating submitted datasets.

3.1     Rationale of Data Submissions to Shared Tasks
Traditionally, the evaluation resources required to run a shared task are created by its
organizers—but the question remains: why? The following reasons can be identified:
    – Quality control. The success of a shared task rests with the quality of its evaluation
      resources. A poorly built evaluation dataset may invalidate evaluation results, which
      is one of the risks of organizing shared tasks. This is why organizers have a vested
      interest in maintaining close control over evaluation resources, and how they are
      constructed.
    – Seniority. Senior community members may have the best vantage point in order to
      create representative evaluation resources.
    – Access to proprietary data. Having access to an otherwise closed data source (e.g.,
      from a company) gives some community members an advantage over others in
      creating evaluation resources with a strong connection to the real world.
    – Task inventorship. The inventor of a new task (i.e., tasks that have not been con-
      sidered before), is in a unique position to create normative evaluation resources,
      shaping future evaluations.
    – Being first to the table. The first one to pick up the opportunity may take the lead
      in constructing evaluation resources (e.g., when a known task has never been orga-
      nized as a shared task before, or, to mitigate a lack of evaluation resources).
All of the above are good reasons for an individual or a small group of researchers to
organize a shared task, and, to create corresponding evaluation resources themselves.
However, from reviewing dozens of shared tasks that have been organized in the human
language technologies, neither of them are a necessary requirement [46]: shared tasks
are being organized using less-than-optimal datasets, by newcomers to a given research
field, without involving special or proprietary data, and without inventing the task in
the first place. Hence, we question the traditional connection of shared task organization
and evaluation resource construction. This connection limits the scale and diversity, and
therefore the representativeness of the evaluation resources that can be created:
    – Scale. The number of man-hours that can be invested in the construction of eval-
      uation resources is limited by the number of organizers and their personal com-
      mitment. This limits the scale of the evaluation resources. Crowdsourcing may be
      employed as a means to increase scale in many situations, however, this is mostly
      not the case when task-specific expertise is required.
 – Diversity. The combined task-specific capabilities of all task organizers may be
   limited regarding the task’s domain. For example, the number of languages spoken
   by task organizers is often fairly small, whereas true representativeness across lan-
   guages would require evaluation resources from at least all major language families
   spoken today.
By involving participants in a structured way into the construction of evaluation re-
sources, task organizers may build on their combined expertise, man-power, and diver-
sity. However, there is no free lunch, and outsourcing the construction of evaluation
resources introduces the new organizational problem that the datasets created and sub-
mitted by third parties must be validated and evaluated for quality.


3.2   Defining a Data Submission Task
When casting a data submission task, there are a number of desiderata that participants
should meet:

 – Data format compliance. The organizers should agree on a specific data format suit-
   able for the task in question. The format should be defined with the utmost care,
   since it may be impossible to fix mistakes discovered later on. Experience shows
   that the format of the evaluation datasets has a major effect on how participants
   implement their softwares for a task, which is especially true when inviting soft-
   ware submissions for a shared task. Regarding data submissions, a dataset should
   comprise a set of problem instances with respect to the task, where each problem
   instance shall be formatted according to the specifications handed out by the or-
   ganizers. To ensure compliance, the organizers should prepare a format validation
   tool, which allows participants to check the format of their dataset in progress, and
   whether it complies with the format specifications. This way, participants move into
   the right direction from the start, and less back and forth will be necessary after a
   dataset has been submitted. The format validation tool should check every aspect
   of the required data format in order to foreclose any unintended deviation.
 – Annotation validity. All problem instances of a dataset should comprise annotations
   that reveal their true solution with regard to the task in question. It goes without say-
   ing, that all annotations should be valid. Datasets that do not comprise annotations
   are of course useless for evaluation purposes, whereas annotation validity as well as
   the quality and representativeness of the problem instances selected by participants
   determines the usefulness of a submitted dataset.
 – Representative size. The datasets submitted should be of sufficient size, so that
   dividing them into training and test datasets can be done without sacrificing repre-
   sentativeness, and so that evaluations conducted based on the resulting test datasets
   are meaningful and not prone to noise.
 – Choice of data source. The choice of a data source should be left up to participants,
   and should open the possibility of using manually created data either from the real
   world or by asking human test subjects to emulate problem instances, as well as
   automatically generated data based on a computer simulation of problem instances
   for the task at hand.
 – Copyright and sensitive data. Participants must ensure that they have the usage
   rights of the data, for transferring usage rights to the organizers of the shared task,
   and for allowing the organizers to transfer usage rights to other participants. The
   data must further be compliant with privacy laws and ethically innocuous. Depen-
   dent on the task at hand and what the organizers of a shared task desire, accepting
   confidential or otherwise sensitive data may still be possible: in case the shared task
   also invites software submissions, the organizers may promise participants that the
   sensitive data does not leak to participants by running submitted software at their
   site against the submitted datasets. Nevertheless, special security precautions must
   be taken to ensure that sensitive data does not leak when feeding it to untrusted
   software.

3.3    Evaluating Submitted Datasets: Peer-Review and Software Submissions
The construction of new evaluation datasets must be done with the utmost care, since
datasets are barely double-checked or questioned again once they have been accepted
as authoritative. This presents the organizers of a dataset construction task with the new
challenge of evaluating submitted datasets, where the evaluation of a dataset should aim
at establishing its validity. In general, the organizers of data submission tasks should
ensure not to advertise submitted datasets as valid unless they are, since such an en-
dorsement may carry a lot of weight in a shared task’s community.
    Unlike with shared tasks that invite algorithmic contributions, the validity of a
dataset typically can not be established via an automatically computed performance
measure, but requires manual reviewing effort. Therefore, as part of their participa-
tion, all participants who submit a dataset should be compelled to also peer-review the
datasets submitted by other participants. Moreover, inviting other community mem-
bers to conduct independent reviews may ensure impartial results. Reviewers may be
instructed as follows:
      The peer-review is about dataset validity, i.e. the quality and realism of the
      problem instances. Conducting the peer-review includes:
        – Manual review of as many examples as possible from all datasets
        – Make observations about how the dataset has been constructed
        – Make observations about potential quality problems or errors
        – Make observations on the realism of each dataset’s problem instances
        – Write about your observations in your notebook (make sure to refer to
          examples from the datasets for your findings).
Handing out the complete submitted datasets for peer-review, however, is out of the
question, since this would defeat the purpose of subsequent shared task evaluations by
revealing the ground truth prematurely. Here, the organizers of a dataset construction
task serve as mediators, splitting submitted datasets into training and test datasets, and
handing out only the training portion for peer-review. The participants who submitted
a given dataset, however, may never be reliably evaluated based on their own dataset.
Also, colluding participants may not be ruled out entirely.
    Finally, when a shared task has previously invited software submissions, this creates
ample opportunity to re-evaluate the existing softwares on the submitted datasets. This
allows for evaluating submitted datasets in terms of their difficulty: the performances
of existing software on submitted datasets, when compared to their respective perfor-
mances on established datasets, allow for a relative assessment of dataset difficulty. If a
shared task did not invite software submissions so far, then the organizers should set up
a baseline software for the shared task and run that against submitted datasets to allow
for a relative comparison among them.

3.4    Data Submissions for Text Alignment
In text alignment, given a pair of documents, the task is to identify all contiguous pas-
sages of reused text between them. The challenge with this task is to identify passages
of text that have been obfuscated, sometimes to the extent that, apart from stop words,
little lexical similarity remains between an original passage and its reused counterpart.
Consequently, for task organizers, the challenge is to provide a representative dataset
of documents that emulate this situation. For the previous editions of PAN, we have
created such datasets ourselves, whereas obfuscated text passages have been generated
automatically, semi-automatically via crowdsourcing [6], and by collecting real cases.
Until now, however, we neglected participants of our shared task as potential assistants
in creating evaluation resources. Given that a stable community has formed around our
task in previous years, and that the data format has not changed in the past three years,
we felt confident to experiment with this task and to switch from algorithm development
to data submissions. We cast the task to construct an evaluation dataset as follows:
    – Dataset collection. Gather real-world instances of text reuse or plagiarism, and an-
      notate them.
    – Dataset generation. Given pairs of documents, generate passages of reused or pla-
      giarized text between them. Apply a means of obfuscation of your choosing.
The task definition is kept as open as possible, imposing no particular restrictions on
the way in which participants approach this task, which languages they consider, or
which kinds of obfuscation they collect or generate. In particular, the task definition
highlights the two possible avenues of dataset construction, namely manual collection,
and automatic construction. To ensure compatibility among each other and with previ-
ous datasets, however, the format of all submitted datasets had to conform with that of
the existing datasets used in previous years. By fixing the dataset format, future editions
of the text alignment task may build on the evaluation resources created within this task
without further effort, and the pieces of software that have been submitted in previous
editions of the text alignment task, which are available on the TIRA platform for eval-
uation as a service, may be re-evaluated on the new datasets. In our case, more than 30
text alignment approaches have been submitted since 2012. To ensure compatibility, we
handed out a dataset validation tool that checked all format restrictions.


4     Survey of Submitted Text Alignment Datasets
A total of eight datasets have been submitted to the PAN 2015 text alignment task on
dataset construction. The datasets are of varying sizes and have been built with a variety
of methods. In what follows, after a brief discussion of the dataset format, we survey the
datasets with regard their source of documents and languages, and the construction pro-
cess employed by their authors, paying special attention to the obfuscation approaches.
The section closes with an overview of dataset statistics.

4.1   Dataset Format
We asked participants to comply with the dataset format of the PAN plagiarism corpora
that have been used for the text alignment task at PAN since 2012. A compliant dataset
consists of documents in the form of plain text files encoded in UTF-8 in which cases
of plagiarism are found, plus XML files comprising meta data about them. The doc-
uments are divided into so-called source documents and suspicious documents, where
suspicious documents are supposed to be analyzed for evidence of plagiarism. Each
suspicious document is a priori paired with one or more source document, i.e., the task
in text alignment is to extract passages of plagiarized text from a given pair of docu-
ments, if there are any. Text alignment does not involve retrieval of source documents, a
different problem studied in the accompanying source retrieval task [20]. The meta data
for each pair of documents reveals if and where plagiarism cases are found within in
the form of character offsets and lengths for each plagiarism case. These ground truth
annotations are used to measures the performance of a text alignment algorithm in ex-
tracting these plagiarism cases. While the problem is trivial for situations where text
has been lifted verbatim from a source document into a suspicious document, the prob-
lem gets a lot more difficult in case the plagiarized text passage has been obfuscated,
e.g., by being paraphrased, translated, or summarized. There are many ways to obfus-
cate a plagiarized test passage, both in the real world as well as using (semi-)automatic
emulations of the real thing. Therefore, each dataset is supposed to contain an addi-
tional folder for each obfuscation strategy applied; the XML meta data files revealing
the ground truth annotations are divided accordingly into these folders. To assist par-
ticipants in getting the format of their datasets right, we supplied them with a format
validation tool that checks formatting details and that performs basic sanity checks. This
tool, of course, cannot ascertain whether the text passages annotated as plagiarized are
actually meaningful or not.

4.2   Dataset Overview
Table 1 compiles an overview of the submitted datasets. The table shows the sizes of
each corpus in terms of documents and plagiarism cases within. The sizes vary greatly
from around 160 documents and about the same number of plagiarism cases to more
than 27000 documents and more than 11000 plagiarism cases. Most of the datasets com-
prise English documents, whereas two feature cross-language plagiarism from Urdu
and Persian to English. Two datasets contain only non-English documents in Persian
and Chinese. Almost all datasets also contain a portion of suspicious documents that do
not contain any plagiarism, whereas the datasets of Alvi et al. [4], Mohtaj et al. [32], and
Palkovskii and Belov [36] contain only few such documents, and that of Kong et al. [27]
none. The documents are mostly short (up to 10 pages), where a page is measured as
1500 chars (a norm page in print publishing), which corresponds to about 288 words in
                   Table 1. Overview of dataset statistics for the eight submitted datasets
Dataset Statistics                 Alvi [4]             Cheema [9]   Khoshnavataher [25]  Mohtaj [32]
                                              Asghari [5]        Hanif [21]       Kong [27]       Palkovskii [36]
Generic
documents                             272       27115      1000     1000     2111      160     4261     5057
plagiarism cases                      150       11200       250      270      823      152     2781     4185
languages                              en       en-fa        en    en-ur       fa       zh       en       en
Document purpose
source documents                     37%         74%       50%      50%      50%      95%       78%      64%
suspicious documents
  - with plagiarism                  55%         13%       25%      27%      25%       5%       15%      33%
  - w/o plagiarism                    8%         13%       25%      23%      25%       0%        5%       3%
Document length
short (<10 pp.)                      92%         68%      100%      95%      95%      42%       46%      93%
medium (10-100 pp.)                   8%         32%        0%       5%       5%      57%       54%       7%
long (>100 pp.)                       0%          0%        0%       0%       0%       0%        0%       0%
Plagiarism per document
hardly (<20%)                        48%         68%       97%      89%      27%      75%       68%      80%
medium (20%-50%)                     19%         32%        3%      10%      72%      25%       32%      19%
much (50%-80%)                       33%          0%        0%       1%       0%       0%        0%       1%
entirely (>80%)                       0%          0%        0%       0%       0%       0%        0%       0%
Case length
short (<1k characters)               99%         99%       90%      99%      41%      85%     100%       97%
medium (1k-3k characters)             1%          1%       10%       1%      59%      11%       0%        3%
long (>3k characters)                 0%          0%        0%       0%       0%       4%       0%        0%
Obfuscation synthesis approaches
character-substitution               33%          –         –        –        –        –         –        –
human-retelling                      33%          –         –        –        –        –         –        –
synonym-replacement                  33%          –         –        –        –        –         –        –
translation-obfuscation               –         100%        –        –        –        –         –       24%
no-obfuscation                        –           –         –        –       31%       –        10%      24%
random-obfuscation                    –           –         –        –       69%       –        87%      24%
simulated-obfuscation                 –           –         –        –        –        –         3%       –
summary-obfuscation                   –           –         –        –        –        –         –       28%
undergrad                             –           –        61%       –        –        –         –        –
phd                                   –           –        16%       –        –        –         –        –
masters                               –           –         6%       –        –        –         –        –
undergrad-in-progress                 –           –        17%       –        –        –         –        –
plagiarism                            –           –         –      100%       –        –         –        –
real-plagiarism                       –           –         –        –        –      100%        –        –



English. Some datasets also contain medium-length documents of about 10-100 pages,
however, only the datasets of Kong et al. [27] and Mohtaj et al. [32] have more medium
documents than short ones. No datasets have long documents; for comparison, the PAN
plagiarism corpora 2009-2012 contain about 15% long documents.
     The portion of plagiarism per document is below 50% of a given document in almost
all cases, suggesting that the plagiarism cases are mostly short, too. This is corroborated
when looking at the distributions of case length; almost all cases unanimously below
1000 characters, except for the dataset of Khoshnavataher et al. [25] and, to a lesser
extent, that of Kong et al. [27] and Cheema et al. [9]. Only the dataset of Alvi et al.
[4] contains documents with much (up to 80%) plagiarism. Again, no dataset contains
documents that are entirely plagiarized, and only Kong et al. [27] has a small percentage
of long plagiarism cases.
     With regard to the obfuscation synthesis approaches, we report the names used by
the dataset authors for consistency with the respective datasets’ folder names. The ap-
proaches will be discussed in more detail below, however, we depart from the names
used by the dataset authors for the obfuscation synthesis approaches, since they are
inconsistent with the literature.
4.3   Document Sources

The first building block of every text alignment dataset is the set of documents used,
which is divided into suspicious documents and source documents. One of the main
obstacles in this connection is to pair suspicious documents with source documents that
are roughly about the same topic, or that partially share a topic. Ideally, one would
choose a set of documents that naturally possess such a relation, however, such docu-
ments are not readily available at scale. Although it is not a strict necessity to ensure
topical relations for pairs of suspicious and source documents, doing so adds to the
realism of an evaluation dataset for text alignment, since, in the real world, spurious
similarities between topically related documents that are not plagiarism are much more
likely than otherwise.
    For the PAN plagiarism corpora 2009-2012, we employed documents obtained from
the Project Gutenberg for the most part [50], pairing documents at random and disre-
garding their topical relation. We have experimented with clustering algorithms to select
document pairs with at least a basic topical relation, but this had only limited success.
For PAN 2013, we switched from Project Gutenberg documents to using ClueWeb09
documents, based on the Webis text reuse corpus 2012, Webis-TRC-12 [49]. In this
corpus, as well as the text alignment corpus that we derived from it, a strong topical
relation between documents can be assumed, since the source documents have been
manually retrieved for a predefined TREC topic.
    Regarding the submitted datasets, those of Asghari et al. [5], Cheema et al. [9],
Hanif et al. [21], Khoshnavataher et al. [25], and Mohtaj et al. [32] employ docu-
ments drawn from Wikipedia. Cheema et al. [9] also employ documents from Project
Gutenberg, but it is unclear whether pairs of suspicious and source documents are se-
lected from both, or whether they are always from the same document source. In all
cases, the authors make an effort to pair documents that are topically related. In this
regard, Khoshnavataher et al. [25] and Mohtaj et al. [32] both employ the same strategy
of clustering Wikipedia articles using the bipartite document-category graph and the
graph-based clustering algorithm of Rosvall and Bergstrom [55]. Asghari et al. [5] rely
on cross-language links between Wikipedia articles in different languages to identify
documents about the same topic.
    Alvi et al. [4] employs translations of Grimm’s fairy tales into English, obtained
from Project Gutenberg, pairing documents which have been translated by different
authors. Therefore this dataset is also limited to a very specific genre comprising some-
times rather old forms of usage and style. Nevertheless, pairs of documents that tell the
same fairy tale are bound to have strong topical relation.
    Kong et al. [27] follow our strategy to construct the Webis-TRC-12 [49], namely
they asked 10 volunteers to genuinely write as many essays on predefined topics. The
volunteers were asked to use a web search engine to manually retrieve topic-related
sources and to reuse text passages found on the web pages to compose their essays. This
way, the resulting suspicious documents and their corresponding source documents also
possess a strong topical relation.
    Palkovskii and Belov [36] took a shortcut by simply reusing the training dataset of
the PAN 2013 shared task on text alignment [45], simply applying an additional means
of obfuscation across the corpus as detailed below.
4.4   Obfuscation Synthesis
The second building block of every text alignment dataset is the set of obfuscation
approaches used to emulate human plagiarists who try to hide their plagiarism. Obfus-
cation of plagiarized text passages is what makes the task of text alignment difficult
for detection algorithms as well as for constructing datasets. The difficulty for the latter
arises from the fact that real plagiarism is hard to find at scale, especially when pla-
giarists invest a lot of effort in hiding it. It can be assumed that there is a bias in all
plagiarism cases that make the news toward cases that are easier to be detected. There-
fore, approaches have to be devised that will yield obfuscated plagiarism that comes
close to the real thing, but can be created at scale with reasonable effort.
    There are basically two alternatives to obfuscation synthesis, namely within context
and without: within context, the entire document surrounding a plagiarized passage
is created simultaneously, either manually or automatically, whereas without context,
a plagiarized passage is created independently and afterwards embedded into a host
document. The latter is easier to be accomplished, but lacks realism since plagiarized
passages are not interleaved with the surrounding host document or other plagiarized
passages around it. Moreover, when embedding an independently created plagiarized
passages into a host document, the selected host document should be topically related,
or else a basic topic drift analysis will reveal a plagiarized passage.
    For the PAN plagiarism corpora 2009-2013, we devised a number of obfuscation
approaches ranging from automatic obfuscation to manual obfuscation. This was done
without context, embedding plagiarized passages into host documents after obfuscation.
In particular, the obfuscation approaches are the following:
 – Random text operations. Random shuffling, insertion, replacement, or removal of
   phrases and sentences. Insertions and replacements are obtained from context doc-
   uments [51].
 – Semantic word variation. Random replacement words with synonyms, antonyms,
   hyponyms, or hypernyms [51].
 – Part-of-speech-preserving word shuffling. Shuffling of phrases while maintaining
   the original POS sequence [51].
 – Machine translation. Automatic translation from one language to another [51].
 – Machine translation and manual copyediting. Manually corrected output of ma-
   chine translation [43].
 – Manual translation from parallel corpus. Usage of translated passages from an ex-
   isting parallel corpus [44].
 – Manual paraphrasing via crowdsourcing. Asking human volunteers to paraphrase
   a given passage of text, possibly on crowdsourcing platforms, such as Amazon’s
   Mechanical Turk [6, 41, 50].
 – Cyclic translation. Automatic translation of a text passage from one language via a
   sequence of other languages to the original language [45]
 – Summarization. Summaries of long text passages or complete documents obtained
   from the corpora of summaries, such as the DUC corpora [45].
   Regarding the submitted datasets, Kong et al. [27] recreate our previous work on
generating completely manual text reuse cases for the Webis-TRC-12 [49] on a small
scale. They asked volunteers to write essays on predefined topics, reusing text passages
from the web pages they found during manual retrieval. From all submitted datasets,
Kong et al. [27] present the only one where obfuscation has been synthesized within
context. They also introduce an interesting twist: to maximize the obfuscation of the
plagiarized text passages, the student essays have been submitted to a plagiarism detec-
tion system widely used at Chinese universities, and the volunteers have paraphrased
their essays until the system could not detect the plagiarism, anymore.
    For all other datasets, obfuscated plagiarized passages have been synthesized with-
out context, and then embedded into host documents. Here, Alvi et al. [4] employ man-
ual translations from a pseudo-parallel corpus, namely different editions of translations
of Grimm’s fairy tales. These translation pairs are then embedded into other, unrelated
fairy tales, assuming that the genre of fairy tales in general will, to some extent, provide
context with a more or less matching topic. It remains to be seen whether a topic drift
analysis may reveal the plagiarized passages. At any rate, independent translations of
fairy tales will provide for an interesting challenge for text alignment algorithms. In
addition to that, Alvi et al. [4] also employ the above obfuscation approach of semantic
word variation, and a new approach which we call UTF character substitution. Here,
characters are replaced by look-alike characters from the UTF table, which makes it
more difficult, though not impossible, for text alignment algorithms to match words at
a lexical level. Note in this connection that Palkovskii and Belov [36] have also applied
UTF character substitution on top of the reused PAN 2013 training dataset; they have
already pointed out back at PAN 2009 that students sometimes apply this approach in
practice [37].
    Cheema et al. [9] employ manual paraphrasing via crowdsourcing; they have re-
cruited colleagues, friends, and students at different stages of their education, namely
undergrads, bachelors, masters, and PhDs, and asked them to paraphrase a total of
250 text passages selected from their respective study domain (e.g., technology, life
sciences, and humanities). These paraphrased text passages have then been embedded
into documents drawn from Wikipedia and Project Gutenberg which were selected ac-
cording to topical similarity to the paraphrased text passages.
    Hanif et al. [21] employ machine translation with and without manual copyedit-
ing, and machine translation with random text operations to obfuscate text passages
obtained from the Urdu Wikipedia. The translated passage are then embedded into host
documents selected so that they match the topic of the translated passages.
    Since the datasets of Asghari et al. [5], Mohtaj et al. [32], and Khoshnavataher et al.
[25] have been compiled by more or less the same people, their construction process
if very similar. In all cases, obfuscated text passages obtained from Wikipedia articles
are embedded into other Wikipedia articles that serve as suspicious documents. For the
monolingual datasets Mohtaj et al. [32] and Khoshnavataher et al. [25] employ ran-
dom text operations as obfuscation approach. In addition, for both monolingual and
cross-language datasets, a new way of creating obfuscation is devised: Asghari et al.
[5] and Mohtaj et al. [32] employ what we call “sentence stitching” with sentence pairs
obtained from parallel corpora when creating cross-language plagiarism, or paraphrase
corpora for monolingual plagiarism. To create a plagiarized passage and its correspond-
ing source passage, sentence pairs from such corpora are selected and then simply ap-
pended to each other to form aligned passages of text. Various degrees of obfuscation
difficulty can be introduced by measuring the similarity of sentence pairs with an ap-
propriate similarity measure, and by combining sentences with high similarity to create
low obfuscation, and vice versa. The authors try to ensure that combined sentences pairs
have at least some similarity to other sentences found in a generated pair of passages
by first clustering the sentences in the corpora used by their similarity. However, the
success of the latter depends on how many semantically related sentence pairs are ac-
tually found in the corpora used, since clustering algorithms will even find clusters of
sentences when there are only unrelated sentence pairs.
    In summary, the authors of the submitted dataset propose the following new obfus-
cation synthesis approaches:
    – UTF character substitution. Replacement of characters with look-alike UTF char-
      acters [4, 36].
    – Sentence stitching using parallel corpora or paraphrase corpora. Generation of pairs
      of text passages from a selection of translated or paraphrased passages [5, 32].
    – Manual paraphrasing against plagiarism detection. Paraphrasing a text passage until
      a plagiarism detector fails to detect the text passage as plagiarism [27].
Discussion In the first years of PAN and the plagiarism detection task, we have har-
vested a lot of the low-hanging fruit in terms of constructing evaluation resources, and in
particular in devising obfuscation synthesis approaches. In this regard, it is not surpris-
ing that, despite the fact that eight datasets haven been submitted, only three completely
new approaches have been proposed. If things were different, and we would start a task
from scratch in this way, participants who decide to construct datasets would certainly
have come up with most of these approaches themselves. Perhaps, by having proposed
so many of the existing obfuscation synthesis approaches ourselves, we may be stifling
creativity by anchoring the thoughts of participants to what is already there instead of
what is missing. For example, it is interesting to note that none of the participants have
actually implemented and employed automatic paraphrasing algorithms or any other
form of text generation, e.g., based on language models.


5     Dataset Validation and Evaluation
Our approach at validating data submissions for shared tasks is twofold: (1) all partic-
ipants who submit a dataset have been asked to peer-review the datasets of all other
participants, and (2) running all 31 pieces of software that have been submitted to pre-
vious editions of our shared task on text alignment against the submitted datasets. In
what follows, we review the reports of the participants’ peer-reviews that have been
submitted as part of their notebook papers to this year’s shared task, we introduce the
performance measures used to evaluate text alignment software, and we report on the
evaluation results obtained from running the text alignment software against the sub-
mitted datasets using the TIRA experimentation platform.
5.1   Dataset Peer-Review
Peer-review is one of the traditional means of the scientific community to check and
ensure quality. Data submissions introduce new obstacles to the successful organization
of a peer-review for the following reasons:
 – Dataset size. Datasets for shared tasks tend to be huge, which renders individual
   reviewers incapable of reviewing them all. Here, the selection of statistically repre-
   sentative subset may alleviate the problem, allowing for an estimation of the total
   amount of errors or other quality issues in a given dataset.
 – Assessment difficulty. Even if the ground truth of a dataset is revealed, it may not
   be enough to easily understand and follow up on their construction principles of
   a dataset. Additional tools may be required to review problem instances at scale;
   in some cases, these tools need to solve the task’s underlying problem themselves,
   e.g., to properly visualize problem instances, whereas, without visualization, the
   review time per problem instance may prohibitively long.
 – Reviewer bias. Given a certain assessment difficulty for problem instances, even if
   the ground truth is revealed, reviewers may be biased to favor easy decisions over
   difficult ones.
 – Curse of variety. While shared tasks typically tackle very clear-cut problems, the
   number of application domains where the task in question occurs may be huge. In
   these situations, it is unlikely that the reviewers available possess all the required
   knowledge, abilities, and experience to review and judge a given dataset with con-
   fidence.
 – Lack of motivation. While it is fun and motivating to create a new evaluation re-
   source, that is less so when reviewing those of others. Reviewers in shared task
   that invite data submissions may therefore feel less inclined to invest their time into
   reviewing other participants’ contributions.
 – Privacy concerns. Some reviewers may feel uncomfortable when passing open
   judgment onto their peers’ work for fear of repercussions, especially when they
   find datasets to be sub-standard. However, an open discussion of the quality of
   evaluation resources of all kinds is an import prerequisite for progress.
All of the above obstacles have been observed in our case: some submitted datasets that
are huge, comprising thousands of generated plagiarism cases; reviewing pairs of entire
text documents up to dozens of pages long, and comparing plagiarism cases that may be
extremely obfuscated is a laborious task, especially when no tools are around to help;
some submitted datasets have been constructed in languages that none of the reviewers
speak, except for those who constructed the dataset; and some of the invited reviewers
apparently lacked the motivation to actually conduct a review in a useful manner.
    The most comprehensive review has been submitted by volunteer reviewers, who
did not submit a dataset of their own: Franco-Salvador et al. [12] systematically an-
alyzed the submitted datasets both by computing dataset statistics and by manual in-
spection. The dataset statistics computed are mostly consistent with those we show in
Table 1. Since most of the datasets have been submitted without further explanation by
their authors, Franco-Salvador et al. [12] suggest to ask participants for short descrip-
tions of how the datasets have been constructed in the future. Altogether, the review-
ers reverse-engineer the datasets in their review, making educated guesses at how they
have been constructed and what are their document sources. Regarding the datasets
of Asghari et al. [5], Cheema et al. [9], Mohtaj et al. [32], and Palkovskii and Belov
[36], the reviewers find unusual synonym replacements as well as garbled text, which
is probably due to the automatic obfuscation synthesis approaches used. Here, the au-
tomatic construction of datasets has its limits. Regarding datasets that are partially or
completely non-English, the reviewers went to great lengths to study them, despite not
being proficient in the datasets’ languages: the reviewers translated non-English pla-
giarized passages to English using Google Translate in order to get an idea of whether
paired text passages actually match with regard to their topic. This approach to over-
coming the language barrier is highly commendable, and shows that improvisation can
greatly enhance a reviewer’s abilities. Even if the finer details of the obfuscation syn-
thesis strategies applied in the non-English datasets are lost or skewed using Google
Translate, the reviewers at least get an impression of the plagiarism cases. Altogether,
the reviewers did not identify any extreme errors that invalidate any of the datasets for
use in an evaluation.
    The review submitted by Alvi et al. [4] has been conducted similarly to that of
Franco-Salvador et al. [12]. The reviewers note inconsistencies of the annotations where
character offsets and lengths do not appear to match the plagiarism cases in the datasets
of Hanif et al. [21] and Mohtaj et al. [32]. Moreover, the reviewers also employ Google
Translate to double-check the cross-language and non-English plagiarism cases for top-
ical similarity. Altogether, the reviewers sometimes have difficulties in discerning the
meaning of certain obfuscation synthesis names used by dataset authors, which is due
to the fact that no explanation about them has been provided by the dataset authors.
Again, they did not identify any detrimental errors. Furthermore, to help other review-
ers in their task, Alvi et al. [4] shared a visual tool to help review plagiarism cases in
the datasets.
    The author lists of the datasets of Asghari et al. [5], Mohtaj et al. [32], and Khosh-
navataher et al. [25] overlap, so that they decided to submit a joint review, written by
Zarrabi et al. [67]. The reviewers compile some dataset statistics and report on manu-
ally reviewing 20 plagiarism cases per dataset. Despite some remarks on small errors
identified, the authors do not find any systematic errors.
    Finally, Cheema et al. [9] provide only a very short and superficial review, Kong
et al. [27] compile only some corpus statistics without any remarks, and Palkovskii and
Belov [36] did not take part in the review phase.
Discussion The outcome of the review phase of our shared task is a mixed bag. While
some reviewers made an honest attempt to conduct thorough reviews, most did so only
superficially. From what we learned, the datasets can be used for evaluation with some
confidence, they are not systematically compromised. With hindsight, data submissions
should still involve a review phase, however, there should be more time for peer-review
than only one or two weeks. Also, the authors of submitted datasets should have a
chance of seeing their reviews before the final submission deadline, so that they have a
chance of improving their datasets. Reviewers should also be allowed to provide anony-
mous feedback. Nevertheless, the reviews should be published to allow later users of the
datasets to get an impartial idea of its quality. Finally, once the datasets are actually used
for evaluation purposes either in another shared task, or by independent researchers, the
researchers using them have a much higher motivation to actually look deep into the
datasets they are using.

5.2   Plagiarism Detection Performance Measures
To assess the performance of the submitted datasets, we employ the text alignment
software that has been submitted in previous years to the shared task of text alignment at
PAN [50]: precision, recall, and granularity, which are combined into the plagdet score.
Moreover, as of last year, we also compute case-level and document-level precision,
recall, and F1 . In what follows, we recap these performance measures.
Character level performance measures Let S denote the set of plagiarism cases in
the corpus, and let R denote the set of detections reported by a plagiarism detector
for the suspicious documents. A plagiarism case s = hsplg , dplg , ssrc , dsrc i, s ∈ S, is
represented as a set s of references to the characters of dplg and dsrc , specifying the
passages splg and ssrc . Likewise, a plagiarism detection r ∈ R is represented as r. We
say that r detects s iff s ∩ r 6= ∅ and splg overlaps with rplg and ssrc overlaps with rsrc .
Based on this notation, precision and recall of R under S can be measured as follows:
                              S                                           S
                    1 X | s∈S (s u r)|                            1 X | r∈R (s u r)|
   prec(S, R) =                             ,      r ec(S, R) =                         ,
                   |R|             |r|                           |S|           |s|
                       r∈R                                             s∈S
                                          
                                              s∩r    if r detects s,
                        where    sur=
                                               ∅     otherwise.
Observe that neither precision nor recall account for the fact that plagiarism detectors
sometimes report overlapping or multiple detections for a single plagiarism case. This
is undesirable, and to address this deficit also a detector’s granularity is quantified as
follows:
                                               1 X
                            gran(S, R) =                |Rs |,
                                             |SR |
                                                    s∈SR

where SR ⊆ S are cases detected by detections in R, and Rs ⊆ R are detections of s;
i.e., SR = {s | s ∈ S and ∃r ∈ R : r detects s} and Rs = {r | r ∈ R and r detects s}.
Note further that the above three measures alone do not allow for a unique ranking
among detection approaches. Therefore, the measures are combined into a single overall
score as follows:
                                                  F1
                       plagdet(S, R) =                        ,
                                        log2 (1 + gran(S, R))
where F1 is the equally weighted harmonic mean of precision and recall.
Case level performance measures Let S and R be defined as above. Further, let
S 0 = {s | s ∈ S and r ecchar (s, R) > τ1 and ∃r ∈ R: r detects s and precchar (S, r) > τ2 }
denote the subset of all plagiarism cases S which have been detected with more than a
threshold τ1 in terms of character recall r ecchar and more than a threshold τ2 in terms
of character precision precchar . Likewise, let
R0 = {r | r ∈ R and precchar (S, r) > τ2 and ∃s ∈ S: r detects s and r ecchar (s, R) > τ1 }
denote the subset of all detections R which contribute to detecting plagiarism cases with
more than a threshold τ1 in terms of character recall r ecchar and more than a threshold
τ2 in terms of character precision precchar . Here, character recall and precision derive
from the character level performance measures defined above:
                             S                                     S
                            | s∈S (s u r)|                        | r∈R (s u r)|
        precchar (S, r) =                  ,   r ecchar (s, R) =                 .
                                  |r|                                   |s|
Based on this notation, we compute case level precision and recall as follows:
                                       |R0 |                         |S 0 |
                   preccase (S, R) =         ,   r eccase (S, R) =          .
                                        |R|                           |S|
The thresholds τ1 and τ2 can be used to adjust the minimal detection accuracy with
regard to passage boundaries. Threshold τ1 adjusts how accurate a plagiarism case has
to be detected, whereas threshold τ2 adjusts how accurate a plagiarism detection has
to be. Beyond the minimal detection accuracy imposed by these thresholds, however,
a higher detection accuracy does not contribute to case level precision and recall. If
τ1 → 1 and τ2 → 1, the minimal required detection accuracy approaches perfection,
whereas if τ1 → 0 and τ2 → 0, it is sufficient to report an entire document as plagiarized
to achieve perfect case level precision and recall. In between these extremes, it is an
open question which threshold settings are valid with regard to capturing the minimally
required detection quality beyond which most users of a plagiarism detection system
will not perceive improvements, anymore. Hence, we choose τ1 = τ2 = 0.5 as a
reasonable trade off, for the time being: for case level precision, a plagiarism detection
r counts a true positive detection if it contributes to detecting at least τ1 = 0.5 ∼ 50%
of a plagiarism case s, and, if at least τ2 = 0.5 ∼ 50% of r contributes to detecting
plagiarism cases. Likewise, for case level recall, a plagiarism case s counts as detected
if at least 50% of s are detected, and, if a plagiarism detection r contributes to detecting
s while at least 50% of r contributes to detecting plagiarism cases in general.
Document level performance measures Let S, R, and R0 be defined as above. Fur-
ther, let Dplg be the set of suspicious documents and Dsrc be the set potential source
documents. Then Dpairs = Dplg × Dsrc denotes the set of possible pairs of documents
that a plagiarism detector may analyze, whereas

 Dpairs|S = {(dplg , dsrc ) | (dplg , dsrc ) ∈ Dpairs and ∃s ∈ S : dplg ∈ s and dsrc ∈ s}

denotes the subset of Dpairs whose document pairs contain the plagiarism cases S, and

 Dpairs|R = {(dplg , dsrc ) | (dplg , dsrc ) ∈ Dpairs and ∃r ∈ R : dplg ∈ r and dsrc ∈ r}

denotes the corresponding subset of Dpairs for which plagiarism was detected in R.
Likewise, Dpairs|R0 denotes the subset of Dpairs for which plagiarism was detected
when requiring a minimal detection accuracy as per R0 defined above. Based on this
notation, we compute document level precision and recall as follows:
                   |Dpairs|S ∩ Dpairs|R0 |                           |Dpairs|S ∩ Dpairs|R0 |
precdoc (S, R) =                           ,     r ecdoc (S, R) =                            .
                         |Dpairs|R |                                       |Dpairs|S |
Again, the thresholds τ1 and τ2 allow for adjusting the minimal required detection ac-
curacy for R0 , but for document level recall, it is sufficient that at least one plagiarism
case is detected beyond that accuracy in order for the corresponding document pair
(dplg , dsrc ) to be counted as true positive detection. If none of the plagiarism cases
present in (dplg , dsrc ) is detected beyond the minimal detection accuracy, it is counted
as false negative, whereas if detections are made for a pair of documents in which no
plagiarism case is present, it is counted as false positive.
Discussion Compared to the character level measures, the case level measures relax
the fine-grained measurement of plagiarism detection quality to allow for judging a de-
tection algorithm by its capability of “spotting” plagiarism cases reasonably well with
respect to the minimum detection accuracy fixed by the thresholds τ1 and τ2 . For exam-
ple, a user who is interested in maximizing case level performance may put emphasis
on the coverage of all plagiarism cases rather than the precise extraction of each in-
dividual plagiarized pair of passages. The document level measures further relax the
requirements to allow for judging a detection algorithm by its capability “to raise a
flag” for a given pair of documents, disregarding whether it finds all plagiarism cases
contained. For example, a user who is interested in maximizing these measures puts
emphasis on being made suspicious, which might lead to further, more detailed inves-
tigations. In this regard the three levels of performance measurement complement each
other. To rank plagiarism detection with regard to their case level performance and their
document level performance, we currently use the Fα -Measure. While the best setting
of α is also still unclear, we resort to α = 1.


5.3   Evaluation Results per Dataset
This section reports on the detection performances of 31 text alignment approaches that
have been submitted to the corresponding shared task at PAN 2012-2015, when run
against the eight datasets submitted to this year’s PAN shared task on text alignment
dataset construction. To cut a long story short, we distinguish three kinds of datasets
among the submitted ones: (1) datasets that yield typical detection performance results
with state-of-the-art text alignment approaches, (2) datasets that yield poor detection
performance results because state-of-the-art text alignment approaches are not prepared
for them, and (3) datasets that are entirely solved by at least one of the state-of-the-art
text alignment approaches.
Datasets with typical results The datasets submitted by Alvi et al. [4], Cheema et al.
[9], and Mohtaj et al. [32] yield typical detection performances among state-of-the-art
text alignment approaches; Tables 2, 3, and 4 show the results. In all cases, the top
plagdet detection performance is around 0.8, whereas F1 at case level is around 085-
0.88, and F1 at document level at 0.86-0.9. However, the top-performing text alignment
approach differs: the approach of Glinos [16] performs best on the dataset of Alvi et al.
[4], whereas the approach of Oberreuter and Eiselt [35] performs best on the datasets
of Cheema et al. [9] and Mohtaj et al. [32]. The latter approach ranks among the top
text alignment approaches on all three datasets, including its preceding version from
2012 [34].
Table 2. Cross-year evaluation of text alignment software submissions from 2012 to 2015 with
respect to performance measures at character level, case level, and document level for the sub-
mitted dataset from Alvi et al. [4].
Software Submission               Character Level               Case Level         Document Level
Team               Year   plagdet prec      r ec    gran   prec    r ec      F1   prec   r ec   F1

Glinos [16]        2014    0.80      0.85   0.76    1.01   0.84    0.87   0.85    0.86   0.87   0.87
Oberreuter [35]    2014    0.76      0.87   0.67    1.00   0.90    0.69   0.78    0.91   0.69   0.79
Oberreuter [34]    2012    0.74      0.90   0.63    1.00   0.89    0.67   0.76    0.91   0.67   0.77
Palkovskii [40]    2014    0.67      0.67   0.66    1.00   0.69    0.75   0.72    0.76   0.75   0.75
Sanchez-Perez [56] 2015    0.66      0.86   0.53    1.00   0.78    0.53   0.63    0.78   0.53   0.63
Sanchez-Perez [57] 2014    0.62      0.90   0.48    1.00   0.85    0.49   0.62    0.85   0.49   0.62
Kong [26]          2014    0.57      0.86   0.44    1.00   0.86    0.48   0.62    0.86   0.48   0.62
Kong [28]          2013    0.57      0.80   0.46    1.00   0.88    0.49   0.63    0.88   0.49   0.63
Palkovskii [38]    2012    0.56      0.85   0.43    1.00   0.80    0.48   0.60    0.80   0.48   0.60
Gross [19]         2014    0.56      0.98   0.40    1.00   0.92    0.47   0.62    0.92   0.47   0.62
Suchomel [66]      2013    0.45      0.46   0.44    1.00   0.33    0.49   0.40    0.33   0.49   0.40
Palkovskii [39]    2013    0.44      0.70   0.34    1.05   0.53    0.40   0.45    0.62   0.40   0.49
Suchomel [65]      2012    0.41      0.82   0.27    1.00   0.71    0.32   0.44    0.71   0.32   0.44
Nourian [33]       2013    0.39      0.91   0.25    1.00   0.75    0.24   0.36    0.75   0.24   0.36
Saremi [59]        2013    0.36      0.70   0.28    1.19   0.33    0.28   0.31    0.43   0.28   0.35
R. Torrejón [52]   2012    0.28      0.96   0.17    1.00   0.43    0.16   0.23    0.43   0.16   0.23
Kueppers [30]      2012    0.26      0.92   0.15    1.00   0.65    0.15   0.24    0.65   0.15   0.24
Alvi [3]           2014    0.24      1.00   0.13    1.00   0.93    0.17   0.28    0.93   0.17   0.28
Jayapal [24]       2013    0.10      0.47   0.09     2.0   0.00    0.00   0.00    0.00   0.00   0.00

Baseline           2015    0.03      1.00   0.02    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Abnar [2]          2014    0.03      0.94   0.02    1.00   1.00    0.03   0.05    1.00   0.03   0.05
Gillam [13]        2013    0.00      0.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Gillam [15]        2014    0.00      0.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Gillam [14]        2012     –         –      –       –      –       –        –     –      –      –
Jayapal [23]       2012     –         –      –       –      –       –        –     –      –      –
Kong [29]          2012     –         –      –       –      –       –        –     –      –      –
R. Torrejón [53]   2013     –         –      –       –      –       –        –     –      –      –
R. Torrejón [54]   2014     –         –      –       –      –       –        –     –      –      –
Sánchez-Vega [58] 2012      –         –      –       –      –       –        –     –      –      –
Shrestha [61]      2013     –         –      –       –      –       –        –     –      –      –
Shrestha [60]      2014     –         –      –       –      –       –        –     –      –      –
Table 3. Cross-year evaluation of text alignment software submissions from 2012 to 2015 with
respect to performance measures at character level, case level, and document level for the sub-
mitted dataset from Cheema et al. [9].
Software Submission               Character Level               Case Level         Document Level
Team               Year   plagdet prec      r ec    gran   prec    r ec      F1   prec   r ec   F1

Oberreuter [35]    2014    0.83      0.99   0.72    1.00   0.97    0.77   0.86    0.98   0.77   0.86
Palkovskii [40]    2014    0.75      0.96   0.61    1.00   0.96    0.69   0.80    0.96   0.69   0.80
Oberreuter [34]    2012    0.74      0.99   0.59    1.00   0.94    0.61   0.74    0.94   0.61   0.74
Glinos [16]        2014    0.72      0.92   0.59    1.00   0.92    0.62   0.74    0.95   0.62   0.75
Kong [28]          2013    0.61      0.87   0.46    1.00   0.97    0.54   0.70    0.97   0.54   0.70
Kong [26]          2014    0.59      0.93   0.43    1.00   0.85    0.47   0.61    0.85   0.47   0.61
Suchomel [66]      2013    0.59      0.96   0.42    1.00   0.90    0.50   0.64    0.90   0.50   0.64
Palkovskii [38]    2012    0.56      0.98   0.40    1.00   0.90    0.43   0.59    0.90   0.43   0.59
Suchomel [65]      2012    0.55      0.98   0.38    1.00   0.93    0.44   0.60    0.93   0.44   0.60
Palkovskii [39]    2013    0.48      1.00   0.33    1.06   0.66    0.39   0.49    0.70   0.39   0.50
Sanchez-Perez [56] 2015    0.48      0.89   0.33    1.00   0.78    0.35   0.49    0.78   0.35   0.49
Gross [19]         2014    0.45      1.00   0.30    1.02   0.84    0.34   0.48    0.86   0.34   0.49
Sanchez-Perez [57] 2014    0.45      0.89   0.30    1.00   0.81    0.33   0.47    0.81   0.33   0.47
Saremi [59]        2013    0.44      0.97   0.32    1.11   0.59    0.34   0.43    0.66   0.34   0.45
R. Torrejón [52]   2012    0.41      1.00   0.26    1.00   0.60    0.26   0.36    0.60   0.26   0.36
Alvi [3]           2014    0.32      1.00   0.20    1.05   0.56    0.18   0.28    0.61   0.18   0.28
Kueppers [30]      2012    0.30      0.88   0.18    1.00   0.56    0.18   0.27    0.56   0.18   0.28
Abnar [2]          2014    0.27      0.90   0.16    1.00   1.00    0.17   0.30    1.00   0.17   0.30
Nourian [33]       2013    0.17      0.92   0.09    1.00   0.87    0.10   0.18    0.87   0.10   0.18
Jayapal [24]       2013    0.13      0.97   0.12     2.2   0.01    0.07   0.01    0.01   0.07   0.02
Gillam [15]        2014    0.12      0.98   0.06    1.00   0.90    0.07   0.13    0.90   0.07   0.13

Baseline           2015    0.11      1.00   0.08    1.50   0.19    0.06   0.10    0.28   0.06   0.10
Gillam [14]        2012     –         –      –       –      –       –        –     –      –      –
Gillam [13]        2013     –         –      –       –      –       –        –     –      –      –
Jayapal [23]       2012     –         –      –       –      –       –        –     –      –      –
Kong [29]          2012     –         –      –       –      –       –        –     –      –      –
R. Torrejón [53]   2013     –         –      –       –      –       –        –     –      –      –
R. Torrejón [54]   2014     –         –      –       –      –       –        –     –      –      –
Sánchez-Vega [58] 2012      –         –      –       –      –       –        –     –      –      –
Shrestha [61]      2013     –         –      –       –      –       –        –     –      –      –
Shrestha [60]      2014     –         –      –       –      –       –        –     –      –      –
Table 4. Cross-year evaluation of text alignment software submissions from 2012 to 2015 with
respect to performance measures at character level, case level, and document level for the sub-
mitted dataset from Mohtaj et al. [32].
Software Submission               Character Level               Case Level         Document Level
Team               Year   plagdet prec      r ec    gran   prec    r ec      F1   prec   r ec   F1

Oberreuter [35]    2014    0.80      0.83   0.78    1.00   0.90    0.85   0.88    0.95   0.85   0.90
Oberreuter [34]    2012    0.76      0.79   0.74    1.00   0.82    0.80   0.81    0.85   0.82   0.83
Sanchez-Perez [57] 2014    0.76      0.81   0.71    1.00   0.95    0.78   0.86    0.97   0.82   0.89
Sanchez-Perez [56] 2015    0.75      0.80   0.71    1.00   0.94    0.78   0.86    0.97   0.84   0.90
Gross [19]         2014    0.73      0.91   0.61    1.01   0.96    0.73   0.83    0.98   0.78   0.87
Palkovskii [40]    2014    0.73      0.86   0.62    1.00   0.95    0.70   0.81    0.97   0.78   0.86
Kong [26]          2014    0.66      0.76   0.59    1.00   0.86    0.68   0.76    0.92   0.75   0.83
Kong [28]          2013    0.66      0.76   0.59    1.00   0.86    0.68   0.76    0.92   0.75   0.83
R. Torrejón [52]   2012    0.65      0.59   0.74    1.00   0.49    0.80   0.61    0.49   0.77   0.60
Suchomel [66]      2013    0.65      0.73   0.59    1.00   0.66    0.67   0.66    0.70   0.70   0.70
Kueppers [30]      2012    0.63      0.82   0.51    1.00   0.85    0.60   0.71    0.91   0.71   0.80
Saremi [59]        2013    0.61      0.86   0.60    1.24   0.62    0.70   0.66    0.87   0.76   0.81
Palkovskii [38]    2012    0.59      0.76   0.48    1.00   0.77    0.54   0.63    0.81   0.62   0.71
Suchomel [65]      2012    0.59      0.82   0.46    1.00   0.72    0.52   0.60    0.76   0.56   0.65
Abnar [2]          2014    0.56      0.92   0.42    1.01   0.92    0.44   0.60    0.97   0.56   0.72
Palkovskii [39]    2013    0.56      0.88   0.43    1.04   0.62    0.47   0.53    0.78   0.57   0.67
Glinos [16]        2014    0.54      0.95   0.38    1.01   0.97    0.39   0.55    0.99   0.47   0.64
Alvi [3]           2014    0.46      0.94   0.31    1.03   0.81    0.36   0.50    0.89   0.48   0.62
Nourian [33]       2013    0.36      0.88   0.23    1.00   0.90    0.25   0.39    0.91   0.32   0.47
Jayapal [24]       2013    0.22      0.86   0.23    2.14   0.03    0.12   0.05    0.11   0.15   0.12

Baseline           2015    0.18      0.93   0.12    1.24   0.21    0.09   0.12    0.22   0.10   0.14
Gillam [15]        2014    0.06      0.84   0.03    1.00   0.94    0.03   0.07    0.97   0.05   0.09
Gillam [13]        2013    0.00      0.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Gillam [14]        2012     –         –      –       –      –       –        –     –      –      –
Jayapal [23]       2012     –         –      –       –      –       –        –     –      –      –
Kong [29]          2012     –         –      –       –      –       –        –     –      –      –
R. Torrejón [53]   2013     –         –      –       –      –       –        –     –      –      –
R. Torrejón [54]   2014     –         –      –       –      –       –        –     –      –      –
Sánchez-Vega [58] 2012      –         –      –       –      –       –        –     –      –      –
Shrestha [61]      2013     –         –      –       –      –       –        –     –      –      –
Shrestha [60]      2014     –         –      –       –      –       –        –     –      –      –
    For comparison, the winning text alignment approach of PAN 2014 from Sanchez-
Perez et al. [57], as well as its 2015 successor [56], achieves mid-range performances
on the dataset of Alvi et al. [4], low performance on that of Cheema et al. [9], and
second rank, following the approaches of Oberreuter on the dataset of Mohtaj et al. [32].
Some of these performance differences may be attributed to the fact that the approach
of Sanchez-Perez et al. [57] has been optimized to work well on the previous year’s
PAN plagiarism corpus, whereas it has not, yet, been optimized against the submitted
datasets, nor against all of them in combination.
    Apparently, the obfuscation synthesis approaches applied during construction of the
three aforementioned datasets compare in terms of difficulty to the ones applied during
construction of the PAN plagiarism corpora. The detection performances on the submit-
ted datasets are not perfect so that further algorithmic improvements are required. These
datasets complement the ones that are already used, and, since they have been con-
structed independently, while still allowing for the existing text alignment approaches
to work well, they verify that the previous datasets which have been constructed exclu-
sively by ourselves, are fit for purpose.
Datasets with perfect results For one of the submitted datasets, the text alignment
approaches exert an odd performance characteristic, that is: either they detect almost
all plagiarism, or none at all. Table 5 shows the performances obtained on the dataset
of Khoshnavataher et al. [25]. The plagdet performances of the eight top-performing
text alignment approaches range from 0.89 to 0.98, the top-performing one being the
approach of Glinos [16]. At case level and at document level, all of them achieve F1
scores above 0.9. There are only four approaches that achieve mid-range performances,
including the Baseline, whereas the performances of all other approaches is negligible.
The Baseline implements a basic text alignment approach using 4-grams for seeding,
and rule-based merging.
     The dataset of Khoshnavataher et al. [25] comprises Persian documents only, so that
this performance characteristic demonstrates which of the approaches can cope with this
language and which cannot. An explanation for the fact that not all approaches work
may be that some work at the lexical level, whereas others employ sophisticated linguis-
tic processing pipelines that may not be adapted to processing Persian text. Moreover,
the fact that those approaches which are capable of detecting plagiarism within Per-
sian documents do so almost flawlessly, hints that the obfuscation synthesis approaches
applied by Khoshnavataher et al. [25] do not seem to yield notable obfuscation.
     From reviewing the notebooks of the respective approaches, however, it is not en-
tirely clear whether they indeed work at the lexical level. Glinos [16] mentions some
pre-processing at word-level, but applies the character-based Smith-Waterman algo-
rithm to align text passages between a pair of documents. Palkovskii and Belov [38],
who provides the second-best performing approach on Khoshnavataher et al. [25]’s
dataset, report to employ a basic Euclidian distance-based clustering approach in their
approach, which also hints that no linguistic pre-processing is applied. Interestingly,
the best-performing approach of PAN 2014 from Sanchez-Perez et al. [57] does not
work well, whereas this year’s refined version Sanchez-Perez et al. [56] is ranked third.
This suggests that the authors added a fallback solution for situations where only lex-
ical matching applies, e.g., based on their approach to predict on-the-fly what kind of
Table 5. Cross-year evaluation of text alignment software submissions from 2012 to 2015 with
respect to performance measures at character level, case level, and document level for the sub-
mitted dataset from Khoshnavataher et al. [25].
Software Submission               Character Level               Case Level         Document Level
Team               Year   plagdet prec      r ec    gran   prec    r ec      F1   prec   r ec   F1

Glinos [16]        2014    0.98      0.97   0.99    1.00   0.98    0.99   0.99    0.98   0.99   0.99
Palkovskii [38]    2012    0.95      0.94   0.95    1.00   0.98    0.96   0.97    0.98   0.96   0.97
Sanchez-Perez [56] 2015    0.94      0.91   0.96    1.00   0.97    0.98   0.98    0.98   0.98   0.98
Suchomel [65]      2012    0.93      0.93   0.92    1.00   0.93    0.97   0.95    0.93   0.97   0.95
Palkovskii [40]    2014    0.92      0.88   0.95    1.00   0.94    0.97   0.96    0.94   0.97   0.96
Palkovskii [39]    2013    0.91      0.89   0.92    1.00   0.89    0.95   0.92    0.96   0.95   0.96
Suchomel [66]      2013    0.91      0.86   0.95    1.00   0.88    0.99   0.93    0.88   0.99   0.93
Alvi [3]           2014    0.89      0.95   0.90    1.05   0.91    0.95   0.93    0.99   0.95   0.97
Gross [19]         2014    0.80      0.76   0.98    1.09   0.72    1.00   0.84    0.78   1.00   0.88
Abnar [2]          2014    0.55      0.73   0.45    1.01   0.57    0.46   0.52    0.63   0.46   0.53
Oberreuter [35]    2014    0.40      0.25   1.00    1.00   0.13    1.00   0.23    0.15   1.00   0.25

Baseline           2015    0.37      0.97   0.50    2.41   0.14    0.37   0.20    0.34   0.37   0.35
Kong [26]          2014    0.07      0.04   0.80    1.00   0.00    0.80   0.00    0.00   0.80   0.00
Kong [28]          2013    0.07      0.04   0.80    1.00   0.00    0.80   0.00    0.00   0.80   0.00
Oberreuter [34]    2012    0.06      0.33   0.03    1.00   0.14    0.03   0.05    0.14   0.03   0.05
Sanchez-Perez [57] 2014    0.01      0.62   0.01    1.00   0.14    0.00   0.01    0.14   0.00   0.01
Saremi [59]        2013    0.00      0.09   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
R. Torrejón [52]   2012    0.00      1.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Jayapal [24]       2013    0.00      1.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Kueppers [30]      2012    0.00      0.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Nourian [33]       2013    0.00      0.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Gillam [14]        2012     –         –      –       –      –       –        –     –      –      –
Gillam [13]        2013     –         –      –       –      –       –        –     –      –      –
Gillam [15]        2014     –         –      –       –      –       –        –     –      –      –
Jayapal [23]       2012     –         –      –       –      –       –        –     –      –      –
Kong [29]          2012     –         –      –       –      –       –        –     –      –      –
R. Torrejón [53]   2013     –         –      –       –      –       –        –     –      –      –
R. Torrejón [54]   2014     –         –      –       –      –       –        –     –      –      –
Sánchez-Vega [58] 2012      –         –      –       –      –       –        –     –      –      –
Shrestha [61]      2013     –         –      –       –      –       –        –     –      –      –
Shrestha [60]      2014     –         –      –       –      –       –        –     –      –      –
obfuscation is at hand to adjust their detection approach accordingly. Since the dataset
of Khoshnavataher et al. [25] apparently does not comprise noteworthy obfuscation,
which results in mostly verbatim overlap of text passages between a given pair of suspi-
cious document and source document, this may trigger the classification approach used
by Sanchez-Perez et al. [56], which then applies a basic algorithm that deals with such
situations at the lexical level.
    Another noteworthy issue about this dataset is that some of its reviewers note that
it has a high quality, which may hint at reviewer bias toward plagiarism cases that
are more easy to be detected, compared to ones that are difficult to identify, even for
a human. In this regard, implementing obfuscation synthesis approaches, which are
supposed to construct difficult plagiarism cases, also makes the task of reviewing their
results a lot more difficult, so that reviewers may tend to favor easy decisions over the
difficult ones.
Datasets with poor results On four of the submitted datasets from Kong et al. [27],
Asghari et al. [5], Palkovskii and Belov [36], and Hanif et al. [21], the text alignment
approaches perform poorly, detecting almost none of the plagiarism cases. The best
performances are obtained by the two versions of the approach of Oberreuter and Eiselt
[35] and the two versions of the approach of Suchomel et al. [66] on the dataset com-
prising Chinese plagiarism cases from Kong et al. [27]. However, the top plagdet score
achieved is still only 0.18. Most of the text alignment approaches detect almost none
of the plagiarism cases on this dataset, whereas all approaches fail on the remaining
datasets from Asghari et al. [5], Palkovskii and Belov [36], and Hanif et al. [21].
     This does not hint at any flaws in the datasets, but is testimony to the datasets’
difficulty. The datasets of Asghari et al. [5] and Hanif et al. [21] are the only ones
comprising cross-language plagiarism from Persian and Urdu to English, respectively.
Since no lexical similarity between these languages can be expected, apart from perhaps
a few named entities, and since apparently none of the text alignment approaches feature
any kind of translation module or cross-language similarity measure, they cannot cope
with these kinds of plagiarism cases.
     Regarding the dataset submitted by Kong et al. [27], which comprises monolingual
Chinese plagiarism cases, a set of text alignment approaches seems to work, to a small
extent, that corresponds to those working on the dataset of Khoshnavataher et al. [25],
probably for similar reasons as outlined above. However, for their dataset, Kong et al.
[27] have optimized the obfuscation of the plagiarism cases until a Chinese plagiarism
detector was unable to detect them, which makes the task of detecting these plagiarism
cases very difficult, since lexical similarities are not to be expected. Moreover, most of
the existing approaches are probably not optimized to process Chinese text, since each
letter may carry a lot more semantics than letters from the Latin alphabet. Therefore,
shorter character sequences may already hint a significant semantic similarity.
     Regarding the dataset of Palkovskii and Belov [36], the obfuscation approach of
substituting characters with look-alike UTF characters seems to successfully confound
all of the existing text alignment approaches. The dataset of Alvi et al. [4], where the
performances have otherwise been typical, also contains plagiarism cases where this
kind of obfuscation has been applied.
Table 6. Cross-year evaluation of text alignment software submissions from 2012 to 2015 with
respect to performance measures at character level, case level, and document level for the sub-
mitted dataset from Kong et al. [27].
Software Submission               Character Level               Case Level         Document Level
Team               Year   plagdet prec      r ec    gran   prec    r ec      F1   prec   r ec   F1

Oberreuter [35]    2014    0.18      0.16   0.21    1.00   0.16    0.23   0.19    0.21   0.23   0.22
Suchomel [66]      2013    0.17      0.16   0.18    1.00   0.10    0.19   0.13    0.12   0.18   0.14
Suchomel [65]      2012    0.16      0.17   0.15    1.00   0.11    0.16   0.13    0.14   0.15   0.15
Oberreuter [34]    2012    0.12      0.14   0.11    1.00   0.11    0.11   0.11    0.14   0.11   0.12
Gross [19]         2014    0.11      0.13   0.10    1.08   0.13    0.11   0.12    0.23   0.12   0.16
Abnar [2]          2014    0.07      0.21   0.08    2.17   0.10    0.11   0.10    0.25   0.14   0.18
Alvi [3]           2014    0.07      0.15   0.08     2.0   0.04    0.09   0.06    0.17   0.14   0.15
Palkovskii [38]    2012    0.06      0.10   0.05    1.00   0.07    0.05   0.06    0.07   0.05   0.06
Glinos [16]        2014    0.06      0.07   0.05    1.00   0.04    0.04   0.04    0.11   0.07   0.08
Kong [26]          2014    0.05      0.03   0.56    1.01   0.00    0.57   0.01    0.01   0.56   0.03

Baseline           2015    0.05      0.18   0.06    2.68   0.00    0.05   0.01    0.02   0.08   0.03
Kong [28]          2013    0.05      0.03   0.56    1.01   0.00    0.56   0.01    0.01   0.55   0.03
Sanchez-Perez [57] 2014    0.04      0.16   0.02    1.00   0.20    0.03   0.05    0.25   0.01   0.03
R. Torrejón [52]   2012    0.03      0.04   0.02    1.00   0.06    0.03   0.04    0.06   0.04   0.05
Palkovskii [39]    2013    0.02      0.10   0.01    1.00   0.00    0.02   0.00    0.00   0.01   0.00
Saremi [59]        2013    0.02      0.04   0.02    2.62   0.00    0.02   0.00    0.04   0.03   0.03
Palkovskii [40]    2014    0.01      0.15   0.01    1.00   0.17    0.01   0.02    0.20   0.01   0.03
Sanchez-Perez [56] 2015    0.01      0.50   0.01    1.00   0.50    0.01   0.02    1.00   0.01   0.03
Jayapal [24]       2013    0.00      0.08   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Gillam [15]        2014    0.00      0.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Kueppers [30]      2012    0.00      0.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Nourian [33]       2013    0.00      0.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Gillam [14]        2012     –         –      –       –      –       –        –     –      –      –
Gillam [13]        2013     –         –      –       –      –       –        –     –      –      –
Jayapal [23]       2012     –         –      –       –      –       –        –     –      –      –
Kong [29]          2012     –         –      –       –      –       –        –     –      –      –
R. Torrejón [53]   2013     –         –      –       –      –       –        –     –      –      –
R. Torrejón [54]   2014     –         –      –       –      –       –        –     –      –      –
Sánchez-Vega [58] 2012      –         –      –       –      –       –        –     –      –      –
Shrestha [61]      2013     –         –      –       –      –       –        –     –      –      –
Shrestha [60]      2014     –         –      –       –      –       –        –     –      –      –
Table 7. Cross-year evaluation of text alignment software submissions from 2012 to 2015 with
respect to performance measures at character level, case level, and document level for the sub-
mitted dataset from Asghari et al. [5].
Software Submission               Character Level               Case Level         Document Level
Team               Year   plagdet prec      r ec    gran   prec    r ec      F1   prec   r ec   F1

Oberreuter [34]    2012    0.03      0.18   0.02    1.00   0.08    0.02   0.03    0.08   0.02   0.03
Sanchez-Perez [57] 2014    0.00      0.71   0.00    1.00   0.28    0.00   0.00    0.28   0.00   0.00
Kong [28]          2013    0.00      0.01   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Kong [26]          2014    0.00      0.01   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Jayapal [24]       2013    0.00      0.89   0.00    1.12   0.00    0.00   0.00    0.00   0.00   0.00
Palkovskii [38]    2012    0.00      1.00   0.00    1.00   1.00    0.00   0.00    1.00   0.00   0.00
Suchomel [65]      2012    0.00      1.00   0.00    1.00   1.00    0.00   0.00    1.00   0.00   0.00
Suchomel [66]      2013    0.00      1.00   0.00    1.00   1.00    0.00   0.00    1.00   0.00   0.00
Alvi [3]           2014    0.00      1.00   0.00    1.00   1.00    0.00   0.00    1.00   0.00   0.00
Palkovskii [40]    2014    0.00      1.00   0.00    1.00   1.00    0.00   0.00    1.00   0.00   0.00
Sanchez-Perez [56] 2015    0.00      1.00   0.00    1.00   1.00    0.00   0.00    1.00   0.00   0.00
Abnar [2]          2014    0.00      1.00   0.00    1.00   1.00    0.00   0.00    1.00   0.00   0.00
Saremi [59]        2013    0.00      0.06   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Palkovskii [39]    2013    0.00      1.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00

Baseline           2015    0.00      1.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Glinos [16]        2014    0.00      0.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Kueppers [30]      2012    0.00      0.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Gross [19]         2014    0.00      0.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Nourian [33]       2013    0.00      0.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Oberreuter [35]    2014    0.00      0.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
R. Torrejón [52]   2012    0.00      0.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Gillam [14]        2012     –         –      –       –      –       –        –     –      –      –
Gillam [13]        2013     –         –      –       –      –       –        –     –      –      –
Gillam [15]        2014     –         –      –       –      –       –        –     –      –      –
Jayapal [23]       2012     –         –      –       –      –       –        –     –      –      –
Kong [29]          2012     –         –      –       –      –       –        –     –      –      –
R. Torrejón [53]   2013     –         –      –       –      –       –        –     –      –      –
R. Torrejón [54]   2014     –         –      –       –      –       –        –     –      –      –
Sánchez-Vega [58] 2012      –         –      –       –      –       –        –     –      –      –
Shrestha [61]      2013     –         –      –       –      –       –        –     –      –      –
Shrestha [60]      2014     –         –      –       –      –       –        –     –      –      –
Table 8. Cross-year evaluation of text alignment software submissions from 2012 to 2015 with
respect to performance measures at character level, case level, and document level for the sub-
mitted dataset from Palkovskii and Belov [36].
Software Submission               Character Level               Case Level         Document Level
Team               Year   plagdet prec      r ec    gran   prec    r ec      F1   prec   r ec   F1

Oberreuter [35]    2014    0.03      0.99   0.01    1.00   0.97    0.01   0.03    0.96   0.02   0.03
Oberreuter [34]    2012    0.03      0.96   0.01    1.00   0.90    0.01   0.03    0.93   0.02   0.03
Glinos [16]        2014    0.01      0.95   0.01    1.00   0.41    0.00   0.01    0.40   0.00   0.01
Palkovskii [40]    2014    0.01      0.87   0.01    1.07   0.62    0.01   0.01    0.71   0.01   0.01
Gross [19]         2014    0.00      1.00   0.00    1.00   0.71    0.00   0.00    0.83   0.00   0.01
Palkovskii [39]    2013    0.00      1.00   0.00    1.00   0.44    0.00   0.00    0.43   0.00   0.00
Palkovskii [38]    2012    0.00      0.97   0.00    1.00   0.67    0.00   0.00    0.67   0.00   0.00
Saremi [59]        2013    0.00      0.24   0.00    1.11   0.00    0.00   0.00    0.00   0.00   0.00
Sanchez-Perez [57] 2014    0.00      0.97   0.00    1.00   0.67    0.00   0.00    0.67   0.00   0.00
R. Torrejón [54]   2014    0.00      0.99   0.00    1.00   0.67    0.00   0.00    0.67   0.00   0.00
Suchomel [66]      2013    0.00      1.00   0.00    1.00   0.50    0.00   0.00    0.50   0.00   0.00
Kueppers [30]      2012    0.00      0.60   0.00    1.33   0.50    0.00   0.00    0.25   0.00   0.00
R. Torrejón [53]   2013    0.00      0.99   0.00    1.00   1.00    0.00   0.00    1.00   0.00   0.00
Suchomel [65]      2012    0.00      1.00   0.00    1.00   0.67    0.00   0.00    0.67   0.00   0.00
Alvi [3]           2014    0.00      0.83   0.00    1.25   0.17    0.00   0.00    0.25   0.00   0.00

Baseline           2015    0.00      0.89   0.00    1.33   0.11    0.00   0.00    0.20   0.00   0.00
R. Torrejón [52]   2012    0.00      1.00   0.00    1.00   1.00    0.00   0.00    1.00   0.00   0.00
Sanchez-Perez [56] 2015    0.00      0.94   0.00    1.00   1.00    0.00   0.00    1.00   0.00   0.00
Jayapal [24]       2013    0.00      1.00   0.00    1.25   0.00    0.00   0.00    0.00   0.00   0.00
Gillam [13]        2013    0.00      0.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Gillam [15]        2014    0.00      0.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Kong [28]          2013    0.00      0.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Kong [26]          2014    0.00      0.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Nourian [33]       2013    0.00      0.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Abnar [2]          2014    0.00      0.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Gillam [14]        2012     –         –      –       –      –       –        –     –      –      –
Jayapal [23]       2012     –         –      –       –      –       –        –     –      –      –
Kong [29]          2012     –         –      –       –      –       –        –     –      –      –
Sánchez-Vega [58] 2012      –         –      –       –      –       –        –     –      –      –
Shrestha [61]      2013     –         –      –       –      –       –        –     –      –      –
Shrestha [60]      2014     –         –      –       –      –       –        –     –      –      –
Table 9. Cross-year evaluation of text alignment software submissions from 2012 to 2015 with
respect to performance measures at character level, case level, and document level for the sub-
mitted dataset from Hanif et al. [21].
Software Submission               Character Level               Case Level         Document Level
Team               Year   plagdet prec      r ec    gran   prec    r ec      F1   prec   r ec   F1

Saremi [59]        2013    0.02      0.28   0.02    1.60   0.00    0.02   0.00    0.00   0.02   0.00
Sanchez-Perez [57] 2014    0.02      0.30   0.01    1.00   0.00    0.01   0.00    0.00   0.01   0.00
R. Torrejón [52]   2012    0.02      1.00   0.01    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Oberreuter [35]    2014    0.01      1.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Jayapal [24]       2013    0.00      1.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Alvi [3]           2014    0.00      0.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Glinos [16]        2014    0.00      0.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Suchomel [65]      2012    0.00      0.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Kong [28]          2013    0.00      0.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Kong [26]          2014    0.00      0.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Kueppers [30]      2012    0.00      0.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Gross [19]         2014    0.00      0.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Nourian [33]       2013    0.00      0.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Oberreuter [34]    2012    0.00      0.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Palkovskii [38]    2012    0.00      0.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Palkovskii [39]    2013    0.00      0.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Palkovskii [40]    2014    0.00      0.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00

Baseline           2015    0.00      0.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Sanchez-Perez [56] 2015    0.00      0.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Suchomel [66]      2013    0.00      0.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Abnar [2]          2014    0.00      0.00   0.00    1.00   0.00    0.00   0.00    0.00   0.00   0.00
Gillam [14]        2012     –         –      –       –      –       –        –     –      –      –
Gillam [13]        2013     –         –      –       –      –       –        –     –      –      –
Gillam [15]        2014     –         –      –       –      –       –        –     –      –      –
Jayapal [23]       2012     –         –      –       –      –       –        –     –      –      –
Kong [29]          2012     –         –      –       –      –       –        –     –      –      –
R. Torrejón [53]   2013     –         –      –       –      –       –        –     –      –      –
R. Torrejón [54]   2014     –         –      –       –      –       –        –     –      –      –
Sánchez-Vega [58] 2012      –         –      –       –      –       –        –     –      –      –
Shrestha [61]      2013     –         –      –       –      –       –        –     –      –      –
Shrestha [60]      2014     –         –      –       –      –       –        –     –      –      –
Discussion The evaluation of the existing text alignment approaches on the submit-
ted datasets leaves us with more confidence in their quality. The datasets on which the
approaches perform with results comparable to their performances on the PAN plagia-
rism corpora mutually verify that neither dataset is too far off from the others. Since the
datasets have been constructed independently, this suggest that the intuition of the au-
thors who constructed them corresponds to ours, albeit we may have strongly influenced
them with our prior work.
    Regarding the datasets where the existing text alignment approaches fail, we cannot
deduce that the datasets are flawed. Rather, the characteristics of the datasets suggest
that either tailored detection modules are required, or rather a better abstraction of the
problem domain to allow for generic approaches. Finally, regarding the dataset where a
number of text alignment approaches achieve almost perfect performance, this dataset
has been beaten, which means that it may be used to confirm basic capabilities of a text
alignment approach, but it does inform further research.

5.4   Analysis of Execution Errors

Not all of the existing text alignment approaches work without execution errors on the
submitted datasets. On some datasets, approaches fail or print error output for various
reasons. Table 10 gives an overview of common errors as reported by the output of fail-
ing approaches. Most of the errors observed hint at internal software errors, whereas
others are opaque because hardly any error messages are printed by the software. De-
spite printing error messages, some pieces of software still generate output that can be
evaluated, whereas others do not. Many of the errors observed occur on the datasets
containing non-English documents, but also on datasets that make use letters from non-
Latin alphabets. Moreover, we have excluded a number of approaches for being too
slow.
     We have repeatedly tried to get the respective approaches to work, however, the er-
rors prevailed. We have also considered to invite the original authors to fix the errors in
their software, but refrained from doing so, since then the already successfully obtained
performance results on other evaluation corpora would be invalidated. Changing a soft-
ware after it has been submitted to a shared task, even if only a small execution error
is fixed, may have side effects on the software’s performance when it is re-executed on
a given dataset compared to its original performance. Arguably, a submitted software
should be kept a fixture for the future and participants should rather be invited to sub-
mit a new version of their software where the errors are fixed and where they are free
to make other improvements as they see fit.


6     Conclusion and Outlook

In conclusion, we can say that data submissions to shared tasks are a viable option
of creating evaluation resources. The datasets that have been submitted to our shared
task have a variety that we would not have been able to create on our own. Despite the
important role that the organizers of a shared task play in keeping things together, and
           Table 10. Overview of executions errors observed on the submitted datasets.
Software Submission        Alvi [4]             Cheema [9]   Khoshnavataher [25]  Mohtaj [32]
Team                Year              Asghari [5]        Hanif [21]       Kong [27]       Palkovskii [36]
Alvi [3]            2014       X      internal     X      internal    X         X       internal    X
Gillam [14]         2012    runtime   runtime runtime runtime runtime runtime runtime runtime
Gillam [13]         2013    internal  internal  internal internal internal internal internal internal
Gillam [15]         2014   no output no output no output no output no output no output no output no output
Glinos [16]         2014       X      internal     X      internal    X         X       internal internal
Jayapal [23]        2012    runtime   runtime runtime runtime runtime runtime runtime runtime
Kong [29]           2012    runtime   runtime runtime runtime runtime runtime runtime runtime
Kueppers [30]       2012       X      internal     X      internal internal internal       X        X
Oberreuter [35]     2014       X     memory        X         X        X         X          X        X
R. Torrejón [52]    2012       X      internal     X      internal internal internal       X        X
R. Torrejón [53]    2013    internal  internal  internal internal internal internal internal no output
R. Torrejón [54]    2014    internal  internal  internal internal internal internal internal        X
Sánchez-Vega [58]   2012    runtime   runtime runtime runtime runtime runtime runtime runtime
Shrestha [61]       2013    runtime   runtime runtime runtime runtime runtime runtime runtime
Shrestha [60]       2014    runtime   runtime runtime runtime runtime runtime runtime runtime


making sure that all moving parts fit together, it is curious that data submissions are so
uncommon.
     In our case, we have been able to validate and evaluate the submitted datasets not
only by manual peer-review, but also by executing all software submissions to previous
editions of our shared task on the submitted datasets. Perhaps, asking for data submis-
sions in a pilot task, where no software has been developed, yet, is not so attractive.
It remains to be seen whether data submissions can only be successful in connection
with software submissions. The organizational procedures that we have outlined in this
paper may serve as first draft of a recipe to successful data submissions. However, in
other contexts and other research fields, data submissions may not be as straightforward
to be implemented. For example, if the data is sensitive, if it raises privacy concerns, or
if it is difficult to obtain, data submissions may not be possible.
     With software submissions and data submissions, we are only one step short of
involving participants in all aspects of a shared tasks; what is still missing are the per-
formance measures. Here, the organizers of a shared task often decide on a set of perfor-
mance measures, whereas the community around a shared task may have different ideas
as to how to measures performance. Involving participants in performance measure de-
velopment and result analysis seems to be an obvious next step. Here, for example, it
may well be possible to invite theoretical contributions to a shared task, since the devel-
opment of performance measures for a given shared task is closely related to developing
a theory around it.

Acknowledgements
We thank the participating teams of this shared task as well as those of previous editions
for their devoted work.

Bibliography
 1. Proceedings of the Workshop on Shared Tasks and Comparative Evaluation in Natural
    Language Generation, Position Papers, April 20-21, 2007. The Ohio State University,
    Arlington, Virginia, USA (2007)
 2. Abnar, S., Dehghani, M., Zamani, H., Shakery, A.: Expanded N-Grams for Semantic Text
    Alignment—Notebook for PAN at CLEF 2014. In: [7]
 3. Alvi, F., Stevenson, M., Clough, P.: Hashing and Merging Heuristics for Text Reuse
    Detection—Notebook for PAN at CLEF 2014. In: [7]
 4. Alvi, F., Stevenson, M., Clough, P.: The Short Stories Corpus—Notebook for PAN at CLEF
    2015. In: [8]
 5. Asghari, H., Khoshnavataher, K., Fatemi, O., Faili, H.: Developing Bilingual Plagiarism
    Detection Corpus Using Sentence Aligned Parallel Corpus—Notebook for PAN at CLEF
    2015. In: [8]
 6. Burrows, S., Potthast, M., Stein, B.: Paraphrase Acquisition via Crowdsourcing and
    Machine Learning. Transactions on Intelligent Systems and Technology (ACM TIST) 4(3),
    43:1–43:21 (Jun 2013), http://dl.acm.org/citation.cfm?id=2483676
 7. Cappellato, L., Ferro, N., Halvey, M., Kraaij, W. (eds.): CLEF 2014 Evaluation Labs and
    Workshop – Working Notes Papers, 15-18 September, Sheffield, UK. CEUR Workshop
    Proceedings, CEUR-WS.org (2014),
    http://www.clef-initiative.eu/publication/working-notes
 8. Cappellato, L., Ferro, N., Jones, G., San Juan, E. (eds.): CLEF 2015 Evaluation Labs and
    Workshop – Working Notes Papers, 8-11 September, Toulouse, France. CEUR Workshop
    Proceedings, CEUR-WS.org (2015),
    http://www.clef-initiative.eu/publication/working-notes
 9. Cheema, W., Najib, F., Ahmed, S., Bukhari, S., Sittar, A., Nawab, R.: A Corpus for
    Analyzing Text Reuse by People of Different Groups—Notebook for PAN at CLEF 2015.
    In: [8]
10. Forner, P., Karlgren, J., Womser-Hacker, C. (eds.): CLEF 2012 Evaluation Labs and
    Workshop – Working Notes Papers, 17-20 September, Rome, Italy (2012),
    http://www.clef-initiative.eu/publication/working-notes
11. Forner, P., Navigli, R., Tufis, D. (eds.): CLEF 2013 Evaluation Labs and Workshop –
    Working Notes Papers, 23-26 September, Valencia, Spain (2013),
    http://www.clef-initiative.eu/publication/working-notes
12. Franco-Salvador, M., Bensalem, I., Flores, E., Gupta, P., Rosso, P.: PAN 2015 Shared Task
    on Plagiarism Detection: Evaluation of Corpora for Text Alignment—Notebook for PAN at
    CLEF 2015. In: [8]
13. Gillam, L.: Guess Again and See if They Line Up: Surrey’s Runs at Plagiarism
    Detection—Notebook for PAN at CLEF 2013. In: [11]
14. Gillam, L., Newbold, N., Cooke, N.: Educated Guesses and Equality Judgements: Using
    Search Engines and Pairwise Match for External Plagiarism Detection—Notebook for PAN
    at CLEF 2012. In: [10], http://www.clef-initiative.eu/publication/working-notes
15. Gillam, L., Notley, S.: Evaluating Robustness for ’IPCRESS’: Surrey’s Text Alignment for
    Plagiarism Detection—Notebook for PAN at CLEF 2014. In: [7]
16. Glinos, D.: A Hybrid Architecture for Plagiarism Detection—Notebook for PAN at CLEF
    2014. In: [7]
17. Gollub, T., Potthast, M., Beyer, A., Busse, M., Rangel, F., Rosso, P., Stamatatos, E., Stein,
    B.: Recent Trends in Digital Text Forensics and its Evaluation. In: Forner, P., Müller, H.,
    Paredes, R., Rosso, P., Stein, B. (eds.) Information Access Evaluation meets Multilinguality,
    Multimodality, and Visualization. 4th International Conference of the CLEF Initiative
    (CLEF 13). pp. 282–302. Springer, Berlin Heidelberg New York (Sep 2013)
18. Gollub, T., Stein, B., Burrows, S.: Ousting Ivory Tower Research: Towards a Web
    Framework for Providing Experiments as a Service. In: Hersh, B., Callan, J., Maarek, Y.,
    Sanderson, M. (eds.) 35th International ACM Conference on Research and Development in
    Information Retrieval (SIGIR 12). pp. 1125–1126. ACM (Aug 2012)
19. Gross, P., Modaresi, P.: Plagiarism Alignment Detection by Merging Context
    Seeds—Notebook for PAN at CLEF 2014. In: [7]
20. Hagen, M., Potthast, M., Stein, B.: Source Retrieval for Plagiarism Detection from Large
    Web Corpora: Recent Approaches. In: Working Notes Papers of the CLEF 2015 Evaluation
    Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org (Sep 2015),
    http://www.clef-initiative.eu/publication/working-notes
21. Hanif, I., Nawab, A., Arbab, A., Jamshed, H., Riaz, S., Munir, E.: Cross-Language
    Urdu-English (CLUE) Text Alignment Corpus—Notebook for PAN at CLEF 2015. In: [8]
22. Hopfgartner, F., Hanbury, A., Müller, H., Kando, N., Mercer, S., Kalpathy-Cramer, J.,
    Potthast, M., Gollub, T., Krithara, A., Lin, J., Balog, K., Eggel, I.: Report on the
    Evaluation-as-a-Service (EaaS) Expert Workshop. SIGIR Forum 49(1), 57–65 (Jun 2015),
    http://sigir.org/forum/issues/june-2015/
23. Jayapal, A.: Similarity Overlap Metric and Greedy String Tiling at PAN 2012: Plagiarism
    Detection—Notebook for PAN at CLEF 2012. In: [10],
    http://www.clef-initiative.eu/publication/working-notes
24. Jayapal, A., Goswami, B.: Submission to the 5th International Competition on Plagiarism
    Detection. http://www.uni-weimar.de/medien/webis/events/pan-13 (2013),
    http://www.clef-initiative.eu/publication/working-notes, From Nuance Communications,
    USA
25. Khoshnavataher, K., Zarrabi, V., Mohtaj, S., Asghari, H.: Developing Monolingual Persian
    Corpus for Extrinsic Plagiarism Detection Using Artificial Obfuscation—Notebook for
    PAN at CLEF 2015. In: [8]
26. Kong, L., Han, Y., Han, Z., Yu, H., Wang, Q., Zhang, T., Qi, H.: Source Retrieval Based on
    Learning to Rank and Text Alignment Based on Plagiarism Type Recognition for
    Plagiarism Detection—Notebook for PAN at CLEF 2014. In: [7]
27. Kong, L., Lu, Z., Han, Y., Qi, H., Han, Z., Wang, Q., Hao, Z., Zhang, J.: Source Retrieval
    and Text Alignment Corpus Construction for Plagiarism Detection—Notebook for PAN at
    CLEF 2015. In: [8]
28. Kong, L., Qi, H., Du, C., Wang, M., Han, Z.: Approaches for Source Retrieval and Text
    Alignment of Plagiarism Detection—Notebook for PAN at CLEF 2013. In: [11]
29. Kong, L., Qi, H., Wang, S., Du, C., Wang, S., Han, Y.: Approaches for Candidate Document
    Retrieval and Detailed Comparison of Plagiarism Detection—Notebook for PAN at CLEF
    2012. In: [10], http://www.clef-initiative.eu/publication/working-notes
30. Küppers, R., Conrad, S.: A Set-Based Approach to Plagiarism Detection—Notebook for
    PAN at CLEF 2012. In: [10], http://www.clef-initiative.eu/publication/working-notes
31. Meyer zu Eißen, S., Stein, B.: Intrinsic Plagiarism Detection. In: Lalmas, M., MacFarlane,
    A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) Advances in Information
    Retrieval. 28th European Conference on IR Research (ECIR 06). Lecture Notes in
    Computer Science, vol. 3936 LNCS, pp. 565–569. Springer, Berlin Heidelberg New York
    (2006)
32. Mohtaj, S., Asghari, H., Zarrabi, V.: Developing Monolingual English Corpus for
    Plagiarism Detection using Human Annotated Paraphrase Corpus—Notebook for PAN at
    CLEF 2015. In: [8]
33. Nourian, A.: Submission to the 5th International Competition on Plagiarism Detection.
    http://www.uni-weimar.de/medien/webis/events/pan-13 (2013),
    http://www.clef-initiative.eu/publication/working-notes, From the Iran University of
    Science and Technology
34. Oberreuter, G., Carrillo-Cisneros, D., Scherson, I., Velásquez, J.: Submission to the 4th
    International Competition on Plagiarism Detection.
    http://www.uni-weimar.de/medien/webis/events/pan-12 (2012),
    http://www.clef-initiative.eu/publication/working-notes, From the University of Chile,
    Chile, and the University of California, USA
35. Oberreuter, G., Eiselt, A.: Submission to the 6th International Competition on Plagiarism
    Detection. http://www.uni-weimar.de/medien/webis/events/pan-14 (2014),
    http://www.clef-initiative.eu/publication/working-notes, From Innovand.io, Chile
36. Palkovskii, Y., Belov, A.: Submission to the 7th International Competition on Plagiarism
    Detection. http://www.uni-weimar.de/medien/webis/events/pan-15 (2015),
    http://www.clef-initiative.eu/publication/working-notes, From the Zhytomyr State
    University and SkyLine LLC
37. Palkovskii, Y.: “Counter Plagiarism Detection Software” and “Counter Counter Plagiarism
    Detection” Methods. In: Stein, B., Rosso, P., Stamatatos, E., Koppel, M., Agirre, E. (eds.)
    SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship, and Social Software
    Misuse (PAN 09). pp. 67–68. Universidad Politécnica de Valencia and CEUR-WS.org (Sep
    2009), http://ceur-ws.org/Vol-502
38. Palkovskii, Y., Belov, A.: Applying Specific Clusterization and Fingerprint Density
    Distribution with Genetic Algorithm Overall Tuning in External Plagiarism
    Detection—Notebook for PAN at CLEF 2012. In: [10],
    http://www.clef-initiative.eu/publication/working-notes
39. Palkovskii, Y., Belov, A.: Using Hybrid Similarity Methods for Plagiarism
    Detection—Notebook for PAN at CLEF 2013. In: [11]
40. Palkovskii, Y., Belov, A.: Developing High-Resolution Universal Multi-Type N-Gram
    Plagiarism Detector—Notebook for PAN at CLEF 2014. In: [7]
41. Potthast, M., Barrón-Cedeño, A., Eiselt, A., Stein, B., Rosso, P.: Overview of the 2nd
    International Competition on Plagiarism Detection. In: Braschler, M., Harman, D., Pianta,
    E. (eds.) Working Notes Papers of the CLEF 2010 Evaluation Labs (Sep 2010),
    http://www.clef-initiative.eu/publication/working-notes
42. Potthast, M., Barrón-Cedeño, A., Stein, B., Rosso, P.: Cross-Language Plagiarism
    Detection. Language Resources and Evaluation (LREV) 45(1), 45–62 (Mar 2011)
43. Potthast, M., Eiselt, A., Barrón-Cedeño, A., Stein, B., Rosso, P.: Overview of the 3rd
    International Competition on Plagiarism Detection. In: Petras, V., Forner, P., Clough, P.
    (eds.) Working Notes Papers of the CLEF 2011 Evaluation Labs (Sep 2011),
    http://www.clef-initiative.eu/publication/working-notes
44. Potthast, M., Gollub, T., Hagen, M., Graßegger, J., Kiesel, J., Michel, M., Oberländer, A.,
    Tippmann, M., Barrón-Cedeño, A., Gupta, P., Rosso, P., Stein, B.: Overview of the 4th
    International Competition on Plagiarism Detection. In: Forner, P., Karlgren, J.,
    Womser-Hacker, C. (eds.) Working Notes Papers of the CLEF 2012 Evaluation Labs (Sep
    2012), http://www.clef-initiative.eu/publication/working-notes
45. Potthast, M., Gollub, T., Hagen, M., Tippmann, M., Kiesel, J., Rosso, P., Stamatatos, E.,
    Stein, B.: Overview of the 5th International Competition on Plagiarism Detection. In:
    Forner, P., Navigli, R., Tufis, D. (eds.) Working Notes Papers of the CLEF 2013 Evaluation
    Labs (Sep 2013), http://www.clef-initiative.eu/publication/working-notes
46. Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Improving the
    Reproducibility of PAN’s Shared Tasks: Plagiarism Detection, Author Identification, and
    Author Profiling. In: Kanoulas, E., Lupu, M., Clough, P., Sanderson, M., Hall, M.,
    Hanbury, A., Toms, E. (eds.) Information Access Evaluation meets Multilinguality,
    Multimodality, and Visualization. 5th International Conference of the CLEF Initiative
    (CLEF 14). pp. 268–299. Springer, Berlin Heidelberg New York (Sep 2014)
47. Potthast, M., Hagen, M., Beyer, A., Busse, M., Tippmann, M., Rosso, P., Stein, B.:
    Overview of the 6th International Competition on Plagiarism Detection. In: Cappellato, L.,
    Ferro, N., Halvey, M., Kraaij, W. (eds.) Working Notes Papers of the CLEF 2014
    Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org (Sep 2014),
    http://www.clef-initiative.eu/publication/working-notes
48. Potthast, M., Hagen, M., Stein, B., Graßegger, J., Michel, M., Tippmann, M., Welsch, C.:
    ChatNoir: A Search Engine for the ClueWeb09 Corpus. In: Hersh, B., Callan, J., Maarek,
    Y., Sanderson, M. (eds.) 35th International ACM Conference on Research and Development
    in Information Retrieval (SIGIR 12). p. 1004. ACM (Aug 2012)
49. Potthast, M., Hagen, M., Völske, M., Stein, B.: Crowdsourcing Interaction Logs to
    Understand Text Reuse from the Web. In: Fung, P., Poesio, M. (eds.) Proceedings of the
    51st Annual Meeting of the Association for Computational Linguistics (ACL 13). pp.
    1212–1221. Association for Computational Linguistics (Aug 2013),
    http://www.aclweb.org/anthology/P13-1119
50. Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An Evaluation Framework for
    Plagiarism Detection. In: Huang, C.R., Jurafsky, D. (eds.) 23rd International Conference on
    Computational Linguistics (COLING 10). pp. 997–1005. Association for Computational
    Linguistics, Stroudsburg, Pennsylvania (Aug 2010)
51. Potthast, M., Stein, B., Eiselt, A., Barrón-Cedeño, A., Rosso, P.: Overview of the 1st
    International Competition on Plagiarism Detection. In: Stein, B., Rosso, P., Stamatatos, E.,
    Koppel, M., Agirre, E. (eds.) SEPLN 09 Workshop on Uncovering Plagiarism, Authorship,
    and Social Software Misuse (PAN 09). pp. 1–9. CEUR-WS.org (Sep 2009),
    http://ceur-ws.org/Vol-502
52. Rodríguez Torrejón, D., Martín Ramos, J.: Detailed Comparison Module In CoReMo 1.9
    Plagiarism Detector—Notebook for PAN at CLEF 2012. In: [10],
    http://www.clef-initiative.eu/publication/working-notes
53. Rodríguez Torrejón, D., Martín Ramos, J.: Text Alignment Module in CoReMo 2.1
    Plagiarism Detector—Notebook for PAN at CLEF 2013. In: [11]
54. Rodríguez Torrejón, D., Martín Ramos, J.: CoReMo 2.3 Plagiarism Detector Text
    Alignment Module—Notebook for PAN at CLEF 2014. In: [7]
55. Rosvall, M., Bergstrom, C.: Maps of Random Walks on Complex Networks Reveal
    Community Structure. Proceedings of the National Academy of Sciences 105(4),
    1118–1123 (2008)
56. Sanchez-Perez, M., Gelbukh, A., Sidorov, G.: Dynamically Adjustable Approach through
    Obfuscation Type Recognition—Notebook for PAN at CLEF 2015. In: [8]
57. Sanchez-Perez, M., Sidorov, G., Gelbukh, A.: A Winning Approach to Text Alignment for
    Text Reuse Detection at PAN 2014—Notebook for PAN at CLEF 2014. In: [7]
58. Sánchez-Vega, F., y Gómez, M.M., Villaseñor-Pineda, L.: Optimized Fuzzy Text Alignment
    for Plagiarism Detection—Notebook for PAN at CLEF 2012. In: [10],
    http://www.clef-initiative.eu/publication/working-notes
59. Saremi, M., Yaghmaee, F.: Submission to the 5th International Competition on Plagiarism
    Detection. http://www.uni-weimar.de/medien/webis/events/pan-13 (2013),
    http://www.clef-initiative.eu/publication/working-notes, From Semnan University, Iran
60. Shrestha, P., Maharjan, S., Solorio, T.: Machine Translation Evaluation Metric for Text
    Alignment—Notebook for PAN at CLEF 2014. In: [7]
61. Shrestha, P., Solorio, T.: Using a Variety of n-Grams for the Detection of Different Kinds of
    Plagiarism—Notebook for PAN at CLEF 2013. In: [11]
62. Stamatatos, E., Potthast, M., Rangel, F., Rosso, P., Stein, B.: Overview of the PAN/CLEF
    2015 Evaluation Lab. In: Information Access Evaluation meets Multilinguality,
    Multimodality, and Visualization. 6th International Conference of the CLEF Initiative
    (CLEF 15). Springer, Berlin Heidelberg New York (Sep 2015)
63. Stein, B., Lipka, N., Prettenhofer, P.: Intrinsic Plagiarism Analysis. Language Resources
    and Evaluation (LRE) 45(1), 63–82 (Mar 2011)
64. Stein, B., Meyer zu Eißen, S.: Near Similarity Search and Plagiarism Analysis. In:
    Spiliopoulou, M., Kruse, R., Borgelt, C., Nürnberger, A., Gaul, W. (eds.) From Data and
    Information Analysis to Knowledge Engineering. Selected papers from the 29th Annual
    Conference of the German Classification Society (GFKL 05). pp. 430–437. Studies in
    Classification, Data Analysis, and Knowledge Organization, Springer, Berlin Heidelberg
    New York (2006)
65. Suchomel, Šimon., Kasprzak, J., Brandejs, M.: Three Way Search Engine Queries with
    Multi-feature Document Comparison for Plagiarism Detection—Notebook for PAN at
    CLEF 2012. In: [10], http://www.clef-initiative.eu/publication/working-notes
66. Suchomel, Šimon., Kasprzak, J., Brandejs, M.: Diverse Queries and Feature Type Selection
    for Plagiarism Discovery—Notebook for PAN at CLEF 2013. In: [11]
67. Zarrabi, V., Rafiei, J., Khoshnava, K., Asghari, H., Mohtaj, S.: Evaluation of Text Reuse
    Corpora for Text Alignment Task of Plagiarism Detection—Notebook for PAN at CLEF
    2015. In: [8]