<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Approaches for Intrinsic and External Plagiarism Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gabriel Oberreuter</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gaston L'Huillier</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sebastián A. Ríos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Juan D. Velásquez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Industrial Engineering University of Chile</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2011</year>
      </pub-date>
      <abstract>
        <p>Plagiarism detection has been considered as a classification problem which can be approximated with intrinsic strategies, considering self-based information from a given document, and external strategies, considering comparison techniques between a suspicious document and different sources. In this work, both intrinsic and external approaches for plagiarism detection are presented. First, the main contribution for intrinsic plagiarism detection is associated to the outlier detection approach for detecting changes in the author's style. Then, the main contribution for the proposed external plagiarism detection is the space reduction technique to reduce the complexity of this plagiarism detection task. Results shows that our approach is highly competitive with respect to the leading research teams in plagiarism detection.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Plagiarism in academia is rising and multiple authors have worked to describe this
phenomena [
        <xref ref-type="bibr" rid="ref5 ref8">5,8</xref>
        ]. As commented by Hunt in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], “Internet Plagiarism” is referred
sometimes as a consequence of the “Information Technology revolution”, as it proves to be
a big problem in academia. According to Park [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], plagiarism is analyzed from
various perspectives and considered as a problem that is growing over time. To tackle this
problem, the most common approach so far is to detect plagiarism using automated
algorithms based on rules and string matching algorithms.
      </p>
      <p>
        Two main strategies for plagiarism detection have been considered by researches
[
        <xref ref-type="bibr" rid="ref4 ref9">9,4</xref>
        ]: Intrinsic and external plagiarism detection. Intrinsic plagiarism detection aims at
discovering plagiarism by examining only the input document, deciding whether parts
of the input document are not from the same author. External plagiarism detection is
the approach where suspicious documents are compared against a set of possible
references. From exact document copy, to paraphrasing, different levels of plagiarism
techniques can been used in several contexts, according to Meyer zu Eissen [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>The main contribution of this work is the usage of outlier detection techniques on
text-based data to enhance two plagiarism detection strategies, one for intrinsic
plagiarism detection using deviation parameters with respect of the writing style of a given
document, and another one to reduce the search space for external plagiarism detection
based on the generation of segments of n-gram for approximated plagiarism decision
where unrelated documents are discarded efficiently.</p>
      <p>This paper is structured as follows: First, in Section 2, a short summary on
plagiarism detection is introduced. In Section 3 the proposed external plagiarism detection
method is described. Afterwards, in Section 4, the proposed intrinsic plagiarism
detection method is described. In Section 5 results are presented. Finally, in Section 6
conclusions are discussed.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        According to Schleimer et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], copy prevention and detection methods can be
combined to reduce plagiarism. While copy detection methods can only minimize it,
prevention methods can fully eliminate it and decrease it. Notwithstanding this fact, prevention
methods need the whole society to take part, thus its solution is non trivial. Copy
plagiarism detection methods, on the other hand, are easier to implement, and tackle different
levels, from simple manual comparison to complex automatic algorithms [
        <xref ref-type="bibr" rid="ref10 ref9">9,10</xref>
        ]. A short
discussion on plagiarism detection strategies is presented.
      </p>
      <sec id="sec-2-1">
        <title>2.1 Intrinsic Plagiarism Detection</title>
        <p>
          When comparing texts against a reference set of possible sources, comes the
complication of choosing the right set of documents to compare to. And now more than ever,
with the possibilities that Internet bring to plagiarists, this task becomes more
complicated to achieve. For this, the writing style can be analyzed within the document and an
examination for incongruities can be done. The complexity and style of each text can be
analyzed based on certain parameters such as text statistics, syntactic features,
part-ofspeech features, closed-class word sets, and structural features, as stated by Meyer zu
Eissen [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. The main idea is to define a criterion to determine if the style has changed
enough to indicate plagiarism.
        </p>
        <p>
          Stamatatos [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] presented a method for intrinsic plagiarism detection. As described
by its author, this approach attempts to quantify the style variation within a document
using character n-gram profiles and a style change function based on an appropriate
dissimilarity measure originally proposed for author identification. Style profiles are
first constructed using a sliding window. For the construction of those profiles the author
proposed the use of character n-grams. These n-grams are used for getting information
on the writer’s style. The method then analyzes changes on the profiles to determine if
a change is significantly enough to indicate another author style.
        </p>
        <p>
          Other approaches have been proposed, such as the one presented by Seaward &amp;
Matwin [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. They introduced Kolmogorov Complexity measures as a way of extracting
structural information from texts for Intrinsic Plagiarism Detection. They experiment
with complexity features based on the Lempel-Ziv compression algorithm for detecting
style shifts within a single document, thus revealing possible plagiarized passages.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2 External Plagiarism Detection</title>
        <p>
          In terms of external plagiarism detection algorithms, the use of n-grams have shown
to give some flexibility to the detection task, as reworded text fragments could still be
detected [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Other approaches focus on solving the plagiarism detection problem as a
traditional classification problem from the machine learning community [
          <xref ref-type="bibr" rid="ref1 ref3">1,3</xref>
          ]. Bao et
al. in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], proposed to use a Semantic Sequence Kernel (SSK), and then using it into a
traditional Support Vector Machines (SVMs) formulation based on the Structural Risk
Minimization (SRM) principle from statistical learning theory [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], where the general
objective is finding out the optimal classification hyperplane for the binary classification
problem (plagiarized, not plagiarized).
        </p>
        <p>
          Kasprzak &amp; Brandejs [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] introduced their model for automatic external plagiarism
detection. It consist of two main phases; the first is to build the index of the documents,
while in the second the similarities are computed. This approach uses word n-grams,
with n ranging from 4 to 6, and takes into account the number of matches of those
n-grams between the suspicious documents and the source documents for computing
the detections. The algorithm have the authors won the first place at the PAN@2010
competition [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Proposed Method for External Plagiarism Detection</title>
      <p>
        The proposed algorithm is based on two phases; First, it executes a plagiarism search
space reduction method, and then executes an exhaustive search to find plagiarized
passages. The search space reduction method aims at quickly identify those pair of
documents that potentially have some text in common, possibly one of them having
plagiarized from the other. For this, the method’s general tactics are to remove
stopwords, and consider word 4-grams. If two documents have at least two word 4-grams
coincidences close enough as to be in the same paragraph, the documents are given to
the next phase. Otherwise the pair is discarded. For more details, please refer to [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        For the exhaustive search, word tri-grams are used (compared to use both word
bi-grams and word tri-grams in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]), and stopwords are not removed. In Figure 1 an
example of the algorithm can be seen. Two documents are being compared, where dots
represent coincidences of tokens used to characterize the documents.
      </p>
      <p>
        The system does not consider plagiarism detection between different languages.
The overall mechanism for finding the plagiarized passages is described in Oberreuter
et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. More details and other parameters of the algorithm are not revealed due to
copyright.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Proposed Method for Intrinsic Plagiarism Detection</title>
      <p>
        For intrinsic plagiarism detection, we first considered some ideas other authors had
investigated. To characterize the writing style of an author, different details can be
considered. As studied by Stein et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], multiple writing style characteristics were tested
in order to determine plagiarism. Likewise, Stamatatos [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] experimented with
character tri-grams in combination with “n-gram profiles” for the same purpose. For this, it
is fundamental to choose with precaution one or a set of language resources an author
utilizes for his writing to be able to differentiate it from others.
      </p>
      <p>In the following, some of the core ideas developed in this research are presented:
– To be able to distinguish different authors within the same document, one must
characterize the writing style present on the text.
4</p>
      <p>– The use of “n-gram profiles” compares segments of the document against the whole
document. This approach works based on the assumption that the document has a
main author, who wrote the majority, if not all, the text. Therefore, it is logical that
the comparison between the style of a particular segment with the whole document
style could lead to detections of important variations, meaning that other authors
are involved.
– Based on reading and contemplation, one of the characteristic that showed to be of
interest, is the author’s use of words. Different authors tend to use different words
to write their ideas, be them on the same topic or not.</p>
      <p>These ideas lead to the following intuition for the development of the algorithm: If
some of the words used on the document are author-specific, one can think that those
words could be concentrated on the paragraphs (or more general, on the segments) that
the mentioned author wrote.</p>
      <sec id="sec-4-1">
        <title>4.1 The method</title>
        <p>First, the document is preprocessed removing numbers and all other Characters that
don’t belong to the a–z group. All Characters are considered lowercase. Second, the
method uses word uni-grams and considers all non-numerical words; stopwords are
not removed. Next, a frequency-based algorithm to test self-similarity of document is
proposed. A hard (not normalized) frequency vector v is built for all words on the
given document. Then, the complete document is clustered creating groups C. As a first
approach, these groups or segments c 2 C are created using a sliding window of length
m over the complete document. Afterwards, for each segment c 2 C, a new frequency
vector vc is computed, which is used in further steps to compare whether a segment is
deviated with respect to the footprint of the complete document. This is performed by
using the Algorithm 4.</p>
        <p>Algorithm 1 Intrinsic plagiarism evaluation</p>
        <p>As presented in Algorithm 1, the general footprint or style of the document is
represented by the average of all differences computed for each segment and the complete
document. Note that every segment is compared against the whole document only in
terms of the words present in the segment. Also, this algorithm takes into account the
intuition; if certain words are only used on a certain segment, the comparison of that
segment against the whole document would lead to a low value, because the frequency
of those words would be the same in both the whole document and in the segment.
Finally, all segments are classified according to its distance with respect to the
document’s style. As an example, in Figure 2, a graphical representation of this evaluation
is presented.</p>
        <p>In this case, the average value of the comparison of all segment with the whole
document represents the document “main” style. This value is roughly computed by the
difference on the frequency of words between vectors v and vc; 8c 2 C. If the
variation is significant, the style function will be lower than the average value minus (the
threshold), then the segment is classified as suspicious. In this example, real plagiarized
annotations are presented along with the style function value of each segment. Five
cases of plagiarism could be discovered; the value of the style function on those cases
is lower than the threshold.
The evaluation results will now be presented. Three experiments were conducted: First,
we participated with our external plagiarism detector applied to the external corpus
provided by the competition. Second, our intrinsic plagiarism detector was applied to
the intrinsic corpus. Last, we applied the intrinsic plagiarism detector on the external
plagiarism corpus.
Parameters of the algorithms were tuned considering the PAN@2010 corpus in the case
of the external approach, and PAN@2009 corpus in the case of the intrinsic one.</p>
        <p>
          In the PAN@2010 competition, as shown in Table 1, the best results were achieved
by Kasprzak &amp; Brandejs approach [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. The overall score was 0.80, and their method
achieved good results at the three metrics: precision, recall and granularity. The next
top results show similar characteristics, being well balanced in the three metrics. Our
model, in it’s 2010 version, took fifth place, with an overall score of 0.61, precision of
0.85 and recall of 0.48. The granularities of the top performers were all close to 1.
        </p>
        <p>Our proposed model, applied to the PAN@2010 corpus, achieves better results. The
new method is more precise (0.94), thus reducing false-positive detections. The recall
also improved, getting a score of 0.6. This value is acceptable considering that at the
moment we do not consider detecting plagiarism between different languages, presented
in the corpus. Also the corpus considers intrinsic plagiarism, which is not considered in
this particular case.</p>
        <p>In Table 2 the results from PAN@2011 competition are shown. The revised method
achieves third place, obtaining high precision (0.91) but a low recall score (0.22). This
could be explained as we do not consider translated plagiarism, and possibly because
our plagiarism space reduction technique could be filtering lots of used source
documents. The best team, Grman &amp; Ravas, get’s slightly better precision, and a recall score
of 0.39.</p>
      </sec>
      <sec id="sec-4-2">
        <title>5.2 Intrinsic Plagiarism Detection</title>
        <p>A sliding window of 400 words was used, and a threshold parameter = 0:075. These
were iteratively adjusted depending on text length. Sensibility analysis for all
parameters was intentionally excluded by authors due to lack of space.</p>
        <p>The results for the intrinsic task at PAN@2009 are shown in Table 3 and for PAN@2011
in Table 4. The results are based on the quality of the detection, which only considers
the information on each document itself.</p>
        <p>
          The winner was Stamatatos approach [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], with a recall of 0.4607, precision of
0.2321 and granularity of 1.3839. This method achieved a good combination of
precision and recall, and a not top performer granularity.
        </p>
        <p>The proposed method gets an overall score of 0.3457, greater than any other
approach, with a positive difference of 0.0995 with the winner’s approach. Our model
gets the best result at F-measure, precision and granularity.</p>
        <p>These results are confirmed with similar results on the PAN@2011 competition
presented in Table 4; the proposed model gets roughly the same overall score, 0.3254,
with comparable precision (0.34) and worse but not significantly different recall (0.31).
We get the best results in the competition, followed by Luyckx et al. with an overall
score of 0.17, almost doubling their score.</p>
      </sec>
      <sec id="sec-4-3">
        <title>5.3 Intrinsic detector with external corpus</title>
        <p>We used the same intrinsic plagiarism detection algorithm with the same parameters on
the external plagiarism corpus. The results are presented in Table 5. The recall score is
the third lowest; this can be explained as the intrinsic detector provides no information
on the source of the copied passages, which reduces considerably the metric itself. Also,
the algorithm achieves a precision of 0.36, comparable to the precision obtained when
applied to the intrinsic corpus (0.34). Overall, the intrinsic detector would have ranked
8 if participated on the external competition, out of 10 teams.
In this lab report two approaches for plagiarism detection were described. The first
method compares suspicious documents against a collection of possible sources, while
the second one compares the writing style within a particular document to determine if
the text was written by one or more authors.</p>
        <p>The third place at the external plagiarism detection competition PAN@2011 was
obtained, out of 9 participant teams. The precision of the proposed method, of particular
importance at plagiarism detection, is close to perfect, with a score of 0.94. Future work
in this task would be to integrate an automatic translator to the system, thus providing
a way to detect plagiarism for cross-language tasks. Also, to investigate new ways to
improve the total number detections, or recall.</p>
        <p>The proposed intrinsic algorithm, which introduces a new variant to compute
writing style differences, achieves remarkable results, obtaining the first place at the PAN@2011
competition, almost doubling the score of the second team. The method does not utilize
language-dependent features such as verbs or stopwords, thus providing a starting point
to experiment with other languages. Nevertheless, it is important to note that in this
task, of significant difficulty, much work is still needed. The best score so far has been
0.325, indicating that in the field of writing style modeling new approaches need to be
developed.</p>
        <p>Last, the proposed intrinsic model was applied to the external corpus. This
provided results for a real case scenario were one has no prior information on the suspect
documents. The results indicate that the precision is still low in this case, and that a
significant part of plagiarized passages are left undetected. Nevertheless, it proves that
it still can be usefully as it can be the only way to get plagiarism detection done when
no reference collection is available.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgment</title>
      <p>Authors would like to thank continuous support of “Instituto Sistemas Complejos de
Ingeniería” (ICM: P-05-004- F, CONICYT: FBO16; www.isci.cl); and FONDEF project
(DO8I-1015) entitled, DOCODE: Document Copy Detection (www.docode.cl). Gabriel
Oberreuter is currently “Becario CONICYT”. Finally, authors would like to thank PAN
Competition Organizers for constructing such a great workshop and motivate the
development of plagiarism detection techniques.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bao</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>J.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>X.D.</given-names>
          </string-name>
          , Liu,
          <string-name>
            <given-names>H.Y.</given-names>
            ,
            <surname>Zhang</surname>
          </string-name>
          , X.D.:
          <article-title>Semantic sequence kin: A method of document copy detection</article-title>
          . In: Dai,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Srikant</surname>
          </string-name>
          ,
          <string-name>
            <surname>R.</surname>
          </string-name>
          , Zhang, C. (eds.)
          <source>PAKDD. Lecture Notes in Computer Science</source>
          , vol.
          <volume>3056</volume>
          , pp.
          <fpage>529</fpage>
          -
          <lpage>538</lpage>
          . Springer Berlin / Heidelberg (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Braschler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pianta</surname>
          </string-name>
          , E. (eds.):
          <article-title>CLEF 2010 LABs and Workshops</article-title>
          , Notebook Papers,
          <fpage>22</fpage>
          -
          <lpage>23</lpage>
          September 2010, Padua, Italy (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Chow</surname>
            ,
            <given-names>T.W.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rahman</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.K.M.</surname>
          </string-name>
          <article-title>: Multilayer som with tree-structured data for efficient document retrieval and plagiarism detection</article-title>
          .
          <source>Trans. Neur. Netw</source>
          .
          <volume>20</volume>
          (
          <issue>9</issue>
          ),
          <fpage>1385</fpage>
          -
          <lpage>1402</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Meyer zu Eissen,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Kulig</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          :
          <article-title>Plagiarism detection without reference collections</article-title>
          . In: Decker,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Lenz</surname>
          </string-name>
          ,
          <string-name>
            <surname>H.J</surname>
          </string-name>
          . (eds.)
          <source>GfKl</source>
          . pp.
          <fpage>359</fpage>
          -
          <lpage>366</lpage>
          . Studies in Classification,
          <source>Data Analysis, and Knowledge Organization</source>
          , Springer Berlin / Heidelberg (
          <year>2006</year>
          ), http://dblp.uni-trier.de/db/conf/gfkl/gfkl2006.html
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Hunt</surname>
          </string-name>
          , R.:
          <article-title>Let's hear it for internet plagiarism</article-title>
          .
          <source>Teaching Learning Bridges</source>
          <volume>2</volume>
          (
          <issue>3</issue>
          ),
          <fpage>2</fpage>
          -
          <lpage>5</lpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kasprzak</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brandejs</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Improving the reliability of the plagiarism detection system: Lab report for pan at clef 2010</article-title>
          . In: Braschler et al. [
          <volume>2</volume>
          ]
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Oberreuter</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>L</surname>
          </string-name>
          'Huillier,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Ríos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.A.</given-names>
            ,
            <surname>Velásquez</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.D.</surname>
          </string-name>
          : Fastdocode:
          <article-title>Finding approximated segments of n-grams for document copy detection: Lab report for pan at clef 2010</article-title>
          . In: Braschler et al. [
          <volume>2</volume>
          ]
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Park</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>In other (people's) words: plagiarism by university students - literature and lessons</article-title>
          . In:
          <article-title>Assessment and Evaluation in Higher Education</article-title>
          . pp.
          <fpage>471</fpage>
          -
          <lpage>488</lpage>
          . No.
          <issue>5</issue>
          ,
          <string-name>
            <given-names>Carfax</given-names>
            <surname>Publishing</surname>
          </string-name>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barrón-Cedeño</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eiselt</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Overview of the 2nd international competition on plagiarism detection</article-title>
          . In: Braschler,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Harman</surname>
          </string-name>
          ,
          <string-name>
            <surname>D</surname>
          </string-name>
          . (eds.)
          <source>Notebook Papers of CLEF 2010 LABs and Workshops</source>
          ,
          <volume>22</volume>
          -
          <fpage>23</fpage>
          September, Padua, Italy (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eiselt</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barrón-Cedeño</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Overview of the 1st international competition on plagiarism detection</article-title>
          . In: Stein,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Koppel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Agirre</surname>
          </string-name>
          , E. (eds.)
          <source>SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 09)</source>
          . pp.
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          . CEUR-WS.
          <source>org (Sep</source>
          <year>2009</year>
          ), http://ceur-ws.
          <source>org/</source>
          Vol-502
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Schleimer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wilkerson</surname>
            ,
            <given-names>D.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aiken</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Winnowing: local algorithms for document fingerprinting</article-title>
          .
          <source>In: SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data</source>
          . pp.
          <fpage>76</fpage>
          -
          <lpage>85</lpage>
          . ACM, New York, NY, USA (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Seaward</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Matwin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Intrinsic plagiarism detection using complexity analysis</article-title>
          . In: Stein,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Koppel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Agirre</surname>
          </string-name>
          , E. (eds.)
          <source>SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 09)</source>
          . pp.
          <fpage>56</fpage>
          -
          <lpage>61</lpage>
          . CEUR-WS.
          <source>org (Sep</source>
          <year>2009</year>
          ), http://ceur-ws.
          <source>org/</source>
          Vol-502
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>Intrinsic plagiarism detection using character n-gram profiles</article-title>
          . In: Stein,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Koppel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Agirre</surname>
          </string-name>
          , E. (eds.)
          <source>SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 09)</source>
          . pp.
          <fpage>38</fpage>
          -
          <lpage>46</lpage>
          . CEUR-WS.
          <source>org (Sep</source>
          <year>2009</year>
          ), http://ceur-ws.
          <source>org/</source>
          Vol-502
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lipka</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prettenhofer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Intrinsic plagiarism analysis</article-title>
          .
          <source>Language Resources and Evaluation</source>
          <volume>45</volume>
          (
          <issue>1</issue>
          ),
          <fpage>63</fpage>
          -
          <lpage>82</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Vapnik</surname>
            ,
            <given-names>V.N.</given-names>
          </string-name>
          :
          <source>The Nature of Statistical Learning Theory (Information Science and Statistics)</source>
          . Springer Berlin / Heidelberg (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>