<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A plagiarism detector for intrinsic plagiarism</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pablo Suárez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>José Carlos González</string-name>
          <email>josecarlos.gonzalez@upm.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julio Villena-Román</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DAEDALUS - Data</institution>
          ,
          <addr-line>Decisions and Language, S.A. Avda. De la Albufera, 321 28031 Madrid</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>ETSI Telecomunicación, Universidad Politécnica de Madrid</institution>
          ,
          <addr-line>28040 Madrid</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Telematic Engineering Department, Universidad Carlos III de Madrid</institution>
          ,
          <addr-line>28911 Leganés</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2010</year>
      </pub-date>
      <abstract>
        <p>In this paper, we describe the algorithm that has been used to carry out our plagiarism detection within the context of PAN10 competition. Our system is based on the LempelZiv distance, which is applied to extract structural information from texts. Then the algorithm tries to find outliers in the vector of distances between each fragment of the text and the whole document itself.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <sec id="sec-1-1">
        <title>2.1 Global architecture</title>
        <p>Next figure shows the global architecture for our intrinsic plagiarism algorithm.</p>
      </sec>
      <sec id="sec-1-2">
        <title>2.2 Fragmenter</title>
        <p>This module fragments the original text in blocks. Our software offers two
different possibilities: 1) fragmentation by sentences, and 2) fragmentation by
paragraphs. The minimum size allowed for the fragments or text blocks is a
configurable parameter in our system. It is necessary, since over a small fragment is
not valid to detect the presence of plagiarism.</p>
      </sec>
      <sec id="sec-1-3">
        <title>2.3 Detection distances</title>
        <p>
          The current version of our algorithms includes, among others, the implementation
of the next definitions for distances:
Basile distance: proposed by Basile and others, that define a distance between two
texts x and y from its n-grams ([
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]):
LempelZiv distance: it is a Kolmogorov distance implemented by means of the
LempeZiv compression algorithm, as described in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
        <p>
          RHonore distance: as described in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>
          Our algorithms can use one or a subset of the available distances by means of a
configurable parameter. In our detection of intrinsic plagiarism for PAN10 we have
only taken into account the LempelZiv distance, since it has been shown that
measures based on Kolmogorov complexity (using a lossless compression algorithm)
are a good way to extract structural information from texts for the intrinsic plagiarism
detection [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
      </sec>
      <sec id="sec-1-4">
        <title>2.4 Outlier detection</title>
        <p>
          Next step consists of detecting which distance can be considered as an outlier in
the vector of distances between each fragment of the text and the whole document
itself. Our software implements three classical ways of detecting an outlier in a list of
data [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. They are: standard deviation (Chebyshev), percentiles and MAD (Median
Absolute Deviation). In particular, the selected threshold for each case is: t=α*σ+ x
(for standard deviation), t=Q3 + β*(Q3-Q1) (for percentiles) and t= x +γ*MAD (for
MAD). Where α, β and γ are configurable weights that we used with values α=0.9,
β=1.5 and γ=3.0. It can be used only one or a subset of outlier thresholds by means of
a configurable parameter. We only used MAD for PAN10.
        </p>
      </sec>
      <sec id="sec-1-5">
        <title>2.5 Interval aggregation</title>
        <p>Interval aggregation is an optional module that can be used in the output of our
system. It aggregates a group of separated detected plagiarism intervals into one
interval when interval separation is smaller than a configurable threshold. It permits
detecting as a unique plagiarized block some close blocks that were separated by the
fragmenter. For PAN10 we did not use this interval aggregation module.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3 Evaluation</title>
      <p>With respect to PAN10 competition, as stated above, we have only participated in
the intrinsic plagiarism detection task, because of (software or hardware) bad
performance of our system for external plagiarism. In this case, the configurable
parameters of our plagiarism detector are: fragmentation level (sentence, paragraph),
minimum length of interval (minimum length for being considered a valid sentence or
paragraph), use of interval aggregation (true, false), aggregation interval (minimum
distance between intervals for aggregation), minimum fragment length (minimum
fragment length for plagiarism detection), active comparison distances (Basile,
LempelZiv, RHonore), outlier detection method (standard deviation, percentiles,
MAD), α, β and γ weights for outlier detection. Our settings, after from different tests
on the training corpus PAN-PC-09, were: fragmentation level = paragraph, minimum
length of interval = 200, use of interval aggregation = false, aggregation interval = 50,
minimum fragment length = 200, active comparison distances = only LempelZiv,
outlier detection method = standard deviation, weights for outlier detection γ = 3.0.</p>
      <p>The detection performance that our system achieves on the training corpus
PAN-PC-09, using the PAN evaluation measures, was: recall=0.185225576213,
precision=0.075230788299, overall=0.0743645119788, granularity=1.71111111111.</p>
    </sec>
    <sec id="sec-3">
      <title>4 Conclusion</title>
      <p>As we noted earlier, we have only participated in the intrinsic plagiarism detection
task. Since the results of the competition cover the detection of both intrinsic and
external plagiarism globally, and not separately, the overall results had to be
necessarily worse. In that sense, we are sure that we can greatly improve our current
system with our future work. In any case, the results have not been too good at the
moment. Our future work will include, in fact, the following tasks: 1) Improve
intrinsic and external plagiarism performance; 2) Combine intrinsic and external
plagiarism; 3) Develop the Internet module; 4) Implement new detection distances; 5)
Implement new outlier detection methods; 6) Implement 'obfuscation' detection
algorithms; 7) Implement a report generator module.</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgements</title>
      <p>This work has been partially supported by the Spanish Center for Industry
Technological Development (CDTI, Ministry of Industry, Tourism and Trade),
through the CONTENIDOS A LA CARTA project, INGENIO 2010 Programme,
AVANZA I+D 2008.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>BASILE</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          et al.
          <year>2008</year>
          :
          <article-title>“An example of mathematical authorship attribution”</article-title>
          .
          <source>In: Journal of Mathematical Physics</source>
          ,
          <volume>49</volume>
          :
          <fpage>125211</fpage>
          -
          <lpage>125230</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>BASILE</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          et al.
          <year>2009</year>
          :
          <article-title>“A plagiarism detection procedure in three steps: selection, matches and 'squares'”</article-title>
          . In: PAN-09 Competition.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. BELABBES,
          <string-name>
            <surname>Sigem</surname>
          </string-name>
          et al.
          <year>2008</year>
          : “
          <article-title>On Using SVM and Kolmogorov Complexity for Spam Filtering”</article-title>
          .
          <source>In: Proceedings of the Twenty-First International FLAIRS Conference.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. BARRÓN,
          <source>Luis Alberto</source>
          <year>2008</year>
          :
          <article-title>“Detección automática de plagio en texto”</article-title>
          . In: &lt;http://mavir2006.mavir.net/docs/Barron-DeteccionPlagioTexto.pdf&gt;.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>IRANZO</surname>
            <given-names>PÉREZ</given-names>
          </string-name>
          ,
          <year>David 2007</year>
          : Análisis de Outliers:
          <article-title>un caso a estudio</article-title>
          .
          <source>PhD Thesis</source>
          . Universitat de València. Servei de publicacions. In: &lt;http://www.tesisenxarxa.net/TESIS_UV/AVAILABLE/TDX-1007108- 124618//iranzo.pdf&gt;.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. SEAWARD, Leane and
          <string-name>
            <surname>MATWIN</surname>
          </string-name>
          , Stan
          <year>2009</year>
          :
          <article-title>“Intrinsic Plagiarism Detection using Complexity Analysis”</article-title>
          . In: Stein, Rosso, Stamatatos, Koppel, Agirre (Eds.).
          <source>PAN'09</source>
          , pp.
          <fpage>56</fpage>
          -
          <lpage>61</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>