<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Applying Specific Clusterization and Fingerprint Density Distribution with Genetic Algorithm Overall Tuning in External Plagiarism Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yurii Palkovskii</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexei Belov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Zhytomyr State University, MARS p.e., Plagiarism Detector Accumulator Project</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2012</year>
      </pub-date>
      <abstract>
        <p>One of the biggest challenges encountered at PAN'11 External Plagiarism Detection was the need for different clusterization methods for different types of plagiarism within the corpus. The existence of sparse sections of highly obfuscated, low obfuscated and translated plagiarism sections alongside with verbatim plagiarism parts, made single pass clusterization inefficient as it produced negative effects in one of the above cases. At PAN'11 we used a single pass fixed length clusterization algorithm with a fixed value defining the maximum distance for cluster formation. The main issue with the fixed cauterization value is that large numbers (1600-1800) perform best for high obfuscation, medium (900) for translated and low (40) for verbatim sections. We decided to develop the system that will be able to either dynamically adjust the clusterization distance depending on the type of detected sections or try out multi-pass clusterization with different distance value with the exclusion of already detected clusters and heuristic post processing. For each detected cluster in several clusterization runs we measured Diagonal Density Distribution (DDD) and Mean Average Diagonal Fingerprint Distance (MADFD). These two values reflect the relative distribution of detected equal fingerprints within the cluster diagonal and allows to effectively tell which type of plagiarism is actually there. One more important role that these values play is the negation of cluster merging if the resulting DDD is less than any of the two clusters merged. This was particularly effective preventing accidental fingerprints merging the resulting clusters. Additionally we discovered that the total number of parameters that affect the system performance is already large and decided to apply the genetic algorithm in order to tackle the best possible meta values instead of picking them by hand. In PAN'12 prototype application we employed a dot plot visualization with both detected clusters and master clusters overlay that allowed us to efficiently control the training process and to measure the overall progress for each separate document pair.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In response to the challenges that have been brought forward by the PAN 2012
competition conditions we decided to focus our research on the detailed document
comparison task as our previous experience at CLEF 2011 proved the existence of
more effective plagiarism detection methods. One of the main issues we faced during
the construction of the PAN 2011 prototype application for detailed document
compare stage was the difficulty selecting the most efficient clusterization algorithm.
After carefully investigating the possible options used by our competitive colleagues
in PAN 2010 and PAN 2012, we come to the conclusion that using custom
multilayered clustering algorithm at clusterization stage may bring better efficiency than
using generic clusterization engines such as WEKA, due to the exact nature of
plagiarism distribution within the plagiarized passages.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Methods</title>
      <p>As a basis for the algorithm we took generic Euclidian distance clusterization
implementation by Emilio Arp:</p>
      <p>We included a number of "filtering layers" to affect the 8-th step that is responsible
for cluster merging. During the analysis of the visualized results we discovered that
the exact distribution of the shared fingerprints within any detected cluster is located
within the cluster diagonal. Most problematic cases that has already been mentioned
are the "Over clustering" and "Under clustering" effects produced by different
Clusterization Distance for different types of Plagiarism.</p>
      <p>"Over clustering" and "Under clustering:"
CD is too big:</p>
      <p>CD is too low:</p>
      <p>Thus one of the hypothesis to use was the ability to build density diagonal
distribution histogram and use it is as an indicator of plagiarism type encountered and
the cluster merging efficiency validator. So we ended up with two values that best
describe these relations Diagonal Density Distribution and Diagonal Minimal Density
Distribution Percent.</p>
      <p>During each cluster formation its Diagonal Minimal Density Distribution Percent
value is built automatically. During the cluster merging stage the above mentioned
measure of the clusters that are going to be merged is compared to the resulting
cluster Diagonal Minimal Density Distribution Percent. Is the resulting value is lower
than any of the two merged clusters then this merge is not performed. This particular
mechanism was used to fight "over clustering" and it proved to be most efficient. One
more benefit of such approach is that it allowed 2 stages of clusterization process
first with a relatively small Clustering Distance of 40 targeting the cases of verbatim
plagiarism and another clusterization pass with the Clustering Distance of 1500 that
was targeting the detection of low and high obfuscated cases of plagiarism.</p>
      <p>One more specific layer of cluster merging used is the relation of the newly
detected cluster width to height named Cluster Dimensions Maximum Allowed Skew.
This filtering was aimed at removing accidental clusters that affected the total
clusterization in a negative way via "over clustering".</p>
      <p>Diagonal Density Distribution at 99% in
suspicious-document01712-sourcedocument03867.xml PAN 2012 document pair:
Diagonal Density Distribution - Cluster negatively affecting the cluster formation
in suspicious-document01801-source-document04208.xml document pair:</p>
    </sec>
    <sec id="sec-3">
      <title>3 Evaluation</title>
      <p>Used as a basis of the compare mechanism we decided to more efficiently tackle
the meta parameters for the developed system and instead manual adjustment we tried
to run a genetic algorithm over these parameters in order to adjust the best possible
values. We ran the multi-staged evaluation process over a limited pre-selected group
of files visualizing the "dot-plot" like graphs, then trying to figure out why the
investigated case does not produce the desired result and namely, why the
clusterization algorithm failed to achieved any better performance.</p>
      <sec id="sec-3-1">
        <title>Genetic Algorithm Genome Structure and Fittest Values:</title>
        <sec id="sec-3-1-1">
          <title>Gene Name:</title>
          <p>DoStemming
RemovePuvctuatuion
SortFingerprintBeforeHashing
1StageClusteringDistance
2StageClusteringDistance
2StageClusterMinLength
1StageClusterMinimalFpsCount
2StageClusterMinimalFpsCount
2StageClusterMinimalLengthChars
ClusterDimentionsMaximumSkew
ExcludeMinimalAverageDensity
MinimalDensityDistributionPercent
FingerprintLength
FingerprintStep</p>
          <p>We were not able to exhaustively run the complete corpus genetic search for the
most optimal values due to the extreme runtime overhead. Instead we used "effective
range" by educated guess and a separate sub-corpus for each individual type of
plagiarism. The idea behind was twofold - to get the generation run data, visualize it
thus tuning the algorithms appropriately and to get most effective p-det over the
mixed corpus that represented the low-scaled tuning corpus of PAN 2012.</p>
          <p>When evaluating our final performance at PAN 2012 in comparison to the
previously achieved results it must be noted, that this year competition has lots of new
conditions and completely new environment, thus the comparison is not
straightforward - CDC and CR stages do not influence each other and the resulting
number of false positives thus is much lower. This particular detail makes such
comparison not feasible. Still we consider the achieved result reflects our efforts
directed onto the project development.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>PAN 2012 Performance:</title>
        <sec id="sec-3-2-1">
          <title>PlagDet</title>
          <p>0.5382163</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Precision</title>
          <p>0.5748453</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>Recall</title>
          <p>0.5230450</p>
        </sec>
        <sec id="sec-3-2-4">
          <title>Granularity</title>
          <p>1.0246376</p>
        </sec>
        <sec id="sec-3-2-5">
          <title>Runtime</title>
          <p>4.5162973</p>
          <p>Things to be noted - our best subcorpus result is p-det 0,74. Our current research
focus is trying to determine why the achieved p-det is much lower than the one
achieved during our tests. Secondly - as our later tests showed, a bug in the software
implementation PAN 2012 prototype application failed to map the exact offsets of the
translated plagiarism thus accumulating delta that negatively affects the results of
translated plagiarism sections. Thirdly - the training corpus for PAN2012 is different
from the test corpus in its structure, plagiarism distribution and some other
characteristics we are not aware at the moment.</p>
          <p>In this paper we introduced a new approach to the clusterization algorithm guided
specifically to tackle the "over clustering" and "under clustering" negative effects that
are usually produced by generic type clusterization. We investigated the benefits of
different filtering layers within the cluster merging stage of the main clusterization
algorithm and the usage of two staged clusterization in order to effectively handle the
different types of plagiarism - namely highly obfuscated and translated ones. We
applied genetic algorithm to effectively tune in the meta parameters that affect the
total efficiency of plagiarism detection system.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Braschler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pianta</surname>
          </string-name>
          , E. (eds.):
          <article-title>CLEF 2010 LABs and Workshops</article-title>
          , Notebook Papers,
          <fpage>22</fpage>
          -
          <lpage>23</lpage>
          September 2010, Padua, Italy (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Clough</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Old and new challenges in automatic plagiarism detection</article-title>
          .
          <source>National Plagiarism Advisory Service</source>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Grozea</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gehl</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Popescu</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>ENCOPLOT: Pairwise Sequence Matching in Linear Time Applied to Plagiarism Detection</article-title>
          .
          <source>In: 3rd PAN WORKSHOP</source>
          .
          <article-title>UNCOVERING PLAGIARISM, AUTHORSHIP AND SOCIAL SOFTWARE MISUSE</article-title>
          . p.
          <volume>10</volume>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Grozea</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Popescu</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>The Encoplot Similarity Measure for Automatic Detection of Plagiarism - Extended Technical Report</article-title>
          . http://brainsignals.de/encsimTR.pdf (
          <year>Aug 2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Grozea</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Popescu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Encoplot - Performance in the Second International Plagiarism Detection Challenge - Lab Report for PAN at CLEF 2010</article-title>
          . In: Braschler et al. [
          <volume>1</volume>
          ]
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Grozea</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Popescu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Who's the Thief? Automatic Detection of the Direction of Plagiarism</article-title>
          . In: Gelbukh,
          <string-name>
            <surname>A.F</surname>
          </string-name>
          . (ed.)
          <source>CICLing. Lecture Notes in Computer Science</source>
          , vol.
          <volume>6008</volume>
          , pp.
          <fpage>700</fpage>
          -
          <lpage>710</lpage>
          . Springer (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Planas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Badia</surname>
            ,
            <given-names>R.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ayguadé</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Labarta</surname>
          </string-name>
          , J.:
          <article-title>Hierarchical task based programming with StarSs</article-title>
          .
          <source>International Journal of High Performance Computing</source>
          <volume>23</volume>
          (
          <issue>3</issue>
          ),
          <fpage>284</fpage>
          -
          <lpage>299</lpage>
          (
          <year>August 2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barrón-Cedeño</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>An evaluation framework for plagiarism detection</article-title>
          .
          <source>In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters</source>
          . pp.
          <fpage>997</fpage>
          -
          <lpage>1005</lpage>
          . Association for Computational Linguistics (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barrón-Cedeño</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eiselt</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Overview of the 2nd International Competition on Plagiarism Detection</article-title>
          . In: Braschler et al. [
          <volume>1</volume>
          ]
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>