<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Detailed Comparison Module In CoReMo 1.9 Plagiarism Detector Notebook for PAN at CLEF 2012</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Diego A. Rodríguez Torrejón</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>José Manuel Martín Ramos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universidad de Huelva</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes the process and basics of the Detailed Comparison Module into the CoReMo 1.9 Plagiarism Detector, which has got a highlighted mention in the PAN2012 edition due to its running speed (at least 10 times faster than any other competitor) achieving very good detections. Its high detection efficacy is due to the special features of the contextual and surrounding context n-grams, which working together, increase the opportunity to match, especially when translations or paraphrases happen, but keeping a highly discriminative feature that simplifies the accurate location for plagiarized sections. The independence of external translation systems coupled to its optimized process by high performance C/C++ programming techniques, have yielded its high speed even when it isn't yet multi-core systems optimized.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Plagiarism Detection is one of the fields that are awakening interest in the areas of
Natural Language Processing and Information Retrieval. The various PAN1 editions
are continuously promoting the improvement of existing techniques, compiling
corpus with cases more realistic and difficult to detect, and developing systems, work
plans and tasks to design and analyze the individual impact of proposals for the
different subtasks about the performance obtained, the necessary hardware resources
and time spent, thus facilitating the subsequent combination and improvement
proposals in search of the ultimate plagiarism detector. CoReMo [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is a
Plagiarism Detection System that was initially designed for participation in PAN
issues, obtaining very acceptable performance results, but highlighted for hardware
requirements and processing speed (one of the main goals for its developers), which
this year has had the opportunity to demonstrate. However, CoReMo uses pruning
techniques to avoid the comparison of the suspicious document with any source
document if not detected evidence of plagiarism by its High Precision Information
Retrieval System (HAIRS) and the Reference Monotony Pruning strategy (RM)
delimiting the suspected plagiarized section before making any comparisons with the
      </p>
      <sec id="sec-1-1">
        <title>1 http://pan.webis.de</title>
        <p>
          suspicious document. Therefore, CoReMo did not performed exhaustive documents
pair comparisons until the proposal for this PAN issue, forcing to change the design
to meet the characteristics of the new edition, which seeks comprehensive comparison
for pairs of documents. However, this is not the only new feature included in this
CoReMo release, as the detection capability, when compared to the previous edition,
was greatly improved by extending the n-grams model used (Contextual N-grams
CTnG) to Surrounding Context N-grams (SCnG) [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] and the use of a post-processing
to join closed detections (Granularity Filter).
        </p>
        <p>The new Detailed Comparison capability design was arranged looking for the
maximal computational efficiency, usual in former CoReMo versions, by using
maximal efficiency programming techniques for the the new task algorithms.</p>
        <p>Furthermore, it was found that earlier CoReMo versions generated the XML
detection files delimiting the offset and length of detections by bytes instead of UTF8
characters, which caused discrepancies between the detection and annotation used in
the gold standard corpus, which negatively affected the evaluation, up to 10%. This
new version allows detections annotation by either Byte (faster) or UTF8 modes.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2 Surrounding Context N-grams</title>
      <p>
        One of the most important innovations in CoReMo as regards last year's version is
that the documents are modeled by an extended concept of former Contextual
N- grams [
        <xref ref-type="bibr" rid="ref1 ref2">1-2</xref>
        ] (CTnG: case folding, stopwords and short length words removal,
stemming and internal sort of n-gram components) to the Surrounding Context
N- grams (SCnG) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which in addition to the former, triple them by including a
special type of skip n- grams obtained by excluding the second or the last but one
from a group of n+1 relevant terms previous to all the previously explained for CTnG
process.
      </p>
      <p>For instance, modeling “The quick brown fox jumps over the lazy dog” to SC3G:
1. quick brown fox → brown_fox_quick (1st direct CT3G way)
2. quick brown jumps → brown_jump_quick (1st left-hand SC3G way)
3. quick fox jumps → fox_jump_quick (1st right-hand SC3G way)
4. brown fox jumps → brown_fox_jump (2nd direct CT3G way)</p>
      <sec id="sec-2-1">
        <title>5. brown fox lazy → brown_fox_laz (2nd left-hand SC3G way)</title>
        <p>6. brown jumps lazy → brown_jump_laz (2nd right-hand SC3G way)</p>
      </sec>
      <sec id="sec-2-2">
        <title>7. fox jumps lazy → fox_jump_laz (3th direct CT3G way)</title>
      </sec>
      <sec id="sec-2-3">
        <title>8. fox jumps dog → dog_fox_jump (3nd left-hand SC3G way)</title>
      </sec>
      <sec id="sec-2-4">
        <title>9. fox lazy dog → dog_laz_fox (3th right-hand SC3G way) 10. jumps lazy dog → dog_jump_laz (4th direct CT3G way)</title>
        <p>
          The use of SCnG finally gets 3 times as many n-grams than only using CTnG, and it
supposes more possibilities to tackle obfuscation cases with almost the same practical
high precision in the process. The biggest number of terms obtained acts as a
magnifier effect in the analysis. The memory requirements are obviously tripled and
processing time almost doubled, but it improves dramatically the performance.
Including these skip n-grams almost doesn't decrease the precision. An n-gram
frequency study on PAN-PC-2009/2010 (table 1) / 2011 corpora [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] shows its
exclusivity ratio almost unaltered.
        </p>
        <p>All n-grams are compared without a difference in the way they are created. The
SCnG are especially useful to improve the CTnG effectiveness when words changes
(synonyms, negated antonyms, given names, translation or orthographic errors,
characters changed by other UTF code having the same aspect, ...), new word
insertions (enriched sentences) or removal (summarized sentences). The sentence
reordering due to translation or changing from passive to active forms or vice versa
are also supported.</p>
        <p>
          This way gets more matching, especially for paraphrased or translated cases, to
identify a possible plagiarism (almost as when using lower grade n-grams, but with
higher precision disambiguation instead). However, it gets more unconnected short
detections which require to be joined. A distance joining step, named Granularity
Filter (GF) gets improved scores. Both SCnG and GF modes combined achieves
about 45% best Plagdet score than when using direct CTnG mode. In order to
facilitate the CTnGs or SCnGs location, its modeling includes offset and length
recording. The benefit of using this extended n-gram modeling compared to the
former, based only in Contextual N-grams was shown in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], improving the
performance in a former CoReMo version, as can be seen in fig. 1 and fig. 2.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3 Detailed Comparison</title>
      <p>
        As by using the extended SCnG n-gram model, the matching is highly discriminative
and more frequent, it's possible to get enough matching n-grams with very low noise,
making the comparison tasks easier. For this detailed pair comparison task,
alphabetically ordered versions of both SCnG modeled documents, with inner
matching annotations and linking, are compared in the way of a modified “mergesort”
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] algorithm to speed up the job, linking every SCnG to an of external matching list.
      </p>
      <p>Minimum length and maximum distances between matches (for same detection)
are adjusted, on bases of document length, number of n-grams and user settings for
minimal monotony and n-grams chunk length (the basics classical adjustments in
CoReMo), which differ for crosslingual and monolingual comparison.</p>
      <p>The distances are n-grams for suspicious documents and characters for the sources:
maxNgramDist = 2 ·chunkLength
maxCharDist = chunkLength·wordLengthAverage
minNgramLength = ( monotony– 1. 5) ·chunkLength
minCharLenght = minNgramLength·wordLengthAverage
(1)
(2)
(3)
(4)
The reliability of the matching n-grams is pondered by its inner matching frequency
in both suspicious and source documents, to determine or reject the detected
continuous matching sections and to create preliminary XML documents (direct
detection). After the end of a detection, a roll-back to the next n-gram happens
starting the next possible detection (have in mind that a detection finishes when no
new reliable match has been found after several n-grams).</p>
      <p>
        The direct detections are post-processed by the Granularity Filter to join
simultaneously nearby detections in both suspicious and source sections, getting final
XML detection documents. Both XML documents could be combined to create a best
comparison readable HTML color document to emphasize direct detections within the
final zones.
The use of external translation systems (as i.e. Google Translator2) is a drawback for
low response timing, availability and economy goals. Because of that, CoReMo
performs its own translations locally when it detects a non English language
document. The crosslingual analysis is locally arranged after a direct mapping from
every non English word (or its stem) to its translated English stemmed word, using
two special dictionaries [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ][
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]: direct2stem (first chance) and stem2stem (second
chance, when it first fails). If no logical translation is found, then the non-English
word is replaced by its English stem.
      </p>
      <p>For every new English n-gram, the original offset and length are registered from
the non-English document to get an easy and precise source plagiarized sections
location.</p>
      <p>
        When using the crosslingual training subcorpus, the Plagdet score achieved by
CoReMo was 0.70176: good results having in mind that they are not being biased by
the same Google Translator process in both (obfuscation and detection) phases. The
lower results obtained in the test phase (0.2577) are due to the fact that only human
translated simulated cases were then used. It is in the expected line after last year's
report [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The CoReMo mapped translation process is however in its childhood, and
it is expected to be improved in newer versions by several modifications and using
better crosslingual stem dictionaries versions.
      </p>
    </sec>
    <sec id="sec-4">
      <title>5 Speed up Methodology</title>
      <p>
        As one of the main goals for CoReMo is the high speed to obtain reliable detection
results, the execution environment and the programming techniques focused on
getting a maximal computational efficiency were used from the early design:
• C++ 64 bits programming
• GNU Linux 64bits OS and ext4 file system platform.
• Internal sort of n-grams is made by bubble sort algorithm.
• Quick sort algorithm is used to order n-grams into the modeled document.
• N-gram comparison between both documents is arranged by a modified mergesort
algorithm [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
• Local translation when cross-lingual comparisons happens.
• When comparing pairs list, ordered by suspicious documents (the most usual case
after locating source documents candidates), it is taken the advantage of n-grams
modeling and inner matching frequency in the suspicious document for
consecutive comparisons.
      </p>
      <p>It enabled to achieve an average analysis time of 0.19 seconds per pair: 13.6 times
faster than the second fastest algorithm, and 31 times faster than the winner one.
However, for this version none optimization was arranged to take advantage of</p>
      <sec id="sec-4-1">
        <title>2 http://translate.google.com</title>
        <p>multicore features of current processors, but it's expected to be included in the next
version.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6 Tuning Parameters and Evaluation</title>
      <p>The best parameters settings were obtained by using the PAN-PC-2012 training
corpus. The results of the training (plagdet 0.6754) are displayed and compared to the
ones achieved in the phase of competition (0,6252) in table 2. For both cases, these
parameters were:
• chunk length: 4 n-grams (internally changes to 12 itself when using SCnGs).
• Cross-lingual chunk length: 47 n-grams (also 3 times bigger when using SCnGs).
• minimum monotony: 2 chunks (same for monolingual and crosslingual modes).</p>
    </sec>
    <sec id="sec-6">
      <title>7 Conclusions and Future Work</title>
      <p>Nowadays CoReMo is the fastest detector, but it should be optimized to take the
opportunity of multi-core systems advantage.</p>
      <p>
        The translated subcorpus analysis achieved better results than last year's
(comparing human translation only) due to the newest n-gram modeling, but it is still
using the same old dictionaries, and only about 50% of the words are translated.
Larger and better dictionaries [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] would benefit this local technique. Other local
translation methods could be explored.
      </p>
      <p>Mixing this n-gram modeling with other NLP resources (WordNet synsets,
odd/even skip n-grams, ...) could improve detections when hard obfuscation
conditions happen.</p>
      <p>The detailed comparison method got better Plagdet performance for the same
corpus than the former method used in CoReMo. This suggests a new change for the
traditional full process for local source collections.</p>
      <p>
        The comparison of the Plagdet progress regarding the PAN2011 must be done with
caution, since not being necessary prior source document detection, by using a LEAP3
detector, could directly get reasonable results, as shown in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] for the former Intrinsic
Detection task.
      </p>
      <p>Acknowledgments. To the PAN team, as their development aids, hard job and
encourage have been crucial for our work, and to all the PAN competitors teams, as
their effort and papers has always been for us a motivational challenge and a source
of new ideas to improve our detection system.</p>
      <p>
        3 LEAP = Labeling Everything As Plagiarized
Fig. 1. Plagdet/chunk_length comparative of CoReMo 1.6 using CT3N or SC3N
w/wo Granularity Filter on PAN-PC-2011 only English subcorpus [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
      </p>
      <p>SC3G+Gran. Filter
CT3G+Gran. Filter
SC3G
CT3G
0,5
0,4
0,3
0,2
0,1
0
4 8 12 17 25 35 45 55 65 75 85 95
4
8
12
25</p>
      <p>PAN-PC-2011
Crosslingual only</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Rodríguez-Torrejón</surname>
            ,
            <given-names>D.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martín-Ramos</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          : “
          <article-title>Detección de plagio en documentos: sistema externo monolingüe de altas prestaciones basado en n-gramas contextuales” (Plagiarism Detection in Documents: High Performance Monolingual External Plagiarism Detector System Based on Contextual N-grams)</article-title>
          .
          <source>Procesamiento del Lenguaje Natural. N. 45</source>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Rodríguez-Torrejón D</surname>
          </string-name>
          .A.,
          <string-name>
            <surname>Martín-Ramos J.M.: CoReMo System (Contextual Reference Monotony) A Fast</surname>
          </string-name>
          ,
          <article-title>Low Cost and High Performance Plagiarism Analyzer System: Lab Report for PAN at CLEF 2010</article-title>
          . In Braschler M.,
          <string-name>
            <surname>Harman</surname>
            <given-names>D.</given-names>
          </string-name>
          , Pianta E., editors.
          <source>Notebook Papers of CLEF 2010 LABs and Workshops</source>
          ,
          <volume>22</volume>
          -
          <fpage>23</fpage>
          September, Padua, Italy,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Rodríguez-Torrejón</surname>
            ,
            <given-names>D.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martín-Ramos</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          :
          <article-title>Crosslingual CoReMo System: Notebook for PAN at CLEF 2011</article-title>
          . In [10].
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Rodríguez-Torrejón</surname>
            ,
            <given-names>D.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martín-Ramos</surname>
            ,
            <given-names>J.M.:</given-names>
          </string-name>
          <article-title>N-gramas de contexto cercano para mejorar la detección de plagio (Surrounding Context N-grams to Improve the Plagiarism Detection</article-title>
          ) In [11]
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Martin</given-names>
            <surname>Potthast</surname>
          </string-name>
          , Benno Stein, Alberto Barrón-Cedeño, and
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Rosso</surname>
          </string-name>
          .
          <article-title>An Evaluation Framework for Plagiarism Detection</article-title>
          .
          <source>In 23rd International Conference on Computational Linguistics (COLING 10)</source>
          ,
          <year>August 2010</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Chiara</given-names>
            <surname>Basile</surname>
          </string-name>
          , Dario Benedetto, Giampaolo Caglioti, and Mirko Degli Esposti.
          <year>2009</year>
          .
          <article-title>A Plagiarism Detection Procedure in Three Steps: Selection, Matches and Squares</article-title>
          .
          <source>In SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN</source>
          <volume>09</volume>
          )
          <article-title>(pan</article-title>
          ,
          <year>2009</year>
          ), pages
          <fpage>19</fpage>
          -
          <lpage>23</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Rodríguez-Torrejón</surname>
            ,
            <given-names>D.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barrón-Cedeño</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sidorov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martín-Ramos</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
          </string-name>
          , P.: “
          <article-title>Influencia del diccionario en la traducción para la detección de plagio translingüe”. (Dictionary Influence in Crosslingual Plagiarism Detection)</article-title>
          .
          <source>in [11]</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Rodríguez-Torrejón</surname>
            ,
            <given-names>D.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martín-Ramos</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          <article-title>: “LEAP: una referencia para la evaluación de sistemas de detección de plagio con enfoque intrínseco” (LEAP: a Baseline for Intrinsic Focusing Plagiarism Detectors)</article-title>
          .
          <source>In [11]</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eiselt</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barrón-Cedeño</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Overview of the 3rd International Competition on Plagiarism Detection</article-title>
          . In [10]
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>Vivien</given-names>
            <surname>Petras</surname>
          </string-name>
          and Paul Clough (Eds.):
          <source>Notebook Papers of CLEF 2011 Labs and Workshops</source>
          ,
          <volume>19</volume>
          -
          <fpage>22</fpage>
          September, Amsterdam, The Netherlands (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>II</given-names>
            <surname>Congreso Español de Recuperación de Información</surname>
          </string-name>
          (CERI
          <year>2012</year>
          ).
          <fpage>17</fpage>
          -18 June, Valencia (
          <year>2012</year>
          ). http://users.dsic.upv.es/grupos/nle/ceri/index.html
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>