<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>External and Intrinsic Plagiarism Detection using a Cross-Lingual Retrieval and Segmentation System</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Know-Center Graz</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Markus Muhr</institution>
          ,
          <addr-line>Roman Kern, Mario Zechner, and Michael Granitzer</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2010</year>
      </pub-date>
      <abstract>
        <p>We present our hybrid system for the PAN challenge at CLEF 2010. Our system performs plagiarism detection for translated and non-translated externally as well as intrinsically plagiarized document passages. Our external plagiarism detection approach is formulated as an information retrieval problem, using heuristic post processing to arrive at the final detection results. For the retrieval step, source documents are split into overlapping blocks which are indexed via a Lucene instance. Suspicious documents are similarly split into consecutive overlapping boolean queries which are performed on the Lucene index to retrieve an initial set of potentially plagiarized passages. For performance reasons queries might get rejected via a heuristic before actually being executed. Candidate hits gathered via the retrieval step are further post-processed by performing sequence analysis on the passages retrieved from the index with respect to the passages used for querying the index. By applying several merge heuristics bigger blocks are formed from matching sequences. German and Spanish source documents are first translated using word alignment on the Europarl corpus before entering the above detection process. For each word in a translated document several translations are produced. Intrinsic plagiarism detection is done by finding major changes in style measured via word suffixes after the documents have been partitioned by an linear text segmentation algorithm. Our approach lead us to the third overall rank with an overall score of 0.6948.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Plagiarism detection has gained increased interest in research as well as in the industry
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] over the last couple years. The PAN challenge accommodated this fact and provided
researchers a basis to compare different approaches.
      </p>
      <p>
        We refrain from giving an introduction on plagiarism detection as Grozea et. al [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
formulated an excellent overview of the matter in last year’s lab report. Instead we will
discuss the motivation for our approach for this year’s challenge.
      </p>
      <p>
        The first three ranked participants in the first PAN competition all used a
documentcentric approach for external plagiarism detection as presented in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Our approach
in last year’s competition was based on a block-level comparison of source and
suspicious documents. Although our approach yielded acceptable results it was clear that the
chosen block granularity, non-overlapping sentences, does not perform exceptionally
well. To identify similar suspicious and source blocks we used a simple cluster pruning
technique which, while easy to implement, also introduced several problems.
      </p>
      <p>To improve on our last approach we reformulated our problem solution slightly.
Instead of comparing non-overlapping blocks of sentences we used overlapping blocks of
tokens with fixed sizes. The cluster pruning technique was replaced by an open-source
document search engine called Lucene 1. Source documents are first split into
overlapping blocks. Each block is then indexed by a Lucene instance. Suspicious documents
are similarly split into overlapping blocks which get transformed to boolean Lucene
queries. Each query results in a ranked list of potentially plagiarized source blocks.</p>
      <p>
        We also reworked our post-processing step. We adapted an approach based on
sequence analysis similar to dot-plot [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] with further heuristic merging and filtering steps
to increase the overall precision of the system.
      </p>
      <p>
        In last years competition none of the participants tried to solve the cross-lingual
plagiarism subtask. We decided to give it a try in this year’s challenge, building upon
techniques developed in the machine translation community. We performed word alignment
with the BerkleyAligner software package [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] using the Europarl corpus [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to provide
us with potential translations for each word in German or Spanish source documents.
      </p>
      <p>
        Intrinsic plagiarism detection [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] seems to be a much harder task which is
supported by the fact that less related work is available. Last year’s competition supports
this notion as only one competitor (Stamatatos [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]) could beat the baseline.
Previous approaches used stylometric features or semi-stylometric features like character
n-grams on sliding windows over the text to form a mean vector. A major difference of
a certain block to this mean vector is expected to mark a style change which is
interpreted as author change and therefore plagiarism. This year we tried to detect intrinsic
plagiarism by adapting the text segmentation algorithm from Kern et. al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to segment
a document into stylometric coherent segments to identify plagiarism instead of topic
coherent segments.
      </p>
      <p>To sum up our system consists of the following basic ideas which is outlined by a
flowchart in figure 1
– The external task is interpreted as retrieval task on a sub-document level
– Post-processing based on sequence analysis with merge and filter heuristics
– Translation on a word level by using word alignment of Europarl as translations
– Intrinsic plagiarism detection using text segmentation by detecting ad hoc flows in
the text due to style changes
– Merging intrinsic and external plagiarism (intrinsic blocks are only taken, if there
are none external ones for a specific document)
2</p>
    </sec>
    <sec id="sec-2">
      <title>External Plagiarism Detection</title>
      <p>Our external plagiarism detection approach consists of two main steps. In the first step
we search for potentially matching suspicious document blocks within an inverted index
of overlapping source document blocks. In the second step we apply heuristic
postprocessing on the potential matches to arrive at the final detection result. For
NonEnglish source documents we have an additional pre-processing step to get translations</p>
      <sec id="sec-2-1">
        <title>1 http://lucene.apache.org/java/docs/index.html</title>
        <p>Source
Documents</p>
        <p>For each source
document
Yes</p>
        <p>Add blocks to index
Segment into overlapping blocks
External</p>
        <p>Intrinsic</p>
        <p>Use block terms as
queries &amp; apply heuristics</p>
        <p>for fast retrieval
Search block
index
fBoluoncdk?s Yes</p>
        <p>No
Block Index</p>
        <p>Merge neighboring
sequences</p>
        <p>Filtering on stylometric</p>
        <p>features
Similarity &amp; heuristics
filtering</p>
        <p>If no external passages</p>
        <p>are detected
Detected
passages
English?</p>
        <p>No</p>
        <p>Translate Words</p>
        <p>Segment into small
overlapping blocks</p>
        <p>Token based sequence
matching</p>
        <p>
          Segment document
into coherent segments
for each word in the Spanish or German source documents. Furthermore, the
postprocessing step has to be slightly modified for translated plagiarism detection.
Translating Non-English Documents using Word Alignment To detect cross-lingual
plagiarism we build upon techniques developed in the field of machine translation.
Instead of applying a complete machine translation solution to translate whole documents
or sentences we took the output of a word alignment algorithm. This kind of algorithm
tries to find pairs of words that might be used as translation candidates and are a main
component of many state-of-the-art machine translation systems. The base of the word
alignment algorithms is a set of documents that are aligned on a sentence level. We
used the Europarl aligned corpus (Release 5) [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. To calculate the aligned words we
employed the BerkeleyAligner2 software package [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. The output of the word
alignment algorithm is a list of English translation candidates for the German and Spanish
words present in the Europarl corpus. For each source document that is not written in
English we replaced each word with up to 5 translation candidates. If no translation
candidate is available, the word is not replaced. After the words have been replaced the
documents are treated similar like the English source documents.
        </p>
        <p>Overlapping Source Blocks Indexing The translated as well as non-translated source
documents were each transformed into overlapping blocks of 40 tokens. For translated
passages 40 refers to words in the original text, so that the actual translated passage may
have more words due to the replacement with up to five translations. These blocks are</p>
      </sec>
      <sec id="sec-2-2">
        <title>2 http://code.google.com/p/berkeleyaligner/</title>
        <p>indexed via a Lucene instance. Besides the text of each block we also stored additional
information such as the offset and length of each block in the source document as well
as the ID of the source document the block originated from.</p>
        <sec id="sec-2-2-1">
          <title>Heuristics to limit Executed Suspicious Queries Similarly to the source documents</title>
          <p>the suspicious documents were tokenized and overlapping blocks of tokens were
transformed to boolean queries. As this results in a massive amount of potential queries we
applied several heuristics to limit the overall complexity of our approach while trying
to keep the recall of correct hits reasonable high.</p>
          <p>The first heuristic employed was the selection of a window step size of 6 and a
window size of 16 tokens for each block. We arrived at this settings after testing various
combinations on a small development corpus. Each query is only executed if at least
one of the tokens in it has a normalized document frequency below a given threshold
(0.004 for the evaluation corpus). Note that a document is really a block of a document.</p>
          <p>The terms of the boolean query are first sorted by their corpus frequency in
increasing order. The first four terms of this sorted list must be included (AND query), the
other 12 terms (OR query) are determining the final rank of a hit in the list of ranked
query results. By this technique we can ease the work load of the search engine as it
can prune many documents due to the limitation that the four least frequent terms must
be included in the query results. We further prune the query result by only using blocks
which have a score above a certain threshold. After some testing on the development
set we arrived at a threshold of 8.0. By adjusting these mentioned parameters, one can
alter the trade-off between the number of potential hits (recall) and the number of query
invocations (faster runtime).</p>
          <p>Although translations have more words in common due to multiple translations for
each word, we did not see major changes for the results when using the same parameter
settings. Finally, we store the offset and length in the suspicious document for each
executed query as well as the source document ID with offset and length for each block
found for a query.</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>Post-Processing using Word Sequence Analysis with Merge Heuristics and Sim</title>
          <p>ilarity Filtering The retrieved potential plagiarized blocks must now be refined and
filtered. The advantage of our approach is that the locations in the source and
suspicious documents are roughly known after the retrieval part, so that we can neglect a
detailed post-processing on whole document pairs.</p>
          <p>What we did so far was taking a suspicious document, split it into queries and
search for potentially matching blocks for each query. As a first step we generate lists
of query-block pairs. Note that a single query can have multiple matching blocks and
thus generate several query-block pairs.</p>
          <p>Given a query-block pair we extend the text around the query in the suspicious
document as well as the text around the block in the source document by a number
of characters (2000 for the evaluation set). Given the offset of both the query and the
source document block we can align these two extended text passages. The alignment
is given in form of a token by token matrix on which we apply the sequence analysis. A
bigger window around the query-block pair leads to higher run-time, but can detect the
passages more accurately. A sequence of tokens in both texts is a match if the sequence
is composed of at least 3 consecutive tokens and has a length of at least 10 characters.
As with other settings we arrived at these by playing with the development set. For
translated source documents the sequence analysis is a little bit less strict to compensate
the incomplete translation we used. In this case the minimum length of sequence must
be 6, but gaps are allowed between consecutive words in the suspicious document.
Furthermore, the order of consecutive words in a suspicious document must not be the
same in the source document, in other words a match must be found in a window of 10
tokens in the source document to count as a match.</p>
          <p>The result is a list of sequence matches grouped by the source document they arose
from. These matches are potentially small and we thus try to further merge them in a
follow-up step. First we inspect all sequence matches originating from the same source
document and eliminate those that do not have other sequence matches in their neighbor
hood (defined as the surrounding 3000 characters) in said source document and are
smaller than 50 characters.</p>
          <p>Next we merge all sequence matches in the suspicious document that point to the
same source document. Sequences are again matched via a neighbor hood criterion.
However, this time the neighborhoods in both the suspicious document as well as the
source document are considered. For sequence matches to be merged they have to be
within a 500 character neighborhood in the suspicious document and within a 3000
character neighborhood in the source document. This can result in asymmetrically sized
spans in the source and suspicious document. After this merging step we remove any
matches smaller than 200 characters.</p>
          <p>The final list of merged sequence matches are filtered on more time by calculating
a Jaccard Similarity between the suspicious document sequence text and the source
document sequence text. Sequence matches smaller than 5000 characters are eliminated
if their similarity is smaller than 0.55, bigger matches must have a minimum similarity
of 0.7 in order to be accepted.
3</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Intrinsic Plagiarism Detection</title>
      <p>The main idea of the intrinsic plagiarism detection algorithm is to detect changes in
the style within a document. Base of our approach is a function that transforms a
sequence of tokens (words, punctuations, ...) into a set of features that should represent
the style of the author. Many different stylometric features have been proposed in the
past. Among them are feature transformation functions based on parts of the word, for
example character n-grams. Other stylometric features are constructed by using just a
subset of words or tokens, for example pronouns. The usage and frequency of
punctuation marks have also been investigated as a proxy for specific writing styles.</p>
      <p>For our intrinsic plagiarism detection system we experimented with two different
stylometric feature functions:
– Stop-words - This feature transformation is motivated by the intuition that
different authors tend to resort to different stop words to construct grammatically correct
sentences. For this feature transformation function to work all words need to be
annotated as either function word or content word. This is accomplished by looking
up all words in a manually crafted stop word list. All words that have been
identified as stopword are added to the feature set and their frequency is additionally
recorded. The remaining function words are ignored by this feature transformation
function. As stop word list we took the stop word lists from the Snowball stemmer
project. These list are available in a number of languages and also contain the set
of pronouns.
– Stem-Suffix - The last characters of words have already been used to identify
specific author styles. The motivation for this feature transformation is the assumption
that different authors may differ in their use of flections. One possible approach
for this kind of function is to pick the last n characters of each words, where n is
usually set to 3. For our system we used a flexible number of suffix characters. The
suffix was determined by the number of characters a stemming algorithm would cut
off or replace (we utilized the Snowball stemmer for this task). Finally this function
produces a set of word suffixes.</p>
      <p>
        To detect areas within a document that are written in a different style we first created
a feature set out of the complete document, which can be seen as centroid of the whole
document. This procedure assumes that the document is mostly written by a single
author and the plagiarized sections do not cover the majority of the document. Next the
document is split into blocks. This step is needed because the set of generated features
by the stylometric transformation function tend to be sparse if applied on the sentence
level. Partitioning the document into blocks of equal size (number of sentences) would
be the most straightforward way to achieve this. For our system we have chosen to
make use of a linear text segmentation algorithm [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Instead of producing blocks of
equal size, the output of this algorithm is a list topically coherent blocks of multiple
sentences. To identify changes in topics all words are first filtered out for stop-words
and then stemmed. For each of the identified blocks a stylometric feature representation
is generated. This set of features is then compared with the document-wide feature set,
by calculating the cosing similarity. If the difference of the two feature sets exceeds
a certain threshold, the block is considered to be written by another author than the
majority of the document.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>This section starts with a short summary of the multiple parameters of our approach.
Afterwards, we present and discuss detailed results on the development and on the
evaluation corpus. We evaluate the quality of the retrieval step as well as provide
performance measures for different obfuscation levels (none, low, high), for translated and
non-translated plagiarism. Furthermore, we will compare our performance on external
and intrinsic plagiarism.
4.1</p>
      <sec id="sec-4-1">
        <title>Parameter Settings</title>
        <p>
          We have multiple parameters used on each step of our approach. Since they have all
been mentioned in the detailed description of our method, we just want to summarize
them in table 4.1. We have to admit that our approach has many parameters, but as a
matter of fact we did not really optimize them with the exception of the merge and
filter parameters. The parameters for the index and search step affect only the trade-off
between precision and runtime, so a setting was used that gives a good precision with a
reasonable runtime.
The development corpus was the same one as in the first PAN competition from 2009
(a detailed description can be found in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]). Since our approach is not very fast, we
optimized our approach on a sub set of 500 suspicious documents, so the following
performance measures are computed only for this sub set. However, note that as basis
for the retrieval step still the complete set of source documents was used.
        </p>
        <p>In table 4.2 the results of the retrieval step are shown. Basically the table shows
the number of all real plagiarism blocks in the data set and the ones which are hit at
least by one query - block pair. In other words, these are the blocks our post-processing
step can extract at best (upper bound on the recall). These results show that we only
loose a considerable amount of the high obfuscated blocks, but a very high percentage
(above 90 %) of low or none obfuscated as well as translated plagiarism have been hit
after the retrieval step. The difference for high obfuscated plagiarism can be explained
by the fact that we do not handle for example synonyms which represent a big part of
the deterioration of high-obfuscated plagiarism. The retrieval step delivered a total of
6461076 query - block pairs, from which 4614485 are correct (are partially overlapping
with a real plagiarism block), so 71.42 % of the pairs are correct ones.</p>
        <p>Furthermore, in table 4.2 a detailed evaluation of our detected plagiarism after the
post-processing steps (sequence analysis, merge heuristics, similarity thresholding) is
shown. Surprisingly our cross-lingual approach performs better than the plain English
one in terms of recall and precision with a worse granularity. However, if we take a
closer look this is mostly because of our quite bad performance on highly obfuscated
plagiarism. High obfuscation means a high level of paraphrasing, exchanging words by
synonyms, etc. which we do not deal with specifically. On the other hand for low and
none obfuscation our results are better than the cross-lingual approach.
In contrast to the development corpus the evaluation corpus does not explicitly
separate external and intrinsic plagiarism, since in this years competition the winner should
develop a hybrid system that can handle both types of plagiarism. Nevertheless, the
following evaluation will distinguish between the different kinds of plagiarism considered
as sub-tasks of the global problem. A detailed description of this corpus can be found
in the overview paper.</p>
        <p>In table 4.3 the results of the retrieval step for the evaluation corpus are shown.
In contrast to the results on the development corpus even for high obfuscated
plagiarism we hit most of the blocks and the difference to low and none obfuscated passages
is much less significant. However, for low and none obfuscation the results are also
higher compared to the development corpus. Surprisingly, translated blocks are a little
bit worse, but since we only used a split of 500 of the development corpus the amount
of translated passages will most likely be smaller, so that the difference lies in
statistical limits of variance. The retrieval step delivered a total of 1.7642292E7 query - block
pairs, from which 1.4431375E7 are correct (are partially overlapping with a real
plagiarism), so the ratio increased as well from 71.42 % to 81.8 %.This shows that heuristics
like ranking score seem to be reasonable good to achieve very high recall values with a
very good precision in this initial step.</p>
        <p>Furthermore, in table 4.3 a detailed evaluation of the final detected plagiarism blocks
are shown. In contrast to the results on the development corpus our cross-lingual
approach performs worse than the plain English plagiarism detection. This can be
explained by the fact that on the evaluation corpus the recall on high-obfuscated
plagiarism detection increased to 0.8122 compared to 0.4706 on the development corpus. For
low and none obfuscation the recall values are in similar ranges, so that the overall
evaluation on non-translated plagiarism detection could be increased considerable. Again it
can be recognized that our performance on the translated blocks are especially bad
concerning granularity, so it seems to be the case that there are several holes in the detected
passages. As expected our performance on the intrinsic plagiarism detection sub-task is
very poor and might have deteriorated our overall result more than expected.</p>
        <p>Despite the good results we have to admit that our current implementation is not
very fast. The whole process takes about a week. However, we did not optimize our
approach in any kind. There are many possible ways to improve our approach. For
example, we can utilize some prefiltering on document basis or try to distribute our
approach.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>We scored the 3rd place in the overall ranking of all systems, with our recall values
being very close to the winning system. The precision of our system is still an area
to be improved in future iterations. We attribute the performance of your system with
respect to precision mainly to the poor results from the intrinsic plagiarism detection
system which lowers the overall precision considerably. The post-processing step also
needs some more tuning as evidenced by the poor granularity achieved.</p>
      <p>We plan on transforming our approach into a web-service, seeded by the articles of
Wikipedia as a source corpus. Handling other types of plagiarism such as stealth
approaches or missing citations are also on our agenda. We’d also like to increase the
scalability and performance of our system by employing distributed indices along with
document-level cluster pruning for large datasets.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. zu Eissen,
          <string-name>
            <given-names>S.M.</given-names>
            ,
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          :
          <article-title>Intrinsic plagiarism detection</article-title>
          .
          <source>In: ECIR. Lecture Notes in Computer Science</source>
          , vol.
          <volume>3936</volume>
          , pp.
          <fpage>565</fpage>
          -
          <lpage>569</lpage>
          . Springer (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Grozea</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gehl</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Popescu</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>ENCOPLOT: Pairwise Sequence Matching in Linear Time Applied to Plagiarism Detection</article-title>
          .
          <source>In: 3rd PAN Workshop</source>
          . Uncovering Plagiarism,
          <source>Authorship and Social Software Misuse</source>
          . p.
          <volume>10</volume>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Kern</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Granitzer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Efficient linear text segmentation based on information retrieval techniques</article-title>
          .
          <source>In: MEDES '09</source>
          . pp.
          <fpage>167</fpage>
          -
          <lpage>171</lpage>
          . ACM (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Koehn</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Europarl: A parallel corpus for statistical machine translation</article-title>
          .
          <source>MT summit 5</source>
          ,
          <fpage>12</fpage>
          -
          <lpage>16</lpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Liang</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taskar</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klein</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Alignment by agreement</article-title>
          .
          <source>In: Proceedings of the Human Language Technology Conference of the NAACL</source>
          . pp.
          <fpage>104</fpage>
          -
          <lpage>111</lpage>
          (
          <year>June 2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Lyon</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barrett</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malcolm</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          :
          <article-title>A theoretical basis to the automated detection of copying between texts, and its practical implementation in the Ferret plagiarism and collusion detector</article-title>
          .
          <source>In: JISC (UK) Conference on Plagiarism: Prevention, Practice and Policies Conference</source>
          . pp.
          <fpage>10</fpage>
          -
          <lpage>18</lpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Maizel</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lenk</surname>
          </string-name>
          , R.:
          <article-title>Enhanced graphic matrix analysis of nucleic acid and protein sequences</article-title>
          .
          <source>In: Proceedings of the National Academy of Sciences</source>
          . pp.
          <fpage>7665</fpage>
          -
          <lpage>7669</lpage>
          (
          <year>1981</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Maurer</surname>
            ,
            <given-names>H.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kappe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaka</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Plagiarism - a survey</article-title>
          .
          <source>J. UCS</source>
          <volume>12</volume>
          (
          <issue>8</issue>
          ),
          <fpage>1050</fpage>
          -
          <lpage>1084</lpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eiselt</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barron-Cedeno</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Overview of the 1st International Competition on Plagiarism Detection</article-title>
          .
          <source>In: 3rd PAN Workshop</source>
          . Uncovering Plagiarism,
          <source>Authorship and Social Software Misuse</source>
          . pp.
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>Intrinsic Plagiarism Detection Using Character n-gram Profiles</article-title>
          .
          <source>In: 3rd PAN Workshop</source>
          . Uncovering Plagiarism,
          <source>Authorship and Social Software Misuse</source>
          . pp.
          <fpage>38</fpage>
          -
          <lpage>46</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>