A plagiarism detection procedure in three steps:
                        selection, matches and ”squares” ∗
          Chiara Basile             Dario Benedetto                                          Emanuele Caglioti
         Dip. Matematica             Dip. Matematica                                           Dip. Matematica
          Univ. Bologna           Sapienza, Univ. Roma                                      Sapienza, Univ. Roma
    P.zza di Porta S. Donato 5,      P.le Aldo Moro 2,                                         P.le Aldo Moro 2,
       40126, Bologna, Italy        00185, Roma, Italy                                        00185, Roma, Italy
        basile@dm.unibo.it      benedetto@mat.uniroma1.it                                  caglioti@mat.uniroma1.it


     Giampaolo Cristadoro                                                                   Mirko Degli Esposti
          Dip. Matematica                                                                      Dip. Matematica
           Univ. Bologna                                                                        Univ. Bologna
    P.zza di Porta S. Donato 5,                                                           P.zza di Porta S. Donato 5,
       40126, Bologna, Italy                                                                 40126, Bologna, Italy
      cristadoro@dm.unibo.it                                                                 desposti@dm.unibo.it

         Abstract: We present a detailed description of an algorithm tailored to detect
         external plagiarism in PAN-09 competition. The algorithm is divided into three
         steps: a ﬁrst reduction of the size of the problem by a selection of ten suspicious
         plagiarists using a n-gram distance on properly recoded texts. A search for matches
         after T9-like recoding. A “joining algorithm” that merges selected matches and is
         able to detect obfuscated plagiarism. The results are brieﬂy discussed.
         Keywords: n-grams, plagiarism, coding, string matching
1     Introduction                                                          2   The problem and the datasets
In this short paper we aim to describe our                                  The contest was divided into two diﬀerent
approach to automatic plagiarism detection.                                 challenges: external and internal plagiarism.
In particular, we discuss how we were able to                               We concentrated on the external plagiarism
borrow and adapt some techniques and ideas                                  only. For the sake of completeness we re-
recently developed by some of us for diﬀer-                                 call the main goal (see (PAN-09-Organizers,
ent, but related, problems such as Author-                                  2009) for more details):
ship Attribution (AA)(Benedetto, Caglioti,                                      ..given a set of suspicious documents and
and Loreto, 2002; Basile et al., 2008). While                               a set of source documents the task is to ﬁnd
some of the authors gained over the last years                              all text passages in the suspicious documents
certain expertise in the ﬁeld of AA, it is the                              which have been plagiarized and the corre-
ﬁrst time we face plagiarism detection. The                                 sponding text passages in the source docu-
algorithm we are going to describe has been                                 ments.
tailored on the “1st International Competi-                                     The organizers provided a training corpus,
tion on Plagiarism Detection” (Stein et al.,                                composed of 7214 source documents and 7214
2009) and does not pretend to be optimal                                    suspicious documents, each with an associ-
for generic situations. Indeed we joined the                                ated XML containing the information about
competition while being completely unaware                                  plagiarized passages. A ﬁrst statistical anal-
of the relevant literature, and thus the main                               ysis shows that text lengths vary from few
aim of this paper is to participate to a de-                                hundreds to 2.5 million characters while the
tailed comparison of the diﬀerent approaches                                total number of plagiarized passages is 37046.
to the contest that, we hope, will permit to                                Moreover, exactly half of the suspicious texts
enlarge the applicability of the methods and                                contain no plagiarism and about 25% of the
generate a profound integration and combi-                                  source documents are not used to plagiarize
nation of diﬀerent ideas.                                                   any suspicious document. The length of the
                                                                            plagiarized passages has a very peculiar dis-
∗
   We thank the organizers of PAN-09 competition
                                                                            tribution, see Figure 1: there are no passages
for the stimulating challenges and C. Cattuto and V.                        with length in the window 6000-12000 char-
Loreto for bringing the contest to our attention.                           acters, and even for long texts there is no

Stein, Rosso, Stamatatos, Koppel, Agirre (Eds.): PAN'09, pp. 19-23, 2009.
20                                                                                    Chiara Basile, Dario Benedetto, Emanuele Caglioti, Giampaolo Cristadoro and Mirko Degli Esposti

plagiarism longer than 30000 characters. A                                                                                                                                              transformed into sequences of word lengths,
remarkable fact is that about 13% of the pla-                                                                                                                                           so that for example the sentence To be,
giarized passages consist of translated plagia-                                                                                                                                         or not to be: that is the question
rism.                                                                                                                                                                                   becomes simply 2223224238.           All word
                                                                                       10
                                                                                                                                                                                        lengths greater than 9 were cut to 9, so
                                                                                                                                                                                        that the new alphabet consists of the nine
                                                 percentage of plagiarized passages


                                                                                           1                                                                                            symbols {1, . . . , 9}. These coded versions of
                                                                                                                                                                                        the texts are on average 82.5% shorter than
                                                                                       0.1                                                                                              the original ones, and so computation times
                                                                                                                                                                                        are greatly reduced.
                                                                                      0.01                                                                                                  The distance between each suspicious and
                                                                                                                                                                                        source text has been computed by comparing
                                                                                               0     5000    10 000          15 000           20 000             25 000                 in a suitable way the frequency vectors of the
                                                                                                               length carachters                                                      8-grams of the coded versions. This n-gram
                                                                                                                                                                                        distance used here has been proved successful
Figure 1: Distribution of plagiarized passage                                                                                                                                           in Authorship Attribution problems (Basile
lengths in the training corpus.                                                                                                                                                         et al., 2008). To be more precise, after a ﬁrst
                                                                                                                                                                                        experiment based on bigram frequencies pre-
    Similarly, the competition corpus is com-                                                                                                                                           sented in 1976 by Bennett (Bennett, 1976),
posed of 7215 source documents and 7214                                                                                                                                                 Kešelj et al. published in 2003 a paper in
suspicious documents (obviously without any                                                                                                                                             which n-gram frequencies were used to deﬁne
associated XML ﬁles). The length statistics                                                                                                                                             a similarity distance between texts (V. Kešelj
are very close to those for the training corpus,                                                                                                                                        and Thomas, 2003). The distance introduced
                                                                                                                                                                                        in (Basile et al., 2008) and used here should
see Figure 2.
                                                                                                                                                                                        be considered as a natural development of the
    The overall score is then calculated as the                                                                                                                                         previous ones: once the value of n has been
ratio between F-measure and granularity over                                                                                                                                            ﬁxed, usually between 4 and 8, n-gram fre-
the whole set of detected chars (see (Stein et                                                                                                                                          quencies are calculated for a given text x. If
al., 2009) for more details).                                                                                                                                                           we denote by ω an arbitrary n-gram, by fx (ω)
                                                                                                                                                                                        the relative frequency with which ω appears
                                                 0.100                                                                                                                                  in the text x and by Dn (x) the so called n-
                                                 0.050
                                                                                                                                              source texts training
                                                                                                                                              source texts competition
                                                                                                                                                                                        gram dictionary of x, that is, the set of all
                                                                                                                                                                                        n-grams which have nonzero frequency in x,
     percentage of texts logarithmic scale


                                                                                                                                              suspicious texts training

                                                 0.010
                                                                                                                                              suspicious texts competition
                                                                                                                                                                                        for any pair of texts x and y, we can deﬁne:
                                                 0.005
                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                               1                                   fy (ω) − fx (ω)
                                                                                                                                                                                        dn (x, y) :=
                                                                                                                                                                                                       |Dn (x)| + |Dn (y)|                         fy (ω) + fx (ω)
                                                 0.001
                                                                                                                                                                                                                             ω∈Dn (x)∪Dn (y)
                                               5  104

                                                                                                                                                                                         This is exactly the distance that has been
                                               1  104
                                                                                       0           500 000   1.0  106            1.5  106          2.0  106              2.5  106
                                                                                                                                                                                        used in (Basile et al., 2008), together with a
                                                                                                                text length characters                                                suitable voting procedure, for approaching a
                                                                                                                                                                                        two-class classiﬁcation problem.
Figure 2: Text length distribution for the train-                                                                                                                                          The parameter n = 8 for the n−gram
ing corpus and for the competition corpus.                                                                                                                                              distance was chosen here as a compromise
                                                                                                                                                                                        between a good recall (the fraction of pla-
3 The method                                                                                                                                                                            giarized characters coming from the ﬁrst 10
                                                                                                                                                                                        nearest neighbors of each suspicious text
3.1 A ﬁrst selection                                                                                                                                                                    is 81%) and acceptable computation times
For each suspicious document, we ﬁrst iden-                                                                                                                                             (about 2.3 days for the whole corpus). Since
tify a small subset of source documents for                                                                                                                                             it was impossible to do repeated computa-
a second and deeper (computationally de-                                                                                                                                                tions on the whole corpus, the optimization
manding) analysis.                                                                                                                                                                      was done using a small subset of 160 sus-
    We start by calculating a suitable dis-                                                                                                                                             picious texts and 300 source texts, suitably
tance between each suspicious and source                                                                                                                                                selected by imposing total and plagiarism
text. Then, for each suspicious text the ﬁrst                                                                                                                                           length distributions comparable to those of
10 source neighbors ordered according to this                                                                                                                                           the whole corpus.
distance are kept for further analysis.                                                                                                                                                    Note that a recall of 81% is a very good
    In order to bring computational time                                                                                                                                                result for the 8-gram distance, since the
to a practical scale the texts were ﬁrst                                                                                                                                                method just described has basically no hope
                          A Plagiarism Detection Procedure in Three Steps: Selection, Matches and "Squares"                      21

to recognize the 13% of translated plagiarism.             common PC for the whole corpus of 7214
Therefore, at least on the small subset of the             texts took about 40 hours. The threshold for
training corpus, only about 6% of the (non                 the match length was ﬁxed arbitrarily to 15
translated) plagiarized passages are lost in               for texts shorter than 500000 characters, to
this phase.                                                25 for longer texts.

3.2    T9 and matches                                      3.3       Looking for “squares”
                                                           The previous phase gives a long list of
After identifying the 10 “most probable pla-
                                                           matches for each suspicious-source pair of
giarist sources” we now perform a more de-
                                                           documents. Since the plagiarized passages
tailed analysis to detect the plagiarized pas-
                                                           had undergone various levels of obfusca-
sages. The idea is to look for common subse-
                                                           tion, typically the matches are “close” to
quences (matches) longer than a ﬁxed thresh-
                                                           each other in the suspicious texts but taken
 old. To this goal we need to recover some
                                                           from not necessarily subsequent places in the
 of the information lost on the ﬁrst passage
                                                           source texts. By representing the pair of
 by ﬁrst rewriting the text in the original al-
                                                           texts in a bidimensional plot, with the sus-
 phabet and then using a diﬀerent (and less
                                                           picious text on the x axis and the source
 “lossy”) coding. We perform a T9-like cod-
                                                           text on the y axis, each match of length l,
 ing: this system is typically used for assisted
                                                           starting at x in the suspicious document and
 writing on most mobile phones. The idea
                                                           at y in the source document, draws a line
 is to translate three or four diﬀerent letters
                                                           from (x, y) to (x + l, y + l). The result is of-
 into the same character, for example {a,b,c}
                                                           ten something similar to Figure 3: there are
→ 2, {d,e,f} → 3 and so on. The symbol 0
                                                           some random (short) matches all around the
 is used for newline and blank space, 1 for all
                                                           plane but there are places where matches ac-
 symbols other than these two and the letters
                                                           cumulate, forming lines or something similar
 of the alphabet. The new alphabet for the
                                                           to a square. Non-obfuscated plagiarism cor-
 coded texts is therefore made up of 10 sym-
                                                           responds to lines, i.e. a single long match
 bols: {0, 1, 2, . . . , 9}. Note that the use of
                                                           or many short close matches which are in
 T9 “compression”, which could seem strange
                                                           succession both in the suspicious and in the
 at a ﬁrst sight, can be justiﬁed by observing
                                                           source texts, whereas intuitively obfuscated
 that a “long” T9 sequence (10-15 characters)
                                                           plagiarism corresponds to “squares”: here
 has in most cases an “almost unique” trans-
                                                           the matching sequences are in a diﬀerent or-
 lation in a sentence which makes sense in the
                                                           der in the source and suspicious documents.
 original language.
     The “true” matches between a suspicious                         suspiciousdocument00814.txt vs. sourcedocument04005.txt
                                                                 140 000

 and a source document are now found: from                       120 000

 any possible starting position in the suspi-                    100 000

                                                                  80 000
 cious document the longest match in the                          60 000
 source document is calculated (possibly more                     40 000

 than one with the same length). If the length                    20 000

 is larger than a ﬁxed threshold and the match                        0
                                                                             50 000   100 000   150 000   200 000   250 000

 is not a submatch of a previously detected
 one, it is stored in a list.                              Figure 3: Orange points correspond to the po-
     Here, we take advantage of the choice of              sition of matching chars (measured in number of
 the T9 encoding, which uses ten symbols:                  chars from the beginning) between a suspicious
 for any starting position in the source doc-              and a source document (see top of the plot) of
                                                           the training corpus. A “square” of matches cor-
 ument, the algorithm stores the last previ-               responds to an obfuscated plagiarism.
 ous position of the same string of length 7,
 and for any possible string of length 7 it is                Figure 4 is an example of what can happen
 memorized, in a vector of size 107 , the last             when there is no plagiarism: matches are uni-
 occurrence in the source ﬁle. With respect                formly spread around the plane and do not
 to other methods (suﬃx trees or sorting, for              accumulate anywhere. Obviously these are
 istance), in this way we can search the maxi-             just two of the many possible settings: longer
 mum match in the source document avoiding                 texts or the presence of “noise” (e.g. long
 to compare the smaller ones.                              sequences of blanks, tables of numbers...)
     Running this part of the algorithm on a               can give rise to a much higher density of
22          Chiara Basile, Dario Benedetto, Emanuele Caglioti, Giampaolo Cristadoro and Mirko Degli Esposti

matches, substantially increasing the diﬃcul-                          we obtain some quadruples which correspond
ties in identifying the correct plagiarized pas-                       roughly to the “diagonals” of the “squares” in
sages.                                                                 Figure 3. We ﬁnally run the algorithm once
           suspiciousdocument00814.txt vs. sourcedocument03464.txt
                                                                       again using smaller parameters δx and δy in
                                                                       order to reduce the granularity from 2 to ap-
       300 000
                                                                       proximately the optimal value of 1. Figure
                                                                       5 shows the result of the algorithm for the
       250 000
                                                                       couple of texts of Figure 3 (blue), and Fig-
                                                                       ure 6 shows the very good superimposition
       200 000                                                         with the actual plagiarized passages (black),
                                                                       as derived from the XML ﬁle.
       150 000                                                            The procedure just described has been de-
                                                                       veloped and tuned for the competition in a re-
       100 000                                                         stricted time schedule, but it would be quite
                                                                       interesting to compare the eﬃciency of this
        50 000                                                         procedure to the ones that can be achieved
                                                                       by using standard clustering algorithms. We
            0                                                          plan to do this in the future.
                 0   50 000   100 000   150 000   200 000   250 000

                                                                                         suspiciousdocument00814.txt vs. sourcedocument04005.txt
                                                                               140 000

Figure 4: Orange points correspond to the po-                                  120 000

sition of matching chars (measured in number of                                100 000


chars from the beginning) between a suspicious                                  80 000

                                                                                60 000
and a source document (see top of the plot) of                                  40 000

the training corpus. No plagiarism is present in                                20 000


this case.                                                                          0
                                                                                               50 000    100 000   150 000   200 000   250 000


   In order to provide a single quadruple                              Figure 5: Detected plagiarism for the pair of
(x, y, lx , ly ) of starting points and lengths for                    texts of the training corpus indicated at the top
each detected plagiarized passage we need                              of the plot. Single matches in orange, joined
to implement an algorithm that joins the                               matches in blue.
“cloud” of matches of each “square”.
   Note that the algorithm that performs                                       140 000
                                                                                         suspiciousdocument00814.txt vs. sourcedocument04005.txt


this task needs to be scalable with pla-                                       120 000


giarism lengths, which can vary from few                                       100 000

                                                                                80 000

hundreds, up to tens of thousands characters.                                   60 000

                                                                                40 000

                                                                                20 000

  The algorithm used here joins two                                                 0
                                                                                               50 000    100 000   150 000   200 000   250 000
matches if the following conditions hold si-
multaneously:                                                          Figure 6: Plagiarized passages for the pair of
                                                                       texts of the training corpus indicated at the top of
     1. matches are subsequent in the x coordi-                        the plot. Single matches in orange, actual plagia-
        nate;                                                          rism in black. Note the perfect superimposition
     2. the distance between the projections of                        between the blue lines in ﬁgure 5 and the black
        the matches on the x axis is greater than                      lines here.
        or equal to zero (no superimposed pla-
        giarism) but shorter than or equal to the                      3.4    Tuning the parameters
        lx of the longest of the two sequences,                        The “joining algorithm” described above de-
        scaled by a certain δx ;                                       pends on 4 parameters: δx and δy for the
     3. the distance between the projection of                         ﬁrst joining phase, and the rescaled δx and
        the matches on the y axis is greater than                      δy for the second joining phase. Our choice
        or equal to zero but shorter than or equal                     of the actual value in use has been dictated
        to the ly of the longer of the two se-                         essentially by the lack of time and no rigor-
        quences, scaled by a certain δy .                              ous and eﬃcient optimization has been per-
                                                                       formed. Driven by very few trials and with
  Merging now repeatedly the segments                                  some heuristic, we decided to use the follow-
which are superimposed either in x or in y,                            ing values: δx = δy = 3 and δx = δy = 0.5.
                         A Plagiarism Detection Procedure in Three Steps: Selection, Matches and "Squares"   23

   It is important to remark that diﬀerent                    ple of mathematical authorship attribu-
choices of the δ values yield to diﬀerent de-                 tion. Journal of Mathematical Physics,
tection results. For example, increasing their                49:125211–125230.
values typically results in a larger recall and
                                                          Benedetto, D., E. Caglioti, and V. Loreto.
in a better granularity, but also in a smaller
                                                            2002.    Language Trees and Zipping.
precision. A further analysis of these de-
                                                            Physical Review Letters, 88(4):048702–1,
pendencies could provide (in future develop-
                                                            048702–4.
ments) a controllable way of modifying the
precision, the recall and the granularity, de-            Bennett, W.R. 1976. Prentice-Hall, Engle-
pending on the plagiarism detection task into               wood Cliﬀs, NJ.
consideration. A promising strategy that we               PAN-09-Organizers. 2009. Proceedings of
plan to explore in the future consists in a dy-             PAN-09.
namical tuning of these parameters accord-
ing, for example, to the density of matches               Stein, B., M. Potthast, A. Eiselt,
or to the lengths of the two texts into consid-              P. Rosso, and A. Barrón-Cedeño.
eration.                                                     2009.         http://www.webis.de/pan-
   The match-joining algorithm was devel-                    09/competition.php.
oped using Mathematica    c 7, and it ran on a
                                                          V. Kešelj, F. Peng, N. Cercone and
common PC for about 20 hours on the whole                   C. Thomas. 2003. n-gram-based author
data set of the competition.                                proﬁles for authorship attribution.
4   Results and comments
The algorithm described gave the following
results on the competition corpus (Stein et
al., 2009):
   • Precision: 0.6727
   • Recall: 0.6272
   • F-measure: 0.6491
   • Granularity: 1.0745
   • Overall score: 0.6041
    The overall score is the third best re-
sult after 0.6093 and 0.6957 of the ﬁrst two.
We stress that the overall score drops con-
siderably starting from the fourth position
(0.3045), the ﬁfth (0.1885), and so on. More-
over, while the winner has better results in
all precision, recall and granularity, our pre-
cision and recall are better than the second,
while granularity is worse.
    The competition had a very tight sched-
ule, therefore many improvements are possi-
ble. In particular, it may be that the match-
joining problem can be advantageously for-
mulated in the framework of Hidden Markov
Models. Also, it would be worth to see
how standard clustering algorithms perform
in this case.
    We are eager to compare techniques and
ideas with the other participants of the con-
test.

References
Basile, C., D. Benedetto, E. Caglioti, and
  M. Degli Esposti. 2008.      An exam-