=Paper=
{{Paper
|id=Vol-502/paper-3
|storemode=property
|title=A Plagiarism Detection Procedure in Three Steps: Selection, Matches and "Squares"
|pdfUrl=https://ceur-ws.org/Vol-502/paper3.pdf
|volume=Vol-502
}}
==A Plagiarism Detection Procedure in Three Steps: Selection, Matches and "Squares"==
A plagiarism detection procedure in three steps:
selection, matches and ”squares” ∗
Chiara Basile Dario Benedetto Emanuele Caglioti
Dip. Matematica Dip. Matematica Dip. Matematica
Univ. Bologna Sapienza, Univ. Roma Sapienza, Univ. Roma
P.zza di Porta S. Donato 5, P.le Aldo Moro 2, P.le Aldo Moro 2,
40126, Bologna, Italy 00185, Roma, Italy 00185, Roma, Italy
basile@dm.unibo.it benedetto@mat.uniroma1.it caglioti@mat.uniroma1.it
Giampaolo Cristadoro Mirko Degli Esposti
Dip. Matematica Dip. Matematica
Univ. Bologna Univ. Bologna
P.zza di Porta S. Donato 5, P.zza di Porta S. Donato 5,
40126, Bologna, Italy 40126, Bologna, Italy
cristadoro@dm.unibo.it desposti@dm.unibo.it
Abstract: We present a detailed description of an algorithm tailored to detect
external plagiarism in PAN-09 competition. The algorithm is divided into three
steps: a first reduction of the size of the problem by a selection of ten suspicious
plagiarists using a n-gram distance on properly recoded texts. A search for matches
after T9-like recoding. A “joining algorithm” that merges selected matches and is
able to detect obfuscated plagiarism. The results are briefly discussed.
Keywords: n-grams, plagiarism, coding, string matching
1 Introduction 2 The problem and the datasets
In this short paper we aim to describe our The contest was divided into two different
approach to automatic plagiarism detection. challenges: external and internal plagiarism.
In particular, we discuss how we were able to We concentrated on the external plagiarism
borrow and adapt some techniques and ideas only. For the sake of completeness we re-
recently developed by some of us for differ- call the main goal (see (PAN-09-Organizers,
ent, but related, problems such as Author- 2009) for more details):
ship Attribution (AA)(Benedetto, Caglioti, ..given a set of suspicious documents and
and Loreto, 2002; Basile et al., 2008). While a set of source documents the task is to find
some of the authors gained over the last years all text passages in the suspicious documents
certain expertise in the field of AA, it is the which have been plagiarized and the corre-
first time we face plagiarism detection. The sponding text passages in the source docu-
algorithm we are going to describe has been ments.
tailored on the “1st International Competi- The organizers provided a training corpus,
tion on Plagiarism Detection” (Stein et al., composed of 7214 source documents and 7214
2009) and does not pretend to be optimal suspicious documents, each with an associ-
for generic situations. Indeed we joined the ated XML containing the information about
competition while being completely unaware plagiarized passages. A first statistical anal-
of the relevant literature, and thus the main ysis shows that text lengths vary from few
aim of this paper is to participate to a de- hundreds to 2.5 million characters while the
tailed comparison of the different approaches total number of plagiarized passages is 37046.
to the contest that, we hope, will permit to Moreover, exactly half of the suspicious texts
enlarge the applicability of the methods and contain no plagiarism and about 25% of the
generate a profound integration and combi- source documents are not used to plagiarize
nation of different ideas. any suspicious document. The length of the
plagiarized passages has a very peculiar dis-
∗
We thank the organizers of PAN-09 competition
tribution, see Figure 1: there are no passages
for the stimulating challenges and C. Cattuto and V. with length in the window 6000-12000 char-
Loreto for bringing the contest to our attention. acters, and even for long texts there is no
Stein, Rosso, Stamatatos, Koppel, Agirre (Eds.): PAN'09, pp. 19-23, 2009.
20 Chiara Basile, Dario Benedetto, Emanuele Caglioti, Giampaolo Cristadoro and Mirko Degli Esposti
plagiarism longer than 30000 characters. A transformed into sequences of word lengths,
remarkable fact is that about 13% of the pla- so that for example the sentence To be,
giarized passages consist of translated plagia- or not to be: that is the question
rism. becomes simply 2223224238. All word
10
lengths greater than 9 were cut to 9, so
that the new alphabet consists of the nine
percentage of plagiarized passages
1 symbols {1, . . . , 9}. These coded versions of
the texts are on average 82.5% shorter than
0.1 the original ones, and so computation times
are greatly reduced.
0.01 The distance between each suspicious and
source text has been computed by comparing
0 5000 10 000 15 000 20 000 25 000 in a suitable way the frequency vectors of the
length carachters 8-grams of the coded versions. This n-gram
distance used here has been proved successful
Figure 1: Distribution of plagiarized passage in Authorship Attribution problems (Basile
lengths in the training corpus. et al., 2008). To be more precise, after a first
experiment based on bigram frequencies pre-
Similarly, the competition corpus is com- sented in 1976 by Bennett (Bennett, 1976),
posed of 7215 source documents and 7214 Kešelj et al. published in 2003 a paper in
suspicious documents (obviously without any which n-gram frequencies were used to define
associated XML files). The length statistics a similarity distance between texts (V. Kešelj
are very close to those for the training corpus, and Thomas, 2003). The distance introduced
in (Basile et al., 2008) and used here should
see Figure 2.
be considered as a natural development of the
The overall score is then calculated as the previous ones: once the value of n has been
ratio between F-measure and granularity over fixed, usually between 4 and 8, n-gram fre-
the whole set of detected chars (see (Stein et quencies are calculated for a given text x. If
al., 2009) for more details). we denote by ω an arbitrary n-gram, by fx (ω)
the relative frequency with which ω appears
0.100 in the text x and by Dn (x) the so called n-
0.050
source texts training
source texts competition
gram dictionary of x, that is, the set of all
n-grams which have nonzero frequency in x,
percentage of texts logarithmic scale
suspicious texts training
0.010
suspicious texts competition
for any pair of texts x and y, we can define:
0.005
1 fy (ω) − fx (ω)
dn (x, y) :=
|Dn (x)| + |Dn (y)| fy (ω) + fx (ω)
0.001
ω∈Dn (x)∪Dn (y)
5 104
This is exactly the distance that has been
1 104
0 500 000 1.0 106 1.5 106 2.0 106 2.5 106
used in (Basile et al., 2008), together with a
text length characters suitable voting procedure, for approaching a
two-class classification problem.
Figure 2: Text length distribution for the train- The parameter n = 8 for the n−gram
ing corpus and for the competition corpus. distance was chosen here as a compromise
between a good recall (the fraction of pla-
3 The method giarized characters coming from the first 10
nearest neighbors of each suspicious text
3.1 A first selection is 81%) and acceptable computation times
For each suspicious document, we first iden- (about 2.3 days for the whole corpus). Since
tify a small subset of source documents for it was impossible to do repeated computa-
a second and deeper (computationally de- tions on the whole corpus, the optimization
manding) analysis. was done using a small subset of 160 sus-
We start by calculating a suitable dis- picious texts and 300 source texts, suitably
tance between each suspicious and source selected by imposing total and plagiarism
text. Then, for each suspicious text the first length distributions comparable to those of
10 source neighbors ordered according to this the whole corpus.
distance are kept for further analysis. Note that a recall of 81% is a very good
In order to bring computational time result for the 8-gram distance, since the
to a practical scale the texts were first method just described has basically no hope
A Plagiarism Detection Procedure in Three Steps: Selection, Matches and "Squares" 21
to recognize the 13% of translated plagiarism. common PC for the whole corpus of 7214
Therefore, at least on the small subset of the texts took about 40 hours. The threshold for
training corpus, only about 6% of the (non the match length was fixed arbitrarily to 15
translated) plagiarized passages are lost in for texts shorter than 500000 characters, to
this phase. 25 for longer texts.
3.2 T9 and matches 3.3 Looking for “squares”
The previous phase gives a long list of
After identifying the 10 “most probable pla-
matches for each suspicious-source pair of
giarist sources” we now perform a more de-
documents. Since the plagiarized passages
tailed analysis to detect the plagiarized pas-
had undergone various levels of obfusca-
sages. The idea is to look for common subse-
tion, typically the matches are “close” to
quences (matches) longer than a fixed thresh-
each other in the suspicious texts but taken
old. To this goal we need to recover some
from not necessarily subsequent places in the
of the information lost on the first passage
source texts. By representing the pair of
by first rewriting the text in the original al-
texts in a bidimensional plot, with the sus-
phabet and then using a different (and less
picious text on the x axis and the source
“lossy”) coding. We perform a T9-like cod-
text on the y axis, each match of length l,
ing: this system is typically used for assisted
starting at x in the suspicious document and
writing on most mobile phones. The idea
at y in the source document, draws a line
is to translate three or four different letters
from (x, y) to (x + l, y + l). The result is of-
into the same character, for example {a,b,c}
ten something similar to Figure 3: there are
→ 2, {d,e,f} → 3 and so on. The symbol 0
some random (short) matches all around the
is used for newline and blank space, 1 for all
plane but there are places where matches ac-
symbols other than these two and the letters
cumulate, forming lines or something similar
of the alphabet. The new alphabet for the
to a square. Non-obfuscated plagiarism cor-
coded texts is therefore made up of 10 sym-
responds to lines, i.e. a single long match
bols: {0, 1, 2, . . . , 9}. Note that the use of
or many short close matches which are in
T9 “compression”, which could seem strange
succession both in the suspicious and in the
at a first sight, can be justified by observing
source texts, whereas intuitively obfuscated
that a “long” T9 sequence (10-15 characters)
plagiarism corresponds to “squares”: here
has in most cases an “almost unique” trans-
the matching sequences are in a different or-
lation in a sentence which makes sense in the
der in the source and suspicious documents.
original language.
The “true” matches between a suspicious suspiciousdocument00814.txt vs. sourcedocument04005.txt
140 000
and a source document are now found: from 120 000
any possible starting position in the suspi- 100 000
80 000
cious document the longest match in the 60 000
source document is calculated (possibly more 40 000
than one with the same length). If the length 20 000
is larger than a fixed threshold and the match 0
50 000 100 000 150 000 200 000 250 000
is not a submatch of a previously detected
one, it is stored in a list. Figure 3: Orange points correspond to the po-
Here, we take advantage of the choice of sition of matching chars (measured in number of
the T9 encoding, which uses ten symbols: chars from the beginning) between a suspicious
for any starting position in the source doc- and a source document (see top of the plot) of
the training corpus. A “square” of matches cor-
ument, the algorithm stores the last previ- responds to an obfuscated plagiarism.
ous position of the same string of length 7,
and for any possible string of length 7 it is Figure 4 is an example of what can happen
memorized, in a vector of size 107 , the last when there is no plagiarism: matches are uni-
occurrence in the source file. With respect formly spread around the plane and do not
to other methods (suffix trees or sorting, for accumulate anywhere. Obviously these are
istance), in this way we can search the maxi- just two of the many possible settings: longer
mum match in the source document avoiding texts or the presence of “noise” (e.g. long
to compare the smaller ones. sequences of blanks, tables of numbers...)
Running this part of the algorithm on a can give rise to a much higher density of
22 Chiara Basile, Dario Benedetto, Emanuele Caglioti, Giampaolo Cristadoro and Mirko Degli Esposti
matches, substantially increasing the difficul- we obtain some quadruples which correspond
ties in identifying the correct plagiarized pas- roughly to the “diagonals” of the “squares” in
sages. Figure 3. We finally run the algorithm once
suspiciousdocument00814.txt vs. sourcedocument03464.txt
again using smaller parameters δx and δy in
order to reduce the granularity from 2 to ap-
300 000
proximately the optimal value of 1. Figure
5 shows the result of the algorithm for the
250 000
couple of texts of Figure 3 (blue), and Fig-
ure 6 shows the very good superimposition
200 000 with the actual plagiarized passages (black),
as derived from the XML file.
150 000 The procedure just described has been de-
veloped and tuned for the competition in a re-
100 000 stricted time schedule, but it would be quite
interesting to compare the efficiency of this
50 000 procedure to the ones that can be achieved
by using standard clustering algorithms. We
0 plan to do this in the future.
0 50 000 100 000 150 000 200 000 250 000
suspiciousdocument00814.txt vs. sourcedocument04005.txt
140 000
Figure 4: Orange points correspond to the po- 120 000
sition of matching chars (measured in number of 100 000
chars from the beginning) between a suspicious 80 000
60 000
and a source document (see top of the plot) of 40 000
the training corpus. No plagiarism is present in 20 000
this case. 0
50 000 100 000 150 000 200 000 250 000
In order to provide a single quadruple Figure 5: Detected plagiarism for the pair of
(x, y, lx , ly ) of starting points and lengths for texts of the training corpus indicated at the top
each detected plagiarized passage we need of the plot. Single matches in orange, joined
to implement an algorithm that joins the matches in blue.
“cloud” of matches of each “square”.
Note that the algorithm that performs 140 000
suspiciousdocument00814.txt vs. sourcedocument04005.txt
this task needs to be scalable with pla- 120 000
giarism lengths, which can vary from few 100 000
80 000
hundreds, up to tens of thousands characters. 60 000
40 000
20 000
The algorithm used here joins two 0
50 000 100 000 150 000 200 000 250 000
matches if the following conditions hold si-
multaneously: Figure 6: Plagiarized passages for the pair of
texts of the training corpus indicated at the top of
1. matches are subsequent in the x coordi- the plot. Single matches in orange, actual plagia-
nate; rism in black. Note the perfect superimposition
2. the distance between the projections of between the blue lines in figure 5 and the black
the matches on the x axis is greater than lines here.
or equal to zero (no superimposed pla-
giarism) but shorter than or equal to the 3.4 Tuning the parameters
lx of the longest of the two sequences, The “joining algorithm” described above de-
scaled by a certain δx ; pends on 4 parameters: δx and δy for the
3. the distance between the projection of first joining phase, and the rescaled δx and
the matches on the y axis is greater than δy for the second joining phase. Our choice
or equal to zero but shorter than or equal of the actual value in use has been dictated
to the ly of the longer of the two se- essentially by the lack of time and no rigor-
quences, scaled by a certain δy . ous and efficient optimization has been per-
formed. Driven by very few trials and with
Merging now repeatedly the segments some heuristic, we decided to use the follow-
which are superimposed either in x or in y, ing values: δx = δy = 3 and δx = δy = 0.5.
A Plagiarism Detection Procedure in Three Steps: Selection, Matches and "Squares" 23
It is important to remark that different ple of mathematical authorship attribu-
choices of the δ values yield to different de- tion. Journal of Mathematical Physics,
tection results. For example, increasing their 49:125211–125230.
values typically results in a larger recall and
Benedetto, D., E. Caglioti, and V. Loreto.
in a better granularity, but also in a smaller
2002. Language Trees and Zipping.
precision. A further analysis of these de-
Physical Review Letters, 88(4):048702–1,
pendencies could provide (in future develop-
048702–4.
ments) a controllable way of modifying the
precision, the recall and the granularity, de- Bennett, W.R. 1976. Prentice-Hall, Engle-
pending on the plagiarism detection task into wood Cliffs, NJ.
consideration. A promising strategy that we PAN-09-Organizers. 2009. Proceedings of
plan to explore in the future consists in a dy- PAN-09.
namical tuning of these parameters accord-
ing, for example, to the density of matches Stein, B., M. Potthast, A. Eiselt,
or to the lengths of the two texts into consid- P. Rosso, and A. Barrón-Cedeño.
eration. 2009. http://www.webis.de/pan-
The match-joining algorithm was devel- 09/competition.php.
oped using Mathematica c 7, and it ran on a
V. Kešelj, F. Peng, N. Cercone and
common PC for about 20 hours on the whole C. Thomas. 2003. n-gram-based author
data set of the competition. profiles for authorship attribution.
4 Results and comments
The algorithm described gave the following
results on the competition corpus (Stein et
al., 2009):
• Precision: 0.6727
• Recall: 0.6272
• F-measure: 0.6491
• Granularity: 1.0745
• Overall score: 0.6041
The overall score is the third best re-
sult after 0.6093 and 0.6957 of the first two.
We stress that the overall score drops con-
siderably starting from the fourth position
(0.3045), the fifth (0.1885), and so on. More-
over, while the winner has better results in
all precision, recall and granularity, our pre-
cision and recall are better than the second,
while granularity is worse.
The competition had a very tight sched-
ule, therefore many improvements are possi-
ble. In particular, it may be that the match-
joining problem can be advantageously for-
mulated in the framework of Hidden Markov
Models. Also, it would be worth to see
how standard clustering algorithms perform
in this case.
We are eager to compare techniques and
ideas with the other participants of the con-
test.
References
Basile, C., D. Benedetto, E. Caglioti, and
M. Degli Esposti. 2008. An exam-