ENCOPLOT: Pairwise Sequence Matching in Linear Time
              Applied to Plagiarism Detection ∗
                     Cristian Grozea                                                 Christian Gehl
                    Fraunhofer FIRST                                                Fraunhofer FIRST
                        IDA Group                                                      IDA Group
                      Kekulestrasse 7,                                               Kekulestrasse 7,
                  12489 Berlin, Germany                                          12489 Berlin, Germany
            cristian.grozea@ﬁrst.fraunhofer.de                              christian.gehl@ﬁrst.fraunhofer.de
                                              Marius Popescu
                                           University of Bucharest
                                Faculty of Mathematics and Computer Science
                                            Academiei 14, Sect. 1,
                                             Bucharest, Romania
                                         popescunmarius@gmail.com

       Abstract: In this paper we describe a new general plagiarism detection method,
       that we used in our winning entry to the 1st International Competition on Plagia-
       rism Detection, the external plagiarism detection task, which assumes the source
       documents are available. In the ﬁrst phase of our method, a matrix of kernel values
       is computed, which gives a similarity value based on n-grams between each source
       and each suspicious document. In the second phase, each promising pair is further
       investigated, in order to extract the precise positions and lengths of the subtexts
       that have been copied and maybe obfuscated – using encoplot, a novel linear time
       pairwise sequence matching technique. We solved the signiﬁcant computational chal-
       lenges arising from having to compare millions of document pairs by using a library
       developed by our group mainly for use in network security tools. The performance
       achieved is comparing more than 49 million pairs of documents in 12 hours on a
       single computer. The results in the challenge were very good, we outperformed all
       other methods.
       Keywords: n-gram, plagiarism detection, network security, challenge
1 Introduction                                      comparison unit used, all methods of plagia-
Many methods have been developed for pla-           rism detection need a similarity measure to
giarism detection, especially for the exter-        compare the text fragments corresponding to
nal plagiarism analysis, which consists in          the comparison unit. Most similarity mea-
ﬁnding passages in the suspicious documents         sures used in plagiarism detection are based
which have been plagiarized and the corre-          on estimating the amount of common conﬁg-
sponding text passages in the source doc-           urations of words. They diﬀer by the conﬁg-
uments. Almost all these methods handle             urations considered (n-grams, subsequences,
the text at word level. Various compar-             etc.) or by what words are used in compar-
ison units have been employed in plagia-            isons (only words from the text fragments,
rism detection methods. Entire documents            stemmed or not, synonyms from WordNet,
are compared in (Lyon, Barrett, and Mal-            etc.). In (Lyon, Barrett, and Malcolm, 2004)
colm, 2004). Sentences from suspicious docu-        word trigrams are used to measure the sim-
ments are compared to sentences from refer-         ilarity between texts. The authors based
ence documents in (Kang, Gelbukh, and Han,          their choice of using word trigrams for plagia-
2006). Mixed-length comparisons in which            rism detection on the fact that the number of
suspicious sentences are compared with entire       common word trigrams in two independently
reference documents were used in (Barrón-          written texts (even if the text are on the same
Cedeño and Rosso, 2009; Barrón-Cedeño,           topic) must be low given the Zipﬁan distribu-
Rosso, and Benedı́, 2009). Irrespective of the      tion of words. Also in (Barrón-Cedeño and
                                                    Rosso, 2009) it is reported that using word bi-
∗
  partly supported from the BMBF project ReMIND     grams and trigrams led to best results in their
(KZ 01-IS07007A) and CNCSIS PN2-Idei project 228    experiments. In order to address the prob-

Stein, Rosso, Stamatatos, Koppel, Agirre (Eds.): PAN'09, pp. 10-18, 2009.
                     ENCOPLOT: Pairwise Sequence Matching in Linear Time Applied to Plagiarism Detection   11

lem of rewording in plagiarism, PPChecker                  guage (word segmentation).
(Kang, Gelbukh, and Han, 2006) is based on                     There is a strong connection between the
a special designed similarity measure, that                research in NLP and the research in computer
takes into account also the synonyms (ob-                  network security. In recent years, network se-
tained from the WordNet) of the words in the               curity research started to approach the prob-
suspicious sentences. Some of the most elab-               lem of detecting automatically unknown at-
orate similarity measures used in plagiarism               tacks as soon as they reach the targeted sys-
detection are described in (Bao et al., 2003;              tem. These attacks may follow the syntax
Bao et al., 2004a; Bao et al., 2004b). These               but try to exploit the semantics of the net-
measures are derived from the string kernel,               work communication between the client and
a kernel type successfully used in text cate-              the server applications, in order to gain ac-
gorization (Lodhi et al., 2002). The string                cess over the attacked computer or at least
kernel works at character level, although in               to prevent it from working normally. The
(Bao et al., 2003; Bao et al., 2004a; Bao et               communication process deﬁned by the appli-
al., 2004b) it is extended to work at word                 cation layer protocols – e.g. HTTP, FTP,
level, comparing two semantic sequences ac-                RPC or IMAP – can also be considered as a
cording to their common words and position                 text-based communication in an artiﬁcial lan-
information.                                               guage. The idea of payload analysis, which
    Using words is natural in text analysis                treats the data as sequences of bytes has
tasks like text categorization (by topic), au-             been explored in detail (Kruegel, Toth, and
thorship identiﬁcation and plagiarism detec-               Kirda, 2002; Wang and Stolfo, 2004; Rieck
tion. Perharps surprisingly, recent results                and Laskov, 2006; Wang, Parekh, and Stolfo,
proved that methods that handle the text at                2006; Rieck and Laskov, 2007). As the focus
character level can also be very eﬀective in               in this ﬁeld shifted towards applying more
text analysis tasks. In (Lodhi et al., 2002)               advanced machine learning methods, gener-
string kernels were used for document cat-                 alizing the extraction and representation of
egorization with very good results. Trying                 the features has increased much the ﬂexibil-
to explain why treating documents as sym-                  ity in deﬁning similarity measures between
bol sequences and using string kernels ob-                 sequential data, in a security context. The
tained such good results the authors suppose               work (Rieck and Laskov, 2008) presents an
that: ”the [string] kernel is performing some-             eﬃcient way to combine features extracted
thing similar to stemming, hence providing                 from byte sequences, e.g. words or n-grams
semantic links between words that the word                 with arbitrary n value, for a wide range of
kernel must view as distinct”. String ker-                 linear and non-linear similarity measures.
nels were also successfully used in authorship                 Graphics methods in comparing sequences
identiﬁcation (Sanderson and Guenter, 2006;                have been used in many ﬁelds, mostly un-
Popescu and Dinu, 2007). A possible reason                 der the name dotplot – see (Maizel and Lenk,
for the success of string kernels in author-               1981) for one of the ﬁrst uses in biology
ship identiﬁcation is given in (Popescu and                and (Church and Helfman, 1993) for uses in
Dinu, 2007): ”the similarity of two strings as             source text comparison. Whereas very at-
it is measured by string kernels reﬂects the               tractive for exploratory data analysis, build-
similarity of the two texts as it is given by              ing this graphic is potentially quadratic in
the short words (2-5 characters) which usu-                time and space. Also it tends to be noisy,
ally are function words, but also takes into               by showing many irrelevant coincidences be-
account other morphemes like suﬃxes (’ing’                 tween the sequences compared. Even with
for example) which also can be good indica-                these limitations, the method has been ap-
tors of the author’s style”1                               plied to source code, videos, music, protein
    For plagiarism detection, the only ap-                 and other biological sequences, with various
proach that handles the text at character                  ways to ﬁlter the noisy graphics and to handle
level that we are aware of is in (Bao, Lyon,               the problem of the potential quadratic size.
and Lane, 2006), for Chinese, and there is                 We improve on this technique by deriving our
justiﬁed by the diﬃculties of the Chinese lan-             own, linear space, linear time technique, that
  1
    the string kernel used in (Popescu and Dinu,           we named the encoplot, short for “eN-gram
2007) takes into account substrings of length up to        COincidence PLOT”. It is fully described in
5 characters.                                              Section 2.3, with code in Appendix 1.
12     Cristian Grozea, Christian Gehl and Marius Popescu

   Our plagiarism detection method can be                     distance function d(x, y)
                                                                              
described as a combination of techniques                      Minkowski        k
                                                                                     ng∈An
                                                                                             |φng (x) − φng (y)|k
from many ﬁelds: it is character n-gram                                                     |φng (x)−φng (y)|
based. It leverages a very eﬃcient network                    Canberra               ng∈An    φng (x)+φng (y)
security software to compute the matrices                     kernel function k(x, y)
of kernel values. It uses the very fast en-                                    
                                                              linear kernel            φ (x) · φng (y)
                                                                                  ng∈An ng
coplot algorithm and processes the encoplot                                          
                                                                                               ||φng (x)−φng (y)||2
data in a quantitative fashion to solve what                  RBF kernel     exp(−     ng∈An
                                                                                                                      )
                                                                                                 2σ 2
can be seen as a rudimentary machine vision
or a specialized 2-dimensional data cluster-
                                                            Table 1: Distances and kernels functions for sequen-
ing task, in order to identify the matching                 tial data.
text passages for a given document pair, as
explained thoroughly below.
   In what follows, the dataset speciﬁcs and                work security research group. It has been
the time performance ﬁgures refer to the                    developed mainly for being used in build-
dataset of the 1st International Competi-                   ing real-time network analysis tools at packet
tion on Plagiarism Detection, external pla-                 level, as part of network intrusion detection
giarism detection task (Webis at Bauhaus-                   and prevention systems. It can map byte
Universität Weimar and NLEL at Universi-                   sequences to a vectorial n-gram representa-
dad Politécnica de Valencia, 2009). The de-                tion, such that the similarity between two
velopment corpus of this dataset contained                  byte sequences can be expressed in terms of
about 7000 source documents and 7000 sus-                   distance and kernel functions on those rep-
picious ones, with the plagiarism generated                 resentations. The n-gram extraction set of
automatically with various degrees of obfus-                feasible byte sequences is given by An = Σn ,
cation (permutations, words deleted, inserted               where Σ is the alphabet (in our case the
or replaced by synonyms or antonyms) and                    whole ASCII–8 set). The n-gram embed-
annotated. The competition corpus had the                   ding function φ for a byte sequence x is
same characteristics (diﬀerent documents)                   then deﬁned as φ(x) = (φng (x))ng∈An with
and the annotation was missing.                             φng (x) = emb(x, ng), where the dimension
                                                            of the vector φ(x) is |An |. The function
2     Methods                                               emb(x, ng) returns either the frequency, the
Our approach consists of two main phases.                   count or the presence bit for a n-gram ng in
In the ﬁrst phase, a matrix of string ker-                  x. With the embedding function φ ﬁxed, one
nel values is computed, which gives a sim-                  can compute a pairwise similarity value for
ilarity value between each source and each                  the vectorial representations of two byte se-
suspicious document. Then, for each source,                 quences. Table 1 presents a selection of the
the possible “destinations” (suspicious docu-               implemented distances and similarity mea-
ments) are ranked based on their similarity                 sures that we could have used (where x and
level with the current source, in decreasing                y are arbitrary byte sequences).
order. In the second phase, each promising                     Experiments with a very small subset of
pair is further investigated, in order to ex-               only 5 documents and our previous experi-
tract the precise positions and lengths of the              ence in string kernels led us to use the linear
subtexts that have been copied and maybe                    kernel over a representation where every n-
obfuscated by the random plagiarist. In the                 gram present is marked by 1 and every other
end we do a supplementary ﬁltering that in-                 is marked by 0 (ignoring thus the frequencies
creases the precision with the price of de-                 of the n-grams). The kernel was normalized,
creasing the recall.                                        such as K(x, x) = 1 for any string x. For the
2.1    Selecting a kernel and                               length of the n-grams we used 16 characters.
                                                            Although in our estimations 18 should have
       computing the matrix of                              been better (closer to three times the average
       kernel values for a large set of                     word length plus two separators), the speed-
       documents                                            up of the software used can only be obtained
Based on the work of (Rieck and Laskov,                     up to n-grams of length 16, see below and Ap-
2008), a C library for sequential data, lib-                pendix 1 for details. Using windows of two
mindy, has been implemented by our net-                     to three words in plagiarism detection was
                  ENCOPLOT: Pairwise Sequence Matching in Linear Time Applied to Plagiarism Detection                                               13

found to be the best choice by (Lyon, Barrett,
and Malcolm, 2004) and (Barrón-Cedeño and                                             0.8


Rosso, 2009).
                                                                                        0.7
   The computation of a matrix of kernel val-


                                                         Estimation of maximum recall
ues with sizes as large as 7000 is computa-                                             0.6

tionally intensive. There are more than 49
million pairs of documents for which the ker-                                           0.5


nel value has to be computed, in each of
                                                                                        0.4
the two datasets, the development and the
competition corpus, accounting for a total of                                           0.3                                  Ranking destinations
                                                                                                                                Ranking sources
more than 98 million pairs to consider. lib-                                                  5   10   15   20   25     30   35      40        45   50

mindy has had already a tool for building a                                                                      Rank


(symmetric) kernel matrix for a set of docu-            Figure 1: Maximum achievable recall for diﬀerent
ments. We extended this tool for being able             pruning thresholds. Ranking the suspicious docu-
to handle asymmetric matrices of kernel val-            ments for each source leads consistently to better val-
ues, where the kernel values are computed for           ues than ranking the sources for each suspicious doc-
                                                        ument.
each x ∈ X and y ∈ Y , where X and Y are
two independent ﬁnite sets of ﬁles, not nec-
essarily having the same cardinal. While the            pair. Pruning was seen from the start as
new tool could in principle perform the task            a requirement, the question was what eﬀect
fast enough, it would have needed an amount             will it have on limiting the performance that
of RAM of about 400 GB for a kernel based               can be achieved. We have considered rank-
on length 16 n-grams. To avoid this issue,              ing the pairs such that the ones with most
we partitioned the matrix of kernel values in           chances of corresponding to plagiarism come
blocks of sizes up to 1000x1000 (1 million              ﬁrst. Ranking on the absolute values of the
pairs in most blocks), which required only              kernel proved to work worst. Ranking for
8 to 10 GB of RAM for processing. Those                 each source the suspicious documents proved
64 blocks per dataset we processed one after            to provide a consistent 10% advantage over
the other, but the processing of each block             ranking for each suspicious document the
was fully parallelized on the 8 cores of the            sources. Therefore, given also the values that
machine, as a result of internally distribut-           can be seen in the Figure 1, we decided to
ing the tasks by the means of OpenMP pro-               limit our eﬀort to the ﬁrst 51 most promising
gramming. Processing a full dataset took 12             suspicious documents for each given source.
hours on the machine we used (Dell Preci-
sion T7400). Although we had access to a                2.3                                   Comparing two documents -
cluster, it oﬀered only a 32-bit environment.                                                 The encoplot
This would have slowed the whole process-               With the maximum eﬀort down to an esti-
ing by a factor that would almost completely            mate of about 100 hours, assuming spending
eliminated the advantage of having 8 to 12              in average a second per exhaustive document
times more computing cores, and this is why             comparison (with the hope of reducing it to
we decided to use a single multi-core com-              12 hours by multicore parallelism), we pro-
puter.                                                  ceeded to search for a way to identify what
                                                        the documents have in common, if anything.
2.2   Pruning of the pairs                              Essential to this was the visualization of the
If the total processing for one pair of docu-           coincidence pattern of n-grams between two
ments (up to book length level) would only              documents. This is a scatter plot of a sub-
take one second, this would lead to a total             list of the positions where both texts have
computation time of more than three years!              the same n-gram. We call this plot encoplot.
Even by successfully parallelizing this task            Plots computed for pairs in the development
and dividing the time by hopefully 8 (the               corpus can be seen in Figures 2 and 3. All
number of computing cores), the time needed             these plots use documents in the development
would have been more than 4 months. It                  dataset.
was obvious that even with the matrix of                    Related ideas (the “dotplot” graphs) exist
kernel values computed, there is too much               about visualizing the n-grams that two texts
work in comparing the documents in each                 (or sequences) share. The problem with those
14                                          Cristian Grozea, Christian Gehl and Marius Popescu


                                                                                          5
                                                                                  x 10
                                                                             4


                                             Suspicious Document Position   3.5


                                                                             3


                                                                            2.5


                                                                             2


                                                                            1.5


                                                                             1


                                                                            0.5


                                                                             0
                                                                                  0               1        2              3   4         5        6        7          8
                                                                                                          Source Document Position                            x 10
                                                                                                                                                                   5


Figure 2: Encoplot for source #3094 and suspicious #9. Many plagiarism instances for the same document
pair. The shattered look of some comes from higher obfuscation. In red, the local contiguity score, scaled.


                                        x 10
                                                5                                                                                 these ordered sets of n-grams are compared
                                    9
                                                                                                                                  with a procedure that is derived from the pro-
                                    8                                                                                             cedure from merging two sorted lists. Every
                                                                                                                                  time the smallest elements of the two lists dif-
     Suspicious Document Position


                                    7
                                                                                                                                  fer, the smallest of them is dropped, without
                                    6                                                                                             producing any output. Every time the small-
                                                                                                                                  est elements of the lists are equal, the pair of
                                    5
                                                                                                                                  positions on which this identical n-gram oc-
                                    4                                                                                             curs is being collected by outputting it to the
                                                                                                                                  standard output. Code for this core proce-
                                    3
                                                                                                                                  dure is given in Appendix 1. Please note that
                                    2                                                                                             encoplot pairs the ﬁrst instance of an n-gram
                                                                                                                                  in one document with the ﬁrst instance of the
                                    1
                                                                                                                                  same in the other document, the second one
                                    0                                                                                             with the second one and so on – as opposed
                                        0                      1             2        3       4       5    6   7          8
                                                                            Source Document Position                  5
                                                                                                                   x 10
                                                                                                                                  to the dotplot, wich pairs each instance with
                                                                                                                                  each instance.
Figure 3: Encoplot for source #134 and suspicious                                                                                 2.4   Heuristics used for separating
#2499 – a real case of human (self) plagiarism.
                                                                                                                                        the copied subtexts
                                                                                                                                  Once the encoplot data (the list of pairs of
is that the number of pairs can be quadratic                                                                                      indexes) is obtained, it is sorted by the value
in the size of the documents. For megabytes                                                                                       of the ﬁrst index in each pair, which corre-
long texts, this can easily become computa-                                                                                       sponds to the position in source of the com-
tionally intractable. We solve this issue by                                                                                      mon n-gram. From this list a local “contigu-
limiting ourselves to a sublist that is guar-                                                                                     ity” score is derived by computing whether
anteed to be no longer than the shortest of                                                                                       there is simultaneously a small jump on both
the documents, and can be computed in lin-                                                                                        indexes (sum of absolute jumps less than 4)
ear time. The precise procedure we employed                                                                                       when going from a pair to the next pair, fol-
starts by sorting virtually the sets of n-grams                                                                                   lowed by a smoothing by a convolution with
for both documents to be compared. Then                                                                                           a constant vector of length 16. The contigu-
                   ENCOPLOT: Pairwise Sequence Matching in Linear Time Applied to Plagiarism Detection   15

ity score for an encoplot is displayed in red in         3    Results
Figures 2 and 2. Then a Monte Carlo opti-                We combined the best F-measure – the har-
mization procedure is called, not more than              monic mean of precision and recall – 0.6976
30 times for each document pair, which in                (the next competitor had 0.6192) with the
10 attempts tries to ﬁnd the largest group               best granularity – lack of fragmentation in
from the current encoplot data. The start of             detection of the plagiated passages – 1.0027
the group is decided randomly with uniform               (the next best value was 1.0164), winning
distribution over the list of available pairs,           thus the competition.
then the group is extended to left and right
such that the average contiguity score stays             4    Discussion and Conclusions
above 0.5 and there are no jumps (skipped
                                                         The ﬁrst question is whether our choice to
portions) longer than 512 in any 16 steps.
                                                         compare the documents in pairs was optimal.
After a group is obtained, it is checked to
                                                         Indexing based methods could be faster, by
have an average contiguity score of over 0.75
                                                         eliminating the need for exhaustive pairwise
and a length of at least 256 characters. If
                                                         comparison of documents in a large corpus.
not, it is rejected as insigniﬁcant. If kept, it
                                                         They function by ﬁrst indexing the collection
is projected to the dimension of the indexes
                                                         of source documents and then searching for
that correspond to the suspicious document,
                                                         parts of the suspicious documents in the in-
and only the compact core of it is preserved.
                                                         dex, as the system MOSS (Schleimer, Wilk-
The compact core is obtained by sorting on
                                                         erson, and Aiken, 2003) does. Such an in-
the suspicious document axis and eliminating
                                                         ﬂexible approach cannot handle well obfus-
the outliers by starting from the group center
                                                         cation, as opposed to our approach. On the
and extending it to left and right while the
                                                         other hand, ﬂexible matching is an always-
skips are less than 256 positions. What re-
                                                         current research topic in information retrieval
mains is projected back onto the source doc-
                                                         systems (Navarro, 2001), and this eventually
ument axis, obtaining thus an estimate of
                                                         improves plagiarism detection as well. We
the indexes whose convex hull deﬁne the two
                                                         think that, whereas needing more computa-
subtexts corresponding to each other. This
                                                         tional eﬀort, our approach had the chance
candidate of a plagiarism instance is checked
                                                         of producing better results. And, as a con-
once again, this time for a ﬁnal length of at
                                                         sequence of using highly optimized network
least 256, for not having shrinked to less than
                                                         analysis code, it did so in a reasonable time,
half with respect to the initial group length
                                                         even when run on a single contemporary com-
and for the two subtexts not having sizes too
                                                         puter, as opposed to a full cluster. One could
diﬀerent (the absolute diﬀerence more than
                                                         say that it was closer to being optimal in
half of the mean of the two lengths). This
                                                         terms of quality of the results, while still be-
subset of the encoplot data is removed, the
                                                         ing acceptable in terms of running time.
plagiarism instance is outputted if all tests
                                                             A second question of interest is whether
succeeded, and the procedure is repeated in
                                                         our values for the hyperparameters of the
the search for more groups. If the group
                                                         method are optimal for this dataset. The an-
found fails to satisfy the checks, it is deemed
                                                         swer is probably no, but maybe not far from
as a failure. At three consecutive failures the
                                                         that. They have been chosen by educated
search is abandoned and the treatment of the
                                                         guess guided by the exploratory data analy-
pair of documents is considered completed.
                                                         sis, as opposed to blindly optimizing a cross-
This decision may be risky, but accelerates
                                                         validation towards the best (over)ﬁtting.
substantially this phase, as on very compli-
                                                             The third interesting issue is the claim
cated document pairs it can take minutes to
                                                         of some experts that only the humans can
completely examine an involved pair. On the
                                                         have very good results at spotting plagiarism
other hand, for the actually unrelated doc-
                                                         (Weber-Wulﬀ, 2008). We think that, as far
uments this ends the investigation rapidly.
                                                         as the ethics is concerned, a human must
Technically, we have accelerated this process-
                                                         look at the evidence before claiming a case
ing phase even more by running simultane-
                                                         as one of plagiarism. And of course, text
ously up to 10 detailed examinations of doc-
                                                         understanding is still not within the reach
ument pairs at a time, trying to balance the
                                                         of artiﬁcial intelligence yet. On the other
processing power required and the disk la-
                                                         hand, the claim that the only automatization
tency.
                                                         in plagiarism detection should limit to using
16       Cristian Grozea, Christian Gehl and Marius Popescu

the one’s favorite search engine and searching                the expense of lowering the recall. This is
for paragraphs selected based on one’s intu-                  how we tuned our system’s parameters, in-
ition is questionable. How would such an ex-                  cluding but not limited to the last check-
pert deal with 7000 documents up to a book                    ing phase. Of course, accurate comparison
length? How long would it take to process                     of systems should take into account the en-
those by hand, even using a public search en-                 tire precision-recall curve. By plotting on the
gine? How long does it take one to read 7000                  same graph these curves for more systems,
works/books? The need for automatization                      one could easily see where is the best perfor-
seems evident, as it was to (Grozea, 2004)                    mance region for each system and whether or
when he had to grade 400 projects from 60                     not one of the systems is overall better than
students in less than 24 hours. Crowdsourc-                   another system.
ing could also be a possibility, but one needs                    Related to the maximum achievable pre-
very big crowds for that (optimally quadratic                 cision while keeping a fair recall is the is-
size, if using the same choice in the trade-                  sue of the documents independence and of
oﬀ between speed and quality as we chose).                    the automatic plagiarism. The dataset con-
Time is the key factor in plagiarism detec-                   tains plagiarism built automatically and ran-
tion.                                                         domly and only these borrowings between the
    Given the very good results obtained by                   source documents and the suspicious docu-
our method it is worth asking – and fur-                      ments had to be found. But the documents
ther investigating – whether using character                  were not independent enough: there are pairs
n-grams oﬀers any advantage over using word                   of documents with the same or almost the
n-grams. First, let us note that our method                   same content, such as independent transla-
uses n-grams of 16 characters which in aver-                  tions of “One Thousand and One Night” or
age2 correspond to word trigrams (the stan-                   several Bible editions, authors doing heavy
dard approach in plagiarism detection). It                    reuse from their previous works (the so-called
may seem that (on average) the same infor-                    self-plagiarism). These are interesting in two
mation is brought by 16 characters n-grams                    ways: they are better examples of what the
and word trigrams. What diﬀerentiates the                     human plagiarism is, so spotting those as re-
two types of n-grams is in our opinion the fact               lated is very good. On the other hand, this
that character n-grams favor long words over                  can be seen as unintended (by the organizers)
short ones, and when people copy text they                    plagiarism, so any such pair reported will ac-
do that for the content words of the copied                   tually lower the precision score.
text that tend to be longer than the func-                        A very interesting issue is the asymmetry
tional words (stop words) which are short.                    of the ranking quality. Why is it 10% bet-
For example: a common syntagmatic expres-                     ter to rank all suspicious documents for any
sion3 like ”as far as” will contribute with one               ﬁxed source instead of ranking all possible
word trigram, but with none character 16-                     sources for every ﬁxed suspicious document,
gram. On the other hand, a sequence of                        as clearly seen in Figure 1? A possible source
content words (worth being copied) like ”ed-                  of this asymmetry is that while it was guar-
ucated guess guided” will contribute again                    anteed for each suspicious document that the
with only one word trigram, but with 6 char-                  areas plagiated do not overlap, this was not
acter 16-grams.                                               the case for the source documents, where the
    Another item to discuss is how to balance                 areas plagiated could overlap. This asymme-
precision and recall in automatic plagiarism                  try deserves more investigation, being one of
detection systems. Given that a human is in                   the few glints of hope so far to tackling what
many cases the ﬁnal link in the chain that                    could be the biggest open problem in auto-
leads to the proof of plagiarism, the eﬀort of                matic plagiarism detection, that is determin-
that human must be spared as much as pos-                     ing the direction of plagiarism in a pair of
sible. The accuse of plagiarism is so strong,                 documents – being able to indicate with con-
that it needs strong evidence. Both these                     ﬁdence which is the copy and which is the
aspects recommend to balance the precision                    original.
and recall towards a high precision, even at                      To conclude, by combining advanced soft-
     2
     The average word length in the corpus is 5.2             ware engineering and eﬀort-sparing heuristics
     3
     Frequently and systematically co-occurring lexi-         tuned using the novel visualization technique
cal items.                                                    encoplot, we have been able to achieve the top
                  ENCOPLOT: Pairwise Sequence Matching in Linear Time Applied to Plagiarism Detection   17

placement in the ﬁnal results, proving that                Computer Science, pages 523–534, Mex-
the interaction of NLP researchers with net-               ico, Mexico. Springer.
works security researchers can lead to high-            Church, K.W. and J.I. Helfman.          1993.
performance NLP systems.                                  Dotplot: A program for exploring self-
4.1   Acknowledgment                                      similarity in millions of lines of text and
                                                          code.     Journal of Computational and
We thank Dr. Andreas Ziehe and the anony-
                                                          Graphical Statistics, pages 153–174.
mous reviewers for the thorough review of our
paper and for the useful suggestions.                   Grozea, C. 2004. Plagiarism detection with
                                                          state of the art compression programs.
References                                                Report CDMTCS-247, Centre for Discrete
Bao, Jun Peng, Caroline Lyon, and Peter                   Mathematics and Theoretical Computer
  C. R. Lane. 2006. Copy detection in chi-                Science, University of Auckland, Auck-
  nese documents using ferret. Language                   land, New Zealand, August.
  Resources and Evaluation, 40(3-4):357–                Kang, NamOh, Alexander F. Gelbukh, and
  365.                                                    Sang-Yong Han. 2006. Ppchecker: Pla-
Bao, Jun-Peng, Jun-Yi Shen, Xiao-Dong Liu,                giarism pattern checker in document copy
  Hai-Yan Liu, and Xiao-Di Zhang. 2003.                   detection. In Petr Sojka, Ivan Kopecek,
  Document copy detection based on ker-                   and Karel Pala, editors, TSD, volume
  nel method. In Proceedings of Natural                   4188 of Lecture Notes in Computer Sci-
  Language Processing and Knowledge En-                   ence, pages 661–667. Springer.
  gineering Conference (IEEE), pages 250–               Kruegel, C., T. Toth, and E. Kirda. 2002.
  255.                                                    Service speciﬁc anomaly detection for net-
                                                          work intrusion detection. In Proc. of ACM
Bao, Jun-Peng, Jun-Yi Shen, Xiao-Dong
                                                          Symposium on Applied Computing, pages
  Liu, Hai-Yan Liu, and Xiao-Di Zhang.
                                                          201–208.
  2004a. Finding plagiarism based on com-
  mon semantic sequence model. In Qing                  Lodhi, Huma, Craig Saunders, John Shawe-
  Li, Guoren Wang, and Ling Feng, ed-                     Taylor, Nello Cristianini, and Christopher
  itors, WAIM, volume 3129 of Lecture                     J. C. H. Watkins. 2002. Text classiﬁcation
  Notes in Computer Science, pages 640–                   using string kernels. Journal of Machine
  645. Springer.                                          Learning Research, 2:419–444.
Bao, Jun-Peng, Jun-Yi Shen, Xiao-Dong Liu,              Lyon, Caroline, Ruth Barrett, and James
  Hai-Yan Liu, and Xiao-Di Zhang. 2004b.                  Malcolm. 2004. A theoretical basis to the
  Semantic sequence kin: A method of                      automated detection of copying between
  document copy detection. In Honghua                     texts, and its practical implementation in
  Dai, Ramakrishnan Srikant, and Chengqi                  the ferret plagiarism and collusion detec-
  Zhang, editors, PAKDD, volume 3056 of                   tor. In Plagiarism: Prevention, Practice
  Lecture Notes in Computer Science, pages                and Policies Conference.
  529–538. Springer.                                    Maizel, J.V. and R.P. Lenk. 1981. Enhanced
Barrón-Cedeño, Alberto and Paolo Rosso.                 graphic matrix analysis of nucleic acid and
  2009. On Automatic Plagiarism Detec-                    protein sequences. Proceedings of the Na-
  tion based on n-grams Comparison. In                    tional Academy of Sciences, 78(12):7665–
  Mohand Boughanem, Catherine Berrut,                     7669.
  Josiane Mothe, and Chantal Soulé-Dupuy,              Navarro, G. 2001. A guided tour to approx-
  editors, ECIR 2009, volume 5478 of                      imate string matching. ACM Computing
  LNCS, pages 696–700, Toulouse, France.                  Surveys (CSUR), 33(1):31–88.
  Springer.
                                                        Popescu, Marius and Liviu P. Dinu. 2007.
Barrón-Cedeño, Alberto, Paolo Rosso, and                Kernel methods and string kernels for au-
  José-Miguel Benedı́. 2009. Reducing the                thorship identiﬁcation: The federalist pa-
  Plagiarism Detection Search Space on the                pers case. In Proceedings of the Inter-
  Basis of the Kullback-Leibler Distance. In              national Conference on Recent Advances
  Alexander F. Gelbukh, editor, CICLing                   in Natural Language Processing (RANLP-
  2009, volume 5449 of Lecture Notes in                   07), Borovets, Bulgaria, September.
18    Cristian Grozea, Christian Gehl and Marius Popescu

Rieck, K. and P. Laskov. 2008. Linear-time                 the radix sort algorithm for virtually sort-
   computation of similarity measures for se-              ing the n-grams in a text without swapping
   quential data. The Journal of Machine                   any memory blocks. It is a specialization of
   Learning Research, 9:23–48.                             the general radix sort algorithm. The key
                                                           part is avoiding to recompute the frequen-
Rieck, Konrad and Pavel Laskov. 2006. De-
                                                           cies at each step in the radix sort algorithm,
   tecting unknown network attacks using
                                                           and relying instead on updating those incre-
   language models. In Detection of Intru-
                                                           mentally. Another key technical aspect is
   sions and Malware, and Vulnerability As-
                                                           the use of the 128 bit unsigned integer type
   sessment, Proc. of 3rd DIMVA Confer-
                                                             uint128 t, possible with the gcc compiler on
   ence, LNCS, pages 74–90, July.
                                                           certain platforms, which allows for very good
Rieck, Konrad and Pavel Laskov. 2007. Lan-                 speeds up to n-grams of length 16, on 64-
   guage models for detection of unknown at-               bit architectures, such as the common x86-64.
   tacks in network traﬃc. Journal in Com-                 The main code uses this virtual sorting of the
   puter Virology, 2(4):243–256.                           n-grams sets to compute the encoplot data of
                                                           two given ﬁles, a central part of our plagia-
Sanderson, Conrad and Simon Guenter.
                                                           rism detection method, as explained above.
  2006. Short text authorship attribution
                                                           // computes t h e e n c o p l o t d a t a o f a p a i r                  of    files
  via sequence kernels, markov chains and                  #i n c l u d e ” s t d i o . h”
  author unmasking: An investigation. In                   #i n c l u d e ” s t d l i b . h”
                                                           #i n c l u d e ” s t r i n g . h”
  Proceedings of the 2006 Conference on                    #i n c l u d e <s y s / t y p e s . h>
                                                           #i n c l u d e <s y s / s t a t . h>
  Empirical Methods in Natural Language                    #i n c l u d e <u n i s t d . h>
                                                           typedef          u i n t 1 2 8 t tngram ;
  Processing, pages 482–491, Sydney, Aus-                  //CrG r s o r t
                                                           #d e f i n e f r ( x , y ) f o r ( i n t x =0; x<y ; x++)
  tralia, July. Association for Computa-
  tional Linguistics.                                      int ∗ index rsort ngrams (
                                                                           u n s i g n e d c h a r ∗x , i n t l , i n t DEPTH) {
                                                           i n t NN=l −DEPTH+1; i f (NN>0){
Schleimer, S., D.S. Wilkerson, and A. Aiken.               u n s i g n e d c h a r ∗ p i n=x+NN;
                                                           u n s i g n e d c h a r ∗ pout=x ;
   2003. Winnowing: local algorithms for                   i n t ∗ i x =( i n t ∗ ) m a l l o c (NN∗ s i z e o f ( i n t ) ) ;
                                                           i n t ∗ ox=( i n t ∗ ) m a l l o c (NN∗ s i z e o f ( i n t ) ) ;
   document ﬁngerprinting. In Proceedings                  c o n s t i n t RANGE=256;
   of the 2003 ACM SIGMOD international                    i n t c o u n t e r s [RANGE ] ; i n t s t a r t p o s [RANGE ] ;
                                                           f r ( i ,NN) i x [ i ]= i ;
   conference on Management of data, pages                 // r a d i x s o r t , t h e i n p u t i s x ,
                                                           // t h e o u t p u t r a n k i s i x
   76–85. ACM New York, NY, USA.                           f r ( k ,RANGE) c o u n t e r s [ k ] = 0 ;
                                                           f r ( i ,NN) c o u n t e r s [ ∗ ( x+i ) ] + + ;
                                                           f r ( j ,DEPTH) { i n t o f s=j ; / / low e n d i a n
Wang, K., J.J. Parekh, and S.J. Stolfo. 2006.                              i n t sp =0;
                                                                           f r ( k ,RANGE) { s t a r t p o s [ k ]= s p ;
  Anagram: A content anomaly detector re-                                                 s p+=c o u n t e r s [ k ] ; }
  sistant to mimicry attack. pages 226–248.                                f r ( i ,NN) { u n s i g n e d c h a r c=x [ o f s+i x [ i ] ] ;
                                                                                          ox [ s t a r t p o s [ c ]++]= i x [ i ] ; }
                                                                          memcpy ( i x , ox ,NN∗ s i z e o f ( i x [ 0 ] ) ) ;
Wang, K. and S.J. Stolfo. 2004. Anoma-                                     // u p d a t e c o u n t e r s
                                                                           i f ( j <DEPTH−1){
  lous payload-based network intrusion de-                 c o u n t e r s [ ∗ pout++]−−; c o u n t e r s [ ∗ p i n ++]++;}}
                                                           f r e e ( ox ) ; r e t u r n i x ; } }
  tection. pages 203–222.                                  #d e f i n e MAXBUFSIZ 8 0 0 0 1 2 3
                                                           u n s i g n e d c h a r f i l e 1 [ MAXBUFSIZ ] ;
Weber-Wulﬀ, Debora.         2008.     Soft-                u n s i g n e d c h a r f i l e 2 [ MAXBUFSIZ ] ;
                                                           i n t l1 , l 2 ;
  waretest,             http://plagiat.htw-                inline       tngram r e a d a t (
  berlin.de/software/2008/.                                              c o n s t u n s i g n e d c h a r ∗ buf , i n t poz ) {
                                                                         r e t u r n ∗ ( tngram ∗ ) ( b u f+poz ) ; }

Webis at Bauhaus-Universität Weimar                       i n t main ( i n t a r g c , c h a r ∗∗ a r g v ) {
                                                               i n t d e p t h= s i z e o f ( tngram ) ;
  and NLEL at Universidad Politécnica                         FILE ∗ f 1=f o p e n ( a r g v [ 1 ] , ” r b ” ) ;
  de Valencia.       2009.      PAN Pla-                       l 1=f r e a d ( f i l e 1 , 1 , MAXBUFSIZ, f 1 ) ; f c l o s e ( f 1 ) ;
                                                               FILE ∗ f 2=f o p e n ( a r g v [ 2 ] , ” r b ” ) ;
  giarism        Corpus        PAN-PC-09.                      l 2=f r e a d ( f i l e 2 , 1 , MAXBUFSIZ, f 2 ) ; f c l o s e ( f 2 ) ;
                                                           // i n d e x t h e ngrams
  http://www.webis.de/research/corpora.                    i n t ∗ i x 1=i n d e x r s o r t n g r a m s ( f i l e 1 , l 1 , d e p t h ) ;
                                                           i n t ∗ i x 2=i n d e x r s o r t n g r a m s ( f i l e 2 , l 2 , d e p t h ) ;
  Martin Potthast, Andreas Eiselt, Benno                   i n t i 1 =0; i n t i 2 =0;// merge
                                                           tngram s 1=r e a d a t ( f i l e 1 , i x 1 [ i 1 ] ) ;
  Stein, Alberto Barrń Cedeño, and Paolo                 tngram s 2=r e a d a t ( f i l e 2 , i x 2 [ i 2 ] ) ;
  Rosso (editors).                                         l 1 −=(depth − 1 ) ; l 2 −=(depth − 1 ) ;
                                                           w h i l e ( i 1 <l 1 && i 2 <l 2 ) {
                                                           i f ( s 1==s 2 ) {
                                                                           p r i n t f (”%d %d\n ” , i x 1 [ i 1 ] , i x 2 [ i 2 ] ) ;
A    Appendix 1: Encoplot code                                             i 1 ++; i f ( i 1 <l 1 ) s 1=r e a d a t ( f i l e 1 , i x 1 [ i 1 ] ) ;
                                                                           i 2 ++; i f ( i 2 <l 2 ) s 2=r e a d a t ( f i l e 2 , i x 2 [ i 2 ] ) ; }
This appendix provides the listing of the im-              e l s e i f ( s1<s 2 ) {
                                                                           i 1 ++; i f ( i 1 <l 1 ) s 1=r e a d a t ( f i l e 1 , i x 1 [ i 1 ] ) ; }
plementation of the encoplot algorithm. At                 e l s e i f ( s2<s 1 ) {
                                                                           i 2 ++; i f ( i 2 <l 2 ) s 2=r e a d a t ( f i l e 2 , i x 2 [ i 2 ] ) ; } }
its core is a very fast implementation of                  f r e e ( ix2 ) ; f r e e ( ix1 ) ; return 0;}