ENCOPLOT: Pairwise Sequence Matching in Linear Time Applied to Plagiarism Detection ∗ Cristian Grozea Christian Gehl Fraunhofer FIRST Fraunhofer FIRST IDA Group IDA Group Kekulestrasse 7, Kekulestrasse 7, 12489 Berlin, Germany 12489 Berlin, Germany cristian.grozea@first.fraunhofer.de christian.gehl@first.fraunhofer.de Marius Popescu University of Bucharest Faculty of Mathematics and Computer Science Academiei 14, Sect. 1, Bucharest, Romania popescunmarius@gmail.com Abstract: In this paper we describe a new general plagiarism detection method, that we used in our winning entry to the 1st International Competition on Plagia- rism Detection, the external plagiarism detection task, which assumes the source documents are available. In the first phase of our method, a matrix of kernel values is computed, which gives a similarity value based on n-grams between each source and each suspicious document. In the second phase, each promising pair is further investigated, in order to extract the precise positions and lengths of the subtexts that have been copied and maybe obfuscated – using encoplot, a novel linear time pairwise sequence matching technique. We solved the significant computational chal- lenges arising from having to compare millions of document pairs by using a library developed by our group mainly for use in network security tools. The performance achieved is comparing more than 49 million pairs of documents in 12 hours on a single computer. The results in the challenge were very good, we outperformed all other methods. Keywords: n-gram, plagiarism detection, network security, challenge 1 Introduction comparison unit used, all methods of plagia- Many methods have been developed for pla- rism detection need a similarity measure to giarism detection, especially for the exter- compare the text fragments corresponding to nal plagiarism analysis, which consists in the comparison unit. Most similarity mea- finding passages in the suspicious documents sures used in plagiarism detection are based which have been plagiarized and the corre- on estimating the amount of common config- sponding text passages in the source doc- urations of words. They differ by the config- uments. Almost all these methods handle urations considered (n-grams, subsequences, the text at word level. Various compar- etc.) or by what words are used in compar- ison units have been employed in plagia- isons (only words from the text fragments, rism detection methods. Entire documents stemmed or not, synonyms from WordNet, are compared in (Lyon, Barrett, and Mal- etc.). In (Lyon, Barrett, and Malcolm, 2004) colm, 2004). Sentences from suspicious docu- word trigrams are used to measure the sim- ments are compared to sentences from refer- ilarity between texts. The authors based ence documents in (Kang, Gelbukh, and Han, their choice of using word trigrams for plagia- 2006). Mixed-length comparisons in which rism detection on the fact that the number of suspicious sentences are compared with entire common word trigrams in two independently reference documents were used in (Barrón- written texts (even if the text are on the same Cedeño and Rosso, 2009; Barrón-Cedeño, topic) must be low given the Zipfian distribu- Rosso, and Benedı́, 2009). Irrespective of the tion of words. Also in (Barrón-Cedeño and Rosso, 2009) it is reported that using word bi- ∗ partly supported from the BMBF project ReMIND grams and trigrams led to best results in their (KZ 01-IS07007A) and CNCSIS PN2-Idei project 228 experiments. In order to address the prob- Stein, Rosso, Stamatatos, Koppel, Agirre (Eds.): PAN'09, pp. 10-18, 2009. ENCOPLOT: Pairwise Sequence Matching in Linear Time Applied to Plagiarism Detection 11 lem of rewording in plagiarism, PPChecker guage (word segmentation). (Kang, Gelbukh, and Han, 2006) is based on There is a strong connection between the a special designed similarity measure, that research in NLP and the research in computer takes into account also the synonyms (ob- network security. In recent years, network se- tained from the WordNet) of the words in the curity research started to approach the prob- suspicious sentences. Some of the most elab- lem of detecting automatically unknown at- orate similarity measures used in plagiarism tacks as soon as they reach the targeted sys- detection are described in (Bao et al., 2003; tem. These attacks may follow the syntax Bao et al., 2004a; Bao et al., 2004b). These but try to exploit the semantics of the net- measures are derived from the string kernel, work communication between the client and a kernel type successfully used in text cate- the server applications, in order to gain ac- gorization (Lodhi et al., 2002). The string cess over the attacked computer or at least kernel works at character level, although in to prevent it from working normally. The (Bao et al., 2003; Bao et al., 2004a; Bao et communication process defined by the appli- al., 2004b) it is extended to work at word cation layer protocols – e.g. HTTP, FTP, level, comparing two semantic sequences ac- RPC or IMAP – can also be considered as a cording to their common words and position text-based communication in an artificial lan- information. guage. The idea of payload analysis, which Using words is natural in text analysis treats the data as sequences of bytes has tasks like text categorization (by topic), au- been explored in detail (Kruegel, Toth, and thorship identification and plagiarism detec- Kirda, 2002; Wang and Stolfo, 2004; Rieck tion. Perharps surprisingly, recent results and Laskov, 2006; Wang, Parekh, and Stolfo, proved that methods that handle the text at 2006; Rieck and Laskov, 2007). As the focus character level can also be very effective in in this field shifted towards applying more text analysis tasks. In (Lodhi et al., 2002) advanced machine learning methods, gener- string kernels were used for document cat- alizing the extraction and representation of egorization with very good results. Trying the features has increased much the flexibil- to explain why treating documents as sym- ity in defining similarity measures between bol sequences and using string kernels ob- sequential data, in a security context. The tained such good results the authors suppose work (Rieck and Laskov, 2008) presents an that: ”the [string] kernel is performing some- efficient way to combine features extracted thing similar to stemming, hence providing from byte sequences, e.g. words or n-grams semantic links between words that the word with arbitrary n value, for a wide range of kernel must view as distinct”. String ker- linear and non-linear similarity measures. nels were also successfully used in authorship Graphics methods in comparing sequences identification (Sanderson and Guenter, 2006; have been used in many fields, mostly un- Popescu and Dinu, 2007). A possible reason der the name dotplot – see (Maizel and Lenk, for the success of string kernels in author- 1981) for one of the first uses in biology ship identification is given in (Popescu and and (Church and Helfman, 1993) for uses in Dinu, 2007): ”the similarity of two strings as source text comparison. Whereas very at- it is measured by string kernels reflects the tractive for exploratory data analysis, build- similarity of the two texts as it is given by ing this graphic is potentially quadratic in the short words (2-5 characters) which usu- time and space. Also it tends to be noisy, ally are function words, but also takes into by showing many irrelevant coincidences be- account other morphemes like suffixes (’ing’ tween the sequences compared. Even with for example) which also can be good indica- these limitations, the method has been ap- tors of the author’s style”1 plied to source code, videos, music, protein For plagiarism detection, the only ap- and other biological sequences, with various proach that handles the text at character ways to filter the noisy graphics and to handle level that we are aware of is in (Bao, Lyon, the problem of the potential quadratic size. and Lane, 2006), for Chinese, and there is We improve on this technique by deriving our justified by the difficulties of the Chinese lan- own, linear space, linear time technique, that 1 the string kernel used in (Popescu and Dinu, we named the encoplot, short for “eN-gram 2007) takes into account substrings of length up to COincidence PLOT”. It is fully described in 5 characters. Section 2.3, with code in Appendix 1. 12 Cristian Grozea, Christian Gehl and Marius Popescu Our plagiarism detection method can be distance function d(x, y)  described as a combination of techniques Minkowski k ng∈An |φng (x) − φng (y)|k from many fields: it is character n-gram  |φng (x)−φng (y)| based. It leverages a very efficient network Canberra ng∈An φng (x)+φng (y) security software to compute the matrices kernel function k(x, y) of kernel values. It uses the very fast en-  linear kernel φ (x) · φng (y) ng∈An ng coplot algorithm and processes the encoplot  ||φng (x)−φng (y)||2 data in a quantitative fashion to solve what RBF kernel exp(− ng∈An ) 2σ 2 can be seen as a rudimentary machine vision or a specialized 2-dimensional data cluster- Table 1: Distances and kernels functions for sequen- ing task, in order to identify the matching tial data. text passages for a given document pair, as explained thoroughly below. In what follows, the dataset specifics and work security research group. It has been the time performance figures refer to the developed mainly for being used in build- dataset of the 1st International Competi- ing real-time network analysis tools at packet tion on Plagiarism Detection, external pla- level, as part of network intrusion detection giarism detection task (Webis at Bauhaus- and prevention systems. It can map byte Universität Weimar and NLEL at Universi- sequences to a vectorial n-gram representa- dad Politécnica de Valencia, 2009). The de- tion, such that the similarity between two velopment corpus of this dataset contained byte sequences can be expressed in terms of about 7000 source documents and 7000 sus- distance and kernel functions on those rep- picious ones, with the plagiarism generated resentations. The n-gram extraction set of automatically with various degrees of obfus- feasible byte sequences is given by An = Σn , cation (permutations, words deleted, inserted where Σ is the alphabet (in our case the or replaced by synonyms or antonyms) and whole ASCII–8 set). The n-gram embed- annotated. The competition corpus had the ding function φ for a byte sequence x is same characteristics (different documents) then defined as φ(x) = (φng (x))ng∈An with and the annotation was missing. φng (x) = emb(x, ng), where the dimension of the vector φ(x) is |An |. The function 2 Methods emb(x, ng) returns either the frequency, the Our approach consists of two main phases. count or the presence bit for a n-gram ng in In the first phase, a matrix of string ker- x. With the embedding function φ fixed, one nel values is computed, which gives a sim- can compute a pairwise similarity value for ilarity value between each source and each the vectorial representations of two byte se- suspicious document. Then, for each source, quences. Table 1 presents a selection of the the possible “destinations” (suspicious docu- implemented distances and similarity mea- ments) are ranked based on their similarity sures that we could have used (where x and level with the current source, in decreasing y are arbitrary byte sequences). order. In the second phase, each promising Experiments with a very small subset of pair is further investigated, in order to ex- only 5 documents and our previous experi- tract the precise positions and lengths of the ence in string kernels led us to use the linear subtexts that have been copied and maybe kernel over a representation where every n- obfuscated by the random plagiarist. In the gram present is marked by 1 and every other end we do a supplementary filtering that in- is marked by 0 (ignoring thus the frequencies creases the precision with the price of de- of the n-grams). The kernel was normalized, creasing the recall. such as K(x, x) = 1 for any string x. For the 2.1 Selecting a kernel and length of the n-grams we used 16 characters. Although in our estimations 18 should have computing the matrix of been better (closer to three times the average kernel values for a large set of word length plus two separators), the speed- documents up of the software used can only be obtained Based on the work of (Rieck and Laskov, up to n-grams of length 16, see below and Ap- 2008), a C library for sequential data, lib- pendix 1 for details. Using windows of two mindy, has been implemented by our net- to three words in plagiarism detection was ENCOPLOT: Pairwise Sequence Matching in Linear Time Applied to Plagiarism Detection 13 found to be the best choice by (Lyon, Barrett, and Malcolm, 2004) and (Barrón-Cedeño and 0.8 Rosso, 2009). 0.7 The computation of a matrix of kernel val- Estimation of maximum recall ues with sizes as large as 7000 is computa- 0.6 tionally intensive. There are more than 49 million pairs of documents for which the ker- 0.5 nel value has to be computed, in each of 0.4 the two datasets, the development and the competition corpus, accounting for a total of 0.3 Ranking destinations Ranking sources more than 98 million pairs to consider. lib- 5 10 15 20 25 30 35 40 45 50 mindy has had already a tool for building a Rank (symmetric) kernel matrix for a set of docu- Figure 1: Maximum achievable recall for different ments. We extended this tool for being able pruning thresholds. Ranking the suspicious docu- to handle asymmetric matrices of kernel val- ments for each source leads consistently to better val- ues, where the kernel values are computed for ues than ranking the sources for each suspicious doc- ument. each x ∈ X and y ∈ Y , where X and Y are two independent finite sets of files, not nec- essarily having the same cardinal. While the pair. Pruning was seen from the start as new tool could in principle perform the task a requirement, the question was what effect fast enough, it would have needed an amount will it have on limiting the performance that of RAM of about 400 GB for a kernel based can be achieved. We have considered rank- on length 16 n-grams. To avoid this issue, ing the pairs such that the ones with most we partitioned the matrix of kernel values in chances of corresponding to plagiarism come blocks of sizes up to 1000x1000 (1 million first. Ranking on the absolute values of the pairs in most blocks), which required only kernel proved to work worst. Ranking for 8 to 10 GB of RAM for processing. Those each source the suspicious documents proved 64 blocks per dataset we processed one after to provide a consistent 10% advantage over the other, but the processing of each block ranking for each suspicious document the was fully parallelized on the 8 cores of the sources. Therefore, given also the values that machine, as a result of internally distribut- can be seen in the Figure 1, we decided to ing the tasks by the means of OpenMP pro- limit our effort to the first 51 most promising gramming. Processing a full dataset took 12 suspicious documents for each given source. hours on the machine we used (Dell Preci- sion T7400). Although we had access to a 2.3 Comparing two documents - cluster, it offered only a 32-bit environment. The encoplot This would have slowed the whole process- With the maximum effort down to an esti- ing by a factor that would almost completely mate of about 100 hours, assuming spending eliminated the advantage of having 8 to 12 in average a second per exhaustive document times more computing cores, and this is why comparison (with the hope of reducing it to we decided to use a single multi-core com- 12 hours by multicore parallelism), we pro- puter. ceeded to search for a way to identify what the documents have in common, if anything. 2.2 Pruning of the pairs Essential to this was the visualization of the If the total processing for one pair of docu- coincidence pattern of n-grams between two ments (up to book length level) would only documents. This is a scatter plot of a sub- take one second, this would lead to a total list of the positions where both texts have computation time of more than three years! the same n-gram. We call this plot encoplot. Even by successfully parallelizing this task Plots computed for pairs in the development and dividing the time by hopefully 8 (the corpus can be seen in Figures 2 and 3. All number of computing cores), the time needed these plots use documents in the development would have been more than 4 months. It dataset. was obvious that even with the matrix of Related ideas (the “dotplot” graphs) exist kernel values computed, there is too much about visualizing the n-grams that two texts work in comparing the documents in each (or sequences) share. The problem with those 14 Cristian Grozea, Christian Gehl and Marius Popescu 5 x 10 4 Suspicious Document Position 3.5 3 2.5 2 1.5 1 0.5 0 0 1 2 3 4 5 6 7 8 Source Document Position x 10 5 Figure 2: Encoplot for source #3094 and suspicious #9. Many plagiarism instances for the same document pair. The shattered look of some comes from higher obfuscation. In red, the local contiguity score, scaled. x 10 5 these ordered sets of n-grams are compared 9 with a procedure that is derived from the pro- 8 cedure from merging two sorted lists. Every time the smallest elements of the two lists dif- Suspicious Document Position 7 fer, the smallest of them is dropped, without 6 producing any output. Every time the small- est elements of the lists are equal, the pair of 5 positions on which this identical n-gram oc- 4 curs is being collected by outputting it to the standard output. Code for this core proce- 3 dure is given in Appendix 1. Please note that 2 encoplot pairs the first instance of an n-gram in one document with the first instance of the 1 same in the other document, the second one 0 with the second one and so on – as opposed 0 1 2 3 4 5 6 7 8 Source Document Position 5 x 10 to the dotplot, wich pairs each instance with each instance. Figure 3: Encoplot for source #134 and suspicious 2.4 Heuristics used for separating #2499 – a real case of human (self) plagiarism. the copied subtexts Once the encoplot data (the list of pairs of is that the number of pairs can be quadratic indexes) is obtained, it is sorted by the value in the size of the documents. For megabytes of the first index in each pair, which corre- long texts, this can easily become computa- sponds to the position in source of the com- tionally intractable. We solve this issue by mon n-gram. From this list a local “contigu- limiting ourselves to a sublist that is guar- ity” score is derived by computing whether anteed to be no longer than the shortest of there is simultaneously a small jump on both the documents, and can be computed in lin- indexes (sum of absolute jumps less than 4) ear time. The precise procedure we employed when going from a pair to the next pair, fol- starts by sorting virtually the sets of n-grams lowed by a smoothing by a convolution with for both documents to be compared. Then a constant vector of length 16. The contigu- ENCOPLOT: Pairwise Sequence Matching in Linear Time Applied to Plagiarism Detection 15 ity score for an encoplot is displayed in red in 3 Results Figures 2 and 2. Then a Monte Carlo opti- We combined the best F-measure – the har- mization procedure is called, not more than monic mean of precision and recall – 0.6976 30 times for each document pair, which in (the next competitor had 0.6192) with the 10 attempts tries to find the largest group best granularity – lack of fragmentation in from the current encoplot data. The start of detection of the plagiated passages – 1.0027 the group is decided randomly with uniform (the next best value was 1.0164), winning distribution over the list of available pairs, thus the competition. then the group is extended to left and right such that the average contiguity score stays 4 Discussion and Conclusions above 0.5 and there are no jumps (skipped The first question is whether our choice to portions) longer than 512 in any 16 steps. compare the documents in pairs was optimal. After a group is obtained, it is checked to Indexing based methods could be faster, by have an average contiguity score of over 0.75 eliminating the need for exhaustive pairwise and a length of at least 256 characters. If comparison of documents in a large corpus. not, it is rejected as insignificant. If kept, it They function by first indexing the collection is projected to the dimension of the indexes of source documents and then searching for that correspond to the suspicious document, parts of the suspicious documents in the in- and only the compact core of it is preserved. dex, as the system MOSS (Schleimer, Wilk- The compact core is obtained by sorting on erson, and Aiken, 2003) does. Such an in- the suspicious document axis and eliminating flexible approach cannot handle well obfus- the outliers by starting from the group center cation, as opposed to our approach. On the and extending it to left and right while the other hand, flexible matching is an always- skips are less than 256 positions. What re- current research topic in information retrieval mains is projected back onto the source doc- systems (Navarro, 2001), and this eventually ument axis, obtaining thus an estimate of improves plagiarism detection as well. We the indexes whose convex hull define the two think that, whereas needing more computa- subtexts corresponding to each other. This tional effort, our approach had the chance candidate of a plagiarism instance is checked of producing better results. And, as a con- once again, this time for a final length of at sequence of using highly optimized network least 256, for not having shrinked to less than analysis code, it did so in a reasonable time, half with respect to the initial group length even when run on a single contemporary com- and for the two subtexts not having sizes too puter, as opposed to a full cluster. One could different (the absolute difference more than say that it was closer to being optimal in half of the mean of the two lengths). This terms of quality of the results, while still be- subset of the encoplot data is removed, the ing acceptable in terms of running time. plagiarism instance is outputted if all tests A second question of interest is whether succeeded, and the procedure is repeated in our values for the hyperparameters of the the search for more groups. If the group method are optimal for this dataset. The an- found fails to satisfy the checks, it is deemed swer is probably no, but maybe not far from as a failure. At three consecutive failures the that. They have been chosen by educated search is abandoned and the treatment of the guess guided by the exploratory data analy- pair of documents is considered completed. sis, as opposed to blindly optimizing a cross- This decision may be risky, but accelerates validation towards the best (over)fitting. substantially this phase, as on very compli- The third interesting issue is the claim cated document pairs it can take minutes to of some experts that only the humans can completely examine an involved pair. On the have very good results at spotting plagiarism other hand, for the actually unrelated doc- (Weber-Wulff, 2008). We think that, as far uments this ends the investigation rapidly. as the ethics is concerned, a human must Technically, we have accelerated this process- look at the evidence before claiming a case ing phase even more by running simultane- as one of plagiarism. And of course, text ously up to 10 detailed examinations of doc- understanding is still not within the reach ument pairs at a time, trying to balance the of artificial intelligence yet. On the other processing power required and the disk la- hand, the claim that the only automatization tency. in plagiarism detection should limit to using 16 Cristian Grozea, Christian Gehl and Marius Popescu the one’s favorite search engine and searching the expense of lowering the recall. This is for paragraphs selected based on one’s intu- how we tuned our system’s parameters, in- ition is questionable. How would such an ex- cluding but not limited to the last check- pert deal with 7000 documents up to a book ing phase. Of course, accurate comparison length? How long would it take to process of systems should take into account the en- those by hand, even using a public search en- tire precision-recall curve. By plotting on the gine? How long does it take one to read 7000 same graph these curves for more systems, works/books? The need for automatization one could easily see where is the best perfor- seems evident, as it was to (Grozea, 2004) mance region for each system and whether or when he had to grade 400 projects from 60 not one of the systems is overall better than students in less than 24 hours. Crowdsourc- another system. ing could also be a possibility, but one needs Related to the maximum achievable pre- very big crowds for that (optimally quadratic cision while keeping a fair recall is the is- size, if using the same choice in the trade- sue of the documents independence and of off between speed and quality as we chose). the automatic plagiarism. The dataset con- Time is the key factor in plagiarism detec- tains plagiarism built automatically and ran- tion. domly and only these borrowings between the Given the very good results obtained by source documents and the suspicious docu- our method it is worth asking – and fur- ments had to be found. But the documents ther investigating – whether using character were not independent enough: there are pairs n-grams offers any advantage over using word of documents with the same or almost the n-grams. First, let us note that our method same content, such as independent transla- uses n-grams of 16 characters which in aver- tions of “One Thousand and One Night” or age2 correspond to word trigrams (the stan- several Bible editions, authors doing heavy dard approach in plagiarism detection). It reuse from their previous works (the so-called may seem that (on average) the same infor- self-plagiarism). These are interesting in two mation is brought by 16 characters n-grams ways: they are better examples of what the and word trigrams. What differentiates the human plagiarism is, so spotting those as re- two types of n-grams is in our opinion the fact lated is very good. On the other hand, this that character n-grams favor long words over can be seen as unintended (by the organizers) short ones, and when people copy text they plagiarism, so any such pair reported will ac- do that for the content words of the copied tually lower the precision score. text that tend to be longer than the func- A very interesting issue is the asymmetry tional words (stop words) which are short. of the ranking quality. Why is it 10% bet- For example: a common syntagmatic expres- ter to rank all suspicious documents for any sion3 like ”as far as” will contribute with one fixed source instead of ranking all possible word trigram, but with none character 16- sources for every fixed suspicious document, gram. On the other hand, a sequence of as clearly seen in Figure 1? A possible source content words (worth being copied) like ”ed- of this asymmetry is that while it was guar- ucated guess guided” will contribute again anteed for each suspicious document that the with only one word trigram, but with 6 char- areas plagiated do not overlap, this was not acter 16-grams. the case for the source documents, where the Another item to discuss is how to balance areas plagiated could overlap. This asymme- precision and recall in automatic plagiarism try deserves more investigation, being one of detection systems. Given that a human is in the few glints of hope so far to tackling what many cases the final link in the chain that could be the biggest open problem in auto- leads to the proof of plagiarism, the effort of matic plagiarism detection, that is determin- that human must be spared as much as pos- ing the direction of plagiarism in a pair of sible. The accuse of plagiarism is so strong, documents – being able to indicate with con- that it needs strong evidence. Both these fidence which is the copy and which is the aspects recommend to balance the precision original. and recall towards a high precision, even at To conclude, by combining advanced soft- 2 The average word length in the corpus is 5.2 ware engineering and effort-sparing heuristics 3 Frequently and systematically co-occurring lexi- tuned using the novel visualization technique cal items. encoplot, we have been able to achieve the top ENCOPLOT: Pairwise Sequence Matching in Linear Time Applied to Plagiarism Detection 17 placement in the final results, proving that Computer Science, pages 523–534, Mex- the interaction of NLP researchers with net- ico, Mexico. Springer. works security researchers can lead to high- Church, K.W. and J.I. Helfman. 1993. performance NLP systems. Dotplot: A program for exploring self- 4.1 Acknowledgment similarity in millions of lines of text and code. Journal of Computational and We thank Dr. Andreas Ziehe and the anony- Graphical Statistics, pages 153–174. mous reviewers for the thorough review of our paper and for the useful suggestions. Grozea, C. 2004. Plagiarism detection with state of the art compression programs. References Report CDMTCS-247, Centre for Discrete Bao, Jun Peng, Caroline Lyon, and Peter Mathematics and Theoretical Computer C. R. Lane. 2006. Copy detection in chi- Science, University of Auckland, Auck- nese documents using ferret. Language land, New Zealand, August. Resources and Evaluation, 40(3-4):357– Kang, NamOh, Alexander F. Gelbukh, and 365. Sang-Yong Han. 2006. Ppchecker: Pla- Bao, Jun-Peng, Jun-Yi Shen, Xiao-Dong Liu, giarism pattern checker in document copy Hai-Yan Liu, and Xiao-Di Zhang. 2003. detection. In Petr Sojka, Ivan Kopecek, Document copy detection based on ker- and Karel Pala, editors, TSD, volume nel method. In Proceedings of Natural 4188 of Lecture Notes in Computer Sci- Language Processing and Knowledge En- ence, pages 661–667. Springer. gineering Conference (IEEE), pages 250– Kruegel, C., T. Toth, and E. Kirda. 2002. 255. Service specific anomaly detection for net- work intrusion detection. In Proc. of ACM Bao, Jun-Peng, Jun-Yi Shen, Xiao-Dong Symposium on Applied Computing, pages Liu, Hai-Yan Liu, and Xiao-Di Zhang. 201–208. 2004a. Finding plagiarism based on com- mon semantic sequence model. In Qing Lodhi, Huma, Craig Saunders, John Shawe- Li, Guoren Wang, and Ling Feng, ed- Taylor, Nello Cristianini, and Christopher itors, WAIM, volume 3129 of Lecture J. C. H. Watkins. 2002. Text classification Notes in Computer Science, pages 640– using string kernels. Journal of Machine 645. Springer. Learning Research, 2:419–444. Bao, Jun-Peng, Jun-Yi Shen, Xiao-Dong Liu, Lyon, Caroline, Ruth Barrett, and James Hai-Yan Liu, and Xiao-Di Zhang. 2004b. Malcolm. 2004. A theoretical basis to the Semantic sequence kin: A method of automated detection of copying between document copy detection. In Honghua texts, and its practical implementation in Dai, Ramakrishnan Srikant, and Chengqi the ferret plagiarism and collusion detec- Zhang, editors, PAKDD, volume 3056 of tor. In Plagiarism: Prevention, Practice Lecture Notes in Computer Science, pages and Policies Conference. 529–538. Springer. Maizel, J.V. and R.P. Lenk. 1981. Enhanced Barrón-Cedeño, Alberto and Paolo Rosso. graphic matrix analysis of nucleic acid and 2009. On Automatic Plagiarism Detec- protein sequences. Proceedings of the Na- tion based on n-grams Comparison. In tional Academy of Sciences, 78(12):7665– Mohand Boughanem, Catherine Berrut, 7669. Josiane Mothe, and Chantal Soulé-Dupuy, Navarro, G. 2001. A guided tour to approx- editors, ECIR 2009, volume 5478 of imate string matching. ACM Computing LNCS, pages 696–700, Toulouse, France. Surveys (CSUR), 33(1):31–88. Springer. Popescu, Marius and Liviu P. Dinu. 2007. Barrón-Cedeño, Alberto, Paolo Rosso, and Kernel methods and string kernels for au- José-Miguel Benedı́. 2009. Reducing the thorship identification: The federalist pa- Plagiarism Detection Search Space on the pers case. In Proceedings of the Inter- Basis of the Kullback-Leibler Distance. In national Conference on Recent Advances Alexander F. Gelbukh, editor, CICLing in Natural Language Processing (RANLP- 2009, volume 5449 of Lecture Notes in 07), Borovets, Bulgaria, September. 18 Cristian Grozea, Christian Gehl and Marius Popescu Rieck, K. and P. Laskov. 2008. Linear-time the radix sort algorithm for virtually sort- computation of similarity measures for se- ing the n-grams in a text without swapping quential data. The Journal of Machine any memory blocks. It is a specialization of Learning Research, 9:23–48. the general radix sort algorithm. The key part is avoiding to recompute the frequen- Rieck, Konrad and Pavel Laskov. 2006. De- cies at each step in the radix sort algorithm, tecting unknown network attacks using and relying instead on updating those incre- language models. In Detection of Intru- mentally. Another key technical aspect is sions and Malware, and Vulnerability As- the use of the 128 bit unsigned integer type sessment, Proc. of 3rd DIMVA Confer- uint128 t, possible with the gcc compiler on ence, LNCS, pages 74–90, July. certain platforms, which allows for very good Rieck, Konrad and Pavel Laskov. 2007. Lan- speeds up to n-grams of length 16, on 64- guage models for detection of unknown at- bit architectures, such as the common x86-64. tacks in network traffic. Journal in Com- The main code uses this virtual sorting of the puter Virology, 2(4):243–256. n-grams sets to compute the encoplot data of two given files, a central part of our plagia- Sanderson, Conrad and Simon Guenter. rism detection method, as explained above. 2006. Short text authorship attribution // computes t h e e n c o p l o t d a t a o f a p a i r of files via sequence kernels, markov chains and #i n c l u d e ” s t d i o . h” author unmasking: An investigation. In #i n c l u d e ” s t d l i b . h” #i n c l u d e ” s t r i n g . h” Proceedings of the 2006 Conference on #i n c l u d e #i n c l u d e Empirical Methods in Natural Language #i n c l u d e typedef u i n t 1 2 8 t tngram ; Processing, pages 482–491, Sydney, Aus- //CrG r s o r t #d e f i n e f r ( x , y ) f o r ( i n t x =0; x0){ Schleimer, S., D.S. Wilkerson, and A. Aiken. u n s i g n e d c h a r ∗ p i n=x+NN; u n s i g n e d c h a r ∗ pout=x ; 2003. Winnowing: local algorithms for i n t ∗ i x =( i n t ∗ ) m a l l o c (NN∗ s i z e o f ( i n t ) ) ; i n t ∗ ox=( i n t ∗ ) m a l l o c (NN∗ s i z e o f ( i n t ) ) ; document fingerprinting. In Proceedings c o n s t i n t RANGE=256; of the 2003 ACM SIGMOD international i n t c o u n t e r s [RANGE ] ; i n t s t a r t p o s [RANGE ] ; f r ( i ,NN) i x [ i ]= i ; conference on Management of data, pages // r a d i x s o r t , t h e i n p u t i s x , // t h e o u t p u t r a n k i s i x 76–85. ACM New York, NY, USA. f r ( k ,RANGE) c o u n t e r s [ k ] = 0 ; f r ( i ,NN) c o u n t e r s [ ∗ ( x+i ) ] + + ; f r ( j ,DEPTH) { i n t o f s=j ; / / low e n d i a n Wang, K., J.J. Parekh, and S.J. Stolfo. 2006. i n t sp =0; f r ( k ,RANGE) { s t a r t p o s [ k ]= s p ; Anagram: A content anomaly detector re- s p+=c o u n t e r s [ k ] ; } sistant to mimicry attack. pages 226–248. f r ( i ,NN) { u n s i g n e d c h a r c=x [ o f s+i x [ i ] ] ; ox [ s t a r t p o s [ c ]++]= i x [ i ] ; } memcpy ( i x , ox ,NN∗ s i z e o f ( i x [ 0 ] ) ) ; Wang, K. and S.J. Stolfo. 2004. Anoma- // u p d a t e c o u n t e r s i f ( j