Plagiarism Alignment Detection by Merging Context Seeds Notebook for PAN at CLEF 2014 Philipp Gross1 and Pashutan Modaresi2,3 1 pressrelations GmbH, Düsseldorf, Germany {philipp.gross, pashutan.modaresi}@pressrelations.de 2 Heinrich-Heine-University of Düsseldorf, Institute of Computer Science, Düsseldorf, Germany modaresi@cs.uni-duesseldorf.de Abstract We describe our submitted algorithm to the text alignment sub-task of the plagiarism detection task in the PAN2014 challenge that achieved a plagdet score 0.855. By extracting contextual features for each document character and grouping those that are relevant for a given pair of documents, we generate seeds of atomic plagiarism cases. These are then merged by an agglomerative single- linkage strategy using a defined distance measure. 1 Introduction Given a pair of text documents, the problem of text alignment is the task of identifying all pairs of contiguous passages that are equal up to obfuscation. The various strategies of the latter pose a challenge for this task. They reach from randomized simple oper- ations like word or sentence permutations to more sophisticated transformations like semantic word variation or translation cycles, or even manual paraphrasing. The problem of text alignment is a sub-task of the PAN 2014 Plagiarism Detection task. It also contains the sub-task source retrieval that concerns he retrieval of source documents when a suspicious document is given. Our submission only deals with the problem of text alignment. We follow the common strategy as explained in [2]: 1. Seed generation: Given a suspicious document X and a source document Y select a set of likely plagiarism cases, which are pairs of small passages in X and Y that are very similar by a defined measure. 2. Merging: Merge two plagiarism cases whose passages in X and Y are close to each other. Repeat this step until there are no adjacent plagiarism cases left. 3. Extraction and filtering: Postprocess the remaining plagiarism cases, filter outliers, and generate output plagiarism cases. In the following we describe our submitted algorithm in detail and give the evalua- tion results for various corpora. 966 2 Problem Statement In this section we formally define the problem of text alignment as part of PAN 2014 Plagiarism Detection task. The precise objective of this task is, for a given set of pla- giarism cases S, to find a set of detections R such that the score plagdet(S, R) is high. We follow a slightly different terminology compared to [3] and recall all definitions for the readers convenience. A document of length nX is a finite totally ordered set X = {xi : i = 0, . . . , nX } of (positioned) characters xi = (c, i), c ∈ C, where C denotes some finite set of symbols. A passage P ⊆ X is a connected subset, thus either empty or of the form P = { xi : 0 ≤ a ≤ i < b ≤ n}. (1) For brevity we use the notion of closed intervals and define [xa , xb ] = P for non-empty passages. Given a pair of documents (X, Y ) we define a passage reference r as a rectangular subset in the Cartesian product set r ⊆ X × Y . Every non-empty passage reference is always of the form r = [xa , xb ] × [yc , yd ] = {(xi , yj ) : 1 ≤ a ≤ i < b ≤ nX , 1 ≤ c ≤ j < d ≤ nY } (2) for passages [xa , xb ] ⊆ X and [yc , yd ] ⊆ Y . Each pair (x, y) ∈ X × Y gives rise to a passage reference by taking the singleton set {(x, y)}, called seed. It is a minimal passage reference. Note that every non-empty passage reference is the linear span of finitely many seeds. We say that a passage reference r detects another passage reference s if both belong to the same product space X × Y and have non-empty intersection r ∩ s. The latter is also a passage reference. By embedding a document into the disjoint union of all documents, the definition r ∩ s extends naturally to passage references with different pairs of parent documents (namely, by the empty intersection). We define the perimeter of a passage reference r as π(r) = 2(b − a) + 2(d − c) if r = [xa , xb ] × [yc , yd ], and π(∅) = 0. (3) The union of passage references is in general not a passage reference. But the perimeter extends in a natural way for such sets by taking the (one-dimensional) vol- ume of the boundary. The upshot of the perimeter is that a passage reference r detects another passage reference s if and only if π(r ∩ s) > 0. A set (or corpus) of plagiarism cases S is just a set of passage references for varying document pairs (X, Y ), X, Y ∈ D and some set of documents D. The quality of a set of detections R is evaluated by the numerical plagdet score. It is a composition of the micro precision, micro recall and granularity. For the sake of completeness, we recall their definitions. With S · R = {s ∩ r | s ∈ S, r ∈ R} let π(S · R) π(S · R) prec(S, R) = , rec(S, R) = . (4) π(R) π(S) 967 Precision and recall give rise to the classical F1 -measure, i.e. the harmonic mean of precision and recall. In order to penalize fragmented passage references one weights the F1 measure with the granularity, P |Rs | gran(S, R) = s∈SR ∈ [1, |R|], (5) |SR | where SR = {s | s ∈ S, ∃r ∈ R detects s} and Rs = {r | r ∈ R, r detects s}. Then the plagdet-score is defined as the weighted F1 -measure: F1 (prec(S, R), rec(S, R)) plagdet(S, R) = ∈ [0, 1]. (6) log2 (1 + gran(S, R)) 3 Feature Extraction and Seed Generation In this section we describe our approach to extract seeds of passages from X × Y for some pair of documents X and Y . We decided to apply feature extraction on a per document basis. Hence, this step can be done in a preprocessing phase before actually considering pairs of documents. 3.1 Extraction of contextual features for documents Throughout the paper, F denotes the set of all features. As seen below, we use skip word ngrams as features, but for the sake of clarity we keep it general first. Given a document X = {xi } we map each character to a finite set of binary features xi 7→ ϕ(xi ) = {fi1 , . . . , fid }, d = d(i), fij ∈ F . Recall that the power set P(F ) is the set of all subsets of F . Therefore, we defined a feature map fX : X → P(F ). (7) In return, by mapping F 3 f 7→ {xi | f ∈ ϕ(xi )} ∈ P(X) we get an index map ιX : F → P(X). (8) It tells us at which character positions in the document a feature f is present. In our approach we first tokenized the document X into a sequence of lowercased stemmed words w0 , w1 , . . . by omitting whitespaces, non-alphanumeric characters and stop words, and used skip-bigrams of length 1 to 4:  {wβ wα }β=α−4,...,α if xi is the beginning character of a word wα . ϕ(xi ) = (9) ∅ otherwise. Table 1 illustrates the feature extraction for the example phrase The quick brown fox jumps over the lazy dog. 968 Table 1. Features extracted for the characters x0 , . . . x43 of the example string The quick brown fox jumps over the lazy dog. Characters that were not at the beginning of a contiguous alphanu- meric substring have no features and are omitted from the table. The symbol * is a placeholder for the empty word. Offset Token f1 f2 f3 f4 4 quick *_quick *_quick *_quick *_quick 10 brown quick_brown *_brown *_brown *_brown 16 fox brown_fox quick_fox *_fox *_fox 20 jump fox_jump brown_jump quick_jump *_jump 35 lazy jump_lazy fox_lazy brown_lazy quick_lazy 40 dog lazy_dog jump_dog fox_dog brown_dog 3.2 Selection of relevant features Clearly, features which are not present at all, or appear at almost every character are useless for the text alignment task. In order to reduce the number of generated features, we apply a simple feature selection strategy. If ιX (f ) is non-empty and has low cardi- nality, then we consider f as meaningful. We say that f ∈ F is a relevant feature (for X), if the cardinality satisfies the following estimation: 1 ≤ |ιX (f )| ≤ % (10) for some given threshold parameter %. The latter is also called relevance threshold. The subset of all relevant features is denoted as FX ⊆ F . In our approach we use a constant relevance threshold % = 4. For time constraints we kept this simple approach because it already worked quite well. 3.3 Feature extraction of document pairs and seed generation The feature extraction for documents carries over to feature extraction to pairs of docu- ments. Given a suspicious document X and a source document Y their index maps give rise to a natural index map of X × Y : ιXY : F → P(X × Y ), f 7→ {(xi , yj ) | f ∈ ϕ(xi ) and f ∈ ϕ(yi )}. (11) It maps a feature f to the set of all pairs (xi , yj ) of characters such that f is simultane- ously present at xi and yi . Now let FX ∩ FY ⊆ F be the subset of features which are relevant for X and Y . The union [ σ(X, Y ) = ιXY (f ) ⊆ X × Y (12) f ∈FX ∩FY is the seed set of plagiarism cases between X and Y . These atomic plagiarism cases are the starting point of a merge process, which is explained in the next section. 969 We can deduce an estimation of the cardinality in terms of the relevance threshold. Namely, for each f ∈ FX ∩ FY holds the estimation 1 ≤ |ιXY (f )| = |ιX (f )| · |ιY (f )| ≤ %2 . (13) Hence, |σ(X, Y )| ≤ %2 |FX ||FY |. Consequently, the number of seeds can be estimated just in terms of X and Y without having to inspect the pair (X, Y ). 4 Merging In this section we describe the process of merging passage references in the product space X × Y for two documents X and Y . The merge criterion will be given in terms of a distance function and the merge process is then the agglomerative single-linkage clustering with an additional termination condition. 4.1 Merge criteria In order to define a distance between two passage references, let us first introduce fur- ther notation. For two non-empty passages P1 = [xa1 , xb1 ] and P2 = [xa2 , xb2 ] in X let their distance be dist(P1 , P2 ) = min{|u1 −u2 | : a1 ≤ u1 ≤ b1 , a2 ≤ u2 ≤ b2 }. It is positive if and only if P1 and P2 are disjoint. Now, for two non-empty passage references P1 × Q1 , P2 × Q2 ⊆ X × Y their distance is 2 dist(P1 , P2 ) + 2 dist(Q1 , Q2 ) dist(P1 × Q1 , P2 × Q2 ) = , (14) σ + π(P1 × Q1 ) + π(P2 × Q2 ) where σ > 0 denotes a constant smoothing parameter that is defined empirically. The distance is zero if and only if P1 ∩ P2 6= ∅ and Q1 ∩ Q2 6= ∅, or equivalently P1 × Q1 ∩ P2 × Q2 6= ∅, and thus reflects the fact that two passages are directly adjacent. If the distance is positive, but lesser than a given threshold τ , the passages have empty intersection but are relatively close. The merge of two passage references in X × Y is the smallest passage reference that contains both. It always contains their union but is in general strictly larger. 4.2 Agglomerative clustering Having defined criteria for merging two passage references, we apply agglomerative single-linkage clustering. That is, in each step we merge a pair of passage references that have minimal distance. If there is no pair whose distance is lesser or equal than a given constant τ > 0, the process terminates. 5 Filtering and passage extraction At the end of the merge process we remove all passage references where the suspicious passage has less than 15 words. The remaining passage references are the detected plagiarism cases. 970 6 Evaluation Results We evaluated the algorithm on the data sets of the previous competition PAN2013 [3] and the current competition PAN2014 using Tira[1]. For completeness, we also state the runtime, although it does not affect the ranking of the algorithm in the competition. See Table 2 for the development results and Table 3 for the end results. We also ran smaller experiments restricted to a fixed obfuscation strategy (Table 4). To no surprise the algorithm underperforms for summary obfuscated plagiarism cases because we use no synonym dictionaries. Table 2. Text alignment results with retrieval performance and runtime for the data sets of PAN2013. The experiments were executed on an 8-core system. Corpus Pairs PlagDet Precision Recall Granularity Runtime pan13-training-corpus 4007 0.825 0.935 0.743 1.00485 48s pan13-test-corpus1 399 0.837 0.930 0.760 1.00000 5s pan13-test-corpus2 4042 0.831 0.939 0.750 1.00421 4ls Table 3. Text alignment results with retrieval performance and runtime for the data sets of PAN2014. The experiments were executed on an 1-core system, provided by the host. Corpus Pairs PlagDet Precision Recall Granularity Runtime pan14-training-corpus 5164 0.821 0.928 0.763 1.02805 183s pan14-test-corpus2 ? 0.826 0.933 0.766 1.02514 180s pan14-test-corpus3 ? 0.855 0.925 0.818 1.02187 169s Table 4. Text alignment plagdet scores with respect to obfuscation strategies Corpus None Random Cyclic translation Summary training-corpus-2013-01-21 0.903 0.812 0.824 0.299 test-corpus1-2013-03-08 0.901 0.791 0.854 0.412 test-corpus2-2013-01-21 0.913 0.811 0.835 0.316 971 7 Conclusion The simple heuristics we used in our approach, already worked quite well and compa- rable to the state of the art text alignment algorithms for random and translation cycle obfuscations. In order to tackle the problem of summary obfuscation in the future, we intend to incorporate more semantic knowledge into the feature extraction stage. References 1. Gollub, T., Potthast, M., Beyer, A., Busse, M., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Recent trends in digital text forensics and its evaluation. In: Forner, P., MÃijller, H., Paredes, R., Rosso, P., Stein, B. (eds.) Information Access Evaluation. Multilinguality, Multimodality, and Visualization, Lecture Notes in Computer Science, vol. 8138, pp. 282–302. Springer Berlin Heidelberg (2013) 2. Potthast, M., Hagen, M., Gollub, T., Tippmann, M., Kiesel, J., Rosso, P., Stamatatos, E., Stein, B.: Overview of the 5th international competition on plagiarism detection. In: CLEF 2013 Evaluation Labs and Workshop - Working Notes Papers (2013) 3. Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An evaluation framework for plagiarism detection. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters. pp. 997–1005. COLING ’10, Association for Computational Linguistics, Stroudsburg, PA, USA (2010), http://dl.acm.org/citation.cfm?id=1944566.1944681 972