Plagiarism Alignment Detection by Merging Context
                          Seeds
                       Notebook for PAN at CLEF 2014

                        Philipp Gross1 and Pashutan Modaresi2,3
                        1
                        pressrelations GmbH, Düsseldorf, Germany
        {philipp.gross, pashutan.modaresi}@pressrelations.de
    2
      Heinrich-Heine-University of Düsseldorf, Institute of Computer Science, Düsseldorf,
                                         Germany
                       modaresi@cs.uni-duesseldorf.de


       Abstract We describe our submitted algorithm to the text alignment sub-task of
       the plagiarism detection task in the PAN2014 challenge that achieved a plagdet
       score 0.855. By extracting contextual features for each document character and
       grouping those that are relevant for a given pair of documents, we generate seeds
       of atomic plagiarism cases. These are then merged by an agglomerative single-
       linkage strategy using a defined distance measure.


1   Introduction

Given a pair of text documents, the problem of text alignment is the task of identifying
all pairs of contiguous passages that are equal up to obfuscation. The various strategies
of the latter pose a challenge for this task. They reach from randomized simple oper-
ations like word or sentence permutations to more sophisticated transformations like
semantic word variation or translation cycles, or even manual paraphrasing.
     The problem of text alignment is a sub-task of the PAN 2014 Plagiarism Detection
task. It also contains the sub-task source retrieval that concerns he retrieval of source
documents when a suspicious document is given. Our submission only deals with the
problem of text alignment. We follow the common strategy as explained in [2]:

 1. Seed generation: Given a suspicious document X and a source document Y select
    a set of likely plagiarism cases, which are pairs of small passages in X and Y that
    are very similar by a defined measure.
 2. Merging: Merge two plagiarism cases whose passages in X and Y are close to each
    other. Repeat this step until there are no adjacent plagiarism cases left.
 3. Extraction and filtering: Postprocess the remaining plagiarism cases, filter outliers,
    and generate output plagiarism cases.

    In the following we describe our submitted algorithm in detail and give the evalua-
tion results for various corpora.


                                             966
2   Problem Statement

In this section we formally define the problem of text alignment as part of PAN 2014
Plagiarism Detection task. The precise objective of this task is, for a given set of pla-
giarism cases S, to find a set of detections R such that the score plagdet(S, R) is high.
    We follow a slightly different terminology compared to [3] and recall all definitions
for the readers convenience.
    A document of length nX is a finite totally ordered set X = {xi : i = 0, . . . , nX } of
(positioned) characters xi = (c, i), c ∈ C, where C denotes some finite set of symbols.
A passage P ⊆ X is a connected subset, thus either empty or of the form

                        P = { xi : 0 ≤ a ≤ i < b ≤ n}.                                  (1)
For brevity we use the notion of closed intervals and define [xa , xb ] = P for non-empty
passages.
    Given a pair of documents (X, Y ) we define a passage reference r as a rectangular
subset in the Cartesian product set r ⊆ X × Y . Every non-empty passage reference is
always of the form

 r = [xa , xb ] × [yc , yd ] = {(xi , yj ) : 1 ≤ a ≤ i < b ≤ nX , 1 ≤ c ≤ j < d ≤ nY } (2)

for passages [xa , xb ] ⊆ X and [yc , yd ] ⊆ Y . Each pair (x, y) ∈ X × Y gives rise to
a passage reference by taking the singleton set {(x, y)}, called seed. It is a minimal
passage reference. Note that every non-empty passage reference is the linear span of
finitely many seeds.
     We say that a passage reference r detects another passage reference s if both belong
to the same product space X × Y and have non-empty intersection r ∩ s. The latter
is also a passage reference. By embedding a document into the disjoint union of all
documents, the definition r ∩ s extends naturally to passage references with different
pairs of parent documents (namely, by the empty intersection).
     We define the perimeter of a passage reference r as

      π(r) = 2(b − a) + 2(d − c) if r = [xa , xb ] × [yc , yd ],   and      π(∅) = 0.   (3)

     The union of passage references is in general not a passage reference. But the
perimeter extends in a natural way for such sets by taking the (one-dimensional) vol-
ume of the boundary. The upshot of the perimeter is that a passage reference r detects
another passage reference s if and only if π(r ∩ s) > 0.
     A set (or corpus) of plagiarism cases S is just a set of passage references for varying
document pairs (X, Y ), X, Y ∈ D and some set of documents D.
     The quality of a set of detections R is evaluated by the numerical plagdet score.
It is a composition of the micro precision, micro recall and granularity. For the sake of
completeness, we recall their definitions. With S · R = {s ∩ r | s ∈ S, r ∈ R} let

                                 π(S · R)                      π(S · R)
                 prec(S, R) =             ,      rec(S, R) =            .               (4)
                                  π(R)                          π(S)


                                           967
Precision and recall give rise to the classical F1 -measure, i.e. the harmonic mean of
precision and recall. In order to penalize fragmented passage references one weights
the F1 measure with the granularity,
                                        P
                                                |Rs |
                         gran(S, R) = s∈SR            ∈ [1, |R|],                  (5)
                                            |SR |

      where SR = {s | s ∈ S, ∃r ∈ R detects s} and Rs = {r | r ∈ R, r detects s}.
      Then the plagdet-score is defined as the weighted F1 -measure:

                                       F1 (prec(S, R), rec(S, R))
                     plagdet(S, R) =                              ∈ [0, 1].                (6)
                                         log2 (1 + gran(S, R))


3     Feature Extraction and Seed Generation

In this section we describe our approach to extract seeds of passages from X × Y for
some pair of documents X and Y . We decided to apply feature extraction on a per
document basis. Hence, this step can be done in a preprocessing phase before actually
considering pairs of documents.


3.1    Extraction of contextual features for documents

Throughout the paper, F denotes the set of all features. As seen below, we use skip
word ngrams as features, but for the sake of clarity we keep it general first.
    Given a document X = {xi } we map each character to a finite set of binary features
xi 7→ ϕ(xi ) = {fi1 , . . . , fid }, d = d(i), fij ∈ F . Recall that the power set P(F ) is the
set of all subsets of F . Therefore, we defined a feature map

                                       fX : X → P(F ).                                     (7)

      In return, by mapping F 3 f 7→ {xi | f ∈ ϕ(xi )} ∈ P(X) we get an index map

                                       ιX : F → P(X).                                      (8)

It tells us at which character positions in the document a feature f is present.
     In our approach we first tokenized the document X into a sequence of lowercased
stemmed words w0 , w1 , . . . by omitting whitespaces, non-alphanumeric characters and
stop words, and used skip-bigrams of length 1 to 4:

               
                   {wβ wα }β=α−4,...,α if xi is the beginning character of a word wα .
    ϕ(xi ) =                                                                               (9)
                   ∅                   otherwise.

   Table 1 illustrates the feature extraction for the example phrase The quick brown fox
jumps over the lazy dog.


                                              968
Table 1. Features extracted for the characters x0 , . . . x43 of the example string The quick brown
fox jumps over the lazy dog. Characters that were not at the beginning of a contiguous alphanu-
meric substring have no features and are omitted from the table. The symbol * is a placeholder
for the empty word.

        Offset Token           f1               f2               f3              f4
           4    quick    *_quick   *_quick    *_quick   *_quick
          10    brown quick_brown *_brown     *_brown   *_brown
          16      fox  brown_fox quick_fox     *_fox      *_fox
          20     jump   fox_jump brown_jump quick_jump   *_jump
          35     lazy jump_lazy   fox_lazy brown_lazy quick_lazy
          40      dog   lazy_dog  jump_dog    fox_dog  brown_dog


3.2   Selection of relevant features

Clearly, features which are not present at all, or appear at almost every character are
useless for the text alignment task. In order to reduce the number of generated features,
we apply a simple feature selection strategy. If ιX (f ) is non-empty and has low cardi-
nality, then we consider f as meaningful. We say that f ∈ F is a relevant feature (for
X), if the cardinality satisfies the following estimation:

                                       1 ≤ |ιX (f )| ≤ %                                      (10)

for some given threshold parameter %. The latter is also called relevance threshold. The
subset of all relevant features is denoted as FX ⊆ F .
    In our approach we use a constant relevance threshold % = 4. For time constraints
we kept this simple approach because it already worked quite well.


3.3   Feature extraction of document pairs and seed generation

The feature extraction for documents carries over to feature extraction to pairs of docu-
ments. Given a suspicious document X and a source document Y their index maps give
rise to a natural index map of X × Y :

      ιXY : F → P(X × Y ),           f 7→ {(xi , yj ) | f ∈ ϕ(xi ) and f ∈ ϕ(yi )}.           (11)

It maps a feature f to the set of all pairs (xi , yj ) of characters such that f is simultane-
ously present at xi and yi . Now let FX ∩ FY ⊆ F be the subset of features which are
relevant for X and Y . The union
                                          [
                        σ(X, Y ) =                ιXY (f ) ⊆ X × Y                        (12)
                                        f ∈FX ∩FY

is the seed set of plagiarism cases between X and Y . These atomic plagiarism cases are
the starting point of a merge process, which is explained in the next section.


                                               969
  We can deduce an estimation of the cardinality in terms of the relevance threshold.
Namely, for each f ∈ FX ∩ FY holds the estimation
                        1 ≤ |ιXY (f )| = |ιX (f )| · |ιY (f )| ≤ %2 .                (13)
Hence, |σ(X, Y )| ≤ %2 |FX ||FY |. Consequently, the number of seeds can be estimated
just in terms of X and Y without having to inspect the pair (X, Y ).


4     Merging
In this section we describe the process of merging passage references in the product
space X × Y for two documents X and Y . The merge criterion will be given in terms
of a distance function and the merge process is then the agglomerative single-linkage
clustering with an additional termination condition.

4.1   Merge criteria
In order to define a distance between two passage references, let us first introduce fur-
ther notation.
    For two non-empty passages P1 = [xa1 , xb1 ] and P2 = [xa2 , xb2 ] in X let their
distance be dist(P1 , P2 ) = min{|u1 −u2 | : a1 ≤ u1 ≤ b1 , a2 ≤ u2 ≤ b2 }. It is positive
if and only if P1 and P2 are disjoint. Now, for two non-empty passage references P1 ×
Q1 , P2 × Q2 ⊆ X × Y their distance is
                                           2 dist(P1 , P2 ) + 2 dist(Q1 , Q2 )
            dist(P1 × Q1 , P2 × Q2 ) =                                         ,     (14)
                                           σ + π(P1 × Q1 ) + π(P2 × Q2 )
    where σ > 0 denotes a constant smoothing parameter that is defined empirically.
The distance is zero if and only if P1 ∩ P2 6= ∅ and Q1 ∩ Q2 6= ∅, or equivalently
P1 × Q1 ∩ P2 × Q2 6= ∅, and thus reflects the fact that two passages are directly
adjacent. If the distance is positive, but lesser than a given threshold τ , the passages
have empty intersection but are relatively close.
    The merge of two passage references in X × Y is the smallest passage reference
that contains both. It always contains their union but is in general strictly larger.

4.2   Agglomerative clustering
Having defined criteria for merging two passage references, we apply agglomerative
single-linkage clustering. That is, in each step we merge a pair of passage references
that have minimal distance. If there is no pair whose distance is lesser or equal than a
given constant τ > 0, the process terminates.


5     Filtering and passage extraction
At the end of the merge process we remove all passage references where the suspicious
passage has less than 15 words. The remaining passage references are the detected
plagiarism cases.


                                            970
6   Evaluation Results

We evaluated the algorithm on the data sets of the previous competition PAN2013 [3]
and the current competition PAN2014 using Tira[1]. For completeness, we also state
the runtime, although it does not affect the ranking of the algorithm in the competition.
See Table 2 for the development results and Table 3 for the end results.
    We also ran smaller experiments restricted to a fixed obfuscation strategy (Table 4).
To no surprise the algorithm underperforms for summary obfuscated plagiarism cases
because we use no synonym dictionaries.


Table 2. Text alignment results with retrieval performance and runtime for the data sets of
PAN2013. The experiments were executed on an 8-core system.

       Corpus                 Pairs PlagDet Precision Recall Granularity Runtime
       pan13-training-corpus 4007 0.825       0.935     0.743 1.00485         48s
       pan13-test-corpus1    399 0.837        0.930     0.760 1.00000         5s
       pan13-test-corpus2    4042 0.831       0.939     0.750 1.00421         4ls


Table 3. Text alignment results with retrieval performance and runtime for the data sets of
PAN2014. The experiments were executed on an 1-core system, provided by the host.

       Corpus                 Pairs PlagDet Precision Recall Granularity Runtime
       pan14-training-corpus 5164 0.821       0.928     0.763 1.02805         183s
       pan14-test-corpus2    ?    0.826       0.933     0.766 1.02514         180s
       pan14-test-corpus3    ?    0.855       0.925     0.818 1.02187         169s


         Table 4. Text alignment plagdet scores with respect to obfuscation strategies

          Corpus                      None Random Cyclic translation Summary
          training-corpus-2013-01-21 0.903    0.812          0.824           0.299
          test-corpus1-2013-03-08    0.901    0.791          0.854           0.412
          test-corpus2-2013-01-21    0.913    0.811          0.835           0.316


                                             971
7   Conclusion

The simple heuristics we used in our approach, already worked quite well and compa-
rable to the state of the art text alignment algorithms for random and translation cycle
obfuscations. In order to tackle the problem of summary obfuscation in the future, we
intend to incorporate more semantic knowledge into the feature extraction stage.


References
1. Gollub, T., Potthast, M., Beyer, A., Busse, M., Rangel, F., Rosso, P., Stamatatos, E., Stein,
   B.: Recent trends in digital text forensics and its evaluation. In: Forner, P., MÃijller, H.,
   Paredes, R., Rosso, P., Stein, B. (eds.) Information Access Evaluation. Multilinguality,
   Multimodality, and Visualization, Lecture Notes in Computer Science, vol. 8138, pp.
   282–302. Springer Berlin Heidelberg (2013)
2. Potthast, M., Hagen, M., Gollub, T., Tippmann, M., Kiesel, J., Rosso, P., Stamatatos, E.,
   Stein, B.: Overview of the 5th international competition on plagiarism detection. In: CLEF
   2013 Evaluation Labs and Workshop - Working Notes Papers (2013)
3. Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An evaluation framework for
   plagiarism detection. In: Proceedings of the 23rd International Conference on Computational
   Linguistics: Posters. pp. 997–1005. COLING ’10, Association for Computational
   Linguistics, Stroudsburg, PA, USA (2010),
   http://dl.acm.org/citation.cfm?id=1944566.1944681


                                             972