=Paper= {{Paper |id=Vol-1178/CLEF2012wn-PAN-Akiva2012 |storemode=property |title=Simplified Approach for Authorship and Plagiarism Detection Using Binary BOW |pdfUrl=https://ceur-ws.org/Vol-1178/CLEF2012wn-PAN-Akiva2012.pdf |volume=Vol-1178 }} ==Simplified Approach for Authorship and Plagiarism Detection Using Binary BOW== https://ceur-ws.org/Vol-1178/CLEF2012wn-PAN-Akiva2012.pdf
      Authorship and Plagiarism Detection Using Binary
                     BOW Features


                         Notebook for PAN at CLEF 2012

                                          Navot Akiva

                               The Computer Science Department
                              Bar Ilan University, Israel Affiliation
                                     navot.akiva@gmail.com



       Abstract. Identifying writing style shifts and variations are fundamental
       capabilities when addressing authorship related tasks. In this work we examine
       a simplified approach for unsupervised authorship and plagiarism detection
       which is based on binary bag of words representation. We evaluate our
       approach using PAN-2012 Authorship Attribution challenge data, which
       includes both open/closed class authorship identification and intrinsic
       plagiarism tasks. Our approach proved to be successful achieving overall
       average accuracy of 84% over and a 2nd place rank in the competition.

       Keywords: Intrinsic plagiarism, authorship attribution, outlier detection.




1      Introduction

   Authorship and stylistic variation identification over documents has a broad range
of applications from identifying specific author's writing style, author verification,
plagiarism detection etc. Vast research efforts invested in approaching these areas has
been conducted over recent years, by applying various feature representation and
algorithmic approaches [1][2]. In this work we evaluate the extent to which a
straightforward feature representation method could be successfully utilized for
authorship and plagiarism identification.
   We consider the problems of authorship identification either open/closed class and
intrinsic plagiarism detection, both included in PAN-2012 Authorship Identification
competition. For each problem type, we first represent each document as a binary
vector that encodes the presence or absence of common words in the text.
2         Our Approach

   For all the tasks in this competition we use a single vector representation that
captures the presence/absence of common words in a text. We have previously
demonstrated the power of this representation elsewhere [3][4] .
   In both tasks we are given the authorial chunks' boundaries either over documents
(authorship) or paragraphs (plagiarism). The challenge resides in the number of
known authors, document length and open/closed class (authorship) as well as short
text, ordered/unordered sequences and varying author’s number (plagiarism)

2.1 PAN 2012 Authorship Identification Competition
   The competition includes 6 tasks for authorship attribution and 2 tasks for intrinsic
plagiarism. The tasks description and notation are listed in Table 1.

Table 1. PAN-2012 Authorship Identification competition tasks.

Task Name              Type               Number of      Description
                                           Authors
      A        Authorship Attribution         3          Short texts, closed class
      B        Authorship Attribution         3          Short texts, open class (of task A)
      C        Authorship Attribution         8          Short texts, closed class
      D        Authorship Attribution         8          Short texts, open class (of task C)
      I        Authorship Attribution        12          Novel length texts, closed class
      J        Authorship Attribution        12          Novel length texts, open class (of task I)
      E         Intrinsic Plagiarism        2-4          Mixed set of paragraphs by individual
                                                         authors. Paragraphs by any individual author
                                                         are not guaranteed to be in original order.
      F         Intrinsic Plagiarism          2          Consecutive intrusive paragraphs by a single
                                                         (intrusive) author.


2.2       Supervised Authorship Attribution
    For traditional authorship problems (tasks A-D, I-J) we use a supervised learning
approach over known authors' documents, using support vector machines as our
learning method. We create two separate vectors collections, each containing a single
file per known authors, and then in turn we used each as train/test data over the other
one respectively, using linear SVM-Light (default setting) modeling.
    For the open-class task J where the training examples are novel-length we use the
unmasking method of [5], in which different feature representations are applied for
evaluating candidate authors' separation robustness. Unmasking distinguish between 2
authors by iteratively training a model and then deliberately impairing it by removing
the most discriminative features between the two texts. [5] identified "degradation
curves" patterns representing accuracy drop thorough a model's iterations, which
drops faster for same-author cases than for different authors.
    For the open-class tasks B and D where the texts are short we use the impostors
method described in [6], in which similarity of a snippet to random impostor texts is
used as a baseline. The method produces artificial impostors for known authors and
iteratively tests a suspected text's chunks similarity with all impostors, using different
feature set each time. The author of the suspected text is determined by the majority
similar author-labels to its chunks.


2.3       Plagiarism Detection
   For clustering/plagiarism problems (tasks E and F), we treat each paragraph as a
separate document and apply the n-cut clustering algorithm described in [7].
   In Task E the number of authors varies between 2 to 4 and there is no assumption
of sequenced plagiarized paragraphs. Therefore we cluster the data using number of
clusters K=2,3,4 and select the optimal cluster count (K) by applying an inner-cluster
similarity and centroids' dissimilarity measures as a convergence criteria.
   In task F there one plagiarized author and additional assumption of continuous
plagiarized paragraphs sequence is provided. We cluster the data into K=2,3 clusters
and look for the maximal original sequences grouped by a single cluster in the 2-
clusters result that remain grouped over the 3-clusters sequence. The maximal
consistent sequence (which belongs to the minority chunks at K=2) is presumed to be
the plagiarized one.


3         Training and Evaluation Results


3.1       Authorship Attribution


Closed Class

Table 2. Closed Class authorship results.

Task Name        Average Training           Evaluation
                    Accuracy                 Results
      A               100%                    100%
      C               65%                      75%
      I                97%                     86%

    In task C there are more candidate authors to "confuse” than task A, where the
training is based on a single example only and the train/test documents of the same
author are different in length. Therefore one of the train experiments resulted in 87%
accuracy while the other achieved only 50%. Nonetheless, when measuring the gap
between the SVM scores of the top class and the 2nd top one (in %) - all misclassified
test examples were below average of all the gaps.


Open Class

Table 3. Open Class authorship results.

 Task Name        Evaluation Results
      B                 90%
      D                 65%
  J without
                        88%
 Unmasking
    J with
                        100%
 Unmasking

    For tasks B and D we initially apply a learned classifier to the test examples, based
on the assumption that actual known author’s test examples would get the correct
label. To eliminate examples not authored by any of the candidates, we then apply the
impostor method introduced in [6].
    Instead of producing an auxiliary artificial impostor data, we utilize all 11 known
authors (of both tasks A and B) as impostors for identifying outliers only, while
ignoring the non-outlier labels.
     We chunk each author's texts into 1000 words for providing the test examples
chunks the option to "confuse" their labelled class and suggest inconsistency. Taking
into account the option that there is no guarantees that all 11 authors are unique ones,
as well as the lower number of potential “impostors” (at least for task C) we decide
the following:
    1. Use this method for validating "outliers" existence only and ignore the resulting
coherent labels of the other ones.
    2. Repeat the impostor experiment per task using its original known authors and
looks for labelling consistency when expanding the experiment with all 11 authors. In
the case where all chunks of a test example were coherently labelled by both
experiment as a single train author (of its original task) then the labelling is valid,
otherwise considered as an outlier.
    While applying the methodology above we discover that there were a couple of
test examples (one per task) which got a coherent label on the tasks' known authors
experiment and another coherent label at the expanded experiment, both from the
other task's know authors. This fact suggested that potentially these "switching"
authors are the same one and in order to validate that (and to not eliminate these
examples) we re-ran both tasks experiments and eliminate one of these authors
respectively. This time the labels were consistent which assessed our assumption and
we did not considered these examples as outliers
    For task J we apply the unmasking method for each classification results where the
difference between the first 2 top SVM scores is below average. The unmasking
algorithm revealed 2 cases where the predicted class was not the correct author and
therefore these items were marked as 'None'. The overall accuracy is thus raised from
88% to 100%.


3.2     Intrinsic Plagiarism


Table 4. Intrinsic plagiarism results.

Task Name          Evaluation Results
      E                  76%
      F                  89%

   For task F we create a train set of example documents with similar properties as
the provided test files (~200 words per paragraph, 20 paragraphs per file) over 15
“Guttenberg project” books written by different authors. Each example is attached
with a plagiarism rate of 0-40%. For each of these we measure its plagiarized
sequence consistency between K=2 and K=3 clustering results.

Table 5. Consistent chunks mapping of the train set (above 60%).

                                     Plagiarism Rate
                           10%           20%    30%     40%
  % of documents           57%           70%    69%     77%

    Table 5 shows the percentage of documents, grouped by their plagiarism rate, in
which the original plagiarized chunks are consistently mapped to a single K=3 cluster
with purity level above 60%. In all the plagiarism levels above the purity of 50%
exists for 95% of the documents. The test evaluation results presented in Table 4
show the robustness of our method for task F, achieving a high accuracy level of 89%.
   In Task E the number of authors is larger and varies between 2 to 4. In addition
there is no assumption of sequenced plagiarized paragraphs. Our approach of seeking
optimal cluster number among K=2,3,4 by applying an inner-cluster similarity criteria
obtained overall accuracy of 76%. The obtained accuracy level is lower than the one
in task F due to the larger number of participating authors as well as the lack of
sequential assumption. Nonetheless this result is surprisingly high considering the
task's complexity and the straightforward representation and algorithmic approach
we've applied for capturing individual paragraph's authorship source.
4      Conclusion and Future Work

   In this paper we present an evaluation of the obtained accuracy level for authorship
related tasks by using a simplified representation of binary bag of words approach.
Our evaluation over the PAN-2012 Authorship Identification challenge data is split
over authorship identification (open/closed class) and author clustering/intrinsic
plagiarism. We show that binary BOW representation works quite well for capturing
authorship for all the tasks. There are a couple of factors which affects the accuracy
levels: number of participating authors and the length of the examined documents.
Our method appears to be effective for both long and short texts but more sensitive
towards high number of authors. For this reason the authorship tasks C and D and
plagiarism task E has somewhat lower accuracy level than the other, though still solid.
Therefore an immediate future work would be to improve achieved accuracy results
for a high number of authors for both tasks.


References

1.   Argamon, S., Juola, P., Overview of the International Authorship Identification
     Competition at PAN-2011. CLEF (Notebook Papers/Labs/Workshop). (2011).
2.   Stamatatos, E., A survey of modern authorship attribution methods. Journal of the
     American Society for Information Science and Technology, 60(3):538–556. (2009).
3.   Koppel, M., Akiva, N., Dershowitz, I. and Dershowitz, N., Unsupervised Decomposition
     of a Document Into Authorial Components, Proceedings of ACL, Portland OR, June 2011,
     pp. 1356-1364.
4.   Koppel, M. and Akiva, N., Identifying Distinct Components of a Multi-Author Document,
     Proceedings of EISIC 2012, to appear.(2012).
5.   Koppel, M., Schler, J. and Bonchek-Dokow, E., Measuring Differentiability: Unmasking
     Pseudonymous Authors, JMLR 8, pp. 1261-1276. (2007).
6.   Koppel, M., Schler, J.,, Argamon, S. and Winter, Y. : The “Fundamental Problem” of
     Authorship Attribution, English Studies, 93:3, 284-291 (2012).
7.   Dhillon, I. S. , Guan, Y. and Kulis, B., "Weighted graph cuts without eigenvectors: a
     multilevel approach", IEEE Transactions on Pattern Analysis and Machine Intelligence
     (PAMI), vol. 29(11):1944-1957.( 2007)