=Paper=
{{Paper
|id=Vol-1313/paper_04
|storemode=property
|title=Automatic Decomposition of Multi-Author Documents Using Grammar Analysis
|pdfUrl=https://ceur-ws.org/Vol-1313/paper_4.pdf
|volume=Vol-1313
|dblpUrl=https://dblp.org/rec/conf/gvd/TschuggnallS14
}}
==Automatic Decomposition of Multi-Author Documents Using Grammar Analysis==
<pdf width="1500px">https://ceur-ws.org/Vol-1313/paper_4.pdf</pdf>
<pre>
      Automatic Decomposition of Multi-Author Documents
                   Using Grammar Analysis

                                          Michael Tschuggnall and Günther Specht
                                                Databases and Information Systems
                                  Institute of Computer Science, University of Innsbruck, Austria
                                   {michael.tschuggnall, guenther.specht}@uibk.ac.at


ABSTRACT                                                                   try to build a cluster for the main author and one or more clusters
The task of text segmentation is to automatically split a text doc-        for intrusive paragraphs. Another scenario where the clustering of
ument into individual subparts, which differ according to specific         text is applicable is the analysis of multi-author academic papers:
measures. In this paper, an approach is presented that attempts to         especially the verification of collaborated student works such as
separate text sections of a collaboratively written document based         bachelor or master theses can be useful in order to determine the
on the grammar syntax of authors. The main idea is thereby to              amount of work done by each student.
quantify differences of the grammatical writing style of authors              Using results of previous work in the field of intrinsic plagia-
and to use this information to build paragraph clusters, whereby           rism detection [31] and authorship attribution [32], the assumption
each cluster is assigned to a different author. In order to analyze        that individual authors have significantly different writing styles in
the style of a writer, text is split into single sentences, and for each   terms of the syntax that is used to construct sentences has been
sentence a full parse tree is calculated. Using the latter, a profile      reused. For example, the following sentence (extracted from a web
is computed subsequently that represents the main characteristics          blog): ”My chair started squeaking a few days ago and it’s driving
for each paragraph. Finally, the profiles serve as input for common        me nuts." (S1) could also be formulated as ”Since a few days my
clustering algorithms. An extensive evaluation using different En-         chair is squeaking - it’s simply annoying.” (S2) which is semanti-
glish data sets reveals promising results, whereby a supplementary         cally equivalent but differs significantly according to the syntax as
analysis indicates that in general common classification algorithms        can be seen in Figure 1. The main idea of this work is to quantify
perform better than clustering approaches.                                 those differences by calculating grammar profiles and to use this
                                                                           information to decompose a collaboratively written document, i.e.,
                                                                           to assign each paragraph of a document to an author.
Keywords
                                                                              The rest of this paper is organized as follows: Section 2 at first
Text Segmentation, Multi-Author Decomposition, Parse Trees, pq-
                                                                           recapitulates the principle of pq-grams, which represent a core con-
grams, Clustering
                                                                           cept of the approach. Subsequently the algorithm is presented in
                                                                           detail, which is then evaluated in Section 3 by using different clus-
1.    INTRODUCTION                                                         tering algorithms and data sets. A comparison of clustering and
   The growing amount of currently available data is hardly man-           classification approaches is discussed in Section 4, while Section 5
ageable without the use of specific tools and algorithms that pro-         depicts related work. Finally, a conclusion and future work direc-
vide relevant portions of that data to the user. While this problem        tions are given in Section 6.
is generally addressed with information retrieval approaches, an-
other possibility to significantly reduce the amount of data is to         2.    ALGORITHM
build clusters. Within each cluster, the data is similar according to
                                                                              In the following the concept of pq-grams is explained, which
some predefined features. Thereby many approaches exist that pro-
                                                                           serves as the basic stylistic measure in this approach to distinguish
pose algorithms to cluster plain text documents (e.g. [16], [22]) or
                                                                           between authors. Subsequently, the concrete steps performed by
specific web documents (e.g. [33]) by utilizing various features.
                                                                           the algorithm are discussed in detail.
   Approaches which attempt to divide a single text document into
distinguishable units like different topics, for example, are usu-         2.1    Preliminaries: pq-grams
ally referred to as text segmentation approaches. Here, also many
                                                                              Similar to n-grams that represent subparts of given length n of
features including statistical models, similarities between words or
                                                                           a string, pq-grams extract substructures of an ordered, labeled tree
other semantic analyses are used. Moreover, text clusters are also
                                                                           [4]. The size of a pq-gram is determined by a stem (p) and a base
used in recent plagiarism detection algorithms (e.g. [34]) which
                                                                           (q) like it is shown in Figure 2. Thereby p defines how much nodes
                                                                           are included vertically, and q defines the number of nodes to be
                                                                           considered horizontally. For example, a valid pq-gram with p = 2
                                                                           and q = 3 starting from PP at the left side of tree (S2) shown in
                                                                           Figure 1 would be [PP-NP-DT-JJ-NNS] (the concrete words
                                                                           are omitted).
Copyright c by the paper’s authors. Copying permitted only for                The pq-gram index then consists of all possible pq-grams of
private and academic purposes.
                                                                           a tree. In order to obtain all pq-grams, the base is shifted left
In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI-
Workshop on Foundations of Databases (Grundlagen von Datenbanken),         and right additionally: If then less than p nodes exist horizon-
21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org.    tally, the corresponding place in the pq-gram is filled with *, in-
(S1)                                                                                                           steps:
                                            S

                  S                        CC                 S                                                   1. At first the document is preprocessed by eliminating unnec-
                                          (and)                                                                      essary whitespaces or non-parsable characters. For exam-
        NP                    VP                      NP               VP
                                                                                                                     ple, many data sets often are based on novels and articles of
                                                                                                                     various authors, whereby frequently OCR text recognition is
 PRP        NN           VBD          S              PRP      VBZ
                                                                                   VP                                used due to the lack of digital data. Additionally, such doc-
 (My)     (chair)      (started)                      (it)    ('s)
                                                                        VBG              S
                                                                                                                     uments contain problem sources like chapter numbers and
                                     VP
                                                                      (driving)                                      titles or incorrectly parsed picture frames that result in non-
                                                                                                                     alphanumeric characters.
                          VBG                 ADVP                                NP              NP
                       (squeaking)

                                       NP              RB                         PRP          NNS                2. Subsequently, the document is partitioned into single para-
                                                      (ago)                       (me)        (nuts)                 graphs. For simplification reasons this is currently done by
                             DT         JJ         NNS
                                                                                                                     only detecting multiple line breaks.
                             (a)      (few)       (days)

                                                                                                                  3. Each paragraph is then split into single sentences by utiliz-
                                                                                                                     ing a sentence boundary detection algorithm implemented
(S2)
                                                                  S                                                  within the OpenNLP framework1 . Then for each sentence
                                                                                                                     a full grammar tree is calculated using the Stanford Parser
                                     S                            -            S                                     [19]. For example, Figure 1 depicts the grammar trees re-
          PP                         NP                VP               NP                   VP
                                                                                                                     sulting from analyzing sentences (S1) and (S2), respectively.
                                                                                                                     The labels of each tree correspond to a part-of-speech (POS)
   IN           NP            PRP       NN         VBZ       VP        PRP     VBZ        ADVP         ADJP
                                                                                                                     tag of the Penn Treebank set [23], where e.g NP corresponds
(Since)                       (my)    (chair)      (is)                 (it)   ('s)                                  to a noun phrase, DT to a determiner or JJS to a superla-
        DT       JJ       NNS                            VBG                                RB        JJ
                                                                                                                     tive adjective. In order to examine the building structure of
        (a)    (few)     (days)                       (squeaking)                        (simply) (annoying)         sentences only like it is intended by this work, the concrete
                                                                                                                     words, i.e., the leafs of the tree, are omitted.
Figure 1: Grammar Trees of the Semantically Equivalent Sen-
tences (S1) and (S2).                                                                                             4. Using the grammar trees of all sentences of the document,
                                                                                                                     the pq-gram index is calculated. As shown in Section 2.1
                                                                                                                     all valid pq-grams of a sentence are extracted and stored into
dicating a missing node. Applying this idea to the previous exam-                                                    a pq-gram index. By combining all pq-gram indices of all
ple, also the pq-gram [PP-IN-*-*-*] (no nodes in the base) is                                                        sentences, a pq-gram profile is computed which contains a
valid, as well as [PP-NP-*-*-DT] (base shifted left by two),                                                         list of all pq-grams and their corresponding frequency of ap-
[PP-NP-*-DT-JJ] (base shifted left by one), [PP-NP-JJ-                                                               pearance in the text. Thereby the frequency is normalized by
NNS-*] (base shifted right by one) and [PP-NP-NNS-*-*] (base                                                         the total number of all appearing pq-grams. As an example,
shifted right by two) have to be considered. As a last example, all                                                  the five mostly used pq-grams using p = 2 and q = 3 of a
leaves have the pq-gram pattern [leaf_label-*-*-*-*].                                                                sample document are shown in Table 1. The profile is sorted
   Finally, the pq-gram index is the set of all valid pq-grams of a                                                  descending by the normalized occurrence, and an additional
tree, whereby multiple occurrences of the same pq-grams are also                                                     rank value is introduced that simply defines a natural order
present multiple times in the index.                                                                                 which is used in the evaluation (see Section 3).


                                                                                                               Table 1: Example of the Five Mostly Used pq-grams of a Sam-
                                                                                                               ple Document.
                                                                                                                      pq-gram               Occurrence [%]    Rank
                                                                                                                       NP-NN-*-*-*                2.68         1
                                                                                                                       PP-IN-*-*-*                2.25         2
Figure 2: Structure of a pq-gram Consisting of Stem p = 2 and                                                          NP-DT-*-*-*                1.99         3
Base q = 3.                                                                                                            NP-NNP-*-*-*               1.44         4
                                                                                                                       S-VP-*-*-VBD               1.08         5

2.2       Clustering by Authors
   The number of choices an author has to formulate a sentence                                                    5. Finally, each paragraph-profile is provided as input for clus-
in terms of grammar structure is rather high, and the assumption                                                     tering algorithms, which are asked to build clusters based on
in this approach is that the concrete choice is made mostly intu-                                                    the pq-grams contained. Concretely, three different feature
itively and unconsciously. On that basis the grammar of authors is                                                   sets have been evaluated: (1.) the frequencies of occurences
analyzed, which serves as input for common state-of-the-art clus-                                                    of each pq-gram, (2.) the rank of each pq-gram and (3.) a
tering algorithms to build clusters of text documents or paragraphs.                                                 union of the latter sets.
The decision of the clustering algorithms is thereby based on the
frequencies of occurring pq-grams, i.e., on pq-gram profiles. In de-                                               1
                                                                                                                     Apache OpenNLP, http://incubator.apache.org/opennlp, vis-
tail, given a text document the algorithm consists of the following                                            ited July 2014
2.3     Utilized Algorithms                                                    many works have studied and questioned the correct author-
   Using the WEKA framework [15], the following clustering algo-               ship of 12 disputed essays [24], which have been excluded in
rithms have been evaluated: K-Means [3], Cascaded K-Means (the                 the experiment.
number of clusters is cascaded and automatically chosen) [5], X-
                                                                           • The PAN’12 competition corpus (PAN12): As a well-known,
Means [26], Agglomerative Hierarchical Clustering [25], and Far-
                                                                             state-of-the-art corpus originally created for the use in au-
thest First [9].
                                                                             thorship identification, parts3 of the PAN2012 corpus [18]
   For the clustering algorithms K-Means, Hierarchical Clustering
                                                                             have been integrated. The corpus is composed of several
and Farthest First the number of clusters has been predefined ac-
                                                                             fiction texts and split into several subtasks that cover small-
cording to the respective test data. This means if the test document
                                                                             and common-length documents (1800-6060 words) as well
has been collaborated by three authors, the number of clusters has
                                                                             as larger documents (up to 13000 words) and novel-length
also been set to three. On the other hand, the algorithms Cascaded
                                                                             documents (up to 170,000 words). Finally, the test setused in
K-Means and X-Means implicitly decide which amount of clusters
                                                                             this evaluation contains 14 documents (paragraphs) written
is optimal. Therefore these algorithms have been limited only in
                                                                             by three authors that are distributed equally.
ranges, i.e., the minimum and maximum number of clusters has
been set to two and six, respectively.                                  3.2     Results
                                                                           The best results of the evaluation are presented in Table 2, where
3.     EVALUATION                                                       the best performance for each clusterer over all data sets is shown in
   The utilization of pq-gram profiles as input features for mod-       subtable (a), and the best configuration for each data set is shown
ern clustering algorithms has been extensively evaluated using dif-     in subtable (b), respectively. With an accuracy of 63.7% the K-
ferent documents and data sets. As clustering and classification        Means algorithm worked best by using p = 2, q = 3 and by uti-
problems are closely related, the global aim was to experiment on       lizing all available features. Interestingly, the X-Means algorithm
the accuracy of automatic text clustering using solely the proposed     also achieved good results considering the fact that in this case the
grammar feature, and furthermore to compare it to those of current      number of clusters has been assigned automatically by the algo-
classification techniques.                                              rithm. Finally, the hierarchical cluster performed worst gaining an
                                                                        accuracy of nearly 10% less than K-Means.
3.1     Test Data and Experimental Setup                                   Regarding the best performances for each test data set, the re-
   In order to evaluate the idea, different documents and test data     sults for the manually created data sets from novel literature are
sets have been used, which are explained in more detail in the fol-     generally poor. For example, the best result for the two-author doc-
lowing. Thereby single documents have been created which con-           ument Twain-Wells is only 59.6%, i.e., the accuracy is only slightly
tain paragraphs written by different authors, as well as multiple       better than the baseline percentage of 50%, which can be achieved
documents, whereby each document is written by one author. In           by randomly assigning paragraphs into two clusters.4 On the other
the latter case, every document is treated as one (large) paragraph     hand, the data sets reused from authorship attribution, namely the
for simplification reasons.                                             FED and the PAN12 data set, achieved very good results with an
   For the experiment, different parameter settings have been eval-     accuracy of about 89% and 83%, respectively. Nevertheless, as the
uated, i.e., the pq-gram values p and q have been varied from 2 to      other data sets have been specifically created for the clustering eval-
4, in combination with the three different feature sets. Concretely,    uation, these results may be more expressive. Therefore a compar-
the following data sets have been used:                                 ison between clustering and classification approaches is discussed
                                                                        in the following, showing that the latter achieve significantly better
     • Twain-Wells (T-W): This document has been specifically           results on those data sets when using the same features.
       created for the evaluation of in-document clustering. It con-
       tains 50 paragraphs of the book ”The Adventures of Huck-           Method                  p       q       Feature Set       Accuracy
       leberry Finn” by Mark Twain, and 50 paragraphs of ”The             K-Means                 3       2       All                 63.7
       Time Machine” by H. G. Wells2 . All paragraphs have been           X-Means                 2       4       Rank                61.7
       randomly shuffled, whereby the size of each paragraph varies       Farthest First          4       2       Occurrence-Rate     58.7
       from approximately 25 words up to 280 words.                       Cascaded K-Means        2       2       Rank                55.3
                                                                          Hierarchical Clust.     4       3       Occurrence-Rate     54.7
     • Twain-Wells-Shelley (T-W-S): In a similar fashion a three-
       author document has been created. It again uses (different)                           (a) Clustering Algorithms
       paragraphs of the same books by Twain and Wells, and ap-
       pends it by paragraphs of the book ”Frankenstein; Or, The          Data Set         Method             p     q   Feat. Set   Accuracy
       Modern Prometheus” by Mary Wollstonecraft Shelley. Sum-            T-W              X-Means            3     2   All           59.6
       marizing, the document contains 50 paragraphs by Mark              T-W-S            X-Means            3     4   All           49.0
       Twain, 50 paragraphs by H. G. Wells and another 50 para-           FED              Farth. First       4     3   Rank          89.4
       graphs by Mary Shelley, whereby the paragraph sizes are            PAN12-A/B        K-Means            3     3   All           83.3
       similar to the Twain-Wells document.                                                       (b) Test Data Sets

     • The Federalist Papers (FED): Probably the mostly referred        Table 2: Best Evaluation Results for Each Clustering Algo-
       text corpus in the field of authorship attribution is a series   rithm and Test Data Set in Percent.
       of 85 political essays called ”The Federalist Papers” written
       by John Jay, Alexander Hamilton and James Madison in the            3
       18th century. While most of the authorships are undoubted,            the subtasks A and B, respectively
                                                                           4
                                                                             In this case X-Means dynamically created two clusters, but
    2
      The books have been obtained from the Project Gutenberg li-       the result is still better than that of other algorithms using a fixed
brary, http://www.gutenberg.org, visited July 2014                      number of clusters.
                                                                            p   q    Algorithm     Max      N-Bay   Bay-Net   LibLin   LibSVM   kNN     J48
4.    COMPARISON OF CLUSTERING AND                                          2   2    X-Means       57.6      77.8    82.3      85.2      86.9   62.6    85.5
                                                                            2   3    X-Means       56.6      79.8    80.8      81.8      83.3   60.6    80.8
      CLASSIFICATION APPROACHES                                             2
                                                                            3
                                                                                4
                                                                                2
                                                                                     X-Means
                                                                                     X-Means
                                                                                                   57.6
                                                                                                   59.6
                                                                                                             76.8
                                                                                                             78.8
                                                                                                                     79.8
                                                                                                                     80.8
                                                                                                                               82.2
                                                                                                                               81.8
                                                                                                                                         83.8
                                                                                                                                         83.6
                                                                                                                                                58.6
                                                                                                                                                59.6
                                                                                                                                                        81.0
                                                                                                                                                        80.8
    For the given data sets, any clustering problem can be rewrit-          3
                                                                            3
                                                                                3
                                                                                4
                                                                                     X-Means
                                                                                     X-Means
                                                                                                   53.5
                                                                                                   52.5
                                                                                                             76.8
                                                                                                             81.8
                                                                                                                     77.8
                                                                                                                     79.8
                                                                                                                               80.5
                                                                                                                               81.8
                                                                                                                                         82.3
                                                                                                                                         83.8
                                                                                                                                                61.6
                                                                                                                                                63.6
                                                                                                                                                        79.8
                                                                                                                                                        82.0
ten as classification problem with the exception that the latter need       4   2    K-Means       52.5      86.9    83.3      83.5      84.3   62.6    81.8
                                                                            4   3    X-Means       52.5      79.8    79.8      80.1      80.3   59.6    77.4
training data. Although a direct comparison should be treated with          4   4    Farth. First  51.5      72.7    74.7      75.8      77.0   60.6    75.8
                                                                                    average improvement      24.1    25.0      26.5      27.9    6.2    25.7
caution, it still gives an insight of how the two different approaches
perform using the same data sets. Therefore an additional evalua-                                           (a) Twain-Wells
tion is shown in the following, which compares the performance of
                                                                            p   q    Algorithm     Max      N-Bay   Bay-Net   LibLin   LibSVM   kNN     J48
the clustering algorithms to the performance of the the following           2   2    K-Means       44.3      67.8    70.8      74.0      75.2   51.0    73.3
classification algorithms: Naive Bayes classifier [17], Bayes Net-          2   3    X-Means       38.3      65.1    67.1      70.7      72.3   48.3    70.2
                                                                            2   4    X-Means       45.6      63.1    68.1      70.5      71.8   49.0    69.3
work using the K2 classifier [8], Large Linear Classification using         3   2    X-Means       45.0      51.7    64.1      67.3      68.8   45.6    65.4
                                                                            3   3    X-Means       47.0      57.7    64.8      67.3      68.5   47.0    65.9
LibLinear [12], Support vector machine using LIBSVM with nu-                3   4    X-Means       49.0      67.8    67.8      70.5      72.5   46.3    68.3
                                                                            4   2    X-Means       36.2      61.1    67.1      69.1      69.5   50.3    65.1
SVC classification [6], k-nearest-neighbors classifier (kNN) using          4   3    K-Means       35.6      53.0    63.8      67.6      70.0   47.0    66.6
k = 1 [1], and a pruned C4.5 decision tree (J48) [28]. To compen-           4   4    X-Means       35.6      57.7    66.1      68.5      69.3   42.3    66.8
                                                                                    average improvement      18.7    24.8      27.7      29.0    5.6    26.0
sate the missing training data, a 10-fold cross-validation has been
used for each classifier.                                                                            (b) Twain-Wells-Shelley
    Table 3 shows the performance of each classifier compared to the
                                                                            p   q    Algorithm     Max      N-Bay   Bay-Net   LibLin   LibSVM   kNN     J48
best clustering result using the same data and pq-setting. It can be        2   2    Farth. First  77.3      81.1    86.4      90.9      84.2   74.2    81.8
                                                                            2   3    Farth. First  78.8      85.6    87.4      92.4      89.0   78.8    82.8
seen that the classifiers significantly outperform the clustering re-       2   4    X-Means       78.8      89.4    92.4      90.9      87.3   89.4    85.9
sults for the Twain-Wells and Twain-Wells-Shelley documents. The            3
                                                                            3
                                                                                2
                                                                                3
                                                                                     K-Means
                                                                                     K-Means
                                                                                                   81.8
                                                                                                   78.8
                                                                                                             82.6
                                                                                                             92.4
                                                                                                                     87.9
                                                                                                                     92.4
                                                                                                                               92.4
                                                                                                                               92.4
                                                                                                                                         85.5
                                                                                                                                         86.4
                                                                                                                                                80.3
                                                                                                                                                81.8
                                                                                                                                                        83.8
                                                                                                                                                        83.8
support vector machine framework (LibSVM) and the linear classi-            3   4    Farth. First  86.4      84.8    90.9      97.0      85.8   81.8    85.6
                                                                            4   2    Farth. First  86.6      81.8    89.4      87.9      83.3   77.3    84.1
fier (LibLinear) performed best, reaching a maximum accuracy of             4   3    Farth. First  89.4      85.6    92.4      89.4      85.8   80.3    83.3
                                                                            4   4    Farth. First  84.8      86.4    90.9      89.4      85.8   84.8    83.6
nearly 87% for the Twain-Wells document. Moreover, the average                      average improvement       3.0     7.5       8.9       3.4   -1.6     1.3
improvement is given in the bottom line, showing that most of the
                                                                                                          (c) Federalist Papers
classifiers outperform the best clustering result by over 20% in av-
erage. Solely the kNN algorithm achieves minor improvements as             p    q    Algorithm     Max      N-Bay   Bay-Net   LibLin   LibSVM   kNN      J48
it attributed the two-author document with a poor accuracy of about        2
                                                                           2
                                                                                2
                                                                                3
                                                                                     K-Means
                                                                                     K-Means
                                                                                                   83.3
                                                                                                   83.3
                                                                                                             83.3
                                                                                                             83.3
                                                                                                                     33.3
                                                                                                                     33.3
                                                                                                                               100.0
                                                                                                                               100.0
                                                                                                                                        100.0
                                                                                                                                        100.0
                                                                                                                                                100.0
                                                                                                                                                100.0
                                                                                                                                                        33.3
                                                                                                                                                        33.3
60% only.                                                                  2    4    K-Means       83.3      83.4    33.3      100.0    100.0   100.0   33.3
                                                                           3    2    K-Means       83.3      75.0    33.3      91.7      91.7   100.0   33.3
    A similar general improvement could be achieved on the three-          3    3    K-Means       83.3     100.0    33.3      100.0     91.7   100.0    33.3
                                                                           3    4    Farth. First  75.0      66.7    33.3      100.0    100.0    91.7   33.3
author document Twain-Wells-Shelley as can be seen in subtable             4    2    K-Means       83.3      91.7    33.3      91.7      75.0    91.7   33.3
(b). Again, LibSVM could achieve an accuracy of about 75%,                 4
                                                                           4
                                                                                3
                                                                                4
                                                                                     K-Means
                                                                                     K-Means
                                                                                                   83.3
                                                                                                   83.3
                                                                                                             75.0
                                                                                                             75.0
                                                                                                                     33.3
                                                                                                                     33.3
                                                                                                                               100.0
                                                                                                                               100.0
                                                                                                                                         75.0
                                                                                                                                         83.4
                                                                                                                                                 91.7
                                                                                                                                                 83.4
                                                                                                                                                        33.3
                                                                                                                                                        33.3
whereas the best clustering configuration could only reach 49%.                     average improvement      -0.9    -49.1      15.8      8.4   13.0    -49.1

Except for the kNN algorithm, all classifiers significantly outper-                                         (d) PAN12-A/B
form the best clustering results for every configuration.
    Quite different comparison results have been obtained for the        Table 3: Best Evaluation Results for each Clustering Algorithm
Federalist Papers and PAN12 data sets, respectively. Here, the im-       and Test Data Set in Percent.
provements gained from the classifiers are only minor, and in some
cases are even negative, i.e., the classification algorithms perform
worse than the clustering algorithms. A general explanation is the       to one document. The main idea is often to compute topically re-
good performance of the clustering algorithms on these data sets,        lated document clusters and to assist web search engines to be able
especially by utilizing the Farthest First and K-Means algorithms.       to provide better results to the user, whereby the algorithms pro-
    In case of the Federalist Papers data set shown in subtable (c),     posed frequently are also patented (e.g. [2]). Regularly applied
all algorithms except kNN could achieve at least some improve-           concepts in the feature extraction phase are the term frequency tf ,
ment. Although the LibLinear classifier could reach an outstanding       which measures how often a word in a document occurs, and the
accuracy of 97%, the global improvement is below 10% for all clas-       term frequency-inverse document frequency tf − idf , which mea-
sifiers. Finally, subtable (d) shows the results for PAN12, where the    sures the significance of a word compared to the whole document
outcome is quite diverse as some classifiers could improve the clus-     collection. An example of a classical approach using these tech-
terers significantly, whereas others worsen the accuracy even more       niques is published in [21].
drastically. A possible explanation might be the small data set (only       The literature on cluster analysis within a single document to
the subproblems A and B have been used), which may not be suited         discriminate the authorships in a multi-author document like it is
very well for a reliable evaluation of the clustering approaches.        done in this paper is surprisingly sparse. On the other hand, many
                                                                         approaches exist to separate a document into paragraphs of differ-
   Summarizing, the comparison of the different algorithms reveal        ent topics, which are generally called text segmentation problems.
that in general classification algorithms perform better than cluster-   In this domain, the algorithms often perform vocabulary analysis
ing algorithms when provided with the same (pq-gram) feature set.        in various forms like word stem repetitions [27] or word frequency
Nevertheless, the results of the PAN12 experiment are very diverse       models [29], whereby ”methods for finding the topic boundaries
and indicate that there might be a problem with the data set itself,     include sliding window, lexical chains, dynamic programming, ag-
and that this comparison should be treated carefully.                    glomerative clustering and divisive clustering” [7]. Despite the
                                                                         given possibility to modify these techniques to also cluster by au-
5.    RELATED WORK                                                       thors instead of topics, this is rarely done. In the following some of
  Most of the traditional document clustering approaches are based       the existing methods are shortly summarized.
on occurrences of words, i.e., inverted indices are built and used to       Probably one of the first approaches that uses stylometry to au-
group documents. Thereby a unit to be clustered conforms exactly         tomatically detect boundaries of authors of collaboratively written
text is proposed in [13]. Thereby the main intention was not to ex-                         K-­‐Means	
  
pose authors or to gain insight into the work distribution, but to pro-                     X-­‐Means	
  
                                                                                      Farthest	
  First	
  
vide a methodology for collaborative authors to equalize their style
                                                                              Cascaded	
  K-­‐Means	
  
in order to achieve better readability. To extract the style of sepa-      Hierarchical	
  Clusterer	
  
rated paragraphs, common stylometric features such as word/sentence
                                                                                       Naive	
  Bayes	
  
lengths, POS tag distributions or frequencies of POS classes at
                                                                                           BayesNet	
  
sentence-initial and sentence-final positions are considered. An ex-                        LibLinear	
  
tensive experiment revealed that styolmetric features can be used to                          LibSVM	
  

find authorship boundaries, but that there has to be done additional                               kNN	
  
                                                                                                     J48	
  
research in order to increase the accuracy and informativeness.
                                                                                                               0	
        10	
              20	
     30	
         40	
              50	
         60	
     70	
               80	
         90	
        100	
  
   In [14] the authors also tried to divide a collaborative text into                                                                                                      Accuracy	
  [%]	
  
different single-author paragraphs. In contrast to the previously
described handmade corpus, a large data set has been computation-
ally created by using (well-written) articles of an internet forum. At    Figure 3: Best Evaluation Results Over All Data Sets For All
first, different neural networks have been utilized using several sty-    Utilized Clustering and Classification Algorithms.
lometric features. By using 90% of the data for training, the best
network could achieve an F-score of 53% for multi-author docu-
ments on the remaining 10% of test data. In a second experiment,                      Twain-­‐Wells	
  
only letter-bigram frequencies are used as distinguishing features.
Thereby an authorship boundary between paragraphs was marked               Twain-­‐Wells-­‐Shelley	
  
if the cosine distance exceeded a certain threshold. This method
                                                                                                                                                                                                                                      Best	
  Clusterer	
  
reached an F-score of only 42%, and it is suspected that letter-
                                                                                                     FED	
  
                                                                                                                                                                                                                                      Best	
  Classiﬁer	
  
bigrams are not suitable for the (short) paragraphs used in the eval-
uation.
                                                                                        PAN12-­‐A/B	
  
   A two-stage process to cluster Hebrew Bible texts by authorship
is proposed in [20]. Because a first attempt to represent chapters                                                0	
              20	
               40	
                 60	
                  80	
              100	
  
only by bag-of-words led to negative results, the authors addition-                                                                                       Accuracy	
  [%]	
  

ally incorporated sets of synonyms (which could be generated by
comparing the original Hebrew texts with an English translation).
With a modified cosine-measure comparing these sets for given             Figure 4: Best Clustering and Classification Results For Each
chapters, two core clusters are compiled by using the ncut algo-          Data Set.
rithm [10]. In the second step, the resulting clusters are used as
training data for a support vector machine, which finally assigns
every chapter to one of the two core clusters by using the simple         linear classification algorithm LibLinear could reach nearly 88%,
bag-of-words features tested earlier. Thereby it can be the case,         outperforming K-Means by 25% over all data sets.
that units originally assigned to one cluster are moved to the other         Finally, the best classification and clustering results for each data
one, depending on the prediction of the support vector machine.           set are shown in Figure 4. Consequently the classifiers achieve
With this two-stage approach the authors report a good accuracy of        higher accuracies, whereby the PAN12 subsets could be classified
about 80%, whereby it should be considered that the size of poten-        100% correctly. As can be seen, a major improvement can be
tial authors has been fixed to two in the experiment. Nevertheless,       gained for the novel literature documents. For example, the best
the authors state that their approach could be extended for more          classifier reached 87% on the Twain-Wells document, whereas the
authors with less effort.                                                 best clustering approach achieved only 59%.

                                                                             As shown in this paper, paragraphs of documents can be split
6.    CONCLUSION AND FUTURE WORK                                          and clustered based on grammar features, but the accuracy is below
   In this paper, the automatic creation of paragraph clusters based      that of classification algorithms. Although the two algorithm types
on the grammar of authors has been evaluated. Different state-of-         should not be compared directly as they are designed to manage
the-art clustering algorithms have been utilized with different input     different problems, the significant differences in accuracies indi-
features and tested on different data sets. The best working algo-        cate that classifiers can handle the grammar features better. Never-
rithm K-Means could achieve an accuracy of about 63% over all             theless future work should focus on evaluating the same features on
test sets, whereby good individual results of up to 89% could be          larger data sets, as clustering algorithms may produce better results
reached for some configurations. On the contrary, the specifically        with increasing amount of sample data.
created documents incorporating two and three authors could only             Another possible application could be the creation of whole doc-
be clustered with a maximum accuracy of 59%.                              ument clusters, where documents with similar grammar are grouped
   A comparison between clustering and classification algorithms          together. Despite the fact that such huge clusters are very difficult to
using the same input features has been implemented. Disregarding          evaluate - due to the lack of ground truth data - a navigation through
the missing training data, it could be observed that classifiers gen-     thousands of documents based on grammar may be interesting like
erally produce higher accuracies with improvements of up to 29%.          it has been done for music genres (e.g. [30]) or images (e.g. [11]).
On the other hand, some classifiers perform worse on average than         Moreover, grammar clusters may also be utilized for modern rec-
clustering algorithms over individual data sets when using some pq-       ommendation algorithms once they have been calculated for large
gram configurations. Nevertheless, if the maximum accuracy for            data sets. For example, by analyzing all freely available books from
each algorithm is considered, all classifiers perform significantly       libraries like Project Gutenberg, a system could recommend other
better as can be seen in Figure 3. Here the best performances of all      books with a similar style based on the users reading history. Also,
utilized classification and clustering algorithms are illustrated. The    an enhancement of current commercial recommender systems that
are used in large online stores like Amazon is conceivable.           [18] P. Juola. An Overview of the Traditional Authorship
                                                                           Attribution Subtask. In CLEF (Online Working
7.   REFERENCES                                                            Notes/Labs/Workshop), 2012.
 [1] D. Aha and D. Kibler. Instance-Based Learning Algorithms.        [19] D. Klein and C. D. Manning. Accurate Unlexicalized
     Machine Learning, 6:37–66, 1991.                                      Parsing. In Proceedings of the 41st Annual Meeting on
 [2] C. Apte, S. M. Weiss, and B. F. White. Lightweight                    Association for Computational Linguistics - Volume 1, ACL
     Document Clustering, Nov. 25 2003. US Patent 6,654,739.               ’03, pages 423–430, Stroudsburg, PA, USA, 2003.
 [3] D. Arthur and S. Vassilvitskii. K-means++: The advantages        [20] M. Koppel, N. Akiva, I. Dershowitz, and N. Dershowitz.
     of careful seeding. In Proceedings of the Eighteenth Annual           Unsupervised Decomposition of a Document into Authorial
     ACM-SIAM Symposium on Discrete Algorithms, SODA ’07,                  Components. In Proc. of the 49th Annual Meeting of the
     pages 1027–1035, Philadelphia, PA, USA, 2007. Society for             Association for Computational Linguistics: Human
     Industrial and Applied Mathematics.                                   Language Technologies - Volume 1, HLT ’11, pages
 [4] N. Augsten, M. Böhlen, and J. Gamper. The pq-Gram                     1356–1364, Stroudsburg, PA, USA, 2011.
     Distance between Ordered Labeled Trees. ACM Transactions         [21] B. Larsen and C. Aone. Fast and Effective Text Mining Using
     on Database Systems (TODS), 2010.                                     Linear-Time Document Clustering. In Proceedings of the 5th
 [5] T. Caliński and J. Harabasz. A Dendrite Method for Cluster           ACM SIGKDD international conference on Knowledge
     Analysis. Communications in Statistics - Theory and                   discovery and data mining, pages 16–22. ACM, 1999.
     Methods, 3(1):1–27, 1974.                                        [22] Y. Li, S. M. Chung, and J. D. Holt. Text Document
 [6] C.-C. Chang and C.-J. Lin. LIBSVM: A Library for Support              Clustering Based on Frequent Word Meaning Sequences.
     Vector Machines. ACM Transactions on Intelligent Systems              Data & Knowledge Engineering, 64(1):381–404, 2008.
     and Technology (TIST), 2(3):27, 2011.                            [23] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini.
 [7] F. Y. Choi. Advances in Domain Independent Linear Text                Building a large annotated corpus of English: The Penn
     Segmentation. In Proceedings of the 1st North American                Treebank. Computational Linguistics, 19:313–330, June
     chapter of the Association for Computational Linguistics              1993.
     conference, pages 26–33. Association for Computational           [24] F. Mosteller and D. Wallace. Inference and Disputed
     Linguistics, 2000.                                                    Authorship: The Federalist. Addison-Wesley, 1964.
 [8] G. F. Cooper and E. Herskovits. A Bayesian Method for the        [25] F. Murtagh. A Survey of Recent Advances in Hierarchical
     Induction of Probabilistic Networks From Data. Machine                Clustering Algorithms. The Computer Journal,
     learning, 9(4):309–347, 1992.                                         26(4):354–359, 1983.
 [9] S. Dasgupta. Performance Guarantees for Hierarchical             [26] D. Pelleg, A. W. Moore, et al. X-means: Extending K-means
     Clustering. In Computational Learning Theory, pages                   with Efficient Estimation of the Number of Clusters. In
     351–363. Springer, 2002.                                              ICML, pages 727–734, 2000.
[10] I. S. Dhillon, Y. Guan, and B. Kulis. Kernel k-means:            [27] J. M. Ponte and W. B. Croft. Text Segmentation by Topic. In
     Spectral Clustering and Normalized Cuts. In Proceedings of            Research and Advanced Technology for Digital Libraries,
     the tenth ACM SIGKDD international conference on                      pages 113–125. Springer, 1997.
     Knowledge discovery and data mining, pages 551–556.              [28] J. R. Quinlan. C4.5: Programs for Machine Learning,
     ACM, 2004.                                                            volume 1. Morgan Kaufmann, 1993.
[11] A. Faktor and M. Irani. “Clustering by Composition” -            [29] J. C. Reynar. Statistical Models for Topic Segmentation. In
     Unsupervised Discovery of Image Categories. In Computer               Proc. of the 37th annual meeting of the Association for
     Vision–ECCV 2012, pages 474–487. Springer, 2012.                      Computational Linguistics on Computational Linguistics,
[12] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J.            pages 357–364, 1999.
     Lin. LIBLINEAR: A Library for Large Linear Classification.       [30] N. Scaringella, G. Zoia, and D. Mlynek. Automatic Genre
     The Journal of Machine Learning Research, 9:1871–1874,                Classification of Music Content: a Survey. Signal Processing
     2008.                                                                 Magazine, IEEE, 23(2):133–141, 2006.
[13] A. Glover and G. Hirst. Detecting Stylistic Inconsistencies in   [31] M. Tschuggnall and G. Specht. Using Grammar-Profiles to
     Collaborative Writing. In The New Writing Environment,                Intrinsically Expose Plagiarism in Text Documents. In Proc.
     pages 147–168. Springer, 1996.                                        of the 18th Conf. of Natural Language Processing and
[14] N. Graham, G. Hirst, and B. Marthi. Segmenting Documents              Information Systems (NLDB), pages 297–302, 2013.
     by Stylistic Character. Natural Language Engineering,            [32] M. Tschuggnall and G. Specht. Enhancing Authorship
     11(04):397–415, 2005.                                                 Attribution By Utilizing Syntax Tree Profiles. In Proc. of the
[15] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann,            14th Conf. of the European Chapter of the Assoc. for
     and I. H. Witten. The WEKA Data Mining Software: an                   Computational Ling. (EACL), pages 195–199, 2014.
     Update. ACM SIGKDD explorations newsletter,                      [33] O. Zamir and O. Etzioni. Web Document Clustering: A
     11(1):10–18, 2009.                                                    Feasibility Demonstration. In Proc. of the 21st annual
[16] A. Hotho, S. Staab, and G. Stumme. Ontologies Improve                 international ACM conference on Research and development
     Text Document Clustering. In Data Mining, 2003. ICDM                  in information retrieval (SIGIR), pages 46–54. ACM, 1998.
     2003. Third IEEE International Conference on, pages              [34] D. Zou, W.-J. Long, and Z. Ling. A Cluster-Based
     541–544. IEEE, 2003.                                                  Plagiarism Detection Method. In Notebook Papers of CLEF
[17] G. H. John and P. Langley. Estimating Continuous                      2010 LABs and Workshops, 22-23 September, 2010.
     Distributions in Bayesian Classifiers. In Proceedings of the
     Eleventh conference on Uncertainty in artificial intelligence,
     pages 338–345. Morgan Kaufmann Publishers Inc., 1995.

</pre>