<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automatic Decomposition of Multi-Author Documents Using Grammar Analysis</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Michael Tschuggnall and Günther Specht Databases and Information Systems Institute of Computer Science, University of Innsbruck</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <abstract>
        <p>The task of text segmentation is to automatically split a text document into individual subparts, which differ according to specific measures. In this paper, an approach is presented that attempts to separate text sections of a collaboratively written document based on the grammar syntax of authors. The main idea is thereby to quantify differences of the grammatical writing style of authors and to use this information to build paragraph clusters, whereby each cluster is assigned to a different author. In order to analyze the style of a writer, text is split into single sentences, and for each sentence a full parse tree is calculated. Using the latter, a profile is computed subsequently that represents the main characteristics for each paragraph. Finally, the profiles serve as input for common clustering algorithms. An extensive evaluation using different English data sets reveals promising results, whereby a supplementary analysis indicates that in general common classification algorithms perform better than clustering approaches.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The growing amount of currently available data is hardly
manageable without the use of specific tools and algorithms that
provide relevant portions of that data to the user. While this problem
is generally addressed with information retrieval approaches,
another possibility to significantly reduce the amount of data is to
build clusters. Within each cluster, the data is similar according to
some predefined features. Thereby many approaches exist that
propose algorithms to cluster plain text documents (e.g. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]) or
specific web documents (e.g. [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ]) by utilizing various features.
      </p>
      <p>
        Approaches which attempt to divide a single text document into
distinguishable units like different topics, for example, are
usually referred to as text segmentation approaches. Here, also many
features including statistical models, similarities between words or
other semantic analyses are used. Moreover, text clusters are also
used in recent plagiarism detection algorithms (e.g. [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ]) which
try to build a cluster for the main author and one or more clusters
for intrusive paragraphs. Another scenario where the clustering of
text is applicable is the analysis of multi-author academic papers:
especially the verification of collaborated student works such as
bachelor or master theses can be useful in order to determine the
amount of work done by each student.
      </p>
      <p>
        Using results of previous work in the field of intrinsic
plagiarism detection [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ] and authorship attribution [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ], the assumption
that individual authors have significantly different writing styles in
terms of the syntax that is used to construct sentences has been
reused. For example, the following sentence (extracted from a web
blog): ”My chair started squeaking a few days ago and it’s driving
me nuts." (S1) could also be formulated as ”Since a few days my
chair is squeaking - it’s simply annoying.” (S2) which is
semantically equivalent but differs significantly according to the syntax as
can be seen in Figure 1. The main idea of this work is to quantify
those differences by calculating grammar profiles and to use this
information to decompose a collaboratively written document, i.e.,
to assign each paragraph of a document to an author.
      </p>
      <p>The rest of this paper is organized as follows: Section 2 at first
recapitulates the principle of pq-grams, which represent a core
concept of the approach. Subsequently the algorithm is presented in
detail, which is then evaluated in Section 3 by using different
clustering algorithms and data sets. A comparison of clustering and
classification approaches is discussed in Section 4, while Section 5
depicts related work. Finally, a conclusion and future work
directions are given in Section 6.
2.</p>
    </sec>
    <sec id="sec-2">
      <title>ALGORITHM</title>
      <p>In the following the concept of pq-grams is explained, which
serves as the basic stylistic measure in this approach to distinguish
between authors. Subsequently, the concrete steps performed by
the algorithm are discussed in detail.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Preliminaries: pq-grams</title>
      <p>
        Similar to n-grams that represent subparts of given length n of
a string, pq-grams extract substructures of an ordered, labeled tree
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The size of a pq-gram is determined by a stem (p) and a base
(q) like it is shown in Figure 2. Thereby p defines how much nodes
are included vertically, and q defines the number of nodes to be
considered horizontally. For example, a valid pq-gram with p = 2
and q = 3 starting from PP at the left side of tree (S2) shown in
Figure 1 would be [PP-NP-DT-JJ-NNS] (the concrete words
are omitted).
      </p>
      <p>The pq-gram index then consists of all possible pq-grams of
a tree. In order to obtain all pq-grams, the base is shifted left
and right additionally: If then less than p nodes exist
horizontally, the corresponding place in the pq-gram is filled with *,
in(S1)</p>
      <p>S
NP</p>
      <p>VP
dicating a missing node. Applying this idea to the previous
example, also the pq-gram [PP-IN-*-*-*] (no nodes in the base) is
valid, as well as [PP-NP-*-*-DT] (base shifted left by two),
[PP-NP-*-DT-JJ] (base shifted left by one),
[PP-NP-JJNNS-*] (base shifted right by one) and [PP-NP-NNS-*-*] (base
shifted right by two) have to be considered. As a last example, all
leaves have the pq-gram pattern [leaf_label-*-*-*-*].</p>
      <p>Finally, the pq-gram index is the set of all valid pq-grams of a
tree, whereby multiple occurrences of the same pq-grams are also
present multiple times in the index.</p>
    </sec>
    <sec id="sec-4">
      <title>Clustering by Authors</title>
      <p>
        The number of choices an author has to formulate a sentence
in terms of grammar structure is rather high, and the assumption
in this approach is that the concrete choice is made mostly
intuitively and unconsciously. On that basis the grammar of authors is
analyzed, which serves as input for common state-of-the-art
clustering algorithms to build clusters of text documents or paragraphs.
The decision of the clustering algorithms is thereby based on the
frequencies of occurring pq-grams, i.e., on pq-gram profiles. In
detail, given a text document the algorithm consists of the following
NP
NNS
(nuts)
1. At first the document is preprocessed by eliminating
unnecessary whitespaces or non-parsable characters. For
example, many data sets often are based on novels and articles of
various authors, whereby frequently OCR text recognition is
used due to the lack of digital data. Additionally, such
documents contain problem sources like chapter numbers and
titles or incorrectly parsed picture frames that result in
nonalphanumeric characters.
2. Subsequently, the document is partitioned into single
paragraphs. For simplification reasons this is currently done by
only detecting multiple line breaks.
3. Each paragraph is then split into single sentences by
utilizing a sentence boundary detection algorithm implemented
within the OpenNLP framework1. Then for each sentence
a full grammar tree is calculated using the Stanford Parser
[
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. For example, Figure 1 depicts the grammar trees
resulting from analyzing sentences (S1) and (S2), respectively.
The labels of each tree correspond to a part-of-speech (POS)
tag of the Penn Treebank set [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], where e.g NP corresponds
to a noun phrase, DT to a determiner or JJS to a
superlative adjective. In order to examine the building structure of
sentences only like it is intended by this work, the concrete
words, i.e., the leafs of the tree, are omitted.
4. Using the grammar trees of all sentences of the document,
the pq-gram index is calculated. As shown in Section 2.1
all valid pq-grams of a sentence are extracted and stored into
a pq-gram index. By combining all pq-gram indices of all
sentences, a pq-gram profile is computed which contains a
list of all pq-grams and their corresponding frequency of
appearance in the text. Thereby the frequency is normalized by
the total number of all appearing pq-grams. As an example,
the five mostly used pq-grams using p = 2 and q = 3 of a
sample document are shown in Table 1. The profile is sorted
descending by the normalized occurrence, and an additional
rank value is introduced that simply defines a natural order
which is used in the evaluation (see Section 3).
2.3
      </p>
    </sec>
    <sec id="sec-5">
      <title>Utilized Algorithms</title>
      <p>
        Using the WEKA framework [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], the following clustering
algorithms have been evaluated: K-Means [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], Cascaded K-Means (the
number of clusters is cascaded and automatically chosen) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],
XMeans [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ], Agglomerative Hierarchical Clustering [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], and
Farthest First [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>For the clustering algorithms K-Means, Hierarchical Clustering
and Farthest First the number of clusters has been predefined
according to the respective test data. This means if the test document
has been collaborated by three authors, the number of clusters has
also been set to three. On the other hand, the algorithms Cascaded
K-Means and X-Means implicitly decide which amount of clusters
is optimal. Therefore these algorithms have been limited only in
ranges, i.e., the minimum and maximum number of clusters has
been set to two and six, respectively.</p>
    </sec>
    <sec id="sec-6">
      <title>EVALUATION</title>
      <p>The utilization of pq-gram profiles as input features for
modern clustering algorithms has been extensively evaluated using
different documents and data sets. As clustering and classification
problems are closely related, the global aim was to experiment on
the accuracy of automatic text clustering using solely the proposed
grammar feature, and furthermore to compare it to those of current
classification techniques.
3.1</p>
    </sec>
    <sec id="sec-7">
      <title>Test Data and Experimental Setup</title>
      <p>In order to evaluate the idea, different documents and test data
sets have been used, which are explained in more detail in the
following. Thereby single documents have been created which
contain paragraphs written by different authors, as well as multiple
documents, whereby each document is written by one author. In
the latter case, every document is treated as one (large) paragraph
for simplification reasons.</p>
      <p>For the experiment, different parameter settings have been
evaluated, i.e., the pq-gram values p and q have been varied from 2 to
4, in combination with the three different feature sets. Concretely,
the following data sets have been used:</p>
      <p>Twain-Wells (T-W): This document has been specifically
created for the evaluation of in-document clustering. It
contains 50 paragraphs of the book ”The Adventures of
Huckleberry Finn” by Mark Twain, and 50 paragraphs of ”The
Time Machine” by H. G. Wells2. All paragraphs have been
randomly shuffled, whereby the size of each paragraph varies
from approximately 25 words up to 280 words.</p>
      <sec id="sec-7-1">
        <title>Twain-Wells-Shelley (T-W-S): In a similar fashion a three</title>
        <p>author document has been created. It again uses (different)
paragraphs of the same books by Twain and Wells, and
appends it by paragraphs of the book ”Frankenstein; Or, The
Modern Prometheus” by Mary Wollstonecraft Shelley.
Summarizing, the document contains 50 paragraphs by Mark
Twain, 50 paragraphs by H. G. Wells and another 50
paragraphs by Mary Shelley, whereby the paragraph sizes are
similar to the Twain-Wells document.</p>
        <p>
          The Federalist Papers (FED): Probably the mostly referred
text corpus in the field of authorship attribution is a series
of 85 political essays called ”The Federalist Papers” written
by John Jay, Alexander Hamilton and James Madison in the
18th century. While most of the authorships are undoubted,
2The books have been obtained from the Project Gutenberg
library, http://www.gutenberg.org, visited July 2014
many works have studied and questioned the correct
authorship of 12 disputed essays [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ], which have been excluded in
the experiment.
        </p>
        <p>
          The PAN’12 competition corpus (PAN12): As a well-known,
state-of-the-art corpus originally created for the use in
authorship identification, parts3 of the PAN2012 corpus [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]
have been integrated. The corpus is composed of several
fiction texts and split into several subtasks that cover
smalland common-length documents (1800-6060 words) as well
as larger documents (up to 13000 words) and novel-length
documents (up to 170,000 words). Finally, the test setused in
this evaluation contains 14 documents (paragraphs) written
by three authors that are distributed equally.
3.2
        </p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Results</title>
      <p>The best results of the evaluation are presented in Table 2, where
the best performance for each clusterer over all data sets is shown in
subtable (a), and the best configuration for each data set is shown
in subtable (b), respectively. With an accuracy of 63.7% the
KMeans algorithm worked best by using p = 2; q = 3 and by
utilizing all available features. Interestingly, the X-Means algorithm
also achieved good results considering the fact that in this case the
number of clusters has been assigned automatically by the
algorithm. Finally, the hierarchical cluster performed worst gaining an
accuracy of nearly 10% less than K-Means.</p>
      <p>Regarding the best performances for each test data set, the
results for the manually created data sets from novel literature are
generally poor. For example, the best result for the two-author
document Twain-Wells is only 59.6%, i.e., the accuracy is only slightly
better than the baseline percentage of 50%, which can be achieved
by randomly assigning paragraphs into two clusters.4 On the other
hand, the data sets reused from authorship attribution, namely the
FED and the PAN12 data set, achieved very good results with an
accuracy of about 89% and 83%, respectively. Nevertheless, as the
other data sets have been specifically created for the clustering
evaluation, these results may be more expressive. Therefore a
comparison between clustering and classification approaches is discussed
in the following, showing that the latter achieve significantly better
results on those data sets when using the same features.</p>
      <sec id="sec-8-1">
        <title>Method</title>
        <p>K-Means
X-Means
Farthest First
Cascaded K-Means
Hierarchical Clust.</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>4. COMPARISON OF CLUSTERING AND</title>
    </sec>
    <sec id="sec-10">
      <title>CLASSIFICATION APPROACHES</title>
      <p>
        For the given data sets, any clustering problem can be
rewritten as classification problem with the exception that the latter need
training data. Although a direct comparison should be treated with
caution, it still gives an insight of how the two different approaches
perform using the same data sets. Therefore an additional
evaluation is shown in the following, which compares the performance of
the clustering algorithms to the performance of the the following
classification algorithms: Naive Bayes classifier [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], Bayes
Network using the K2 classifier [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], Large Linear Classification using
LibLinear [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], Support vector machine using LIBSVM with
nuSVC classification [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], k-nearest-neighbors classifier (kNN) using
k = 1 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], and a pruned C4.5 decision tree (J48) [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]. To
compensate the missing training data, a 10-fold cross-validation has been
used for each classifier.
      </p>
      <p>Table 3 shows the performance of each classifier compared to the
best clustering result using the same data and pq-setting. It can be
seen that the classifiers significantly outperform the clustering
results for the Twain-Wells and Twain-Wells-Shelley documents. The
support vector machine framework (LibSVM) and the linear
classifier (LibLinear) performed best, reaching a maximum accuracy of
nearly 87% for the Twain-Wells document. Moreover, the average
improvement is given in the bottom line, showing that most of the
classifiers outperform the best clustering result by over 20% in
average. Solely the kNN algorithm achieves minor improvements as
it attributed the two-author document with a poor accuracy of about
60% only.</p>
      <p>A similar general improvement could be achieved on the
threeauthor document Twain-Wells-Shelley as can be seen in subtable
(b). Again, LibSVM could achieve an accuracy of about 75%,
whereas the best clustering configuration could only reach 49%.
Except for the kNN algorithm, all classifiers significantly
outperform the best clustering results for every configuration.</p>
      <p>Quite different comparison results have been obtained for the
Federalist Papers and PAN12 data sets, respectively. Here, the
improvements gained from the classifiers are only minor, and in some
cases are even negative, i.e., the classification algorithms perform
worse than the clustering algorithms. A general explanation is the
good performance of the clustering algorithms on these data sets,
especially by utilizing the Farthest First and K-Means algorithms.</p>
      <p>In case of the Federalist Papers data set shown in subtable (c),
all algorithms except kNN could achieve at least some
improvement. Although the LibLinear classifier could reach an outstanding
accuracy of 97%, the global improvement is below 10% for all
classifiers. Finally, subtable (d) shows the results for PAN12, where the
outcome is quite diverse as some classifiers could improve the
clusterers significantly, whereas others worsen the accuracy even more
drastically. A possible explanation might be the small data set (only
the subproblems A and B have been used), which may not be suited
very well for a reliable evaluation of the clustering approaches.</p>
      <p>Summarizing, the comparison of the different algorithms reveal
that in general classification algorithms perform better than
clustering algorithms when provided with the same (pq-gram) feature set.
Nevertheless, the results of the PAN12 experiment are very diverse
and indicate that there might be a problem with the data set itself,
and that this comparison should be treated carefully.</p>
    </sec>
    <sec id="sec-11">
      <title>5. RELATED WORK</title>
      <p>Most of the traditional document clustering approaches are based
on occurrences of words, i.e., inverted indices are built and used to
group documents. Thereby a unit to be clustered conforms exactly</p>
      <p>N-Bay Bay-Net
77.8 82.3
79.8 80.8
76.8 79.8
78.8 80.8
76.8 77.8
81.8 79.8
86.9 83.3
79.8 79.8
72.7 74.7
24.1 25.0
(a) Twain-Wells
Algorithm Max N-Bay Bay-Net LibLin
K-Means 44.3 67.8 70.8 74.0
X-Means 38.3 65.1 67.1 70.7
X-Means 45.6 63.1 68.1 70.5
X-Means 45.0 51.7 64.1 67.3
X-Means 47.0 57.7 64.8 67.3
X-Means 49.0 67.8 67.8 70.5
X-Means 36.2 61.1 67.1 69.1
K-Means 35.6 53.0 63.8 67.6
X-Means 35.6 57.7 66.1 68.5
average improvement 18.7 24.8 27.7</p>
      <p>(b) Twain-Wells-Shelley
Algorithm Max
Farth. First 77.3
Farth. First 78.8
X-Means 78.8
K-Means 81.8
K-Means 78.8
Farth. First 86.4
Farth. First 86.6
Farth. First 89.4
Farth. First 84.8
average improvement
Algorithm Max
K-Means 83.3
K-Means 83.3
K-Means 83.3
K-Means 83.3
K-Means 83.3
Farth. First 75.0
K-Means 83.3
K-Means 83.3
K-Means 83.3
average improvement</p>
      <p>
        N-Bay Bay-Net LibLin
81.1 86.4 90.9
85.6 87.4 92.4
89.4 92.4 90.9
82.6 87.9 92.4
92.4 92.4 92.4
84.8 90.9 97.0
81.8 89.4 87.9
85.6 92.4 89.4
86.4 90.9 89.4
3.0 7.5 8.9
(c) Federalist Papers
to one document. The main idea is often to compute topically
related document clusters and to assist web search engines to be able
to provide better results to the user, whereby the algorithms
proposed frequently are also patented (e.g. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]). Regularly applied
concepts in the feature extraction phase are the term frequency tf ,
which measures how often a word in a document occurs, and the
term frequency-inverse document frequency tf idf , which
measures the significance of a word compared to the whole document
collection. An example of a classical approach using these
techniques is published in [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
      </p>
      <p>
        The literature on cluster analysis within a single document to
discriminate the authorships in a multi-author document like it is
done in this paper is surprisingly sparse. On the other hand, many
approaches exist to separate a document into paragraphs of
different topics, which are generally called text segmentation problems.
In this domain, the algorithms often perform vocabulary analysis
in various forms like word stem repetitions [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] or word frequency
models [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ], whereby ”methods for finding the topic boundaries
include sliding window, lexical chains, dynamic programming,
agglomerative clustering and divisive clustering” [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Despite the
given possibility to modify these techniques to also cluster by
authors instead of topics, this is rarely done. In the following some of
the existing methods are shortly summarized.
      </p>
      <p>
        Probably one of the first approaches that uses stylometry to
automatically detect boundaries of authors of collaboratively written
text is proposed in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Thereby the main intention was not to
expose authors or to gain insight into the work distribution, but to
provide a methodology for collaborative authors to equalize their style
in order to achieve better readability. To extract the style of
separated paragraphs, common stylometric features such as word/sentence
lengths, POS tag distributions or frequencies of POS classes at
sentence-initial and sentence-final positions are considered. An
extensive experiment revealed that styolmetric features can be used to
find authorship boundaries, but that there has to be done additional
research in order to increase the accuracy and informativeness.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] the authors also tried to divide a collaborative text into
different single-author paragraphs. In contrast to the previously
described handmade corpus, a large data set has been
computationally created by using (well-written) articles of an internet forum. At
first, different neural networks have been utilized using several
stylometric features. By using 90% of the data for training, the best
network could achieve an F-score of 53% for multi-author
documents on the remaining 10% of test data. In a second experiment,
only letter-bigram frequencies are used as distinguishing features.
Thereby an authorship boundary between paragraphs was marked
if the cosine distance exceeded a certain threshold. This method
reached an F-score of only 42%, and it is suspected that
letterbigrams are not suitable for the (short) paragraphs used in the
evaluation.
      </p>
      <p>
        A two-stage process to cluster Hebrew Bible texts by authorship
is proposed in [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. Because a first attempt to represent chapters
only by bag-of-words led to negative results, the authors
additionally incorporated sets of synonyms (which could be generated by
comparing the original Hebrew texts with an English translation).
With a modified cosine-measure comparing these sets for given
chapters, two core clusters are compiled by using the ncut
algorithm [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. In the second step, the resulting clusters are used as
training data for a support vector machine, which finally assigns
every chapter to one of the two core clusters by using the simple
bag-of-words features tested earlier. Thereby it can be the case,
that units originally assigned to one cluster are moved to the other
one, depending on the prediction of the support vector machine.
With this two-stage approach the authors report a good accuracy of
about 80%, whereby it should be considered that the size of
potential authors has been fixed to two in the experiment. Nevertheless,
the authors state that their approach could be extended for more
authors with less effort.
      </p>
    </sec>
    <sec id="sec-12">
      <title>CONCLUSION AND FUTURE WORK</title>
      <p>In this paper, the automatic creation of paragraph clusters based
on the grammar of authors has been evaluated. Different
state-ofthe-art clustering algorithms have been utilized with different input
features and tested on different data sets. The best working
algorithm K-Means could achieve an accuracy of about 63% over all
test sets, whereby good individual results of up to 89% could be
reached for some configurations. On the contrary, the specifically
created documents incorporating two and three authors could only
be clustered with a maximum accuracy of 59%.</p>
      <p>A comparison between clustering and classification algorithms
using the same input features has been implemented. Disregarding
the missing training data, it could be observed that classifiers
generally produce higher accuracies with improvements of up to 29%.
On the other hand, some classifiers perform worse on average than
clustering algorithms over individual data sets when using some
pqgram configurations. Nevertheless, if the maximum accuracy for
each algorithm is considered, all classifiers perform significantly
better as can be seen in Figure 3. Here the best performances of all
utilized classification and clustering algorithms are illustrated. The
K-­‐Means  </p>
      <p>X-­‐Means  </p>
      <p>Farthest  First  </p>
      <p>Cascaded  K-­‐Means  
Hierarchical  Clusterer  </p>
      <p>Naive  Bayes  
BayesNet  
LibLinear  
LibSVM  
kNN  </p>
      <p>J48  </p>
      <p>Twain-­‐Wells  
Twain-­‐Wells-­‐Shelley  </p>
      <p>FED  
PAN12-­‐A/B  
0  
linear classification algorithm LibLinear could reach nearly 88%,
outperforming K-Means by 25% over all data sets.</p>
      <p>Finally, the best classification and clustering results for each data
set are shown in Figure 4. Consequently the classifiers achieve
higher accuracies, whereby the PAN12 subsets could be classified
100% correctly. As can be seen, a major improvement can be
gained for the novel literature documents. For example, the best
classifier reached 87% on the Twain-Wells document, whereas the
best clustering approach achieved only 59%.</p>
      <p>As shown in this paper, paragraphs of documents can be split
and clustered based on grammar features, but the accuracy is below
that of classification algorithms. Although the two algorithm types
should not be compared directly as they are designed to manage
different problems, the significant differences in accuracies
indicate that classifiers can handle the grammar features better.
Nevertheless future work should focus on evaluating the same features on
larger data sets, as clustering algorithms may produce better results
with increasing amount of sample data.</p>
      <p>
        Another possible application could be the creation of whole
document clusters, where documents with similar grammar are grouped
together. Despite the fact that such huge clusters are very difficult to
evaluate - due to the lack of ground truth data - a navigation through
thousands of documents based on grammar may be interesting like
it has been done for music genres (e.g. [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]) or images (e.g. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]).
Moreover, grammar clusters may also be utilized for modern
recommendation algorithms once they have been calculated for large
data sets. For example, by analyzing all freely available books from
libraries like Project Gutenberg, a system could recommend other
books with a similar style based on the users reading history. Also,
an enhancement of current commercial recommender systems that
are used in large online stores like Amazon is conceivable.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Aha</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Kibler</surname>
          </string-name>
          .
          <article-title>Instance-Based Learning Algorithms</article-title>
          .
          <source>Machine Learning</source>
          ,
          <volume>6</volume>
          :
          <fpage>37</fpage>
          -
          <lpage>66</lpage>
          ,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Apte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Weiss</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B. F.</given-names>
            <surname>White</surname>
          </string-name>
          . Lightweight Document Clustering,
          <source>Nov. 25 2003. US Patent 6</source>
          ,
          <issue>654</issue>
          ,
          <fpage>739</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Arthur</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Vassilvitskii.</surname>
          </string-name>
          K-means+
          <article-title>+: The advantages of careful seeding</article-title>
          .
          <source>In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms</source>
          , SODA '
          <volume>07</volume>
          , pages
          <fpage>1027</fpage>
          -
          <lpage>1035</lpage>
          , Philadelphia, PA, USA,
          <year>2007</year>
          . Society for Industrial and Applied Mathematics.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Augsten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Böhlen</surname>
          </string-name>
          , and
          <string-name>
            <surname>J. Gamper.</surname>
          </string-name>
          <article-title>The pq-Gram Distance between Ordered Labeled Trees</article-title>
          .
          <source>ACM Transactions on Database Systems (TODS)</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Calin</surname>
          </string-name>
          <article-title>´ski and</article-title>
          <string-name>
            <given-names>J.</given-names>
            <surname>Harabasz</surname>
          </string-name>
          .
          <article-title>A Dendrite Method for Cluster Analysis</article-title>
          .
          <source>Communications in Statistics - Theory and Methods</source>
          ,
          <volume>3</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>27</lpage>
          ,
          <year>1974</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.-C.</given-names>
            <surname>Chang</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.-J.</given-names>
            <surname>Lin</surname>
          </string-name>
          .
          <article-title>LIBSVM: A Library for Support Vector Machines</article-title>
          .
          <source>ACM Transactions on Intelligent Systems and Technology (TIST)</source>
          ,
          <volume>2</volume>
          (
          <issue>3</issue>
          ):
          <fpage>27</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F. Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          .
          <article-title>Advances in Domain Independent Linear Text Segmentation</article-title>
          .
          <source>In Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference</source>
          , pages
          <fpage>26</fpage>
          -
          <lpage>33</lpage>
          . Association for Computational Linguistics,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G. F.</given-names>
            <surname>Cooper</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Herskovits</surname>
          </string-name>
          .
          <article-title>A Bayesian Method for the Induction of Probabilistic Networks From Data</article-title>
          .
          <source>Machine learning</source>
          ,
          <volume>9</volume>
          (
          <issue>4</issue>
          ):
          <fpage>309</fpage>
          -
          <lpage>347</lpage>
          ,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Dasgupta</surname>
          </string-name>
          .
          <article-title>Performance Guarantees for Hierarchical Clustering</article-title>
          .
          <source>In Computational Learning Theory</source>
          , pages
          <fpage>351</fpage>
          -
          <lpage>363</lpage>
          . Springer,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>I. S.</given-names>
            <surname>Dhillon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Guan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Kulis</surname>
          </string-name>
          .
          <article-title>Kernel k-means: Spectral Clustering and Normalized Cuts</article-title>
          .
          <source>In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          , pages
          <fpage>551</fpage>
          -
          <lpage>556</lpage>
          . ACM,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Faktor</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Irani</surname>
          </string-name>
          . “Clustering by Composition”
          <article-title>- Unsupervised Discovery of Image Categories</article-title>
          .
          <source>In Computer Vision-ECCV</source>
          <year>2012</year>
          , pages
          <fpage>474</fpage>
          -
          <lpage>487</lpage>
          . Springer,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>R.-E. Fan</surname>
            ,
            <given-names>K.-W.</given-names>
          </string-name>
          <string-name>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-J. Hsieh</surname>
            ,
            <given-names>X.-R.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            , and
            <given-names>C.-J.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
          </string-name>
          .
          <article-title>LIBLINEAR: A Library for Large Linear Classification</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          ,
          <volume>9</volume>
          :
          <fpage>1871</fpage>
          -
          <lpage>1874</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Glover</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Hirst. Detecting Stylistic</surname>
          </string-name>
          <article-title>Inconsistencies in Collaborative Writing</article-title>
          .
          <source>In The New Writing Environment</source>
          , pages
          <fpage>147</fpage>
          -
          <lpage>168</lpage>
          . Springer,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>N.</given-names>
            <surname>Graham</surname>
          </string-name>
          , G. Hirst, and
          <string-name>
            <given-names>B.</given-names>
            <surname>Marthi</surname>
          </string-name>
          .
          <article-title>Segmenting Documents by Stylistic Character</article-title>
          .
          <source>Natural Language Engineering</source>
          ,
          <volume>11</volume>
          (
          <issue>04</issue>
          ):
          <fpage>397</fpage>
          -
          <lpage>415</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hall</surname>
          </string-name>
          , E. Frank,
          <string-name>
            <given-names>G.</given-names>
            <surname>Holmes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pfahringer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Reutemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I. H.</given-names>
            <surname>Witten</surname>
          </string-name>
          .
          <source>The WEKA Data Mining Software: an Update. ACM SIGKDD explorations newsletter</source>
          ,
          <volume>11</volume>
          (
          <issue>1</issue>
          ):
          <fpage>10</fpage>
          -
          <lpage>18</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hotho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Staab</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Stumme. Ontologies Improve Text Document</surname>
          </string-name>
          <article-title>Clustering</article-title>
          .
          <source>In Data Mining</source>
          ,
          <year>2003</year>
          .
          <article-title>ICDM 2003</article-title>
          . Third IEEE International Conference on, pages
          <fpage>541</fpage>
          -
          <lpage>544</lpage>
          . IEEE,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>G. H. John and P.</given-names>
            <surname>Langley</surname>
          </string-name>
          .
          <article-title>Estimating Continuous Distributions in Bayesian Classifiers</article-title>
          .
          <source>In Proceedings of the Eleventh conference on Uncertainty in artificial intelligence</source>
          , pages
          <fpage>338</fpage>
          -
          <lpage>345</lpage>
          . Morgan Kaufmann Publishers Inc.,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>P.</given-names>
            <surname>Juola</surname>
          </string-name>
          .
          <article-title>An Overview of the Traditional Authorship Attribution Subtask</article-title>
          . In CLEF (Online Working Notes/Labs/Workshop),
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>D.</given-names>
            <surname>Klein</surname>
          </string-name>
          and
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <article-title>Accurate Unlexicalized Parsing</article-title>
          .
          <source>In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1, ACL '03</source>
          , pages
          <fpage>423</fpage>
          -
          <lpage>430</lpage>
          , Stroudsburg, PA, USA,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>M.</given-names>
            <surname>Koppel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Akiva</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Dershowitz</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Dershowitz</surname>
          </string-name>
          .
          <article-title>Unsupervised Decomposition of a Document into Authorial Components</article-title>
          .
          <source>In Proc. of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, HLT '11</source>
          , pages
          <fpage>1356</fpage>
          -
          <lpage>1364</lpage>
          , Stroudsburg, PA, USA,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>B.</given-names>
            <surname>Larsen</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Aone</surname>
          </string-name>
          .
          <article-title>Fast and Effective Text Mining Using Linear-Time Document Clustering</article-title>
          .
          <source>In Proceedings of the 5th ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          , pages
          <fpage>16</fpage>
          -
          <lpage>22</lpage>
          . ACM,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Chung</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Holt</surname>
          </string-name>
          .
          <source>Text Document Clustering Based on Frequent Word Meaning Sequences. Data &amp; Knowledge Engineering</source>
          ,
          <volume>64</volume>
          (
          <issue>1</issue>
          ):
          <fpage>381</fpage>
          -
          <lpage>404</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Marcus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Marcinkiewicz</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Santorini</surname>
          </string-name>
          .
          <article-title>Building a large annotated corpus of English: The Penn Treebank</article-title>
          .
          <source>Computational Linguistics</source>
          ,
          <volume>19</volume>
          :
          <fpage>313</fpage>
          -
          <lpage>330</lpage>
          ,
          <year>June 1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>F.</given-names>
            <surname>Mosteller</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Wallace</surname>
          </string-name>
          .
          <article-title>Inference and Disputed Authorship: The Federalist</article-title>
          . Addison-Wesley,
          <year>1964</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>F.</given-names>
            <surname>Murtagh</surname>
          </string-name>
          .
          <article-title>A Survey of Recent Advances in Hierarchical Clustering Algorithms</article-title>
          .
          <source>The Computer Journal</source>
          ,
          <volume>26</volume>
          (
          <issue>4</issue>
          ):
          <fpage>354</fpage>
          -
          <lpage>359</lpage>
          ,
          <year>1983</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>D.</given-names>
            <surname>Pelleg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. W.</given-names>
            <surname>Moore</surname>
          </string-name>
          , et al.
          <article-title>X-means: Extending K-means with Efficient Estimation of the Number of Clusters</article-title>
          . In ICML, pages
          <fpage>727</fpage>
          -
          <lpage>734</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Ponte</surname>
          </string-name>
          and
          <string-name>
            <given-names>W. B.</given-names>
            <surname>Croft</surname>
          </string-name>
          .
          <article-title>Text Segmentation by Topic</article-title>
          .
          <source>In Research and Advanced Technology for Digital Libraries</source>
          , pages
          <fpage>113</fpage>
          -
          <lpage>125</lpage>
          . Springer,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Quinlan</surname>
          </string-name>
          .
          <source>C4</source>
          .
          <article-title>5: Programs for Machine Learning</article-title>
          , volume
          <volume>1</volume>
          . Morgan Kaufmann,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Reynar</surname>
          </string-name>
          .
          <article-title>Statistical Models for Topic Segmentation</article-title>
          .
          <source>In Proc. of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics</source>
          , pages
          <fpage>357</fpage>
          -
          <lpage>364</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>N.</given-names>
            <surname>Scaringella</surname>
          </string-name>
          , G. Zoia, and
          <string-name>
            <given-names>D.</given-names>
            <surname>Mlynek</surname>
          </string-name>
          .
          <article-title>Automatic Genre Classification of Music Content: a Survey</article-title>
          .
          <source>Signal Processing Magazine</source>
          , IEEE,
          <volume>23</volume>
          (
          <issue>2</issue>
          ):
          <fpage>133</fpage>
          -
          <lpage>141</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tschuggnall</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Specht</surname>
          </string-name>
          .
          <article-title>Using Grammar-Profiles to Intrinsically Expose Plagiarism in Text Documents</article-title>
          .
          <source>In Proc. of the 18th Conf. of Natural Language Processing and Information Systems (NLDB)</source>
          , pages
          <fpage>297</fpage>
          -
          <lpage>302</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tschuggnall</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Specht. Enhancing Authorship Attribution By Utilizing Syntax Tree</surname>
          </string-name>
          <article-title>Profiles</article-title>
          .
          <source>In Proc. of the 14th Conf</source>
          .
          <article-title>of the European Chapter of the Assoc. for Computational Ling</article-title>
          .
          <source>(EACL)</source>
          , pages
          <fpage>195</fpage>
          -
          <lpage>199</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>O.</given-names>
            <surname>Zamir</surname>
          </string-name>
          and
          <string-name>
            <given-names>O.</given-names>
            <surname>Etzioni</surname>
          </string-name>
          .
          <article-title>Web Document Clustering: A Feasibility Demonstration</article-title>
          .
          <source>In Proc. of the 21st annual international ACM conference on Research and development in information retrieval (SIGIR)</source>
          , pages
          <fpage>46</fpage>
          -
          <lpage>54</lpage>
          . ACM,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>D.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-J.</given-names>
            <surname>Long</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ling</surname>
          </string-name>
          .
          <article-title>A Cluster-Based Plagiarism Detection Method</article-title>
          .
          <source>In Notebook Papers of CLEF 2010 LABs and Workshops</source>
          ,
          <volume>22</volume>
          -
          <fpage>23</fpage>
          September,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>