=Paper=
{{Paper
|id=Vol-1313/paper_04
|storemode=property
|title=Automatic Decomposition of Multi-Author Documents Using Grammar Analysis
|pdfUrl=https://ceur-ws.org/Vol-1313/paper_4.pdf
|volume=Vol-1313
|dblpUrl=https://dblp.org/rec/conf/gvd/TschuggnallS14
}}
==Automatic Decomposition of Multi-Author Documents Using Grammar Analysis==
Automatic Decomposition of Multi-Author Documents Using Grammar Analysis Michael Tschuggnall and Günther Specht Databases and Information Systems Institute of Computer Science, University of Innsbruck, Austria {michael.tschuggnall, guenther.specht}@uibk.ac.at ABSTRACT try to build a cluster for the main author and one or more clusters The task of text segmentation is to automatically split a text doc- for intrusive paragraphs. Another scenario where the clustering of ument into individual subparts, which differ according to specific text is applicable is the analysis of multi-author academic papers: measures. In this paper, an approach is presented that attempts to especially the verification of collaborated student works such as separate text sections of a collaboratively written document based bachelor or master theses can be useful in order to determine the on the grammar syntax of authors. The main idea is thereby to amount of work done by each student. quantify differences of the grammatical writing style of authors Using results of previous work in the field of intrinsic plagia- and to use this information to build paragraph clusters, whereby rism detection [31] and authorship attribution [32], the assumption each cluster is assigned to a different author. In order to analyze that individual authors have significantly different writing styles in the style of a writer, text is split into single sentences, and for each terms of the syntax that is used to construct sentences has been sentence a full parse tree is calculated. Using the latter, a profile reused. For example, the following sentence (extracted from a web is computed subsequently that represents the main characteristics blog): ”My chair started squeaking a few days ago and it’s driving for each paragraph. Finally, the profiles serve as input for common me nuts." (S1) could also be formulated as ”Since a few days my clustering algorithms. An extensive evaluation using different En- chair is squeaking - it’s simply annoying.” (S2) which is semanti- glish data sets reveals promising results, whereby a supplementary cally equivalent but differs significantly according to the syntax as analysis indicates that in general common classification algorithms can be seen in Figure 1. The main idea of this work is to quantify perform better than clustering approaches. those differences by calculating grammar profiles and to use this information to decompose a collaboratively written document, i.e., to assign each paragraph of a document to an author. Keywords The rest of this paper is organized as follows: Section 2 at first Text Segmentation, Multi-Author Decomposition, Parse Trees, pq- recapitulates the principle of pq-grams, which represent a core con- grams, Clustering cept of the approach. Subsequently the algorithm is presented in detail, which is then evaluated in Section 3 by using different clus- 1. INTRODUCTION tering algorithms and data sets. A comparison of clustering and The growing amount of currently available data is hardly man- classification approaches is discussed in Section 4, while Section 5 ageable without the use of specific tools and algorithms that pro- depicts related work. Finally, a conclusion and future work direc- vide relevant portions of that data to the user. While this problem tions are given in Section 6. is generally addressed with information retrieval approaches, an- other possibility to significantly reduce the amount of data is to 2. ALGORITHM build clusters. Within each cluster, the data is similar according to In the following the concept of pq-grams is explained, which some predefined features. Thereby many approaches exist that pro- serves as the basic stylistic measure in this approach to distinguish pose algorithms to cluster plain text documents (e.g. [16], [22]) or between authors. Subsequently, the concrete steps performed by specific web documents (e.g. [33]) by utilizing various features. the algorithm are discussed in detail. Approaches which attempt to divide a single text document into distinguishable units like different topics, for example, are usu- 2.1 Preliminaries: pq-grams ally referred to as text segmentation approaches. Here, also many Similar to n-grams that represent subparts of given length n of features including statistical models, similarities between words or a string, pq-grams extract substructures of an ordered, labeled tree other semantic analyses are used. Moreover, text clusters are also [4]. The size of a pq-gram is determined by a stem (p) and a base used in recent plagiarism detection algorithms (e.g. [34]) which (q) like it is shown in Figure 2. Thereby p defines how much nodes are included vertically, and q defines the number of nodes to be considered horizontally. For example, a valid pq-gram with p = 2 and q = 3 starting from PP at the left side of tree (S2) shown in Figure 1 would be [PP-NP-DT-JJ-NNS] (the concrete words are omitted). Copyright c by the paper’s authors. Copying permitted only for The pq-gram index then consists of all possible pq-grams of private and academic purposes. a tree. In order to obtain all pq-grams, the base is shifted left In: G. Specht, H. Gamper, F. Klan (eds.): Proceedings of the 26th GI- Workshop on Foundations of Databases (Grundlagen von Datenbanken), and right additionally: If then less than p nodes exist horizon- 21.10.2014 - 24.10.2014, Bozen, Italy, published at http://ceur-ws.org. tally, the corresponding place in the pq-gram is filled with *, in- (S1) steps: S S CC S 1. At first the document is preprocessed by eliminating unnec- (and) essary whitespaces or non-parsable characters. For exam- NP VP NP VP ple, many data sets often are based on novels and articles of various authors, whereby frequently OCR text recognition is PRP NN VBD S PRP VBZ VP used due to the lack of digital data. Additionally, such doc- (My) (chair) (started) (it) ('s) VBG S uments contain problem sources like chapter numbers and VP (driving) titles or incorrectly parsed picture frames that result in non- alphanumeric characters. VBG ADVP NP NP (squeaking) NP RB PRP NNS 2. Subsequently, the document is partitioned into single para- (ago) (me) (nuts) graphs. For simplification reasons this is currently done by DT JJ NNS only detecting multiple line breaks. (a) (few) (days) 3. Each paragraph is then split into single sentences by utiliz- ing a sentence boundary detection algorithm implemented (S2) S within the OpenNLP framework1 . Then for each sentence a full grammar tree is calculated using the Stanford Parser S - S [19]. For example, Figure 1 depicts the grammar trees re- PP NP VP NP VP sulting from analyzing sentences (S1) and (S2), respectively. The labels of each tree correspond to a part-of-speech (POS) IN NP PRP NN VBZ VP PRP VBZ ADVP ADJP tag of the Penn Treebank set [23], where e.g NP corresponds (Since) (my) (chair) (is) (it) ('s) to a noun phrase, DT to a determiner or JJS to a superla- DT JJ NNS VBG RB JJ tive adjective. In order to examine the building structure of (a) (few) (days) (squeaking) (simply) (annoying) sentences only like it is intended by this work, the concrete words, i.e., the leafs of the tree, are omitted. Figure 1: Grammar Trees of the Semantically Equivalent Sen- tences (S1) and (S2). 4. Using the grammar trees of all sentences of the document, the pq-gram index is calculated. As shown in Section 2.1 all valid pq-grams of a sentence are extracted and stored into dicating a missing node. Applying this idea to the previous exam- a pq-gram index. By combining all pq-gram indices of all ple, also the pq-gram [PP-IN-*-*-*] (no nodes in the base) is sentences, a pq-gram profile is computed which contains a valid, as well as [PP-NP-*-*-DT] (base shifted left by two), list of all pq-grams and their corresponding frequency of ap- [PP-NP-*-DT-JJ] (base shifted left by one), [PP-NP-JJ- pearance in the text. Thereby the frequency is normalized by NNS-*] (base shifted right by one) and [PP-NP-NNS-*-*] (base the total number of all appearing pq-grams. As an example, shifted right by two) have to be considered. As a last example, all the five mostly used pq-grams using p = 2 and q = 3 of a leaves have the pq-gram pattern [leaf_label-*-*-*-*]. sample document are shown in Table 1. The profile is sorted Finally, the pq-gram index is the set of all valid pq-grams of a descending by the normalized occurrence, and an additional tree, whereby multiple occurrences of the same pq-grams are also rank value is introduced that simply defines a natural order present multiple times in the index. which is used in the evaluation (see Section 3). Table 1: Example of the Five Mostly Used pq-grams of a Sam- ple Document. pq-gram Occurrence [%] Rank NP-NN-*-*-* 2.68 1 PP-IN-*-*-* 2.25 2 Figure 2: Structure of a pq-gram Consisting of Stem p = 2 and NP-DT-*-*-* 1.99 3 Base q = 3. NP-NNP-*-*-* 1.44 4 S-VP-*-*-VBD 1.08 5 2.2 Clustering by Authors The number of choices an author has to formulate a sentence 5. Finally, each paragraph-profile is provided as input for clus- in terms of grammar structure is rather high, and the assumption tering algorithms, which are asked to build clusters based on in this approach is that the concrete choice is made mostly intu- the pq-grams contained. Concretely, three different feature itively and unconsciously. On that basis the grammar of authors is sets have been evaluated: (1.) the frequencies of occurences analyzed, which serves as input for common state-of-the-art clus- of each pq-gram, (2.) the rank of each pq-gram and (3.) a tering algorithms to build clusters of text documents or paragraphs. union of the latter sets. The decision of the clustering algorithms is thereby based on the frequencies of occurring pq-grams, i.e., on pq-gram profiles. In de- 1 Apache OpenNLP, http://incubator.apache.org/opennlp, vis- tail, given a text document the algorithm consists of the following ited July 2014 2.3 Utilized Algorithms many works have studied and questioned the correct author- Using the WEKA framework [15], the following clustering algo- ship of 12 disputed essays [24], which have been excluded in rithms have been evaluated: K-Means [3], Cascaded K-Means (the the experiment. number of clusters is cascaded and automatically chosen) [5], X- • The PAN’12 competition corpus (PAN12): As a well-known, Means [26], Agglomerative Hierarchical Clustering [25], and Far- state-of-the-art corpus originally created for the use in au- thest First [9]. thorship identification, parts3 of the PAN2012 corpus [18] For the clustering algorithms K-Means, Hierarchical Clustering have been integrated. The corpus is composed of several and Farthest First the number of clusters has been predefined ac- fiction texts and split into several subtasks that cover small- cording to the respective test data. This means if the test document and common-length documents (1800-6060 words) as well has been collaborated by three authors, the number of clusters has as larger documents (up to 13000 words) and novel-length also been set to three. On the other hand, the algorithms Cascaded documents (up to 170,000 words). Finally, the test setused in K-Means and X-Means implicitly decide which amount of clusters this evaluation contains 14 documents (paragraphs) written is optimal. Therefore these algorithms have been limited only in by three authors that are distributed equally. ranges, i.e., the minimum and maximum number of clusters has been set to two and six, respectively. 3.2 Results The best results of the evaluation are presented in Table 2, where 3. EVALUATION the best performance for each clusterer over all data sets is shown in The utilization of pq-gram profiles as input features for mod- subtable (a), and the best configuration for each data set is shown ern clustering algorithms has been extensively evaluated using dif- in subtable (b), respectively. With an accuracy of 63.7% the K- ferent documents and data sets. As clustering and classification Means algorithm worked best by using p = 2, q = 3 and by uti- problems are closely related, the global aim was to experiment on lizing all available features. Interestingly, the X-Means algorithm the accuracy of automatic text clustering using solely the proposed also achieved good results considering the fact that in this case the grammar feature, and furthermore to compare it to those of current number of clusters has been assigned automatically by the algo- classification techniques. rithm. Finally, the hierarchical cluster performed worst gaining an accuracy of nearly 10% less than K-Means. 3.1 Test Data and Experimental Setup Regarding the best performances for each test data set, the re- In order to evaluate the idea, different documents and test data sults for the manually created data sets from novel literature are sets have been used, which are explained in more detail in the fol- generally poor. For example, the best result for the two-author doc- lowing. Thereby single documents have been created which con- ument Twain-Wells is only 59.6%, i.e., the accuracy is only slightly tain paragraphs written by different authors, as well as multiple better than the baseline percentage of 50%, which can be achieved documents, whereby each document is written by one author. In by randomly assigning paragraphs into two clusters.4 On the other the latter case, every document is treated as one (large) paragraph hand, the data sets reused from authorship attribution, namely the for simplification reasons. FED and the PAN12 data set, achieved very good results with an For the experiment, different parameter settings have been eval- accuracy of about 89% and 83%, respectively. Nevertheless, as the uated, i.e., the pq-gram values p and q have been varied from 2 to other data sets have been specifically created for the clustering eval- 4, in combination with the three different feature sets. Concretely, uation, these results may be more expressive. Therefore a compar- the following data sets have been used: ison between clustering and classification approaches is discussed in the following, showing that the latter achieve significantly better • Twain-Wells (T-W): This document has been specifically results on those data sets when using the same features. created for the evaluation of in-document clustering. It con- tains 50 paragraphs of the book ”The Adventures of Huck- Method p q Feature Set Accuracy leberry Finn” by Mark Twain, and 50 paragraphs of ”The K-Means 3 2 All 63.7 Time Machine” by H. G. Wells2 . All paragraphs have been X-Means 2 4 Rank 61.7 randomly shuffled, whereby the size of each paragraph varies Farthest First 4 2 Occurrence-Rate 58.7 from approximately 25 words up to 280 words. Cascaded K-Means 2 2 Rank 55.3 Hierarchical Clust. 4 3 Occurrence-Rate 54.7 • Twain-Wells-Shelley (T-W-S): In a similar fashion a three- author document has been created. It again uses (different) (a) Clustering Algorithms paragraphs of the same books by Twain and Wells, and ap- pends it by paragraphs of the book ”Frankenstein; Or, The Data Set Method p q Feat. Set Accuracy Modern Prometheus” by Mary Wollstonecraft Shelley. Sum- T-W X-Means 3 2 All 59.6 marizing, the document contains 50 paragraphs by Mark T-W-S X-Means 3 4 All 49.0 Twain, 50 paragraphs by H. G. Wells and another 50 para- FED Farth. First 4 3 Rank 89.4 graphs by Mary Shelley, whereby the paragraph sizes are PAN12-A/B K-Means 3 3 All 83.3 similar to the Twain-Wells document. (b) Test Data Sets • The Federalist Papers (FED): Probably the mostly referred Table 2: Best Evaluation Results for Each Clustering Algo- text corpus in the field of authorship attribution is a series rithm and Test Data Set in Percent. of 85 political essays called ”The Federalist Papers” written by John Jay, Alexander Hamilton and James Madison in the 3 18th century. While most of the authorships are undoubted, the subtasks A and B, respectively 4 In this case X-Means dynamically created two clusters, but 2 The books have been obtained from the Project Gutenberg li- the result is still better than that of other algorithms using a fixed brary, http://www.gutenberg.org, visited July 2014 number of clusters. p q Algorithm Max N-Bay Bay-Net LibLin LibSVM kNN J48 4. COMPARISON OF CLUSTERING AND 2 2 X-Means 57.6 77.8 82.3 85.2 86.9 62.6 85.5 2 3 X-Means 56.6 79.8 80.8 81.8 83.3 60.6 80.8 CLASSIFICATION APPROACHES 2 3 4 2 X-Means X-Means 57.6 59.6 76.8 78.8 79.8 80.8 82.2 81.8 83.8 83.6 58.6 59.6 81.0 80.8 For the given data sets, any clustering problem can be rewrit- 3 3 3 4 X-Means X-Means 53.5 52.5 76.8 81.8 77.8 79.8 80.5 81.8 82.3 83.8 61.6 63.6 79.8 82.0 ten as classification problem with the exception that the latter need 4 2 K-Means 52.5 86.9 83.3 83.5 84.3 62.6 81.8 4 3 X-Means 52.5 79.8 79.8 80.1 80.3 59.6 77.4 training data. Although a direct comparison should be treated with 4 4 Farth. First 51.5 72.7 74.7 75.8 77.0 60.6 75.8 average improvement 24.1 25.0 26.5 27.9 6.2 25.7 caution, it still gives an insight of how the two different approaches perform using the same data sets. Therefore an additional evalua- (a) Twain-Wells tion is shown in the following, which compares the performance of p q Algorithm Max N-Bay Bay-Net LibLin LibSVM kNN J48 the clustering algorithms to the performance of the the following 2 2 K-Means 44.3 67.8 70.8 74.0 75.2 51.0 73.3 classification algorithms: Naive Bayes classifier [17], Bayes Net- 2 3 X-Means 38.3 65.1 67.1 70.7 72.3 48.3 70.2 2 4 X-Means 45.6 63.1 68.1 70.5 71.8 49.0 69.3 work using the K2 classifier [8], Large Linear Classification using 3 2 X-Means 45.0 51.7 64.1 67.3 68.8 45.6 65.4 3 3 X-Means 47.0 57.7 64.8 67.3 68.5 47.0 65.9 LibLinear [12], Support vector machine using LIBSVM with nu- 3 4 X-Means 49.0 67.8 67.8 70.5 72.5 46.3 68.3 4 2 X-Means 36.2 61.1 67.1 69.1 69.5 50.3 65.1 SVC classification [6], k-nearest-neighbors classifier (kNN) using 4 3 K-Means 35.6 53.0 63.8 67.6 70.0 47.0 66.6 k = 1 [1], and a pruned C4.5 decision tree (J48) [28]. To compen- 4 4 X-Means 35.6 57.7 66.1 68.5 69.3 42.3 66.8 average improvement 18.7 24.8 27.7 29.0 5.6 26.0 sate the missing training data, a 10-fold cross-validation has been used for each classifier. (b) Twain-Wells-Shelley Table 3 shows the performance of each classifier compared to the p q Algorithm Max N-Bay Bay-Net LibLin LibSVM kNN J48 best clustering result using the same data and pq-setting. It can be 2 2 Farth. First 77.3 81.1 86.4 90.9 84.2 74.2 81.8 2 3 Farth. First 78.8 85.6 87.4 92.4 89.0 78.8 82.8 seen that the classifiers significantly outperform the clustering re- 2 4 X-Means 78.8 89.4 92.4 90.9 87.3 89.4 85.9 sults for the Twain-Wells and Twain-Wells-Shelley documents. The 3 3 2 3 K-Means K-Means 81.8 78.8 82.6 92.4 87.9 92.4 92.4 92.4 85.5 86.4 80.3 81.8 83.8 83.8 support vector machine framework (LibSVM) and the linear classi- 3 4 Farth. First 86.4 84.8 90.9 97.0 85.8 81.8 85.6 4 2 Farth. First 86.6 81.8 89.4 87.9 83.3 77.3 84.1 fier (LibLinear) performed best, reaching a maximum accuracy of 4 3 Farth. First 89.4 85.6 92.4 89.4 85.8 80.3 83.3 4 4 Farth. First 84.8 86.4 90.9 89.4 85.8 84.8 83.6 nearly 87% for the Twain-Wells document. Moreover, the average average improvement 3.0 7.5 8.9 3.4 -1.6 1.3 improvement is given in the bottom line, showing that most of the (c) Federalist Papers classifiers outperform the best clustering result by over 20% in av- erage. Solely the kNN algorithm achieves minor improvements as p q Algorithm Max N-Bay Bay-Net LibLin LibSVM kNN J48 it attributed the two-author document with a poor accuracy of about 2 2 2 3 K-Means K-Means 83.3 83.3 83.3 83.3 33.3 33.3 100.0 100.0 100.0 100.0 100.0 100.0 33.3 33.3 60% only. 2 4 K-Means 83.3 83.4 33.3 100.0 100.0 100.0 33.3 3 2 K-Means 83.3 75.0 33.3 91.7 91.7 100.0 33.3 A similar general improvement could be achieved on the three- 3 3 K-Means 83.3 100.0 33.3 100.0 91.7 100.0 33.3 3 4 Farth. First 75.0 66.7 33.3 100.0 100.0 91.7 33.3 author document Twain-Wells-Shelley as can be seen in subtable 4 2 K-Means 83.3 91.7 33.3 91.7 75.0 91.7 33.3 (b). Again, LibSVM could achieve an accuracy of about 75%, 4 4 3 4 K-Means K-Means 83.3 83.3 75.0 75.0 33.3 33.3 100.0 100.0 75.0 83.4 91.7 83.4 33.3 33.3 whereas the best clustering configuration could only reach 49%. average improvement -0.9 -49.1 15.8 8.4 13.0 -49.1 Except for the kNN algorithm, all classifiers significantly outper- (d) PAN12-A/B form the best clustering results for every configuration. Quite different comparison results have been obtained for the Table 3: Best Evaluation Results for each Clustering Algorithm Federalist Papers and PAN12 data sets, respectively. Here, the im- and Test Data Set in Percent. provements gained from the classifiers are only minor, and in some cases are even negative, i.e., the classification algorithms perform worse than the clustering algorithms. A general explanation is the to one document. The main idea is often to compute topically re- good performance of the clustering algorithms on these data sets, lated document clusters and to assist web search engines to be able especially by utilizing the Farthest First and K-Means algorithms. to provide better results to the user, whereby the algorithms pro- In case of the Federalist Papers data set shown in subtable (c), posed frequently are also patented (e.g. [2]). Regularly applied all algorithms except kNN could achieve at least some improve- concepts in the feature extraction phase are the term frequency tf , ment. Although the LibLinear classifier could reach an outstanding which measures how often a word in a document occurs, and the accuracy of 97%, the global improvement is below 10% for all clas- term frequency-inverse document frequency tf − idf , which mea- sifiers. Finally, subtable (d) shows the results for PAN12, where the sures the significance of a word compared to the whole document outcome is quite diverse as some classifiers could improve the clus- collection. An example of a classical approach using these tech- terers significantly, whereas others worsen the accuracy even more niques is published in [21]. drastically. A possible explanation might be the small data set (only The literature on cluster analysis within a single document to the subproblems A and B have been used), which may not be suited discriminate the authorships in a multi-author document like it is very well for a reliable evaluation of the clustering approaches. done in this paper is surprisingly sparse. On the other hand, many approaches exist to separate a document into paragraphs of differ- Summarizing, the comparison of the different algorithms reveal ent topics, which are generally called text segmentation problems. that in general classification algorithms perform better than cluster- In this domain, the algorithms often perform vocabulary analysis ing algorithms when provided with the same (pq-gram) feature set. in various forms like word stem repetitions [27] or word frequency Nevertheless, the results of the PAN12 experiment are very diverse models [29], whereby ”methods for finding the topic boundaries and indicate that there might be a problem with the data set itself, include sliding window, lexical chains, dynamic programming, ag- and that this comparison should be treated carefully. glomerative clustering and divisive clustering” [7]. Despite the given possibility to modify these techniques to also cluster by au- 5. RELATED WORK thors instead of topics, this is rarely done. In the following some of Most of the traditional document clustering approaches are based the existing methods are shortly summarized. on occurrences of words, i.e., inverted indices are built and used to Probably one of the first approaches that uses stylometry to au- group documents. Thereby a unit to be clustered conforms exactly tomatically detect boundaries of authors of collaboratively written text is proposed in [13]. Thereby the main intention was not to ex- K-‐Means pose authors or to gain insight into the work distribution, but to pro- X-‐Means Farthest First vide a methodology for collaborative authors to equalize their style Cascaded K-‐Means in order to achieve better readability. To extract the style of sepa- Hierarchical Clusterer rated paragraphs, common stylometric features such as word/sentence Naive Bayes lengths, POS tag distributions or frequencies of POS classes at BayesNet sentence-initial and sentence-final positions are considered. An ex- LibLinear tensive experiment revealed that styolmetric features can be used to LibSVM find authorship boundaries, but that there has to be done additional kNN J48 research in order to increase the accuracy and informativeness. 0 10 20 30 40 50 60 70 80 90 100 In [14] the authors also tried to divide a collaborative text into Accuracy [%] different single-author paragraphs. In contrast to the previously described handmade corpus, a large data set has been computation- ally created by using (well-written) articles of an internet forum. At Figure 3: Best Evaluation Results Over All Data Sets For All first, different neural networks have been utilized using several sty- Utilized Clustering and Classification Algorithms. lometric features. By using 90% of the data for training, the best network could achieve an F-score of 53% for multi-author docu- ments on the remaining 10% of test data. In a second experiment, Twain-‐Wells only letter-bigram frequencies are used as distinguishing features. Thereby an authorship boundary between paragraphs was marked Twain-‐Wells-‐Shelley if the cosine distance exceeded a certain threshold. This method Best Clusterer reached an F-score of only 42%, and it is suspected that letter- FED Best Classifier bigrams are not suitable for the (short) paragraphs used in the eval- uation. PAN12-‐A/B A two-stage process to cluster Hebrew Bible texts by authorship is proposed in [20]. Because a first attempt to represent chapters 0 20 40 60 80 100 only by bag-of-words led to negative results, the authors addition- Accuracy [%] ally incorporated sets of synonyms (which could be generated by comparing the original Hebrew texts with an English translation). With a modified cosine-measure comparing these sets for given Figure 4: Best Clustering and Classification Results For Each chapters, two core clusters are compiled by using the ncut algo- Data Set. rithm [10]. In the second step, the resulting clusters are used as training data for a support vector machine, which finally assigns every chapter to one of the two core clusters by using the simple linear classification algorithm LibLinear could reach nearly 88%, bag-of-words features tested earlier. Thereby it can be the case, outperforming K-Means by 25% over all data sets. that units originally assigned to one cluster are moved to the other Finally, the best classification and clustering results for each data one, depending on the prediction of the support vector machine. set are shown in Figure 4. Consequently the classifiers achieve With this two-stage approach the authors report a good accuracy of higher accuracies, whereby the PAN12 subsets could be classified about 80%, whereby it should be considered that the size of poten- 100% correctly. As can be seen, a major improvement can be tial authors has been fixed to two in the experiment. Nevertheless, gained for the novel literature documents. For example, the best the authors state that their approach could be extended for more classifier reached 87% on the Twain-Wells document, whereas the authors with less effort. best clustering approach achieved only 59%. As shown in this paper, paragraphs of documents can be split 6. CONCLUSION AND FUTURE WORK and clustered based on grammar features, but the accuracy is below In this paper, the automatic creation of paragraph clusters based that of classification algorithms. Although the two algorithm types on the grammar of authors has been evaluated. Different state-of- should not be compared directly as they are designed to manage the-art clustering algorithms have been utilized with different input different problems, the significant differences in accuracies indi- features and tested on different data sets. The best working algo- cate that classifiers can handle the grammar features better. Never- rithm K-Means could achieve an accuracy of about 63% over all theless future work should focus on evaluating the same features on test sets, whereby good individual results of up to 89% could be larger data sets, as clustering algorithms may produce better results reached for some configurations. On the contrary, the specifically with increasing amount of sample data. created documents incorporating two and three authors could only Another possible application could be the creation of whole doc- be clustered with a maximum accuracy of 59%. ument clusters, where documents with similar grammar are grouped A comparison between clustering and classification algorithms together. Despite the fact that such huge clusters are very difficult to using the same input features has been implemented. Disregarding evaluate - due to the lack of ground truth data - a navigation through the missing training data, it could be observed that classifiers gen- thousands of documents based on grammar may be interesting like erally produce higher accuracies with improvements of up to 29%. it has been done for music genres (e.g. [30]) or images (e.g. [11]). On the other hand, some classifiers perform worse on average than Moreover, grammar clusters may also be utilized for modern rec- clustering algorithms over individual data sets when using some pq- ommendation algorithms once they have been calculated for large gram configurations. Nevertheless, if the maximum accuracy for data sets. For example, by analyzing all freely available books from each algorithm is considered, all classifiers perform significantly libraries like Project Gutenberg, a system could recommend other better as can be seen in Figure 3. Here the best performances of all books with a similar style based on the users reading history. Also, utilized classification and clustering algorithms are illustrated. The an enhancement of current commercial recommender systems that are used in large online stores like Amazon is conceivable. [18] P. Juola. An Overview of the Traditional Authorship Attribution Subtask. In CLEF (Online Working 7. REFERENCES Notes/Labs/Workshop), 2012. [1] D. Aha and D. Kibler. Instance-Based Learning Algorithms. [19] D. Klein and C. D. Manning. Accurate Unlexicalized Machine Learning, 6:37–66, 1991. Parsing. In Proceedings of the 41st Annual Meeting on [2] C. Apte, S. M. Weiss, and B. F. White. Lightweight Association for Computational Linguistics - Volume 1, ACL Document Clustering, Nov. 25 2003. US Patent 6,654,739. ’03, pages 423–430, Stroudsburg, PA, USA, 2003. [3] D. Arthur and S. Vassilvitskii. K-means++: The advantages [20] M. Koppel, N. Akiva, I. Dershowitz, and N. Dershowitz. of careful seeding. In Proceedings of the Eighteenth Annual Unsupervised Decomposition of a Document into Authorial ACM-SIAM Symposium on Discrete Algorithms, SODA ’07, Components. In Proc. of the 49th Annual Meeting of the pages 1027–1035, Philadelphia, PA, USA, 2007. Society for Association for Computational Linguistics: Human Industrial and Applied Mathematics. Language Technologies - Volume 1, HLT ’11, pages [4] N. Augsten, M. Böhlen, and J. Gamper. The pq-Gram 1356–1364, Stroudsburg, PA, USA, 2011. Distance between Ordered Labeled Trees. ACM Transactions [21] B. Larsen and C. Aone. Fast and Effective Text Mining Using on Database Systems (TODS), 2010. Linear-Time Document Clustering. In Proceedings of the 5th [5] T. Caliński and J. Harabasz. A Dendrite Method for Cluster ACM SIGKDD international conference on Knowledge Analysis. Communications in Statistics - Theory and discovery and data mining, pages 16–22. ACM, 1999. Methods, 3(1):1–27, 1974. [22] Y. Li, S. M. Chung, and J. D. Holt. Text Document [6] C.-C. Chang and C.-J. Lin. LIBSVM: A Library for Support Clustering Based on Frequent Word Meaning Sequences. Vector Machines. ACM Transactions on Intelligent Systems Data & Knowledge Engineering, 64(1):381–404, 2008. and Technology (TIST), 2(3):27, 2011. [23] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. [7] F. Y. Choi. Advances in Domain Independent Linear Text Building a large annotated corpus of English: The Penn Segmentation. In Proceedings of the 1st North American Treebank. Computational Linguistics, 19:313–330, June chapter of the Association for Computational Linguistics 1993. conference, pages 26–33. Association for Computational [24] F. Mosteller and D. Wallace. Inference and Disputed Linguistics, 2000. Authorship: The Federalist. Addison-Wesley, 1964. [8] G. F. Cooper and E. Herskovits. A Bayesian Method for the [25] F. Murtagh. A Survey of Recent Advances in Hierarchical Induction of Probabilistic Networks From Data. Machine Clustering Algorithms. The Computer Journal, learning, 9(4):309–347, 1992. 26(4):354–359, 1983. [9] S. Dasgupta. Performance Guarantees for Hierarchical [26] D. Pelleg, A. W. Moore, et al. X-means: Extending K-means Clustering. In Computational Learning Theory, pages with Efficient Estimation of the Number of Clusters. In 351–363. Springer, 2002. ICML, pages 727–734, 2000. [10] I. S. Dhillon, Y. Guan, and B. Kulis. Kernel k-means: [27] J. M. Ponte and W. B. Croft. Text Segmentation by Topic. In Spectral Clustering and Normalized Cuts. In Proceedings of Research and Advanced Technology for Digital Libraries, the tenth ACM SIGKDD international conference on pages 113–125. Springer, 1997. Knowledge discovery and data mining, pages 551–556. [28] J. R. Quinlan. C4.5: Programs for Machine Learning, ACM, 2004. volume 1. Morgan Kaufmann, 1993. [11] A. Faktor and M. Irani. “Clustering by Composition” - [29] J. C. Reynar. Statistical Models for Topic Segmentation. In Unsupervised Discovery of Image Categories. In Computer Proc. of the 37th annual meeting of the Association for Vision–ECCV 2012, pages 474–487. Springer, 2012. Computational Linguistics on Computational Linguistics, [12] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. pages 357–364, 1999. Lin. LIBLINEAR: A Library for Large Linear Classification. [30] N. Scaringella, G. Zoia, and D. Mlynek. Automatic Genre The Journal of Machine Learning Research, 9:1871–1874, Classification of Music Content: a Survey. Signal Processing 2008. Magazine, IEEE, 23(2):133–141, 2006. [13] A. Glover and G. Hirst. Detecting Stylistic Inconsistencies in [31] M. Tschuggnall and G. Specht. Using Grammar-Profiles to Collaborative Writing. In The New Writing Environment, Intrinsically Expose Plagiarism in Text Documents. In Proc. pages 147–168. Springer, 1996. of the 18th Conf. of Natural Language Processing and [14] N. Graham, G. Hirst, and B. Marthi. Segmenting Documents Information Systems (NLDB), pages 297–302, 2013. by Stylistic Character. Natural Language Engineering, [32] M. Tschuggnall and G. Specht. Enhancing Authorship 11(04):397–415, 2005. Attribution By Utilizing Syntax Tree Profiles. In Proc. of the [15] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, 14th Conf. of the European Chapter of the Assoc. for and I. H. Witten. The WEKA Data Mining Software: an Computational Ling. (EACL), pages 195–199, 2014. Update. ACM SIGKDD explorations newsletter, [33] O. Zamir and O. Etzioni. Web Document Clustering: A 11(1):10–18, 2009. Feasibility Demonstration. In Proc. of the 21st annual [16] A. Hotho, S. Staab, and G. Stumme. Ontologies Improve international ACM conference on Research and development Text Document Clustering. In Data Mining, 2003. ICDM in information retrieval (SIGIR), pages 46–54. ACM, 1998. 2003. Third IEEE International Conference on, pages [34] D. Zou, W.-J. Long, and Z. Ling. A Cluster-Based 541–544. IEEE, 2003. Plagiarism Detection Method. In Notebook Papers of CLEF [17] G. H. John and P. Langley. Estimating Continuous 2010 LABs and Workshops, 22-23 September, 2010. Distributions in Bayesian Classifiers. In Proceedings of the Eleventh conference on Uncertainty in artificial intelligence, pages 338–345. Morgan Kaufmann Publishers Inc., 1995.