Author Diarization Using Cluster-Distance Approach Notebook for PAN at CLEF 2016 Abdul Sittar, Hafiz Rizwan Iqbal, and Rao Muhammad Adeel Nawab COMSATS Institute of Information Technology, M.A. Jinnah Campus, Off-Raiwind Road, Lahore, Pakistan abdulsittar72@yahoo.com,{rizwan.iqbal,adeelnawab}@ciitlahore.edu.pk Abstract. Author Diarization is a new task introduced in PAN’16, to identify portion(s) of text with in a document written by multiple au- thors. This paper presents, our proposed approach for author diarization task. Various types of stylistic features which include lexical features, used to uniquely identify an author. Furthermore, to find anomalous text with in a single document, ClustDist method used. Finally, clusters were gen- erated by using simple k-means clustering algorithm. Experiments were performed both on training and testing data sets. It has been observed that by changing the text fragments length, promising results can be achieved. 1 Introduction Plagiarism detection is the task of determining text reuse in a document or collection of documents. It is of two types, 1) Extrinsic Plagiarism Detection - focuses on finding source documents which were used to generate suspicious documents. It is a complex task because it may happen at various levels ranges from word to sentences, paraphrasing of text and plagiarism of ideas, and 2) Intrinsic Plagiarism Detection or Author Diarization - focuses on identifying whether a document is written by a single or multiple authors. Author Diarization term came from the domain of speaker diarization - fo- cuses on clustering and identifying various speakers from a single audio speech signal, by analyzing frequency range of speaker’s voices e.g. a class discussion on a particular topic and a conference conversation on mobile phones etc. Similarly, author diarization deals with the written document instead of audio conversation [4]. Such documents may be resulted from a collaborative work or plagiarism. It’s objective is to identify and cluster different authors with in a single document on the basis of written text. In literature, stylometric features like function words, part of speech tag, spelling mistakes, average sentence length and average word length were used for Author Diarization task [7]. David Guthrie [2] introduced the term authorship anomalies, to detect portion of text that deviate from the original context as written by other author. PAN1 is the competition held as a part of CLEF Conference. The PAN’16 [8] competition is designed for three different tasks namely, Author Identifica- tion, Author Profiling and Author Obfuscation. Author Identification has further divided into Author clustering and Author Diarization. Identify and cluster por- tions of text with respect to individual author for a given document, is the task for Author Diarization. Moreover, the task divided into three sub problems, 1) Traditional intrinsic plagiarism detection, 2) Diarization with a given number (n) of authors and 3) Diarization with unknown number of authors, to explore different variants of the parent task. Participants have to developed a software for the given task and deploy it on TIRA (An engine to perform evaluation of software) [5]. All training and testing documents are provided only in English language. For each subtask, all text files have been read from it’s respective provided training data set. Each file text splitted into sentences. For each sentence, 15 lexical features (see table 1) were computed. Using ClustDist[2] anomaly detec- tion technique, a feature vector generated which contains average distances of all sentences from each other. On the basis of these distances, training models were created using WEKA and saved to use at the time of prediction. Our developed software deployed on TIRA and evaluation performed on testing corpora pro- vided by PAN. Our software generated promising results on both training and testing corpora. Rest of this paper is organized as follows: The details of the methodology is explained in the Section 2 while experimental setup, results of training phase and testing phase are discussed in the Section 3 and Section 4 respectively. Section 5 provides conclusion and future work. 2 Proposed Approach A wide variety of algorithms have been reported in literature for authorship identification [7, 2], which includes Cluster distance, Counting Words, Stylomet- ric features, Syntactic features, Lexical features, Content specific and Content free features. 2.1 Lexical Features A language independent approach for author identification, considers that any text (i.e. sentences, paragraphs, documents) is a sequence of tokens 2 . On the basis of these tokens, various types of statistics (e.g. average word length, average sentence length, characters count) could be drawn from any text of any language etc. [7]. Table 1 shows lexical features, which were used in our developed software to find unique writing style of an author. 1 http://pan.webis.de/ Last visited: 25-05-2016 2 A token could be a word, character, punctuation mark or numeric number. Table 1. Lexical Features Feature. No. Feature Name 1 Characters Count 2 Digits Count 3 Uppercase letters Count 4 Spaces Count 5 Letters Count 6 Tabs Count 7 Words Count 8 Ratio of Interrogative Sentences 9 Average Word Length 10 Average Sentence Length 11 Ratio of Digits 12 Ratio of Uppercase letters 13 Ratio of Spaces 14 Ratio of letters 15 Ratio of tabs 2.2 Clustering using ClustDist ClustDist [2] - a straightforward technique to compute the average distance from one portion (i.e. sentences) of text to all other pieces of text. Consider a document D with n number of sentences. At first, each sentence i was distinguished by computing the p lexical features (see Table 1) and a feature vector Vi for this sentence would be generated. For our experiments, a matrix V of order n X p was created. Each matrix row shows a feature vector for each sentence. ClustDist computed by using equation 1, where d is the distance between any pair of vectors. The resultant score for each sentence distance from others, generates a ranking which describes that how a sentence is different from all other sentences in the given document. To generate clusters, we used simple k-means algorithm [3]. For detailed insight into the proposed approach, see section 2.2. P d(x, v) ClustDist(x, V ) = k (1) n Example: Step-by-Step Author Diarization by ClustDist Approach This example demonstrates author diarization process from an input text to output clusters. – Step 1: Read Raw Input Text “In what follows, we give a detailed overviewof Barack Obama’s Family. We shed light on himself, his immediateand extended family, including mater- nal and paternal relations.Moreover, we give insights into the relations of Michelle Obama Barack Obama’s wife, as well as some distant relations of- both. Barack Obama Barack Hussein Obama II is the44th and current Presi- dent of the United States. He is the firstAfrican American to hold the office. Obama was the junior UnitedStates Senator from Illinois from 2005 until he resigned followinghis election to the presidency. Obama is a graduate of ColumbiaUniversity and Harvard Law School. ” – Step 2: Break Down Text into Sentences 1. In what follows, we give a detailed overview of Barack Obama’s Family. 2. We shed light on himself, his immediate and extended family, including maternal and paternal relations. 3. Moreover, we give insights into the relations of Michelle Obama Barack Obama’s wife, as well as some distant relations of both. 4. Barack Obama Barack Hussein Obama II is the 44th and current Presi- dent of the United States. 5. He is the first African American to hold the office. 6. Obama was the junior United States Senator from Illinois from 2005 until he resigned following his election to the presidency. 7. Obama is a graduate of Columbia University and Harvard Law School. – Step 3: Lexical Features Computation Table 2 shows the computations of lexical features used in our software. For each sentence, all of the features (see Table 1) are computed in our software. For the sake of demonstration only 4 features computations shown here. Table 2. Lexical Features Computations Sentences Letters Ratio Average Word Length Spaces Ratio Character Count 1 0.91 4.35 0.19 61 2 0.97 5.68 0.16 91 3 0.93 4.73 0.19 109 4 0.92 4.93 0.18 79 5 0.91 4.09 0.22 45 6 0.92 5.19 0.18 109 7 0.94 4.83 0.18 58 – Step 4: Distance Calculation Table 3 shows distance calculation of each sentence from all other sentences in the text. – Step 5: ClustDist Computation For each sentence, it’s ClustDist score is as follows: ClustDist (1) = 0 + 30 + 48 + 18 + 16 + 48 + 3.03 = 163.03 ClustDist (2) = 30 + 0 + 18.02 + 12.02 + 46.02 + 18 + 33.01= 157.07 Table 3. Distance Calculations Sentences 1 2 3 4 5 6 7 1 0 30 48 18.0 16 48 3.03 2 30.0 0 18.02 12.02 46.02 18.0 33.01 3 48 18.02 0 30 64 0.46 51 4 18.0 12.02 30.0 0 34.01 30.0 21 5 16 46.02 64 34.01 0 64.0 13.02 6 48 18.0 0.46 30.0 64.0 0 51.0 7 3.03 33.01 51.0 21.0 13.02 51.0 0 ClustDist (3) = 48 + 18.02 + 0 + 30 + 64 + 0.46 + 51 = 211.48 ClustDist (4) =18.0 + 12.02 + 30.0 + 0 + 34.01 + 30.0 + 21 = 145.03 ClustDist (5) = 16 + 46.02 + 64 + 34.01 + 0 + 64.0 + 13.02 =237.05 ClustDist (6) = 48 + 18 + 0.46 + 30 + 64 + 0 + 51 =211.46 ClustDist (7) = 3.03 + 33.01 + 51 + 21 + 13.02 + 51 =172.06 – Step 6: Generating Clusters On the basis of ClustDist score, we applied simple k-means clustering algo- rithm and got the following clusters: Cluster 1: [237.05, 211.46, 211.48] Cluster 2: [163.03, 157.07, 172.06] Cluster 3: [145.03] 3 Experimental Setup This section provides a detailed insight into the experimental environment, setup for the development and evaluation of Author Diarization software. 3.1 Fabrication: Training and Classification Models For each of the three sub tasks (see section 1), different training data set’s are provided in PAN’16. Each training data set contains three files, “.txt” file contains actual text written by multiple authors, “.meta” file contain JSON object which provides text language, problem type(plagiarism or diarization) and number of authors, and a “.truth” file which provides the output required against the given text file. For the generation of trained models, all text files were read from the corpus in a sequence. Each file break down into sentences. All of the lexical features (see section 2.1) were computed for each sentence, created a feature vector containing distances of this sentence from all other sentences using ClustDist technique (see section 2.2). WEKA 3 ( A machine learning tool kit) used to generate and save training models by generating ”.arff” file from this resultant distance vector. For cluster generation, we used simple k-means algorithm. 3 http://www.cs.waikato.ac.nz/ml/weka/ Last visited: 25-05-2016 3.2 Software Evaluation PAN’16 also provided the data set to evaluate the developed software but this data set is not publicly available because they will launch it after the competition. However, after training our software on training data set, we deployed it on TIRA for the evaluation phase and executed evaluation software on it. This time software takes input files from testing corpus. For each input file, “test.arff ” file generated using ClustDist feature vector. Finally, we got clusters with respect to each subtask as per requirement of PAN’16. e.g. For subtask 1, only two clusters were generated, one for main author and second for the rest of the authors. For subtask 2, clusters were created as per required number in “.meta” file. For subtask 3, random number of clusters were created because its number of author were not given. As final step, clusters generated from evaluation corpus compared with the existing trained models to predict new instances. 3.3 Evaluation Measures PAN’16 recommended different evaluation measures for sub tasks 1, 2 and 3. For subtask 1, micro-recall, micro-precision, macro-recall, macro-precision, micro-f and macro f [6] measures used to evaluate the performance of the system. For subtasks 2 and 3 bcubed-recall, bcubed-precision and bcubed-f measure used for evaluation [1]. 4 Results and Analysis This section presents the results, generated using training and testing data. For each of the three sub tasks, size of text fragments were increased to get better results. Experiments showed improvement in results upto some extent. Section 4.1 discusses results on training data and Section 4.2 elaborates results on testing data for each of the three sub tasks. 4.1 Results: Training Phase For Task 1, results are shown in Table 4. It can be analyzed that highest results for all evaluation measures are obtained for sentences of length 7. Table 5 and Table 6 show the results for subtask 2 and subtask 3 respectively. Sentences of length 5 demonstrated best results for subtask 2 while in subtask 3, sentences of similar length shows highest values of the required evaluation measures. 4.2 Results: Testing Phase Based upon the result of training data set, we used only those sentence lengthes which demonstrated highest results for each sub task, because both training and testing data sets contain almost similar type of texts. Therefore, for sub task 1, we used sentence length 7. For task 2 and 3, sentence length 5 has been used respectively. Table 7 shows the results for sub task 1 while for sub task 2 and 3, the results are shown in table 8. Table 4. Training Data Set : Sub Task 1 Results Sentence Length Micro-recall Micro-Precision Micro-F Macro-Recall Macro-Precision Macro-F 2 0.1338 0.2006 0.1605 0.1216 0.2006 0.1514 3 0.1045 0.1828 0.1330 0.1109 0.1823 0.1379 4 0.1291 0.2492 0.1701 0.1178 0.2492 0.1600 5 0.1392 0.2599 0.1813 0.1337 0.2599 0.1766 6 0.1461 0.2572 0.1864 0.1421 0.2572 0.1830 7 0.1493 0.2655 0.1911 0.1648 0.2664 0.2036 8 0.1130 0.1998 0.1444 0.1129 0.1995 0.1442 9 0.1280 0.2323 0.1651 0.1304 0.2322 0.1670 10 0.1045 0.1828 0.1330 0.1109 0.1823 0.1379 11 0.1379 0.2547 0.1875 0.1481 0.2523 0.1866 12 0.1165 0.2315 0.1550 0.1103 0.2307 0.1492 15 0.1301 0.2573 0.1728 0.1242 0.2565 0.1674 Table 5. Training Data Set : Sub Task 2 Results Sentence Length Bcubed-Recall Bcubed-Precision Bcubed-F 5 0.4823 0.2861 0.3591 10 0.5951 0.1315 0.2154 12 0.6143 0.1080 0.1838 13 0.6260 0.1051 0.1800 14 0.6376 0.0887 0.1558 Table 6. Training Data Set : Sub Task 3 Results Sentence Length Bcubed-Recall Bcubed-Precision Bcubed-F 5 0.5464 0.2822 0.3722 10 0.6253 0.1339 0.2206 12 0.6386 0.1076 0.1842 Table 7. Testing Data Set : Sub Task 1 Results Sentence Length Micro-recall Micro-Precision Micro-F Macro-Recall Macro-Precision Macro-F 7 0.0672 0.1427 0.0914 0.0951 0.1427 0.1141 Table 8. Testing Data Set : Sub Tasks 2 and 3 Results Sentence Length Task Bcubed-Recall Bcubed-Precision Bcubed-F 5 2 0.4700 0.2791 0.3502 5 3 0.4676 0.3140 0.3757 5 Conclusion In this paper we have discussed our participation in PAN’16 Author Diariza- tion task. We have developed a software for all of the 3 subtasks. Our proposed approach based upon a language independent technique to uniquely identify an author based upon his/her written text Lexical Features. Fifteen lexical features were used in combination with ClustDist approach. We used ClustDist method for the detection of anomalous text with a document. Experiments were per- formed on training and testing data sets of PAN’16. Different size of text frag- ments were used to improve the results in training phase while in testing phase only those fragments sizes were used which gave best results in training phase. It has been analyzed that by changing the fragment sizes of text, improvement in results can be obtained. As future work, content based, topic based and stylistic features in combination with the ClustDist method will be explored. References 1. Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information retrieval 12(4), 461–486 (2009) 2. Guthrie, D.: Unsupervised Detection of Anomalous Text. Ph.D. thesis, University of Sheffield (2008) 3. MacQueen, J., et al.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. vol. 1, pp. 281–297. Oakland, CA, USA. (1967) 4. Miro, X.A., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., Vinyals, O.: Speaker diarization: A review of recent research. Audio, Speech, and Language Pro- cessing, IEEE Transactions on 20(2), 356–370 (2012) 5. Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Improving the Reproducibility of PAN’s Shared Tasks: Plagiarism Detection, Author Identi- fication, and AuthorProfiling. In: Kanoulas, E., Lupu, M., Clough, P., Sanderson, M., Hall, M., Hanbury, A., Toms, E. (eds.) Information Access Evaluation meets Multilinguality, Multimodality, and Visualization. 5th International Conference of the CLEF Initiative (CLEF 14). pp. 268–299. Springer, Berlin Heidelberg New York (Sep 2014) 6. Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An evaluation framework for plagiarism detection. In: Proceedings of the 23rd international conference on computational linguistics: Posters. pp. 997–1005. Association for Computational Linguistics (2010) 7. Stamatatos, E.: A survey of modern authorship attribution methods. Journal of the American Society for information Science and Technology 60(3), 538–556 (2009) 8. Stamatatos, E., Tschuggnall, M., Verhoeven, B., Daelemans, W., Specht, G., Stein, B., Potthast, M.: Clustering by Authorship Within and Across Documents. In: Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Pro- ceedings, CLEF and CEUR-WS.org (Sep 2016)