Introduction

Author Diarization Using Cluster-Distance Approach Notebook for PAN at CLEF 2016

Abdul Sittar

abdulsittar72@yahoo.com 0

Ha z Rizwan Iqbal

rizwan.iqbal@ciitlahore.edu.pk 0

Rao Muhammad Adeel Nawab

adeelnawab@ciitlahore.edu.pk 0 0 COMSATS Institute of Information Technology, M.A. Jinnah Campus , O -Raiwind Road, Lahore , Pakistan

Author Diarization is a new task introduced in PAN'16, to identify portion(s) of text with in a document written by multiple authors. This paper presents, our proposed approach for author diarization task. Various types of stylistic features which include lexical features, used to uniquely identify an author. Furthermore, to nd anomalous text with in a single document, ClustDist method used. Finally, clusters were generated by using simple k-means clustering algorithm. Experiments were performed both on training and testing data sets. It has been observed that by changing the text fragments length, promising results can be achieved.

Introduction

Plagiarism detection is the task of determining text reuse in a document or collection of documents. It is of two types, 1) Extrinsic Plagiarism Detection - focuses on nding source documents which were used to generate suspicious documents. It is a complex task because it may happen at various levels ranges from word to sentences, paraphrasing of text and plagiarism of ideas, and 2) Intrinsic Plagiarism Detection or Author Diarization - focuses on identifying whether a document is written by a single or multiple authors.

Author Diarization term came from the domain of speaker diarization - focuses on clustering and identifying various speakers from a single audio speech signal, by analyzing frequency range of speaker's voices e.g. a class discussion on a particular topic and a conference conversation on mobile phones etc. Similarly, author diarization deals with the written document instead of audio conversation [ 4 ]. Such documents may be resulted from a collaborative work or plagiarism. It's objective is to identify and cluster di erent authors with in a single document on the basis of written text.

In literature, stylometric features like function words, part of speech tag, spelling mistakes, average sentence length and average word length were used for Author Diarization task [ 7 ]. David Guthrie [ 2 ] introduced the term authorship anomalies, to detect portion of text that deviate from the original context as written by other author.

PAN1 is the competition held as a part of CLEF Conference. The PAN'16 [ 8 ] competition is designed for three di erent tasks namely, Author Identi cation, Author Pro ling and Author Obfuscation. Author Identi cation has further divided into Author clustering and Author Diarization. Identify and cluster portions of text with respect to individual author for a given document, is the task for Author Diarization. Moreover, the task divided into three sub problems, 1) Traditional intrinsic plagiarism detection, 2) Diarization with a given number (n) of authors and 3) Diarization with unknown number of authors, to explore di erent variants of the parent task. Participants have to developed a software for the given task and deploy it on TIRA (An engine to perform evaluation of software) [ 5 ]. All training and testing documents are provided only in English language.

For each subtask, all text les have been read from it's respective provided training data set. Each le text splitted into sentences. For each sentence, 15 lexical features (see table 1) were computed. Using ClustDist [ 2 ] anomaly detection technique, a feature vector generated which contains average distances of all sentences from each other. On the basis of these distances, training models were created using WEKA and saved to use at the time of prediction. Our developed software deployed on TIRA and evaluation performed on testing corpora provided by PAN. Our software generated promising results on both training and testing corpora.

Rest of this paper is organized as follows: The details of the methodology is explained in the Section 2 while experimental setup, results of training phase and testing phase are discussed in the Section 3 and Section 4 respectively. Section 5 provides conclusion and future work. 2

Proposed Approach

A wide variety of algorithms have been reported in literature for authorship identi cation [ 7, 2 ], which includes Cluster distance, Counting Words, Stylometric features, Syntactic features, Lexical features, Content speci c and Content free features. 2.1

Lexical Features

A language independent approach for author identi cation, considers that any text (i.e. sentences, paragraphs, documents) is a sequence of tokens 2. On the basis of these tokens, various types of statistics (e.g. average word length, average sentence length, characters count) could be drawn from any text of any language etc. [ 7 ]. Table 1 shows lexical features, which were used in our developed software to nd unique writing style of an author. 1 http://pan.webis.de/ Last visited: 25-05-2016 2 A token could be a word, character, punctuation mark or numeric number. ClustDist [ 2 ] - a straightforward technique to compute the average distance from one portion (i.e. sentences) of text to all other pieces of text. Consider a document D with n number of sentences. At rst, each sentence i was distinguished by computing the p lexical features (see Table 1) and a feature vector Vi for this sentence would be generated. For our experiments, a matrix V of order n X p was created. Each matrix row shows a feature vector for each sentence. ClustDist computed by using equation 1, where d is the distance between any pair of vectors. The resultant score for each sentence distance from others, generates a ranking which describes that how a sentence is di erent from all other sentences in the given document. To generate clusters, we used simple k-means algorithm [ 3 ]. For detailed insight into the proposed approach, see section 2.2. P d(x; v) k

n ClustDist(x; V ) = (1)

Example: Step-by-Step Author Diarization by ClustDist Approach

This example demonstrates author diarization process from an input text to output clusters.

{ Step 1: Read Raw Input Text \In what follows, we give a detailed overviewof Barack Obama's Family. We shed light on himself, his immediateand extended family, including maternal and paternal relations.Moreover, we give insights into the relations of Michelle Obama Barack Obama's wife, as well as some distant relations ofboth. Barack Obama Barack Hussein Obama II is the44th and current President of the United States. He is the rstAfrican American to hold the o ce. Obama was the junior UnitedStates Senator from Illinois from 2005 until he resigned followinghis election to the presidency. Obama is a graduate of ColumbiaUniversity and Harvard Law School. " { Step 2: Break Down Text into Sentences 1. In what follows, we give a detailed overview of Barack Obama's Family. 2. We shed light on himself, his immediate and extended family, including maternal and paternal relations. 3. Moreover, we give insights into the relations of Michelle Obama Barack

Obama's wife, as well as some distant relations of both. 4. Barack Obama Barack Hussein Obama II is the 44th and current President of the United States. 5. He is the rst African American to hold the o ce. 6. Obama was the junior United States Senator from Illinois from 2005 until he resigned following his election to the presidency.

7. Obama is a graduate of Columbia University and Harvard Law School. { Step 3: Lexical Features Computation { Step 4: Distance Calculation For each of the three sub tasks (see section 1), di erent training data set's are provided in PAN'16. Each training data set contains three les, \.txt" le contains actual text written by multiple authors, \.meta" le contain JSON object which provides text language, problem type(plagiarism or diarization) and number of authors, and a \.truth" le which provides the output required against the given text le.

For the generation of trained models, all text les were read from the corpus in a sequence. Each le break down into sentences. All of the lexical features (see section 2.1) were computed for each sentence, created a feature vector containing distances of this sentence from all other sentences using ClustDist technique (see section 2.2). WEKA 3( A machine learning tool kit) used to generate and save training models by generating ".ar " le from this resultant distance vector. For cluster generation, we used simple k-means algorithm. 3 http://www.cs.waikato.ac.nz/ml/weka/ Last visited: 25-05-2016 3.2

Software Evaluation

PAN'16 also provided the data set to evaluate the developed software but this data set is not publicly available because they will launch it after the competition. However, after training our software on training data set, we deployed it on TIRA for the evaluation phase and executed evaluation software on it. This time software takes input les from testing corpus. For each input le, \test.ar " le generated using ClustDist feature vector. Finally, we got clusters with respect to each subtask as per requirement of PAN'16. e.g. For subtask 1, only two clusters were generated, one for main author and second for the rest of the authors. For subtask 2, clusters were created as per required number in \.meta" le. For subtask 3, random number of clusters were created because its number of author were not given. As nal step, clusters generated from evaluation corpus compared with the existing trained models to predict new instances. 3.3

Evaluation Measures

PAN'16 recommended di erent evaluation measures for sub tasks 1, 2 and 3. For subtask 1, micro-recall, micro-precision, macro-recall, macro-precision, micro-f and macro f [ 6 ] measures used to evaluate the performance of the system. For subtasks 2 and 3 bcubed-recall, bcubed-precision and bcubed-f measure used for evaluation [ 1 ]. 4

Results and Analysis

This section presents the results, generated using training and testing data. For each of the three sub tasks, size of text fragments were increased to get better results. Experiments showed improvement in results upto some extent. Section 4.1 discusses results on training data and Section 4.2 elaborates results on testing data for each of the three sub tasks. 4.1

Results: Training Phase

For Task 1, results are shown in Table 4. It can be analyzed that highest results for all evaluation measures are obtained for sentences of length 7. Table 5 and Table 6 show the results for subtask 2 and subtask 3 respectively. Sentences of length 5 demonstrated best results for subtask 2 while in subtask 3, sentences of similar length shows highest values of the required evaluation measures. 4.2

Results: Testing Phase

Based upon the result of training data set, we used only those sentence lengthes which demonstrated highest results for each sub task, because both training and testing data sets contain almost similar type of texts. Therefore, for sub task 1, we used sentence length 7. For task 2 and 3, sentence length 5 has been used respectively. Table 7 shows the results for sub task 1 while for sub task 2 and 3, the results are shown in table 8. In this paper we have discussed our participation in PAN'16 Author Diarization task. We have developed a software for all of the 3 subtasks. Our proposed approach based upon a language independent technique to uniquely identify an author based upon his/her written text Lexical Features. Fifteen lexical features were used in combination with ClustDist approach. We used ClustDist method for the detection of anomalous text with a document. Experiments were performed on training and testing data sets of PAN'16. Di erent size of text fragments were used to improve the results in training phase while in testing phase only those fragments sizes were used which gave best results in training phase. It has been analyzed that by changing the fragment sizes of text, improvement in results can be obtained. As future work, content based, topic based and stylistic features in combination with the ClustDist method will be explored.

1. Amigo , E. , Gonzalo , J. , Artiles , J. , Verdejo , F. : A comparison of extrinsic clustering evaluation metrics based on formal constraints . Information retrieval 12 ( 4 ), 461 { 486 ( 2009 )

2. Guthrie , D. : Unsupervised Detection of Anomalous Text . Ph.D. thesis , University of She eld ( 2008 )

3. MacQueen , J., et al.: Some methods for classi cation and analysis of multivariate observations . In: Proceedings of the fth Berkeley symposium on mathematical statistics and probability . vol. 1 , pp. 281 { 297 . Oakland, CA, USA. ( 1967 )

4. Miro , X.A. , Bozonnet , S. , Evans , N. , Fredouille , C. , Friedland , G. , Vinyals , O. : Speaker diarization: A review of recent research . Audio, Speech, and Language Processing, IEEE Transactions on 20(2) , 356 { 370 ( 2012 )

5. Potthast , M. , Gollub , T. , Rangel , F. , Rosso , P. , Stamatatos , E. , Stein , B. : Improving the Reproducibility of PAN's Shared Tasks: Plagiarism Detection, Author Identication, and AuthorPro ling . In: Kanoulas, E. , Lupu , M. , Clough , P. , Sanderson , M. , Hall , M. , Hanbury , A. , Toms , E. (eds.) Information Access Evaluation meets Multilinguality, Multimodality, and Visualization . 5th International Conference of the CLEF Initiative (CLEF 14) . pp. 268 { 299 . Springer, Berlin Heidelberg New York ( Sep 2014 )

6. Potthast , M. , Stein , B. , Barron-Ceden~o, A. , Rosso , P.: An evaluation framework for plagiarism detection . In: Proceedings of the 23rd international conference on computational linguistics: Posters . pp. 997 { 1005 . Association for Computational Linguistics ( 2010 )

7. Stamatatos , E.: A survey of modern authorship attribution methods . Journal of the American Society for information Science and Technology 60 ( 3 ), 538 { 556 ( 2009 )

8. Stamatatos , E. , Tschuggnall , M. , Verhoeven , B. , Daelemans , W. , Specht , G. , Stein , B. , Potthast , M. : Clustering by Authorship Within and Across Documents . In: Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org (Sep 2016 )