A Language Independent Author Verifier Using Fuzzy C-Means Clustering Notebook for PAN at CLEF 2014 Pashutan Modaresi1,2 and Philipp Gross1 1 pressrelations GmbH, Düsseldorf, Germany {pashutan.modaresi, philipp.gross}@pressrelations.de 2 Heinrich-Heine-University of Düsseldorf, Institute of Computer Science, Düsseldorf, Germany modaresi@cs.uni-duesseldorf.de Abstract In this work we describe our approach to solve the author verification problem introduced in the PAN 2014 Author Identification task. The author ver- ification task presents participants with a set of problems where each problem consists of a set of documents written by the same author and a questioned docu- ment with an unknown author. The task is then to decide whether the questioned document has the same author as the other documents or not. Inspired by a psy- chological personality model, our approach uses basic lexical feature extraction and fuzzy clustering. Using the created fuzzy clusters, the membership values of documents to the clusters can be computed. The distribution of the cluster mem- bership values will be used finally to solve the verification problem. 1 Introduction Given a set of documents with known authors, authorship attribution is the task of iden- tifying the author of an unseen document. Having a small number of candidate authors, this task can be easily solved using the state-of-the-art approaches[1]. A realistic and common scenario for authorship attribution is the author verification problem. Given a set of documents written by a single author, the task here is to determine whether a questioned document is written by the same author or not. The PAN 2014 Author Identification task focuses on the author verification prob- lem. To be more specific, in this task a multi-lingual corpus is provided which consists of several problems. Each problem contains a maximum of 5 documents written by a single author and a questioned document by an unknown author. The task in then to determine whether the questioned document is written by the same author or not. The fact that an author may consciously or unconsciously vary his or her writing style, makes the task of author verification a hard problem[7]. In this paper we introduce a novel approach for solving the task of author verification. For this we extract language independent features from our training corpus and use a fuzzy clustering algorithm to construct our models. Finally using the membership distribution of documents over the clusters, we do solve the verification task. In Section 2 we define the problem of author verification formally and introduce some notations. Section 3 addresses the process of feature extraction and normalization. 1084 The process of clustering and model construction is discussed in Section 4. Section 5 covers the process of verification and scoring. An overview of the evaluation results can be seen in Section 6. Finally in Section 7 the work will be concluded. 2 Problem Statement In this section we formally define the problem of author verification in the context of the PAN 2014 Author Identification task. Let P = {D, du } be a problem consisting of a set of documents D = {d1 , . . . , dn } with 1 ≤ n ≤ 5 written by a single author, and a questioned document du with an unknown author. The task in author verification is to determine whether the questioned document du is written by the same author or not. We denote the author of a document di by A(di ). In other words an author verifier ϕ is a binary classification function of the following form: ( 1, if A(du ) = A(di ) ∀di ∈ D ϕ(du , D) = (1) 0, if otherwise In the PAN 2014 Author Identification task, problems are from 4 different languages, namely Dutch, English, Greek and Spanish. The author verification algorithm has to be able to deal with documents from the specified languages. The performance of the author verifier will be evaluated according to the area under the ROC curve (AUC) of its probability scores and also based on the c@1 measure[8]. The evaluation process will be discussed in more details in Section 6. In the following section, we start the description of our algorithm by discussing the feature extraction and normalization step. 3 Feature Extraction and Normalization Feature extraction is considered as one of the important steps in author verification[9]. Different kinds of stylometric features like lexical, syntactic or semantic features have been used for solving the author verification task. In order to design an efficient author verification algorithm, which can deal with huge amounts of documents, we only con- sider a limited number of lexical features and construct our learning algorithm in a way that would result in an acceptable performance even with a small number of features. Lexical features have the advantage over the syntactic or semantic features, that this kind of features can be computed very efficiently and without the use of any external knowledge or training. We represent documents as vectors in R4 . Each component of these 4-dimensional vectors can be computed using the feature extraction functions. Independent of the doc- ument language we use the following functions to compute the feature vector compo- nents of documents: 1085 Average Sentence Length (fsl ) : Using a sentence detector, sentence boundaries of the document will be detected (In our case we use a regular expression based sentence de- tector for optimizing the performance). For each sentence s in the document, its length l(s) will be computed. We denote the set of all sentences inside a document with S. Finally the average sentence length of the document can be computed as follows: P l(s) fsl (d) = s∈S (2) |S| Punctuation Marks Usage (fpm ) : Using a predefined set of punctuation marks T = { ( ) , : ; ! ? } the frequency of the elements of the set T inside the document will be computed and finally normalized by the length of the document. With f (t, d) we denote the frequency of the punctuation mark t in document d. P f (t, d) fpm (d) = t∈T (3) |d| Space After Comma (fsac ) : Our experimental results show that whether a space is used after a comma or not, can be a good discriminating feature in the author verification task. Let α denote the number of times a comma is followed by a space and β be the number of times a comma is not followed by a space. In his way fsac can be defined as follows: α−β fsac (d) = (4) |d| Analogue to fsac we define fsbc which is the Space Before Comma feature. Through this feature, authors that use a space before comma can be discriminated from the ones who do not use a space before comma. As the extracted features may exhibit significant differences in their range and dis- tribution, out learning algorithm could be more sensitive to features that are in a wider range (e.g. Average Sentence Length). In order to avoid this behavior we use feature normalization through which we can modify the mean and variance of the features us- ing a transformation function. The transformation function that we use in this work is the min-max function. Given a feature f , the min-max transformation function which is defined as follows: f − min(f ) f0 = (5) max(f ) − min(f ) In the above formula f denotes the feature vector and f 0 is the transformed feature vector. 4 Fuzzy Clustering and Model Construction In this section we illustrate the main idea behind our learning algorithm. We believe that different personality dimensions have a close relationship with the writing style of authors. In psychology, the Big Five Personality Traits are 5 dimensions of personality that are used to describe the personality of humans[3]. Openness, Conscientiousness, 1086 Extraversion, Agreeableness and Neuroticism are the personality dimensions which are described as the factors of the Big Five model. Based on these dimensions, each persons personality can be described using a combination of the above dimensions. Inspired by the Big Five model, we construct c clusters, where each cluster represents a person- ality dimension. An author’s personality can then be determined by computing his or her membership to these clusters. Finally two authors that have the same (or similar) membership distribution over the clusters would be considered as the same. For this we collect all the documents in our training set from which we know that they are written by the same author and extract their features (See Section 3). This will result in a matrix Z = [z1tr , z2tr , . . . , zN tr ] ∈ R4×N where N is the number of collected tr documents and zi denotes the transpose of the vector zi . As already mentioned the personality of an author can be determined using his or her membership values to the available clusters. Due to this consideration, we use the Fuzzy C-Means[2] clustering al- gorithm to construct fuzzy clusters. For constructing c clusters, we assign initial cluster membership values for each document in the collection (The collection of these values constructs the partition matrix U = [µik ] ∈ Rc×N ). The partition matrix will be up- dated after each iteration of the algorithm until no significant changes are observable. After initializing the partition matrix randomly, the Fuzzy C-Means algorithm can be summarized as follows: Repeat for l = 1, 2, . . . Step 1: Compute the cluster centers with m ∈ [1, ∞) PN (l−1) m (l) k=1 (µik ) zk vi = PN (l−1) m , 1≤i≤c (6) k=1 (µik ) Step 2: Compute the distances 2 2 (l) (l) (l) Dik = zk − vi = (zk − vi )T (zk − vi ), 1 ≤ i ≤ c, 1 ≤ k ≤ N (7) Step 2: Update the partition matrix: for 1 ≤ k ≤ N if Dik > 0 for all i = 1, 2, . . . , c (l) 1 µik = Pc 2/(m−1) (8) j=1 (D ik /D jk ) otherwise c (l) (l) (l) X µik = 0 if Dik > 0, and µik ∈ [0, 1] with µik = 1 (9) i=1 Until U (l) − U (l−1) <  1087 We use the cluster information produced by the cluster algorithm, to verify whether two documents are written by the same author or not. The process of author verification will be discussed in the following section. 5 Verification and Scoring In order to find an answer to an author verification problem P , we compute the cluster membership values for documents with known authors and documents with unknown authors. Then using the membership values we will decide if the documents have the same author or not. Given a problem P = {D = {d1 , . . . , dn }, du } and c cluster prototypes (centroids) V = {v1 , . . . , vc } we compute the membership values of the documents with known authors to the constructed clusters. In this way, for each document di , 1 ≤ i ≤ n a cluster membership vector µi = {µi1 , . . . , µic } will be computed where the j-th element in the vectors represents the membership value of the document di to the cluster j. In the same way we compute the cluster membership values of the document with unknown author du . This would result in the membership vector µu . At this step the cluster membership values for all documents in the problem P are known. Notice that the documents d1 , . . . , dn are assumed to be written by the same author. Theoretically we would expect that the cluster membership vectors of these documents look very similar to each other. Experimental results show that this is usually not the case, which relies on the fact the authors write in different psychological states. In order to solve the above problem, for the documents with known authors, we compute a mean cluster membership vector. Through this vector a more stable estima- tion of membership to available personality dimensions can be made. The mean cluster membership vector of a set of documents d1 , . . . , dn with known authors can be com- puted as follows: Pn i=1 µi µ̃ = (10) n Now using the cosine similarity between the average cluster membership vector of documents with known authors and the questioned document, the similarity between these two vectors can be computed. The cosine similarity between these two vectors is defined as follows[6]: µ̃ · µu Sµ̃,µu = (11) kµ̃k kµu k Through the cosine similarity measure we compute the angle between the vectors. A cosine values of 0 means that the vectors are orthogonal to each other and a cosine value of 1 means that the vectors are identical. Through the cosine similarity measure we assigned a score to each problem. Additionally we need a transformation function which can return binary values for author verification problem. In Section 2 we defined the function ϕ(du , D). Here we modify this definition and redefine the function: 1088 ( 1, if Sµ̃,µu ≥ 0.5 ϕ(du , D) = (12) 0, if otherwise Using the above function definition, for each problem P it can be decided if the documents inside P belong to the same author or not. A value of 1 means that the documents inside P have the same author and a value of 0 means that the questioned document has a different author than the documents with a known author. 6 Evaluation Results In order to evaluate our approach we used the training set provided by the PAN 2014 Author Identification task. The training set consists of documents belonging to 4 dif- ferent languages, namely Dutch, English, Greek and Spanish. Dutch documents are divided into essays and reviews, and English documents into essays and novels. Greek and Spanish documents belong only to the genre Articles. In total we constructed 6 models, where each model corresponds to a specific language and a specific genre. For constructing the clusters of language L and genre G, we randomly selected 20% of the available training data to create the clusters. The experiments have been repeated 1000 times and average c@1 measure of the iterations has been computed. The c@1 measure of a single iteration can be computed as follows[8]: 1 nc c@1 = ( )(nc + (nu )) (13) n n where, n = number of problems, nc = number of correct answers and nu = number of unanswered problems. The results are summarized in Table 1. Table 1: Evaluation results for the training set Language Genre #Clusters m c@1 AUC c@1 · AUC Dutch Essays 4 4 0.731 0.752 0.549 Dutch Reviews 3 4 0.680 0.763 0.518 English Essays 3 3 0.664 0.651 0.432 English Novels 4 3 0.852 0.852 0.725 Greek Articles 4 3 0.671 0.697 0.467 Spanish Articles 3 5 0.684 0.712 0.487 In Table 1 the number of created clusters and the parameter m are also specified. These parameters are the ones that returned the best results during our experiments. As we can see the algorithm returns the best results for the English novels with an c@1 value of 0.852. The worst results are also for the English documents but the ones in the genre essays. Even though the c@1 values for all languages and genres are greater than 0.66. Beside the above approach, we evaluate the performance of our algorithm according to the area under the ROC curve (AUC)[4] of its returned probability scores. Table 1 1089 summarizes the results. As it can be seen in the table, the AUC values are consistent and comparable with c@1 values. The reason for this is that the verification algorithm outputs very high probability scores for the positive cases, and very low probability scores for the negative cases. For ranking the performance of participants in the competition a test corpus has been provided. We have evaluated our algorithm using Tira[5] which is a service for running experiments in computer science. Table 2 represents the performance results and also the run-time of our algorithm on the test corpus. From the performance results based on the test corpus it can be seen that our algo- rithm performs very well for English Novels, and Essays reaching a final score of 0.508 and 0.349 respectively. But for the other languages the results are not as satisfactory as expected. This difference between the results indicates that for languages other than English, a deeper feature engineering is needed. Table 2: Evaluation results for the test set Language Genre c@1 AUC c@1 · AUC Runtime (in seconds) Dutch Essays 0.635 0.594 0.377 4 Dutch Reviews 0.500 0.493 0.246 6 English Essays 0.580 0.602 0.349 6 English Novels 0.715 0.711 0.508 7 Greek Articles 0.540 0.543 0.293 4 Spanish Articles 0.650 0.640 0.416 7 The run-time of our algorithm on different data sets also shows that the introduced algorithm can be efficiently used for large collections of author verification problems. This is due to the small number of features that we extract from documents. This has from one side the advantage that the author verification problems can be solved very efficiently, but from the other side, it will result in a lower performance for specific languages. 7 Conclusion In this work we have described our approach to solve the author verification problem introduced in the PAN 2014 Author Identification task. Using the fuzzy c-means cluster- ing algorithm, we partitioned the provided training set (Section 4) into several clusters. Given an author verification problem, we used the membership values of the documents inside the problem to verify whether two documents have the same author or not. In order to design an efficient algorithm we only considered a limited number of features for each language. This resulted in very low run-times for our algorithm. Ac- cordingly we acquired the 1st place among the participants regarding the run-time of algorithms. Our introduced approach also revealed sound results for the English language achiev- ing the 1st place for English Novels and the 5th place for English Essays among the 13 participating teams. For other languages we did not get the expected satisfactory results. 1090 The reason for this lies in the small amount of training set that we use for constructing our fuzzy clusters. We also use the same set of features for all available languages which is probably the main reason for insufficient results for languages other than English. References 1. Argamon, S.: Scalability issues in authorship attribution.kim luyckx. LLC 27(1), 95–97 (2012) 2. Bezdek, J., Ehrlich, R., Full, W.: FCM: The fuzzy c-means clustering algorithm. Computers & Geosciences 10(2-3), 191–203 (1984) 3. Digman, J.M.: Personality Structure: Emergence of the Five-Factor Model. Annual Review of Psychology 41(1), 417–440 (1990) 4. Fawcett, T.: Roc graphs: Notes and practical considerations for researchers. ReCALL 31(HPL-2003-4), 1–38 (2004) 5. Gollub, T., Potthast, M., Beyer, A., Busse, M., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Recent trends in digital text forensics and its evaluation. In: Forner, P., MÃijller, H., Paredes, R., Rosso, P., Stein, B. (eds.) Information Access Evaluation. Multilinguality, Multimodality, and Visualization, Lecture Notes in Computer Science, vol. 8138, pp. 282–302. Springer Berlin Heidelberg (2013) 6. Han, J.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2005) 7. Koppel, M., Schler, J.: Authorship verification as a one-class classification problem. In: Proceedings of the Twenty-first International Conference on Machine Learning. pp. 62–. ICML ’04, ACM, New York, NY, USA (2004) 8. Peñas, A., Rodrigo, A.: A simple measure to assess non-response. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1. pp. 1415–1424. HLT ’11, Association for Computational Linguistics, Stroudsburg, PA, USA (2011) 9. Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538–556 (Mar 2009) 1091