Unsupervised method for the authorship identification task Notebook for PAN at CLEF 2014 Esteban Castillo1 , Ofelia Cervantes1 , Darnes Vilariño2 , David Pinto2 , and Saul León2 1 Universidad de las Américas Puebla Department of Computer Science, Electronics and Mechatronics, Mexico {esteban.castillojz, ofelia.cervantes}@udlap.mx 2 Benemérita Universidad Autónoma de Puebla Faculty of Computer Science, Mexico {darnes, dpinto}@cs.buap.mx and saul.ls@live.com Abstract This paper presents an approach for tackling the authorship identifi- cation task. The approach is based on comparing the similarity between a given unknown document against the known documents using a number of different phrase-level and lexical-syntactic features, so that an unknown document can be classified as having been written by the same author, if the different similarity measures obtained are close to a predetermined threshold for each language in the task. The method has shown competitive results, achieving the overall 6th place in the competition ranking. Keywords: Authorship verification, features, similarity measures, unsupervised learning, threshold 1 Introduction Discovering the correct features in a raw text in order to unambiguously allow to at- tribute the authorship of a given anonymous document is a very hard problem that re- cently (empowered by the continuous growing of information in Internet) has become of high interest in areas like information retrieval (IR), Natural Language Processing (NLP) and computational linguistics. Taking into account the above, the most common framework for mapping candidate authors with unknown documents is the authorship attribution problem, where, given texts of uncertain authorship and sample documents from a small, finite set of candidate authors, the task consists of mapping the uncertain texts onto their true authors among the candidates. This problem is considered as an un- reasonably easy task while a more demanding problem (often presented in documents on the web) is the author verification task, where, in a given set of documents by a single author and a questioned document, the problem is to determine if the questioned document was written by that particular author or not. In this sense, the importance of finding the correct features for characterizing the signature or particular writing style of a given author is fundamental for solving the authorship problem. 1035 The results reported in this paper were obtained in the framework of the 8th Interna- tional Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 14), in particular, in the task named Author Identification. For this purpose, we have attempted an approach for representing the features that will be taken into account in the process of authorship verification using an unsupervised learning method. The proposed approaches are discussed in the following sections. The rest of this paper is structured as follows. In Section 2 it is presented the de- scription of the features used in the task to be tackled. Section 3 shows the Proposed approach (unsupervised algorithm) used in the experiments. The experimental setting and a discussion of the obtained results are given in Section 4. Finally, the conclusions of this research work is presented in Section 5. 2 Description of the features and similarity measures used in the task In this work we explore a combination of different types of features in a text frecuency vector [7] to represent the writing style of the authors. For all languages in the author verification task, we consider using lexical-syntactic features because they are relatively easy to quantify as well as they provide the syntactic order in which the words are used to form the ideas. On the other hand, it is also considered (in an unsupervised classification) the use of different metrics to determine the similarity of the feature vectors related to the documents of known origin against the documents of unknown origin. Both types of elements are described below. 2.1 Lexical-syntactic features In this approach are considered the following features for representing the particular writing style of a given author: – Phrase level features 1. Word prefixes. A group of letters added before a word or base to alter its meaning and form a new word. 2. Word sufixes. A group of letters added after a word or base to alter its meaning and form a new word. 3. Stopwords. A group of words that bear no content or relevant semantics in the text. 4. Punctuation marks. Conventional signs and certain typographical devices as aids to the understanding and correct reading, both silently and aloud, of hand- written and printed texts. 5. Word N-grams. A contiguous sequence [5] of n words from a text or speech. 6. Skip-grams. A technique [4] where the n-grams are used to stored sequences of words, but they allow to skip tokens. 1036 – Character level features 1. Vowel combination. Word consonants are removed and, thereafter, the remain- ing vowels are combined. Each vowel combination is considered to be a fea- ture. Adjacent repetition of vowels are considered as only one vowel. 2. Vowel permutation. Word consonants are removed and, thereafter, the vowel permutation is considered to be a feature. 2.2 Similarity measures In this approach the following metrics are considered to determine the similarity of the documents: – Latent semantic analysis (LSA). A technique[6] which analyzes relationships be- tween a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words which are close in meaning will occur in similar pieces of text. – Jaccard similarity. A metric used for comparing the similarity and diversity of sample sets. The Jaccard coefficient measures similarity between finite sample sets, and it is defined as the size of the intersection divided by the size of the union of the sample sets and is given by: T |A B| JS(A, B) = S (1) |A B| – Euclidean distance. A metric used for comparing the similarity between two vec- tors p and q, where, p = (p1 , p2 , ..., pn ) and q = (q1 , q2 , ..., qn ), then the distance from p to q, or from q to p is given by: v u n uX d(q, p) = t (qi − pi )2 (2) i=1 – Chebyshev Distance. A metric defined on a vector space where the distance be- tween two vectors p and q, is the greatest of their differences along any coordinate dimension. The distance is given by: DChebyshev (q, p) := max(| qi − pi |) (3) i – Cosine similarity. A measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. Given two vectors of attributes, p and q, the cosine similarity, is represented using a dot product and magnitude and is defined as: Pn p·q i=1 pi × qi cos(Θ) = = pPn pPn (4) k p kk q k 2 i=1 i ) × (p i=1 (qi ) 2 1037 Data: X: Unknown documents and Y = {Y1 , Y2 , . . . , Yn } known documents. Result: Same author or different author for each Unknown document /* For each unknown document in the test data set */ for X do index ←− 0; vector1 , vector2 ←− [ ]; for hX, Yi i ∈ Document set do /* Frequency vector of all the combined Lexical syntactic features of the test document */ Xv ←− F eature representation of X; Yv ←− F eature representation of Yi ; /* Cosine Similarity */ vector1 [index] ←− SCosine (Xv , Yv ); /* Frequency vector of the words of the test document */ Xv ←− F eature representation of X; Yv ←− F eature representation of Yi ; /* Jaccard similarity */ vector2 [index] ←− SJaccard (Xv , Yv ); index ←− index + 1; end Cosine ←− max(vector1 [0] , . . . , vector1 [n]); Jaccard ←− max(vector2 [0] , . . . , vector2 [n]); index ←− 0; vector1 , vector2 , vector3 ←− [ ]; for hX, Yi i ∈ Document set do /* LSA word vectors of the test document */ Xv ←− F eature representation of X; Yv ←− F eature representation of Yi ; /* Cosine Similarity */ vector1 [index] ←− SCosine (Xv , Yv ); /* Chebyshev Distance */ vector2 [index] ←− SChebyshev (Xv , Yv ); /* Euclidean distance */ vector3 [index] ←− SEuclidean (Xv , Yv ); index ←− index + 1; end LSA1 ←− max(vector1 [0] , . . . , vector1 [n]); LSA2 ←− max(vector2 [0] , . . . , vector2 [n]); LSA3 ←− max(vector3 [0] , . . . , vector2 [n]); M axSimilarity ←− max(Cosine, Jaccard, LSA1, LSA2, LSA3) if M axSimilarity >= ∆ then result ←− same author else result ←− dif f erent author end end Algorithm 1: Proposed approach using an unsupervised learning method 1038 3 Proposed approach For tackling the authorship identification task we propose a methodology (see algo- rithm 1) in which we used an unsupervised learning method. We evaluate different feature vectors with different similarity measures in order to identify if an unknown document belongs to an author. For each language, three types of text representations are evaluated: – A frequency vector of all the vocabulary in the test documents. – A frequency vector of all the combined Lexical syntactic features of the test docu- ments: 1. Word prefixes 2. Word sufixes 3. Stopwords 4. Punctuation marks 5. N-grams(from 1 to 5) 6. Skip-grams (from 1 to 5) 7. Vowel combination 8. Vowel permutation – A similarity vector using the LSA algorithm for each word in the test documents. Different distance/similarity measures were tested, including the Jaccard similarity for the vocabulary feature vector, the cosine similarity fo the Frequency vector of all the combined Lexical syntactic features and Chebyshev Distance, Euclidean distance and cosine similarity for the LSA vectors. The best similarities of each document of unknown origin were obtained and a threshold (see table 1) to determine if the document belongs to an author. The threshold was obtained using a classification tree [2], [9] with the maximum similarities measures values (see section 2.2) as features in a vector representation using the training data set. Table 1. Threshold used in the authorship identification task Language Genre Threshold ∆ Dutch essays 0.82 Dutch reviews 0.93 English reviews 0.67 English novels 0.80 Greek articles 0.76 Spanish articles 0.64 4 Experimental results The results obtained with the approach are discussed in this section. First, we describe the dataset used in the experiments and, thereafter, the results obtained. 1039 4.1 Data sets The description of the six text collections used in the experiments are shown in Table 2. As can be seen, the data set is made up of different languages/genres within the author verification task, where, each problem consists of some (up to five) known documents written by a single person and only one questioned document. All documents within a single problem instance are in the same language and the documents are matched for register, theme, and date of writing. Table 2. Data set used in the authorship identification task Language Genre Number of training documents Number of test documents Dutch essays 169 96 Dutch reviews 105 100 English reviews 529 200 English novels 100 100 Greek articles 100 100 Spanish articles 500 100 4.2 Results obtained in the task In Table 3 we present the results obtained by our approach using the TIRA tool [3] with each one of the data sets considered in the competition. The results were evaluated according to the area under the ROC curve (AUC) measure [1] and the c@1 measure [8]. Three different languages (each one, with different genres) were tackled out. The best performance was obtained with the Dutch language in the essays genre, followed by the Greek language in the articles genre, and the English language in the essays genre. However, a low performance was obtained (with respect to the other teams in the competition) in the Dutch, Spanish and English languages in the reviews, articles and novels genres. We consider these results were obtained because of the use of empirical thresholds in the final classification of the unknown documents. Further analysis will investigate this issue. It is worth to notice that we obtained the sixth place from 13 teams and that our approach always performed better than the competition baselines. Table 3. Results obtained in different languages Language Genre AUC C@1Final score Runtime Ranking based on the final score Dutch essays 0.86068 0.86133 0.74133 00:01:56 5 Dutch reviews 0.6692 0.3696 0.24734 00:01:00 11 English essays 0.54855 0.58 0.31816 01:31:53 11 English novels 0.6278 0.615 0.3861 26:14:11 5 Greek articles 0.686 0.73 0.50078 00:03:14 4 Spanish articles 0.734 0.76 0.55784 00:06:48 5 1040 5 Conclusions We have presented an approach that uses an unsupervised method with lexical-syntactic features. Even if the runtime is greater than the most approaches of this competition, the performance is good. It was surprising that being a Spanish native language team, we performed better in Dutch and Greek languages, but it is a good oportunity for analyz- ing the text into more depth for determining the reason of this issue. As we mentioned before, we have executed the same methodology across the different languages, varying basically only the thresholds applied in the final classification. However, more experi- ments continue to be performed to analyze whether or not these changes introduce sig- nificant variations in the data sets. Future work is planned to observe the performance of the proposed methodology using different similarity measures. References 1. Agarwal, S., Graepel, T., Herbrich, R., Har Peled, S., Roth, D.: Generalization bounds for the area under the roc curve. Journal of Machine Learning Research 6, 393–425 (2005) 2. Garner, S.R.: Weka: The waikato environment for knowledge analysis. In: In Proc. of the New Zealand Computer Science Research Students Conference. pp. 57–64 (1995) 3. Gollub, T., Potthast, M., Beyer, A., Busse, M., Pardo, F.M.R., Rosso, P., Stamatatos, E., Stein, B.: Recent trends in digital text forensics and its evaluation - plagiarism detection, author identification, and author profiling. In: Forner, P., Müller, H., Paredes, R., Rosso, P., Stein, B. (eds.) CLEF. Lecture Notes in Computer Science, vol. 8138, pp. 282–302. Springer (2013) 4. Guthrie, D., Allison, B., Liu, W., Guthrie, L., Wilks, Y.: A closer look at skip-gram modelling. In: Proceedings of the Fifth international Conference on Language Resources and Evaluation (LREC-2006). Genoa, Italy (2006) 5. Houvardas, J., Stamatatos, E.: N-gram feature selection for authorship identification. In: Euzenat, J., Domingue, J. (eds.) AIMSA. Lecture Notes in Computer Science, vol. 4183, pp. 77–86. Springer (2006) 6. Landauer, T.K., Folt, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse processes 25(2), 259–284 (1998) 7. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA (2008) 8. Peñas, A., Rodrigo, A.: A simple measure to assess non-response. In: Lin, D., Matsumoto, Y., Mihalcea, R. (eds.) ACL. pp. 1415–1424. The Association for Computer Linguistics (2011) 9. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2005) 1041