=Paper=
{{Paper
|id=Vol-1609/16090932
|storemode=property
|title=Multi Feature Space Combination for Authorship Clustering
|pdfUrl=https://ceur-ws.org/Vol-1609/16090932.pdf
|volume=Vol-1609
|authors=Muharram Mansoorizadeh,Mohammad Aminian,Taher Rahgooy,Mehdy Eskandari
|dblpUrl=https://dblp.org/rec/conf/clef/MansoorizadehAR16
}}
==Multi Feature Space Combination for Authorship Clustering==
Multi feature space combination for authorship clustering: Notebook for PAN at CLEF 2016 Muharram Mansoorizadeh1, Mohammad Aminiyan2, Taher Rahgooy, Mahdy Eskandari Computer Engineering Department, Bu-Ali Sina University, Hamedan, Iran Abstract .The Author Identification task for PAN 2016 consisted of three dif- ferent Sub-tasks: authorship clustering, authorship links and author diarization. We developed a machine learning approaches for two of three of these tasks. For the two authorship related tasks we created various sets of feature spaces. The challenge was to combine these feature spaces to enable the machine learn- ing algorithms to detect these difference authors across multiple feature spaces. In the case of authorship clustering we combine these feature spaces and use a two-step approach for clustering. Then we use results of the clustering, and em- ploy new feature space to determine links between documents in given prob- lems. Keywords: authorship clustering, authorship link, tf-idf, feature space combintion 1 Introduction In the following we provide a detailed description of our approaches to solve the two subtasks of the Author Identification track of PAN 2016. The problem instance is a tuplewhere K is a set of documents authored by the different authors, U is the genre of the document and L is the enumerated value specifying the language of the documents: Eng- lish, Dutch or Greek. All documents in a problem instance are in the same language and same genre. This lab report is structured as follows: In section 2 we propose a number of different features that characterize documents from widely different points of view: character, word, part-of speech, sentence length, punctuation. We construct non-overlapping groups of homogeneous feature. In section 3 we present the two-step unsupervised method for author- ship clustering task by employing a graph based approach and the standard k- means++ algorithm. Then we employ new feature space to determine links 1 mansoorm@basu.ac.ir 2 M.Aminiyan@Gmail.com between documents. Finally, in section 4 we describe our results on the train- ing corpus and the final evaluation corpus of PAN-2016. 2 Preprocessing We extract a number of different features from each document. For ease of presentation, we group homogeneous features together, as described below. 2.1 Features Word ngrams (WG): We convert all characters to lowercase and then we transform the document to a sequence of words. We consider white spaces, punctuation characters and digits as word separators. We count all word ngrams, with n ≤ 3, and we obtain a feature for each different word ngram which occurs in the training set documents of a given language [1]. It should be mentioned that, we use word unigrams and 2-gram in clustering task and preprocesses related to it and word 3-gram only used in link computation phase. In order to normalize these set of features we use term frequency-inverse document frequency (tf-idf) for each set of documents (each problem)[2]. POS (part-of-speech) tag ngrams (PG): We apply a part of speech (POS) tagger on each document, which assigns words with similar syntactic proper- ties to the same POS tag. We count all POS ngrams, with n≤ 2, and we obtain a feature for each different POS ngram which occurs in the training set docu- ments of a given language [2]. Sentence lengths (SL): We transform the document to a sequence of tokens, a token being a sequence of characters separated by one or more blank spaces. Next, we transform the sequence of tokens to a sequence of sentences, a sen- tence being a sequence of tokens separated by any of the following characters: ., ;, :, !, ?. We count the number of sentences whose length in tokens is n, with n {1,..,15}: we obtain a feature for each value of n [2]. Punctuation ngrams (MG): We transform the document by removing all characters not included in the following set: {,, ., ;, :, !, ?, "}—the resulting document thus consists of a (possibly empty) sequence of characters in that set. We then count all character ngrams of the resulting document, with n≥2, and we obtain a feature for each different punctuation ngrams which occurs in the training set documents of a given language [2]. In order to preprocess documents we use python NLTK 3.0 package [3]. After creating the feature space we simply separate word 2grams for author- ship link task and use the rest of features for clustering. We assume that word 2grams consist of very specific relation which can effect better inside of each cluster for determining the level of similarity between documents. 2.2 Data normalization After feature extraction, we normalize value of each feature using min-max normalization in order to remove the impact of different scale spaces: X old Max X new (1) Max Min Where X old is the old value of X and Max is the maximum value of feature X and Min is the Minimum value for feature X. 3 Two-step unsupervised method In order to solve the task, we use two step method. 3.1 Step 1: Determining the number of authors Considering the fact that number of authors is unknown first we have to determine the number of authors for each problem, namely, we have to determine number of clusters for clustering algorithm. The number of clusters should be set by the developer based on specifications of problem. Assigning a proper number is a challenging task. A document similarity graph (DSG) algorithm has been used. DSG is an undirected graph showing similarity rela- tions between documents based on their contents [4]. The nodes of this graph are documents and the edges between documents are defined by the similari- ties and dissimilarities between them using (2): x .y n X .Y Z i , j 1 1 i 1 i i X .Y x y n 2 n 2 i 1 i i 1 i (2) 1 Z i , j V S mat i , j 0 Z i , j Where xk and yk are features of Xi and Yj documents respectively and δ is the threshold which define the existence of the similarity between two docu- ments. In this paper, the δ parameter is set to 0.5. Also Z is the cosine simi- larity between two documents [5]. The number of clusters has been determined using the number of sub graphs resulted with DSG. To find the number we just count the nodes with value more than 65 percent of number of all document for example if we have 100 documents in problem folder, we count nodes which have more than 65 incoming edges. 3.2 Step 2: clustering and computing links After calculating the number of clusters, we use k-means++ [6] scikit-learn python package in order to perform clustering task. When clustering completed, we collect the result and employ simple simi- larity task in each of clusters. We compute similarity based word 3grams fea- tures and cosine similarity (2). 4 Results In order to evaluate our work, we use training corpus and the final evalua- tion corpus of PAN-2016. These datasets consist of set of problems, each problem comes with different number of documents in specific language (English, Dutch and Greek) and two different genres (newspaper articles and reviews). The clustering output will be evaluated according to BCubed F- score [7] and the ranking of authorship links will be evaluated according to Mean Average Precision (MAP) [8]. In order to evaluate our work, first the software has been executed on TIRA platform [9]. Table 1 shows the result of train dataset. It is obvious that our method have high Bcubed recall hence we can say the method cluster same items almost great in each cluster but by investigating our method’s Bcubed precision, we can clearly say that the number of cluster or even the way we measure similar- ity does not tune well. Table 1. Results of test dataset F- R- P- Average problem language Genre Bcubed Bcubed: Bcubed Precision problem01 English Articles 0.71947 0.71333 0.72571 0.00083612 problem02 English Articles 0.58281 0.50444 0.69 0.00022599 problem03 English Articles 0.58665 0.87333 0.44167 0.00052301 problem04 English Reviews 0.76012 0.69583 0.8375 0.0015432 problem05 English Reviews 0.2648 0.97083 0.15331 0.0028001 problem06 English Reviews 0.24887 0.89667 0.14449 0.017832 problem07 Dutch Articles 0.45478 0.96491 0.2975 0.0084819 problem08 Dutch Articles 0.68125 0.59223 0.80175 0.030246 problem09 Dutch Articles 0.42888 0.8538 0.28636 0.016233 problem10 Dutch Reviews 0.36209 0.61333 0.25687 0.0063669 problem11 Dutch Reviews 0.33539 0.71167 0.21939 0.001594 problem12 Dutch Reviews 0.33242 0.92 0.20286 0.0008547 problem13 Greek Articles 0.1365 0.89697 0.073869 0.023333 problem14 Greek Articles 0.53793 0.77939 0.4107 0.0095962 problem15 Greek Articles 0.56034 0.93939 0.39924 0.007211 problem16 Greek Reviews 0.56877 1 0.3974 0.013893 problem17 Greek Reviews 0.5674 0.74697 0.45743 0.045313 problem18 Greek Reviews 0.57607 0.90303 0.42294 0.028917 Like Table 1, Table 2, results of test dataset, also illustrates, high level of Bcubed recall in most of problem sets, in contrast with Bcubed precision which is not high. But it is obvious that results from test dataset are better than train data. It shows ability of system to generalize new problems. But the ma- jor defect of system with lower Bcubed precision than recall one still exists. Notice that you can find complete evaluation on overview [10]. Table 2. Results of test dataset F- R- P- Average problem language Genre Bcubed Bcubed: Bcubed Precision problem01 English Articles 0.4492 0.77619 0.31605 0.0045282 problem02 English Articles 0.51302 0.6165 0.43929 0.018634 problem03 English Articles 0.45086 0.9381 0.29674 0.0022228 problem04 English Reviews 0.46821 0.80833 0.32955 0.0033492 problem05 English Reviews 0.54696 0.92083 0.38902 0.00076857 problem06 English Reviews 0.4896 0.64458 0.3947 0.0046202 problem07 Dutch Articles 0.06261 1 0.032318 0.017739 problem08 Dutch Articles 0.33159 0.95906 0.2004 0.0053791 problem09 Dutch Articles 0.15954 0.94987 0.087 0.032996 problem10 Dutch Reviews 0.33115 0.89667 0.20308 0.00097662 problem11 Dutch Reviews 0.31324 0.56167 0.21718 0.0022789 problem12 Dutch Reviews 0.3371 0.73167 0.219 0.00074413 problem13 Greek Articles 0.43173 0.76429 0.3008 0.02234 problem14 Greek Articles 0.46847 0.7119 0.34909 0.015947 problem15 Greek Articles 0.43579 0.9 0.2875 0.00087489 problem16 Greek Reviews 0.48623 0.83571 0.34286 0.007296 problem17 Greek Reviews 0.46259 0.98095 0.30266 0.003199 problem18 Greek Reviews 0.47588 0.79524 0.33953 0.0095474 5 Conclusion and future works In this research we propose a two-step unsupervised method in order to per- form author clustering. In our approach we combine different feature spaces and use them to cluster documents based on their authors. Then, we rank doc- uments based on their cosine similarity using new set of feature which are different from the set we use for clustering. Results illustrates that our work have a good Bcubed recall. But major problem of our method was its Bcubed precision. The problem may come from cluster number selection or the feature space. Hence as a future work, we suggest researchers work on a way of better cluster parameter selection. Also, it would be suggested that the task tested on more complex clustering method without the need on parameter selection like self-organized map (SOM) and so on. References 1. Tuarob, S., Pouchard, L. C., Mitra, P., & Giles, C. L. (2015). A generalized topic modeling approach for automatic document annotation. International Journal on Digital Librar- ies, 16(2), 111-128. 2. Alberto Bartoli, Alex Dagri, Andrea De Lorenzo, Eric Medvet, and Fabiano Tarlao. An Author Verification Approach Based on Differential Features—Notebook for PAN at CLEF 2015. In Linda Cappellato, Nicola Ferro, Gareth Jones, and Eric San Juan, edi- tors, CLEF 2015 Evaluation Labs and Workshop – Working Notes Papers, 8-11 Septem- ber, Toulouse, France, September 2015. CEUR-WS.org. ISSN 1613-0073. 3. Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python. " O'Reil- ly Media, Inc.". 4. B. Seah, S. S. Bhowmick, and A. Sun, “PRISM : Concept-preserving Social Image Search Results Summarization,” in Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval - SIGIR ’14, 2014, pp. 737–746 5. Deza, Michel Marie, and Elena Deza. Encyclopedia of distances. Springer Berlin Heidel- berg, 2009. 6. Arthur, D., Vassilvitskii, S.: k-means ++ : The Advantages of Careful Seeding. Proceed- ings of the eighteenth annual ACMSIAM symposium on Discrete algorithms 8(2006-13), 1027–1035 (2007 7. Amigó, E., Gonzalo, J., Artiles, J., & Verdejo, F. (2009). A comparison of extrinsic clus- tering evaluation metrics based on formal constraints.Information retrieval, 12(4), 461- 486. 8. Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retriev- al (Vol. 1, No. 1, p. 496). Cambridge: Cambridge university press. 9. Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Improving the Reproducibility of PAN’s Shared Tasks: Plagiarism Detection, Author Identification, and Author Profiling. In: Kanoulas, E., Lupu, M., Clough, P., Sanderson, M., Hall, M., Hanbury, A., Toms, E. (eds.) Information Access Evaluation meets Multilinguality, Mul- timodality, and Visualization. 5th International Conference of the CLEF Initiative (CLEF 14). pp. 268–299. Springer, Berlin Heidelberg New York (Sep 2014). 10. Stamatatos, E., Tschuggnall, M., Verhoeven, B., Daelemans, W., Specht, G., Stein, B., Potthast, M.: Clustering by Authorship Within and Across Documents. In: Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org (Sep 2016)