-

Cross-domain Authorship Attribution: Author Identification using Char Sequences, Word Uni-grams, and POS-tags Features

Yaakov HaCohen-Kerner

kerner@jct.ac.il 0 1

Daniel Miller

0 1

Yair Yigal

yigalyairn@gmail.com 0 1

Elyashiv Shayovitz

elyashiv12@gmail.com 0 1 0 21 Havaad Haleumi St. , P.O.B. 16031, 9116001 Jerusalem , Israel 1 Department of Computer Science, Jerusalem College of Technology, Lev Academic Center

2018

Authorship Attribution deals with identifying the author of an anonymous text, i.e., to attribute each test text of unknown authorship to one of a set of known authors, whose training texts are given. In this paper, we describe the participation of our teams (miller18 and yigal18, both teams contain the same people, but in another order) in the PAN 2018 shared task on cross-domain Author Identification. Given a set of documents authored by known authors, there is a need to identify the authors of documents from another set of documents. All documents are in the same language that may be one of the five following languages: English, French, Italian, Polish, or Spanish. In this paper, we describe our pre-processing, feature sets, the applied machine learning methods and the average F1 scores of three submitted models. For the evaluation corpus, we sent the top three models according to their results on the development corpus using PCA and Linear SVC. The first model scored an average of 0.582. Its features consist of the frequencies of all char 6-gram sequences, POS-tags sequences frequencies, Orthographic features, Quantitative features, and lexical richness features. The second model scored an average of 0.598. Its features consist of all the char sequences of length between 3 to 8, all word Uni-grams, POS-tags features, and all stylistic features from the first model. The third model scored an average of 0.611. Its features consist of the content-based features mentioned in the second model and POS-tags features.

Authorship Attribution Author Identification Content-based Features Style-based Features Supervised Machine Learning Text Classification

training texts are given. This task is usually performed as a text classification process using supervised machine learning (ML) method(s). “The main idea behind statistically or computationally supported authorship attribution is that by measuring some textual features, we can distinguish between texts written by different authors” [ 35 ].

In this paper, we describe the participation of our teams (miller18 and yigal18, both teams contain the same people, but in another order) in the PAN 2018 shared task on cross-domain authorship attribution. More specifically, the shared task has been set up as follows. Given a set of documents (known fanfics) by a small number (up to 20) of candidate authors, identify the authors of another set of documents (unknown fanfics). Each candidate author has contributed at least one of the unknown fanfics, which all belong to the same target fandom. The known fanfics belong to several fandoms (excluding the target fandom), although not necessarily the same for all candidate authors. An equal number of fanfics per candidate author is provided. In contrast, the unknown fanfics are not equally distributed over the authors. The text-length of fanfics varies from 500 to 1,000 tokens. All documents are in the same language that may be English, French, Italian, Polish, or Spanish.

The rest of the paper is organized as follows. Section 2 provides background and presents some related work on text classification in general and authorship attribution in particular. Section 3 introduces the feature sets that we have implemented and used in our extensive experiments. Section 4 presents the experimental setup, the results and their analysis. Finally, Section 5 summarizes and suggests ideas for future research. 2 2.1

Related Work Text classification

TC is the supervised learning task of assigning natural language text documents to one or more predefined categories [ 26 ]. There are two main types of TC: topic-based classification and style-based classification. An example of a topic-based classification application is classifying news articles to various categories such as Business-Finance, Lifestyle-Leisure, Science-Technology, and Sports, which were downloaded from three well-known news web-sites (BBC, Reuters, and TheGuardian) [ 24 ]. An example of a style-based classification application is classification based on different literary genres, e.g., action, comedy, crime, fantasy, historical, political, saga, and science fiction [21, 9].

These two classification types often require different types of feature sets to achieve the best performance. Topic-based classification is typically performed using word uni-grams and/or n-grams (n > 2) [1, 12]. Style-based classification is typically performed using linguistic features such as quantitative features, orthographic features, part of speech (POS) tags, function words, and vocabulary richness features [21, 13, 14, 9]. 2.2

Authorship attribution

AA can be viewed as a sub-task of TC. At the beginning of the history of AA, it had only limited application, mainly to literary works of unknown or disputed authorship [ 27 ]. However, during the last two decades, AA grew and developed impressively due to its many applications in a widespread variety of domains, e.g., criminal law, computer forensics, humanities research, and military intelligence [ 35 ].

Current research in AA focuses mainly on the extraction of the best features (both style-based and content-based) for identification of document authors and suitable ML methods. 3

Feature Sets

In this section, we present the content-based and style-based features that we applied for the authorship attribution task. Each combination of those features is defined by the following template: “number k-skip-n-type [style]”. All the features were scaled by the Sklearn’s MaxAbsScaler1. 3.1

Content-based features

Our content-based features include various n-gram feature sets according to the following pattern “number k-skip-n-type”, where ‘number’ is the number of wanted n-grams in the set (the size of our vocabulary, e.g., 1000 or 5000, or ’all’, which means that we used all the tokens in the dataset), ‘k’ is the size of the wanted skip (0 – no skip, 1 – skip of one unit, 2 – skip of 2 units, …), ‘n’ is code of the gram type (1 for Uni-grams 2 for bigrams, 3 for trigrams, …), and ‘type’ is ‘word’ for words or ‘char’ for characters. All values are represented by TF-IDF values. The specific various n-gram feature sets that were applied will be presented later in the framework of the experiments. 3.2

Style-based features

Our style-based features include the following feature sets: POS-Tags, Quantitative features, Orthographic features, and Lexical-Richness features. The ‘[style]’ pattern in the template mentioned above is a list of style features separated by ‘_’ where ‘pos’ stands for POS-tags, ‘quan’ for quantitative features, ‘orth’ for orthographic features and ‘rich’ for lexical richness features. The POS-tags features are the frequencies of POS-tags sequences of length 1 to 10 (we tried various lengths of sequences and by using the value of 10 we obtained the best results. Maybe some authors tend to use specific POS sequences of this length, maybe over more than one sentence). The POStags features for English and Spanish were created using the Stanford POS-Tagger2 and for French, Italian and Polish using the RDRPOSTagger3. While most of the POS1 http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html 2 https://nlp.stanford.edu/software/tagger.shtml 3 http://rdrpostagger.sourceforge.net/ taggers we used have fine grained tags, the Polish POS-tagger has only UniPOS-tags. The Quantitative set includes 12 features: Average and median number of characters of a word (with and without punctuation) or a token, Average and median number of words or tokens in a sentence and Average and median number of character in a sentence. The Orthographic set includes 32 features. Each one represents the frequency of a specific character (normalized by the number of characters in the tested document) from the following list of characters: !"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~. The LexicalRichness set includes two features: the percentage of unique words in the text, and the percentage of hapaxes in the text.

The tokenization was performed by NLTK's word tokenizer4. Sentence separation was performed by NLTK's sentence tokenizer5. Word separation was performed using the single space character. 4

Experimental Setup and Results

The PAN CLEF 2018 [ 36 ] launched an evaluation campaign. Ten teams have participated in this campaign. Each team has proposed its algorithm, which has been evaluated using the TIRA platform [ 24 ]. The algorithms and the results of the participated teams have been overviewed in [22].

General approach: Our approach to authorship Attribution is to apply supervised ML methods to TC as was suggested by Sebastiani [ 32 ]. The process is as follows. First, given a corpus of training documents, where each document is labeled by its author, we processed each document to produce values for various combinations of features from different types of features sets: content-based features and style-based features. Second, we apply several popular ML methods on the generated combinations of features. Third, we try additional combinations of features and parameter tuning. Finally, the best model(s) are tested on out-of-training data (i.e., evaluation data).

Preprocessing: There is a widespread variety of text preprocessing types such as: conversion of uppercase letters into lowercase letters, HTML object removal, stopword removal, punctuation mark removal, reduction of different sets of emoticon labels to a reduced set of wildcard characters, replacement of HTTP links to wildcard characters, word stemming, word lemmatization, correction of common misspelled words, and reduction of replicated characters. Not all the preprocessing types are considered as effective by all TC researchers. Many systems use only a small number of simple preprocessing types (e.g., conversion of all the uppercase letters into lowercase letters and or stopwords removal).

In our classification experiments, we found that most of the preprocessing types are irrelevant, because the data we have is already processed and it is mostly free of errors and mistakes. We mainly used characters for the content-based features, so stopwords removal did not improve the TC results. We also tried stemming, but the stems hurt the 4 http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.punkt.PunktLanguageVars.word_tokenize 5 http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.sent_tokenize results. Therefore, the only preprocessing type we used was converting uppercase letters into lowercase letters.

ML methods: We applied five popular ML methods: MLP – Multilayer Perceptron6, SVC7, LinearSVC8, LR - Logistic regression9, and RF - Random Forest10. A brief description of these ML methods follows: MLP is a feedforward neural network ML method [17] where artificial neural network (ANN) can be viewed as a weighted directed graph in which nodes are artificial neurons and directed edges (with weights) are connections from the outputs of neurons to the inputs of neurons. Support vector machine (SVM, also called support vector network) [6] is a model that assigns examples to one of two categories, making it a non-probabilistic binary linear classifier. SVC is a type of SVM with an RBF kernel implemented using LibSVM [5]. LinearSVC is a type of SVM with a linear kernel implemented using LibLinear [7], which is recommended for TC because most of TC problems are linearly separable [18] and training an SVM with a linear kernel is faster compared to other kernels. Logistic regression (LR) is a variant of a statistical model that tries to predict the outcome of a categorical dependent variable (i.e., a class label) [4, 16]. Random Forest (RF) is an ensemble learning method for classification and regression [3]. RF operates by constructing a multitude of decision trees at training time and outputting classification for the case at hand. RF combines the “bagging” idea presented by Breiman [2] and random selection of features introduced by Ho [15] to construct a forest of decision trees.

Tools and information sources: We used the three following tools and information sources: 1. Scikit-Learn11 - a python library for ML methods and algorithms [ 25 ] 2. NLTK 12- a suite of libraries and programs for symbolic and statistical natural language processing [ 28 ] 3. Numpy13 - a library that performs fast mathematical processing on arrays and metrices [37].

Development corpus: Development corpora in five languages (English, French, Italian, Polish and Spanish). For each language, we have two corpora, which each of them has both train and test sub-corpora. 4.1

Experimental results using content-based feature sets

The first TC experiments were performed for the development corpus using five ML methods without any parameter tuning. The results of the average F1 score for all the 6 http://scikit-learn.org/stable/modules/neural_networks_supervised.html 7 The default SVC is with RBF kernel.

http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html 8 http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html 9 http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html 10 http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html 11 http://scikit-learn.org/stable/index.html 12 https://www.nltk.org/ 13 http://www.numpy.org/ examined feature sets for the development corpus are shown in Table 1. The best result for each tested ML method (a column) is shown in bold.

results using sets of word multi-grams.

Features 1000 Word Uni-grams 1000 Word Bi-grams 1000 Word 3-grams 2000 Word Uni-grams 2000 Word Bi-grams 2000 Word 3-grams 3000 Word Uni-grams 3000 Word Bi-grams 3000 Word 3-grams 4000 Word Uni-grams 4000 Word Bi-grams 4000 Word 3-grams 5000 Word Uni-grams 5000 Word Bi-grams 5000 Word 3-grams 7500 Word Uni-grams 7500 Word Bi-grams 7500 Word 3-grams 10000 Word Uni-grams 10000 Word Bi-grams 10000 Word 3-grams

The results shown in Table 1 are quite bad, and the vocabulary size does not seem to have any noticeable effect on the results. We also noticed that the best results were achieved by the word uni-gram feature sets, which is not a surprise according to many previous TC studies. It also interesting to note that although the LinearSVC and SVC methods are both based on the SVM method the differences in the kernel and in the implementation have a big effect on their scores.

The next experiment we conducted was similar, but with characters instead of words. The results of the Average 1 are shown in Table 2. The best result for each tested ML method (a column) is shown in bold.

results using sets of char 1-4-grams.

We have also tried many combinations that include various skip character/word ngram sets. However, their results were quite low.

Next, we tried to eliminate some overfitting by removing hapaxes (i.e., words occurring only once in the text) or removing any words that appear less than four times in the corpus. The results of these experiments were, again, quite low.

Since the best results we have seen so far were obtained by using char 3-gram and 4-gram features using Linear SVC and Logistic Regression, we decided to try n-gram features with higher numbers of characters in the next experiments. In addition, we noticed that the results were significantly higher when we used a big vocabulary size (i.e., more features), therefore we also tried to use all the char-grams in the corpus. The results of this experiment are shown in Table 3. The best result for each tested ML method (a column) is shown in bold.

results using sets of characters 5 to 8 -grams.

Experimental results using style-based feature sets

We also tried to add various style-based feature sets (POS-tags, Quantitative features, Orthographic features and lexical richness features) in our feature combinations. However, most of these feature sets did not improve the average f1-score, except POStags. We used the frequencies of POS-tag sequences as features in addition to Char 6gram features, which produced the best result so far. By using POS-tags we increased our average f1-score, but after a closer look we noticed that the results of the datasets in Spanish and Polish got worse. We guess that has been caused by an inaccurate tagging of the text (the Polish POS-tagger was just a UniPOS-tagger and not a finegrained one). Therefore, we ran our tests again, using POS tagger only for the datasets in English, French, and Italian. The results are shown in Table 4.

We also applied the Principal component analysis (PCA) [20] algorithm using the Scikit-Learn’s PCA module14. According to this model, we filtered out the least significant features. The results improved, but again, in closer look we noticed that the result got worse in Spanish and Polish, maybe due to the lack of the POS-tags features. Therefore, we decided not to use PCA for those languages. The results are shown in Table 5. 14 http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

for style features combined with 6-chars.

for 6-chars with style-features and PCA applied.

Features all_6-char_linear_svc all_6-char_orth_linear_svc all_6-char_orth_quan_linear_svc all_6-char_orth_quan_rich_linear_svc all_6-char_orth_rich_linear_svc all_6-char_pos_linear_svc all_6-char_pos_orth_linear_svc all_6-char_pos_orth_quan_linear_svc all_6-char_pos_orth_quan_rich_linear_svc all_6-char_pos_orth_rich_linear_svc all_6-char_pos_quan_linear_svc all_6-char_pos_quan_rich_linear_svc all_6-char_pos_rich_linear_svc all_6-char_quan_linear_svc all_6-char_quan_rich_linear_svc all_6-char_rich_linear_svc Features all_6-char_pca_linear_svc all_6-char_orth_pca_linear_svc all_6-char_orth_quan_pca_linear_svc all_6-char_orth_quan_rich_pca_linear_svc all_6-char_orth_rich_pca_linear_svc all_6-char_pos_orth_pca_linear_svc all_6-char_pos_orth_quan_pca_linear_svc all_6-char_pos_orth_quan_rich_pca_linear_svc all_6-char_pos_orth_rich_pca_linear_svc all_6-char_pos_pca_linear_svc all_6-char_pos_quan_pca_linear_svc all_6-char_pos_quan_rich_pca_linear_svc all_6-char_pos_rich_pca_linear_svc all_6-char_quan_pca_linear_svc all_6-char_quan_rich_pca_linear_svc all_6-char_rich_pca_linear_svc

Experimental results for the final three models

For the final competition we sent three models, two of which were sent by miller18 and another one by yigal18.

The first model, sent by miller18, was based on the results we achieved on the development corpus. Its features consist of the frequencies of all char 6-gram sequences, POS-tags sequences frequencies, Orthographic features, Quantitative features, and Lexical richness features. Then, we applied PCA and Linear SVC, which was the best ML method on the development corpus. We tried several parameter combinations for the classifier using GridSearchCV15, including ‘C’ (Penalty parameter of the error term) values between 0.001 to 10 and ‘tol’ (Tolerance for stopping criteria) values between 1e-7 to 1 but the results did not improve, so we sent the model with the default parameters. This model scored an average of 0.582 on the evaluation corpus.

For the second model, sent by yigal18, we tried to use all the char sequences of length between 3 to 8. We also tried to add all the unigram frequencies to the feature sets in addition to the stylistic features from the first model. Again, we applied PCA and Linear SVC as our classifier with the default parameters. This model scored an average 0.598 on the evaluation corpus.

For the third model, which is the second model sent by miller18, we used the same content features as the second model, but this time we did not used any style-based features at all except for POS-tags. The PCA and the classifier were the same as the other models. This model scored an average of 0.611 on the evaluation corpus. It is important to note that the organizers of this PAN’s task did not accept this model as a formal one because the results of tournament were already published. However, they allowed us to write about this model and its results in our notebook paper. In table 6, we present a comparison of these three models including the average 1 results for the development and evaluation corpora.

The most striking and surprising finding in Table 6 is that there is no direct connection between the results of the three models on the development corpus to their results on the evaluation corpus. Furthermore, the results on the evaluation corpus are significantly lower, which implies that the evaluation corpus is quite different in its nature from the development corpus. Another possible explanation is that the three models have overfitted and are not optimal for general corpora from this type. 5

Summary and Future Work

In this paper, we describe our participation in the PAN 2018 shared task on author identification. We tried various pre-processing types, a widespread variety of feature sets, and five ML methods.

For the evaluation corpus, we sent the top three models according to their results on the development corpus. The first model was sent by miller18. Its features consist of 15 http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html the frequencies of all char 6-gram sequences, POS-tags sequences frequencies, Orthographic features, Quantitative features, Lexical richness features. Then we applied PCA. The best ML method on the development corpus was the Linear SVC. This model scored an average of 0.582 on the evaluation corpus.

The second model was sent by yigal18. Its features consist of all the char sequences of length between 3 to 8. We also tried to add all the unigram frequencies to the feature sets in addition to the stylistic features from the first model. Again, we applied PCA and Linear SVC as our classifier with the default parameters. This model scored an average 0.598 on the evaluation corpus (an improvement of 0.016 comparing to the first model).

The third model, which is the second model sent by miller18, was composed of the same content-based features as the second model, but this time we did not use any stylebased features at all except for POS-tags. The PCA and the classifier were the same as the other models. This model scored an average result of 0.611 (an improvement of 0.029 comparing to the first model).

The main findings of our experiments are: (1) there is no direct connection between the results of the three models on the development corpus to their results on the evaluation corpus and (2) the results on the evaluation corpus are significantly lower, which implies that the evaluation corpus is quite different in its nature from the development corpus. A possible explanation is that the three models are too overfitting and are not optimal for general corpora from this type.

Future research proposals include: (1) applying additional combinations of feature sets in order to find a general enough model; (2) tuning the models’ parameters; (3) applying various deep neural models; (4) defining and applying skip-POS sequences; and (5) building model that will perform authorship attribution using key phrases [10, 11] that distinguish each of the text authors.

Acknowledgments. This work was partially funded by the Jerusalem College of Technology (Lev Academic Center) and we gratefully acknowledge its support. 7. Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., Lin, C. J.: LIBLINEAR: A library for large linear classification. Journal of machine learning research, 9(Aug), 1871-1874 (2008). 8. Fox, C.: A stop list for general text. In Acm sigir forum (Vol. 24, No. 1-2, pp. 1921). ACM (1989). 9. Gianfortoni, P., Adamson, D., Rosé, C. P.: Modeling of stylistic variation in social media with stretchy patterns. In Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties (pp. 49-59). Association for Computational Linguistics (2011). 10. HaCohen-Kerner, Y., Gross, Z., Masa, A.: Automatic extraction and learning of keyphrases from scientific articles. In Int. Conf. on Intelligent Text Processing and Computational Linguistics, Springer, Berlin, Heidelberg, pp. 657-669 (2005). 11. HaCohen-Kerner, Y., Stern, I., Korkus, D., Fredj, E.: Automatic machine learning of keyphrase extraction from short html documents written in Hebrew. Cybernetics and Systems: An International Journal, 38(1), 1-21 (2007). 12. HaCohen-Kerner, Y., Mughaz, D., Beck, H., Yehudai, E.: Words as classifiers of documents according to their historical period and the ethnic origin of their authors. Cybernetics and Systems: An International Journal, 39(3), 213-228 (2008). 13. HaCohen-Kerner, Y., Beck, H., Yehudai, E., Rosenstein, M., Mughaz, D.: Cuisine: Classification using stylistic feature sets &/or name-based feature sets. Journal of the American Society for Information Science and Technology 61 (8), 1644–57 (2010A). 14. HaCohen-Kerner, Y., Beck, H., Yehudai, E., Mughaz, D.: Stylistic feature sets as classifiers of documents according to their historical period and ethnic origin. Applied Artificial Intelligence 24 (9), 847–62 (2010B). 15. Ho, T. K.: Random Decision Forests. Proc. of the 3rd Int. Conf. on Document

Analysis and Recognition, Montreal, QC, 14–16 August 1995. 278–282 (1995). 16. Hosmer Jr, D. W., Lemeshow, S., Sturdivant, R. X.: Applied logistic regression (Vol. 398). John Wiley & Sons (2013). 17. Jain, A. K., Mao, J., Mohiuddin, K. M.: Artificial neural networks: A tutorial. Computer, 29(3), 31-44 (1996). 18. Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In European conference on machine learning (pp. 137142). Springer, Berlin, Heidelberg (1998). 19. Jockers, M. L., Witten, D. M.: A comparative study of machine learning methods for authorship attribution. Literary and Linguistic Computing, 25(2), 215-223 (2010). 20. Jolliffe, I. T.: Principal component analysis and factor analysis. In Principal component analysis (pp. 115-128). Springer, New York, NY (1986). 21. Kessler, Brett, Nunberg, Geoffrey, Hinrich Schutze.: Automatic detection of text genre. In P. R. Cohen and W. Wahlster (Eds.), In Proc. of the 35th annual meeting of the ACL and 8th conf. of the Eur. chapter of the As. for Comp. Linguistics. 3238, Somerset, New Jersey: Association for Computational Linguistics (1997). 22. Kestemont, M., Tschugnall, M., Stamatatos, E., Daelemans, W., Specht, G., Stein, B., Potthast, M.: Overview of the Author Identification Task at PAN-2018: CrossSoulier, L., Sanjuan, E., Cappellato, L., Ferro, N. (eds.) Experimental IR Meets Multilinguality, Multimodality, and Interaction. 9th Int. Conference of the CLEF Initiative (CLEF 18). Springer, Berlin Heidelberg New York (2018). 37. Walt, S. V. D., Colbert, S. C., Varoquaux, G.: The NumPy array: a structure for efficient numerical computation. Computing in Science & Engineering, 13(2), 2230 (2011).

Argamon , S. , Whitelaw , C. , Chase , P. , Hota , S. R. , Garg , N. , Levitan , S. : Stylistic text classification using functional lexical features: Research articles . Journal of the Amer. Society for Info. Science and Technology . 58 , 6 , 802 - 822 ( 2007 ).

Breiman , L. : Bagging predictors . Univ. California Technical Report No. 421.

Breiman , L. : Random forests . Machine learning , 45 ( 1 ), 5 - 32 ( 2001 ).

Cessie , S.

Le , Van Houwelingen, J. C. : Ridge estimators in logistic regression , Applied statistics, pp. 191 - 201 ( 1992 ).

Chang , C. C. , Lin , C. J.: LIBSVM: a library for support vector machines . ACM transactions on intelligent systems and technology (TIST) , 2 ( 3 ), 27 ( 2011 ).

Cortes , C. , Vapnik , V. : Support-vector networks . Machine learning , 20 ( 3 ), 273 - 297 ( 1995 ). domain Authorship Attribution and Style Change Detection . In: Cappellato, L. , Ferro , N. , Nie , J.Y. , Soulier , L . (eds.) Working Notes Papers of the CLEF 2018 Evaluation Labs . CEUR Workshop Proceedings, CLEF and CEUR-WS.org ( 2018 )

23. Koppel , M. , Argamon , S. , Shimoni , A. R. : Automatically categorizing written texts by author gender . Literary and linguistic computing , 17 ( 4 ), 401 - 412 ( 2002 ).

24. Liparas , D. , HaCohen-Kerner , Y. , Moumtzidou , A. , Vrochidis , S. , Kompatsiaris , I. News articles classification using Random Forests and weighted multimodal features . In Information Retrieval Facility Conf . (pp. 63 - 75 ). Springer, Cham. ( 2014 ).

25. Loper , E. , Bird , S.: NLTK: The natural language toolkit . In Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics-Volume 1 (pp. 63 - 70 ). Association for Computational Linguistics ( 2002 ).

26. Meretakis , D. , Wuthrich , B. : Extending naive Bayes classifiers using long itemsets , Proc. of the 5th ACM-SIGKDD Int. Conf. Knowledge Discovery, Data Mining (KDD'99) , San Diego, USA, 165 - 174 ( 1999 ).

27. Mosteller , F. , and Wallace , D. L. : Applied Bayesian and classical inference: the case of the Federalist papers . Addison-Wesley: Reading, MA ( 1984 ).

28. Pedregosa , F. , Varoquaux , G. , Gramfort , A. , Michel , V. , Thirion , B. , Grisel , O. , ... Vanderplas , J.: Scikit-learn: Machine learning in Python . Journal of machine learning research , 12 (Oct), 2825 - 2830 ( 2011 ).

29. Potthast , M. , Gollub , T. , Rangel , F. , Rosso , P. , Stamatatos , E. , Stein , B. : Improving the Reproducibility of PAN's Shared Tasks: Plagiarism Detection, Author Identification, and Author Profiling . In: Kanoulas, E. , Lupu , M. , Clough , P. , Sanderson , M. , Hall , M. , Hanbury , A. , Toms , E. (eds.) Information Access Evaluation meets Multilinguality, Multimodality, and Visualization . 5th International Conference of the CLEF Initiative (CLEF 14) . pp. 268 - 299 . Springer, Berlin Heidelberg New York ( 2014 ).

30. Rangel , F. , Rosso , P. , Potthast , M. , Stein , B. : Overview of the 5th author profiling task at pan 2017: Gender and language variety identification in twitter . Working Notes Papers of the CLEF ( 2017 ).

31. Schler , J. , Koppel , M. , Argamon , S. , Pennebaker , J. W.: Effects of age and gender on blogging . In AAAI spring symposium: Computational approaches to analyzing weblogs(Vol. 6 , pp. 199 - 205 ) ( 2006 ).

32. Sebastiani , F. : Machine learning in automated text categorization . ACM computing surveys (CSUR) , 34 ( 1 ), 1 - 47 ( 2002 ).

33. Simonyan , K. , Zisserman , A. : Very deep convolutional networks for large-scale image recognition . arXiv preprint arXiv:1409.1556 ( 2014 ).

34. Stamatatos , E.: Authorship attribution based on feature set subspacing ensembles . International Journal on Artificial Intelligence Tools , 15 ( 05 ), 823 - 838 ( 2006 ).

35. Stamatatos , E.: A survey of modern authorship attribution methods . Journal of the Association for Information Science and Technology , 60 ( 3 ), 538 - 556 ( 2009 ).

36. Stamatatos , E. , Rangel , F. , Tschuggnall , M. , Kestemont , M. , Rosso , P. , Stein , B. , Potthast , M. : Overview of PAN-2018: Author Identification, Author Profiling, and Author Obfuscation . In: Bellot, P. , Trabelsi , C. , Mothe , J. , Murtagh , F. , Nie , J. ,