<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Cross-domain Authorship Attribution: Author Identification using Char Sequences, Word Uni-grams, and POS-tags Features</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yaakov HaCohen-Kerner</string-name>
          <email>kerner@jct.ac.il</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniel Miller</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yair Yigal</string-name>
          <email>yigalyairn@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elyashiv Shayovitz</string-name>
          <email>elyashiv12@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>21 Havaad Haleumi St.</institution>
          ,
          <addr-line>P.O.B. 16031, 9116001 Jerusalem</addr-line>
          ,
          <country country="IL">Israel</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, Jerusalem College of Technology, Lev Academic Center</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <abstract>
        <p>Authorship Attribution deals with identifying the author of an anonymous text, i.e., to attribute each test text of unknown authorship to one of a set of known authors, whose training texts are given. In this paper, we describe the participation of our teams (miller18 and yigal18, both teams contain the same people, but in another order) in the PAN 2018 shared task on cross-domain Author Identification. Given a set of documents authored by known authors, there is a need to identify the authors of documents from another set of documents. All documents are in the same language that may be one of the five following languages: English, French, Italian, Polish, or Spanish. In this paper, we describe our pre-processing, feature sets, the applied machine learning methods and the average F1 scores of three submitted models. For the evaluation corpus, we sent the top three models according to their results on the development corpus using PCA and Linear SVC. The first model scored an average of 0.582. Its features consist of the frequencies of all char 6-gram sequences, POS-tags sequences frequencies, Orthographic features, Quantitative features, and lexical richness features. The second model scored an average of 0.598. Its features consist of all the char sequences of length between 3 to 8, all word Uni-grams, POS-tags features, and all stylistic features from the first model. The third model scored an average of 0.611. Its features consist of the content-based features mentioned in the second model and POS-tags features.</p>
      </abstract>
      <kwd-group>
        <kwd>Authorship Attribution</kwd>
        <kwd>Author Identification</kwd>
        <kwd>Content-based Features</kwd>
        <kwd>Style-based Features</kwd>
        <kwd>Supervised Machine Learning</kwd>
        <kwd>Text Classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        training texts are given. This task is usually performed as a text classification process
using supervised machine learning (ML) method(s). “The main idea behind statistically
or computationally supported authorship attribution is that by measuring some textual
features, we can distinguish between texts written by different authors” [
        <xref ref-type="bibr" rid="ref19">35</xref>
        ].
      </p>
      <p>In this paper, we describe the participation of our teams (miller18 and yigal18, both
teams contain the same people, but in another order) in the PAN 2018 shared task on
cross-domain authorship attribution. More specifically, the shared task has been set up
as follows. Given a set of documents (known fanfics) by a small number (up to 20) of
candidate authors, identify the authors of another set of documents (unknown fanfics).
Each candidate author has contributed at least one of the unknown fanfics, which all
belong to the same target fandom. The known fanfics belong to several fandoms
(excluding the target fandom), although not necessarily the same for all candidate
authors. An equal number of fanfics per candidate author is provided. In contrast, the
unknown fanfics are not equally distributed over the authors. The text-length of fanfics
varies from 500 to 1,000 tokens. All documents are in the same language that may be
English, French, Italian, Polish, or Spanish.</p>
      <p>The rest of the paper is organized as follows. Section 2 provides background and
presents some related work on text classification in general and authorship attribution
in particular. Section 3 introduces the feature sets that we have implemented and used
in our extensive experiments. Section 4 presents the experimental setup, the results and
their analysis. Finally, Section 5 summarizes and suggests ideas for future research.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <sec id="sec-2-1">
        <title>Text classification</title>
        <p>
          TC is the supervised learning task of assigning natural language text documents to one
or more predefined categories [
          <xref ref-type="bibr" rid="ref10">26</xref>
          ]. There are two main types of TC: topic-based
classification and style-based classification. An example of a topic-based classification
application is classifying news articles to various categories such as Business-Finance,
Lifestyle-Leisure, Science-Technology, and Sports, which were downloaded from three
well-known news web-sites (BBC, Reuters, and TheGuardian) [
          <xref ref-type="bibr" rid="ref8">24</xref>
          ]. An example of a
style-based classification application is classification based on different literary genres,
e.g., action, comedy, crime, fantasy, historical, political, saga, and science fiction [21,
9].
        </p>
        <p>These two classification types often require different types of feature sets to achieve
the best performance. Topic-based classification is typically performed using word
uni-grams and/or n-grams (n &gt; 2) [1, 12]. Style-based classification is typically
performed using linguistic features such as quantitative features, orthographic features,
part of speech (POS) tags, function words, and vocabulary richness features [21, 13,
14, 9].
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Authorship attribution</title>
        <p>
          AA can be viewed as a sub-task of TC. At the beginning of the history of AA, it had
only limited application, mainly to literary works of unknown or disputed authorship
[
          <xref ref-type="bibr" rid="ref11">27</xref>
          ]. However, during the last two decades, AA grew and developed impressively due
to its many applications in a widespread variety of domains, e.g., criminal law,
computer forensics, humanities research, and military intelligence [
          <xref ref-type="bibr" rid="ref19">35</xref>
          ].
        </p>
        <p>Current research in AA focuses mainly on the extraction of the best features (both
style-based and content-based) for identification of document authors and suitable ML
methods.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Feature Sets</title>
      <p>In this section, we present the content-based and style-based features that we applied
for the authorship attribution task. Each combination of those features is defined by the
following template: “number k-skip-n-type [style]”. All the features were scaled by the
Sklearn’s MaxAbsScaler1.
3.1</p>
      <sec id="sec-3-1">
        <title>Content-based features</title>
        <p>Our content-based features include various n-gram feature sets according to the
following pattern “number k-skip-n-type”, where ‘number’ is the number of wanted n-grams
in the set (the size of our vocabulary, e.g., 1000 or 5000, or ’all’, which means that we
used all the tokens in the dataset), ‘k’ is the size of the wanted skip (0 – no skip, 1 –
skip of one unit, 2 – skip of 2 units, …), ‘n’ is code of the gram type (1 for Uni-grams
2 for bigrams, 3 for trigrams, …), and ‘type’ is ‘word’ for words or ‘char’ for characters.
All values are represented by TF-IDF values. The specific various n-gram feature sets
that were applied will be presented later in the framework of the experiments.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Style-based features</title>
        <p>Our style-based features include the following feature sets: POS-Tags, Quantitative
features, Orthographic features, and Lexical-Richness features. The ‘[style]’ pattern in
the template mentioned above is a list of style features separated by ‘_’ where ‘pos’
stands for POS-tags, ‘quan’ for quantitative features, ‘orth’ for orthographic features
and ‘rich’ for lexical richness features. The POS-tags features are the frequencies of
POS-tags sequences of length 1 to 10 (we tried various lengths of sequences and by
using the value of 10 we obtained the best results. Maybe some authors tend to use
specific POS sequences of this length, maybe over more than one sentence). The
POStags features for English and Spanish were created using the Stanford POS-Tagger2 and
for French, Italian and Polish using the RDRPOSTagger3. While most of the
POS1 http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html
2 https://nlp.stanford.edu/software/tagger.shtml
3 http://rdrpostagger.sourceforge.net/
taggers we used have fine grained tags, the Polish POS-tagger has only UniPOS-tags.
The Quantitative set includes 12 features: Average and median number of characters of
a word (with and without punctuation) or a token, Average and median number of
words or tokens in a sentence and Average and median number of character in a
sentence. The Orthographic set includes 32 features. Each one represents the frequency of
a specific character (normalized by the number of characters in the tested document)
from the following list of characters: !"#$%&amp;\'()*+,-./:;&lt;=&gt;?@[\\]^_`{|}~. The
LexicalRichness set includes two features: the percentage of unique words in the text, and the
percentage of hapaxes in the text.</p>
        <p>The tokenization was performed by NLTK's word tokenizer4. Sentence
separation was performed by NLTK's sentence tokenizer5. Word separation was performed
using the single space character.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experimental Setup and Results</title>
      <p>
        The PAN CLEF 2018 [
        <xref ref-type="bibr" rid="ref20">36</xref>
        ] launched an evaluation campaign. Ten teams have
participated in this campaign. Each team has proposed its algorithm, which has been evaluated
using the TIRA platform [
        <xref ref-type="bibr" rid="ref8">24</xref>
        ]. The algorithms and the results of the participated teams
have been overviewed in [22].
      </p>
      <p>
        General approach: Our approach to authorship Attribution is to apply supervised
ML methods to TC as was suggested by Sebastiani [
        <xref ref-type="bibr" rid="ref16">32</xref>
        ]. The process is as follows.
First, given a corpus of training documents, where each document is labeled by its
author, we processed each document to produce values for various combinations of
features from different types of features sets: content-based features and style-based
features. Second, we apply several popular ML methods on the generated combinations of
features. Third, we try additional combinations of features and parameter tuning.
Finally, the best model(s) are tested on out-of-training data (i.e., evaluation data).
      </p>
      <p>Preprocessing: There is a widespread variety of text preprocessing types such as:
conversion of uppercase letters into lowercase letters, HTML object removal, stopword
removal, punctuation mark removal, reduction of different sets of emoticon labels to a
reduced set of wildcard characters, replacement of HTTP links to wildcard characters,
word stemming, word lemmatization, correction of common misspelled words, and
reduction of replicated characters. Not all the preprocessing types are considered as
effective by all TC researchers. Many systems use only a small number of simple
preprocessing types (e.g., conversion of all the uppercase letters into lowercase letters and or
stopwords removal).</p>
      <p>In our classification experiments, we found that most of the preprocessing types are
irrelevant, because the data we have is already processed and it is mostly free of errors
and mistakes. We mainly used characters for the content-based features, so stopwords
removal did not improve the TC results. We also tried stemming, but the stems hurt the
4
http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.punkt.PunktLanguageVars.word_tokenize
5 http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.sent_tokenize
results. Therefore, the only preprocessing type we used was converting uppercase
letters into lowercase letters.</p>
      <p>ML methods: We applied five popular ML methods: MLP – Multilayer Perceptron6,
SVC7, LinearSVC8, LR - Logistic regression9, and RF - Random Forest10.
A brief description of these ML methods follows: MLP is a feedforward neural network
ML method [17] where artificial neural network (ANN) can be viewed as a weighted
directed graph in which nodes are artificial neurons and directed edges (with weights)
are connections from the outputs of neurons to the inputs of neurons. Support vector
machine (SVM, also called support vector network) [6] is a model that assigns
examples to one of two categories, making it a non-probabilistic binary linear classifier. SVC
is a type of SVM with an RBF kernel implemented using LibSVM [5]. LinearSVC is a
type of SVM with a linear kernel implemented using LibLinear [7], which is
recommended for TC because most of TC problems are linearly separable [18] and training
an SVM with a linear kernel is faster compared to other kernels. Logistic regression
(LR) is a variant of a statistical model that tries to predict the outcome of
a categorical dependent variable (i.e., a class label) [4, 16]. Random Forest (RF) is an
ensemble learning method for classification and regression [3]. RF operates by
constructing a multitude of decision trees at training time and outputting classification for
the case at hand. RF combines the “bagging” idea presented by Breiman [2] and random
selection of features introduced by Ho [15] to construct a forest of decision trees.</p>
      <p>
        Tools and information sources: We used the three following tools and information
sources:
1. Scikit-Learn11 - a python library for ML methods and algorithms [
        <xref ref-type="bibr" rid="ref9">25</xref>
        ]
2. NLTK 12- a suite of libraries and programs for symbolic and statistical natural
language processing [
        <xref ref-type="bibr" rid="ref12">28</xref>
        ]
3. Numpy13 - a library that performs fast mathematical processing on arrays and
metrices [37].
      </p>
      <p>Development corpus: Development corpora in five languages (English, French,
Italian, Polish and Spanish). For each language, we have two corpora, which each of
them has both train and test sub-corpora.
4.1</p>
      <sec id="sec-4-1">
        <title>Experimental results using content-based feature sets</title>
        <p>The first TC experiments were performed for the development corpus using five ML
methods without any parameter tuning. The results of the average F1 score for all the
6 http://scikit-learn.org/stable/modules/neural_networks_supervised.html
7 The default SVC is with RBF kernel.</p>
        <p>http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
8 http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html
9 http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
10
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
11 http://scikit-learn.org/stable/index.html
12 https://www.nltk.org/
13 http://www.numpy.org/
examined feature sets for the development corpus are shown in Table 1. The best result
for each tested ML method (a column) is shown in bold.</p>
        <p>results using sets of word multi-grams.</p>
        <p>Features
1000 Word Uni-grams
1000 Word Bi-grams
1000 Word 3-grams
2000 Word Uni-grams
2000 Word Bi-grams
2000 Word 3-grams
3000 Word Uni-grams
3000 Word Bi-grams
3000 Word 3-grams
4000 Word Uni-grams
4000 Word Bi-grams
4000 Word 3-grams
5000 Word Uni-grams
5000 Word Bi-grams
5000 Word 3-grams
7500 Word Uni-grams
7500 Word Bi-grams
7500 Word 3-grams
10000 Word Uni-grams
10000 Word Bi-grams
10000 Word 3-grams</p>
        <p>The results shown in Table 1 are quite bad, and the vocabulary size does not seem to
have any noticeable effect on the results. We also noticed that the best results were
achieved by the word uni-gram feature sets, which is not a surprise according to many
previous TC studies. It also interesting to note that although the LinearSVC and SVC
methods are both based on the SVM method the differences in the kernel and in the
implementation have a big effect on their scores.</p>
        <p>The next experiment we conducted was similar, but with characters instead of words.
The results of the Average  1  are shown in Table 2. The best result for each
tested ML method (a column) is shown in bold.</p>
        <p>results using sets of char 1-4-grams.</p>
        <p>We have also tried many combinations that include various skip character/word
ngram sets. However, their results were quite low.</p>
        <p>Next, we tried to eliminate some overfitting by removing hapaxes (i.e., words
occurring only once in the text) or removing any words that appear less than four times
in the corpus. The results of these experiments were, again, quite low.</p>
        <p>Since the best results we have seen so far were obtained by using char 3-gram and
4-gram features using Linear SVC and Logistic Regression, we decided to try n-gram
features with higher numbers of characters in the next experiments. In addition, we
noticed that the results were significantly higher when we used a big vocabulary size
(i.e., more features), therefore we also tried to use all the char-grams in the corpus. The
results of this experiment are shown in Table 3. The best result for each tested ML
method (a column) is shown in bold.</p>
        <p>results using sets of characters 5 to 8 -grams.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Experimental results using style-based feature sets</title>
        <p>We also tried to add various style-based feature sets (POS-tags, Quantitative features,
Orthographic features and lexical richness features) in our feature combinations.
However, most of these feature sets did not improve the average f1-score, except
POStags. We used the frequencies of POS-tag sequences as features in addition to Char
6gram features, which produced the best result so far. By using POS-tags we increased
our average f1-score, but after a closer look we noticed that the results of the datasets
in Spanish and Polish got worse. We guess that has been caused by an inaccurate
tagging of the text (the Polish POS-tagger was just a UniPOS-tagger and not a
finegrained one). Therefore, we ran our tests again, using POS tagger only for the datasets
in English, French, and Italian. The results are shown in Table 4.</p>
        <p>We also applied the Principal component analysis (PCA) [20] algorithm using the
Scikit-Learn’s PCA module14. According to this model, we filtered out the least
significant features. The results improved, but again, in closer look we noticed that the result
got worse in Spanish and Polish, maybe due to the lack of the POS-tags features.
Therefore, we decided not to use PCA for those languages. The results are shown in Table 5.
14 http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html</p>
        <p>for style features combined with 6-chars.</p>
        <p>for 6-chars with style-features and PCA applied.</p>
        <p>Features
all_6-char_linear_svc
all_6-char_orth_linear_svc
all_6-char_orth_quan_linear_svc
all_6-char_orth_quan_rich_linear_svc
all_6-char_orth_rich_linear_svc
all_6-char_pos_linear_svc
all_6-char_pos_orth_linear_svc
all_6-char_pos_orth_quan_linear_svc
all_6-char_pos_orth_quan_rich_linear_svc
all_6-char_pos_orth_rich_linear_svc
all_6-char_pos_quan_linear_svc
all_6-char_pos_quan_rich_linear_svc
all_6-char_pos_rich_linear_svc
all_6-char_quan_linear_svc
all_6-char_quan_rich_linear_svc
all_6-char_rich_linear_svc
Features
all_6-char_pca_linear_svc
all_6-char_orth_pca_linear_svc
all_6-char_orth_quan_pca_linear_svc
all_6-char_orth_quan_rich_pca_linear_svc
all_6-char_orth_rich_pca_linear_svc
all_6-char_pos_orth_pca_linear_svc
all_6-char_pos_orth_quan_pca_linear_svc
all_6-char_pos_orth_quan_rich_pca_linear_svc
all_6-char_pos_orth_rich_pca_linear_svc
all_6-char_pos_pca_linear_svc
all_6-char_pos_quan_pca_linear_svc
all_6-char_pos_quan_rich_pca_linear_svc
all_6-char_pos_rich_pca_linear_svc
all_6-char_quan_pca_linear_svc
all_6-char_quan_rich_pca_linear_svc
all_6-char_rich_pca_linear_svc</p>
      </sec>
      <sec id="sec-4-3">
        <title>Experimental results for the final three models</title>
        <p>For the final competition we sent three models, two of which were sent by miller18 and
another one by yigal18.</p>
        <p>The first model, sent by miller18, was based on the results we achieved on the
development corpus. Its features consist of the frequencies of all char 6-gram
sequences, POS-tags sequences frequencies, Orthographic features, Quantitative features,
and Lexical richness features. Then, we applied PCA and Linear SVC, which was the
best ML method on the development corpus. We tried several parameter combinations
for the classifier using GridSearchCV15, including ‘C’ (Penalty parameter of the error
term) values between 0.001 to 10 and ‘tol’ (Tolerance for stopping criteria) values
between 1e-7 to 1 but the results did not improve, so we sent the model with the default
parameters. This model scored an average of 0.582 on the evaluation corpus.</p>
        <p>For the second model, sent by yigal18, we tried to use all the char sequences of
length between 3 to 8. We also tried to add all the unigram frequencies to the feature
sets in addition to the stylistic features from the first model. Again, we applied PCA
and Linear SVC as our classifier with the default parameters. This model scored an
average 0.598 on the evaluation corpus.</p>
        <p>For the third model, which is the second model sent by miller18, we used the same
content features as the second model, but this time we did not used any style-based
features at all except for POS-tags. The PCA and the classifier were the same as the
other models. This model scored an average of 0.611 on the evaluation corpus. It is
important to note that the organizers of this PAN’s task did not accept this model as a
formal one because the results of tournament were already published. However, they
allowed us to write about this model and its results in our notebook paper. In table 6,
we present a comparison of these three models including the average  1  results
for the development and evaluation corpora.</p>
        <p>The most striking and surprising finding in Table 6 is that there is no direct
connection between the results of the three models on the development corpus to their
results on the evaluation corpus. Furthermore, the results on the evaluation corpus are
significantly lower, which implies that the evaluation corpus is quite different in its
nature from the development corpus. Another possible explanation is that the three
models have overfitted and are not optimal for general corpora from this type.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Summary and Future Work</title>
      <p>In this paper, we describe our participation in the PAN 2018 shared task on author
identification. We tried various pre-processing types, a widespread variety of feature
sets, and five ML methods.</p>
      <p>For the evaluation corpus, we sent the top three models according to their results on
the development corpus. The first model was sent by miller18. Its features consist of
15 http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
the frequencies of all char 6-gram sequences, POS-tags sequences frequencies,
Orthographic features, Quantitative features, Lexical richness features. Then we applied
PCA. The best ML method on the development corpus was the Linear SVC. This model
scored an average of 0.582 on the evaluation corpus.</p>
      <p>The second model was sent by yigal18. Its features consist of all the char sequences
of length between 3 to 8. We also tried to add all the unigram frequencies to the feature
sets in addition to the stylistic features from the first model. Again, we applied PCA
and Linear SVC as our classifier with the default parameters. This model scored an
average 0.598 on the evaluation corpus (an improvement of 0.016 comparing to the first
model).</p>
      <p>The third model, which is the second model sent by miller18, was composed of the
same content-based features as the second model, but this time we did not use any
stylebased features at all except for POS-tags. The PCA and the classifier were the same as
the other models. This model scored an average result of 0.611 (an improvement of
0.029 comparing to the first model).</p>
      <p>The main findings of our experiments are: (1) there is no direct connection between
the results of the three models on the development corpus to their results on the
evaluation corpus and (2) the results on the evaluation corpus are significantly lower,
which implies that the evaluation corpus is quite different in its nature from the
development corpus. A possible explanation is that the three models are too overfitting
and are not optimal for general corpora from this type.</p>
      <p>Future research proposals include: (1) applying additional combinations of feature
sets in order to find a general enough model; (2) tuning the models’ parameters; (3)
applying various deep neural models; (4) defining and applying skip-POS sequences;
and (5) building model that will perform authorship attribution using key phrases [10,
11] that distinguish each of the text authors.</p>
      <p>Acknowledgments. This work was partially funded by the Jerusalem College of
Technology (Lev Academic Center) and we gratefully acknowledge its support.
7. Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., Lin, C. J.: LIBLINEAR: A
library for large linear classification. Journal of machine learning
research, 9(Aug), 1871-1874 (2008).
8. Fox, C.: A stop list for general text. In Acm sigir forum (Vol. 24, No. 1-2, pp.
1921). ACM (1989).
9. Gianfortoni, P., Adamson, D., Rosé, C. P.: Modeling of stylistic variation in social
media with stretchy patterns. In Proceedings of the First Workshop on Algorithms
and Resources for Modelling of Dialects and Language Varieties (pp. 49-59).
Association for Computational Linguistics (2011).
10. HaCohen-Kerner, Y., Gross, Z., Masa, A.: Automatic extraction and learning of
keyphrases from scientific articles. In Int. Conf. on Intelligent Text Processing and
Computational Linguistics, Springer, Berlin, Heidelberg, pp. 657-669 (2005).
11. HaCohen-Kerner, Y., Stern, I., Korkus, D., Fredj, E.: Automatic machine learning
of keyphrase extraction from short html documents written in Hebrew.
Cybernetics and Systems: An International Journal, 38(1), 1-21 (2007).
12. HaCohen-Kerner, Y., Mughaz, D., Beck, H., Yehudai, E.: Words as classifiers of
documents according to their historical period and the ethnic origin of their
authors. Cybernetics and Systems: An International Journal, 39(3), 213-228 (2008).
13. HaCohen-Kerner, Y., Beck, H., Yehudai, E., Rosenstein, M., Mughaz, D.: Cuisine:
Classification using stylistic feature sets &amp;/or name-based feature sets. Journal of
the American Society for Information Science and Technology 61 (8), 1644–57
(2010A).
14. HaCohen-Kerner, Y., Beck, H., Yehudai, E., Mughaz, D.: Stylistic feature sets as
classifiers of documents according to their historical period and ethnic origin.
Applied Artificial Intelligence 24 (9), 847–62 (2010B).
15. Ho, T. K.: Random Decision Forests. Proc. of the 3rd Int. Conf. on Document</p>
      <p>Analysis and Recognition, Montreal, QC, 14–16 August 1995. 278–282 (1995).
16. Hosmer Jr, D. W., Lemeshow, S., Sturdivant, R. X.: Applied logistic regression
(Vol. 398). John Wiley &amp; Sons (2013).
17. Jain, A. K., Mao, J., Mohiuddin, K. M.: Artificial neural networks: A
tutorial. Computer, 29(3), 31-44 (1996).
18. Joachims, T.: Text categorization with support vector machines: Learning with
many relevant features. In European conference on machine learning (pp.
137142). Springer, Berlin, Heidelberg (1998).
19. Jockers, M. L., Witten, D. M.: A comparative study of machine learning methods
for authorship attribution. Literary and Linguistic Computing, 25(2), 215-223
(2010).
20. Jolliffe, I. T.: Principal component analysis and factor analysis. In Principal
component analysis (pp. 115-128). Springer, New York, NY (1986).
21. Kessler, Brett, Nunberg, Geoffrey, Hinrich Schutze.: Automatic detection of text
genre. In P. R. Cohen and W. Wahlster (Eds.), In Proc. of the 35th annual meeting
of the ACL and 8th conf. of the Eur. chapter of the As. for Comp. Linguistics.
3238, Somerset, New Jersey: Association for Computational Linguistics (1997).
22. Kestemont, M., Tschugnall, M., Stamatatos, E., Daelemans, W., Specht, G., Stein,
B., Potthast, M.: Overview of the Author Identification Task at PAN-2018:
CrossSoulier, L., Sanjuan, E., Cappellato, L., Ferro, N. (eds.) Experimental IR Meets
Multilinguality, Multimodality, and Interaction. 9th Int. Conference of the CLEF
Initiative (CLEF 18). Springer, Berlin Heidelberg New York (2018).
37. Walt, S. V. D., Colbert, S. C., Varoquaux, G.: The NumPy array: a structure for
efficient numerical computation. Computing in Science &amp; Engineering, 13(2),
2230 (2011).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Argamon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Whitelaw</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chase</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hota</surname>
            ,
            <given-names>S. R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garg</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Levitan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Stylistic text classification using functional lexical features: Research articles</article-title>
          .
          <source>Journal of the Amer. Society for Info. Science and Technology</source>
          .
          <volume>58</volume>
          ,
          <issue>6</issue>
          ,
          <fpage>802</fpage>
          -
          <lpage>822</lpage>
          (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Breiman</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Bagging predictors</article-title>
          .
          <source>Univ. California Technical Report No. 421.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Breiman</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Random forests</article-title>
          .
          <source>Machine learning</source>
          ,
          <volume>45</volume>
          (
          <issue>1</issue>
          ),
          <fpage>5</fpage>
          -
          <lpage>32</lpage>
          (
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Cessie</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Le</surname>
            , Van Houwelingen,
            <given-names>J. C.</given-names>
          </string-name>
          :
          <article-title>Ridge estimators in logistic regression</article-title>
          , Applied statistics, pp.
          <fpage>191</fpage>
          -
          <lpage>201</lpage>
          (
          <year>1992</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>C. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>C. J.:</given-names>
          </string-name>
          <article-title>LIBSVM: a library for support vector machines</article-title>
          .
          <source>ACM transactions on intelligent systems and technology (TIST)</source>
          ,
          <volume>2</volume>
          (
          <issue>3</issue>
          ),
          <volume>27</volume>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Cortes</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vapnik</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Support-vector networks</article-title>
          .
          <source>Machine learning</source>
          ,
          <volume>20</volume>
          (
          <issue>3</issue>
          ),
          <fpage>273</fpage>
          -
          <lpage>297</lpage>
          (
          <year>1995</year>
          ).
          <article-title>domain Authorship Attribution and Style Change Detection</article-title>
          . In: Cappellato,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.Y.</given-names>
            ,
            <surname>Soulier</surname>
          </string-name>
          ,
          <string-name>
            <surname>L</surname>
          </string-name>
          . (eds.)
          <article-title>Working Notes Papers of the CLEF 2018 Evaluation Labs</article-title>
          .
          <source>CEUR Workshop Proceedings, CLEF and CEUR-WS.org</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          23.
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Argamon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shimoni</surname>
            ,
            <given-names>A. R.</given-names>
          </string-name>
          :
          <article-title>Automatically categorizing written texts by author gender</article-title>
          .
          <source>Literary and linguistic computing</source>
          ,
          <volume>17</volume>
          (
          <issue>4</issue>
          ),
          <fpage>401</fpage>
          -
          <lpage>412</lpage>
          (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          24.
          <string-name>
            <surname>Liparas</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>HaCohen-Kerner</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moumtzidou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vrochidis</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kompatsiaris</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <article-title>News articles classification using Random Forests and weighted multimodal features</article-title>
          .
          <source>In Information Retrieval Facility Conf</source>
          . (pp.
          <fpage>63</fpage>
          -
          <lpage>75</lpage>
          ). Springer, Cham. (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          25.
          <string-name>
            <surname>Loper</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bird</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>NLTK: The natural language toolkit</article-title>
          .
          <source>In Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics-Volume</source>
          <volume>1</volume>
          (pp.
          <fpage>63</fpage>
          -
          <lpage>70</lpage>
          ).
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          26.
          <string-name>
            <surname>Meretakis</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wuthrich</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Extending naive Bayes classifiers using long itemsets</article-title>
          ,
          <source>Proc. of the 5th ACM-SIGKDD Int. Conf. Knowledge Discovery, Data Mining (KDD'99)</source>
          , San Diego, USA,
          <fpage>165</fpage>
          -
          <lpage>174</lpage>
          (
          <year>1999</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          27.
          <string-name>
            <surname>Mosteller</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Wallace</surname>
            ,
            <given-names>D. L.</given-names>
          </string-name>
          :
          <article-title>Applied Bayesian and classical inference: the case of the Federalist papers</article-title>
          . Addison-Wesley: Reading, MA (
          <year>1984</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          28.
          <string-name>
            <surname>Pedregosa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varoquaux</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gramfort</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michel</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thirion</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grisel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          , ...
          <string-name>
            <surname>Vanderplas</surname>
          </string-name>
          , J.:
          <article-title>Scikit-learn: Machine learning in Python</article-title>
          .
          <source>Journal of machine learning research</source>
          ,
          <volume>12</volume>
          (Oct),
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          29.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Improving the Reproducibility of PAN's Shared Tasks: Plagiarism Detection, Author Identification, and Author Profiling</article-title>
          . In: Kanoulas,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Lupu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Clough</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Sanderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Hall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Hanbury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Toms</surname>
          </string-name>
          , E. (eds.)
          <article-title>Information Access Evaluation meets Multilinguality, Multimodality, and Visualization</article-title>
          .
          <source>5th International Conference of the CLEF Initiative (CLEF 14)</source>
          . pp.
          <fpage>268</fpage>
          -
          <lpage>299</lpage>
          . Springer, Berlin Heidelberg New York (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          30.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Overview of the 5th author profiling task at pan 2017: Gender and language variety identification in twitter</article-title>
          .
          <source>Working Notes Papers of the CLEF</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          31.
          <string-name>
            <surname>Schler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Argamon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pennebaker</surname>
            ,
            <given-names>J. W.:</given-names>
          </string-name>
          <article-title>Effects of age and gender on blogging</article-title>
          .
          <source>In AAAI spring symposium: Computational</source>
          approaches to analyzing
          <source>weblogs(Vol. 6</source>
          , pp.
          <fpage>199</fpage>
          -
          <lpage>205</lpage>
          ) (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          32.
          <string-name>
            <surname>Sebastiani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <article-title>: Machine learning in automated text categorization</article-title>
          .
          <source>ACM computing surveys (CSUR)</source>
          ,
          <volume>34</volume>
          (
          <issue>1</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>47</lpage>
          (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          33.
          <string-name>
            <surname>Simonyan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          .
          <source>arXiv preprint arXiv:1409.1556</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          34.
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>Authorship attribution based on feature set subspacing ensembles</article-title>
          .
          <source>International Journal on Artificial Intelligence Tools</source>
          ,
          <volume>15</volume>
          (
          <issue>05</issue>
          ),
          <fpage>823</fpage>
          -
          <lpage>838</lpage>
          (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          35.
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>A survey of modern authorship attribution methods</article-title>
          .
          <source>Journal of the Association for Information Science and Technology</source>
          ,
          <volume>60</volume>
          (
          <issue>3</issue>
          ),
          <fpage>538</fpage>
          -
          <lpage>556</lpage>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          36.
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tschuggnall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kestemont</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Overview of PAN-2018: Author Identification, Author Profiling, and Author Obfuscation</article-title>
          . In: Bellot,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Trabelsi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Mothe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Murtagh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>