<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Investigating the Influence of Selected Linguistic Features on Authorship Attribution using German News Articles</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Manuel Sage</string-name>
          <email>fmanuel.sage@mail</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pietro Cruciata</string-name>
          <email>fpietro.cruciata@polymtl.cag</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Raed Abdo</string-name>
          <email>raed.abdo@mail</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jackie Chi Kit Cheung</string-name>
          <email>jcheung@cs</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yaoyao Fiona Zhao</string-name>
          <email>yaoyao.zhao@gmcgill.ca</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Electrical Engineering, McGill University</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Mathematics and Industrial Engineering</institution>
          ,
          <addr-line>Polytechnic de Montreal</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Department of Mechanical Engineering, McGill University</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>School of Computer Science, McGill University</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p />
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In this work, we perform authorship
attribution on a new dataset of German news
articles. We seek to classify over 3,700
articles to their five corresponding authors,
using four conventional machine learning
approaches (na¨ıve Bayes, logistic
regression, SVM and kNN) and a convolutional
neural network. We analyze the effect of
character and word n-grams on the
prediction accuracy, as well as the influence
of stop words, punctuation, numbers, and
lowercasing when preprocessing raw text.
The experiments show that higher order
character n-grams (n = 5,6) perform
better than lower orders and word n-grams
slightly outperform those with characters.
Combining both in fusion models further
improves results up to 92% for SVM. A
multilayer convolutional structure allows
the CNN to achieve 90.5% accuracy. We
found stop words and punctuation to be
important features for author
identification; removing them leads to a
measurable decrease in performance. Finally, we
evaluate the topic dependency of the
algorithms by gradually replacing named
entities, nouns, verbs and eventually all
tokens in the dataset according to their
POStags.</p>
    </sec>
    <sec id="sec-2">
      <title>1 Introduction</title>
      <p>
        When the author of a text is the subject of
particular interest, there exist three main approaches
in the field of natural language processing (NLP):
author profiling, authorship verification, and
authorship attribution. They mean respectively,
aiming at detecting details of the author such as age
or gender, measuring the similarity between an
author’s work and a text in question, and trying
to identify the author of a text given a group of
potential authors. All approaches are based on
the assumption that individuals have unique
writing styles and habits
        <xref ref-type="bibr" rid="ref15">(Stamatatos, 2009)</xref>
        . In this
project, we focus on authorship attribution (AA),
a task popular in many areas such as literary
studies, history and forensic linguistics
        <xref ref-type="bibr" rid="ref5">(Evert et al.,
2017)</xref>
        . Anonymity and potential content creation
under false name on the internet have recently
increased the interest in AA
        <xref ref-type="bibr" rid="ref1">(Aborisade and Anwar,
2018)</xref>
        ,
        <xref ref-type="bibr" rid="ref10">(Rocha et al., 2017)</xref>
        .
      </p>
      <p>
        Working on a new dataset of 3,700 German
news articles written by five authors, we carry out
a multiclass classification using different machine
learning (ML) models and linguistic features. As
ML models, we test multinomial na¨ıve Bayes
(NB), logistic regression (LR), support vector
machine (SVM) and k-nearest-neighbors (kNN)
using scikit-learn implementations, and a
convolutional neural network (CNN) for text
classification using PyTorch. We experiment with word
n-grams, character n-grams and fusions of both,
as well as punctuation, numbers, stop words, and
lowercasing. We further seek to evaluate the
topicdependency of the algorithms by gradually
replacing named entities, nouns, verbs and all tokens in
the dataset with the help of their part-of-speech
(POS) tags. This study aims to quantify the effect
of linguistic features on AA using a new source of
German news articles.
Previous approaches to AA greatly vary regarding
applied linguistic features, implemented ML
models and investigated languages. The combination
of character n-grams and traditional ML models
such as na¨ıve Bayes has been deployed for many
publications on AA, such as in Amasyalı and Diri
(2006) on Turkish news articles, Markov et al.
(2017) on Portuguese news articles, or Oppliger
(2016) on instant messages in Swiss German.
Punctuation n-grams have been used as
stylometric features for English, French, Italian and
Spanish
        <xref ref-type="bibr" rid="ref8">(Mart´ın-Del-Campo-Rodr´ıguez et al., 2019)</xref>
        .
In addition, Khan (2018) and Schwartz (2016)
demonstrated the importance of stop words for AA
with English language data.
      </p>
      <p>Originally developed for computer vision tasks,
CNNs have been successfully applied in
various text classification tasks lately, including AA.
Ruder et al. (2016) presented state-of-the-art
results in large-scale AA on various social media
datasets. The authors of the paper implemented
different multi-channel word and character CNNs
and create hybrid models of both. On average, the
char-CNN outperformed not only previous
models, but also word CNNs and hybrids. In the study
of Shrestha et al. (2017), a CNN with three
convolutional layers was trained on character n-grams
and outperformed traditional models as well as
recurrent neural networks (RNN) on recognizing the
authors of tweets.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Dataset</title>
      <p>We collect a new dataset of newspaper articles
published by Main-Post, a German newspaper.
In cooperation with the newspaper, we decide to
choose the articles by five journalists from the
regional department in Schweinfurt. In their daily
work, the journalists cover similar topics, mostly
local news in and around the city. With this choice,
we hope to alleviate the likelihood of classifying
authors by topic specific vocabulary. The
newspaper reaches a circulation of around 40,000
readings per day in Schweinfurt. All articles (mostly
behind a paywall) can be accessed via the
company’s online presence.1 In a first step, we collect
the weblinks to all articles for each author. Then,
a second script opens each link and extracts the
corresponding text into a csv-file. We clean the
collected articles by removing:</p>
      <p>Author names in the text where indicative of
the writer (e.g. comments);</p>
      <sec id="sec-3-1">
        <title>Articles written by multiple authors;</title>
        <p>Articles listed more than once (e.g. regional
and trans-regional versions);
Non-text elements such as image or video
boxes that were downloaded due to variations
in the webpages’ html-structures.</p>
        <p>The final dataset consists of 3,717 articles by
five different journalists, written between May
2013 and October 2019. The number of
articles per author is imbalanced and varies from 331
to 972. The average length of an article is 455
words and it comprises 24 sentences and 7.5
paragraphs. The shortest article measures 26 words,
the longest 2299. The overall corpus size is 1.6
Million words. Compared to other author
attribution datasets in literature, our dataset contains
fewer authors, but larger available data per author.
This facilitates the prediction task but allows to
draw more meaningful conclusions about the
effect of analyzed linguistic features. The dataset is
not publicly available but can be obtained from the
authors of this paper on reasonable request.
4
4.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Methodology</title>
      <sec id="sec-4-1">
        <title>Preprocessing &amp; Model Implementations</title>
        <p>As a first step, we split off a stratified test set
containing 20% of each author’s available articles. All
final models are evaluated by their prediction
accuracy on this test set. Since this is the first work
on a new dataset, we start by establishing two
baseline models. Due to their decent performance
in many applications, we chose na¨ıve Bayes and
logistic regression, a generative and a
discriminative approach, respectively. We process the raw
text using word unigrams obtained through
splitting by whitespaces and test for the following
linguistic features:</p>
        <sec id="sec-4-1-1">
          <title>Punctuation (keep/remove);</title>
        </sec>
        <sec id="sec-4-1-2">
          <title>Numbers (keep/remove);</title>
          <p>Stop words (keep/remove), using NLTK’s list
of 232 German stop words;
Lemmatization, using Spacy’s
implementation for German;
Stemming, using NLTK’s German
SnowballStemmer;
1https://www.mainpost.de/regional/schweinfurt/</p>
        </sec>
        <sec id="sec-4-1-3">
          <title>Lowercasing.</title>
          <p>We further define a minimum document frequency
of 5 and tune the model-specific hyperparameters
via grid search. For NB, LR, SVM and kNN,
we vectorize counting raw token frequencies
using scikit-learn’s CountVectorizer. As a result of
the baseline experiments, the best parameter
combinations achieved 81.85% on the test set for na¨ıve
Bayes and 90.59% for logistic regression. During
our work on the baselines, we observed an average
performance decrease of 2.8 percentage points for
lemmatization and 3.9 percentage points for
stemming, along with a drastically increasing runtime.
Thus, both techniques were excluded from the
following experiments. Instead, we focus on the
effect of punctuation, numbers, stop words, and
lowercasing on the prediction accuracy.</p>
          <p>We expand our work with SVM and kNN
models, and add bigrams and trigrams, as well as
character n-grams of different lengths (n = 3 – 6).
Finally, for each model, we combine the best word
n-gram vectorizer with the best character n-gram
vectorizer in a fusion model. For every
combination of model and word/character n-gram, a
random search with 100 iterations tests the
beforementioned linguistic features and the model
specific hyperparameters by averaging the results of
a 5-fold cross-validation. Then, each model’s best
configuration is trained on the whole training set
and its performance on the test set is reported.</p>
          <p>
            The implementation of CNN is based on the
work of
            <xref ref-type="bibr" rid="ref16">(Trevett, 2019)</xref>
            on CNNs for multi-class
sentiment analysis, that we adjust for our
experiments. In all set-ups, the network consists of an
embedding layer, at least one convolutional layer,
and a fully connected output layer. The model is
fed with a pretrained German word embedding,
trained on two million Wikipedia articles, with a
disk size of 6.4 GB
            <xref ref-type="bibr" rid="ref4">(Cieliebak et al., 2017)</xref>
            .2 To
obtain results comparable to the other ML models
trained on word/character n-grams, we use
onelayer convolutional filters of size n after
tokenizing the text on word and character level (e.g. filter
of size 2 for bigrams). Instead of fusions/hybrids
between word and characters, we optimize the
CNN by adding multiple convolutional layers to
achieve higher accuracies. In addition to testing on
the linguistic features described above, we
experiment with different values for filter sizes,
number of filters and layers, dropout regularization,
max-pooling, and the size of vocabulary. We
se2https://www.spinningbytes.com/resources/
lect Adam as optimizer and cross-entropy as
lossfunction. After splitting off the test set, 20% of
the remaining data is used for validation. During
training, the performance is validated after every
epoch and the overall best model parameters are
saved. For testing, we load the best parameters
and run the algorithm on the test set.
4.2
The five authors in the dataset work in the same
regional department. Nevertheless, each author has
special topics, for example certain cultural events
or news from particular villages. With this
information in the training data, the models might
learn to predict authors based on specific words
appearing in an article and not through each
author’s writing style. We seek to quantify this
assumption and evaluate how much the performance
of the established models depends on the
vocabulary used. Therefore, we replace all tokens from
different part-of-speech categories with
abbreviations and create the following four variations of
the dataset (including test set):
          </p>
        </sec>
        <sec id="sec-4-1-4">
          <title>Replace all named entities by ’NE’;</title>
        </sec>
        <sec id="sec-4-1-5">
          <title>Additionally, replace all nouns by ‘NN’;</title>
          <p>Additionally, replace all verbs (including
auxiliary verbs) by ‘VB’;
Finally, replace all tokens by their
corresponding Treebank POS-tag.</p>
          <p>Then, we run the best performing version for each
machine learning model on these variations of the
dataset and report the accuracy on the hold out test
set. We utilize Spacy’s pretrained German
tagger with a reported accuracy of 96.3% for
POStagging.
5
5.1</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <sec id="sec-5-1">
        <title>Preprocessing &amp; Model Implementations</title>
        <p>The left side of Table 1 shows the best results
obtained for the five implemented ML models.
Overall, we note high performances above 80% for all
models except the kNN approach. Logistic
regression and SVM performed the best and achieved
almost same results in all tested variations.
Except for kNN, word n-grams slightly outperformed
character n-grams. On word level, adding
bigrams and trigrams outperformed unigrams, with
only marginal differences between the models. On
character level, higher order n-grams improved the
results. Combining the best character and word
n-grams in fusion models led to higher
accuracies for LR and SVM. Here, SVM delivered the
best predictions with 91.9% (macro-averaged
F1score = 0.898). In all variations, SVMs with
linear kernels were more accurate than those with
polynomial kernels. On average, the performance
of the CNN was two percentage points below the
SVM. Both, character and word n-gram features
achieved accuracies in the high 80s. The best
performance of the CNN was 90.5% (macro-averaged
F1-score = 0.855), applying two convolutional
layers of filter sizes 2 and 3, with 500 filters each,
after word level tokenization. Using smaller
vocabulary sizes of 5k and 10k improved results, as
well as high dropout values of 0.5.</p>
        <p>Besides word and character n-grams, the
conducted experiments allow us to quantify the effect
of stop-words, punctuation, numbers and
lowercasing on the performance for this dataset. The
corresponding values are displayed in Table 2 and
obtained by taking each model’s best
implementation, retraining it either with or without the feature
in question, and finally averaging the performance
difference over all models for each feature. For
all 28 random searches and 8 CNN-configurations,
removing stop words or punctuation decreased the
prediction accuracies. For the most accurate
implementations, this resulted in an average decrease
of 1.23 and 1.06 percentage points, respectively.
Lowercasing and removing numbers on the other
hand barely influenced the results.
5.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Experiments with reduced topic-dependency</title>
        <p>The results of the experiments with reduced
topicdependency are presented in Table 1. Replacing
named entities by ’NE’ did not affect the
performance negatively as expected. Instead, kNN,</p>
      </sec>
      <sec id="sec-5-3">
        <title>Feature</title>
        <sec id="sec-5-3-1">
          <title>Removing stop words Removing punctuation Removing numbers Lowercasing</title>
        </sec>
      </sec>
      <sec id="sec-5-4">
        <title>Average effect</title>
        <p>on performance
1.23
1.06
0.09
+ 0.09</p>
        <p>SVM and na¨ıve Bayes slightly improved their
performances and the SVM reached 92.2%
(macroaveraged F1-score = 0.901), the highest accuracy
in the project. Replacing nouns and then verbs
decreases the performance for all models yet still
allows accuracies in high 80s (except kNN). Finally,
replacing the whole text with corresponding
Treebank POS-tags led to poorer yet reasonable results
above 76%, again excluding kNN.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Discussion &amp; Conclusion</title>
      <p>With logistic regression, SVM and CNN, three
models reached prediction accuracies over 90%,
while kNN was found less applicable on this
task. The CNN did not outperform traditional
approaches in this work. However, the innumerable
network structures and tunable (hyper)parameters
of this model leave room for improvement. We
are in line with Sanchez-Perez et al. (2017)
showing that higher orders of character n-grams
outperform lower orders. Combining word and character
n-grams also improved results. Therefore,
exploring longer character n-grams (n = 7, 8, . . . ) could
extend this study.</p>
      <p>
        Regarding text preprocessing, we conclude that
most changes in the raw text, despite being
useful in other NLP domains, decrease the
performance on this task. Removing stop words
reduces the accuracy measurably, this confirms the
consent in literature. In the study of A
        <xref ref-type="bibr" rid="ref3">run et al.
(2009</xref>
        ), stop words play an essential role in
authorship attribution on English text documents.
We detected a similar importance for punctuation
whereas numbers and lowercasing barely affected
performance. Due to a lower accuracy of NB and
LR after lemmatization and stemming, we assume
that both techniques disguise characteristics in an
author’s writing style. However, this assumption
requires further evaluation. In the experiments
with reduced topic-dependency, the models did
not depend on certain keywords, but more on the
overall structure and writing style of an author’s
work. Replacing named entities improved
predictions. This contradicts the initial hypothesis and
other researcher‘s work, such as Sanchez-Perez
et al. (2017), where accuracies dropped by
approximately 2-3 percentage points. We assume that the
intersection between the authors’ work is too large
to allow models to classify based on named
entities. Instead, removing them could reduce the
variance of the vectorizer and help to focus on
more meaningful writing patterns. More
experiments, e.g. with POS n-grams, could further
improve results.
      </p>
      <p>Overall, this work demonstrated the importance
of stop words, punctuation, and fusions of word
and character n-grams for AA on German news
articles. It further revealed the potential of
POStags as meaningful features for this task.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>We would like to thank Main-Post GmbH for their
cooperation in finding adequate authors and free
access to all articles.</p>
      <p>This research work is supported by National
Sciences and Engineering Research Council
of Canada Discovery Accelerator Supplements
RGPAS-2018-522708.</p>
      <p>The fourth author is supported in part by the
Canada CIFAR AI Chair program.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>O.</given-names>
            <surname>Aborisade</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Anwar</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Classification for Authorship of Tweets by Comparing Logistic Regression and Naive Bayes Classifiers</article-title>
          .
          <source>In 2018 IEEE International Conference on Information Reuse and Integration (IRI)</source>
          , pages
          <fpage>269</fpage>
          -
          <lpage>276</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Fatih</given-names>
            <surname>Amasyalı</surname>
          </string-name>
          and
          <string-name>
            <given-names>Banu</given-names>
            <surname>Diri</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Automatic Turkish text categorization in terms of author, genre and gender</article-title>
          .
          <source>In International Conference on Application of Natural Language to Information Systems</source>
          , pages
          <fpage>221</fpage>
          -
          <lpage>226</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>R.</given-names>
            <surname>Arun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Suresh</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. E. V.</given-names>
            <surname>Madhavan</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Stopword Graphs and Authorship Attribution in Text Corpora</article-title>
          .
          <source>In 2009 IEEE International Conference on Semantic Computing</source>
          , pages
          <fpage>192</fpage>
          -
          <lpage>196</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Mark</given-names>
            <surname>Cieliebak</surname>
          </string-name>
          , Jan Milan Deriu, Dominic Egger, and
          <string-name>
            <given-names>Fatih</given-names>
            <surname>Uzdilli</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>A twitter corpus and benchmark resources for German sentiment analysis</article-title>
          .
          <source>In Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media</source>
          , pages
          <fpage>45</fpage>
          -
          <lpage>51</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Stefan</given-names>
            <surname>Evert</surname>
          </string-name>
          , Thomas Proisl, Fotis Jannidis, Isabella Reger, Steffen Pielstro¨m, Christof Scho¨ch, and
          <string-name>
            <given-names>Thorsten</given-names>
            <surname>Vitt</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Understanding and explaining Delta measures for authorship attribution</article-title>
          .
          <source>Digital Scholarship in the Humanities</source>
          ,
          <volume>32</volume>
          (
          <issue>suppl 2</issue>
          ):
          <fpage>ii4</fpage>
          -
          <lpage>ii16</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Jamal</given-names>
            <surname>Ahmad Khan</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>A Model for Style Breach Detection at a Glance: Notebook for PAN at CLEF 2018</article-title>
          . In CLEF (Working Notes).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Ilia</given-names>
            <surname>Markov</surname>
          </string-name>
          , Jorge Baptista, and Obdulia PichardoLagunas.
          <year>2017</year>
          .
          <article-title>Authorship attribution in Portuguese using character n-grams</article-title>
          .
          <source>Acta Polytechnica Hungarica</source>
          ,
          <volume>14</volume>
          (
          <issue>3</issue>
          ):
          <fpage>59</fpage>
          -
          <lpage>78</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Carolina</given-names>
            <surname>Mart</surname>
          </string-name>
          <article-title>´ın-</article-title>
          <string-name>
            <surname>Del-</surname>
            Campo-Rodr´ıguez, Daniel Alejandro Pe´rez Alvarez, Christian Efra´ın Maldonado Sifuentes, Grigori Sidorov, Ildar Batyrshin, and
            <given-names>Alexander</given-names>
          </string-name>
          <string-name>
            <surname>Gelbukh</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Authorship attribution through punctuation n-grams and averaged combination of SVM notebook for PAN at CLEF 2019</article-title>
          .
          <source>In CEUR Workshop Proceedings</source>
          , volume
          <volume>2380</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Rahel</given-names>
            <surname>Oppliger</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Automatic authorship attribution based on character n-grams in Swiss German</article-title>
          . Bochumer Linguistische Arbeitsberichte, (
          <volume>16</volume>
          ):
          <fpage>177</fpage>
          -
          <lpage>185</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>A.</given-names>
            <surname>Rocha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. J.</given-names>
            <surname>Scheirer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. W.</given-names>
            <surname>Forstall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Cavalcante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Theophilo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. R. B.</given-names>
            <surname>Carvalho</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Authorship Attribution for Social Media Forensics</article-title>
          .
          <source>IEEE Transactions on Information Forensics and Security</source>
          ,
          <volume>12</volume>
          (
          <issue>1</issue>
          ):
          <fpage>5</fpage>
          -
          <lpage>33</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Ruder</surname>
          </string-name>
          , Parsa Ghaffari, and John G Breslin.
          <year>2016</year>
          .
          <article-title>Character-level and multi-channel convolutional neural networks for large-scale authorship attribution</article-title>
          .
          <source>arXiv preprint arXiv:1609</source>
          .
          <fpage>06686</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Miguel A Sanchez-Perez</surname>
            , Ilia Markov, Helena Go´mezAdorno, and
            <given-names>Grigori</given-names>
          </string-name>
          <string-name>
            <surname>Sidorov</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Comparison of character n-grams and lexical features on author, gender, and language variety identification on the same Spanish news corpus</article-title>
          .
          <source>In International Conference of the Cross-Language Evaluation Forum for European Languages</source>
          , pages
          <fpage>145</fpage>
          -
          <lpage>151</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Maxwell B Schwartz</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>An examination of cross-domain authorship attribution techniques</article-title>
          . CUNY Academic Works. https://academicworks.cuny.edu/gc etds/1573.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Prasha</given-names>
            <surname>Shrestha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Sierra</surname>
          </string-name>
          , Fabio Gonza´lez, Manuel Montes, Paolo Rosso, and
          <string-name>
            <given-names>Thamar</given-names>
            <surname>Solorio</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Convolutional Neural Networks for Authorship Attribution of Short Texts</article-title>
          .
          <source>In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume</source>
          <volume>2</volume>
          ,
          <string-name>
            <surname>Short</surname>
            <given-names>Papers</given-names>
          </string-name>
          , pages
          <fpage>669</fpage>
          -
          <lpage>674</lpage>
          , Valencia, Spain. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Efstathios</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>A survey of modern authorship attribution methods</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology</source>
          ,
          <volume>60</volume>
          (
          <issue>3</issue>
          ):
          <fpage>538</fpage>
          -
          <lpage>556</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Ben</given-names>
            <surname>Trevett</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Pytorch sentiment analysis</article-title>
          . https://github.com/bentrevett/ pytorch-sentiment
          <article-title>-analysis</article-title>
          .
          <source>Accessed: 2019-12-04.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>