<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Improving Cross-domain Authorship Attribution by Combining Lexical and Syntactic Features</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Martijn Bartelds</string-name>
          <email>m.bartelds.2@student.rug.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wietse de Vries</string-name>
          <email>w.de.vries.21@student.rug.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Groningen</institution>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <abstract>
        <p>Authorship attribution is a problem in information retrieval and computational linguistics that involves attributing authorship of an unknown document to an author within a set of candidate authors. Because of this, PAN-CLEF 2019 organized a shared task that involves creating a computational model that can determine the author of a fanfiction story. The task is cross-domain because of the open set of fandoms to which the documents belong. Additionally, the set of candidate authors is also open since the actual author of a document may not be among the candidate authors. We extracted character-level, word-level and syntactic information from the documents in order to train a support vector machine. Our approach yields an overall macro-averaged F1 score of 0.687 on the development data of the shared task. This is an improvement of 18.7% over the character-level lexical baseline. On the test data, our model achieves an overall macro F1 score of 0.644. We compare different feature types and find that character n-grams are the most informative feature type though all tested feature types contribute to the performance of the model.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Authorship attribution is an established research area in computational linguistics that
aims to determine the author of a document by taking the writing style of the author
into account. Typically, a system assigns a candidate author to an anonymous text by
comparing the anonymous text to a set of possible author writing samples. Currently,
the field of authorship attribution can be considered as a topic of pivotal interest as the
authenticity of information presented in the media is often questioned. Following this,
any successfully attempt in revealing the authors behind a text will result in improved
transparency and ideally removes any uncertainty with respect to the validity of
information presented. As a consequence, authorship attribution can be determined as being
closely related to research tailored to the privacy domain, law, cyber-security, and social
media analysis.</p>
      <p>In the PAN-CLEF 2019 shared task on authorship attribution, a cross-domain
authorship attribution task was proposed based on fanfiction texts. More specifically,
fanfiction texts are written by admirers of a certain author and these fanfiction texts are
known to substantially borrow characteristics from the original work. This task can be
considered cross-domain, since the documents of known authorship are not necessarily
collected within the same thematic domain or genre. Moreover, this task is extended
beyond closed-set attribution conditions, as the true author of a given text in the target
domain is not by definition included in the set of candidate authors.</p>
      <p>
        In this work, we present the methodology and results of our submission to the
PANCLEF 2019 cross-domain authorship attribution task. We developed our approach with
respect to the documents of all four languages provided in the PAN-CLEF 2019 data
set. These languages include: English, French, Italian, and Spanish. Previous research
showed the effectiveness of textual features such as character-level n-grams to
authorship attribution problems, since these are capable of representing more abstract level
writing style characteristics rather than generating a representation that is purely
related to the content of the document [
        <xref ref-type="bibr" rid="ref18 ref19 ref21">18, 19, 21</xref>
        ]. However, limitations of these features
arise together with the observation that they often result in sparse representations for
documents of insufficient size. Therefore, we aimed to create an approach in which
we create textual representations across different levels inside the training documents.
As such, we intended to prevent overfitting by developing a model that yields robust
performance across the different genres.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>In this section, we will describe some of the most successful modern authorship
attribution methods. Furthermore, a brief overview of the subfield of cross-domain authorship
attribution will be provided.</p>
      <p>
        Generally, there is a distinction between the use of profile-based and
instancebased approaches in solving authorship identification problems [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. In profile-based
approaches, all available training texts per author are concatenated to create a
cumulative representation of the author’s writing style. In contrast, instance-based approaches
treat each training text as an individual representation of the author’s writing style.
When these approaches are compared, it is shown that profile-based approach may be
advantageous when only training documents of limited size are available [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. As a
result, the concatenation of the training documents may lead to a more reliable
representation of the author’s writing style. In contrast, the implementation of an
instancebased approach ensures that interactions between several stylometric features can be
captured, even when the distributions of these features differ between documents that
are written by the same author. As an extension to both the profile-based and
instancebased approaches, hybrid approaches are proposed that integrates aspects of both the
profile-based and instance-based approaches [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Then, a single vector per author will
be produced by averaging the sum of the individually represented training texts. We
argue that these hybrid approaches might be superior to the profile-based and
instancebased approaches, since they could capture more reliable characteristics of writing style
across multiple documents of the same author.
      </p>
      <p>
        As can be observed from the results of the PAN-CLEF 2018 shared task on
crossdomain authorship attribution, we examine that the best performance was obtained by
the implementation of character-level and word-level n-grams as textual features [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
These were normalized by a tf-idf weighting scheme and used in combination with
support vector machines. Previous research on cross-domain authorship attribution
endorses the effectiveness of a support vector machines applied on character-level
ngrams [
        <xref ref-type="bibr" rid="ref18 ref19 ref21">18, 19, 21</xref>
        ]. This suggests that the use of these methodologies can still be
determined as a valuable strategy in solving cross-domain authorship attribution problems.
      </p>
      <p>
        In previous research, thirty-nine different types of textual features that are often
used in modern authorship attribution studies were compared [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In this study, a token
normalized version of the punctuation frequency was the most successful feature used
to discriminate between the different authors. Moreover, character-level bi-grams and
tri-grams were also among the most promising textual features presented, and this
result is substantiated by the results of numerous other research findings [
        <xref ref-type="bibr" rid="ref13 ref2 ref22 ref6 ref7">2, 6, 7, 13, 22</xref>
        ].
Following this, we decided to include these features into our own approach by creating
such a feature representation that is tailored to the data set we had available.
      </p>
      <p>
        Furthermore, research on cross-domain authorship attribution showed that
topicdependent information can be discarded from documents by carrying out several
preprocessing steps [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. It was suggested to replace all digits by zero, separate punctuation
marks from their adjacent words, and replace named entities with a dummy symbol. The
latter pre-processing step is effective, since named entities are often strongly related to
the topic of a document. After these steps character-level n-grams were extracted from
the documents, and an increase in performance was reported when these pre-processed
character-level n-grams were used. Moreover, attribution performance can be improved
by applying a frequency threshold to the extracted character-level n-gram
representations [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. As such, the least frequent occurring n-grams associated with topic-specific
information should be removed from the model. In our work, we decided to extend on
their work by implementing an adaptation of their suggested frequency threshold.
Moreover, we attempt to improve our model performance by applying these pre-processing
steps not solely on character-level n-grams, but on multiple textual feature levels inside
their training documents.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Data and Resources</title>
      <p>A development data set is provided by the organizers of the authorship attribution task.
The goal of the shared task is not to train a model on known training data and to test it on
unknown test data, but rather to design a model that can be trained on unknown training
data and then be tested on unknown test data. The development data set is not called a
training data set, because there is no overlap in candidate authors in the development
data and the undisclosed data that the models are trained and evaluated on for the shared
task. Instead, the development data contains twenty separate problem sets with each a
training part and a test part. The final evaluation will be performed on a similar set of
problem sets.</p>
      <p>The development data set contains twenty problem sets in four different languages,
resulting in five problem sets per language. The set of languages consists of English,
French, Italian and Spanish. Each problem set contains nine candidate authors with
seven known documents each. The task is to assign each document in a set of unknown
documents to a candidate author within the problem set, if the author unknown
document is actually in the candidate set. There is also the possibility that the actual author
is unknown and in that case the unknown document should be given an unknown label.
For evaluation a separate corpus is held back with similar characteristics as the
development corpus. The evaluation data contains problem sets in the same languages as the
development data, but there is no overlap in authors in the development and test sets.
Therefore, no features can be learned for specific authors before testing.</p>
      <p>The documents that are to be classified consist of a set of fanfiction stories with a
length of 500 to 1000 tokens each. The stories were scraped from an online fanfiction
website. Candidate authors write stories in different fandoms, so it is important that the
model will learn author-specific textual features and not features that are inherent to
the fandom or the universe that the story takes place in. Because of the large content
differences between fandoms, the fandoms are considered to be different domains.</p>
      <p>The amount of unknown documents per problem set is highly variable, since it
ranges between 46 and 561 documents per problem set. Moreover, occurrences of
candidate authors in the unknown documents are not uniformly distributed either. This is
not inherently important for development and methodology design choices. However,
very rare occurrences of certain candidate authors may have large effects on evaluation
scores during development, since scores are macro-averaged per candidate within the
problem sets. The fraction of unknown documents that are not written by any known
author is also variable. Although, within the development set overall a third of all
documents are written by an unknown author.</p>
      <p>The imbalance of the testing parts of the problem sets in the development data does
not influence model training, since the distribution is unknown at training time. During
the development of our methodology, the average result may however be strongly
influenced by unbalanced problem sets. The goal is to let our model be able to work with
problem sets with unknown distributions, so the development data should be balanced
when we want to trust the results. Our algorithmic and hyper-parameter choices are
therefore not based on the given configuration of the development data set. Instead, we
merged the problem sets per language and evaluated on random permutations of
candidate authors. Our new shuffled problem sets contained nine random candidate authors
in the same language with seven random known documents for each author. It can be
possible that these documents originate from documents that were originally meant for
model testing. The test data for our new shuffled problem sets contained up to 100
documents per candidate author that were not used for training. The set of test documents
is extended with documents that are not written by the candidate authors. These
documents served as the documents with an unknown author, and a third of each generated
problem set consists of these documents with an unknown author. This method enabled
us to generate a large amount of unique permutations of problem sets. For the
development of our methodology, we evaluated our choices by using a fixed set of twenty
shuffled problem sets per language, as opposed to the original five problem sets per
language. The original development problem sets are used for validation and the results
in this paper are evaluated on the original problem sets.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Methods</title>
      <p>The code that is used for our authorship attribution approach is fully open-source and
available at https://github.com/wietsedv/pan19-cross-domain-authorship-attribution.
4.1</p>
      <sec id="sec-4-1">
        <title>Classification approach</title>
        <p>
          Because of their success in previous authorship attribution approaches, we choose to
use a support vector machine with document level features. We extract different types
of textual features from the known documents which are used to train a support vector
machine (SVM) model for each problem set. We used the SVM implementation that is
available in the Scikit-learn Python package [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. Hyper-parameters are tuned globally
using a grid search method with our randomly permuted problem sets. Therefore,
hyperparameters are equal for all languages. The specific values of the hyper-parameters will
be discussed in Section 4.3. Hyper-parameters that are specific for feature types are
tuned separately from each other using separate grid searches within intuitively
plausible parameter ranges. This constrained grid search is chosen because of computational
limitations, but also to prevent overfitting on the hyper-parameters.
        </p>
        <p>The classifier that is used is a support vector machine classifier with a linear kernel.
Multiple classes are handled using the one-vs-rest scheme. We also tried using other
SVM kernels as well as the using a random forest classifier, but preliminary results
indicated that these options are unlikely to lead to better classification accuracy results
in this task.</p>
        <p>
          The support vector machine classifier has to be reasonably certain in its candidate
author choice, since there are also test documents that have an unknown author. This
is achieved by setting a probability threshold for the support vector machine
classifications. Probabilities are calculated in the SVM model by scaling the distance of the
sample to each hyper-plane between zero and one. SVM predictions and probability values
can be heavily influenced by single training documents because of the small amount
of documents for each author and the high risk of learning fandom specific features.
Additionally, SVM outputs may not be reliable estimators for probability. Therefore,
a good solution in finding more reliable probabilities would be to use probability
calibration [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. In this process, five-fold cross validation is applied on the training data
to train five separate classifiers. In each of these classifiers, probability estimations are
calibrated using Platt scaling [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. Probability estimations of the five classifiers are
averaged to get the final probabilities.
        </p>
        <p>Each test document is attributed to the most probable author if and only if the
difference between the maximum probability and the second highest probability score is
at least 0.1. As opposed to an absolute minimum probability, this minimum difference
threshold is less sensitive to different probability distributions. Contrasting distributions
in different languages or problem sets may result in very differing maximum
probabilities. However, we are only interested in cases where the most probable choice is more
likely distinguishable than the second most probable choice. The choice of a minimum
probability difference of 0.1 is arrived at by using a grid search with values between
0.01 and 0.3 with intervals of 0.05.</p>
        <p>The features that are used with the SVM consist of an union of six different feature
types that will be described in Section 4.3. The different types of features rely on
different representations of the documents for which pre-processing is required. In the next
section (Section 4.2), the pre-processing steps are described that are needed to extract
these linguistic features.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Pre-processing</title>
        <p>
          The first preprocessing step is to tokenize the documents. Tokenization is done using the
UDPipe [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] tokenizer to get tokens in the format that can be used by both the
part-ofspeech tagger and the dependency parser. For part-of-speech tagging, Structbilty [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ],
a Bi-LSTM sequence tagger, is trained on Universal Dependencies data sets [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
Validation set accuracy scores after training are all between between 0.95 and 0.98 for
the four different languages. The part-of-speech tagger is trained on Universal
Dependencies data sets. More precisely, we used UD_English-EWT, UD_French-GSD,
UD_Italian-ISDT and UD_Spanish-GSD.
        </p>
        <p>
          Dependency parses are provided by the UUParser [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. UUParser was trained using
the same data sets as were used by the part-of-speech tagger. When using the document
tokens as input, the parser achieves validation LAS scores between 0.83 and 0.89 on
the Universal Dependencies development treebanks. To improve performance, we also
trained the parser by using ELMo embeddings [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] of the documents instead of the
tokens. The ELMo embeddings were extracted using pre-trained ELMo representations
[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. As a result, the validation scores after training the parser ranged between 0.86 and
0.90, which is a considerable improvement over the original model. Calculating ELMo
representations is however an expensive process and because of hardware limitations in
the shared task setup, we decided to use the token based dependency parses instead of
the ELMo based parses. This compromise does not not have large negative effects on
the classifier performance as will be discussed in section 5.
4.3
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>Features</title>
        <p>Different textual feature types are extracted from the documents independently from
each other. Each feature type yields numeric features that are linearly scaled between
zero and one. Subsequently, the dimensionality of each feature type is reduced to 150 by
applying truncated singular value decomposition. After this dimensionality reduction
for each feature type, all features are combined into a single feature set. The
dimensionality is reduced before combining the features to make sure that all feature types
are fairly represented in the feature set. As discussed before, the hyper-parameters for
feature extraction are tuned per feature using a grid search approach. Fine tuned
hyperparameters for features include the n-gram range, the use of tf-idf, maximum document
frequencies and minimum group document frequencies. The minimum group
document frequency is a threshold that we have created that ensures that any feature must at
least be present in n documents with the same target label during training. This created
hyper-parameter eliminates features that are only present in few documents written by
an author, which suggests that the feature is domain related instead of author related.</p>
        <p>For each of the feature types, we explored n-gram ranges between one and five.
Subsequently, the following features were extracted from the documents:
Character n-grams The first feature type that is included in our model is based on the
tf-idf scores of character n-grams in the raw document text. The value of n after tuning
ranges between two and four.</p>
        <p>Punctuation n-grams This feature type consists of n-grams of consecutive
punctuation tokens where non-punctuation tokens are skipped. For example: this feature type
contains the bi-gram ",." if a sentence contains a comma and ends with a dot.
Occurrences of uni- and bi-grams are counted and used as features.</p>
        <p>Token n-grams This feature type consists of the counts of token n-grams in the
tokenized text. Only bi-grams are counted for this feature type, and only bi-grams that
occur in at least five documents are included.</p>
        <p>Part-of-speech n-grams This feature type consists of the counts of n-grams of the
part-of-speech tags corresponding to each of the tokens in the document text. The value
of n after tuning ranges between one and four.</p>
        <p>Dependency relations syntactic n-grams This feature type consists of sequences of
dependency relations. These sequences are created by chaining the syntactic relations
between words. For instance, a bi-gram consists of the relation between a word and its
head, and the relation between the head and its head. Note that this chaining procedure
is different from the positional ordering of the words. For dependency relation syntactic
n-grams, only uni-grams and bi-grams are included that occur at least thrice in the
training document of a candidate author. The dependency relation syntactic bi-grams
for instance include nsubj ROOT , if a sentence contains a nominal subject that is
connected to the root of the sentence.</p>
        <p>Token syntactic n-grams This feature type consists of actual words in syntactic
ngram relations. The same chaining procedure is used but actual tokens are chained
instead of relation labels. Token syntactic n-grams also have a minimum group document
frequency of three. The n-gram range for this feature type is two to three.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>Following the description of the PAN-CLEF 2019 shared task, we aimed to determine
the author of a fanfiction text among a list of candidate authors. Following this, we have
created a support vector machine approach with multiple features derived from different
textual levels inside the documents.</p>
      <sec id="sec-5-1">
        <title>5.1 Individual feature types</title>
        <p>
          In Figure 1, both the performance of the individual feature types and the overall
performance of our approach are visualised. Just like the evaluation of the shared task, the
performance was evaluated by calculating macro-averaged F1 scores that were
calculated when the unknown target labels were excluded [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. From these results, we observe
that the character-level n-grams yielded the best performance among the tested
feature types. In contrast, the lowest performance for all languages was obtained with the
textual representation that counted occurrences of syntactic token n-grams. These
observations apply to all the languages that were included in our data set. Examining the
individual languages in particular, we note that our model performed well for the
Italian language. This observation applied to both the performance of the individual feature
types as well as the performance when the all feature types are combined. Furthermore,
the performance per feature type is the lowest when we look at the the English and
French languages, suggesting that the variation of features might have less predictive
power for these languages as compared to the other languages present. However, the
difference may also be an artifact of the data set.
        </p>
        <p>When we compare the performance of our character-level n-gram feature
representation to the performance of the baseline approach that was provided by PAN-CLEF
2019, we observe that we outperformed the baseline approach by 6.7%. More
specifically, the PAN-CLEF 2019 baseline approach obtained a macro averaged F1 score of
0.579 across all languages. This approach consisted of a character-level tri-gram
representation in combination with a linear support vector machine, and a simple probability
threshold rejection option was included to assign an unknown document to the unknown
class. Our performance gain was calculated across all four languages that were included
in the data set, and an even larger performance gain with respect to the baseline is
reported when we compare the performance of our system with all features included.
Then, we outperformed the baseline approach by 18.7%.</p>
        <p>In order to clarify the contributions of the individual features to the overall
performance of our approach, we performed an ablation study. Initially, we started with the
complete feature set, after which we eliminated the individual features in the feature set,
respectively. As shown in Table 1, we examine that the largest decrease in performance
(13.7%) was obtained when the character-level n-gram feature was omitted from the
feature set. These findings correspond well to the observed effect that was previously
described and visualised in Figure 1. Also, the same method of reasoning can be applied
when we compare the remaining results of the ablation study with their corresponding
counterparts that can be found in Figure 1. We observed higher scores for the individual
performance of the token-level n-grams as compared to the individual performance of
the part-of-speech-level n-grams. When we compare this observation to the outcomes
of the ablation study in Table 1, we observe the opposite effect. This suggests that solely
using token-level n-grams achieves better performance than solely using part-of-speech
based n-grams. However, the information that is captured by token-level n-grams seems
also to be captured by other feature types whereas part-of-speech based n-grams provide
additional information. This observation confirms the power of combining different
feature types that may not be good predictors individually.
5.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>ELMo embeddings</title>
        <p>
          As illustrated in Table 2, we compared the performance of the dependency relation
syntactic n-grams and contrasted these results with the performance of the complete
feature set. With this comparison, we wanted to examine whether the use of ELMo
embeddings improved the general performance of our approach, and we wanted to observe
the effect of ELMo embeddings on the results produced by the dependency parser. As
illustrated in Table 2, the performance can be observed per language, and we
distinguished between F1 scores that were obtained when we trained the parser by using
ELMo embeddings, and F1 scores that were obtained when we trained the dependency
parser using regular tokens. Following this, we observe that the inclusion of ELMo
embeddings had the largest advantageous effect on the textual problems related to the
English language for both the dependency tag syntactic n-gram feature type and the
complete feature set. An additional increase in performance was observed for the
dependency tag syntactic n-gram feature type when looking at the French language. In all
other cases the use of ELMo embeddings did not have any effect or even resulted in a
decrease in performance. Given the fact that the calculation of ELMo embeddings was
an expensive process, we argue that including ELMo embeddings for the derivation of
syntactic information is not beneficial for this task.
In conclusion, with the best feature settings included in our approach we obtained an
average macro F1 score of 0.687 based on the development data. More specifically, this
score was obtained by calculating the average of the F1 scores of the four individual
languages. In more detail, when we observe the F1 scores per language, we are able to
conclude that our model performed best for the Italian problem sets (0.777), and this
score is followed by the Spanish (0.730), French (0.684) and English (0.556) problem
sets, respectively.
Based on the previously described results, we submitted the full model to the TIRA
submission platform with all feature types for the shared task testing phase [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. The
overall macro F1 score of our model is 0.644, which is slightly lower than our macro
F1 score on the development data. The English and French testing scores are 0.558 and
0.687, respectively. These two scores are marginally higher than the scores on the
development data, which indicates that our methodology appears to be robust for these
languages. The Italian and Spanish scores were highest during development, but these
scores have dropped to 0.700 and 0.629, respectively. The Italian and Spanish
testing scores are more similar to the English and French results, which indicates that our
model may perform more consistently across languages than what seemed during
development. Therefore, the decrease in performance for Italian and Spanish may be a
positive result.
6
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>In this paper, we presented a support vector machine approach with multiple features
derived from different textual levels inside the documents. The implemented support
vector machine made use of a linear kernel function, and the multiple classes that were
presented to the classifier were handled using the one-vs-rest scheme. In order to be able
to deal with the open-set attribution conditions, we implemented a probability threshold
that was taken into account when computing the support vector machine classifications.
The textual features that were used in this task consisted of a union of six different
feature types that each correspond to a unique representational textual level. As such,
we included character-level n-grams, punctuation-level n-grams, tokens-level n-grams,
part-of-speech n-grams, dependency relations syntactic n-grams, and token syntactic
n-grams. After the hyper-parameter tuning for these features, we obtained an average
macro F1 score of 0.687 on the development data, and an average macro F1 score of
0.644 on the test data.</p>
      <p>Even though we outperformed the baseline by 18.7%, we still note that
crossdomain authorship attribution studies are challenging. We have demonstrated that more
sophisticated features, like the inclusion of dependency tag syntactic n-grams, are
capable of capturing those stylometric elements represented in texts, but without help of
other feature types their predictive power is not great. However, in combination with
other more simple and straightforward lexical n-grams, they do improve model
performance. Further research should aim to determine whether these more elaborate textual
features are able to provide an accurate and reliable basis that can be used to capture
valuable elements of an author’s writing style.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Che</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          , Liu,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          :
          <article-title>Towards better UD parsing: Deep contextualized word embeddings, ensemble, and treebank concatenation</article-title>
          .
          <source>In: Proceedings of the CoNLL</source>
          <year>2018</year>
          <article-title>Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies</article-title>
          . pp.
          <fpage>55</fpage>
          -
          <lpage>64</lpage>
          . Association for Computational Linguistics, Brussels, Belgium (
          <year>October 2018</year>
          ), http://www.aclweb.org/anthology/K18-2005
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Escalante</surname>
            ,
            <given-names>H.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Solorio</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes-y Gómez</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Local histograms of character n-grams for authorship attribution</article-title>
          .
          <source>In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume</source>
          <volume>1</volume>
          . pp.
          <fpage>288</fpage>
          -
          <lpage>298</lpage>
          . Association for Computational Linguistics (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Grieve</surname>
          </string-name>
          , J.:
          <article-title>Quantitative authorship attribution: An evaluation of techniques</article-title>
          .
          <source>Literary and linguistic computing 22(3)</source>
          ,
          <fpage>251</fpage>
          -
          <lpage>270</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Halteren</surname>
            ,
            <given-names>H.V.</given-names>
          </string-name>
          :
          <article-title>Author verification by linguistic profiling: An exploration of the parameter space</article-title>
          .
          <source>ACM Transactions on Speech and Language Processing (TSLP) 4</source>
          (
          <issue>1</issue>
          ),
          <volume>1</volume>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Júnior</surname>
            ,
            <given-names>P.R.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>de</surname>
            <given-names>Souza</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R.M.</given-names>
            ,
            <surname>Werneck</surname>
          </string-name>
          , R.d.O.,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pazinato</surname>
            ,
            <given-names>D.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>de Almeida</surname>
            ,
            <given-names>W.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Penatti</surname>
            ,
            <given-names>O.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Torres</surname>
          </string-name>
          , R.d.S.,
          <string-name>
            <surname>Rocha</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Nearest neighbors distance ratio open-set classifier</article-title>
          .
          <source>Machine Learning</source>
          <volume>106</volume>
          (
          <issue>3</issue>
          ),
          <fpage>359</fpage>
          -
          <lpage>386</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kešelj</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peng</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cercone</surname>
          </string-name>
          , N.,
          <string-name>
            <surname>Thomas</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>N-gram-based author profiles for authorship attribution</article-title>
          .
          <source>In: Proceedings of the conference pacific association for computational linguistics</source>
          ,
          <source>PACLING</source>
          . vol.
          <volume>3</volume>
          , pp.
          <fpage>255</fpage>
          -
          <lpage>264</lpage>
          . sn (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kestemont</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tschuggnall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Specht</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Overview of the author identification task at pan-2018: cross-domain authorship attribution and style change detection</article-title>
          .
          <source>In: Working Notes Papers of the CLEF</source>
          <year>2018</year>
          <article-title>Evaluation Labs</article-title>
          . Avignon, France,
          <source>September 10-14</source>
          ,
          <year>2018</year>
          /Cappellato, Linda [edit.]; et al. pp.
          <fpage>1</fpage>
          -
          <lpage>25</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. de Lhoneux,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Stymne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Nivre</surname>
          </string-name>
          , J.:
          <article-title>Arc-hybrid non-projective dependency parsing with a static-dynamic oracle</article-title>
          .
          <source>In: Proceedings of the The 15th International Conference on Parsing Technologies (IWPT)</source>
          . Pisa,
          <string-name>
            <surname>Italy</surname>
          </string-name>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Markov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sidorov</surname>
          </string-name>
          , G.:
          <article-title>Improving cross-topic authorship attribution: The role of pre-processing</article-title>
          .
          <source>In: International Conference on Computational Linguistics and Intelligent Text Processing</source>
          . pp.
          <fpage>289</fpage>
          -
          <lpage>302</lpage>
          . Springer (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Niculescu-Mizil</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Caruana</surname>
          </string-name>
          , R.:
          <article-title>Predicting good probabilities with supervised learning</article-title>
          .
          <source>In: Proceedings of the 22nd international conference on Machine learning</source>
          . pp.
          <fpage>625</fpage>
          -
          <lpage>632</lpage>
          . ACM (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Nivre</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abrams</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agic</surname>
            <given-names>´</given-names>
          </string-name>
          , Ž., et al.:
          <source>Universal dependencies 2</source>
          .3 (
          <issue>2018</issue>
          ), http://hdl.handle.net/11234/1-2895,
          <article-title>LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL)</article-title>
          ,
          <source>Faculty of Mathematics and Physics</source>
          , Charles University
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Pedregosa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varoquaux</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gramfort</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michel</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thirion</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grisel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blondel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prettenhofer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weiss</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dubourg</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vanderplas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Passos</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cournapeau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brucher</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perrot</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duchesnay</surname>
          </string-name>
          , E.:
          <article-title>Scikit-learn: Machine learning in Python</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          ,
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Peng</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schuurmans</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keselj</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Language independent authorship attribution using character level language models</article-title>
          .
          <source>In: Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics-Volume</source>
          <volume>1</volume>
          . pp.
          <fpage>267</fpage>
          -
          <lpage>274</lpage>
          . Association for Computational Linguistics (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Peters</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Iyyer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gardner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Deep contextualized word representations</article-title>
          .
          <source>In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long Papers). pp.
          <fpage>2227</fpage>
          -
          <lpage>2237</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Plank</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agic</surname>
            <given-names>´</given-names>
          </string-name>
          , Ž.:
          <article-title>Distant supervision from disparate sources for low-resource part-of-speech tagging</article-title>
          .
          <source>In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing</source>
          . pp.
          <fpage>614</fpage>
          -
          <lpage>620</lpage>
          . Association for Computational Linguistics (
          <year>2018</year>
          ), http://aclweb.org/anthology/D18-1061
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Platt</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , et al.:
          <article-title>Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods</article-title>
          .
          <source>Advances in large margin classifiers 10(3)</source>
          ,
          <fpage>61</fpage>
          -
          <lpage>74</lpage>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiegmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>TIRA Integrated Research Architecture</article-title>
          . In: Ferro,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <surname>C</surname>
          </string-name>
          . (eds.)
          <article-title>Information Retrieval Evaluation in a Changing World - Lessons Learned from 20 Years of</article-title>
          CLEF. Springer (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Sapkota</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bethard</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Solorio</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Not all character n-grams are created equal: A study in authorship attribution</article-title>
          .
          <source>In: Proceedings of the 2015</source>
          conference
          <article-title>of the North American chapter of the association for computational linguistics: Human language technologies</article-title>
          . pp.
          <fpage>93</fpage>
          -
          <lpage>102</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Sapkota</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Solorio</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bethard</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Cross-topic authorship attribution: Will out-of-topic data help?</article-title>
          <source>In: Proceedings of COLING</source>
          <year>2014</year>
          ,
          <source>the 25th International Conference on Computational Linguistics: Technical Papers</source>
          . pp.
          <fpage>1228</fpage>
          -
          <lpage>1237</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>A survey of modern authorship attribution methods</article-title>
          .
          <source>Journal of the American Society for information Science and Technology</source>
          <volume>60</volume>
          (
          <issue>3</issue>
          ),
          <fpage>538</fpage>
          -
          <lpage>556</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>Authorship attribution using text distortion</article-title>
          .
          <source>In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume</source>
          <volume>1</volume>
          ,
          <string-name>
            <given-names>Long</given-names>
            <surname>Papers</surname>
          </string-name>
          . vol.
          <volume>1</volume>
          , pp.
          <fpage>1138</fpage>
          -
          <lpage>1149</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , et al.:
          <article-title>Ensemble-based author identification using character n-grams</article-title>
          .
          <source>In: Proceedings of the 3rd International Workshop on Text-based Information Retrieval</source>
          . vol.
          <volume>36</volume>
          , pp.
          <fpage>41</fpage>
          -
          <lpage>46</lpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Straka</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Straková</surname>
          </string-name>
          , J.:
          <article-title>Tokenizing, pos tagging, lemmatizing and parsing ud 2.0 with udpipe</article-title>
          .
          <source>In: Proceedings of the CoNLL</source>
          <year>2017</year>
          <article-title>Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies</article-title>
          . pp.
          <fpage>88</fpage>
          -
          <lpage>99</lpage>
          . Association for Computational Linguistics, Vancouver, Canada (
          <year>August 2017</year>
          ), http://www.aclweb.org/anthology/K/K17/K17-3009.pdf
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>