<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Cross-Domain Authorship Attribution with Federales</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Centre for Language Studies, Radboud University Nijmegen P.</institution>
          <addr-line>O. Box 9103, NL-6500HD Nijmegen</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <abstract>
        <p>This paper describes the system with which I participated in the CrossDomain Authorship Attribution task at PAN2019, which entailed attributing fanfiction texts from a specific “fandom” (texts in a given fictional world) on the basis of training material from other fandoms. As underlying system I used a combination of 5 or 3 (depending on language) feature sets and two author verification methods. In reaction to the genre differences, I added a second round of attribution, this time in-genre, for those authors for whom enough target fandom texts could be identified in the first round. On the training dataset, attribution quality was well over the baseline scores, but with ample space for further improvement. On the test dataset, gain over the baseline scores was lower, indicating some kind of overtraining. In general, performance showed that more work is needed on (automated) hyperparameter selection.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The Cross-Domain Authorship Attribution task in PAN2019 is the attribution of
socalled “fan-fiction” texts in four languages (English, French, Italian and Spanish; see
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] for a full description of the task).1 Fans of popular fiction books or series, e.g.
Sherlock Holmes or Harry Potter, write their own stories in the corresponding fictional
worlds. The stories in such a world are dubbed a “fandom”. The training material for
the PAN2019 task was divided into 20 “problems”, five for each language, in which
a number of texts from a specific fandom are to be attributed to nine known authors
or “none of these”, in all cases on the basis of seven known texts from each author,
stemming from other fandoms. The problems were to be solved in isolation, i.e. it was
not allowed to use information from one problem in solving the others. This severely
limited the compilation of a background corpus. However, three baseline systems were
Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 September 2019, Lugano,
Switzerland.
1 In this paper, I will focus on my own approach. I refer the reader to the overview paper and
the other papers on the PAN2019 Cross-Domain Authorship Attribution task for related work.
Not only will this prevent overlap between the papers, but most of the other papers, and hence
information on the current state of the art, are not available at the time of writing of this paper.
also provided, one of which (the Impostor baseline[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]) was supported with a set of
5,000 texts per language, from various (unknown) fandoms and authors, which could
serve as a background corpus. The unknown texts for each problem also contained texts
from other authors than the nine given ones. The number of such texts was 20%, 40%,
60%, 80% and 100% of the number of texts by the target authors. Although it seemed
likely that the test material would have a similar composition, this was not specified.2
      </p>
      <p>Authorial style might or might not have been adapted to the style of the emulated
book(s), but it is likely that the language use was influenced by topic and general genre.
In authorship attribution, it is well-known that the difficulty of the task is much higher
if unknown texts are from a different genre than the known texts. In the current task,
we seemed to be somewhere in between, as all texts were fiction. However, the fact that
all unknown texts were from the same fandom might have pushed the language use in
a similar direction for all authors, adding an additional confounding factor.</p>
      <p>My approach for this task3 built on earlier work on authorship and other text
classification tasks, which used to be published under the name Linguistic Profiling, which
because of ambiguity of that term has now been replaced by the working title “Feature
Deviation Rating Learning System” (henceforth Federales). Although the full name
implies a specific learning technique, the acronym already indicates my predilection for
combination approaches. Which form of combination was used in this task is described
below (Section 3).</p>
      <p>
        On top of the basic verification system, I wanted to add some way to deal with the
genre differences. A first idea was to model each fandom and learn how a specific
author behaved in relation to these fandom models. If relative behaviour were consistent
between fandoms, correction factors could be applied to target fandom measurements,
hopefully leading to a better attribution. However, there was insufficient material
outside the known texts to follow this strategy.4 A second option was to ignore all features
that appeared to be affected by genre (i.e. fandom). In a small pilot study on English,
it turned out that so many features were removed that attribution quality went down
rather than up.5 In the end, I did not attempt to apply corrections to the feature
measurements. Instead, I took the authors for which a sufficient number of texts was identified
with (relatively) high confidence, and used those texts as known texts for a verification
model within the target fandom. For those authors, the in-genre models were then used,
whereas the other authors had to be attributed with the cross-genre model, albeit with
2 And therefore I did not use this expectation in building my system.
3 I also participated in the Author Profiling task. The differences in handling the two tasks were
such that I preferred to describe the other task in a separate paper[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The main difference is
that here I used verification, and for profiling comparison. Despite that, there will obviously
be some overlap between the papers.
4 As it is very unlikely that the same authors will be active in the same combinations of fandoms,
the known texts were usually from various fandoms. The only other source was the set of
5,000 sets by unknown authors per language, but this as well was too fragmented for proper
modeling. I hope the organisers will make more data available after the workshop, so that I
can investigate author behaviour over various fandoms.
5 It must be said that this pilot was in the initial phases of the work on the task. It may well be
that the approach would be viable in the circumstances provided by the final system. Again,
future work is a definite option.
      </p>
      <p>Stiles
had
been
the
one
to
say
they
needed
to
get
out
of
town
first
.</p>
      <p>NNP
VBD
VBN
DT
CD
TO
VB
PRP
VBD
TO
VB
IN
IN
NN
RB
.</p>
      <p>Je
sais
que
le
jour
venu
les_autres
maisons
seront
derrière
toi
,
professeurs
et
élèves
.
the additional knowledge that some competing authors could be ruled out on the basis
of the in-genre verification.</p>
      <p>
        An additional complication of the shared task was that the final evaluation was in
the form of a blind test using TIRA[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. My main approach on earlier projects was
a careful investigation of (the known texts of) each individual problem, followed by
a manual combination of the most promising components and system settings. Given
that there was insufficient time to fully automate this procedure, I was forced here to
choose components (and hyperparameters) for all the (probably varied) test data at once,
without having access to the known texts or even knowledge of the involved genres.
Moreover, in the development of the current software, I had not yet reached the state of
an integrated system, which was obviously needed for TIRA. In the end, complications
in building the integrated system took away much time that could have been used on
better system tuning.
      </p>
      <p>In the next sections, I first describe the features I used (Section 2), the verification
techniques (Section 3) and how all this fitted together into a full system (Section 4).
After this I continue with performance measurements (Section 5) and a short discussion
to conclude the paper (Section 6).
2</p>
    </sec>
    <sec id="sec-2">
      <title>Feature Extraction</title>
      <p>
        For English, I used both surface and syntactic features. The simplest type was that of
the character n-grams, with n from 1 to 6, which were counted directly on the raw text.
For all other types, I analysed all texts with the Stanford CoreNLP system[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>From the POS tagging, of which an example is shown on the left side of Figure 1,
I extracted token n-grams, with n from 1 to 3. Each position in the n-gram was taken
by the word itself, the POS tag or the word’s group. The latter was looked up in a list
of word groups deemed relevant for authorship attribution, which I had available only
for English and which contained reporting verbs and various types of adverbs.6 The
token n-grams were generated in two forms. The lexical form (lex) included the words
themselves. In the abstract form (abs) the words were replaced by an indication of their
IDF value (low, middle or high) if their IDF exceeded a certain value.7 In this way, the
abstract features were supposed to reflect the language use of the author rather than the
topics discussed.</p>
      <p>
        The CoreNLP system also yielded a dependency analysis. However, as the
dependency structure was less amenable to variation studies than a constituency structure, I
first transformed the trees, aiming for a representation similar to that in the TOSCA
Project[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].8 Apart from restructuring the tree, the transformation program lexicalized
the analysis by percolating the head words upwards. As an example the parse of the
(English) sentence in Figure 1 is shown in Figure 2. From these transformed trees,
syntactic features were derived, namely slices from the trees representing presence of
constituents, dominance and linear precedence, as well as full rewrites. Again a
lexical and an abstract form were generated, this time with the word (or IDF indication)
concatenated with the POS tag.9 An overview of the feature types, with examples for
English, is given in Table 1.
6 The list was created during work on author recognition on texts from the British National
      </p>
      <p>
        Corpus[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and was still under development.
7 IDFs were calculated on the texts in the British National Corpus.
8 Unfortunately, the parser(s) from that project are (currently) unavailable.
9 Unfortunately, there is no space for a more extensive description, but this will follow in future
work.
      </p>
      <p>ROOT:ROOT(&lt;UNHEAD&gt;) -&gt; [ UTT NOFUpunc ]</p>
      <p>UTT:S(be) -&gt; [ SU V CS ]</p>
      <p>SU:NP-NPRPNP00(Stiles) -&gt; [ NPHD ]</p>
      <p>NPHD:NNP(Stiles) -&gt; Stiles
V:VP(be) -&gt; [ AVB MVB ]</p>
      <p>AVB:VBD(have) -&gt; had</p>
      <p>MVB:VBN(be) -&gt; been
CS:NP-NPRPNP01(one) -&gt; [ NPDT NPHD NPPO ]</p>
      <p>NPDT:DT(the) -&gt; the
NPHD:CD(one) -&gt; one
NPPO:SBAR(say|GrpVrepd) -&gt; [ S ]</p>
      <p>S:S(say|GrpVrepd) -&gt; [ A V A ]</p>
      <p>A:TO(to) -&gt; to
V:VP(say|GrpVrepd) -&gt; [ MVB ]</p>
      <p>MVB:VB(say|GrpVrepd) -&gt; say
A:SBARc(need) -&gt; [ S ]</p>
      <p>S:S(need) -&gt; [ SU V A ]</p>
      <p>SU:NP-PRPNP00(they) -&gt; [ NPHD ]</p>
      <p>NPHD:PRP(they) -&gt; they
V:VP(need) -&gt; [ MVB ]</p>
      <p>MVB:VBD(need) -&gt; needed
A:Sx(get) -&gt; [ A V A A ]</p>
      <p>A:TO(to) -&gt; to
V:VP(get) -&gt; [ MVB ]</p>
      <p>MVB:VB(get) -&gt; get
A:AVP(out) -&gt; [ AVHD AVPO ]</p>
      <p>AVHD:IN(out) -&gt; out
AVPO:PP(of|town) -&gt; [ PREP PCOMP ]</p>
      <p>PREP:IN(of) -&gt; of
PCOMP:NP-NPRPNP00(town) -&gt; [ NPHD ]</p>
      <p>NPHD:NN(town) -&gt; town
A:AVP(first) -&gt; [ AVHD ]</p>
      <p>AVHD:RB(first) -&gt; first
NOFUpunc:.(.) -&gt; .</p>
      <p>
        For all three Romance languages, no appropriate syntactic parser was available.10
Instead, I POS-tagged the texts with FreeLing[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. An example for French is shown on
the right side of Figure 1. From this output I extracted only surface features, leading
to three different feature sets. Furthermore, the token n-grams were less involved, as
no word groups were present, nor IDF statistics, so that the abstract form replaced all
words of open word classes.
      </p>
      <p>For each problem, the system took all known and unknown texts, plus the
5,000text background corpus, and extracted all features which occurred in at least three texts.
The numbers of features for each problem and feature type (for the training data) are
indicated in Table 2. The large ranges within each language were caused mostly by the
greatly varying sizes of the unknown text sets, from 38 texts for problem 10 (French) to
561 for problem 1 (English).
3</p>
    </sec>
    <sec id="sec-3">
      <title>Learning Techniques</title>
      <p>Apart from a combination of feature types, I also used a combination of learning
techniques.
3.1</p>
      <p>
        Feature Value versus Profile Range Comparison
The Federales system built on the Linguistic Profiling system, which had been used in
various studies, such as authorship recognition[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ][
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], language proficiency[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], source
language recognition[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and gender recognition[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The approach is based on the
assumption that (relative) counts for each specific feature typically move within a specific
range for a class of texts and that deviations from this typical behavior indicate that the
deviating text does not belong to the class in question. If the frequency range for a
feature is very large, the design of the scoring mechanism ensures that the system mostly
ignores that feature.
10 This was not because no parsers existed, but because there was no time to build tranformers
from the parser output to the structure I wanted for the syntactic features.
      </p>
      <p>For each feature, the relative counts11 for all samples in the class are used to
calculate a mean and a standard deviation.12 The deviation of the feature count for a specific
test sample is simply the z-score with respect to this mean and standard deviation, and
is viewed as a penalty value. Hyperparameters enable the user to set a threshold below
which deviations are not taken into account (the smoothing threshold), a power to apply
to the z-score in order to give more or less weight to larger or smaller deviations
(deviation power), and a penalty ceiling to limit the impact of extreme deviations. When
comparing two classes, a further hyperparameter sets a power value for the difference
between the two distributions (difference power), the result of which is then multiplied
with the deviation value. The optimal behaviour in cases where a feature is seen in the
training texts for the class but not in the test sample, or vice versa, is still under
consideration. In the current task, features only seen in the training texts were counted as they
are, namely with a count of 0 in the test sample; features only seen in the test sample
were compared against the lowest mean and standard deviation from among the training
features, which should correspond more or less to the scores for hapaxes in the training
texts.</p>
      <p>The penalties for all features are added. A set of benchmark texts is used to
calculate a mean and standard deviation for the penalty totals, to allow comparison between
different models. For verification, the z-score for the penalty total is an outcome by
itself; for comparison between two models, the difference of the z-scores can be taken;
for attribution within larger candidate sets (as in the current task), the z-scores can be
compared. In all cases, a threshold can be chosen for the final decision.</p>
      <p>Even though optimal settings for one author were often bad settings for another
author, I started with the fallback strategy of using a single, basic choice for the
hyperparameters, namely no smoothing threshold, no deviation or difference power (i.e.
power 1), and a penalty ceiling of 10. The next step, automated tuning, did not take
place anymore, due to lack of time.
3.2</p>
      <p>
        Support Vector Regression
As a second learning method to assign feature vectors to classes, I used Support Vector
Regression as provided by the libsvm package[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. For each author, vectors for known
texts of the author were given class 1 and vectors for texts by other authors with class
-1. svm-scale was used with its default settings. Here too, a single, simple list of
hyperparameters was used: -regression with an RBF kernel, and the defaults for cost (c = 1)
and gamma ( = 1=number_of _f eatures); only was set other than the default (0.1
versus a default of 0.5). To correct for a bias towards positive or negative examples
because of different numbers of texts in the two classes, the hyperparameter w was
used, with weights that exactly compensated for class size. By choosing regression, I
received a score rather than a decision on the class, which could then be used in further
processing.
11 I.e. the absolute count divided by the corresponding number of items, e.g. count of a token in
a text divided by that of all tokens within the text, or a character n-gram count divided by the
number of characters in the text.
12 Theoretically, this is questionable, as most counts will not be distributed normally, but the
system appears quite robust against this theoretical objection.
After all ten/six individual component runs, the component scores for each were
normalized by factoring in a linear model which predicted the component score on the
basis of the average component scores for the models and test samples in question,
plus the number of feature comparisons made during scoring. The adjusted component
score was the deviation from the predicted value. Finally, all adjusted component scores
were normalized to z-scores with regard to all observed adjusted component scores for
a model and with regard to all observed adjusted component scores for a test sample.
These normalized component scores allow for an intuitive interpretation and provide
comparability between different author models. The latter goal was needed here both
for selecting a common threshold for verification and for score combination of the
various components. In addition to the two individual normalized component scores (wrt
model and wrt sample), their sum was also calculated. Which of these three normalized
scores was used, depended on the phase in the attribution (cf. Section 4).
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Complete System</title>
      <p>After the normalized scores (henceforth simply “scores”) for each feature set and learner
became available, the system needed to combine these to determine an attribution for
each text. This was done in several phases.
4.1</p>
      <p>Phase 1: Cross-Genre Attribution
In the first phase, the system examined the scores produced by the various models so
far. On the basis of the sum of all individual scores,13 a ranking of potential authors was
produced. If the attribution task were a closed one, the first ranked author could now
be selected. However, the test texts also included texts by other than the nine known
authors. This meant that a decision had to be made whether or not to attribute to a
known author at all.</p>
      <p>For this, a number of statistics were calculated per test sample. For each potential
author, these were: the mean of the reciprocal rank, the mean of the score, the mean
of the distance to the score of the top-ranked author, and the highest and lowest score
reached. For the sample as a whole, these were: the sum of the various best scores and
distance in the individual models, plus the highest values for the five author statistics.
Finally, the first principal component of a PCA of the seven whole-sample statistics was
added.14</p>
      <p>In the various phases, all of the whole-sample statistics, apart from the
reciprocal rank, were used for comparisons against thresholds to determine whether or not
to accept the top-ranked author for attribution. The thresholds were set manually per
language, in such a way that the average evaluation score over all 45 authors was
optimized.15 My hope was that possibly overtraining in this threshold selection process
13 To be exact, the normalized scores with regard to both model and sample.
14 The rotation to calculate this value was based on the known samples and applied to the test
samples.
15 This manual tuning process too should be, but has not yet been, automated.
would be of an acceptable level, seeing that the selected thresholds were chosen for 45
different authors each in five different fandoms.16
4.2</p>
      <p>Phase 2: Selecting Training Texts for In-Genre Attribution
The next step was to build an in-genre attribution process for those authors for which
sufficient in-genre texts had been identified. If at least four texts were attributed to an
author in phase 1, these texts were used to build in-genre models, with which the same
set of suggested texts were scored.17</p>
      <p>These scores were then transformed to z-scores with regard to all text scores for the
author in question and then sorted from high to low. Texts were removed from the set if
a) they had a z-score lower than -2 and/or b) they were in the second half of the ranking
and the gap to the next higher z-score was greater than 1. The second criterion was to
correct for the situation where another author (known or unknown) was being falsely
accepted as well. This correction fails if the authors are too much alike, but also if more
samples from the unwanted author were included than by the wanted author.18</p>
      <p>The texts which were not removed were next used to build full in-genre attribution
models. In addition, they were marked for final attribution to their respective authors.
4.3</p>
      <p>Phase 3: In-Genre Attribution
On the basis of the texts selected in phase 2, the system now created an attribution model
for each author for whom at least three in-genre texts were left. For the other authors,
the cross-genre models were reused. With these new models, all unknown texts were
processed in the same way as in phase 1, except that now only the normalized scores
with regard to the sample were used (and no longer adding those for the models), and
other thresholds were chosen.</p>
      <p>The texts attributed to any of the in-genre models were marked for final attribution
to the corresponding authors. All other texts were marked as NOT belonging to any of
these authors and referred to phase 4, where they could be attributed to one of the other
authors or remain unattributed.
4.4</p>
      <p>Phase 4: Cross-Genre Attribution of Underrepresented Authors
For this final attribution phase, the results of phase 1 were reused, but also using the
normalized scores with regard to the sample.</p>
      <p>
        All authors processed in phase 3 were struck from the rankings and the top-ranking
remaining author was examined. If his/her mean and/or lowest scores over the
individual runs was over selected (language-dependent) thresholds, he/she was selected for
final attribution of the text in question. If not, the text remained unattributed.
16 After receiving the test results, I must conclude that this was not the case, and that
authordependent hyperparameter and threshold selection is indeed needed (cf. Section 6).
Furthermore, in the circumstances, reporting on the manually selected thresholds does not seem
worthwhile.
17 For reasons of processing time, support vector regression was not included in this phase.
18 I hoped the task organisers were not too cruel in their text selection, in which case random
selection should be on my side.
The evaluation measure for the PAN2019 attribution task was the open-set
macroaveraged F1 score, which was calculated over the training classes without the
“unknown” class, but counting false accepts of unknown-class texts against precision[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].19
Evaluation with the provided python script of my system’s results on the training set
yielded the scores listed in Table 3. Also in Table 3 are the Macro-F1 scores for three
baseline systems, for the description of which I refer the reader to [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].20
      </p>
      <p>For the training material, my system outperformed the three baselines, but with
substantial variation per problem (and even more so per author, but those results are
not shown here for lack of space). It is positive that the largest improvement with
regard to the baselines was for English, where overall the system reached a 25% gain
in Macro-F1 over the best baseline (SVM). This is the language I have mostly been
19 As the Macro-F1 was taken as an average over all F1, and not calculated from the
Macro</p>
      <p>Precision and the Macro-Recall, the Macro-F1 could be lower than the two others.
20 I did not actually run the baseline systems myself, but these measurements were provided by</p>
      <p>Mike Kestemont, for which I thank him.
working on so far and this is the language for which the syntactic features could be
used. For the Romance languages, my system was much less evolved. For Spanish, the
relative performance (19% performance gain over again SVM) was therefore rather
satisfying. Apparently, the syntactic structure used for English is reflected sufficiently in
the morphology here. Italian was somewhere in between, with 9% gain, this time over
compression. But French was apparently a serious problem, with a gain of only 1.5%
(over compression) and a best score only for problem 8.</p>
      <p>For the test material, the system performed much worse. For English, the Macro-F1
of 0.532 was only 8% over the compression baseline (0.493). For French, my 0.554
was even lower than compression’s 0.595. For Italian, the test run went better than the
training run, with 0.653 and a 12.5% improvement over the best baseline (0.580, again
compression). For all three languages, the best scores were much higher: English 0.665,
French 0.705 and Italian 0.717, all by “muttenthaler”. Spanish was an exception, with
SVM as the best baseline (0.577) and “neri” the best participating system (0.679), so
that my 0.652 ended up on rank 3 with 13% over SVM, still lower than the training
performance.
6</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>For the PAN2019 Cross-Domain Authorship Attribution task, I built a system
combining various feature types and two classification methods, and furthermore attempting to
apply in-genre models when enough training texts could be identified in the cross-genre
recognition. Under pressure of the shared task deadline, after losing too much time on
building an integrated system that could run on TIRA, automation of the tuning
procedure for hyperparameters and thresholds was no longer possible, and suboptimal values
had to be used.</p>
      <p>On the training set, the system outperformed three baselines, but with varying
margins. The largest margin was for English, for which my software was the most
developed and the features included ones derived from full syntactic analysis trees. On the
test set, performance varied. In relation to the baselines, there was a drop for English
(+25% for training to +8% for test), French (+1.5% to -7%) and Spanish (+19% to
+13%), but an improvement for Italian (+9% to +12.5%). In relation to the best
systems (for each language), however, we see bad scores for English (-20%) and French
(-21.5%), slightly better scores for Italian (-9%), and acceptable scores for Spanish
(3%).</p>
      <p>Given the large variation in relative scores, it would seem that success or failure
with the single settings option is largely a matter of luck. The settings may work or
they may not. This is clearly not an acceptable situation and automated hyperparameter
tuning is a matter of utmost urgency. Once this is in place, I can also turn my attention
to the relative contributions of the various feature types and components, as well as the
partial in-genre attribution, which would be rather meaningless at the moment.</p>
      <p>Obviously, careful analysis of the results is only possible with access to the test data.
But I would like to go further and express the hope that the organisers will release even
more fandom data, as this would allow a more detailed mapping of the behaviour of
authors across (fiction) genres, which was not possible with the current selection. Once
cross-genre regularities and irregularities are understood better, cross-genre authorship
attribution should become much more viable.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Aarts</surname>
            , J., van Halteren,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oostdijk</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          :
          <article-title>The linguistic annotation of corpora: The TOSCA analysis system</article-title>
          .
          <source>International journal of corpus linguistics 3</source>
          (
          <issue>2</issue>
          ),
          <fpage>189</fpage>
          -
          <lpage>210</lpage>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>BNC</given-names>
            <surname>Consortium</surname>
          </string-name>
          <article-title>: The British National Corpus, version 3 (BNC XML Edition)</article-title>
          . URL: http://www.natcorp.ox.ac.uk/ (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <issue>3</issue>
          .
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>C.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>C.J.:</given-names>
          </string-name>
          <article-title>LIBSVM: A library for support vector machines</article-title>
          .
          <source>ACM Transactions on Intelligent Systems and Technology</source>
          <volume>2</volume>
          (
          <issue>27</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>27</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. van Halteren,
          <string-name>
            <surname>H.</surname>
          </string-name>
          :
          <article-title>Linguistic Profiling for authorship recognition and verification</article-title>
          .
          <source>In: Proceedings ACL</source>
          <year>2004</year>
          . pp.
          <fpage>199</fpage>
          -
          <lpage>206</lpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. van Halteren,
          <string-name>
            <surname>H.</surname>
          </string-name>
          :
          <article-title>Author verification by Linguistic Profiling: An exploration of the parameter space</article-title>
          .
          <source>ACM Transactions on Speech and Language Processing (TSLP) 4</source>
          (
          <issue>1</issue>
          ),
          <volume>1</volume>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. van Halteren,
          <string-name>
            <surname>H.</surname>
          </string-name>
          :
          <article-title>Source language markers in Europarl translations</article-title>
          .
          <source>In: Proceedings of COLING2008, 22nd International Conference on Computational Linguistics</source>
          . pp.
          <fpage>937</fpage>
          -
          <lpage>944</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. van Halteren,
          <string-name>
            <surname>H.</surname>
          </string-name>
          :
          <article-title>Bot and gender recognition on tweets using feature count deviations, notebook for PAN at CLEF2019</article-title>
          . In: Cappellato,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <surname>D</surname>
          </string-name>
          . (eds.)
          <article-title>CLEF 2019 Labs and Workshops, Notebook Papers</article-title>
          .
          <source>CEUR Workshop Proceedings. CEUR-WS.org</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. van Halteren,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Oostdijk</surname>
          </string-name>
          , N.:
          <article-title>Linguistic Profiling of texts for the purpose of language verification</article-title>
          .
          <source>In: Proceedings of the 20th international conference on Computational Linguistics</source>
          . p.
          <fpage>966</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9. van Halteren,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Speerstra</surname>
          </string-name>
          , N.:
          <article-title>Gender recognition of Dutch tweets</article-title>
          .
          <source>Computational Linguistics in the Netherlands Journal</source>
          <volume>4</volume>
          ,
          <fpage>171</fpage>
          -
          <lpage>190</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Kestemont</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manjavacas</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Overview of the Cross-domain Authorship Attribution Task at PAN 2019</article-title>
          . In: Cappellato,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Müller</surname>
          </string-name>
          , H. (eds.)
          <article-title>CLEF 2019 Labs and Workshops, Notebook Papers</article-title>
          .
          <article-title>CEUR-WS.org (</article-title>
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Winter</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Determining if two documents are written by the same author</article-title>
          .
          <source>Journal of the Association for Information Science and Technology</source>
          <volume>65</volume>
          (
          <issue>1</issue>
          ),
          <fpage>178</fpage>
          -
          <lpage>187</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Surdeanu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bauer</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Finkel</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bethard</surname>
            ,
            <given-names>S.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McClosky</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>The Stanford CoreNLP natural language processing toolkit</article-title>
          . In:
          <article-title>Association for Computational Linguistics (ACL) System Demonstrations</article-title>
          . pp.
          <fpage>55</fpage>
          -
          <lpage>60</lpage>
          (
          <year>2014</year>
          ), http://www.aclweb.org/anthology/P/P14/P14-5010
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Mendes</surname>
            <given-names>Júnior</given-names>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          R., de Souza,
          <string-name>
            <given-names>R.M.</given-names>
            ,
            <surname>Werneck</surname>
          </string-name>
          , R.d.O.,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pazinato</surname>
            ,
            <given-names>D.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>de Almeida</surname>
            ,
            <given-names>W.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Penatti</surname>
            ,
            <given-names>O.A.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Torres</surname>
          </string-name>
          , R.d.S.,
          <string-name>
            <surname>Rocha</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Nearest neighbors distance ratio open-set classifier</article-title>
          .
          <source>Machine Learning</source>
          <volume>106</volume>
          (
          <issue>3</issue>
          ),
          <fpage>359</fpage>
          -
          <lpage>386</lpage>
          (
          <year>Mar 2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Padró</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stanilovsky</surname>
          </string-name>
          , E.:
          <article-title>FreeLing 3.0: Towards wider multilinguality</article-title>
          .
          <source>In: Proceedings of the Language Resources and Evaluation Conference (LREC</source>
          <year>2012</year>
          ). ELRA, Istanbul, Turkey (May
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiegmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>TIRA Integrated Research Architecture</article-title>
          . In: Ferro,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <surname>C</surname>
          </string-name>
          . (eds.)
          <article-title>Information Retrieval Evaluation in a Changing World - Lessons Learned from 20 Years of</article-title>
          CLEF. Springer (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>