<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>EACH-USP Ensemble Cross-Domain Authorship Attribution</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>José Eleandro Custódio and Ivandré Paraboni</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Arts, Sciences and Humanities (EACH) University of São Paulo (USP) São Paulo</institution>
          ,
          <country country="BR">Brazil</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <abstract>
        <p>We present an ensemble approach to cross-domain authorship attribution that combines predictions made by three independent classifiers, namely, standard char n-grams, char n-grams with non-diacritic distortion and word ngrams. Our proposal relies on variable-length n-gram models and multinomial logistic regression, and selects the prediction of highest probability among the three models as the output for the task. Results generally outperform the PANCLEF 2018 baseline system that makes use of fixed-length char n-grams and linear SVM classification.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Authorship attribution (AA) is the computational task of determining the author of a
given document from a number of possible candidates [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Systems of this kind have a
wide range of possible applications, from on-line fraud detection to plagiarism and/or
copyright protection. AA is presently a well-established research field, and a recurrent
topic in the PAN-CLEF shared task series [
        <xref ref-type="bibr" rid="ref5 ref7">7,5</xref>
        ].
      </p>
      <p>At PAN-CLEF 2018, a cross-domain authorship attribution task applied to fan
fiction text has been proposed. In this task, texts written by the same authors in multiple
domains were put together, creating a cross-domain setting. The task consists of
identifying the author of a given document based on text of a different genre.</p>
      <p>
        The present work describes the results of our own entry in the PAN-CLEF 2018
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] AA shared task - hereby called the EACH-USP model - using both the baseline
system and data provided by the event 1. This consists of ten individual AA tasks in five
languages (English, French, Italian, Polish and Spanish), being two tasks (with 5 or 20
candidate authors) each.
      </p>
      <p>The rest of this paper is structured as follows. Section 3 describes our main AA
approach, and Section 4 describes its evaluation over the PAN-CLEF 2018 AA dataset.
Section 5 presents our results and those provided by relevant baseline methods. Finally,
Section 6 discusses these results and suggests future work.
1 Available from https://pan.webis.de/clef18/pan18-web/author-identification.html</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>The present work shares similarities with a number of AA studies. Some of these are
briefly discussed below.</p>
      <p>
        The work in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] makes use of text distortion methods intended to preserve only the
text structure and style in a cross-domain AA setting. The work focused on the use
of word-level information, whereas our current proposal will focus on character-level
information.
      </p>
      <p>
        The work in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] investigates the role of affixes in the AA task by using char n-gram
models for the English language. Similarly, the work in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] addresses the use of char
n-grams models for the Portuguese language, and discusses the role of affix information
in the AA task. This is in principle relevant to our current work since the Portuguese
language shares a great deal of its structure with Spanish and Italian, which are two of
the target languages for the PAN-CLEF 2018 AA task.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Method</title>
      <p>Central to our approach is the idea that the AA task may rely on the combination of
different knowledge sources such as lexical preferences, morphological inflection,
uppercase usage, and text structure, and that different kinds of knowledge may be obtained
either from character-based or word-based text models. These alternatives are discussed
as follows.</p>
      <p>
        Word or content-based models may indicate word usage preferences, and may help
distinguish an author from another. However, we notice that a single author may favour
certain words in different domains (e.g., fictional versus dialogue text). Moreover,
wordbased models will usually discard punctuation and spaces, which may represent a
valuable knowledge source for AA. Character-based models, on the other hand, are known
for their ability to capture time or gender inflection, among others, as well as
punctuation and spacing [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>Based on these observations, our approach to cross-domain authorship attribution
consists of a number of improvements over the standard PAN-CLEF 2018 baseline
system organised as an ensemble method. In particular, we replace the original fixed-length
n-grams and linear SVM classification for variable-length n-grams and multinomial
logistic regression, and we combine predictions made by three independent classifiers to
determine the most likely author of a given document as illustrated in Figure 1.</p>
      <p>Our proposal - hereby called USP-EACH Ensemble model - combines the following
three classifiers:
– Std.charN: a variable-length char-ngram model
– Dist.charN: a variable-length char-ngram model in which non-diacritics were
distorted
– Std.wordN: a variable-length word-ngram model</p>
      <p>
        Both Std.charN and Dist.charN models are intended to capture syntactic and
morphological clues for authorship attribution in a language-independent fashion. In the
latter, however, all characters that do not represent diacritics are removed from the text
beforehand, therefore focusing on the effects of punctuation, spacing and the use of
diacritics, numbers and other non-alphabetical symbols. This form of text distortion [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
is illustrated by the example in Table 1.
Original text
-¿Y cómo sabes que no lo
ama? -Inglaterra se preguntó
a su vez si habría un muñeco
del esposo también.
      </p>
      <p>Transformed text
-¿* *ó** ***** *** ** **
***? -********** ** *******ó
* ** *** ** ****í* ** **ñ***
*** ****** *****é*.</p>
      <p>A major motivation for this approach is the observation that, in languages that make
use of diacritics, some authors may consistently use the correct spelling (as in ‘é’, which
is Portuguese for ‘is’) whereas others tend to ignore the need for diacritics by producing
the incorrect spelling (e.g., ‘e’) for the same purpose. In addition to these two
characterbased models, we also consider a third model that is intended to capture lexical
preferences, hereby called Std.wordN.</p>
      <p>Predictions made by the three classifiers are combined into our Ensemble model by
selecting the most likely outcome for a given authorship attribution task. To this end,
the three individual outputs are concatenated and taken as input features to a fourth soft
voting (ensemble) classifier. This in turn performs multinomial logistic regression to
select the winning strategy.</p>
    </sec>
    <sec id="sec-4">
      <title>Experiment</title>
      <p>
        The models introduced in the previous section had their parameters set by using the
PAN-CLEF development dataset as follows. Features were scaled using Python’s
MaxAbsScaler transformer [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and dimensionality was reduced using a standard PCA
implementation. PCA also helps remove correlated features, which is useful in the present
case since our models make use of variable length feature concatenation.
      </p>
      <p>The resulting feature sets were submitted to multinomial logistic regression by
considering a range of values, as summarised in Table 2.</p>
      <p>Optimal values for the regression task were determined by making use of grid search
and 5-fold cross validation using an ensemble method. The optimal values that were
selected for subsequently training our actual models are illustrated in Table 3, in which
Start/End values denote the range of subsequences that were concatenated. For instance,
Start= 2 and End= 5 represents the concatenation of subsequences [(2; 2); (2; 3); ;
(4; 3); (4; 5)].</p>
      <p>Tables 4, 5 and 6 show the ten most relevant features for AA Problem00002, which
comprises text written by five authors each in the English language. In this
representation, blank spaces were encoded as underscore symbols, and relevance is represented
by the weights of multinomial logistic regression. These were estimated by scaling the
features to a mean value equal to 0, and to a standard deviation value equal to 1.</p>
      <p>Being a language-independent approach, information regarding function words was
not taken into account, although this might have been helpful since function words
usually play a rather prominent role in AA (as opposed to, e.g., content words, which
may arguably be more relevant to other text categorisation tasks.) We notice however
that function words were made explicit by the Std.wordN model. Moreover, we notice
that all models also made (to some extent) explicit a number of individual preferences
regarding word usage, punctuation and spacing, and that Std.distN provides some
evidence of the role of punctuation marks, spacing and hyphenation.
candidate00001 candidate00002 candidate00003 candidate00004 candidate00005
about_what against_his an_odd although and_pulled_him
and_practically and_it_was and_then_he an_eye and_pulling
any_of and_so acknowledged and_said across_his
any_more and_already and_he_had and_takes across_the
and_nearly and_steve are_your and_just and_all
and_pulled and_say again_to ancient against_her
agree accent and_tell amount_of among
all_tony and_wet and_forth always about_what_to
ah apparently are_just and_grinned acting
and_wet_and after and_grabbing about_the about_their
5</p>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>Table 7 presents macro F-measure results for the original PAN-CLEF 2018 baseline
system, our three individual classifiers and the Ensemble model for the ten PAN-CLEF
2018 authorship attribution tasks over the development data. To this end, the baseline
was optimised using 4-grams, minimum document frequency of 5 and One-vs-Rest as
the SVM multi-class strategy. Our models were optimised individually using the
parameters described in Table 2, and output probabilities were combined by using multinomial
logistic regression in a soft voting ensemble fashion. Best results are highlighted.
From these results, a number of observations are warranted. First, we notice that
Std.charN generally obtained the best results among the three individual classifiers.
We also notice that Dist.charN performs worse than Std.charN. This was to be
expected since Dist.charN conveys less information, that is, it may be seen as a subset of
Std.charN.</p>
      <p>Our ensemble model consistently outperformed the alternatives by using soft voting.
In our experiments, we noticed that combining the three knowledge sources obtained
best results. In all cases, the relevant features turned out to be of variable length, ranging
from 1 to 5-grams.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Final remarks</title>
      <p>This paper presented an ensemble approach to cross-domain authorship attribution that
combines predictions made by a standard char n-gram model, a char n-gram model with
non-diacritic distortion and a word n-gram model using variable-length n-gram models
and multinomial logistic regression. Results generally outperform the PAN-CLEF 2018
baseline system that makes use of fixed-length char n-grams and linear SVM
classification. As future work, we intend to investigate alternative text models and distortion
methods for prefixes, suffixes and other text components.</p>
      <p>Acknowledgements. The second author received financial support from FAPESP grant
nro. 2016/14223-0.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beyer</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Busse</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , E., benno Stein:
          <article-title>Recent trends in digital text forensics and its evaluation: Plagiarism detection, author identification, and author profiling</article-title>
          .
          <source>In: LNCS 8138</source>
          . pp.
          <fpage>282</fpage>
          -
          <lpage>302</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Kestemont</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tschugnall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Specht</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Overview of the Author Identification Task at PAN-2018: Cross-domain Authorship Attribution and Style Change Detection</article-title>
          . In: Cappellato,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.Y.</given-names>
            ,
            <surname>Soulier</surname>
          </string-name>
          ,
          <string-name>
            <surname>L</surname>
          </string-name>
          . (eds.)
          <article-title>Working Notes Papers of the CLEF 2018 Evaluation Labs</article-title>
          .
          <source>CEUR Workshop Proceedings, CLEF and CEUR-WS.org (Sep</source>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Markov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baptista</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pichardo-Lagunas</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Authorship attribution in portuguese using character N-grams</article-title>
          .
          <source>Acta Polytechnica Hungarica</source>
          <volume>14</volume>
          (
          <issue>3</issue>
          ),
          <fpage>59</fpage>
          -
          <lpage>78</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Pedregosa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varoquaux</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gramfort</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michel</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thirion</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grisel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blondel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prettenhofer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weiss</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dubourg</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          , et al.:
          <article-title>Scikit-learn: Machine learning in Python</article-title>
          .
          <source>Journal of machine learning research 12</source>
          ,
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tschuggnall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Overview of PAN 17: Author identification, author profiling, and author obfuscation</article-title>
          .
          <source>In: LNCS 10456</source>
          . pp.
          <fpage>275</fpage>
          -
          <lpage>290</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Rocha</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scheirer</surname>
            ,
            <given-names>W.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Forstall</surname>
            ,
            <given-names>C.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cavalcante</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Theophilo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carvalho</surname>
            ,
            <given-names>A.R.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>Authorship Attribution for Social Media Forensics</article-title>
          .
          <source>IEEE Transactions on Information Forensics and Security</source>
          <volume>12</volume>
          (
          <issue>1</issue>
          ),
          <fpage>5</fpage>
          -
          <lpage>33</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tschuggnall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Overview of PAN 16: New challenges for authorship analysis: Cross-genre profiling, clustering, diarization, and obfuscation</article-title>
          .
          <source>In: LNCS 9822</source>
          . pp.
          <fpage>332</fpage>
          -
          <lpage>350</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Sapkota</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bethard</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes-Y-Gómez</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Solorio</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Not all character n-grams are created equal: A study in authorship attribution</article-title>
          .
          <source>In: Proceedings of NAACL HLT</source>
          <year>2015</year>
          . pp.
          <fpage>93</fpage>
          -
          <lpage>102</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>Authorship attribution using text distortion</article-title>
          . In:
          <article-title>Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL-</article-title>
          <year>2017</year>
          ).
          <article-title>Association for Computational Linguistics</article-title>
          , Valencia, Spain (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>