<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multi-channel Open-set Cross-domain Authorship Attribution</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>José Eleandro Custódio and Ivandré Paraboni</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Arts, Sciences and Humanities (EACH) University of São Paulo (USP) São Paulo</institution>
          ,
          <country country="BR">Brazil</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>2</volume>
      <fpage>393</fpage>
      <lpage>407</lpage>
      <abstract>
        <p>This paper describes a multi-channel approach to open-set cross-domain authorship attribution (AA) for the PAN-CLEF 2019 AA shared task. The present work adapts the EACH-USP ensemble method presented at PAN-CLEF 2018 to an open-set scenario by defining a threshold value for unknown authors, and extends the previous architecture with an additional character ranking model built with the aid of the PageRank algorithm. Results are superior to a number of baseline systems, and remain generally comparable to those in the original closed-set ensemble approach.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Authorship attribution (AA) is the computational task of identifying the author of a
given text by examining samples of texts written by a number of candidate authors
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Practical applications include, for instance, the detection of internet misuse, text
forensics for copyright protection, and many others [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>AA may be based on single- or cross-domain settings. In this paper we discuss the
latter, that is, situations in which we would like to identify the author of a text in a
certain genre based on samples of text written in another genre.</p>
      <p>
        From a computational perspective, we may distinguish two AA problem definitions:
closed- and open-set AA. Closed-set AA assumes that the author of a disputed text
necessarily belongs to a pre-defined set of possible candidates. This subtask was the
theme of the PAN-CLEF 2018 shared task in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Open-set AA, by contrast, assumes
that the disputed text may not necessarily belong to any known candidate [18]. This
subtask was the theme of the PAN-CLEF 2019 shared task, and it is also the focus of
the present work.
      </p>
      <p>
        In the context of closed-set AA, the work in [
        <xref ref-type="bibr" rid="ref2 ref3">2,3</xref>
        ] presented an ensemble approach
that combines predictions made by three knowledge channels, namely, standard
character n-grams, character n-grams with non-diacritic distortion and word n-grams. In the
present work, this method is adapted to an open-set scenario by defining a threshold
value for unknown authors, and further extended with the inclusion of a fourth
channel based on a character ranking model built with the aid of the PageRank algorithm
[10,15].
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        The present work consists of an extension of the ensemble AA approach in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This,
and a number of related studies, are briefly discussed below.
      </p>
      <p>
        The work in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] presented an ensemble approach to cross-domain AA called
EACHUSP, which combines predictions made by three independent classifiers based on word
n-grams (Std.wordN), standard character n-grams (Std.charN), and character n-grams
with non-diacritic distortion (Dist.charN). The method relies on variable-length n-gram
models and multinomial logistic regression, and selects the prediction of highest
probability among the three models as the output for the task by soft voting.
      </p>
      <p>
        The word-based Std.wordN model in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is intended to help distinguish an author
from another based on word usage. However, given that a single author may favour
different words across domains (e.g., fictional versus dialogue text), and that
wordbased models will usually discard punctuation and blank spaces thyat may represent
a valuable knowledge source for AA [13], the character-based models Std.charN and
Dist.charN were added as a means to capture time and gender inflection, punctuation
and spacing.
      </p>
      <p>
        Both Std.charN and Dist.charN models in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] are intended to capture
languageindependent syntactic and morphological clues for AA. In the latter, all characters that
do not represent diacritics are removed from the text beforehand, therefore focusing on
the effects of punctuation, spacing and the use of diacritics, numbers and other
nonalphabetical symbols.
      </p>
      <p>
        For further details regarding the ensemble method, we report to [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Character
models are extensively discussed in [14], with details regarding the role of affixes and
prefixes in the task. Function words and word n-gram models are discussed in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Text
distortion models for removing noise information from text are discussed in [17].
      </p>
      <p>Finally, the work in [19] creates word-adjacency graphs and extracts weighted
clustering coefficients and weighted degrees from certain nodes in the word-adjacency
network. An AA knowledge channel along these lines will be addressed in our own work
as discussed in Section 4.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Corpus and Baseline Analysis</title>
      <p>We started our investigation by examining the PAN-CLEF 2019 cross-domain AA
dataset1, and by comparing the results obtained by the baseline systems provided. This
analysis is described as follows.</p>
      <p>
        The PAN-CLEF 2019 AA development dataset conveys 20 problems written four
languages (English, French, Italian and Spanish), with nine candidate authors per
problem, seven documents per candidate and an average of 4500 characters per document.
1 https://pan.webis.de/clef19/pan19-web/author-identification.html
The shared task organisers also provided three baseline systems, namely, compression
models [20,11], the Impostors method [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], and a SVM classifier based on character
trigrams. Further details are provided in [13]. Figure 1 presents a comparison between
macro F1 scores obtained from the three baseline systems for each target language.
      </p>
      <p>From Figure 1 we notice that the SVM classifier has the best overall performance
among the three baseline systems. Moreover, we notice that the three systems obtained
similar results in the case of the English dataset.</p>
      <p>Figure 2 presents a comparison among the same baseline methods according to the
number of unknown documents under consideration.</p>
      <p>From Figure 2 we notice that the proportion of unknown texts in each dataset, or
openness of the AA task, has a considerable impact on the performance of all models.
This confirms the general intuition that open-set AA is more challenging than closed-set
AA.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Current Work</title>
      <p>
        As in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], our current approach to AA assumes that evidence of an author’s identity may
be found in multiple layers of morphological, syntactic and semantic knowledge. These
layers may be modelled as knowledge channels that use character- and word-based
ngrams as their main source for feature extraction [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Channels of this kind tend to be
relatively independent from each other, that is, the information captured by one channel
may not necessarily be captured by another.
      </p>
      <p>
        Based on these observations, we follow the work in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and address the AA task by
making use of multiple models combined as an ensemble of classifiers. More
specifically, our current approach extends the ensemble method in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] by adding a fourth
module to the existing set of channels (Std.wordN), Std.charN), and Dist.charN, cf.
previous section) and proposes further adjustments for the open-set AA setting.
4.1
      </p>
      <p>
        A Character Ranking Model for AA
Language models are central to a wide range of natural language processing tasks.
Accordingly, many studies have attempted to estimate the probability of a word (or
character) appearing after a given symbol [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. N-grams and recurrent neural networks
[
        <xref ref-type="bibr" rid="ref1">1,16</xref>
        ] are the most well-known methods of this kind.
      </p>
      <p>Of particular interest to the present work, language models may be represented as a
character adjacency graph, in which the degree of influence of each node may help
capture the (most influential) character sequences that denote a particular author. Influence
may be measured, for instance, by using the PageRank algorithm [10,15]. In this case,
the influence of a node is defined by the equation 1, in which N is the number of nodes,
is the original alpha factor, and M is the set toward pi points to.</p>
      <p>P R(pi) =</p>
      <p>+
1</p>
      <p>N</p>
      <p>X
pj2M(pi)</p>
      <p>P R(pj )
L(pj )</p>
      <p>Using this method as a basis, we envisaged a character ranking model for AA,
hereby called Rank.char, that computes character adjacency graphs and uses PageRank
to select the most influential characters of a set of documents of a given author. For
instance, the word ‘the’ gives rise to three nodes t, h and e, and two edges t ! h e
h ! e.</p>
      <p>Once the adjacency graph is computed, symbols of frequency lower than five are
removed, and the resulting structure is submitted to the PageRank algorithm to determine
(1)
its most influential nodes. The algorithm is executed with a maximum of 500 iterations,
and an alpha value set to 0:85. The output - a matrix of size jdj; jvj where d is the set of
documents and v is the corpus vocabulary - is then fed into the AA pipeline.
There are many possible strategies for combining the outputs of a set of classifiers.
Among these, the most common are averaging, soft voting and hard voting. Averaging
simply averages the predictions made by each classifier and chooses the class with
higher probability. In hard voting, the majority vote is used as the final decision and, in
soft voting, a weighted vote is considered.</p>
      <p>
        In the present work we follow [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and consider the use of a soft voting method in
which the probabilities produced by a set of classifiers are concatenated and taken as
an input to a softmax logistic regression model. This strategy is motivated by similar
methods commonly applied in convolution neural network learning, in which multiples
filters are applied to a stream of text, and subsequently combined by using a softmax
layer. In the present AA setting, this method allows full filter (or channel) optimisation
with the benefits of soft voting, which may be particularly suitable to scenarios with
restricted number of text samples per author.
      </p>
      <p>
        Our resulting architecture is illustrated in Figure 3. The first three channels are
similar to those in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], whereas the last channel (Rank.char) represents our current extension.
      </p>
      <p>The output of the ensemble method is a matrix of probabilities conveying d rows
representing documents and a columns representing authors, in which dij is the
probability of a document di belong to an author aj . The openness aspect of the AA task
at PAN-2019 (i.e., the fact that an input text may not belong to any of the candidate
authors) is dealt with by assigning the unknown author (&lt;UNK&gt;) label to the input text
when the standard deviation of the corresponding row is below a 0:05 threshold.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Evaluation</title>
      <p>Model parameters were set by using the PAN-CLEF 2019 development dataset as
follows. Features were scaled using Python MaxAbsScaler transformer, and
dimensionality reduction was performed by using a standard PCA implementation. PCA also helps
remove correlated features, which is particularly useful in the present case because our
models make use of variable length feature concatenation. The resulting feature sets
were submitted to multinomial logistic regression by considering a range of possible
alternative values as summarised in Figure 4.</p>
      <p>Optimal values for each pipeline were determined by making use of grid search
and 3-fold cross validation using an ensemble method. The optimal values that were
selected for training of our actual models are summarised in Figure 5. In this summary,
a sequence as in, e.g., Start=2 and End=5 is intended to represent the concatenation of
subsequences [(2, 2),(2, 3), ,(4, 3),(4, 5)], assuming that Start is not greater than End.</p>
      <p>In addition to the main experiments presently reported, a large number of
alternatives were considered as well. These included the use of BM25 and one-hot
representation for feature extraction, and the use of bagging, boosting, multi-layer perceptron,
decision tree induction and other learning methods. All these results were however
below those obtained by the present approach, and were therefore discarded.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Results</title>
      <p>
        Table 1 presents macro F1 results based on the PAN-CLEF 2019 test dataset and
evaluation software [12] as obtained by the original baseline systems, our four individual
classifiers, the ensemble approach EACH-USP taken from [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and by the current method.
Baseline systems were trained with their default parameters, and all models were
individually optimised by using the parameters described in Table 5. Best results for each
problem are highlighted.
      </p>
      <p>From these results we notice that the current approach keeps a relatively good
performance overall. Figure 6 presents a comparison between macro F1 scores obtained
from the SVM baseline, the char n-gram model with variable range Std.charN, and the
EACH-USP and current ensemble methods for each target language.</p>
      <p>From these results we notice that the use of Rank.char was more effective for the
Italian language dataset. Moreover, the task seems to be more challenging in the case
of the English dataset than for the other languages.</p>
      <p>Finally, Figure 7 presents a comparison among the same methods according to the
number of unknown documents under consideration.</p>
      <p>Once again, we notice that the percentage of documents of unknown authors had a
great impact over all system under evaluation regardless of other factors.
7</p>
    </sec>
    <sec id="sec-7">
      <title>Final Remarks</title>
      <p>
        This paper has proposed an extension to the work in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] by presenting an approach
for open-set cross-domain authorship attribution that relies on fully optimised char
ngrams, word n-grams and char-ranking models. To this end, results obtained from the
individual models as probability vectors were combined by making use of a soft voting
ensemble method, and unknown authors were classified by considering the standard
deviation of the final probability vector.
      </p>
      <p>
        Our current results are generally superior to those obtained by the PAN-CLEF 2019
baseline systems, but were not generally superior to the work in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Although the
compact text representation provided by the current Rank.char model does help improve
some of our results, the Dist.charN model from [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] remains the most useful knowledge
source within this ensemble approach even in the present open AA setting.
      </p>
      <p>As future work, we intend to experiment with other kinds of network influence
methods, and further customise the PageRank algorithm [10,15] for the AA problem.
The use of part-of-speech and embedding channels for AA is also to be investigated.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <p>The second author received support by FAPESP grant nro. 2016/14223-0.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bagnall</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Author identification using multi-headed recurrent neural networks</article-title>
          . In:
          <string-name>
            <surname>Jones G.J.F. Cappellato L.</surname>
            ,
            <given-names>F.N.S.J.E</given-names>
          </string-name>
          . (ed.)
          <source>CEUR Workshop Proceedings</source>
          . vol.
          <volume>1391</volume>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          . CEUR-WS (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Custódio</surname>
            ,
            <given-names>J.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paraboni</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>EACH-USP Ensemble Cross-domain Authorship Attribution: Notebook for PAN at CLEF 2018</article-title>
          . In: Cappellato,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.Y.</given-names>
            ,
            <surname>Soulier</surname>
          </string-name>
          ,
          <string-name>
            <surname>L</surname>
          </string-name>
          . (eds.)
          <article-title>Working Notes Papers of the CLEF 2018 Evaluation Labs</article-title>
          .
          <source>CEUR Workshop Proceedings, CLEF and CEUR-WS.org (Sep</source>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Custódio</surname>
            ,
            <given-names>J.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paraboni</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <article-title>: Multi-channel Open-set Cross-domain Authorship Attribution</article-title>
          . In: Working Notes Papers of the Conference and
          <article-title>Labs of the Evaluation Forum (CLEF-</article-title>
          <year>2019</year>
          )
          <article-title>(to appear)</article-title>
          . Lugano,
          <string-name>
            <surname>Switzerland</surname>
          </string-name>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Goldberg</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Neural Network Methods in Natural Language Processing</article-title>
          . Morgan &amp; Claypool Publishers (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Juola</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>An overview of the traditional authorship attribution subtask</article-title>
          .
          <source>In: CLEF 2012 Evaluation Labs and Workshop</source>
          , Online Working Notes, Rome, Italy,
          <source>September 17-20</source>
          ,
          <year>2012</year>
          (
          <year>2012</year>
          ), http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>1178</volume>
          /
          <article-title>CLEF2012wn-PAN-Juola2012</article-title>
          .pdf
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kestemont</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Function Words in Authorship Attribution From Black Magic to Theory?</article-title>
          <source>In: 3rd Workshop on Computational Linguistics for Literature (CLfL</source>
          <year>2014</year>
          ). pp.
          <fpage>59</fpage>
          -
          <lpage>66</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kestemont</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tschugnall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Specht</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Overview of the Author Identification Task at PAN-2018: Cross-domain Authorship Attribution and Style Change Detection</article-title>
          . In: Cappellato,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.Y.</given-names>
            ,
            <surname>Soulier</surname>
          </string-name>
          ,
          <string-name>
            <surname>L</surname>
          </string-name>
          . (eds.)
          <article-title>Working Notes Papers of the CLEF 2018 Evaluation Labs</article-title>
          .
          <source>CEUR Workshop Proceedings, CLEF and CEUR-WS.org (Sep</source>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Argamon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Computational Methods in Authorship Attribution</article-title>
          .
          <source>Journal of the Association for Information Science and Technology</source>
          <volume>60</volume>
          (
          <issue>1</issue>
          ),
          <fpage>9</fpage>
          --
          <lpage>26</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seidman</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Detecting pseudepigraphic texts using novel similarity measures</article-title>
          .
          <source>Digital Scholarship in the Humanities</source>
          <volume>33</volume>
          (
          <issue>1</issue>
          ),
          <fpage>72</fpage>
          -
          <lpage>81</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>