EACH-USP Ensemble Cross-Domain
                      Authorship Attribution
                          Notebook for PAN at CLEF 2018

                        José Eleandro Custódio and Ivandré Paraboni

                         School of Arts, Sciences and Humanities (EACH)
                                  University of São Paulo (USP)
                                         São Paulo, Brazil
                                    {eleandro,ivandre}@usp.br


          Abstract. We present an ensemble approach to cross-domain authorship attri-
          bution that combines predictions made by three independent classifiers, namely,
          standard char n-grams, char n-grams with non-diacritic distortion and word n-
          grams. Our proposal relies on variable-length n-gram models and multinomial
          logistic regression, and selects the prediction of highest probability among the
          three models as the output for the task. Results generally outperform the PAN-
          CLEF 2018 baseline system that makes use of fixed-length char n-grams and
          linear SVM classification.


1      Introduction

Authorship attribution (AA) is the computational task of determining the author of a
given document from a number of possible candidates [1]. Systems of this kind have a
wide range of possible applications, from on-line fraud detection to plagiarism and/or
copyright protection. AA is presently a well-established research field, and a recurrent
topic in the PAN-CLEF shared task series [7,5].
    At PAN-CLEF 2018, a cross-domain authorship attribution task applied to fan fic-
tion text has been proposed. In this task, texts written by the same authors in multiple
domains were put together, creating a cross-domain setting. The task consists of identi-
fying the author of a given document based on text of a different genre.
    The present work describes the results of our own entry in the PAN-CLEF 2018
[2] AA shared task - hereby called the EACH-USP model - using both the baseline
system and data provided by the event 1 . This consists of ten individual AA tasks in five
languages (English, French, Italian, Polish and Spanish), being two tasks (with 5 or 20
candidate authors) each.
    The rest of this paper is structured as follows. Section 3 describes our main AA
approach, and Section 4 describes its evaluation over the PAN-CLEF 2018 AA dataset.
Section 5 presents our results and those provided by relevant baseline methods. Finally,
Section 6 discusses these results and suggests future work.
 1
     Available from https://pan.webis.de/clef18/pan18-web/author-identification.html
2     Related Work

The present work shares similarities with a number of AA studies. Some of these are
briefly discussed below.
    The work in [9] makes use of text distortion methods intended to preserve only the
text structure and style in a cross-domain AA setting. The work focused on the use
of word-level information, whereas our current proposal will focus on character-level
information.
    The work in [8] investigates the role of affixes in the AA task by using char n-gram
models for the English language. Similarly, the work in [3] addresses the use of char
n-grams models for the Portuguese language, and discusses the role of affix information
in the AA task. This is in principle relevant to our current work since the Portuguese
language shares a great deal of its structure with Spanish and Italian, which are two of
the target languages for the PAN-CLEF 2018 AA task.


3     Method

Central to our approach is the idea that the AA task may rely on the combination of dif-
ferent knowledge sources such as lexical preferences, morphological inflection, upper-
case usage, and text structure, and that different kinds of knowledge may be obtained
either from character-based or word-based text models. These alternatives are discussed
as follows.
    Word or content-based models may indicate word usage preferences, and may help
distinguish an author from another. However, we notice that a single author may favour
certain words in different domains (e.g., fictional versus dialogue text). Moreover, word-
based models will usually discard punctuation and spaces, which may represent a valu-
able knowledge source for AA. Character-based models, on the other hand, are known
for their ability to capture time or gender inflection, among others, as well as punctua-
tion and spacing [6].
    Based on these observations, our approach to cross-domain authorship attribution
consists of a number of improvements over the standard PAN-CLEF 2018 baseline sys-
tem organised as an ensemble method. In particular, we replace the original fixed-length
n-grams and linear SVM classification for variable-length n-grams and multinomial lo-
gistic regression, and we combine predictions made by three independent classifiers to
determine the most likely author of a given document as illustrated in Figure 1.


    Our proposal - hereby called USP-EACH Ensemble model - combines the following
three classifiers:

    – Std.charN: a variable-length char-ngram model
    – Dist.charN: a variable-length char-ngram model in which non-diacritics were dis-
      torted
    – Std.wordN: a variable-length word-ngram model
                        Fig. 1. Ensemble cross-domain AA architecture


     Both Std.charN and Dist.charN models are intended to capture syntactic and mor-
phological clues for authorship attribution in a language-independent fashion. In the
latter, however, all characters that do not represent diacritics are removed from the text
beforehand, therefore focusing on the effects of punctuation, spacing and the use of di-
acritics, numbers and other non-alphabetical symbols. This form of text distortion [9]
is illustrated by the example in Table 1.


    Table 1. Example of text distortion using the first document of the 9th training dataset.

   Original text                                Transformed text
   -¿Y cómo sabes que no lo                     -¿* *ó** ***** *** ** **
   ama? -Inglaterra se preguntó                 ***? -********** ** *******ó
   a su vez si habría un muñeco                 * ** *** ** ****í* ** **ñ***
   del esposo también.                          *** ****** *****é*.


    A major motivation for this approach is the observation that, in languages that make
use of diacritics, some authors may consistently use the correct spelling (as in ‘é’, which
is Portuguese for ‘is’) whereas others tend to ignore the need for diacritics by producing
the incorrect spelling (e.g., ‘e’) for the same purpose. In addition to these two character-
based models, we also consider a third model that is intended to capture lexical prefer-
ences, hereby called Std.wordN.
    Predictions made by the three classifiers are combined into our Ensemble model by
selecting the most likely outcome for a given authorship attribution task. To this end,
the three individual outputs are concatenated and taken as input features to a fourth soft
voting (ensemble) classifier. This in turn performs multinomial logistic regression to
select the winning strategy.
4     Experiment
The models introduced in the previous section had their parameters set by using the
PAN-CLEF development dataset as follows. Features were scaled using Python’s Max-
AbsScaler transformer [4], and dimensionality was reduced using a standard PCA im-
plementation. PCA also helps remove correlated features, which is useful in the present
case since our models make use of variable length feature concatenation.
    The resulting feature sets were submitted to multinomial logistic regression by con-
sidering a range of values, as summarised in Table 2.


                  Table 2. Pipeline - Multinomial logistic regression parameters

 Module             Parameters                           Possible values
                    N-gram range                         Start=(1 to 3) - End=(1 to 5)
                    Min document frequency               [0.01, 0.05, 0.1, 0.5]
                    Max document frequency               [0.25, 0.50, 0.90, 1.0]
 Feature Extraction
                    TF                                   normal, sublinear
                    IDF                                  normal, smoothed
                    Document normalisation               L1, L2
                    Scaling                              MaxAbsScaler
 Transformation
                    PCA percentage of explained variance [0.10, 0.25, 0.50, 0.75, 0.90, 0.99]
 Classifier         Logistic regression                  Multinomial-Softmax


    Optimal values for the regression task were determined by making use of grid search
and 5-fold cross validation using an ensemble method. The optimal values that were
selected for subsequently training our actual models are illustrated in Table 3, in which
Start/End values denote the range of subsequences that were concatenated. For instance,
Start= 2 and End= 5 represents the concatenation of subsequences [(2, 2), (2, 3), · · · ,
(4, 3), (4, 5)].


                Table 3. Pipeline - Multinomial logistic regression optimal values

         Module          Parameters                           Optimal values
                                                              Std.charN - Start=2 End=5
                         N-gram range                         Dist.charN - Start=2 End=5
                                                              Word.charN - Start=1 End=3
                    Min corpus frequency                      0.05
    Feature Extraction
                    Max corpus frequency                      1.0
                    TF                                        sublinear
                    IDF                                       smoothed
                    Document normalisation                    L2
     Transformation PCA                                       0.99


   Tables 4, 5 and 6 show the ten most relevant features for AA Problem00002, which
comprises text written by five authors each in the English language. In this representa-
tion, blank spaces were encoded as underscore symbols, and relevance is represented
by the weights of multinomial logistic regression. These were estimated by scaling the
features to a mean value equal to 0, and to a standard deviation value equal to 1.


                      Table 4. Most relevant features for Std.charN

    candidate00001 candidate00002 candidate00003 candidate00004 candidate00005
         _as_l           _Sti          _sub            _joi          _day,
           _’          _"Can           _suc            _gh           _dev
         _prec          _"Ca           _I_fi           _er           _dete
          _I’d          _"Be           _succ         _glow           _plac
        _"Are            _but          _subs           _Is           _mut
          _Re           _but_           _I_f           _sta         _must
        _smel           _Ofte           _"T           _gor           _Dro
         _leak          _posi           _a_t          _sorr         _day_
         _is_s          _For            _"St          _eat_         _she_
         _spu            _Ri          _a_sw           _If_t          _chi


                      Table 5. Most relevant features for Dist.charN

    candidate00001 candidate00002 candidate00003 candidate00004 candidate00005
        *_‘**          _**_-            "*’           *_~_          *_–_*
         _**-          _**_(          "*_**            *_~           ’*,_*
           ’           _**_*           !),_*          *_._*           "_~
          *).             *!            *!!           ’***            *_–
         *),_          _**_’          *’*_*          ’****           *_–_
         _-_*           *!_*          **_*’           "_**’           ’*.
          _-_          *_“**          **_**          _É***            _“*
         _’**            _~_          **_*’            _“*’            _-
          !),           _~_*           _**!           _**..           _-_
        _’***           _“*’          *!_*_          *_**é          _"***


    Being a language-independent approach, information regarding function words was
not taken into account, although this might have been helpful since function words
usually play a rather prominent role in AA (as opposed to, e.g., content words, which
may arguably be more relevant to other text categorisation tasks.) We notice however
that function words were made explicit by the Std.wordN model. Moreover, we notice
that all models also made (to some extent) explicit a number of individual preferences
regarding word usage, punctuation and spacing, and that Std.distN provides some evi-
dence of the role of punctuation marks, spacing and hyphenation.
                       Table 6. Most relevant features for Std.wordN

      candidate00001 candidate00002 candidate00003 candidate00004 candidate00005
        about_what     against_his       an_odd        although   and_pulled_him
      and_practically  and_it_was     and_then_he       an_eye      and_pulling
           any_of         and_so     acknowledged     and_said       across_his
         any_more      and_already    and_he_had      and_takes      across_the
        and_nearly      and_steve       are_your       and_just        and_all
        and_pulled       and_say        again_to        ancient     against_her
            agree         accent        and_tell     amount_of         among
          all_tony       and_wet       and_forth        always     about_what_to
             ah         apparently      are_just    and_grinned         acting
       and_wet_and         after     and_grabbing     about_the      about_their


5     Results
Table 7 presents macro F-measure results for the original PAN-CLEF 2018 baseline
system, our three individual classifiers and the Ensemble model for the ten PAN-CLEF
2018 authorship attribution tasks over the development data. To this end, the baseline
was optimised using 4-grams, minimum document frequency of 5 and One-vs-Rest as
the SVM multi-class strategy. Our models were optimised individually using the param-
eters described in Table 2, and output probabilities were combined by using multinomial
logistic regression in a soft voting ensemble fashion. Best results are highlighted.


        Table 7. Macro-F1 measure results for PAN-CLEF 2018 AA development corpus

    Problem Language Authors Baseline Std.charN Dist.charN Std.wordN Ensemble
      001    English   20     0.514     0.609     0.479      0.444    0.625
      002    English    5     0.626     0.535     0.333      0.577    0.673
      003    French    20     0.631     0.681     0.568      0.418    0.776
      004    French     5     0.747     0.719     0.586      0.572    0.820
      005     Italian  20     0.529     0.597     0.491      0.497    0.578
      006     Italian   5     0.614     0.623     0.595      0.520    0.663
      007     Polish   20     0.455     0.470     0.496      0.475    0.554
      008     Polish    5     0.703     0.948     0.570      0.922    0.922
      009    Spanish   20     0.709     0.774     0.589      0.616    0.701
      010    Spanish    5     0.593     0.778     0.802      0.588    0.830
     Mean                     0.612     0.673     0.551      0.563    0.714


    From these results, a number of observations are warranted. First, we notice that
Std.charN generally obtained the best results among the three individual classifiers.
We also notice that Dist.charN performs worse than Std.charN. This was to be ex-
pected since Dist.charN conveys less information, that is, it may be seen as a subset of
Std.charN.
    Our ensemble model consistently outperformed the alternatives by using soft voting.
In our experiments, we noticed that combining the three knowledge sources obtained
best results. In all cases, the relevant features turned out to be of variable length, ranging
from 1 to 5-grams.


6    Final remarks
This paper presented an ensemble approach to cross-domain authorship attribution that
combines predictions made by a standard char n-gram model, a char n-gram model with
non-diacritic distortion and a word n-gram model using variable-length n-gram models
and multinomial logistic regression. Results generally outperform the PAN-CLEF 2018
baseline system that makes use of fixed-length char n-grams and linear SVM classifi-
cation. As future work, we intend to investigate alternative text models and distortion
methods for prefixes, suffixes and other text components.


Acknowledgements. The second author received financial support from FAPESP grant
nro. 2016/14223-0.


References
1. Gollub, T., Potthast, M., Beyer, A., Busse, M., Rangel, F., Rosso, P., Stamatatos, E., benno
   Stein: Recent trends in digital text forensics and its evaluation: Plagiarism detection, author
   identification, and author profiling. In: LNCS 8138. pp. 282–302 (2013)
2. Kestemont, M., Tschugnall, M., Stamatatos, E., Daelemans, W., Specht, G., Stein, B.,
   Potthast, M.: Overview of the Author Identification Task at PAN-2018: Cross-domain
   Authorship Attribution and Style Change Detection. In: Cappellato, L., Ferro, N., Nie, J.Y.,
   Soulier, L. (eds.) Working Notes Papers of the CLEF 2018 Evaluation Labs. CEUR
   Workshop Proceedings, CLEF and CEUR-WS.org (Sep 2018)
3. Markov, I., Baptista, J., Pichardo-Lagunas, O.: Authorship attribution in portuguese using
   character N-grams. Acta Polytechnica Hungarica 14(3), 59–78 (2017)
4. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,
   Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: Machine learning in Python.
   Journal of machine learning research 12, 2825–2830 (2011)
5. Potthast, M., Rangel, F., Tschuggnall, M., Stamatatos, E., Rosso, P., Stein, B.: Overview of
   PAN 17: Author identification, author profiling, and author obfuscation. In: LNCS 10456. pp.
   275–290 (2017)
6. Rocha, A., Scheirer, W.J., Forstall, C.W., Cavalcante, T., Theophilo, A., Shen, B., Carvalho,
   A.R.B., Stamatatos, E.: Authorship Attribution for Social Media Forensics. IEEE
   Transactions on Information Forensics and Security 12(1), 5–33 (2017)
7. Rosso, P., Rangel, F., Potthast, M., Stamatatos, E., Tschuggnall, M., Stein, B.: Overview of
   PAN 16: New challenges for authorship analysis: Cross-genre profiling, clustering,
   diarization, and obfuscation. In: LNCS 9822. pp. 332–350 (2016)
8. Sapkota, U., Bethard, S., Montes-Y-Gómez, M., Solorio, T.: Not all character n-grams are
   created equal: A study in authorship attribution. In: Proceedings of NAACL HLT 2015. pp.
   93–102 (2015)
9. Stamatatos, E.: Authorship attribution using text distortion. In: Proceedings of the
   Conference of the European Chapter of the Association for Computational Linguistics
   (EACL-2017). Association for Computational Linguistics, Valencia, Spain (2017)