Authorship Attribution through Punctuation n-grams
            and Averaged Combination of SVM
                         Notebook for PAN at CLEF 2019

       Carolina Martín-del-Campo-Rodríguez1 , Daniel Alejandro Pérez Alvarez1 ,
               Christian Efraín Maldonado Sifuentes1 , Grigori Sidorov1 ,
                       Ildar Batyrshin1 , and Alexander Gelbukh1

                       Instituto Politécnico Nacional (IPN),
           Center for Computing Research (CIC), Mexico City, Mexico
        cm.del.cr@gmail.com, daperezalvarez@gmail.com,
chrismaldonado@gmail.com, sidorov@cic.ipn.mx, batyr1@gmail.com,
                            gelbukh@gelbukh.com


        Abstract This work explores the exploitation of pre-processing, feature extrac-
        tion and the averaged combination of Support Vector Machines (SVM) outputs
        for the open-set Cross-Domain Authorship Attribution task. The use of punc-
        tuation n-grams as a feature representation of a document is introduced for the
        Authorship Attribution in combination with traditional character n-grams. Start-
        ing from different feature representations of a document, several SVM are trained
        to represent the probability of membership for a certain author to latter obtain an
        average of all the SVM results. This approach managed to obtain 0.642 with the
        Macro F1-score for the PAN 2019 contest of open-set Cross-Domain Authorship
        Attribution.


1     Introduction
The problem of authorship attribution in cross-domain conditions is defined when doc-
uments of known authors that come from different writing domains (different genres
or themes) are used to gather information that enables the classification of documents
of unknown authorship from a list of possible candidates. In the case that no candidate
matches the style of an unattributed document it is possible that the actual author was
not included within the candidate list, such case is known as an open-set attribution
problem.

    The 2019 edition of PAN [2] focuses in an open-set Cross-Domain Authorship At-
tribution in fanfiction. Fanfiction is a literature work in which a fan seeks to imitate as
much as possible the writhing style of an admired author, and where a fandom is re-
ferred as the genre or original work of a certain writer. In this edition of PAN a set of
    Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons Li-
    cense Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 September 2019, Lugano,
    Switzerland.
documents are provided from known fans writing in several fandoms, to then require
the classification of documents from unknown authors writing in a single fandom, being
possible that the author of the document is not part of the previous set of known writers
[4].


2     Related work

Within the main factors to reach an improvement in Cross-Domain Authorship Attribu-
tion, the pre-processing stage has been identified has a key tool to increase the effective-
ness of the classifiers, and therefore, it becomes a principal concern to the development
of this work. In 2017 Markov et al. [6] shows that the elimination of topic-dependent
information from texts allows to improve the performance of authorship attribution clas-
sifiers. By the replacement of digits, punctuation marks splitting, and the replacement
of named entities before the extraction of character n-grams the score of correct as-
signments rise for cross-domain authorship attribution. Besides these findings, it is also
identified that the appropriate selection of the dimensionality of the representation of
character n-grams is a crucial feature in pre-processing for the cross-domain task.
     The authors of [1] included a character n-gram model of variable length in which
non-diacritics were distorted focusing in punctuation and other non-alphabetical sym-
bols to represent the structure of the text. On the other hand [5] experimented with
text representation based purely on punctuation n-grams for the task of native language
identification. In [3] syntactic n-grams are proposed, that is n-grams in non-linear man-
ner. This type of n-grams allows using syntactic information within the automatic text
processing methods related to classification or clustering.


3     Method

3.1    Features extraction

The following are the principal features that were considered for the development of
our approach:

    – Character n-grams.
    – Pure-punctuation n-grams.
    – Typed character n-grams.
    – Bag of Words (BoW).

   As follows, we will describe the pure-punctuation n-grams, typed character n-
grams. The overall procedure described in this section is summarized in Figure 1.


n-grams based on punctuation (pure-punctuation n-grams) The style of an author
can be determined, to some extent, through the use of punctuation. We determined that
beyond the counting of punctuation an important factor is the way that the author uses it.
So, we proposed to extract n-grams based only in these. We considered as punctuation
all those characters that are not letters, numbers or spaces from the training corpus
plus the characters obtained from the library string.punctuation of python 3 1 . All the
characters different to these were removed, obtaining for each text a representation only
based in punctuation.
    So, considering the text in (1), the punctuation representation is: ’.„’.,–.. After, we
obtained the character n-grams for each new text representation.


              Table 1: Fragment of text extracted from the training corpus
       ’t speak to anyone. I saw her at the funeral, and she said a few words, but that’s
       it. I went to see her afterwards, to pick up your stuff – they let me have it after
       the forensic team did their thing.


Typed character n-grams In [8] the typed character n-grams were introduced, basi-
cally these are subgroups of character n-grams. These subgroups are call super cate-
gories (SC). Each of these SC are divided in different categories:

 – Affix n-grams: Consider morphosyntactic aspects. This SC capture morphology to
   some extent. It is divided in prefix, suffix, space-prefix and space-suffix.
 – Word n-grams: Consider thematic content aspects. This SC capture partial words
   and other word-relevant tokens (whole-word, mid-word, multi-word).
 – Punctuation n-grams (typed-punctuation n-grams2 ): Consider style aspects. This
   SC capture patterns of punctuation (beg-punct, mid-punct, end-punct).


    The features obtained were filtered considering the document frequency (df ), a term
is ignored if the df is lower than a threshold (th). This means that if a term appear in
less than th documents, it will be ignored (not considered for the vectorization). For
the features weightening the tf-idf was applied. The vectorization was made with the
library scikit-learn3 , using the function TfidfVectorizer. This function allows us to do
the vectorization, the filter based on df and the features weightening at the same time.


3.2   Evaluation of features with SVM

SVM was the selected algorithm to resolve the task of open-class Authorship Attri-
bution. We used the same configuration as in the baseline, applying CalibratedClassi-
fierCV to get the beloging probabilities to the classes per document. For each feature
representation (5 different representations) we trained different SVM’s and got the be-
longing probability model (unknown document, class) for each representation.
 1
   https://docs.python.org/3/
 2
   to avoid confusion with the pure-punctuation n-gramas proposed, we named the SC punctua-
   tion n-grams as typed-punctuation n-grams
 3
   https://scikit-learn.org/stable/
3.3   Probability models point-to-point average

Having the probability models an average point to point was made (averaged proba-
bility model), this is the idea behind the VotingClassifier with a soft voting approach.
Weighted of the probabilities was discard to avoid a posible overfitting.
     With the averaged probability model, for each unknown document was considered
the following: the probabilities of class belonging was sorted (from highest to low-
est), the difference (diff ) of the two highest values was taken, if diff was smaller than a
threshold, it was considered that the document was not written for any of the candidates,
otherwise, the unknown document is assigned to the class with the highest probability.


       Feat
          uressets
      extr
         acti
            onofall
      knownauthors
        documents
                             St
                              art

      Featuressets
      wei
        g hti
            ngofall
                          Featuressets              Cal
                                                      c ulat
                                                           et he
        t
        heknown
                         ext
                           ractionofthe          di
                                                  ff
                                                   erenc eofthetwo
         author
              s
                         unknownaut hor           author swiththe
       doc uments
                           doc ument              highes taverage
                                                     probabili
                                                             ty

      SVM tr
           aini
              ngwi t
                   h      Eval
                             uati
                                onof
       knownauthors      t
                         hefeatur
                                esset
                                    s
        document s       oft
                           hedocument
        f
        eatur
            ess ets        withSVM

                                                   I
                                                   stheaverage           Theauthorwi
                                                                                   th
                                                pr
                                                 obabi
                                                     li
                                                      tyhi
                                                         gherthan      t
                                                                       hehighestpr
                                                                                 obabi
                                                                                     li
                                                                                      ty
                        Pr
                         obabil
                              itymodels                                 isassi
                                                                             gnedtothe
                          poi
                            nt-
                              to-poi
                                   nt              t
                                                   hethres
                                                         hold?
                                                                           document
                         averageofthe
                           document


                                                 T
                                                 heunknownaut hor
                                                   <UNK>l abeli
                                                              s
                                                                          Endofpr
                                                                                oces
                                                                                   s
                                                   assi
                                                      gnedtothe
                                                     document


Figure 1. Flow chart of the methodology applied to the open-set cross-fandom attribution PAN
2019.


4     Experiments

For each feature type different experiments were made, related to the size on n-grams,
variation of n was made from 1 to 10. Also, concatenation of the characteristics (for
type) was made with a variation from 1 to 8.
    For each characteristic type, different values of df was considered, variations from
1 to 5 were made to determine, for type of feature, witch was the best filter to consider.
    The weightening of the features was done with tf-idf. Two different methods were
considered to obtain it: the one implemented in gensim4 TfidfModel and the other one
with scikit-learn TfidfVectorizer (that applies a normalization, by the Euclidean norm,
after the weightening). Considering that for the use of TfidfModel is necesary to convert
the data type in corpus type, the facility of TfidfVectorizer for the use of filters and
weightening, and preliminary tests, TfidfVectorizer was selected to get the weightening.
      Table (2) shows the features considered in the final configuration of our approach.


Table 2: Final configuration of features. [n, m] is concatenation of features in the range
n to m
                                                         Document Frequency
                       Features                  n
                                                             threshold
                                               [1, 6]            5
                                                 6               2
                   character n-grams             5               2
                                                 4               5
                                                 3               5
                                                 2               1
                                                 3               1
               pure-punctuation n-grams          4               1
                                                 5               1
                                               [1, 5]            2
                 typed character n-grams         1               4
                      bag of words               1               5
               typed-punctuation n-grams         1               5


5      Results


The Macro F1-score was the measure used for the evaluation. The results obtained with
our approach for the development corpus are shown in table 3. the system was executed
in TIRA [7].
    Table 4 shows the competition scores. Our approach delcamporodriguez19 had a
Macro F1-score equal to 0.642. The best approach was the one proposed by mutten-
thaler19 with a Macro F1-score of 0.690, with a value of 0.048 superior to ours.


 4
     https://radimrehurek.com/gensim/
              Table 3: Result of our approach in the development corpus
                      Problem         Language       Macro F1-score
                   problem00001        english           0.800
                   problem00002        english           0.549
                   problem00003        english           0.649
                   problem00004        english           0.551
                   problem00005        english           0.572
                   problem00006         french           0.732
                   problem00007         french           0.680
                   problem00008         french           0.629
                   problem00009         french           0.732
                   problem00010         french           0.732
                   problem00011         italian          0.752
                   problem00012         italian          0.644
                   problem00013         italian          0.800
                   problem00014         italian          0.729
                   problem00015         italian          0.785
                   problem00016        spanish           0.843
                   problem00017        spanish           0.703
                   problem00018        spanish           0.816
                   problem00019        spanish           0.667
                   problem00020        spanish           0.582
                              Average                    0.697


                      Table 4: Results in the competition corpus.
                               Contester     Macro F1-score
                            muttenthaler19       0.690
                                  neri19         0.680
                          eleandrocustodio19     0.650
                                devries19        0.644
                         delcamporodriguez19     0.642
                                isbister19       0.622
                              johansson19        0.616
                                 basile19        0.613
                             vanhalteren19       0.598
                               rahgouy19         0.580
                                 gagala19        0.576
                                 kipnis19        0.259


6   Conclusions
The application of several feature representations, and the inclusion of features based
on punctuation represent a factor in the improvement of the classification of authorship
in open-class Cross-Domain Authorship Atttribution. Besides from the pre-processing
benefits presented in this work, the use of several SVM’s probability models are applied
to select the author of the fanfiction by an average of the outputs. This approach man-
aged to obtain 0.642 with the Macro F1-score for the PAN 2019 contest of open-class
Cross-Domain Authorship Attribution in fanfiction.


References
1. Custódio, J.E., Paraboni, I.: Each-usp ensemble cross-domain authorship attribution.
   Working Notes Papers of the CLEF (2018)
2. Daelemans, W., Kestemont, M., Manjavancas, E., Potthast, M., Rangel, F., Rosso, P., Specht,
   G., Stamatatos, E., Stein, B., Tschuggnall, M., Wiegmann, M., Zangerle, E.: Overview of
   PAN 2019: Author Profiling, Celebrity Profiling, Cross-domain Authorship Attribution and
   Style Change Detection. In: Crestani, F., Braschler, M., Savoy, J., Rauber, A., Müller, H.,
   Losada, D., Heinatz, G., Cappellato, L., Ferro, N. (eds.) Proceedings of the Tenth
   International Conference of the CLEF Association (CLEF 2019). Springer (Sep 2019)
3. Grigori, S.: Syntactic n-grams in computational linguistics. Springer (2019)
4. Kestemont, M., Stamatatos, E., Manjavacas, E., Daelemans, W., Potthast, M., Stein, B.:
   Overview of the Cross-domain Authorship Attribution Task at PAN 2019. In: Cappellato, L.,
   Ferro, N., Losada, D., Müller, H. (eds.) CLEF 2019 Labs and Workshops, Notebook Papers.
   CEUR-WS.org (Sep 2019)
5. Markov, I.: Automatic Native Language Identification. Ph.D. thesis, Instituto Politecnico
   Nacional (2018)
6. Markov, I., Stamatatos, E., Sidorov, G.: Improving cross-topic authorship attribution: The
   role of pre-processing. In: Proceedings of the 18th International Conference on
   Computational Linguistics and Intelligent Text Processing. CICLing 2017, Springer (2017)
7. Potthast, M., Gollub, T., Wiegmann, M., Stein, B.: TIRA Integrated Research Architecture.
   In: Ferro, N., Peters, C. (eds.) Information Retrieval Evaluation in a Changing World -
   Lessons Learned from 20 Years of CLEF. Springer (2019)
8. Sapkota, U., Bethard, S., Montes-y Gómez, M., Solorio, T.: Not all character n-grams are
   created equal: A study in authorship attribution. In: Proceedings of the 2015 Annual
   Conference of the North American Chapter of the ACL: Human Language Technologies. pp.
   93–102. NAACL-HLT’15, Association for Computational Linguistics (2015)