<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Authorship Attribution through Punctuation n-grams and Averaged Combination of SVM</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Carolina Martín-del-Campo-Rodríguez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniel Alejandro Pérez Alvarez</string-name>
          <email>daperezalvarez@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christian Efraín Maldonado Sifuentes</string-name>
          <email>chrismaldonado@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Grigori Sidorov</string-name>
          <email>sidorov@cic.ipn.mx</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ildar Batyrshin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexander Gelbukh</string-name>
          <email>gelbukh@gelbukh.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Instituto Politécnico Nacional (IPN), Center for Computing Research (CIC)</institution>
          ,
          <addr-line>Mexico City</addr-line>
          ,
          <country country="MX">Mexico</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <abstract>
        <p>This work explores the exploitation of pre-processing, feature extraction and the averaged combination of Support Vector Machines (SVM) outputs for the open-set Cross-Domain Authorship Attribution task. The use of punctuation n-grams as a feature representation of a document is introduced for the Authorship Attribution in combination with traditional character n-grams. Starting from different feature representations of a document, several SVM are trained to represent the probability of membership for a certain author to latter obtain an average of all the SVM results. This approach managed to obtain 0.642 with the Macro F1-score for the PAN 2019 contest of open-set Cross-Domain Authorship Attribution.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The problem of authorship attribution in cross-domain conditions is defined when
documents of known authors that come from different writing domains (different genres
or themes) are used to gather information that enables the classification of documents
of unknown authorship from a list of possible candidates. In the case that no candidate
matches the style of an unattributed document it is possible that the actual author was
not included within the candidate list, such case is known as an open-set attribution
problem.</p>
      <p>
        The 2019 edition of PAN [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] focuses in an open-set Cross-Domain Authorship
Attribution in fanfiction. Fanfiction is a literature work in which a fan seeks to imitate as
much as possible the writhing style of an admired author, and where a fandom is
referred as the genre or original work of a certain writer. In this edition of PAN a set of
documents are provided from known fans writing in several fandoms, to then require
the classification of documents from unknown authors writing in a single fandom, being
possible that the author of the document is not part of the previous set of known writers
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>
        Within the main factors to reach an improvement in Cross-Domain Authorship
Attribution, the pre-processing stage has been identified has a key tool to increase the
effectiveness of the classifiers, and therefore, it becomes a principal concern to the development
of this work. In 2017 Markov et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] shows that the elimination of topic-dependent
information from texts allows to improve the performance of authorship attribution
classifiers. By the replacement of digits, punctuation marks splitting, and the replacement
of named entities before the extraction of character n-grams the score of correct
assignments rise for cross-domain authorship attribution. Besides these findings, it is also
identified that the appropriate selection of the dimensionality of the representation of
character n-grams is a crucial feature in pre-processing for the cross-domain task.
      </p>
      <p>
        The authors of [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] included a character n-gram model of variable length in which
non-diacritics were distorted focusing in punctuation and other non-alphabetical
symbols to represent the structure of the text. On the other hand [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] experimented with
text representation based purely on punctuation n-grams for the task of native language
identification. In [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] syntactic n-grams are proposed, that is n-grams in non-linear
manner. This type of n-grams allows using syntactic information within the automatic text
processing methods related to classification or clustering.
3
3.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Method</title>
      <sec id="sec-3-1">
        <title>Features extraction</title>
        <p>– Character n-grams.
– Pure-punctuation n-grams.
– Typed character n-grams.</p>
        <p>– Bag of Words (BoW).</p>
        <p>The following are the principal features that were considered for the development of
our approach:</p>
        <p>As follows, we will describe the pure-punctuation n-grams, typed character
ngrams. The overall procedure described in this section is summarized in Figure 1.
n-grams based on punctuation (pure-punctuation n-grams) The style of an author
can be determined, to some extent, through the use of punctuation. We determined that
beyond the counting of punctuation an important factor is the way that the author uses it.
So, we proposed to extract n-grams based only in these. We considered as punctuation
all those characters that are not letters, numbers or spaces from the training corpus
plus the characters obtained from the library string.punctuation of python 3 1. All the
characters different to these were removed, obtaining for each text a representation only
based in punctuation.</p>
        <p>So, considering the text in (1), the punctuation representation is: ’.„’.,–.. After, we
obtained the character n-grams for each new text representation.</p>
        <p>
          Typed character n-grams In [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] the typed character n-grams were introduced,
basically these are subgroups of character n-grams. These subgroups are call super
categories (SC). Each of these SC are divided in different categories:
– Affix n-grams: Consider morphosyntactic aspects. This SC capture morphology to
some extent. It is divided in prefix, suffix, space-prefix and space-suffix.
– Word n-grams: Consider thematic content aspects. This SC capture partial words
and other word-relevant tokens (whole-word, mid-word, multi-word).
– Punctuation n-grams (typed-punctuation n-grams2): Consider style aspects. This
SC capture patterns of punctuation (beg-punct, mid-punct, end-punct).
        </p>
        <p>The features obtained were filtered considering the document frequency (df ), a term
is ignored if the df is lower than a threshold (th). This means that if a term appear in
less than th documents, it will be ignored (not considered for the vectorization). For
the features weightening the tf-idf was applied. The vectorization was made with the
library scikit-learn3, using the function TfidfVectorizer. This function allows us to do
the vectorization, the filter based on df and the features weightening at the same time.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Evaluation of features with SVM</title>
        <p>SVM was the selected algorithm to resolve the task of open-class Authorship
Attribution. We used the same configuration as in the baseline, applying
CalibratedClassifierCV to get the beloging probabilities to the classes per document. For each feature
representation (5 different representations) we trained different SVM’s and got the
belonging probability model (unknown document, class) for each representation.</p>
        <sec id="sec-3-2-1">
          <title>1 https://docs.python.org/3/</title>
          <p>2 to avoid confusion with the pure-punctuation n-gramas proposed, we named the SC
punctuation n-grams as typed-punctuation n-grams
3 https://scikit-learn.org/stable/
3.3</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Probability models point-to-point average</title>
        <p>Having the probability models an average point to point was made (averaged
probability model), this is the idea behind the VotingClassifier with a soft voting approach.
Weighted of the probabilities was discard to avoid a posible overfitting.</p>
        <p>With the averaged probability model, for each unknown document was considered
the following: the probabilities of class belonging was sorted (from highest to
lowest), the difference (diff ) of the two highest values was taken, if diff was smaller than a
threshold, it was considered that the document was not written for any of the candidates,
otherwise, the unknown document is assigned to the class with the highest probability.</p>
        <p>Featuressets
extractionofall
knownauthors
documents
Featuressets
weightingofall
theknown
authors
documents
SVMtrainingwith
knownauthors
documents
featuressets</p>
        <p>Start
Featuressets
extractionofthe
unknownauthor
document
Evaluationof
thefeaturessets
ofthedocument</p>
        <p>withSVM
Probabilitymodels
point-to-point
averageofthe
document</p>
        <p>Calculatethe
differenceofthetwo
authorswiththe
highestaverage</p>
        <p>probability</p>
        <p>Istheaverage
probabilityhigherthan</p>
        <p>thethreshold?
Theunknownauthor
&lt;UNK&gt;labelis
assignedtothe
document</p>
        <p>Theauthorwith
thehighestprobability
isassignedtothe</p>
        <p>document
Endofprocess
For each feature type different experiments were made, related to the size on n-grams,
variation of n was made from 1 to 10. Also, concatenation of the characteristics (for
type) was made with a variation from 1 to 8.</p>
        <p>For each characteristic type, different values of df was considered, variations from
1 to 5 were made to determine, for type of feature, witch was the best filter to consider.</p>
        <p>
          The weightening of the features was done with tf-idf. Two different methods were
considered to obtain it: the one implemented in gensim4 TfidfModel and the other one
with scikit-learn TfidfVectorizer (that applies a normalization, by the Euclidean norm,
after the weightening). Considering that for the use of TfidfModel is necesary to convert
the data type in corpus type, the facility of TfidfVectorizer for the use of filters and
weightening, and preliminary tests, TfidfVectorizer was selected to get the weightening.
The Macro F1-score was the measure used for the evaluation. The results obtained with
our approach for the development corpus are shown in table 3. the system was executed
in TIRA [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <sec id="sec-3-3-1">
          <title>4 https://radimrehurek.com/gensim/</title>
          <p>The application of several feature representations, and the inclusion of features based
on punctuation represent a factor in the improvement of the classification of authorship
in open-class Cross-Domain Authorship Atttribution. Besides from the pre-processing
benefits presented in this work, the use of several SVM’s probability models are applied
to select the author of the fanfiction by an average of the outputs. This approach
managed to obtain 0.642 with the Macro F1-score for the PAN 2019 contest of open-class
Cross-Domain Authorship Attribution in fanfiction.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Custódio</surname>
            ,
            <given-names>J.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paraboni</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Each-usp ensemble cross-domain authorship attribution</article-title>
          .
          <source>Working Notes Papers of the CLEF</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kestemont</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manjavancas</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Specht</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tschuggnall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiegmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zangerle</surname>
          </string-name>
          , E.: Overview of PAN 2019:
          <article-title>Author Profiling, Celebrity Profiling, Cross-domain Authorship Attribution and Style Change Detection</article-title>
          . In: Crestani,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Braschler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Savoy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Rauber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Heinatz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Cappellato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          , N. (eds.)
          <source>Proceedings of the Tenth International Conference of the CLEF Association (CLEF</source>
          <year>2019</year>
          ). Springer (Sep
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Grigori</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Syntactic n-grams in computational linguistics</article-title>
          . Springer (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Kestemont</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manjavacas</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Overview of the Cross-domain Authorship Attribution Task at PAN 2019</article-title>
          . In: Cappellato,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Müller</surname>
          </string-name>
          , H. (eds.)
          <article-title>CLEF 2019 Labs and Workshops, Notebook Papers</article-title>
          .
          <source>CEUR-WS.org (Sep</source>
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Markov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Automatic Native Language Identification</article-title>
          .
          <source>Ph.D. thesis</source>
          , Instituto Politecnico Nacional (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Markov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sidorov</surname>
          </string-name>
          , G.:
          <article-title>Improving cross-topic authorship attribution: The role of pre-processing</article-title>
          .
          <source>In: Proceedings of the 18th International Conference on Computational Linguistics and Intelligent Text Processing. CICLing 2017</source>
          , Springer (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiegmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>TIRA Integrated Research Architecture</article-title>
          . In: Ferro,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <surname>C</surname>
          </string-name>
          . (eds.)
          <article-title>Information Retrieval Evaluation in a Changing World - Lessons Learned from 20 Years of</article-title>
          CLEF. Springer (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Sapkota</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bethard</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes-y Gómez</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Solorio</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Not all character n-grams are created equal: A study in authorship attribution</article-title>
          .
          <source>In: Proceedings of the 2015 Annual Conference of the North American Chapter of the ACL: Human Language Technologies</source>
          . pp.
          <fpage>93</fpage>
          -
          <lpage>102</lpage>
          . NAACL-HLT'
          <fpage>15</fpage>
          ,
          <string-name>
            <surname>Association for Computational Linguistics</surname>
          </string-name>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>