<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CIC-GIL Approach to Cross-domain Authorship Attribution</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Carolina Martín-del-Campo-Rodríguez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Helena Gómez-Adorno</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Grigori Sidorov</string-name>
          <email>sidorov@cic.ipn.mx</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ildar Batyrshin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Instituto Politécnico Nacional (IPN), Center for Computing Research (CIC)</institution>
          ,
          <addr-line>Mexico City</addr-line>
          ,
          <country country="MX">Mexico</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universidad Nacional Autónoma de México (UNAM), Engeneering Institute (II)</institution>
          ,
          <addr-line>Mexico City</addr-line>
          ,
          <country country="MX">Mexico</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <abstract>
        <p>We present the CIC-GIL approach to the cross-domain authorship attribution task at PAN 2018. This year's evaluation lab focuses on the closed-set attribution task applied to a Fanfiction corpus in five languages: English, French, Italian, Polish, and Spanish. We followed a traditional machine learning approach and selected different feature sets depending on the language. We evaluated document features such as typed and untyped character n-grams, word n-grams, and function word n-grams. Our final system uses the log-entropy weighting scheme and SVM as classifier.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The authorship attribution (AA) task consists in identifying the author of a given
document among a list of candidates. There are several subtasks within the authorship
attribution field such as author identification [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], author obfuscation [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and author
profiling [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The AA methods are used for many practical applications like electronic
commerce, forensics, and humanities research [
        <xref ref-type="bibr" rid="ref2 ref5">2,5</xref>
        ]. The Authorship Attribution task
is viewed as a multi-class, single-label classification problem, i.e. an automatic method
has to assign a single class label (the author) to the unknown authorship documents.
      </p>
      <p>
        Character n-grams are considered among the best feature representation for
authorship attribution problems [16]. In [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], the authors introduced a categorization of
character n-grams and showed that some categories have better performance than others
in an AA task. Furthermore, several studies indicate that the combination of different
types of n-grams introduces useful information to the classification algorithm,
providing a robust model [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>
        This paper describes our approach to the cross-domain authorship attribution task at
PAN 2018 [
        <xref ref-type="bibr" rid="ref4">4,17</xref>
        ]. We examined different document features (typed and untyped
character n-grams, word n-grams, and function word n-grams), weighting schemes (tf-idf and
log-entropy), and machine learning algorithms (support vector machines, multinomial
naive Bayes, and multi-layer perceptron).
      </p>
    </sec>
    <sec id="sec-2">
      <title>Corpus for Development Phase</title>
      <p>The corpus of the authorship attribution shared task at PAN 2018 is focused on
crossdomain attribution. It is more challenging than the classical AA setting (the single-topic
AA), because the training and testing documents can belong to different domains (eg.
thematic area, genre). The documents in the corpus are fanfics, i.e., fictional literature
based on the theme, atmosphere, style, characters, story world, etc. of a certain known
author.</p>
      <p>The corpus for development phase corpus (CDP), similarly to the corpus for test
phase (CTP), is composed of a training corpus and a test corpus. Although the candidate
authors for the CDP and CTP have similar characteristics, the candidate authors do not
overlap.</p>
      <p>
        The development phase corpus is composed of 10 problems divided in five
languages (two problems each language): English, French, Italian, Polish and Spanish.
The specifications of the problems are defined in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>In this section, we first cover the concept of typed character n-grams, then the
logentropy weighting scheme, and finally the experimental settings of the methodology.
3.1</p>
      <sec id="sec-3-1">
        <title>Typed character n-grams</title>
        <p>
          Typed character n-grams, introduced by [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] are subgroups of character n-grams that
correspond to three distinct linguistic aspects: morphosyntax (represented by affix
ngrams), thematic content (represented by word n-grams) and style (represented by
punctuation n-grams). These subgroups are call super categories (SC). Each of these
SC are divided in different categories:
– Affix n-grams: Capture morphology to some extent (prefix, suffix, space-prefix,
space-suffix).
– Word n-grams: Capture partial words and other word-relevant tokens
(wholeword, mid-word, multi-word).
– Punctuation n-grams: Capture patterns of punctuation (beg-punct, mid-punct,
end-punct).
        </p>
        <p>
          Some categories of character n-grams showed higher predictive capabilities in the
AA task [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] than using all possible n-grams (categorized and uncategorized). The
redefinition stated by [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] of these categories unambiguously assign each 3-gram to exactly
one category and do not exclude any n-gram (as in the case of consecutive punctuation
marks in the original proposal). Also, the authors showed that some categories have a
better performance that others for AA.
3.2
Global weighting functions measure the importance of a term across the entire
collection of documents [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Previous research on document similarity judgments [
          <xref ref-type="bibr" rid="ref6 ref9">6,9</xref>
          ] has
shown that entropy-based global weighting is generally better than the TF-IDF model.
The log-entropy (le) weight is calculated with the following equation (Equation 1):
leij = ei
        </p>
        <p>log(tfij + 1);
ei = 1 +</p>
        <p>X pij
j</p>
        <p>log pij ; where pij = tfij ;
log n gfi
(1)
(2)
where n is the number of documents, tfij is the frequency of the term i in document j,
and gfi is the frequency of term i in the whole collection. A term that appears once in
every document will have a weight of zero. A term that appears once in one document
will have a weight of one. Any other combination of frequencies will assign a given
term a weight between zero and one.
3.3</p>
      </sec>
      <sec id="sec-3-2">
        <title>Experimental Settings</title>
        <p>
          After an evaluation of several classification algorithms, in our final approach we chose
Support Vector Machine (SVM) since this algorithm is recommended when the number
of dimensions is greater than the number of samples (as in this case) [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. We used the
SVM implementation of sklearn [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], using the strategy one-against-all and the default
parameter settings.
        </p>
        <p>
          We analyzed several text representation schemes: typed character n-grams (with
n varying from 2 to 8), untyped character n-grams (with n between 3 and 4), word
n-grams (with n varying from 1 to 5) and function word n-grams proposed by
Stamatatos [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
        </p>
        <p>
          We implemented the character n-gram types introduced by Sapkota et al. [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], but
with the redefinitions of Markov et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], which make them more accurate and
complete.
        </p>
        <p>
          For function word n-grams we used the 50 most frequent stop-words, as described
in [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], to form the n-grams (with a value of n equal to 8). For English, the 50 most
frequent stop-words mentioned in [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] were used. For the other languages (French,
Italian, Polish and Spanish) the 50 most frequent stop-words were extracted from the
development corpus (from the training).
        </p>
        <p>We evaluated different combination of features for the different languages in the
corpus. We also performed an evaluation study in order to identify the most useful
typed character n-gram categories for each language. Table 1 shows the combination of
features as well as the types of character n-grams used in our final submission.</p>
        <p>Moreover, we experimented with different feature document frequency thresholds.
We considered thresholds between 1 and 3, i.e. features that occur in at least 1, 2,
or 3 documents in each problem. We found that the features that occur in at least 2
documents achieved the best classification performance in our experiments.</p>
        <p>
          Following the experimental settings presented in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], we examined two feature
representations based on a global weighting scheme: log-entropy and tf-idf. Global
weighting functions measure the importance of a word across the entire collection of
documents. Previous research on document similarity judgments [
          <xref ref-type="bibr" rid="ref6 ref9">6,9</xref>
          ] and authorship
attribution [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] has shown that entropy-based global weighting is generally better than the
if-idf model. We use log-entropy as weighting function for out final version.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation Measure and Results</title>
      <p>
        The macro-averaged F1 score is used for evaluating the performance of the systems
participating in the authorship attribution shared task at PAN CLEF 2018 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        The final configuration of our approach was selected based on the classification
performance on the test set of the development phase corpus (DPC). Table 2 shows the
results obtained on the DPC with the above-specified configuration evaluated on the
TIRA platform [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>The results achieved in the test phase corpus (TPC) are shown in Table 3. It can be
observed that the performance on the TPC is much lower than in the DPC. This
behavior can be explained by our decision of tuning our system based on the classification
performance over the test set of the DPC.
We presented the system that was submitted to the Cross-domain Authorship
Attribution task at PAN 2018. Our experiments were performed using different features, finding
that a specific set of features per language is the best approach to improve performance.</p>
      <p>Our approach had a good performance on the development phase corpus
(MacroAverage F1: 0.747), but this performance was severely diminished on the test phase
corpus (Macro-Average F1: 0.588). Based on the current technique, there are still
opportunities for further enhancements.</p>
      <p>In future research, we would like to consider a cross-validation approach for the
development phase corpus to make the system more robust.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work was partially supported by the Mexican Government (CONACYT projects
240844, SNI, COFAA-IPN, SIP-IPN 20181849, 20171813) and Honeywell Grant.
16. Stamatatos, E.: A survey of modern authorship attribution methods. Journal of the</p>
      <p>American Society for Information Science and Technology 60(3), 538–556 (2009)
17. Stamatatos, E., Rangel, F., Tschuggnall, M., Kestemont, M., Rosso, P., Stein, B., Potthast,
M.: Overview of PAN-2018: Author Identification, Author Profiling, and Author
Obfuscation. In: Bellot, P., Trabelsi, C., Mothe, J., Murtagh, F., Nie, J., Soulier, L., Sanjuan,
E., Cappellato, L., Ferro, N. (eds.) Experimental IR Meets Multilinguality, Multimodality,
and Interaction. 9th International Conference of the CLEF Initiative (CLEF 18). Springer,
Berlin Heidelberg New York (Sep 2018)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Buitinck</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Louppe</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blondel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pedregosa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mueller</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grisel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Niculae</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prettenhofer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gramfort</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grobler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Layton</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , VanderPlas, J.,
          <string-name>
            <surname>Joly</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Holt</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varoquaux</surname>
          </string-name>
          , G.:
          <article-title>API design for machine learning software: experiences from the scikit-learn project</article-title>
          .
          <source>In: ECML PKDD Workshop: Languages for Data Mining and Machine Learning</source>
          . pp.
          <fpage>108</fpage>
          -
          <lpage>122</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Coulthard</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>On admissible linguistic evidence</article-title>
          .
          <source>Journal of Law &amp; Policy</source>
          <volume>21</volume>
          ,
          <issue>441</issue>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Gómez-Adorno</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aleman</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vilariño</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sanchez-Perez</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pinto</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sidorov</surname>
          </string-name>
          , G.:
          <article-title>Author clustering using hierarchical clustering analysis</article-title>
          .
          <source>In: CLEF 2017 Working Notes. CEUR Workshop Proceedings</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Kestemont</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tschugnall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Specht</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Overview of the Author Identification Task at PAN-2018: Cross-domain Authorship Attribution and Style Change Detection</article-title>
          . In: Cappellato,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.Y.</given-names>
            ,
            <surname>Soulier</surname>
          </string-name>
          ,
          <string-name>
            <surname>L</surname>
          </string-name>
          . (eds.)
          <article-title>Working Notes Papers of the CLEF 2018 Evaluation Labs</article-title>
          .
          <source>CEUR Workshop Proceedings, CLEF and CEUR-WS.org (Sep</source>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seidman</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Automatically identifying pseudepigraphic texts</article-title>
          .
          <source>In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing</source>
          . pp.
          <fpage>1449</fpage>
          -
          <lpage>1454</lpage>
          . EMNLP '
          <volume>13</volume>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>M.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Navarro</surname>
            ,
            <given-names>D.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nikkerud</surname>
            ,
            <given-names>H.:</given-names>
          </string-name>
          <article-title>An empirical evaluation of models of text document similarity</article-title>
          .
          <source>In: Proceedings of the Cognitive Science Society</source>
          . vol.
          <volume>27</volume>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Markov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sidorov</surname>
          </string-name>
          , G.:
          <article-title>Improving cross-topic authorship attribution: The role of pre-processing</article-title>
          .
          <source>In: Proceedings of the 18th International Conference on Computational Linguistics and Intelligent Text Processing. CICLing 2017</source>
          , Springer (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Pedregosa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varoquaux</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gramfort</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michel</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thirion</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grisel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blondel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prettenhofer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weiss</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dubourg</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vanderplas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Passos</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cournapeau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brucher</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perrot</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duchesnay</surname>
          </string-name>
          , E.:
          <article-title>Scikit-learn: Machine learning in Python</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          ,
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Pincombe</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Comparison of human and latent semantic analysis (lsa) judgements of pairwise document similarities for a news corpus</article-title>
          .
          <source>Tech. rep., DEFENCE SCIENCE AND TECHNOLOGY ORGANISATION SALISBURY (AUSTRALIA) INFO SCIENCES LAB</source>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Improving the Reproducibility of PAN's Shared Tasks: Plagiarism Detection, Author Identification, and Author Profiling</article-title>
          . In: Kanoulas,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Lupu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Clough</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Sanderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Hall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Hanbury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Toms</surname>
          </string-name>
          , E. (eds.)
          <article-title>Information Access Evaluation meets Multilinguality, Multimodality, and Visualization</article-title>
          .
          <source>5th International Conference of the CLEF Initiative (CLEF 14)</source>
          . pp.
          <fpage>268</fpage>
          -
          <lpage>299</lpage>
          . Springer, Berlin Heidelberg New York (
          <year>Sep 2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hagen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schremmer</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Overview of the Author Obfuscation Task at PAN 2018: A New Approach to Measuring Safety</article-title>
          . In: Cappellato,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.Y.</given-names>
            ,
            <surname>Soulier</surname>
          </string-name>
          ,
          <string-name>
            <surname>L</surname>
          </string-name>
          . (eds.)
          <article-title>Working Notes Papers of the CLEF 2018 Evaluation Labs</article-title>
          .
          <source>CEUR Workshop Proceedings, CLEF and CEUR-WS.org (Sep</source>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes-</surname>
            y-Gómez,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Overview of the 6th Author Profiling Task at PAN 2018: Multimodal Gender Identification in Twitter</article-title>
          . In: Cappellato,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.Y.</given-names>
            ,
            <surname>Soulier</surname>
          </string-name>
          ,
          <string-name>
            <surname>L</surname>
          </string-name>
          . (eds.)
          <article-title>Working Notes Papers of the CLEF 2018 Evaluation Labs</article-title>
          .
          <source>CEUR Workshop Proceedings, CLEF and CEUR-WS.org (Sep</source>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Sanchez-Perez</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Markov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gómez-Adorno</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sidorov</surname>
          </string-name>
          , G.:
          <article-title>Comparison of character n-grams and lexical features on author, gender, and language variety identification on the same spanish news corpus</article-title>
          .
          <source>In: International Conference of the Cross-Language Evaluation Forum for European Languages</source>
          . pp.
          <fpage>145</fpage>
          -
          <lpage>151</lpage>
          . Springer (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Sapkota</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bethard</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes-y Gómez</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Solorio</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Not all character n-grams are created equal: A study in authorship attribution</article-title>
          .
          <source>In: Proceedings of the 2015 Annual Conference of the North American Chapter of the ACL: Human Language Technologies</source>
          . pp.
          <fpage>93</fpage>
          -
          <lpage>102</lpage>
          . NAACL-HLT'
          <fpage>15</fpage>
          ,
          <string-name>
            <surname>Association for Computational Linguistics</surname>
          </string-name>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>Plagiarism detection using stopword n-grams</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology</source>
          <volume>62</volume>
          (
          <issue>12</issue>
          ),
          <fpage>2512</fpage>
          -
          <lpage>2527</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>