<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Know-Center at PAN 2015 author identification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Oliver Pimas</string-name>
          <email>opimas@know-center.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mark Kröll</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roman Kern</string-name>
          <email>rkern@know-center.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Know-Center GmbH Graz</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <abstract>
        <p>Our system for the PAN 2015 authorship verification challenge is based upon a two step pre-processing pipeline. In the first step we extract different features that observe stylometric properties, grammatical characteristics and pure statistical features. In the second step of our pre-processing we merge all those features into a single meta feature space. We train an SVM classifier on the generated meta features to verify the authorship of an unseen text document. We report the results from the final evaluation as well as on the training datasets.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The paper at hand presents a detailed description of our approach to solve the author
identification task at PAN 2015. The problem to solve can be formulated as follows:
Given a set of documents by a single known author as well as a document of unknown
authorship, determine whether this unknown document was written by that particular
author or not. This problem is also labelled authorship verification. For the PAN 2015,
the training set for a single author consisted of text documents from different genres
and different topics. Therefore, the task can be seen as cross-genre and cross-topic
authorship verification. This resembles real-world applications more closely but also
makes the task more challenging.</p>
      <p>
        This notebook paper is outlined as follows: In section 2 we describe our
classification approach. In section 3 we present the results. These are followed by a conclusion
in section 4.
We based our work for the PAN 2015 author identification challenge upon Know-Center
submissions of previous years (see [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]). We consider authorship verification a
supervised classification problem. For each author, we pre-process each document in
two steps. In step one we extract different features. These features include statistical
features such as term frequencies, character or word n-grams. We also extract grammar
features such as possible wrong quotes, unpaired brackets or sentences starting with
upper-case letters. Stylometric features including Hapax Legomena [11], Brunets W
[11], Simpsons D [11], Sichels S [11] or Honores H [11] or sentence length n-grams
try to capture the writing style of an author. We also extract topic features in order to
model the topics an author tends to write about. However, as the PAN 2015 challenge
was cross-topic, we deactivated the topic features for our final evaluations.
      </p>
      <p>
        For most of the used features please refer to the original paper [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] as well as the
follow-up submissions [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. One of the new features is sentence length n-grams.
We define sentences consisting of up to 7 words as short sentences. Sentences
consisting of more than 13 words are considered long sentences. We consider sentences that
neither qualify as short or long sentence to be medium sentences. Using this definitions,
we move a window of size n over an author’s text and store n-grams as features,
substituting the sentence with a length indication character. A short sentence is substituted
by s, a medium sentence by m and a long sentence by l. Thus, a sentence length
bigram describing a short sentence followed by a long sentence would simply be "sl". We
store sentence length n-gram as frequency vectors. The intuition behind this feature is
to model an authors tendency to write longer or shorter sentences or to mix sentence
length.
      </p>
      <p>
        Another new feature is topic distribution. We extract the topic distribution of the
training corpus using MALLET’s [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] implementation of LDA [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. LDA is a generative
model that tries to uncover latent topics from a given text corpus. MALLET’s
ParallelTopicModel1 is a parallel threaded implementation of LDA building upon the work
of [9] and [12]. We generate a topic model for each language, if multiple languages
are present in the training set. In the feature extraction step, we then store the topic
distribution vector of every document for each author. However, since the authorship
verification challenge at PAN 2015 was cross-topic, we deactivated this feature set for
the final evaluation.
      </p>
      <p>
        The grammar features are extracted using the open source style and grammar checker
LanguageTool2. Please refer to [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] for more details.
      </p>
      <p>The PAN 2015 authorship verification challenge includes datasets from the
languages English, Spanish, Dutch and Greek. While our preprocessing pipeline generally
supports all four languages, not all features can be extracted for each languages. We
currently do not support stemming, stop or function word annotation or part-of-speech
annotation for Dutch and Greek. Therefore, we cannot extract all features for those two
languages and results may differ.</p>
      <p>
        After extracting those features in step one, we face a number of different feature
spaces with different ranges of values. This introduces a problem for many machine
learning algorithms. In step two of our pre-processing pipeline, we tackle this problem
by generating meta features from those extracted features. These meta features do all
exist in a single meta feature space. To generate the meta features, we aggregate the
extracted feature spaces and compare it to the unseen document using the
KolmogorovSmirnov test. For more details on the comparison and the meta feature generation
process, please refer to [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        Finally, we train a classifier on our meta features. For the evaluation we trained
an SVM, using the machine learning framework WEKA [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. We used WEKA’s class
SMO3, which is an implementation of John Platt’s sequential minimal optimization
al1 http://mallet.cs.umass.edu/api/cc/mallet/topics/ParallelTopicModel.html
2 https://languagetool.org/
3 http://weka.sourceforge.net/doc.dev/weka/classifiers/functions/SMO.html
gorithm [10] for training support vector classifier, building upon the work of [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. We did not evaluate different settings for the support vector classifier but used the
default parameter settings of WEKA instead.
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Results</title>
      <p>We report the results on the training and test datasets provided by the PAN 2015
authorship verification challenge. In order to evaluate the performance on the training dataset
we used the method crossValidateModel provided by the WEKA class Evaluation to
report the results doing 10-fold cross validation. The results on the final training sets
can be seen in table 1. Evaluations on previously released training datasets scored
similar results. Especially the performance on the English dataset, where all feature-sets
are supported by our pre-processing pipeline (see section 2), look quite promising. The
results of the PAN 2015 authorship verification evaluations can be seen in table 2.
Comparing the final evaluation results to those on the training datasets it seems our approach
is prone to overfit the training dataset.
5 http://weka.sourceforge.net/doc.dev/weka/classifiers/Evaluation.html
meta feature space in a two-step preprocessing pipeline. Our systems performance in
the final evaluation did not meet our expectations. Comparing the results on the training
and test datasets, our system seems to overfit the training data.
4.1</p>
      <p>Future Work
In the future we aim to invest into trying different combinations of features as well
as tuning the classification model creation. We also plan to try different supervised
classification algorithms and compare the results. In order to validate our meta feature
generation approach, we plan to use a classification algorithm that is able to deal with
different feature spaces and compare the results of this algorithm, using the features
extracted in step one of our pre-processing pipeline, to those of our classifier trained on
the meta features generated on pre-processing step two.
5</p>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgements</title>
      <p>This work is funded by the KIRAS program of the Austrian Research Promotion Agency
(FFG) (project number 840824). The Know-Center is funded within the Austrian COMET
Program under the auspices of the Austrian Ministry of Transport, Innovation and
Technology, the Austrian Ministry of Economics and Labor and by the State of Styria.
COMET is managed by the Austrian Research Promotion Agency FFG.
9. Newman, D., Asuncion, A., Smyth, P., Welling, M.: Distributed Algorithms for Topic
Models. The Journal of Machine Learning Research 10, 1801–1828 (2009),
http://dl.acm.org/citation.cfm?id=1577069.1755845
10. Platt, J.: Fast training of support vector machines using sequential minimal optimization. In:
Schoelkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods - Support Vector
Learning. MIT Press (1998), http://research.microsoft.com/˜jplatt/smo.html
11. Tweedie, F.J., Baayen, R.H.: How Variable May a Constant be? Measures of Lexical
Richness in Perspective. Computers and the Humanities 32(5), 323–352 (1998),
http://link.springer.com/article/10.1023/A%3A1001749303137
12. Yao, L., Mimno, D., McCallum, A.: Efficient methods for topic model inference on
streaming document collections. Proceedings of the 15th ACM SIGKDD international
conference on Knowledge discovery and data mining - KDD ’09 p. 937 (2009),
http://portal.acm.org/citation.cfm?doid=1557019.1557121</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jordan</surname>
            ,
            <given-names>M.I.</given-names>
          </string-name>
          :
          <article-title>Latent Dirichlet Allocation</article-title>
          .
          <source>The Journal of Machine Learning</source>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Holmes</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pfahringer</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reutemann</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>I.H.</given-names>
          </string-name>
          :
          <source>The WEKA Data Mining Software : An Update. SIGKDD Explorations</source>
          <volume>11</volume>
          (
          <issue>1</issue>
          ),
          <fpage>10</fpage>
          -
          <lpage>18</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Hastie</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tibshirani</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Classification by pairwise coupling</article-title>
          . In: Jordan,
          <string-name>
            <given-names>M.I.</given-names>
            ,
            <surname>Kearns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.J.</given-names>
            ,
            <surname>Solla</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.A</surname>
          </string-name>
          . (eds.)
          <source>Advances in Neural Information Processing Systems</source>
          . vol.
          <volume>10</volume>
          . MIT Press (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Keerthi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shevade</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bhattacharyya</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murthy</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Improvements to platt's smo algorithm for svm classifier design</article-title>
          .
          <source>Neural Computation</source>
          <volume>13</volume>
          (
          <issue>3</issue>
          ),
          <fpage>637</fpage>
          -
          <lpage>649</lpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Kern</surname>
          </string-name>
          , R.:
          <article-title>Grammar Checker Features for Author Identification and Author Profiling</article-title>
          .
          <article-title>CLEF 2013 Evaluation Labs</article-title>
          and Workshop - Working Notes Papers (
          <year>2013</year>
          ), http://ims-sites.dei.unipd.it/documents/71612/430938/CLEF2013wn-PAN-Kern2013.pdf
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kern</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klampfl</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zechner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          : Vote/Veto Classification,
          <article-title>Ensemble Clustering and Sequence Classification for Author Identification - Notebook of PAN at CLEF 2012</article-title>
          .
          <article-title>Working Notes Papers of the CLEF 2012 Evaluation Labs pp</article-title>
          .
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kern</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seifert</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zechner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Granitzer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Vote/Veto Meta-Classifier for Authorship Identification - Notebook for PAN at CLEF 2011</article-title>
          .
          <article-title>CLEF 2011: Proceedings of the 2011 Conference on Multilingual and Multimodal Information Access Evaluation (Lab</article-title>
          and Workshop Notebook Papers), Amsterdam, The Netherlands (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>McCallum</surname>
            ,
            <given-names>A.K.</given-names>
          </string-name>
          :
          <article-title>Mallet: A machine learning for language toolkit (</article-title>
          <year>2002</year>
          ), http://mallet.cs.umass.edu
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>