<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Single Author Style Representation for the Author Verification Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Cristhian Mayor</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Josue Gutierrez</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Angel Toledo</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rodrigo Martinez</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paola Ledesma</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gibran Fuentes</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ivan Meza</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Escuela Nacional de Antropologia e Historia</institution>
          ,
          <addr-line>ENAH</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institut National des Sciences Appliquees de Lyon</institution>
          ,
          <addr-line>INSA Lyon</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Instituto de Investigaciones en Matematicas Aplicadas y en Sistemas (IIMAS) Universidad Nacional Autonoma de Mexico</institution>
          ,
          <addr-line>UNAM</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>1079</fpage>
      <lpage>1083</lpage>
      <abstract>
        <p>This paper presents our experience implementing three approaches for the 'PAN 2014 Author Identification' [3,1] task using the same representation for the author's style. Two of our approaches extend previous successful approaches: naive Bayes [4] and impostor [8] methods. The third approach is based on original research on sparse representation for text documents. We present results with the official development and test corpora.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Author verification has multiple applications on several areas including information
retrieval and computational linguistics, and has an impact in fields such as law and
journalism [
        <xref ref-type="bibr" rid="ref2 ref5 ref9">2,5,9</xref>
        ]. In this edition of the PAN 2014 Author Identification, the task was
formally defined as follows1:
      </p>
      <p>Given a small set (no more than 5, possibly as few as one) of “known”
documents by a single person and a “questioned” document, the task is to
determine whether the questioned document was written by the same person
who wrote the known document set.</p>
      <p>This year the documents were in four languages and four genres with the following
combinations: essays and reviews in Dutch, essays and novels in English and articles in
Greek and Spanish.</p>
      <p>
        In this work we present the implementation of three approaches to perform the
authorship verification task based on the same document representation. In particular, we
repeatedly use a vector space representation of documents as described in our approach
last edition [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], but we agglomerate them to obtain a single author’s style
representation. We applied two well-known methods for verifying the author: naive Bayes [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and
impostor [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Additionally, we implemented a novel approach using sparse
representation [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
1 As described in the official website of the competition http://pan.webis.de/ (2014).
      </p>
    </sec>
    <sec id="sec-2">
      <title>Author’s Style representation</title>
      <p>
        The representation for an author’s style is generated in two stages. First, we represent
the documents from an author using the vector space model [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]:
      </p>
      <p>d = (w1; w2; : : : ; wm)
where wi is a frequency or weight of a word of the vocabulary of size m for the word i.
A common representation of a vector space model is the bag of words model, in which
the words represent actual words of the document and the frequency count occurrences
of such words in the represented document.</p>
      <p>Additionally to the bag of words we use the following feature frequencies to
represent the documents:</p>
      <sec id="sec-2-1">
        <title>Bag of words Frequencies of words in the document.</title>
        <p>Bigram Frequencies of two consecutive words.</p>
        <p>Trigram Frequencies of three consecutive words.</p>
        <p>Prefix Frequencies of prefixes of words.</p>
        <p>Suffix Frequencies of suffixes words.</p>
        <p>Prefix bigram Frequencies of two consecutive prefixes of words.</p>
        <p>Suffix bigram Frequencies of two consecutive suffixes words.</p>
        <p>Stop words Frequencies of stop words.</p>
        <p>Stop words bigram Frequencies of two consecutive stop words.</p>
        <p>Punctuation Frequencies of punctuations.</p>
        <p>Words per sentence Frequencies of words per sentence.</p>
        <p>In the second stage an author is represented by the sum of the vectors
representing the documents written by her or him. This cumulative vector is normalized by the
number of documents by the author and the resulting normalized vector represents the
style of the author. This representation is not novel, however many approaches on
author verification optimize the representation on the domain, in our experiments we keep
the same representation independently of the domain. Some of the implemented
approaches require an instance of a document by a certain author, to accomplish this it we
sample a document from the author’s style representation.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Approaches</title>
      <sec id="sec-3-1">
        <title>The approaches implemented in the task are described next.</title>
        <sec id="sec-3-1-1">
          <title>3.1 Impostors</title>
          <p>
            This method consists on iteratively compare the vector distance between the author’s
document to the questioned document versus the distances between several impostor
documents to the questioned document. With these distances a score is built up based
on how many times the author and questioned documents are closer than the impostor
and questioned documents. For this approach we follow the description of the method
by Seidman (2013) [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ]. We modify the method to work on more than one set of
features and instead to use impostors from the web we used the training corpus as source
of impostors. Additionally, we extended the approach to produce a probability as
output based on repetition of the algorithm since the document instances were randomly
sampled.
3.2
          </p>
        </sec>
        <sec id="sec-3-1-2">
          <title>Naive Bayes</title>
          <p>This method consists of sampling from the author and the impostor style
representations two document instances for each. A probability score is then calculated using the
common term between the questioned document and the author’s documents. On the
other hand, an alternative score is calculated between the questioned document and the
impostor documents. These scores are derived using Bayes2. The purpose of the score is
to capture the probability that the document was created by the same author, if the score
for the author is higher than the impostor, we consider it as evidence of authorship. We
iterate n times over this method to calculate the probability of authorship.
3.3</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>Sparse</title>
          <p>
            This methodology has been successfully applied to the face recognition task, in which
the identity of a face image has to be determined from a set of known faces [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ]. We
adapted this methodology to the authorship verification task. The method consists on
identifying the components that contribute to the questioned document from samples of
documents from a set of authors. The rationale is that the biggest contribution of
components should be elements from a single author. In order to identify the components
the method proposes the following l1-minimization:
minimize
subject to
x = argminjjxjj1
          </p>
          <p>Ax = y
(1)
Where y is the questioned document, A is the matrix of n samples from different m
candidate authors (impostors), and x is the variable to minimize which represent the
contribution from each candidate. So that multiplying the samples by the contribution
we could generate the questioned document. From the resulting variable x0 we can
quantify the residuals given by Ax versus y and decide which author contributes with
more components. We adapt this method to produce a probability as result by iterating
k times over the full method.
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>In the preparation of the authorship verification systems we implemented three
approaches: impostor, n-gram and sparse methodologies. During development we tested
all of them on the same representation for the author’s style and the same parameter for
the four languages and three genres of the task. During development and testing the best
results were achieved using the sparse methodology which is interesting to us since it
is the first time such method is applied to the task of authorship verification.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beyer</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Busse</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
          </string-name>
          , B.:
          <article-title>irecent trends in digital text forensics and its evaluation. in pamela forner</article-title>
          .
          <source>In: Information Access Evaluation meets Multilinguality, Multimodality, and Visualization. 4th International Conference of the CLEF Initiative (CLEF 13)</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Juola</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Authorship attribution</article-title>
          .
          <source>Found. Trends Inf. Retr</source>
          .
          <volume>1</volume>
          (
          <issue>3</issue>
          ),
          <fpage>233</fpage>
          -
          <lpage>334</lpage>
          (
          <year>Dec 2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Juola</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>Overview of the author identification task at pan 2013</article-title>
          . In:
          <article-title>CLEF 2013 Evaluation Labs</article-title>
          and Workshop - Online Working Notes (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Kešelj</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peng</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cercone</surname>
          </string-name>
          , N.,
          <string-name>
            <surname>Thomas</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>N-gram-based author profiles for authorship attribution</article-title>
          .
          <source>In: Proceedings of the conference pacific association for computational linguistics</source>
          ,
          <source>PACLING</source>
          . vol.
          <volume>3</volume>
          , pp.
          <fpage>255</fpage>
          -
          <lpage>264</lpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Argamon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Computational methods in authorship attribution</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology</source>
          <volume>60</volume>
          (
          <issue>1</issue>
          ),
          <fpage>9</fpage>
          -
          <lpage>26</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ledesma</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fuentes</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jasso</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toledo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meza</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Distance learning for author verification</article-title>
          .
          <source>In: Proceedings of the conference pacific association for computational linguistics</source>
          ,
          <source>PACLING</source>
          . vol.
          <volume>3</volume>
          , pp.
          <fpage>255</fpage>
          -
          <lpage>264</lpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Salton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wong</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>C.S.:</given-names>
          </string-name>
          <article-title>A vector space model for automatic indexing</article-title>
          .
          <source>Commun. ACM</source>
          <volume>18</volume>
          (
          <issue>11</issue>
          ),
          <fpage>613</fpage>
          -
          <lpage>620</lpage>
          (
          <year>Nov 1975</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Seidman</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Authorship verification using the impostors method</article-title>
          . In:
          <article-title>CLEF 2013 Evaluation Labs</article-title>
          and Workshop - Online Working Notes (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>A survey of modern authorship attribution methods</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology</source>
          <volume>60</volume>
          (
          <issue>3</issue>
          ),
          <fpage>538</fpage>
          -
          <lpage>556</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Wright</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ganesh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sastry</surname>
            ,
            <given-names>S.S.</given-names>
          </string-name>
          , Ma, Y.:
          <article-title>Robust face recognition via sparse representation</article-title>
          .
          <source>Pattern Analysis and Machine Intelligence</source>
          ,
          <source>IEEE Transactions on 31(2)</source>
          ,
          <fpage>210</fpage>
          -
          <lpage>227</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>