<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Distance learning for Author Verification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Paola Ledesma</string-name>
          <email>paola@turing.iimas.unam.mx</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gibran Fuentes</string-name>
          <email>gibranfp@turing.iimas.unam.mx</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gabriela Jasso</string-name>
          <email>gabrielajassolopez@gmail.mx</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Angel Toledo</string-name>
          <email>angeltc@ciencias.unam.mx</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ivan Meza</string-name>
          <email>ivanvladmir@turing.iimas.unam.mx</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Escuela Nacional de Antropología e Historia</institution>
          ,
          <addr-line>ENAH</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Instituto de Investigaciones en Matemáticas aplicadas y en Sistemas (IIMAS) Universidad Nacional Autónoma de México</institution>
          ,
          <addr-line>UNAM</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <abstract>
        <p>This paper presents a distance metric learning method for the 'PAN 2013 Author Identification' challenge. Our approach extracts multiple distance metrics of different document representations from which our system learns to tune each one of these distances and representations to form an combined distance metric. We reach this learning distances by means of linear programming, Support Vector Regression and Neuronal Networks models. As specified by the description of the task our system can be applied to English, Greek and Spanish, and can be configured to be language independent. We achieved moderately successful results on this PAN competition, with better results on the development test set. We present results with the official test set and our own corpus based on web documents.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Authorship attribution [
        <xref ref-type="bibr" rid="ref4 ref5 ref8">4,5,8</xref>
        ] is an important task for several areas including
information retrieval and computational linguistics and has an impact in fields such as law
and journalism. A common scenario for this task is the author verification problem. In
this setting we have access to a set of documents by a single author and a questioned
document, the problem is to determine if the questioned document was written by that
particular author or not.
      </p>
      <p>This year in the PAN 2013 Author Identification task was defined as follow1:</p>
      <p>Given a small set (no more than 10, possibly as few as one) of “known”
documents by a single person and a “questioned” document, the task is to
determine whether the questioned document was written by the same person
who wrote the known document set.</p>
      <p>There are multiple features which could characterise an author writing: lexical,
syntactic and stylistic. In our approach, we focus into capture these features in different
1 As described in the official website of the competition http://pan.webis.de/
vector space representations of the known documents, for instance bag of words,
ngrams. A reasonable hypothesis is that two documents by the same author should be
“closer” while documents by different authors should be “distant”. We aim to capture
this intuition by calculating a metric distance between the vector space representations
of the documents. Documents by the same author would share similar characteristics
and therefore they would have a distance close to zero while documents by different
authors would have a distance close to one.</p>
      <p>
        We frame the author verification problem as a problem of learning a distance[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
from multiple distances over multiple vector space representations. In order to achieve
this we tried a linear programming [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] (LP), a Support Vector Regression [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] (SVR)
and Neural Network [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] (NN) machine learning frameworks. The intuition is to identify
the contribution of each one of the distances to form a new distance by means of a linear
combination of distances, support vectors or weights. This setting allows us to verify
the author in two documents. In order to infer the author for a set of documents, as
stated in the task, we calculate the distance between the questioned document and each
of the known documents. We used a voting scheme in which we count how many times
the system decided a yes or no for the questioned document.
      </p>
      <p>The outline of this works is the following. Section 2 reviews the vector space
representation of documents and lists the representations used in our approach 2. Section 3
presents the metrics used to calculate the distance between documents. Section 4
introduces the different machine learning methods we experimented with. Section 6.2 lists
some specification of our final system. Section 5 shows the characteristics of the
corpora used during development. Section 6 shows our development and final results using
this approach and our system. Finally, 7 discuss about future work and presents some
conclusions.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Vector Space Representation</title>
      <p>
        A vector space model[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] represents a document as a vector space:
      </p>
      <p>d = (w1; w2; :::; wm)
In which wi is a frequency or weight of a word of the vocabulary of size m for the
word i. A common representation of a vector space model is the bag of words model,
in which the words represent actual words of the document and the frequencies counts
of occurrence of such words in the represented document.</p>
      <p>In our approach we use the following documents representations:
Bag of words Frequencies of words in the document.</p>
      <p>Bigram Frequencies of two consecutive words.</p>
      <p>Trigram Frequencies of three consecutive words.</p>
      <p>Comma Frequencies of commas signs.</p>
      <p>Dots Frequencies of period signs.</p>
      <p>Numbers Frequencies of whole numbers.</p>
      <p>Capitals Frequencies of capitalized words.</p>
      <p>Words per paragraph Frequencies of words per paragraphs
Sentences per paragraph Frequencies of sentences per paragraphs.</p>
      <p>Square brackets Frequencies of square brackets.</p>
    </sec>
    <sec id="sec-3">
      <title>Distance metrics</title>
      <p>A distance function calculates the dis-similarity between two elements. We can
calculate distances between the vector space representations of our documents presented in
the previous section. Formally, a distance function D and vector space d form metric
spaces when the following holds:</p>
      <p>D(di; dj )</p>
      <p>0
D(di; di) = 0
D(di; dj ) = d(di; dj ); i 6= j
D(di; dk)</p>
      <p>D(di; dj ) + D(dj ; dk); i 6= j; j 6= k
(1)
(2)
In particular we focus on bounded distances, these are distances that also hold:
D(di; dj )</p>
      <p>&lt; 1:0; i 6= j
This means that the distances are bounded by one: they range from zero, when they are
the same element, to one, when they are completely dis-similar elements.</p>
      <p>There are multiple distance functions to used with the vector space representation.
We divided the metrics into three categories: binary, weighted and euclidian based
distances. Next we describe each one of these categories.
3.1</p>
      <sec id="sec-3-1">
        <title>Binary distances</title>
        <p>This type of distances are based on point wise logical operations on the vector spaces.
Example of these distances are Jaccard, Massi and Sorensen. However in our approach
we only used:
Jaccard This metric measures the common type vocabulary between two sets:
This type of distances are based on binary distances but they are modified to take into
account the frequencies or weights of the vector space representation. In our approach
we use the following weighted distances:
Weighted Jaccard This metric measures the common token vocabulary between two
sets:</p>
        <p>D(d1; d2) =</p>
        <p>
          P min(d1; d2)
P max(d1; d2)
Weighted Sorensen This metric measures the twice the common token vocabulary
between two sets:
where is the sum of the vectors components with values grater than zero.
Weighted Massi This metric measures the common token vocabulary compared with
the vector with more mass frequency:
Weighted h0 [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] This metric measures the minimum token vocabulary in common
versus the maximum vocabulary in common:
This type of distances are based on the Euclidean representation of the vector space. In
order to bound them to 1 we normalize them.
        </p>
        <p>Euclidean This metrics measures the Euclidean distance between two vectors :
Additionally to the standard distances presented in the previous sections, we found the
following coefficient quite helpful during the learning stage:</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Linear programming model</title>
      <p>Let (V; D) be a metric space bounded by 1 in which V is a vector space and D a
distance function, then for all d1; d2; d3 2 V , D(dn; dn) = 0, D(d1; d2) = d(d2; d1),
D(d1; d3) D(d1; d2) + D(d2; d3) and D(d1; d2) 1 holds. With these properties in
mind we propose to have a linear combination of distance functions D such that for
D1; D2; :::; Dn 2 D, dl = 1D1 + 2D2 + ::: + nDn is a bound distance function
bounded by 1. In order to find such linear combination we propose the following linear
program:
minimize
subject to
1 + 2 + ::: +
1
1
1
1</p>
      <p>This linear program aims to find the weights such that it can split the distances
which makes two document be assigned to the same author or otherwise. During the
training stage the inequality constrains are expanded with actual examples of values
for distances and subject to the inequality depending if thats an example of documents
from the same author or not. Then the linear program is run and the solution represents
the weights .
5</p>
    </sec>
    <sec id="sec-5">
      <title>Corpora</title>
      <p>Additionally to the official training corpus provided by PAN organization we collected
an extra corpus based on documents from the web. Table 3 summarises the main
characteristics of both corpora.</p>
      <p>English
Greek
Spanish
English
Greek
Spanish</p>
      <p>No. Know. Docs No. Problems Vocabulary Type Vocabulary Token</p>
      <p>Official training corpus</p>
    </sec>
    <sec id="sec-6">
      <title>Experimental Results</title>
      <p>For our development stage we performed a leave-one-out cross-validation using the
official training corpus, but we tested the model created by our approach using the Web
corpus we collected explained in the previous section. Table ?? presents our
development and final results for the official training and testing set and the web documents
corpus we extracted.
Additionally, to the linear programming approach we tried a linear regression setting
with Support Vector Regression and Neural Network. Table ?? shows the reached
results, we do not have results for the official test corpus since these configurations were
not part of the final system. The distance learning setting was not compatible with the
linear regression since it was difficult to constraint the sum of the weights between zero
and one. This situation make it hard to reach good results with these settings.
Additionally to the distance learning setting we incorporating several pre-processing
strategies to our finale system2. The following explains these strategies and list the used
parameters for the final version of the system.</p>
      <p>Weighting frequencies For the vector space representation of the documents we tried
the following weighting:
– Term frequency tf
– Log term frequency tflog
– Inverse document frequency term frequency idtf tf
From experimentation we settle with log term frequency which is only a change to a
logarithm scale for the frequencies of the terms in the vector space.</p>
      <p>Stop words Some of the representations were preprocessed by eliminating stop words.
For this, we used a standard list of words for each of the languages 3. The
representations which were preprocessed with stop words were: Comma, dots and Capitals.
Cut-off frequencies The vector space representation were also preprocessed by
eliminating term with few counts, these are uncommon terms. For each of the representation
a cut-off value was defined, the following list the values for each of the representations
which were different than zero:</p>
      <sec id="sec-6-1">
        <title>Bag of words 5</title>
      </sec>
      <sec id="sec-6-2">
        <title>Numbers 1</title>
        <p>7</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Conclusions and Future work</title>
      <p>In this paper we have presented a learning distance metric method for the ‘PAN 2013
Author Identification’ challenge. Our approach extracts multiple distance metrics of
multiple document representation from which our system learns to tune each one of
these distances and representations to form an combined distance metric. Our main
approach uses linear programming to find a linear combination of distances, however we
also present prelimilary results with a support vector regression and neural networks
setting. Our final system achieved moderately successful results on this PAN
competition, with better results on the development test set. In particular, English language
was harder to verify the author, our system did worst than the baseline system, while
for Greek and Spanish we were 4th place in performance. However, the current results
point to more work on the side of representations, we are investigating syntactic and
semantic feature representations.
2 https://github.com/ivanvladimir/authorid
3 This can be found here: https://raw.github.com/ivanvladimir/authorid/
4e3bd5b968135380ce839340541892c30d13cf5d/data/stopwords.txt
We are investigating improvements to our system, in particular the current version
verifies an author by a voting scheme from a set of distances from the known document
to the unknown document. But our intuition is that even for an author there will be
documents which are closer to the unknown and other than no so close, if a cluster are
highly closer it should be considered a good evidence of authorship. We are studying
mechanism to add this intuition into our inference processing.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Chum</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Philbin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Near duplicate image detection: min-hash and tf-idf weighting</article-title>
          .
          <source>In: Proceedings of BMVC</source>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Drucker</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burges</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaufman</surname>
          </string-name>
          , L., C, C.J.,
          <string-name>
            <surname>Kaufman</surname>
            ,
            <given-names>B.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smola</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vapnik</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Support vector regression machines (</article-title>
          <year>1996</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Haykin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          : Neural Networks:
          <string-name>
            <given-names>A Comprehensive</given-names>
            <surname>Foundation. Prentice Hall</surname>
          </string-name>
          <string-name>
            <surname>PTR</surname>
          </string-name>
          , Upper Saddle River, NJ, USA, 2nd edn. (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Juola</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Authorship attribution</article-title>
          .
          <source>Found. Trends Inf. Retr</source>
          .
          <volume>1</volume>
          (
          <issue>3</issue>
          ),
          <fpage>233</fpage>
          -
          <lpage>334</lpage>
          (
          <year>Dec 2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Argamon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Computational methods in authorship attribution</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology</source>
          <volume>60</volume>
          (
          <issue>1</issue>
          ),
          <fpage>9</fpage>
          -
          <lpage>26</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. Mitchell,
          <string-name>
            <given-names>J.E.</given-names>
            ,
            <surname>Farwell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Ramsden</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          :
          <article-title>Interior point methods for large-scale linear programming (</article-title>
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Salton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wong</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>C.S.:</given-names>
          </string-name>
          <article-title>A vector space model for automatic indexing</article-title>
          .
          <source>Commun. ACM</source>
          <volume>18</volume>
          (
          <issue>11</issue>
          ),
          <fpage>613</fpage>
          -
          <lpage>620</lpage>
          (
          <year>Nov 1975</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>A survey of modern authorship attribution methods</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology</source>
          <volume>60</volume>
          (
          <issue>3</issue>
          ),
          <fpage>538</fpage>
          -
          <lpage>556</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Distance metric learning: A comprehensive survey (</article-title>
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>