<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Lexical-Syntactic and Graph-Based Features for Authorship Verification</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Benemérita Universidad Autónoma de Puebla Faculty of Computer Science</institution>
          ,
          <country country="MX">Mexico</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Darnes Vilariño, David Pinto, Helena Gómez Saúl León</institution>
          ,
          <addr-line>and Esteban Castillo</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <abstract>
        <p>In this paper we present the results obtained by an approach submitted to the author identification task of PAN 2013 which uses lexical, syntactic and graph-based features for constructing a representation model of document authors. In particular, the features extracted from the graph representation were obtained by means of the SubDue mining tool. As a classification model we have employed Support Vector Machines (SVM). The overall results have ranked our approach in the fifth place from around 17 teams.</p>
      </abstract>
      <kwd-group>
        <kwd>Authorship verification</kwd>
        <kwd>graph-based representation</kwd>
        <kwd>phrase-level lexicalsyntactic features</kwd>
        <kwd>support vector machines</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Authorship verification is the task of determinining if a document has been written
by a given author or not. This task is particularly important for forensic linguists who
are often called upon to answer this kind of question. This task has been empowered
by the continuous growing of information in Internet, thus, the importance of finding
the correct features for characterizing the particular writing style of a given author is
fundamental for solving the problem of authorship verification.</p>
      <p>The results reported in this paper were obtained in the framework of the
International Workshop on Plagiarism detection, Author Identification, and Author Profilling
(PAN’13). In particular, in the task named “Author Identification” which has focused
this year in the problem of authorship verification which may be described as follows:
“Given a small set (no more than 10, possibly as few as one) of “known” documents
by a single person and a “questioned” document, the task is to determine whether the
questioned document was written by the same person who wrote the known document
set”.</p>
      <p>In order to tackle this problem, we propose to extract a set of lexical syntactic level
features from each target document, and up to 100 words which are representative of
each author. These representative words are selected through the tool “SubDue”
(described in Section 2.2) in order to construct a representation of the whole documents
written by the given author using a graph structure.</p>
      <p>The rest of this paper is structured as follows. In Section 2 it is presented the
description of the features used in the task to be tackled. Section 3 shows the SVM
classification method employed in the experiments. The experimental setting and a discussion
of the obtained results are given in Section 4. Finally, the conclusions of this research
work is presented in Section 5.
2</p>
    </sec>
    <sec id="sec-2">
      <title>The proposed approach</title>
      <p>In this work we combine two different types of feature extraction. The first one
considers lexical-syntactic features, whereas the second uses a data mining based process for
extracting the most relevant terms of the target documents. The two features extraction
process are described first, and thereafter, we present the manner we construct the final
text representation combining the two types of features.
2.1</p>
      <sec id="sec-2-1">
        <title>Lexical-syntactic features</title>
        <p>This approach considers the following lexical-syntactic features for representing the
particular writing style of a given author:
– Phrase level features
• Word sufixes. A group of letters added after a word base to alter its meaning
and form a new word.
• Stopwords. A group of words that bear no content or relevant semantics which
are filtered out from the texts.
• Punctuation marks.
• Trigrams of PoS. Sequences of three PoS tags appearing in the document. Each
word text is tagged with its corresponding PoS tag according to the target
language. For Spanish language we used the TreeTagger1, for the English
language we employed the Stanford PosTagger2, whereas for the Greek language
we used the Greek POS tagger3.
– Character level features
• Vowel combination. Consonants are removed from words and, thereafter, the
remaining vowels are combined. Each vowel combination is considered to be
a feature. Adjacent repetition of vowels are merged together, considering them
as only one vowel.
• Vowel permutation. Word consonants are removed and, thereafter, the vowel
permutation is considered to be a feature.</p>
        <p>The text representation schema using the above mentioned features is described in
Section 2.3.
1 http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
2 http://nlp.stanford.edu/software/tagger.shtml
3 http://nlp.cs.aueb.gr/software_and_datasets/AUEB_POS_tagger_2_1_alpha.tar.gz
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Graph-based features</title>
        <p>A graph based representation is considered in this approach. Formally, given a graph
G = (V, E, L, f ) with V being the non-empty set of vertices, E ⊆ V × V the edges, L
the tag set, and f : E → L, a function that assigns a tag to a pair of associated vertices.</p>
        <p>This graph-based representation attempt to capture the sequence among the
sentence words, so as the sequence among their PoS tags with the aim of feeding a graph
mining tool which may extract relevant features that may be further used for
representing the texts. Thus, the set V is constructed from the different words and PoS of the
target document.</p>
        <p>In order to demonstrate the way we construct the graph for each phrase, consider
the following text phrase: “second qualifier long road leading 1998 world cup”. The
associated graph representation is shown in Figure 1.
The process for extracting relevant features from the constructed graph is given as
follows.</p>
        <p>
          The Subdue tool Once each paragraph is represented by means of a graph, we apply a
data mining algorithm in order to find subgraphs. Subdue is a data mining tool widely
used in structured domains. This tool has been used for discovering structured patterns
in texts represented by means of graphs [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Subdue uses an evaluation model named
“Minimum encoding”, a technique derived from the minimum description length
principle [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], in which the best graph sub-structures are chosen. The best subgraphs are
those that minimize the number of bits that represent the graph. In this case, the number
of bits is calculated considering the size of the graph adjancency matrix. Thus, the best
substructure is the one that minimizes I(S) + I(G|S), where I(S) is the number of bits
where C is the class manually associated to the document, in this case, the author Name
or ID.
        </p>
        <p>For the testing stage, we use the feature vector as follows:</p>
        <p>D = (x1, x2, x3, . . . , xn, C)
| {z }</p>
        <p>Document features
D = (x1, x2, x3, . . . , xn)
| {z }</p>
        <p>Document features
required to describe the substructure S, and I(G|S) is the number of bits required to
describe graph G after it has been compacted by the substructure S.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Text representation schema</title>
        <p>Let (x1, x2, x3, · · · , xn) be the set of features selected for representing the documents,
combining the lexical-syntactic and the graph-based features. Each document D is
represented considering the feature frequency. Thus, the training stage uses the following
feature vector:</p>
        <p>In this case, there is not a classification attribute (class name) due to the anonymous
source of the document.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Description of the classifier used in the task</title>
      <p>We have used a Support Vector Machine (SVM) classifier for the task. SVM is a
learning method based on the use of a hypothesis space of lineal functions in a higher
dimensional space induced by a kernel, in which the hypotheses are trained by one
algorithm that uses elements of the generalization theory and taken from the optimization
theory.</p>
      <p>The linear learning machines are barely used in major real world applications due to
their computational limitations. Kernel based representations are an alternative for this
problem proyecting the information to a feature space of higher dimensionality which
increases the computational capacity of the linear learning machines. The input space
X is mapped to a new feature space as follows:
x = {x1, x2, . . . , xn} → φ(x) = {φ(x)1, φ(x)2, . . . , φ(x)n}
(3)</p>
      <p>By employing the kernel function, it is not necessary to explicitly calculate the
mapping φ : X → F in order to learn in the feature space.</p>
      <p>In this research work, we employed as kernel the polynomial mapping, which is a
very popular method for modeling non-linear functions:</p>
      <p>K(x, x) = (hx, xi + c)d
(4)
where c ∈ R.</p>
      <p>
        For the experiments carried out in this paper, we used the Weka data mining
platform[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] for executing the implementation of the SVM classifier.
(1)
(2)
      </p>
    </sec>
    <sec id="sec-4">
      <title>Experimental results</title>
      <p>The results obtained with the approach presented is discussed in this section. First, we
describe the dataset used in the experiments and, thereafter, the obtained results.
4.1</p>
      <sec id="sec-4-1">
        <title>Data sets</title>
        <p>For each author, a set of documents of his authorship is given, together with one
document which needs to be verified whether or not it has been written by this author. Thus,
for our system, all the “known” documents (verified authorship) written by different
authors are part of the training set, whereas all the “unknown” documents form the test
data. With the training set we generated a model using the SVM classifier, which was
further used for classifying the test documents. If the classifier assigns the author that
corresponds to each unknown document, the answer is positive, otherwise it is negative.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Results obtained in the task</title>
        <p>The same methodology is applied to the three different languages, considering only
some language particularities such as the Greek vowels and the PoS taggers. In Table 1
it can be seen the overall results obtained for each one of the teams that have participated
in this edition of the author identification task of PAN 2013. The system proposed by
our team (ayala13) obtained the fifth place from 17 teams. Given that this approach uses
graph mining techniques through the SubDue tool, it can be observed that the runtime
is greater than most of the runs submitted to the competition.
0.500
0.700
0.800
0.633
0.753 65476823
0.718 8362
0.671 9483
0.659 240335
0.659 5577420
0.647 1713966
0.612 32608
0.553 125655
0.600 9461
0.600 7798010
0.576 7008
0.553 406755
0.541 419495
0.529 624366</p>
        <p>In Table 2 we present the results obtained by our approach with each one of the
three datasets considered in the competition. Three different languages were tackle out.
The best performance was obtained with the English language (F1 = 0.733), followed
by an F1 = 0.667 in the corpus of Greek documents. However, we obtained a low
performance in the Spanish corpus (F1 = 0.560), a result we consider obtained because
of the PoS tagger used in the experiments. Further analysis will investigate this issue. It
is worth to notice that we always performed better than the competition baselines.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>We have presented an approach that uses two types of features: lexical-syntactic and
graph-based. Even if the runtime is greater than the most approaches of this
competition, the performance is good. It was surprising that being a Spanish native language
team, we performed better in English and Greek languages, but it is a good
oportunity for analyzing the text into more deep for determining the reason of this issue. As
we mentioned before, we have executed the same methodology across the different
languages, varying basically only the PoS taggers. As future work, we would like to
observe the performance of the proposed methodology using the FreeLing PoS tagger
instead of TreeTagger.</p>
      <p>When the graph-based features were selected, we empirically determined to extract
at most 100 relevant terms using the SubDue graph mining tool. However, more
experiments should be performed to analyze whether or not this number introduces significant
changes in the obtained results.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Olmos</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalez</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Osorio</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Subgraph isomorphism detection using a code based representation</article-title>
          .
          <source>In: FLAIRS Conference</source>
          . pp.
          <fpage>474</fpage>
          -
          <lpage>479</lpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Rissanen</surname>
          </string-name>
          , J.:
          <article-title>Stochastic Complexity in Statistical Inquiry Theory</article-title>
          . World Scientific Publishing Co., Inc.,
          <string-name>
            <surname>River</surname>
            <given-names>Edge</given-names>
          </string-name>
          , NJ, USA (
          <year>1989</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>I.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>Data Mining: Practical Machine Learning Tools and Techniques</article-title>
          . Morgan Kaufmann Series in Data Management Sys, Morgan Kaufmann, second edn.
          <source>(June</source>
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>