<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Features groups EN DU GR SP
All-WG</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>An Author Verification Approach Based on Differential Features</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Alberto Bartoli</institution>
          ,
          <addr-line>Alex Dagri, Andrea De Lorenzo, Eric Medvet, and Fabiano Tarlao</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>DIA - University of Trieste</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <volume>0</volume>
      <issue>69</issue>
      <abstract>
        <p>We describe the approach that we submitted to the 2015 PAN competition [7] for the author identification task1. The task consists in determining if an unknown document was authored by the same author of a set of documents with the same author. We propose a machine learning approach based on a number of different features that characterize documents from widely different points of view. We construct non-overlapping groups of homogeneous features, use a random forest regressor for each features group, and combine the output of all regressors by their arithmetic mean. We train a different regressor for each language. Our approach achieved the first position in the final rank for the Spanish language.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The effectiveness of a method for author identification is assessed using a testing
set of solved problem instances, as follows. The answers generated by the method for
the problem instances in the testing set are compared against the actual values and the
comparison outcome is expressed in terms of two indexes: area under the ROC curve
(AUC) and c@1. AUC is computed basing on the ROC curve plotted by comparing
the generated answers against a threshold moving between 0 and 1, hence obtaining a
binary classification task. The latter index is computed as c@1 = nnc + nnun2c , where n is
the size of the testing set, nu is the number of unanswered problem instances (i.e., those
for which the generated answer was exactly 0:5), nc is the number of correct answers
(i.e., those for which the generated answer &gt; 0:5 and the actual answer is 1 and those
for which the generated answer &lt; 0:5 and the actual answer is 0).
2</p>
    </sec>
    <sec id="sec-2">
      <title>Our approach</title>
      <p>We propose a machine learning approach based on a number of different features that
characterize documents from widely different points of view: character, word,
part-ofspeech, sentence length, punctuation. We construct non-overlapping groups of
homogeneous features and use a random forest regressor for each features group. The output
of the resulting ensemble of regressors is the arithmetic mean of the output generated
by each random forest.</p>
      <p>We train a different regressor for each language. Based on extensive experimention
on the training set, we decided to use the same features for problem instances in Dutch,
Greek, Spanish but a different set of features for problem instances in English.
2.1</p>
      <sec id="sec-2-1">
        <title>Features</title>
        <p>We extract a number of different features from each document. For ease of presentation,
we group homogeneous features together, as described below.</p>
        <p>Word ngrams (WG) We convert all characters to lowercase and then we transform the
document to a sequence of words. We consider white spaces, punctuation characters
and digits as word separators. We count all word ngrams, with n 3, and we obtain
a feature for each different word ngram which occurs in the training set documents
of a given language.</p>
        <p>Character ngrams (CG) We replace punctuation characters and digits with blank spaces
and then sequences of blank spaces with a single blank space. We count all
character ngrams, with n 3, and we obtain a feature for each different character ngram
which occurs in the training set documents of a given language.</p>
      </sec>
      <sec id="sec-2-2">
        <title>POS (part-of-speech) tag ngrams (PG) We apply a part of speech (POS) tagger on</title>
        <p>each document, which assigns words with similar syntactic properties to the same
POS tag. For English and Dutch we use the Apache OpenNLP Tools2, for Greek
we use the tagger developed by the Department of Informatics at Athens University</p>
        <sec id="sec-2-2-1">
          <title>2 http://opennlp.apache.org</title>
          <p>
            of Economics and Business3 while for Spanish we use TreeTagger4 [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ]. We count
all POS ngrams, with n 3, and we obtain a feature for each different POS ngram
which occurs in the training set documents of a given language.
          </p>
          <p>Word lengths (WL) We convert all characters to lowercase and then we transform the
document to a sequence of words. We consider white spaces, punctuation
characters and digits as word separators. We count the number of words whose length in
characters is n, with n 2 f1; : : : ; 16g: we obtain a feature for each value of n.
Sentence lengths (SL) We transform the document to a sequence of tokens, a token
being a sequence of characters separated by one or more blank spaces. Next, we
transform the sequence of tokens to a sequence of sentences, a sentence being a
sequence of tokens separated by any of the following characters: .,;,:,!,?. We
count the number of sentences whose length in tokens is n, with n 2 f1; : : : ; 40g:
we obtain a feature for each value of n.</p>
          <p>Sentence length ngrams (SG) We transform each document to a sequence of labels,
where each label represents a full sentence and is chosen based on the sentence
length (as described in the following). Next, we compute the ngrams of the
resulting labels, with n 2. In detail, we execute a preliminary analysis of all
documents of a given language in the training set, as follows. For each document, we
transform the document to a sequence of sentences as illustrated for the SL features
group. Next, we compute the distribution of sentence length across all sentences in
the training set and determine the length values associated with the following
percentile values: 10%, 25%, 75%, and 90%. In other words, we divide the range of
sentence lengths observed in the training set in 5 intervals, with boundaries between
intervals determined by the specified percentiles. The label we assign to each
sentence corresponds to one of the 5 lenght intervals, i.e., ]0%; 10%], ]10%; 25%], and
so on: we obtain a feature for each label ngrams which occurs in the training set
documents of a given language.</p>
          <p>Word richness (WR) We transform the document to a sequence of words as for the
WG features group. Then we compute the ratio between the number of distinct
words and the number of total words in the document—this features group contains
only one feature.</p>
          <p>Punctuation ngrams (MG) We transform the document by removing all characters
not included in the following set: f,; .; ;; :; !; ?; "g—the resulting document thus
consists of a (possibly empty) sequence of characters in that set. We then count all
character ngrams of the resulting document, with n 3, and we obtain a feature
for each different punctuation ngram which occurs in the training set documents of
a given language.</p>
          <p>Text shape ngrams (TG) We transform the document as follows: sequences of digits
are replaced by the single character n; sequences of alphabetic characters are
replaced by a single character: l if all the characters in the sequence are lowercase,
u if only the first character is uppercase, w if at least two characters are uppercase;
sequences of blank spaces are replaced by a single blank space; other characters are
left unchanged. We then count all character ngrams of the resulting document, with</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>3 http://nlp.cs.aueb.gr/software.html</title>
        </sec>
        <sec id="sec-2-2-3">
          <title>4 http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger</title>
          <p>n 3, and we obtain a feature for each different character ngram which occurs in
the training set documents of a given language.
2.2</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>Feature selection, normalization, and aggregation</title>
        <p>We perform a simple feature selection for features in groups WG, CG, PG, and TG. To
this end, we apply the following procedure to each of the 4 partitions of the training set
for which the language of the documents was the same—in other words, we select
different features for each language. We compute the feature values for all the documents
in the training set partition. Next, among each group, we sort the features according
to their average values on the documents of the partition—greater values coming first.
Finally, for each group, we keep the nsel top features. We set nsel = 500 for WG, CG
and PG and nsel = 100 for TG.</p>
        <p>After the feature selection, we perform a normalization of the features values, as
follows. Let fi(d) be the value of the ith feature for the document d and let G be
the group of features (as defined in Section 2.1) which includes the feature fi, we set
fi(d) := Pfjf2iG(df)j(d) . We execute this procedure for all the groups of features, except
for WR, which consists of a single feature.</p>
        <p>Finally, for the purpose of obtaining a single feature vector for each problem
instance, rather than one feature vector for each document, we build a new feature fi0
whose value is obtained from the values of the corresponding fi for the documents in
K and the document u, as follows:
(1)
(2)
fi0(hK; u; Li) = abs fi(u)</p>
        <p>Pk2K fi(k)
jKj
In other words, we consider the absolute difference between the feature value for the
unknown document u and the average of the feature values for the known documents in
K. We also consider a variant of our approach in which the difference is divided by the
feature value for u:
fi00(hK; u; Li) =
fi0(hK; u; Li)
fi(u)
2.3</p>
      </sec>
      <sec id="sec-2-4">
        <title>Regressor</title>
        <p>
          We explored three different regressor algorithms: trees (Tree), random forests (RF), and
support vector machines (SVM). In particular, we use the algorithm proposed in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] for
Tree, we use the gaussian kernel and C = 1 for SVM [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], and we use the algorithm for
regression proposed in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] with ntree = 500 for RF.
        </p>
        <p>We apply each regressor, both in training and actual regression phase, only to the
feature values of the same group. For obtaining an answer in [0; 1] for a problem
instance, we average the predictions obtained by the trained regressors on the features
groups. In other words, we built an ensemble of group regressors.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Analysis</title>
      <p>As described in the previous section, we considered two set of features (f 0 and f 00) and
3 regressors. We systematically assessed the effectiveness of all the 6 resulting
combinations by means of a leave-one-out procedure applied on the training set, separately
for each language. That is, for each language, type of feature, and regressor, (i) we built
the subset T of the problem instances of the training with that language, (ii) we removed
one element t0 from T , (iii) we computed the feature values for the problem instances
in T and trained the regressor, (iv) we applied the trained regressor to the problem
instance t0 and compared the generated answer against the known one. We repeated all
but first steps jT j = 100 times, i.e., by removing each time a different element, and
computed the performance of the method in terms of the indexes defined in Section 1:
c@1 and AUC.</p>
      <p>
        The results are in Table 1: the table shows c@1 and AUC values for each method,
the method name being composed by the regressor acronym and one among abs or rel
indicating the use of f 0 or f 00 features, respectively. It can be seen that RF provides
in general better results than the other regressors; moreover, RF-abs appears to be the
best performing method. In order to further validate the latter finding, we performed a
Wilcoxon signed-rank test [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] with a significance level of 5% and Bonferroni
correction [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]: the outcome is that RF-abs is significantly better than all the other methods,
except Tree-rel, for a little gap, and RF-rel; RF-rel is not better than the other methods
except SVM-rel; Tree-rel is not better than all the other methods.
      </p>
      <p>In order to gain insights about which features group appeared to be more suitable
for accomplishing the considered task, we applied the RF-abs method (with the
leaveone-out procedure described above) 9 times, each time removing one of the 9 features
groups—i.e., we performed a features group ablation analysis. The results (in terms of
c@1) are reported in Table 2. It can be seen, by comparing results of method RF-abs
with those of Table 1, that the largest decrease of c@1 occurs by removing features
group MG, while the smallest one occurs by removing features group WR—on the
average around 3% and 1%, respectively. It can also be observed that feature ablation
may actually lead to some improvements: for English, we obtain 0:69, rather than 0:67,
by removing WG; for Spanish, we obtain 0:96, rather than 0:94, by removing either
WG or WR.</p>
      <p>Then, we analyzed the performance of RF-abs in terms of feature addition. We
considered RF-abs using only features group MG (which showed to be the most
relevant, according to the ablation analysis) and then using only MG and each of the 8
other features groups in isolation. The results are reported in Table 3. It can be seen
that, for English, there are combinations that improve c@1 with respect to the
baseline value 0:67: MG+CG, MG+SL, and MG+SG reach 0:73. Since such improvement
is not negligible, we inspected the mutual effect of these features groups more closely
by analyzing the c@1 values resulting from all their combinations. The results are:
0:78 with MG+CG+SL, 0:65 with CG+SG+SL, 0:71 with MG+SL+SG, and 0:73 with
MG+CG+SL+SG. Based on these results, which improved the 0:67 baseline (all feature
groups), we chose to use RF-abs with only 3 features groups (MG+CG+SL), only for
the English language. On the other hand, we did not notice significant improvements
for specific sets of features groups for the other languages: hence, for Dutch, Greek,
and Spanish, we chose to use RF-abs with all the features groups.</p>
      <p>We observed that the results for the Spanish language tend to be much better than for
the other languages. We believe that such good results depend more on the peculiarity of
this dataset rather than to the quality of our method: indeed the training set for Spanish
contained 100 problem instances with 5 documents each, but the number of distinct
documents, though, was only 42.
3.1</p>
      <sec id="sec-3-1">
        <title>Final results</title>
        <sec id="sec-3-1-1">
          <title>5 http://www.tira.io/task/authorship-verification/</title>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bauer</surname>
            ,
            <given-names>D.F.</given-names>
          </string-name>
          :
          <article-title>Constructing confidence sets using rank statistics</article-title>
          .
          <source>Journal of the American Statistical Association</source>
          <volume>67</volume>
          (
          <issue>339</issue>
          ),
          <fpage>687</fpage>
          -
          <lpage>690</lpage>
          (
          <year>1972</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Benjamini</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hochberg</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Controlling the false discovery rate: a practical and powerful approach to multiple testing</article-title>
          .
          <source>Journal of the Royal Statistical Society</source>
          . Series B (Methodological) pp.
          <fpage>289</fpage>
          -
          <lpage>300</lpage>
          (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Breiman</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Random forests</article-title>
          .
          <source>Machine learning 45(1)</source>
          ,
          <fpage>5</fpage>
          -
          <lpage>32</lpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <issue>4</issue>
          .
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>C.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>C.J.:</given-names>
          </string-name>
          <article-title>Libsvm: a library for support vector machines</article-title>
          .
          <source>ACM Transactions on Intelligent Systems and Technology (TIST) 2</source>
          (
          <issue>3</issue>
          ),
          <volume>27</volume>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Loh</surname>
          </string-name>
          , W.Y.:
          <article-title>Classification and regression trees</article-title>
          .
          <source>Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery</source>
          <volume>1</volume>
          (
          <issue>1</issue>
          ),
          <fpage>14</fpage>
          -
          <lpage>23</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Schmid</surname>
          </string-name>
          , H.:
          <article-title>Probabilistic part-of-speech tagging using decision trees</article-title>
          .
          <source>In: Proceedings of the international conference on new methods in language processing</source>
          . vol.
          <volume>12</volume>
          , pp.
          <fpage>44</fpage>
          -
          <lpage>49</lpage>
          .
          <string-name>
            <surname>Citeseer</surname>
          </string-name>
          (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verhoeven</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Juola</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lopez</surname>
            <given-names>Lopez</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          :
          <article-title>Overview of the Author Identification Task at PAN 2015</article-title>
          .
          <article-title>In: Working Notes Papers of the CLEF 2015 Evaluation Labs</article-title>
          .
          <source>CEUR Workshop Proceedings, CLEF and CEUR-WS.org (Sep</source>
          <year>2015</year>
          ), http://www.clef-initiative.eu/publication/working-notes
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>