<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Slightly-modified GI-based Author-verifier with Lots of Features (ASGALF)</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Khalifa University</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>977</fpage>
      <lpage>983</lpage>
      <abstract>
        <p>This paper presents the performance evaluation of an authorship verification technique that is based on a modified version of General Impostors (GI) [2]. The novelties of this implementation are: 1. a modified way of combining the min-max similarity measure and, 2. a relatively large set of diverse features that spans letter-level, word-level, function word-level, word shape-level, and word tag-level features. The technique ranked high in overall in the author identification task of PAN 2014.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>8&gt;y 2 [0; 0:5) if h predicts authors of fu and Pi
&lt;</p>
      <p>y = 0:5 if h does not predict
&gt;:y 2 (0:5; 1] if h predicts authors of fu and Pi
ffug are different
ffug are same</p>
      <p>Nc NjcPNju
jPj
, where jPj is the total number of problems, Nc is the total number of problems that the
model h has solved correctly, and Nu is the total number of problems that the model h
has decided to not solve.</p>
      <p>Additionally, and in order to enhance the quality of the found models h 2 H, the
organizers of PAN have also provided us with another problems collection L that has
an identical structure to the collection P. L is intended to be used as a training set to
find better h 2 H models and thus PAN has also provided its ground truth information
as a form of function G : L ! fsame authors; different authorsg.
(2)
(3)
2</p>
    </sec>
    <sec id="sec-2">
      <title>Method description</title>
      <p>
        Our classifier, namely A Slightly-modified GI-based Author-verifier with Lots of
Features (ASGALF), is (as the name implies) based on a modified version of the General
Impostors (GI) framework [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. It differs compared to other GI implementations in a
couple of ways. First, we discuss why we have chosen to use GI as the starting point
for ASGALF, then we discuss the differences between the GI implementation [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and
ASGALF.
2.1
      </p>
      <p>Reasons for adopting the GI framework
Based on our preliminary tests, previous PAN results, and our intuitive reasoning, it is
very clear to us that the set of impostors (or distractors) does contain some information
that helps in answering the question of how similar two vectors are, and considering
such information in the decision making process is indeed helpful to solve (1) in a
better way than otherwise.</p>
      <p>For example, consider two feature vectors v1 = k(f1) and v2 = k(f2), where
k : Pi ! Rd is a function that extracts the features from any input text file fj 2 Pi, for
any j 2 J , and returns a features vector vj 2 Rd that corresponds to fj .</p>
      <p>Simply knowing the value of q(v1; v2) is not enough in reality as far as we know,
where q : Rd Rd ! Y is a function that returns the similarity of its input vectors.</p>
      <p>Our justification to this is that the output of all q functions that are known to-date are
somewhat relevant to the topic/context of the evaluated vectors. For example, in some
topic/context authors in general tend to be similar, while in some other topic/context
authors tend to be dissimilar.</p>
      <p>E.g. in some context q(v1; v2) = y indicates that v1 and v2 represent documents
authored by the same individual only if y is greater than (say) 0.9. This can be the case if
the topic/context is a restrictive one that causes authors to be very similar to each other
(e.g. reports). On the other hand, some other topic/context may allow high variability
among authors, thus vectors v1 and v2 can be assumed to represent documents that are
authored by the same individual even if q &lt; 0:9, such as q = 0:4 can possibly indicate
that the authors are the same.</p>
      <p>Thus fetching a set of impostor text files M = fmw : 8 w 2 W g, where W =
f1; 2; : : : ; ng is the index set of M , is — in our view — a solid step towards a better
optimization of the problem (1).</p>
      <p>We believe that measuring the distance against impostor vectors fxw = k(mw) :
8w 2 W g, allows the model to see how close v1 and v2 are to each other relative to
other impostor vectors in the same topic.</p>
      <p>Additionally, the GI framework is one that is based upon ensembling randomized
models. Although it might not be very obvious at the first glance, we believe that GI
essentially creates a set of randomized models in every run (by choosing different
features and impostors subsets in every run). Relying on ensembles of models is another
strength of the GI framework that is appreciated by the Machine Learning community
as well as other competitions such as the Netflix Prize1 where most of the top techniques
were composed of some form of ensembles.
2.2</p>
      <p>
        Differences between ASGALF and [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
– Instead of using the original Impostors score measure (4), we adopted a modified
one as presented in (5). The advantages of (5) is that it allows us to measure how
similar input vectors are as opposed to whether they are similar enough. On the
other hand, a possible disadvantage could be over fitting the collection of training
problems L.
(4)
(5)
score
score + 1 if min-max(v1; v2)2 &gt;
min-max(v1; x1)
min-max(v2; x2)
, where x1 is the most similar impostor to v1 and x2 is the most similar impostor
to v2.
      </p>
      <p>score
score +</p>
      <p>min-max(v1; v2)2
min-max(v1; x1)
min-max(v2; x2)
– Using a large set of diverse features. Essentially, we have extracted letter-level,
word-level, function word-level, word shape-level, part-of-speech tag-level features
as follows:
n-grams with various combinations of n values and gram types — n values
are f1; 2; ::; 10g, gram types are fletters, words, function words, word shapes2,
1 http://en.wikipedia.org/wiki/Netflix_Prize
2 The word shapes are based on three properties: characters case (e.g. lower/upper case),
characters type (whether it is a letter or a number), and words length. For example, the word
“School” is represented as the gram “Cccccc”, “2014” is represented as the gram “NNNN”,
“x86” is represented as the gram “cNN”, etc.
POS tags3, POS-words4g and the resultant features were based on the
combinatorics of all n values and gram types. This resulted in a large number of
features, most of which were too infrequent to be reliable, thus we have only
considered features that occurred for at least 5 times in any single document.
However, the number of features remained large in general even after
removing the infrequent features. The total number of extracted features after
removing the infrequent ones varied depending on language-genre combinations as
shown in Table 1.
Body richness — the total number of unique words in a given text file fj , for
any j 2 J , normalized by total number of words in the same file fj .</p>
      <p>
        Other details of ASGALF that are not mentioned in this notebook are the same as
suggested in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. For example, similar to [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] we fetch the set of impostors of a given
document from a search engine by submitting search queries of 5 randomly chosen words
from the subject document, download the 10-highest ranked HTML pages, strip them
from any HTML markup, and use their first 1500 words. In ASGALF, we fetched the
first 1500 words of 10 impostors for each document in the training set, and then grouped
them on per language-genre basis. The output of this process is a set of impostors for
each language-genre combination.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Parameter tuning</title>
      <p>The parameters were tuned based on our preliminary tests against problems in L as
follows:
– Score correction offsets are: -0.4585, -0.62950, -0.43850, -0.24850, -0.478, and
0.56600 for English essays, English novels, Dutch essays, Dutch reviews, Spanish
3 Part of speech tags such as NN, NNS, NNP, VB, VBD, VBG, etc. For example, if the word
“school” existed in a text as a noun, then it would be represented as the gram “NN”. A
comprehensive list of such tags can be found in https://www.ling.upenn.edu/courses/
Fall_2003/ling001/penn_treebank_pos.html.
4 Combinations of words and their respective POS tags. For example, if the word “saw” was a
noun then it would be represented as the gram “saw-NN”, and if the word was a verb then it
would be represented as the gram “saw-VBD”.
articles, and Greek articles respectively. This allowed the final score to be centered
around 0.5 in order to satisfy the semantics in (2).
– Total number of Impostor rounds: 50, which also matches the optimal value for the</p>
      <p>
        Spanish set in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
– Total number of impostor documents: 8,614, 1,257, 2,728, 2,073, 5,347 and 3,104
for English essays, English novels, Dutch essays, Dutch reviews, Spanish articles
and Greek articles respectively.
– Total number of randomly chosen impostors per round: 20.
– Total number of randomly chosen features per round: 40% of the total features,
which also matches the optimal value for the English and Spanish sets in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Evaluation results</title>
      <p>Evaluation results are presented in Table 2.
We believe that the score of the model can improve if the following limitations are
addressed:
– Our classifier is implemented such that it always attempts to predict an answer. I.e.
it never outputs a score y = 0:5, thus limiting its ability in taking advantage of the
metric C@1.
– The parameters were tuned in a preliminary testing phase against the training datasets.</p>
      <p>
        The outcome was that all language-genre combinations had similar optimal
parameter values. However, a more rigorous optimization process could have revealed
better language-genre-specific parameter values.
– At the core of Impostors as described in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is the min-max similarity measure,
which is also the measure that we have used as an implementation of the q function.
It is possible that a more sophisticated model could have found a better use of our
diverse set of features.
– No feature subset selection was performed other than removing features that did not
occur frequently enough according to a simplistic criteria as described in previous
sections. Using more sound feature subset selection methods (e.g. IG, Wraper, etc)
can possibly reduce CPU time requirements as well as enhance the accuracy.
6
      </p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>This paper describes an authorship verification classifier that is based on the General
Impostors (GI) framework with the exception of using a modified method of
combining the scores, as well as a diverse set of features. The features were of various
types, namely: letter-level, word-level, function word-level, word shape-level, and
partof-speech tag-level.</p>
      <p>
        The evaluation on the testing set shown high classification accuracy in general. This
also confirms that impostor/distractor-based methodologies are indeed a step forward.
Although the selection of the impostors/distractors set is a known limitation, it seems
that it is practically not a major issue and that such set of impostors can be obtained
with relative ease (e.g. such as by using search engines as in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]).
      </p>
      <p>On the downside, our technique generally had a slow runtime. However, we believe
that this issue can be minimized in a number of ways, such as:
– Our code made excessive use of automatic execution of external commands via the
shell. Some of such external commands were themselves very slow in general, such
as the software used to extract the part-of-speech tags. If such external dependency
is replaced by some faster one (or completely removed), the speed of the feature
extraction phase can improve dramatically.
– We believe that techniques that are based on the GI framework can be easily
distributed should multiple cores or machines be available. This is due to the fact that
all GI rounds are independent from each other, thus allowing the runs to be
distributed across multiple cores or even different machines, ultimately leading in a
much reduced runtime.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgement</title>
      <p>Thanks to Shachar Seidman for answering our questions about General Impostors (GI).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Joula</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>Overfiew of the Author Identification Task at PAN 2013</article-title>
          . In: Conference and
          <article-title>Labs of the Evaluation Forum (</article-title>
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seidman</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          : Automatically Identifying Pseudepigraphic Texts. In: EMNLP. pp.
          <fpage>1449</fpage>
          -
          <lpage>1454</lpage>
          . ACL (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Seidman</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Authorship Verification Using the Impostors Methods</article-title>
          . In: Conference and
          <article-title>Labs of the Evaluation Forum (</article-title>
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>