=Paper=
{{Paper
|id=Vol-1391/129-CR
|storemode=property
|title=Authorship Verification by combining SVMs with Kernels Optimized for Different Feature Categories
|pdfUrl=https://ceur-ws.org/Vol-1391/129-CR.pdf
|volume=Vol-1391
|dblpUrl=https://dblp.org/rec/conf/clef/SolorzanoMPLMS15
}}
==Authorship Verification by combining SVMs with Kernels Optimized for Different Feature Categories==
<pdf width="1500px">https://ceur-ws.org/Vol-1391/129-CR.pdf</pdf>
<pre>
         Authorship Verification by combining SVMs with
         kernels optimized for different feature categories
                           Notebook for PAN at CLEF 2015

            Julián Solórzano1 , Víctor Mijangos1 , Alejandro Pimentel1 , Fernanda
                 López-Escobedo2 , Azucena Montes1 , and Gerardo Sierra1
     1
       Grupo de Ingeniería Lingüística, Instituto de Ingeniería, UNAM, Mexico City, Mexico
    {jsolorzanos, vmijangosc, amontesr, gsierram}@iingen.unam.mx,
                                   pi.p15@hotmail.com
     2
       Licenciatura de Ciencia Forense, Facultad de Medicina, UNAM, Mexico City, Mexico
                                     flopeze@unam.mx


          Abstract We present our approach to the PAN-2015 authorship verification task.
          We combine one-class SVM classifiers under the hypothesis that different cate-
          gories of features a) are better suited for different authors and b) have different
          underlying topologies. Thus, we have each classifier operate in a different feature
          subset with a different kernel function, and the output is used to train a logistic
          regression model which assigns a different weight to each category of features.
          Results show that further improvement of the method is needed, and we discuss
          its shortcomings.

          Keywords: authorship verification, one-class SVM, kernel selection, logistic re-
          gression


1        Introduction
This paper presents our approach to the Author Verification task in PAN at CLEF 2015.
Author Verification is one several of authorship analysis tasks, in which it must be
determined whether a given text was written or not by a certain author [8].
    In the present task, a single problem consists of a set of documents, one of which is
labeled as unknown and the rest are labeled as known. There can be a total of up to 6
documents in a single problem. The task consists in determining whether the document
labeled as unknown is written by the same author as the rest of the documents. There
are four different sets of problems, one for each of the following languages: Spanish,
English, Greek and Dutch.


2        Methodology
The main idea behind the methodology we present is to train classifiers that are to
become experts at analyzing features from a specific category of features, and then
combine the knowledge of all these classifiers. According to [1], this is one of the two
approaches to multi feature-set techniques that have been used in stylistic analysis. We
hypothesize that an ensemble classifier will automatically assign more weight to each
author’s distinctive feature subsets without having to try various feature combinations.
    Finally, there is also the observation that not all feature spaces are necessarily equal
and that different feature subspaces may have different underlying topologies. Learning
the best distance metric leads to improvements in distance-based classification schemes
[9]. In this work we experiment with different kernel functions instead of distance func-
tions.

2.1    Document Representation
Tagging Documents are tokenized and the Part of Speech (POS) tag of each token is
obtained. Table 1 indicates the softwares we used to do this processing according to
each language.


                                     Table 1. Taggers

                               Language      Software
                               Spanish      Freeling [5]
                               English       Freeling
                               Dutch     TreeTagger [6][7]
                               Greek    Greek POS Tagger1


Style Features For each document we obtain features in the following categories:

 – Punctuation marks - tokens recognized as punctuation marks by the correspond-
   ing tagger
 – Multi word terms - tokens made up of more than one word (as determined by the
   tagger, in this case only obtained for Spanish and English)
 – Lexical features - Class intervals of word and sentence lengths
 – Start of sentence word profile - the grammatical category of the word at the be-
   ginning of sentences
 – End of sentence word profile - the grammatical category of the word at the end of
   sentences
 – Function words n-grams - 1-grams, 2-grams and 3-grams
 – Function words skip-grams - 2-grams and 3-grams, with up to 2 gaps
 – POS full tags n-grams - 1-grams, 2-grams and 3-grams
 – POS abbreviated tags n-grams - 1-grams, 2-grams and 3-grams. This only ap-
   plies for languages in which the tagger outputs detailed POS tags in the first place
   (Spanish and English in this case). The abbreviated POS tag contains the grammat-
   ical category without further details such as number, gender, tense etc.
 – Character ngrams 2-grams and 3-grams
 1
     http://php-nlp-tools.com/blog/category/greek-pos-tagger/
    All these features are used to create the vector representation of the document, i.e.
a vector whose elements are the relative frequencies of each feature in that document
(the frequency is relative to the total number of features in the category). A maximum
of 200 features of each category was taken into account for a single document.


Distance-to-the-average An additional representation for each document is created
as follows. First we compute a vector vavg which contains the average frequency of
all the features among all documents in the dataset. This vector is broken down into n
subvectors such that each one contains the features of one of the n feature categories.
Then for each document we similarly break down its frequency vector into n subvectors
and obtain their distance to the corresponding subvectors of vavg .
     The resulting matrix encodes information about how a document deviates from the
mean in each feature category, similar to Burrow’s Delta [2] (only he used z-scores to
normalize).


2.2   Classification

The classification is done by an ensemble that has its votes combined by means of a
logistic regression model. The ensemble is comprised by n one-class SVM classifiers,
each one using the features from one of the n feature categories, as well as an additional
classifier that works with the distance-to-average representation of the documents. So,
there is a total of n + 1 classifiers.
    We separate each problem matrix into n feature category matrices (plus the distance-
to-average matrix). In these matrices, each document represents a point on a space X.
So, d1 , ..., dn ∈ X, where di , i ∈ {1, ..., n} are row vectors of each feature matrix
representing a document. We assume points in this space are close to each other when
written by the same author. Given a new point in this space, we want to determine if it
is part of the cluster or it lies outside.


Dimensionality Reduction For the case of the feature frequency matrices, their high-
dimensionality makes them difficult to process. We perform dimensionality reduction
by taking their first two eigenvectors, which correspond to the two highest eigenvalues;
i. e. the eigenvectors with highest variance. To create the new matrices we use these
eigenvectors as columns; so our new data set then is in R2 .
     This not only reduced data dimensionality, but also filtered noise. We empirically
noted that the performance of the experiments increased by considering only these
eigenvectors.


Novelty Detection through One-Class SVM To apply novelty detection using a one
class SVM, we first think of a map φ : X → H, where H is a dot product space such
that we can evaluate the dot product in the image of φ by a kernel function:

                               k(xi , xj ) = φ(xi ) · φ(xj )                          (1)
    We need to detect a neighborhood of the data points such that given a new docu-
ment of the same author it lies inside this neighborhood. To do this, the one-class SVM
finds a function f that returns +1 if the point lies into the neighborhood and returns -1
otherwise.
    The value of the function evaluated at a new point y is obtained considering in which
part of the hyperplane it falls on. So, we need to separate the data set from the origin,
solving:

                                     1          1 X
                             minn      ||w||2 + n   ζi − c                            (2)
                        w∈H,ζ∈R ,c∈R 2         v i

   Where v ∈ (0, 1) and w · φ(xi ) ≤ c − ζi , ζi ≤ 0. This way, we can turn it into a
decision function:

                              f (x) = sgn((w · φ(x)) − c)                             (3)

   such that it will be positive for the points in the data set; here the term ||w|| is a
support vector type regularization. Deriving the dual problem with the kernel function
showed in (1), the solution for a the new point y has a support vector expansion:
                                        X
                            f (y) = sgn(  αi k(xi , y) − c)                           (4)
                                           i

    Where xi with αi 6= 0 are support vectors. We calculate the one-class SVM for each
feature matrix, obtaining a total of n outputs or judges. For each feature matrix we use
different kernel functions for equation 4. One of the simplest is the linear kernel:

                                 k(xi , xj ) = xTi xj + c                             (5)

   The gaussian kernel is defined as:

                                                  ||xi − xj ||2
                            k(xi , xj ) = exp(−                 )                     (6)
                                                       2σ 2
   A sigmoid kernel is defined as:

                             k(xi , xj ) = tanh(αxTi xj + c)                          (7)

   Finally, the polynomial kernel:

                              k(xi , xj ) = (αxTi xj + c)m                            (8)

    These different kernels functions correspond to each data distribution and were se-
lected in the training process. We select one kernel for each feature matrix by evaluating
every kernel over the training set and selecting the one with highest evaluation for each
feature matrix.
Logistic Regression The final classification is done by means of a logistic regression.
We take the n classifier outputs as judges voting for Yes or No (written by the same
author or not). For each problem, we create a vector with these votes (1 if the judge
thinks the document is written by the same author and 0 if not) and use the resulting
matrix to train a logistic regression model to obtain the weight of each feature category.
Intuitively these weights describe the relevance of each feature category in the author’s
style.
    To do this, we use the training data set points xi , i = {1, ..., n} and the set Y of the
training data labels (the output of the classifiers), such that Y = {y : y = {0, 1}}. We
want to obtain weights wi for each point xi solving the equation (9):
                                           n
                                           X
                                      y=         wi · xi                                 (9)
                                           i=1

    With these weights for each judge we then calculate the probability of a class in the
evaluation set by the equation:

                                                ew·z
                                       p(z)                                             (10)
                                              1 + ew·z
   Where z is the vector of judges for the unknown author and w is the weights vector
obtained by solving (9) for {wi }. So we take this probability to determinate if the new
point is part of the given set or if it is not.


3     Results and Discussion

3.1   Results

Results of the evaluation are shown in Table 2.

                                      Table 2. Results

                             Language AUC c@1 Combined
                             Dutch    0.396 0.384 0.152
                             English 0.517 0.5 0.258
                             Greek    0.589 0.56 0.33
                             Spanish 0.454 0.48 0.217


3.2   Discussion

Various problems exist in the methodology. First, the model that the logistic regression
learns is a single one-fit-all model for all problems. This is not ideal because one of the
hypotheses is that the effectiveness of each feature category is dependent on the author
we want to identify. Thus a different linear regression model should be generated for
each author.
    Second, the only training data we provide to each SVM classifier is the unknown
texts of a single problem. Evidently, this is less than ideal, specially in cases where there
is only one unkown document. In these cases we split the unknown documents into three
shorter documents, which does allow the program to run the SVM algorithm but it is not
sufficient information to create a general model. A better use of the distance-to-average
matrix was needed in order to account for the little information each problem presented,
since we opted to not use the more traditional Impostors approach.
    Finally, linear regression is not necessarily the best approach for combining the
classifiers. There is much literature on ensemble classifiers and the possible ways on
generating the weights or scores of each one. Some examples of factors that the com-
bining function can take into account include assigning more weight to classifiers that
correctly classified hard instances, where hard refers to the fact that none or almost none
of the other classifiers correctly classified them [4].
    Also, we consider that for our hypothesis to be true the combining function should
always find the best features of the author no matter which feature categories were
used in the first place. Yet we observe that the method performs differently depending
on which feature categories were included in the experiment. For example, preliminar
runs where not all feature categories had been added (specifically "Multi word terms",
"End of sentence word profile" and "Character n-grams"), tended to performed better
(reaching a training c@1 score of 0.8 in the Spanish dataset). This is most likely due
to the logistic regression not being able to handle a large number of features at least
without some selection.


4    Conclusions and Future Work
Improvement of the method is needed, especially on the way the classifiers are com-
bined. Ideally it could use a relatively large feature set without losing performance, as
the Writeprints [1] method suggests, or as previous algorithms in this same task such as
[3] have successfully shown.
    Also, regarding the way instances are to be compared against control examples,
either the Impostors approach must be adopted or else further experimentation must
be done with the distance-to-average matrix in order to truly take advantage of the
information it tells us about the corpus.

Acknowledgments This work is funded by the project PAPIIT-UNAM IN400312
“Análisis estilométrico para la detección de similitud textual”, as well as CONACYT
CB2012/178248 “Detección y medición automática de similitud textual”


References
1. Abbasi, A., Chen, H.: Writeprints: A stylometric approach to identity-level identification and
   similarity detection in cyberspace. ACM Transactions on Information Systems (TOIS) 26(2),
   7 (2008)
2. Burrows, J.: Delta: A measure of stylistic difference and a guide to likely authorship. Literary
   and Linguistic Computing 17(3), 267–287 (2002)
3. Khonji, M., Iraqi, Y.: A slightly-modified gi-based author-verifier with lots of features
   (asgalf). In: Notebook for PAN at CLEF 2014
4. Kim, H., Kim, H., Moon, H., Ahn, H.: A weight-adjusted voting algorithm for ensembles of
   classifiers. Journal of the Korean Statistical Society 40(4), 437–449 (2011)
5. Padró, L., Stanilovsky, E.: Freeling 3.0: Towards wider multilinguality. In: Proceedings of
   the Language Resources and Evaluation Conference (LREC 2012). ELRA, Istanbul, Turkey
   (May 2012)
6. Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the
   international conference on new methods in language processing. vol. 12, pp. 44–49.
   Citeseer (1994)
7. Schmid, H.: Improvements in part-of-speech tagging with an application to german. In:
   Proceedings of the ACL SIGDAT-Workshop. Citeseer (1995)
8. Stamatatos, E.: A survey of modern authorship attribution methods. Journal of the American
   Society for information Science and Technology 60(3), 538–556 (2009)
9. Weinberger, K.Q., Blitzer, J., Saul, L.K.: Distance metric learning for large margin nearest
   neighbor classification. In: Advances in neural information processing systems. pp.
   1473–1480 (2005)

</pre>