=Paper=
{{Paper
|id=Vol-1180/CLEF2014wn-Pan-CastilloEt2014
|storemode=property
|title=Unsupervised Method for the Authorship Identification Task
|pdfUrl=https://ceur-ws.org/Vol-1180/CLEF2014wn-Pan-CastilloEt2014.pdf
|volume=Vol-1180
|dblpUrl=https://dblp.org/rec/conf/clef/CastilloCAPL14
}}
==Unsupervised Method for the Authorship Identification Task==
<pdf width="1500px">https://ceur-ws.org/Vol-1180/CLEF2014wn-Pan-CastilloEt2014.pdf</pdf>
<pre>
    Unsupervised method for the authorship identification
                           task
                       Notebook for PAN at CLEF 2014

Esteban Castillo1 , Ofelia Cervantes1 , Darnes Vilariño2 , David Pinto2 , and Saul León2
                            1
                            Universidad de las Américas Puebla
           Department of Computer Science, Electronics and Mechatronics, Mexico
                      {esteban.castillojz, ofelia.cervantes}@udlap.mx
                       2
                        Benemérita Universidad Autónoma de Puebla
                           Faculty of Computer Science, Mexico
                     {darnes, dpinto}@cs.buap.mx and saul.ls@live.com


       Abstract This paper presents an approach for tackling the authorship identifi-
       cation task. The approach is based on comparing the similarity between a given
       unknown document against the known documents using a number of different
       phrase-level and lexical-syntactic features, so that an unknown document can be
       classified as having been written by the same author, if the different similarity
       measures obtained are close to a predetermined threshold for each language in
       the task. The method has shown competitive results, achieving the overall 6th
       place in the competition ranking.

       Keywords: Authorship verification, features, similarity measures, unsupervised
       learning, threshold


1    Introduction

Discovering the correct features in a raw text in order to unambiguously allow to at-
tribute the authorship of a given anonymous document is a very hard problem that re-
cently (empowered by the continuous growing of information in Internet) has become
of high interest in areas like information retrieval (IR), Natural Language Processing
(NLP) and computational linguistics. Taking into account the above, the most common
framework for mapping candidate authors with unknown documents is the authorship
attribution problem, where, given texts of uncertain authorship and sample documents
from a small, finite set of candidate authors, the task consists of mapping the uncertain
texts onto their true authors among the candidates. This problem is considered as an un-
reasonably easy task while a more demanding problem (often presented in documents
on the web) is the author verification task, where, in a given set of documents by a
single author and a questioned document, the problem is to determine if the questioned
document was written by that particular author or not. In this sense, the importance of
finding the correct features for characterizing the signature or particular writing style of
a given author is fundamental for solving the authorship problem.


                                            1035
The results reported in this paper were obtained in the framework of the 8th Interna-
tional Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse
(PAN 14), in particular, in the task named Author Identification. For this purpose, we
have attempted an approach for representing the features that will be taken into account
in the process of authorship verification using an unsupervised learning method. The
proposed approaches are discussed in the following sections.
    The rest of this paper is structured as follows. In Section 2 it is presented the de-
scription of the features used in the task to be tackled. Section 3 shows the Proposed
approach (unsupervised algorithm) used in the experiments. The experimental setting
and a discussion of the obtained results are given in Section 4. Finally, the conclusions
of this research work is presented in Section 5.


2     Description of the features and similarity measures used in the
      task

In this work we explore a combination of different types of features in a text frecuency
vector [7] to represent the writing style of the authors. For all languages in the author
verification task, we consider using lexical-syntactic features because they are relatively
easy to quantify as well as they provide the syntactic order in which the words are
used to form the ideas. On the other hand, it is also considered (in an unsupervised
classification) the use of different metrics to determine the similarity of the feature
vectors related to the documents of known origin against the documents of unknown
origin. Both types of elements are described below.

2.1    Lexical-syntactic features
In this approach are considered the following features for representing the particular
writing style of a given author:

    – Phrase level features
       1. Word prefixes. A group of letters added before a word or base to alter its
          meaning and form a new word.
       2. Word sufixes. A group of letters added after a word or base to alter its meaning
          and form a new word.
       3. Stopwords. A group of words that bear no content or relevant semantics in the
          text.
       4. Punctuation marks. Conventional signs and certain typographical devices as
          aids to the understanding and correct reading, both silently and aloud, of hand-
          written and printed texts.
       5. Word N-grams. A contiguous sequence [5] of n words from a text or speech.
       6. Skip-grams. A technique [4] where the n-grams are used to stored sequences
          of words, but they allow to skip tokens.


                                           1036
 – Character level features
      1. Vowel combination. Word consonants are removed and, thereafter, the remain-
         ing vowels are combined. Each vowel combination is considered to be a fea-
         ture. Adjacent repetition of vowels are considered as only one vowel.
      2. Vowel permutation. Word consonants are removed and, thereafter, the vowel
         permutation is considered to be a feature.


2.2   Similarity measures

In this approach the following metrics are considered to determine the similarity of the
documents:

 – Latent semantic analysis (LSA). A technique[6] which analyzes relationships be-
   tween a set of documents and the terms they contain by producing a set of concepts
   related to the documents and terms. LSA assumes that words which are close in
   meaning will occur in similar pieces of text.
 – Jaccard similarity. A metric used for comparing the similarity and diversity of
   sample sets. The Jaccard coefficient measures similarity between finite sample sets,
   and it is defined as the size of the intersection divided by the size of the union of
   the sample sets and is given by:
                                                  T
                                                |A B|
                                     JS(A, B) =   S                                         (1)
                                                |A B|
 – Euclidean distance. A metric used for comparing the similarity between two vec-
   tors p and q, where, p = (p1 , p2 , ..., pn ) and q = (q1 , q2 , ..., qn ), then the distance
   from p to q, or from q to p is given by:
                                             v
                                             u n
                                             uX
                                   d(q, p) = t (qi − pi )2                                  (2)
                                                   i=1

 – Chebyshev Distance. A metric defined on a vector space where the distance be-
   tween two vectors p and q, is the greatest of their differences along any coordinate
   dimension. The distance is given by:

                            DChebyshev (q, p) := max(| qi − pi |)                           (3)
                                                         i

 – Cosine similarity. A measure of similarity between two vectors of an inner product
   space that measures the cosine of the angle between them. Given two vectors of
   attributes, p and q, the cosine similarity, is represented using a dot product and
   magnitude and is defined as:
                                                 Pn
                                  p·q               i=1 pi × qi
                  cos(Θ) =                = pPn           pPn                               (4)
                               k p kk q k             2
                                              i=1 i ) ×
                                                 (p             i=1 (qi )
                                                                         2


                                            1037
Data: X: Unknown documents and Y = {Y1 , Y2 , . . . , Yn } known documents.
Result: Same author or different author for each Unknown document
/* For each unknown document in the test data set                           */
for X do
    index ←− 0;
    vector1 , vector2 ←− [ ];
    for hX, Yi i ∈ Document set do
        /* Frequency vector of all the combined Lexical
             syntactic features of the test document                       */
        Xv ←− F eature representation of X;
        Yv ←− F eature representation of Yi ;
        /* Cosine Similarity                                               */
        vector1 [index] ←− SCosine (Xv , Yv );
        /* Frequency vector of the words of the test
             document                                                      */
        Xv ←− F eature representation of X;
        Yv ←− F eature representation of Yi ;
        /* Jaccard similarity                                              */
        vector2 [index] ←− SJaccard (Xv , Yv );
        index ←− index + 1;
    end
    Cosine ←− max(vector1 [0] , . . . , vector1 [n]);
    Jaccard ←− max(vector2 [0] , . . . , vector2 [n]);
    index ←− 0;
    vector1 , vector2 , vector3 ←− [ ];
    for hX, Yi i ∈ Document set do
        /* LSA word vectors of the test document                           */
        Xv ←− F eature representation of X;
        Yv ←− F eature representation of Yi ;
        /* Cosine Similarity                                               */
        vector1 [index] ←− SCosine (Xv , Yv );
        /* Chebyshev Distance                                              */
        vector2 [index] ←− SChebyshev (Xv , Yv );
        /* Euclidean distance                                              */
        vector3 [index] ←− SEuclidean (Xv , Yv );
        index ←− index + 1;
    end
    LSA1 ←− max(vector1 [0] , . . . , vector1 [n]);
    LSA2 ←− max(vector2 [0] , . . . , vector2 [n]);
    LSA3 ←− max(vector3 [0] , . . . , vector2 [n]);
    M axSimilarity ←− max(Cosine, Jaccard, LSA1, LSA2, LSA3)
    if M axSimilarity >= ∆ then
        result ←− same author
    else
        result ←− dif f erent author
    end
end
 Algorithm 1: Proposed approach using an unsupervised learning method


                                   1038
3     Proposed approach
For tackling the authorship identification task we propose a methodology (see algo-
rithm 1) in which we used an unsupervised learning method. We evaluate different
feature vectors with different similarity measures in order to identify if an unknown
document belongs to an author. For each language, three types of text representations
are evaluated:

    – A frequency vector of all the vocabulary in the test documents.
    – A frequency vector of all the combined Lexical syntactic features of the test docu-
      ments:
       1. Word prefixes
       2. Word sufixes
       3. Stopwords
       4. Punctuation marks
       5. N-grams(from 1 to 5)
       6. Skip-grams (from 1 to 5)
       7. Vowel combination
       8. Vowel permutation
    – A similarity vector using the LSA algorithm for each word in the test documents.

    Different distance/similarity measures were tested, including the Jaccard similarity
for the vocabulary feature vector, the cosine similarity fo the Frequency vector of all
the combined Lexical syntactic features and Chebyshev Distance, Euclidean distance
and cosine similarity for the LSA vectors. The best similarities of each document of
unknown origin were obtained and a threshold (see table 1) to determine if the document
belongs to an author. The threshold was obtained using a classification tree [2], [9]
with the maximum similarities measures values (see section 2.2) as features in a vector
representation using the training data set.


                  Table 1. Threshold used in the authorship identification task

                             Language      Genre Threshold ∆
                             Dutch         essays    0.82
                             Dutch        reviews    0.93
                             English      reviews    0.67
                             English       novels    0.80
                             Greek        articles   0.76
                             Spanish      articles   0.64


4     Experimental results
The results obtained with the approach are discussed in this section. First, we describe
the dataset used in the experiments and, thereafter, the results obtained.


                                             1039
4.1   Data sets
The description of the six text collections used in the experiments are shown in Table 2.
As can be seen, the data set is made up of different languages/genres within the author
verification task, where, each problem consists of some (up to five) known documents
written by a single person and only one questioned document. All documents within a
single problem instance are in the same language and the documents are matched for
register, theme, and date of writing.


                   Table 2. Data set used in the authorship identification task

      Language     Genre Number of training documents Number of test documents
      Dutch        essays            169                         96
      Dutch       reviews            105                        100
      English     reviews            529                        200
      English      novels            100                        100
      Greek       articles           100                        100
      Spanish     articles           500                        100


4.2   Results obtained in the task
In Table 3 we present the results obtained by our approach using the TIRA tool [3]
with each one of the data sets considered in the competition. The results were evaluated
according to the area under the ROC curve (AUC) measure [1] and the c@1 measure
[8]. Three different languages (each one, with different genres) were tackled out. The
best performance was obtained with the Dutch language in the essays genre, followed
by the Greek language in the articles genre, and the English language in the essays
genre. However, a low performance was obtained (with respect to the other teams in the
competition) in the Dutch, Spanish and English languages in the reviews, articles and
novels genres. We consider these results were obtained because of the use of empirical
thresholds in the final classification of the unknown documents. Further analysis will
investigate this issue. It is worth to notice that we obtained the sixth place from 13 teams
and that our approach always performed better than the competition baselines.


                        Table 3. Results obtained in different languages

      Language    Genre     AUC        C@1Final score Runtime Ranking based on
                                                               the final score
      Dutch        essays 0.86068 0.86133  0.74133    00:01:56        5
      Dutch       reviews 0.6692 0.3696    0.24734    00:01:00        11
      English      essays 0.54855   0.58   0.31816    01:31:53        11
      English      novels 0.6278   0.615    0.3861    26:14:11         5
      Greek       articles 0.686    0.73   0.50078    00:03:14         4
      Spanish     articles 0.734    0.76   0.55784    00:06:48         5


                                             1040
5    Conclusions

We have presented an approach that uses an unsupervised method with lexical-syntactic
features. Even if the runtime is greater than the most approaches of this competition, the
performance is good. It was surprising that being a Spanish native language team, we
performed better in Dutch and Greek languages, but it is a good oportunity for analyz-
ing the text into more depth for determining the reason of this issue. As we mentioned
before, we have executed the same methodology across the different languages, varying
basically only the thresholds applied in the final classification. However, more experi-
ments continue to be performed to analyze whether or not these changes introduce sig-
nificant variations in the data sets. Future work is planned to observe the performance
of the proposed methodology using different similarity measures.


References
1. Agarwal, S., Graepel, T., Herbrich, R., Har Peled, S., Roth, D.: Generalization bounds for the
   area under the roc curve. Journal of Machine Learning Research 6, 393–425 (2005)
2. Garner, S.R.: Weka: The waikato environment for knowledge analysis. In: In Proc. of the
   New Zealand Computer Science Research Students Conference. pp. 57–64 (1995)
3. Gollub, T., Potthast, M., Beyer, A., Busse, M., Pardo, F.M.R., Rosso, P., Stamatatos, E., Stein,
   B.: Recent trends in digital text forensics and its evaluation - plagiarism detection, author
   identification, and author profiling. In: Forner, P., Müller, H., Paredes, R., Rosso, P., Stein, B.
   (eds.) CLEF. Lecture Notes in Computer Science, vol. 8138, pp. 282–302. Springer (2013)
4. Guthrie, D., Allison, B., Liu, W., Guthrie, L., Wilks, Y.: A closer look at skip-gram
   modelling. In: Proceedings of the Fifth international Conference on Language Resources and
   Evaluation (LREC-2006). Genoa, Italy (2006)
5. Houvardas, J., Stamatatos, E.: N-gram feature selection for authorship identification. In:
   Euzenat, J., Domingue, J. (eds.) AIMSA. Lecture Notes in Computer Science, vol. 4183, pp.
   77–86. Springer (2006)
6. Landauer, T.K., Folt, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse
   processes 25(2), 259–284 (1998)
7. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge
   University Press, New York, NY, USA (2008)
8. Peñas, A., Rodrigo, A.: A simple measure to assess non-response. In: Lin, D., Matsumoto, Y.,
   Mihalcea, R. (eds.) ACL. pp. 1415–1424. The Association for Computer Linguistics (2011)
9. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques,
   Second Edition (Morgan Kaufmann Series in Data Management Systems). Morgan
   Kaufmann Publishers Inc., San Francisco, CA, USA (2005)


                                               1041

</pre>