=Paper= {{Paper |id=Vol-1391/74-CR |storemode=property |title=Homotopy Based Classification for Author Verification Task: Notebook for PAN at CLEF 2015 |pdfUrl=https://ceur-ws.org/Vol-1391/74-CR.pdf |volume=Vol-1391 |dblpUrl=https://dblp.org/rec/conf/clef/HernandezCLPR15 }} ==Homotopy Based Classification for Author Verification Task: Notebook for PAN at CLEF 2015== https://ceur-ws.org/Vol-1391/74-CR.pdf

Homotopy Based Classification for Author Verification
Task
Notebook for PAN at CLEF 2015

Josue Gutierrez1 , Jose Casillas2 , Paola Ledesma3 , Gibran Fuentes1 , and Ivan Meza1
1
Instituto de Investigaciones en Matematicas Aplicadas y en Sistemas (IIMAS)
2
Facultad de Ciencias (FC)
Universidad Nacional Autonoma de Mexico (UNAM)
3
Escuela Nacional de Antropologia e Historia (ENAH)
http://www.enah.edu.mx

Abstract This paper presents our experience implementing a homotopy-based
classification (HBC) system for the ‘PAN 2015 Author Identification’ [20]. Known
documents from a specific author and randomly selected impostor documents are
used as a dictionary to generate a contested document. Given the contribution of
the known documents to the contested document we can verify the authorship of
the document. This classification is embedded into the General Impostor Method
resulting in an ensemble of the SBC model.

1 Introduction

Author verification has multiple applications in several areas including information re-
trieval and computational linguistics, and has an impact in fields such as law and jour-
nalism [8,10,18]. In this edition of the PAN 2015 Author Identification, the task was
formally defined as follows1 :

Given a small set (no more than 5, possibly as few as one) of "known"
documents by a single person and a "questioned" document, the task is to de-
termine whether the questioned document was written by the same person who
wrote the known document set. The genre and/or topic may differ significantly
between the known and unknown documents.

This edition had documents in English, Spanish, Dutch and Greek.
In this work we present our approach for author verification based on sparse-based
classification. Homotopy-based Classification (HBC) was first proposed for face recog-
nition in this setting the goal is to measure the contribution of known faces in the gen-
eration of an unknown face. The amount of contribution determines the identity of the
person with the unknown face [21]. This work is a continuation from the previous ver-
sion of our system [15]. In this version we have normalized the extraction of document
representation; additionally, we have added character-level features.
1
As described in the official website of the competition http://pan.webis.de/ (2015).
2 Previous work

Author verification is considered a corner stone of the authorship analysis together
with authorship attribution, author profiling and plagiarism detection tasks [19]. Cur-
rent work on the field depends on similarity metrics among texts such as: Jaccard,
cosine, Euclidean and min-max similarities. As aforementioned the general impostor
method has been successful at using similarity measures relative to documents in the
domain [17,9]. On the other hand, clustering approaches highly depends on similarity
measures [7].
Alternative methods for combining distances have been also proposed [2]. Even the
Common N-gram (CNG) method which was originally proposed for author profiling
had been adapted to author verification and it can be interpreted as a particular similarity
metric [4,13]. Metric distances have been important on the field since they facilitate an
unsupervised framework for the task. In order to surpass some of the limitations of
similarity metrics supervised approaches had been explore [7,16]. Hybrid approaches
on which model is built on a feature space based on similarity metrics had also been
proposed with mixed results [5,14].

3 Document representation

We use the vector space model to represent the documents. In this edition we use the
following features:

1. Bag of words Frequencies of words in the document.
2. Bigram of words Frequencies of two consecutive words.
3. Punctuation Frequencies of punctuations.
4. Trigram of words Frequencies up to three consecutive letters.

Table 3 presents the final configurations of feature per language

Table 1. Features used per language.

Feat Dutch English Greek Spanish
1 * * * *
2 * * * *
3 * *
4 * * *

4 Methods

In order to present our proposal first we review the GI method, and the homotopy-based
classification, to follow with our proposal.
4.1 The general impostor method
The GI method is a second order binary similarity metric for collections of docu-
ments [11,12]. It uses two functions: the similarity metric sim that compares pairwise
documents, and the aggregate function agg to allow for comparing collections of doc-
uments. The aggregate function does not work directly with the similarity function, it
rather aggregates the score calculated by the original impostor method which is also
a pairwise metric. This method has been described as an ensemble of random models
since several comparisons are performed with randomly selected impostors [9].
The procedure to generate the impostor collection is the following: First, randomly
select n terms of a document and made a query to a search engine. Second, from the
results keep the first m results. Third for each result only take the k first words. Finally,
repeat this m times.

4.2 Homotopy-based Classification
At the core of the proposal is performing variable selection over the equation system
represented in Figure 1 by optimizing the following objective x0 = argmin||x||1 .
A ∈