Homotopy Based Classification for Author Verification Task Notebook for PAN at CLEF 2015 Josue Gutierrez1 , Jose Casillas2 , Paola Ledesma3 , Gibran Fuentes1 , and Ivan Meza1 1 Instituto de Investigaciones en Matematicas Aplicadas y en Sistemas (IIMAS) 2 Facultad de Ciencias (FC) Universidad Nacional Autonoma de Mexico (UNAM) 3 Escuela Nacional de Antropologia e Historia (ENAH) http://www.enah.edu.mx Abstract This paper presents our experience implementing a homotopy-based classification (HBC) system for the ‘PAN 2015 Author Identification’ [20]. Known documents from a specific author and randomly selected impostor documents are used as a dictionary to generate a contested document. Given the contribution of the known documents to the contested document we can verify the authorship of the document. This classification is embedded into the General Impostor Method resulting in an ensemble of the SBC model. 1 Introduction Author verification has multiple applications in several areas including information re- trieval and computational linguistics, and has an impact in fields such as law and jour- nalism [8,10,18]. In this edition of the PAN 2015 Author Identification, the task was formally defined as follows1 : Given a small set (no more than 5, possibly as few as one) of "known" documents by a single person and a "questioned" document, the task is to de- termine whether the questioned document was written by the same person who wrote the known document set. The genre and/or topic may differ significantly between the known and unknown documents. This edition had documents in English, Spanish, Dutch and Greek. In this work we present our approach for author verification based on sparse-based classification. Homotopy-based Classification (HBC) was first proposed for face recog- nition in this setting the goal is to measure the contribution of known faces in the gen- eration of an unknown face. The amount of contribution determines the identity of the person with the unknown face [21]. This work is a continuation from the previous ver- sion of our system [15]. In this version we have normalized the extraction of document representation; additionally, we have added character-level features. 1 As described in the official website of the competition http://pan.webis.de/ (2015). 2 Previous work Author verification is considered a corner stone of the authorship analysis together with authorship attribution, author profiling and plagiarism detection tasks [19]. Cur- rent work on the field depends on similarity metrics among texts such as: Jaccard, cosine, Euclidean and min-max similarities. As aforementioned the general impostor method has been successful at using similarity measures relative to documents in the domain [17,9]. On the other hand, clustering approaches highly depends on similarity measures [7]. Alternative methods for combining distances have been also proposed [2]. Even the Common N-gram (CNG) method which was originally proposed for author profiling had been adapted to author verification and it can be interpreted as a particular similarity metric [4,13]. Metric distances have been important on the field since they facilitate an unsupervised framework for the task. In order to surpass some of the limitations of similarity metrics supervised approaches had been explore [7,16]. Hybrid approaches on which model is built on a feature space based on similarity metrics had also been proposed with mixed results [5,14]. 3 Document representation We use the vector space model to represent the documents. In this edition we use the following features: 1. Bag of words Frequencies of words in the document. 2. Bigram of words Frequencies of two consecutive words. 3. Punctuation Frequencies of punctuations. 4. Trigram of words Frequencies up to three consecutive letters. Table 3 presents the final configurations of feature per language Table 1. Features used per language. Feat Dutch English Greek Spanish 1 * * * * 2 * * * * 3 * * 4 * * * 4 Methods In order to present our proposal first we review the GI method, and the homotopy-based classification, to follow with our proposal. 4.1 The general impostor method The GI method is a second order binary similarity metric for collections of docu- ments [11,12]. It uses two functions: the similarity metric sim that compares pairwise documents, and the aggregate function agg to allow for comparing collections of doc- uments. The aggregate function does not work directly with the similarity function, it rather aggregates the score calculated by the original impostor method which is also a pairwise metric. This method has been described as an ensemble of random models since several comparisons are performed with randomly selected impostors [9]. The procedure to generate the impostor collection is the following: First, randomly select n terms of a document and made a query to a search engine. Second, from the results keep the first m results. Third for each result only take the k first words. Finally, repeat this m times. 4.2 Homotopy-based Classification At the core of the proposal is performing variable selection over the equation system represented in Figure 1 by optimizing the following objective x0 = argmin||x||1 . A ∈