CIC-GIL Approach to Cross-domain Authorship Attribution Notebook for PAN at CLEF 2018

CIC-GIL Approach to Cross-domain Authorship Attribution Notebook for PAN at CLEF 2018 CarolinaMartín-Del-Campo-Rodríguez Center for Computing Research (CIC) Instituto Politécnico Nacional (IPN)

Mexico City Mexico

HelenaGómez-Adorno Center for Computing Research (CIC) Instituto Politécnico Nacional (IPN)

Mexico City Mexico

Engeneering Institute (II) Universidad Nacional Autónoma de México (UNAM)

Mexico City Mexico

GrigoriSidorov sidorov@cic.ipn.mx Center for Computing Research (CIC) Instituto Politécnico Nacional (IPN)

Mexico City Mexico

IldarBatyrshin Center for Computing Research (CIC) Instituto Politécnico Nacional (IPN)

Mexico City Mexico

CIC-GIL Approach to Cross-domain Authorship Attribution Notebook for PAN at CLEF 2018 F7FD5C5923254B80BD6EE4114123B784 GROBID - A machine learning software for extracting information from scholarly documents

We present the CIC-GIL approach to the cross-domain authorship attribution task at PAN 2018. This year's evaluation lab focuses on the closed-set attribution task applied to a Fanfiction corpus in five languages: English, French, Italian, Polish, and Spanish. We followed a traditional machine learning approach and selected different feature sets depending on the language. We evaluated document features such as typed and untyped character n-grams, word n-grams, and function word n-grams. Our final system uses the log-entropy weighting scheme and SVM as classifier.

Introduction

The authorship attribution (AA) task consists in identifying the author of a given document among a list of candidates. There are several subtasks within the authorship attribution field such as author identification [4], author obfuscation [11] and author profiling [12]. The AA methods are used for many practical applications like electronic commerce, forensics, and humanities research [2,5]. The Authorship Attribution task is viewed as a multi-class, single-label classification problem, i.e. an automatic method has to assign a single class label (the author) to the unknown authorship documents.

Character n-grams are considered among the best feature representation for authorship attribution problems [16]. In [14], the authors introduced a categorization of character n-grams and showed that some categories have better performance than others in an AA task. Furthermore, several studies indicate that the combination of different types of n-grams introduces useful information to the classification algorithm, providing a robust model [13].

This paper describes our approach to the cross-domain authorship attribution task at PAN 2018 [4,17]. We examined different document features (typed and untyped character n-grams, word n-grams, and function word n-grams), weighting schemes (tf-idf and log-entropy), and machine learning algorithms (support vector machines, multinomial naive Bayes, and multi-layer perceptron).

The corpus of the authorship attribution shared task at PAN 2018 is focused on crossdomain attribution. It is more challenging than the classical AA setting (the single-topic AA), because the training and testing documents can belong to different domains (eg. thematic area, genre). The documents in the corpus are fanfics, i.e., fictional literature based on the theme, atmosphere, style, characters, story world, etc. of a certain known author.

The corpus for development phase corpus (CDP), similarly to the corpus for test phase (CTP), is composed of a training corpus and a test corpus. Although the candidate authors for the CDP and CTP have similar characteristics, the candidate authors do not overlap.

The development phase corpus is composed of 10 problems divided in five languages (two problems each language): English, French, Italian, Polish and Spanish. The specifications of the problems are defined in [4].

Methodology

In this section, we first cover the concept of typed character n-grams, then the logentropy weighting scheme, and finally the experimental settings of the methodology.

Typed character n-grams

Typed character n-grams, introduced by [14] are subgroups of character n-grams that correspond to three distinct linguistic aspects: morphosyntax (represented by affix ngrams), thematic content (represented by word n-grams) and style (represented by punctuation n-grams). These subgroups are call super categories (SC). Each of these SC are divided in different categories:

-Affix n-grams: Capture morphology to some extent (prefix, suffix, space-prefix, space-suffix). -Word n-grams: Capture partial words and other word-relevant tokens (wholeword, mid-word, multi-word). -Punctuation n-grams: Capture patterns of punctuation (beg-punct, mid-punct, end-punct).

Some categories of character n-grams showed higher predictive capabilities in the AA task [14] than using all possible n-grams (categorized and uncategorized). The redefinition stated by [7] of these categories unambiguously assign each 3-gram to exactly one category and do not exclude any n-gram (as in the case of consecutive punctuation marks in the original proposal). Also, the authors showed that some categories have a better performance that others for AA.

Log-entropy

Global weighting functions measure the importance of a term across the entire collection of documents [3]. Previous research on document similarity judgments [6,9] has shown that entropy-based global weighting is generally better than the TF-IDF model. The log-entropy (le) weight is calculated with the following equation (Equation 1):

le ij = e i × log(tf ij + 1),(1)e i = 1 + j p ij × log p ij log n , where p ij = tf ij gf i , (2)

where n is the number of documents, tf ij is the frequency of the term i in document j, and gf i is the frequency of term i in the whole collection. A term that appears once in every document will have a weight of zero. A term that appears once in one document will have a weight of one. Any other combination of frequencies will assign a given term a weight between zero and one.

Experimental Settings

After an evaluation of several classification algorithms, in our final approach we chose Support Vector Machine (SVM) since this algorithm is recommended when the number of dimensions is greater than the number of samples (as in this case) [8]. We used the SVM implementation of sklearn [1], using the strategy one-against-all and the default parameter settings. We analyzed several text representation schemes: typed character n-grams (with n varying from 2 to 8), untyped character n-grams (with n between 3 and 4), word n-grams (with n varying from 1 to 5) and function word n-grams proposed by Stamatatos [15].

We implemented the character n-gram types introduced by Sapkota et al. [14], but with the redefinitions of Markov et al. [7], which make them more accurate and complete.

For function word n-grams we used the 50 most frequent stop-words, as described in [15], to form the n-grams (with a value of n equal to 8). For English, the 50 most frequent stop-words mentioned in [15] were used. For the other languages (French, Italian, Polish and Spanish) the 50 most frequent stop-words were extracted from the development corpus (from the training).

We evaluated different combination of features for the different languages in the corpus. We also performed an evaluation study in order to identify the most useful typed character n-gram categories for each language. Table 1 shows the combination of features as well as the types of character n-grams used in our final submission.

Moreover, we experimented with different feature document frequency thresholds. We considered thresholds between 1 and 3, i.e. features that occur in at least 1, 2, or 3 documents in each problem. We found that the features that occur in at least 2 documents achieved the best classification performance in our experiments. Following the experimental settings presented in [3], we examined two feature representations based on a global weighting scheme: log-entropy and tf-idf. Global weighting functions measure the importance of a word across the entire collection of documents. Previous research on document similarity judgments [6,9] and authorship attribution [3] has shown that entropy-based global weighting is generally better than the if-idf model. We use log-entropy as weighting function for out final version.

Evaluation Measure and Results

The macro-averaged F1 score is used for evaluating the performance of the systems participating in the authorship attribution shared task at PAN CLEF 2018 [4].

The final configuration of our approach was selected based on the classification performance on the test set of the development phase corpus (DPC). Table 2 shows the results obtained on the DPC with the above-specified configuration evaluated on the TIRA platform [10]. The results achieved in the test phase corpus (TPC) are shown in Table 3. It can be observed that the performance on the TPC is much lower than in the DPC. This behav-ior can be explained by our decision of tuning our system based on the classification performance over the test set of the DPC.

Conclusions

We presented the system that was submitted to the Cross-domain Authorship Attribution task at PAN 2018. Our experiments were performed using different features, finding that a specific set of features per language is the best approach to improve performance. Our approach had a good performance on the development phase corpus (Macro-Average F1: 0.747), but this performance was severely diminished on the test phase corpus (Macro-Average F1: 0.588). Based on the current technique, there are still opportunities for further enhancements.

In future research, we would like to consider a cross-validation approach for the development phase corpus to make the system more robust.

Table 1 .1Features included for each language in our final submission.Language FeaturesTyped character n-grams categoriesEnglishtyped character n-grams (2, 3, 5)whole-word, mid-word, multi-word,beg-punct, mid-punct, end-punctFrenchtyped character n-grams (2, 4, 5)prefix, mid-word, multi-word, beg-punct, end-punctItalianword n-grams (1, 2, 3, 5)Polishword n-grams (2, 5)Spanishcharacter n-grams(3), typed characterbeg-punctn-grams(4) and word n-grams(1, 2)

Table 2 .2Results of the Cross Domain Authorship-Attribution on the Development Phase CorpusLanguageProblemMacro-Average F1Englishproblem 10.582problem 20.783Frenchproblem 30.659problem 40.938Italianproblem 50.702problem 60.637Polishproblem 70.589problem 80.893Spanishproblem 90.804problem 100.879Overall score0.747

Table 3 .3Results of the Cross Domain Authorship-Attribution Task on the Test Phase CorpusUserMacro-Average F1 Runtimecustodio180.68500:04:27murauer180.64300:19:15halvani180.62900:42:50mosavat180.61300:03:34yigal180.59800:24:09delcamporodriguez180.58800:11:01pan18-baseline0.58400:01:18miller180.58200:30:58schaetti180.38701:17:57gagala180.26701:37:56garciacumbreras180.13900:38:46tabealhoje180.02802:19:14

Acknowledgments

This work was partially supported by the Mexican Government (CONACYT projects 240844, SNI, COFAA-IPN, SIP-IPN 20181849, 20171813) and Honeywell Grant.

API design for machine learning software: experiences from the scikit-learn project LBuitinck GLouppe MBlondel FPedregosa AMueller OGrisel VNiculae PPrettenhofer AGramfort JGrobler RLayton JVanderplas AJoly BHolt GVaroquaux ECML PKDD Workshop: Languages for Data Mining and Machine Learning 2013 On admissible linguistic evidence MCoulthard Journal of Law & Policy 21 441 2012 Author clustering using hierarchical clustering analysis HGómez-Adorno YAleman DVilariño MASanchez-Perez DPinto GSidorov CLEF 2017 Working Notes. CEUR Workshop Proceedings 2017 Overview of the Author Identification Task at PAN-2018: Cross-domain Authorship Attribution and Style Change Detection MKestemont MTschugnall EStamatatos WDaelemans GSpecht BStein MPotthast Working Notes Papers of the CLEF 2018 Evaluation Labs CEUR Workshop Proceedings, CLEF and CEUR-WS LCappellato NFerro JYNie LSoulier Sep 2018 Automatically identifying pseudepigraphic texts MKoppel SSeidman Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing the 2013 Conference on Empirical Methods in Natural Language Processing 2013 EMNLP '13 An empirical evaluation of models of text document similarity MDLee DJNavarro HNikkerud Proceedings of the Cognitive Science Society the Cognitive Science Society 2005 27 Improving cross-topic authorship attribution: The role of pre-processing IMarkov EStamatatos GSidorov Proceedings of the 18 th International Conference on Computational Linguistics and Intelligent Text Processing the 18 th International Conference on Computational Linguistics and Intelligent Text Processing Springer 2017. 2017 Scikit-learn: Machine learning in Python FPedregosa GVaroquaux AGramfort VMichel BThirion OGrisel MBlondel PPrettenhofer RWeiss VDubourg JVanderplas APassos DCournapeau MBrucher MPerrot EDuchesnay Journal of Machine Learning Research 12 2011 Comparison of human and latent semantic analysis (lsa) judgements of pairwise document similarities for a news corpus BPincombe 2004 DEFENCE SCIENCE AND TECHNOLOGY ORGANISATION SALISBURY (AUSTRALIA) INFO SCIENCES LAB Tech. rep Improving the Reproducibility of PAN's Shared Tasks: Plagiarism Detection, Author Identification, and Author Profiling MPotthast TGollub FRangel PRosso EStamatatos BStein Information Access Evaluation meets Multilinguality, Multimodality, and Visualization. 5th International Conference of the CLEF Initiative (CLEF 14 EKanoulas MLupu PClough MSanderson MHall AHanbury EToms

Berlin Heidelberg New York

Springer Sep 2014 Overview of the Author Obfuscation Task at PAN 2018: A New Approach to Measuring Safety MPotthast MHagen FSchremmer BStein Working Notes Papers of the CLEF 2018 Evaluation Labs CEUR Workshop Proceedings, CLEF and CEUR-WS LCappellato NFerro JYNie LSoulier Sep 2018 Overview of the 6th Author Profiling Task at PAN 2018: Multimodal Gender Identification in Twitter FRangel PRosso MMontes-Y-Gómez MPotthast BStein Working Notes Papers of the CLEF 2018 Evaluation Labs CEUR Workshop Proceedings, CLEF and CEUR-WS LCappellato NFerro JYNie LSoulier Sep 2018 Comparison of character n-grams and lexical features on author, gender, and language variety identification on the same spanish news corpus MASanchez-Perez IMarkov HGómez-Adorno GSidorov International Conference of the Cross-Language Evaluation Forum for European Languages Springer 2017 Not all character n-grams are created equal: A study in authorship attribution USapkota SBethard MMontes-Y Gómez TSolorio Proceedings of the 2015 Annual Conference of the North American Chapter of the ACL: Human Language Technologies the 2015 Annual Conference of the North American Chapter of the ACL: Human Language Technologies 2015 NAACL-HLT'15, Association for Computational Linguistics Plagiarism detection using stopword n-grams EStamatatos Journal of the American Society for Information Science and Technology 62 12 2011 A survey of modern authorship attribution methods EStamatatos Journal of the American Society for Information Science and Technology 60 3 2009 Overview of PAN-2018: Author Identification, Author Profiling, and Author Obfuscation EStamatatos FRangel MTschuggnall MKestemont PRosso BStein MPotthast Experimental IR Meets Multilinguality, Multimodality, and Interaction. 9th International Conference of the CLEF Initiative (CLEF 18) PBellot CTrabelsi JMothe FMurtagh JNie LSoulier ESanjuan LCappellato NFerro

Berlin Heidelberg New York

Springer Sep 2018