1 Introduction

Multi-channel Open-set Cross-domain Authorship Attribution

0 José Eleandro Custódio and Ivandré Paraboni 1 School of Arts, Sciences and Humanities (EACH) University of São Paulo (USP) São Paulo , Brazil

2019

2 393 407

This paper describes a multi-channel approach to open-set cross-domain authorship attribution (AA) for the PAN-CLEF 2019 AA shared task. The present work adapts the EACH-USP ensemble method presented at PAN-CLEF 2018 to an open-set scenario by defining a threshold value for unknown authors, and extends the previous architecture with an additional character ranking model built with the aid of the PageRank algorithm. Results are superior to a number of baseline systems, and remain generally comparable to those in the original closed-set ensemble approach.

1 Introduction

Authorship attribution (AA) is the computational task of identifying the author of a given text by examining samples of texts written by a number of candidate authors [ 8 ]. Practical applications include, for instance, the detection of internet misuse, text forensics for copyright protection, and many others [ 5 ].

AA may be based on single- or cross-domain settings. In this paper we discuss the latter, that is, situations in which we would like to identify the author of a text in a certain genre based on samples of text written in another genre.

From a computational perspective, we may distinguish two AA problem definitions: closed- and open-set AA. Closed-set AA assumes that the author of a disputed text necessarily belongs to a pre-defined set of possible candidates. This subtask was the theme of the PAN-CLEF 2018 shared task in [ 7 ]. Open-set AA, by contrast, assumes that the disputed text may not necessarily belong to any known candidate [18]. This subtask was the theme of the PAN-CLEF 2019 shared task, and it is also the focus of the present work.

In the context of closed-set AA, the work in [ 2,3 ] presented an ensemble approach that combines predictions made by three knowledge channels, namely, standard character n-grams, character n-grams with non-diacritic distortion and word n-grams. In the present work, this method is adapted to an open-set scenario by defining a threshold value for unknown authors, and further extended with the inclusion of a fourth channel based on a character ranking model built with the aid of the PageRank algorithm [10,15]. 2

Related Work

The present work consists of an extension of the ensemble AA approach in [ 2 ]. This, and a number of related studies, are briefly discussed below.

The work in [ 2 ] presented an ensemble approach to cross-domain AA called EACHUSP, which combines predictions made by three independent classifiers based on word n-grams (Std.wordN), standard character n-grams (Std.charN), and character n-grams with non-diacritic distortion (Dist.charN). The method relies on variable-length n-gram models and multinomial logistic regression, and selects the prediction of highest probability among the three models as the output for the task by soft voting.

The word-based Std.wordN model in [ 2 ] is intended to help distinguish an author from another based on word usage. However, given that a single author may favour different words across domains (e.g., fictional versus dialogue text), and that wordbased models will usually discard punctuation and blank spaces thyat may represent a valuable knowledge source for AA [13], the character-based models Std.charN and Dist.charN were added as a means to capture time and gender inflection, punctuation and spacing.

Both Std.charN and Dist.charN models in [ 2 ] are intended to capture languageindependent syntactic and morphological clues for AA. In the latter, all characters that do not represent diacritics are removed from the text beforehand, therefore focusing on the effects of punctuation, spacing and the use of diacritics, numbers and other nonalphabetical symbols.

For further details regarding the ensemble method, we report to [ 2 ]. Character models are extensively discussed in [14], with details regarding the role of affixes and prefixes in the task. Function words and word n-gram models are discussed in [ 6 ]. Text distortion models for removing noise information from text are discussed in [17].

Finally, the work in [19] creates word-adjacency graphs and extracts weighted clustering coefficients and weighted degrees from certain nodes in the word-adjacency network. An AA knowledge channel along these lines will be addressed in our own work as discussed in Section 4. 3

Corpus and Baseline Analysis

We started our investigation by examining the PAN-CLEF 2019 cross-domain AA dataset1, and by comparing the results obtained by the baseline systems provided. This analysis is described as follows.

The PAN-CLEF 2019 AA development dataset conveys 20 problems written four languages (English, French, Italian and Spanish), with nine candidate authors per problem, seven documents per candidate and an average of 4500 characters per document. 1 https://pan.webis.de/clef19/pan19-web/author-identification.html The shared task organisers also provided three baseline systems, namely, compression models [20,11], the Impostors method [ 9 ], and a SVM classifier based on character trigrams. Further details are provided in [13]. Figure 1 presents a comparison between macro F1 scores obtained from the three baseline systems for each target language.

From Figure 1 we notice that the SVM classifier has the best overall performance among the three baseline systems. Moreover, we notice that the three systems obtained similar results in the case of the English dataset.

Figure 2 presents a comparison among the same baseline methods according to the number of unknown documents under consideration.

From Figure 2 we notice that the proportion of unknown texts in each dataset, or openness of the AA task, has a considerable impact on the performance of all models. This confirms the general intuition that open-set AA is more challenging than closed-set AA. 4

Current Work

As in [ 2 ], our current approach to AA assumes that evidence of an author’s identity may be found in multiple layers of morphological, syntactic and semantic knowledge. These layers may be modelled as knowledge channels that use character- and word-based ngrams as their main source for feature extraction [ 4 ]. Channels of this kind tend to be relatively independent from each other, that is, the information captured by one channel may not necessarily be captured by another.

Based on these observations, we follow the work in [ 2 ] and address the AA task by making use of multiple models combined as an ensemble of classifiers. More specifically, our current approach extends the ensemble method in [ 2 ] by adding a fourth module to the existing set of channels (Std.wordN), Std.charN), and Dist.charN, cf. previous section) and proposes further adjustments for the open-set AA setting. 4.1

A Character Ranking Model for AA Language models are central to a wide range of natural language processing tasks. Accordingly, many studies have attempted to estimate the probability of a word (or character) appearing after a given symbol [ 4 ]. N-grams and recurrent neural networks [ 1,16 ] are the most well-known methods of this kind.

Of particular interest to the present work, language models may be represented as a character adjacency graph, in which the degree of influence of each node may help capture the (most influential) character sequences that denote a particular author. Influence may be measured, for instance, by using the PageRank algorithm [10,15]. In this case, the influence of a node is defined by the equation 1, in which N is the number of nodes, is the original alpha factor, and M is the set toward pi points to.

P R(pi) =

+ 1

X pj2M(pi)

P R(pj ) L(pj )

Using this method as a basis, we envisaged a character ranking model for AA, hereby called Rank.char, that computes character adjacency graphs and uses PageRank to select the most influential characters of a set of documents of a given author. For instance, the word ‘the’ gives rise to three nodes t, h and e, and two edges t ! h e h ! e.

Once the adjacency graph is computed, symbols of frequency lower than five are removed, and the resulting structure is submitted to the PageRank algorithm to determine (1) its most influential nodes. The algorithm is executed with a maximum of 500 iterations, and an alpha value set to 0:85. The output - a matrix of size jdj; jvj where d is the set of documents and v is the corpus vocabulary - is then fed into the AA pipeline. There are many possible strategies for combining the outputs of a set of classifiers. Among these, the most common are averaging, soft voting and hard voting. Averaging simply averages the predictions made by each classifier and chooses the class with higher probability. In hard voting, the majority vote is used as the final decision and, in soft voting, a weighted vote is considered.

In the present work we follow [ 2 ] and consider the use of a soft voting method in which the probabilities produced by a set of classifiers are concatenated and taken as an input to a softmax logistic regression model. This strategy is motivated by similar methods commonly applied in convolution neural network learning, in which multiples filters are applied to a stream of text, and subsequently combined by using a softmax layer. In the present AA setting, this method allows full filter (or channel) optimisation with the benefits of soft voting, which may be particularly suitable to scenarios with restricted number of text samples per author.

Our resulting architecture is illustrated in Figure 3. The first three channels are similar to those in [ 2 ], whereas the last channel (Rank.char) represents our current extension.

The output of the ensemble method is a matrix of probabilities conveying d rows representing documents and a columns representing authors, in which dij is the probability of a document di belong to an author aj . The openness aspect of the AA task at PAN-2019 (i.e., the fact that an input text may not belong to any of the candidate authors) is dealt with by assigning the unknown author (<UNK>) label to the input text when the standard deviation of the corresponding row is below a 0:05 threshold. 5

Evaluation

Model parameters were set by using the PAN-CLEF 2019 development dataset as follows. Features were scaled using Python MaxAbsScaler transformer, and dimensionality reduction was performed by using a standard PCA implementation. PCA also helps remove correlated features, which is particularly useful in the present case because our models make use of variable length feature concatenation. The resulting feature sets were submitted to multinomial logistic regression by considering a range of possible alternative values as summarised in Figure 4.

Optimal values for each pipeline were determined by making use of grid search and 3-fold cross validation using an ensemble method. The optimal values that were selected for training of our actual models are summarised in Figure 5. In this summary, a sequence as in, e.g., Start=2 and End=5 is intended to represent the concatenation of subsequences [(2, 2),(2, 3), ,(4, 3),(4, 5)], assuming that Start is not greater than End.

In addition to the main experiments presently reported, a large number of alternatives were considered as well. These included the use of BM25 and one-hot representation for feature extraction, and the use of bagging, boosting, multi-layer perceptron, decision tree induction and other learning methods. All these results were however below those obtained by the present approach, and were therefore discarded. 6

Results

Table 1 presents macro F1 results based on the PAN-CLEF 2019 test dataset and evaluation software [12] as obtained by the original baseline systems, our four individual classifiers, the ensemble approach EACH-USP taken from [ 2 ], and by the current method. Baseline systems were trained with their default parameters, and all models were individually optimised by using the parameters described in Table 5. Best results for each problem are highlighted.

From these results we notice that the current approach keeps a relatively good performance overall. Figure 6 presents a comparison between macro F1 scores obtained from the SVM baseline, the char n-gram model with variable range Std.charN, and the EACH-USP and current ensemble methods for each target language.

From these results we notice that the use of Rank.char was more effective for the Italian language dataset. Moreover, the task seems to be more challenging in the case of the English dataset than for the other languages.

Finally, Figure 7 presents a comparison among the same methods according to the number of unknown documents under consideration.

Once again, we notice that the percentage of documents of unknown authors had a great impact over all system under evaluation regardless of other factors. 7

Final Remarks

This paper has proposed an extension to the work in [ 2 ] by presenting an approach for open-set cross-domain authorship attribution that relies on fully optimised char ngrams, word n-grams and char-ranking models. To this end, results obtained from the individual models as probability vectors were combined by making use of a soft voting ensemble method, and unknown authors were classified by considering the standard deviation of the final probability vector.

Our current results are generally superior to those obtained by the PAN-CLEF 2019 baseline systems, but were not generally superior to the work in [ 2 ]. Although the compact text representation provided by the current Rank.char model does help improve some of our results, the Dist.charN model from [ 2 ] remains the most useful knowledge source within this ensemble approach even in the present open AA setting.

As future work, we intend to experiment with other kinds of network influence methods, and further customise the PageRank algorithm [10,15] for the AA problem. The use of part-of-speech and embedding channels for AA is also to be investigated.

Acknowledgements

The second author received support by FAPESP grant nro. 2016/14223-0.

1. Bagnall , D. : Author identification using multi-headed recurrent neural networks . In: Jones G.J.F. Cappellato L. , F.N.S.J.E . (ed.) CEUR Workshop Proceedings . vol. 1391 , pp. 1 - 9 . CEUR-WS ( 2015 )

2. Custódio , J.E. , Paraboni , I. : EACH-USP Ensemble Cross-domain Authorship Attribution: Notebook for PAN at CLEF 2018 . In: Cappellato, L. , Ferro , N. , Nie , J.Y. , Soulier , L . (eds.) Working Notes Papers of the CLEF 2018 Evaluation Labs . CEUR Workshop Proceedings, CLEF and CEUR-WS.org (Sep 2018 )

3. Custódio , J.E. , Paraboni , I. : Multi-channel Open-set Cross-domain Authorship Attribution . In: Working Notes Papers of the Conference and Labs of the Evaluation Forum (CLEF- 2019 ) (to appear) . Lugano, Switzerland ( 2019 )

4. Goldberg , Y. : Neural Network Methods in Natural Language Processing . Morgan & Claypool Publishers ( 2017 )

5. Juola , P.: An overview of the traditional authorship attribution subtask . In: CLEF 2012 Evaluation Labs and Workshop , Online Working Notes, Rome, Italy, September 17-20 , 2012 ( 2012 ), http://ceur-ws. org/ Vol- 1178 / CLEF2012wn-PAN-Juola2012 .pdf

6. Kestemont , M. : Function Words in Authorship Attribution From Black Magic to Theory? In: 3rd Workshop on Computational Linguistics for Literature (CLfL 2014 ). pp. 59 - 66 ( 2014 )

7. Kestemont , M. , Tschugnall , M. , Stamatatos , E. , Daelemans , W. , Specht , G. , Stein , B. , Potthast , M. : Overview of the Author Identification Task at PAN-2018: Cross-domain Authorship Attribution and Style Change Detection . In: Cappellato, L. , Ferro , N. , Nie , J.Y. , Soulier , L . (eds.) Working Notes Papers of the CLEF 2018 Evaluation Labs . CEUR Workshop Proceedings, CLEF and CEUR-WS.org (Sep 2018 )

8. Koppel , M. , Schler , J. , Argamon , S. : Computational Methods in Authorship Attribution . Journal of the Association for Information Science and Technology 60 ( 1 ), 9 -- 26 ( 2009 )

9. Koppel , M. , Seidman , S. : Detecting pseudepigraphic texts using novel similarity measures . Digital Scholarship in the Humanities 33 ( 1 ), 72 - 81 ( 2018 )