1 Introduction

UniNE at CLEF 2017: Author Clustering

0 Computer Science Dept., University of Neuchâtel , Switzerland

2017

This paper describes and evaluates an effective unsupervised author clustering and authorship linking model called SPATIUM. The suggested strategy can be adapted without any difficulty to different languages (such as Dutch, English, and Greek) in different text genres (e.g., newspaper articles and reviews). As features, we suggest using the m most frequent terms (isolated words and punctuation symbols) or the m most frequent character n-grams of each text. Applying a simple distance measure, we determine whether there is enough indication that two texts were written by the same author. The evaluations are based on 60 training and 120 test problems (PAN AUTHOR CLUSTERING task at CLEF 2017). Using the most frequent terms results in a higher clustering precision, while using the most frequent character n-grams of letters gives a higher clustering recall. An analysis to assess the variability of the performance measures indicates that we have a system working stable independent of the underlying text collection and that our parameter choices did not over-fit to the training data.

1 Introduction

The authorship attribution problem is an interesting problem in computational linguistics but also in applied areas such as criminal investigation and historical studies where knowing the author of a document (such as a ransom note) may be able to save lives [ 14 ]. With the Web 2.0 technologies, the number of anonymous or pseudonymous texts is increasing and in many cases one person writes in different places about different topics (e.g., multiple blog posts written by the same author). Therefore, proposing an effective algorithm to the authorship problem presents a real interest. In this case, the system must regroup all texts by the same author (possibly written about different text topics) into the same group or cluster. A justification supporting the proposed answer and a probability that the given answer is correct can be given to improve the confidence attached to the response [ 10 ].

This author clustering task is more demanding than the classical authorship attribution problem. Given a document collection the task is to group documents written by the same author such that each cluster corresponds to a different author. The number of distinct authors whose documents are included is not given. For example, based on a set of passages extracted from larger documents, we should first determine the number of authors k and then regroup the texts into k clusters according to their real author. This task can also be viewed as establishing authorship links between texts and is related to the PAN 2015 task of authorship verification.

This paper is organized as follows. The next section presents the test collections and the evaluation methodology used in the experiments. The third section explains our proposed algorithm called SPATIUM. Then, we evaluate the proposed scheme on 60 training problems and compare it to the best performing schemes using 120 different test problems. The last section explains our parameter choices and provides a sensibility assessment. A conclusion draws the main findings of this study.

2 Test Collections and Evaluation Methodology

The evaluation was performed using the TIRA platform, which is an automated tool for deployment and evaluation of the software [ 3 ]. The data access is restricted such that during a software run the system is encapsulated and thus ensuring that there is no data leakage back to the task participants [ 9 ]. This evaluation procedure also offers a fair evaluation of the time needed to produce an answer.

During the PAN CLEF 2017 evaluation campaign, six corpora (or test collections) were built each containing 30 problems (10 for training and 20 for testing). In each problem, all the texts matched the same language, are in the same text genre, and are single-authored, but they may differ in text-length and can be cross-topic [ 14 ]. The number of distinct authors is not given. In this context, the task is defined as: Given a problem of up to 50 short documents, identify authorship links and groups of documents by the same author.

The six corpora are a combination of one of three languages (English, Dutch, or Greek) and one of two genres (newspaper articles or reviews). An overview of these corpora is depicted in Table 1. Considering the six benchmarks we have 120 problems to test and 60 problems to train (pre-evaluate) our system. The training set was used to evaluate our approach and the test set was used to compare our results with other participants of the PAN CLEF 2017 campaign. This year, everyone had access to the test data twice. This means we can train and test a basic approach, improve it or provide a different approach, and then test it again for the second and final run. For each corpus, we have 10 problems in the training dataset containing the average number of texts as given under the label “Texts”. The number of distinct authors on average together with the range for each corpus is indicated in the column “Authors”, and the average with the minimum and maximum number of authors with only a single document is presented under the label “Single”. Finally, the average number of terms (isolated words and punctuation symbols) is given in the column “Terms”. For example, with the English newspaper collection (training set), 20 texts are written, on average, by 5.6 authors and we can find 1.8 authors who wrote only one single article. These metrics are not available for the test corpora because the datasets remain undisclosed thanks to the TIRA system. We only know that the same combinations of language and genre are present.

In Table 1 we see that the number of words is rather small. In Figure 1 we show three texts extracted from a problem containing articles written in the English language. The represented texts are the full unmodified documents as available in problem001. Notice that document0014 and document0017 are a single sentence, and the latter is so short that it would fit in a single Twitter tweet (it contains less than 140 characters). When analyzing the texts, we should detect a shared authorship between document0017 and document0018, but not with document0014 as this was written by someone else. The limited length of those documents is the main difficulty of this year’s author clustering task.

document0014: But the more fastidious are also sensitive about their reputations – and the risk that others with shadier professional pasts, alleged or real, may damage their fundraising. document0017: “Can we keep this brief?” enquired Vaz, now beginning to get a bit twitchy that the Gambaccini show was overrunning. document0018: “It is absolutely vital that every decision we take, every policy we pursue, every programme we start, is about giving everyone in our country the best chance of living a fulfilling and good life,” Dave said. “And now it’s time for the cameras to leave.” And for the cuts to begin.

When inspecting the training corpora, the number of words available is rather small (overall in mean 82 terms for each text). Since there are some authors who only wrote a single text we should only cluster two texts if there are enough evidences for a single authorship.

During the PAN CLEF 2017 campaign, a system must return two outputs in a JSON structure for each problem. First, the detected groups should be written to a file indicating the author clustering. Each text must belong to exactly one cluster; thus, the clusters must be non-overlapping. Second, a list of text pairs with a probability of having the same author should be written to another file representing the authorship links.

As performance measure, two evaluation measures were used during the PAN CLEF campaign. The first performance measure is the BCubed F1 to evaluate the clustering output [ 1 ]. This value is the harmonic mean of the precision and recall associated to each document. The document precision represents how many documents in the same cluster are written by the same author and therefore measures the purity of its cluster. Symmetrically, the recall associated to one document represents how many documents from that author appear in its cluster and therefore measures the completeness of its cluster.

As another measure, the PAN CLEF campaign adopts the mean average precision (MAP) measure for the authorship links between document pairs [ 8 ]. This evaluation measure provides a single-figure measure of quality across recall levels. The MAP is roughly the average area under the precision-recall curve for a set of problems. Therefore, this measure gives more emphasis on the first positions and a misclassification with a lower probability is less penalized. MAP does not punish verbosity, i.e., every true link counts even when appearing near the end of the ranked list. Therefore, by providing all possible authorship links, one can attempt to maximize MAP [ 13 ].

3 Simple Clustering Algorithm

We suggest an unsupervised approach based on a simple feature extraction and distance measure called SPATIUM (Latin word meaning distance). The selected stylistic features correspond to the top m most frequent terms (isolated words without stemming but with the punctuation symbols) in the first run as in the last year [ 5 ] and additionally the m most frequent character n-grams for the second run. The features are selected solely based on the frequency in the query text. For determining the value of m, previous studies have shown that a value between 200 and 300 tends to provide the best performance [ 2, 10 ]. The texts were only paragraphs so the effective number of features m was set to at most 200 but was in most cases well below. The length of the n-grams was set to n=6 characters to ease the analysis of the most pertinent features. Unlike in the previous year [ 5 ], we did not remove the words appearing only once (hapax legomenon) in the text due to the limited size of each document (see Table 1). For instance, in document0017 depicted in Figure 1 every term would be deleted if the hapax legomenon would be ignored.

To measure the distance between a Test A and another Text B, SPATIUM uses a variant of the L1-norm called Canberra. This distance suggests that the absolute differences of the individual features are normalized based on the sum of them as indicated in Equation 1.

| [ ]− [ ]| ∆( , ) = ∆ = ∑ =1 [ ]+ [ ] (1) where m indicates the number of features (words and punctuation symbols, or character n-grams), and PA[fi] and PB[fi] represent the estimated occurrence probability of the feature fi in the first Text A and in the other Text B respectively. To estimate these probabilities, we divide the feature occurrence frequency (ffi) by the sum of all features of the corresponding text (n), Prob[fi] = ffi / n, without smoothing and therefore accepting a probability of 0.0 in Text B. This distance measure is not symmetric due to the choice of the features to be include in the computation.

Observing a small value for ∆ provides evidence that both documents are written by the same author. On the other hand, a large value suggests the opposite. The real problem consists in defining precisely what a “small distance value” is. To verify (2) (3) (4) (5) 2: ∆( , ) ≤ (. , B) = (. , B) − ∗ (. , B) With these two decision rules, one can verify if a distance ∆ is small in comparison with all distances from Text A (Eq. 2) or all distances to Text B (Eq. 3). In the same way, one can verify whether the resulting ∆ value is small or rather large. Therefore, we propose to create two additional decision rules with Eq. 4 (based on the distribution of distance values from Text B) and Eq. 5 (for distance to Text A) as follows: 3: ∆( , ) ≤ ( , . ) = ( , . ) − ∗ 4: ∆( , ) ≤ (. , A) = (. , A) − ∗ ( , . ) (. , A) An authorship between Text A and Text B is expected if at least two of the four hints are satisfied. For the clustering output, we use the single linkage strategy. For the list of links, we must rank each pair of texts by the certainty that they have a shared authorship. To determine the probability of a correct author linking we include both the number of satisfied hints h and the absolute distance between two texts in the computation [ 6 ]. A link with h hints fulfilled gets a probability between ℎ/5 and (ℎ + 1)/5, where the final score depends on the other text pairs that also satisfy h hints. whether the resulting ∆ value is small or rather large, a comparison basis must be determined.

To achieve this with a specific problem, the distance from Text A to all other texts is computed (or ∆( , )). From this distribution, the mean (denoted ( , . )) and standard deviation ( ( , . )) are estimated. Moreover, the distribution of distance values to Text B (or ∆( , )) can be computed to provide the mean (. , ) and the standard deviation (. , ) of the intertextual distances to Text B.

As a first definition of a “small” distance, we can assume that a small distance value from Text A must respect Eq. 2. In this formulation, is a parameter to be fixed. 1: ∆( , ) ≤ ( , . ) = ( , . ) − ∗ ( , . ) Similarly, a small distance to Text B can be defined as:

4 Evaluation

Since our system is based on an unsupervised approach we could directly evaluate it using the training set. In Table 2a, we have reported the same performance measure applied during the PAN CLEF campaign, namely the BCubed F1 (with the clustering precision and recall) and the AP using the most frequent terms from our first run and in Table 2b with the most frequent character 6-grams as used in the second run. Each corpus consists of 10 problems and we report the average of them in the last row. The final score is the arithmetic mean between the BCubed F1 and the MAP.

The algorithm returns similar results over all corpora and seems to work stable independent of the text genre and language. But we can see that from the first to the second approach (from Table 2a to Table 2b) that the precision drops significantly and the recall increases notably. Overall, the approach with 6-grams results in a slightly higher performance of the clustering output (BCubed F1), the authorship linking (MAP), and the Final score (+2.4% difference, +5.3% change). The test set is then used to rank the performance of all 6 participants in this task. Based on the same evaluation methodology, we achieve the results depicted in Table 3a and Table 3b corresponding to the six test corpora. training set. Therefore, the system seems to perform stable independent of the underlying text collection and is not over-fitted to the data.

To put those values in perspective we can see in Table 4 our result in comparison with the other participants using macro-averaging for the effectiveness measures and showing the total runtime sorted by the final score. Overall, we are ranked 2nd out of 6 approaches. Generally, there are only small differences in the BCubed F1 between the participants. Conversely, the MAP shows substantial variations and impacts the final score the most. The runtime only shows the actual time spent to classify the test set. On TIRA1 there was the possibility to first train the system using the training set which had no influence on the final runtime. Since we have an unsupervised system it did not need to train any parameters, but this possibility might have been used by other participants.

Overall, we achieve excellent results using a rather simple and fast approach in comparison with the other solutions.

In text categorization studies, we are convinced that a deeper analysis of the evaluation results is important to obtain a better understanding of the advantages and drawbacks of a suggested scheme. By just focusing on overall performance measures, we only observe a general behavior or trend without being able to acquire a better explanation of the proposed assignment. To achieve this deeper understanding, we could analyze some problems extracted from the English corpus. The relative frequency (or probability) differences with very frequent tokens such as the, (comma), to, or and can explain the decision. The confirmation of an authorship link is in many cases based on topical words and names that two texts share, like labour, party, people, Cameron, or work.

5 Parameter Choices

Our approach uses a few parameters to solve the clustering task. The main influences on the performance are the choice of the distance measure, the threshold value , and the feature selection scheme. Taking a decision solely on the outcome in the training data could lead to over-fitting. A leaving-one-out or a fold cross-validation is not possible in this task. Instead the bootstrap approach can be used. In this perspective, for each problem, the system must generate S new random bootstrap samples. More precisely, for each text, we will create S = 200 new copies having the same text length. 1 http://www.tira.io/task/author-clustering/ For each copy the probability of choosing one given feature (word and punctuation symbol, or n-gram) depends on its relative frequency in the original text. This drawing is done with replacement; thus, the underlying probabilities are stable. Each resulting text must be viewed as a bag-of-words. As the syntax is not respected, each bootstrap text is not readable but still reflects the stylistic aspects as analyzed by the SPATIUM approach.

For each of the original 60 training problems (Table 1) we now have 200 generated problems of bootstrap samples and can compare different parameter choices. In Table 5 we analyze several distance measures and report the mean of the Final score achieved with the 200*60 new problems together with the limit of ±2 standard deviations σ corresponding to a confidence interval of 95.4%. Furthermore, the last two columns show the mean of the BCubed F1 and MAP over the 200 bootstrap samples and 60 problems. In a previous study [ 7 ] we found that Canberra and Clark work better on average in author profiling tasks than Cosine and Euclidean. We can again see the same distinction for this clustering task. In Table 5 we can also see that there is no significant difference between Canberra, Clark, Matusita, and KLD. For instance, between Canberra and Clark we only observe a relative change of +1.4% in the mean final score with the bootstrap approach, which isn’t a substantial improvement that justifies changing our model.

The next parameter to optimize is the threshold value that indicates the willingness of having more or less strict assignments. A smaller value for generates more potential links between texts and thus increases the risk of observing incorrect assignments. For a Gaussian distribution, common choices are = 1.96 to take account of 95%, = 1.64 which contains 90%, = 1.28 to include 80%, and = 1.0 to take in 66.3%. If a corpus is composed of many authors with each cluster contains only a few items, the parameter should be fixed at a relatively higher level. In our system from 2016, we set = 2.0 because of the small average cluster size [ 5 ]. With the dataset from 2017 the number of authors with only a single text is lower and there are more grouped up documents. Therefore, we decreased the threshold parameter in the current system to = 1.64.

Figure 2 shows the mean of the BCubed F1, the MAP, and the Final score for different values when using the bootstrap approach. We can see that there was a lot of potential to improve the clustering outcome (highest line on top, BCubed F1). This analysis was performed after the completion of the testing stage where we fixed = 1.64 (shown with red squares in Figure 2). Setting = 1.28 would have enhanced the clustering output by 10% and therefore increased the final performance by 5%. The benefit of having a higher threshold is to be more certain that a given authorship link is correct, leading to higher clustering precisions. On the other hand, using a less restrictive threshold gives higher a clustering recall. We propose to be more cautious, mainly because proposing an incorrect assignment must be viewed as more problematic in many systems (especially if they are legal and law related) than missing a link between two documents written by the same author.

Interestingly, the authorship linking seems to produce a constant result (dashed line on the bottom, MAP) independent of the used threshold value . Finally, we can evaluate the performance variation on the training data to determine the optimal length of the character n-grams for our second run. Figure 3 shows the mean clustering precision, recall, and the BCubed F1 for different n-gram lengths from n=1 (unigrams) to n=12 based on the bootstrap approach. We can see a convergence from n=2 to n=9 between the recall (increasing) and the precision (decreasing) before they diverge again. In our second run, we used 6-grams (shown with red squares in Figure 3). The highest harmonic mean between precision and recall is achieved using 7-grams, which is only slightly better than the neighboring 6-grams and 8-grams (less than 0.5% change).

Overall, the analysis has shown that the chosen parameters are fine but could have been optimized. On the one hand, choosing Clark instead of Canberra as a distance measure or taking n-grams with length n=7 characters instead of n=6 characters would have unlikely improved the result noticeably. On the other hand, using a lower threshold value like = 1.28 instead of = 1.64 would have significantly enhanced the overall clustering performance.

6 Conclusion

This paper proposes a simple unsupervised technique to solve the author clustering problem. As features to discriminate between the proposed author and different candidates, we propose using the top m most frequent terms (words and punctuations) or character n-grams. This choice was found effective for other related tasks such as authorship attribution [ 2 ]. Moreover, compared to various feature selection strategies used in text categorization [ 12 ], the most frequent terms tend to select the most discriminative features when applied to stylistic studies [ 11 ]. To take the author linking decision, we propose using a simple distance measure called SPATIUM based on a variant of the L1 norm called Canberra.

The proposed approach tends to perform very well in three different languages (Dutch, English, and Greek) and in two text genres (newspaper articles and reviews). Such a classifier strategy can be described as having a high bias but a low variance [ 4 ]. Changing the training data does not drastically change the decision. However, the suggested approach ignores other significant information such as mean sentence length, POS (part of speech) distribution, or topical terms. Even if the proposed system cannot capture all possible stylistic features (bias), changing the available data does not modify significantly the overall performance (variance).

It is common to fix some parameters (such as time period, size, genre, or length of the data) to minimize the possible sources of variation in the corpus. However, our main goal was to present a simple and unsupervised approach without too many predefined parameters.

With SPATIUM the proposed clustering decision could be clearly explained because it is based on a reduced set of features on the one hand and, on the other, those features are words, punctuation symbols, or long n-grams. Thus, the interpretation for the final user is clearer than when working with a huge number of features, when dealing with short n-grams of letters, or when combing several similarity measures. The SPATIUM decision can be explained by large differences in relative frequencies of frequent words, corresponding to either functional terms or overused topical words.

To improve the current classifier, we will investigate the consequence of other cluster linking strategies. Changing the single linkage strategy to a complete, average, or centroid linkage strategy could improve the outcome, because one sole link could no longer merge two bigger clusters and consequently not lower the precision drastically. Acknowledgments. The author wants to thank the task coordinators for their valuable effort to promote test collections in author clustering. This research was supported, in part, by the NSF under Grant #200021_149665/1.

1. Amigo , E. , Gonzalo , J. , Artiles , J. , & Verdejo , F. 2009 . A comparison of Extrinsic Clustering Evaluation Metrics based on Formal Constraints . Information Retrieval , 12 ( 4 ), 461 - 486 .

2. Burrows , J.F. 2002 . Delta: A Measure of Stylistic Difference and a Guide to Likely Authorship . Literary and Linguistic Computing , 17 ( 3 ), 267 - 287 .

3. Gollub , T. , Stein , B. , & Burrows , T. 2012 . Ousting Ivory Tower Research: Towards a Web Framework for Providing Experiments as a Service . In: Hersh, B. , Callan , J. , Maarek , Y. , & Sanderson , M. (eds.) SIGIR. The 35th International ACM , 1125 - 1126 .

4. Hastie , T. , Tibshirani , R. , & Friedman , J. 2009 . The Elements of Statistical Learning . Data Mining, Inference, and Prediction . Springer-Verlag: New York (NY).

5. Kocher , M. , & Savoy , J. 2016 . UniNE at CLEF 2016 Author Clustering: Notebook for PAN at CLEF 2016 . In Balog, K. , Capellato , L. , Ferro , N. , & Macdonald , C . (Eds), CLEF 2016 Labs Working Notes, Évora, Portugal, September 5-8 , 2016 , Aachen: CEUR.

6. Kocher , M. , & Savoy , J. 2017 . Author Clustering with an Adaptive Threshold . In Jones, G. J. F. , Lawless , S. , Gonzalo , J. , Kelly , L. , Goeuriot , L. , Thomas M. , Cappellato , L. , & Ferro , N. (Eds), Experimental IR Meets Multilinguality, Multimodality, and Interaction, 8th International Conference of the CLEF Association, CLEF 2017 , Dublin, Ireland, September 11-14 , 2017 , Proceedings. (to appear)

7. Kocher , M. , & Savoy , J. 2017 . Distance Measures in Author Profiling . Information Processing & Management , 53 ( 5 ), 1103 - 1119 .

8. Manning , C.D. , Raghaven , P. , & Schütze , H. 2008 . Introduction to Information Retrieval. Cambridge University Press.

9. Potthast , M. , Gollub , T. , Rangel , F. , Rosso , P. , Stamatatos , E. , & Stein , B. 2014 . Improving the Reproducibility of PAN's Shared Tasks: - Plagiarism Detection, Author Identification, and Author Profiling . In: Kanoulas, E. , Lupu , M. , Clough , P. , Sanderson , M. , Hall , M. , Handbury , A. , & Toms , E . (eds.) CLEF. Lecture Notes in Computer Science , vol. 8685 , 268 - 299 . Springer.

10. Savoy , J. 2016 . Estimating the Probability of an Authorship Attribution . Journal of American Society for Information Science & Technology , 67 ( 6 ), 1462 - 1472 .

11. Savoy , J. 2015 . Comparative Evaluation of Term Selection Functions for Authorship Attribution . Digital Scholarship in the Humanities , 30 ( 2 ), 246 - 261 .

12. Sebastiani , F. 2002 . Machine Learning in Automatic Text Categorization . ACM Computing Survey , 34 ( 1 ), 1 - 27 .

13. Stamatatos , E. , Tschuggnall , M. , Verhoeven , B. , Daelemans , W. , Specht , G. , Stein , B. , & Potthast , M. 2016 . Clustering by Authorship Within and Across Documents . In Working Notes of the CLEF 2016 Evaluation Labs, CEUR Workshop Proceedings, CEUR-WS.org.

14. Tschuggnall , M. , Stamatatos , E. , Verhoeven , B. , Daelemans , W. , Specht , G. , Stein , B. , Potthast , M. 2017 . Overview of the Author Identification Task at PAN 2017: Style Breach Detection and Author Clustering . In Working Notes Papers of the CLEF 2017 Evaluation Labs, CEUR Workshop Proceedings , CEURWS.org.