=Paper=
{{Paper
|id=Vol-2696/paper_256
|storemode=property
|title=Overview of the Style Change Detection Task at PAN 2020
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_256.pdf
|volume=Vol-2696
|authors=Eva Zangerle,Maximilian Mayerl,Günther Specht,Martin Potthast,Benno Stein
|dblpUrl=https://dblp.org/rec/conf/clef/ZangerleMSP020
}}
==Overview of the Style Change Detection Task at PAN 2020==
Overview of the Style Change Detection Task at PAN 2020 Eva Zangerle,1 Maximilian Mayerl,1 Günther Specht,1 Martin Potthast,2 and Benno Stein3 1 University of Innsbruck 2 Leipzig University 3 Bauhaus-Universität Weimar pan@webis.de http://pan.webis.de Abstract The goal of style change detection is to identify text positions within a multi-author document at which the author switches. Detecting these positions is a crucial part of processing multi-author documents for purposes of author- ship identification. In this year’s PAN style change detection task, we asked the participants to answer the following questions for a given document: (1) Given a document, was it written by multiple authors? (2) For each pair of consecu- tive paragraphs in a given document, is there a style change between these para- graphs? The task is performed and evaluated on two datasets compiled from an English Q&A platform, which differ in their topical breadth (i.e., the number of different topics that are covered in the documents contained). The paper in hand introduces style change detection as a task and its underlying dataset, surveys the participants’ submissions, and analyzes their performance. 1 Introduction The task of style change detection aims at detecting positions of author changes within a collaboratively written text. Previous PAN editions paved the way for PAN’20 by ana- lyzing multi-authored documents for style changes. This includes the identification and clustering of text segments by author in 2016 [25]. In 2017, participants were asked to detect whether a given document has been authored by multiple authors, and in that case, to determine the boundaries at which authorship changes [34]. The results showed that accurately determining such boundaries is still beyond current capabilities. Hence, in 2018, the task was relaxed by formulating it as a binary classification prob- lem, where the goal was to predict whether a given document is written by a single author or multiple authors [15]. At PAN 2019, this classification task was extended to also predict the number of authors for multi-author documents [35]. In 2020, the task was steered back into its original direction: Participants were asked to detect whether a document was authored by one or multiple authors, and the positions of style changes at the paragraph-level. Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 September, Thessaloniki, Greece. The remainder of this paper is structured as follows. Section 2 presents previous style change detection approaches. Section 3 introduces the style change detection task as part of PAN 2020, along with the datasets and evaluation measures employed. Sec- tion 4 summarizes the received submissions, and Section 5 analyzes and compares the achieved results and Section 6 concludes the paper. 2 Related Work Style change detection is closely related to the fields of stylometry, plagiarism detec- tion, and text segmentation. All of them have in common that they rely on intrinsic stylometric analyses of documents, without referring to external documents or cor- pora for comparison. Hence, stylistic profiles are created that are based on lexical fea- tures like character n-grams (e.g., [29, 20]), word frequencies (e.g., [11]) and average word/sentence lengths (e.g., [36]), syntactic features like part-of-speech tag frequen- cies/structures (e.g., [32]), and structural features such as indentation usage (e.g., [36]). One of the earliest works on style change detection by analyzing stylometric fea- tures to detect author boundaries is by Glover and Hirst [9], which aims at identify- ing inconsistencies of writing style in collaborative documents. Meyer zu Eißen and Stein [21, 31, 30] were the first to investigate intrinsic plagiarism detection based on style change detection using word frequency classes. Koppel et al. [18, 19] and Akiva and Koppel [1, 2] propose an unsupervised method to decompose multi-author docu- ments into authorial threads by applying clustering methods on lexical features. Tschug- gnall et al. [33] proposed an unsupervised decomposition approach based on grammar tree representations, whereas Rexha et al. [24] use stylistic features to predict the num- ber of authors who wrote a text. Bensalem et al. [3] rely on n-grams to identify author style changes. Gianella [8] employs Bayesian modeling to split a document by author- ship, Further approaches include that of Graham et al. [10], who utilize neural networks with several stylometric features. At PAN 2017 [34], the goal was to find the exact positions of authorship changes. This task was mostly tackled by using stylometric features to characterize sentences and paragraphs and detecting boundaries by computing similarities [14, 16], or by ap- plying outlier detection [26]. For the binary classification task whether a document is single-author or multi-author at PAN 2018 [15], the best performing system is a stack- ing ensemble classifier based on lexical and syntactical features extracted via multiple sliding window approaches [37]. Alternatively, deep learning approaches such as con- volutional neural networks that operate on a character input [28] and recurrent neural networks operating on parse tree features [12] have been proposed. Other participants used stylometric features to compute the similarity of sentences and paragraphs to find homogeneous text segments that correspond to an individual author [17], or as input to a binary ensemble classifier [27]. In addition, to predict the number of authors of a multi-author document, at PAN 2019, Nath [22] uses two clustering approaches based on token frequencies, whereas Zuo et al. [38] use a classification ensemble based on lexical, syntactic and word frequency features. Example Document A Example Document B Example Document C Lorem ipsum dolor sit amet, consetetur sadipscing elitr, Duis autem vel eum iriure dolor in hendrerit in vulputate Duis autem vel eum iriure dolor in hendrerit in vulputate Author 1 sed diam nonumy eirmod tempor invidunt ut labore et Author 1 velit esse moles�e consequat, vel illum dolore eu feugiat Author 1 velit esse moles�e consequat, vel illum dolore eu feugiat dolore magna aliquyam erat, sed diam voluptua. At vero nulla facilisis at vero eros et accumsan et iusto odio nulla facilisis at vero eros et accumsan et iusto odio eos et accusam et justo duo dolores et ea rebum. Stet dignissim qui blandit praesent luptatum zzril delenit dignissim qui blandit praesent luptatum zzril delenit clita kasd gubergren, no sea takimata sanctus est Lorem augue duis dolore te feugait nulla facilisi. Lorem ipsum augue duis dolore te feugait nulla facilisi. Lorem ipsum ipsum dolor sit amet. Lorem ipsum dolor sit amet, dolor sit amet, consectetuer adipiscing elit, sed diam dolor sit amet, consectetuer adipiscing elit, sed diam consetetur sadipscing elitr, sed diam nonumy eirmod nonummy nibh euismod �ncidunt ut laoreet dolore nonummy nibh euismod �ncidunt ut laoreet dolore tempor invidunt ut labore et dolore magna aliquyam magna aliquam erat volutpat. magna aliquam erat volutpat. erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Author 2 Ut wisi enim ad minim veniam, quis nostrud exerci ta�on ullamcorper suscipit lobor�s nisl ut aliquip ex ea Author 2 Ut wisi enim ad minim veniam, quis nostrud exerci ta�on ullamcorper suscipit lobor�s nisl ut aliquip ex ea Lorem ipsum dolor sit amet, consetetur sadipscing elitr, commodo consequat. Duis autem vel eum iriure dolor in commodo consequat. Duis autem vel eum iriure dolor in sed diam nonumy eirmod tempor invidunt ut labore et hendrerit in vulputate velit esse moles�e consequat, vel hendrerit in vulputate velit esse moles�e consequat, vel dolore magna aliquyam erat, sed diam voluptua. At vero illum dolore eu feugiat nulla facilisis at vero eros et illum dolore eu feugiat nulla facilisis at vero eros et eos et accusam et justo duo dolores et ea rebum. Stet accumsan et iusto odio dignissim qui blandit praesent accumsan et iusto odio dignissim qui blandit praesent clita kasd gubergren, no sea takimata sanctus est Lorem luptatum zzril delenit augue duis dolore te feugait nulla luptatum zzril delenit augue duis dolore te feugait nulla ipsum dolor sit amet. facilisi. facilisi. Author 1 Duis autem vel eum iriure dolor in hendrerit in vulputate velit esse moles�e consequat, vel illum dolore eu Author 2 Nam liber tempor cum soluta nobis eleifend op�on congue nihil imperdiet doming id quod mazim placerat Author 2 Nam liber tempor cum soluta nobis eleifend op�on congue nihil imperdiet doming id quod mazim placerat feugiat nulla facilisis at vero eros et accumsan et iusto facer possim assum. Lorem ipsum dolor sit amet, facer possim assum. Lorem ipsum dolor sit amet, odio dignissim qui blandit praesent luptatum zzril consectetuer adipiscing elit, sed diam nonummy nibh consectetuer adipiscing elit, sed diam nonummy nibh delenit augue duis dolore te feugait nulla facilisi. Lorem euismod �ncidunt ut laoreet dolore magna aliquam erat euismod �ncidunt ut laoreet dolore magna aliquam erat ipsum dolor sit amet, consectetuer adipiscing elit, sed volutpat. volutpat. Ut wisi enim ad minim veniam, quis nostrud diam nonummy nibh euismod �ncidunt ut laoreet exerci ta�on ullamcorper suscipit lobor�s nisl ut aliquip dolore magna aliquam erat volutpat. ex ea commodo consequat. Duis autem vel eum iriure dolor in hendrerit in vulputate Author 3 velit esse moles�e consequat, vel illum dolore eu feugiat nulla facilisis. Task 1 no (0) yes (1) yes (1) Task 2 [0] [1,0] [1,0,1] Figure 1. Exemplary documents that illustrate different style change situations and the expected solution for Task 1 (single- or multi-authored document) and Task 2 (position of style changes). 3 Style Change Detection Task This section introduces the style change detection task, the dataset constructed for it, the performance measures employed, and our evaluation platform on which the task has been deployed. 3.1 Task Definition The goal of style change detection is to segment documents into stylistically homoge- neous passages, which can subsequently be utilized for authorship identification and at- tribution. Hence, style change detection aims at identifying text positions within a given multi-author document at which the author changes. Beforehand, it must be determined whether such a segmentation is necessary at all by checking whether the document in question is indeed a multi-author document. We there study the following two tasks: – Task 1. Given a document, is the document written by multiple authors? – Task 2. Given a sequence of paragraphs of a (supposedly) multi-author document, is there a style change between any of the paragraphs? Figure 1 illustrates the possible scenarios and the expected output for the two tasks. Document A does not contain any style changes and hence, was authored by a single author; Document B contains a single style change between Paragraphs 1 and 2, and Document C contains two style changes. In order to render the task more feasible, we ensure that all documents comprised in our evaluation dataset are written in the same language (English), and that style changes occur only between paragraphs, not within them (i.e., a single paragraph is always authored by a single author and does not contain any style changes). Moreover, the documents may contain between zero and ten style changes, resulting from at most three different authors. 3.2 Dataset Construction As with last year’s task [35], the datasets are based on data taken from StackExchange. StackExchange is a popular network of Q&A sites, covering a wide range of topics. We used a dump of the questions and answers on the various StackExchange sites1 as the basis for our datasets. As a first step, we cleaned this data as follows: – Removal of questions and answers that contain fewer than 30 characters. – Removal of questions and answers that were edited by a user different from the one who originally wrote them. – Removal of the following items within all questions and answers: images, URLs, code snippets, block quotes, and bullet lists along with their contents. After cleaning the data, we constructed two datasets differing in the number of topics covered to also enable an investigation into how well approaches are able to deal with topical diversity. For the first dataset, called dataset-narrow, we used only questions and answers belonging to a subset of the StackExchange sites that deal with topics related to computer technology.2 For the second dataset, called dataset-wide, we used a subset of sites covering a wide range of different topics,3 including technology, economics, literature, philosophy, and mathematics. For every site in the subset of sites that was used for the creation of a given dataset, we grouped together all questions and answers written by the same user and split them into paragraphs, removing paragraphs with fewer than 100 characters. This yielded a list of paragraphs for every user on a particular site. In a next step, we constructed the documents making up the dataset. Each dataset contains an equal number of single- author and multi-author documents. For single-author documents, we selected a random user and drew paragraphs from the paragraph list of that user until the document had a sufficient length (randomly chosen to be between 1,000 and 3,000 words). For multi- author documents, we first randomly chose whether the document should have two or three authors. Then, we randomly constructed a structure for the document (i.e., a sequence of author changes for a set of paragraphs). Based on that, we randomly chose distinct authors from our list of users and drew paragraphs from their paragraph lists until the document had the predetermined structure and the chosen length (again, randomly chosen to be between 1,000 and 3,000 words). The resulting documents were then split into training, validation, and test sets with approximately 50% of the documents being assigned to the training set, and 25% each being assigned to the validation and test sets. The procedure we used for splitting en- sured that every subset contains approximately the same number of single-author and multi-author documents. Finally, we filtered all documents based on their language. As 1 See https://archive.org/details/stackexchange 2 dataset-narrow contains questions and answers from the following sites: Code Review, Computer Graphics, CS Educators, CS Theory, Data Science, DBA, DevOps, Game Dev, Net- work Engineering, Raspberry Pi, Serverfault.com, Superuser.com. 3 dataset-wide contains questions and answers from the following sites: Academia, Astron- omy, Bicycles, Biology, Buddhism, Code Review, Coffee, DBA, Earth Science, Economics, En- gineering, Fitness, History, Interpersonal, Linguistics, Literature, Mathoverflow.net, Outdoors, Philosophy, Serverfault.com, Skeptics, Sports, Travel, Workplace, Worldbuilding. Table 1. Parameters for constructing the style change detection datasets. Parameter Configurations Number of collaborating authors 1-3 Number of style changes 0-10 Document length 1,000-3,000 Change positions between paragraphs Document language English Table 2. Dataset overview. Text length is measured as average number of tokens per document. Dataset Documents Documents / #Authors Length / #Authors 1 2 3 1 2 3 1,709 854 855 Narrow-Train 3,418 11,872 11,659 11,717 50.00% 24.99% 25.01% 855 415 443 Narrow-Valid. 1,713 11,931 11,996 11,605 49.91% 24.23% 25.86% 852 426 423 Narrow-Test 1,701 11,715 11,637 11,708 50.09% 25.04% 24.87% 4,025 1,990 2,015 Wide-Train 8,030 11,751 12,191 12,095 50.12% 24.78% 25.09% 2018 969 1,032 Wide-Valid. 4,019 12,113 12,113 12,069 50.21% 24.11% 25.68% 2,004 987 1,004 Wide-Test 3,995 12,242 12,015 11,729 50.16% 24.70% 25.13% we want our datasets to consist only of English documents, we removed all documents where at least one paragraph was identified as being written in a language other than English. For this, we used the Python library langdetect.4 A summary of the parameters for both datasets is given in Table 1. Table 2 shows an overview of the created datasets, including the number of contained documents as well as the average document lengths, partitioned by the number of authors. For development, participants are provided with the documents and ground truth information. For each training and validation document, we provided the number of authors, the StackExchange site the texts were gathered from, the order of the authors within the document, the positions of the style changes, and whether the document was indeed multi-authored. 3.3 Performance Measures To evaluate and compare the submitted approaches, we report both, the achieved perfor- mance for the individual subtasks, and their combination as a staged task. Furthermore, 4 https://pypi.org/project/langdetect/ we evaluate the approaches on both datasets individually. Submissions are evaluated using the Fα -Measure for each document, where α = 1 equally weighs the harmonic mean between precision and recall. For Task 1, we compute the average F1 measure across all documents, and for Task 2, we use the micro-averaged F1 measure across all documents. The submissions for the two datasets are evaluated independently and the resulting F1 measures for the two tasks are averaged across datasets. 3.4 Evaluation Framework To ensure the reproducibility of the submitted solutions, participants were asked to de- ploy their software on our TIRA platform [23]. Each participant was assigned a virtual machine on TIRA, where the software had to be setup with the only constraint of being executable via a POSIX command. The web frontend of TIRA allows for configuring pieces of software that are deployed within an participant’s virtual machine, and to re- motely execute them via an appropriate command. This enabled participants both, to test their software on the freely available training and validation datasets, as well as to self-evaluate their software on the test dataset, which is not freely accessible. TIRA prevents direct access by participants by moving the virtual machine into a secure sand- box before enabling the a deployed software to process a test dataset. This way, TIRA enables blind evaluation, thus foreclosing optimization against the test data. Runs re- sulting from processing the training, validation, or test data can be evaluated using the aforementioned evaluation measure at the click of a button. 4 Survey of Submissions For this year’s edition of the style change detection task, we received three submis- sions. However, only two participating teams submitted a working notes paper. In the following, we describe the approaches used in those submissions. 4.1 Mixed Style Feature Representation and B-maximal Clustering The approach developed by Castro-Castro et al. [6] makes use of a variant of B0 - maximal clustering to solve the style change detection task. First, they formulate a rep- resentation for a paragraph as a set of 185 stylometric features, consisting of character- based, lexical, and syntactic features, but excluding features which explicitly capture the semantics of the given text. The features are divided into three different categories: boolean features, features consisting of a single floating-point number, and features consisting of vectors of numbers. For each of these categories, a comparison criterion is defined which expresses whether a given feature of two different paragraphs is “sim- ilar” or not. Then, the similarity between two paragraphs is defined to be the number of similar features between them. Based on this, B0 -maximal clustering is performed to group the paragraphs in a document into clusters, where every cluster is regarded to be one author. This clustering approach assigns all paragraphs with a similarity larger than a defined threshold to the same cluster. Since this makes it possible for a paragraph to be assigned to multiple Table 3. Overall results for the style change detection task, ranked by average F1 . Participant Task1 F1 Task2 F1 Avg. F1 Iyer and Vosoughi 0.6401 0.8567 0.7484 Castro-Castro et al. 0.5399 0.7579 0.6489 Nath 0.5204 0.7526 0.6365 Baseline (random) 0.5007 0.5001 0.5004 clusters, and hence to multiple authors, a basic approach for deciding to which cluster a paragraph will be assigned is proposed. In such a case, from all possible candidate clusters, the one which contains the paragraph that occurs earliest in the document is chosen. Thus, all the paragraphs in a document are assigned to authors. From this, the tasks posed in this year’s style change detection task are solved as follows: For the first task, it is simply checked whether the clustering has produced more than one cluster. For the second task, the positions in the document are identified, where consecutive paragraphs were assigned to different clusters. 4.2 Style Change Detection Using BERT The approach of Iyer and Vosoughi [13] is based on using Google’s BERT language model [7] as a feature extractor, and random forests as a classifier. First, the documents contained in the dataset are split into sentences, and every sentence is fed to BERT, tak- ing the outputs of the last four BERT layers to represent a given sentence. Since the size of the feature matrix produced by this depends on the number of tokens in a sentence, the values along the length dimension are summed to obtain a feature matrix of a fixed length. After this, representations are formulated for consecutive pairs of paragraphs (to solve the second task), and the whole document (to solve the first task), based on the representations of sentences, by summing (paragraphs) or averaging (whole doc- uments) the feature values of the sentences that make up the paragraph or document. These feature representations are then used to trin random forest models for both tasks. 5 Evaluation Results The results for the participants submissions as well as a random baseline are given in Table 3. The table shows the F1 score for both tasks, as well as the overall average scores. Both participants’ submissions significantly outperformed the baseline with re- spect to individual and overall score. The best-performing submission is the one by Iyer and Vosoughi, which achieved the best scores in both tasks as well as in the overall average score. The approach developed by Castro-Castro et al. performs significantly better than the random baseline, but also significantly worse than the approach by Iyer and Vosoughi, forming a middle ground. The approach of Nath5 performs only slightly worse than that of Castro-Castro et al. 5 The participant did not submit their working notes and was hence omitted from further analysis. (a) Single-author Documents (b) Multi-author Documents (c) Number of Authors 1.0 Participant Castro-Castro et al. Iyer and Vosoughi 0.8 0.6 F1 Score 0.4 0.2 0.0 Task 1 Task 2 Task 1 Task 2 1 2 3 (d) Topic Diversity Participant Task 1 Narrow Task 1 Wide Task 2 Narrow Task 2 Wide Iyer and Vosoughi 0.7042 0.5760 0.8823 0.8310 Castro-Castro et al. 0.5379 0.5419 0.8242 0.6915 Figure 2. Overall performance of the submitted approaches regarding (a) single-author docu- ments, (b) multi-author documents, and (c) dependent on the number of authors per document. (d) The table shows the F1 scores achieved dependent on topic diversity. We further analyzed the performance of both approaches with regard to the specific properties of the documents in our datasets. First, compared both approaches with re- spect to single-author versus multi-author documents. The results for this analysis are shown in Figures 2a and b. The approach by Iyer and Vosoughi reaches an F1 score of almost 1.0 for Task 2 on single-author documents. This suggests that it may be benefi- cial for them to reduce their approach to one model predicting style changes between paragraphs, and then calculating predictions for Task 1 based on the output of that model (i.e., predicting a document to be multi-author if and only if there was at least one style change predicted between the paragraphs of that document). Another point to note is that the approach by Castro-Castro et al. performed best for Task 1 on multi- author documents. This suggests that their model is especially well-suited for detecting documents that have been written by more than one author. Moreover, we analyzed how the performance for both submitted approaches changes depending on the number of authors. The results for this analysis are shown in Figure 2c, confirming that the ap- proach of Castro-Castro et al. performs better for multi-author documents, regardless of whether the number of authors is two or three. It is interesting that Castro-Castro et al.’s approach improves for multi-author documents, whereas that of Iyer and Vosoughi performs best for single-author documents, exerting a sharp drop in performance when a document is written by multiple authors. Finally, we analyzed the performance of the participants’ approaches dependent on topic diversity (see Figure 2d). In most cases, we found a significant difference in per- formance between both datasets. The exception to this is the approach by Castro-Castro et al. on Task 1, where the performance on the narrow and wide datasets are almost identical. In all other cases, the performance differs significantly, with performance on the narrow dataset being higher than on the wide dataset, implying that dealing with documents of a diverse topical variety renders the task more difficult. 6 Conclusion In the 2020 edition of the PAN style change detection task, we asked participants to an- swer the following questions for a given document: (1) Given a document, was it written by multiple authors? (2) For each pair of consecutive paragraphs in a given document, is there a style change between these paragraphs? Three participants submitted their systems and two participants submitted a working notes paper. The two approaches differed fundamentally, the best-performing system relying on semantic features (i.e., BERT embeddings), while the second-best approach focused on syntactic ones. Future challenges include finding the exact position of authorship changes beyond the para- graph level, and assigning paragraphs to individual authors. Bibliography [1] Akiva, N., Koppel, M.: Identifying distinct components of a multi-author document. In: Memon, N., Zeng, D. (eds.) 2012 European Intelligence and Security Informatics Conference, EISIC 2012, Odense, Denmark, August 22-24, 2012, pp. 205–209, IEEE Computer Society (2012), https://doi.org/10.1109/EISIC.2012.16, URL https://doi.org/10.1109/EISIC.2012.16 [2] Akiva, N., Koppel, M.: A generic unsupervised method for decomposing multi-author documents. JASIST 64(11), 2256–2264 (2013), https://doi.org/10.1002/asi.22924, URL https://doi.org/10.1002/asi.22924 [3] Bensalem, I., Rosso, P., Chikhi, S.: Intrinsic plagiarism detection using n-gram classes. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1459–1464, Association for Computational Linguistics, Doha, Qatar (Oct 2014), https://doi.org/10.3115/v1/D14-1153, URL https://www.aclweb.org/anthology/D14-1153 [4] Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds.): CLEF 2017 Evaluation Labs and Workshop – Working Notes Papers, 11-14 September, Dublin, Ireland, CEUR Workshop Proceedings, CEUR-WS.org (2017), ISSN 1613-0073, URL http://ceur-ws.org/Vol-1866/ [5] Cappellato, L., Ferro, N., Nie, J.Y., Soulier, L. (eds.): CLEF 2018 Evaluation Labs and Workshop – Working Notes Papers, 11-14 September, Avignon, France, CEUR Workshop Proceedings, CEUR-WS.org (2018), ISSN 1613-0073 [6] Castro-Castro, D., Rodríguez-Lozada, C.A., noz, R.M.: Mixed Style Feature Representation and B-maximal Clustering for Style Change Detection. In: Cappellato, L., Ferro, N., Névéol, A., Eickhoff, C. (eds.) CLEF 2020 Labs and Workshops, Notebook Papers, CEUR-WS.org (Sep 2020) [7] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) [8] Giannella, C.: An improved algorithm for unsupervised decomposition of a multi-author document. JASIST 67(2), 400–411 (2016), https://doi.org/10.1002/asi.23375, URL https://doi.org/10.1002/asi.23375 [9] Glover, A., Hirst, G.: Detecting stylistic inconsistencies in collaborative writing. In: The New Writing Environment, pp. 147–168, Springer (1996) [10] Graham, N., Hirst, G., Marthi, B.: Segmenting documents by stylistic character. Natural Language Engineering 11(4), 397–416 (2005) [11] Holmes, D.I.: The evolution of stylometry in humanities scholarship. Literary and Linguistic Computing 13(3), 111–117 (1998) [12] Hosseinia, M., Mukherjee, A.: A Parallel Hierarchical Attention Network for Style Change Detection—Notebook for PAN at CLEF 2018. In: [5] [13] Iyer, A., Vosoughi, S.: Style Change Detection Using BERT. In: Cappellato, L., Ferro, N., Névéol, A., Eickhoff, C. (eds.) CLEF 2020 Labs and Workshops, Notebook Papers, CEUR-WS.org (Sep 2020) [14] Karaś, D., Śpiewak, M., Sobecki, P.: OPI-JSA at CLEF 2017: Author Clustering and Style Breach Detection—Notebook for PAN at CLEF 2017. In: [4], URL http://ceur-ws.org/Vol-1866/ [15] Kestemont, M., Tschuggnall, M., Stamatatos, E., Daelemans, W., Specht, G., Stein, B., Potthast, M.: Overview of the Author Identification Task at PAN-2018: Cross-Domain Authorship Attribution and Style Change Detection. In: Working Notes Papers of the CLEF 2018 Evaluation Labs. Avignon, France, September 10-14, 2018/Cappellato, Linda [edit.]; et al., pp. 1–25 (2018) [16] Khan, J.: Style Breach Detection: An Unsupervised Detection Model—Notebook for PAN at CLEF 2017. In: [4], URL http://ceur-ws.org/Vol-1866/ [17] Khan, J.: A Model for Style Change Detection at a Glance—Notebook for PAN at CLEF 2018. In: [5] [18] Koppel, M., Akiva, N., Dershowitz, I., Dershowitz, N.: Unsupervised decomposition of a document into authorial components. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 1356–1364, Association for Computational Linguistics, Portland, Oregon, USA (Jun 2011), URL https://www.aclweb.org/anthology/P11-1136 [19] Koppel, M., Akiva, N., Dershowitz, I., Dershowitz, N.: Unsupervised decomposition of a document into authorial components. In: Lin, D., Matsumoto, Y., Mihalcea, R. (eds.) The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 19-24 June, 2011, Portland, Oregon, USA, pp. 1356–1364, The Association for Computer Linguistics (2011), URL http://www.aclweb.org/anthology/P11-1136 [20] Koppel, M., Schler, J., Argamon, S.: Computational methods in authorship attribution. Journal of the American Society for Information Science and Technology 60(1), 9–26 (2009) [21] Meyer zu Eißen, S., Stein, B.: Intrinsic Plagiarism Detection. In: Lalmas, M., MacFarlane, A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) Advances in Information Retrieval. 28th European Conference on IR Research (ECIR 2006), Lecture Notes in Computer Science, vol. 3936, pp. 565–569, Springer, Berlin Heidelberg New York (2006), ISBN 3-540-33347-9, ISSN 0302-9743, https://doi.org/10.1007/11735106_66 [22] Nath, S.: UniNE at PAN-CLEF 2019: Style Change Detection by Threshold Based and Window Merge Clustering Methods. In: Cappellato, L., Ferro, N., Losada, D., Müller, H. (eds.) CLEF 2019 Labs and Workshops, Notebook Papers, CEUR-WS.org (Sep 2019) [23] Potthast, M., Gollub, T., Wiegmann, M., Stein, B.: TIRA Integrated Research Architecture. In: Ferro, N., Peters, C. (eds.) Information Retrieval Evaluation in a Changing World, The Information Retrieval Series, Springer, Berlin Heidelberg New York (Sep 2019), ISBN 978-3-030-22948-1, https://doi.org/10.1007/978-3-030-22948-1_5 [24] Rexha, A., Klampfl, S., Kröll, M., Kern, R.: Towards a more fine grained analysis of scientific authorship: Predicting the number of authors using stylometric features. In: Mayr, P., Frommholz, I., Cabanac, G. (eds.) Proceedings of the Third Workshop on Bibliometric-enhanced Information Retrieval co-located with the 38th European Conference on Information Retrieval (ECIR 2016), Padova, Italy, March 20, 2016., CEUR Workshop Proceedings, vol. 1567, pp. 26–31, CEUR-WS.org (2016), URL http://ceur-ws.org/Vol-1567/paper3.pdf [25] Rosso, P., Rangel, F., Potthast, M., Stamatatos, E., Tschuggnall, M., Stein, B.: Overview of PAN’16—New Challenges for Authorship Analysis: Cross-genre Profiling, Clustering, Diarization, and Obfuscation. In: Fuhr, N., Quaresma, P., Larsen, B., Gonçalves, T., Balog, K., Macdonald, C., Cappellato, L., Ferro, N. (eds.) Experimental IR Meets Multilinguality, Multimodality, and Interaction. 7th International Conference of the CLEF Initiative (CLEF 16), Springer, Berlin Heidelberg New York (Sep 2016), ISBN 978-3-319-44564-9, https://doi.org/10.1007/978-3-319-44564-9_28 [26] Safin, K., Kuznetsova, R.: Style Breach Detection with Neural Sentence Embeddings—Notebook for PAN at CLEF 2017. In: [4], URL http://ceur-ws.org/Vol-1866/ [27] Safin, K., Ogaltsov, A.: Detecting a Change of Style Using Text Statistics—Notebook for PAN at CLEF 2018. In: [5] [28] Schaetti, N.: Character-based Convolutional Neural Network for Style Change Detection—Notebook for PAN at CLEF 2018. In: [5], URL http://ceur-ws.org/Vol-2125/ [29] Stamatatos, E.: Intrinsic Plagiarism Detection Using Character n-gram Profiles. In: Notebook Papers of the 5th Evaluation Lab on Uncovering Plagiarism, Authorship and Social Software Misuse (PAN), Amsterdam, The Netherlands (Sep 2011) [30] Stein, B., Lipka, N., Prettenhofer, P.: Intrinsic Plagiarism Analysis. Language Resources and Evaluation (LRE) 45(1), 63–82 (Mar 2011), ISSN 1574-020X, https://doi.org/10.1007/s10579-010-9115-y [31] Stein, B., Meyer zu Eißen, S.: Intrinsic Plagiarism Analysis with Meta Learning. In: Stein, B., Koppel, M., Stamatatos, E. (eds.) 1st Workshop on Plagiarism Analysis, Authorship Identification, and Near-Duplicate Detection (PAN 2007) at SIGIR, pp. 45–50 (Jul 2007), ISSN 1613-0073, URL http://ceur-ws.org/Vol-276 [32] Tschuggnall, M., Specht, G.: Countering Plagiarism by Exposing Irregularities in Authors’ Grammar. In: Proceedings of the European Intelligence and Security Informatics Conference (EISIC), pp. 15–22, IEEE, Uppsala, Sweden (Aug 2013) [33] Tschuggnall, M., Specht, G.: Automatic decomposition of multi-author documents using grammar analysis. In: Klan, F., Specht, G., Gamper, H. (eds.) Proceedings of the 26th GI-Workshop Grundlagen von Datenbanken, Bozen-Bolzano, Italy, October 21st to 24th, 2014., CEUR Workshop Proceedings, vol. 1313, pp. 17–22, CEUR-WS.org (2014), URL http://ceur-ws.org/Vol-1313/paper_4.pdf [34] Tschuggnall, M., Stamatatos, E., Verhoeven, B., Daelemans, W., Specht, G., Stein, B., Potthast, M.: Overview of the author identification task at pan-2017: style breach detection and author clustering. In: Working Notes Papers of the CLEF 2017 Evaluation Labs/Cappellato, Linda [edit.]; et al., pp. 1–22 (2017) [35] Zangerle, E., Tschuggnall, M., Specht, G., Potthast, M., Stein, B.: Overview of the Style Change Detection Task at PAN 2019. In: Cappellato, L., Ferro, N., Losada, D., Müller, H. (eds.) CLEF 2019 Labs and Workshops, Notebook Papers, CEUR-WS.org (Sep 2019), URL http://ceur-ws.org/Vol-2380/ [36] Zheng, R., Li, J., Chen, H., Huang, Z.: A framework for authorship identification of online messages: Writing-style features and classification techniques. Journal of the American Society for Information Science and Technology 57(3), 378–393 (2006) [37] Zlatkova, D., Kopev, D., Mitov, K., Atanasov, A., Hardalov, M., Koychev, I., Nakov, P.: An Ensemble-Rich Multi-Aspect Approach for Robust Style Change Detection—Notebook for PAN at CLEF 2018. In: [5], URL http://ceur-ws.org/Vol-2125/ [38] Zuo, C., Zhao, Y., Banerjee, R.: Style Change Detection with Feedforward Neural Networks Notebook for PAN at CLEF 2019 . In: Cappellato, L., Ferro, N., Losada, D., Müller, H. (eds.) CLEF 2019 Labs and Workshops, Notebook Papers, CEUR-WS.org (Sep 2019)