Mixed Style Feature Representation and B0 -maximal Clustering for Style Change Detection Notebook for PAN at CLEF 2020 Daniel Castro-Castro1 , Carlos Alberto Rodríguez-Losada1 , and Rafael Muñoz2 1 Oriente University, Santiago de Cuba, Cuba {danielbaldauf, carlosarl1999}@gmail.com 2 Department of Software and Computing systems, Alicante University, Spain rafael@dlsi.ua.es http://www.dlsi.ua.es Abstract The goal of Style Change Detection task in a document is to deter- mine if it was written by more than one author and in such case, to delimit which paragraph (or more generally a portion of text) corresponds to each one of them. The objective of our proposal is to build a paragraph representation based on general Style Feature computed considering characters, lexical and syntactic fea- tures, without the use of semantic words. The paragraphs were grouped employ- ing a non overlapped variant of the B0 -maximal clustering algorithm, where the overlapping was eliminated considering the order of paragraphs in the document. 1 Introduction Authorship detection is important for determining which author or group of authors should get credit for writing a given document. In particular, in our digital modern society, it is a complex task when the objective is to determine who wrote a piece of digital text or if a document could be written by more than one author. Thanks to the research community and in particular to the organizers of PAN 3 evaluation forum [4], in recent years there is a growing interest in sharing methods and algorithms to solve many of the tasks involved in Authorship Attribution (AA). One of these tasks is the Style Change Detection in a document, with the purpose of detection if a document was written by only one author or more than one, and in the last scenario, what piece of text corresponds to each one of the authors [1]. Overviews of the past style change detection task [5][3][6] resumed the description of the task, approaches presented by participants and the results obtained. It is important to highlight, that a priori, there is no information about authors or the numbers of them involved in a problem, that’s why, the tasks are mainly solved considering text clustering solutions. Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons Li- cense Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 September 2020, Thessa- loniki, Greece. 3 http://pan.webis.de One of the key aspects tackled to solve the task, corresponds to, the representation of textual contents, and in the majority of proposals, it was used the Bag of Word model considering lexical and syntactical linguistic features. For clustering algorithms have been used hierarchical and non-hierarchical traditional methods. Also, due to the nature of the task, the clusters of documents may not be overlapped, because a document or a paragraph (depending of the practical problem proposed) belongs to an unique author, so then, a document or paragraph must be part of just one cluster (or group). In section 2 our proposal is described with emphasis in paragraph representation based on the construction of a Mixed Style Set of Features and the clustering algorithm employed. Section 3 presents the results obtained and the conclusions of the work, and in Section 4 a brief discussion of the main problems. 2 Proposal for Style Change Detection - PAN 2020 Our main goal is to determine clusters of paragraphs in which is considered that all paragraphs in a cluster are written by the same author. If the algorithm obtains more than one cluster, then the document was written by more than one author and the number of distinct authors corresponds to the number of clusters. In the next two sections are described the representation of paragraphs and the clus- tering algorithm. The computational representation is based on the formulations pro- posed in Logical Combinatorial Pattern Recognition 4 and the clustering algorithm is explained in [2]. 2.1 Paragraph Mixed Style Feature Representation and Similarity The objective of our proposal was to build a paragraph representation based on general Style Feature computed considering characters, lexical and syntactic features, without the use of content words. In Figure 1, are illustrated the paragraph representation and similarity functions implemented to compare two of them. The paragraph representation is build considering a finite mixed set of 185 features from three types of data values, Boolean, Float or n-gram Vector. At the left corner in Figure 1 there is a section "Features examples" with three examples of the type of features analyzed. Features were structured in six subsets considering different textual layers on the text. These layers are boolean, character, sentence, paragraph, syntactic and the text. Examples for each of the layers subset of features: 1- Boolean layer: Uses the same word to finish a sentence and to begin the next sentence. 2- Character layer: Average length of words. 3- Sentence layer: Average number of words. Average number of distinct prepositions. 4- Paragraph layer: Average number of sentences. Average number of words. 5- Syntactic layer: Proportion of nouns over adjective. 6- Text layer: Average length of sentence. Bag of Words of conjunctions. In order to compare two representations, one comparison criteria (CC) for each type of 4 https://www.uci.cu/reconocimiento-logico-combinatorio-de-patrones Figure 1. Description of paragraph representation and similarity function data feature value is introduced. The three CC formulas are exposed at the right section in Figure 1. For features of Bool type, two features are similar, if they have the same value (true or false), see CCbool . For features of Float type, two features are similar, if the difference between values are less than a predefined threshold, see CCf loat . For features of Vector type, two features are similar if the similarity between them is greater than a predefined threshold, see CCvector . We used M inM ax5 similarity to compare two vector. Finally, the two paragraphs are similar, if the number of features, in which they are similar, are greater than a percentage defined, see F (Pi , Pj ). 2.2 B0 -maximal Clustering Method The clustering proposal generates all the subsets of paragraphs in order to achieve that the similarity between each of the paragraphs in a cluster should be larger than a prede- fined B0 parameter. The B0 -maximal clustering algorithm obtains compact groups of paragraphs and some overlapped groups. For the task, this overlapping needs to be eliminated and we used an approach based on the order of paragraphs in the document. If a paragraph could be part of two or more clusters, it will be considered only in the cluster where a paragraph with the lower index in the order of appearance in the text exists. This decision is based on the assumption that the style in a document are characterized by the style reflected in the firsts para- graphs and in general the main author tend to write the majority of the paragraphs and the firsts one. To accomplish that, the overlapped paragraphs were sorted by their index of appearance in the document. When the cluster assignment was defined, then all edges from these paragraph to other clusters were eliminated. In Figure 2 and Figure 3 are presented an example based on a graph construction, where the vertices are the paragraphs and the number of the vertices, the order of the paragraphs in the document. The edges that connect two vertices represent that the similarity of vertices is greater or equal than a B0 parameter and in our proposal we use a percentage of similar features. The Figure 2 corresponds to the output of the clustering algorithm, and it can be seen that paragraph 4 and 5 could be part of two clusters. Considering the heuristic explained to eliminate the overlapping, the final clusters will correspond to the two illustrated at Figure 3. 3 Evaluation The data-set distributed contains documents for two problems of Style Change Detec- tion, a narrow data-set and a wide data-set [1]. The description of the data and evaluation measures are discussed on the overview published for the task. In Table 1, are resumed the average results for task1 and task2 considering results in both data-set. For task1 the objective was to answer if a document was written by one author or more than one. In task2 had to be answered, in which paragraph (could be more than one) of the text there was a style change. As an additional data, the or- ganizers informed, that a maximum of three authors could be involved in a document, 5 https://rdrr.io/cran/stylo/man/dist.minmax.html Figure 2. Example of B0 -maximal compact cluster graph representation Figure 3. Example of B0 -maximal compact cluster graph representation, with non overlapped clusters but our proposal is not restricted by a predefined number of clusters. Task1 was eval- uated by F1 measure and task2 using micro − F measure. Using train and validation data-set distributed for the task, it was selected the values for parameters B0 , γ and δ, considering all combinations of these three. Table 1. Style Change Detection results. Team task1 task2 iyer20 0.640 0.856 castro20 0.539 0.757 nath20 0.520 0.752 As a baseline we considered for task1 that the answer was always multi-authored, and as the data-set are balanced in the number of problems for multi-authored and single-authored documents, the result is 0.5. Similar result is obtained if the answer were single-authored for all documents. For task 2 we could not compare the results with a baseline. 4 Discussion Using training and validation data-set, we got better results processing the wide data- set than the narrow one, and this is interesting, considering that we did not use content (topic related) words, concluding that the syntactic and structural style features are used differently when the topics change. At the contrary, we got no significant difference of style between authors when they wrote about the same topic. Several of the features get duplicated values, because they capture the same values, considering that two or more structural layers are fused, for example, when the unit to be analyzed as a document is a paragraph, then the paragraph layer and text layer are considered distinct but they are the same. 5 Conclusion and Future Work It was presented a proposal based on a paragraph representation, considering general Style Features at character, lexical and syntactic layers of analysis, without the use of topic or content words. The paragraphs were grouped employing a non overlapped variant of the B0 -maximal clustering algorithm, where the overlapping was eliminated considering the order of the paragraph in the document. As future work could be interesting to combine semantic and topic vector repre- sentation as features of the mixed model in order to distinguish between paragraph of different topics. Also, the heuristic employed to eliminate the overlapping scenarios can be improved, if some characteristics of the groups are considered, for example: the size, strength of similarity or the adjacency of paragraphs. References 1. Eva Zangerle, Maximilian Mayerl, G.S.M.P.B.S.: Overview of the Style Change Detection Task at PAN 2020. In: Cappellato, L., Eickhoff, C., Ferro, N., Névéol, A. (eds.) CLEF 2020 Labs and Workshops, Notebook Papers. CEUR-WS.org (Sep 2020) 2. Gil-García, R., Badía-Contelles, J.M., Pons-Porrata, A.: A parallel algorithm for incremental compact clustering. In: Euro-Par. Lecture Notes in Computer Science, vol. 2790, pp. 310–317. Springer (2003) 3. Kestemont, M., Tschuggnall, M., Stamatatos, E., Daelemans, W., Specht, G., Stein, B., Potthast, M.: Overview of the author identification task at PAN-2018: cross-domain authorship attribution and style change detection. In: Cappellato, L., Ferro, N., Nie, J., Soulier, L. (eds.) Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum, Avignon, France, September 10-14, 2018. CEUR Workshop Proceedings, vol. 2125. CEUR-WS.org (2018), http://ceur-ws.org/Vol-2125/invited_paper_2.pdf 4. Potthast, M., Gollub, T., Wiegmann, M., Stein, B.: TIRA Integrated Research Architecture. In: Ferro, N., Peters, C. (eds.) Information Retrieval Evaluation in a Changing World. Springer (Sep 2019) 5. Tschuggnall, M., Stamatatos, E., Verhoeven, B., Daelemans, W., Specht, G., Stein, B., Potthast, M.: Overview of the author identification task at PAN-2017: style breach detection and author clustering. In: Cappellato, L., Ferro, N., Goeuriot, L., Mandl, T. (eds.) Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, Dublin, Ireland, September 11-14, 2017. CEUR Workshop Proceedings, vol. 1866. CEUR-WS.org (2017), http://ceur-ws.org/Vol-1866/invited_paper_3.pdf 6. Zangerle, E., Tschuggnall, M., Specht, G., Stein, B., Potthast, M.: Overview of the style change detection task at PAN 2019. In: Cappellato, L., Ferro, N., Losada, D.E., Müller, H. (eds.) Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland, September 9-12, 2019. CEUR Workshop Proceedings, vol. 2380. CEUR-WS.org (2019), http://ceur-ws.org/Vol-2380/paper_243.pdf