Comparison of distance and similarity measures for stylometric analysis of Lithuanian texts Daumantas Stanikūnas1 , Justina Mandravickaitė2 , Tomas Krilavičius3 1 Department of Mathematics and Statistics, Vytautas Magnus University, Lithuania Email: daumantas.stanikunas@fcis.vdu.lt 2 Baltic Institute of Advanced Technology, Vilnius University, Lithuania Email: justina@bpti.lt 3 Baltic Institute of Advanced Technology, Vytautas Magnus University, Lithuania Email: t.krilavicius@bpti.lt Abstract—Constant developments in information and com- This paper presents an on-going experimental work on puter technologies make it possible to handle constantly increas- identifying the most suitable measures used in stylometry ing amount of data, thereby expanding the research possibilities. when analyzing Lithuanian texts. The objective of these ex- In this article, we discuss and compare distance and similarity measures used in stylometric analysis which could be applied to periments is to compare the performance of distance and analyze Lithuanian texts. As corpus for the analysis, transcripts similarity measures already used in stylometry and other fields of parliamentary debates by two politicians of the Lithuanian of research [4], [5], [11] using R language with the focus Parliament were chosen. Furthermore, comparison of distance on the transcripts of speeches of politicians in the Lithuanian measures, stylometric analysis and visualization were performed. Parliament. Objective of the experiment was to identify what measures would perform better when executing stylometric analysis of Lithuanian These experiments cover a domain of transcriptions of par- texts and explore where these differences in the performance liamentary debates of Lithuanian Parliament which is only a occur. Summarizing the experiment results, the recommendations small fraction of Lithuanian language, however they represent are as follow: number of Most Frequent Words used should be at richness and variety of language quite well, and hence, we least 1000, Eder’s Simple Delta measure can be used in general expect, that the results could be useful in analysis of other stylometric analysis of transcriptions of parliamentary debates of Lithuanian Parliament, in a case when Most Frequent Words are Lithuanian texts. limited to 2000, Binomial Index shows an increase in performance over Eder’s Simple Delta and thus it is more suitable. II. DATA AND M ETHODS Index Terms—stylometry; computational stylistics; parliamen- tary speech; R; statistical analysis; distance measure; similarity A. Data Preparation measure; data visualization Data for the analysis will be taken from corpora, col- lected in ASTRA project. The data consists of transcriptions I. I NTRODUCTION of parliamentary debates of Lithuanian Parliament. For this Stylometry refers to the study of linguistic style, usually to investigation we use transcripts of two politicians from the written language. It uses variety of statistical methods to ana- term 2008-2012. The criteria used for selecting the right data lyze a text to determine the text’s author. Common technique for comparison of measures were chosen in such a way that used is to calculate distances or similarities between texts measures would provide the biggest difference between texts and process the output using different visualization methods. of different authors (or speakers, in our case). Considering this, There have already been significant number of experiments authors for experiments were chosen of different gender and performed by various researchers in order to figure out what political standing (position and opposition) to strengthen the measures show better results in different cases of stylometric dissimilarity. This is a different type of approach compared to analysis. F. Jannidis and S. Evert analyzed three collections a similar research which was done by J. Kapočiūtė-Dzikienė of novels in English, French and German languages and have where her goal was to identify best methods and features shown that Cosine Delta measure outperforms all other mea- (MFW were not used) for authorship attribution based on sures on our three collections [2], [3], while J. Mandravickaitė machine learning [15]. Her approach was to analyze as similar in her research she explained that such measures like Burrows data sets as possible, when authors are actually different for Delta would not work well with highly inflected languages these data sets. (Latin, Polish) and suggested using Eder’s Delta measure [10]. Automatic extraction of style applied to individual authors and groups of Copyright c 2017 held by the authors. authors, http://dangus.vdu.lt/~jkd/?page_id=2 1 Table I: Distance/similarity measures and their formulas Distance/Similarity measure Formula Pn Manhattan Distance i=1 |xi − yi | pPn 2 Euclidean Distance i=1 (xi − yi ) Pn |xi −yi | Canberra Distance i=1 |xi |+|yi | Pn i=1q xi yi Cosine Distance qP n 2 Pn 2 i=1 xi i=1 yi xi −µi Burrows’ Delta 1 Pn n i=1 σi − yiσ−µi i s 1 Pn (xi −yi )2 Argamon’s Linear Delta n i=1 σi2   xi −yi Eder’s Delta 1 Pn n i=1 σi · n−nni +1 Pn √ √ Eder’s Simple Delta [9] i=1 xi − yi 1 Pn 1 2 Argamon’s Quadratic Delta n i=1 σ 2 (xi − yi ) i Pn |xi −yi | Bray-Curtis Dissimilarity Pi=1 n (x +y ) i=1 i i Pn |x i −y i| Kulczynski Distance Pn i=1 i=1 min (xi ,yi ) Pn i |x −y | i 2 Pi=1 n (x +y ) Pi=1 i i Jaccard Index n |xi −yi | 1+ Pi=1 n (x +y ) i=1 i i 1 Pn |xi −yi | Gower Similarity n i=1 maxi − mini 1 · n P Alternative Gower Similarity n0 i=1 |xi − yi | 2 n P i=1 xi y !i Horn’s modification of Morisita’s Overlap Index n x2 n y2 P P i=1 i + i=1 i Pn Pn Pn 2 Pn 2 i=1 xi i=1 yi ( x i=1 i ) y i=1 i ( ) 1 Mountford Index α , where α is the parameter of Fisher’s log-series x y i +y ·ln i −2n·ln 1 Pn xi ·ln 2n i 2n 2 Binomial Index [5] i=1 2n Table II: Where xi and yi are corresponding i values of vectors X = (x1 , x2 , . . . , xn ) and Y = (y1 , y2 , . . . , yn ), n is the size of the compared vectors, σi and µi are standard deviation and mean of i values of all vectors used in comparison, ni is queue number of i value in a vector (usually ni = i), mini and maxi are minimum and maximum i values between all compared vectors, n0 is a number of pairs between corresponding X and Y vector values when at least one value in a pair is equal to 0. The selected authors were Virginija Baltraitienė (Labor (8 for each politician). The analysis was performed using as Party Political Group (Darbo partijos frakcija, DPF), who features the most frequent words (MFW) in the sub-sample belonged to opposition at the given time) and Donatas of 2 selected authors sorted in descending order by frequency. Jankauskas (Homeland Union - Lithuanian Christian Demo- More statistics for the corpora and selected authors can be crat Political Group (Tėvynės sajungos-Lietuvos ˛ krikščioniu˛ seen in Table III. demokratu˛ frakcija, TS-LKDF)). The mean word count for both authors are 190.39 and 203.96 accordingly, which means B. Methods that their texts have to be concatenated so that one text To evaluate the measures, Z-index was used, which in other object would contain around 5000 words. This is because words is a difference between means of standardized values. Lithuanian language have many different words with many s m v m 1 X sl − µ 1 X vk − µ different word forms, and in this experiment we are not z = |µs − µv | = − , (1) converting words to their base form. In addition, we will ms σ mv σ l=1 k=1 consider these 5000 words to be sufficient for a text object, where µs and µv are out-group and in-group means of because documents themselves contain as few as 100 words standardized values, ms and mv are out-group and in-group and it was shown by M. Eder that after more than 5000 words number of comparisons, sl and vk are out-group and in-group the quality of author attribution barely increases [13]. After distance/similarity measurement for the corresponding com- this step, for further analysis, 16 text objects were created parison, µ and σ are the overall mean and standard deviation 2 Table III: Corpora statistics Global Parameter Value Period March of 1990 – December of 2013 Number of authors 147 (18 women and 129 men) Minimum word count in one text 100 Selected Author Number of Texts Number of Words Number of Different Words Mean Number of Words in One Text Virginija Baltraitienė 418 79584 11711 190.39 Donatas Jankauskas 468 95453 12981 203.96 of all measurements for comparisons, excluding comparisons that showed better performance, i.e., higher Z-index with 100 when compared documents are the same. MFW, would perform worse with 1000 and 5000 MFW and In-group data set contains comparisons of documents of vice versa. To investigate further, a second experiment was speeches by the same politician, while out-group data set executed. contains comparisons of documents of speeches by different politicians. In this experiment we assumed that the bigger B. Experiment 2: Analysis of Distance Measures difference between means of standardized values, the better To find out exactly how every distance measure behaves performance of certain measure was. when increasing the number of MFW, a graph was plotted After inspecting results of the experiment, Cluster Analysis where all the measures were presented in Z-indexes, taking and Multidimensional Scaling was used to visualize relation into consideration quantities of MFW taken for experimenta- among speeches of selected politicians, using distance measure tion, see Fig. 1. All Z-indexes were calculated for 100, 200,. . . , with better performance and with regards to the quantity of 5000 MFW. In general, every distance measure was displayed MFW. In addition to Z-index, a dependency on the quantity as a function. A plot was generated with total of 50 points of MFW (from 100 MWF to 5000 MFW) to the results was for every measure, which was enough to detect their behavior analyzed as well. when quantity of MFW increased. For stylometric analysis (calculating word frequencies, In the Fig. 1 we can see that most distance measures values of the distance measure, and plotting the relations possess high values in Z-index throughout the plot, but four among documents) R, free software environment for statistical of them performed very poorly. These four distance (or computing and graphics, was used [1]. R language and its similarity) measures are Alternative Gower Similarity, Horn’s environment was chosen because it has all the necessary tools modification of Morisita’s Overlap Index, Cosine Distance for textual data processing, computations, statistical analysis, and Euclidean Distance (see Table II for details). In order to visualization capabilities and good performance in general, investigate further, we removed these distance measures from e.g., one R script can be executed to provide all the required the experiment and concentrated on the remaining ones. results from raw textual data without any additional software. After removing distance measures with bad performance, All distance/similarity measure computations and stylomet- we generated a new graph where remaining measures could ric analysis were performed using “stylo” [9] and “vegdist” be compared more precisely,see Fig. 2. We can see that up [16] scripts for R. In order to have a more efficient process, to around 1000 MFW, Z-index is always increasing in value both scripts were merged together. This way all computations and after this point it either decreases or shows similar results and analysis could be executed with one script. when MFW number is increased. Analyzing the plot further, we can observe three main groups of distance (or similarity) III. E XPERIMENTAL R ESULTS measures: The objective of this investigation was to identify which 1) Measures which show very good performance until measures perform better with different number of MFW in around 2000 MFW were reached. After that, the per- stylometric analysis of Lithuanian texts. All the evaluated formance downgraded very fast. Distance measure that measures are presented in Table II. Hierarchical Clustering [6], performed best was Binomial Index. Other distance Multidimensional Scaling [7] and Heat Map [8] were applied measures which behaved similarly were: with parameters described in Table IV. a) Argamon’s Quadratic Delta, A. Experiment 1: Initial Experiments b) Burrows’ Delta, Initial experiments where performed with 100, 1000 and c) Argamon’s Linear Delta, 5000 MFW. In each case all distance measures were sorted d) Gower Similarity. according to the calculated Z-index. Naturally, a distance 2) Measures which slowly reach their performance peak measure with the highest Z-index should have been considered only at around 2000 MWF and then their performance the best, but in our experiments there were no clear consistency very slowly declined. In general, these distance measures in the measures regarding their performances, e.g., the measure showed a very good performance and stability. In this 3 Figure 1: Difference between means of standardized values of in-group and out-group distances as a function of MFW amount used. Indicated for initial distance and similarity measures on the Lithuanian texts. Figure 2: Difference between means of standardized values of in-group and out-group distances as a function of MFW amount used. Indicated for selected distance and similarity measures on the Lithuanian texts. 4 Table IV: Settings used for stylometric analysis Parameter Value Corpus format Plain text Corpus language English (ALL) Analyzed features Words N-gram size 1 Most Frequent Words (MFW) Amount that showed better results Start at freq. rank 1 Culling min 0 Culling max 0 Analysis types Cluster Analysis, Multidimensional Scaling (MDS), Heat map Distance/Similarity measure A measure that showed better results Sampling No sampling formance together with the increasing MFW quantity, though this improvement was extremely gradual. In comparison to other groups, this group performed worse than the first and the second groups until reached 3000 MFW and after that it still performed worse than the second group, but it showed better performance over the first group. Distance measure that performed best in this group was Jaccard Index. Other measures which behaved similarly were: a) Manhattan Distance, b) Bray-Curtis Similarity, c) Kulczynski Distance. We noticed that Eder’s Simple Delta measure performed really good: between the first group, only around 2-3% worse than the Binomial Index and between the second group, where Eder’s Simple Delta showed the highest Z-index that was 1- 3% better than Eder’s Delta. So not only it could be used for bigger quantities of MFW, but for smaller ones as well. For this reason, in this experiment we considered Eder’s Simple Delta as the best performing distance (or similarity) measure when analyzing Lithuanian texts in general. Figure 3: Heat map was generated using measurement matrix C. Experiment 3: Exemplary Stylometric Analysis received with Eder’s Simple Delta and 2000 Most Frequent We used Eder’s Simple Delta measure for exemplary stylo- Words. In the upper left corner, a color palette can be seen metric analysis in order to visualize the difference in lexical which is used to visualize data. Furthermore, histogram is style (usage of words) between the selected members of the drawn on top to show distribution of matrix values. Lithuanian Parliament. For this analysis 2000 MFW was used, as it showed the best performance for the chosen distance measure. The first method used in order to map relations group, Eder’s Simple Delta had the best results. Other among the transcribed speeches made by before mentioned measures which behaved similarly were: politicians was Hierarchical Cluster Analysis (see in Fig. 4b). a) Eder’s Delta, In this research an agglomerative hierarchical clustering with b) Mountford Index, Ward linkage, where the criterion for choosing the pair of c) Canberra Distance. clusters to merge at each step is based on the optimal value of 3) Measures which had a continuously improving per- an objective function [14], was used. As a result, clusters’ 5 (a) Multidimensional Scaling method. (b) Hierarchical Cluster Analysis method. Figure 4: Eder’s Simple Delta measure and 2000 Most Frequent Words was used. Different colors represent different text authors. hierarchy was generated and visualized as a dendrogram, in regards to the Eder’s Simple Delta measure. where on the right side we have separate documents which To summarize, every method for visualization and rela- are being linked into clusters according their similarity until tion/position mapping for stylometric analysis that we used did all documents are merged into one cluster. The results showed not object our statement that Eder’s Simple Delta performed clear differentiation between the authors which also con- well in analysis of Lithuanian texts. But by no means this is the tributed to our conclusion that Eder’s Simple Delta performed only measure that produces very good results. Other measures well. surely could also provide similar performance (as was seen in In Fig. 4a visualization and relation mapping among Fig. 2). Different corpus, parameters for text analysis, selection documents were performed using Multidimensional Scaling. of Most Frequent Words - these options could still affect The colors represent different authors of the parliamentary the performance of every distance (or similarity) measure. speeches, corresponding to the previous Figure (Fig. 4b). However, to reach solid conclusion, further research is needed. The division among parliamentary speeches made by different politicians was executed well, but in this case distribution of IV. C ONCLUSIONS points was extended from the perspective of the vertical axis. In this experiment we focused on Lithuanian texts, when This behavior might lead to different results if a lot more corpus was composed of parliamentary speeches from Lithua- authors would be analyzed at once. In general, the results nian Parliament. Hence, the following recommendations could were good, which also contributed to our conclusion that the be applied to the experimentation with the latter corpus. distance measure we used had an overall good performance. In addition to the two methods we have already used 1) Distance measure can be selected according to the quan- for relation/position mapping of the documents as well as tity of the Most Frequent Words used in the analysis. visualization, we tried to display the results with a Heat 2) At least 1000 of Most Frequent Words should be used. Map. Fig. 3 shows one of the simple ways to visualize After this point Z-index value either decreases or shows stylistic differences among the documents without any further similar results. computations. In this example the color palette used for the 3) If quantity of MFW does not exceed 5000 by a wide plot was gradient from white to dark red, where white means a margin, Eder’s Simple Delta measure performs well. complete match. Since documents were sorted by author in the 4) If Most Frequent Words are limited to 2000, Binomial horizontal and vertical axes, the rectangle shapes formed out of Index shows an increase in performance over Eder’s light and darker colors. Light color showed that the documents Simple Delta and thus it is more suitable in this case. were written (or speech spoken, in our case) probably by the Finally, these recommendations can be applied for stylo- same author and dark color – by a different author. Since the metric analysis of generic Lithuanian texts, but precautions brightness of the color matched every pair of documents well must be taken and therefore we plan to experiment with texts enough, Heat Map could be considered to show positive results belonging to different domains in the future. 6 R EFERENCES [1] R Development Core Team, R: A Language and Environment for Statis- tical Computing, R Foundation for Statistical Computing, 2008. [2] F. Jannidis, S. Pielström, C. Schöch and T. Vitt, Improving Burrows’ Delta – An empirical evaluation of text distance measures, Digital Humanities 2015, 2015. [3] S. Evert, T. Proisl, C. Schöch, F. Jannidis, S. Pielström and T. Vitt Explaining Delta, or: How do distance measures for authorship attribution work?, 2015. [4] H. S. Horn, Measurement of “overlap” in comparative ecological studies, 1966. [5] M. J. Anderson and R. B. Millar, Spatial variation and effects of habitat on temperate reef fish assemblages in northeastern New Zealand, 2004. [6] L. Rokach and O. Maimon, Clustering methods, 2005. [7] I. Borg and P. Groenen, Modern Multidimensional Scaling: theory and applications, 2005. [8] Heat map (heatmap), http://searchbusinessanalytics.techtarget.com/ definition/heat-map, accessed: 2017-03-12. [9] M. Eder, J. Rybicki and M. Kestemont, Stylometry with R: a package for computational text analysis, R journal, 2016. [10] J. Mandravickaitė and T. Krilavičius, Language usage of members of the Lithuanian Parliament considering their political orientation, Deeds and Days, 2015. [11] T. Krilavičius and V. Morkevičius, Mining Social Science Data: a Study of Voting of the members of the Seimas of Lithuania by Using Mul- tidimensional Scaling and Homogeneity Analysis, Intellectual Economics, 2011. [12] S. Argamon, Interpreting Burrows’s Delta: Geometric and Probabilistic Foundations, 2007. [13] M. Eder, Does size matter? Authorship attribution, small samples, big problem, Digital Humanities 2010, 2010. [14] J. H. Ward Jr., Hierarchical Grouping to Optimize an Objective Function, Journal of the American Statistical Association, 1963. [15] J. Kapočiūtė-Dzikienė, A. Utka and L. Šarkutė Feature Exploration for Authorship Attribution of Lithuanian Parliamentary Speeches, 2014. [16] J. Oksanen, F. G. Blanchet, R. Kindt, P. Legendre, P. R. Minchin, R. B. O’Hara, G. L. Simpson, P. Solymos, M. Henry, H. Stevens and H. Wagner, vegan: Community Ecology Package, 2016. 7