Quantitative Analysis of Textual Genres: Comparison of English and Lithuanian Justina Mandravickaitė Tomas Krilavičius Vilnius University, Lithuania Vytautas Magnus University, Lithuania Baltic Institute of Advanced Technology, Lithuania Baltic Institute of Advanced Technology, Lithuania Email: justina@bpti.lt Email: t.krilavicius@bpti.lt Abstract—We report an ongoing study on quantitative char- II. C ORPORA AND M ETHODS acteristics of texts written in different genres. At this stage, we compared Lithuanian and English texts in terms of genres. We A. Data Sets and Preprocessing used 16 indices which describe frequency structure of text as well as indicate several other characteristics of written texts. We used part of Corpus of the Contemporary Lithuanian Initial study showed significant differences of indices calculated Language [21] (≈ 1, 5 million words) and Freiburg-LOB Cor- for genre pairs of the same language. Hierarchical clustering pus of British English (F-LOB) (≈ 1 million words) [22] for revealed possible applications in using them as features for text our initial experiment. The composition of Lithuanian material categorization/classification by genre, though better results were achieved for Lithuanian texts. is the following: Fiction (17%), Documents (21%), Scientific Index Terms—quantitative genre analysis, frequency structure (21%) and Periodicals (31%). English material consists of of text, vocabulary richness, stylometry, English, Lithuanian Fiction (25%), General Prose (42%), Learned (16%) and Press (18%). Lithuanian genre category Scientific corresponds to I. I NTRODUCTION English category Learned, while Lithuanian Periodicals corre- sponds to English category Press. More detailed constitution We report an ongoing study on quantitative characteristics of of F-LOB corpus by genres described in Table I. Part of the texts written in different genres. It has been suggested that gen- Corpus of the Contemporary Lithuanian Language we used res add to familiarity and the shorthand of communication [1], for our study did not have such details available, only genre [2], [3] and therefore resonate with people. Also, genres tend groups as described above. to shift in accordance to public opinion and reflect widespread culture of certain time [4]. From NLP perspective, genres are As English texts were already concatenated according to useful for text classification (e.g. [5]) and categorization (e.g. their genre, only minimal preprocessing was performed, i.e., [6]), natural language generation (e.g. [7], [8]), etc. lines numbers and tags that marked textual structure were At this stage, we present initial quantitative analysis of removed. For Lithuanian, as we had individual texts, to get Lithuanian and English texts of different genres (or super- around of “fingerprint” of individual authorship as much genres, in case of being more precise [9], as the texts were as possible, all the samples were concatenated into 4 large grouped into broad categories or genre groups; however, for documents based on genre group (or super-genre), and then the simplicity a term “genre” was used in this paper). As were partitioned into 5 parts each. Thus all in all for Lithuanian the main point of interest was frequency structure of text part of analysis we had 20 samples. considering genre aspect, we used 16 indices proposed by [10], [11], [12] and implemented in QUITA - Quantitative Index B. Features for Characterization of Genres Text Analyzer [13]. Most frequent words (MFW) as features are one the most As we study textual genres wrt style, i.e., fiction, press, popular solutions in stylometric analysis [23], [24], [25], [26], etc. style, we apply computational stylistics or stylometry. [27], [28] (usually, they coincide with function words [29], Stylometry is based on the two hypotheses: [30]). They are considered to be topic-neutral and perform • human stylome hypothesis, i.e., each individual has a well [31], [32], [33]. As our interest lied in frequency structure unique style [14]; of the text as well as vocabulary richness taking genre aspect • unique style of individual can be measured [15], and into consideration, for our experiment we applied 16 indices thus stylometry allows gaining meta-knowledge [16], i.e., implemented in QUITA - Quantitative Index Text Analyzer what can be learned from the text about the author – [13]: gender [17], [18], age [19], psychological characteristics • Type-Token Ratio (TTR) – ratio between the number of [20], political affiliation [19], etc. types and the number of tokens in a text, i.e. shows Genre can be considered as a certain ”style”, thus we vocabulary variation in a text; assumed that stylometric analysis could aid in our study of • h-point (h) – a fuzzy boundary in the word frequency quantitative characteristics of genres. table where the rank is the same as the frequency; Copyright held by the author(s). 61 Table I F-LOB CORPUS STATISTICS Genre group Category Content of Category A Reportage Press B Editorial C Review D Religion E Skills, trades and hobbies General prose F Popular lore G Belles lettres, biographies, essays H Miscellaneous Learned J Science K General fiction L Mystery and detective Fiction M Science fiction Fiction N Adventure and Western P Romance and love story R Humor • R1 – an indicator of vocabulary richness based on the indexes were standartized to make them comparable. h-point (h); • Repeat Rate (RR) – shows the degree of vocabulary C. Distance Measures concentration in a text, i.e. inverse measure of vocabulary Stylometry refers to the study of linguistic style, usually richness; to written language. It uses variety of statistical methods, • Relative Repeat Rate of McIntosh (RRmc ) – the relative although common technique is to calculate distances or RR for better comparison with the other indices; (dis)similarities between texts and process the output with • Hapax Legomenon Percentage (HP) – ratio between the different visualization methods. Studies have been performed number of tokens and number of hapax legomena, i.e. in order to figure out what distance or similarity measures were words that occur only once, in a text; more appropriate in different scenarios of stylometric analysis. • Lambda (Λ) – describes frequency structure of text, i.e. For example, F. Jannidis and S. Evert found that Cosine it is related to vocabulary richness, but also considers the Delta measure outperformed all other measures for novels relationship between neighbouring frequencies; written in English, French and German [36], [37]. Burrows’s • Gini Coefficient (G) – measure of statistical dispersion, in Delta distance is typically used for stylometric analysis as linguistics G is used as a measure for vocabulary richness; it proved effective for English and German [33], [26] as • R4 – the reversed Gini coefficient; well. However, it was less successful for highly inflective • Curve length (L) – as a lot of vocabulary richness languages, e.g., Latin and Polish [26]. Thus in such cases, measures are based on the curve of rank-frequency distri- especially when the most frequent words as features were used, bution, L is defined as the sum of the Euclidean distances application of Eder’s Delta, i.e., a modified Burrows’s Delta between all neighbouring points on the curve; that gives more weight to the frequent features and rescales • Curve length R Indicator (R) – indicator of vocabulary less frequent ones in order to avoid random infrequent features, richness derived from the curve length (L); was recommended [38]. Also, taking into consideration variety • Entropy (E) – in linguistics, entropy expresses the degree of possible text lengths, for Lithuanian texts Eder’s Simple of vocabulary concentration in the text; Delta and Binomial Index were useful (experiments were per- • Adjusted Modulus (AM) – frequency structure indicator, formed on the transcripts of plenary sittings of the Lithuanian independent of text length; Parliament) [39]. As we aim to compare the performance of • Writer’s View (WV) – indicator that is defined by the distance or (dis)similarity measures already used in stylometry angle between the h-point and the ends of the rank- and other fields of research, e.g. ecology [40], biology [35], frequency distribution, i.e. the golden ratio; social sciences [41], we used the variety of them with formulas • Average Tokens length (ATL) – arithmetic mean of the [39] presented in Table III. lengths of tokens; • Token Length Frequency Spectrum (TLFS) – list of all D. Experimental Setup token lengths in a text with their frequency. For stylometric analysis (calculation of distance or (dis)similarity and plotting the relations among text samples) Detailed formulas of the indexes (except for TLFS), based R, free software environment for statistical computing and on [13] and [10], are presented in Table II. The values of graphics, was used [42], and its 2 packages - “stylo” [34] and Copyright held by the author(s). 62 Table II I NDEXES AND THEIR FORMULAS Index Formula V Type-Token Ratio (TTR) N h-point (h) r = f (r) 2 h R1 1 − (F (h) − 2N ) PV 2 Repeat Rate (RR) r=1 pi √ 1− RR Relative Repeat Rate of McIntosh (RRmc ) √ 1−1/ V Nh Hapax Legomenon Percentage (HP) N L(log10 N ) Lambda (Λ) N 1 Gini Coefficient (G) V (V + 1 − 2m01 ) R4 1−G PV −1 p Curve length (L) r=1 (f (r) − f (r + 1))2 + 1 Curve length R Indicator (R) 1 − LLh PK Entropy (E) − i=1 pi ldpi 1 (f (1)2 +V 2 )1/2 h Adjusted Modulus (AM) log10 N −[(h−1)(f1 −h)+(h−1)(V −h)] Writer’s View (WV) [(h−1)2 +(f1 −h)2 ]1/2 [(h−1)2 +(V −h)2 ]1/2 1 PN Average Tokens length (ATL) N i=1 xi Where V – number of types, N – number of tokens, r – rank/individual rank, f (r) – frequency of the rank, F (h) – cumulative relative frequency up to the h-point, h – h-point, pi – individual probabilities, estimated by means of relative frequencies, RR – Repeat Rate, Nh – number of hapax legomena, L – arc length of the rank- frequency distribution, m1 – average frequency distribution, G –Gini coefficient, f – individual frequency, Lh – curve length above h-point, K – inventory size, ld – logarithm to the base 2, f1 – the highest frequency, xi – individual length of the token. “vegan” [43]. For the practical reasons these packages were not to classify them by genre. Therefore hierarchical clustering merged together by [44]. with Ward linkage (it minimizes total variance within-cluster For calculation of indexes (Type-Token Ratio (TTR), h- [46]) was chosen. Point, Entropy, Average Tokens Length (ATL),R1 , Repeat Rate III. R ESULTS (RR), Relative Repeat Rate of McIntosh (RRmc ), Lambda (Λ), A. Statistical Significance of Indicators Adjusted Modulus (AM), Gini’s coefficient (G), R4 , Hapax Legomena Percentage (HP), Curve Length (L), Writers View Significance (asymptotic u-test) of calculated indices in (WV), Curve Length Indicator (R), Token Length Frequency terms of genres are provided in Table IV. The suffix “_LT” Spectrum (TLFS)) that were taken as features for our stylomet- indicates Lithuanian part of experimental material, while ric analysis of textual genres, QUITA - Quantitative Index Text “_EN” presents English part of our data. Most of calculated Analyzer [13] was used. Also, to check statistical significance indicators achieved significance on at least some conditions. of the calculated indices in terms of genres, asymptotic u-test For Lithuanian part 3 indices (TTR, HP and R) were significant [45] was performed. under all test conditions. There were no indices that did not achieved significance at any conditions. For English part only Then dissimilarity between the text samples was calculated 1 indicator (ATL) was significant under all test conditions. using selected distances or similarity measures, and distance Meanwhile, 2 indices (Lambda and HP) did not achieved matrix was generated. Then, hierarchical clustering was ap- significance at any conditions. plied to group samples by similarity [46], and dendrograms were used to visualize the results. B. Stylometric Analysis The goal of this study was to identify stylistic dissimilarities As it was already mentioned, typically Burrows’s Delta and map positions of the text samples in relation to each other, distance is used for stylometric analysis [33], [26] with the Copyright held by the author(s). 63 Table III D ISTANCE / SIMILARITY MEASURES AND THEIR FORMULAS Distance/Similarity measure Formula Pn |xi −yi | Canberra Distance i=1P|xi |+|yi | n i=1q xi yi Cosine Distance qP n 2 Pn 2 x i=1 i i=1 yi xi −µi Burrows’ Delta 1 Pn n i=1 σi − yiσ−µi s i 1 Pn (xi −yi )2 Argamon’s Linear Delta n i=1 σi2   xi −yi Eder’s Delta 1 Pn n i=1 σi · n−nni +1 Pn √ √ Eder’s Simple Delta [34] i=1 xi − yi Pn |xi −yi | Bray-Curtis Dissimilarity Pi=1 n (x +y ) i=1 i i P n |x −yi | Kulczynski Distance Pn i=1 i i=1 min (x i ,yi ) Pn i|x −y | i 2 Pi=1 n (x +y ) Pi=1 i i Jaccard Index n |xi −yi | 1+ Pi=1 n (x +y ) i=1 i i 1 Pn |xi −yi | Gower Similarity n i=1 maxi − mini 1 Mountford Index α , where α is the parameter of Fisher’s log-series xi yi 1 Pn xi ·ln 2n +yi ·ln 2n −2n·ln 2 Binomial Index [35] i=1 2n Where xi and yi are corresponding i values of vectors X = (x1 , x2 , . . . , xn ) and Y = (y1 , y2 , . . . , yn ), n is the size of the compared vectors, σi and µi are standard deviation and mean of i values of all vectors used in comparison, ni is queue number of i value in a vector (usually ni = i), mini and maxi are minimum and maximum i values between all compared vectors. Table IV R ESULTS OF SIGNIFICANCE TEST: GENRE PAIRS . First variable Second variable Significant differences in indexes Scientific_texts_LT Documents_LT TTR, h-Point, Entropy, R1 , RR, Lambda, AM, G, R4 , HP, L, WV, R. Scientific_texts_LT Fiction_LT TTR, h-Point, Entropy, ATL, RR, Lambda, G, R4 , HP, L, WV, R, TLFS. Scientific_texts_LT Periodicals_LT TTR, h-Point, Entropy, ATL, RR, RRmc , Lambda, G, R4 , HP, L, R, TLFS. Documents_LT Fiction_LT TTR, h-Point, ATL, R1 , Lambda, AM, G, R4 , HP, R, TLFS. Documents_LT Periodicals_LT TTR, Entropy, ATL, R1 , RR, RRmc , Lambda, AM, G, R4 , HP, L, WV, R. Fiction_LT Periodicals_LT TTR, h-Point, Entropy, ATL, RR, RRmc , AM, HP, L, WV, TLFS. Press_EN Learned_EN h-Point, Entropy, ATL, RR, RRmc , AM, L, WV. Press_EN Fiction_EN Entropy, ATL, RR, RRmc , AM, L, WV, TLFS. Press_EN General_prose_EN ATL, RR, RRmc , R. Learned_EN Fiction_EN ATL, RR, RRmc , WV, R, TLFS. Learned_EN General_prose_EN TTR, h-Point, Entropy, ATL, R1 , AM, G, R4 , L, WV, R. Fiction_EN General_prose_EN Entropy, ATL, RR, RRmc , AM, L, WV, R, TLFS. most frequent words (MFW) [23], [24], [25], [26], [27], objective function [48], was applied. This generated hierarchy [28] or function words (they usually occur among MFW of clusters, which was visualized as a dendrogram, that is, [29], [30])as features. However, we achieved the best results going from the right side separate documents were linked into with Eder’s Delta distance measure (for English dataset; for clusters by their similarity till all the documents were merged formula, see III) and Argamon’s Linear Delta distance measure into one cluster. (for Lithuanian dataset; for formula, see Table III). Though we experimented with all the distance or similarity measures The results showed clear differentiation of text samples by described in Table III, due to limited space of the paper we genre for Lithuanian (all samples were clustered by genre present only the latter results (see Fig. 1 and 2). correctly), while clustering of English dataset was somewhat less successful – some samples were attached to incorrect Hierarchical Clustering [47] of an agglomerative type was cluster. The reason could be language characteristics (indi- used. Ward linkage, where choosing the pair of clusters to cators used as features react to the degree of inflection the merge step-by-step is based on the optimal value of an language posess [10]), i.e. English is analytic language, while Copyright held by the author(s). 64 Lithuanian – synthetic, and thus comparison of texts written R EFERENCES in different languages becomes a non-trivial issue. Besides, it might have been influenced by grouping of text into genres and [1] A. Tereszkiewicz, “Lead, headline, news abstract?-genre conventions of genre groups as it seems that this procedure was performed news sections on newspaper websites,” Studia Linguistica Universitatis Iagellonicae Cracoviensis, no. 129, p. 211, 2012. by following different criteria for our datasets in English and [2] J. Swales, Genre analysis: English in academic and research settings. Lithuanian, e.g. for English part significantly bigger variety Cambridge University Press, 1990. of genres was included into genre groups. Also, construction [3] A. J. Devitt, “Generalizing about genre: New conceptions of an old concept,” College composition and Communication, vol. 44, no. 4, pp. of comparable datasets for genre analysis might need to be 573–586, 1993. more optimized in terms of sample lengths (even though [4] C. R. Miller, “Genre as social action (1984), revisited 30 years later part of indicators we used in this study was text-length- (2014),” Letras & Letras, vol. 31, no. 3, pp. 56–72, 2015. independent [13] and unsupervised machines learning (in this [5] Y. Kim and S. Ross, “Variation of word frequencies across genre classification tasks,” 2007. case – hierarchical cluster analysis) allows downscaling class [6] E. Stamatatos, N. Fakotakis, and G. Kokkinakis, “Automatic text imbalance problem)), samples themselves so that they would categorization in terms of genre and author,” Computational linguistics, represent genre groups and genres best at the same time not vol. 26, no. 4, pp. 471–495, 2000. [Online]. Available: http: //www.aclweb.org/anthology/J00-4001.pdf forgetting to take authorship into consideration (we need to [7] O. Stock and C. Strapparava, “The act of creating humorous acronyms,” escape authorial ”fingerprint” and concentrate of qualities of Applied Artificial Intelligence, vol. 19, no. 2, pp. 137–151, 2005. textual genres and the means to identify them). [8] C. van der Lee, E. Krahmer, and S. Wubben, “Pass: A dutch data-to-text system for soccer, targeted towards specific audiences,” in Proceedings To summarize, stylometric analysis combined with quan- of the 10th International Conference on Natural Language Generation, titative textual indicators that mark frequency structure or 2017, pp. 95–104. vocabulary richness of the text allowed us to map/position [9] G. Steen, “Genres of discourse and the definition of literature,” Dis- course Processes, vol. 28, no. 2, pp. 109–120, 1999. text samples by genre, though results were more successful [10] I.-I. Popescu, Word frequency studies. Walter de Gruyter, 2009, vol. 64. for Lithuanian part of the experiment. Eder’s Delta (for En- [11] I.-I. Popescu, J. Mačutek, and G. Altmann, “Word forms, style and glish) and Argamon’s Linear Delta (for Lithuanian) distance typology,” Glottotheory, vol. 3, no. 1, pp. 89–96, 2010. measures provided the best results, however, by no means this [12] I.-I. Popescu, R. Čech, and G. Altmann, The lambda-structure of texts. Ram-Verlag Lüdenscheid, 2011. is the only possible configuration. Other measures could also [13] M. Kubát, V. Matlach, and R. Čech, Studies in Quantitative Linguistics provide similar performance in different experimental setup, 18: QUITA-Quantitative Index Text Analyzer. RAM-Verlag, 2014. e.g. different corpora, parameters for text analysis, selection [14] H. Van Halteren, H. Baayen, F. Tweedie, M. Haverkort, and A. Neijt, of features. However, to reach a more solid conclusion, further “New machine learning methods demonstrate the existence of a human stylome,” Journal of Quantitative Linguistics, vol. 12, no. 1, pp. 65–77, research is needed. 2005. [15] E. Stamatatos, “A survey of modern authorship attribution methods,” IV. C ONCLUSION AND F UTURE W ORK Journal of the American Society for information Science and Technology, vol. 60, no. 3, pp. 538–556, 2009. [Online]. Available: We presented an ongoing work on quantitative analysis of http://www.clips.ua.ac.be/stylometry/Lit/Stamatatos_survey2009.pdf [16] W. Daelemans, “Explanation in computational stylometry,” in texts written in different genres for English and Lithuanian. Computational Linguistics and Intelligent Text Processing. Springer, Textual genre in our study was perceived as certain ”style” 2013, pp. 451–462. [Online]. Available: http://www.clips.ua.ac.be/ and thus stylometric analysis was performed. ~walter/papers/2013/d13.pdf [17] K. Luyckx, W. Daelemans, and E. Vanhoutte, “Stylogenetics: Clustering- 1) Features (frequency structure indicators and measures of based stylistic analysis of literary corpora,” in Proceedings of the vocabulary richness) used in this study seemed promis- 5th International Conference on Language Resources and Evaluation ing for characterization of genres as there were signif- (LREC’06), Genoa, Italy, 2006. [18] S. Argamon, M. Koppel, J. Fine, and A. R. Shimoni, “Gender, genre, icant differences for genre pairs in terms of calculated and writing style in formal written texts,” To appear in Text, vol. 23, indices. p. 3, 2003. [Online]. Available: http://www.lingcog.iit.edu/wp-content/ 2) As a part of stylometric analysis, 12 distance or papercite-data/pdf/gendertext04.pdf [19] M. Dahllöf, “Automatic prediction of gender, political affiliation, and age (dis)similarity measures were experimented on. Out of in swedish politicians from the wording of their speeches - a comparative them, Eder’s Delta (for English dataset) and Argamon’s study of classifiability,” Literary and linguistic computing, vol. 27, no. 2, Linear Delta (for Lithuanian dataset) provided the best pp. 139–153, 2012. [20] K. Luyckx and W. Daelemans, “Personae: a corpus for author results for our genre study. and personality prediction from text,” in Proceedings of the Sixth 3) Cluster analysis allowed groupings of text samples by International Conference on Language Resources and Evaluation genre, though results were more successful for Lithua- (LREC’08), B. M. J. M. J. O. S. P. D. T. Nicoletta Calzolari (Con- ference Chair), Khalid Choukri, Ed. Marrakech, Morocco: European nian dataset in comparison to English one: all Lithuanian Language Resources Association (ELRA), may 2008, http://www.lrec- samples were grouped correctly. conf.org/proceedings/lrec2008/. However, for more substantial conclusions additional research [21] E. Rimkutė, J. Kovalevskaitė, V. Melninkaitė, A. Utka, and D. Vitkutė- Adžgauskienė, “Corpus of contemporary lithuanian language–the stan- is necessary. Thus we plan to extend this work to larger text dardised way,” in Human Language Technologies–The Baltic Perspec- collections and additional genres. More extensive study on tive: Proceedings of the Fourth International Conference Baltic HLT textual indicators in terms of genre is important as well. We 2010, vol. 219. IOS Press, 2010, p. 154. [22] M. Hundt, A. Sand, and R. Siemund, Manual of information to accom- also plan to examine other languages to see whether similar pany the Freiburg-LOB Corpus of British English (’FLOB’). Albert- effects found in this study would persist. Ludwigs-Universität Freiburg, 1998. Copyright held by the author(s). 65 Figure 1. Best clustering results for English data: Eder’s Delta distance measure. The names of the samples in the cluster analysis were constructed as follows: genre-group_genre; see Table I for the details. As there was only one sample for Learned genre group, it was split into 2 equal samples: J1 and J2. Figure 2. Best clustering results for Lithuanian data: Argamon’s Linear Delta distance measure. The names of the samples in the cluster analysis were constructed as follows: genre-group_number-of-sample, where D = Documents, G = Fiction, M = Scientific texts, and P = Periodicals. Copyright held by the author(s). 66 [23] J. F. Burrows, “Not unles you ask nicely: The interpretative nexus [48] J. H. Ward Jr, “Hierarchical grouping to optimize an objective function,” between analysis and information,” Literary and Linguistic Computing, Journal of the American statistical association, vol. 58, no. 301, pp. vol. 7, no. 2, pp. 91–109, 1992. 236–244, 1963. [24] D. L. Hoover, “Corpus stylistics, stylometry, and the styles of henry james,” Style, vol. 41, no. 2, p. 174, 2007. [25] M. Eder, “Mind your corpus: systematic errors in authorship attribution,” Literary and linguistic computing, p. fqt039, 2013. [Online]. Available: http://www.dh2012.uni-hamburg.de/conference/programme/abstracts/ mind-your-corpus-systematic-errors-in-authorship-attribution.1.html [26] J. Rybicki and M. Eder, “Deeper delta across genres and languages: do we really need the most frequent words?” Literary and linguistic computing, vol. 26, no. 3, pp. 315–321, 2011. [Online]. Available: http://dh2010.cch.kcl.ac.uk/academic-programme/ abstracts/papers/pdf/ab-688.pdf [27] M. Eder and J. Rybicki, “Do birds of a feather really flock together, or how to choose training samples for authorship attribution,” Literary and Linguistic Computing, p. fqs036, 2012. [28] M. Eder, “Computational stylistics and biblical translation: How reliable can a dendrogram be,” The translator and the computer, pp. 155–170, 2013. [29] J.-R. Hochmann, A. D. Endress, and J. Mehler, “Word frequency as a cue for identifying function words in infancy,” Cognition, vol. 115, no. 3, pp. 444–457, 2010. [30] B. Sigurd, M. Eeg-Olofsson, and J. Van Weijer, “Word length, sentence length and frequency–zipf revisited,” Studia Linguistica, vol. 58, no. 1, pp. 37–52, 2004. [31] P. Juola and R. H. Baayen, “A controlled-corpus experiment in author- ship identification by cross-entropy,” Literary and Linguistic Computing, vol. 20, no. Suppl, pp. 59–67, 2005. [32] D. I. Holmes, L. J. Gordon, and C. Wilson, “A widow and her soldier: Stylometry and the american civil war,” Literary and Linguistic Computing, vol. 16, no. 4, pp. 403–420, 2001. [33] J. Burrows, “‘delta’: A measure of stylistic difference and a guide to likely authorship,” Literary and Linguistic Computing, vol. 17, no. 3, pp. 267–287, 2002. [34] M. Eder, J. Rybicki, and M. Kestemont, “Stylometry with r: a package for computational text analysis,” R Journal, vol. 16, no. 1, 2016. [35] M. J. Anderson and R. B. Millar, “Spatial variation and effects of habitat on temperate reef fish assemblages in northeastern new zealand,” Journal of Experimental Marine Biology and Ecology, vol. 305, no. 2, pp. 191– 221, 2004. [36] F. Jannidis, S. Pielström, C. Schöch, and T. Vitt, “Improving burrows’ delta-an empirical evaluation of text distance measures,” in Digital Humanities Conference, 2015. [37] S. Evert, T. Proisl, C. Schöch, F. Jannidis, S. Pielström, and T. Vitt, “Ex- plaining delta, or: How do distance measures for authorship attribution work?” 2015. [38] M. Eder, J. Rybicki, M. Kestemont, and M. M. Eder, “Package ‘stylo’,” 2014. [39] D. Stanikunas, J. Mandravickaite, and T. Krilavicius, “Comparison of distance and similarity measures for stylometric analysis of lithuanian texts,” 2017. [40] H. S. Horn, “Measurement of" overlap" in comparative ecological studies,” The American Naturalist, vol. 100, no. 914, pp. 419–424, 1966. [41] T. Krilavičius and V. Morkevičius, “Mining social science data: a study of voting of the members of the seimas of lithuania by using multidi- mensional scaling and homegeneity analysis,” Intellectual Economics, vol. 5, no. 2, pp. 224–243, 2011. [42] R. C. Team et al., “R: A language and environment for statistical computing,” 2013. [43] J. Oksanen, F. G. Blanchet, R. Kindt, P. Legendre, R. O’hara, G. L. Simpson, P. Solymos, M. H. H. Stevens, and H. Wagner, “vegan: Community ecology package. r package version 1.17-2,” R Founda- tion for Statistical Computing, Vienna. Available: CRAN. R-project. org/package= vegan.(July 2012), 2011. [44] D. Stanikūnas, “Matu˛ ir metodu˛ poveikis lietuvišku˛ tekstu˛ stilometrinei analizei,” Master’s thesis, Vytautas Magnus University, 2017. [45] M. P. Fay and M. A. Proschan, “Wilcoxon-mann-whitney or t-test? on assumptions for hypothesis tests and multiple interpretations of decision rules,” Statistics surveys, vol. 4, p. 1, 2010. [46] B. S. Everitt, S. Landau, M. Leese, and D. Stahl, “Hierarchical cluster- ing,” Cluster Analysis, 5th Edition, pp. 71–110, 2011. [47] L. Rokach and O. Maimon, “Clustering methods,” in Data mining and knowledge discovery handbook. Springer, 2005, pp. 321–352. Copyright held by the author(s). 67