Quantitative Analysis of Textual Genres:
                Comparison of English and Lithuanian
                       Justina Mandravickaitė                                                Tomas Krilavičius
                    Vilnius University, Lithuania                                   Vytautas Magnus University, Lithuania
        Baltic Institute of Advanced Technology, Lithuania                   Baltic Institute of Advanced Technology, Lithuania
                        Email: justina@bpti.lt                                             Email: t.krilavicius@bpti.lt


   Abstract—We report an ongoing study on quantitative char-                               II. C ORPORA AND M ETHODS
acteristics of texts written in different genres. At this stage, we
compared Lithuanian and English texts in terms of genres. We               A. Data Sets and Preprocessing
used 16 indices which describe frequency structure of text as
well as indicate several other characteristics of written texts.              We used part of Corpus of the Contemporary Lithuanian
Initial study showed significant differences of indices calculated         Language [21] (≈ 1, 5 million words) and Freiburg-LOB Cor-
for genre pairs of the same language. Hierarchical clustering              pus of British English (F-LOB) (≈ 1 million words) [22] for
revealed possible applications in using them as features for text          our initial experiment. The composition of Lithuanian material
categorization/classification by genre, though better results were
achieved for Lithuanian texts.
                                                                           is the following: Fiction (17%), Documents (21%), Scientific
   Index Terms—quantitative genre analysis, frequency structure            (21%) and Periodicals (31%). English material consists of
of text, vocabulary richness, stylometry, English, Lithuanian              Fiction (25%), General Prose (42%), Learned (16%) and Press
                                                                           (18%). Lithuanian genre category Scientific corresponds to
                        I. I NTRODUCTION                                   English category Learned, while Lithuanian Periodicals corre-
                                                                           sponds to English category Press. More detailed constitution
   We report an ongoing study on quantitative characteristics of
                                                                           of F-LOB corpus by genres described in Table I. Part of the
texts written in different genres. It has been suggested that gen-
                                                                           Corpus of the Contemporary Lithuanian Language we used
res add to familiarity and the shorthand of communication [1],
                                                                           for our study did not have such details available, only genre
[2], [3] and therefore resonate with people. Also, genres tend
                                                                           groups as described above.
to shift in accordance to public opinion and reflect widespread
culture of certain time [4]. From NLP perspective, genres are                 As English texts were already concatenated according to
useful for text classification (e.g. [5]) and categorization (e.g.         their genre, only minimal preprocessing was performed, i.e.,
[6]), natural language generation (e.g. [7], [8]), etc.                    lines numbers and tags that marked textual structure were
   At this stage, we present initial quantitative analysis of              removed. For Lithuanian, as we had individual texts, to get
Lithuanian and English texts of different genres (or super-                around of “fingerprint” of individual authorship as much
genres, in case of being more precise [9], as the texts were               as possible, all the samples were concatenated into 4 large
grouped into broad categories or genre groups; however, for                documents based on genre group (or super-genre), and then
the simplicity a term “genre” was used in this paper). As                  were partitioned into 5 parts each. Thus all in all for Lithuanian
the main point of interest was frequency structure of text                 part of analysis we had 20 samples.
considering genre aspect, we used 16 indices proposed by [10],
[11], [12] and implemented in QUITA - Quantitative Index                   B. Features for Characterization of Genres
Text Analyzer [13].                                                           Most frequent words (MFW) as features are one the most
   As we study textual genres wrt style, i.e., fiction, press,             popular solutions in stylometric analysis [23], [24], [25], [26],
etc. style, we apply computational stylistics or stylometry.               [27], [28] (usually, they coincide with function words [29],
Stylometry is based on the two hypotheses:                                 [30]). They are considered to be topic-neutral and perform
   • human stylome hypothesis, i.e., each individual has a                 well [31], [32], [33]. As our interest lied in frequency structure
      unique style [14];                                                   of the text as well as vocabulary richness taking genre aspect
   • unique style of individual can be measured [15], and                  into consideration, for our experiment we applied 16 indices
      thus stylometry allows gaining meta-knowledge [16], i.e.,            implemented in QUITA - Quantitative Index Text Analyzer
      what can be learned from the text about the author –                 [13]:
      gender [17], [18], age [19], psychological characteristics             • Type-Token Ratio (TTR) – ratio between the number of
      [20], political affiliation [19], etc.                                   types and the number of tokens in a text, i.e. shows
   Genre can be considered as a certain ”style”, thus we                       vocabulary variation in a text;
assumed that stylometric analysis could aid in our study of                  • h-point (h) – a fuzzy boundary in the word frequency
quantitative characteristics of genres.                                        table where the rank is the same as the frequency;


      Copyright held by the author(s).                                61
                                                                 Table I
                                                         F-LOB CORPUS STATISTICS

                                         Genre group     Category          Content of Category
                                                         A          Reportage
                                         Press           B          Editorial
                                                         C          Review
                                                         D          Religion
                                                         E          Skills, trades and hobbies
                                         General prose   F          Popular lore
                                                         G          Belles lettres, biographies, essays
                                                         H          Miscellaneous
                                         Learned         J          Science
                                                         K          General fiction
                                                         L          Mystery and detective Fiction
                                                         M          Science fiction
                                         Fiction
                                                         N          Adventure and Western
                                                         P          Romance and love story
                                                         R          Humor


  • R1 – an indicator of vocabulary richness based on the                indexes were standartized to make them comparable.
    h-point (h);
  • Repeat Rate (RR) – shows the degree of vocabulary
                                                                         C. Distance Measures
    concentration in a text, i.e. inverse measure of vocabulary             Stylometry refers to the study of linguistic style, usually
    richness;                                                            to written language. It uses variety of statistical methods,
  • Relative Repeat Rate of McIntosh (RRmc ) – the relative              although common technique is to calculate distances or
    RR for better comparison with the other indices;                     (dis)similarities between texts and process the output with
  • Hapax Legomenon Percentage (HP) – ratio between the                  different visualization methods. Studies have been performed
    number of tokens and number of hapax legomena, i.e.                  in order to figure out what distance or similarity measures were
    words that occur only once, in a text;                               more appropriate in different scenarios of stylometric analysis.
  • Lambda (Λ) – describes frequency structure of text, i.e.             For example, F. Jannidis and S. Evert found that Cosine
    it is related to vocabulary richness, but also considers the         Delta measure outperformed all other measures for novels
    relationship between neighbouring frequencies;                       written in English, French and German [36], [37]. Burrows’s
  • Gini Coefficient (G) – measure of statistical dispersion, in         Delta distance is typically used for stylometric analysis as
    linguistics G is used as a measure for vocabulary richness;          it proved effective for English and German [33], [26] as
  • R4 – the reversed Gini coefficient;                                  well. However, it was less successful for highly inflective
  • Curve length (L) – as a lot of vocabulary richness                   languages, e.g., Latin and Polish [26]. Thus in such cases,
    measures are based on the curve of rank-frequency distri-            especially when the most frequent words as features were used,
    bution, L is defined as the sum of the Euclidean distances           application of Eder’s Delta, i.e., a modified Burrows’s Delta
    between all neighbouring points on the curve;                        that gives more weight to the frequent features and rescales
  • Curve length R Indicator (R) – indicator of vocabulary               less frequent ones in order to avoid random infrequent features,
    richness derived from the curve length (L);                          was recommended [38]. Also, taking into consideration variety
  • Entropy (E) – in linguistics, entropy expresses the degree           of possible text lengths, for Lithuanian texts Eder’s Simple
    of vocabulary concentration in the text;                             Delta and Binomial Index were useful (experiments were per-
  • Adjusted Modulus (AM) – frequency structure indicator,               formed on the transcripts of plenary sittings of the Lithuanian
    independent of text length;                                          Parliament) [39]. As we aim to compare the performance of
  • Writer’s View (WV) – indicator that is defined by the                distance or (dis)similarity measures already used in stylometry
    angle between the h-point and the ends of the rank-                  and other fields of research, e.g. ecology [40], biology [35],
    frequency distribution, i.e. the golden ratio;                       social sciences [41], we used the variety of them with formulas
  • Average Tokens length (ATL) – arithmetic mean of the                 [39] presented in Table III.
    lengths of tokens;
  • Token Length Frequency Spectrum (TLFS) – list of all
                                                                         D. Experimental Setup
    token lengths in a text with their frequency.                           For stylometric analysis (calculation of distance or
                                                                         (dis)similarity and plotting the relations among text samples)
  Detailed formulas of the indexes (except for TLFS), based              R, free software environment for statistical computing and
on [13] and [10], are presented in Table II. The values of               graphics, was used [42], and its 2 packages - “stylo” [34] and


      Copyright held by the author(s).                              62
                                                                    Table II
                                                         I NDEXES AND THEIR FORMULAS

                                                 Index                                               Formula
                                                                              V
                              Type-Token Ratio (TTR)                          N

                              h-point (h)                                     r = f (r)
                                                                                                 2
                                                                                           h
                              R1                                              1 − (F (h) − 2N )
                                                                              PV     2
                              Repeat Rate (RR)                                  r=1 pi
                                                                                 √
                                                                              1− RR
                              Relative Repeat Rate of McIntosh (RRmc )             √
                                                                              1−1/ V
                                                                              Nh
                              Hapax Legomenon Percentage (HP)                 N
                                                                              L(log10 N )
                              Lambda (Λ)                                          N
                                                                              1
                              Gini Coefficient (G)                            V
                                                                                (V + 1 − 2m01 )

                              R4                                              1−G
                                                                              PV −1 p
                              Curve length (L)                                 r=1   (f (r) − f (r + 1))2 + 1

                              Curve length R Indicator (R)                    1 − LLh
                                                                                  PK
                              Entropy (E)                                     −    i=1 pi ldpi
                                                                              1 (f (1)2 +V 2 )1/2
                                                                              h
                              Adjusted Modulus (AM)                                 log10 N
                                                                                    −[(h−1)(f1 −h)+(h−1)(V −h)]
                              Writer’s View (WV)
                                                                              [(h−1)2 +(f1 −h)2 ]1/2 [(h−1)2 +(V −h)2 ]1/2
                                                                              1 PN
                              Average Tokens length (ATL)                     N  i=1 xi

                               Where V – number of types, N – number of tokens, r – rank/individual rank, f (r)
                               – frequency of the rank, F (h) – cumulative relative frequency up to the h-point, h
                               – h-point, pi – individual probabilities, estimated by means of relative frequencies,
                               RR – Repeat Rate, Nh – number of hapax legomena, L – arc length of the rank-
                               frequency distribution, m1 – average frequency distribution, G –Gini coefficient,
                               f – individual frequency, Lh – curve length above h-point, K – inventory size, ld
                               – logarithm to the base 2, f1 – the highest frequency, xi – individual length of
                               the token.


“vegan” [43]. For the practical reasons these packages were                   not to classify them by genre. Therefore hierarchical clustering
merged together by [44].                                                      with Ward linkage (it minimizes total variance within-cluster
   For calculation of indexes (Type-Token Ratio (TTR), h-                     [46]) was chosen.
Point, Entropy, Average Tokens Length (ATL),R1 , Repeat Rate                                           III. R ESULTS
(RR), Relative Repeat Rate of McIntosh (RRmc ), Lambda (Λ),
                                                                              A. Statistical Significance of Indicators
Adjusted Modulus (AM), Gini’s coefficient (G), R4 , Hapax
Legomena Percentage (HP), Curve Length (L), Writers View                         Significance (asymptotic u-test) of calculated indices in
(WV), Curve Length Indicator (R), Token Length Frequency                      terms of genres are provided in Table IV. The suffix “_LT”
Spectrum (TLFS)) that were taken as features for our stylomet-                indicates Lithuanian part of experimental material, while
ric analysis of textual genres, QUITA - Quantitative Index Text               “_EN” presents English part of our data. Most of calculated
Analyzer [13] was used. Also, to check statistical significance               indicators achieved significance on at least some conditions.
of the calculated indices in terms of genres, asymptotic u-test               For Lithuanian part 3 indices (TTR, HP and R) were significant
[45] was performed.                                                           under all test conditions. There were no indices that did not
                                                                              achieved significance at any conditions. For English part only
   Then dissimilarity between the text samples was calculated
                                                                              1 indicator (ATL) was significant under all test conditions.
using selected distances or similarity measures, and distance
                                                                              Meanwhile, 2 indices (Lambda and HP) did not achieved
matrix was generated. Then, hierarchical clustering was ap-
                                                                              significance at any conditions.
plied to group samples by similarity [46], and dendrograms
were used to visualize the results.                                           B. Stylometric Analysis
   The goal of this study was to identify stylistic dissimilarities              As it was already mentioned, typically Burrows’s Delta
and map positions of the text samples in relation to each other,              distance is used for stylometric analysis [33], [26] with the


      Copyright held by the author(s).                                   63
                                                                     Table III
                                             D ISTANCE / SIMILARITY MEASURES AND THEIR FORMULAS

                                 Distance/Similarity measure                              Formula
                                                                   Pn     |xi −yi |
                                 Canberra Distance                   i=1P|xi |+|yi |
                                                                          n
                                                                          i=1q xi yi
                                 Cosine Distance                    qP
                                                                       n    2 Pn        2
                                                                           x
                                                                       i=1 i       i=1 yi
                                                                           xi −µi
                                 Burrows’ Delta                     1 Pn
                                                                    n  i=1   σi
                                                                                  − yiσ−µi
                                                                             s          i

                                                                    1 Pn          (xi −yi )2
                                 Argamon’s Linear Delta             n  i=1            σi2
                                                                                                     
                                                                                xi −yi
                                 Eder’s Delta                      1 Pn
                                                                   n    i=1         σi
                                                                                         · n−nni +1
                                                                   Pn     √          √
                                 Eder’s Simple Delta [34]            i=1     xi − yi
                                                                   Pn
                                                                         |xi −yi |
                                 Bray-Curtis Dissimilarity         Pi=1
                                                                     n (x +y )
                                                                     i=1   i    i
                                                                     P n
                                                                           |x −yi |
                                 Kulczynski Distance               Pn i=1 i
                                                                     i=1 min  (x i ,yi )
                                                                      Pn
                                                                           i|x −y |
                                                                              i
                                                                    2 Pi=1
                                                                       n (x +y )
                                                                       Pi=1  i   i
                                 Jaccard Index                           n  |xi −yi |
                                                                    1+ Pi=1
                                                                         n (x +y )
                                                                         i=1 i     i
                                                                   1 Pn       |xi −yi |
                                 Gower Similarity                  n    i=1 maxi − mini
                                                                   1
                                 Mountford Index                   α
                                                                     , where α is the parameter of Fisher’s log-series
                                                                               xi         yi        1
                                                                   Pn xi ·ln 2n   +yi ·ln 2n −2n·ln 2
                                 Binomial Index [35]                  i=1              2n
                                    Where xi and yi are corresponding i values of vectors X = (x1 , x2 , . . . , xn )
                                    and Y = (y1 , y2 , . . . , yn ), n is the size of the compared vectors, σi and
                                    µi are standard deviation and mean of i values of all vectors used in
                                    comparison, ni is queue number of i value in a vector (usually ni = i),
                                    mini and maxi are minimum and maximum i values between all compared
                                    vectors.

                                                                      Table IV
                                                    R ESULTS OF SIGNIFICANCE TEST: GENRE PAIRS .

                First variable        Second variable                             Significant differences in indexes
              Scientific_texts_LT    Documents_LT            TTR, h-Point, Entropy, R1 , RR, Lambda, AM, G, R4 , HP, L, WV, R.
              Scientific_texts_LT    Fiction_LT              TTR, h-Point, Entropy, ATL, RR, Lambda, G, R4 , HP, L, WV, R, TLFS.
              Scientific_texts_LT    Periodicals_LT          TTR, h-Point, Entropy, ATL, RR, RRmc , Lambda, G, R4 , HP, L, R, TLFS.
              Documents_LT           Fiction_LT              TTR, h-Point, ATL, R1 , Lambda, AM, G, R4 , HP, R, TLFS.
              Documents_LT           Periodicals_LT          TTR, Entropy, ATL, R1 , RR, RRmc , Lambda, AM, G, R4 , HP, L, WV, R.
              Fiction_LT             Periodicals_LT          TTR, h-Point, Entropy, ATL, RR, RRmc , AM, HP, L, WV, TLFS.
              Press_EN               Learned_EN              h-Point, Entropy, ATL, RR, RRmc , AM, L, WV.
              Press_EN               Fiction_EN              Entropy, ATL, RR, RRmc , AM, L, WV, TLFS.
              Press_EN               General_prose_EN        ATL, RR, RRmc , R.
              Learned_EN             Fiction_EN              ATL, RR, RRmc , WV, R, TLFS.
              Learned_EN             General_prose_EN        TTR, h-Point, Entropy, ATL, R1 , AM, G, R4 , L, WV, R.
              Fiction_EN             General_prose_EN        Entropy, ATL, RR, RRmc , AM, L, WV, R, TLFS.


most frequent words (MFW) [23], [24], [25], [26], [27],                         objective function [48], was applied. This generated hierarchy
[28] or function words (they usually occur among MFW                            of clusters, which was visualized as a dendrogram, that is,
[29], [30])as features. However, we achieved the best results                   going from the right side separate documents were linked into
with Eder’s Delta distance measure (for English dataset; for                    clusters by their similarity till all the documents were merged
formula, see III) and Argamon’s Linear Delta distance measure                   into one cluster.
(for Lithuanian dataset; for formula, see Table III). Though
we experimented with all the distance or similarity measures                       The results showed clear differentiation of text samples by
described in Table III, due to limited space of the paper we                    genre for Lithuanian (all samples were clustered by genre
present only the latter results (see Fig. 1 and 2).                             correctly), while clustering of English dataset was somewhat
                                                                                less successful – some samples were attached to incorrect
  Hierarchical Clustering [47] of an agglomerative type was                     cluster. The reason could be language characteristics (indi-
used. Ward linkage, where choosing the pair of clusters to                      cators used as features react to the degree of inflection the
merge step-by-step is based on the optimal value of an                          language posess [10]), i.e. English is analytic language, while


     Copyright held by the author(s).                                      64
Lithuanian – synthetic, and thus comparison of texts written                                          R EFERENCES
in different languages becomes a non-trivial issue. Besides, it
might have been influenced by grouping of text into genres and          [1] A. Tereszkiewicz, “Lead, headline, news abstract?-genre conventions of
genre groups as it seems that this procedure was performed                  news sections on newspaper websites,” Studia Linguistica Universitatis
                                                                            Iagellonicae Cracoviensis, no. 129, p. 211, 2012.
by following different criteria for our datasets in English and         [2] J. Swales, Genre analysis: English in academic and research settings.
Lithuanian, e.g. for English part significantly bigger variety              Cambridge University Press, 1990.
of genres was included into genre groups. Also, construction            [3] A. J. Devitt, “Generalizing about genre: New conceptions of an old
                                                                            concept,” College composition and Communication, vol. 44, no. 4, pp.
of comparable datasets for genre analysis might need to be                  573–586, 1993.
more optimized in terms of sample lengths (even though                  [4] C. R. Miller, “Genre as social action (1984), revisited 30 years later
part of indicators we used in this study was text-length-                   (2014),” Letras & Letras, vol. 31, no. 3, pp. 56–72, 2015.
independent [13] and unsupervised machines learning (in this            [5] Y. Kim and S. Ross, “Variation of word frequencies across genre
                                                                            classification tasks,” 2007.
case – hierarchical cluster analysis) allows downscaling class          [6] E. Stamatatos, N. Fakotakis, and G. Kokkinakis, “Automatic text
imbalance problem)), samples themselves so that they would                  categorization in terms of genre and author,” Computational linguistics,
represent genre groups and genres best at the same time not                 vol. 26, no. 4, pp. 471–495, 2000. [Online]. Available: http:
                                                                            //www.aclweb.org/anthology/J00-4001.pdf
forgetting to take authorship into consideration (we need to            [7] O. Stock and C. Strapparava, “The act of creating humorous acronyms,”
escape authorial ”fingerprint” and concentrate of qualities of              Applied Artificial Intelligence, vol. 19, no. 2, pp. 137–151, 2005.
textual genres and the means to identify them).                         [8] C. van der Lee, E. Krahmer, and S. Wubben, “Pass: A dutch data-to-text
                                                                            system for soccer, targeted towards specific audiences,” in Proceedings
   To summarize, stylometric analysis combined with quan-                   of the 10th International Conference on Natural Language Generation,
titative textual indicators that mark frequency structure or                2017, pp. 95–104.
vocabulary richness of the text allowed us to map/position              [9] G. Steen, “Genres of discourse and the definition of literature,” Dis-
                                                                            course Processes, vol. 28, no. 2, pp. 109–120, 1999.
text samples by genre, though results were more successful
                                                                       [10] I.-I. Popescu, Word frequency studies. Walter de Gruyter, 2009, vol. 64.
for Lithuanian part of the experiment. Eder’s Delta (for En-           [11] I.-I. Popescu, J. Mačutek, and G. Altmann, “Word forms, style and
glish) and Argamon’s Linear Delta (for Lithuanian) distance                 typology,” Glottotheory, vol. 3, no. 1, pp. 89–96, 2010.
measures provided the best results, however, by no means this          [12] I.-I. Popescu, R. Čech, and G. Altmann, The lambda-structure of texts.
                                                                            Ram-Verlag Lüdenscheid, 2011.
is the only possible configuration. Other measures could also          [13] M. Kubát, V. Matlach, and R. Čech, Studies in Quantitative Linguistics
provide similar performance in different experimental setup,                18: QUITA-Quantitative Index Text Analyzer. RAM-Verlag, 2014.
e.g. different corpora, parameters for text analysis, selection        [14] H. Van Halteren, H. Baayen, F. Tweedie, M. Haverkort, and A. Neijt,
of features. However, to reach a more solid conclusion, further             “New machine learning methods demonstrate the existence of a human
                                                                            stylome,” Journal of Quantitative Linguistics, vol. 12, no. 1, pp. 65–77,
research is needed.                                                         2005.
                                                                       [15] E. Stamatatos, “A survey of modern authorship attribution methods,”
          IV. C ONCLUSION AND F UTURE W ORK                                 Journal of the American Society for information Science and
                                                                            Technology, vol. 60, no. 3, pp. 538–556, 2009. [Online]. Available:
   We presented an ongoing work on quantitative analysis of                 http://www.clips.ua.ac.be/stylometry/Lit/Stamatatos_survey2009.pdf
                                                                       [16] W. Daelemans, “Explanation in computational stylometry,” in
texts written in different genres for English and Lithuanian.               Computational Linguistics and Intelligent Text Processing. Springer,
Textual genre in our study was perceived as certain ”style”                 2013, pp. 451–462. [Online]. Available: http://www.clips.ua.ac.be/
and thus stylometric analysis was performed.                                ~walter/papers/2013/d13.pdf
                                                                       [17] K. Luyckx, W. Daelemans, and E. Vanhoutte, “Stylogenetics: Clustering-
  1) Features (frequency structure indicators and measures of               based stylistic analysis of literary corpora,” in Proceedings of the
     vocabulary richness) used in this study seemed promis-                 5th International Conference on Language Resources and Evaluation
     ing for characterization of genres as there were signif-               (LREC’06), Genoa, Italy, 2006.
                                                                       [18] S. Argamon, M. Koppel, J. Fine, and A. R. Shimoni, “Gender, genre,
     icant differences for genre pairs in terms of calculated               and writing style in formal written texts,” To appear in Text, vol. 23,
     indices.                                                               p. 3, 2003. [Online]. Available: http://www.lingcog.iit.edu/wp-content/
  2) As a part of stylometric analysis, 12 distance or                      papercite-data/pdf/gendertext04.pdf
                                                                       [19] M. Dahllöf, “Automatic prediction of gender, political affiliation, and age
     (dis)similarity measures were experimented on. Out of                  in swedish politicians from the wording of their speeches - a comparative
     them, Eder’s Delta (for English dataset) and Argamon’s                 study of classifiability,” Literary and linguistic computing, vol. 27, no. 2,
     Linear Delta (for Lithuanian dataset) provided the best                pp. 139–153, 2012.
                                                                       [20] K. Luyckx and W. Daelemans, “Personae: a corpus for author
     results for our genre study.                                           and personality prediction from text,” in Proceedings of the Sixth
  3) Cluster analysis allowed groupings of text samples by                  International Conference on Language Resources and Evaluation
     genre, though results were more successful for Lithua-                 (LREC’08), B. M. J. M. J. O. S. P. D. T. Nicoletta Calzolari (Con-
                                                                            ference Chair), Khalid Choukri, Ed. Marrakech, Morocco: European
     nian dataset in comparison to English one: all Lithuanian              Language Resources Association (ELRA), may 2008, http://www.lrec-
     samples were grouped correctly.                                        conf.org/proceedings/lrec2008/.
However, for more substantial conclusions additional research          [21] E. Rimkutė, J. Kovalevskaitė, V. Melninkaitė, A. Utka, and D. Vitkutė-
                                                                            Adžgauskienė, “Corpus of contemporary lithuanian language–the stan-
is necessary. Thus we plan to extend this work to larger text               dardised way,” in Human Language Technologies–The Baltic Perspec-
collections and additional genres. More extensive study on                  tive: Proceedings of the Fourth International Conference Baltic HLT
textual indicators in terms of genre is important as well. We               2010, vol. 219. IOS Press, 2010, p. 154.
                                                                       [22] M. Hundt, A. Sand, and R. Siemund, Manual of information to accom-
also plan to examine other languages to see whether similar                 pany the Freiburg-LOB Corpus of British English (’FLOB’). Albert-
effects found in this study would persist.                                  Ludwigs-Universität Freiburg, 1998.


     Copyright held by the author(s).                             65
Figure 1. Best clustering results for English data: Eder’s Delta distance measure. The names of the samples in the cluster analysis were constructed as follows:
genre-group_genre; see Table I for the details. As there was only one sample for Learned genre group, it was split into 2 equal samples: J1 and J2.


Figure 2. Best clustering results for Lithuanian data: Argamon’s Linear Delta distance measure. The names of the samples in the cluster analysis were
constructed as follows: genre-group_number-of-sample, where D = Documents, G = Fiction, M = Scientific texts, and P = Periodicals.


       Copyright held by the author(s).                                       66
[23] J. F. Burrows, “Not unles you ask nicely: The interpretative nexus                  [48] J. H. Ward Jr, “Hierarchical grouping to optimize an objective function,”
     between analysis and information,” Literary and Linguistic Computing,                    Journal of the American statistical association, vol. 58, no. 301, pp.
     vol. 7, no. 2, pp. 91–109, 1992.                                                         236–244, 1963.
[24] D. L. Hoover, “Corpus stylistics, stylometry, and the styles of henry
     james,” Style, vol. 41, no. 2, p. 174, 2007.
[25] M. Eder, “Mind your corpus: systematic errors in authorship attribution,”
     Literary and linguistic computing, p. fqt039, 2013. [Online]. Available:
     http://www.dh2012.uni-hamburg.de/conference/programme/abstracts/
     mind-your-corpus-systematic-errors-in-authorship-attribution.1.html
[26] J. Rybicki and M. Eder, “Deeper delta across genres and
     languages: do we really need the most frequent words?” Literary
     and linguistic computing, vol. 26, no. 3, pp. 315–321, 2011.
     [Online]. Available: http://dh2010.cch.kcl.ac.uk/academic-programme/
     abstracts/papers/pdf/ab-688.pdf
[27] M. Eder and J. Rybicki, “Do birds of a feather really flock together, or
     how to choose training samples for authorship attribution,” Literary and
     Linguistic Computing, p. fqs036, 2012.
[28] M. Eder, “Computational stylistics and biblical translation: How reliable
     can a dendrogram be,” The translator and the computer, pp. 155–170,
     2013.
[29] J.-R. Hochmann, A. D. Endress, and J. Mehler, “Word frequency as
     a cue for identifying function words in infancy,” Cognition, vol. 115,
     no. 3, pp. 444–457, 2010.
[30] B. Sigurd, M. Eeg-Olofsson, and J. Van Weijer, “Word length, sentence
     length and frequency–zipf revisited,” Studia Linguistica, vol. 58, no. 1,
     pp. 37–52, 2004.
[31] P. Juola and R. H. Baayen, “A controlled-corpus experiment in author-
     ship identification by cross-entropy,” Literary and Linguistic Computing,
     vol. 20, no. Suppl, pp. 59–67, 2005.
[32] D. I. Holmes, L. J. Gordon, and C. Wilson, “A widow and her
     soldier: Stylometry and the american civil war,” Literary and Linguistic
     Computing, vol. 16, no. 4, pp. 403–420, 2001.
[33] J. Burrows, “‘delta’: A measure of stylistic difference and a guide to
     likely authorship,” Literary and Linguistic Computing, vol. 17, no. 3,
     pp. 267–287, 2002.
[34] M. Eder, J. Rybicki, and M. Kestemont, “Stylometry with r: a package
     for computational text analysis,” R Journal, vol. 16, no. 1, 2016.
[35] M. J. Anderson and R. B. Millar, “Spatial variation and effects of habitat
     on temperate reef fish assemblages in northeastern new zealand,” Journal
     of Experimental Marine Biology and Ecology, vol. 305, no. 2, pp. 191–
     221, 2004.
[36] F. Jannidis, S. Pielström, C. Schöch, and T. Vitt, “Improving burrows’
     delta-an empirical evaluation of text distance measures,” in Digital
     Humanities Conference, 2015.
[37] S. Evert, T. Proisl, C. Schöch, F. Jannidis, S. Pielström, and T. Vitt, “Ex-
     plaining delta, or: How do distance measures for authorship attribution
     work?” 2015.
[38] M. Eder, J. Rybicki, M. Kestemont, and M. M. Eder, “Package ‘stylo’,”
     2014.
[39] D. Stanikunas, J. Mandravickaite, and T. Krilavicius, “Comparison of
     distance and similarity measures for stylometric analysis of lithuanian
     texts,” 2017.
[40] H. S. Horn, “Measurement of" overlap" in comparative ecological
     studies,” The American Naturalist, vol. 100, no. 914, pp. 419–424, 1966.
[41] T. Krilavičius and V. Morkevičius, “Mining social science data: a study
     of voting of the members of the seimas of lithuania by using multidi-
     mensional scaling and homegeneity analysis,” Intellectual Economics,
     vol. 5, no. 2, pp. 224–243, 2011.
[42] R. C. Team et al., “R: A language and environment for statistical
     computing,” 2013.
[43] J. Oksanen, F. G. Blanchet, R. Kindt, P. Legendre, R. O’hara, G. L.
     Simpson, P. Solymos, M. H. H. Stevens, and H. Wagner, “vegan:
     Community ecology package. r package version 1.17-2,” R Founda-
     tion for Statistical Computing, Vienna. Available: CRAN. R-project.
     org/package= vegan.(July 2012), 2011.
[44] D. Stanikūnas, “Matu˛ ir metodu˛ poveikis lietuvišku˛ tekstu˛ stilometrinei
     analizei,” Master’s thesis, Vytautas Magnus University, 2017.
[45] M. P. Fay and M. A. Proschan, “Wilcoxon-mann-whitney or t-test? on
     assumptions for hypothesis tests and multiple interpretations of decision
     rules,” Statistics surveys, vol. 4, p. 1, 2010.
[46] B. S. Everitt, S. Landau, M. Leese, and D. Stahl, “Hierarchical cluster-
     ing,” Cluster Analysis, 5th Edition, pp. 71–110, 2011.
[47] L. Rokach and O. Maimon, “Clustering methods,” in Data mining and
     knowledge discovery handbook. Springer, 2005, pp. 321–352.


       Copyright held by the author(s).                                             67