=Paper=
{{Paper
|id=Vol-1852/p01
|storemode=property
|title=Comparison of distance and similarity measures
for stylometric analysis of lithuanian texts
|pdfUrl=https://ceur-ws.org/Vol-1852/p01.pdf
|volume=Vol-1852
|authors=Daumantas Stanikūnas,Justina Mandravickaitė,Tomas Krilavičius
}}
==Comparison of distance and similarity measures
for stylometric analysis of lithuanian texts==
<pdf width="1500px">https://ceur-ws.org/Vol-1852/p01.pdf</pdf>
<pre>
Comparison of distance and similarity measures for
    stylometric analysis of Lithuanian texts
                              Daumantas Stanikūnas1 , Justina Mandravickaitė2 , Tomas Krilavičius3
                        1
                            Department of Mathematics and Statistics, Vytautas Magnus University, Lithuania
                                              Email: daumantas.stanikunas@fcis.vdu.lt

                                2
                                    Baltic Institute of Advanced Technology, Vilnius University, Lithuania
                                                             Email: justina@bpti.lt

                        3
                            Baltic Institute of Advanced Technology, Vytautas Magnus University, Lithuania
                                                       Email: t.krilavicius@bpti.lt


   Abstract—Constant developments in information and com-                    This paper presents an on-going experimental work on
puter technologies make it possible to handle constantly increas-         identifying the most suitable measures used in stylometry
ing amount of data, thereby expanding the research possibilities.         when analyzing Lithuanian texts. The objective of these ex-
In this article, we discuss and compare distance and similarity
measures used in stylometric analysis which could be applied to           periments is to compare the performance of distance and
analyze Lithuanian texts. As corpus for the analysis, transcripts         similarity measures already used in stylometry and other fields
of parliamentary debates by two politicians of the Lithuanian             of research [4], [5], [11] using R language with the focus
Parliament were chosen. Furthermore, comparison of distance               on the transcripts of speeches of politicians in the Lithuanian
measures, stylometric analysis and visualization were performed.          Parliament.
Objective of the experiment was to identify what measures would
perform better when executing stylometric analysis of Lithuanian             These experiments cover a domain of transcriptions of par-
texts and explore where these differences in the performance              liamentary debates of Lithuanian Parliament which is only a
occur. Summarizing the experiment results, the recommendations            small fraction of Lithuanian language, however they represent
are as follow: number of Most Frequent Words used should be at            richness and variety of language quite well, and hence, we
least 1000, Eder’s Simple Delta measure can be used in general            expect, that the results could be useful in analysis of other
stylometric analysis of transcriptions of parliamentary debates of
Lithuanian Parliament, in a case when Most Frequent Words are             Lithuanian texts.
limited to 2000, Binomial Index shows an increase in performance
over Eder’s Simple Delta and thus it is more suitable.                                         II. DATA AND M ETHODS
   Index Terms—stylometry; computational stylistics; parliamen-
tary speech; R; statistical analysis; distance measure; similarity
                                                                          A. Data Preparation
measure; data visualization                                                  Data for the analysis will be taken from corpora, col-
                                                                          lected in ASTRA project. The data consists of transcriptions
                      I. I NTRODUCTION
                                                                          of parliamentary debates of Lithuanian Parliament. For this
   Stylometry refers to the study of linguistic style, usually to         investigation we use transcripts of two politicians from the
written language. It uses variety of statistical methods to ana-          term 2008-2012. The criteria used for selecting the right data
lyze a text to determine the text’s author. Common technique              for comparison of measures were chosen in such a way that
used is to calculate distances or similarities between texts              measures would provide the biggest difference between texts
and process the output using different visualization methods.             of different authors (or speakers, in our case). Considering this,
There have already been significant number of experiments                 authors for experiments were chosen of different gender and
performed by various researchers in order to figure out what              political standing (position and opposition) to strengthen the
measures show better results in different cases of stylometric            dissimilarity. This is a different type of approach compared to
analysis. F. Jannidis and S. Evert analyzed three collections             a similar research which was done by J. Kapočiūtė-Dzikienė
of novels in English, French and German languages and have                where her goal was to identify best methods and features
shown that Cosine Delta measure outperforms all other mea-                (MFW were not used) for authorship attribution based on
sures on our three collections [2], [3], while J. Mandravickaitė         machine learning [15]. Her approach was to analyze as similar
in her research she explained that such measures like Burrows             data sets as possible, when authors are actually different for
Delta would not work well with highly inflected languages                 these data sets.
(Latin, Polish) and suggested using Eder’s Delta measure [10].
                                                                            Automatic extraction of style applied to individual authors and groups of
  Copyright c 2017 held by the authors.                                   authors, http://dangus.vdu.lt/~jkd/?page_id=2


                                                                      1
                                         Table I: Distance/similarity measures and their formulas

                       Distance/Similarity measure                       Formula
                                                                         Pn
                       Manhattan Distance                                  i=1 |xi − yi |
                                                                         pPn
                                                                                           2
                       Euclidean Distance                                    i=1 (xi − yi )
                                                                         Pn     |xi −yi |
                       Canberra Distance                                      i=1 |xi |+|yi |
                                                                                 Pn
                                                                                   i=1q xi yi
                       Cosine Distance                                       qP
                                                                                n    2 Pn        2
                                                                                i=1 xi      i=1 yi

                                                                                    xi −µi
                       Burrows’ Delta                                        1 Pn
                                                                             n  i=1   σi
                                                                                           − yiσ−µi
                                                                                                 i
                                                                                        s
                                                                             1 Pn               (xi −yi )2
                       Argamon’s Linear Delta                                n  i=1                 σi2
                                                                                                                         
                                                                                            xi −yi
                       Eder’s Delta                                          1 Pn
                                                                             n  i=1           σi
                                                                                                             · n−nni +1
                                                                         Pn           √             √
                       Eder’s Simple Delta [9]                                 i=1        xi −          yi
                                                                             1 Pn        1         2
                       Argamon’s Quadratic Delta                             n   i=1 σ 2 (xi − yi )
                                                                                          i
                                                                             Pn
                                                                                   |xi −yi |
                       Bray-Curtis Dissimilarity                             Pi=1
                                                                               n (x +y )
                                                                               i=1   i      i
                                                                               Pn
                                                                                     |x i −y  i|
                       Kulczynski Distance                                   Pn i=1
                                                                               i=1 min (xi ,yi )
                                                                               Pn
                                                                                    i  |x −y |
                                                                                       i
                                                                             2 Pi=1
                                                                                n (x +y )
                                                                                Pi=1  i   i
                       Jaccard Index                                              n  |xi −yi |
                                                                             1+ Pi=1
                                                                                  n (x +y )
                                                                                  i=1 i     i

                                                                             1 Pn     |xi −yi |
                       Gower Similarity                                      n  i=1 maxi − mini
                                                                              1
                                                                                · n
                                                                                 P
                       Alternative Gower Similarity                          n0    i=1 |xi − yi |
                                                                                            2 n
                                                                                             P
                                                                                                 i=1 xi y
                                                                                                        !i
                       Horn’s modification of Morisita’s Overlap Index               n  x2     n  y2
                                                                                P           P
                                                                                     i=1 i +   i=1 i
                                                                                                                        Pn          Pn
                                                                                   Pn      2 Pn      2                    i=1 xi     i=1 yi
                                                                               (        x
                                                                                    i=1 i   )     y
                                                                                              i=1 i (              )
                                                                             1
                       Mountford Index                                       α
                                                                               , where α is the parameter of Fisher’s log-series
                                                                                                x              y
                                                                                              i +y ·ln i −2n·ln 1
                                                                         Pn           xi ·ln 2n   i    2n       2
                       Binomial Index [5]                                      i=1                  2n

Table II: Where xi and yi are corresponding i values of vectors X = (x1 , x2 , . . . , xn ) and Y = (y1 , y2 , . . . , yn ), n is the size
of the compared vectors, σi and µi are standard deviation and mean of i values of all vectors used in comparison, ni is queue
number of i value in a vector (usually ni = i), mini and maxi are minimum and maximum i values between all compared
vectors, n0 is a number of pairs between corresponding X and Y vector values when at least one value in a pair is equal to 0.


   The selected authors were Virginija Baltraitienė (Labor                  (8 for each politician). The analysis was performed using as
Party Political Group (Darbo partijos frakcija, DPF), who                    features the most frequent words (MFW) in the sub-sample
belonged to opposition at the given time) and Donatas                        of 2 selected authors sorted in descending order by frequency.
Jankauskas (Homeland Union - Lithuanian Christian Demo-                      More statistics for the corpora and selected authors can be
crat Political Group (Tėvynės sajungos-Lietuvos
                                  ˛                krikščioniu˛             seen in Table III.
demokratu˛ frakcija, TS-LKDF)). The mean word count for
both authors are 190.39 and 203.96 accordingly, which means                  B. Methods
that their texts have to be concatenated so that one text                      To evaluate the measures, Z-index was used, which in other
object would contain around 5000 words. This is because                      words is a difference between means of standardized values.
Lithuanian language have many different words with many                                                                   s   m         v     m
                                                                                                                       1 X  sl − µ   1 X  vk − µ
different word forms, and in this experiment we are not                            z = |µs − µv | =                                −             ,   (1)
converting words to their base form. In addition, we will                                                              ms      σ     mv      σ
                                                                                                                              l=1             k=1
consider these 5000 words to be sufficient for a text object,
                                                                             where µs and µv are out-group and in-group means of
because documents themselves contain as few as 100 words
                                                                             standardized values, ms and mv are out-group and in-group
and it was shown by M. Eder that after more than 5000 words
                                                                             number of comparisons, sl and vk are out-group and in-group
the quality of author attribution barely increases [13]. After
                                                                             distance/similarity measurement for the corresponding com-
this step, for further analysis, 16 text objects were created
                                                                             parison, µ and σ are the overall mean and standard deviation


                                                                         2
                                                             Table III: Corpora statistics

  Global Parameter                 Value

  Period                           March of 1990 – December of 2013
  Number of authors                147 (18 women and 129 men)
  Minimum word count in one text   100

  Selected Author                  Number of Texts                    Number of Words     Number of Different Words   Mean Number of Words in One Text

  Virginija Baltraitienė          418                                79584               11711                       190.39
  Donatas Jankauskas               468                                95453               12981                       203.96


of all measurements for comparisons, excluding comparisons                        that showed better performance, i.e., higher Z-index with 100
when compared documents are the same.                                             MFW, would perform worse with 1000 and 5000 MFW and
   In-group data set contains comparisons of documents of                         vice versa. To investigate further, a second experiment was
speeches by the same politician, while out-group data set                         executed.
contains comparisons of documents of speeches by different
politicians. In this experiment we assumed that the bigger                        B. Experiment 2: Analysis of Distance Measures
difference between means of standardized values, the better                          To find out exactly how every distance measure behaves
performance of certain measure was.                                               when increasing the number of MFW, a graph was plotted
   After inspecting results of the experiment, Cluster Analysis                   where all the measures were presented in Z-indexes, taking
and Multidimensional Scaling was used to visualize relation                       into consideration quantities of MFW taken for experimenta-
among speeches of selected politicians, using distance measure                    tion, see Fig. 1. All Z-indexes were calculated for 100, 200,. . . ,
with better performance and with regards to the quantity of                       5000 MFW. In general, every distance measure was displayed
MFW. In addition to Z-index, a dependency on the quantity                         as a function. A plot was generated with total of 50 points
of MFW (from 100 MWF to 5000 MFW) to the results was                              for every measure, which was enough to detect their behavior
analyzed as well.                                                                 when quantity of MFW increased.
   For stylometric analysis (calculating word frequencies,                           In the Fig. 1 we can see that most distance measures
values of the distance measure, and plotting the relations                        possess high values in Z-index throughout the plot, but four
among documents) R, free software environment for statistical                     of them performed very poorly. These four distance (or
computing and graphics, was used [1]. R language and its                          similarity) measures are Alternative Gower Similarity, Horn’s
environment was chosen because it has all the necessary tools                     modification of Morisita’s Overlap Index, Cosine Distance
for textual data processing, computations, statistical analysis,                  and Euclidean Distance (see Table II for details). In order to
visualization capabilities and good performance in general,                       investigate further, we removed these distance measures from
e.g., one R script can be executed to provide all the required                    the experiment and concentrated on the remaining ones.
results from raw textual data without any additional software.                       After removing distance measures with bad performance,
   All distance/similarity measure computations and stylomet-                     we generated a new graph where remaining measures could
ric analysis were performed using “stylo” [9] and “vegdist”                       be compared more precisely,see Fig. 2. We can see that up
[16] scripts for R. In order to have a more efficient process,                    to around 1000 MFW, Z-index is always increasing in value
both scripts were merged together. This way all computations                      and after this point it either decreases or shows similar results
and analysis could be executed with one script.                                   when MFW number is increased. Analyzing the plot further,
                                                                                  we can observe three main groups of distance (or similarity)
                      III. E XPERIMENTAL R ESULTS
                                                                                  measures:
   The objective of this investigation was to identify which
                                                                                     1) Measures which show very good performance until
measures perform better with different number of MFW in
                                                                                         around 2000 MFW were reached. After that, the per-
stylometric analysis of Lithuanian texts. All the evaluated
                                                                                         formance downgraded very fast. Distance measure that
measures are presented in Table II. Hierarchical Clustering [6],
                                                                                         performed best was Binomial Index. Other distance
Multidimensional Scaling [7] and Heat Map [8] were applied
                                                                                         measures which behaved similarly were:
with parameters described in Table IV.
                                                                                           a) Argamon’s Quadratic Delta,
A. Experiment 1: Initial Experiments                                                       b) Burrows’ Delta,
   Initial experiments where performed with 100, 1000 and                                  c) Argamon’s Linear Delta,
5000 MFW. In each case all distance measures were sorted                                   d) Gower Similarity.
according to the calculated Z-index. Naturally, a distance                           2) Measures which slowly reach their performance peak
measure with the highest Z-index should have been considered                             only at around 2000 MWF and then their performance
the best, but in our experiments there were no clear consistency                         very slowly declined. In general, these distance measures
in the measures regarding their performances, e.g., the measure                          showed a very good performance and stability. In this


                                                                              3
Figure 1: Difference between means of standardized values of in-group and out-group distances as a function of MFW amount
used. Indicated for initial distance and similarity measures on the Lithuanian texts.


Figure 2: Difference between means of standardized values of in-group and out-group distances as a function of MFW amount
used. Indicated for selected distance and similarity measures on the Lithuanian texts.


                                                           4
                                      Table IV: Settings used for stylometric analysis

                Parameter                       Value

                Corpus format                   Plain text
                Corpus language                 English (ALL)
                Analyzed features               Words
                N-gram size                     1
                Most Frequent Words (MFW)       Amount that showed better results
                Start at freq. rank             1
                Culling min                     0
                Culling max                     0
                Analysis types                  Cluster Analysis, Multidimensional Scaling (MDS), Heat map
                Distance/Similarity measure     A measure that showed better results
                Sampling                        No sampling


                                                                          formance together with the increasing MFW quantity,
                                                                          though this improvement was extremely gradual. In
                                                                          comparison to other groups, this group performed worse
                                                                          than the first and the second groups until reached 3000
                                                                          MFW and after that it still performed worse than the
                                                                          second group, but it showed better performance over
                                                                          the first group. Distance measure that performed best
                                                                          in this group was Jaccard Index. Other measures which
                                                                          behaved similarly were:
                                                                            a) Manhattan Distance,
                                                                            b) Bray-Curtis Similarity,
                                                                            c) Kulczynski Distance.
                                                                      We noticed that Eder’s Simple Delta measure performed
                                                                   really good: between the first group, only around 2-3% worse
                                                                   than the Binomial Index and between the second group, where
                                                                   Eder’s Simple Delta showed the highest Z-index that was 1-
                                                                   3% better than Eder’s Delta. So not only it could be used for
                                                                   bigger quantities of MFW, but for smaller ones as well. For
                                                                   this reason, in this experiment we considered Eder’s Simple
                                                                   Delta as the best performing distance (or similarity) measure
                                                                   when analyzing Lithuanian texts in general.
Figure 3: Heat map was generated using measurement matrix          C. Experiment 3: Exemplary Stylometric Analysis
received with Eder’s Simple Delta and 2000 Most Frequent
                                                                      We used Eder’s Simple Delta measure for exemplary stylo-
Words. In the upper left corner, a color palette can be seen
                                                                   metric analysis in order to visualize the difference in lexical
which is used to visualize data. Furthermore, histogram is
                                                                   style (usage of words) between the selected members of the
drawn on top to show distribution of matrix values.
                                                                   Lithuanian Parliament. For this analysis 2000 MFW was used,
                                                                   as it showed the best performance for the chosen distance
                                                                   measure. The first method used in order to map relations
     group, Eder’s Simple Delta had the best results. Other
                                                                   among the transcribed speeches made by before mentioned
     measures which behaved similarly were:
                                                                   politicians was Hierarchical Cluster Analysis (see in Fig. 4b).
       a) Eder’s Delta,                                            In this research an agglomerative hierarchical clustering with
       b) Mountford Index,                                         Ward linkage, where the criterion for choosing the pair of
       c) Canberra Distance.                                       clusters to merge at each step is based on the optimal value of
  3) Measures which had a continuously improving per-              an objective function [14], was used. As a result, clusters’


                                                               5
               (a) Multidimensional Scaling method.
                                                                                   (b) Hierarchical Cluster Analysis method.
Figure 4: Eder’s Simple Delta measure and 2000 Most Frequent Words was used. Different colors represent different text
authors.


hierarchy was generated and visualized as a dendrogram,                in regards to the Eder’s Simple Delta measure.
where on the right side we have separate documents which                  To summarize, every method for visualization and rela-
are being linked into clusters according their similarity until        tion/position mapping for stylometric analysis that we used did
all documents are merged into one cluster. The results showed          not object our statement that Eder’s Simple Delta performed
clear differentiation between the authors which also con-              well in analysis of Lithuanian texts. But by no means this is the
tributed to our conclusion that Eder’s Simple Delta performed          only measure that produces very good results. Other measures
well.                                                                  surely could also provide similar performance (as was seen in
   In Fig. 4a visualization and relation mapping among                 Fig. 2). Different corpus, parameters for text analysis, selection
documents were performed using Multidimensional Scaling.               of Most Frequent Words - these options could still affect
The colors represent different authors of the parliamentary            the performance of every distance (or similarity) measure.
speeches, corresponding to the previous Figure (Fig. 4b).              However, to reach solid conclusion, further research is needed.
The division among parliamentary speeches made by different
politicians was executed well, but in this case distribution of                              IV. C ONCLUSIONS
points was extended from the perspective of the vertical axis.
                                                                         In this experiment we focused on Lithuanian texts, when
This behavior might lead to different results if a lot more
                                                                       corpus was composed of parliamentary speeches from Lithua-
authors would be analyzed at once. In general, the results
                                                                       nian Parliament. Hence, the following recommendations could
were good, which also contributed to our conclusion that the
                                                                       be applied to the experimentation with the latter corpus.
distance measure we used had an overall good performance.
   In addition to the two methods we have already used                   1) Distance measure can be selected according to the quan-
for relation/position mapping of the documents as well as                   tity of the Most Frequent Words used in the analysis.
visualization, we tried to display the results with a Heat               2) At least 1000 of Most Frequent Words should be used.
Map. Fig. 3 shows one of the simple ways to visualize                       After this point Z-index value either decreases or shows
stylistic differences among the documents without any further               similar results.
computations. In this example the color palette used for the             3) If quantity of MFW does not exceed 5000 by a wide
plot was gradient from white to dark red, where white means a               margin, Eder’s Simple Delta measure performs well.
complete match. Since documents were sorted by author in the             4) If Most Frequent Words are limited to 2000, Binomial
horizontal and vertical axes, the rectangle shapes formed out of            Index shows an increase in performance over Eder’s
light and darker colors. Light color showed that the documents              Simple Delta and thus it is more suitable in this case.
were written (or speech spoken, in our case) probably by the             Finally, these recommendations can be applied for stylo-
same author and dark color – by a different author. Since the          metric analysis of generic Lithuanian texts, but precautions
brightness of the color matched every pair of documents well           must be taken and therefore we plan to experiment with texts
enough, Heat Map could be considered to show positive results          belonging to different domains in the future.


                                                                   6
                              R EFERENCES
[1] R Development Core Team, R: A Language and Environment for Statis-
    tical Computing, R Foundation for Statistical Computing, 2008.
[2] F. Jannidis, S. Pielström, C. Schöch and T. Vitt, Improving Burrows’
    Delta – An empirical evaluation of text distance measures, Digital
    Humanities 2015, 2015.
[3] S. Evert, T. Proisl, C. Schöch, F. Jannidis, S. Pielström and T.
    Vitt Explaining Delta, or: How do distance measures for authorship
    attribution work?, 2015.
[4] H. S. Horn, Measurement of “overlap” in comparative ecological studies,
    1966.
[5] M. J. Anderson and R. B. Millar, Spatial variation and effects of habitat
    on temperate reef fish assemblages in northeastern New Zealand, 2004.
[6] L. Rokach and O. Maimon, Clustering methods, 2005.
[7] I. Borg and P. Groenen, Modern Multidimensional Scaling: theory and
    applications, 2005.
[8] Heat map (heatmap), http://searchbusinessanalytics.techtarget.com/
    definition/heat-map, accessed: 2017-03-12.
[9] M. Eder, J. Rybicki and M. Kestemont, Stylometry with R: a package
    for computational text analysis, R journal, 2016.
[10] J. Mandravickaitė and T. Krilavičius, Language usage of members of
    the Lithuanian Parliament considering their political orientation, Deeds
    and Days, 2015.
[11] T. Krilavičius and V. Morkevičius, Mining Social Science Data: a
    Study of Voting of the members of the Seimas of Lithuania by Using Mul-
    tidimensional Scaling and Homogeneity Analysis, Intellectual Economics,
    2011.
[12] S. Argamon, Interpreting Burrows’s Delta: Geometric and Probabilistic
    Foundations, 2007.
[13] M. Eder, Does size matter? Authorship attribution, small samples, big
    problem, Digital Humanities 2010, 2010.
[14] J. H. Ward Jr., Hierarchical Grouping to Optimize an Objective
    Function, Journal of the American Statistical Association, 1963.
[15] J. Kapočiūtė-Dzikienė, A. Utka and L. Šarkutė Feature Exploration for
    Authorship Attribution of Lithuanian Parliamentary Speeches, 2014.
[16] J. Oksanen, F. G. Blanchet, R. Kindt, P. Legendre, P. R. Minchin,
    R. B. O’Hara, G. L. Simpson, P. Solymos, M. Henry, H. Stevens and
    H. Wagner, vegan: Community Ecology Package, 2016.


                                                                                  7

</pre>