=Paper=
{{Paper
|id=Vol-1852/p01
|storemode=property
|title=Comparison of distance and similarity measures
for stylometric analysis of lithuanian texts
|pdfUrl=https://ceur-ws.org/Vol-1852/p01.pdf
|volume=Vol-1852
|authors=Daumantas Stanikūnas,Justina Mandravickaitė,Tomas Krilavičius
}}
==Comparison of distance and similarity measures
for stylometric analysis of lithuanian texts==
Comparison of distance and similarity measures for
stylometric analysis of Lithuanian texts
Daumantas Stanikūnas1 , Justina Mandravickaitė2 , Tomas Krilavičius3
1
Department of Mathematics and Statistics, Vytautas Magnus University, Lithuania
Email: daumantas.stanikunas@fcis.vdu.lt
2
Baltic Institute of Advanced Technology, Vilnius University, Lithuania
Email: justina@bpti.lt
3
Baltic Institute of Advanced Technology, Vytautas Magnus University, Lithuania
Email: t.krilavicius@bpti.lt
Abstract—Constant developments in information and com- This paper presents an on-going experimental work on
puter technologies make it possible to handle constantly increas- identifying the most suitable measures used in stylometry
ing amount of data, thereby expanding the research possibilities. when analyzing Lithuanian texts. The objective of these ex-
In this article, we discuss and compare distance and similarity
measures used in stylometric analysis which could be applied to periments is to compare the performance of distance and
analyze Lithuanian texts. As corpus for the analysis, transcripts similarity measures already used in stylometry and other fields
of parliamentary debates by two politicians of the Lithuanian of research [4], [5], [11] using R language with the focus
Parliament were chosen. Furthermore, comparison of distance on the transcripts of speeches of politicians in the Lithuanian
measures, stylometric analysis and visualization were performed. Parliament.
Objective of the experiment was to identify what measures would
perform better when executing stylometric analysis of Lithuanian These experiments cover a domain of transcriptions of par-
texts and explore where these differences in the performance liamentary debates of Lithuanian Parliament which is only a
occur. Summarizing the experiment results, the recommendations small fraction of Lithuanian language, however they represent
are as follow: number of Most Frequent Words used should be at richness and variety of language quite well, and hence, we
least 1000, Eder’s Simple Delta measure can be used in general expect, that the results could be useful in analysis of other
stylometric analysis of transcriptions of parliamentary debates of
Lithuanian Parliament, in a case when Most Frequent Words are Lithuanian texts.
limited to 2000, Binomial Index shows an increase in performance
over Eder’s Simple Delta and thus it is more suitable. II. DATA AND M ETHODS
Index Terms—stylometry; computational stylistics; parliamen-
tary speech; R; statistical analysis; distance measure; similarity
A. Data Preparation
measure; data visualization Data for the analysis will be taken from corpora, col-
lected in ASTRA project. The data consists of transcriptions
I. I NTRODUCTION
of parliamentary debates of Lithuanian Parliament. For this
Stylometry refers to the study of linguistic style, usually to investigation we use transcripts of two politicians from the
written language. It uses variety of statistical methods to ana- term 2008-2012. The criteria used for selecting the right data
lyze a text to determine the text’s author. Common technique for comparison of measures were chosen in such a way that
used is to calculate distances or similarities between texts measures would provide the biggest difference between texts
and process the output using different visualization methods. of different authors (or speakers, in our case). Considering this,
There have already been significant number of experiments authors for experiments were chosen of different gender and
performed by various researchers in order to figure out what political standing (position and opposition) to strengthen the
measures show better results in different cases of stylometric dissimilarity. This is a different type of approach compared to
analysis. F. Jannidis and S. Evert analyzed three collections a similar research which was done by J. Kapočiūtė-Dzikienė
of novels in English, French and German languages and have where her goal was to identify best methods and features
shown that Cosine Delta measure outperforms all other mea- (MFW were not used) for authorship attribution based on
sures on our three collections [2], [3], while J. Mandravickaitė machine learning [15]. Her approach was to analyze as similar
in her research she explained that such measures like Burrows data sets as possible, when authors are actually different for
Delta would not work well with highly inflected languages these data sets.
(Latin, Polish) and suggested using Eder’s Delta measure [10].
Automatic extraction of style applied to individual authors and groups of
Copyright c 2017 held by the authors. authors, http://dangus.vdu.lt/~jkd/?page_id=2
1
Table I: Distance/similarity measures and their formulas
Distance/Similarity measure Formula
Pn
Manhattan Distance i=1 |xi − yi |
pPn
2
Euclidean Distance i=1 (xi − yi )
Pn |xi −yi |
Canberra Distance i=1 |xi |+|yi |
Pn
i=1q xi yi
Cosine Distance qP
n 2 Pn 2
i=1 xi i=1 yi
xi −µi
Burrows’ Delta 1 Pn
n i=1 σi
− yiσ−µi
i
s
1 Pn (xi −yi )2
Argamon’s Linear Delta n i=1 σi2
xi −yi
Eder’s Delta 1 Pn
n i=1 σi
· n−nni +1
Pn √ √
Eder’s Simple Delta [9] i=1 xi − yi
1 Pn 1 2
Argamon’s Quadratic Delta n i=1 σ 2 (xi − yi )
i
Pn
|xi −yi |
Bray-Curtis Dissimilarity Pi=1
n (x +y )
i=1 i i
Pn
|x i −y i|
Kulczynski Distance Pn i=1
i=1 min (xi ,yi )
Pn
i |x −y |
i
2 Pi=1
n (x +y )
Pi=1 i i
Jaccard Index n |xi −yi |
1+ Pi=1
n (x +y )
i=1 i i
1 Pn |xi −yi |
Gower Similarity n i=1 maxi − mini
1
· n
P
Alternative Gower Similarity n0 i=1 |xi − yi |
2 n
P
i=1 xi y
!i
Horn’s modification of Morisita’s Overlap Index n x2 n y2
P P
i=1 i + i=1 i
Pn Pn
Pn 2 Pn 2 i=1 xi i=1 yi
( x
i=1 i ) y
i=1 i ( )
1
Mountford Index α
, where α is the parameter of Fisher’s log-series
x y
i +y ·ln i −2n·ln 1
Pn xi ·ln 2n i 2n 2
Binomial Index [5] i=1 2n
Table II: Where xi and yi are corresponding i values of vectors X = (x1 , x2 , . . . , xn ) and Y = (y1 , y2 , . . . , yn ), n is the size
of the compared vectors, σi and µi are standard deviation and mean of i values of all vectors used in comparison, ni is queue
number of i value in a vector (usually ni = i), mini and maxi are minimum and maximum i values between all compared
vectors, n0 is a number of pairs between corresponding X and Y vector values when at least one value in a pair is equal to 0.
The selected authors were Virginija Baltraitienė (Labor (8 for each politician). The analysis was performed using as
Party Political Group (Darbo partijos frakcija, DPF), who features the most frequent words (MFW) in the sub-sample
belonged to opposition at the given time) and Donatas of 2 selected authors sorted in descending order by frequency.
Jankauskas (Homeland Union - Lithuanian Christian Demo- More statistics for the corpora and selected authors can be
crat Political Group (Tėvynės sajungos-Lietuvos
˛ krikščioniu˛ seen in Table III.
demokratu˛ frakcija, TS-LKDF)). The mean word count for
both authors are 190.39 and 203.96 accordingly, which means B. Methods
that their texts have to be concatenated so that one text To evaluate the measures, Z-index was used, which in other
object would contain around 5000 words. This is because words is a difference between means of standardized values.
Lithuanian language have many different words with many s m v m
1 X sl − µ 1 X vk − µ
different word forms, and in this experiment we are not z = |µs − µv | = − , (1)
converting words to their base form. In addition, we will ms σ mv σ
l=1 k=1
consider these 5000 words to be sufficient for a text object,
where µs and µv are out-group and in-group means of
because documents themselves contain as few as 100 words
standardized values, ms and mv are out-group and in-group
and it was shown by M. Eder that after more than 5000 words
number of comparisons, sl and vk are out-group and in-group
the quality of author attribution barely increases [13]. After
distance/similarity measurement for the corresponding com-
this step, for further analysis, 16 text objects were created
parison, µ and σ are the overall mean and standard deviation
2
Table III: Corpora statistics
Global Parameter Value
Period March of 1990 – December of 2013
Number of authors 147 (18 women and 129 men)
Minimum word count in one text 100
Selected Author Number of Texts Number of Words Number of Different Words Mean Number of Words in One Text
Virginija Baltraitienė 418 79584 11711 190.39
Donatas Jankauskas 468 95453 12981 203.96
of all measurements for comparisons, excluding comparisons that showed better performance, i.e., higher Z-index with 100
when compared documents are the same. MFW, would perform worse with 1000 and 5000 MFW and
In-group data set contains comparisons of documents of vice versa. To investigate further, a second experiment was
speeches by the same politician, while out-group data set executed.
contains comparisons of documents of speeches by different
politicians. In this experiment we assumed that the bigger B. Experiment 2: Analysis of Distance Measures
difference between means of standardized values, the better To find out exactly how every distance measure behaves
performance of certain measure was. when increasing the number of MFW, a graph was plotted
After inspecting results of the experiment, Cluster Analysis where all the measures were presented in Z-indexes, taking
and Multidimensional Scaling was used to visualize relation into consideration quantities of MFW taken for experimenta-
among speeches of selected politicians, using distance measure tion, see Fig. 1. All Z-indexes were calculated for 100, 200,. . . ,
with better performance and with regards to the quantity of 5000 MFW. In general, every distance measure was displayed
MFW. In addition to Z-index, a dependency on the quantity as a function. A plot was generated with total of 50 points
of MFW (from 100 MWF to 5000 MFW) to the results was for every measure, which was enough to detect their behavior
analyzed as well. when quantity of MFW increased.
For stylometric analysis (calculating word frequencies, In the Fig. 1 we can see that most distance measures
values of the distance measure, and plotting the relations possess high values in Z-index throughout the plot, but four
among documents) R, free software environment for statistical of them performed very poorly. These four distance (or
computing and graphics, was used [1]. R language and its similarity) measures are Alternative Gower Similarity, Horn’s
environment was chosen because it has all the necessary tools modification of Morisita’s Overlap Index, Cosine Distance
for textual data processing, computations, statistical analysis, and Euclidean Distance (see Table II for details). In order to
visualization capabilities and good performance in general, investigate further, we removed these distance measures from
e.g., one R script can be executed to provide all the required the experiment and concentrated on the remaining ones.
results from raw textual data without any additional software. After removing distance measures with bad performance,
All distance/similarity measure computations and stylomet- we generated a new graph where remaining measures could
ric analysis were performed using “stylo” [9] and “vegdist” be compared more precisely,see Fig. 2. We can see that up
[16] scripts for R. In order to have a more efficient process, to around 1000 MFW, Z-index is always increasing in value
both scripts were merged together. This way all computations and after this point it either decreases or shows similar results
and analysis could be executed with one script. when MFW number is increased. Analyzing the plot further,
we can observe three main groups of distance (or similarity)
III. E XPERIMENTAL R ESULTS
measures:
The objective of this investigation was to identify which
1) Measures which show very good performance until
measures perform better with different number of MFW in
around 2000 MFW were reached. After that, the per-
stylometric analysis of Lithuanian texts. All the evaluated
formance downgraded very fast. Distance measure that
measures are presented in Table II. Hierarchical Clustering [6],
performed best was Binomial Index. Other distance
Multidimensional Scaling [7] and Heat Map [8] were applied
measures which behaved similarly were:
with parameters described in Table IV.
a) Argamon’s Quadratic Delta,
A. Experiment 1: Initial Experiments b) Burrows’ Delta,
Initial experiments where performed with 100, 1000 and c) Argamon’s Linear Delta,
5000 MFW. In each case all distance measures were sorted d) Gower Similarity.
according to the calculated Z-index. Naturally, a distance 2) Measures which slowly reach their performance peak
measure with the highest Z-index should have been considered only at around 2000 MWF and then their performance
the best, but in our experiments there were no clear consistency very slowly declined. In general, these distance measures
in the measures regarding their performances, e.g., the measure showed a very good performance and stability. In this
3
Figure 1: Difference between means of standardized values of in-group and out-group distances as a function of MFW amount
used. Indicated for initial distance and similarity measures on the Lithuanian texts.
Figure 2: Difference between means of standardized values of in-group and out-group distances as a function of MFW amount
used. Indicated for selected distance and similarity measures on the Lithuanian texts.
4
Table IV: Settings used for stylometric analysis
Parameter Value
Corpus format Plain text
Corpus language English (ALL)
Analyzed features Words
N-gram size 1
Most Frequent Words (MFW) Amount that showed better results
Start at freq. rank 1
Culling min 0
Culling max 0
Analysis types Cluster Analysis, Multidimensional Scaling (MDS), Heat map
Distance/Similarity measure A measure that showed better results
Sampling No sampling
formance together with the increasing MFW quantity,
though this improvement was extremely gradual. In
comparison to other groups, this group performed worse
than the first and the second groups until reached 3000
MFW and after that it still performed worse than the
second group, but it showed better performance over
the first group. Distance measure that performed best
in this group was Jaccard Index. Other measures which
behaved similarly were:
a) Manhattan Distance,
b) Bray-Curtis Similarity,
c) Kulczynski Distance.
We noticed that Eder’s Simple Delta measure performed
really good: between the first group, only around 2-3% worse
than the Binomial Index and between the second group, where
Eder’s Simple Delta showed the highest Z-index that was 1-
3% better than Eder’s Delta. So not only it could be used for
bigger quantities of MFW, but for smaller ones as well. For
this reason, in this experiment we considered Eder’s Simple
Delta as the best performing distance (or similarity) measure
when analyzing Lithuanian texts in general.
Figure 3: Heat map was generated using measurement matrix C. Experiment 3: Exemplary Stylometric Analysis
received with Eder’s Simple Delta and 2000 Most Frequent
We used Eder’s Simple Delta measure for exemplary stylo-
Words. In the upper left corner, a color palette can be seen
metric analysis in order to visualize the difference in lexical
which is used to visualize data. Furthermore, histogram is
style (usage of words) between the selected members of the
drawn on top to show distribution of matrix values.
Lithuanian Parliament. For this analysis 2000 MFW was used,
as it showed the best performance for the chosen distance
measure. The first method used in order to map relations
group, Eder’s Simple Delta had the best results. Other
among the transcribed speeches made by before mentioned
measures which behaved similarly were:
politicians was Hierarchical Cluster Analysis (see in Fig. 4b).
a) Eder’s Delta, In this research an agglomerative hierarchical clustering with
b) Mountford Index, Ward linkage, where the criterion for choosing the pair of
c) Canberra Distance. clusters to merge at each step is based on the optimal value of
3) Measures which had a continuously improving per- an objective function [14], was used. As a result, clusters’
5
(a) Multidimensional Scaling method.
(b) Hierarchical Cluster Analysis method.
Figure 4: Eder’s Simple Delta measure and 2000 Most Frequent Words was used. Different colors represent different text
authors.
hierarchy was generated and visualized as a dendrogram, in regards to the Eder’s Simple Delta measure.
where on the right side we have separate documents which To summarize, every method for visualization and rela-
are being linked into clusters according their similarity until tion/position mapping for stylometric analysis that we used did
all documents are merged into one cluster. The results showed not object our statement that Eder’s Simple Delta performed
clear differentiation between the authors which also con- well in analysis of Lithuanian texts. But by no means this is the
tributed to our conclusion that Eder’s Simple Delta performed only measure that produces very good results. Other measures
well. surely could also provide similar performance (as was seen in
In Fig. 4a visualization and relation mapping among Fig. 2). Different corpus, parameters for text analysis, selection
documents were performed using Multidimensional Scaling. of Most Frequent Words - these options could still affect
The colors represent different authors of the parliamentary the performance of every distance (or similarity) measure.
speeches, corresponding to the previous Figure (Fig. 4b). However, to reach solid conclusion, further research is needed.
The division among parliamentary speeches made by different
politicians was executed well, but in this case distribution of IV. C ONCLUSIONS
points was extended from the perspective of the vertical axis.
In this experiment we focused on Lithuanian texts, when
This behavior might lead to different results if a lot more
corpus was composed of parliamentary speeches from Lithua-
authors would be analyzed at once. In general, the results
nian Parliament. Hence, the following recommendations could
were good, which also contributed to our conclusion that the
be applied to the experimentation with the latter corpus.
distance measure we used had an overall good performance.
In addition to the two methods we have already used 1) Distance measure can be selected according to the quan-
for relation/position mapping of the documents as well as tity of the Most Frequent Words used in the analysis.
visualization, we tried to display the results with a Heat 2) At least 1000 of Most Frequent Words should be used.
Map. Fig. 3 shows one of the simple ways to visualize After this point Z-index value either decreases or shows
stylistic differences among the documents without any further similar results.
computations. In this example the color palette used for the 3) If quantity of MFW does not exceed 5000 by a wide
plot was gradient from white to dark red, where white means a margin, Eder’s Simple Delta measure performs well.
complete match. Since documents were sorted by author in the 4) If Most Frequent Words are limited to 2000, Binomial
horizontal and vertical axes, the rectangle shapes formed out of Index shows an increase in performance over Eder’s
light and darker colors. Light color showed that the documents Simple Delta and thus it is more suitable in this case.
were written (or speech spoken, in our case) probably by the Finally, these recommendations can be applied for stylo-
same author and dark color – by a different author. Since the metric analysis of generic Lithuanian texts, but precautions
brightness of the color matched every pair of documents well must be taken and therefore we plan to experiment with texts
enough, Heat Map could be considered to show positive results belonging to different domains in the future.
6
R EFERENCES
[1] R Development Core Team, R: A Language and Environment for Statis-
tical Computing, R Foundation for Statistical Computing, 2008.
[2] F. Jannidis, S. Pielström, C. Schöch and T. Vitt, Improving Burrows’
Delta – An empirical evaluation of text distance measures, Digital
Humanities 2015, 2015.
[3] S. Evert, T. Proisl, C. Schöch, F. Jannidis, S. Pielström and T.
Vitt Explaining Delta, or: How do distance measures for authorship
attribution work?, 2015.
[4] H. S. Horn, Measurement of “overlap” in comparative ecological studies,
1966.
[5] M. J. Anderson and R. B. Millar, Spatial variation and effects of habitat
on temperate reef fish assemblages in northeastern New Zealand, 2004.
[6] L. Rokach and O. Maimon, Clustering methods, 2005.
[7] I. Borg and P. Groenen, Modern Multidimensional Scaling: theory and
applications, 2005.
[8] Heat map (heatmap), http://searchbusinessanalytics.techtarget.com/
definition/heat-map, accessed: 2017-03-12.
[9] M. Eder, J. Rybicki and M. Kestemont, Stylometry with R: a package
for computational text analysis, R journal, 2016.
[10] J. Mandravickaitė and T. Krilavičius, Language usage of members of
the Lithuanian Parliament considering their political orientation, Deeds
and Days, 2015.
[11] T. Krilavičius and V. Morkevičius, Mining Social Science Data: a
Study of Voting of the members of the Seimas of Lithuania by Using Mul-
tidimensional Scaling and Homogeneity Analysis, Intellectual Economics,
2011.
[12] S. Argamon, Interpreting Burrows’s Delta: Geometric and Probabilistic
Foundations, 2007.
[13] M. Eder, Does size matter? Authorship attribution, small samples, big
problem, Digital Humanities 2010, 2010.
[14] J. H. Ward Jr., Hierarchical Grouping to Optimize an Objective
Function, Journal of the American Statistical Association, 1963.
[15] J. Kapočiūtė-Dzikienė, A. Utka and L. Šarkutė Feature Exploration for
Authorship Attribution of Lithuanian Parliamentary Speeches, 2014.
[16] J. Oksanen, F. G. Blanchet, R. Kindt, P. Legendre, P. R. Minchin,
R. B. O’Hara, G. L. Simpson, P. Solymos, M. Henry, H. Stevens and
H. Wagner, vegan: Community Ecology Package, 2016.
7