=Paper=
{{Paper
|id=Vol-2470/p30
|storemode=property
|title=Lithuanian news clustering using document embeddings
|pdfUrl=https://ceur-ws.org/Vol-2470/p30.pdf
|volume=Vol-2470
|authors=Lukas Stankevičius,Mantas Lukoševičius
|dblpUrl=https://dblp.org/rec/conf/ivus/StankeviciusL19
}}
==Lithuanian news clustering using document embeddings==
Lithuanian news clustering using document
embeddings
Lukas Stankevičius Mantas Lukoševičius
Faculty of Informatics Faculty of Informatics
Kaunas University of Technology Kaunas University of Technology
Kaunas, Lithuania Kaunas, Lithuania
lukas.stankevicius@ktu.edu mantas.lukosevicius@ktu.lt
Abstract—A lot of research of natural language processing is to vector (Doc2Vec) can improve clustering of Lithuanian
done and applied on English texts but relatively little is tried on news.
less popular languages. In this article document embeddings are
compared with traditional bag of words methods for Lithuanian II. RELATED WORK ON LITHUANIAN LANGUAGE
news clustering. The results show that for enough documents the Articles on Lithuanian language documents clustering
embeddings greatly outperform simple bag of words
suggest using K-means [4], spherical K-means [6] or
representations. In addition, optimal lemmatization,
embeddings vector size, and number of training epochs were
Expectation-Maximization (EM) [7] algorithms. It was also
investigated. observed that K-means is fast and suitable for large corpora
[7] and outperforms other popular algorithms [4].
Keywords—document clustering; document embedding; [6] considers Term Frequency / Inverse Document
lemmatization; Lithuanian news articles. Frequency (TF-IDF) as the best weighting scheme. [4] adds
I. INTRODUCTION that it must be used together with stemming while [6]
advocates to do minimum and maximum document frequency
The knowledge and information are inseparable part of our filtering before applying TF-IDF. These works show that TF-
civilization. For thousands of years from news of incoming IDF is significant weighting scheme and it could be optionally
troops to ordinary know-how could have meant death or life. tried with some additional preprocessing steps.
Knowledge accumulation throughout the centuries led to
astonishing improvements of our way of live. Hardly anyone We have not found any research on Lithuanian language
could persist having no news or other kinds of information regarding document embeddings. However, there are some
even throughout the day. work on word embeddings. In [8] word embeddings using
different models and training algorithms were compared after
Despite information scarcity centuries ago, nowadays we training on 234 million tokens corpus. It was found that
have the opposite situation. Demand and technology greatly Continuous Bag of Words (CBOW) architecture significantly
increased the amount of information we can acquire. Now outperformed skip-gram method while vector dimensionality
one’s goal is to not get lost in it. As an example, the most showed no significant impact on the results. This implies that
popular Lithuanian news website each day publishes document embeddings like word embeddings should follow
approximately 80 news articles. Add other news websites not same CBOW architectural pattern. Other work [9] compared
only from Lithuania but the entire world and one would end traditional and deep learning (with use of word embeddings)
up overwhelmed to read most of this information. approaches for sentiment analysis and found that deep
The field of text data mining emerged to tackle this kind of learning demonstrated good results only when applied on the
problems. It goes “beyond information access to further help small datasets, otherwise traditional methods were better. As
users analyze and digest information and facilitate decision embeddings may be underperforming in sentiment analysis it
making” [1]. Text data mining offers several solutions to better will be tested if it is a case for news clustering.
characterize text documents: summarization, classification III. TEXT CLUSTERING PROCESS
and clustering [1]. However, when evaluated by people, the
best summarization results currently are given only 2-4 points To improve clustering quality some text preprocessing
out of 5 [2]. Today the best classification accuracies are 50- must be done. Every text analytics process consists „of three
94% [3] and clustering of about 0.4 F1 score [4]. Although consecutive phases: Text Preprocessing, Text Representation
achieved classification results are more accurate, the and Knowledge Discovery“ [1] (the last being clustering in our
clustering is perceived more promising as it is universal and case).
can handle unknown categories as it is the case for diverse A. Text preprocessing
news data.
The purpose of text preprocessing is to make the data more
After it was shown that artificial neural networks can be concise and facilitate text representation. It mainly involves
successfully trained and used to reduce dimensionality [5], tokenizing text into features and dropping the ones considered
many new successful data mining models had emerged. The less important. Extracted features can be words, chars or any
aim of this work is to test how one of such models – document n-gram (contiguous sequence of n items from a given sample
© 2019 for this paper by its authors. Use permitted under Creative
of text) of both. Tokens can also be accompanied by the
Commons License Attribution 4.0 International (CC BY 4.0) structural or placement aspects of document [10].
104
The most and least frequent items are considered C. Text clustering
uninformative and dropped. Tokens found on every document There are tens of clustering algorithms to choose from
are not descriptive and they usually include stop words such [14]. One of the simplest and widely used is k-means
as “and”, “to”. On the other hand, too rare words are algorithm. During initialization, k-means algorithm selects k
insufficient to attribute to any characteristic and due to their means, which corresponds to k clusters. Then algorithm
resulting sparse vectors only complicate the whole process. repeats two steps: (1) for every data point choose the nearest
Existing text features can be further concentrated by these mean and assign the point to the corresponding cluster; (2)
methods: recalculate means by averaging data points assigned to the
corresponding cluster. The algorithm terminates, when
stemming; assignment of the data points does not change after several
lemmatization; iterations. As the clustering depends on initially selected
centroids, the algorithm is usually run several times to average
number normalization; over random centroid initializations.
allowing only maximum number of features; IV. THE DATA
maximum document frequency – ignore terms that A. Articles
appear in more than specified documents; Article data for this research was scraped from three
minimum document frequency – ignore terms that Lithuanian news websites: the national lrt.lt and commercial
appear in less than specified documents. websites 15min.lt and delfi.lt. Articles URL’s were scraped
from sitemaps in robots.txt files in websites. Total of 82793
It was shown that the use of stemming in Lithuanian news articles (26336 from lrt.lt, 31397 from 15min.lt and 25060
clustering greatly increased clustering performance [4]. from delfi.lt) were retrieved spanning random release dates of
B. Text representation 2017 year.
For the computer to make any calculations with the text Raw dataset contains 30338937 tokens from which
data it must be represented in numerical vectors. The simplest 641697 are unique. Unique token count can be decreased to:
representation is called “Bag Of Words” (BOW) or “Vector
641254, dropping stop words;
Space Model” (VSM) where each document has counts or
other derived weights for each vocabulary word. This structure 635257, normalizing all numbers to a single feature;
ignores linguistic text structure. Surprisingly, in [11] it was
reviewed that “unordered methods have been found on many 441178, applying lemmas and leaving unknown
tasks to be extremely well performing, better than several of words;
the more advanced techniques”, because “there are only a few 41933, applying lemmas and dropping unknown
likely ways to order any given bag of words”. words;
The most popular weight for BOW is TF-IDF. Recent 434472, dropping stop words, normalizing numbers,
study [4] on Lithuanian news clustering have shown that TF- applying lemmas and leaving unknown words.
IDF weight produced the best clustering results. TF-IDF is
calculated as: Each article has on average 366 tokens and on average 247
unique tokens. Mean token length is 6.51 characters with
𝑁 standard deviation of 3.
𝑡𝑓𝑖𝑑𝑓(𝑤, 𝑑) = 𝑡𝑓(𝑤, 𝑑) ∙ 𝑙𝑜𝑔
𝑑𝑓(𝑤)
While analyzing articles and their accompanying
where: information, it was noticed that some labelling information
can be acquired from article URL. Both websites have
tf(w,d) is term frequency, the number of word w categorical information between the domain and article id
occurrences in a document d; parts in URL. Total of 116 distinct categorical descriptions
were received and normalized to 12 distinct categories as
df(w) is document frequency, the number of documents
described at [4]. Category distributions are:
containing word w;
Lithuania news (20162 articles);
N is number of documents in the corpus.
One of the newest and widely adopted document World news (21052 articles);
representation schemes is Doc2Vec [12]. It is an extension of Crime (7502 articles);
the word-to-vector (Word2Vec) representation. A word in the
Word2Vec representation is regarded as a single vector of real Business (7280 articles);
number values. The assumption of Word2Vec is that the Cars (1557 articles);
element values of a word are affected by those of other words
surrounding the target word. This assumption is encoded as a Sports (5913 articles);
neural network structure and the network weights are adjusted
by learning observed examples [13]. Doc2Vec extends Technologies (1919 articles);
Word2Vec from the word level to the document level and each Opinions (2553 articles);
document has its own vector values in the same space as that
for words [12]. Entertainment (769 articles);
Life (944 articles);
105
Culture (3478 articles); 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃
𝑇𝑃+𝐹𝑃
Other (9664 articles, which do not fall into previous
𝑇𝑃
categories). 𝑟𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃+𝐹𝑁
It is clearly visible that category distribution is not
𝑇𝑃∙𝑇𝑁−𝐹𝑃∙𝐹𝑁
uniform. The biggest categories are “Lithuanian news” and 𝑀𝐶𝐶 =
√(𝑇𝑃+𝐹𝑃)(𝑇𝑃+𝐹𝑁)(𝑇𝑁+𝐹𝑃)(𝑇𝑁+𝐹𝑁)
“World news” taking up to 49 % of all articles.
MCC score ranges from -1 (total disagreement) to 1
(perfect prediction), while 0 means no better than random
B. Words prediction. F1 score varies from 0 (the worst) to 1 (perfect).
Lithuanian word data was scraped from two semantic
information databases: morfologija.lt and VI. EXPERIMENTS
tekstynas.vdu.lt/~irena/morfema_search.php. The latter To ensure that experiments are as reproducible as possible,
website has more accurate information, including word each experiment was repeated 50 times and confidence
frequency while the first is very large and was observed having interval of each resulting clustering scores calculated. In each
some mistakes. Therefore, these two databases were merged repetition distinct number of articles were randomly (each
prioritizing words from the second one. Resulting word time) selected from the dataset. However, for the same number
database contained 2212726 different word forms including of documents this repeated random pickup would be the same
72587 lemmas. (if we were to have another experiment with same number of
documents then these 50 samplings of articles would be the
V. CLUSTERING EVALUATION
same). This ensures that we evaluate as much data as possible
The main evaluation metrics can be acquired by confusion while keeping the same subset for different experiments.
matrix, depicted in Table I. Here for true and predicted
conditions we get counts of following types: All experiments were carried out using only articles from
the 10 biggest categories. For each of them equal number of
TP (true positives). The true condition is positive and articles were sampled. Only variables associated with dataset
the predicted condition is positive. loading, text preprocessing and representation phases were
varied. Actual clustering was done using k-means algorithm.
TN (true negatives). The true condition is negative and
the predicted condition is negative. In all experiments the following actions and parameters
were used if not specified otherwise:
FP (false positives). The true condition is negative but
the predicted condition is positive. used 1500 articles;
FN (false negatives). The true condition is positive but vocabulary pruned to maximum of 10000 words;
the predicted condition is negative.
0.95 maximum document frequency (BOW);
If it would be a classification task, then we would know
real classes and just simply get percentage of them predicted 0.05 minimum document frequency (BOW);
accurately. However, in the clustering process nor we know Distributed Bag of Words (DBOW) architecture of
actual class, nor we have a meaning of returned predicted Doc2Vec model used;
class. We must rely an additional information - label of our
news article category, given by the editor of the news website. Doc2Vec method trained on same articles to be
This way we make assumption that clusters we want to clustered (not all corpus);
achieve are similar to categories of articles. There indeed must
window size of 5 words (Doc2Vec models);
be a reason, some similarity between articles, why they were
put in the same category. The only drawback of our approach 20 training epochs (Doc2Vec models);
is that having high number of documents would require many
pair calculations. Based on chosen condition, confusion matrix 200 vector size (Doc2Vec models);
elements are as following: minimum word count of 4 (Doc2Vec models);
TP – pairs of articles have same category label and are all number normalized to “#NUMBER” feature;
predicted to be in the same cluster.
words with known lemma lemmatized;
TN – pairs of articles belong to different categories and
are predicted to be in different clusters. words in stop word list dropped from documents;
FP – pairs of articles belong to different categories but unigrams used (feature as a single word).
are predicted to be in the same cluster.
A. Number of articles and preprocessor method experiment
FN – pairs of articles having same category label but In this experiment dataset size and preprocessor method
are predicted to be in different clusters. were varied to determine how the two are correlated. Tried text
We will use F1, as the one widely used, and MCC, as more representations include BOW and Doc2vec with distributed
robust, evaluation scores: bag of words variation. It was also examined how well
Doc2Vec would perform if trained on all the 82793 articles.
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∙𝑟𝑒𝑐𝑎𝑙𝑙
𝐹1 = 2 B. Reducing words to lemmas experiment
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙
This experiment investigated 3 scenarios:
106
1) lemmas are not used; Fig. 1. MCC score dependency on text representation method and number
of documents used in clustering.
2) words for which lemmas could be found were
replaced with them and other words discarded; B. Reducing words to lemmas experiment
3) same as 2 but unknown words remained. Experiment results are depicted in Fig. 2. It was observed
that converting known words to lemmas gives MCC score
Another parameter, namely maximum number of features, boost both for BOW and Doc2Vec models. The highest
solves similar issues as lemmatization. Due to this reason increase of MCC score (from 0.122 to 0.221 for 10000
several values of maximum number of allowed features were maximum features) for BOW representation is observed then
tried. after lemmatization non-lemmatized words are dropped. On
C. Training epochs and embedding vector size experiment the other hand, Doc2Vec representation yields higher MCC
score increase then non-lemmatized words are left (from 0.356
In this experiment two parameters for Doc2Vec were to 0.401 for 40000 maximum number of features). It is clearly
optimized: training epochs (from 5 to 100) and vector size visible that both vectorization methods benefit from
(from 5 to 400). Distributed bag of words version of Doc2Vec lemmatization.
was used.
D. Clustering articles from a defined release interval
In this experiment the best configurations for BOW and
Doc2Vec will be tried on articles released in one week from
2017-04-28 to 2017-05-04 dates, covering total of 1001
articles. Both models with same articles will be run 50 times
and the best run selected. Doc2Vec is trained on same articles
used for clustering using maximum number of 40000 features
and vector size of 52.
The best resulting clusters will be analyzed with the same
BOW workflow as documents but reducing features only with
0.8 maximum and 0.1 minimum document frequencies. 10
words with the biggest TF-IDF weights will be selected as
representative of each cluster.
VII. RESULTS AND ANALYSIS Fig. 2. MCC score dependency on how words are changed to their lemma
with or without constrain of maximum features.
A. Number of articles and preprocessor method experiment
Experiment results are shown in Fig. 1. The best recorded C. Training epochs and embedding vector size experiment
MCC score is 0.403 (0.464 for F1) for Doc2Vec, distributed Clustering results for several epochs and vector sizes are
bag of words variation trained on all corpus and clustering depicted in Fig. 3. The highest average MCC score was
3000 articles. It is clearly visible that all text representation recorder for vector size of 150 and 20 epochs at 0.381. It is
models are better with higher number of documents. When interesting to note that increasing number of training epochs
clustering a small number of documents we can observe that to 100 reduces MCC to 0.316. This reduction is observer for
BOW model outperforms Doc2Vec if the latter is trained only all vector sizes and could be explained as overfitting. On the
on documents that are later used for clustering. However, other hand, only 5 epochs give poor results with maximum
starting with 300 documents Doc2vec outperforms BOW MCC of 0.133 for vector size of 10 and it should be regarded
model. This shows that Doc2Vec model depends on how as underfitting. With optimal number of training epochs being
many documents it is trained on as the model trained on all 20, there are many vector sizes (from 20 to 400) yielding very
corpus has the biggest MCC score of 0.201 when clustering similar MCC results. This shows that small vector sizes such
100 articles. However, advantage of training on all corpus as 20 are enough to train 1500 articles dataset for 20 epochs
instead of only documents to be clustered quickly diminishes for good text representation.
as the number of clustering documents approaches 700.
107
Fig. 3. MCC score dependency on vector size and number of training
epochs in Doc2Vec distributed bag of words representation clustering
D. Clustering articles from defined release interval
The best Doc2Vec model trained on a small corpus
outperformed the best BOW model (MCC 0.318 and 0.145,
F1 0.415 and 0.282). Cluster features and statistics of
Doc2vec model are depicted in Table I. It shows that model
performs reasonably well and can distinguish:
very small (1.9 % of all articles) distinct weather
forecast category (cluster Nr. 5);
classical categories as culture, sports, and crime
(clusters Nr. 3, 8 and 10);
hot topics as university reform, Brexit and current
political scandals (clusters Nr. 1, 4 and 8).
108
TABLE I. CLUSTERS STATISTICS
Category label
Lithuania news
Entertainment
Technologies
World news
Cluster Nr.
Number of
articles in
Opinions
Business
Culture
cluster
Sports
Crime
Other
Most descriptive features and their translation to English
universitetas, mokslas, eur, mokykla, studija, pertvarka, akademija,
1. 40 11 0 0 24 0 3 0 0 0 2 rektorius, vu, kokybė // university, science, eur, school, study,
transformation, academy, rector, vu (Vilnius University), quality
muzika, alkoholis, kultūra, ntv, filmas, visuomenė, maistas, namas,
2. 87 27 0 2 35 3 15 3 0 0 2 liga, lelkaitis // music, alcohol, culture, ntv, film, society, food,
house, illness, lelkaitis (surname of a person)
koncertas, teatras, muzika, rež, biblioteka, festivalis, džiazas,
3. 118 29 1 40 18 4 1 4 16 2 3 kultūra, paroda, muziejus // concert, theater, music, dir, library,
festival, jazz, culture, exhibition, museum
es, brexit, derybos, le, pen, may, macronas, partija, th, politinis //
4. 106 8 0 0 16 0 1 80 0 0 1
es, brexit, talks, le, pen, may, macron, party, th, political
laipsnis, šiluma, temperatūra, naktis, debesis, debesuotumas, lietus,
5. 19 0 0 0 16 0 0 2 0 0 1 įdienojus, pūs, termometrai // degree, heat, temperature, night,
cloud, clouds, rain, be broad daylight, will blow, thermometers
jav, korėtis, raketa, korėja, branduolinis, putinas, jungtinis, pajėgos,
6. 184 1 0 0 16 5 0 160 0 0 2 karinis, sirijos // usa, korėtis, rocket, korea, nuclear, putin, united,
forces, military, syrian
įmonė, seimas, įstatymas, mokestis, savivaldybė, kaina, šiluma,
7. 120 11 1 0 37 4 9 10 0 0 48 asmuo, projektas, pajamos // company, parlament, law, tax,
municipality, price, heat, person, project, income
seimas, pūkas, partija, teismas, komisija, konstitucija, pirmininkas,
įstatymas, apkalti, taryba // parlament, pūkas (surname of a person),
8. 79 4 1 1 67 0 1 0 0 2 3
party, court, commission, constitution, chairman, law,
impeachment, board
rungtynės, taškas, žaidėjas, čempionatas, ekipa, rinktinė, įvartis,
9. 64 0 0 0 0 0 0 0 0 64 0 pelnyti, pergalė, raptors // match, point, player, championship,
team, team, goal, win, victory, raptors (name of basketball club)
policija, automobilis, vyras, vairuotojas, pranešti, įtariamas,
10. 184 13 67 2 27 3 0 68 0 0 4 sulaikyti, žūti, teismas, asmuo // police, car, man, driver, report,
suspected, detained, die, court, person
VIII. CONCLUSIONS [6] Mackutė-varoneckienė, Aušra; Krilavičius, Tomas. Empirical study on
unsupervised feature selection for document clustering. In Human
In this work BOW and Doc2Vec text representation Language Technologies – The Baltic Perspective 2014. p. 107-110.
methods were compared. Our research shows that Doc2Vec [7] Ciganaitė, Greta, Aušra Mackutė-Varoneckienė, and Tomas
greatly outperforms BOW model. Clustering weeks’ worth of Krilavičius. Text documents clustering. Informacinės technologijos.
XIX tarpuniversitetinė magistrantų ir doktorantų konferencija"
data the highest MCC scores are 0.318 versus 0.145. However, Informacinė visuomenė ir universitetinės studijos"(IVUS 2014):
for Doc2Vec method to outperform BOW when clustering less konferencijos pranešimų medžiaga, 2014, p. 90-93. 2014.
than 300 articles, it must be trained on a much larger dataset. [8] Kapočiūtė-Dzikienė, Jurgita, and Robertas Damaševičius. Intrinsic
We estimated optimal embedding vector size large enough evaluation of Lithuanian word embeddings using WordNet. Computer
starting with 20 and optimal number of training epochs around Science On-line Conference. Springer, Cham, 2018.
20. Analysis of words conversion to their lemmas showed that [9] Kapočiūtė-Dzikienė, Jurgita, Robertas Damaševičius, and Marcin
lemmatization of words is beneficial for both BOW and Woźniak. Sentiment analysis of Lithuanian texts using traditional and
Doc2Vec representations. deep learning approaches. Computers 8.1 (2019): 4.
[10] Aker A, Paramita M, Kurtic E, Funk A, Barker E, Hepple M,
REFERENCES Gaizauskas R. Automatic label generation for news comment clusters.
In Proceedings of the 9th International Natural Language Generation
[1] Aggarwal CC, Zhai C, editors. Mining text data. Springer Science & Conference 2016 (pp. 61-69).
Business Media; 2012 Feb 3.
[11] White L, Togneri R, Liu W, Bennamoun M. Sentence Representations
[2] Liu L, Lu Y, Yang M, Qu Q, Zhu J, Li H. Generative adversarial and Beyond. In Neural Representations of Natural Language 2019 (pp.
network for abstractive text summarization. In Thirty-Second AAAI 93-114). Springer, Singapore.
Conference on Artificial Intelligence 2018 Apr 29.
[12] LE, Quoc; MIKOLOV, Tomas. Distributed representations of
[3] Liu G, Guo J. Bidirectional LSTM with attention mechanism and sentences and documents. In: International conference on machine
convolutional layer for text classification. Neurocomputing. 2019 Feb learning. 2014. p. 1188-1196.
1.
[13] MIKOLOV, Tomas, et al. Efficient estimation of word representations
[4] V. Pranckaitis and M. Lukoševičius, Clustering of Lithuanian news in vector space. arXiv preprint arXiv:1301.3781, 2013.
articles. Proceedings of the IVUS 2017, pp. 27-32.
[14] Charu C. Aggarwal , Chandan K. Reddy, Data Clustering: Algorithms
[5] Hinton GE, Salakhutdinov RR. Reducing the dimensionality of data and Applications, Chapman & Hall/CRC
with neural networks. science. 2006 Jul 28;313(5786):504-7.
109