<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Lithuanian news clustering using document embeddings</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Lukas</forename><surname>Stankevičius</surname></persName>
							<email>lukas.stankevicius@ktu.edu</email>
							<affiliation key="aff0">
								<orgName type="department">Faculty of Informatics</orgName>
								<orgName type="institution">Kaunas University of Technology Kaunas</orgName>
								<address>
									<country key="LT">Lithuania</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Mantas</forename><surname>Lukoševičius</surname></persName>
							<email>mantas.lukosevicius@ktu.lt</email>
							<affiliation key="aff1">
								<orgName type="department">Faculty of Informatics</orgName>
								<orgName type="institution">Kaunas University of Technology Kaunas</orgName>
								<address>
									<country key="LT">Lithuania</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Lithuanian news clustering using document embeddings</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">3F12CFBF84305C22D6378136BF56E65D</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-23T22:51+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>document clustering</term>
					<term>document embedding</term>
					<term>lemmatization</term>
					<term>Lithuanian news articles</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>A lot of research of natural language processing is done and applied on English texts but relatively little is tried on less popular languages. In this article document embeddings are compared with traditional bag of words methods for Lithuanian news clustering. The results show that for enough documents the embeddings greatly outperform simple bag of words representations. In addition, optimal lemmatization, embeddings vector size, and number of training epochs were investigated.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>The knowledge and information are inseparable part of our civilization. For thousands of years from news of incoming troops to ordinary know-how could have meant death or life. Knowledge accumulation throughout the centuries led to astonishing improvements of our way of live. Hardly anyone could persist having no news or other kinds of information even throughout the day.</p><p>Despite information scarcity centuries ago, nowadays we have the opposite situation. Demand and technology greatly increased the amount of information we can acquire. Now one's goal is to not get lost in it. As an example, the most popular Lithuanian news website each day publishes approximately 80 news articles. Add other news websites not only from Lithuania but the entire world and one would end up overwhelmed to read most of this information.</p><p>The field of text data mining emerged to tackle this kind of problems. It goes "beyond information access to further help users analyze and digest information and facilitate decision making" <ref type="bibr" target="#b0">[1]</ref>. Text data mining offers several solutions to better characterize text documents: summarization, classification and clustering <ref type="bibr" target="#b0">[1]</ref>. However, when evaluated by people, the best summarization results currently are given only 2-4 points out of 5 <ref type="bibr" target="#b1">[2]</ref>. Today the best classification accuracies are 50-94% <ref type="bibr" target="#b2">[3]</ref> and clustering of about 0.4 F1 score <ref type="bibr" target="#b3">[4]</ref>. Although achieved classification results are more accurate, the clustering is perceived more promising as it is universal and can handle unknown categories as it is the case for diverse news data.</p><p>After it was shown that artificial neural networks can be successfully trained and used to reduce dimensionality <ref type="bibr" target="#b4">[5]</ref>, many new successful data mining models had emerged. The aim of this work is to test how one of such modelsdocument to vector (Doc2Vec) can improve clustering of Lithuanian news.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. RELATED WORK ON LITHUANIAN LANGUAGE</head><p>Articles on Lithuanian language documents clustering suggest using K-means <ref type="bibr" target="#b3">[4]</ref>, spherical K-means <ref type="bibr" target="#b5">[6]</ref> or Expectation-Maximization (EM) <ref type="bibr" target="#b6">[7]</ref> algorithms. It was also observed that K-means is fast and suitable for large corpora <ref type="bibr" target="#b6">[7]</ref> and outperforms other popular algorithms <ref type="bibr" target="#b3">[4]</ref>. <ref type="bibr" target="#b5">[6]</ref> considers Term Frequency / Inverse Document Frequency (TF-IDF) as the best weighting scheme. <ref type="bibr" target="#b3">[4]</ref> adds that it must be used together with stemming while <ref type="bibr" target="#b5">[6]</ref> advocates to do minimum and maximum document frequency filtering before applying TF-IDF. These works show that TF-IDF is significant weighting scheme and it could be optionally tried with some additional preprocessing steps.</p><p>We have not found any research on Lithuanian language regarding document embeddings. However, there are some work on word embeddings. In <ref type="bibr" target="#b7">[8]</ref> word embeddings using different models and training algorithms were compared after training on 234 million tokens corpus. It was found that Continuous Bag of Words (CBOW) architecture significantly outperformed skip-gram method while vector dimensionality showed no significant impact on the results. This implies that document embeddings like word embeddings should follow same CBOW architectural pattern. Other work <ref type="bibr" target="#b8">[9]</ref> compared traditional and deep learning (with use of word embeddings) approaches for sentiment analysis and found that deep learning demonstrated good results only when applied on the small datasets, otherwise traditional methods were better. As embeddings may be underperforming in sentiment analysis it will be tested if it is a case for news clustering.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. TEXT CLUSTERING PROCESS</head><p>To improve clustering quality some text preprocessing must be done. Every text analytics process consists "of three consecutive phases: Text Preprocessing, Text Representation and Knowledge Discovery" <ref type="bibr" target="#b0">[1]</ref> (the last being clustering in our case).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Text preprocessing</head><p>The purpose of text preprocessing is to make the data more concise and facilitate text representation. It mainly involves tokenizing text into features and dropping the ones considered less important. Extracted features can be words, chars or any n-gram (contiguous sequence of n items from a given sample of text) of both. Tokens can also be accompanied by the structural or placement aspects of document <ref type="bibr" target="#b9">[10]</ref>.</p><p>The most and least frequent items are considered uninformative and dropped. Tokens found on every document are not descriptive and they usually include stop words such as "and", "to". On the other hand, too rare words are insufficient to attribute to any characteristic and due to their resulting sparse vectors only complicate the whole process.</p><p>Existing text features can be further concentrated by these methods:</p><p> stemming;  lemmatization;  number normalization;  allowing only maximum number of features;</p><p> maximum document frequencyignore terms that appear in more than specified documents;</p><p> minimum document frequencyignore terms that appear in less than specified documents.</p><p>It was shown that the use of stemming in Lithuanian news clustering greatly increased clustering performance <ref type="bibr" target="#b3">[4]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Text representation</head><p>For the computer to make any calculations with the text data it must be represented in numerical vectors. The simplest representation is called "Bag Of Words" (BOW) or "Vector Space Model" (VSM) where each document has counts or other derived weights for each vocabulary word. This structure ignores linguistic text structure. Surprisingly, in <ref type="bibr" target="#b10">[11]</ref> it was reviewed that "unordered methods have been found on many tasks to be extremely well performing, better than several of the more advanced techniques", because "there are only a few likely ways to order any given bag of words".</p><p>The most popular weight for BOW is TF-IDF. Recent study <ref type="bibr" target="#b3">[4]</ref> on Lithuanian news clustering have shown that TF-IDF weight produced the best clustering results. TF-IDF is calculated as:</p><formula xml:id="formula_0"> 𝑡𝑓𝑖𝑑𝑓(𝑤, 𝑑) = 𝑡𝑓(𝑤, 𝑑) • 𝑙𝑜𝑔 𝑁 𝑑𝑓(𝑤)</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head> </head><p>where:</p><p> tf(w,d) is term frequency, the number of word w occurrences in a document d;</p><p> df(w) is document frequency, the number of documents containing word w;</p><p> N is number of documents in the corpus.</p><p>One of the newest and widely adopted document representation schemes is Doc2Vec <ref type="bibr" target="#b11">[12]</ref>. It is an extension of the word-to-vector (Word2Vec) representation. A word in the Word2Vec representation is regarded as a single vector of real number values. The assumption of Word2Vec is that the element values of a word are affected by those of other words surrounding the target word. This assumption is encoded as a neural network structure and the network weights are adjusted by learning observed examples <ref type="bibr" target="#b12">[13]</ref>. Doc2Vec extends Word2Vec from the word level to the document level and each document has its own vector values in the same space as that for words <ref type="bibr" target="#b11">[12]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Text clustering</head><p>There are tens of clustering algorithms to choose from <ref type="bibr" target="#b13">[14]</ref>. One of the simplest and widely used is k-means algorithm. During initialization, k-means algorithm selects k means, which corresponds to k clusters. Then algorithm repeats two steps: <ref type="bibr" target="#b0">(1)</ref> for every data point choose the nearest mean and assign the point to the corresponding cluster; (2) recalculate means by averaging data points assigned to the corresponding cluster. The algorithm terminates, when assignment of the data points does not change after several iterations. As the clustering depends on initially selected centroids, the algorithm is usually run several times to average over random centroid initializations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. THE DATA</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Articles</head><p>Article data for this research was scraped from three Lithuanian news websites: the national lrt. <ref type="bibr">lt</ref>   434472, dropping stop words, normalizing numbers, applying lemmas and leaving unknown words.</p><p>Each article has on average 366 tokens and on average 247 unique tokens. Mean token length is 6.51 characters with standard deviation of 3.</p><p>While analyzing articles and their accompanying information, it was noticed that some labelling information can be acquired from article URL. Both websites have categorical information between the domain and article id parts in URL. Total of 116 distinct categorical descriptions were received and normalized to 12 distinct categories as described at <ref type="bibr" target="#b3">[4]</ref>. Category distributions are: It is clearly visible that category distribution is not uniform. The biggest categories are "Lithuanian news" and "World news" taking up to 49 % of all articles.</p><formula xml:id="formula_1"></formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Words</head><p>Lithuanian word data was scraped from two semantic information databases: morfologija.lt and tekstynas.vdu.lt/~irena/morfema_search.php. The latter website has more accurate information, including word frequency while the first is very large and was observed having some mistakes. Therefore, these two databases were merged prioritizing words from the second one. Resulting word database contained 2212726 different word forms including 72587 lemmas.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. CLUSTERING EVALUATION</head><p>The main evaluation metrics can be acquired by confusion matrix, depicted in Table <ref type="table" target="#tab_2">I</ref>. Here for true and predicted conditions we get counts of following types:</p><p> TP (true positives). The true condition is positive and the predicted condition is positive.</p><p> TN (true negatives). The true condition is negative and the predicted condition is negative.</p><p> FP (false positives). The true condition is negative but the predicted condition is positive.</p><p> FN (false negatives). The true condition is positive but the predicted condition is negative.</p><p>If it would be a classification task, then we would know real classes and just simply get percentage of them predicted accurately. However, in the clustering process nor we know actual class, nor we have a meaning of returned predicted class. We must rely an additional information -label of our news article category, given by the editor of the news website. This way we make assumption that clusters we want to achieve are similar to categories of articles. There indeed must be a reason, some similarity between articles, why they were put in the same category. The only drawback of our approach is that having high number of documents would require many pair calculations. Based on chosen condition, confusion matrix elements are as following:</p><p> TPpairs of articles have same category label and are predicted to be in the same cluster.</p><p> TNpairs of articles belong to different categories and are predicted to be in different clusters.</p><p> FPpairs of articles belong to different categories but are predicted to be in the same cluster.</p><p> FNpairs of articles having same category label but are predicted to be in different clusters.</p><p>We will use F1, as the one widely used, and MCC, as more robust, evaluation scores:</p><formula xml:id="formula_2"> 𝐹1 = 2 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛•𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙    𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃 𝑇𝑃+𝐹𝑃    𝑟𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃 𝑇𝑃+𝐹𝑁    𝑀𝐶𝐶 = 𝑇𝑃•𝑇𝑁−𝐹𝑃•𝐹𝑁 √(𝑇𝑃+𝐹𝑃)(𝑇𝑃+𝐹𝑁)(𝑇𝑁+𝐹𝑃)(𝑇𝑁+𝐹𝑁)</formula><p>  MCC score ranges from -1 (total disagreement) to 1 (perfect prediction), while 0 means no better than random prediction. F1 score varies from 0 (the worst) to 1 (perfect).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VI. EXPERIMENTS</head><p>To ensure that experiments are as reproducible as possible, each experiment was repeated 50 times and confidence interval of each resulting clustering scores calculated. In each repetition distinct number of articles were randomly (each time) selected from the dataset. However, for the same number of documents this repeated random pickup would be the same (if we were to have another experiment with same number of documents then these 50 samplings of articles would be the same). This ensures that we evaluate as much data as possible while keeping the same subset for different experiments.</p><p>All experiments were carried out using only articles from the 10 biggest categories. For each of them equal number of articles were sampled. Only variables associated with dataset loading, text preprocessing and representation phases were varied. Actual clustering was done using k-means algorithm.</p><p>In all experiments the following actions and parameters were used if not specified otherwise:  all number normalized to "#NUMBER" feature;</p><p> words with known lemma lemmatized;</p><p> words in stop word list dropped from documents;</p><p> unigrams used (feature as a single word).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Number of articles and preprocessor method experiment</head><p>In this experiment dataset size and preprocessor method were varied to determine how the two are correlated. Tried text representations include BOW and Doc2vec with distributed bag of words variation. It was also examined how well Doc2Vec would perform if trained on all the 82793 articles.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Reducing words to lemmas experiment</head><p>This experiment investigated 3 scenarios:</p><p>1) lemmas are not used;</p><p>2) words for which lemmas could be found were replaced with them and other words discarded;</p><p>3) same as 2 but unknown words remained.</p><p>Another parameter, namely maximum number of features, solves similar issues as lemmatization. Due to this reason several values of maximum number of allowed features were tried.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Training epochs and embedding vector size experiment</head><p>In this experiment two parameters for Doc2Vec were optimized: training epochs (from 5 to 100) and vector size (from 5 to 400). Distributed bag of words version of Doc2Vec was used.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Clustering articles from a defined release interval</head><p>In this experiment the best configurations for BOW and Doc2Vec will be tried on articles released in one week from 2017-04-28 to 2017-05-04 dates, covering total of 1001 articles. Both models with same articles will be run 50 times and the best run selected. Doc2Vec is trained on same articles used for clustering using maximum number of 40000 features and vector size of 52.</p><p>The best resulting clusters will be analyzed with the same BOW workflow as documents but reducing features only with 0.8 maximum and 0.1 minimum document frequencies. 10 words with the biggest TF-IDF weights will be selected as representative of each cluster.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VII. RESULTS AND ANALYSIS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Number of articles and preprocessor method experiment</head><p>Experiment results are shown in Fig. <ref type="figure" target="#fig_2">1</ref>. The best recorded MCC score is 0.403 (0.464 for F1) for Doc2Vec, distributed bag of words variation trained on all corpus and clustering 3000 articles. It is clearly visible that all text representation models are better with higher number of documents. When clustering a small number of documents we can observe that model outperforms Doc2Vec if the latter is trained only on documents that are later used for clustering. However, starting with 300 documents Doc2vec outperforms BOW model. This shows that Doc2Vec model depends on how many documents it is trained on as the model trained on all corpus has the biggest MCC score of 0.201 when clustering 100 articles. However, advantage of training on all corpus instead of only documents to be clustered quickly diminishes as the number of clustering documents approaches 700. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Reducing words to lemmas experiment</head><p>Experiment results are depicted in Fig. <ref type="figure" target="#fig_3">2</ref>. It was observed that converting known words to lemmas gives MCC score boost both for BOW and Doc2Vec models. The highest increase of MCC score (from 0.122 to 0.221 for 10000 maximum features) for BOW representation is observed then after lemmatization non-lemmatized words are dropped. On the other hand, Doc2Vec representation yields higher MCC score increase then non-lemmatized words are left (from 0.356 to 0.401 for 40000 maximum number of features). It is clearly visible that both vectorization methods benefit from lemmatization. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Training epochs and embedding vector size experiment</head><p>Clustering results for several epochs and vector sizes are depicted in Fig. <ref type="figure" target="#fig_4">3</ref>. The highest average MCC score was recorder for vector size of 150 and 20 epochs at 0.381. It is interesting to note that increasing number of training epochs to 100 reduces MCC to 0.316. This reduction is observer for all vector sizes and could be explained as overfitting. On the other hand, only 5 epochs give poor results with maximum MCC of 0.133 for vector size of 10 and it should be regarded as underfitting. With optimal number of training epochs being 20, there are many vector sizes (from 20 to 400) yielding very similar MCC results. This shows that small vector sizes such as 20 are enough to train 1500 articles dataset for 20 epochs for good text representation. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Clustering articles from defined release interval</head><p>The best Doc2Vec model trained on a small corpus outperformed the best BOW model (MCC 0.318 and 0.145, F1 0.415 and 0.282). Cluster features and statistics of Doc2vec model are depicted in Table <ref type="table" target="#tab_2">I</ref>. It shows that model performs reasonably well and can distinguish:  very small (1.9 % of all articles) distinct weather forecast category (cluster Nr.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VIII. CONCLUSIONS</head><p>In this work BOW and Doc2Vec text representation methods were compared. Our research shows that Doc2Vec greatly outperforms BOW model. Clustering weeks' worth of data the highest MCC scores are 0.318 versus 0.145. However, for Doc2Vec method to outperform BOW when clustering less than 300 articles, it must be trained on a much larger dataset. We estimated optimal embedding vector size large enough starting with 20 and optimal number of training epochs around 20. Analysis of words conversion to their lemmas showed that lemmatization of words is beneficial for both BOW and Doc2Vec representations.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>Lithuania news (20162 articles);  World news (21052 articles);  Crime (7502 articles);  Business (7280 articles);  Cars (1557 articles);  Sports (5913 articles);  Technologies (1919 articles);  Opinions (2553 articles);  Entertainment (769 articles);  Life (944 articles);  Culture (3478 articles);  Other (9664 articles, which do not fall into previous categories).</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head></head><label></label><figDesc>used 1500 articles;  vocabulary pruned to maximum of 10000 words;  0.95 maximum document frequency (BOW);  0.05 minimum document frequency (BOW);  Distributed Bag of Words (DBOW) architecture of Doc2Vec model used;  Doc2Vec method trained on same articles to be clustered (not all corpus);  window size of 5 words (Doc2Vec models);  20 training epochs (Doc2Vec models);  200 vector size (Doc2Vec models);  minimum word count of 4 (Doc2Vec models);</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. MCC score dependency on text representation method and number of documents used in clustering.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Fig. 2 .</head><label>2</label><figDesc>Fig. 2. MCC score dependency on how words are changed to their lemma with or without constrain of maximum features.</figDesc><graphic coords="4,306.66,216.00,243.96,163.26" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Fig. 3 .</head><label>3</label><figDesc>Fig. 3. MCC score dependency on vector size and number of training epochs in Doc2Vec distributed bag of words representation clustering</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>and commercial websites 15min.lt and delfi.lt. Articles URL's were scraped from sitemaps in robots.txt files in websites. Total of 82793 articles (26336 from lrt.lt, 31397 from 15min.lt and 25060 from delfi.lt) were retrieved spanning random release dates of 2017 year.</figDesc><table><row><cell>Raw dataset contains 30338937 tokens from which</cell></row><row><cell>641697 are unique. Unique token count can be decreased to:</cell></row><row><cell> 641254, dropping stop words;</cell></row><row><cell> 635257, normalizing all numbers to a single feature;</cell></row><row><cell> 441178, applying lemmas and leaving unknown</cell></row><row><cell>words;</cell></row><row><cell> 41933, applying lemmas and dropping unknown</cell></row><row><cell>words;</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>TABLE I .</head><label>I</label><figDesc>, įstatymas, mokestis, savivaldybė, kaina, šiluma, asmuo, projektas, pajamos // company, parlament, law, tax, municipality, price, heat, person, project, income , žaidėjas, čempionatas, ekipa, rinktinė, įvartis, pelnyti, pergalė, raptors // match, point, player, championship, team, team, goal, win, victory, raptors (name of basketball club)</figDesc><table><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>CLUSTERS STATISTICS</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell cols="2">Category label</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>Cluster Nr.</cell><cell>Number of</cell><cell>articles in</cell><cell>cluster Other</cell><cell>Crime</cell><cell>Culture</cell><cell>Lithuania news</cell><cell>Technologies</cell><cell>Opinions</cell><cell>World news</cell><cell>Entertainment</cell><cell>Sports</cell><cell>Business</cell><cell>Most descriptive features and their translation to English</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>universitetas, mokslas, eur, mokykla, studija, pertvarka, akademija,</cell></row><row><cell>1.</cell><cell cols="4">40 11 0</cell><cell cols="3">0 24 0</cell><cell>3</cell><cell>0</cell><cell>0</cell><cell>0</cell><cell>2</cell><cell>rektorius, vu, kokybė // university, science, eur, school, study,</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>transformation, academy, rector, vu (Vilnius University), quality</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>muzika, alkoholis, kultūra, ntv, filmas, visuomenė, maistas, namas,</cell></row><row><cell>2.</cell><cell cols="4">87 27 0</cell><cell cols="4">2 35 3 15</cell><cell>3</cell><cell>0</cell><cell>0</cell><cell>2</cell><cell>liga, lelkaitis // music, alcohol, culture, ntv, film, society, food,</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>house, illness, lelkaitis (surname of a person)</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>koncertas, teatras, muzika, rež, biblioteka, festivalis, džiazas,</cell></row><row><cell>3.</cell><cell cols="7">118 29 1 40 18 4</cell><cell>1</cell><cell>4</cell><cell cols="2">16 2</cell><cell>3</cell><cell>kultūra, paroda, muziejus // concert, theater, music, dir, library,</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>festival, jazz, culture, exhibition, museum</cell></row><row><cell>4.</cell><cell cols="3">106 8</cell><cell>0</cell><cell cols="3">0 16 0</cell><cell>1</cell><cell>80</cell><cell>0</cell><cell>0</cell><cell>1</cell><cell>es, brexit, derybos, le, pen, may, macronas, partija, th, politinis // es, brexit, talks, le, pen, may, macron, party, th, political</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>laipsnis, šiluma, temperatūra, naktis, debesis, debesuotumas, lietus,</cell></row><row><cell>5.</cell><cell cols="2">19</cell><cell>0</cell><cell>0</cell><cell cols="3">0 16 0</cell><cell>0</cell><cell>2</cell><cell>0</cell><cell>0</cell><cell>1</cell><cell>įdienojus, pūs, termometrai // degree, heat, temperature, night,</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>cloud, clouds, rain, be broad daylight, will blow, thermometers</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>jav, korėtis, raketa, korėja, branduolinis, putinas, jungtinis, pajėgos,</cell></row><row><cell>6.</cell><cell cols="3">184 1</cell><cell>0</cell><cell cols="3">0 16 5</cell><cell cols="3">0 160 0</cell><cell>0</cell><cell>2</cell><cell>karinis, sirijos // usa, korėtis, rocket, korea, nuclear, putin, united,</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>forces, military, syrian</cell></row><row><cell cols="14">7. įmonė, seimas8. 120 11 1 0 37 4 9 10 0 0 48 seimas, pūkas, partija, teismas, komisija, konstitucija, pirmininkas, 79 4 1 1 67 0 1 0 0 2 įstatymas, apkalti, taryba // parlament, pūkas (surname of a person), 3 party, court, commission, constitution, chairman, law,</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>impeachment, board</cell></row><row><cell cols="14">9. rungtynės, taškas10. 184 13 67 2 27 3 64 0 0 0 0 0 0 0 0 64 0 policija, automobilis, vyras, vairuotojas, pranešti, įtariamas, 0 68 0 0 4 sulaikyti, žūti, teismas, asmuo // police, car, man, driver, report,</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>suspected, detained, die, court, person</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">Mining text data</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">C</forename><surname>Aggarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Zhai</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2012-02-03">2012 Feb 3</date>
			<publisher>Springer Science &amp; Business Media</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Generative adversarial network for abstractive text summarization</title>
		<author>
			<persName><forename type="first">L</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Qu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Li</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Thirty-Second AAAI Conference on Artificial Intelligence</title>
				<imprint>
			<date type="published" when="2018-04-29">2018 Apr 29</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Bidirectional LSTM with attention mechanism and convolutional layer for text classification</title>
		<author>
			<persName><forename type="first">G</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Guo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Neurocomputing</title>
		<imprint>
			<date type="published" when="2019-02-01">2019 Feb 1</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Clustering of Lithuanian news articles</title>
		<author>
			<persName><forename type="first">V</forename><surname>Pranckaitis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lukoševičius</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IVUS 2017</title>
				<meeting>the IVUS 2017</meeting>
		<imprint>
			<biblScope unit="page" from="27" to="32" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Reducing the dimensionality of data with neural networks</title>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">E</forename><surname>Hinton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">R</forename><surname>Salakhutdinov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">science</title>
		<imprint>
			<biblScope unit="volume">313</biblScope>
			<biblScope unit="page" from="504" to="507" />
			<date type="published" when="2006-07-28">2006 Jul 28. 5786</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Empirical study on unsupervised feature selection for document clustering</title>
		<author>
			<persName><forename type="first">Aušra</forename><forename type="middle">;</forename><surname>Mackutė-Varoneckienė</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tomas</forename><surname>Krilavičius</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Human Language Technologies -The Baltic Perspective</title>
				<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="107" to="110" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<author>
			<persName><forename type="first">Greta</forename><surname>Ciganaitė</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Aušra</forename><surname>Mackutė-Varoneckienė</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tomas</forename><surname>Krilavičius</surname></persName>
		</author>
		<title level="m">Text documents clustering. Informacinės technologijos. XIX tarpuniversitetinė magistrantų ir doktorantų konferencija&quot; Informacinė visuomenė ir universitetinės studijos</title>
				<imprint>
			<date type="published" when="2014">2014. 2014</date>
			<biblScope unit="page" from="90" to="93" />
		</imprint>
	</monogr>
	<note>IVUS 2014): konferencijos pranešimų medžiaga</note>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Intrinsic evaluation of Lithuanian word embeddings using WordNet</title>
		<author>
			<persName><forename type="first">Jurgita</forename><surname>Kapočiūtė-Dzikienė</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Robertas</forename><surname>Damaševičius</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="s">Computer Science On-line Conference</title>
		<imprint>
			<date type="published" when="2018">2018</date>
			<publisher>Springer</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Sentiment analysis of Lithuanian texts using traditional and deep learning approaches</title>
		<author>
			<persName><surname>Kapočiūtė-Dzikienė</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Robertas</forename><surname>Jurgita</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Marcin</forename><surname>Damaševičius</surname></persName>
		</author>
		<author>
			<persName><surname>Woźniak</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computers</title>
		<imprint>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page">4</biblScope>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Automatic label generation for news comment clusters</title>
		<author>
			<persName><forename type="first">A</forename><surname>Aker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Paramita</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Kurtic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Funk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Barker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Hepple</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Gaizauskas</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 9th International Natural Language Generation Conference</title>
				<meeting>the 9th International Natural Language Generation Conference</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="61" to="69" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Sentence Representations and Beyond</title>
		<author>
			<persName><forename type="first">L</forename><surname>White</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Togneri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Bennamoun</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Neural Representations of Natural Language</title>
				<meeting><address><addrLine>Singapore</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="93" to="114" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Distributed representations of sentences and documents</title>
		<author>
			<persName><forename type="first">Quoc</forename><forename type="middle">;</forename><surname>Le</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tomas</forename><surname>Mikolov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International conference on machine learning</title>
				<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="1188" to="1196" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<title level="m" type="main">Efficient estimation of word representations in vector space</title>
		<author>
			<persName><forename type="first">Tomas</forename><surname>Mikolov</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1301.3781</idno>
		<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<title level="m" type="main">Data Clustering: Algorithms and Applications</title>
		<author>
			<persName><forename type="first">C</forename><surname>Charu</surname></persName>
		</author>
		<author>
			<persName><surname>Aggarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Chandan</surname></persName>
		</author>
		<author>
			<persName><surname>Reddy</surname></persName>
		</author>
		<imprint>
			<publisher>Chapman &amp; Hall/CRC</publisher>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
