Testing Pre-trained Transformer Models for Lithuanian News Clustering Lukas Stankevičiusa , Mantas Lukoševičiusa a Faculty of Informatics, Kaunas University of Technology, Kaunas, Lithuania Abstract TA recent introduction of Transformer deep learning architecture made breakthroughs in various natural language processing tasks. However, non-English languages could not leverage such new opportunities with the English text pre-trained models. This changed with research focusing on multilingual models, where less-spoken languages are the main beneficiaries. We compare pre-trained multilingual BERT, XLM-R, and older learned text representation methods as encodings for the task of Lithuanian news clustering. Our results indicate that publicly available pre-trained multilingual Transformer models can be fine-tuned to surpass word vectors but still score much lower than specially trained doc2vec embeddings. Keywords Document clustering, document embedding, Lithuanian news articles, Transformer model, BERT, XLM-R, multilingual 1. Introduction languages), multilingual BERT [6] (104 languages) and XLM-R [7] (100 languages). Appearance of a novel Transformer deep learning ar- The authors of XLM [8] showed that training Nepali chitecture [1] sparked a rapid research progress in the language model on Wikipedia together with additional Natural Language Processing (NLP) field. data from both English and Hindi decreased perplexity Table 1 clearly depicts how quickly models reached to 109.3 on Nepali, compared to single Nepali training human performance on popular NLP evaluation data- perplexity of 157. sets. In less than two years after publication, the data- Transfer learning and zero-shot translation between sets: SQuAD2.0 and GLUE [2, 3] had human perfor- language pairs never seen explicitly during training mance outmatched. Currently every top scoring model was shown to be possible in [9]. Overall, multilingual is of Transformer architecture. The situation was not models can cover many languages, be trained without changed by a newer SuperGLUE [4] task set which had any cross-lingual supervision and use the bigger lan- been left with only a tiny gap to human performance. guages to benefit the smaller ones. These datasets were among the most popular to eval- Lithuanian language does not yet have BERT-scale uate new Transformer models and showed the effec- monolingual NLP model. It is relatively very little spo- tiveness of this new architecture. ken in a world. However, as a national language of one There is a need to create NLP models for less-spoken of European Union member states, Lithuanian is usu- languages. Apart from being less popular than English ally included in the most of pre-trained multilingual or Chinese, less-spoken languages also have less con- models. tent to train the models on. Just the top-10 out of 6 000 The aim of this work is to use such Transformer- languages in use today make up 76.3 % of the total con- type models to generate text embeddings and evaluate tent on the internet1 . them on clustering of Lithuanian news articles. Specif- Such situation encourages not only to pursue cre- ically, we will use well known baselines – multilingual ation of NLP models for other languages but also to BERT and recently published XLM-R, trained on more look for ways to transfer the knowledge from the con- than two terabytes of filtered CommonCrawl data. tent rich language models. The most common way to We chose clustering task to also try to advance the satisfy this need for less spoken languages is to pre- field of data mining. train multilingual models. Examples are LASER [5] (93 The surge of information, particularly news data, demands tools to help users to “analyze and digest in- IVUS 2020: Information Society and University Studies, 23 April 2020, formation and facilitate decision making” [10]. Unlike KTU Santaka Valley, Kaunas, Lithuania " lukas.stankevicius@ktu.edu (L. Stankevičius); classification, clustering is universal in that it can han- mantas.lukosevicius@ktu.edu (M. Lukoševičius) dle unknown categories [11, 12]  Therefore it is well suited for the quickly changing © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). news articles data. CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 1 https://www.internetworldstats.com/stats7.htm Table 1 Difficulty of the most popular NLP evaluation datasets Dataset Score Name Year Initial Current Human Type RACE [13] 2017 44.1 89.4 94.5 Accuracy SQuAD2.0 [2] 2018 66.3 92.58 89.542 F1 GLUE [3] 2019 70.0 90.3 87.1 Average SuperGLUE [4] 2019 71.5 89.3 89.8 Average 2. Literature Review 2.2. Text Representation In this section we review the two first consecutive pha- Although tokenized text remains meaningful to us, mo- ses of common natural language processing (NLP) tas- dels still can not operate on it directly. They need it in a ks: text preprocessing and text representation [10]. numerical form. The preferable way is to derive vector These stages recently are of the most active research representation for each text sample. Cosine similarity and culminated in the development of the Transformer is the simplest example of models operating on (com- architecture. We also examine relevant NLP contri- paring) these embeddings. The classical approach to butions for Lithuanian language and our task of news text representation uses a Bag of Words (BoW) model. clustering. As the name suggests, the order of tokens is lost here and each document is represented by bare counts (his- togram) of its tokens. Therefore token weighting such 2.1. Text Preprocessing as tf-idf is involved. The higher weight of tf-idf is, the Text preprocessing involves selection of features that more descriptive token for a given document is. Given will bear the understanding of text. The most elemen- the number of word 𝑤 occurrences in a document 𝑑 as tary approach is tokenization into simple characters or tf𝑤,𝑑 , number of documents containing word 𝑤 as df𝑤 words. The finer the tokenization, the smaller is the and total number of documents 𝑁 , tf-idf𝑤,𝑑 is given by resulting vocabulary and the more challenging task is given to the NLP model. On the other hand, coarser tf-idf𝑤,𝑑 = tf𝑤,𝑑 ⋅ log(𝑁 /df𝑤 ). (1) tokens drastically increase vocabulary size and induce BoW approaches suffer from several problems. The other problems such as sparseness. The middle ground vector length for each document is the same as the is statistically significant n-grams of both words and size of the vocabulary. Typically, the vocabulary size chars. Examples of this type of tokenizers are Senten- is huge and this induces major memory constraints. cePiece [14], BPE [15], and WordPiece [16]. They are The embedded vectors are also very sparse as each often used in the state-of-the-art (SOTA) Transformer document uses only the small subset of the vocabu- models and are shipped together with publicly avail- lary. Various methods, such as Latent Semantic Anal- able pre-trained models. This way manual tokeniza- ysis (LSA) using Singular Value Decomposition (SVD), tion step is skipped. are employed to reduce the dimensionality. Neverthe- There are number of methods to filter word level to- less, SVD has to operate on the same high dimensional kens. It includes lowercasing, stemming, lemmatiza- documents × tokens matrix. tion, filtering by maximum and minimum document Work of [19] revolutionised word embedding cal- frequencies (ignoring tokens that are too rare or too culations. Previous word embeddings, known as co- common throughout the documents). occurrence vectors, were superseded. They were cal- However, it was shown in [17] that such filtering culated as direct probabilities of surrounding words in benefits only the classical text representation approach- a context window of a given length. The new word2vec es such as tf-idf, while shallow neural network models, [19] algorithm uses the same training inputs, except doc2vec [18], benefited of not using any such filtering. the goal is not to calculate the word distribution, but to derive such embedding weights that context words would be predicted with maximum accuracy. Such setup significantly reduced the word vector size and elimi- 47 nated problems of high dimensionality and sparseness. 2.3. Related Work on Lithuanian Later, the next word embedding model, Global Vectors Language (GloVe) for word representation was presented [20]. It merged advantages of both the matrix factorization There are several works on Lithuanian text cluster- and the shallow window-based methods. Currently it ing. [25] used internet daily newspaper articles from is the most used method for independent word vectors. the Lrytas.lt news website and information from the Original word2vec word vectors were extended to largest internet forum for mothers – supermama.lt. 𝑘- paragraph vectors by doc2vec model [18]. Here each means and Expectation Maximization (EM) algorithms sequence of tokens has its own embedding in the same were compared on BoW data representation. It was space as that of words. CBOW and Skip-gram archi- found that optimal clustering algorithm is 𝑘-means with tectures from the original word2vec algorithm were cosine similarity. Other work analysed unsupervised applied to documents correspondingly as PV-DM and feature selection for document clustering [26]. Au- PV-DBOW. Distributed Memory model of Paragraph thors found that tf-idf weighting with 3 000 features Vector (PV-DM) is tasked to predict the next context and spherical 𝑘-means clustering algorithm works best. word given the previous context and document vector, A similar observation is expressed in [27]. Here 𝑘- thus the vector has to sustain memory of that is miss- means was compared to various hierarchical cluster- ing. Distributed Bag of Words version of Paragraph ing algorithms and was not superseded. Authors found Vectors (PV-DBOW) is forced to predict context vec- that tf-idf together with stemming is superior to other tors randomly, given only the document vector. The approaches. original work [18] used a concatenation of the both [17] compared BoW and doc2vec (PV-DBOW) Lithua- models. However, in [17] it was found that PV-DBOW nian news article representations for document clus- alone gave the best results. tering. It was shown that PV-DBOW representation, The most recent text representation models produce trained on the whole dataset is superior to the BoW contextualised token vectors. Models like ELMO [21] method. Authors also investigated various PV-DBOW and various Transformers have each of their inputs to hyperparameters and outlined recommendations for interact with the other ones. This leads to each to- training method weights. ken vector being aware of the others. Contextuali- There is also other NLP work on Lithuanian lan- sation solved the problem of word polysemy seen in guage. [28] and [29] compared traditional and deep word2vec or GloVe models. learning approaches for Lithuanian internet comments The Transformer architecture presented in 2017 [1] sentiment classification. Authors demonstrated that excels other contextualised models due to several rea- traditional Naïve Bayes Multinomial and Support Vec- sons. Firstly, only after a single layer each input repre- tor Machine methods outperformed LSTM and Convo- sentation becomes aware of the other ones. For Recur- lutional neural networks. Other work [30] compared rent Neural Networks (RNN) like ELMO it took 𝑛 lay- CBOW and Skip-gram word embedding architectures ers, where 𝑛 is the sequence length. Another advan- and found the first one to be superior. We can add that tage over recurrent architectures is that Transformer in this current work we noticed a similar tendency for is very parallelizable. It does not need to wait for a document vectors: equivalent version of PV-DBOW in hidden state of the previous word as is the case with our initial experiments outperformed PV-DMM archi- RNN. This particular feature led to creation of multi- tecture. billion-parameter Transformers such as GPT-2 [22], T5 We have not found any previous work on Lithua- [23], Megatron [24], and T-NLG2 . Despite huge suc- nian language using Transformer models. cess of Transformer models, it can not process long sequences as its complexity per layer is 𝑂(𝑛2 ⋅𝑑) where 𝑑 is representation dimension. For example, the max- 3. The Data imum input length of the popular BERT model is just We followed methodology of [17] and expanded their 510 tokens. This and other problems of Transformer dataset from 82 793 up to 260 146 articles. Although architecture currently are researched very actively. average number of characters in each our text sam- ple is 2 948, several scraped articles were very small and resulted in empty vectors during averaging the GloVe vectors. Due to this reason we filtered all ar- ticles with less than 200 characters and this resulted 2 https://www.microsoft.com/en-us/research/blog/ in a final dataset of 259 996 texts. turing-nlg-a-17-billion-parameter-language-model-by-microsoft/ 48 The data consist of Lithuanian news articles scraped 4. Methods from lrt.lt, 15min.lt, and delfi.lt websites. The num- ber of texts are correspondingly 26 344, 133 587, and 4.1. Clustering 100 065. Due to the absence of sitemap in lrt.lt website, We use 𝑘-means clustering algorithm. Due to its high we did not scrape more articles from this site than is speed, it is suitable for large corpora [25] and outper- already scraped in [17]. forms other clustering algorithms [27]. During exper- Evaluation of clustering requires existing knowledge iments we feed vectorized document representations of the potential clusters. For this task we leveraged ar- and the expected number of clusters 𝑘 to 𝑘-means and ticle category labels extracted from each article URL. receive document assignments to clusters. We set each Following the mappings of [27], the labels were unified of 50 𝑘-means initialisations the same. from over a hundred categorical descriptions down to 12 distinct categories. The resulting categories of the articles are: 4.2. Evaluation • Lithuanian news (60 158 articles); First, we calculate the following confusion matrix ele- ments: • World news (68 635 articles); • TP – pairs of articles that have same category • Crime (30 967 articles); label and are predicted to be in the same cluster; • Business (19 964 articles); • TN – pairs of articles that belong to different cat- egories and are predicted to be in different clus- • Cars (6 313 articles); ters; • Sports (14 910 articles); • FP – pairs of articles that belong to different cat- • Technologies (4 438 articles); egories but are predicted to be in the same clus- ter; • Opinions (9 728 articles); • FN – pairs of articles that have same category • Entertainment (2 462 articles); label but are predicted to be in different clusters. • Life (3 811 articles); We chose to evaluate clusters by Matthews Correla- tion Coefficient (MCC) score due to its reliability as • Culture (7 967 articles); described in [31]. It is calculated as • Other (30 643 articles that do not fall into the TP ⋅ TN − FP ⋅ 𝐹 𝑁 previous categories). MCC = √ . (2) (TP + FP)(TP + FN)(TN + FP)(TN + FN) During most of experiments we employed a smaller subset of the dataset described above. We sampled The MCC score ranges from -1 to 1. Scores around 0 randomly 125 news articles from each of the 12 cat- value correspond to random clustering, while close to egories. That results in a total of 1 500 articles equally 1 indicate perfect matching. For the sake of complete- distributed among the categories and corresponds to ness, it must be said that there are other Correlation the data required for one clustering. We planned to Coefficients that provide reliable evaluations [32]. make 50 independent clusterings to average results and enhance their reproducibility. Thus we repeated inde- 4.3. PV-DBOW pendent sampling of 1 500 equally distributed articles This doc2vec version was trained on all our dataset and used a subset of 55 487 unique articles (with rep- – total of 259 996 Lithuanian news articles. We pre- etitions it would be 75 000). During each experiment processed the dataset by lowercasing and tokenizing we calculated embedding vectors only for those 55 487 it into words. The same vector size (100), number of news articles. epochs (10), window (12), and minimum count (4) pa- rameters were used as in [17]. PV-DBOW returns a single embedding for each document so no further ag- gregation is required. 49 4.4. GloVe 4.6. XLM-R We performed our own text preprocessing during ex- XLM-R [7] is one of the recent multilingual language periments with GloVe [20] type Lithuanian word vec- models, much bigger than multilingual BERT. It is trained tors [33]. It combined lowercasing and word level tok- on 2 terabytes of filtered text from which 13.7 GB is enization. Out of whole unique 1 028 816 tokens from Lithuanian. The huge size of this model limited our ex- our whole dataset 311 470 were also present in Lithua- periments. We only calculated outputs of the first 512 nian GloVe vectors. This unique tokens intersection tokens for each news article. It took approximately a amounts to 30 % of our tokens and up to 94 % of GloVe. total of 40 hours. We tried several ways of aggregating GloVe vectors: • calculating an average of all the word vectors in 5. Results the article; 5.1. GloVe • weighting all tokens with tf-idf and calculating an average of the 20 word vectors with the high- Results with aggregation of GloVe vectors are presented est weight; in Table 2. It is clearly seen that applying tf-idf weight- ing to select the best tokens to average can signifi- • weighting all tokens in an article with Softmax(tf- cantly surpass the simple average of all vectors. idf) and calculating a weighted average of all the word vectors in the article. 5.2. Multilingual BERT 4.5. Multilingual BERT The effect of multilingual BERT fine-tuning on Lithua- nian news articles is depicted in Fig. 1. One can clearly BERT [6] outputs the same number of vectors as it is see that (1) the fine-tuning improves the clustering re- fed inputs. The first is a special [CLS] token which is sults, (2) the average of all tokens is much better than designed to be used in sentence level tasks. The fol- only the [CLS] vector, and (3) the uncased model ver- lowing are text data tokens, ending with the last [SEP] sion outperforms the cased one. token. Optionally, one can add a second [SEP] token in To our surprise, the best results were obtained with a middle of an input sequence to separate two text seg- limiting the number of tokens to only the first 144 (see ments. In our experiments we input only one segment Fig. 2). This can be attributed to the more important and always separately try the [CLS] token output vec- information being in the beginning of news article. We tor and the averages of all token vectors. Pre-trained observed the same tendency with the XLM-R model. models like multilingual BERT are supposed to be fine- tuned for the desirable task. We performed Masked Language Modelling fine- tun- 5.3. The Best Models ing on half of our subset data – 27 743 news articles. We tried four different models to represent Lithuanian We trained with batch size of 4 for 5 epochs totalling news articles. Do Transformer models scored better 68 505 steps for uncased and 75 985 steps for cased ver- than PV-DBOW? As can be seen in Table 3, the best sions of pre-trained multilingual BERT model. Transformer model managed to outperform GloVe vec- The maximum number of input tokens to BERT is tors. However, PV-DBOW model is far ahead with the 512, including special tokens. The most articles are mean MCC score of 0.442. within 512 tokens limit but some are longer. We tried to estimate effect of this constraint by trying to feed even fewer tokens and analyzing how mean MCC score 6. Conclusions changes with the longer input sequence. We carried out our experiments in Google Colab3 In this work we compared multilingual BERT, XLM-R, environment. It offers 12 GB of RAM and GPU-accele- GloVe, and PV-DBOW text representations for Lithua- rated machines which allows an order of magnitude nian news clustering. For BERT we found out that speed up of BERT model compared to CPU. the average of only the first 144 token vectors outper- forms longer aggregations or the [CSL] token vector. We observed that BERT fine-tuning with Lithuanian news articles also improves the results. The other pre- trained Transformer type model XLM-R was too com- 3 https://colab.research.google.com/ putationally expensive to optimize and out of the four 50 Table 2 Methods of combining GloVe vectors MCC score Vector aggregation Mean Std Average 0.203 0.016 Softmax(tf-idf) weighted average 0.264 0.017 Average of 20 highest tf-idf tokens 0.264 0.024 Table 3 Comparison of methods MCC score Text representation method Mean Std PV-DBOW 0.442 0.028 Uncased fine-tuned BERT, average of first 144 tokens 0.322 0.020 Softmax(tf-idf) weighted average of GloVe vectors 0.264 0.017 XLM-R, average of first 288 tokens, total fed 512 0.251 0.016 its initial representations scored the worst. Regard- specially trained simpler PV-DBOW. ing GloVe vectors, we found that its best Softmax(tf- Multilingual BERT MCC score kept rising till last idf) embeddings (mean MCC score 0.264) are outper- fine-tuning steps and it is not clear how large the im- formed by the BERT. Nevertheless, the best text repre- provement could be accomplished training longer. Our sentation method proved to be PV-DBOW with mean future plan is to clarify this by using more data. We MCC score 0.442. Our work on generating representa- also plan to train a new monolingual BERT model specif- tions for Lithuanian news clustering showed that mul- ically for Lithuanian language. It would be interesting tilingual pre-trained Transformers can be better than to know if these resource-“hungry” approaches could independent GloVe vectors but under-performs against surpass the score of the relatively simple PV-DBOW Figure 1: MCC score dependence on multilingual BERT model type (cased or uncased), language modelling fine-tuning steps, and token vector aggregation method (average of all first 128 tokens or just the first [CLS] token). Markers show mean of MCC and shadows – standard deviation. 51 formers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [7] A. Conneau, K. Khandelwal, N. Goyal, V. Chaud- hary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross- lingual representation learning at scale, arXiv preprint arXiv:1911.02116 (2019). [8] A. Conneau, G. Lample, Cross-lingual language model pretraining, in: Advances in Neural Infor- mation Processing Systems, 2019, pp. 7057–7067. [9] M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. Viégas, M. Wat- tenberg, G. Corrado, et al., Google’s multi- lingual neural machine translation system: En- abling zero-shot translation, Transactions of Figure 2: MCC score dependence on the number of first the Association for Computational Linguistics 5 tokens used and the token vector aggregation method (av- (2017) 339–351. erage of all selected tokens or just the first [CLS] token) for [10] C. C. Aggarwal, C. Zhai, Mining text data, multilingual uncased BERT model. Markers show mean of Springer Science & Business Media, 2012. MCC and shadows – standard deviation. [11] G. Capizzi, G. Lo Sciuto, C. Napoli, E. Tramon- tana, M. Wozniak, A novel neural networks- based texture image processing algorithm for or- method. ange defects classification., International Journal of Computer Science & Applications 13 (2016). [12] M. Wózniak, D. Połap, R. K. Nowicki, C. Napoli, References G. Pappalardo, E. Tramontana, Novel approach toward medical signals classifier, in: 2015 Inter- [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, national Joint Conference on Neural Networks L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, (IJCNN), IEEE, 2015, pp. 1–7. Attention is all you need, in: Advances in neural [13] G. Lai, Q. Xie, H. Liu, Y. Yang, E. Hovy, Race: information processing systems, 2017, pp. 5998– Large-scale reading comprehension dataset from 6008. examinations, arXiv preprint arXiv:1704.04683 [2] P. Rajpurkar, R. Jia, P. Liang, Know what you (2017). don’t know: Unanswerable questions for squad, [14] T. Kudo, J. Richardson, Sentencepiece: A sim- arXiv preprint arXiv:1806.03822 (2018). ple and language independent subword tokenizer [3] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and detokenizer for neural text processing, arXiv S. R. Bowman, Glue: A multi-task benchmark preprint arXiv:1808.06226 (2018). and analysis platform for natural language un- [15] R. Sennrich, B. Haddow, A. Birch, Neural ma- derstanding, arXiv preprint arXiv:1804.07461 chine translation of rare words with subword (2018). units, arXiv preprint arXiv:1508.07909 (2015). [4] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, [16] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, J. Michael, F. Hill, O. Levy, S. Bowman, Super- W. Macherey, M. Krikun, Y. Cao, Q. Gao, glue: A stickier benchmark for general-purpose K. Macherey, et al., Google’s neural machine language understanding systems, in: Advances translation system: Bridging the gap between in Neural Information Processing Systems, 2019, human and machine translation, arXiv preprint pp. 3261–3275. arXiv:1609.08144 (2016). [5] M. Artetxe, H. Schwenk, Massively multilingual [17] L. Stankevičius, Clustering of Lithuanian News sentence embeddings for zero-shot cross-lingual Articles using Document Embeddings, Master’s transfer and beyond, Transactions of the Associ- thesis, Kaunas University of Technology, 2019. ation for Computational Linguistics 7 (2019) 597– [18] Q. Le, T. Mikolov, Distributed representations 610. of sentences and documents, in: International [6] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, conference on machine learning, 2014, pp. 1188– Bert: Pre-training of deep bidirectional trans- 1196. 52 [19] T. Mikolov, K. Chen, G. Corrado, J. Dean, Effi- wordnet., Artificial intelligence and algorithms cient estimation of word representations in vec- in intelligent systems: proceedings of 7th com- tor space, arXiv preprint arXiv:1301.3781 (2013). puter science on-line conference 2018 (2018) 394– [20] J. Pennington, R. Socher, C. D. Manning, Glove: 404. Global vectors for word representation, in: [31] D. Chicco, G. Jurman, The advantages of the Proceedings of the 2014 conference on empir- matthews correlation coefficient (mcc) over f1 ical methods in natural language processing score and accuracy in binary classification eval- (EMNLP), 2014, pp. 1532–1543. uation, BMC genomics 21 (2020) 6. [21] M. E. Peters, M. Neumann, M. Iyyer, M. Gard- [32] F. Beritelli, G. Capizzi, G. Lo Sciuto, C. Napoli, ner, C. Clark, K. Lee, L. Zettlemoyer, Deep con- M. Woźniak, A novel training method to preserve textualized word representations, arXiv preprint generalization of rbpnn classifiers applied to ecg arXiv:1802.05365 (2018). signals diagnosis, Neural Networks 108 (2018) [22] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, 331–338. I. Sutskever, Language models are unsupervised [33] A. Bielinskieṅe, L. Boizou, I. Bumbulieṅe, J. Ko- multitask learners, OpenAI Blog 1 (2019) 9. valevskaiṫe, T. Krilavičius, J. Mandravickaiṫe, [23] C. Raffel, N. Shazeer, A. Roberts, K. Lee, E. Rimkuṫe, L. Vilkaiṫe-Lozdieṅe, Lithuanian S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, word embeddings, 2019. URL: http://hdl.handle. Exploring the limits of transfer learning with a net/20.500.11821/26, CLARIN-LT digital library unified text-to-text transformer, arXiv preprint in the Republic of Lithuania. arXiv:1910.10683 (2019). [24] M. Shoeybi, M. Patwary, R. Puri, P. LeGres- ley, J. Casper, B. Catanzaro, Megatron-lm: Training multi-billion parameter language mod- els using gpu model parallelism, arXiv preprint arXiv:1909.08053 (2019). [25] G. Ciganaiṫe, A. Mackuṫe-Varoneckieṅe, T. Krilavičius, Text documents clustering., Informaciṅes technologijos : 19-oji tarpuniver- sitetiṅe tarptautiṅe magistrantu˛ ir doktorantu˛ konferencija "Informaciṅe visuomeṅe ir univer- sitetiṅes studijos" (IVUS 2014) : konferencijos pranešimu˛ medžiaga (2014) 90–93. [26] A. Mackuṫe-Varoneckieṅe, T. Krilavičius, Em- pirical study on unsupervised feature selection for document clustering., Human language tech- nologies - the Baltic perspective : proceedings of the 6th international conference, Baltic HLT 2014 (2014) 107–110. [27] V. Pranckaitis, M. Lukoševičius, Clustering of Lithuanian news articles, in: CEUR Workshop Proceedings, 2017. [28] J. Kapočiūṫe-Dzikieṅe, R. Damaševičius, M. Woz- niak, Sentiment analysis of Lithuanian texts us- ing traditional and deep learning approaches., Computers 8 (2019) 1–16. [29] J. Kapočiūṫe-Dzikieṅe, R. Damaševičius, M. Woź- niak, Sentiment analysis of Lithuanian texts us- ing deep learning methods., Information and software technologies: 24th international con- ference, ICIST 2018, Vilnius, Lithuania, October 4–6, 2018: proceedings (2018) 521–532. [30] J. Kapočiūṫe-Dzikieṅe, R. Damaševičius, Intrinsic evaluation of Lithuanian word embeddings using 53