Testing Pre-trained Transformer Models for Lithuanian
News Clustering
Lukas Stankevičiusa , Mantas Lukoševičiusa
a Faculty of Informatics, Kaunas University of Technology, Kaunas, Lithuania


                                          Abstract
                                          TA recent introduction of Transformer deep learning architecture made breakthroughs in various natural language processing
                                          tasks. However, non-English languages could not leverage such new opportunities with the English text pre-trained models.
                                          This changed with research focusing on multilingual models, where less-spoken languages are the main beneficiaries. We
                                          compare pre-trained multilingual BERT, XLM-R, and older learned text representation methods as encodings for the task of
                                          Lithuanian news clustering. Our results indicate that publicly available pre-trained multilingual Transformer models can be
                                          fine-tuned to surpass word vectors but still score much lower than specially trained doc2vec embeddings.

                                          Keywords
                                          Document clustering, document embedding, Lithuanian news articles, Transformer model, BERT, XLM-R, multilingual


1. Introduction                                                                                                    languages), multilingual BERT [6] (104 languages) and
                                                                                                                   XLM-R [7] (100 languages).
Appearance of a novel Transformer deep learning ar-                                                                   The authors of XLM [8] showed that training Nepali
chitecture [1] sparked a rapid research progress in the                                                            language model on Wikipedia together with additional
Natural Language Processing (NLP) field.                                                                           data from both English and Hindi decreased perplexity
   Table 1 clearly depicts how quickly models reached                                                              to 109.3 on Nepali, compared to single Nepali training
human performance on popular NLP evaluation data-                                                                  perplexity of 157.
sets. In less than two years after publication, the data-                                                             Transfer learning and zero-shot translation between
sets: SQuAD2.0 and GLUE [2, 3] had human perfor-                                                                   language pairs never seen explicitly during training
mance outmatched. Currently every top scoring model                                                                was shown to be possible in [9]. Overall, multilingual
is of Transformer architecture. The situation was not                                                              models can cover many languages, be trained without
changed by a newer SuperGLUE [4] task set which had                                                                any cross-lingual supervision and use the bigger lan-
been left with only a tiny gap to human performance.                                                               guages to benefit the smaller ones.
These datasets were among the most popular to eval-                                                                   Lithuanian language does not yet have BERT-scale
uate new Transformer models and showed the effec-                                                                  monolingual NLP model. It is relatively very little spo-
tiveness of this new architecture.                                                                                 ken in a world. However, as a national language of one
   There is a need to create NLP models for less-spoken                                                            of European Union member states, Lithuanian is usu-
languages. Apart from being less popular than English                                                              ally included in the most of pre-trained multilingual
or Chinese, less-spoken languages also have less con-                                                              models.
tent to train the models on. Just the top-10 out of 6 000                                                             The aim of this work is to use such Transformer-
languages in use today make up 76.3 % of the total con-                                                            type models to generate text embeddings and evaluate
tent on the internet1 .                                                                                            them on clustering of Lithuanian news articles. Specif-
   Such situation encourages not only to pursue cre-                                                               ically, we will use well known baselines – multilingual
ation of NLP models for other languages but also to                                                                BERT and recently published XLM-R, trained on more
look for ways to transfer the knowledge from the con-                                                              than two terabytes of filtered CommonCrawl data.
tent rich language models. The most common way to                                                                     We chose clustering task to also try to advance the
satisfy this need for less spoken languages is to pre-                                                             field of data mining.
train multilingual models. Examples are LASER [5] (93                                                                 The surge of information, particularly news data,
                                                                                                                   demands tools to help users to “analyze and digest in-
IVUS 2020: Information Society and University Studies, 23 April 2020,                                              formation and facilitate decision making” [10]. Unlike
KTU Santaka Valley, Kaunas, Lithuania
" lukas.stankevicius@ktu.edu (L. Stankevičius);                                                                    classification, clustering is universal in that it can han-
mantas.lukosevicius@ktu.edu (M. Lukoševičius)                                                                      dle unknown categories [11, 12]
                                                                                                                     Therefore it is well suited for the quickly changing
                                    © 2020 Copyright for this paper by its authors. Use permitted under Creative
                                    Commons License Attribution 4.0 International (CC BY 4.0).                     news articles data.
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
               1 https://www.internetworldstats.com/stats7.htm
Table 1
Difficulty of the most popular NLP evaluation datasets


                                Dataset                                   Score
                         Name              Year    Initial        Current     Human      Type
                         RACE [13]         2017    44.1           89.4        94.5       Accuracy
                         SQuAD2.0 [2]      2018    66.3           92.58       89.542     F1
                         GLUE [3]          2019    70.0           90.3        87.1       Average
                         SuperGLUE [4]     2019    71.5           89.3        89.8       Average


2. Literature Review                                              2.2. Text Representation
In this section we review the two first consecutive pha-          Although tokenized text remains meaningful to us, mo-
ses of common natural language processing (NLP) tas-              dels still can not operate on it directly. They need it in a
ks: text preprocessing and text representation [10].              numerical form. The preferable way is to derive vector
   These stages recently are of the most active research          representation for each text sample. Cosine similarity
and culminated in the development of the Transformer              is the simplest example of models operating on (com-
architecture. We also examine relevant NLP contri-                paring) these embeddings. The classical approach to
butions for Lithuanian language and our task of news              text representation uses a Bag of Words (BoW) model.
clustering.                                                       As the name suggests, the order of tokens is lost here
                                                                  and each document is represented by bare counts (his-
                                                                  togram) of its tokens. Therefore token weighting such
2.1. Text Preprocessing                                           as tf-idf is involved. The higher weight of tf-idf is, the
Text preprocessing involves selection of features that            more descriptive token for a given document is. Given
will bear the understanding of text. The most elemen-             the number of word 𝑤 occurrences in a document 𝑑 as
tary approach is tokenization into simple characters or           tf𝑤,𝑑 , number of documents containing word 𝑤 as df𝑤
words. The finer the tokenization, the smaller is the             and total number of documents 𝑁 , tf-idf𝑤,𝑑 is given by
resulting vocabulary and the more challenging task is
given to the NLP model. On the other hand, coarser                    tf-idf𝑤,𝑑 = tf𝑤,𝑑 ⋅ log(𝑁 /df𝑤 ).       (1)
tokens drastically increase vocabulary size and induce       BoW approaches suffer from several problems. The
other problems such as sparseness. The middle ground vector length for each document is the same as the
is statistically significant n-grams of both words and size of the vocabulary. Typically, the vocabulary size
chars. Examples of this type of tokenizers are Senten- is huge and this induces major memory constraints.
cePiece [14], BPE [15], and WordPiece [16]. They are The embedded vectors are also very sparse as each
often used in the state-of-the-art (SOTA) Transformer document uses only the small subset of the vocabu-
models and are shipped together with publicly avail- lary. Various methods, such as Latent Semantic Anal-
able pre-trained models. This way manual tokeniza- ysis (LSA) using Singular Value Decomposition (SVD),
tion step is skipped.                                     are employed to reduce the dimensionality. Neverthe-
   There are number of methods to filter word level to- less, SVD has to operate on the same high dimensional
kens. It includes lowercasing, stemming, lemmatiza- documents × tokens matrix.
tion, filtering by maximum and minimum document              Work of [19] revolutionised word embedding cal-
frequencies (ignoring tokens that are too rare or too culations. Previous word embeddings, known as co-
common throughout the documents).                         occurrence vectors, were superseded. They were cal-
   However, it was shown in [17] that such filtering culated as direct probabilities of surrounding words in
benefits only the classical text representation approach- a context window of a given length. The new word2vec
es such as tf-idf, while shallow neural network models, [19] algorithm uses the same training inputs, except
doc2vec [18], benefited of not using any such filtering. the goal is not to calculate the word distribution, but
                                                                  to derive such embedding weights that context words
                                                                  would be predicted with maximum accuracy. Such setup
                                                                  significantly reduced the word vector size and elimi-


                                                             47
nated problems of high dimensionality and sparseness.                 2.3. Related Work on Lithuanian
Later, the next word embedding model, Global Vectors                       Language
(GloVe) for word representation was presented [20].
It merged advantages of both the matrix factorization                 There are several works on Lithuanian text cluster-
and the shallow window-based methods. Currently it                    ing. [25] used internet daily newspaper articles from
is the most used method for independent word vectors.                 the Lrytas.lt news website and information from the
    Original word2vec word vectors were extended to                   largest internet forum for mothers – supermama.lt. 𝑘-
paragraph vectors by doc2vec model [18]. Here each                    means and Expectation Maximization (EM) algorithms
sequence of tokens has its own embedding in the same                  were compared on BoW data representation. It was
space as that of words. CBOW and Skip-gram archi-                     found that optimal clustering algorithm is 𝑘-means with
tectures from the original word2vec algorithm were                    cosine similarity. Other work analysed unsupervised
applied to documents correspondingly as PV-DM and                     feature selection for document clustering [26]. Au-
PV-DBOW. Distributed Memory model of Paragraph                        thors found that tf-idf weighting with 3 000 features
Vector (PV-DM) is tasked to predict the next context                  and spherical 𝑘-means clustering algorithm works best.
word given the previous context and document vector,                  A similar observation is expressed in [27]. Here 𝑘-
thus the vector has to sustain memory of that is miss-                means was compared to various hierarchical cluster-
ing. Distributed Bag of Words version of Paragraph                    ing algorithms and was not superseded. Authors found
Vectors (PV-DBOW) is forced to predict context vec-                   that tf-idf together with stemming is superior to other
tors randomly, given only the document vector. The                    approaches.
original work [18] used a concatenation of the both                      [17] compared BoW and doc2vec (PV-DBOW) Lithua-
models. However, in [17] it was found that PV-DBOW                    nian news article representations for document clus-
alone gave the best results.                                          tering. It was shown that PV-DBOW representation,
    The most recent text representation models produce                trained on the whole dataset is superior to the BoW
contextualised token vectors. Models like ELMO [21]                   method. Authors also investigated various PV-DBOW
and various Transformers have each of their inputs to                 hyperparameters and outlined recommendations for
interact with the other ones. This leads to each to-                  training method weights.
ken vector being aware of the others. Contextuali-                       There is also other NLP work on Lithuanian lan-
sation solved the problem of word polysemy seen in                    guage. [28] and [29] compared traditional and deep
word2vec or GloVe models.                                             learning approaches for Lithuanian internet comments
    The Transformer architecture presented in 2017 [1]                sentiment classification. Authors demonstrated that
excels other contextualised models due to several rea-                traditional Naïve Bayes Multinomial and Support Vec-
sons. Firstly, only after a single layer each input repre-            tor Machine methods outperformed LSTM and Convo-
sentation becomes aware of the other ones. For Recur-                 lutional neural networks. Other work [30] compared
rent Neural Networks (RNN) like ELMO it took 𝑛 lay-                   CBOW and Skip-gram word embedding architectures
ers, where 𝑛 is the sequence length. Another advan-                   and found the first one to be superior. We can add that
tage over recurrent architectures is that Transformer                 in this current work we noticed a similar tendency for
is very parallelizable. It does not need to wait for a                document vectors: equivalent version of PV-DBOW in
hidden state of the previous word as is the case with                 our initial experiments outperformed PV-DMM archi-
RNN. This particular feature led to creation of multi-                tecture.
billion-parameter Transformers such as GPT-2 [22], T5                    We have not found any previous work on Lithua-
[23], Megatron [24], and T-NLG2 . Despite huge suc-                   nian language using Transformer models.
cess of Transformer models, it can not process long
sequences as its complexity per layer is 𝑂(𝑛2 ⋅𝑑) where
𝑑 is representation dimension. For example, the max-
                                                                      3. The Data
imum input length of the popular BERT model is just                   We followed methodology of [17] and expanded their
510 tokens. This and other problems of Transformer                    dataset from 82 793 up to 260 146 articles. Although
architecture currently are researched very actively.                  average number of characters in each our text sam-
                                                                      ple is 2 948, several scraped articles were very small
                                                                      and resulted in empty vectors during averaging the
                                                                      GloVe vectors. Due to this reason we filtered all ar-
                                                                      ticles with less than 200 characters and this resulted
     2 https://www.microsoft.com/en-us/research/blog/
                                                                      in a final dataset of 259 996 texts.
turing-nlg-a-17-billion-parameter-language-model-by-microsoft/


                                                                 48
   The data consist of Lithuanian news articles scraped       4. Methods
from lrt.lt, 15min.lt, and delfi.lt websites. The num-
ber of texts are correspondingly 26 344, 133 587, and         4.1. Clustering
100 065. Due to the absence of sitemap in lrt.lt website,
                                                              We use 𝑘-means clustering algorithm. Due to its high
we did not scrape more articles from this site than is
                                                              speed, it is suitable for large corpora [25] and outper-
already scraped in [17].
                                                              forms other clustering algorithms [27]. During exper-
   Evaluation of clustering requires existing knowledge
                                                              iments we feed vectorized document representations
of the potential clusters. For this task we leveraged ar-
                                                              and the expected number of clusters 𝑘 to 𝑘-means and
ticle category labels extracted from each article URL.
                                                              receive document assignments to clusters. We set each
Following the mappings of [27], the labels were unified
                                                              of 50 𝑘-means initialisations the same.
from over a hundred categorical descriptions down to
12 distinct categories. The resulting categories of the
articles are:                                                 4.2. Evaluation
    • Lithuanian news (60 158 articles);                      First, we calculate the following confusion matrix ele-
                                                              ments:
    • World news (68 635 articles);
                                                                  • TP – pairs of articles that have same category
    • Crime (30 967 articles);                                      label and are predicted to be in the same cluster;
    • Business (19 964 articles);                                 • TN – pairs of articles that belong to different cat-
                                                                    egories and are predicted to be in different clus-
    • Cars (6 313 articles);                                        ters;
    • Sports (14 910 articles);                                   • FP – pairs of articles that belong to different cat-
    • Technologies (4 438 articles);                                egories but are predicted to be in the same clus-
                                                                    ter;
    • Opinions (9 728 articles);
                                                                  • FN – pairs of articles that have same category
    • Entertainment (2 462 articles);                               label but are predicted to be in different clusters.
    • Life (3 811 articles);                                  We chose to evaluate clusters by Matthews Correla-
                                                              tion Coefficient (MCC) score due to its reliability as
    • Culture (7 967 articles);                               described in [31]. It is calculated as
    • Other (30 643 articles that do not fall into the                             TP ⋅ TN − FP ⋅ 𝐹 𝑁
      previous categories).                                    MCC = √                                            . (2)
                                                                          (TP + FP)(TP + FN)(TN + FP)(TN + FN)
   During most of experiments we employed a smaller
subset of the dataset described above. We sampled             The MCC score ranges from -1 to 1. Scores around 0
randomly 125 news articles from each of the 12 cat-           value correspond to random clustering, while close to
egories. That results in a total of 1 500 articles equally    1 indicate perfect matching. For the sake of complete-
distributed among the categories and corresponds to           ness, it must be said that there are other Correlation
the data required for one clustering. We planned to           Coefficients that provide reliable evaluations [32].
make 50 independent clusterings to average results and
enhance their reproducibility. Thus we repeated inde-         4.3. PV-DBOW
pendent sampling of 1 500 equally distributed articles
                                                              This doc2vec version was trained on all our dataset
and used a subset of 55 487 unique articles (with rep-
                                                              – total of 259 996 Lithuanian news articles. We pre-
etitions it would be 75 000). During each experiment
                                                              processed the dataset by lowercasing and tokenizing
we calculated embedding vectors only for those 55 487
                                                              it into words. The same vector size (100), number of
news articles.
                                                              epochs (10), window (12), and minimum count (4) pa-
                                                              rameters were used as in [17]. PV-DBOW returns a
                                                              single embedding for each document so no further ag-
                                                              gregation is required.


                                                         49
4.4. GloVe                                                     4.6. XLM-R
We performed our own text preprocessing during ex-             XLM-R [7] is one of the recent multilingual language
periments with GloVe [20] type Lithuanian word vec-            models, much bigger than multilingual BERT. It is trained
tors [33]. It combined lowercasing and word level tok-         on 2 terabytes of filtered text from which 13.7 GB is
enization. Out of whole unique 1 028 816 tokens from           Lithuanian. The huge size of this model limited our ex-
our whole dataset 311 470 were also present in Lithua-         periments. We only calculated outputs of the first 512
nian GloVe vectors. This unique tokens intersection            tokens for each news article. It took approximately a
amounts to 30 % of our tokens and up to 94 % of GloVe.         total of 40 hours.
  We tried several ways of aggregating GloVe vectors:
    • calculating an average of all the word vectors in        5. Results
      the article;
                                                               5.1. GloVe
    • weighting all tokens with tf-idf and calculating
      an average of the 20 word vectors with the high- Results with aggregation of GloVe vectors are presented
      est weight;                                         in Table 2. It is clearly seen that applying tf-idf weight-
                                                          ing to select the best tokens to average can signifi-
    • weighting all tokens in an article with Softmax(tf- cantly surpass the simple average of all vectors.
      idf) and calculating a weighted average of all the
      word vectors in the article.
                                                               5.2. Multilingual BERT
4.5. Multilingual BERT                                         The effect of multilingual BERT fine-tuning on Lithua-
                                                               nian news articles is depicted in Fig. 1. One can clearly
BERT [6] outputs the same number of vectors as it is           see that (1) the fine-tuning improves the clustering re-
fed inputs. The first is a special [CLS] token which is        sults, (2) the average of all tokens is much better than
designed to be used in sentence level tasks. The fol-          only the [CLS] vector, and (3) the uncased model ver-
lowing are text data tokens, ending with the last [SEP]        sion outperforms the cased one.
token. Optionally, one can add a second [SEP] token in            To our surprise, the best results were obtained with
a middle of an input sequence to separate two text seg-        limiting the number of tokens to only the first 144 (see
ments. In our experiments we input only one segment            Fig. 2). This can be attributed to the more important
and always separately try the [CLS] token output vec-          information being in the beginning of news article. We
tor and the averages of all token vectors. Pre-trained         observed the same tendency with the XLM-R model.
models like multilingual BERT are supposed to be fine-
tuned for the desirable task.
   We performed Masked Language Modelling fine- tun-
                                                         5.3. The Best Models
ing on half of our subset data – 27 743 news articles. We tried four different models to represent Lithuanian
We trained with batch size of 4 for 5 epochs totalling news articles. Do Transformer models scored better
68 505 steps for uncased and 75 985 steps for cased ver- than PV-DBOW? As can be seen in Table 3, the best
sions of pre-trained multilingual BERT model.            Transformer model managed to outperform GloVe vec-
   The maximum number of input tokens to BERT is tors. However, PV-DBOW model is far ahead with the
512, including special tokens. The most articles are mean MCC score of 0.442.
within 512 tokens limit but some are longer. We tried
to estimate effect of this constraint by trying to feed
even fewer tokens and analyzing how mean MCC score 6. Conclusions
changes with the longer input sequence.
   We carried out our experiments in Google Colab3 In this work we compared multilingual BERT, XLM-R,
environment. It offers 12 GB of RAM and GPU-accele- GloVe, and PV-DBOW text representations for Lithua-
rated machines which allows an order of magnitude nian news clustering. For BERT we found out that
speed up of BERT model compared to CPU.                  the average of only the first 144 token vectors outper-
                                                         forms longer aggregations or the [CSL] token vector.
                                                         We observed that BERT fine-tuning with Lithuanian
                                                         news articles also improves the results. The other pre-
                                                         trained Transformer type model XLM-R was too com-
    3 https://colab.research.google.com/                 putationally expensive to optimize and out of the four


                                                          50
Table 2
Methods of combining GloVe vectors

                                                                         MCC score
                                 Vector aggregation
                                                                       Mean      Std
                                 Average                                0.203   0.016
                                 Softmax(tf-idf) weighted average       0.264   0.017
                                 Average of 20 highest tf-idf tokens    0.264   0.024


Table 3
Comparison of methods


                                                                                 MCC score
                        Text representation method
                                                                                Mean       Std
                        PV-DBOW                                                 0.442    0.028
                        Uncased fine-tuned BERT, average of first 144 tokens    0.322    0.020
                        Softmax(tf-idf) weighted average of GloVe vectors       0.264    0.017
                        XLM-R, average of first 288 tokens, total fed 512       0.251    0.016


its initial representations scored the worst. Regard-          specially trained simpler PV-DBOW.
ing GloVe vectors, we found that its best Softmax(tf-             Multilingual BERT MCC score kept rising till last
idf) embeddings (mean MCC score 0.264) are outper-             fine-tuning steps and it is not clear how large the im-
formed by the BERT. Nevertheless, the best text repre-         provement could be accomplished training longer. Our
sentation method proved to be PV-DBOW with mean                future plan is to clarify this by using more data. We
MCC score 0.442. Our work on generating representa-            also plan to train a new monolingual BERT model specif-
tions for Lithuanian news clustering showed that mul-          ically for Lithuanian language. It would be interesting
tilingual pre-trained Transformers can be better than          to know if these resource-“hungry” approaches could
independent GloVe vectors but under-performs against           surpass the score of the relatively simple PV-DBOW


Figure 1: MCC score dependence on multilingual BERT model type (cased or uncased), language modelling fine-tuning
steps, and token vector aggregation method (average of all first 128 tokens or just the first [CLS] token). Markers show
mean of MCC and shadows – standard deviation.


                                                          51
                                                                            formers for language understanding,           arXiv
                                                                            preprint arXiv:1810.04805 (2018).
                                                                        [7] A. Conneau, K. Khandelwal, N. Goyal, V. Chaud-
                                                                            hary, G. Wenzek, F. Guzmán, E. Grave, M. Ott,
                                                                            L. Zettlemoyer, V. Stoyanov, Unsupervised cross-
                                                                            lingual representation learning at scale, arXiv
                                                                            preprint arXiv:1911.02116 (2019).
                                                                        [8] A. Conneau, G. Lample, Cross-lingual language
                                                                            model pretraining, in: Advances in Neural Infor-
                                                                            mation Processing Systems, 2019, pp. 7057–7067.
                                                                        [9] M. Johnson, M. Schuster, Q. V. Le, M. Krikun,
                                                                            Y. Wu, Z. Chen, N. Thorat, F. Viégas, M. Wat-
                                                                            tenberg, G. Corrado, et al., Google’s multi-
                                                                            lingual neural machine translation system: En-
                                                                            abling zero-shot translation, Transactions of
Figure 2: MCC score dependence on the number of first                       the Association for Computational Linguistics 5
tokens used and the token vector aggregation method (av-                    (2017) 339–351.
erage of all selected tokens or just the first [CLS] token) for        [10] C. C. Aggarwal, C. Zhai, Mining text data,
multilingual uncased BERT model. Markers show mean of                       Springer Science & Business Media, 2012.
MCC and shadows – standard deviation.                                  [11] G. Capizzi, G. Lo Sciuto, C. Napoli, E. Tramon-
                                                                            tana, M. Wozniak, A novel neural networks-
                                                                            based texture image processing algorithm for or-
method.                                                                     ange defects classification., International Journal
                                                                            of Computer Science & Applications 13 (2016).
                                                                       [12] M. Wózniak, D. Połap, R. K. Nowicki, C. Napoli,
References                                                                  G. Pappalardo, E. Tramontana, Novel approach
                                                                            toward medical signals classifier, in: 2015 Inter-
 [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,                       national Joint Conference on Neural Networks
     L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin,                       (IJCNN), IEEE, 2015, pp. 1–7.
     Attention is all you need, in: Advances in neural                 [13] G. Lai, Q. Xie, H. Liu, Y. Yang, E. Hovy, Race:
     information processing systems, 2017, pp. 5998–                        Large-scale reading comprehension dataset from
     6008.                                                                  examinations, arXiv preprint arXiv:1704.04683
 [2] P. Rajpurkar, R. Jia, P. Liang, Know what you                          (2017).
     don’t know: Unanswerable questions for squad,                     [14] T. Kudo, J. Richardson, Sentencepiece: A sim-
     arXiv preprint arXiv:1806.03822 (2018).                                ple and language independent subword tokenizer
 [3] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy,                       and detokenizer for neural text processing, arXiv
     S. R. Bowman, Glue: A multi-task benchmark                             preprint arXiv:1808.06226 (2018).
     and analysis platform for natural language un-                    [15] R. Sennrich, B. Haddow, A. Birch, Neural ma-
     derstanding, arXiv preprint arXiv:1804.07461                           chine translation of rare words with subword
     (2018).                                                                units, arXiv preprint arXiv:1508.07909 (2015).
 [4] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh,                   [16] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi,
     J. Michael, F. Hill, O. Levy, S. Bowman, Super-                        W. Macherey, M. Krikun, Y. Cao, Q. Gao,
     glue: A stickier benchmark for general-purpose                         K. Macherey, et al., Google’s neural machine
     language understanding systems, in: Advances                           translation system: Bridging the gap between
     in Neural Information Processing Systems, 2019,                        human and machine translation, arXiv preprint
     pp. 3261–3275.                                                         arXiv:1609.08144 (2016).
 [5] M. Artetxe, H. Schwenk, Massively multilingual                    [17] L. Stankevičius, Clustering of Lithuanian News
     sentence embeddings for zero-shot cross-lingual                        Articles using Document Embeddings, Master’s
     transfer and beyond, Transactions of the Associ-                       thesis, Kaunas University of Technology, 2019.
     ation for Computational Linguistics 7 (2019) 597–                 [18] Q. Le, T. Mikolov, Distributed representations
     610.                                                                   of sentences and documents, in: International
 [6] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,                          conference on machine learning, 2014, pp. 1188–
     Bert: Pre-training of deep bidirectional trans-                        1196.


                                                                  52
[19] T. Mikolov, K. Chen, G. Corrado, J. Dean, Effi-      wordnet., Artificial intelligence and algorithms
     cient estimation of word representations in vec-     in intelligent systems: proceedings of 7th com-
     tor space, arXiv preprint arXiv:1301.3781 (2013).    puter science on-line conference 2018 (2018) 394–
[20] J. Pennington, R. Socher, C. D. Manning, Glove:      404.
     Global vectors for word representation, in: [31] D. Chicco, G. Jurman, The advantages of the
     Proceedings of the 2014 conference on empir-         matthews correlation coefficient (mcc) over f1
     ical methods in natural language processing          score and accuracy in binary classification eval-
     (EMNLP), 2014, pp. 1532–1543.                        uation, BMC genomics 21 (2020) 6.
[21] M. E. Peters, M. Neumann, M. Iyyer, M. Gard- [32] F. Beritelli, G. Capizzi, G. Lo Sciuto, C. Napoli,
     ner, C. Clark, K. Lee, L. Zettlemoyer, Deep con-     M. Woźniak, A novel training method to preserve
     textualized word representations, arXiv preprint     generalization of rbpnn classifiers applied to ecg
     arXiv:1802.05365 (2018).                             signals diagnosis, Neural Networks 108 (2018)
[22] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei,     331–338.
     I. Sutskever, Language models are unsupervised [33] A. Bielinskieṅe, L. Boizou, I. Bumbulieṅe, J. Ko-
     multitask learners, OpenAI Blog 1 (2019) 9.          valevskaiṫe, T. Krilavičius, J. Mandravickaiṫe,
[23] C. Raffel, N. Shazeer, A. Roberts, K. Lee,           E. Rimkuṫe, L. Vilkaiṫe-Lozdieṅe, Lithuanian
     S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu,     word embeddings, 2019. URL: http://hdl.handle.
     Exploring the limits of transfer learning with a     net/20.500.11821/26, CLARIN-LT digital library
     unified text-to-text transformer, arXiv preprint     in the Republic of Lithuania.
     arXiv:1910.10683 (2019).
[24] M. Shoeybi, M. Patwary, R. Puri, P. LeGres-
     ley, J. Casper, B. Catanzaro,        Megatron-lm:
     Training multi-billion parameter language mod-
     els using gpu model parallelism, arXiv preprint
     arXiv:1909.08053 (2019).
[25] G. Ciganaiṫe,        A. Mackuṫe-Varoneckieṅe,
     T. Krilavičius,      Text documents clustering.,
     Informaciṅes technologijos : 19-oji tarpuniver-
     sitetiṅe tarptautiṅe magistrantu˛ ir doktorantu˛
     konferencija "Informaciṅe visuomeṅe ir univer-
     sitetiṅes studijos" (IVUS 2014) : konferencijos
     pranešimu˛ medžiaga (2014) 90–93.
[26] A. Mackuṫe-Varoneckieṅe, T. Krilavičius, Em-
     pirical study on unsupervised feature selection
     for document clustering., Human language tech-
     nologies - the Baltic perspective : proceedings of
     the 6th international conference, Baltic HLT 2014
     (2014) 107–110.
[27] V. Pranckaitis, M. Lukoševičius, Clustering of
     Lithuanian news articles, in: CEUR Workshop
     Proceedings, 2017.
[28] J. Kapočiūṫe-Dzikieṅe, R. Damaševičius, M. Woz-
     niak, Sentiment analysis of Lithuanian texts us-
     ing traditional and deep learning approaches.,
     Computers 8 (2019) 1–16.
[29] J. Kapočiūṫe-Dzikieṅe, R. Damaševičius, M. Woź-
     niak, Sentiment analysis of Lithuanian texts us-
     ing deep learning methods., Information and
     software technologies: 24th international con-
     ference, ICIST 2018, Vilnius, Lithuania, October
     4–6, 2018: proceedings (2018) 521–532.
[30] J. Kapočiūṫe-Dzikieṅe, R. Damaševičius, Intrinsic
     evaluation of Lithuanian word embeddings using


                                                    53