Evaluation of Vector Transformations for Russian Static
and Contextualized Embeddings
Olga Korogodina 1, Vladimir Koulichenko 1, Olesya Karpik 2 and Eduard Klyshinsky 1
1
    National Research University Higher School of Economics, Myasnitskaya 20, Moscow, 101000, Russia
2
    Keldysh Institute of Applied Mathematics, Miusskaya sq., 4, Moscow, 125047, Russia

                Abstract
                The authors of Word2Vec claimed that their technology could solve the word analogy problem
                using the vector transformation in the introduced vector space. By default, the same is true for
                both static and contextualized models. However, the practice demonstrates that sometimes such
                an approach fails. In this paper, we investigate several static and contextualized models trained
                for the Russian language and find out the reasons of such inconsistency. We found out that
                words of different categories demonstrated different behavior in the semantic space.
                Contextualized models tend to find phonological and lexical analogies, while static models are
                better in finding relations among geographical proper names. In most cases, the average
                accuracy for contextualized models is better than for static ones. Our experiments have
                demonstrated that in some cases the length of the vectors could differ more than twice, while
                for some categories most of the vectors could be perpendicular to the vector connecting average
                beginning and ending points.

                Keywords 1
                Word Embeddings, Vector Space, Vector Transformation, Word Analogies

1. Introduction
    Vector models, originally introduced in paper [1] in 2003, boost progress in NLP area. The main
idea of embeddings is generation of fixed-size vectors according to statisti-cal information about the
word context by means of a neural network. This concept was developed in [2] where the authors
demonstrated that such pre-trained vectors could be useful for solution of different natural language
processing problems. The real revolution was made by the Word2Vec model, introduced in 2013 [3 5],
which is based on the distributive hypothesis. The Word2Vec model also uses neural networks
reinforced by several new ideas.
    First of all, the new approach [4] had less computational complexity compared to the previous
systems. The next article [5] increases its learning rate and accuracy. The greatest contribution of the
authors was the publication of source codes and pre-trained language models for free use.
    The FastText model, which was introduced in 2017 [6], improves some drawbacks of Word2Vec
using the following ideas. Both prefixes and postfixes of words carry semantic information as well as
word roots. In this case, the meaning of a word can be composed of the meaning of its parts. Dividing
a word into character n-grams, the system collects more information about the same n-gram using
contexts of different words. Thus, the FastText model doesn’t need lemmatization and can use relatively
smaller corpora to achieve the same outcome.
    Both the Word2Vec and FastText models have a big drawback: they use all words from a context of
a considered word. However, the considered word can have several meanings; often, every meaning of
a word has its own set of contexts which are slightly intersecting or have no intersection at all. Thus,
training a model, one should try to separate those meanings into different vectors.

GraphiCon 2021: 31st International Conference on Computer Graphics and Vision, September 27-30, 2021, Nizhny Novgorod, Russia
EMAIL: eklyshinsky@hse.ru (E. Klyshinsky); parlak@mail.ru (O. Karpik)
ORCID: 0000-0003-3601-4677 (O. Korogodina); 0000-0003-3256-8955 (V. Koulichenko); 0000-0002-0477-1502 (O. Karpik); 0000-0002-
4020-488X (E. Klyshinsky)
             ©️ 2021 Copyright for this paper by its authors.
             Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
             CEUR Workshop Proceedings (CEUR-WS.org)
    Such drawbacks were corrected by the ELMo (Embeddings from Language Mod-els) [7] and BERT
(Bidirectional Encoder Representations from Transformers) [8] methods, presented in 2018. Both of
these models use Transformer neural networks with several attention layers, but in distinct of ELMo,
BERT uses bidirectional layers. Unlike static models such as Word2Vec and FastText, which always
return the same vector, the contextualized models, BERT and ELMo, return a vector according to the
meaning of the word in the given context. It makes some problems, since a fuzzy comparison of vectors
rather than checking their equality should be conducted; how-ever, in most cases contextualized vectors
lead to higher productivity. The paper [9] provides a good overview and description of vectors
contextualization.
    The authors of [3-5] argued that the new semantic space allows vector arithmetic. Their example of
“Queen = King – man + woman” swiftly becomes very famous. However, it becomes clear in a short
time that such operations do not always lead us to success. One of the proofs of this concept is the
problem of words analogies. The early experiments demonstrated that another favorite example,
countries and their capitals, did not work correctly for any example. The accuracy of this analogy was
quite high but not enough to state that vector arithmetic worked properly. However, the problem of
words analogies operates not as good as it could. Despite the fact that embeddings allow correct finding
of a list of semantic neighbors for a given word, what makes it a crucial part of modern systems of natu-
ral language processing, the authors of [10] demonstrated that results could dramati-cally vary
depending on used task and model.
    As it was demonstrated in the paper [11], the quality of the word analogies prob-lem depends on the
considered category and pre-trained static model. The authors of [12] investigate the word analogies
problem as a task of reflection. They demonstrat-ed that for the same analogy there could be several
mirrors responsible for their own region of the considered semantic space. The reason is that different
word groups can have different meanings for the same analogy, i.e. transition vectors for these groups
will be different as well. For example, gender differences in a regular and royal family have different
connotations, which differ from professional gender variations.
    The aim of this paper is to conduct experiments from [11] for contextualized BERT and ELMo
models, compare and generalize the results for static and contex-tualized models, and find out whether
contextualized models provide any ad-vantages in comparison with static models.
    The rest of the paper is organized as follows. In Section 2, we state the problem of word analogies
as a vector transformation problem. Sections 3 and 4 describe the used data set and the numerical
evaluation of language models for the Russian lan-guage. Section 5 analyses the results achieved and
compares the results for static and contextualized models. Section 6 concludes the article.

2. Formal Statement of the Problem of Word Analogies
    In general terms, the main question of the problem of word analogies could be stated as “Is there a
word c which relates to the word b as the word a' relates to the word a?” To answer this question,
semantic embeddings use a vector representation of these words. Let va' and va be vectors corresponding
to the words a' and a respectively; in this case, the vector difference va' - va expresses the semantic
relation (or in other words, the semantic difference) between words a and a'. Thus, in order to find an
analogue, we should find the word x and its corresponding vector y in such a way that y - vb = va' - va,
or
                                      𝑦 = 𝑣𝑏 + 𝑣𝑎′ − 𝑣𝑎 .                                         (1)
    However, the probability of existence of a word having exactly the same vector as vx is extremely
small. That is why we will find vector y' that is the closest word to the vector y:
                              𝑦′ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑐𝑜𝑠⁡(𝑣, 𝑣𝑏 + 𝑣𝑎′ − 𝑣𝑎 )                                 (2)
                                  𝑣∉{𝑣𝑎 ,𝑣𝑎′ ,𝑣𝑏 }
    We can reformulate the question for word groups. Let us consider a set of word pairs (w11:w12),
(w21:w22), …, (wN1:wN2) that have the same semantic or lexical relation, and their corresponding vectors
v11, v11, v21, v21, …, vN1, vN1. In this case, the task of word analogies could be formulated as following:
if there is a vector x that makes an affine transformation of w11, w21, …, wN1 to w12, w22, …, wN2, then x
is such that
                                   𝑣𝑖2 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑐𝑜 𝑠(𝑣 ′ , 𝑣𝑖1 + 𝑥).                                  (3)
                                            𝑣′
    Let us denote the fact that the word a relates to the word b in the same sense as the word c relates to
the word d by the following equation: (a:b) :: (c:d). For example, (apple:fruit) :: (cucumber:vegetable),
(apple:apples) :: (cucumber:cucumbers), and, classical, (king:queen) :: (man:woman). In this case, a
request to find an analogy looks like (king:?) :: (man:woman) or (man:woman) :: (king:?).
    The task of word analogies is very sensitive to the noise in the input data. A word can be
homonymous; this means that it should be presented as two or more separate vectors reflecting different
meanings of this word. In case of Word2Vec, such a word will be represented only by a vector that will
be a superposition of all its meanings. Moreover, the resulting vectors of similar entities could express
differences in their occurrence with other words. For example, a dog and a cow are both animals, but
dog is a carnivore and a human’s friend, while a cow is an herbivore and gives milk; thus, the analogy
is not complete here. Contextualized models improve static models using a context to infer the real
sense, and a word resulting vector. However, the vectors for the same word in slightly different contexts
will not be equal. In order to eliminate such influence, the authors of the 3CosAvg method [13]
introduce a new formula that takes into account not just a pair of words but two whole groups having
the same analogy:
                                                      ∑𝑛𝑖=1 𝑣𝑎′ 𝑖 ∑𝑛𝑖=1 𝑣𝑎𝑖                          (4)
                         ′
                       𝑦 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑐𝑜𝑠 (𝑣, 𝑣𝑏 +                   −          ).
                              𝑣∈𝑉∖{𝑣𝑏 }                   𝑛           𝑛
    As it was shown in [14, 15], methods Only-b, Ignore-a, and PairDirection give unsatisfactory results.
Unlike [10], we will use the 3CosAvg method only, since [11] has demonstrated that this method
provided more robust results which were less dependent on biases in input vectors and their polysemy.
We have not tested the X2Static method [16], which calculates the average vector for all embeddings
returned by a contextualized model, because of its novelty. However, we believe that it should not
dramatically increase performance of the words analogies method, since the 3CosAvg method averages
such vectors in the same way.

3. Used Data Sets
    For static embeddings, we used several pre-trained models from the site RusVectōrēs
(http://rusvectores.org/ru/models/): Araneum Upos Skipgram 2018, Ruwikiruscorpora Upos Skipgram
2018, Ruwikiruscorpora Upos Skipgram 2019, Tayga Upos Skipgram 2019, News Upos Skipgram
2019, Ruscorpora Upos CBOW 2019, Araneum Fasttext Skipgram 2018, Facebook FastText CBOW
2018 [17, 18]. The first models were trained using Word2Vec, the two later ones were trained using
FastText. For contextualized embeddings, we used RuBERT 2019 [19], Sentence RuBERT 2019,
Conversational RuBERT 2019, ELMo Ru Wiki 2019, ELMo Ru News 2019, and ELMo Ru Tw 2019.
These models were selected as they tagged a text by words but not by parts of words, as some newer
models did.
    Note that contextualized models need a context; thus, there is no possibility for passing merely a
word for acquiring a resulting vector. That is why we used a collection of news wire texts. Every
contextualized vector was calculated as a mean value of vectors, calculated by manually selected texts.
The set of selected examples could influence resulting vector but could not lead to misrepresentation of
the whole picture.
    For semantic analogies, we used the Russian versions of Google analogy test set [4] and BATS (The
Bigger Analogy Test Set) [13]. These data sets was translated and extended by a human expert. For
grammatical analogies we used morphological dictionary of the Russian language. The list of used
categories presented in Table 1.
Table 1
Used semantic and grammatical categories
          Category              ID                            Example                       Number of
                                                                                              Pairs
  Famous capital → Country         A1            Афины                     Греция               23
    All capitals → Country         A2           Канберра                 Австралия              115
     Country → currency            A3            Ангола                    кванза               30
     Country → Adjective           A4          Австралия               австралийский            41
     Country → Language            A5          Аргентина                 испанский              36
    Masculine → Feminine           A6          наследник                наследница              67
       Singular → Plural           A7            улыбка                    улыбки               100
 Antonyms with не-(non-, ir-)      A8        определенный             неопределенный            27
     Adjective → Adverb            A9          спокойный                 спокойно               30
   Possessive Adjective →
                                  A10             яркий                      ярче                 24
    Comparative Adjective
 Verb → Corresponding Noun
                                  A11       консультировать             консультация              55
      with -ация (-ation)
 Verb → Corresponding Noun
                                  A12           назначать                назначение               55
   with -ение (-ment, -ion)
 Verb → Corresponding Noun
                                  A13            слушать                 слушатель                56
     with -тель (-er, -or)
    Verb → Reflexive Verb         A14            откопаю                 откопаюсь               400
   Verb → Verb with при-          A15             вязать                 привязать               376

4. Evaluation
   For the evaluation purpose, we calculated the accuracy metrics for all categories as well as for
language models. Fig. 1 and 2 demonstrate results for the static and contextualized models, respectively,
calculated by the 3CosAvg method. Dark blue shows results with a higher accuracy, up to 1; light blue
shows results close to zero. The results from Fig.1 were taken from [11]. Note that for contextualized
models we have not calculated the last two categories since they demonstrated low accuracy for static
models.
   Obviously, contextualized models demonstrate better accuracy than static ones. For static models,
there are only five model-category combinations which accuracy exceeds 0.9, while for contextualized
ones there are about 40% of such combinations which overpass this threshold. There are only three
categories where static models hit contextualized: Capital → Country for famous (A1) and all (A2)
countries, and Country → Language (A5). Note, that both types of models show low accuracy for the
A5 category. The same is true for the category A3 Country → Currency.
   We found that RuBERT Sentence exceeds other BERT models for each category in our task. The
productivity of ELMo models depends on a category. In case of grammatical parameters, ELMo Ru
Twitter overpasses other models, while for information about countries ELMo Ru Wiki wins in most
cases.
   We can state the hypothesis that the analyzed words of such categories as Capital → Country,
Country → Language, and Country → Currency have several meanings. For example, the name of the
capital of a well-known country can be used as the name of the proper city (An accident in Moscow)
and as a synonym of the corresponding country (Moscow plays muscles again). There are also several
countries which share the same language or currency. Such a variety makes analysis more difficult for
contextualized models which have fewer contexts for each separated meaning. It’s easy to see that
models trained on news wire or Wikipedia texts demonstrate better solutions for categories entailed to
countries. Though RuBERT was trained on Wiki and news texts, it demonstrates worse productivity for
country categories than other BERT models. However, other models are fine-tuned versions of the
RuBERT model; thus, they had extra contexts to learn.
Figure 1: Accuracy by 3CosAvg method for static models [11]


Figure 2: Accuracy by 3CosAvg method for contextualized models

5. Data Analysis
   In order to find out the reasons of success and fail, we conducted a visual analysis of the resulting
vectors. First of all, we projected 300-dimensional vectors into 2D-space using Principal Component
Analysis (PCA). Instead of t-SNE and UMAP, PCA does not create areas with non-linear skew. As a
result, parallel vectors keep their parallelism. On the other hand, there is a chance that two arbitrary
vectors placed on parallel planes could become parallel on the projection; however, such distortions in
the data are not very critical.
   For our experiments, we used two static language models which were trained on the same Araneum
corpus: Araneum Upos Skipgram 2018 and Araneum Fasttext Skipgram 2018. For contextualized
models, we used Ru BERT and ELMo Ru News which are not the best solution for the selected category
A1 Famous capital → Country. Fig. 3, 4 present five randomly selected word pairs for these four
models.
   The main reason of low accuracy in the task of word analogies is the bias between the vectors. It is
easy to see that the Word2Vec vectors on Fig. 3 are mostly parallel, excluding slightly bias for Оттава-
Канада (Ottawa-Canada). The length of the vectors is also almost the same. Averaging among the start
and end points of the vectors helps to adjust these small biases in the Word2Vec model. However, the
FastText model is oriented mostly on word parts; that is why its vectors are almost randomly oriented
and have no preferential direction. Contextualized models show almost the same picture (Fig. 4): the
vectors are mostly parallel for ELMo model, but for the RuBert model this is not true.
   Since PCA makes some distortions and allows us to analyze a projection instead of raw data, we
decided to draw the distribution of angles among vectors. Fig. 5, 6 demonstrate cosine distances for
vectors belonging to FastText and ELMo Ru Twitter models. We selected the most representative
categories which show two different situations; however, in both cases the results were unsatisfactory.
On the left picture, all vectors are non-parallel, the cosine of angles among the vectors is less than 0.75.
A worse situation is drawn on the right figure: some vectors have opposite directions and their cosine
reaches -0.75.
Figure 3: Word2Vec and FastText vectors for the category A1 Famous capital → Country


Figure 4: ELMo Ru News and RuBert vectors for the category A1 Famous capital → Country


Figure 5: Histogram for cosine similarity among average vector and vectors for the categories A14 and
A15, Araneum FastText model


Figure 6: Histogram for cosine similarity among average vector and vectors for the categories A5 and
A2, ELMo Ru Twitter model
   Another reason for such drawbacks in the results is the length of the vectors. Even if the vectors’
directions are the same, but their lengths significantly differ, then the words analogies task will fail.
Thus, we visualized both the vector length and its direction according to the average transition vector;
the start and end points of transition vector were calculated as average among the start and end points
of the original vectors (Fig 7). We present here only a few of the most representative figures for the
Araneum FastText static model which demonstrate the reasons of failures. Pink dots represent correctly
resolved analogies, the blue dots represent the opposite situation.


                                    a                                   b


                                 c                                d
Figure 7: Relations between vector length and its angle with an average vector

     Fig. 7a represents the case when half of the vectors are parallel and have the same length, and half
of the vectors have a different direction, which is perpendicular to the average vector in most cases, and
different length. This situation corresponds to 0.03 accuracy since there are several vectors parallel to
the average vector. Figure 7b represents the situation when all the vectors are mostly parallel but have
different length (accuracy = 0.53). Fig. 7c corresponds to the situation when the vectors have almost
the same size but they are perpendicular (accuracy = 0.02). Finally, fig. 7d represents perpendicular
vectors of different length (accuracy = 0.71).
     We examined the fluctuations found on fig. 6 and 7 but did not find any regularities in the word sets.
Probably, the resulting noise can be explained by the fact that there are several semantic clusters joined
into one.
     Our results correspond to the idea of parameterized reflections described in [12], but we did not
investigate this version. However, Fig. 6 and 7 demonstrate some clusters which can be considered as
candidates to application of such reflection planes. Anyway, the main idea of [12], that there are several
dedicated directions for some categories, seems reasonable in the light of our experiments. Moreover,
it is true not only for static but for contextualized models as well, but not for all of them, since different
models reflect different spatial situations.
6. Discussion and Conclusion
   In this paper, we found several reasons why the vector transformation does not work on some
categories of word analogies.
   1. The used language model should be taught on texts that have enough occurrences of words for
   which the task of word analogies is solved. That is true for both static and contextualized models,
   but static models more affect on the selected text corpora. On the other hand, contextualized models
   need larger corpora for better accuracy in more polysemous categories. As we can see in Fig. 1 and
   2, categories that include countries, their capitals, and other related words are better analyzed using
   corpora such as Wikipedia and news wire, since such corpora contain enough information for
   inference of logical relations among these words.
   2. Following [10] we can state that quality of results depends on the model in hand. Fig. 1 and 2
   demonstrate that the result depends on the corpus used for the model training and on the selected
   domain. However, contextualized models have a great preference; they can be fine-tuned on the
   analyzed corpus. As a result, the quality of the solution should increase.
   3. In a common case, the task of word analogies could not be solved using a random pair of words
   taken from two investigated categories. For example, Moscow and Berlin are respectful
   representatives of their countries in case of international or cultural affairs; these cities are used as
   synonyms of Russian and German government and culture. However, Bogota and Kampala are the
   capital rather than the government. Thus, there will be several preferential directions for different
   prefix or suffix values. If someone misuses just one word in a pair, he or she will get the wrong
   transition vector. Averaging of the start and end points helps to eliminate this problem, but this
   method has some drawbacks. One should have a list of words in a given category to average their
   vectors; this is not always possible. On the other hand, sometimes such a list is available, and one
   could use it to tune vectors in order to predict out-of-list words. But previously he or she should be
   sure that this list does not contain several preferred semantic directions.
   4. The main idea of an affine transition is that there is one vector that could be added to the word
   a to find its analogy b. It means that all vectors for a:b should be equal, i.e. have approximately
   equal length and orientation angles. At least, the value of the bias between this transition vector and
   the correct answer vector should be less than half the distance to the nearest neighbor. As we found
   out, this is not always true. In the case of homonymous prefixes and suffixes, some vector groups
   could be oppositely directed. This means that the word analogies task could not be solved using only
   one transition vector. Our experiments have demonstrated that in some cases the length of the vectors
   could differ more than twice. Such biases lead to the situation when the software module must
   generate several outputs, and the user has to find some extra methods to find the correct answer.
   5. Contextualized models provide more robust solution in case of rich mono-thematic (or general
   purpose) corpora. Note, that some specific tasks have better solution with static models, but the best
   default solution is to use contextualized models.
   Our method of analysis of affine transformations for embedding vectors could be used as a method
of exploratory analysis of domain. Before using of method of vector analogies, a researcher should
check if such analogies could be successfully applied to a selected domain or with a selected language
model, both pre-trained and trained on domain texts. This will help eliminate some mistakes and
evaluate further results in advance.

7. References
[1] Y. Bengio, R. Ducharme, P. Vincent, P.A. Jauvin, Neural Probabilistic Language Model, Journal
    of Machine Learning Research 3 (2003) 1137–1155.
[2] R. Collobert, J. Weston, A unified architecture for natural language processing, in: Proceedings of
    the 25th International Conference on Machine Learning, 2008, vol. 20, pp. 160–167.
[3] T. Mikolov, W.-T. Yih, G. Zweig, Linguistic Regularities in Continuous Space Word
    Representations. In: Proc. of HLT-NAACL, 2013, pp. 746-751.
[4] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector
    space, in: Proc. of International Conference on Learning Representations (ICLR), 2013.
[5] T. Mikolov, K. Chen, G. Corrado, J. Dean, Distributed Representations of Words and Phrases and
     their Compositionality, in: Proc. of 27th Annual Conference on Neural Information Processing
     Systems, 2013, pp. 3111-3119.
[6] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword
     Information, Transactions of the Association for Computational Linguistics 5 (2017) 135-146.
[7] M.E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, Deep
     contextualized           word         representations,        ArXiv:1802.05365,           URL:
     https://arxiv.org/pdf/1802.05365.pdf.
[8] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional
     Transformers            for        Language          Understanding,          ArXiv:1810.04805,
     https://arxiv.org/pdf/1810.04805.pdf.
[9] K. Ethayarajh, How Contextual are Contextualized Word Representations? Comparing the
     Geometry of BERT, ELMo, and GPT-2 Embeddings,                              ArXiv:1909.00512v1,
     https://arxiv.org/pdf/1909.00512.pdf.
[10] B. Wang, A. Wang, F. Chen, Y. Wang, J. Kou, Evaluating word embedding models: methods and
     experimental results, in: APSIPA Transactions on Signal and Information Processing, 2019, 8. doi:
     10.1017/ATSIP.2019.12
[11] O. Korogodina, O. Karpik, E. Klyshinsky, Evaluation of Vector Transformations for Russian
     Word2Vec and FastText Embeddings, in: Proc. of Graphicon 2020. doi: 10.51130/graphicon-
     2020-2-3-18
[12] Y. Ishibashi, K. Sudoh, K. Yoshino, S. Nakamura, Reflection-based Word Attribute Transfer,
     ArXiv:2007.02598v2, URL: https://arxiv.org/pdf/2007.02598.pdf.
[13] A. Drozd, A. Gladkova, S. Matsuoka, Word Embeddings, Analogies, and Machine Learning:
     Beyond King-Man+Woman=Queen, in: Proc. of COLING 2016, pp. 3519–3530.
[14] O. Levy, Y. Goldberg, Linguistic Regularities in Sparse and Explicit Word Representations, in:
     Proc. of 18th Conf. on Computational Natural Language Learning, 2014, pp. 171-180. doi:
     10.3115/v1/W14-1618
[15] T. Linzen, Issues in evaluating semantic spaces using word analogies, in: Proc. of 1st Workshop
     on Evaluating Vector-Space Representations for NLP, 2016, pp. 13–18.
[16] P. Gupta, M. Jaggi, Obtaining Better Static Word Embeddings Using Contextual Embedding
     Models. arXiv:2106.04302v1, URL: https://arxiv.org/pdf/2106.04302.pdf.
[17] A. Kutuzov, E. Kuzmenko, WebVectors: A Toolkit for Building Web Interfaces for Vector
     Semantic Models, in: Analysis of Images, Social Networks and Texts (AIST), 2016, vol. 661, pp.
     155-161. doi: 10.1007/978-3-319-52920-2_15
[18] E. Grave, P. Bojanowski, P. Gupta, A. Joulin, T. Mikolov, Learning Word Vectors for 157
     Languages, in: Proc of LREC’2018, 2018, pp. 3483-3487.
[19] Y. Kuratov, V. Arkhipov, Adaptation of Deep Bidirectional MultilingualTransformers for Russian
     Language. arXiv:1905.07213v1, URL: https://arxiv.org/pdf/1905.07213.pdf.