Intrinsic word embedding model evaluation for Lithuanian language using adapted similarity and relatedness benchmark datasets

Intrinsic word embedding model evaluation for Lithuanian language using adapted similarity and relatedness benchmark datasets MindaugasPetkevičius mindaugas.petkevicius@vdu.lt Vytautas Magnus University

K. Donelaičio g. 58 44248 Kaunas Lithuania

DaivaVitkutė-Adžgauskienė Vytautas Magnus University

K. Donelaičio g. 58 44248 Kaunas Lithuania

26th International Conference Information Society University Studies -IVUS

2021

Intrinsic word embedding model evaluation for Lithuanian language using adapted similarity and relatedness benchmark datasets 145B19725FE207063058768957826702 GROBID - A machine learning software for extracting information from scholarly documents Word embeddings evaluation Lithuanian language word2vec fastText GloVe

Word embeddings are real-valued word representations capable of capturing lexical semantics and trained on natural language corpora. Word embedding models have gained popularity in recent years, but the issue of selecting the most adequate word embedding evaluation methods remains open. This paper presents research on adaptation of the intrinsic similarity and relatedness task for the Lithuanian language and the evaluation of word embedding models, testing the quality of representations independently of specific natural language processing tasks. 7 different evaluation benchmarks were adapted for the Lithuanian language and 50 word embedding models were trained using fastText, GloVe, and Word2vec algorithms and evaluated on syntactic and semantic similarity tasks. The obtained results suggest that for the intrinsic similarity and relatedness task, the dimension parameter has a significant impact on the evaluation results, with larger word embedding dimension yielding better results.

Introduction

The development of natural language processing tools influenced a growing need for word embeddings as real-valued representations of words for text analytics, generated by applying distributive semantic models. While word embeddings have become one of the most widely used tools in modern natural language processing (NLP) applications, their limitations have not yet been fully explored. The problem of assessing word embedding consistency and quality is one of the most relevant questions in distributive semantics research.

The idea of word embeddings is not new, but it gained popularity after Mikolov et al. [1] presented the Word2vec model in 2013. The fastText model, developed by Facebook AI Research (FAIR), introduces embeddings using subword information. The next big improvement came from Stanford with GLoVE (Global-Vectors) [2], based on word-word co-occurrence statistics in a corpus.

There are two types of word embedding evaluation: intrinsic and extrinsic. Intrinsic evaluation tests the representation quality independent of specific natural language processing (NLP) tasks, while extrinsic evaluation uses word embeddings as input features to an NLP task and measures changes in corresponding performance metrics. We focus on intrinsic evaluation methods, based on human annotated datasets, because datasets can be adapted for different languages by translating and reevaluating human annotated scores.

The method of word semantic similarity, based on correlation with human judgment of how closely words are related among themselves, was one of the first intrinsic evaluation metrics for distributional meaning representations. According to this method, words smart and intelligent should be closer in the vector space than smart and dumb, since smart and intelligent are intuitively better semantically related.

There are gold-standard benchmarks for evaluating distributive semantic models such as SimLex999 [3], MEN [4], etc., focused on semantic relatedness. These benchmarks consist of certain word pairs and their relative similarity scores. The similarity scores are defined in the interval between 0 and 10, e.g., the score for the words book and paper is 7.46. When applied, these scores are compared with word pair cosine vector similarity results for word embeddings.

The word analogy method aims to identify words based on operation prediction in a word vector space. The method tries to predict a missing word in a word pair based on a known relationship in another word pair. Thus, for a dataset a-b, c-d, the task is to identify an unknown word d based on the known relationship between words a and b. For example, given the words, a (brother), b (sister), and c (father), this method should correctly predict the value mother for word d [5]. The Google analogy dataset [6] and BATS [7] are the most popular datasets. The Google test set has become the standard for word embedding analysis. BATS is a newer dataset that is much larger and more balanced.

The word clustering method evaluates a word embedding space by applying the word clustering approach. It is aimed at splitting a given word set into groups of words corresponding to different categories based on word vectors. For example, words dog and cat belong to one cluster, while words car and planeto another [8].

The situation with word embeddings for the Lithuanian language is influenced by its specifics. The Lithuanian language is a morphologically rich Baltic language, being considered one of the most archaic living Indo-European languages [9]. It has a relatively large vocabulary, containing over 500 000 unique words [10]. On the other hand, the Lithuanian language lacks textual resources due to the small size of the nation using it. Lithuanian Wikipedia, for example, has 199 567 articles, while better represented languages have over a million each [11]. Several attempts were made to perform intrinsic and extrinsic evaluation of the Lithuanian language embeddings. However, so far there are no available semantic similarity benchmarks for this purpose.

The goal of this research was to adapt selected intrinsic similarity benchmarks for the Lithuanian language and to apply them for experimental evaluation of fastText, Word2vec, and GloVe embedding models with different hyperparameters.

In order to reach this goal, we perform the following tasks: related work analysis (Section 2), corpus building for embedding training (Section 3), methodology for the adaptation of evaluation benchmarks for the Lithuanian language (Section 4), experimental evaluation of different embeddings based on the derived benchmarks (Section 5), conclusions and future plans (Section 6).

Related Works

In recent years, there have been several critical articles on intrinsic assessment methods: some researchers address the subjectivity of human judgments, the vagueness of instructions for particular tasks, and terminology confusions [12]. However, despite these flaws, these methods are widely used for embedding model evaluation for different languages.

There have been successful attempts to adapt intrinsic evaluation benchmarks to other languages. Research has shown that when monolingual vector space models were translated into German, Russian, and Italian, it became clear that their predictions did not always correlate well with human decisions made in the language used for model training [13].

Another study attempted to translate the SymLex999 benchmark into Estonian and discovered that, unlike in the original research, computational word embedding models better correlate with noun scores rather than adjective scores [14].

A few studies on the evaluation of Lithuanian word embeddings have been carried out. In the first study, word embeddings for different models and training algorithms were evaluated against a limited implementation of the Lithuanian WordNet [15], showing that the Continuous Bag of Words (CBOW) approach performed significantly better than the skip-gram approach for Word2vec word embeddings, vector dimensions having little effect in this case.

The second study compared traditional and deep learning approaches for sentiment analysis using word embeddings, finding that deep learning performed well only when applied to small datasets, and that traditional methods performed better in all other contexts [16].

The third study was conducted with Transformer models using GloVe word embeddings [17]. The study concluded that multilingual transformer models can be fine-tuned to word vectors, but still perform much worse than specifically trained embeddings.

In conclusion, we see that the Lithuanian language lacks word embedding evaluation benchmarks.

Corpus

Semantic intrinsic similarity benchmarks cover many different types of test domains, such as geography, languages, currency, etc. Therefore, we need to have a wide variety of data for embedding training. Research has shown that for a larger corpus we get better word embeddings [18]. For this reason, it is important to build an extensive corpus for embedding training, that will be further used for evaluation.

Wikipedia texts are usually a typical approach for building a corpus for embedding training. In order to expand our experimental corpus, we used articles from Lithuanian news portals, mainly from the largest one, Delfi.lt, the collected articles covering different topical areas such as news, cars, fitness, culture, food, and so on.

In order to obtain better word embeddings, we also included texts from the Corpus of Contemporary Lithuanian Language (CCLL) [19], texts in a variety of genres and topics.

Statistics for our combined experimental corpus is presented in Table 1. The pre-processing phase consists of two steps: 1) breaking text into tokens, lowercasing text, removing special symbols, numbers, non-Lithuanian words and stop-words 2) removing short documents, less than 50 characters in size; 3) lemmatizing the texts, this being important for rich morphology languages [20]. Lemmatization was performed using lexical and morphological analysis tools from the Lithuanian language technology infrastructure built in the Semantika2 [21] project.

Alternatively, text could have been stemmed instead, but lemmatization was preferred, as all our documents were in normative spelling and punctuation. Stemming is more favorable in case of social texts with lots of out-of-dictionary words. Also, stemming has its limitations, e.g. over-stemming and under-stemming problems [22].

The statistics for the final version of our experimental corpus are presented in Table 2.

Methodology

The methodology part covers the methods applied in this research: (1) benchmark dataset adaptation method for semantic similarity based intrinsic embedding model evaluation; (2) semantic similarity based embedding evaluation using the adapted benchmarks.

Adaptation of benchmark datasets

As a result of a brief analysis, the following English-language benchmarks were selected for the adaptation to the Lithuanian language, their popularity being the main criteria:

1. MEN (Marco, Elia and Nam), 3 000 pairs [23]. The following algorithm was applied for the dataset adaptation to the Lithuanian language: 1. Automated translation of datasets (by applying the Google Cloud Translation API) [30]. 2. Inconsistency checking (manual examination), discarding inconsistent word pairs. 3. Word lemmatization. 4. Re-evalution of the score that was initially assigned to the English language word pairs was done by two independent persons (manual procedure). An average score was calculated.

Embedding evaluation using semantic similarity benchmarks

As mentioned in Chapter 1, the semantic similarity datasets are based on correlation with human judgments of how closely words are related.

The similarity benchmark datasets consist of a certain number of word pairs. Each pair is determined by its similarity and relatedness. The values are in the range [0, 10], depending on the dataset.

The word embedding models are represented by corresponding vectors for each word in the dictionary. If a word is missing in the trained word embedding model, it is replaced by the mean of all vectors. In order to calculate the similarity between vectors, we can use the cosine similarity formula (see Eq. 1), where a and b are vectors in the word embedding vector space.

𝑠𝑖𝑚 𝑐𝑜𝑠 (𝑎 𝑖 , 𝑏 𝑖 ) = 𝒂 𝒊 . 𝒃 𝒊 ||𝒂 𝒊 || × ||𝒃 𝒊 || ,(1)

here ai and bi are vectors of N-dimension. The result of cosine similarity is a value in the range [-1, 1] interval, where 1 stands for identical vectors, and -1 for opposite vectors.

A human-annotated benchmark dataset consists of n triplets containing pairs of words and their corresponding similarity scores ⟨wi , wj , hij⟩, where wi , wj are dictionary words, and hij is the score.

Let h = (hi1, hi2, . . . , hiN ) be a vector of human annotated benchmark datasets, and m = (mi1, mi2, . . . , miN ), correspondingly, a vector of similarity scores calculated from word embeddings.

Then, the evaluation score for the corresponding embedding model, based on the selected benchmark, is calculated as Spearman's correlation ρ (see Eq. 2) between h and m.

Spearman 𝑝 value can be any value satisfying−1 ≤ 𝑝 ≤ 1, and the interpretation is that p values close to +1 indicate stronger relationship, while those closer to -1 indicate weaker relationship.

The Spearman correlation formula is:

𝑝 = 1 − 6 ∑ 𝑑 𝑖 2 𝑛(𝑛 2 − 1) ,(2)

where ndataset length, ddifference between ranks of h and m.

The aggregated score of one-word embedding model 𝑝 𝑎𝑣𝑔 is calculated (see Eq. 3) as:

𝑝 𝑎𝑣𝑔 = 1 𝑛 * ∑ 𝑝 𝑖 𝑛 𝑖=1(3)

where 𝑝 𝑖 is Spearman correlation value of specific benchmark, nnumber of benchmarks.

In order to compare different embedding model types (Word2vec, fastText, GloVe). We can calculate an average score of all models' embeddings (see Eq. 4).

𝑃 𝑡 = 1 𝑛 * ∑ 𝑝 𝑎𝑣𝑔 𝑛 𝑖=1(4)

where 𝑃 𝑡an average score of all t word embeddings, nnumber of t word embedding models.

Experiments and results

Experiments were carried out in a series of tasks: 1. Firstly, 7 selected evaluation benchmark datasets were adapted for the Lithuanian language. 2. Secondly, 50 word embeddings with different hyperparameter sets were trained on the accumulated experimental corpus.

3. Thirdly, the obtained word embedding models were evaluated using the adapted intrinsic evaluation benchmarks. 4. Finally, the resulting data were examined in order to determine the effect of different hyperparameters on benchmark evaluation results.

Adaptation of benchmark datasets

The selected 7 (see Chapter 4.1) evaluation benchmark datasets were adapted from English to Lithuanian language. There were 5610 word pairs at the beginning. After the adaptation process, 5573 word pairs remained. A total of 37 word pairs were discarded. The following problems were observed during the adaptation process:

1. Multiple wordsin some cases, one-to-one word translation is not possible, when a two-word expression in the Lithuanian language is a correspondence to a single word in the English language. For example, for the word pair "computer -software", the Lithuanian translation would be "kompiuteris -programinė įranga". As we use vector-to-vector comparison, such word pairs were discarded.

2. The meaning of certain words has been shaped by American culture, e.g. words like soccer, football, and FBI. These are words that are commonly used in the US. Such words were replaced with Lithuanian synonyms 3. A few older words have undergone semantic changes as their meanings evolved. For example, the word pair "Arafat -terror", had a greater similarity back in history than it does now. Such pairs were discarded. 4. In some cases, both English words have the same meaning in the Lithuanian language, for example, the following pairs: "smart -intelligent", "happy -cheerful", "fast -rapid". Such word pairs as a result contained two equal words, and their scores were set to 10 (maximum similarity). An excerpt of the adapted SimLex999 dataset for the Lithuanian language is presented in Table 3. The first two columns contain an English word pair in its original form. The third column contains the human-generated similarity score. The fourth and fifth columns contain Lithuanian translations of English words and revalued Lithuanian word scores.

Word embedding model training

The following tools were used for word embedding training: python genism wrapper of Word2vec1 , fastTextofficial python library2 , GloVe -official library 3 . We used similar training parameters in order to be able to compare different word embeddings (see Table 4).

Word embedding model evaluation

All the trained embedding models were evaluated using the Spearman ρ correlation coefficient between human benchmark scores and vector space model scores. The results were grouped by different vector model types, characterized by different hyperparameter sets (see Table 4).

The best 4 and the worst 4 models ranked by the benchmark result average are presented correspondingly in Table 5 and Table 6 To be able to do score comparison, only embedding models with the same hyperparameters were used: dimensions (100, 300), window size (5), and minimum count (2, 5) (Figure 1). GloVe's word embedding model scores 𝑃 𝑡 were on average lower than those of fastText and Word2vec. The previous two were nearly identical, with a difference of only 0,001 between them.

Additionally, the experiment results were analyzed to determine whether a particular hyperparameter had a significant effect on the results. Following a thorough examination of all the hyperparameters, we discovered a correlation between the dimension value and the correlation results. (Figure 2). Different dimension values for various embedding types had a significant effect on the results. The larger the dimension of the word embedding, the more accurate the results. As illustrated in Figure 3, as vector size increases, the model correlation score also increases. (5)

where nnumber of models, xdimension value. y -Spearman correlation value for a model. 𝑟 = 0.918, this indicates strong relationship between values.

Conclusions

This was the first attempt to adapt the most popular intrinsic similarity and relatedness benchmark datasets for the Lithuanian language. Despite reported challenges when adapting benchmarks to other languages, we proved, that this can be done even for morphology rich languages like Lithuanian.

The application of the adapted benchmark datasets for the evaluation of the embedding models, trained on an experimental corpus, showed, that GloVe model performed worse than fastText and Word2vec, judging by average benchmark results.

We also conclude, that for the intrinsic similarity and relatedness task, the dimension hyperparameter has a significant impact on the evaluation results, with larger word embedding dimension yielding better results.

In the future, we plan to adapt other types of embedding evaluation benchmarks, such as categorization and analogy testing, as well as extrinsic evaluation with POS tagging, named entity recognition (NER), and other NLP tasks. This would allow us to compare intrinsic and extrinsic evaluation methods. Also, we will continue to expand our corpus for future tests.

. The first column in these tables indicates model name together with hyperparameter indication. The following labels are used: Nnegative sampling, S -SoftMax, CBOW -Continuous Bag of Words, SKIP -Skipgram, ddimension, wwindow size, mminimum count threshold, iiteration count. The rest are benchmark names and Spearman ρ correlation scores. The last column shows aggregated Spearman 𝑝 𝑎𝑣𝑔 correlation score of all the benchmarks.

Figure 1 :1Figure 1: Aggregated Spearman ρ correlation scores over different model types 𝑃 𝑡 .

Figure 2 :2Figure 2: The correlation between vector size (d) hyperparameter and benchmark aggregated scores grouped by embedding model type.

Figure 3 :3Figure 3: The correlation between vector size (dim) hyperparameter and benchmark aggregated scores. We can use Pearson correlation score 𝑟 (see Eq. 5) to see if there is correlation between values. 𝑟 = 𝑛(∑ 𝑥 𝑦) − (∑ 𝑥)(∑ 𝑦) √ [𝑛 ∑ 𝑥 2 − (∑ 𝑥) 2 ] [𝑛 ∑ 𝑦 2 − (∑ 𝑦) 2 ]

Table 11Experimental corpus: initial versionCorporaDocumentsToken countUnique tokensWikipedia286 08922 942 951971 506CCLL8 128136 279 0872 329 976News articles118 93060 456 637718 051Total413 148219 678 6752 894 874

Table 22Experimental corpus: final versionParameterValueDocument count303 443Total tokens160 174 732Unique tokens1 396 607

Table 33Excerpt of the adapted of SimLex999 benchmark for the Lithuanian languageword1word2valueword1word2valuesteakmeat7.47kepsnysmėsa8.2nailthumb3.55nagasnykštys4.5bandorchestra7.08grupėorkestras7.6bookbible5.00knygabiblija5.6

Table 44Hyperparameters used for embedding model trainingWord2vecFastTextGloVeArchitecture Model trainingCBOW, Skip-gram Negative Sampling, hierarchical SoftMaxCBOW, Skip-gram Negative SamplingGlobal word-word Co-occurrence matrixDimensions100, 300, 500, 1000100, 300, 500100, 300Window size55, 105, 10Minimum count1, 2, 52, 52 ,5A total of 50 (19 Word2vec, 20 fastText, 11 GloVe) word embeddings were created by applyingdifferent hyperparameter sets.

Table 55The best 4 word embeddings ranked by 𝒑 𝒂𝒗𝒈 (Spearman aggregated score)ModelMENWS353 WS353R WS353S SimLex999RG65 MTurk 𝑝 𝑎𝑣𝑔FastText SKIP 300d 5w 5m 5i0.7180.6930.5390.7790.4120.7330.684 0.651FastText SKIP 300d 5w 2m 5i0.7170.6790.5130.7710.410.7370.682 0.644FastText SKIP 100d 5w 5m 5i0.7120.6810.5440.7850.3880.7210.678 0.644Word2vec NSKIP 300d 5w 1m 5i0.7110.6790.5070.7490.4220.7660.660.642

Table 66The worst 4 word embeddings ranked by 𝒑 𝒂𝒗𝒈 (Spearman aggregated score)ModelMEN WS353 WS353R WS353S SimLex999RG65 MTurk 𝑝 𝑎𝑣𝑔GloVe 300d 10w 1m 5i0.6570.5840.4260.6890.3780.7060.614 0.579GloVe 100d 10w 2m 5i0.650.5840.4140.6830.3630.7240.611 0.575FastText CBOW 100d 5w 1m 5i0.650.5640.3890.6790.4190.7610.559 0.574GloVe 100d 10w 1m 5i0.6470.5880.4320.6730.3560.7090.609 0.573

Comparison between different types of embeddings (Word2vec, fastText, GloVe) was done by averaging 𝑝 𝑎𝑣𝑔 by embedding type (see Eq. (4). https://radimrehurek.com/gensim/models/word2vec.html https://github.com/facebookresearch/fastText/ https://github.com/stanfordnlp/GloVe

Efficient estimation of word representations in vector space TomasMikolov arXiv:1301.3781 2013 arXiv preprint Glove: global vectors for word representation JPennington RSocher CManning EMNLP 14 2014 Simlex-999: Evaluating semantic models with (genuine) similarity estimation FelixHill RoiReichart AnnaKorhonen Computational Linguistics 41 4 2015 Multimodal distributional semantics EliaBruni Nam-KhanhTran MarcoBaroni Journal of artificial intelligence research 49 2014 Word representations: a simple and general method for semi-supervised learning JosephTurian LevRatinov YoshuaBengio Proceedings of the 48th annual meeting of the association for computational linguistics TMikolov Chen the 48th annual meeting of the association for computational linguistics 2010 Efficient estimation of word representations in vector space KCorrado GDean J Proceedings of International Conference on Learning Representations (ICLR) International Conference on Learning Representations (ICLR) 2013 Analogy-based detection of morphological and semantic relations with word embeddings: what works and what doesn't AGladkova ADrozd SMatsuoka Proceedings of the NAACL-HLT SRW the NAACL-HLT SRW

San Diego, California

2016. June 12-17, 2016 Don't count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors MarcoBaroni GeorgianaDinu GermánKruszewski Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics the 52nd Annual Meeting of the Association for Computational Linguistics

The Indo-Europeans: Archeological Problems MGimbutas 10.1525/aa.1963.65.4.02a00030 American Anthropologist 65 1963 Dictionary of the Lithuanian Language 2002. April 19, 2018 Archived from the original on 2017-08-11 The biggest and busiest languages on Wikipedia 2021 Problems with evaluation of word embeddings using word similarity tasks ManaalFaruqui arXiv:1605.02276 2016 arXiv preprint Separated by an un-common language: Towards judgment language informed vector space modeling IraLeviant RoiReichart arXiv:1508.00106 2015 arXiv preprint Is Similarity Visually Grounded? Computational Model of Similarity for the Estonian language ClaudiaKittask EduardBarbu Proceedings of the International Conference on Recent Advances in Natural Language Processing the International Conference on Recent Advances in Natural Language Processing RANLP 2019. 2019 Intrinsic evaluation of Lithuanian word embeddings using WordNet JurgitaKapočiūtė-Dzikienė RobertasDamaševičius Computer Science On-line Conference 2018 Springer Sentiment analysis of lithuanian texts using traditional and deep learning approaches Kapočiūtė-Dzikienė RobertasJurgita MarcinDamaševičius Woźniak Computers 8 1 4 2019 Testing pre-trained Transformer models for Lithuanian news clustering LukasStankevičius MantasLukoševičius arXiv:2004.03461 2020 arXiv preprint How to generate a good word embedding SiweiLai IEEE Intelligent Systems 31 6 2016 1998-2016 Kompiuterinės lingvistikos centras. Dabartinės lietuvių kalbos tekstynas Vytauto Didžiojo universitetas To lemmatize or not to lemmatize: how word normalisation affects ELMo performance in word sense disambiguation AndreyKutuzov ElizavetaKuzmenko arXiv:1909.03135 2019 arXiv preprint SEMANTIKA 2 VDU vykdo ES finansuojamą projektą 2020 A comparative study of stemming algorithms AnjaliJivani Ganesh Int. J. Comp. Tech. Appl 2 6 2011 Multimodal distributional semantics EliaBruni Nam-KhanhTran MarcoBaroni Journal of artificial intelligence research 49 2014 Placing Search in Context: The Concept Revisited LevFinkelstein EvgeniyGabrilovich YossiMatias EhudRivlin ZachSolan GadiWolfman EytanRuppin ACM Transactions on Information Systems 20 1 2002 A study on similarity and relatedness using distributional and wordnet-based approaches EnekoAgirre 2009 A study on similarity and relatedness using distributional and wordnet-based approaches EnekoAgirre 2009 Learning distributed representations of sentences from unlabelled data FelixHill KyunghyunCho AnnaKorhonen arXiv:1602.03483 2016 arXiv preprint A word at a time: computing word relatedness using temporal semantic analysis KiraRadinsky Proceedings of the 20th international conference on World wide web the 20th international conference on World wide web 2011 Contextual correlates of synonymy HerbertRubenstein JohnBGoodenough Communications of the ACM 8 1965 Fast, dynamic translation tailored to your content needs 2021