Word embedding in form of symmetric and skew-symmetric operator Koshchenko Ekaterina Kuralenok Igor National Research University JetBrains Research Higher School of Economics in Saint-Petersburg Saint-Petersburg, Russia Saint-Petersburg, Russia ikuralenok@gmail.com catherine.pths@gmail.com Abstract—Existing word embedding models repre- of the feature extraction techniques used for document sent each word with two real-valued vectors: central indexing is latent semantic indexing [2]. Latent semantic and context. This happens because of words relations indexing is a precursor for word embeddings embodying asymmetric nature and requires more time and data for training. We introduce a new approach based on the same principles and ideas. Another task is sentiment asymmetric relations that uses the advantages of global analysis. One of the solutions for this problem is SentProp vectors model. Due to the reduction of asymmetric framework [3], it combines label propagation method with information impact on resulting words representations, word embeddings to learn sentiment lexicons on domain- our model converges faster and outperforms existing specific corpora. Another way to solve some of the Natural models on words analogies tasks. Index Terms—SSDE, word embedding, matrix de- Language Processing tasks are Language Models. Nowa- composition days state of the art decisions for Language Modeling are ELMO [4] and BERT [5]. Each of these methods uses I. Introduction prebuilt word embeddings as input data and can benefit Understanding words relations in the context of natural from better embedding models. Therefore, creating better language is an easy task for human but not for computer. embedding models is still a relevant task. We need to teach computers how words are related and There are three most popular and used word embedding what meanings they have, depending on the context. To models. Word2Vec is a local window-based method pre- make it possible for a machine to process words, they have sented by Mikolov et al. [6]. It preserves words analogies to be presented in digitized format. This leads to the idea feature, bringing closer vectors of words appearing in a of real-valued vector representations — word embeddings. similar context. Another approach is GloVe [7] which Most works on word embeddings focus their attention on is trained on word-word co-occurrence counts. Authors preserving two words properties in their representations. noticed that to understand the relation of two words you The first property is that words relations and similarities can examine the ratio of their co-occurrence probabilities can be described using distances and angles between word with various probe words, thus deploying words analogies vectors. For example, closer-further feature: “yellow” is feature. Third model – FastText [8] – is focused on dis- closer to “red” than to “smart”. In vector form it can be tances/angles property. FastText uses character n-grams presented as to enrich word vectors with subword information. This approach allows to use morphology information, therefore, Y ellow − Red < Y ellow − Smart choosing better vectors for sparse words and makes it This property is widely used for synonyms search. Another possible to learn something for non-vocabulary words. property is words analogies. The corresponding feature Words relations are often asymmetrical. For example, was introduced by Mikolov et al. [1], designed to learn "New York" is a common combination of words meaning words similarities. For example, “Paris” and “France” has the name of the city in the USA. However, "York New" the same connection as “Budapest” and “Hungary”. In is a quite rare combination and does not mean anything vectors we can present it as specific. In all mentioned models words interaction is expressed in terms of the dot product of their vectors, F rance − P aris = Hungary − Budapest. that leads to a generation of two vectors for each word: This approach benefits models creating meaning based central and context. For that reason, twice more parame- word vectors, while the closer-further feature is more ters should be computed and, consequently, more time is practical and can be applied to clustering and classification required for learning. To solve this problem asymmetrical tasks. relations between word representation can be used instead Word embeddings were originally created to be used of central and context vectors dot product. in Natural Language Processing tasks. For example, one In this work, we propose a Symmetric Skew-symmetric Decomposition based model. We demonstrate that our there are two vectors generated for each word, which method outperforms GloVe approach on its words analo- requires more time and input data for training. gies metrics. B. GloVe II. Related work GloVe model, for Global Vectors, suggested by Man- There are many word embedding models known from ning et al. also aims to preserve words analogies. The the literature. But most of them were based on three prin- relationship of two words can be learned by examining ciple approaches: Word2Vec [6], GloVe [7] and FastText their relations with other words. In this approach words [8]. All three models are widely used in language models relationships are represented with a matrix of their co- and Natural Language Processing applications. occurrences X, where xij is how many times word wi was in the context with word wj . This matrix should be A. Word2Vec constructed before the training process with one scan of Word2Vec is an approach introduced by Mikolov et al. the corpus. On each learning step we iterate through co- [6] that preserves words analogies property. It suggests two occurrences matrix and for each non-zero co-occurrence language models: Skip-gram and CBOW. Both methods xij calculate central and context vectors for corresponding represent words relationships with the dot product of their wi , according to value and direction of target function vectors. As it was described in the introduction, relations gradient. can be asymmetrical, which leads to two vectors per word In GloVe each word is presented with two vectors, usage: central and context. Skip-gram and CBOW scan similar to Word2Vec. A sliding window is also used to scan corpus with a sliding window. All words inside the window the corpus for co-occurrences matrix construction. Unlike are considered to be in the same context, i.e. connected to the Word2Vec "constant" window, GloVe uses "shrinking" each other. In both models all words inside one window window. The weight of co-occurrence in the window lin- get the same co-occurrence weight, i.e. are equal. We call early decreases with distance increasing. Authors did not this type of window "constant window". explore how window type affects experiments results and Continuous Bag of Words (CBOW) is a model trained did not give any details on such a choice. with “predict middle-word if you know surrounding con- text” task. The method tries to choose words central and C. FastText context vectors, so that probability to predict the word FastText model, in contrast to Word2Vec and GloVe, in the middle of the sliding window, based on the rest of was built to preserve words property of representing words the window, would be high. The second model is called relations in distances and angles between their vectors. Skip-gram and is trained on the inverse problem: predict This change allows the model to perform better on text context with just one word in the center of the sliding classification tasks. Similar to two previous methods, Fast- window. Text generates central and context vectors for each word For each training step for each word, both methods and uses a sliding window to scan the corpus. should count the probability of using window middle-word The main idea of this approach is to use character n- in context with any other word from the vocabulary. It grams to build central vectors. During the vocabulary makes computational complexity too high. In later article construction, each word is saved with it’s n-grams. For [9] this problem was solved for Skip-gram model with example, for the word “pencil” we also remember 3-grams Negative Sampling. Negative Sampling suggests counting "pe", "pen", "enc", "nci", "cil" and "il" in addition to the the probability of middle-word being in the same context whole word sequence. 3-gram "pen" corresponding to the only with a constant number of positive and negative word "pencil" is different from the word “pen”. After that, samples. Positive samples are words that often appear in during the training process, each sequence gets its own one window with middle-word, they can be found before vector and resulting central vector is a sum of all n-gram the training process. Negative samples are words that are vectors and whole word vector. unlikely to appear in context with middle-word. Mikolov As it was mentioned, FastText has great results on text et al. suggest getting negative samples from uniform dis- classification tasks but Word2Vec and GloVe outperform tribution raised to 3/4rd power. This approach allows it on words analogies tasks. accelerating Skip-gram model calculations while being of the same quality. III. The SSDE Model Results of experiments have shown that Skip-gram Words relations have asymmetric nature, for that reason method performs better on semantic tasks and their syn- all three approaches above generate two vectors for each tactic tasks results are very similar. Since Skip-gram can word. The question is how to apply these central and be trained easier than CBOW with same or even better context vectors. In GloVe, for example, there are several results, later models use Skip-gram. modes for what to use as a resulting vector. The default Skip-gram and CBOW models have several drawbacks. mode is a sum of central and context vectors. There was First, training time depends on the corpus size. Second, no intuition for this choice, although our experiments have shown that the default mode indeed performs best. It is We are actually looking for embedding that will preserve possible that Word2Vec, GloVe and FastText use more the ratio in logarithmic part of Eqn. (6). The ratio repre- parameters than they really need, which means more time sents how much more often a combination of words x and and input data is required for training. The subject of our y occurs in corpus than each of them individually. Infor- research was to find out if words asymmetric information mation that the model encodes is a conditional probability is really necessary to include into the resulting vector. given model F: To do that we introduce a Symmetric Skew-symmetric X p(wi , wj |F ) Decomposition Embedding (SSDE). It is based on GloVe I= p(wi , wj ) · log . (7) p(wi |F )p(wj |F ) model, mainly because it is faster than other existing i,j models and performs better on word analogies metrics. The result of rewriting Eqn. (6) and Eqn. (7) in GloVe A. GloVe model analysis notation and combining with the weighted least squares The main idea of GloVe model: words wi and wj method will be very similar to GloVe target function: relation can be found by studying the ratio of their p(wi , wj |F ) T ⇒ eui vj co-occurrence probabilities with various probe words – p(wi |F )p(wj |F ) P (wi , wk )/P (wj , wk ), where wk is a probe word. So, gen- p(wi , wj ) eral model can be written as log ⇒ log Xij − bui − bvj p(wi )p(wj ) P (wi , wk ) F ((ui − uj )T vk ) = . (1) X P (wj , wk ) J∗ = p(wi , wj ) · (uTi vj + log p(wi ) + log p(wj )− Authors say that due to exchangeability of words and i,j (8) 2 context words function F should be a homomorphism: − log p(wi , wj )) , F (uTi vk ) Joint probability of words wi and wj are what in GloVe F ((ui − uj )T vk ) = . (2) F (uTj vk ) model is designed as co-occurrences matrix Xij and prior probabilities of words are designed as biases bui and bvj . In This formula gives an idea that model F is exponential, our experiments we tried both ways and obtained similar which in combination with Eqn. (1) leads to: results for biases and probabilities usage. For that reason, uTi vk = log Pik = log Xik − log Xi . (3) we continued using prior probabilities in SSDE to decrease computational complexity. After that GloVe brings biases to the formula. log Xi does not depend on probe word k and is replaced with bias bui . B. Our model For word-context exchange symmetry context bias bvk is From Eqn. (5) we see that GloVe represents words rela- also included: tions with dot product of their central and context vectors: uTi vk + bui + bvk = log Xik . (4) uT v. This is done to consider the asymmetry property that we want to remove. Central and context vectors In this equation, right-hand side is what information dot product is equal to corresponding one-hot encoder model has to learn and left-hand side is how GloVe vectors multiplication to central and context matrices preserves it. This is optimized with weighted least squares product. Central and context matrices product can be regression model. As a result, GloVe model target function considered as linear operator, and any linear operator can is be decomposed to sum of symmetric and skew-symmetric |V | X matrices [10]: J= f (Xij ) · (uTi vj + bui + bvj − log Xij )2 , (5) i,j=1 uTi vj = hi U T V hj where L = UT V = S + K (9) • X — co-occurrences matrix, After that symmetric matrix S (according to the prop- • |V | — vocabulary size, u erty of symmetric matrices) can be written as a product • ui and bi — central vector and bias for word wi , v of some low-rank matrix and its transpose. The same • vj and bj — context vector and bias for word wj . transformation can be used for the skew-symmetric matrix Introduction of encoding and decoding biases is a mo- K with multiplying lower-diagonal part to −1. ment that has no mathematical demonstration in the article, but our experiments have shown that the model lij = sij + kij = aTi aj + ξij · cTi cj , (10) does not work without their usage. We explained this The size of a matrix A is |V | · l where |V | - size of with target function similarity with mutual information vocabulary, l - word symmetric representation size. The formula: size of a matrix C is |V |·m where m - word asymmetric rep- X p(wi , wj ) resentation size. Balancing between symmetric and skew- DKL = p(wi , wj ) · log . (6) i,j p(w i )p(wj ) symmetric sizes we control the information distribution the way we need. For example, to reduce the influence of us to optimize model performance while quality remained asymmetric information on resulting word representation the same. we make constant m much smaller than l. In total, after rewriting GloVe target function (5) with IV. Experiments Eqn. (10) and using the prior probabilities instead of A. Evaluation biases, we get SSDE model target function: To compare SSDE with GloVe we used metrics sug- |V | X gested in GloVe article. All the metrics are based on word Q= f (pij ) · (aTi aj + ξij · cTi cj + log pi + log pj − analogies property. There are four words w1 , w2 , w3 , w4 , i,j=1 all associated with one topic and can be described as “w1 − log pij )2 , is related to w2 the same way w3 is related to w4 ”. This (11) can be presented in vectors terms as • pij = p(wi , wj ) and pi = p(wi ) — are counted from w2 − w1 = w4 − w3 . the input corpus before the training process • ξij = −1, if i > j, otherwise ξij = 1 According to the arithmetics law this can be rewritten as On each training step we iterate through word-word co- w2 − w1 + w3 = w4 (∗). occurrences matrix X. Each co-occurrence xij shows how many times word wi was in the context with word wj . Testing algorithm is: 1) get first three input words and We compute gradients for symmetric vectors and skew- count left part of (*) 2) among all vectors of our vocab- symmetric vectors and update them according to the ulary find the closest vector v to the previous step result gradients. (using cosine similarity) 3) if word corresponding to v is Resulting word embeddings are vectors of symmetric equal to w4 , then this experiment was successful, otherwise matrix A. Since we wanted to remove asymmetric informa- it failed. tion influence on resulting word representations, vectors ci We do not provide a comparison with CBOW or Skip- are only used for training. However, their properties worth gram models, but, as it is shown in the article [7], GloVe further studying. performs better than the other baselines. There are two ways to optimize function (11): 1) gradi- ‘Tab. I” shows all metrics that were used to evaluate ent descent, 2) stochastic gradient descent. The advantage both GloVe and SSDE models. Five of these metrics have of gradient descent is that it will eventually converge semantic nature, for example, to better results. Though stochastic gradient has several methods that achieve reasonable results much faster than ”King” − ”M an” + ”W oman” = ”Queen”. gradient-descent. Since we wanted to reduce training time, While the other nine are syntactic, for example, we decided to use Glove’s approach using adaptive gradi- ent descent. GloVe authors also noticed that values slightly ”Dangerous” − ”Danger” + ”Beauty” = ”Beautif ul”. change on each stochastic gradient iteration which means computations can be done in parallel. B. Results GloVe model shuffles whole co-occurrences matrix on We compared GloVe and SSDE models on corpus com- each step of stochastic gradient descent. posed of 100Mb of articles from English Wikipedia. For |V | |V | corpus scanning we used symmetric shrinking window X X f (Xij ) · (uTi vj + bui + bvj − log Xij )2 of size 30. All models were trained up to convergence. (12) i=1 j=1 Studying of the constant window and asymmetric window = Ei,j∼U (X) f (Xij ) · (uTi vj + bui + bvj − log Xij )2 . results will be completed in future work. Tab. II shows the performance of GloVe and SSDE In SSDE model we shuffle only lines of co-occurrences models with an equal number of parameters trained. Our matrix. approach significantly improves scores both for semantic |V | |V | X X and syntactic tasks. f (pij ) · (aTi aj + ξij · cTi cj + log pi + log pj − Tab. III shows results of GloVe and SSDE models i=1 j=1 with equal sizes of word embeddings vectors. As it was − log pij )2 mentioned, GloVe model uses a sum of central and context |V | vectors as the resulting representation and SSDE model X = Ei f (pij ) · (aTi aj + ξij · cTi cj + log pi + log pj − uses only a symmetric vector. Similar or even higher j=1 scores can be obtained with SSDE model with the same − log pij )2 . representation size as GloVe, but almost twice a smaller (13) number of parameters. Lines shuffle without columns shuffle makes computations All the results were obtained on Inter Core i7 processor, cash-friendly, reducing cash-miss rate. This change allowed 8GB, DDR4 memory type. Table I Table III GloVe word analogies metrics Experiments with equal representations size Type Name Example Size Average Average Average Sec. Capital common Greece to Athens Model sem synt total per Semantic 506 countries as Iraq to Baghdad score score score iter. Nigeria to Abuja as GloVe-50 20.3% 29.4% 23.9% 18 Semantic Capital world 4525 Ghana to Accra SSDE-50-5 22.5% 28.7% 25.0% 12 Illinois to Chicago SSDE-50-10 21.8% 28.3% 24.3% 14 Semantic City in state 2467 as Texas Houston SSDE-50-20 21.7% 28.5% 24.5% 15 Dinar to Algeria as GloVe-80 25.3% 40.2% 31.3% 24 Semantic Currency 866 Kwanza to Angola SSDE-80-5 24.3% 39.8% 30.6% 17 Brother to Boy as SSDE-80-10 25.1% 39.7% 31.0% 18 Semantic Family 506 Sister to Girl SSDE-80-20 25.5% 41.4% 31.9% 20 Calm to calmly as GloVe-120 27.6% 44.0% 34.2% 34 Syntactic Adjective to adverb 992 Happy to Happily SSDE-120-10 25.3% 48.1% 34.5% 25 Aware to Unaware SSDE-120-20 26.3% 47.8% 35.1% 27 Syntactic Opposite 812 as Clear to Unclear GLOVE-150 27.8% 45.2% 34.8% 79 Worse to Bad as a "GloVe-n" – vector size n. Syntactic Comparative 1332 Bigger to Big b "SSDE-m-l" – symmetric size m, asymmetric size l. Worst to Bad as Syntactic Superlative 1122 Biggest to Big Coding to Code as Syntactic Present Participle 1056 Dancing to Dance Nationality Adjec- China to Chinese We showed that it is possible to train high-quality word Syntactic 1599 tive as Poland to Polish vectors using a little information on the asymmetry of re- Danced to Dancing Syntactic Past Tense as Flew to Flying 1560 lations, comparing to the popular word embedding model Bananas to Banana with highest scores on word analogies tasks – GloVe. Syntactic Plural 1332 as Birds to Bird Since our approach computes a twice smaller number of Eats to Eat as Says Syntactic Plural Verbs to Say 870 parameters, it requires less time to train the model. Total 19545 We analyzed GloVe model and introduced a new model – SSDE – that combines the advantages of GloVe with our ideas on asymmetric relations. Comparison of SSDE Table II with GloVe has shown that our model outperforms GloVe Experiments with equal parameters number on word analogies metrics, while GloVe, according to the article [7], outperforms CBOW and Skip-gram models. Average Average Average Sec. Model sem synt total per score score score iter. B. Future work GloVe-25 12.9% 12.3% 12.5% 13 SSDE-40-5 19.1% 22.5% 20.4% 11 SSDE model, similar to GloVe and Word2Vec, uses SSDE-40-10 19.6% 22.6% 20.8% 12 GloVe-50 20.3% 29.4% 23.9% 18 a sliding window to scan the corpus. We assume that SSDE-80-5 24.3% 39.8% 30.6% 17 depending on the type of the window used, results may be SSDE-80-10 25.1% 39.7% 31.0% 18 different for metrics of different types. Constant windows SSDE-80-20 25.5% 41.4% 31.9% 20 GloVe-80 25.3% 40.2% 31.3% 24 might perform better on synonyms search tasks, while the SSDE-120-10 25.3% 48.1% 34.5% 25 shrinking window could be a good choice for word analo- SSDE-120-20 26.3% 47.8% 35.1% 27 gies tasks. So, in future work, we will examine window a "GloVe-n" – vector size n. b "SSDE-m-l" – symmetric size m, asymmetric size l. type influence on different metrics types. Currently, we only use vectors with symmetric informa- tion for resulting word embeddings. However, there might be some interesting information encoded in asymmetric We demonstrated that our approach outperforms GloVe vectors. For example, L1-regularization turn most of the model on word analogies metrics while calculating a twice skew-symmetric vectors to zero. There might be some con- smaller number of parameters. This fact proves our ini- nection between those words which corresponding skew- tial assumption that asymmetric information influence symmetric vectors are not zero. In future work, we will on word embeddings can be significantly reduced, thus, study the asymmetric component of SSDE and analyze if optimizing time required for training of the model. there is any pattern that might increase performance on some tasks. V. Conclusion Window size and symmetry influence on model perfor- A. Achievements mance is another aspect that was not examined. Impor- In this paper, we studied the necessity of word rela- tance of asymmetric information might increase for highly tionships asymmetric information for word embeddings. asymmetric windows. References [1] T. Mikolov, W.-t. Yih, and G. Zweig, “Linguistic regularities in continuous space word representations.” in HLT-NAACL, 2013, pp. 746–751. [2] F. Sebastiani, “Machine learning in automated text categorization,” ACM Computing Surveys, vol. 34, no. 1, pp. 1–47, 2002. [Online]. Available: http://nmis.isti.cnr.it/ sebastiani/Publications/ACMCS02.pdf [3] W. L. Hamilton, K. Clark, J. Leskovec, and D. Jurafsky, “Inducing domain-specific sentiment lexicons from unlabeled corpora,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2016, pp. 595–605. [Online]. Available: http://aclweb.org/anthology/D16-1057 [4] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word repre- sentations,” in Proc. of NAACL, 2018. [5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018. [6] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” CoRR, vol. abs/1301.3781, 2013. [Online]. Available: http://dblp.uni-trier. de/db/journals/corr/corr1301.html#abs-1301-3781 [7] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532– 1543. [Online]. Available: http://www.aclweb.org/anthology/ D14-1162 [8] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword information,” Transactions of the Association for Computational Linguistics, vol. 5, pp. 135–146, 2017. [9] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in Neural Information Process- ing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2013, pp. 3111–3119. [10] F. Gantmacher, The theory of matrices, ser. The Theory of Matrices. Chelsea Pub. Co., 1960, no. т. 1. [Online]. Available: https://books.google.ru/books?id=GOdQAAAAMAAJ