Word embedding in form of symmetric and
                skew-symmetric operator
                            Koshchenko Ekaterina                                      Kuralenok Igor
                        National Research University                                JetBrains Research
               Higher School of Economics in Saint-Petersburg                    Saint-Petersburg, Russia
                         Saint-Petersburg, Russia                                 ikuralenok@gmail.com
                         catherine.pths@gmail.com


  Abstract—Existing word embedding models repre-                of the feature extraction techniques used for document
sent each word with two real-valued vectors: central            indexing is latent semantic indexing [2]. Latent semantic
and context. This happens because of words relations            indexing is a precursor for word embeddings embodying
asymmetric nature and requires more time and data
for training. We introduce a new approach based on              the same principles and ideas. Another task is sentiment
asymmetric relations that uses the advantages of global         analysis. One of the solutions for this problem is SentProp
vectors model. Due to the reduction of asymmetric               framework [3], it combines label propagation method with
information impact on resulting words representations,          word embeddings to learn sentiment lexicons on domain-
our model converges faster and outperforms existing             specific corpora. Another way to solve some of the Natural
models on words analogies tasks.
  Index Terms—SSDE, word embedding, matrix de-                  Language Processing tasks are Language Models. Nowa-
composition                                                     days state of the art decisions for Language Modeling are
                                                                ELMO [4] and BERT [5]. Each of these methods uses
                     I. Introduction                            prebuilt word embeddings as input data and can benefit
   Understanding words relations in the context of natural      from better embedding models. Therefore, creating better
language is an easy task for human but not for computer.        embedding models is still a relevant task.
We need to teach computers how words are related and               There are three most popular and used word embedding
what meanings they have, depending on the context. To           models. Word2Vec is a local window-based method pre-
make it possible for a machine to process words, they have      sented by Mikolov et al. [6]. It preserves words analogies
to be presented in digitized format. This leads to the idea     feature, bringing closer vectors of words appearing in a
of real-valued vector representations — word embeddings.        similar context. Another approach is GloVe [7] which
   Most works on word embeddings focus their attention on       is trained on word-word co-occurrence counts. Authors
preserving two words properties in their representations.       noticed that to understand the relation of two words you
The first property is that words relations and similarities     can examine the ratio of their co-occurrence probabilities
can be described using distances and angles between word        with various probe words, thus deploying words analogies
vectors. For example, closer-further feature: “yellow” is       feature. Third model – FastText [8] – is focused on dis-
closer to “red” than to “smart”. In vector form it can be       tances/angles property. FastText uses character n-grams
presented as                                                    to enrich word vectors with subword information. This
                                                                approach allows to use morphology information, therefore,
            Y ellow − Red < Y ellow − Smart
                                                                choosing better vectors for sparse words and makes it
This property is widely used for synonyms search. Another       possible to learn something for non-vocabulary words.
property is words analogies. The corresponding feature             Words relations are often asymmetrical. For example,
was introduced by Mikolov et al. [1], designed to learn         "New York" is a common combination of words meaning
words similarities. For example, “Paris” and “France” has       the name of the city in the USA. However, "York New"
the same connection as “Budapest” and “Hungary”. In             is a quite rare combination and does not mean anything
vectors we can present it as                                    specific. In all mentioned models words interaction is
                                                                expressed in terms of the dot product of their vectors,
        F rance − P aris = Hungary − Budapest.
                                                                that leads to a generation of two vectors for each word:
  This approach benefits models creating meaning based          central and context. For that reason, twice more parame-
word vectors, while the closer-further feature is more          ters should be computed and, consequently, more time is
practical and can be applied to clustering and classification   required for learning. To solve this problem asymmetrical
tasks.                                                          relations between word representation can be used instead
  Word embeddings were originally created to be used            of central and context vectors dot product.
in Natural Language Processing tasks. For example, one             In this work, we propose a Symmetric Skew-symmetric
Decomposition based model. We demonstrate that our             there are two vectors generated for each word, which
method outperforms GloVe approach on its words analo-          requires more time and input data for training.
gies metrics.
                                                               B. GloVe
                   II. Related work                               GloVe model, for Global Vectors, suggested by Man-
   There are many word embedding models known from             ning et al. also aims to preserve words analogies. The
the literature. But most of them were based on three prin-     relationship of two words can be learned by examining
ciple approaches: Word2Vec [6], GloVe [7] and FastText         their relations with other words. In this approach words
[8]. All three models are widely used in language models       relationships are represented with a matrix of their co-
and Natural Language Processing applications.                  occurrences X, where xij is how many times word wi
                                                               was in the context with word wj . This matrix should be
A. Word2Vec                                                    constructed before the training process with one scan of
   Word2Vec is an approach introduced by Mikolov et al.        the corpus. On each learning step we iterate through co-
[6] that preserves words analogies property. It suggests two   occurrences matrix and for each non-zero co-occurrence
language models: Skip-gram and CBOW. Both methods              xij calculate central and context vectors for corresponding
represent words relationships with the dot product of their    wi , according to value and direction of target function
vectors. As it was described in the introduction, relations    gradient.
can be asymmetrical, which leads to two vectors per word          In GloVe each word is presented with two vectors,
usage: central and context. Skip-gram and CBOW scan            similar to Word2Vec. A sliding window is also used to scan
corpus with a sliding window. All words inside the window      the corpus for co-occurrences matrix construction. Unlike
are considered to be in the same context, i.e. connected to    the Word2Vec "constant" window, GloVe uses "shrinking"
each other. In both models all words inside one window         window. The weight of co-occurrence in the window lin-
get the same co-occurrence weight, i.e. are equal. We call     early decreases with distance increasing. Authors did not
this type of window "constant window".                         explore how window type affects experiments results and
   Continuous Bag of Words (CBOW) is a model trained           did not give any details on such a choice.
with “predict middle-word if you know surrounding con-
text” task. The method tries to choose words central and       C. FastText
context vectors, so that probability to predict the word          FastText model, in contrast to Word2Vec and GloVe,
in the middle of the sliding window, based on the rest of      was built to preserve words property of representing words
the window, would be high. The second model is called          relations in distances and angles between their vectors.
Skip-gram and is trained on the inverse problem: predict       This change allows the model to perform better on text
context with just one word in the center of the sliding        classification tasks. Similar to two previous methods, Fast-
window.                                                        Text generates central and context vectors for each word
   For each training step for each word, both methods          and uses a sliding window to scan the corpus.
should count the probability of using window middle-word          The main idea of this approach is to use character n-
in context with any other word from the vocabulary. It         grams to build central vectors. During the vocabulary
makes computational complexity too high. In later article      construction, each word is saved with it’s n-grams. For
[9] this problem was solved for Skip-gram model with           example, for the word “pencil” we also remember 3-grams
Negative Sampling. Negative Sampling suggests counting         "pe", "pen", "enc", "nci", "cil" and "il" in addition to the
the probability of middle-word being in the same context       whole word sequence. 3-gram "pen" corresponding to the
only with a constant number of positive and negative           word "pencil" is different from the word “pen”. After that,
samples. Positive samples are words that often appear in       during the training process, each sequence gets its own
one window with middle-word, they can be found before          vector and resulting central vector is a sum of all n-gram
the training process. Negative samples are words that are      vectors and whole word vector.
unlikely to appear in context with middle-word. Mikolov           As it was mentioned, FastText has great results on text
et al. suggest getting negative samples from uniform dis-      classification tasks but Word2Vec and GloVe outperform
tribution raised to 3/4rd power. This approach allows          it on words analogies tasks.
accelerating Skip-gram model calculations while being of
the same quality.                                                              III. The SSDE Model
   Results of experiments have shown that Skip-gram               Words relations have asymmetric nature, for that reason
method performs better on semantic tasks and their syn-        all three approaches above generate two vectors for each
tactic tasks results are very similar. Since Skip-gram can     word. The question is how to apply these central and
be trained easier than CBOW with same or even better           context vectors. In GloVe, for example, there are several
results, later models use Skip-gram.                           modes for what to use as a resulting vector. The default
   Skip-gram and CBOW models have several drawbacks.           mode is a sum of central and context vectors. There was
First, training time depends on the corpus size. Second,       no intuition for this choice, although our experiments have
shown that the default mode indeed performs best. It is               We are actually looking for embedding that will preserve
possible that Word2Vec, GloVe and FastText use more                   the ratio in logarithmic part of Eqn. (6). The ratio repre-
parameters than they really need, which means more time               sents how much more often a combination of words x and
and input data is required for training. The subject of our           y occurs in corpus than each of them individually. Infor-
research was to find out if words asymmetric information              mation that the model encodes is a conditional probability
is really necessary to include into the resulting vector.             given model F:
To do that we introduce a Symmetric Skew-symmetric                                   X                       p(wi , wj |F )
Decomposition Embedding (SSDE). It is based on GloVe                            I=       p(wi , wj ) · log                    . (7)
                                                                                                           p(wi |F )p(wj |F )
model, mainly because it is faster than other existing                               i,j
models and performs better on word analogies metrics.                 The result of rewriting Eqn. (6) and Eqn. (7) in GloVe
A. GloVe model analysis                                               notation and combining with the weighted least squares
   The main idea of GloVe model: words wi and wj                      method will be very similar to GloVe target function:
relation can be found by studying the ratio of their                                          p(wi , wj |F )       T
                                                                                                               ⇒ eui vj
co-occurrence probabilities with various probe words –                                      p(wi |F )p(wj |F )
P (wi , wk )/P (wj , wk ), where wk is a probe word. So, gen-
                                                                                            p(wi , wj )
eral model can be written as                                                         log                ⇒ log Xij − bui − bvj
                                                                                           p(wi )p(wj )
                                      P (wi , wk )
                F ((ui − uj )T vk ) =              .      (1)                  X
                                      P (wj , wk )                      J∗ =         p(wi , wj ) · (uTi vj + log p(wi ) + log p(wj )−
Authors say that due to exchangeability of words and                           i,j                                                       (8)
                                                                                                                                   2
context words function F should be a homomorphism:                                                               − log p(wi , wj )) ,
                                      F (uTi vk )                     Joint probability of words wi and wj are what in GloVe
                F ((ui − uj )T vk ) =             .            (2)
                                      F (uTj vk )                     model is designed as co-occurrences matrix Xij and prior
                                                                      probabilities of words are designed as biases bui and bvj . In
This formula gives an idea that model F is exponential,
                                                                      our experiments we tried both ways and obtained similar
which in combination with Eqn. (1) leads to:
                                                                      results for biases and probabilities usage. For that reason,
             uTi vk = log Pik = log Xik − log Xi .             (3)    we continued using prior probabilities in SSDE to decrease
                                                                      computational complexity.
After that GloVe brings biases to the formula. log Xi does
not depend on probe word k and is replaced with bias bui .            B. Our model
For word-context exchange symmetry context bias bvk is                   From Eqn. (5) we see that GloVe represents words rela-
also included:                                                        tions with dot product of their central and context vectors:
                    uTi vk + bui + bvk = log Xik .             (4)    uT v. This is done to consider the asymmetry property
                                                                      that we want to remove. Central and context vectors
In this equation, right-hand side is what information                 dot product is equal to corresponding one-hot encoder
model has to learn and left-hand side is how GloVe                    vectors multiplication to central and context matrices
preserves it. This is optimized with weighted least squares           product. Central and context matrices product can be
regression model. As a result, GloVe model target function            considered as linear operator, and any linear operator can
is                                                                    be decomposed to sum of symmetric and skew-symmetric
             |V |
            X                                                         matrices [10]:
      J=          f (Xij ) · (uTi vj + bui + bvj − log Xij )2 , (5)
            i,j=1
                                                                                           uTi vj = hi U T V hj

where                                                                                          L = UT V = S + K                          (9)
  • X — co-occurrences matrix,
                                                                        After that symmetric matrix S (according to the prop-
  • |V | — vocabulary size,
              u
                                                                      erty of symmetric matrices) can be written as a product
  • ui and bi — central vector and bias for word wi ,
              v
                                                                      of some low-rank matrix and its transpose. The same
  • vj and bj — context vector and bias for word wj .
                                                                      transformation can be used for the skew-symmetric matrix
  Introduction of encoding and decoding biases is a mo-               K with multiplying lower-diagonal part to −1.
ment that has no mathematical demonstration in the
article, but our experiments have shown that the model                               lij = sij + kij = aTi aj + ξij · cTi cj ,          (10)
does not work without their usage. We explained this                     The size of a matrix A is |V | · l where |V | - size of
with target function similarity with mutual information               vocabulary, l - word symmetric representation size. The
formula:                                                              size of a matrix C is |V |·m where m - word asymmetric rep-
                  X                       p(wi , wj )                 resentation size. Balancing between symmetric and skew-
           DKL =       p(wi , wj ) · log               . (6)
                   i,j
                                         p(w i )p(wj )                symmetric sizes we control the information distribution
the way we need. For example, to reduce the influence of                   us to optimize model performance while quality remained
asymmetric information on resulting word representation                    the same.
we make constant m much smaller than l.
  In total, after rewriting GloVe target function (5) with                                    IV. Experiments
Eqn. (10) and using the prior probabilities instead of                     A. Evaluation
biases, we get SSDE model target function:                                    To compare SSDE with GloVe we used metrics sug-
          |V |
          X                                                                gested in GloVe article. All the metrics are based on word
   Q=            f (pij ) · (aTi aj + ξij · cTi cj + log pi + log pj −     analogies property. There are four words w1 , w2 , w3 , w4 ,
        i,j=1                                                              all associated with one topic and can be described as “w1
                                              − log pij )2 ,               is related to w2 the same way w3 is related to w4 ”. This
                                                         (11)              can be presented in vectors terms as
   • pij = p(wi , wj ) and pi = p(wi ) — are counted from
                                                                                              w2 − w1 = w4 − w3 .
     the input corpus before the training process
   • ξij = −1, if i > j, otherwise ξij = 1                                 According to the arithmetics law this can be rewritten as
   On each training step we iterate through word-word co-                                    w2 − w1 + w3 = w4 (∗).
occurrences matrix X. Each co-occurrence xij shows how
many times word wi was in the context with word wj .                       Testing algorithm is: 1) get first three input words and
We compute gradients for symmetric vectors and skew-                       count left part of (*) 2) among all vectors of our vocab-
symmetric vectors and update them according to the                         ulary find the closest vector v to the previous step result
gradients.                                                                 (using cosine similarity) 3) if word corresponding to v is
   Resulting word embeddings are vectors of symmetric                      equal to w4 , then this experiment was successful, otherwise
matrix A. Since we wanted to remove asymmetric informa-                    it failed.
tion influence on resulting word representations, vectors ci                  We do not provide a comparison with CBOW or Skip-
are only used for training. However, their properties worth                gram models, but, as it is shown in the article [7], GloVe
further studying.                                                          performs better than the other baselines.
   There are two ways to optimize function (11): 1) gradi-                    ‘Tab. I” shows all metrics that were used to evaluate
ent descent, 2) stochastic gradient descent. The advantage                 both GloVe and SSDE models. Five of these metrics have
of gradient descent is that it will eventually converge                    semantic nature, for example,
to better results. Though stochastic gradient has several
methods that achieve reasonable results much faster than                          ”King” − ”M an” + ”W oman” = ”Queen”.
gradient-descent. Since we wanted to reduce training time,                 While the other nine are syntactic, for example,
we decided to use Glove’s approach using adaptive gradi-
ent descent. GloVe authors also noticed that values slightly                ”Dangerous” − ”Danger” + ”Beauty” = ”Beautif ul”.
change on each stochastic gradient iteration which means
computations can be done in parallel.                                      B. Results
   GloVe model shuffles whole co-occurrences matrix on                       We compared GloVe and SSDE models on corpus com-
each step of stochastic gradient descent.                                  posed of 100Mb of articles from English Wikipedia. For
          |V | |V |                                                        corpus scanning we used symmetric shrinking window
          X    X
                      f (Xij ) · (uTi vj + bui + bvj − log Xij )2          of size 30. All models were trained up to convergence.
                                                                    (12)
          i=1 j=1                                                          Studying of the constant window and asymmetric window
   = Ei,j∼U (X) f (Xij ) · (uTi vj + bui + bvj − log Xij )2 .              results will be completed in future work.
                                                                             Tab. II shows the performance of GloVe and SSDE
In SSDE model we shuffle only lines of co-occurrences
                                                                           models with an equal number of parameters trained. Our
matrix.
                                                                           approach significantly improves scores both for semantic
    |V | |V |
    X    X                                                                 and syntactic tasks.
                 f (pij ) · (aTi aj + ξij · cTi cj + log pi + log pj −       Tab. III shows results of GloVe and SSDE models
    i=1 j=1
                                                                           with equal sizes of word embeddings vectors. As it was
                                                         − log pij )2      mentioned, GloVe model uses a sum of central and context
          |V |                                                             vectors as the resulting representation and SSDE model
          X
   = Ei          f (pij ) · (aTi aj + ξij · cTi cj + log pi + log pj −     uses only a symmetric vector. Similar or even higher
          j=1                                                              scores can be obtained with SSDE model with the same
                                              − log pij )2 .               representation size as GloVe, but almost twice a smaller
                                                          (13)             number of parameters.
Lines shuffle without columns shuffle makes computations                     All the results were obtained on Inter Core i7 processor,
cash-friendly, reducing cash-miss rate. This change allowed                8GB, DDR4 memory type.
                        Table I                                                          Table III
              GloVe word analogies metrics                              Experiments with equal representations size

 Type        Name                  Example               Size                       Average     Average    Average       Sec.
             Capital   common      Greece to Athens                Model            sem         synt       total         per
 Semantic                                                506
             countries             as Iraq to Baghdad                               score       score      score         iter.
                                   Nigeria to Abuja as             GloVe-50         20.3%       29.4%      23.9%         18
 Semantic    Capital world                               4525
                                   Ghana to Accra                  SSDE-50-5        22.5%       28.7%      25.0%         12
                                   Illinois to Chicago             SSDE-50-10       21.8%       28.3%      24.3%         14
 Semantic    City in state                               2467
                                   as Texas Houston                SSDE-50-20       21.7%       28.5%      24.5%         15
                                   Dinar to Algeria as             GloVe-80         25.3%       40.2%      31.3%         24
 Semantic    Currency                                    866
                                   Kwanza to Angola                SSDE-80-5        24.3%       39.8%      30.6%         17
                                   Brother to Boy as               SSDE-80-10       25.1%       39.7%      31.0%         18
 Semantic    Family                                      506
                                   Sister to Girl                  SSDE-80-20       25.5%       41.4%      31.9%         20
                                   Calm to calmly as               GloVe-120        27.6%       44.0%      34.2%         34
 Syntactic   Adjective to adverb                         992
                                   Happy to Happily                SSDE-120-10      25.3%       48.1%      34.5%         25
                                   Aware to Unaware                SSDE-120-20      26.3%       47.8%      35.1%         27
 Syntactic   Opposite                                    812
                                   as Clear to Unclear             GLOVE-150        27.8%       45.2%      34.8%         79
                                   Worse to Bad as                 a "GloVe-n" – vector size n.
 Syntactic   Comparative                                 1332
                                   Bigger to Big                   b "SSDE-m-l" – symmetric size m, asymmetric size l.
                                   Worst to Bad as
 Syntactic   Superlative                                 1122
                                   Biggest to Big
                                   Coding to Code as
 Syntactic   Present Participle                          1056
                                   Dancing to Dance
             Nationality Adjec-    China to Chinese              We showed that it is possible to train high-quality word
 Syntactic                                               1599
             tive                  as Poland to Polish           vectors using a little information on the asymmetry of re-
                                   Danced to Dancing
 Syntactic   Past Tense
                                   as Flew to Flying
                                                         1560    lations, comparing to the popular word embedding model
                                   Bananas to Banana             with highest scores on word analogies tasks – GloVe.
 Syntactic   Plural                                      1332
                                   as Birds to Bird              Since our approach computes a twice smaller number of
                                   Eats to Eat as Says
 Syntactic   Plural Verbs
                                   to Say
                                                         870     parameters, it requires less time to train the model.
 Total                                                   19545      We analyzed GloVe model and introduced a new model
                                                                 – SSDE – that combines the advantages of GloVe with
                                                                 our ideas on asymmetric relations. Comparison of SSDE
                          Table II                               with GloVe has shown that our model outperforms GloVe
         Experiments with equal parameters number
                                                                 on word analogies metrics, while GloVe, according to the
                                                                 article [7], outperforms CBOW and Skip-gram models.
                   Average     Average    Average        Sec.
  Model            sem         synt       total          per
                   score       score      score          iter.   B. Future work
  GloVe-25         12.9%       12.3%      12.5%          13
  SSDE-40-5        19.1%       22.5%      20.4%          11         SSDE model, similar to GloVe and Word2Vec, uses
  SSDE-40-10       19.6%       22.6%      20.8%          12
  GloVe-50         20.3%       29.4%      23.9%          18      a sliding window to scan the corpus. We assume that
  SSDE-80-5        24.3%       39.8%      30.6%          17      depending on the type of the window used, results may be
  SSDE-80-10       25.1%       39.7%      31.0%          18      different for metrics of different types. Constant windows
  SSDE-80-20       25.5%       41.4%      31.9%          20
  GloVe-80         25.3%       40.2%      31.3%          24
                                                                 might perform better on synonyms search tasks, while the
  SSDE-120-10      25.3%       48.1%      34.5%          25      shrinking window could be a good choice for word analo-
  SSDE-120-20      26.3%       47.8%      35.1%          27      gies tasks. So, in future work, we will examine window
  a "GloVe-n" – vector size n.
  b "SSDE-m-l" – symmetric size m, asymmetric size l.
                                                                 type influence on different metrics types.
                                                                    Currently, we only use vectors with symmetric informa-
                                                                 tion for resulting word embeddings. However, there might
                                                                 be some interesting information encoded in asymmetric
   We demonstrated that our approach outperforms GloVe           vectors. For example, L1-regularization turn most of the
model on word analogies metrics while calculating a twice        skew-symmetric vectors to zero. There might be some con-
smaller number of parameters. This fact proves our ini-          nection between those words which corresponding skew-
tial assumption that asymmetric information influence            symmetric vectors are not zero. In future work, we will
on word embeddings can be significantly reduced, thus,           study the asymmetric component of SSDE and analyze if
optimizing time required for training of the model.              there is any pattern that might increase performance on
                                                                 some tasks.
                        V. Conclusion
                                                                    Window size and symmetry influence on model perfor-
A. Achievements                                                  mance is another aspect that was not examined. Impor-
   In this paper, we studied the necessity of word rela-         tance of asymmetric information might increase for highly
tionships asymmetric information for word embeddings.            asymmetric windows.
                          References
 [1] T. Mikolov, W.-t. Yih, and G. Zweig, “Linguistic regularities in
     continuous space word representations.” in HLT-NAACL, 2013,
     pp. 746–751.
 [2] F. Sebastiani, “Machine learning in automated text
     categorization,” ACM Computing Surveys, vol. 34, no. 1,
     pp. 1–47, 2002. [Online]. Available: http://nmis.isti.cnr.it/
     sebastiani/Publications/ACMCS02.pdf
 [3] W. L. Hamilton, K. Clark, J. Leskovec, and D. Jurafsky,
     “Inducing domain-specific sentiment lexicons from unlabeled
     corpora,” in Proceedings of the 2016 Conference on Empirical
     Methods in Natural Language Processing. Association for
     Computational Linguistics, 2016, pp. 595–605. [Online].
     Available: http://aclweb.org/anthology/D16-1057
 [4] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark,
     K. Lee, and L. Zettlemoyer, “Deep contextualized word repre-
     sentations,” in Proc. of NAACL, 2018.
 [5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert:
     Pre-training of deep bidirectional transformers for language
     understanding,” arXiv preprint arXiv:1810.04805, 2018.
 [6] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient
     estimation of word representations in vector space,” CoRR, vol.
     abs/1301.3781, 2013. [Online]. Available: http://dblp.uni-trier.
     de/db/journals/corr/corr1301.html#abs-1301-3781
 [7] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global
     vectors for word representation,” in Empirical Methods in
     Natural Language Processing (EMNLP), 2014, pp. 1532–
     1543. [Online]. Available: http://www.aclweb.org/anthology/
     D14-1162
 [8] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching
     word vectors with subword information,” Transactions of the
     Association for Computational Linguistics, vol. 5, pp. 135–146,
     2017.
 [9] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
     “Distributed representations of words and phrases and their
     compositionality,” in Advances in Neural Information Process-
     ing Systems 26, C. J. C. Burges, L. Bottou, M. Welling,
     Z. Ghahramani, and K. Q. Weinberger, Eds. Curran Associates,
     Inc., 2013, pp. 3111–3119.
[10] F. Gantmacher, The theory of matrices, ser. The Theory of
     Matrices. Chelsea Pub. Co., 1960, no. т. 1. [Online]. Available:
     https://books.google.ru/books?id=GOdQAAAAMAAJ