-

Optimization Task in Equivalent to word2vec Matrix Factorization

Victor Kantor

viktor.kantor@phystech.edu 0 1 0 Moscow Institute of Physics and Technologies 1 Yandex Moscow , Russia

Omer Levy and Yoav Goldberg have shown in their paper at NIPS 2014 that word2vec is equivalent to the factorization of shifted PMI matrix. The question which was not discussed in this work and later papers is the right choice of the norm for an approximation of the matrix. Authors also presented the results of the experiments with SVD approximating the matrix with respect to Frobenius norm. In this work we show that weighted Frobenius norm could be the reasonable choice, but weights shouldn't be equal to one as in Levy and Goldberg experiments. We conjecture that the right choice of weights could help to improve matrix factorization results on analogy questions, where skipgram with negative sampling (SGNS) remains superior to SVD.

word2vec SGNS matrix factorizations SVD distributional semantics

Word2vec is a powerful and popular natural language processing technique proposed by Mikolov et al. in [ 1 ]. It allows to get word representations with some useful properties. Dot product of word2vec vectors is a good similarity measure and arithmetical operations with vectors help to solve some analogy tasks (popular example: "queen woman + man king").

In [ 3 ] it was shown that word2vec is similar to matrix factorization technique well-known in NLP and collaborative ltering. And factorizing matrix is almost the PMI-matrix which is also common object in NLP and CF.

The main di erence between matrix factorization in [ 3 ] and common use of matrix factorization techniques is that Levy et al. result is valid for exact factorization of the matrix but usually we work with approximate factorization in NLP and CF applications.

The default choice of norm for approximating the initial matrix in matrix factorization is Frobenius norm which leads to the quadratic loss: (1) jjX

U V jj2F =

X(xij uiT vj )2

Experiments from [ 3 ] and [ 4 ] were considered with Singular Value Decomposition (SVD) which gives the best approximation according to Frobenius norm and subsequently quadratic loss. This choice was motivated by popularity of SVD and Frobenius norm as a default variant. In this work we get the theoretical motivation for quadratic loss using second order Taylor series approximation for objective function. An interesting conclusion is that classic SVD optimizing simple quadratic loss isn't a good choice in this case. Our future work includes experiments which could complement this conclusion with practical results. 2

Skip-Gram with Negative Sampling (SGNS) In this section we provide a brief review of the result from [ 3 ]. 2.1

Setting and Notation

The skip-gram model assumes a corpus of words w 2 Vw and their contexts c 2 Vc, where Vw and Vc are sets of words and contexts. We denote the collection of observed words and contexts pair as D. To denote the length of D (the number of words in collection) we use jDj. Note that jDj di ers from jVwj. We also use #(w; c) to denote the number of times the pair (w, c) appear in D. Notation #(w) and #(c) has the similar meaning.

Let d be the embedding dimensionality. Each word w 2 Vw is associated with a vector w 2 Rd and each context c 2 Vc is associated with a vector c 2 Rd. 2.2

SGNS as Matrix Factorization As it was shown in [ 3 ], if k is the number of negative samples, SGNS optimization task is as follows: ` = X X #(w; c) (log (< w; c >) + k EcN PD log ( < w; cN >)) (3) `(w; c) = log (< w; c >) + k #(w) < w; c >= log #(w; c) jDj #(w)#(c) log k If we are going to use quadratic loss for this matrix factorization, the reasonable way to approximate the objective function with such loss is to use the second order Taylor series expansion. In this case we get the weighted quadratic loss, so the main question is the values of the weights. If the weights are the same for every (w; c) pair, we can use standard SVD approximating initial matrix with respect to Frobenius norm. The following theorem shows that weights could be di erent for di erent pairs (w; c):

Theorem 1. Assume < w; c >

P M Ik(w; c). Then: `

X const(< w; c >) + wc (< w; c >

P M Ik(w; c))2 jDj (6) (7) (8) (9) +0 (< w; c >

P M Ik(w; c))2+ +o (< w; c >

P M Ik(w; c))2

Here the rst term is const(< w; c >), the second term is equal to zero because it's Taylor series expansion in the extremum point, and the third term leads to quadratic loss. From this equation we almost get the statement of the theorem: `

X w2Vw c2Vc P M Ik(w; c))2 = #(w; c) ( x)

k = (x) ( x) #(w; c) + k #(w)#(c) jDj

(x) #(w)#(c) jDj Substituting P M Ik(w; c) = P M I(w; c) log k = log = log #(w;c) jDj for x =< w; c > we get: k#(w)#(c) #(w;c) jDj #(w)#(c) log k = log log #(w; c) jDj k#(w)#(c) #(w; c) jDj k#(w)#(c)

= = 1 jDj

(k#(w)#(c))2 #(w; c)jDj k#(w)#(c) (10) (11)

tu (12) (13) 4

Fitting parameters The comparison of word2vec and SVD was presented in [ 4 ]. However SVD is a matrix factorization optimizing quadratic loss (with the same weights of terms). Also these results are based on classic SVD calculation method. In this section we propose the iterative matrix factorization techniques frequently used in recommender systems. This method makes parameters tting process closer to original word2vec parameters tting.

In both suggested methods we consider the following optimization task: `~ = X X wc (< w; c >

P M Ik(w; c)) c

In stochastic gradient decent we choose random terms from sums over w 2 Vw and c 2 Vc: The main problem of matrix factorizations via SGD is a low convergence rate. Sometimes this problem could be solved with Alternating Least Squares method. The idea is to get iteratively w as a solution of the equation @@w`~ = 0 and c as a solution of the equation @@c`~ = 0. From the expression for objective function gradient one can conclude that to use ALS in our task we just need to solve following linear systems iteratively up to convergence:

X c2Vc X w2Vw ! ! wcccT w = X

wcc wcwwT c = X

wcw c2Vc w2Vw (15) (16) (17) (18) 5

Conclusion and future research Theorem 1 from section 3 shows that matrix factorization with weighted quadratic loss is close to initial optimization task. Also we get the weights and see that previous experiments with quadratic loss without weights are less motivated than experiments with weighted one.

In [ 3 ] authors have also mentioned that skip-gram with negative sampling (SGNS) remains superior to SVD on analogy questions and this could stem from the weighted nature of SGNS's factorization. We suppose that experiments with weighted quadratic loss could improve matrix factorization results in this task up to word2vec results. Also the objective function cold be modi ed with penalties of baseline predictors as it's common in collaborative ltering and it could slightly improve results too. Future work includes such experiments with analogy questions task and objective function modi cation.

1. Mikolov , T. , Sutskever , I. , Chen , K. , Corrado , G. , Dean , J. : Distributed representations of words and phrases and their compositionality . In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013 ,

Lake

Tahoe , Nevada, United States, 3111 { 3119 ( 2013 )

2. Levy , O. , Goldberg , Y. : word2vec explained: deriving Mikolov et al.s negativesampling word-embedding method . arXiv preprint arXiv:1402.3722 ( 2014 )

3. Levy , O. , Goldberg , Y. : Neural Word Embedding as Implicit Matrix Factorization . In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014 , Montreal, Quebec, Canada, 2177 { 2185 ( 2014 )

4. Levy , O. , Goldberg

, Dagan , I. : Improving Distributional Similarity with Lessons Learned from Word Embeddings, Transactions of the Association for Computational Linguistics , 3 , 211 - 225 ( 2015 )