<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Optimization Task in Equivalent to word2vec Matrix Factorization</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Victor Kantor</string-name>
          <email>viktor.kantor@phystech.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Moscow Institute of Physics and Technologies</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Yandex Moscow</institution>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Omer Levy and Yoav Goldberg have shown in their paper at NIPS 2014 that word2vec is equivalent to the factorization of shifted PMI matrix. The question which was not discussed in this work and later papers is the right choice of the norm for an approximation of the matrix. Authors also presented the results of the experiments with SVD approximating the matrix with respect to Frobenius norm. In this work we show that weighted Frobenius norm could be the reasonable choice, but weights shouldn't be equal to one as in Levy and Goldberg experiments. We conjecture that the right choice of weights could help to improve matrix factorization results on analogy questions, where skipgram with negative sampling (SGNS) remains superior to SVD.</p>
      </abstract>
      <kwd-group>
        <kwd>word2vec</kwd>
        <kwd>SGNS</kwd>
        <kwd>matrix factorizations</kwd>
        <kwd>SVD</kwd>
        <kwd>distributional semantics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Word2vec is a powerful and popular natural language processing technique
proposed by Mikolov et al. in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. It allows to get word representations with some
useful properties. Dot product of word2vec vectors is a good similarity
measure and arithmetical operations with vectors help to solve some analogy tasks
(popular example: "queen woman + man king").
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] it was shown that word2vec is similar to matrix factorization technique
well-known in NLP and collaborative ltering. And factorizing matrix is almost
the PMI-matrix which is also common object in NLP and CF.
      </p>
      <p>
        The main di erence between matrix factorization in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and common use
of matrix factorization techniques is that Levy et al. result is valid for exact
factorization of the matrix but usually we work with approximate factorization
in NLP and CF applications.
      </p>
      <p>The default choice of norm for approximating the initial matrix in matrix
factorization is Frobenius norm which leads to the quadratic loss:
(1)
jjX</p>
      <p>U V jj2F =</p>
      <p>X(xij
uiT vj )2</p>
      <p>
        Experiments from [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] were considered with Singular Value
Decomposition (SVD) which gives the best approximation according to Frobenius norm
and subsequently quadratic loss. This choice was motivated by popularity of
SVD and Frobenius norm as a default variant. In this work we get the
theoretical motivation for quadratic loss using second order Taylor series approximation
for objective function. An interesting conclusion is that classic SVD optimizing
simple quadratic loss isn't a good choice in this case. Our future work includes
experiments which could complement this conclusion with practical results.
2
      </p>
      <p>
        Skip-Gram with Negative Sampling (SGNS)
In this section we provide a brief review of the result from [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Setting and Notation</title>
      <p>The skip-gram model assumes a corpus of words w 2 Vw and their contexts
c 2 Vc, where Vw and Vc are sets of words and contexts. We denote the collection
of observed words and contexts pair as D. To denote the length of D (the number
of words in collection) we use jDj. Note that jDj di ers from jVwj. We also use
#(w; c) to denote the number of times the pair (w, c) appear in D. Notation
#(w) and #(c) has the similar meaning.</p>
      <p>Let d be the embedding dimensionality. Each word w 2 Vw is associated with
a vector w 2 Rd and each context c 2 Vc is associated with a vector c 2 Rd.
2.2</p>
      <p>
        SGNS as Matrix Factorization
As it was shown in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], if k is the number of negative samples, SGNS optimization
task is as follows:
` = X X #(w; c) (log (&lt; w; c &gt;) + k EcN PD log (
&lt; w; cN &gt;)) (3)
`(w; c) = log (&lt; w; c &gt;) + k #(w)
&lt; w; c &gt;= log
#(w; c) jDj
#(w)#(c)
log k
If we are going to use quadratic loss for this matrix factorization, the reasonable
way to approximate the objective function with such loss is to use the second
order Taylor series expansion. In this case we get the weighted quadratic loss,
so the main question is the values of the weights. If the weights are the same for
every (w; c) pair, we can use standard SVD approximating initial matrix with
respect to Frobenius norm. The following theorem shows that weights could be
di erent for di erent pairs (w; c):
      </p>
    </sec>
    <sec id="sec-3">
      <title>Theorem 1. Assume &lt; w; c &gt;</title>
      <p>P M Ik(w; c). Then:
`</p>
      <p>X</p>
      <p>X const(&lt; w; c &gt;) + wc (&lt; w; c &gt;</p>
      <p>P M Ik(w; c))2
jDj
(6)
(7)
(8)
(9)
+0 (&lt; w; c &gt;</p>
      <p>P M Ik(w; c))2+
+o (&lt; w; c &gt;</p>
      <p>P M Ik(w; c))2</p>
      <p>Here the rst term is const(&lt; w; c &gt;), the second term is equal to zero
because it's Taylor series expansion in the extremum point, and the third term
leads to quadratic loss. From this equation we almost get the statement of the
theorem:
`</p>
      <p>X
w2Vw c2Vc
P M Ik(w; c))2
= #(w; c) ( x)</p>
      <p>k
=
(x) ( x) #(w; c) + k
#(w)#(c)
jDj</p>
      <p>(x)
#(w)#(c)
jDj
Substituting P M Ik(w; c) = P M I(w; c)
log k = log
= log
#(w;c) jDj for x =&lt; w; c &gt; we get:
k#(w)#(c)
#(w;c) jDj
#(w)#(c)
log k =
log
log
#(w; c) jDj
k#(w)#(c)
#(w; c) jDj
k#(w)#(c)</p>
      <p>=
=
1
jDj</p>
      <p>(k#(w)#(c))2
#(w; c)jDj k#(w)#(c)
(10)
(11)</p>
      <p>tu
(12)
(13)
4</p>
      <p>
        Fitting parameters
The comparison of word2vec and SVD was presented in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. However SVD is a
matrix factorization optimizing quadratic loss (with the same weights of terms).
Also these results are based on classic SVD calculation method. In this
section we propose the iterative matrix factorization techniques frequently used in
recommender systems. This method makes parameters tting process closer to
original word2vec parameters tting.
      </p>
      <p>In both suggested methods we consider the following optimization task:
`~ = X X
wc (&lt; w; c &gt;</p>
      <p>P M Ik(w; c)) c</p>
      <p>In stochastic gradient decent we choose random terms from sums over w 2 Vw
and c 2 Vc:
The main problem of matrix factorizations via SGD is a low convergence rate.
Sometimes this problem could be solved with Alternating Least Squares method.
The idea is to get iteratively w as a solution of the equation @@w`~ = 0 and c as
a solution of the equation @@c`~ = 0. From the expression for objective function
gradient one can conclude that to use ALS in our task we just need to solve
following linear systems iteratively up to convergence:</p>
      <p>X
c2Vc
X
w2Vw
!
!
wcccT
w = X</p>
      <p>wcc
wcwwT
c = X</p>
      <p>wcw
c2Vc
w2Vw
(15)
(16)
(17)
(18)
5</p>
      <p>Conclusion and future research
Theorem 1 from section 3 shows that matrix factorization with weighted quadratic
loss is close to initial optimization task. Also we get the weights and see that
previous experiments with quadratic loss without weights are less motivated than
experiments with weighted one.</p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] authors have also mentioned that skip-gram with negative sampling
(SGNS) remains superior to SVD on analogy questions and this could stem
from the weighted nature of SGNS's factorization. We suppose that experiments
with weighted quadratic loss could improve matrix factorization results in this
task up to word2vec results. Also the objective function cold be modi ed with
penalties of baseline predictors as it's common in collaborative ltering and it
could slightly improve results too. Future work includes such experiments with
analogy questions task and objective function modi cation.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems</source>
          <year>2013</year>
          ,
          <string-name>
            <given-names>Lake</given-names>
            <surname>Tahoe</surname>
          </string-name>
          , Nevada, United States,
          <volume>3111</volume>
          {
          <fpage>3119</fpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Levy</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goldberg</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>word2vec explained: deriving Mikolov et al.s negativesampling word-embedding method</article-title>
          .
          <source>arXiv preprint arXiv:1402.3722</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Levy</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goldberg</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Neural Word Embedding as Implicit Matrix Factorization</article-title>
          .
          <source>In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems</source>
          <year>2014</year>
          , Montreal, Quebec, Canada,
          <volume>2177</volume>
          {
          <fpage>2185</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Levy</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goldberg</surname>
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dagan</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Improving Distributional Similarity with Lessons Learned from Word Embeddings, Transactions of the Association for Computational Linguistics</article-title>
          ,
          <volume>3</volume>
          ,
          <fpage>211</fpage>
          -
          <lpage>225</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>