=Paper=
{{Paper
|id=Vol-1917/paper13
|storemode=property
|title=Interpretable Matrix Factorization with Stochastic Nonnegative DEDICOM
|pdfUrl=https://ceur-ws.org/Vol-1917/paper13.pdf
|volume=Vol-1917
|authors=Rafet Sifa,Cesar Ojeda,Kostadin Cvejoski,Christian Bauckhage
|dblpUrl=https://dblp.org/rec/conf/lwa/SifaOCB17
}}
==Interpretable Matrix Factorization with Stochastic Nonnegative DEDICOM==
<pdf width="1500px">https://ceur-ws.org/Vol-1917/paper13.pdf</pdf>
<pre>
                          Interpretable Matrix Factorization with
                           Stochasticity Constrained Nonnegative
                                         DEDICOM

                  Rafet Sifa1,2 , Cesar Ojeda1 , Kostadin Cvejoski1 , and Christian Bauckhage1,2
                                       1
                                           Fraunhofer IAIS, Sankt Augustin, Germany
                                            2
                                              University of Bonn, Bonn, Germany
                                           www.multimedia-pattern-recognition.info
                                            {name.surname}@iais.fraunhofer.de


                        Abstract. Decomposition into Directed Components (DEDICOM) is a
                        special matrix factorization technique to factorize a given asymmetric
                        similarity matrix into a combination of a loading matrix describing the
                        latent structures in the data and an asymmetric affinity matrix encoding
                        the relationships between the found latent structures. Finding DEDI-
                        COM factors can be cast as a matrix norm minimization problem that
                        requires alternating least square updates to find appropriate factors. Yet,
                        due to the way DEDICOM reconstructs the data, unconstrained factors
                        might yield results that are difficult to interpret. In this paper we de-
                        rive a projection-free gradient descent based alternating least squares
                        algorithm to calculate constrained DEDICOM factors. Our algorithm
                        constrains the loading matrix to be column-stochastic and the affinity
                        matrix to be nonnegative for more interpretable low rank representa-
                        tions. Additionally, unlike most of the available approximate solutions
                        for finding the loading matrix, our approach takes the entire occurrences
                        of the loading matrix into account to assure convergence. We evaluate
                        our algorithm on a behavioral dataset containing pairwise asymmetric
                        associations between variety of game titles from an online platform.

                        Keywords: Unsupervised Learning, Matrix Factorization, Constrained
                        Optimization, Behavior Analysis


                 1    Introduction
                 Matrix and tensor factorization methods have been widely used in data sci-
                 ence applications for mostly understanding hidden patterns in the datasets and
                 learning representations for variety of prediction tasks and recommender sys-
                 tems [1, 5, 10, 11, 13, 16, 17]. Decomposition into Directed Components (DEDI-
                 COM) [8] is a matrix and tensor factorization technique and a compact way of
                 finding low rank representations from asymmetric similarity data. The method
                 has found applications in various data science problems including analysis of
                 temporal graphs [1], first order customer migration [15], natural language pro-
                 cessing [4], spatio-temporal representation learning [2, 16] and link prediction in


Copyright © 2017 by the paper’s authors. Copying permitted only for private and academic purposes.
In: M. Leyer (Ed.): Proceedings of the LWDA 2017 Workshops: KDML, FGWM, IR, and FGDB.
Rostock, Germany, 11.-13. September 2017, published at http://ceur-ws.org
Fig. 1: Illustration of two-way DEDICOM to partition a given similarity matrix
S ∈ Rn×n encoding pair-wise asymmetric relations among n objects into a
combination of a loading matrix A ∈ Rn×k and a square affinity matrix R ∈
Rk×k . In this representation A contains the latent factors where the columns
describe the hidden structures in S and the relations among these structures are
encoded by the asymmetric affinity matrix R. Variety of constrained versions
of DEDICOM has been previously studied so as to improve the interpretability
and the performance of finding appropriate factors.


relational and knowledge graphs [13], where the source of the addressed problems
were primarily about analyzing asymmetric relationships between predefined en-
tities.
    Formally, given an asymmetric similarity matrix S ∈ Rn×n encoding pair-
wise asymmetric relations (i.e. sij 6= sji ) among n objects, two-way DEDICOM
finds a loading matrix A ∈ Rn×k and a square affinity matrix R ∈ Rk×k for
representing S as
                                   S ≈ ARAT .                                 (1)

More precisely, DEDICOM approximation of the asymmetric association be-
tween elements i and j (i.e. sij ) can be formulated in column vector notation as
                                  n                            n X
                                                                 n
                                  X               T           X
              sij ≈ aTi: Raj: =         aib rb:        aj: =             aib rbc ajc ,   (2)
                                  b=1                          b=1 c=1

where ai: and aj: represent the ith and jth row of A respectively and rb: rep-
resents the bth row of R. In this representation A contains the latent factors
where the columns describe the hidden structures in S and the relations among
these structures are encoded by the asymmetric affinity matrix R. A pictorial
illustration of two-way DEDICOM factorization is shown in Fig. 1.
    Considering the three factor approximations of the similarity values as in
(1) and (2), the interpretation (especially with respect to scale) of the hidden
structures from given factor matrices A and R might yield misleading results
with the presence of negative values [15]. That is, the interpretation of the re-
sults is limited to only consider nonnegative affinities in R or the positive or
negative loadings of the corresponding points in (2). Additionally, when analyz-
ing nonnegative data matrices (such as ones containing probabilities, counts and
etc.), it is usually beneficial to consider nonnegative factor matrices that can be
considered as condensed (or compressed) representations that can be used for
informed decision making and representation learning [1, 8, 15, 16].
    In this work we address the issue of interpretability of the DEDICOM fac-
tors by introducing a converging algorithm as an alternative to the previously
proposed methods in [1,13, 15, 16]. Our method constrains the loading matrix A
to contain column stochastic vectors and (similar to [15,16]) the affinity matrix
R to be nonnegative. Formally this amounts to factorize a given asymmetric
similarity matrix S as in (1) by enforcing

                                  rij ≥ 0 ∀ {i, j}                               (3)

and
                                      n
                                      X
                          acb ≥ 0 ∧         aqb = 1 ∀ {c, b}.                    (4)
                                      q=1

    Compared to the additive-scaling based representation introduced in [15,16],
constraining the columns of A to contain probabilities has the direct advantage
of interpreting each element of A as the representativeness value of the partic-
ular loading it represents. That is, the value of aib represents how much the
ith element in the dataset contributes to the bth latent factor. This is indeed
parallel to the idea of Archetypal Analysis [3, 6, 17], where the mixture matrices
respectively determine how much a data point and archetypes contribute to re-
spectively constructing the archetypes and reconstructing each data point using
the archetypes.
    In the following we will first describe the general alternating least squares op-
timization framework and give an overview of the algorithms to find appropriate
constrained and unconstrained DEDICOM factors. After that we will introduce
our algorithm by first studying our problem setting and deriving the algorithm
step-by-step. This will be followed by a case study covering a real world ap-
plication where we will analyze game-ownership patterns from data containing
asymmetric associations between games using the factors we extract by running
our algorithm. Finally, we will conclude our paper with a summary and some
directions for future work.


2     Algorithms for Finding DEDICOM Factors
Finding appropriate DEDICOM factors for matrix decomposition can be cast as
a matrix norm minimization problem with the objective
                                                          2
                           E(A, R) = S − ARAT                 ,                  (5)

which is minimized with respect to A and R. Non-convex minimization problems
of this kind usually follow an iterative alternating least squares (ALS) procedure
where at each iteration we optimize over a selected factor treating the other
factors fixed. It is important to note that, the minimized objective function in
(5) is convex in R for fixed A whereas not convex in A for fixed R, which leads
to optimal and sub-optimal approximate solutions when respectively updating
R and A.
   Starting with defining update rules for the affinity matrix R, we note that
with fixed A, minimization of (5) is a matrix regression problem [1, 13, 15] with
global optimum solution
                                R ← A† SAT † ,                                (6)
where A† indicates Moore-Penrose pseudoinverse of A. Furthermore, if A is con-
strained to be column orthogonal (i.e. AT A = I k , where I k is the k × k identity
matrix), the update for R simplifies to R ← AT SA [15]. Here the updates are
not constrained to fall into a particular class of domain and might result in with
affinity matrices containing negative elements. Constraining R when minimizing
(5) to contain only nonnegative values can be cast as a nonnegative least squares
problem by transforming the arguments of (5) as
                                                                   2
                        E(R) = vec(S) − vec(ARAT )                               (7)

where the vec operator vectorizes (or flattens) a matrix B ∈ Rm×n as

                 vec(B) = [b11 , . . . , bm1 , . . . , b1n , . . . , bmn ]T .    (8)

It is worth mentioning that since vectorizing (5) as shown in (7) does not affect
the value of the norm [15], we can make use of the property of vectorization to
represent a vectorization of multiplications of matrices as matrix vector multi-
plication to rewrite (7) as
                                                  2
                      E(R) = vec(S) − A ⊗ A vec(R)) ,                            (9)

where ⊗ represents the Kronecker product. As noted in [15], the expression in
(9) can indeed be mapped to constrained linear regression problem [12] with a
converging result (through an active set algorithm) to define an update as
                                         2
         R ← argmin vec(S) − A ⊗ A vec(R))   ∧ rij ≥ 0 ∀ i, j.                  (10)
                  R

    Having identified alternating least squares updates for constrained and un-
constrained affinity matrix R, we now turn our attention to define updates for
the loading matrix A. To this end, there have been two different directions fol-
lowed previously: approximating the minimization of (5) by treating a particular
factor with A constant [1, 13] and directly minimizing (5) by means of gradient
based approaches. To start with, former method considers stacking matrix S
and matrix S T as
                h      h                   i     h            i
                 S S T = ARAT ARTAT = A RAT RTAT ,                          (11)

and solves for A keeping AT fixed [1, 13] to come up with an update for A as
                                           −1
             A ← SART + S TAR RATART + RTATAR     .                             (12)
The latter method’s ALS update for the loading matrix A is based on directly
minimizing (5) by considering first order gradient based optimization [15, 16],
which in each update moves the current solution for A in the opposite direction
of gradient of (5). To derive the gradient matrix we start by defining (5) in terms
of the trace operator as
                      h                                                     i
       E(A, R) = tr S T S − S T ARAT − ART AT S + ART AT ARAT
                      h                                       i
                 = tr S T S − 2S T ARAT + ART AT ARAT
                      h     i       h          i      h                 i
                 = tr S T S − 2 tr S T ARAT + tr ART AT ARAT .                  (13)
                                                h     i
Since traces are linear mappings and the term tr S T S does not depend on A,
minimizing E(A, R) in (13) with respect to A is equivalent to minimizing E(A)
which we define as
                           E(A) = E1 (A) + E2 (A)                         (14)
given that                               h         i
                           E1 (A) = −2 tr S T ARAT                             (15)

and                                 h           i
                         E2 (A) = tr ART AT ARAT .                             (16)

Considering the orthogonality constraint on A, the term in (16) becomes inde-
pendent of A. That is, since traces are invariant under cyclic permutation we
can reformulate (16) as
               h                i     h                i      h     i
    E2 (A) = tr ART AT ARAT = tr RT AT ARAT A = tr RT R ,                 (17)

which allows for only considering the minimization of the error term in (15) [15,
16]. Consequently, we can define a gradient descent based update by considering
the gradient matrix

                       ∂E1 (A)                  
                               = −2 S T AR + SART ,                            (18)
                         ∂A
which with a learning rate ηA allows us to define an update for A as
                                                      
                     A ← A + 2ηA S T AR + SART .                               (19)

As a final step, in oder to assure the orthogonality constraint on A we project the
updated A by considering its QR decomposition [15, 16] A = QT , where Q ∈
Rn×k is orthogonal and T ∈ Rk×k is invertible upper triangular, and following
the update A ← Q. This concludes our overview of the methods to come up
with appropriate factors. For more detailed information about the alternating
least squares solutions for DEDICOM factorization for matrices and tensors we
refer the reader to [1, 13, 15, 16].
3    Stochasticity Constrained Nonnegative DEDICOM

In this section we will present a new ALS algorithm to particularly consider
DEDICOM factorization with column stochastic loadings and nonnegative affini-
ties for interpretability of the resulting factors. To begin with, as noted in [15,16],
we can assure nonnegative affinities R by solving the nonnegative least squares
problem introduced in [12] as in (10).
    Next considering the theoretical properties of constraining the columns of
matrix A to (4), we note that each column resides in the standard simplex ∆n−1 ,
which is the convex hull of the standard basis vectors of Rn . Formally, given that
δij represents the Kronocker delta and V = {v 1 , v 2 , . . . , v n |v i = [δi1 , . . . , δin ]T }
is the set of the standard basis vectors of Rn , the standard simplex ∆n−1 is
defined as the compact set
                         n
                        nX                 n
                                           X                                         o
              ∆n−1 =            αi v i |         αi = 1 ∧ αi ≥ 0 ∀ i ∈ [1, . . . , n] .     (20)
                          i=1              i=1

This indicates that, finding an appropriate column stochastic matrix A minimiz-
ing the convex objective function in (5) can be reduced down to a constrained
optimization problem in the convex compact simplex ∆n−1 .
    Constrained optimization problems of this kind can be easily tackled using
the Frank-Wolfe algorithm [7,9], which is also known as the conditional gradient
method [9]. The main idea behind the Frank-Wolfe algorithm is to minimize
differentiable convex functions over their domains forming compact convex sets
using iterative first-order Taylor approximations until achieving convergence.
Formally given a differentiable convex function f : S → R over a compact
convex set S, the Frank-Wolfe algorithm aims to solve

                                                 min f (x)                                  (21)
                                                  x

for x ∈ S, by iteratively solving for

                                     st = min sT ∇f (xt ),                                  (22)
                                             s∈S

where ∇f (xt ) is the gradient of the optimized function f evaluated at the current
solution xt . Following that, at each iteration t, the algorithm estimates the new
solution by evaluating

                                  xt+1 = xt + αt (st − xt ),                                (23)

where αt ∈ (0, . . . , 1] is the learning rate that is usually selected by performing a
line search on (23) or set to be monotonically decreasing as a function of t [7, 9].
Additionally, one particular advantage of this optimization method is that it
allows for -approximation for a given maximum number of iterations tmax [7,9].
Considering our special case, since the set of vertices of a standard simplex is
equivalent to the set of standard basis V in Rn [3], at each iteration Frank-Wolfe
algorithm moves the current solution into the direction of one of the standard
basis until achieving convergence.
    Following that, since the stochasticity constraint does not allow for a simpli-
fication as in (17), we require the full derivation of the gradient matrix ∂E(A)
                                                                              ∂A
from (13) so as to adapt the Frank-Wolfe algorithm to find appropriate stochas-
tic columns for A. To this end we start by considering E2 (A) from (16) in terms
of a scalar summation as
        h                i XXXXXX
     tr ART AT ARAT =                                auv rwv axw axy ryz auz   (24)
                              u   v   w     x    y   z

and consider its partial derivative with respect to an arbitrary element aij of A
that can be written due to linearity of differentiation as

         ∂E2 (A) X X X X X X ∂
                =                  (auv rwv axw axy ryz auz ) .               (25)
          ∂aij    u v w x y z
                              ∂aij

The expression in (25) can be further simplified by the product rule expansion

       ∂E2 (A) X X X X X X ∂
              =                  (auv rwv )(axw axy ryz auz ) +
        ∂aij    u v w x y z
                            ∂aij
                   XXXXXX ∂
                                    (axw )(auv rwv axy ryz auz ) +
                   u v w x y z
                               ∂aij
                   XXXXXX ∂
                                    (axy )(auv rwv axw ryz auz ) +
                   u v w x y z
                               ∂aij
                   XXXXXX ∂
                                    (auz ryz )(auv rwv axw axy )              (26)
                   u v w x y z
                               ∂aij

and evaluating the derivatives results in

                  ∂E2 (A) X X X X
                         =         rwv axw axy ryz auz +
                   ∂aij    w x y z
                           XXXX
                                   auv rwv axy ryz auz +
                              u   v   y     z
                             XXXX
                                                auv rwv axw ryz auz +
                              u   v   w     z
                             XXXX
                                                ryz auv rwv axw axy .         (27)
                              v   w   x     y

Finally, reformulating the four summations in (27) in terms of matrix multipli-
cations as
               ∂E2 (A)      h           i       h             i
                        = 2 ART AT AR + 2 ARAT ART                        (28)
                 ∂aij                     ij                   ij
   Randomly initialize valid A and R
   Let V = {v 1 , v 2 , . . . , v n |v i = [δi1 , . . . , δin ]T }
   while Stopping condition is not satisfied do
      //Update A with Frank-Wolfe procedure
      while t 6= tmax and updates of A are not small do
         //Define the gradient            matrix                            
           G = ∂E(A)     T T        T  T   T
                ∂A = 2 AR A AR + ARA AR − S AR − SAR
                                                     T

           //Define the monotonically decreasing learning rate
           α ← 2/(t + 2)
           for b ∈ {1, . . . , k} do
              //Find the minimizer simplex vertex i
              i = argmin gbl
                       l
               //Update columns ab of A
               ab ← ab + α(v i − ab )
           //Update the iterator
           t←t+1
       //Update R by solving the nonnegative least squares problem
                                             2
       R ← argmin vec(S) − A ⊗ A vec(R))          w.r.t rij ≥ 0 ∀ i, j
                R


 Alg. 1: A Frank-Wolfe based projection-free ALS algorithm to find Semi-
 nonnogative DEDICOM factors with column stochastic loading matrix A
 and nonnegative affinity matrix R. At each iteration of the algorithm, we
 alternate between finding an optimal column stochastic A keeping R fixed
 and finding a nonnegative matrix R keeping A fixed. Note that the learning
 rate here is set to decrease monotonically.


allows us to define the gradient matrix ∂E∂A
                                           2 (A)
                                                 as

                    ∂E2 (A)                        
                            = 2 ART AT AR + ARAT ART .                       (29)
                      ∂A
   Combining the results from (18) and (29) the full gradient matrix of the
minimized error norm for A is calculated as follows
     ∂E(A)                                                    
             = 2 ART AT AR + ARAT ART − S T AR − SART ,                (30)
       ∂A
which allows us to define a column-wise Frank-Wolfe algorithm to come up with
optimal column stochastic loading matrix A minimizing (5). We list the essential
steps of our adaptation of the projection free Frank-Wolfe algorithm in Alg. 1. In
a nutshell, our algorithm starts by randomly initializing (valid) matrices A and
R, continues with alternating between the Frank-Wolfe optimization updates
to come up with optimal column stochastic A and alternating least squares
               270
               260
               250
               240
         RSS
               230
               220
               210
               200
                     1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
                                                  ALS Iterations


Fig. 2: Illustration of the evolution of the Residual Sum of Squares (RSS) for
factorizing our empirical conditional probability matrix indicating asymmetric
relations between games using DEDICOM with k = 3. Our provably converging
algorithm monotonically decreases the RSS at each iteration.


updates for nonnegative R until a predefined stopping condition is met. The
stopping conditions are usually selected based on reaching a maximum number
of alternating least squares updates (not to be mixed with the maximum iteration
count tmax of the Frank-Wolfe updates for columns of A in Alg. 1), having minor
subsequent changes in the minimized objective (for our case the matrix norm in
(5)) or combining both of these conditions.

4     A Business Intelligence Application: Analyzing
      Asymmetric Game Ownership Associations in an
      Online Gaming Platform
In order to evaluate our algorithm, we consider a use-case from game analyt-
ics [14, 16], where we analyze the asymmetric relationships among various types
of games. To this end we consider a dataset from an online gaming platform called
Steam, which hosts thousands of games of various genres to a large player-base
with its size ranging from 9 to 15 million3 (daily active) players. We used the
dataset from [14] which contains game ownership information of more than six
million users about 3007 titles.
    Representing the ownership information as a bipartite matrix Y ∈ {0, 1}m×n
indicating ownership of information of m players for n games, we construct
the asymmetric association among games in terms of their pairwise empirical
conditional probabilities. To this end, we define the directional similarity from
game i to game j as
                                   |{c |yci 6= 0 ∀ c} ∩ {b |ybj 6= 0 ∀ b}|
                          sij =                                            ,                            (31)
                                              |{c |yci 6= 0 ∀ c}|
3
    http://store.steampowered.com/stats/
Loadings       Identifier                     Dominant Games
   a1        TF 2/FPS      TF 2, CS: Source and Red Orchestra: Ostfront 41-45
   a2     Indie/Platformer       WSS, Ethan and Oozi: Earth Adventure
   a3      Flagships/AAA Left 4 Dead 2, Dota 2, Skyrim, HL 2, Garry’s Mod
Table 1: Selection of games with high loading values for the analyzed dataset.


where | · | indicates set cardinality and ypq represents if the pth player owns the
qth game. It is important to note that (31) is inherently asymmetric and a spe-
cial case of the so-called Tversky index [18] with prototype and variant weights
of respectively 1 and 0 (or vice versa). Since the number of computations re-
quired to obtain the similarity matrix S scales to O(mn2 ), we parallelized the
procedure of computing the similarity values in (31) by utilizing a hybrid par-
allelization technique that simultaneously exploits both distributed and shared
memory architectures in order to speedup the computation time.
    Having obtained the asymmetric similarity matrix with the empirical condi-
tional probability values in S, we factorize it using our algorithm and present
the results with k = 3. To explicitly ignore modeling the self-loops, at each alter-
nating least squares iteration, we replaced the diagonal of S with the diagonal
of its current reconstruction [1]. In Fig. 2 we show the residual sum of squares,
which decreases monotonically and converges to a minimum after 30 iterations.
Analyzing the resulting factors, we observe that the resulting two columns (or
modes) a1 and a2 of A are sparse whereas the last column was dense in terms of
non-zero probability values, where the most dominant games vary from column
to column. In Tbl. 1 we present the games with high loadings in their corre-
sponding column and observe that mostly the free-to-play flagship game Team
Fortress 2 (TF 2), followed by the FPS games CS: Source and Red Orchestra:
Ostfront 41-45, contribute to a1 .On the other hand, the indie-platformer games
Wooden Sen’SeY (WSS), Ethan (Meteor Hunter) and Oozi: Earth Adventure
contribute mostly to a2 . Finally, the mostly dense a3 has very high loadings
for all of the flagship games of the analyzed platform and other AAA games
including Left 4 Dead 2, Dota 2, Skyrim, Half-life 2 (HL 2) and Garry’s Mod.
    Analyzing the resulting row-normalized affinity matrix that encodes the asym-
metric relations between the modes from Fig. 3, the first rows indicates tenden-
cies of TF 2 players more to the indie-platformer mode than to the AAA games
because of the fact that TF 2 is mostly a singular game that people primarily
prefer in the Steam platform [14]. On the other hand, inline with the results
from [14] the opposite directional associations, namely, from the flagships to the
TF 2 are high which is related to the fact that majority of the players playing
one or more of the flagship games also prefer TF 2 [14]. Finally analyzing the
second mode, similar to the results in [15] the self association is the highest
association we observe (indicating players preferring mostly to stay in the same
genre), this is followed by high associations to the TF 2 and the mode related
to the flagship and AAA games.
                                      TF2/FPS   Indie/Platform Flagships/AAA


                           TF2/FPS    0.2469       0.5158        0.2372


                     Indie/Platform   0.3637       0.4867        0.1496


                     Flagships/AAA    0.6623       0.1042        0.2334


Fig. 3: Results of factorizing the empirical conditional probability graph created
for the pair-wise game ownership information of more than six million players
of an online gaming platform. The normalized nonnegative affinity matrix R
shows relationships between groups automatically determined by the DEDICOM
algorithm.


    It is worth mentioning that, parallel to the results in [1], by increasing the
number of modes k for the loading matrix A, we observed more detailed par-
titions regarding the modeled patterns. Running our algorithm with k = 2, we
observed a higher reconstruction error and that the indie-platformer mode a2
remains to be one of the factors whereas the games contributing to a3 have
merged with ones contributing to mode a1 . On the other hand, considering the
resulting factors when we ran our algorithm with k = 4, we obtained a slight
improvement regarding the reconstruction error and observed that the modes
a1 and a2 remained the same, however, mode a3 has split into two different
modes where both contain the same games as in a3 and the new mode contains
additional indie and plaformer games (reducing the probability on the games of
a3 ) such as Finding Teddy, Gentlemen, Gravi and Face Noir.


5   Conclusion and Future Work
In this work we studied the DEDICOM model to factorize asymmetric similarity
matrices into a combination of a low rank loading matrix and an affinity ma-
trix. We gave an overall overview about the theoretical details and algorithms
to come up with appropriate factors. In essence, the affinity matrix matrix R
indicates different directional structures that are determined by the columns of
the loading matrix A. Having defined our objective function to be the matrix
norm of the difference between the factorized asymmetric similarity matrix and
its DEDICOM reconstruction, we derived the gradient matrix for this function
with respect to the loading matrix A and presented a variant of the Frank-Wolfe
algorithm to obtain interpretable column stochastic loadings and nonnegative
affinities.Running our algorithm on a dataset containing asymmetric pairwise
relationships between more than 3000 games, we found interesting patterns in-
dicating directional preferences among games.
    Our future work involves analyzing the performance of our model for different
tasks including link prediction and representation learning [13, 16]. Considering
the tensor extensions in [13,16], our future work also involves extending the pro-
posed model to tensors factorizations which can allow us to analyze asymmetric
similarity matrices from different sources. In the context of business intelligence
and game analytics, this might allow us to analyze and compare, for instance,
preferences of different countries or player groups.

References
 1. Bader, B., Harshman, R., Kolda, T.: Temporal Analysis of Semantic Graphs using
    ASALSAN. In: Proc. of IEEE ICDM (2007)
 2. Bauckhage, C., Sifa, R., Drachen, A., Thurau, C., Hadiji, F.: Beyond Heatmaps:
    Spatio-Temporal Clustering using Behavior-Based Partitioning of Game Levels. In:
    Proc. of IEEE CIG (2014)
 3. Bauckhage, C.: k-Means Clustering via the Frank-Wolfe Algorithm. In: Proc. of
    KDML-LWDA (2016)
 4. Chew, P.A., Bader, B.W., Rozovskaya, A.: Using DEDICOM for Completely Unsu-
    pervised Part-Of-Speech Tagging. In: Proc. Workshop on Unsupervised and Mini-
    mally Supervised Learning of Lexical Semantics (2009)
 5. Cremonesi, P., Koren, Y., Turrin, R.: Performance of Recommender Algorithms
    on Top-N Recommendation Tasks. In: Proc. ACM Recsys (2010)
 6. Cutler, A., Breiman, L.: Archetypal Analysis. Technometrics 36(4), 338–347 (1994)
 7. Frank, M., Wolfe, P.: An Algorithm for Quadratic Programming. Naval Research
    Logistics 3(1-2) (1956)
 8. Harshman, R.A.: Models for Analysis of Asymmetrical Relationships among N
    Objects or Stimuli. In: Proc. Joint Meeting of the Psychometric Society and the
    Society for Mathematical Psychology (1978)
 9. Jaggi, M.: Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization.
    In: Proc. of ICML (2013)
10. Kantor, P., Rokach, L., Ricci, F., Shapira, B.: Recommender Systems Handbook.
    Springer (2011)
11. Koren, Y., Bell, R., Volinsky, C.: Matrix Factorization Techniques for Recom-
    mender Systems. Computer 42(8) (2009)
12. Lawson, C.L., Hanson, R.J.: Solving Least Squares Problems. SIAM (1974)
13. Nickel, M., Tresp, V., Kriegel, H.: A Three-way Model for Collective Learning on
    Multi-relational Data. In: Proc. of ICML (2011)
14. Sifa, R., Drachen, A., Bauckhage, C.: Large-Scale Cross-Game Player Behavior
    Analysis on Steam. In: Proc. of AAAI AIIDE (2015)
15. Sifa, R., Ojeda, C., Bauckhage, C.: User Churn Migration Analysis with DEDI-
    COM. In: Proc. of ACM RecSys (2015)
16. Sifa, R., Srikanth, S., Drachen, A., Ojeda, C., Bauckhage, C.: Predicting Retention
    in Sandbox Games with Tensor Factorization-based Representation Learning. In:
    Proc. of IEEE CIG (2016)
17. Sifa, R., Bauckhage, C., Drachen, A.: Archetypal Game Recommender Systems.
    In: Proc. of KDML-LWA (2014)
18. Tversky, A.: Features of Similarity. Psychological Review 84(4) (1977)

</pre>