-

Interpretable Matrix Factorization with Stochasticity Constrained Nonnegative DEDICOM

Rafet Sifa

0 1

Cesar Ojeda

Kostadin Cvejoski

Christian Bauckhage

0 1 0 Fraunhofer IAIS , Sankt Augustin , Germany 1 University of Bonn , Bonn , Germany

2017

Decomposition into Directed Components (DEDICOM) is a special matrix factorization technique to factorize a given asymmetric similarity matrix into a combination of a loading matrix describing the latent structures in the data and an asymmetric a nity matrix encoding the relationships between the found latent structures. Finding DEDICOM factors can be cast as a matrix norm minimization problem that requires alternating least square updates to nd appropriate factors. Yet, due to the way DEDICOM reconstructs the data, unconstrained factors might yield results that are di cult to interpret. In this paper we derive a projection-free gradient descent based alternating least squares algorithm to calculate constrained DEDICOM factors. Our algorithm constrains the loading matrix to be column-stochastic and the a nity matrix to be nonnegative for more interpretable low rank representations. Additionally, unlike most of the available approximate solutions for nding the loading matrix, our approach takes the entire occurrences of the loading matrix into account to assure convergence. We evaluate our algorithm on a behavioral dataset containing pairwise asymmetric associations between variety of game titles from an online platform.

Unsupervised Learning Matrix Factorization Constrained Optimization Behavior Analysis

Matrix and tensor factorization methods have been widely used in data science applications for mostly understanding hidden patterns in the datasets and learning representations for variety of prediction tasks and recommender systems [ 1, 5, 10, 11, 13, 16, 17 ]. Decomposition into Directed Components (DEDICOM) [ 8 ] is a matrix and tensor factorization technique and a compact way of nding low rank representations from asymmetric similarity data. The method has found applications in various data science problems including analysis of temporal graphs [ 1 ], rst order customer migration [ 15 ], natural language processing [ 4 ], spatio-temporal representation learning [ 2, 16 ] and link prediction in relational and knowledge graphs [ 13 ], where the source of the addressed problems were primarily about analyzing asymmetric relationships between prede ned entities.

Formally, given an asymmetric similarity matrix S 2 Rn n encoding pairwise asymmetric relations (i.e. sij 6= sji) among n objects, two-way DEDICOM nds a loading matrix A 2 Rn k and a square a nity matrix R 2 Rk k for representing S as More precisely, DEDICOM approximation of the asymmetric association between elements i and j (i.e. sij ) can be formulated in column vector notation as sij

n aiT: Raj: = X

aibrb: b=1

n n aj: = X X aibrbcajc; b=1 c=1 where ai: and aj: represent the ith and jth row of A respectively and rb: represents the bth row of R. In this representation A contains the latent factors where the columns describe the hidden structures in S and the relations among these structures are encoded by the asymmetric a nity matrix R. A pictorial illustration of two-way DEDICOM factorization is shown in Fig. 1.

Considering the three factor approximations of the similarity values as in (1) and (2), the interpretation (especially with respect to scale) of the hidden structures from given factor matrices A and R might yield misleading results with the presence of negative values [ 15 ]. That is, the interpretation of the results is limited to only consider nonnegative a nities in R or the positive or negative loadings of the corresponding points in (2). Additionally, when analyzing nonnegative data matrices (such as ones containing probabilities, counts and etc.), it is usually bene cial to consider nonnegative factor matrices that can be considered as condensed (or compressed) representations that can be used for informed decision making and representation learning [ 1, 8, 15, 16 ]. (1) (2)

In this work we address the issue of interpretability of the DEDICOM factors by introducing a converging algorithm as an alternative to the previously proposed methods in [ 1, 13, 15, 16 ]. Our method constrains the loading matrix A to contain column stochastic vectors and (similar to [ 15, 16 ]) the a nity matrix R to be nonnegative. Formally this amounts to factorize a given asymmetric similarity matrix S as in (1) by enforcing and rij

0 8 fi; jg acb 0 ^ n X aqb = 1 8 fc; bg: q=1

Compared to the additive-scaling based representation introduced in [ 15, 16 ], constraining the columns of A to contain probabilities has the direct advantage of interpreting each element of A as the representativeness value of the particular loading it represents. That is, the value of aib represents how much the ith element in the dataset contributes to the bth latent factor. This is indeed parallel to the idea of Archetypal Analysis [ 3, 6, 17 ], where the mixture matrices respectively determine how much a data point and archetypes contribute to respectively constructing the archetypes and reconstructing each data point using the archetypes.

In the following we will rst describe the general alternating least squares optimization framework and give an overview of the algorithms to nd appropriate constrained and unconstrained DEDICOM factors. After that we will introduce our algorithm by rst studying our problem setting and deriving the algorithm step-by-step. This will be followed by a case study covering a real world application where we will analyze game-ownership patterns from data containing asymmetric associations between games using the factors we extract by running our algorithm. Finally, we will conclude our paper with a summary and some directions for future work. 2

Algorithms for Finding DEDICOM Factors Finding appropriate DEDICOM factors for matrix decomposition can be cast as a matrix norm minimization problem with the objective

E(A; R) =

ARAT 2; which is minimized with respect to A and R. Non-convex minimization problems of this kind usually follow an iterative alternating least squares (ALS) procedure where at each iteration we optimize over a selected factor treating the other factors xed. It is important to note that, the minimized objective function in (5) is convex in R for xed A whereas not convex in A for xed R, which leads to optimal and sub-optimal approximate solutions when respectively updating R and A. (3) (4) (5)

Starting with de ning update rules for the a nity matrix R, we note that with xed A, minimization of (5) is a matrix regression problem [ 1, 13, 15 ] with global optimum solution

AySAT y; where Ay indicates Moore-Penrose pseudoinverse of A. Furthermore, if A is constrained to be column orthogonal (i.e. AT A = Ik, where Ik is the k k identity matrix), the update for R simpli es to R AT SA [ 15 ]. Here the updates are not constrained to fall into a particular class of domain and might result in with a nity matrices containing negative elements. Constraining R when minimizing (5) to contain only nonnegative values can be cast as a nonnegative least squares problem by transforming the arguments of (5) as where represents the Kronecker product. As noted in [ 15 ], the expression in (9) can indeed be mapped to constrained linear regression problem [ 12 ] with a converging result (through an active set algorithm) to de ne an update as R

hS ST = hARAT ARTAT i = AhRAT RTAT i; and solves for A keeping AT

xed [ 1, 13 ] to come up with an update for A as E(R) = vec(S) vec(ARAT ) 2 where the vec operator vectorizes (or attens) a matrix B 2 Rm n as vec(B) = [b11; : : : ; bm1; : : : ; b1n; : : : ; bmn]T : It is worth mentioning that since vectorizing (5) as shown in (7) does not a ect the value of the norm [ 15 ], we can make use of the property of vectorization to represent a vectorization of multiplications of matrices as matrix vector multiplication to rewrite (7) as

E(R) = vec(S)

2 A vec(R)) ; (6) (7) (8) (9) (11) The latter method's ALS update for the loading matrix A is based on directly minimizing (5) by considering rst order gradient based optimization [ 15, 16 ], which in each update moves the current solution for A in the opposite direction of gradient of (5). To derive the gradient matrix we start by de ning (5) in terms of the trace operator as

E(A; R) = tr hST S

ST ARAT ART AT S + ART AT ARAT i

= tr hST S = tr hST Si

2ST ARAT + ART AT ARAT i 2 tr hST ARAT i + tr hART AT ARAT i :

Since traces are linear mappings and the term tr hST Si does not depend on A, minimizing E(A; R) in (13) with respect to A is equivalent to minimizing E(A) which we de ne as given that and

E(A) = E1(A) + E2(A) E1(A) =

2 tr hST ARAT i

E2(A) = tr hART AT ARAT i : Considering the orthogonality constraint on A, the term in (16) becomes independent of A. That is, since traces are invariant under cyclic permutation we can reformulate (16) as

E2(A) = tr hART AT ARAT i = tr hRT AT ARAT Ai = tr hRT Ri ; (17) (13) (14) (15) (16) (18) (19) which allows for only considering the minimization of the error term in (15) [ 15, 16 ]. Consequently, we can de ne a gradient descent based update by considering the gradient matrix which with a learning rate A allows us to de ne an update for A as A

A + 2 A

ST AR + SART :

As a nal step, in oder to assure the orthogonality constraint on A we project the updated A by considering its QR decomposition [ 15, 16 ] A = QT , where Q 2 Rn k is orthogonal and T 2 Rk k is invertible upper triangular, and following the update A Q. This concludes our overview of the methods to come up with appropriate factors. For more detailed information about the alternating least squares solutions for DEDICOM factorization for matrices and tensors we refer the reader to [ 1, 13, 15, 16 ].

Stochasticity Constrained Nonnegative DEDICOM In this section we will present a new ALS algorithm to particularly consider DEDICOM factorization with column stochastic loadings and nonnegative a nities for interpretability of the resulting factors. To begin with, as noted in [ 15,16 ], we can assure nonnegative a nities R by solving the nonnegative least squares problem introduced in [ 12 ] as in (10).

Next considering the theoretical properties of constraining the columns of matrix A to (4), we note that each column resides in the standard simplex n 1, which is the convex hull of the standard basis vectors of Rn. Formally, given that ij represents the Kronocker delta and V = fv1; v2; : : : ; vnjvi = [ i1; : : : ; in]T g is the set of the standard basis vectors of Rn, the standard simplex n 1 is de ned as the compact set

n n 1 = nX i=1 n X i=1 ivi j i = 1 ^ i

o 0 8 i 2 [1; : : : ; n] : (20) This indicates that, nding an appropriate column stochastic matrix A minimizing the convex objective function in (5) can be reduced down to a constrained optimization problem in the convex compact simplex n 1.

Constrained optimization problems of this kind can be easily tackled using the Frank-Wolfe algorithm [ 7, 9 ], which is also known as the conditional gradient method [ 9 ]. The main idea behind the Frank-Wolfe algorithm is to minimize di erentiable convex functions over their domains forming compact convex sets using iterative rst-order Taylor approximations until achieving convergence. Formally given a di erentiable convex function f : S ! R over a compact convex set S, the Frank-Wolfe algorithm aims to solve (21) (22) (23) for x 2 S, by iteratively solving for min f (x)

x st = min sT rf (xt);

s 2 S where rf (xt) is the gradient of the optimized function f evaluated at the current solution xt. Following that, at each iteration t, the algorithm estimates the new solution by evaluating xt+1 = xt + t(st xt); where t 2 (0; : : : ; 1] is the learning rate that is usually selected by performing a line search on (23) or set to be monotonically decreasing as a function of t [ 7, 9 ]. Additionally, one particular advantage of this optimization method is that it allows for -approximation for a given maximum number of iterations tmax [ 7,9 ]. Considering our special case, since the set of vertices of a standard simplex is equivalent to the set of standard basis V in Rn [ 3 ], at each iteration Frank-Wolfe u v w x y z X X X X X X u v w x y z X X X X X X u v w x y z algorithm moves the current solution into the direction of one of the standard basis until achieving convergence.

Following that, since the stochasticity constraint does not allow for a simplication as in (17), we require the full derivation of the gradient matrix @E(A) @A from (13) so as to adapt the Frank-Wolfe algorithm to nd appropriate stochastic columns for A. To this end we start by considering E2(A) from (16) in terms of a scalar summation as tr hART AT ARAT i = X X X X X X auvrwvaxwaxyryzauz (24) u v w x y z and consider its partial derivative with respect to an arbitrary element aij of A that can be written due to linearity of di erentiation as The expression in (25) can be further simpli ed by the product rule expansion and evaluating the derivatives results in

X X X X auvrwvaxyryzauz +

u v y z

X X X X auvrwvaxwryzauz +

u v w z

X X X X ryzauvrwvaxwaxy:

v w x y Finally, reformulating the four summations in (27) in terms of matrix multiplications as ij ij (27) Alg. 1: A Frank-Wolfe based projection-free ALS algorithm to nd Seminonnogative DEDICOM factors with column stochastic loading matrix A and nonnegative a nity matrix R. At each iteration of the algorithm, we alternate between nding an optimal column stochastic A keeping R xed and nding a nonnegative matrix R keeping A xed. Note that the learning rate here is set to decrease monotonically. allows us to de ne the gradient matrix @E2(A) as @A @E2(A) = 2 ART AT AR + ARAT ART : @A (29)

Combining the results from (18) and (29) the full gradient matrix of the minimized error norm for A is calculated as follows

ART AT AR + ARAT ART ST AR SART ;

(30) which allows us to de ne a column-wise Frank-Wolfe algorithm to come up with optimal column stochastic loading matrix A minimizing (5). We list the essential steps of our adaptation of the projection free Frank-Wolfe algorithm in Alg. 1. In a nutshell, our algorithm starts by randomly initializing (valid) matrices A and R, continues with alternating between the Frank-Wolfe optimization updates to come up with optimal column stochastic A and alternating least squares 240 S SR230 220 210 200 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

ALS Iterations updates for nonnegative R until a prede ned stopping condition is met. The stopping conditions are usually selected based on reaching a maximum number of alternating least squares updates (not to be mixed with the maximum iteration count tmax of the Frank-Wolfe updates for columns of A in Alg. 1), having minor subsequent changes in the minimized objective (for our case the matrix norm in (5)) or combining both of these conditions. 4

A Business Intelligence Application: Analyzing Asymmetric Game Ownership Associations in an Online Gaming Platform In order to evaluate our algorithm, we consider a use-case from game analytics [ 14, 16 ], where we analyze the asymmetric relationships among various types of games. To this end we consider a dataset from an online gaming platform called Steam, which hosts thousands of games of various genres to a large player-base with its size ranging from 9 to 15 million3 (daily active) players. We used the dataset from [ 14 ] which contains game ownership information of more than six million users about 3007 titles.

Representing the ownership information as a bipartite matrix Y 2 f0; 1gm n indicating ownership of information of m players for n games, we construct the asymmetric association among games in terms of their pairwise empirical conditional probabilities. To this end, we de ne the directional similarity from game i to game j as sij = jfc jyci 6= 0 8 cg \ fb jybj 6= 0 8 bgj ; jfc jyci 6= 0 8 cgj (31) 3 http://store.steampowered.com/stats/ Loadings

Identi er a1 a2 a3

TF 2/FPS Indie/Platformer Flagships/AAA

TF 2, CS: Source and Red Orchestra: Ostfront 41-45

WSS, Ethan and Oozi: Earth Adventure

Left 4 Dead 2, Dota 2, Skyrim, HL 2, Garry's Mod where j j indicates set cardinality and ypq represents if the pth player owns the qth game. It is important to note that (31) is inherently asymmetric and a special case of the so-called Tversky index [ 18 ] with prototype and variant weights of respectively 1 and 0 (or vice versa). Since the number of computations required to obtain the similarity matrix S scales to O(mn2), we parallelized the procedure of computing the similarity values in (31) by utilizing a hybrid parallelization technique that simultaneously exploits both distributed and shared memory architectures in order to speedup the computation time.

Having obtained the asymmetric similarity matrix with the empirical conditional probability values in S, we factorize it using our algorithm and present the results with k = 3. To explicitly ignore modeling the self-loops, at each alternating least squares iteration, we replaced the diagonal of S with the diagonal of its current reconstruction [ 1 ]. In Fig. 2 we show the residual sum of squares, which decreases monotonically and converges to a minimum after 30 iterations. Analyzing the resulting factors, we observe that the resulting two columns (or modes) a1 and a2 of A are sparse whereas the last column was dense in terms of non-zero probability values, where the most dominant games vary from column to column. In Tbl. 1 we present the games with high loadings in their corresponding column and observe that mostly the free-to-play agship game Team Fortress 2 (TF 2), followed by the FPS games CS: Source and Red Orchestra: Ostfront 41-45, contribute to a1.On the other hand, the indie-platformer games Wooden Sen'SeY (WSS), Ethan (Meteor Hunter) and Oozi: Earth Adventure contribute mostly to a2. Finally, the mostly dense a3 has very high loadings for all of the agship games of the analyzed platform and other AAA games including Left 4 Dead 2, Dota 2, Skyrim, Half-life 2 (HL 2) and Garry's Mod.

Analyzing the resulting row-normalized a nity matrix that encodes the asymmetric relations between the modes from Fig. 3, the rst rows indicates tendencies of TF 2 players more to the indie-platformer mode than to the AAA games because of the fact that TF 2 is mostly a singular game that people primarily prefer in the Steam platform [ 14 ]. On the other hand, inline with the results from [ 14 ] the opposite directional associations, namely, from the agships to the TF 2 are high which is related to the fact that majority of the players playing one or more of the agship games also prefer TF 2 [ 14 ]. Finally analyzing the second mode, similar to the results in [ 15 ] the self association is the highest association we observe (indicating players preferring mostly to stay in the same genre), this is followed by high associations to the TF 2 and the mode related to the agship and AAA games.

TF2/FPS Indie/Platform Flagships/AAA TF2/FPS

It is worth mentioning that, parallel to the results in [ 1 ], by increasing the number of modes k for the loading matrix A, we observed more detailed partitions regarding the modeled patterns. Running our algorithm with k = 2, we observed a higher reconstruction error and that the indie-platformer mode a2 remains to be one of the factors whereas the games contributing to a3 have merged with ones contributing to mode a1. On the other hand, considering the resulting factors when we ran our algorithm with k = 4, we obtained a slight improvement regarding the reconstruction error and observed that the modes a1 and a2 remained the same, however, mode a3 has split into two di erent modes where both contain the same games as in a3 and the new mode contains additional indie and plaformer games (reducing the probability on the games of a3) such as Finding Teddy, Gentlemen, Gravi and Face Noir. 5

Conclusion and Future Work In this work we studied the DEDICOM model to factorize asymmetric similarity matrices into a combination of a low rank loading matrix and an a nity matrix. We gave an overall overview about the theoretical details and algorithms to come up with appropriate factors. In essence, the a nity matrix matrix R indicates di erent directional structures that are determined by the columns of the loading matrix A. Having de ned our objective function to be the matrix norm of the di erence between the factorized asymmetric similarity matrix and its DEDICOM reconstruction, we derived the gradient matrix for this function with respect to the loading matrix A and presented a variant of the Frank-Wolfe algorithm to obtain interpretable column stochastic loadings and nonnegative a nities.Running our algorithm on a dataset containing asymmetric pairwise relationships between more than 3000 games, we found interesting patterns indicating directional preferences among games.

Our future work involves analyzing the performance of our model for di erent tasks including link prediction and representation learning [ 13, 16 ]. Considering the tensor extensions in [ 13, 16 ], our future work also involves extending the proposed model to tensors factorizations which can allow us to analyze asymmetric similarity matrices from di erent sources. In the context of business intelligence and game analytics, this might allow us to analyze and compare, for instance, preferences of di erent countries or player groups.

1. Bader , B. , Harshman , R. , Kolda , T. : Temporal Analysis of Semantic Graphs using ASALSAN . In: Proc. of IEEE ICDM ( 2007 )

2. Bauckhage , C. , Sifa , R. , Drachen , A. , Thurau , C. , Hadiji , F. : Beyond Heatmaps: Spatio-Temporal Clustering using Behavior-Based Partitioning of Game Levels . In: Proc. of IEEE CIG ( 2014 )

3. Bauckhage , C. : k-Means Clustering via the Frank-Wolfe Algorithm . In: Proc. of KDML-LWDA ( 2016 )

4. Chew , P.A. , Bader , B.W. , Rozovskaya , A. : Using DEDICOM for Completely Unsupervised Part-Of-Speech Tagging . In: Proc. Workshop on Unsupervised and Minimally Supervised Learning of Lexical Semantics ( 2009 )

5. Cremonesi , P. , Koren , Y. , Turrin , R. : Performance of Recommender Algorithms on Top-N Recommendation Tasks . In: Proc. ACM Recsys ( 2010 )

6. Cutler , A. , Breiman , L. : Archetypal Analysis . Technometrics 36 ( 4 ), 338 { 347 ( 1994 )

7. Frank , M. , Wolfe , P. : An Algorithm for Quadratic Programming . Naval Research Logistics 3 ( 1 -2) ( 1956 )

8. Harshman , R.A. : Models for Analysis of Asymmetrical Relationships among N Objects or Stimuli . In: Proc. Joint Meeting of the Psychometric Society and the Society for Mathematical Psychology ( 1978 )

9. Jaggi , M. : Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization . In: Proc. of ICML ( 2013 )

10. Kantor , P. , Rokach , L. , Ricci , F. , Shapira , B. : Recommender Systems Handbook . Springer ( 2011 )

11. Koren , Y. , Bell , R. , Volinsky , C. : Matrix Factorization Techniques for Recommender Systems . Computer 42 ( 8 ) ( 2009 )

12. Lawson , C.L. , Hanson , R.J.: Solving Least Squares Problems . SIAM ( 1974 )

13. Nickel , M. , Tresp , V. , Kriegel , H.: A Three-way Model for Collective Learning on Multi-relational Data . In: Proc. of ICML ( 2011 )

14. Sifa , R. , Drachen , A. , Bauckhage , C. : Large-Scale Cross -Game Player Behavior Analysis on Steam . In: Proc. of AAAI AIIDE ( 2015 )

15. Sifa , R. , Ojeda , C. , Bauckhage , C. : User Churn Migration Analysis with DEDICOM . In: Proc. of ACM RecSys ( 2015 )

16. Sifa , R. , Srikanth , S. , Drachen , A. , Ojeda , C. , Bauckhage , C. : Predicting Retention in Sandbox Games with Tensor Factorization-based Representation Learning . In: Proc. of IEEE CIG ( 2016 )

17. Sifa , R. , Bauckhage , C. , Drachen , A. : Archetypal Game Recommender Systems . In: Proc. of KDML-LWA ( 2014 )

18. Tversky , A. : Features of Similarity. Psychological Review 84 ( 4 ) ( 1977 )