Introduction

Tensor Co-clustering: a Parameter-less Approach

Elena Battaglia

Ruggero G. Pensa

ruggero.pensag@unito.it 0 0 University of Turin, Dept. of Computer Science , Turin , Italy

Tensors co-clustering has been proven useful in many applications, due to its ability of coping with high-dimensional data and sparsity. However, setting up a co-clustering algorithm properly requires the speci cation of the desired number of clusters for each mode as input parameters. To face this issue, we propose a tensor co-clustering algorithm that does not require the number of desired co-clusters as input, as it optimizes an objective function based on a measure of association across discrete random variables that is not a ected by their cardinality. The e ectiveness of our algorithm is shown on real-world datasets, also in comparison with state-of-the-art co-clustering methods.

Introduction

Tensors are widely used mathematical objects that well represent complex information such as gene expression data, social networks, heterogenous information networks, time-evolving data, behavioral patterns, and multi-lingual text corpora. In general, every n-ary relation can be easily represented as a tensor. From the algebraic point of view, in fact, they can be seen as multidimensional generalizations of matrices and, as such, can be processed with mathematical and computational methods that generalize those usually employed to analyze data matrices, e.g., non-negative factorization, singular value decomposition, itemset and association rule mining, clustering and co-clustering.

Clustering, in particular, is by far one of the most popular unsupervised machine learning techniques since it allows analysts to obtain an overview of the intrinsic similarity structures of the data with relatively little background knowledge about them. However, with the availability of high-dimensional heterogenous data, co-clustering has gained popularity, since it provides a simultaneous partitioning of each mode. Despite its proven usefulness the correct application of tensor co-clustering is limited by the fact that it requires the speci cation of a congruent number of clusters for each mode, while, in realistic analysis scenarios, the actual number of clusters is unknown. Furthermore, matrix/tensor Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). This volume is published and copyrighted by its editors. SEBD 2020, June 21-24, 2020, Villasimius, Italy. ∈ ∈

∈ ( ) tensor co-clustering con ngency tensor

Goodman-Kruskal’s associa on measures (co-)clustering is often based on a preliminary tensor factorization step or on latent block models [ 14, 3, 13 ] that, in their turn, require further input parameters (e.g., the number of latent factors/blocks within each mode). As a consequence, it is merely impossible to explore all combinations of parameter values in order to identify the best clustering results.

The main reason for this problem is that most clustering algorithms (and tensor factorization approaches) optimize objective functions that strongly depend on the number of clusters (or factors). Hence, two solutions with two di erent numbers of clusters can not be compared directly. Although this reduces considerably the size of the search space, it prevents the discovery of a better partitioning once a wrong number of clusters is selected. In this paper, we address this limitation by proposing a new tensor co-clustering algorithm that optimizes an objective function that can be viewed as n-mode extension of an association measure called Goodman-Kruskal's [ 5 ], whose local optima do not depend on the number of clusters. Compared with state-of-the-art techniques that require the desired number of clusters in each mode as input parameters, our approach achieves similar or better results on several real-world datasets.

The algorithm presented in this paper has been st introduced in [ 1 ]. An interested reader could refer to it for further theoretical and experimental details. 2

An association measure for tensor co-clustering

The objective function optimized by our tensor co-clustering algorithm is an association measure, called Goodman and Kruskal's [ 5 ], that evaluates the dependence between two discrete variables and has been used to evaluate the quality of 2-way co-clustering [ 10 ]. Goodman and Kruskal's estimates the strength of the link between two discrete variables X and Y according to the proportional reduction of the error in predicting one of them knowing the other: XjY = eX E[eXjY ] ; where eX is the error in predicting X (estimated as the eX probability that two di erent observations from the marginal distribution of X fall in di erent categories) and E[eXjY ] is the expected value of the conditional error taken with respect to the distribution of Y . Conversely, the proportional reduction of the error in predicting Y while X is known is Y jX = eY E[eY jX ] : eY

In order to use this measure for the evaluation of a tensor co-clustering, we need to extend it so that can evaluate the association of n distinct discrete variables X1; : : : ; Xn. Reasoning as in the two-dimensional case, we can de ne the reduction in the error in predicting Xi while (Xj )j6=i are all known as Xi =

Xij(Xj)j6=i = eXi

E[eXij(Xj)j6=i ] eXi for all i n. When n = 2, the measure coincides with Goodman-Kruskal's .

We will now see how to use function to evaluate a tensor co-clustering. Let X 2 R+m1 mn be a tensor with n modes and non-negative values. Let us denote with xk1:::kn the generic element of X , where ki = 1; : : : ; mi for each mode i = 1; : : : ; n. A co-clustering P of X is a collection of n partitions fPigi=1;:::;n, ci where Pi = [j=1Cji is a partition of the elements on the i-th mode of X in ci groups, with ci mi for each i = 1; : : : ; n. Each co-clustering P can be associated to a tensor T P 2 Rc+1 cn , whose generic element is the sum of all entries of X in the same co-cluster. We can look at T P as the contingency n-modal table that empirically estimates the joint distribution of n discrete variables X1; : : : ; Xn, where each Xi takes values in fC1i ; : : : Ccii g. From this contingency table, we can derive the marginal probabilities of each variable Xi (i.e., the probability that a generic element of mode i falls in a particular cluster on that mode) and we can use these probabilities to compute Goodman and Kruskal's functions: in this way we associate to each co-clustering P over X a vector P = ( XP1 ; : : : ; XPn ) that can be used to evaluate the quality of the co-clustering. The overall coclustering schema is depicted in Figure 1. 3

A stochastic local search approach to co-clustering

Our co-clustering approach can be formulated as a multi-objective optimization problem: given a tensor X with n modes and dimension mi on mode i, an optimal co-clustering P for X is one that is not dominated by any other co-clustering (i.e. does not exist any other co-clustering Q with XQj >= XPj for all j = 1; : : : ; n).

Since we do not x the number of clusters, the space of possible solutions is huge (for example, given a very small tensor of dimension 10 10 10, the number of possible partitions is 1:56 1016): it is clear that a systematic exploration of all possible solutions is not feasible for a generic tensor X . For this reason we need to nd a heuristic that allows us to reach a \good" partition of X , i.e. a partition P with high values of XPk for all modes k. With this aim, we propose a stochastic local search approach to solve the maximization problem. else end i end iter whithout moves

0; i + 1; Algorithm 1: T CC(X ; Niter)

Input: A m1 mn tensor X , Niter

Result: P1; : : : ; Pn 1 Initialize P1; : : : ; Pn with discrete partitions; 2 i 0; 3 iter whithout moves 0 ; 4 while i Niter & iter whithout moves < maxj=1;:::;n(mj) do 5 for k = 1 to n do 6 if iter whithout moves < t then 7 Randomly choose Cbk in Pk and x in Cbk; 8 9 else x next(k) //Select the element following the one selected at iteration i 1 on mode k;

Cbk Cluster of x; end for Ckj in Pk [ ; do

Qjk (Pk n Cbk; Cjk ) S Cbk n fxg; Cjk [ fxg ; Qj (P1; P2; : : : ; Pk 1; Qk; Pk+1; : : : ; Pn);

Compute contingency tensor T j associated to Qj and Qj ; end e

SelectBestP artition(k; b; ( Qj )j=1;:::;jPk[;j); Pk Qek; if e == b then iter whithout moves

iter whithout moves + 1;

Algorithm 1 provides the general sketch of our tensor co-clustering algorithm, called TCC. It repeatedly considers modes one by one, sequentially, and it tries to improve the quality of the co-clustering by moving one single element from its original cluster Cbk to another cluster on the same mode, Ck, which most e improves the quality of the partition, according to a criterion chosen to measure the quality of the partition. When all the n modes have been considered, the i-th iteration of the algorithm is concluded. If all objects have been tried but no move is possible, the algorithm ends. At the end of each iteration, one of the following possible moves has been done on mode k: { an object x has been moved from cluster Cbk to a pre-existing cluster Cek: in this case the nal number of clusters on mode k remains the same (let'call it ck) if Cbk is non-empty after the move. If Cbk is empty after the move, it will be deleted and the nal number of clusters will be ck 1; { an object x has been moved from cluster Cbk to a new cluster Cek = ;: the nal number of clusters on mode k will be ck + 1 (the useless case when x is moved from Cbk = fxg to Cek = ; is not considered); { no move has been performed and the number of clusters remains ck. Thus, during the iterative process, the updating procedure is able to increase or decrease the number of clusters at any time. This is due to the fact that, contrary to other measures, such as the loss in mutual information [ 4 ], measure has an upper limit which does not depend on the number of co-clusters and thus enables the comparison of co-clustering solutions of di erent cardinalities.

As mentioned above, the choise of the best cluster on mode k in which the selected element x should be moved depends on the way in which we decide to measure the increase in the quality of the tensor partition. We can dene di erent measures, corresponding to di erent ways to implement function SelectBestP artition in Algorithm 1. According to our experiments, the one with best performances in terms of speed of convergence and quality of the identi ed co-clusters is the following. Suppose we want to move an object on mode k: we consider only those moves that improve (or at least do not worsen) Xk and, among them, we choose the one with the greatest value of avg( ) = n1 Pn j=1 Xj .

It can be proven that algorithm TCC with this selection strategy converges to a Pareto local optimum. 4

Experiments and discussion

In this section, we test the e ectiveness of our co-clustering algorithm through experiments on the following three real-world datasets: the \four-area" DBLP dataset1; the \hetrec2011-movielens-2k" dataset2 [ 2 ], from which we extraxt two di erent tensors, MovieLens1 and MovieLens2; the Yelp dataset3, from which we extract two tensors, YelpTOR and YelpPGH. To assess the quality of the clustering performances, we consider two measures commonly used in the clustering literature: normalized mutual information (NMI) [ 11 ] and adjusted rand index (ARI) [ 8 ].

We compare our results with those of other state-of-the-art tensor co-clustering algorithms. nnCP is the non-negative CP decomposition [ 6 ] and can be used to co-cluster a tensor, as done by [ 14 ], by assigning each element in each mode to the cluster corresponding to the latent factor with highest value. nnTucker is the non-negative Tucker decomposition [ 12 ]. nnCP+kmeans and nnT+kmeans combine CP (or Tucker) decomposition with a post-processing phase in which k-means is applied on each of the latent factor matrices, similarly as what has been done by [ 7 ] and [ 3 ]. SparseCP consists of a CP decomposition with nonnegative sparse latent factors [ 9 ]. Finally, TBM performs tensor co-clustering 1 http://web.cs.ucla.edu/~yzsun/data/DBLP_four_area.zip 2 https://grouplens.org/datasets/hetrec-2011/ 3 https://www.yelp.com/dataset via the Tensor Block Model [ 13 ]. The last four methods require as input parameters the number of clusters on each mode: we set these numbers equal to the true number of classes on the three modes of the tensor. When further parameters are needed, we follow the instructions suggested in the original papers for setting them. Algorithms nnCP, nnTucker and their variant with k-means are applied trying di erent ranks of the decomposition: we report the best result obtained. In this way we are giving a big advantage to our competitors: we choose the rank of the decomposition and the number of clusters by looking at the actual number of categories, which are unknown in standard unsupervised settings. Despite this, as shown in Table 1, TCC outperforms the other algorithms on all datasets but one (DBLP) and has comparable results on another (YelpTOR). In these cases, non-negative Tucker decomposition (with the number of latent factors set to the correct number of embedded clusters) achieves the best results and non-negative CP decomposition obtains results that are comparable with those of TCC. However, when we modify, even if slightly, the number of latent factors (see Figure2), the results get immediately worse than those of TCC.

The number of clusters identi ed by TCC is usually close to the correct number of embedded clusters: on average, 5 instead of 4 for DBLP, 5 instead of 3 for MovieLens1, the correct number 3 for MovieLens2, 5 instead of 3 for YelpPGH. Only YelpTOR presents a number of clusters (13) that is far from the correct number of classes (3). However, more than the 85% of the objects are classi ed in 3 large clusters, while the remaining objects form very small clusters: we consider these objects as candidate outliers. The same beahviour is even more pronounced in DBLP, where four clusters contain the 99.9% of the objects and only 2 objects stay in the \extra cluster".

Lastly, we provide some insights about the quality of the clusters identi ed by our algorithm. To this purpose, we choose a co-clustering of the MovieLens1 dataset. This dataset contains 181 movies and all the tags assigned by the users to each movie. We construct a (215 181 142)-dimensional tensor, where the three modes represent users, movies and tags. The movies are labelled in three categories (Animation, Horror and Documentary). Algorithm TCC identi es ve clusters of movies, instead of the three categories we consider as labels. The tag clouds in Figure 3, illustrate the 30 movies with more tags for each cluster (text size depends on the actual number of tags): it can be easily observed that τTCC

τTCC the rst cluster concerns animated movies for children, mainly Disney and Pixar movies; the second one is a little cluster containing animated movies realized with the claymation technique (mainly Wallace and Gromit saga's movies or other lms by the same director); the third cluster is still a subset of the animated movies, but it contains anime and animated lms from Japan. The fourth cluster is composed mainly by horror movies and the last one contains only documentaries. On the tag mode, our algorithm nds thirteen clusters. Six of them contain more than 90% of the total tags and only 10 uninformative tags are partitioned in other 7 very small clusters, and could be considered as outliers. There is a one-to-one correspondence between four clusters of movies (Cartoons, Anime, Wallace&Gromit and Documentary) and four of the tag clusters; cluster Horror, instead, can be put in relation with two di erent tag clusters, the rst containing names of directors, actors or characters of popular horror movies, the second composed by adjectives tipically used to describe disturbing lms.

Conclusions

Our experimental validation has shown that our approach is able to identify meaningful clusters. Moreover, it outperforms state-of-the-art methods for most datasets. Even when our algorithm is not the best one, we have found that the competitors can not work properly without specifying a correct number of clusters for each mode of the tensor. As future work, we will design a speci c algorithm for sparse tensors with the aim of reducing the overall computational complexity of the approach. Finally, we will further investigate the ability of our method to identify candidate outliers as small clusters in the data.

1. Battaglia , E. , Pensa , R.G.: Parameter-less tensor co-clustering . In: Proceedings of DS 2019 . pp. 205 { 219 ( 2019 )

2. Cantador , I. , Brusilovsky , P. , Ku ik, T.: 2nd Workshop on Information Heterogeneity and Fusion in Recommender Systems (HetRec 2011 ). In: Proc. RecSys 2011 ( 2011 )

3. Cao , X. , Wei , X. , Han, Y. , Lin , D. : Robust face clustering via tensor decomposition . IEEE Trans. Cybernetics 45 ( 11 ), 2546 { 2557 ( 2015 )

4. Dhillon , I.S. , Mallela , S. , Modha , D.S. : Information-theoretic co-clustering . In: Proc. ACM SIGKDD 2003 . pp. 89 { 98 ( 2003 )

5. Goodman , L.A. , Kruskal , W.H. : Measures of association for cross classi cation . Journal of the American Statistical Association 49 , 732 { 764 ( 1954 )

6. Harshman , R.A. : Foundation of the parafac procedure: models and conditions for an\ explanatory" multimodal factor analysis . UCLA Working Papers in Phonetics 16 , 1 { 84 ( 1970 )

7. Huang , H. , Ding , C.H.Q. , Luo , D. , Li , T. : Simultaneous tensor subspace selection and clustering: the equivalence of high order svd and k-means clustering . In: Proc. ACM SIGKDD 2008 . pp. 327 { 335 ( 2008 )

8. Hubert , L. , Arabie , P. : Comparing partitions . J. Classif . 2 ( 1 ), 193 { 218 ( 1985 )

9. Papalexakis , E.E. , Sidiropoulos , N.D. , Bro , R. : From K-means to higher-way coclustering: Multilinear decomposition with sparse latent factors . IEEE Trans. Signal Processing 61 ( 2 ), 493 { 506 ( 2013 )

10. Robardet , C. , Feschet , F.: E cient local search in conceptual clustering . In: Proceedings of DS 2001 . pp. 323 { 335 ( 2001 )

11. Strehl , A. , Ghosh , J.: Cluster ensembles { A knowledge reuse framework for combining multiple partitions . Journal of Machine Learning Research 3 , 583 { 617 ( 2002 )

12. Tucker , L.R.: Some mathematical notes on three-mode factor analysis . Psychometrika 31 , 279 { 311 ( 1966 )

13. Wang , M. , Zeng , Y. : Multiway clustering via tensor block models . In: Proc. of NeurIPS 2019 . pp. 713 { 723 ( 2019 )

14. Zhou , Q. , Xu , G. , Zong , Y. : Web co-clustering of usage network using tensor decomposition . In: Proceedings of ECBS 2009 . pp. 311 { 314 ( 2009 )