=Paper= {{Paper |id=Vol-2646/11-paper |storemode=property |title=Tensor Co-clustering: A Parameter-less Approach |pdfUrl=https://ceur-ws.org/Vol-2646/11-paper.pdf |volume=Vol-2646 |authors=Elena Battaglia,Ruggero G. Pensa |dblpUrl=https://dblp.org/rec/conf/sebd/BattagliaP20 }} ==Tensor Co-clustering: A Parameter-less Approach== https://ceur-ws.org/Vol-2646/11-paper.pdf
Tensor Co-clustering: a Parameter-less Approach
                        (DISCUSSION PAPER)

                     Elena Battaglia and Ruggero G. Pensa

            University of Turin, Dept. of Computer Science, Turin, Italy
                  {elena.battaglia,ruggero.pensa}@unito.it



      Abstract. Tensors co-clustering has been proven useful in many ap-
      plications, due to its ability of coping with high-dimensional data and
      sparsity. However, setting up a co-clustering algorithm properly requires
      the specification of the desired number of clusters for each mode as input
      parameters. To face this issue, we propose a tensor co-clustering algo-
      rithm that does not require the number of desired co-clusters as input,
      as it optimizes an objective function based on a measure of association
      across discrete random variables that is not affected by their cardinality.
      The effectiveness of our algorithm is shown on real-world datasets, also
      in comparison with state-of-the-art co-clustering methods.




1   Introduction
Tensors are widely used mathematical objects that well represent complex infor-
mation such as gene expression data, social networks, heterogenous information
networks, time-evolving data, behavioral patterns, and multi-lingual text cor-
pora. In general, every n-ary relation can be easily represented as a tensor. From
the algebraic point of view, in fact, they can be seen as multidimensional gen-
eralizations of matrices and, as such, can be processed with mathematical and
computational methods that generalize those usually employed to analyze data
matrices, e.g., non-negative factorization, singular value decomposition, itemset
and association rule mining, clustering and co-clustering.
    Clustering, in particular, is by far one of the most popular unsupervised ma-
chine learning techniques since it allows analysts to obtain an overview of the
intrinsic similarity structures of the data with relatively little background knowl-
edge about them. However, with the availability of high-dimensional heteroge-
nous data, co-clustering has gained popularity, since it provides a simultaneous
partitioning of each mode. Despite its proven usefulness the correct application
of tensor co-clustering is limited by the fact that it requires the specification of
a congruent number of clusters for each mode, while, in realistic analysis sce-
narios, the actual number of clusters is unknown. Furthermore, matrix/tensor
Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
mons License Attribution 4.0 International (CC BY 4.0). This volume is published and
copyrighted by its editors. SEBD 2020, June 21-24, 2020, Villasimius, Italy.
                                       ∈   ∈   ∈




                    (         )
                tensor co-clustering           conngency tensor    Goodman-Kruskal’s
                                                                   associaon measures




Fig. 1: An example of tensor co-clustering with the related contingency tensor
and the associated Goodman-Kruskal’s τ measures.


(co-)clustering is often based on a preliminary tensor factorization step or on la-
tent block models [14, 3, 13] that, in their turn, require further input parameters
(e.g., the number of latent factors/blocks within each mode). As a consequence,
it is merely impossible to explore all combinations of parameter values in order
to identify the best clustering results.
     The main reason for this problem is that most clustering algorithms (and ten-
sor factorization approaches) optimize objective functions that strongly depend
on the number of clusters (or factors). Hence, two solutions with two differ-
ent numbers of clusters can not be compared directly. Although this reduces
considerably the size of the search space, it prevents the discovery of a bet-
ter partitioning once a wrong number of clusters is selected. In this paper, we
address this limitation by proposing a new tensor co-clustering algorithm that
optimizes an objective function that can be viewed as n-mode extension of an
association measure called Goodman-Kruskal’s τ [5], whose local optima do not
depend on the number of clusters. Compared with state-of-the-art techniques
that require the desired number of clusters in each mode as input parameters,
our approach achieves similar or better results on several real-world datasets.
     The algorithm presented in this paper has been fist introduced in [1]. An
interested reader could refer to it for further theoretical and experimental details.


2   An association measure for tensor co-clustering

The objective function optimized by our tensor co-clustering algorithm is an
association measure, called Goodman and Kruskal’s τ [5], that evaluates the
dependence between two discrete variables and has been used to evaluate the
quality of 2-way co-clustering [10]. Goodman and Kruskal’s τ estimates the
strength of the link between two discrete variables X and Y according to the
proportional reduction of the error in predicting one of them knowing the other:
           e −E[e      ]
τX|Y = X eX X|Y , where eX is the error in predicting X (estimated as the
probability that two different observations from the marginal distribution of X
fall in different categories) and E[eX|Y ] is the expected value of the conditional
error taken with respect to the distribution of Y . Conversely, the proportional
                                                                       e −E[e   ]
reduction of the error in predicting Y while X is known is τY |X = Y eY Y |X .
    In order to use this measure for the evaluation of a tensor co-clustering, we
need to extend it so that τ can evaluate the association of n distinct discrete
variables X1 , . . . , Xn . Reasoning as in the two-dimensional case, we can define
the reduction in the error in predicting Xi while (Xj )j6=i are all known as

                                                eXi − E[eXi |(Xj )j6=i ]
                       τXi = τXi |(Xj )j6=i =
                                                       eXi
for all i ≤ n. When n = 2, the measure coincides with Goodman-Kruskal’s τ .
    We will now see how to use function τ to evaluate a tensor co-clustering.
                 1 ×···×mn
Let X ∈ Rm     +             be a tensor with n modes and non-negative values. Let us
denote with xk1 ...kn the generic element of X , where ki = 1, . . . , mi for each mode
i = 1, . . . , n. A co-clustering P of X is a collection of n partitions {Pi }i=1,...,n ,
where Pi = ∪cj=1     i
                        Cji is a partition of the elements on the i-th mode of X in ci
groups, with ci ≤ mi for each i = 1, . . . , n. Each co-clustering P can be associated
to a tensor T P ∈ Rc+1 ×···×cn , whose generic element is the sum of all entries of X
in the same co-cluster. We can look at T P as the contingency n-modal table that
empirically estimates the joint distribution of n discrete variables X1 , . . . , Xn ,
where each Xi takes values in {C1i , . . . Ccii }. From this contingency table, we can
derive the marginal probabilities of each variable Xi (i.e., the probability that a
generic element of mode i falls in a particular cluster on that mode) and we can
use these probabilities to compute Goodman and Kruskal’s τ functions: in this
way we associate to each co-clustering P over X a vector τ P = (τX          P
                                                                             1
                                                                                          P
                                                                               , . . . , τXn
                                                                                             )
that can be used to evaluate the quality of the co-clustering. The overall co-
clustering schema is depicted in Figure 1.


3    A stochastic local search approach to co-clustering

Our co-clustering approach can be formulated as a multi-objective optimization
problem: given a tensor X with n modes and dimension mi on mode i, an optimal
co-clustering P for X is one that is not dominated by any other co-clustering (i.e.
                                                     Q       P
does not exist any other co-clustering Q with τX      j
                                                        >= τX  j
                                                                 for all j = 1, . . . , n).
    Since we do not fix the number of clusters, the space of possible solutions is
huge (for example, given a very small tensor of dimension 10×10×10, the number
of possible partitions is 1.56 × 1016 ): it is clear that a systematic exploration of
all possible solutions is not feasible for a generic tensor X . For this reason we
need to find a heuristic that allows us to reach a “good” partition of X , i.e. a
                                   P
partition P with high values of τX   k
                                        for all modes k. With this aim, we propose
a stochastic local search approach to solve the maximization problem.
 Algorithm 1: τ T CC(X , Niter )
   Input: A m1 × · · · × mn tensor X , Niter
   Result: P1 , . . . , Pn
 1 Initialize P1 , . . . , Pn with discrete partitions;
 2 i ← 0;
 3 iter whithout moves ← 0 ;
 4 while i ≤ Niter & iter whithout moves < maxj=1,...,n (mj ) do
 5     for k = 1 to n do
 6          if iter whithout moves < t then
 7               Randomly choose Cbk in Pk and x in Cbk ;
 8          else
 9               x ← next(k) //Select the element following the one selected at
                   iteration i − 1 on mode k;
10               Cbk ← Cluster of x;
11          end
12          for Ckj in Pk ∪ ∅ do
                 Qjk ← (Pk \ Cbk , Cjk )
                                            S k
13                                              Cb \ {x}, Cjk ∪ {x} ;
                    j
14               Q ← (P1 , P2 , . . . , Pk−1 , Qk , Pk+1 , . . . , Pn );
15               Compute contingency tensor T j associated to Qj and τ Qj ;
16          end
17          e ← SelectBestP artition(k, b, (τ Qj )j=1,...,|Pk ∪∅| );
18          Pk ← Qek ;
19          if e == b then
20               iter whithout moves ← iter whithout moves + 1;
21          else
22               iter whithout moves ← 0;
23          end
24          i ← i + 1;
25     end
26 end




    Algorithm 1 provides the general sketch of our tensor co-clustering algorithm,
called τ TCC. It repeatedly considers modes one by one, sequentially, and it tries
to improve the quality of the co-clustering by moving one single element from
its original cluster Cbk to another cluster on the same mode, Cek , which most
improves the quality of the partition, according to a criterion chosen to measure
the quality of the partition. When all the n modes have been considered, the
i-th iteration of the algorithm is concluded. If all objects have been tried but
no move is possible, the algorithm ends. At the end of each iteration, one of the
following possible moves has been done on mode k:

 – an object x has been moved from cluster Cbk to a pre-existing cluster Cek : in
   this case the final number of clusters on mode k remains the same (let’call
   it ck ) if Cbk is non-empty after the move. If Cbk is empty after the move, it
   will be deleted and the final number of clusters will be ck − 1;
    – an object x has been moved from cluster Cbk to a new cluster Cek = ∅: the
      final number of clusters on mode k will be ck + 1 (the useless case when x is
      moved from Cbk = {x} to Cek = ∅ is not considered);
    – no move has been performed and the number of clusters remains ck .

Thus, during the iterative process, the updating procedure is able to increase or
decrease the number of clusters at any time. This is due to the fact that, contrary
to other measures, such as the loss in mutual information [4], τ measure has an
upper limit which does not depend on the number of co-clusters and thus enables
the comparison of co-clustering solutions of different cardinalities.
    As mentioned above, the choise of the best cluster on mode k in which
the selected element x should be moved depends on the way in which we de-
cide to measure the increase in the quality of the tensor partition. We can de-
fine different measures, corresponding to different ways to implement function
SelectBestP artition in Algorithm 1. According to our experiments, the one with
best performances in terms of speed of convergence and quality of the identified
co-clusters is the following. Suppose we want to move an object on mode k:
we consider only those moves that improve (or at least do not worsen)PτXk and,
                                                                            n
among them, we choose the one with the greatest value of avg(τ ) = n1 j=1 τXj .
It can be proven that algorithm τ TCC with this selection strategy converges to
a Pareto local optimum.


4      Experiments and discussion

In this section, we test the effectiveness of our co-clustering algorithm through
experiments on the following three real-world datasets: the “four-area” DBLP
dataset1 ; the “hetrec2011-movielens-2k” dataset2 [2], from which we extraxt two
different tensors, MovieLens1 and MovieLens2; the Yelp dataset3 , from which we
extract two tensors, YelpTOR and YelpPGH. To assess the quality of the clus-
tering performances, we consider two measures commonly used in the clustering
literature: normalized mutual information (NMI) [11] and adjusted rand index
(ARI) [8].
    We compare our results with those of other state-of-the-art tensor co-clustering
algorithms. nnCP is the non-negative CP decomposition [6] and can be used to
co-cluster a tensor, as done by [14], by assigning each element in each mode to the
cluster corresponding to the latent factor with highest value. nnTucker is the
non-negative Tucker decomposition [12]. nnCP+kmeans and nnT+kmeans
combine CP (or Tucker) decomposition with a post-processing phase in which
k-means is applied on each of the latent factor matrices, similarly as what has
been done by [7] and [3]. SparseCP consists of a CP decomposition with non-
negative sparse latent factors [9]. Finally, TBM performs tensor co-clustering
1
  http://web.cs.ucla.edu/~yzsun/data/DBLP_four_area.zip
2
  https://grouplens.org/datasets/hetrec-2011/
3
  https://www.yelp.com/dataset
Table 1: Results achieved by the co-clustering algorithms on the real-world
datasets (average values over 5 runs). NMI and ARI are computed for the main
mode (authors in DBLP, movies in MovieLens and restaurants in Yelp).
     Dataset        DBLP       MovieLens1    MovieLens2      yelpTOR        yelpPGH
    Algorithm   NMI     ARI    NMI    ARI    NMI    ARI    NMI     ARI    NMI     ARI
      τ TCC     0.74    0.80   0.66   0.71   0.40   0.51   0.42    0.44   0.38    0.28
     nnTucker   0.78   0.84    0.42   0.53   0.24   0.17   0.43    0.39   0.28    0.19
      nnCP      0.74    0.80   0.38   0.34   0.11   0.03   0.42    0.37   0.10    0.06
    SparseCP    0.00    0.00   0.00   0.00   0.00   0.00   0.20    0.15   0.11    0.04
       TBM        -       -    0.00   0.01   0.01   0.01   0.02    0.01   0.10    0.04
  nnCP+kmeans   0.24    0.08   0.31   0.21   0.15   0.07   0.11    0.01   0.09    0.03
   nnT+kmeans   0.25    0.06   0.38   0.31   0.17   0.09   0.09    0.01   0.09    0.03




via the Tensor Block Model [13]. The last four methods require as input parame-
ters the number of clusters on each mode: we set these numbers equal to the true
number of classes on the three modes of the tensor. When further parameters
are needed, we follow the instructions suggested in the original papers for setting
them. Algorithms nnCP, nnTucker and their variant with k-means are applied
trying different ranks of the decomposition: we report the best result obtained.
In this way we are giving a big advantage to our competitors: we choose the
rank of the decomposition and the number of clusters by looking at the actual
number of categories, which are unknown in standard unsupervised settings. De-
spite this, as shown in Table 1, τ TCC outperforms the other algorithms on all
datasets but one (DBLP) and has comparable results on another (YelpTOR).
In these cases, non-negative Tucker decomposition (with the number of latent
factors set to the correct number of embedded clusters) achieves the best results
and non-negative CP decomposition obtains results that are comparable with
those of τ TCC. However, when we modify, even if slightly, the number of latent
factors (see Figure2), the results get immediately worse than those of τ TCC.
    The number of clusters identified by τ TCC is usually close to the correct
number of embedded clusters: on average, 5 instead of 4 for DBLP, 5 instead
of 3 for MovieLens1, the correct number 3 for MovieLens2, 5 instead of 3 for
YelpPGH. Only YelpTOR presents a number of clusters (13) that is far from
the correct number of classes (3). However, more than the 85% of the objects
are classified in 3 large clusters, while the remaining objects form very small
clusters: we consider these objects as candidate outliers. The same beahviour is
even more pronounced in DBLP, where four clusters contain the 99.9% of the
objects and only 2 objects stay in the “extra cluster”.
    Lastly, we provide some insights about the quality of the clusters identified
by our algorithm. To this purpose, we choose a co-clustering of the MovieLens1
dataset. This dataset contains 181 movies and all the tags assigned by the users
to each movie. We construct a (215 × 181 × 142)-dimensional tensor, where the
three modes represent users, movies and tags. The movies are labelled in three
categories (Animation, Horror and Documentary). Algorithm τ TCC identifies
five clusters of movies, instead of the three categories we consider as labels. The
tag clouds in Figure 3, illustrate the 30 movies with more tags for each cluster
(text size depends on the actual number of tags): it can be easily observed that
            NNT                 NNP                  τTCC            NNT               NNP                τTCC
     0.80

                                                              0.42
     0.75

                                                              0.40
     0.70
                                                              0.38
     0.65




                                                            NMI
   NMI
                                                              0.36
     0.60
                                                              0.34
     0.55
                                                              0.32
     0.50
                                                              0.30
     0.45
                                                              0.28
            2     3   4     5    6     7     8   9    10             2     3   4   5    6     7   8   9    10
                                rank                                                   rank
                          (a) DBLP                                             (b) YelpTOR

Fig. 2: Variation of nnTucker/nnCP results w.r.t. the rank of the decomposition
in DBLP and YelpTOR datasets.




     (a) Cl1 - Cartoons                    (b) Cl2 - Wallace&Gromit                    (c) Cl3 - Anime




                           (d) Cl4 - Horror                   (e) Cl5 - Documentaries

Fig. 3: First 30 movie in each cluster identified by τ TCC on dataset MoveiLens1.



the first cluster concerns animated movies for children, mainly Disney and Pixar
movies; the second one is a little cluster containing animated movies realized
with the claymation technique (mainly Wallace and Gromit saga’s movies or
other films by the same director); the third cluster is still a subset of the ani-
mated movies, but it contains anime and animated films from Japan. The fourth
cluster is composed mainly by horror movies and the last one contains only doc-
umentaries. On the tag mode, our algorithm finds thirteen clusters. Six of them
contain more than 90% of the total tags and only 10 uninformative tags are
partitioned in other 7 very small clusters, and could be considered as outliers.
There is a one-to-one correspondence between four clusters of movies (Cartoons,
Anime, Wallace&Gromit and Documentary) and four of the tag clusters; cluster
Horror, instead, can be put in relation with two different tag clusters, the first
containing names of directors, actors or characters of popular horror movies, the
second composed by adjectives tipically used to describe disturbing films.
5    Conclusions

Our experimental validation has shown that our approach is able to identify
meaningful clusters. Moreover, it outperforms state-of-the-art methods for most
datasets. Even when our algorithm is not the best one, we have found that
the competitors can not work properly without specifying a correct number of
clusters for each mode of the tensor. As future work, we will design a specific
algorithm for sparse tensors with the aim of reducing the overall computational
complexity of the approach. Finally, we will further investigate the ability of our
method to identify candidate outliers as small clusters in the data.


References
 1. Battaglia, E., Pensa, R.G.: Parameter-less tensor co-clustering. In: Proceedings of
    DS 2019. pp. 205–219 (2019)
 2. Cantador, I., Brusilovsky, P., Kuflik, T.: 2nd Workshop on Information Hetero-
    geneity and Fusion in Recommender Systems (HetRec 2011). In: Proc. RecSys
    2011 (2011)
 3. Cao, X., Wei, X., Han, Y., Lin, D.: Robust face clustering via tensor decomposition.
    IEEE Trans. Cybernetics 45(11), 2546–2557 (2015)
 4. Dhillon, I.S., Mallela, S., Modha, D.S.: Information-theoretic co-clustering. In:
    Proc. ACM SIGKDD 2003. pp. 89–98 (2003)
 5. Goodman, L.A., Kruskal, W.H.: Measures of association for cross classification.
    Journal of the American Statistical Association 49, 732–764 (1954)
 6. Harshman, R.A.: Foundation of the parafac procedure: models and conditions for
    an“ explanatory” multimodal factor analysis. UCLA Working Papers in Phonetics
    16, 1–84 (1970)
 7. Huang, H., Ding, C.H.Q., Luo, D., Li, T.: Simultaneous tensor subspace selection
    and clustering: the equivalence of high order svd and k-means clustering. In: Proc.
    ACM SIGKDD 2008. pp. 327–335 (2008)
 8. Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)
 9. Papalexakis, E.E., Sidiropoulos, N.D., Bro, R.: From K-means to higher-way co-
    clustering: Multilinear decomposition with sparse latent factors. IEEE Trans. Sig-
    nal Processing 61(2), 493–506 (2013)
10. Robardet, C., Feschet, F.: Efficient local search in conceptual clustering. In: Pro-
    ceedings of DS 2001. pp. 323–335 (2001)
11. Strehl, A., Ghosh, J.: Cluster ensembles – A knowledge reuse framework for com-
    bining multiple partitions. Journal of Machine Learning Research 3, 583–617 (2002)
12. Tucker, L.R.: Some mathematical notes on three-mode factor analysis. Psychome-
    trika 31, 279–311 (1966)
13. Wang, M., Zeng, Y.: Multiway clustering via tensor block models. In: Proc. of
    NeurIPS 2019. pp. 713–723 (2019)
14. Zhou, Q., Xu, G., Zong, Y.: Web co-clustering of usage network using tensor de-
    composition. In: Proceedings of ECBS 2009. pp. 311–314 (2009)