=Paper= {{Paper |id=Vol-2378/longBDE4 |storemode=property |title= Multimodal Clustering of Boolean Tensors on MapReduce: Experiments Revisited |pdfUrl=https://ceur-ws.org/Vol-2378/longBDE4.pdf |volume=Vol-2378 |authors= Dmitry I. Ignatov,Dmitry Tochilkin,Dmitry Egurnov |dblpUrl=https://dblp.org/rec/conf/icfca/IgnatovTE19 }} == Multimodal Clustering of Boolean Tensors on MapReduce: Experiments Revisited== https://ceur-ws.org/Vol-2378/longBDE4.pdf
    Multimodal Clustering of Boolean Tensors on
        MapReduce: Experiments Revisited

            Dmitry I. Ignatov, Dmitry Tochilkin, and Dmitry Egurnov

     National Research University Higher School of Economics, Russian Federation
                                  dignatov@hse.ru
                                 http://www.hse.ru


        Abstract. This paper presents further development of distributed mul-
        timodal clustering. We introduce a new version of multimodal clustering
        algorithm for distributed processing in Apache Hadoop on computer clus-
        ters. Its implementation allows a user to conduct clustering on data with
        modality greater than two. We provide time and space complexity of the
        algorithm and justify its relevance. The algorithm is adapted for MapRe-
        duce distributed processing model. The program implemented by means
        of Apache Hadoop framework is able to perform parallel computing on
        thousands of nodes.

        Keywords: Formal Concept Analysis, n-ary relations, Boolean tensors,
        data mining, big data, MapReduce


1     Introduction
Mining of multimodal patterns in n-ary relations or Boolean tensors is among
popular topics in Data Mining and Machine Learning [4,1,28,10,21,27,9]. Thus,
cluster analysis of multimodal data and specifically of dyadic and triadic relations
is a natural extension of the idea of original clustering. In dyadic case biclustering
methods (the term bicluster was coined in [22]) are used to simultaneously find
subsets of the sets of objects and attributes that form homogeneous patterns of
the input object-attribute data. In fact, one of the most popular applications
of biclustering is gene expression analysis in Bioinformatics [20,2]. Triclustering
methods operate in triadic case where for each object-attribute pair one assigns
a set of some conditions [23,12,5]. Both biclustering and triclustering algorithms
are widely used in such areas as gene expression analysis [33,19,16], recommender
systems [24,14,13], social networks analysis [7], natural language processing [29],
etc. The processing of numeric multimodal data is also possible by modifications
of existing approaches for mining binary relations [15].
    Though there are methods that can enumerate all triclusters satisfying cer-
tain constraints [1] (in most cases they ensure that triclusters are dense), their
time complexity is rather high, as in the worst case the maximal number of tri-
clusters is usually exponential (e.g. in case of formal triconcepts), showing that
    Copyright c 2019 for this paper by its authors. Copying permitted for private and
    academic purposes.
2        D.I. Ignatov et al.

these methods are hardly scalable. To process big data algorithms require at
most linear time complexity (e.g., O(|I|) in case of n-ary relation I) and be eas-
ily parallelisable. In addition, especially in case of data streams [26], the output
patterns should be the results of one pass over data.
    Earlier, in order to create an algorithm satisfying these requirements, we
adapted a triclustering method based on prime operators (prime OAC-triclustering
method) [5] and proposed its online version, which has linear time complexity; it
is also one-pass and easily parallelisable [6]. However, its parallelisation is pos-
sible in different ways. For example, one can use a popular framework for com-
modity hardware, Map-Reduce (M/R) [25].In the past, there were several suc-
cessful M/R implementations in the FCA community and other lattice-oriented
domains. Thus, in [17], the authors adapted Close-by-One algorithm to M/R
framework and showed its efficiency. At the same year, in [18], an efficient M/R
algorithm for computation of closed cube lattices was proposed. The authors
of [32] demonstrated that iterative algorithms like Ganter’s NextClosure can
benefit from the usage of iterative M/R schemes.
    Our previous M/R implementation of triclustering method based on prime
operators was proposed in [34] showing computational benefits on rather large
datasets. M/R triclustering algorithm [34] is a successful distributed adaptation
of the online version of prime OAC-triclustering [6]. This method uses MapRe-
duce approach as means for task allocation on computational clusters, launching
the online version of prime OAC-triclustering on each reducer of the first phase.
However, due to simplicity of this adaptation, the algorithm does not use the ad-
vantages of MapReduce to the full extent. On the first stage all the input triples
are split into the number of groups equals to the number of reducers by means
of hash-function for entities of one of the types, object, attribute, or condition,
which values are used as keys. It is clear that this way of data allocation cannot
guarantee uniformness in terms of group sizes. The JobTracker used in Apache
Hadoop is able to evenly allocate tasks by nodes 1 . To do so, the number of tasks
should be larger than than the number of working nodes, which is not fulfilled in
this implementation. For example, let us assume we have 10 reduce SlaveNodes;
respectively, r = 10 and the hash-function is applied to the objects (the first
element in each input triple). However, due to non-uniformity of hash-function
values by modulo 10, it may happen that the set of objects will result in less
than 10 different residuals during division by 10. In this case, the input triples
will be distributed between parts of different sizes and processed by only a part
of cluster nodes. Such cases are rather rare; it could be possible only for slicing
by entities (objects, attributes, or conditions) with a small number of different
elements. However, they may slow down the cluster work drastically.
    The weakest link is the second stage of the algorithm. First of all, during
the first stage it finds triclusters computed for each data slice separately. Hence,
they are not the final triclusters; we need to merge the obtained results.
    Let us consider example in Table 1 with the ternary relation on users-items-
labels. Let us assume that the first mapper splits data according to their la-
1
    https://wiki.apache.org/hadoop/JobTracker
                    Multimodal Clustering of Boolean Tensors on MapReduce               3

                        Table 1. An example with triadic data

                                    i1 i2          i1 i2
                                 u1 ×           u1 ×
                                 u2 × ×         u2 × ×
                                 u3     ×       u3 ×
                                      l1             l2



bels’ component, r = 2, then triples containing label l1 and those related to
label l2 are processed on different nodes. After the first stage completion on
the first node, we have tricluster ({u2 }, {i1 , i2 }, {l1 }) among the others, while
the second node results in tricluster ({u2 }, {i1 , i2 }, {l2 }). It is clear that both
triclusters are not complete for the whole input dataset and should be merged
into ({u2 }, {i1 , i2 }, {l1 , l2 }). The second stage of the algorithm is responsible for
this type of merging. However, as one can see, this merging assumes that all
intermediate data should be located on the same node. In big data setting, this
allocation of all the intermediate results on a single node is a critical point for
application performance.
    To calculate tricluster components (or cumuli, see Section 3) and assemble the
final triclusters from them, we need to have large data slices on the computational
nodes. Moreover, during parallelisation of the algorithm those data slices can be
required simultaneously on different nodes. Thus, to solve these problems one
needs to fulfil data centralisation (all the required slices should be present at the
same node simultaneously). However, it leads to accumulation of too large parts
of the data as described. Another approach is to perform data replication. In this
case the total amount of data processed on computational nodes is increased,
while the data are evenly distributed in the system. The latter approach has
been chosen for our updated study on multimodal clustering.
    Note that experts aware potential M/R users: “the entire distributed-file-
system milieu makes sense only when files are very large and are rarely updated
in place” [25]. In this work, as in our previous study, we assume that there is a
large bulk of data to process that are not coming online.
    The rest of the paper is organised as follows: in Section 2, we recall the orig-
inal method and the online version of the algorithm of prime OAC-triclustering.
Section 3 generalise prime OAC-triclustering for the case of multimodal data. In
Section 4, we describe the M/R setting of the problem and the corresponding
M/R version of the original algorithm with important implementation aspects.
Finally, in Section 5 we show the results of several experiments which demon-
strate the efficiency of the M/R version of the algorithm.

2    Prime object-attribute-condition triclustering
Prime object-attribute-condition triclustering method (OAC-prime) based on
Formal Concept Analysis [31,3] is an extension for the triadic case of object-
attribute biclustering method [11]. Triclusters generated by this method have
4       D.I. Ignatov et al.

similar structure as the corresponding biclusters, namely the cross-like structure
of triples inside the input data cuboid (i.e. formal tricontext).
    Let K = (G, M, B, I) be a triadic context, where G, M , B are respectively
the sets of objects, attributes, and conditions, and I ⊆ G × M × B is a tri-
adic incidence relation. Each prime OAC-tricluster is generated by applying the
following prime operators to each pair of components of some triple:

             (X, Y )0 = {b ∈ B | (g, m, b) ∈ I for all g ∈ X, m ∈ Y },
             (X, Z)0 = {m ∈ M | (g, m, b) ∈ I for all g ∈ X, b ∈ Z},                 (1)
              (Y, Z)0 = {g ∈ G | (g, m, b) ∈ I for all m ∈ Y, b ∈ Z},

where X ⊆ G, Y ⊆ M , and Z ⊆ B.
    Then the triple T = ((m, b)0 , (g, b)0 , (g, m)0 ) is called prime OAC-tricluster
based on triple (g, m, b) ∈ I. The components of tricluster are called, respectively,
tricluster extent, tricluster intent, and tricluster modus. The triple (g, m, b) is
called a generating triple of the tricluster T . Figure 1 shows the structure of an
OAC-tricluster (X, Y, Z) based on triple (e    g , m,
                                                   e eb), triples corresponding to the
gray cells are contained in the context, other triples may be contained in the
tricluster (cuboid) as well.




Fig. 1. Structure of prime OAC-triclusters: the dense cross-like central layer containing
g̃ (left) and the layer for an object g (right) in M × B dimensions.




    The basic algorithm for prime OAC-triclustering method is rather straight-
forward (see [5]). First of all, for each combination of elements from each two
sets of K we apply the corresponding prime operator (we call the resulting sets
prime sets). After that we enumerate all triples from I and on each step we
must generate a tricluster based on the corresponding triple, check whether this
tricluster is already contained in the tricluster set (by using hashing) and also
check extra conditions.
    The total time complexity of the algorithm depends on whether there is a
non-zero minimal density threshold or not and on the complexity of the hashing
algorithm used. In case we use some basic hashing algorithm processing the
tricluster’s extent, intent and modus without a minimal density threshold, the
                   Multimodal Clustering of Boolean Tensors on MapReduce            5

total time complexity is O(|G||M ||B| + |I|(|G| + |M | + |B|)); in case of a non-
zero minimal density threshold, it is O(|I||G||M ||B|). The memory complexity
is O(|I|(|G| + |M | + |B|)), as we need to keep the dictionaries with the prime
sets in memory.
    In online setting, for triples coming from triadic context K = (G, M, B, I),
the user has no a priori knowledge of the elements and even cardinalities of G,
M , B, and I. At each iteration we receive some set of triples from I: J ⊆ I. After
that we must process J and get the current version of the set of all triclusters. It
is important in this setting to consider every pair of triclusters as being different
as they have different generating triples, even if their respective extents, intents,
and modi are equal. Thus, any other triple can change only one of these two
triclusters, making them different.
    To efficiently access prime sets for their processing, the dictionaries contain-
ing the prime sets are implemented as hash-tables.
    The algorithm is straightforward as well (Alg. 1). It takes some set of triples
(J), the current tricluster set (T ), and the dictionaries containing prime sets
(P rimes) as input and outputs the modified versions of the tricluster set and
dictionaries. The algorithm processes each triple (g, m, b) of J sequentially (line
1). At each iteration the algorithm modifies the corresponding prime sets (lines
2-4).
    Finally, it adds a new tricluster to the tricluster set. Note that this tricluster
contains pointers to the corresponding prime sets (in the corresponding dictio-
naries) instead of the copies of the prime sets (line 5) which allows to lower the
memory and access costs.


Algorithm 1 Add function for the online algorithm for prime OAC-triclustering.
Input: J is a set of triples;
    T = {T = (∗X, ∗Y, ∗Z)} is a current set of triclusters;
    P rimesOA, P rimesOC, P rimesAC.
Output: T = {T = (∗X, ∗Y, ∗Z)};
    P rimesOA, P rimesOC, P rimesAC.
 1: for all (g, m, b) ∈ J do
 2:    P rimesOA[g, m] := P rimesOA[g, m] ∪ {b}
 3:    P rimesOC[g, b] := P rimesOC[g, b] ∪ {m}
 4:    P rimesAC[m, b] := P rimesAC[m, b] ∪ {g}
 5:    T := T ∪ {(&P rimesAC[m, b], &P rimesOC[g, b], &P rimesOA[g, m])}
 6: end for



    The algorithm is one-pass and its time and memory complexities are O(|I|).
    Duplicate elimination and selection patterns by user-specific constraints are
done as post-processing to avoid patterns’ loss. The time complexity of the basic
post-processing is O(|I|) and it does not require any additional memory.
    The algorithm can be easily parallelised by splitting the subset of triples J
into several subsets, processing each of them independently, and merging the
6         D.I. Ignatov et al.

resulting sets afterwards. This fact results in our previous MapReduce imple-
mentation [34].


3      Multimodal clustering
The direct extension of the prime object-attribute-condition triclustering is mul-
timodal clustering for higher input relation arities. For the input polyadic context
KN = (A1 , A2 , . . . , AN , I ⊆ A1 × A2 × · · · × AN ) [30], we introduce the notion
of cumulus for each input tuple i = (e1 , e2 , . . . , eN ) ∈ I and the corresponding
entity ek , where k ∈ {1, . . . , N } as follows:

                cum(i, k) = {e | (e1 , e2 , . . . , ek−1 , e, ek+1 , . . . , eN ) ∈ I}.
      The multimodal cluster generated by the tuple i ∈ I is defined as follows:

                                  ((cum(i, 1), . . . , cum(i, N )).
   Those cumuli operators are similar to primes for pairs (or tuples) of sets
Eq. 1:

                cum(i, k) = ({e1 }, {e2 }, . . . , {ek−1 }, {ek+1 }, . . . , {eN })0 .
    However, here, they are applied to the tuples of input relation rather than
to pairs (tuples) of sets.
    In a certain sense, cumuli accumulate all the elements of a fixed type that
are related by I.
    As its triadic version, multimodal clustering is not greater than the number
of tuples in the input relation, whereas the complete set of polyadic concepts
may be exponential w.r.t. the input size [30].


4      Map-reduce-based multimodal clustering
4.1     Map-reduce decomposition
We follow three stage approach here. On each stage, we sequentially run the
map and reduce procedures: First map → First reduce → Second map → Second
reduce → Third map → Third reduce. Each map/reduce procedure of a certain
stage is executed in parallel on all the available nodes/clusters. The way in tasks
are distributed among the nodes/clusters depends on the concrete MapReduce
technology implementation (in our case, Apache Hadoop). Below, we describe
data flow between computational nodes and their processing.
     1) The first map (Algorithm 2) takes a set of input tuples. Each tu-
ple (e1 , e2 , . . . , eN ) is transformed into N key-value pairs: h(e2 , . . . , eN ), e1 i,
he1 , e3 , . . . , eN ), e2 i, . . . , h(e1 , e2 , . . . , eN −1 ), eN i. The resulting pairs are passed
to the further step.
     2) The first reduce (Algorithm 3) receives all the accumulated values of
each key. Thus, for each (e1 , . . . , eN ) ∈ I and the context entity type
                       Multimodal Clustering of Boolean Tensors on MapReduce                  7

Algorithm 2 Distributed Multimodal clustering: First Map
Input: I is a set of tuples of length N each
Output: hsubrelation, entityi pairs.
 1: for all (e1 , e2 , . . . , eN ) ∈ I do
 2:   for all k ∈ {1, . . . , N } do
 3:      subrelation := (e1 , . . . , ek−1 , ek+1 , . . . , eN )
 4:      emit hsubrelation, ek i
 5:   end for
 6: end for



k ∈ {1, 2, . . . , N }, we compute the cumulus (e1 , . . . , ek−1 , ek+1 , . . . , eN )0 .
The values are passed to the next MapReduce stage with the key
(e1 , . . . , ek−1 , ek+1 , . . . , eN ).


Algorithm 3 Distributed Multimodal clustering: First Reduce
Input: key-value pairs hsubrelation, entities {e1k , . . . , eL
                                                              k }i
Output: hsubrelation, cumulusi
 1: cumulus:={}
 2: for all ek ∈ {e1k , . . . , eL
                                 k } do
 3:   cumulus := cumulus ∪ {ek }
 4:   emit hsubrelation, cumulusi
 5: end for



    3) Second map (Algorithm 4). All the received keys are transformed into the
original relations and passed to the second reduce procedure with unchanged
values.


Algorithm 4 Distributed Multimodal clustering: Second Map
Input: hsubrelation, cumulusi, where subrelation = (e1 , . . . , ek−1 , ek+1 , . . . , eN )
Output: hgenerating relation, cumulusi pairs.
 1: for all ek ∈ cumulus do
 2:   generating relation := (e1 , . . . , ek−1 , ek , ek+1 , . . . , eN )
 3:   emit hgenerating relation, cumulusi
 4: end for



    4) Second reduce (Algorithm 5). All the cumuli obtained for each input tuple
of the original relation I are reduced to a single set. On this stage, we obtain
all the original tuples and generated multimodal clusters. These clusters are
presented as tuples of cumuli for respective entity types. All the obtained pairs
hgenerating relation, multimodal clusteri are passed to the next stage.
    5) Third map (Algorithm 6). The task of the third mapreduce stage is du-
plicate elimination and filtration by density threshold. It is beneficial to im-
8       D.I. Ignatov et al.

Algorithm 5 Distributed Multimodal clustering: Second Reduce
Input: hgenerating relation, cumuli {A1 , A2 , · · · , AN }i
Output: hgenerating relation, multimodal clusteri pairs
 1: multimodal cluster := (A1 , A2 , · · · , AN )
 2: emit hgenerating relation, multimodal clusteri



plement within the reduce step, but to do so each obtained key-value pair
hgenerating relation, multimodal clusteri should be passed further as follows
hmultimodal cluster, generating relationi.


Algorithm 6 Distributed Multimodal clustering: Third Map
Input: hgenerating relation, multimodal clusteri
 1: emit hmultimodal cluster, generating relationi



    6) Third reduce (Algorithm 7). For each input multimodal cluster and its
generating tuples it is possible to directly compute density. All the unique clus-
ters will be stored.


Algorithm 7 Distributed Multimodal clustering: Third Reduce
Input: hmultimodal cluster, generating relations {r1 , r2 . . . , rM }i
           |{r1 , r2 . . . , rM }|
 1: if                             ≥ θ then
       vol(multimodal cluster)
 2:    store hmultimodal clusteri
 3: end if



    The time and memory complexities are provided assuming that the worst case
scenario corresponds to the absolutely dense cuboid, i.e. polyadic context KN =
(A1 , A2 , . . . , AN , A1 × A2 × · · · × AN ). Thus, after careful analysis, the worst-
                                                                            PN
case time complexity of the proposed three stage algorithm is O(|I| j=1 |Aj |).
Not surprisingly it has the same worst-case memory complexity, since the stored
and passed maximal number of multimodal clusters is |I|, and the size of each
                             PN
of them is not greater j=1 |Aj |. However, from implementation point of view,
since HDFS has default replication factor 3, those data elements are copied thrice
to fulfil fault-tolerance.

4.2    Implementation aspects and used technologies
The application2 has been implemented in Java and as distributed computation
framework we use Apache Hadoop3 .
2
    https://github.com/kostarion/multimodal-clustering-hadoop
3
     http://hadoop.apache.org/
                   Multimodal Clustering of Boolean Tensors on MapReduce           9

    We have used many other technologies: Apache Maven (framework for au-
tomatic project assembling), Apache Commons (for work with extended Java
collections), Jackson JSON (open-source library for transformation of object-
oriented representation of an object like tricluster to string), TypeTools (for
real-time type resolution of inbound and outbound key-value pairs), etc.
    To provide the reader with basic information on the most important classes
for M/R implementation, let us shortly describe them below.
Entity. This is a basic abstraction for an element of a certain type. Each entity
is defined by its type index from 0 to n-1, where n is the arity of the input
formal context. An entity value is a string that need to be kept during the
program execution. This class inherit Writable interface for storing its objects
in temporary and permanent Hadoop files. This is a mandatory requirement for
all classes that pass or take their objects as keys and values of map and reduce
methods.
Tuple. This class implements a representation of relation. Each object of the class
Tuple contains the list of objects of Entity class and arity of its data given by its
numeric value. This class implements interface WritableComparable to
make it possible to use an object class Tuple as a key. The interface is similar
to Writable, however one need to define comparison function to use in the key
sorting phase.
Cumulus. This is an abstraction of cumulus, introduced earlier. It contains the
list of string values and the index of respective entity. It also implements the
following interfaces: WritableComparable for using cumulus as a
key and Iterable for iteration by its values.
FormalConcept. This is an abstraction of both formal concepts and multimodal
clusters, it contains the list of cumuli and implements interface Writable.
   The process-like M/R classes are summarised below.
FirstMapper, SecondMapper, ThirdMapper. These are the classes that extend
class Mapper<> of the Hadoop MapReduce library by respective mapping func-
tion from Subsection 4.1.
FirstReducer, SecondReducer, ThirdReducer. These classes extend class
Reducer<> of the Hadoop MapReduce library for fulfilling Algorithms 2,4,6.
TextCumulusInputFormat, TextCumulusOutputFormat. These classes implement
reading and writing for objects of Cumulus class; they also inherit RecordReader
and RecordWriter interfaces, respectively. They are required to exchange results
between different MapReduce phases within one MapReduce program.
JobConfigurator. This is the class for setting configuration of a single MapReduce
stage. It defines the classes of input/output/intermediate keys and values of the
mapper and reducer as well as formats of an input and output data.
App. This class is responsible for launching the application and chaining of M/R
stages.
10      D.I. Ignatov et al.

5     Experiments

Two series of experiments have been conducted in order to test the application
on the synthetic contexts and real world datasets with moderate and large num-
ber of triples in each. In each experiment both versions of the OAC-triclustering
algorithm have been used to extract triclusters from a given context. Only online
and M/R versions of OAC-triclustering algorithm have managed to result pat-
terns for large contexts since the computation time of the compared algorithms
was too high (>3000 s). To evaluate the runtime more carefully, for each context
the average result of 5 runs of the algorithms has been recorded.


5.1   Datasets

Synthetic datasets. The following synthetic dataset were generated.
The dense context K1 = (G, M, B, I), where G = M = B = {1, . . . , 60} and
I = G × M × B \ {(g, m, b) ∈ I | g = m = b}. In total, 603 - 60 =215,940 triples.
The context of three non-overlapped cuboids K2 = (G1 t G2 t G3 , M1 t M2 t
M3 , B1 tB2 tB3 , I), where I = (G1 ×M1 ×B1 )∪(G2 ×M2 ×B2 )∪(G3 ×M3 ×B3 ).
In total, 3 · 503 = 375, 000 triples.
The context K3 = (A1 , A2 , A3 , A4 , A × B × C × D) is a dense fourth dimensional
cuboid with |A1 | = |A2 | = |A3 | = |A4 | = 30 containing 304 = 810, 000 triples.
    These tests have sense since in M/R setting due to the tuples can be (par-
tially) repeated, e.g., because of M/R task failures on some nodes (i.e. restarting
processing of some key-value pairs). Even though the third dataset does not
result in 3min(|A1 |,|A2 |,|A3 |,|A4 |) formal triconcepts, the worst case for formal tri-
concepts generation in terms of the number of patterns, this is an example of the
worst case scenario for the reducers since the input has its maximal size w.r.t.
to the size of Ai -s and the number of duplicates. In fact, our algorithm should
correctly assemble the only one tricluster (A1 , A2 , A3 , A4 ) and it actually does.
IMDB. This dataset consists of Top-250 list of the Internet Movie Database (250
best movies based on user reviews).
    The following triadic context is composed: the set of objects consists of movie
names, the set of attributes (tags), the set of conditions (genres), and each triple
of the ternary relation means that the given movie has the given genre and is
assigned the given tag. In total, there are 3,818 triples.
Movielens. The dataset contains 1,000,000 tuples that relate 6,040 users, 3,952
movies, ratings, and timestamps, where ratings are made on a 5-star scale [8].
Bibsonomy. Finally, a sample of the data of bibsonomy.org from ECML PKDD
discovery challenge 2008 has been used.
    This website allows users to share bookmarks and lists of literature and tag
them. For the tests the following triadic context has been prepared: the set
of objects consists of users, the set of attributes (tags), the set of conditions
(bookmarks), and a triple of the ternary relation means that the given user has
assigned the given tag to the given bookmark.
    Table 2 contains the summary of IMDB and Bibsonomy contexts.
                    Multimodal Clustering of Boolean Tensors on MapReduce         11

         Table 2. Tricontexts based of real data systems for the experiments
                   Context    |G| |M |      |B| # triples Density
                    IMDB      250    795     22    3,818 0.00087
                  BibSonomy 2,337 67,464 28,920 816,197 1.8 · 10−7




5.2    Results

The experiments has been conducted on the computer Intel R Core(TM) i5-
2450M CPU @ 2.50GHz, 4Gb RAM (a typical commodity hardware) in the
emulation mode, when Hadoop cluster contains only one node and operates
locally and sequentially. By time execution results one can estimate the perfor-
mance in a real distributed environment assuming that each node workload is
(roughly) the same.


                Table 3. Three-stage MapReduce multimodal clustering
  Results, ms                     IMDB MovieLens100k K1      K2      K3
  Online OAC prime clustering      368    16,298    96,990 185,072 643,978
  MapReduce multimodal clustering 7,124   14,582    37,572 61,367 102,699




                Table 4. Three-stage MapReduce multimodal clustering
      Dataset           Online M/R total       MapReduce stages      # clusters
                      OAC Prime             1st      2nd        3rd
      MovieLens100k      89,931    16,348 8,724 5,292          2,332   89,932
      MovieLens250k 225,242        42,708 10,075 20,338       12,295  225,251
      MovieLens500k 461,198        94,701 15,016 46,300       33,384  461,238
      MovieLens1M       958,345   217,694 28,027 114,221 74,446       942,757
      Bibsonomy        > 6 hours 3,651,072 19,117 1,972,135 1,659,820 486,221
      (≈800k triples)            (≈1 hour)




   In Tables 3 and 4 we summarise the results of performed tests. It is clear
that on average our application has fewer execution time than its competitor,
the online version of OAC-triclustering. If we compare the implemented pro-
gram with its original online version, the results are worse for not that big and
sparse dataset as IMDB. It is the consequence of the fact that the application
architecture aimed at processing of large amounts of data; in particular, it is
implemented in three stages with time consuming communication. Launching
and stopping Apache Hadoop, data writing and passing between Map and Re-
duce steps in both stages requires substantial time, that is why for not that
12           D.I. Ignatov et al.

          1,200
                                              Online OAC prime clustering
                                             MapReduce multimodal clustering
          1,000


           800
Time, s




           600


           400


           200


             0
                  I   M100K   K1   K2                        K3          M
                                          Dataset                           ·106
Fig. 2. Performance curves for six datasets: I stands for IMDB dataset with 3,818
triples, M100K – MovieLens dataset with 100K tuples, M – MovieLens dataset with
1M tuples


big datasets, when execution time is comparable with time for infrastructure
management, time performance is not perfect. However, with data size increase
the relative performance is growing up to five-six times (see Fig. 2). Thus, the
last test for BibSonomy data has been successfully passed, but the competitor
was not able to finish it within one hour. As for the M/R stages, the most
time-consuming phases are 2nd and 3rd stages.


6         Conclusion

In this paper we have presented a map-reduce version of multimodal clustering
algorithm, which extends triclustering approaches and copes with bottlenecks
of the earlier M/R triclustering version [34]. We have shown that the proposed
algorithm is efficient from both theoretical and practical points of view. This is
a variant of map-reduce based algorithm where the reducer exploits composite
keys directly (see also Appendix section [34]). However, in despite the step to-
wards Big Data technologies, a proper comparison of the proposed multimodal
clustering and noise tolerant patterns in n-ary relations by DataPeeler and its
descendants [1] is not yet conducted (including MapReduce setting).


Acknowledgements. The study was implemented in the framework of the
Basic Research Program at the National Research University Higher School of
Economics, in the Laboratory of Intelligent Systems and Structural Analysis
                   Multimodal Clustering of Boolean Tensors on MapReduce             13

(Sections 2 and 5), and funded by the Russian Academic Excellence Project
’5-100’. The first author was also supported by Russian Scientific Foundation
(Section 1, 3, and 4) under grant 17-11-01294. The authors would like to thank
anonymous reviewers as well as Yuri Kudriavtsev from PM-Square and Dominik
Slezak from Infobright and Warsaw University for their encouragement given to
our studies of M/R technologies.


References

 1. Cerf, L., Besson, J., Nguyen, K.N., Boulicaut, J.F.: Closed and noise-tolerant pat-
    terns in n-ary relations. Data Min. Knowl. Discov. 26(3), 574–619 (2013)
 2. Eren, K., Deveci, M., Kucuktunc, O., Catalyurek, Umit V.: A comparative analysis
    of biclustering algorithms for gene expression data. Briefings in Bioinform. (2012)
 3. Ganter, B., Wille, R.: Formal Concept Analysis: Mathematical Foundations.
    Springer-Verlag New York, Inc., Secaucus, NJ, USA, 1st edn. (1999)
 4. Georgii, E., Tsuda, K., Schölkopf, B.: Multi-way set enumeration in weight tensors.
    Machine Learning 82(2), 123–155 (2011)
 5. Gnatyshak, D.V., Ignatov, D.I., Kuznetsov, S.O.: From triadic FCA to triclus-
    tering: Experimental comparison of some triclustering algorithms. In: CLA. pp.
    249–260 (2013)
 6. Gnatyshak, D.V., Ignatov, D.I., Kuznetsov, S.O., Nourine, L.: A one-pass triclus-
    tering approach: Is there any room for big data? In: CLA 2014 (2014)
 7. Gnatyshak, D.V., Ignatov, D.I., Semenov, A.V., Poelmans, J.: Gaining insight
    in social networks with biclustering and triclustering. In: BIR. Lecture Notes in
    Business Information Processing, vol. 128, pp. 162–171. Springer (2012)
 8. Harper, F.M., Konstan, J.A.: The movielens datasets: History and context. TiiS
    5(4), 19:1–19:19 (2016). https://doi.org/10.1145/2827872, https://doi.org/10.
    1145/2827872
 9. Henriques, R., Madeira, S.C.: Triclustering algorithms for three-dimensional data
    analysis: A comprehensive survey. ACM Comput. Surv. 51(5), 95:1–95:43 (Sep
    2018). https://doi.org/10.1145/3195833, http://doi.acm.org/10.1145/3195833
10. Ignatov, D.I., Gnatyshak, D.V., Kuznetsov, S.O., Mirkin, B.: Triadic formal con-
    cept analysis and triclustering: searching for optimal patterns. Machine Learning
    pp. 1–32 (2015)
11. Ignatov, D.I., Kuznetsov, S.O., Poelmans, J.: Concept-based biclustering for in-
    ternet advertisement. In: ICDM Workshops. pp. 123–130. IEEE Computer Society
    (2012)
12. Ignatov, D.I., Kuznetsov, S.O., Poelmans, J., Zhukov, L.E.: Can triconcepts be-
    come triclusters? International Journal of General Systems 42(6), 572–593 (2013)
13. Ignatov, D.I., Nenova, E., Konstantinova, N., Konstantinov, A.V.: Boolean Matrix
    Factorisation for Collaborative Filtering: An FCA-Based Approach. In: AIMSA
    2014, Varna, Bulgaria, Proceedings. vol. LNCS 8722, pp. 47–58 (2014)
14. Jelassi, M.N., Yahia, S.B., Nguifo, E.M.: A personalized recommender system based
    on users’ information in folksonomies. In: Carr, L., et al. (eds.) WWW (Companion
    Volume). pp. 1215–1224. ACM (2013)
15. Kaytoue, M., Kuznetsov, S.O., Macko, J., Napoli, A.: Biclustering meets triadic
    concept analysis. Ann. Math. Artif. Intell. 70(1-2), 55–79 (2014)
14      D.I. Ignatov et al.

16. Kaytoue, M., Kuznetsov, S.O., Napoli, A., Duplessis, S.: Mining gene expres-
    sion data with pattern structures in formal concept analysis. Inf. Sci. 181(10),
    1989–2001 (2011). https://doi.org/10.1016/j.ins.2010.07.007, http://dx.doi.org/
    10.1016/j.ins.2010.07.007
17. Krajca, P., Vychodil, V.: Distributed algorithm for computing formal concepts
    using map-reduce framework. In: N. Adams et al. (Eds.): IDA 2009. vol. LNCS
    5772, pp. 333–344 (2009)
18. Kuznecov, S., Kudryavcev, Y.: Applying map-reduce paradigm for paral-
    lel closed cube computation. In: 1st Int. Conf. on Advances in Databases,
    Knowledge, and Data Applications, DBKDS 2009. pp. 62–67 (2009).
    https://doi.org/10.1109/DBKDA.2009.32, http://dx.doi.org/10.1109/DBKDA.
    2009.32
19. Li, A., Tuck, D.: An effective tri-clustering algorithm combining expression data
    with gene regulation information. Gene regul. and syst. biol. 3, 49–64 (2009),
    http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2758278/
20. Madeira, S.C., Oliveira, A.L.: Biclustering algorithms for biological data analysis:
    A survey. IEEE/ACM Trans. Comput. Biology Bioinform. 1(1), 24–45 (2004)
21. Metzler, S., Miettinen, P.: Clustering boolean tensors. Data Min. Knowl. Dis-
    cov. 29(5), 1343–1373 (2015). https://doi.org/10.1007/s10618-015-0420-3, https:
    //doi.org/10.1007/s10618-015-0420-3
22. Mirkin, B.: Mathematical Classification and Clustering. Kluwer, Dordrecht (1996)
23. Mirkin, B.G., Kramarenko, A.V.: Approximate bicluster and tricluster boxes in the
    analysis of binary data. In: Kuznetsov, S.O., et al. (eds.) RSFDGrC 2011. Lecture
    Notes in Computer Science, vol. 6743, pp. 248–256. Springer (2011)
24. Nanopoulos, A., Rafailidis, D., Symeonidis, P., Manolopoulos, Y.: Musicbox: Per-
    sonalized music recommendation based on cubic analysis of social tags. IEEE
    Transactions on Audio, Speech & Language Processing 18(2), 407–412 (2010)
25. Rajaraman, A., Leskovec, J., Ullman, J.D.: Mining of Massive Datasets, chap.
    MapReduce and the New Software Stack, pp. 19–70. Cambridge University Press,
    England, Cambridge (2013)
26. Schweikardt, N.: One-pass algorithm. In: Encyclopedia of Database Systems, Sec-
    ond Edition (2018). https://doi.org/10.1007/978-1-4614-8265-9 253, https://doi.
    org/10.1007/978-1-4614-8265-9_253
27. Shin, K., Hooi, B., Faloutsos, C.: Fast, accurate, and flexible algo-
    rithms for dense subtensor mining. TKDD 12(3), 28:1–28:30 (2018).
    https://doi.org/10.1145/3154414, https://doi.org/10.1145/3154414
28. Spyropoulou, E., De Bie, T., Boley, M.: Interesting pattern mining in multi-
    relational data. Data Mining and Knowledge Discovery 28(3), 808–849 (2014)
29. Ustalov, D., Panchenko, A., Kutuzov, A., Biemann, C., Ponzetto, S.P.: Unsu-
    pervised semantic frame induction using triclustering. In: Gurevych, I., Miyao,
    Y. (eds.) Proceedings of the 56th Annual Meeting of the Association for Com-
    putational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Vol-
    ume 2: Short Papers. pp. 55–62. Association for Computational Linguistics (2018),
    https://aclanthology.info/papers/P18-2010/p18-2010
30. Voutsadakis, G.: Polyadic concept analysis. Order 19(3), 295–304 (2002)
31. Wille, R.: Restructuring lattice theory: An approach based on hierarchies of con-
    cepts. In: Rival, I. (ed.) Ordered Sets, NATO Advanced Study Institutes Series,
    vol. 83, pp. 445–470. Springer Netherlands (1982)
32. Xu, B., de Frein, R., Robson, E., Foghlu, M.O.: Distributed formal concept analysis
    algorithms based on an iterative mapreduce framework. In: Domenach, F., Ignatov,
    D., Poelmans, J. (eds.) ICFCA 2012. vol. LNAI 7278, pp. 292–308 (2012)
                    Multimodal Clustering of Boolean Tensors on MapReduce             15

33. Zhao, L., Zaki, M.J.: Tricluster: An effective algorithm for mining coherent clusters
    in 3d microarray data. In: SIGMOD 2005 Conference. pp. 694–705 (2005)
34. Zudin, S., Gnatyshak, D.V., Ignatov, D.I.: Putting oac-triclustering on mapreduce.
    In: Yahia, S.B., Konecny, J. (eds.) Proceedings of the Twelfth International Con-
    ference on Concept Lattices and Their Applications, Clermont-Ferrand, France,
    October 13-16, 2015. CEUR Workshop Proceedings, vol. 1466, pp. 47–58. CEUR-
    WS.org (2015), http://ceur-ws.org/Vol-1466/paper04.pdf