<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Putting OAC-triclustering on MapReduce</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sergey Zudin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dmitry V. Gnatyshak</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dmitry I. Ignatov</string-name>
          <email>dignatov@hse.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Research University Higher School of Economics</institution>
          ,
          <addr-line>Russian Federation</addr-line>
        </aff>
      </contrib-group>
      <fpage>47</fpage>
      <lpage>58</lpage>
      <abstract>
        <p>In our previous work an efficient one-pass online algorithm for triclustering of binary data (triadic formal contexts) was proposed. This algorithm is a modified version of the basic algorithm for OACtriclustering approach; it has linear time and memory complexities. In this paper we parallelise it via map-reduce framework in order to make it suitable for big datasets. The results of computer experiments show the efficiency of the proposed algorithm; for example, it outperforms the online counterpart on Bibsonomy dataset with ≈ 800, 000 triples.</p>
      </abstract>
      <kwd-group>
        <kwd>Formal Concept Analysis</kwd>
        <kwd>triclustering</kwd>
        <kwd>triadic data</kwd>
        <kwd>data mining</kwd>
        <kwd>big data</kwd>
        <kwd>MapReduce</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Mining of multimodal patterns is one of the hot topics in Data Mining and
Machine Learning [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1,2,3,4</xref>
        ]. Thus, cluster analysis of multimodal data and
specifically of dyadic and triadic relations is a natural extension of the idea of original
clustering. In dyadic case biclustering methods (the term bicluster was coined
in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]) are used to simultaneously find subsets of the sets of objects and
attributes that form homogeneous patterns of the input object-attribute data. In
fact, one of the most popular applications of biclustering is gene expression
analysis in Bionformatics [
        <xref ref-type="bibr" rid="ref6 ref7">6,7</xref>
        ]. Triclustering methods operate in triadic case where
for each object-attribute pair one assigns a set of some conditions [
        <xref ref-type="bibr" rid="ref10 ref8 ref9">8,9,10</xref>
        ]. Both
biclustering and triclustering algorithms are widely used in such areas as gene
expression analysis [
        <xref ref-type="bibr" rid="ref11 ref12 ref13">11,12,13</xref>
        ], recommender systems [
        <xref ref-type="bibr" rid="ref14 ref15 ref16">14,15,16</xref>
        ], social networks
analysis [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], etc. The processing of numeric multimodal data is also possible by
modifications of existing approaches for mining dyadic binary relations [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
      </p>
      <p>
        Though there are methods that can enumerate all triclusters satisfying
certain constraints [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] (in most cases they ensure that triclusters are dense), their
time complexity is rather high, as in the worst case the maximal number of
triclusters usually is exponential (e.g. in case of formal triconcepts), showing
that these methods are hardly scalable. To process big data algorithms need to
have at most linear time complexity (e.g., O(|I|) in case of n-ary relation I)
and be easily parallelisable. In addition, in most cases, it is necessary that such
algorithms output the results in one pass.
      </p>
      <p>
        Earlier, in order to create an algorithm satisfying these requirements, we
adapted a triclustering method based on prime operators (prime OAC-triclustering
method) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and proposed its online version, which is linear, one-pass and
easily parallelisable [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. However, its parallelisation is possible in different ways.
For example, one can use a popular framework for commodity hardware,
MapReduce (M/R) [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. By the way, there were several successful M/R
implementations in the FCA community and other lattice-oriented domains. Thus, in [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ],
the authors adapted Close-by-One algorithm to M/R framework and showed
its efficiency. At the same year, in [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], an efficient M/R algorithm for
computation of closed cube lattices was proposed. The authors of [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] demonstrated
that iterative algorithms like Ganter’s NextClosure can benefit from the usage
of iterative M/R schemes.
      </p>
      <p>
        Note that experts aware potential users that M/R is like a big cannon that
requires long preparations to shot but fires fast: “the entire distributed-file-system
milieu makes sense only when files are very large and are rarely updated in place
” [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. In this work, in contrast to our previous study, we assume that there is a
large bulk of data to process that are not coming online.
      </p>
      <p>The rest of the paper is organized as follows: in Section 2, we recall the
original method and the online version of the algorithm of prime OAC-triclustering.
In Section 3, we describe the M/R setting of the problem and the corresponding
M/R version of the original algorithm with important implementation aspects.
Finally, in Section 4 we show the results of several experiments which
demonstrate the efficiency of the M/R version of the algorithm. As an addendum, in
the Appendix section, the reader may find our proposal for alternative models
of M/R-based variants of prime OAC-triclustering.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Prime object-attribute-condition triclustering</title>
      <p>
        Prime object-attribute-condition triclustering method (OAC-prime) based on
Formal Concept Analysis [
        <xref ref-type="bibr" rid="ref24 ref25">24,25</xref>
        ] is an extension for the triadic case of
objectattribute biclustering method [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. Triclusters generated by this method have
similar structure as the corresponding biclusters, namely the cross-like structure
of triples inside the input data cuboid (i.e. formal tricontext).
      </p>
      <p>Let K = (G, M, B, I) be a triadic context, where G, M , B are respectively
the sets of objects, attributes, and conditions, and I ⊆ G × M × B is a
triadic incidence relation. Each prime OAC-tricluster is generated by applying the
following prime operators to each pair of components of some triple:
(X, Y )0 = {b ∈ B | (g, m, b) ∈ I for all g ∈ X, m ∈ Y },
(X, Z)0 = {m ∈ M | (g, m, b) ∈ I for all g ∈ X, b ∈ Z},
(Y, Z)0 = {g ∈ G | (g, m, b) ∈ I for all m ∈ Y, b ∈ Z},
(1)
where X ⊆ G, Y ⊆ M , and Z ⊆ B.</p>
      <p>Then the triple T = ((m, b)0, (g, b)0, (g, m)0) is called prime OAC-tricluster
based on triple (g, m, b) ∈ I. The components of tricluster are called, respectively,
tricluster extent, tricluster intent, and tricluster modus. The triple (g, m, b) is
called a generating triple of the tricluster T . Figure 1 shows the structure of an
OAC-tricluster (X, Y, Z) based on triple (ge, me, eb), triples corresponding to the
gray cells are contained in the context, other triples may be contained in the
tricluster (cuboid) as well.</p>
      <p>
        The basic algorithm for prime OAC-triclustering method is rather
straightforward (see [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]). First of all, for each combination of elements from each two
sets of K we apply the corresponding prime operator (we call the resulting sets
prime sets ). After that we enumerate all triples from I and on each step we
must generate a tricluster based on the corresponding triple, check whether this
tricluster is already contained in the tricluster set (by using hashing) and also
check extra conditions.
      </p>
      <p>The total time complexity of the algorithm depends on whether there is a
non-zero minimal density threshold or not and on the complexity of the hashing
algorithm used. In case we use some basic hashing algorithm processing the
tricluster’s extent, intent and modus without a minimal density threshold, the
total time complexity is O(|G||M ||B| + |I|(|G| + |M | + |B|)); in case of a
nonzero minimal density threshold, it is O(|I||G||M ||B|). The memory complexity
is O(|I|(|G| + |M | + |B|)), as we need to keep the dictionaries with the prime
sets in memory.</p>
      <p>In online setting, for triples coming from triadic context K = (G, M, B, I),
the user has no a priori knowledge of the elements and even cardinalities of G,
M , B, and I. At each iteration we receive some set of triples from I: J ⊆ I. After
that we must process J and get the current version of the set of all triclusters. It
is important in this setting to consider every pair of triclusters as being different
as they have different generating triples, even if their respective extents, intents,
and modi are equal. Thus, any other triple can change only one of these two
triclusters, making them different.</p>
      <p>To efficiently access prime sets for their processing, the dictionaries
containing the prime sets are implemented as hash-tables.</p>
      <p>The algorithm is straightforward as well (Alg. 1). It takes some set of triples
(J ), the current tricluster set (T ), and the dictionaries containing prime sets
(P rimes) as input and outputs the modified versions of the tricluster set and
dictionaries. The algorithm processes each triple (g, m, b) of J sequentially (line
1). At each iteration the algorithm modifies the corresponding prime sets (lines
2-4).</p>
      <p>Finally, it adds a new tricluster to the tricluster set. Note that this tricluster
contains pointers to the corresponding prime sets (in the corresponding
dictionaries) instead of the copies of the prime sets (line 5) which allows to lower the
memory and access costs.</p>
      <p>Algorithm 1 Add function for the online algorithm for prime OAC-triclustering.
Input: J is a set of triples;</p>
      <p>T = {T = (∗X, ∗Y, ∗Z)} is a current set of triclusters;</p>
      <p>P rimesOA, P rimesOC, P rimesAC.</p>
      <p>Output: T = {T = (∗X, ∗Y, ∗Z)};</p>
      <p>P rimesOA, P rimesOC, P rimesAC.
1: for all (g, m, b) ∈ J do
2: P rimesOA[g, m] := P rimesOA[g, m] ∪ {b}
3: P rimesOC[g, b] := P rimesOC[g, b] ∪ {m}
4: P rimesAC[m, b] := P rimesAC[m, b] ∪ {g}
5: T := T ∪ {(&amp;P rimesAC[m, b], &amp;P rimesOC[g, b], &amp;P rimesOA[g, m])}
6: end for</p>
      <p>The algorithm is one-pass and its time and memory complexities are O(|I|).</p>
      <p>Duplicate elimination and selection patterns by user-specific constraints are
done as post-processing to avoid patterns’ loss. The time complexity of the basic
post-processing is O(|I|) and it does not require any additional memory.</p>
      <p>Finally, it seems the algorithm can be easily parallelised by splitting the
subset of triples J into several subsets, processing each of them independently,
and merging the resulting sets afterward.
3
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>Map-reduce OAC-triclustering</title>
      <sec id="sec-3-1">
        <title>Map-reduce decomposition</title>
        <p>We use a two-stage M/R approach. The first M/R allows us to efficiently
calculate all the primes of the existed pairs. The second M/R permits to assemble
the found primes into triclusters. During the first map phase, each triple from
the input context is indexed by a key using hash function depending on one of
the basic entity types, object, attribute, or condition (see Alg. 2). The number
of map keys is equal to the number of reducers.</p>
        <p>Then each first reducer receives the portion of data for a particular key (see
Alg. 3). The internal reducer algorithm is almost a replication of Online
OACprime. However, it does not assemble all found triclusters into a final collection;
the reducer simply writes the file with the current triclusters for a given portion of
data to a file or pass it to the second-stage mapper. Since in Hadoop MapReduce</p>
        <sec id="sec-3-1-1">
          <title>Algorithm 2 Distributed OAC-triclustering: First Map</title>
          <p>Input: S is a set of input triples as strings;
r is a number of reducers;
i is a grouping index (objects, attributes or conditions).</p>
          <p>Output: J˜ is a list of hkey, triplei pairs.
1: for all s ∈ S do
2: t := transf orm(s)
3: key := hash(t[i]) mod r
4: J˜ := J˜ ∪ {hkey, ti}
5: end for
we should work with text input files and our data are mainly in a tuple-based
form, we use encode/decode function encode()/transf rom() to switch between
the internal tuple representation and the text-based one.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>Algorithm 3 Distributed OAC-triclustering: First Reduce</title>
          <p>Input: J is a list of triples (for a certain key);</p>
          <p>T = {T = (X, Y, Z)} is a current set of triclusters;</p>
          <p>P rimesOA, P rimesOC, P rimesAC.</p>
          <p>Output: file of strings – encoded htriple, triclusteri pairs.
1: P rimes ← initialise a new multimap
2: for all (g, m, b) ∈ J do
3: P rimes[g, m] := P rimes[g, m] ∪ {b}
4: P rimes[g, b] := P rimes[g, b] ∪ {m}
5: P rimes[m, b] := P rimes[m, b] ∪ {g}
6: end for
7: for all (g, m, b) ∈ J do
8: T := (set(P rimes[m, b]), set(P rimes[g, b]), set(P rimes[g, m]))
9: s := {encode(h(g, m, b), T i)}
10: store s
11: end for</p>
          <p>The second mapper takes the found intermediate triclusters (with their keys)
as strings from the files produced by the first-stage reducers (see Alg. 4). It
fills P rimes multimap in one pass through all htriple, triclusteri pairs. In
the next loop for each key (g, m, b) the corresponding tricluster is formed and
htricluster, triclusteri pairs are passed to the second-stage reducer (the key
tricluster can be efficiently implemented by a proper hashing). In its turn, the
second stage reducer eliminates duplicates and outputs the resulting file (Alg. 4).
The set() function helps to avoid duplicates among the values of P rimes[, ],
which is closer to our implementation. However, one can easily omit set() in line
8, provided that P rimes is properly implemented.</p>
          <p>The time complexity of the M/R solution is composed from two terms for
each stage: O(|I|/r) and O(|I|). However, there are communication costs that</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>Algorithm 4 Distributed OAC-triclustering: Second Map</title>
          <p>Input: S is a list of strings.</p>
          <p>Output: T˜ is an list of htricluster, triclusteri pairs.
1: P rimes ← initialise a new multimap
2: for all s ∈ S do
3: h(g, m, b), T i := decode(s)
4: update P rimes multimap appropriately
5: I := I ∪ {(g, m, b)}
6: end for
7: for all (g, m, b) ∈ I do
8: T := (set(P rimes[m, b]), set(P rimes[g, b]), set(P rimes[g, m]))
9: T˜ := T˜ ∪ {hT, T i}
10: end for</p>
        </sec>
        <sec id="sec-3-1-4">
          <title>Algorithm 5 Distributed OAC-triclustering: Second Reduce</title>
          <p>Input: Tˆ is a list of htricluster, list of triclustersi pairs.</p>
          <p>
            Output: File with a final set of triclusters {T = (X, Y, Z)}.
1: for all hT, [T, . . . , T ]i ∈ Tˆ do
2: store T
3: end for
should be inevitably paid and can be theoretically estimated as follows [
            <xref ref-type="bibr" rid="ref20">20</xref>
            ]: the
replication rate for the first M/R stage r1 = 1 (each triple is passed as one
keyvalue pair), the reducer size q1 = |I|/r; the replication rate for the second M/R
stage is r2 = 1 (it assign one key-value pair for each tricluster), but the reducer
size varies from qmin = 1 (no duplicate triclusters) and qmax = |I| (one final
2 2
tricluster when all the initial triples belong to one absolutely dense cuboid).
3.2
          </p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Implementation aspects and used technologies</title>
        <p>The application 1 has been implemented in Java within JRE 8 and as distributed
computation framework we use Apache Hadoop 2.</p>
        <p>We have used many other technologies: Apache Maven (framework for
automatic project assembling), Apache Commons (for work with extended Java
collections), Google Guava (utilities and data structures), Jackson JSON
(opensource library for transformation of object-oriented representation of an object
like tricluster to string), TypeTools (for real-time type resolution of inbound and
outbound key-value pairs), etc.</p>
        <p>ChainingJob module. During the development we found that in Hadoop one
MapReduce process can contain only one Mapper and one Reducer. Thus, in
order to develop an application with three “map” phases and one “reduce”,
one needs to create three processes. One process creation (even without various
adjustments) takes 8-10 lines of code. After our vain search of an appropriate
1 https://github.com/zydins/DistributedTriclustering
2 http://hadoop.apache.org/
library, we developed “chaining-job” module 3. Its main class contains the
following fields: “jobs” (list of all scheduled processes), “name” (common name for
all processes), and “tempDir” (folder name for intermediate results). First, the
algorithm set input path for the first chaining process and path to the result of
the last job; the rest jobs are connected by input and output “key-value” pairs
and directory for intermediate files storage. Then this algorithm runs processes
according to the schedule and waits their completion. In other words, it connects
the input and output of chaining processes that run sequentially.</p>
        <p>Let us shortly describe the most important classes our M/R implementation.
Entity. It is a basic class for object oriented representation of input strings and
maintains three entity types: EXTENT, INTENT, and MODUS. For example:
{“Leon”, EXTENT }.</p>
        <p>Tuple. An object of this class stores references to objects of class Entity and
represents two basic entities: triple and tricluster. Mapper and Reducer classes
operate with objects of this type.</p>
        <p>FormalContext. This class is an object oriented representation of the underlying
binary relation; it keeps the reference to an object of EntityStorage class (see
below). It also contains methods “add” (add triple) and “getTriclusters” (get
the output set of unique triclusters).</p>
        <p>EntityStorage. This class manages the work with extents, intents and modi of
triclustes. It also contains three dictionaries with composite keys. For example,
for (g1, m1, c1) object c1 will be added by key (g1, m1) to the first dictionary;
analogously for keys (g1, c1) and (m1, c1).</p>
        <sec id="sec-3-2-1">
          <title>The process-like M/R classes are summarised below.</title>
          <p>TupleReadMapper. Its main goal is reading a triple from the input file and
transform the triple to an object of class Tuple.</p>
          <p>TupleContextReducer. It receives input tuples and fills the underlying tricontext
by them. It also sets the number of first reducers. This number depends on the
available nodes in a distributed system and the structure of input data. The
more unique entities are in triples, the more that value should be.
PrepareMapper. The “map” method receives files from the previous stage. They
contain intermediate triclusters from each object of class TupleContextReducer.
It fills the dictionary with primes. Further, each tricluster triple is transformed
to Tuple structure and is passed to the second reduce phase.</p>
          <p>CollectReduce. This class gathers all intermediate triclusters and obtains the final
tricluster set. This process runs in several threads for speed up. The number of
threads is a user-specified parameter.</p>
          <p>Executor. It is a starting class of the application, which receives the input
parameters, activates “chaining-job” utility for making a chain of jobs, and starts
the execution.
3 https://github.com/zydins/chaining-job</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>Two series of experiments have been conducted in order to test the application
on the synthetic contexts and real world datasets with moderate and large
number of triples in each. In each experiment both versions of the OAC-triclustering
algorithm have been used to extract triclusters from a given context. Only online
and M/R versions of OAC-triclustering algorithm have managed to result
patterns for large contexts since the computation time of the compared algorithms
was too high (&gt;3000 s). To evaluate the runtime more carefully, for each context
the average result of 5 runs of the algorithms has been recorded.
4.1</p>
      <sec id="sec-4-1">
        <title>Datasets</title>
        <p>Synthetic datasets. As it was mentioned, synthetic contexts were randomly
generated: 1) 20,000 triples (25 unique entities of each type); 2) 100,000 triples (50
unique entities of each type); 3) 1,000,000 triples (all possible combinations of
100 unique entities of each type). However, it is easy to see that some datasets
are not correct formal contexts from algebraic viewpoint. Thus, the first dataset
inevitably contains duplicates since 25 × 25 × 25 gives only 15,625 unique triples.
The second one contains less triples than 503 = 125, 000, the number of all
possible combinations. The third one is just an absolutely dense cuboid 100×100×100
(it contains only one formal concept (OAC-tricluster), the whole context).</p>
        <p>These tests look more like crush test, but they have sense since in M/R
setting the triples can be (partially) repeated, e.g., because of M/R task failures
on some nodes (i.e. restarting processing of some key-value pairs). Even though
the third dataset does not result in 3min(|G|,|M|,|B|) formal triconcepts, the worst
case for formal triconcepts generation in terms of the number of patterns, this
is an example of the worst case scenario for the second reducer since its size is
maximal (q2max = |I|). By the way, our algorithm should correctly assemble the
only one tricluster (G, M, B) and it actually does.</p>
        <p>IMDB. This dataset consists of Top-250 list of the Internet Movie Database (250
best movies based on user reviews). The following triadic context is composed:
the set of objects consists of movie names, the set of attributes (tags), the set of
conditions (genres), and each triple of the ternary relation means that the given
movie has the given genre and is assigned the given tag.</p>
        <p>Bibsonomy. Finally, a sample of the data of bibsonomy.org from ECML PKDD
discovery challenge 2008 has been used. This website allows users to share
bookmarks and lists of literature and tag them. For the tests the following triadic
context has been prepared: the set of objects consists of users, the set of
attributes (tags), the set of conditions (bookmarks), and a triple of the ternary
relation means that the given user has assigned the given tag to the given
bookmark.</p>
        <p>The table 1 contains the summary of the contexts.
The experiments has been conducted on the computer running under OS X 10,
using 1,8 GHz Intel Core i5, having 4 Gb 1600 MHz DDR3 and having 8 Gb free
space on its hard drive (a typical commodity hardware). Two M/R modes have
been tested: sequential mode of tasks completion and emulation of distributed
one with 16 first reducers and 32 threads for the second stage.</p>
        <p>In Table 2 we summarise the results of performed tests. It is clear that on
average our application has fewer execution time than its competitors, except
of online version of OAC-triclustering. If we compare the implemented program
with its original online version, the results are worse for not that big but dense
datasets (closer to the worst case scenario q2 = |I|. It is the consequence of the
fact that the application architecture aimed at processing of large amounts of
data; in particular, it is implemented in two stages with time consuming
communication. Launching and stopping Apache Hadoop, data writing and passing
between Map and Reduce steps in both stages requires substantial time, that is
why for not that big datasets, when execution time is comparable with time for
infrastructure management, time performance is not perfect. However, with data
size increase the relative performance is growing. Thus, the last test for
BibSonomy data has been successfully passed, but the competitors were not able to
finish it within 50 min, but our M/R program did it even in sequential mode
within 25 min.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>
        In this paper we have presented a map-reduce version of OAC-triclustering
algorithm. We have shown that the algorithm is efficient from both theoretical and
practical points of view. It remains of linear time complexity and is performed
in two stages (with each stage being M/R distributed); this allows us to use it
for big data problems. However, we believe that it is possible to propose another
variants of map-reduce based algorithm where the reducer exploits composite
keys directly (see Appendix section). So, such algorithms and their comparison
with the current M/R version on real and artificial data is still be in our plans.
However, in despite the step towards Big Data technologies, a proper comparison
of the proposed OAC triclustering and noise tolerant patterns in n-ary relations
by DataPeeler and its descendants [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is not yet conducted.
      </p>
      <p>Acknowledgments. The study was implemented in the framework of the Basic
Research Program at the National Research University Higher School of
Economics in 2014-2015, in the Laboratory of Intelligent Systems and Structural
Analysis. The last two authors were partially supported by Russian Foundation
for Basic Research, grant no. 13-07-00504. The authors would like to thank Yuri
Kudriavtsev from PM-Square and Dominik Slezak from Infobright and Warsaw
University for their encouragement given to our studies of M/R technologies.</p>
    </sec>
    <sec id="sec-6">
      <title>Appendix. Alternative variants of two-stage MapReduce</title>
      <p>First Map: Finding primes. During this phase every input triple (g, m, b)
is encoded by three key-value pairs h(g, m), bi, h(g, b), mi, and h(m, b), gi. These
pairs are passed to the first reducer. The replication rate is r1 = 3.
First Reduce: Finding primes. This reducer fills three corresponding
dictionaries for primes of keys. So, for example, the first dictionary, P rimeOA
contains key-value pairs h(g, m), {b1, b2, . . . , bn}i. The reducer size is q1 =
max(|G|, |M |, |B|)</p>
      <p>The process can be stopped after the first reduce phase and all the
triclusters found as (P rime[g, m], P rime[g, b], P rime[m, b]) each by enumeration of
(g, m, b) ∈ I. However, to do it faster and keep the result for further
computation, it is possible to use M/R as well.</p>
      <p>Second Map: Tricluster generation. The second map does tricluster
combining job, i.e. for each triple (g, m, b) it composes the new key-value
pair, h(g, m, b), ∅i. And for each pair of either type, h(g, m), P rime[g, m]i,
h(g, b), P rime[g, b]i, and h(m, b), P rime[m, b]i it generates key-values pairs
h(g, m, ˜b), P rime[g, m]i, h(g, m˜ , b), P rimeOC[g, b]i, and h(g˜, m, b), P rime[m, b]i,
where g˜ ∈ G, m˜ ∈ M , and ˜b ∈ B. r2 = (|I|+3|G||M ||B|)/(|I|+|G||M |+|G||B|+
|M ||B|) ≤ (ρ + 3)/(ρ + 3/max(|G|, |M |, |B|)), where ρ is the input tricontext
density.</p>
      <sec id="sec-6-1">
        <title>Second Reduce: Tricluster generation. The second reducer just assem</title>
        <p>bles only one value for each key (g, m, b), the generating triple, its tricluster,
(P rime[g, m], P rime[g, b], P rime[m, b]). If there is no key-value pair h(g, m, b), ∅i
for a particular triple (g, m, b), it does not output any key-value pair for the key.
The reducer size q2 is either 3 (no output) or 4 (tricluster assembled).</p>
      </sec>
      <sec id="sec-6-2">
        <title>Second Map: Tricluster generation with duplicate generating triples.</title>
        <p>Second map does tricluster combining job, i.e. for each triple (g, m, b) it composes
a new key-value pair: h(P rime[g, m], P rime[g, b], P rime[m, b]), (g, m, b)i.</p>
      </sec>
      <sec id="sec-6-3">
        <title>Second Map: Tricluster generation with duplicate generating triples.</title>
        <p>The second reducer just groups values for each key: h(X, Y, Z), {(g1, m1, b1), . . . ,
(gn, mn, bn)}i.</p>
        <p>These two variations of the second stage have their merits: the first one is
beneficial for further computations with a new portion of triples and the last one
is more compact and informative. Of course, each variant of the second stage has
its own runtime complexity which depends not only on the model representation
but is also sensitive to datastructures implementation and M/R communication
costs and settings.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Georgii</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tsuda</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , Scho¨lkopf, B.:
          <article-title>Multi-way set enumeration in weight tensors</article-title>
          .
          <source>Machine Learning 82(2)</source>
          (
          <year>2011</year>
          )
          <fpage>123</fpage>
          -
          <lpage>155</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Cerf</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Besson</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>K.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boulicaut</surname>
            ,
            <given-names>J.F.</given-names>
          </string-name>
          :
          <article-title>Closed and noise-tolerant patterns in n-ary relations</article-title>
          .
          <source>Data Min. Knowl. Discov</source>
          .
          <volume>26</volume>
          (
          <issue>3</issue>
          ) (
          <year>2013</year>
          )
          <fpage>574</fpage>
          -
          <lpage>619</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Spyropoulou</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Bie</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boley</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Interesting pattern mining in multirelational data</article-title>
          .
          <source>Data Mining and Knowledge Discovery</source>
          <volume>28</volume>
          (
          <issue>3</issue>
          ) (
          <year>2014</year>
          )
          <fpage>808</fpage>
          -
          <lpage>849</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Ignatov</surname>
            ,
            <given-names>D.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gnatyshak</surname>
            ,
            <given-names>D.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuznetsov</surname>
            ,
            <given-names>S.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mirkin</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Triadic formal concept analysis and triclustering: searching for optimal patterns</article-title>
          .
          <source>Machine Learning</source>
          (
          <year>2015</year>
          )
          <fpage>1</fpage>
          -
          <lpage>32</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Mirkin</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <source>Mathematical Classification and Clustering</source>
          . Kluwer, Dordrecht (
          <year>1996</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Madeira</surname>
            ,
            <given-names>S.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oliveira</surname>
            ,
            <given-names>A.L.</given-names>
          </string-name>
          :
          <article-title>Biclustering algorithms for biological data analysis: A survey</article-title>
          .
          <source>IEEE/ACM Trans. Comput. Biology Bioinform</source>
          .
          <volume>1</volume>
          (
          <issue>1</issue>
          ) (
          <year>2004</year>
          )
          <fpage>24</fpage>
          -
          <lpage>45</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Eren</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deveci</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kucuktunc</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Catalyurek</surname>
          </string-name>
          , Umit V.
          <article-title>: A comparative analysis of biclustering algorithms for gene expression data</article-title>
          . Briefings in Bioinform. (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Mirkin</surname>
            ,
            <given-names>B.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kramarenko</surname>
            ,
            <given-names>A.V.</given-names>
          </string-name>
          :
          <article-title>Approximate bicluster and tricluster boxes in the analysis of binary data</article-title>
          . In Kuznetsov,
          <string-name>
            <surname>S.O.</surname>
          </string-name>
          , et al., eds.:
          <source>RSFDGrC 2011</source>
          . Volume
          <volume>6743</volume>
          of Lecture Notes in Computer Science., Springer (
          <year>2011</year>
          )
          <fpage>248</fpage>
          -
          <lpage>256</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Ignatov</surname>
            ,
            <given-names>D.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuznetsov</surname>
            ,
            <given-names>S.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poelmans</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhukov</surname>
            ,
            <given-names>L.E.</given-names>
          </string-name>
          : Can triconcepts become triclusters?
          <source>International Journal of General Systems</source>
          <volume>42</volume>
          (
          <issue>6</issue>
          ) (
          <year>2013</year>
          )
          <fpage>572</fpage>
          -
          <lpage>593</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Gnatyshak</surname>
            ,
            <given-names>D.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ignatov</surname>
            ,
            <given-names>D.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuznetsov</surname>
            ,
            <given-names>S.O.</given-names>
          </string-name>
          :
          <article-title>From triadic FCA to triclustering: Experimental comparison of some triclustering algorithms</article-title>
          . In: CLA. (
          <year>2013</year>
          )
          <fpage>249</fpage>
          -
          <lpage>260</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaki</surname>
            ,
            <given-names>M.J.:</given-names>
          </string-name>
          <article-title>Tricluster: An effective algorithm for mining coherent clusters in 3d microarray data</article-title>
          .
          <source>In: SIGMOD 2005 Conference</source>
          . (
          <year>2005</year>
          )
          <fpage>694</fpage>
          -
          <lpage>705</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tuck</surname>
            ,
            <given-names>D.:</given-names>
          </string-name>
          <article-title>An effective tri-clustering algorithm combining expression data with gene regulation information</article-title>
          .
          <source>Gene regul. and syst. biol. 3</source>
          (
          <year>2009</year>
          )
          <fpage>49</fpage>
          -
          <lpage>64</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Kaytoue</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuznetsov</surname>
            ,
            <given-names>S.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Napoli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duplessis</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Mining gene expression data with pattern structures in formal concept analysis</article-title>
          .
          <source>Inf. Sci</source>
          .
          <volume>181</volume>
          (
          <issue>10</issue>
          ) (
          <year>2011</year>
          )
          <fpage>1989</fpage>
          -
          <lpage>2001</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Nanopoulos</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rafailidis</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Symeonidis</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manolopoulos</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Musicbox: Personalized music recommendation based on cubic analysis of social tags</article-title>
          .
          <source>IEEE Transactions on Audio, Speech &amp; Language Processing</source>
          <volume>18</volume>
          (
          <issue>2</issue>
          ) (
          <year>2010</year>
          )
          <fpage>407</fpage>
          -
          <lpage>412</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Jelassi</surname>
            ,
            <given-names>M.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yahia</surname>
            ,
            <given-names>S.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nguifo</surname>
            ,
            <given-names>E.M.:</given-names>
          </string-name>
          <article-title>A personalized recommender system based on users' information in folksonomies</article-title>
          . In Carr, L., et al., eds.:
          <string-name>
            <surname>WWW (Companion</surname>
            <given-names>Volume)</given-names>
          </string-name>
          ,
          <source>ACM</source>
          (
          <year>2013</year>
          )
          <fpage>1215</fpage>
          -
          <lpage>1224</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Ignatov</surname>
            ,
            <given-names>D.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nenova</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Konstantinova</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Konstantinov</surname>
            ,
            <given-names>A.V.</given-names>
          </string-name>
          :
          <article-title>Boolean Matrix Factorisation for Collaborative Filtering: An FCA-Based Approach</article-title>
          . In: AIMSA 2014, Varna, Bulgaria,
          <source>Proceedings. Volume LNCS 8722</source>
          .
          <article-title>(</article-title>
          <year>2014</year>
          )
          <fpage>47</fpage>
          -
          <lpage>58</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Gnatyshak</surname>
            ,
            <given-names>D.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ignatov</surname>
            ,
            <given-names>D.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Semenov</surname>
            ,
            <given-names>A.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poelmans</surname>
          </string-name>
          , J.:
          <article-title>Gaining insight in social networks with biclustering and triclustering</article-title>
          .
          <source>In: BIR. Volume 128 of Lecture Notes in Business Information Processing.</source>
          , Springer (
          <year>2012</year>
          )
          <fpage>162</fpage>
          -
          <lpage>171</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Kaytoue</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuznetsov</surname>
            ,
            <given-names>S.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Macko</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Napoli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Biclustering meets triadic concept analysis</article-title>
          .
          <source>Ann. Math. Artif. Intell</source>
          .
          <volume>70</volume>
          (
          <issue>1-2</issue>
          ) (
          <year>2014</year>
          )
          <fpage>55</fpage>
          -
          <lpage>79</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Gnatyshak</surname>
            ,
            <given-names>D.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ignatov</surname>
            ,
            <given-names>D.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuznetsov</surname>
            ,
            <given-names>S.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nourine</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>A one-pass triclustering approach: Is there any room for big data</article-title>
          ? In:
          <string-name>
            <surname>CLA</surname>
          </string-name>
          <year>2014</year>
          .
          <article-title>(</article-title>
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Rajaraman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leskovec</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ullman</surname>
          </string-name>
          , J.D.:
          <article-title>MapReduce and the New Software Stack</article-title>
          .
          <source>In: Mining of Massive Datasets</source>
          . Cambridge University Press, England, Cambridge (
          <year>2013</year>
          )
          <fpage>19</fpage>
          -
          <lpage>70</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Krajca</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vychodil</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Distributed algorithm for computing formal concepts using map-reduce framework</article-title>
          . In: N.
          <string-name>
            <surname>Adams</surname>
          </string-name>
          et al. (Eds.):
          <article-title>IDA 2009</article-title>
          . Volume LNCS
          <volume>5772</volume>
          .
          <article-title>(</article-title>
          <year>2009</year>
          )
          <fpage>333</fpage>
          -
          <lpage>344</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Kuznecov</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kudryavcev</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Applying map-reduce paradigm for parallel closed cube computation</article-title>
          .
          <source>In: 1st Int. Conf. on Advances in Databases, Knowledge, and Data Applications</source>
          ,
          <string-name>
            <surname>DBKDS</surname>
          </string-name>
          <year>2009</year>
          .
          <article-title>(</article-title>
          <year>2009</year>
          )
          <fpage>62</fpage>
          -
          <lpage>67</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>de</surname>
            <given-names>Frein</given-names>
          </string-name>
          , R.,
          <string-name>
            <surname>Robson</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Foghlu</surname>
            ,
            <given-names>M.O.</given-names>
          </string-name>
          :
          <article-title>Distributed formal concept analysis algorithms based on an iterative mapreduce framework</article-title>
          . In Domenach, F.,
          <string-name>
            <surname>Ignatov</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poelmans</surname>
          </string-name>
          , J., eds.
          <source>: ICFCA 2012</source>
          . Volume LNAI
          <volume>7278</volume>
          .
          <article-title>(</article-title>
          <year>2012</year>
          )
          <fpage>292</fpage>
          -
          <lpage>308</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Wille</surname>
          </string-name>
          , R.:
          <article-title>Restructuring lattice theory: An approach based on hierarchies of concepts</article-title>
          .
          <source>In Rival</source>
          , I., ed.:
          <source>Ordered Sets. Volume 83 of NATO Advanced Study Institutes Series</source>
          . Springer Netherlands (
          <year>1982</year>
          )
          <fpage>445</fpage>
          -
          <lpage>470</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Ganter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wille</surname>
          </string-name>
          , R.:
          <source>Formal Concept Analysis: Mathematical Foundations. 1st edn</source>
          . Springer-Verlag New York, Inc., Secaucus, NJ, USA (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Ignatov</surname>
            ,
            <given-names>D.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuznetsov</surname>
            ,
            <given-names>S.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poelmans</surname>
          </string-name>
          , J.:
          <article-title>Concept-based biclustering for internet advertisement</article-title>
          .
          <source>In: ICDM Workshops</source>
          , IEEE Computer Society (
          <year>2012</year>
          )
          <fpage>123</fpage>
          -
          <lpage>130</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>