A Hybrid Partitioning Strategy for Distributed
                        FCA

    Muneeswaran P[0000−0003−4099−3003] , Jyoti[0000−0003−1595−3173] , and Sriram
                        Kailasam[0000−0002−2218−8660]

                     Indian Institute of Technology Mandi, India
                 cegmunees@gmail.com, jjangra2@gmail.com, India
                              sriramk@iitmandi.ac.in


        Abstract. Several parallel and distributed FCA algorithms have been
        proposed to speedup the discovery of concepts. Some require the con-
        text to be replicated on all machines while others partition the context
        horizontally across different machines. The former suffers from memory
        bottlenecks for large contexts as every machine retains the entire context
        in memory. The latter approaches are based on Map-Reduce paradigm
        wherein to compute a valid concept requires all-all communication be-
        tween different machines, incurring huge communication overheads. It
        has been shown that DFS (Depth-First-Search) based approaches can
        exploit database reduction strategy for concept sub-tree exploration.
        This significantly decreases the cost of closure computation in the con-
        cept sub-tree and results in higher speedup. The existing Map-Reduce
        based approaches cannot use DFS exploration strategy. In this paper,
        we propose a hybrid partitioning strategy for distributed FCA that uses
        horizontal partitioning for groups of low support attributes and vertical
        partitioning for high support attributes. This not only alleviates memory
        bottlenecks but also minimizes communication overheads. For vertically
        partitioned sub-context, inter-machine communication is only required
        for concepts below a certain support threshold. We further minimize the
        communication cost by using group-id based auxiliary structure. This
        hybrid partitioning also allows DFS based concept exploration, thereby
        making it more scalable and efficient.

        Keywords: Hybrid partitioning · Large context · Distributed FCA


1     Introduction
Formal Concept Analysis (FCA) is a method of data analysis that aims at ex-
tracting natural clusters (formal concepts) from object and attribute data table
(a.k.a. context). The sequential algorithms for computing concepts (e.g. Close-
By-One (CbO) [8], LCM [1], and DCI-Closed [9]) do not scale to large contexts
as the total number of concepts can be extremely large. In the worst-case, it
is exponential in the number of objects/attributes. To speedup the discovery of
concepts, several parallel and distributed FCA algorithms have been proposed [5,
3, 2, 11, 4]. These algorithms either replicate the input context on all machines


Copyright c 2020 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0). Published in Francisco
J. Valverde-Albacete, Martin Trnecka (Eds.): Proceedings of the 15th International
Conference on Concept Lattices and Their Applications, CLA 2020, pp. 71–82, 2020.
72     P Muneeswaran et al.

or partition the context horizontally across different machines. The algorithms
that use replicated context [3] suffer from memory bottlenecks for large con-
texts as every machine needs to retain the entire context in memory. The other
algorithms [2, 11] partition the context horizontally across multiple machines.
Here, computing a concept that is valid in the entire context involves two steps:
(1) each machine first computes the local closure in its partition. Then it ex-
changes the local closure with other machines using broadcast. (2) At the end
of broadcast, each machine computes the valid concept by intersecting the local
closures from all the machines. The above two steps are repeated until all valid
concepts are discovered. MR-Ganter [2] implements the above procedure using a
sequence of Map-Reduce (MR) jobs whereas MR-Ganter+ [11] uses Twister [10],
a MR-based framework for iterative computations.
    It has been shown that DFS based approaches can exploit database reduction
for concept sub-tree exploration. This significantly decreases the cost of closure
computation in the concept sub-tree and results in higher speedup. The existing
MR-based approaches cannot use DFS exploration strategy as it would make
them highly inefficient. If they compute one concept at a time using DFS, they
would incur large waiting time due to all-all communication between the ma-
chines. Hence, they explore the concept tree in a breadth-first manner and try
to overlap communication and computation. However, they lose the benefits of
database reduction available in DFS based approach. From experiments, we find
that the distributed MR-based algorithms perform slower than even single node
DFS based algorithms like LCM [1].
    In this paper, we propose a hybrid partitioning strategy for distributed FCA
that uses horizontal partitioning for groups of low support attributes and vertical
partitioning for high support attributes (see Fig. 1). This partitioning strategy
allows DFS based concept exploration, alleviates memory bottlenecks and mini-
mizes communication overheads. Hence, it is more scalable and efficient than ex-
isting algorithms. We also propose some optimizations for reducing inter-machine
communication cost by using group-id based auxiliary structure.
    Rest of the paper is organized as follows: Section 2 introduces the relevant
background and related work; Section 3 discusses the hybrid partitioning strat-
egy and the optimizations for reducing inter-machine communication cost; Sec-
tion 4 describes the performance results and Section 5 discusses the conclusion
and future work of this paper.


2     Background and Related Work
2.1   FCA: Notations and Definitions
In this section, we introduce the relevant notations and terminology of FCA.
Definition 1 A formal context K = (G, M, I), consists of two sets G and M ,
and a binary relation I ⊆ G×M . G represents the set of objects and M represents
the set of attributes. If g ∈ G and m ∈ M are in relation I, we write (g, m) ∈
I or gIm. It can be read as “object g has attribute m”.
                        A Hybrid Partitioning Strategy for Distributed FCA       73

   Formal concept is computed using two operators ↑ and ↓ . ↓↑ is a closure
operator [6]. These operators are defined as below:
Definition 2 Let A ⊆ G and B ⊆ M , then: A↑ = {m ∈ M | ∀g ∈ A : (gIm)},
and B ↓ = {g ∈ G | ∀m ∈ B : (gIm)}

Definition 3 An object subset A ⊆ G is closed in (G, M, I) if A = A↑↓ . An
attribute subset B ⊆ M is closed in (G, M, I) if B = B ↓↑ .

A formal concept is described by an extent and an intent as shown below:
Definition 4 A duplet hA, Bi is called formal concept of context K = (G, M, I),
if A ⊆ G and B ⊆ M , A↑ = B, and A = B ↓ . The set A (object set) is called the
extent and the set B (attribute set) is called the intent of the concept.

2.2   Basic Linear time Closed itemset Miner (LCM)
Linear time Closed itemset Miner (LCM) [1] is state-of-the-art sequential algo-
rithm for concept discovery. LCM algorithm induces a search tree on concepts
and explores it in a depth-first-search manner. Given a concept intent, the LCM
procedure adds attributes which are not part of the intent (a.k.a. list of markers)
and computes the resulting closure. These closures form the next-level concepts.
To prevent generating the same concept multiple times (duplicate concepts),
LCM orders the attributes; it considers a particular concept valid only if it is
generated in the sub-tree of the leftmost attribute in that concept. This redun-
dancy check is performed during closure computation, by testing whether the
newly added attributes in the closure contain any attribute to the left of the
marker. If yes, it is considered a duplicate, else it is considered a valid concept.
LCM bounds the time complexity of concept mining to a function linear in the
number of concepts and its memory usage is linear in the size of the context.
However, due to its sequential nature, it does not scale to large contexts. We
have designed a Group based LCM algorithm that can scale to large contexts.

2.3   Related Work
There are both parallel and distributed implementations of the LCM algorithm.
P-LCM [5] is a parallel version of LCM based on shared memory model; the task
queue is shared among all the workers. When a worker becomes idle, it pulls
tasks from the shared queue. This approach works in a multi-core environment,
but requires redesign to work in distributed shared nothing architectures largely
prevalent in the cloud [12]. Distributed-LCM [6] uses a master-worker based
architecture. The master performs static load balancing to assign seed concepts
to each worker. It also maintains meta-data about the load on each worker. The
workers generate the concepts in the sub-tree below the assigned seed concept.
When they become idle, they request the master to assign a donor and fetch work
from the donor. This master-aided dynamic load balancing scheme distributes
the skewed concept tree of LCM evenly among the workers. Distributed-LCM
74     P Muneeswaran et al.

provides linear speedups for many contexts, however, it is not scalable to large
contexts as it uses context replication on all the workers.
    MR-CbO [3] is the first distributed implementation of CbO [2] based on the
Map-Reduce framework. It follows an iterative approach. In the map phase, each
mapper(worker) computes the first level concepts from a given seed concept and
passes them to the reduce phase for redundancy check. Those concepts which
pass the redundancy check are considered as seed concepts for the next iteration.
MR-CbO also requires the entire context to be replicated. Hence it is not scalable.
To address this issue, MR-Ganter [2] partitions the context horizontally. Each
mapper finds the local closure with respect to their partitioned context. In the
next phase, global closure is computed from the local closures of the previous
phase. This method suffers from communication overhead for computing the
concepts valid in the entire context and for performing the redundancy check.
Further, the MR-based approaches do not use DFS based exploration of the
concept tree, thereby losing the benefits of database reduction.
    Yoshizoe et al. [12] proposed an attribute-based partitioning of the context;
wherein each partition consists of objects in the extent of the attribute. This
allows discovery of all concepts that contain a particular attribute without any
inter-machine communication. However, for large contexts, the partition corre-
sponding to high support attributes would contain almost the entire context
leading to memory bottleneck.
    In summary, existing parallel and distributed versions of LCM either work
on replicated context or horizontally partitioned context. The former suffers
from memory bottlenecks for large contexts as every machine retains the entire
context in memory. The latter approaches incur huge communication overheads
and are not able to explore the concept tree using DFS based strategy.


3     G-LCM (group-based LCM)

We propose G-LCM, a group-based LCM algorithm that scales to large contexts.
G-LCM uses a hybrid partitioning strategy to bound the communication cost as
well as the memory consumption across machines.


3.1   Hybrid Partitioning in G-LCM

Hybrid partitioning is a technique that uses horizontal partitioning for low sup-
port attributes and vertical partitioning for high support attributes (above cer-
tain support threshold). The context is reordered according to the support of the
attributes. For creating horizontal partitions, we start with the attribute having
the lowest support and form groups of attributes in an increasing order of their
support by fixing an upper bound on the total number of objects in a group.
    Thus, the low support attributes are grouped into many horizontal partitions,
each of which can be stored in a single machine. The high support attributes are
grouped into a single vertical partition containing only the frequent attributes
                          A Hybrid Partitioning Strategy for Distributed FCA         75

in each of its objects. This enables us to compress the vertical partition since
many objects in this group would become identical and can be coalesced.
    We use 3 Hadoop Map-Reduce jobs to partition the input context. Job1
computes the support of attributes; job2 reorders the attributes based on their
support and job3 partitions the context based on the grouping strategy.

3.1.1 Horizontal Partitioning: The sub-context induced by a group of at-
tributes is referred as a horizontal partition. Consider a formal context K =
(G, M, I) relabeled according to support of the attributes; the attribute with
the lowest support is assigned the lowest index.


Fig. 1: Partitions of the context with N attributes. The number of horizontal partitions
is r, i.e. {K1 , K2 ,. . . , Kr }. K0 denotes the vertical partition.


Definition 5 Suppose attributes are indexed and arranged in the ascending or-
der of their support as m1 , m2 , ... m|M | . The grouping strategy partitions the
attributes linearly into (say r + 1-groups): R1 , R2 , ...,Rr , and R0 .
1. Each horizontal partition, Ri = {mk | mk ∈ M, k ∈ [L(i), U (i)]}
2. Vertical partition (R0 ) = {mk | mk ∈ M, support(mk ) ≥ minT h}
          
            1,              if i = 1
  L(i) =
            U (i − 1) + 1, if 1 < i ≤ r
                             X j                     j+1
                                                      X               
   U (i) = j | j < |M | and        support(mk ) ≤ θ <     support(mk )
                               k=L(i)                       k=L(i)
Where, θ is upper bound on the number of objects in a partition and support(mk )
returns the cardinality of m↓k .

    Suppose that K is partitioned into k horizontal partitions each having group-
id in [1, r](see Fig. 1). While forming groups of attributes, we ensure that the
total number of objects in their corresponding partitions is less than or equal
to a threshold θ. Consider a toy context shown in Table 1 consisting of four
76        P Muneeswaran et al.
               Table 1: Toy context              Table 2: Relabelled context

                      DEL NYC LON BOM                BOM LON NYC DEL
     AirCanada (o1)    X           X            o1    X           X
     AirIndia (o2)     X   X   X                o2        X   X   X
     Indigo (o3)           X                    o3            X
     Aeroflot (o4)     X   X   X   X            o4    X   X   X   X

     Table 3: Group 1 sub-context     Table 4: Group 2 sub-context

          BOM LON NYC DEL                  BOM LON NYC DEL
     o1    X           X              o2        X   X   X
     o4    X   X   X   X              o4    X   X   X   X


attributes and four objects. The relabelled context after sorting the attributes
according to support is shown in Table 2. Assume that the upper bound θ = 2,
then NYC and DEL are the frequent attributes. The two horizontal partitions
are formed with group attribute set {BOM } and {LON } respectively. Group 1
contains objects 1 and 4 (See Table 3). Group 2 contains objects 2 and 4 (see
Table 4).

Definition 6 The sub-context induced by Rk , k > 0, is  S known as the horizontal
partition, Kk = (Gk , Mk , Ik ) and is defined as: Gk = m∈Rk (m↓ ), where m↓ is
                                                       S
the set of objects in K containing attribute m, Mk = g∈Gk (g ↑ ), where g ↑ is the
set of attributes in K containing object g, and Ik = (Gk × Mk ) ∩ I. ↑k and ↓k
denote the corresponding concept forming operators in Kk .

Lemma 1. Every concept (A, B) generated by the horizontal partition Kk that
includes at least one attribute m ∈ Rk as part of its intent (i.e. m ∈ B) is valid
in the entire context (G, M, I).

Proof. Suppose (A, B) is not closed in K. Then by extensivity of closure operator,
B ⊆ B ↓↑ . Let n ∈ B ↓↑ \ B.
=⇒ ∀g ∈ B ↓ : gIn
Consider an attribute m ∈ Rk s.t. m ∈ B
=⇒ ∀g ∈ B ↓ : gIm
=⇒ B ↓ ⊆ m↓
=⇒ B ↓ ⊆ Gk       (m ∈ Rk )
=⇒ B ↓ = B ↓k     (By definition of Ik )
But B ↓k = A      (((A,B) is closed in Kk )
=⇒ ∀g ∈ A : gIn
=⇒ n ∈ B           (hence, proved by contradiction)


3.1.2 Vertical Partitioning: Vertically partitioned context is defined w.r.t. a
support threshold and is assigned group-id 0. It contains the union of occurrences
                            A Hybrid Partitioning Strategy for Distributed FCA     77
      Table 5: Vertical partition      Table 6: Augmented context

                 NYC DEL                         m(1) m(2)
  AirCanada (o1)      X                       o1 X
   AirIndia (o2)  X   X                       o2       X
    Indigo (o3)   X                           o3
   Aeroflot (o4)  X   X                       o4 X     X


of all attributes whose support is greater than the threshold. Vertical partition
with support threshold 3 is shown in Table 5.

Definition 7 The vertically partitioned context K0 = (G0 , M0 , I0 ) w.r.t. support
threshold (minT h) is defined as: M0 ⊆ M s.t. ∀m ∈ M0 , support(m) ≥ minT h,
G0 = G and I0 = (G0 × M0 ) ∩ I.


3.2     Concept discovery from vertical partition of the context

As the vertical partition does not include all the attributes, it can lead to discov-
ery of invalid concepts in certain cases. For example closure of {NYC, DEL} in
vertical partitioned context is {NYC, DEL} but the actual closure in the input
context is {LON, NYC, DEL}. From this example, it is clear that the vertical
partition may generate invalid concepts. Hence, we augment the vertically par-
titioned context with auxiliary information that will help check validity of the
concept. We define the augmented context for the vertical partition.

Definition 8 An augmented context Kac = (Gac , Mac , Iac ) is defined as: Gac =
G, Mac = {m(k) | 1 ≤ k ≤ r}, where r is the number of horizontal partitions
and gIac m(k) iff ∃m ∈ Rk s.t. gIm.

Table 6 shows the augmented context for the toy example.
Lemma 2. Let ↑a denote the ↑ operator in Kac . A concept (Av , Bv ) generated
by the vertical partition is valid if for every horizontal partition Kk : m(k) ∈ Av ↑a
there exists no attribute m ∈ Rk s.t. Av ⊆ m↓k .

Proof. Assume there exists an attribute m ∈ Rk s.t. Av ⊆ m↓k .
=⇒ ∀g ∈ Av : gIm
=⇒ m ∈ A↓v . But A↓v = Bv and m ∈  / Bv . Hence proved.

For each concept generated by the vertical partition, we intersect the group-ids
of the objects in the extent (using the augmented context). If the intersection
returns an empty set, then the concept is valid, else an additional check is re-
quired to test the validity of the concept. For example, suppose that we make
2 groups in our toy context; attributes in group 0 (vertical partition) contain
NYC and DEL, while attributes in group 1 (horizontal partition) contain BOM
and LON. Group 0 includes all 4 objects. In the augmented context, each object
will contain group-id 1 except object 3. Hence, for concept intent {DEL} in the
78     P Muneeswaran et al.

augmented context, the group-id intersection returns {1}. This however does not
mean that {DEL} is an invalid concept intent. We need to test its validity w.r.t.
attributes in group 1. In this case, we find that no attribute in group 1 includes
the extent of {DEL}, and therefore {DEL} is a valid concept intent.
    Thus, group-id based intersection may generate false positives but no false
negatives. To understand how false positives are handled, let us assume a sce-
nario in which, machine M0 is discovering concept in vertical partition; where
as machine M1 and M2 are discovering concepts in group 1 and 2 (horizontal
partitions) respectively. In machine M1 and M2 , a hash table is created with
keys as the low support attributes (m) of the corresponding group and value
being the corresponding attribute’s object ids (m↓ ). To check the validity of the
concept (Av , Bv ) discovered in the vertical partition, a validity check is issued to
all the machines whose id is returned by the group-id based intersection. Upon
receiving this query, each machine checks in its hash table whether the extent
Av is a subset of the value stored against any attribute id of that group. If a
match is found, then the concept discovered by the vertical partition is invalid.
This check can be implemented efficiently using a Cuckoo filter.


3.2.1 Cuckoo filter Cuckoo filter is a probabilistic data structure that checks
the membership of an element in a set. It is mainly designed for applications that
need to store many elements in the set and check the membership of the set with
low false positive rate. Cuckoo filter is variant of cuckoo hash table that stores
only fingerprints instead of (key, value) pairs which leads to space efficient filter
and restricts false positive rate to less than 3% [7].
    For each attribute (m ∈ Rk ), we store its object IDs (m↓ ) in a cuckoo filter. To
test the validity of a concept intent, we first determine the candidate horizontal
partitions using the augmented context. Then for each candidate partition, we
check the concept extent against the cuckoo filter of those attributes whose
support is greater than the size of the extent. If the cuckoo filter returns negative
for any object in the concept extent, it means that the concept intent does
not include that attribute. Otherwise, the attribute can be present with high
probability and needs to be verified by actual comparison of the concept extent
with the concerned attribute’s extent.


3.2.2 G-LCM algorithm for discovering the concepts in vertical par-
tition Algorithm 1 shows how concepts are discovered in the vertical parti-
tion. In line number 2, algorithm iterates over the (M 0 ) attribute set. In line
number 4, it checks whether the newly found concept is redundant or not us-
ing function isN ewConcept similar to LCM (see Section 2.2). If the support
of the new concept is below the threshold (line 5), then it performs the va-
lidity check locally using group-id based intersection function (line 6). If that
returns a non-empty set (line 7) then for each group-id in gIDList (group-
id list), a membership query is sent to the machine which holds that group-
id context (lemma 2). On that machine, the cuckoo filter is used to check
                        A Hybrid Partitioning Strategy for Distributed FCA       79

the validity of the concept (line 8). As cuckoo filter may produce false posi-
tives, we do an actual comparison with the extent if the cuckoo filter returns
true (lines 11 - 18). If the extent is having some common attributes in the
group-id context then it is a true positive (line 14) else false positive (line 16).
 Algorithm 1: G-LCM algorithm
   Input: K0 : Vertically partitioned context, minTh: Minimum threshold,
            Kac : Augmented Context.
 1 // initialize flag=false, X = φ
 2 for i := First_Attribute (M 0 ) to Last_Attribute (M 0 ) do
 3     X = ((X ∪ i) ↓↑)
 4     if isNewConcept(X) then
 5         if support(X)< minT h then
 6              gIDList = groupIDIntersection(Extent(X),Kac )
 7              if size(gIDList) > 0 then
 8                  flag = cuckooFilterMembership(Extent(X),gIDList)
 9              else
10                  Mark X as a valid concept // true negative
11              if flag==true then
12                  fpFlag= intersectInFreqAtt(Extent(X), gIDList)
13                  if fpFlag==true then
14                      // true positive
15                  else
16                      // false positive and SET flag=false
17              else
18                  // false positive
19         if !flag then
20              G-LCM(X, i)
    In case the support of the newly found concept is greater than minimum
threshold (minT h) then the extent of the concept does not have any common
attribute from infrequent attributes. Thus, the concept found is valid (line 19).
In the next section, we discuss experimental evaluation of this algorithm.


4   Results and Discussion

We implemented G-LCM algorithm in Java. The context is partitioned into hori-
zontal and vertical partitions using Map-Reduce jobs. For horizontal partitions,
we can use the distributed LCM algorithm from [6] to compute the concepts.
Each horizontal partition is passed as a separate job to the framework. As each
machine holds only the horizontal partition instead of the entire input context
(see section 2.3), the memory bottleneck is resolved. According to lemma 1,
each horizontal partition should generate only concepts containing at least one
attribute of the group in its intent. To ensure this, we add the attributes of the
corresponding group as first level markers (see section 2.2) to the null concept.
From second level onward, the basic LCM procedure is followed.
80          P Muneeswaran et al.

    Here, we report performance results for computing concepts in the vertical
partition for susy and pumsb_star data-sets. The susy data-set 1 contains 190
attributes and 5 million objects with average object length as 19.


                     Table 7: Experimental evaluation for susy data-set.

                  No. of                                                 No. of
     Support     concepts                                               concepts
                            Group-id based Cuckoo Filter Execution time
       (%)        below                                                  above
                               intersection                 (minutes)
                  minT h                                                 minT h
                                                             <      ≥
                            T Ng F Pg T Pg T Nc F Pc T Pc
                                                          minT h minT h
       25          4230     4230 0        0  0   0    0   17.69 25.36     3745
       30          2575     2575 0        0  0   0    0   10.38 20.86     2450
       35          480      480     0     0  0   0    0    2.95   20.83   2207
       40          520      520     0     0  0   0    0    0.77   19.71   2167
       55           0         0     0     0  0   0    0      0    10.99   959


                 Table 8: Experimental evaluation for pumsb_star data-set.

                                                                            No. of
             No. of
                                                                             con-
    Support concepts
                           Group-id based      Cuckoo Filter Execution time cepts
      (%)    below
                              intersection                     (seconds)    above
             minT h
                                                                            minT h
                                                               <        ≥
                          T Ng       F Pg T Pg T Nc F Pc T Pc
                                                              minT h minT h
       75          1        1         0          0    0 0 0.011       0.067   2
       70         238       0        238   0    238   0 0      0.8     0.62   17
       65        1317       0       1317   0   1317   0 0 4.454        1.3    42
       60        5107       0       5107   0   5107   0 0 8.81        1.534   68
       55       52218       0       52218  0   52218  0 0 51.16       2.424 116
       50       436610      0      436610 0 436610 0 0 284.6          4.515 248
       45      5237012 1971124 3265888 0 3265888 0 0 1028            10.692 713
       40      14946492 7643577 7302915 0 7302915 0 0 2331.3 28.764 2610


    We evaluate the efficacy of group-id based intersection and cuckoo filter for
different values of support threshold. When group-id based intersection function
returns an empty set, no inter-machine communication is necessary to test the
validity of the concept. In Table 7, the true negative column (T Ng ) captures the
frequency of such occurrences in susy dataset. If the function returns a non-empty
set (gIDList) then for each group-id in gIDList, we need to check whether the
1
     FIMI repository, http://fimi.cs.helsinki.fi/. Last accessed 1 March 2020
                        A Hybrid Partitioning Strategy for Distributed FCA       81

concept is valid/invalid. If the group has at least one attribute containing the
extent, then the concept is invalid. As the group-id based intersection did not
detect this, it is counted as False Positive ((F Pg )). Otherwise the concept is
valid (T Pg ). We observe that the group-id based intersection returns an empty
set (T Ng ) almost 100% times for susy data-set. This means that concepts dis-
covered in the vertical partition are valid and there is no need for inter-machine
communication.
     For experiments on pumsb_star data-set 2 (containing 49046 objects, 2088
attributes with average object length as 50.48), we found that group-id based
intersection returned a non-empty set (F Pg ) close to 50% (see Table 8 for
pumsb_star data-set). Here, we need to communicate with a machine whose
group-id is contained in the list returned by group-id based intersection. Still
group-based auxiliary structure reduces the communication cost significantly as
communication is required only with the machines storing the partition corre-
sponding to that group and not all of them. On the machine, the validity check
is performed using cuckoo filter. If cuckoo filter returns false (T Nc ), then the
concept is valid in the vertical partition. In such cases, no comparison with the
actual context of the horizontal partition is necessary. From Table 8, we see that
cuckoo filter always returns true negative (T Nc ).
     Our experiments show that group-id based auxiliary structure decreases the
inter-machine communication cost for testing validity of discovered concepts by
almost 100% for susy data-set and up to 50% for pumsb_star data-set. Cuckoo
filter reported 0% false positives (F Pc ) thereby further minimizing the cost of
validity checks.


5     Conclusions and future work


In this paper, we proposed a novel hybrid partitioning technique that uses hor-
izontal partitioning for low support attributes and vertical partitioning for high
support attributes. This technique ensures that the size of each partition does
not exceed single machine’s memory. Thus, it becomes possible to use the dis-
tributed LCM algorithm [6] to compute concepts in parallel in each horizontal
partition. We proposed a new approach for computing concepts using group-id
based auxiliary structure for the vertical partition. The experimental evaluation
shows significant reduction in the inter-machine communication cost. Thus, us-
ing horizontal and vertical partitioning, our algorithm can scale to large contexts.
Our future work includes coming up with an optimal threshold that takes into
consideration both memory capacity and combined execution time of computing
concepts in vertical and horizontal partitions.

2
    SPMF, https://www.philippe-fournier-viger.com/spmf/. Last accessed 1 March
    2020.
82     P Muneeswaran et al.

Acknowledgements
This research was supported by SPARC, a Government of India Initiative under
grant no. SPARC/2018-2019/P682/SL. We thank Dr. Sergei Obiedkov from HSE
Moscow for his helpful comments in coming up with the paritioning strategy.


References
1. Uno T., Asai T., Uchida Y., Arimura H., “An efficient algorithm for enumerating
   closed patterns in transaction databases,” in Discovery Science, DS 2004, Lecture
   Notes in Computer Science, vol 3245. Springer, 2004.
2. Ganter B., “Two basic algorithms in concept analysis,” In: Kwuida L., Sertkaya B.
   (eds) Formal Concept Analysis. ICFCA 2010. Lecture Notes in Computer Science,
   vol 5986. Springer, 2010.
3. Krajca P., Vychodil V. , “Distributed algorithm for computing formal concepts us-
   ing map-reduce framework,” In: Adams N.M., Robardet C., Siebes A., Boulicaut
   JF. (eds) Advances in Intelligent Data Analysis VIII. IDA 2009. Lecture Notes in
   Computer Science, vol 5772. Springer, 2009.
4. Xu B., de Fréin R., Robson E., Ó Foghlú M, “Distributed formal concept analysis
   algorithms based on an iterative MapReduce framework,” In: Domenach F., Ignatov
   D.I., Poelmans J. (eds) Formal Concept Analysis. ICFCA 2012. Lecture Notes in
   Computer Science, vol 7278. Springer, 2012.
5. B. Negrevergne, A. Termier, J. Méhaut and T. Uno, "Discovering closed frequent
   itemsets on multicore: Parallelizing computations and optimizing memory accesses,"
   In: 2010 International Conference on High Performance Computing and Simulation,
   Caen, 2010, pp. 521-528, 2010.
6. Shravan Patel, Umang Agarwal, and Sriram Kailasam, “A Dynamic Load Balanc-
   ing Scheme for Distributed Formal Concept Analysis,” in 24th IEEE International
   Conference on Parallel and Distributed Systems, Singapore, pp. 489-496, 2018.
7. Bin Fan, Dave G. Andersen, Michael Kaminsky, and Michael D. Mitzenmacher,
   "Cuckoo Filter: Practically Better Than Bloom," In Proceedings of the 10th ACM
   International on Conference on emerging Networking Experiments and Technologies
   (CoNEXT ’14). ACM, New York, USA, pp.75–88, 2014.
8. Kuznetsov S.O., "Learning of Simple Conceptual Graphs from Positive and Neg-
   ative Examples," In: Żytkow J.M., Rauch J. (eds) Principles of Data Mining and
   Knowledge Discovery. PKDD 1999. Lecture Notes in Computer Science, vol 1704.
   Springer, 1999.
9. C. Lucchese, S. Orlando and R. Perego, "Fast and memory efficient mining of fre-
   quent closed itemsets," in IEEE Transactions on Knowledge and Data Engineering,
   vol. 18, no. 1, pp. 21-36, Jan. 2006.
10. Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., Fox, "G.:
   Twister: a Runtime for Iterative MapReduce," In: Hariri, S., Keahey, K. (eds.)
   HPDC, pp. 810–818, ACM, 2010.
11. Xu B., de Fréin R., Robson E., Ó Foghlú M, "Distributed Formal Concept Analysis
   Algorithms Based on an Iterative MapReduce Framework." In: Domenach F., Igna-
   tov D.I., Poelmans J. (eds) Formal Concept Analysis ICFCA 2012, Lecture Notes
   in Computer Science, vol 7278. Springer, 2012.
12. K. Yoshizoe, A. Terada, K. Tsuda, "Redesigning pattern mining algorithms for
   supercomputers," Corr, 2015.