<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Cloud Implementation of Classi er Nominal Concepts using DistributedWekaSpark</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rawia Fray</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nida Meddouri</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mondher Maddouri</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Tunis El Manar Faculty of Sciences of Tunis LIPAH</institution>
          ,
          <addr-line>Tunis</addr-line>
          ,
          <country country="TN">Tunisia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this article, we are interested in the Cloudi cation of a classi cation method based on Formal Concept Analysis named Classier Nominal Concepts. The basic idea is to create a distributed version for this existing method, named Distributed Classi er Nominal Concepts, and implement it on Cloud Computing. Implementation of a classi cation method on cloud is one of Distributed/Big Data Mining methods. The latter generally depends on four requirements: a Big Data Framework to support the distribution of applications, a Distributed Data Mining tool, a parallel programming model, e.g. MapReduce, and the distributed system Cloud Computing used as an execution environment. In our work, we chose Spark as a Big Data Framework, DistributedWekaSpark as a Distributed Data Mining tool, and Amazon Web Services Cloud as environment for implementation. We implemented our approach on a Cluster of ve virtual machines by using large data samples for testing. This Cloudi ed version is compared to the sequential single-node executed version. The evaluation of the results demonstrate the e ectiveness of our work.</p>
      </abstract>
      <kwd-group>
        <kwd>Formal Concept Analysis Cloud Computing Big Data Mining Classi er Nominal Concepts DistributedWekaSpark Amazon Web Services</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Data mining is the process of discovering interesting patterns and knowledge
from large amounts of data [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. With the continuous and rapid growth of data
size, extracting knowledge from these data using traditional data mining tools
and algorithms has become di cult.
      </p>
      <p>
        Big Data Frameworks are created to make the possibility of large data
processing in general. However, extracting knowledge from this large data depends
on speci c tools, called Big Data Mining tools. Typically, these tools rely on
Copyright c 2019 for this paper by its authors. Copying permitted for private and
academic purposes.
a distributed environment such as Cloud Computing to prove their e
ectiveness, so they are also called distributed data mining tools. Several distributed
data mining tools were created. This has allowed the development of various
distributed versions for data mining methods and the implementation of these
methods on distributed environments. Classi er Nominal Concepts(CNC) is one
of data mining algorithms. CNC is a classi cation method based on Formal
Concept Analysis (FCA)[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. A distributed version of CNC can be made using
the Big Data Minig tool DistributedWekaSpark[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], and implemented on Cloud
Computing.
      </p>
      <p>This paper is organized as follows. In Section 2, we give an overview of most
popular Big Data frameworks, the parallel programminig model MapReduce
and the Big Data mining system. In Section 3, we recall some basics of FCA,
we present the principle of the classi cation method called Classi er Nominal
Concepts(CNC), and we introduce a distributed version for this method based
on the uni ed paradigm of DistributedWekaSpark tool. Section 4 is devoted to
the implementation of our proposed method on Cloud, and the presentation of
experimental results allowing the evaluation of the performance of our proposed
approach.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Preliminaries</title>
      <p>
        Distributed data mining attempts to improve the performance of traditional
data mining systems, and recently it has garnered much attention from the data
mining community. Distributed data mining is mentioned with parallel data
mining in the literature[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. There are several tools created to scale existing Data
Mining algorithms. These tools depend on Big Data framework used.
2.1
      </p>
      <p>
        Big Data frameworks
With the emergence of cloud computing and other distributed computing
systems, the amount of data generated is increasing every day. So, these sets of
large volumes of complex data that can not be processed using traditional data
processing software are called Big Data. Big Data concern large-volume,
complex and growing data sets with multiple and autonomous sources. There are
many Big Data techniques that can be used to store data, perform tasks faster,
make the system distributed, increase processing speed, and analyze data. To
perform these tasks, we need Big Data frameworks. A Big Data framework is the
set of functions or structure that de nes how to perform processing,
manipulation and representation of large data, it manages both structured, unstructured
and semi-structured data[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. The best-known Big Data frameworks are: Apache
Spark, Apache Hadoop, Apache Storm, Apache Flink and Apache Samza. In [
        <xref ref-type="bibr" rid="ref11 ref12 ref16 ref19">19,
16, 11, 12</xref>
        ], you nd a survey study on these frameworks. We focus on the two
appreciated and most used, Apache Hadoop and Apache Spark.
      </p>
      <p>
        Apache Spark is a framework characterized by its speed. It aims to accelerate
batch workloads, this is done by the complete computation in memory and
optimization of the processing[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Spark can be integrated with Hadoop, and it is
more advantageous compared to other big data frameworks. Spark is
characterized by Resilient Distributed Data set(RDD)[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], which is a collection of objects
partitioned across a cluster(set of computing machines)[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        Apache Hadoop is an open source, scalable and fault-tolerant framework. It
is a processing framework that provides only batch processing and e ectively
handles large volumes of data on a core hardware cluster. The two main
components of Apache Hadoop are Hadoop Distributed File System(HDFS) and
MapReduce. HDFS provides a distributed le system that allows large les to
be stored on distributed machines reliably and e ciently[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. MapReduce is the
native batch processing engine of Hadoop.
2.2
      </p>
      <p>
        MapReduce Programming Model
The MapReduce model is a programming paradigm that allows the computing
of huge amounts of data on clusters of physical or virtual computers[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. These
bene ts include scalability, crash tolerance, ease of use and cost reduction. There
are two basic steps in MapReduce.
      </p>
      <p>The Map function: is the rst step of the model, this function takes input
and creates key and value pairs (k, v). Then, it transforms them into a list
of intermediate pairs of keys and values: List (Ki, Vi). Intermediate values
that belong to the same intermediate key are xed and then transmitted to
the Minimize function.</p>
      <p>Map (k, v) )List(Ki, Vi)
The Reduce function: follows the Map function, it return a nal model by
merging values that possess the same key.</p>
      <p>Reduce (Ki, List(Vi)) ) List (Vo)
2.3</p>
      <p>
        Big Data Mining system
Data Mining and Machine Learning allow to use the di erent aspects of Big
Data technologies(such Big Data frameworks mentioned previously), to scale up
existing algorithms and solve some of the related problems[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. A scalable
solution for Big Data Mining depends on many relied components that forms a
Big Data Mining system. The rst component is the user interface that allows
the user to interact with the Big Data Mining system. The second component
is the application that contains our code with all the dependencies. The third
component is the big data framework that corresponds to our application. The
fourth component is the distributed storage layer where data is stored, the latter
encapsulates the local storage of data in a large-scale logical environment.
Finally, the infrastructure layer that contains a set of virtual or physical machines,
these machines form a cluster [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>Distributed Classi er Nominal Concepts</title>
      <p>3.1</p>
      <p>Basics of Formal Concept Analysis</p>
      <sec id="sec-3-1">
        <title>De nition 1. Formal context</title>
        <p>
          A formal context is a triplet (G,M,I). The elements of G are called objects, the
elements of M are called properties (binary attributes)and I is a binary relation
de ned between G and M, such that I G M. For g 2 G and m 2 M, the
notation (g,m) 2 I means that the object g veri es the property m[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>
          Suppose that X G and Y M two nite sets. The operators '(X) and
(Y ) are denoted as follows [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]:
'(X) = f m2 M j g 2 X and ( g,m) 2 I g.
        </p>
        <p>(Y ) = f g2 G j m 2 Y and ( g,m) 2 I g.</p>
        <p>
          The operator ' maps the attributes shared by all the elements of X. The
operator maps the objects which share the same attributes of the set Y. The
two operators ' and de ne the Galois Connection between the two sets X and
Y[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <sec id="sec-3-1-1">
          <title>De nition 2. Closing</title>
          <p>For both sets X and Y mentioned previously, closure operators are de ned by:
X" =
Y" = '
'(X)</p>
          <p>
            (Y)
So, a set is closed if it is equal to its closure. Thus, X is closed if X=X" and Y
is closed if Y=Y" [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ].
          </p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>De nition 3. Formal Concept</title>
        <p>A formal concept of the context &lt; G,M,I &gt; is a pair of the form ( X,Y) for
which X G is the extent (domain) and Y M is the intent (co-domain) with
'(X) = Y and (Y ) = X.</p>
      </sec>
      <sec id="sec-3-3">
        <title>De nition 4. Many-Valued Context</title>
        <p>A Many-valued context allows a di erent representation of the data than a formal
context(mono-valued context). It is a quadruple (G, M, W, I), where G is a set
of objects, M is a set of attributes, W is a set of attribute values, and I is a
ternary relation satisfying the condition that the same object-attribute pair can
be related to at most one value. An object may have at most one value for each
attribute. So, every attribute m may be treated as a function what maps an object
to an attribute value.</p>
        <p>Proposition 1: From a multi-valued context, the
operator is set by:
(AN</p>
        <p>
          = vj ) = f g 2 G j AN (g) = vj g[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]:
Proposition 2: From a multi-valued context, the ' operator is set by:
'(B) = fvj j8g; g 2 B; 9ANl 2 AN j ANl(g) = vj g[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]:
(1)
        </p>
        <p>Outlook Temperature Humidity Windy Play
g1 sunny hot high false No
g2 sunny hot high true No
g3 overcast hot high false Yes
g4 rainy mild high false Yes
g5 rainy cool normal false Yes
g6 rainy cool normal true No
g7 overcast cool normal true Yes
g8 sunny mild high false No
g9 sunny cool normal false Yes
g10 rainy mild normal false Yes
g11 sunny mild normal true Yes
g12 overcast mild high true Yes
g13 overcast hot normal false Yes
g14 rainy mild high true No</p>
        <p>Illustrative example : Considering the training set Weather.nominal
described by a set of nominal attributes AN .This data set is selected from UCI
Machine Learning Repository 1.</p>
        <p>AN = fANl j l = f1; ::; Lg; 9g 2 G; 9m 2 M; ANl(g) = mg:
(3)</p>
        <p>Assuming that the chosen attribute AN from this many-valued context is
'Outlook' . According to the proposition 1, we extract the associated objects for
each value vj from this attribute. we get these 3 sets of objects (fg1,g2,g8,g9,g11g,fg3,g7,g12,g13g,
fg4,g5,g6,g10,g14g). According to the proposition 2, we look for the other
attributes describing all the extracted objects. In this example, '(AN = vj )=(fOutlook
= sunny g,fOutlook = overcast g,fOutlook = rainy g). As result, we obtain
3 formal concepts:(fg1,g2,g8,g9,g11 g, fOutlook = sunny g),(f g3,g7,g12,g13g,
fOutlook = overcast g),(fg4,g5,g6,g10,g14 g, fOutlook = rainy g).
3.2</p>
        <p>
          Classi er of Nominal Concepts
Classi er Nominal Concepts(CNC) is a classi er based on Formal Concept
Analysis that can handle nominal data. Calculating the formal concept from the
multivalue Context by Conceptual Scaling is expensive (RAM consumptions, CPU
time). So, CNC calculates it directly using Nomial Scaling. A nominal context is
a many-valued context whose attribute values are of the nominal type [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. From
the nominal training instances G described by L nominal attributes AN , CNC
select the attribute AN that maximizes the Information Gain[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. The latter is
1 http:// archive.ics.uci.edu/ml/
calculated from the Entropy function (E()).
        </p>
        <p>Gain:Inf o(AN; G) = E(G)</p>
        <p>VXan S(V aljan)
j=1
n</p>
        <p>E(V aljan)
(4)</p>
        <p>
          Once the relevant nominal attribute AN is chosen, proposition 1 is used to
extract the associated objects for each value vj from this attribute. The next
step is the search for the most relevant value v and the objects associated with
this value. Then, the attributes checked by this set of objects are selected
according to proposition 2 and using the closure operator ( '(AN = v ))[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. So,
the pertinent formal concept is constructed from selected objects and attributes
( (AN = v ); '(AN = v )). The classi cation rule is obtained by looking
for the majority class corresponding to the extent of this concept ( (AN = v )).
The condition part is formed by the conjunction of the attributes of the intent
of the concept ( '(AN = v )). The conclusion part is formed by the majority
class[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. In [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], authors proposed the method named CNC Dagging (DNC).
DNC is a parallel set method that improves the performance of CNC [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. The
cloudi cation of the DNC method is one of our future perspectives.
        </p>
        <p>Illustrative example : Considering the same training set (table 1). First, we
calculate the Information Gain of each attribute, the attribute "Outlook" is
chosen with a Gain Information value of 0.37. It is characterized by 3 di erent
values: "sunny", "overcast" and "rainy". The most relevant value is "rainy" (or
"sunny"). According to proposition 1, the associated objects with this value are
fg4,g5,g6,g10,g14 g. We use the closure operator with proposition 2 to select the
attributes veri ed by these objects, we get fOutlook = rainy g. So, the relevant
concept obtained is (fg4,g5,g6,g10,g14 g, fOutlook = rainy g). The associated
majority class is "P lay = Y es", the following classi cation rule generated is :
"If Outlook = rainy; then P lay = Y es".</p>
        <p>Data: n nominal instances G = f(g1; y1); :::; (gn; yn)g with labels yi 2 Y.</p>
        <p>Result: The classi cation rule hCNC
begin</p>
        <p>From G, determine AN : the attribute that maximizes the
Information Gain;
From AN , determine the most relevant value v ;
Calculate the associated closure of this relevant value;
Generate the relevant concept;
De ne the majority class y ;</p>
        <p>Induce and return hCNC : the classi cation rule;
end</p>
        <p>
          Algorithm 1: Algorithm of Classi er Nominal Concept[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]
        </p>
        <p>
          Distributed CNC : a distributed version of CNC algorithm
The implementation of a classi cation method using DistributedWekaSpark tool
is based on 4 requirements: RDD generation from raw data, creation of headers
using RDD, model creation and model evaluation [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. The transformation of
HDFS data into RDD is not su cient because these RDD objects created by
Spark are raw data (string objects) and this type of object is not supported by
Weka. So this RDD format created previously must be transformed into a
second format that is the format Instances. The second step is to create a header
that contains attribute types and names, and other statistics, including
minimum and maximum values, to form the ARFF format supported by Weka [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
The creation of the model and its evaluation is done through the uni ed
framework provided by DistributedWekaSpark, this framework allows the personalized
distributed implementation of each classi cation algorithm [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
        </p>
        <p>Training phase of Distributed CNC: The master node divides data in
parts and distributes the task(code) and data partitions to the slave nodes. A
set of partitions is assigned to each slave node, each one applies the CNC method
to eatch data partition using the Map function of the MapReduce parallel
programming model, and returns the result to the master node. So, we get a list
of classi ers. Each time, we apply an aggregation test to the rst two classi ers
in the list. Two classi ers are aggregable, are two homogeneous classi ers that
have the same class. Thus, if they are aggregable, they are replaced directly by a
single classi er. Otherwise, an average vote is used to merge these two classi ers.
In both cases, each time the rst two classi ers are replaced by a single classi er,
until a single classi er is obtained at the end.</p>
        <p>Data: Dataset in HDFS storage.</p>
        <p>Result: The classi cation rule hDCNC
begin</p>
        <p>Divide the input data into partitions ;
Distribute training task and data partitions to the slave nodes;
Map step: create a CNC model for eatch partition;
Reduce Step: merge the models;</p>
        <p>Return the result model hDCNC ;
end</p>
        <p>Algorithm 2: Distributed CNC: Training Step
Evaluation phase of Distributed CNC: The evaluation phase of the model
requires a new MapReduce step. The master node distributes the CNC model
formed to the slave nodes. During a new Map phase, each slave node initiates
a model evaluation procedure using its set of partitions, and reviews the local
evaluation results. The Reduce function produce the nal result by aggregating
the intermediate results.</p>
        <p>Data: hDCNC in HDFS storage.</p>
        <p>Result: Evaluation results
begin</p>
        <p>Distribute evaluation tasks to the slave nodes;
Distribute the trained model hDCNC to the slave nodes;
Map step: each slave node uses its partitons to evaluate the model;
Reduce Step: merge the evaluation results;</p>
        <p>Return the merged evaluation results;
end</p>
        <p>Algorithm 3: Distributed CNC: Evaluation Step
Illustrative example: We propose to consider a dataset composed of 12
objects as an example, and a cluster composed of 3 nodes: 1 master node and
2 slave nodes. Each node has only 2 cores, so the number of partitions in this
case will be 4 (2 * 2), the dataset will be partitioned into 4 partitions, each
partition composed of 3 objects. Each slave node applies the CNC classi cation
method on its partitions in parallel. So, we obtain 4 classi cation rules. Only one
classi cation rule is returned using the Reduce function. In the evaluation step,
a copy of the classi cation rule will be distributed on all slave nodes, where each
one evaluates the model on each of its partitions. Finally the evaluation results
will be aggregated.
4
4.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Implementation and experimental study</title>
      <sec id="sec-4-1">
        <title>Implementation</title>
        <p>The second step of Cloudi cation is to create a cluster of virtual machines in
Cloud, and implement distributed CNC on this cluster, and conduct an
experimental study to compare our results with those of the sequential version. Our
architechture contains ve Amazon EC2 instances2, one master node and four
slave nodes. We chose the Big Data framework Apache Spark for its speed, and
the Big Data Mining tool DistributedWekaSpark. The latter uses the tool Weka
3 as user interface.</p>
        <p>
          Each instance is operated with Linux Ubuntu 16.04, and equipped with a 4
CPU, 16 GB of main memory and Amazon Elastic Block Store (Amazon EBS)
storage 4, wich provides persistent block storage volumes for use with Amazon
EC2 instances in the AWS cloud. All nodes are con gured with Hadoop 2.7.65
and Spark 2.3.16.
2 The number of instances/virtual machines for our AWS educationale account is
limited to 5
3 https://www.cs.waikato.ac.nz/ml/weka
4 https://aws.amazon.com/fr/ebs
5 https://hadoop.apache.org
6 https://spark.apache.org
Launching Application with spark-submit: After creating and con guring
our cluster, and before starting execution using the spark-submit script, we need
to create a jar le which gathers other projects on which our code depends, for
that we used Maven 7. Then we send our Jar le from our local machine to the
master node of our cluster, and we upload our data les into HDFS. The last
step is to launch our application using the spark-submit script(see in gure 1
and gure 2). This script depends on a set of parameters. The rst parameter is
the class, which is the entry point for the application. The second parameter is
the master URL for the cluster. The third parameter is the deploy-mode(cluster
mode or local mode). The fourth parameter is the path to the jar le that include
all dependencies. The fth parameter is the URL for the dataset in HDFS, the
rest of the parameters are the application arguments. They are the number of
attributes, class index, task(build classi er/evaluation), and the URL for the
classi cation method.
Five data sets with large scale and high dimensionality are used in the
experiments as shown in Table 2. The two rst data sets from the UCI machine learning
repository 8, the three last data sets from the Open Machine Learning repository
9. They are generated by the Bayesian Network (BNG)[
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. Data sets generated
by the Bayesian Netwok are a collection of arti cially generated datasets. These
datasets have been generated to meet the need for a large heterogeneous set of
large datasets [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
7 http://maven.apache.org/
8 http://archive.ics.uci.edu/ml/
9 http://www.openml.org
        </p>
        <p>To evaluate our approach, we must compare our results with those of the
sequential method Classi er Nominal Concepts(CNC), we can classify these
results according to two axes of performance: the error rate(see in table 3), and
the execution time (see in table 4).</p>
        <p>Data CNC Distributed CNC
Letter 20,96% 20,96%
Covertype 4.6% 4.6%
BNG(kr-vs-kp) 33.8% 33.8%
BNG(ionosphere) 14.96% 14.96%</p>
        <p>BNG(spambase) 39.35% 39.35%</p>
        <p>For evaluation, we used the 10-fold cross-validation scheme(for each
partition). The experiments were conducted multiple times automatically. So, the
error rate in the table 3 is the mean of the error rates of these experiments. The
results show that after the implementation of CNC method on cloud, the CNC
error rate is not changed. So, we can conclude that the Cloudi cation of the
CNC method does not a ect its accuracy performance.</p>
        <p>Table 4 presents the result of our work, after comparing the execution time
of the two methods, we can notice that the execution time of Distributed CNC
is lower than the execution time of CNC method, this can be remarkable from a
certain size. We can conclude that Distributed CNC is more e cient than CNC
in terms of speed. so, now we conduct experiments on very large data les which
is impossible on only one local machine. To show superiority even more, we think
about conducting experiment on larger datasets.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>Distributed Data Mining tools are created to scale the existing data mining
algorithms. These tools depend on a Big Data framework used, which is designed</p>
      <p>Data CNC Distributed CNC
Letter 1,21 1,37
Covertype 3,96 6,4
BNG(kr-vs-kp) 38,53 29,49
BNG(ionosphere) 40,4 28,85</p>
      <p>BNG(spambase) 64,39 35,09
to solve the problems of processing of large datasets. These framewoks usually
rely on a parallel programming paradigm, often the MapReduce model is used. In
this article, we have proposed a distributed version of the classi cation method
based on Formal Concept Analysis. We implemented this release on the Amazon
Web Services Cloud by creating a cluster composed of ve virtual machines.
Preparatory experiments have shown that Distributed CNC is faster than the
single-node sequential version(CNC).</p>
      <p>So, after the Cloudi cation of the classi cation method named CNC, we
were be able to overcome the runtime problems and limitations of our material
resources. Now, we can use this method with large datasets without worrying
about time and without thinking about acquiring other more powerful machines.</p>
      <p>Our work allowed us to discover several future perspectives that can be
considered in the context of Big Data Mining. In future work, we will conduct
experiments on parallel algorithms to improve the e ciency of the use of
computing resources. In a second perspective and always in the context of Big Data
Mining, we will propose a new method of data distribution on slave nodes, this
method will be inspired by the principle of strati ed sampling. Also, we will
create big data solutions for improving bene ts of algorithms based on Formal
Concept Analysis.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>We would like to thank the managers of RosettaHUB(www.rosettahub.com), the
platform through which we had access to Amazon Web Services Educate, which
allowed us to have all the services of the AWS Cloud.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Quinlan</surname>
            ,
            <given-names>J. R.</given-names>
          </string-name>
          :
          <article-title>Induction of decision trees</article-title>
          .
          <source>Journal of Machine Learning</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ),
          <volume>81</volume>
          {
          <fpage>106</fpage>
          (
          <year>1986</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Kuznetsov</surname>
            ,
            <given-names>S. O.</given-names>
          </string-name>
          :
          <article-title>Mathematical aspects of concept analysis</article-title>
          .
          <source>Journal of Mathematical Sciences</source>
          ,
          <volume>80</volume>
          (
          <issue>2</issue>
          ),
          <volume>1654</volume>
          {
          <fpage>1698</fpage>
          (
          <year>1996</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Fu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Distributed Data Mining: An Overview</article-title>
          .
          <source>IEEE TCDP newsletter</source>
          , Springer (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Ganter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stumme</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wille</surname>
          </string-name>
          , R.:
          <article-title>Formal concept analysis: foundations and applications</article-title>
          . Springer (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghemawat</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <source>MapReduce: Simpli ed Data Processing on Large Clusters. Com ACM</source>
          , pp.
          <volume>107</volume>
          {
          <fpage>113</fpage>
          , (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Zaharia</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chowdhury</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Franklin</surname>
            ,
            <given-names>M. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shenker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stoica</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Spark: Cluster computing with working sets</article-title>
          .
          <source>In Proceedings of the 2Nd USENIX, Conference on Hot Topics in Cloud Computing, HotCloud'10</source>
          , pp.
          <volume>10</volume>
          {
          <issue>10</issue>
          , Berkeley, CA, USA, (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Zaharia</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chowdhury</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Franklin</surname>
            ,
            <given-names>M. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tathagata</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ankur</surname>
            <given-names>D.</given-names>
          </string-name>
          , Justin Ma.,
          <string-name>
            <surname>Murphy</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scott</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ion</surname>
            <given-names>S.:</given-names>
          </string-name>
          <article-title>Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing</article-title>
          , Berkeley, CA, USA, (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. Han,
          <string-name>
            <given-names>J</given-names>
            .,
            <surname>Kamber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Pei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.: Data</given-names>
            <surname>Mining</surname>
          </string-name>
          .
          <article-title>Concepts and Techniques, 3rd edn</article-title>
          . Elsevier, (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>White</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Hadoop : The de nitive guide.</article-title>
          <string-name>
            <surname>O'Reilly Media</surname>
          </string-name>
          , Inc, (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Meddouri</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khou</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maddouri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Parallel learning and classi cation for rules based on formal concepts</article-title>
          .
          <source>In 18th International Conference on KnowledgeBased and Intelligent Information Engineering Systems - KES2014</source>
          , pp.
          <volume>358</volume>
          {
          <issue>367</issue>
          ,
          <string-name>
            <surname>Poland</surname>
          </string-name>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Iftikhar</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Survey of real-time processing systems for big data</article-title>
          .
          <source>In Proceedings of the 18th International Database Engineering and Applications Symposium</source>
          <year>2014</year>
          , pp.
          <volume>356</volume>
          {
          <fpage>361</fpage>
          . ACM, (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reddy</surname>
            ,
            <given-names>C. K.</given-names>
          </string-name>
          :
          <article-title>A survey on platforms for big data analytics</article-title>
          .
          <source>Journal of Big Data</source>
          ,
          <volume>2</volume>
          (
          <issue>1</issue>
          ):8 (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. van Rijn,
          <string-name>
            <given-names>J.N.</given-names>
            ,
            <surname>Holmes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Pfahringer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Vanschoren</surname>
          </string-name>
          , J.:
          <source>Algorithm Selection on Data Streams. In Discovery Science. Lecture Notes in Computer Science</source>
          , vol
          <volume>8777</volume>
          . Springer, Cham (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Shukla</surname>
            ,
            <given-names>R. K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pandey</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Big data frameworks : At a glance</article-title>
          .
          <source>International Journal of Innovations Advancement in Computer Science IJIACS</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Al-Jarrah</surname>
            ,
            <given-names>O.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yoo</surname>
            ,
            <given-names>P.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Muhaidat</surname>
            , Karagiannidis,
            <given-names>S. G.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taha</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>E cient machine learning for big data : a review</article-title>
          .
          <source>Big Data Res</source>
          ,
          <volume>87</volume>
          {
          <fpage>93</fpage>
          , (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Landset</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khoshgoftaar</surname>
            ,
            <given-names>T. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Richter</surname>
            ,
            <given-names>A. N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hasanin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>A survey of open source tools for machine learning with big data in the hadoop ecosystem</article-title>
          .
          <source>Journal of Big Data</source>
          ,
          <volume>2</volume>
          (
          <issue>1</issue>
          ):1 (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Koliopoulos</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yiapanis</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tekiner</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nenadic</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keane</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>A parallel distributed weka framework for big data mining using spark</article-title>
          .
          <source>In IEEE International Congress on Big Data</source>
          <year>2015</year>
          ,CA USA (
          <year>2015</year>
          ) Zaharia,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Chowdhury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Franklin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            ,
            <surname>Shenker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Stoica</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          :
          <article-title>Spark: Cluster computing with working sets</article-title>
          .
          <source>In Proceedings of the 2Nd USENIX, Conference on Hot Topics in Cloud Computing, HotCloud'10</source>
          , pp.
          <volume>10</volume>
          {
          <issue>10</issue>
          , Berkeley, CA, USA, (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Trabelsi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meddouri</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maddouri</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>A new feature selection method for nominal classi er based on Formal Concept Analysis</article-title>
          .
          <source>In: Proceedings of the 21th International Conference on Knowledge-Based and Intelligent Information and Engineering Systems (KES</source>
          <year>2017</year>
          [B]),
          <source>Procedia Computer Science</source>
          , Vol.
          <volume>112</volume>
          , pp.
          <fpage>186</fpage>
          -
          <lpage>194</lpage>
          . Elsevier, Marseille, France (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Inoubli</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aridhi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mezni</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maddouri</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Mephu</given-names>
            <surname>Nguifo</surname>
          </string-name>
          ,
          <string-name>
            <surname>E.</surname>
          </string-name>
          :
          <article-title>An experimental survey on big data frameworks</article-title>
          ,
          <source>Future Generation Computer Systems</source>
          , (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>