Proceedings of the International Conference on Big Data Cloud and Applications
Tetuan, Morocco, May 25 - 26, 2015


           Using Big Data Classification and Mining for the
                    Decision-making 2.0 Process

                       Rhizlane Seltani1,2                                                     Noura Aknin1,2
  1,2                                                                   1,2
        Information Technology and Modeling Systems Research                  Information Technology and Modeling Systems Research
                      Unit, LIROSA Laboratory                                               Unit, LIROSA Laboratory
          Faculty of Science, Abdelmalek Essaadi University                     Faculty of Science, Abdelmalek Essaadi University
                           Tetuan, Morocco                                                       Tetuan, Morocco
                       sel.rhizlane@gmail.com                                                     aknin@ieee.org


                          Souad Amjad1,2                                                Kamal Eddine El Kadiri2
   1,2                                                                         2
         Information Technology and Modeling Systems Research                   Computer Science, Operational Research and Applied
                        Unit, LIROSA Laboratory                                                 Statistics Laboratory
          Faculty of Science, Abdelmalek Essaadi University                     Faculty of Science, Abdelmalek Essaadi University
                          Tetuan, Morocco                                                        Tetuan, Morocco
                        amjad_souad@uae.ma                                                       elkadiri@uae.ma


     Abstract—Web 2.0 is a revolution that has affected all             decision-making systems, to generate more meaningful and
 areas, especially those of the new technology. Several new             relevant decisions.
 concepts have emerged, and a large number of innovative
 applications continue to come out every day. However, the                  To classify and process data, various algorithms and
 social networking remains the racehorse of web 2.0, giving the         techniques can be used. These methods differ depending on
 user at the same time, a space for communication and for               data types. In the case of big data, to retrieve information,
 information sharing, which generates too much data, variable           there are various analysis techniques with different
 and characterized by a great creation speed. So, we can call           orientations and results, such as Representation-learning
 them big data, and consider them a very rich and interesting           methods based geometric information, Stream Classification
 basis for decision-making.                                             Algorithms, Associative Classifiers, etc.
     Big Data is a type of data which are characterized by the
 veracity, important volumes, and increasing variety and                    In this paper, we discuss some methods that we can use
 velocity, which makes their treatment and their processing by          to classify big data in order to elaborate decisions, report the
 traditional database management tools a very difficult task. To        strengths and the weaknesses. And therefore, present our
 overcome this problem, we opt for the big data classification          global framework of decision-making 2.0 based on big data
 process.                                                               classification by describing the key pillars to be considered,
     In this paper, we make a study of some big data                    to lead well the classification process for the purpose of
 classification methods, which are the most significant to be           decision-making.
 used to classify big data dedicated to decision-making, we
 detect their points of strength and weakness. Then we propose                                   II. WEB 2.0
 a framework summarizing the process of the formulation of
 the decision from the web 2.0 content, based on the big data           A. Definition
 classification, and we specify the criteria to be taken into
 account when choosing the big data classification methods                 The web 2.0 is a combination of technologies, business
 intended for the decision-making.                                      plans and social skills, which allow users to create web
                                                                        content, and to be more involved in the process of the
    Keywords—Web 2.0; Big Data; Decision-making; Data                   management of this content. It has brought many creative
 Classification                                                         concepts and techniques that did not exist before and which
                                                                        made the electronic life simpler and more enjoyable [1][2].
                        I. INTRODUCTION                                 With the web 2.0, a new era of web use is born. Several
                                                                        applications have been developed and which have enriched
    The large variety of applications that appeared after the           our lives by allowing more of interactivity and collaboration,
 emergence of the web 2.0, produce a huge mass of various               such as blogs and social networks [3].
 and diverse data. This wealth of information is a very
 important resource that we want to exploit to enrich our


                                                                                                                                      29
B. Architecture and Principals                                          Data Volume: refers to a very important quantity of
    Web 2.0 is based on a varied and robust architecture,                generated information. Data is considered as big
founded on the introduction of new principles such as                    data if their size is very large, so we cannot control
collaboration and interactivity, and the use of new                      them to make analysis easily.
applications like web interface design techniques, those of
                                                                        Data Variety: This makes analyzing this type of
content syndication, XHTML, URL, etc [4].
                                                                         data a very difficult mission. We have more
   There are several emerging principles with the                        different data presentation formats: text, audio,
appearance of web 2.0, the most notable:                                 image, etc.
       Collaboration: This is an important aspect which                Data Velocity: It refers to the speed of creation and
        describes when a user has the opportunity to                     generation of data, which have been increased with
        contribute in the creation of the web content by                 the different new web applications.
        creating its own content.
                                                                        Data Veracity: Data veracity refers to the
       Interactivity: one of the introduced principles by the           anomalies in data. Veracity in data analysis
        web 2.0, interactivity is reflected by the interaction           constitutes the biggest challenge to overcome,
        of the user with the web content and with other                  because, veracity of data sources can largely affect
        users.                                                           the precision of analyzes.
    These two principles constitute new trends that have
changed our lives and our way of working, they are the basis       IV. BIG DATA CLASSIFICATION FOR DECISION-MAKING
of social networks, blogs, wikis, etc.
                                                                 A. Clustering
                       III. BIG DATA                                 Clustering (also called Cluster Analysis), is a task of data
                                                                 mining, which means the mission of assembling a set of
A. Definition                                                    objects, by the way that, objects which belong to the same
    The term big data refers to data sets exchanged by           group have more similarities than with those belonging to
connected objects in the web, and whose volumes are              others groups. A group is called a cluster. The clustering was
important and the variety and the velocity are increased [5].    used for the first time in the classification tasks by Cattell in
It is a compilation of data sets which are characterized by      1943 for personality psychology classification [7]. Many
complexity and large volume, so their management and             clustering algorithms exist. Making the choice about which
processing constitute a difficult task if we use traditional     algorithm we must use, depends on the used cluster models
                                                                 [8]. Among the most distinctive cluster models, we find:
database management tools [6].
                                                                 Centroid models, Distribution models, Group models, and
                                                                 Connectivity models.
B. Characteristics
   Compared to other types of data, big data are different           In addition to its important role in the classification task,
and have some specifications. These differences concern          clustering has several advantages, such as the definition of
several facets as the data format, their volume, the time        information relating to the data, which were not revealed
required for their creation, and their nature.                   before, as associations, so we can look for new patterns.
                                                                 Also, clustering provides a logical structure which makes
   The principal features are: Data volume, data velocity,       results read and interpreted easily. But it is not the case, if we
data variety, and data veracity. We can consider these           opt for a large scale of clusters, because there are no
elements as the characterizing pillars of big data (Fig. 1.),    definitive methods to determine precisely the suitable
and which make their processing and their analysis a special     number of clusters.
challenge.
                                                                 B. Decision Trees
                                                                     The decision tree is a technique which we can use for
                                                                 classification tasks, by creating a model to predict the output
                                                                 value based on a number of input values [9] [10]. To use
                                                                 decision trees for classification, we construct trees starting
                                                                 by the root of the tree, and subsequently, proceeding down
                                                                 to its leaves.
                                                                     A classification rule is developed based on example
                                                                 objects, which are known by their values of a collection of
                                                                 attributes. Then, the decision tree is expressed in function of
                                                                 the same attributes [11]. Decision trees constitute a good
              Fig. 1. Characterizing Pillars of Big Data         way to well represent decisions. An example of a decision
                                                                 tree form is shown in the Fig. 2.


                                                                                                                                30
                                                                         Classification by Emerging Patterns: based on
                                                                          emerging patterns from a sample, which means
                                                                          event associations whose supports vary, depending
                                                                          on the dataset [21].
                                                                         Classification based on High-Order Pattern: is a
                                                                          classification system, which uses the algorithm of
                                                                          high-order pattern discovery, which detects
                                                                          considerable connection or association patterns
                                                                          using residual analysis in statistics [22].
                                                                         Associative Classifiers based on the Apriori
                                                                          Algorithm: the Apriori Algorithm is an algorithm
                                                                          which proceeds by determining the prevalent items
                                                                          in the database. So, we can define association
                                                                          rules to wrap up trends in the database, many
                                                                          applications in various domains were done using
                                                                          this technique, such as market basket analysis [23].
                                                                     Associative classification provides a high accuracy and
              Fig. 2. A General Form of a Decision Tree
                                                                  it is easy to understand. However, it presents some
                                                                  challenges, like the lack of obvious criteria to classify
    The decision trees are characterized by the robustness
                                                                  objects. Since it is based on a large number of rules, the
and the simplicity of understanding and interpreting. What
                                                                  process of its elaboration is a time-consuming task, and it
is important about decision trees is that we can treat
                                                                  becomes a difficult task to select the suitable ones to
categorical and numerical data. On the other hand, decision
                                                                  develop the classifier.
trees are instable, since a miniature change in the input data
can affect the entire tree, by causing large changes in it.
                                                                   V. BIG DATA CLASSIFICATION AS A BASIS OF DECISION-
C. Support Vector Machines                                                            MAKING 2.0

    Support vector machines, more usually SVMs, were              A. The Data Generation Process
introduced the first time for binary classification. They refer
to a collection of methods used for regression                        Web 2.0 is a very important source of information. The
and classification, to analyze data in order to verify to which   user interacts continuously with the web content through
category an element belongs [12]. They can be used in             collaborative applications, such as blogs, social networks,
several ways depending on the nature of their application,        etc. With the increase of the number of actors on the web,
such as, text categorization, recognition of images, hand-        the rate of information circulating on its channels increases.
writing code, bioinformatics, etc.                                This large data flow generates the phenomenon of big data.
                                                                  Hence, web 2.0 is a rich platform of information, which can
    Some of the advantages of using SVM algorithms are:
                                                                  be treated to generate significant data. The user is primarily
the robustness, the ability to learn well using a few
parameters, and the computational efficiency. On the other        a passive actor, becomes in an instant an active actor, by
hand, apply SVM can at times require taking into                  transmitting opinions, which we propose to treat to ensure
consideration many aspects of learning methods [13], SVM          the mission of decision-making. These opinions can take,
is oriented to be applicable directly in the case of two-class    for example, the form of:
tasks. For that reason, when we deal with a multi-class task,            A solution to a particular problem: a problem can be
we must use algorithms that can reduce it to a set of binary
                                                                          solved quickly and efficiently if the process of the
problems, or take account of all the classes at once by giving
                                                                          generation of the solution is collaborative. So the
one formulation of optimization for all the data. Different
methods of treating multi-class support vector machines                   reviews, including those of experts, about an issue
continue to emerge [14].                                                  may be of great use to make decisions to solve a
                                                                          given problem.
D. Associative Classification                                            A feedback to a given subject: any feedback
    Associative classification refers to a classification which           contains in itself a notice that we can use to extract
is based on the use of association rules, by combining both               useful information which enriches the process of the
classification and mining of associations [15] [16].                      decision making.
Compared to other approaches, it is considered a highly
                                                                         A proposal for improvement: in any field,
accurate and competitive method, and can be applied in
                                                                          application, or system, we always look for ways of
different ways [17] [18] [19] [20]. We can define three types
                                                                          improvement, especially in the case of business.
of associative classification systems:
                                                                          Opinions of clients and in particular those which are


                                                                                                                               31
        the most affected by the service, constitute a very                    In the decision-making 2.0 process, the classification
        important resource of inspiration to make the right                serves as a passage from the raw data to the classified ones,
        decision of improvement.                                           which will be used later to generate decisions. Data which
                                                                           circulate across the web, especially in social networks,
       A complaint about a process, a product, a service: as              blogs, etc, are difficult to track and manage. So to overcome
        with proposals for improvement, complaints also                    this problem, our classification process should follow some
        lead to the generation of significant decisions about              specifications to properly carry out this mission.
        a product, a process, a service, etc.
                                                                               Taking into consideration our aim, which is decision-
B. Decision-Making 2.0 Based Big Data Classification                       making based on the content reflected by the comments and
   Model                                                                   the feedbacks of users, and to provide relevant decision,
   To exploit the generated data on the web 2.0, it is                     which must be generated based on meaningful data, our
necessary to isolate the significant information. Circulating              classification process must be efficient and suits our
data through the web 2.0 applications such as social                       purpose.
networks have the characteristics that make them a part of                    As already mentioned, the classification methods have
what is called big data. To process them, we proposed to                   drawbacks as advantages. That is why, we opt for a
adopt a classification process.                                            combination, to elaborate a multiple classification model to
   When we want to treat data based on the web 2.0                         exploit the strengths of the cited methods, taking into
content, in order to make decisions. A simple comment or                   account different parameters, as shown in the Fig. 4.
tweet can generate a large data stream, through feedbacks of                      Accuracy: the classification process must guarantee
users. Taking account of these data in decision-making is                          high accuracy, to ensure the relevance of our
very important to harness the collective intelligence.                             decisions, which is a very important factor for the
   After a preliminary process of data streams, to centralize                      evaluation of the quality of the decision.
those that meet our study needs, comes the classification                         Facility of understanding: it is essential that
phase to derive classified data according to specific                              classification must be a process that provides results
parameters that depend on the issue in question. Finally, we                       which are easy to understand. It means also, that
get the basis of decision-making. The framework which                              results must be interpreted without difficulties.
presents the general process starting with the creation of the
data on the web and ending with the decision-making is                            Flexibility: flexibility is represented by the fact that
represented in the Fig. 3.                                                         the classification can take into consideration
                                                                                   categorical data, and not just the numerical ones, for
                                                                                   more significant and common decisions.


                                                                              Fig. 4. Pillars of Big Data Classification for a Decision-making 2.0
                                                                                                              Process Model


  Fig. 3. Process of the Generation of the Decision 2.0 Based on the Big
                                Data Classification


                                                                                                                                                     32
                          VI. CONCLUSION                                       [17]   G. Dong, X. Zhang, L. Wong, and J. Li, “CAEP:classification by
                                                                                      aggregating emerging patterns.” In Proceedings of The Second
    In this paper, we gave a vision on the results of a                               International Conference on Discovery Science (DS’99), pp. 43–55,
developed study of the big data classification tools, we                              Japan, December 1999.
presented a summary of the results concerning the                              [18]   J. Li, G. Dong, K. Ramamohanarao, and L. Wong, “DeEPS: a new
                                                                                      instance-based lazy discovery and classification system.” Machine
techniques that we can use to treat data coming from web                              Learning, 54(2):99–124, 2004.
2.0, to ensure the decision-making mission. Then, we                           [19]   W. Li, J. Han, and J. Pei, “CMAR: accurate and efficient
presented a general framework of the entire process and                               classification based on multiple class-association rules.”       In
mentioned the criteria to take into consideration when                                Proceedings of The 2001 IEEE International Conference on Data
                                                                                      Mining (ICDM’01), pp. 369–376, San Jose, CA, November 2001.
choosing the classification method.                                            [20]   B.L.W.H.Y. Ma, “Integrating classification and association rule
    To exploit the strengths of the cited methods, we opt for                         mining.” In Proceedings of the Fourth ACM SIGKDD International
a combination, to develop a multiple classification model,                            Conference on Knowledge Discovery and Data Mining, pp. 80–86,
so that we can ensure three pillars of big data classification                        New York, NY, August 1998.
                                                                               [21]   G. Dong and J. Li. “Efficient mining of emerging patterns:
for a decision-making 2.0 process, which are accuracy,                                discovering trends and differences.” In S. Chaudhui and D.
facility of understanding and flexibility.                                            Madigan, editors, Proceedings of the Fifth ACM SIGKDD
                                                                                      International Conference on Knowledge Discovery and Data
                                                                                      Mining, pp. 43–52. ACM Press, San Diego, CA, 1999.
                         ACKNOWLEDGMENT                                        [22]   Y. Wang, High-order pattern discovery and analysis of discrete-
   The authors of this paper would like to thank our                                  valued data sets. PhD thesis, University of Waterloo, Waterloo,
                                                                                      Ontario, Canada, 1997.
Research Team, Information Technology and Modeling                             [23]   R. Agrawal, and R. Srikant, “Fast algorithms for mining association
Systems Research Unit, and more generally, the Computer                               rules.” In Proc. 20th int. conf. very large data bases, VLDB (Vol.
Science, Operational Research and Applied Statistics                                  1215, pp. 487-499, 1994.
Laboratory, from the Faculty of Science, Abdelmalek
Essaadi University of Tetuan, Morocco, for their great
support.

                              REFERENCES
[1]    T. O'reilly, “What is Web 2.0: Design patterns and business models
       for the next generation of software.” Communications & strategies,
       (1), 17, 2007.
[2]    T. O’Reilly, and J. Musser, Web 2.0 principles and best practices.
       O’Reilly Radar, 2006.
[3]    R. Kumar, J. Novak, and A. Tomkins, “Structure and evolution of
       online social networks.” In Link mining: models, algorithms, and
       applications, Springer New York 2010, pp. 337-357.
[4]    T. O'Reilly, What is web 2.0. O'Reilly Media, Inc, 2009.
[5]    P. Zikopoulos, and C. Eaton, Understanding big data. Analytics for
       enterprise class hadoop and streaming data, 2012.
[6]    E. Letouzé, “Big data for development: challenges &
       opportunities”. UN Global Pulse, 47, 2012.
[7]    R. B. Cattell, "The description of personality: basic traits resolved
       into clusters." Journal of Abnormal and Social Psychology 38: 476–
       506, 1943.
[8]    V. Estivill-Castro, "Why so many clustering algorithms — a position
       paper." ACM SIGKDD Explorations Newsletter 4 (1): 65–75, 2002.
[9]    L. Rokach, Data mining with decision trees: theory and
       applications. World Scientific Pub Co Inc. ISBN 978-9812771711,
       2008.
[10]   S. B. Kotsiantis, “Decision trees: a recent overview.” Artificial
       Intelligence Review, 39(4), 261-283, 2013.
[11]   J. R. Quinlan, “Induction of decision trees.” Machine learning, 1(1),
       81-106,1986.
[12]   V. N. Vapnik, The nature of statistical learning. Springer-Verlag
       New York, 1995.
[13]   I. Steinwart, and A. Christmann, Support vector machines. Springer
       Science & Business Media, 2008.
[14]   C. W. Hsu, and C. J. Lin, “A comparison of methods for multiclass
       support vector machines.” Neural Networks, IEEE Transactions
       on, 13(2), 415-425, 2002.
[15]   Y. Wang, and A. K. C. Wong, “From association to classification:
       Inference using weight of evidence.” IEEE Trans. On Knowledge
       and Data Engineering, 15(3):764–767, 2003.
[16]   X. Yin, and J. Han, “CPAR: Classification based on predictive
       association rules.” In Proceedings 2003 SIAM International
       Conference on Data Mining(SDM’03), San Francisco, CA, May
       2003, pp. 331– 335.


                                                                                                                                                     33