=Paper= {{Paper |id=Vol-1446/smlir_submission5 |storemode=property |title=Integrating an Advanced Classifier in WEKA |pdfUrl=https://ceur-ws.org/Vol-1446/smlir_submission5.pdf |volume=Vol-1446 |dblpUrl=https://dblp.org/rec/conf/edm/PopescuMM15 }} ==Integrating an Advanced Classifier in WEKA== https://ceur-ws.org/Vol-1446/smlir_submission5.pdf
                Integrating an Advanced Classifier in WEKA

            Paul Ştefan Popescu                       Mihai Mocanu               Marian Cristian Mihăescu
          Deparment of Computers and            Deparment of Computers and         Deparment of Computers and
            Information Technology                Information Technology             Information Technology
             Bvd. Decebal no. 107                  Bvd. Decebal no. 107               Bvd. Decebal no. 107
               Craiova, Romania                      Craiova, Romania                   Craiova, Romania
          sppopescu@gmail.com                  mocanu@software.ucv.ro mihaescu@software.ucv.ro


ABSTRACT                                                         order to be more explicit, we have tools like RapidMiner [12],
In these days WEKA has become one of the most impor-             KEEL [2], WEKA, Knime [3] or Mahout [13]. RapidMiner
tant data mining and machine learning tools. Despite the         is a graphical drag and drop analytics platform, formerly
fact that it incorporates many algorithms, on the classifica-    known as YALE, which provides an integrated environment
tion area there are still some unimplemented features. In        for data mining, machine learning, business and predictive
this paper we cover some of the missing features that may        analytics. Keel is an application package of machine learning
be useful to researchers and developers when working with        software tools, specialized on the evaluation of evolutionary
decision tree classifiers. The rest of the paper presents the    algorithms. KNIME, the Konstanz Information Miner, is a
design of a package compatible with the WEKA Package             modular data exploration platform, provided as an Eclipse
Manager, which is now under development. The functional-         plug-in, which offers a graphical workbench and various com-
ities provided by the tool include instance loading, succes-     ponents for data mining and machine learning. Mahout
sor/predecessor computation and an alternative visualiza-        is a highly scalable machine learning library based on the
tion feature of an enhanced decision tree, using the J48 algo-   Hadoop framework [18], an implementation of the MapRe-
rithm. The paper presents how a new data mining/machine          duce programming model, which supports distributed pro-
learning classification algorithm can be adapted to be used      cessing of large data sets across clusters of computers.
integrated in the workbench of WEKA.
                                                                 For our approach we choose WEKA because it has become
                                                                 one of the most popular machine learning and data mining
Keywords                                                         workbenches and its success is due to its constant improve-
Classifier, J48, WEKA, Machine learning, Data Mining             ment and development. Moreover, WEKA is a very popular
                                                                 tool used in many research domains, widely adopted by the
1.   INTRODUCTION                                                educational data mining communities.
Nowadays huge amounts of data can be gathered from many
research areas or industry applications. There is a certain      WEKA is developed in Java and encapsulates a collection of
need for data mining or knowledge extraction [6] from data.      algorithms that tackle many data mining or machine learn-
From this large amount of data, the data analysts gather         ing tasks like preprocessing, regression, clustering, associ-
many variables/features and many machine learning tech-          ation rules, classification and also visualization techniques.
niques are needed to face this situation. There are many         In some cases, these algorithms are referring only the basic
application domains such as medical, economics (i.e., mar-       implementation.
keting, sales, etc.), engineering or in our case educational
research area [16] in which machine learning techniques can      One aspect that needs to be taken into consideration is that
be applied. Educational data mining is a growing domain          WEKA has a package manager which simplifies the devel-
[4] in which a lot of work has been done.                        opers contribution process. There are two kind of packages
                                                                 that can be installed in WEKA and used via the applica-
Because the application domains are growing continuously,        tion interface: official and unofficial packages. This is a very
the tools that support the machine learning processes must       important feature because if there is an algorithm that fits
live up to market standards providing good performances          your problem description and there is a package for it you
and intuitive visualization techniques. In these days there      can just add it to the application and use it further. More-
are many tools that deal with a wide variety of problems. In     over, you don’t need to be a programmer to do that, you
                                                                 don’t need to write code, just install the package and then
                                                                 use the algorithm like it had been there forever.

                                                                 According to the real life experiences, many of the included
                                                                 algorithms can hardly be used because of their lack of flexi-
                                                                 bility. For example, in standard decision trees from WEKA
                                                                 we can perform a classification process but we cannot access
                                                                 a particular instance from the tree. Suppose that we have a
                                                                 training data file and we create the tree model. When we try
to see where is the place of the instance “X” in the tree we        mining knowledge discovery that was contributed to WEKA
can’t do that in the application interface, neither when you        is R [10]. This contribution was developed in order to in-
add the WEKA library in your code. This is a big drawback           clude different sets of tools from both environments available
because retrieving the leaf to which the instance belongs to        in a single unified system.
provides more information than retrieving its class. Usu-
ally, when performing a classification task, the data analyst       Also as related work we must take into consideration some
divides test instances into classes that have little meaning        of the last algorithms development. In the last year it is
from application domain of perspective.                             presented a new fast decision tree algorithm [19]. Based on
                                                                    their experiments, the classifier outperforms C5.0 which is
In a real life scenario in a training dataset we may have a         the commercial implementation of C4.5.
large number of features describing the instances. A data
analyst should be able to parse a decision tree, see the rule
that derived to a specific decision and then draw very accu-        3.    SYSTEM DESIGN
rate conclusions In this paper we will address classification       The package is designed to be used both by developers, in
and visualization issues by adding new functionalities and          their Java applications, and researchers, using the WEKA
improving the decision tree visualization.                          Explorer. At the moment of writing this paper the package
                                                                    with the Advanced Classifier is still under development, of-
Several classification algorithms have been previously con-         fering more functionalities as a tool for developers than in
tributed to WEKA but non of them is able to output a data           the explorer view of WEKA.
model that is loaded with instances. Based on the previous
statement it is clear that there aren’t WEKA visualization
techniques that are able to present the data in the model
in a efficient way and also, there are no available parsing
methods ready to implement such functionalities. Traversal
of leaves is another task that is missing and it is important
because instances from neighbour leaves have a high degree
of similarity and share many attributes with similar values.

One aspect that differs at WEKA from other similar soft-
ware regards its architecture that allows developers to con-
tribute in a productive way. All the work that needs to be
done refers to creating a specific folders layout, completing
a “description.props” file, adding the “.jar” file to the archive
and the build script.

2.   RELATED WORK
WEKA is a open source machine learning library that allows
developers and researchers to contribute very easily. There
are more than twenty years since WEKA had it’s first re-
lease [9] and there were constant contributions added on it.
Not only machine learning algorithms were implemented, for
example, in 2005 a text data mining module was developed
[20]. An overview of the actual software was made in [8].                 Figure 1: Package Integration in WEKA

Several classifiers were developed and contributed as pack-         In Fig. 1 we present the main design of the algorithm and
ages to WEKA. In 2007 a classifier that was build based             how it can be used in WEKA. On the top of the figure
on a set of sub-samples was developed [14] and compared             we have the classifier which can be divided in two main
to C4.5 [15] which have it’s implementation called J48 [11]         modules: the algorithm and the visualization. As we can
in WEKA. Other classifiers refers the “Alternating Decision         see on the next level, both of the modules can be divided
Trees Learning Algorithms” [7] which is a generalization of         further. All the functionalities are then installed in WEKA
the decision trees, voted decision trees and voted decision         via the package manager and then, in the explorer, we can
stumps. This kind of classifiers are relatively easy to in-         perform data analysis tasks using a model loaded with data
terpret and the rules are usually smaller in size. Classical        and it’s associated visualization techniques.
decision trees, such as c4.5 were expanding nodes in a depth-
first order; an improvement came from “Best-first decision
trees” [17]which expands nodes in a best-first order. A pack-       3.1   General Architecture
age with these trees was contributed to WEKA.                       The packages is a zip archive, structured with respect to
                                                                    the WEKA guidelines. That is, it unpacks to the current
Some other contributions refers libraries of algorithms that        directory and it contains: the source files, a folder with the
can be accessed via WEKA. One of them is JCLEC [5] an               required libraries, a build script, a properties file required
evolutionary computation framework which has been suc-              by WEKA for installing and managing the package, and
cessfully employed for developing several evolutionary algo-        the actual “.jar” file. A detailed structure of the package is
rithms. Other environment for machine learning and data             presented below.
                                            In Figure 2 is presented the system’s class diagram. This
  +-AdvancedClassifier.jar                                     diagram includes all the java packages from the project and
  +-Description.props                                          their relations. As we can see in the above mentioned fig-
  +-build_package.xml
                                                               ure, we have two type of classes: independent classes and
  +-src
  | +-main                                                     composed. Independent classes are gathered from the model
  |    +-java                                                  part of the Model-View-Controller architecture or just classes
  |      +-resources                                           that perform one time tasks like “WekaTextFileToXMLTextFile”
  |      | +-background_node.png                               which is able to generate an XML based on the text file out-
  |      | +-background_leaf.png                               putted by WEKA. On the other side, the composed classes
  |      | +-background_leaf_pressed.png
  |      | +-font_node.ttf
                                                               are dependent on each other and these relations are shared
  |      | +-font_decision.ttf                                 across packages. One important class that is worth to be
  |      +-weka                                                mentioned is “AdvancedClassifierTreeLeaf.java” in which we
  |           +-classifiers                                    store the leaves of our tree along with rules that define the
  |           | +-trees                                        leaf. Discussions about implementation of the packages are
  |           |    +-model                                     more related to the software engineering research area and
  |           |    | +-AdvancedClassifierTree.java
  |           |    | +-AdvancedClassifierTreeBaseNode.java
                                                               beyond of the scope of this paper.
  |           |    | +-AdvancedClassifierTreeNode.java
  |           |    | +-AdvancedClassifierTreeLeaf.java         3.1.1    Design and Implementation of the Algorithm
  |           |    | +-BaseAttributeValidator.java
  |           |    | +-NominalAttributeValidator.java          The algorithm needs to generate custom rules (dependent
  |           |    | +-NumericAttributeValidator.java          on the training dataset) for every leaf of the decision tree.
  |           |    | +-Constants.java                          These rules are computed by tracing the path from the root
  |           |    +-AdvancedClassifier.java                   of the tree to the specified leaf. Each decision that leads to a
  |           |    +-WekaTextfileToXMLTextfile.java            leaf is therefore translated into a rule that encapsulates the
  |           +-gui                                            name of the attribute and the value on which the decision
  |              +-visualize
  |                 +-plugins                                  was made. For each type of attribute defined by WEKA, we
  |                    +-AdvancedClassifierTree.java           need to have a corresponding rule that matches that type.
  |                    +-AdvancedClassifierTreePanel.java      For this purpose an abstract class has been created to act as
  |                    +-BaseNodeView.java                     a base class for any of the custom rules. The name of this
  |                    +-AdvancedClassifierTreeNodeView.java   class is “BaseAttributeValidator” and exposes the required
  |                    +-AdvancedClassifierTreeLeafView.java   methods that a superclass needs to implement: a “clone”
  |                    +-ConnectingLineView.java
  +-lib                                                        method required by the workflow of the system and meth-
    +-weka.jar                                                 ods that validate if an instance or set of instances have the
    +-simple-xml.jar                                           required values of the attribute targeted by the rule. At
    +-rt.jar                                                   the moment, the only implemented rules are the ones that
                                                               handle “NOMINAL” and “NUMERIC” attribute types.

                                                               The rule that validates each nominal attribute is called “Nom-
                                                               inalAttributeValidator” and receives as parameters the name
                                                               of the targeted attribute and a string variable representing
                                                               the accepted value of the attribute. The rule that handles
                                                               the numeric attributes is called “NumericAttributeValida-
                                                               tor” and also receives the name of the attribute and either
                                                               a particular value or the boundaries of an interval.

                                                               In the following paragraphs, we present a brief overview
                                                               of the algorithm for which we adopt a straightforward ap-
                                                               proach.

                                                               Firstly, the algorithm retrieves instances from the “.arff” file
                                                               using the methods provided by WEKA. The next step is
                                                               applying the desired classification process. Currently the
                                                               only supported classifier is J48, but employing other decision
                                                               tree classifiers is foreseen as future work. Using the text
                                                               representation of the outputted model and a predefined set
                                                               of rules and tags, an XML is then generated. This is an
                                                               important step during the workflow because the structured
                                                               XML format allows us to obtain the base model for our
                                                               decision tree. The deserialization is done using a third-party
                                                               Java library(“Simple XML” [1]).

                                                               The model obtained this way contains a list of nodes and
                                                               leaves with the following significance: each node corresponds
               Figure 2: Class Diagram                         to a decision in the tree; the data stored in each object
(node) refers the information about the name of the ac-
tual attribute, operator and value on which the decision was
made; and the results to which making the decision leads (a
list of other nodes or an output leaf). Using this model and
the set of attributes provided by WEKA, the set of rules
is computed. This step is performed by parsing the model
from the first node (i.e., the root) to the last available leaf
and gradually composing the set of rules that defines each
leaf. The setup of the algorithm is finally completed with
the loading of the training dataset into the model.

The classifier and processed data can now be easily han-
dled and different operations can be applied. The method
currently implemented include basic per leaf manipulation
of instances, i.e. loading new instances into the model and
retrieving the part of the dataset contained in each leaf, as
well as predecessor and successor computation.

3.1.2    Visualization Plugin
For the visualization feature, a custom panel has been de-
signed to hold the components that build up the decision
tree and expose the data available in the leaves. The con-
tructor of the panel requires the decision tree model as a pa-
rameter, and takes care of adding the corresponding views
to the interface. In order to include this functionality in
WEKA, a specialized class that implements WEKA’s Tree-
VisualizePlugin interface has been created. After adding
the package through the Package Manager and selecting this
visualization option, a new JFrame that holds the custom
panel is displayed.


                                                                                    Figure 4: Tree Sample



                                                                   the name of the attribute, and each decision is printed on
                                                                   top of the connecting line. Surely, each leaf can be clicked,
                                                                   and the set of enclosed instances is displayed. As previ-
                                                                   ously noted, there is still some work to be made to final-
                                                                   ize the development of the package, and the visualization
                                                                   tool needs to be included as well. Efforts will have to be
                                                                   made toward providing the means to visualize and handle
                                                                   the successors/predecessors, outliers and other relevant in-
                                                                   formation.



                                                                   4.   CONCLUSIONS AND FUTURE WORK
                                                                   In this paper we have presented the integration of a data
                                                                   analysis tool in WEKA. This tool is important because brings
                                                                   a new classifier to WEKA that aims to improve the classi-
          Figure 3: Sample from the Dataset                        fication procedures. Here, are also presented some imple-
                                                                   menting procedures and details.
In Figure 3 we present a dataset sample. In order to validate
the classifier and it’s extra functionalities several tests have   A workflow is also described and all the mechanism that is
been made but for this case study we used three attributes         used to bring new features for the users. One important
and 788 instances. The feature called “userid” doesn’t pro-        thing that needs to be mentioned is that the data load-
vide any information gain but can be easily used for in-           ing module opens new data analysis opportunities for re-
stances localization in leaves. The attributes significance is     searchers.
beyond the scope of this paper.
                                                                   As future work we plan to implement Other types of at-
In Figure 4 is presented a screen-shot of the tree generated       tributes supported by WEKA like “DATE”, “String” and
based on the dataset from figure 3. Each node contains             “Relational”.
5.   REFERENCES                                                  [17] H. Shi. Best-first decision tree learning. Technical
 [1] Simple xml. http://simple.sourceforge.net.                       report, University of Waikato, 2007.
 [2] J. Alcalá-Fdez, L. Sánchez, S. Garcı́a, M. del Jesus,     [18] K. Shvachko, H. Kuang, S. Radia, and R. Chansler.
     S. Ventura, J. Garrell, J. Otero, C. Romero,                     The hadoop distributed file system. In Mass Storage
     J. Bacardit, V. Rivas, J. FernÃandez,
                                       ,       and F. Herrera.        Systems and Technologies (MSST), 2010 IEEE 26th
     Keel: a software tool to assess evolutionary algorithms          Symposium on, pages 1–10. IEEE, 2010.
     for data mining problems. Soft Computing,                   [19] S.-G. P. V. Purdila. Fast decision tree algorithm.
     13(3):307–318, 2009.                                             Advances in Electrical and Computer Engineering,
 [3] M. R. Berthold, N. Cebron, F. Dill, T. R. Gabriel,               14(1):65–68, 2014.
     T. Kötter, T. Meinl, P. Ohl, K. Thiel, and                 [20] I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin,
     B. Wiswedel. Knime - the konstanz information miner:             and C. G. Nevill-Manning. Kea: Practical automatic
     Version 2.0 and beyond. SIGKDD Explor. Newsl.,                   keyphrase extraction. In Proceedings of the Fourth
     11(1):26–31, Nov. 2009.                                          ACM Conference on Digital Libraries, DL ’99, pages
 [4] R. Campagni, D. Merlini, R. Sprugnoli, and M. C.                 254–255, New York, NY, USA, 1999. ACM.
     Verri. Data mining models for student careers. Expert
     Systems with Applications, (0):–, 2015.
 [5] A. Cano, J. M. Luna, J. L. Olmo, and S. Ventura.
     Jclec meets weka! In E. Corchado, M. Kurzynski, and
     M. Wozniak, editors, HAIS (1), volume 6678 of
     Lecture Notes in Computer Science, pages 388–395.
     Springer, 2011.
 [6] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. The
     kdd process for extracting useful knowledge from
     volumes of data. Commun. ACM, 39(11):27–34, Nov.
     1996.
 [7] Y. Freund and L. Mason. The alternating decision tree
     learning algorithm. In Proceedings of the Sixteenth
     International Conference on Machine Learning, ICML
     ’99, pages 124–133, San Francisco, CA, USA, 1999.
     Morgan Kaufmann Publishers Inc.
 [8] M. Hall, E. Frank, G. Holmes, B. Pfahringer,
     P. Reutemann, and I. H. Witten. The weka data
     mining software: An update. SIGKDD Explor. Newsl.,
     11(1):10–18, Nov. 2009.
 [9] G. Holmes, A. Donkin, and I. H. Witten. Weka: a
     machine learning workbench. pages 357–361, August
     1994.
[10] K. Hornik, C. Buchta, and A. Zeileis. Open-source
     machine learning: R meets weka. Computational
     Statistics, 24(2):225–232, 2009.
[11] W.-Y. Loh. Classification and Regression Tree
     Methods. John Wiley & Sons, Ltd, 2008.
[12] I. Mierswa, M. Wurst, R. Klinkenberg, M. Scholz, and
     T. Euler. Yale: Rapid prototyping for complex data
     mining tasks. In Proceedings of the 12th ACM
     SIGKDD International Conference on Knowledge
     Discovery and Data Mining, KDD ’06, pages 935–940,
     New York, NY, USA, 2006. ACM.
[13] S. Owen, R. Anil, T. Dunning, and E. Friedman.
     Mahout in Action. Manning Publications Co.,
     Greenwich, CT, USA, 2011.
[14] J. M. Pérez, J. Muguerza, O. Arbelaitz,
     I. Gurrutxaga, and J. I. Martı́n. Combining multiple
     class distribution modified subsamples in a single tree.
     Pattern Recognition Letters, 28(4):414–422, 2007.
[15] J. R. Quinlan. C4.5: Programs for Machine Learning.
     Morgan Kaufmann Publishers Inc., San Francisco,
     CA, USA, 1993.
[16] C. Romero and S. Ventura. Educational data mining:
     A survey from 1995 to 2005. Expert Systems with
     Applications, 33(1):135 – 146, 2007.