=Paper= {{Paper |id=Vol-1885/235 |storemode=property |title=EasyMiner – Short History of Research and Current Development |pdfUrl=https://ceur-ws.org/Vol-1885/235.pdf |volume=Vol-1885 |authors=Tomáš Kliegr,Jaroslav Kuchař,Stanislav Vojíř,Václav Zeman |dblpUrl=https://dblp.org/rec/conf/itat/KliegrKVZ17 }} ==EasyMiner – Short History of Research and Current Development== https://ceur-ws.org/Vol-1885/235.pdf
J. Hlaváčová (Ed.): ITAT 2017 Proceedings, pp. 235–239
CEUR Workshop Proceedings Vol. 1885, ISSN 1613-0073, c 2017 T. Kliegr, J. Kuchař, S. Vojíř, V. Zeman



                 EasyMiner – Short History of Research and Current Development

                                   Tomáš Kliegr1 , Jaroslav Kuchař2 , Stanislav Vojíř1 , and Václav Zeman1
       1 Department of Information and Knowledge Engineering, Faculty of Informatics and Statistics, University of Economics, Prague,
                                               W. Churchill Sq. 4, Prague 3, Czech Republic
      2 Web Intelligence Research Group, Faculty of Information Technology, Czech Technical University, Thákurova 9, 160 00, Prague 6,

                                                              Czech Republic
                                                     first.last@{vse|fit.cvut}.cz

      Abstract: EasyMiner (easyminer.eu) is an academic                   outlier detection. The architecture of the system is pre-
      data mining project providing data mining of association            sented in Section 7. Since the beginnings, the research
      rules, building of classification models based on associ-           was accompanied with standardization efforts, which are
      ation rules and outlier detection based on frequent pat-            presented in Section 8. The current development efforts
      tern mining. It differs from other data mining systems by           focus also on distributed computation platforms – this is
      adapting the “web search” paradigm. It is web-based, pro-           covered in Section 9. Section 10 provides an overview of
      viding both a REST API and a user interface, and puts               the features that were at some point in time developed as
      emphasis on interactivity, simplicity of user interface and         well as of those that are supported by the current version
      immediate response. This paper will give an overview of             of EasyMiner. Finally, the conclusions present a case for
      research related to the EasyMiner project.                          using EasyMiner as a component in new project requiring
                                                                          data mining functionality and refers the interested reader
                                                                          to other publications regarding comparison with other ma-
      1 Introduction                                                      chine learning as a service (MLaaS) systems.
      In this paper, we present the history of research and devel-
      opment of the EasyMiner project http://easyminer.                   2   Handling of Domain Knowledge
      eu. EasyMiner is an academic data mining project pro-
      viding data mining of association rules, building of classi-        EasyMiner evolved from the SEWEBAR (SEmantic-WEB
      fication models based on association rules and outlier de-          Analytical Reports) project, which focused on semanti-
      tection based on frequent pattern mining.                           cally readable machine learning. In [9], we presented
         EasyMiner was to our knowledge the first interactive             SEWEBAR-CMS as a set of extensions for the Joomla!
      web-based data mining system that supported the com-                content management system (CMS) that extends it with
      plete machine learning process. While today there are sev-          functionality required to serve as a communication plat-
      eral web-based machine learning systems on the market1 ,            form between the data analyst, domain expert and the re-
      owing to continuous development EasyMiner provides                  port user. The system later supported elicitation of do-
      distinct user experience. While most existing machine               main knowledge from the analyst [12]. Association rules
      learning systems offer versatile user interfaces, where the         discovered from data with the LISp-Miner system (http:
      user has to in some way for each task compose a new ma-             //lispminer.vse.cz) were stored in a semantic form in
      chine learning workflow, in EasyMiner the user interface is         the SEWEBAR-CMS system. The background knowledge
      crafted to provide the “web search” experience. The user            was used to help answer user search queries, for example,
      visually constructs a query against the data, and the sys-          to find rules that are contradicting existing domain knowl-
      tem responds with a set of interesting patterns (presented          edge [6]. Another novel element in the system was the use
      as rules) or a classifier (Figure 1).                               of ontology for representation of the data mining domain.
         Over the years of development, EasyMiner served as                  Related research focused on improving semantic capa-
      a testbed for a number of new technologies and research             bilities of content management systems [3] and on design-
      ideas. The purpose of this paper is to give a brief overview        ing ontologies and schemata for representation of back-
      of this research.                                                   ground knowledge [8, 11].
         This paper is organized as follows. Section 2 is focused
      on SEWEBAR-CMS, the predecessor of EasyMiner, used
                                                                          3 Association Rule Discovery
      in research on the use of domain knowledge in data min-
      ing. Section 3 focuses on association rule discovery. Sec-          In its first release, EasyMiner provided a web-based in-
      tion 4 presents the adaptation of EasyMiner for learning            terface for the LISp-Miner system, which was used for
      business rules and Section 5 consequently for association           association rule mining [23]. EasyMiner interacted with
      rule classification. Section 6 presents the current focus on        LISp-Miner using its LM-Connect component, which is a
                                                                          web application providing the functionality of LISp-Miner
           1 Such as BigML.com or Microsoft Azure.                        through REST API.
236                                                                                                                   T. Kliegr, J. Kuchař, S. Vojíř, V. Zeman




                                                        Figure 1: Visual query designer in EasyMiner.


      Table 1: Features supported in EasyMiner 2.4. Year - when was the paper describing the feature published, API - feature
      available in the REST API, UI - feature available in the user interface.

            Feature                                                                                                   Year        API       UI
            Content Management System [9]                                                                             2009        No        No
            Semantic search over discovered rules [3]                                                                 2010        No        No
            Support for GUHA extension of PMML [10]                                                                   2010        Yes       Yes
            Query for related (confirming, contradicting) rules to the selected rule [6]                              2011        No        No
            Editor of background knowledge [12]                                                                       2011        No        No
            LISp-Miner interface (disjunctions, negations, partial cedents, quantifiers, cuts, coefficients) [23]     2012        No        No
            Export of business rules to Drools [21]                                                                   2013        No        No
            Rule pruning with CBA [5]                                                                                 2014        Yes       Yes
            Evaluation of quality of classification models [20]                                                       2014        Yes       No
            Rule selection and editing for classification model building [20]                                         2014        Yes       No
            R interface (arules package) [22]                                                                         2015        Yes       Yes
            Spark backend [25]                                                                                        2016        Yes       Yes
            Discretization algorithms [25]                                                                            2016        Yes       No
            Support for the input RDF data format                                                                     2017        Yes       No
            Outlier detection [19]                                                                                    2017        Yes       No




         EasyMiner with LISp-Miner backend offered several                             EasyMiner, which allows to export selected rules to Busi-
      unique features: 1. negation on attributes, 2. disjunction                       ness Rules Management System (BRMS) Drools, trans-
      between attributes, 3. subpatterns allowing for scoping                          forming the output of association rule learning into the
      logical connectives, 4. multiple interest measures (called                       DRL format supported by Drools. We found that the main
      quantifiers in GUHA), 5. mines directly on multivalued                           obstacles for a straightforward use of association rules as
      attributes, no need to create "items", 6. dynamic binning                        candidate business rules are the excessive number of rules
      operators (called coefficients in GUHA), 7. PMML-based                           discovered even on small datasets, and the fact that contra-
      import and export, 8. grid support.                                              dicting rules are generated. In [5] we propose that a poten-
         Since LM-Connect component is no longer developed                             tial solution to these problems is provided by the seminal
      and maintained, the integration of the current version of                        association rule classification algorithm CBA [16]. In [20]
      EasyMiner and LISp-Miner is thus currently not working.2                         we presented a software module for EasyMiner, which al-
         The current version of EasyMiner primarily relies on the                      lows the domain expert to edit the discovered rules.
      R arules package [2], which wraps a C implementation of
      the apriori association rule mining algorithm [1].
                                                                                       5     Association Rule Based Classification
      4     Learning Business Rules
                                                                                       In [5] we started to use the CBA algorithm for postpro-
      One of the first use cases for EasyMiner was learning busi-                      cessing association rule learning results into a classifier.
      ness rules. In [21] we presented a software module for                           In [22] we presented an extension for EasyMiner for build-
                                                                                       ing of classification models. A benchmark against stan-
          2 It should be noted that all the features list above can be used directly   dard symbolic classification algorithms on a news recom-
      from the LISp-Miner system.                                                      mender task was presented in [7].
EasyMiner – Short History of Research and Current Development                                                                      237

     6    Outlier Detection                                          using EasyMiner-Data using user-defined preprocessing
                                                                     methods. The attributes for data mining are created from
     The most recent addition of new tasks supported by              uploaded data fields using one of these preprocessing algo-
     EasyMiner is frequent pattern-based anomaly (outlier) de-       rithms: each value-one bin, enumeration of intervals, enu-
     tection. The main idea of the approach is that if an instance   meration of nominal values, equidistant intervals, equifre-
     contains more frequent patterns, it is unlikely to be an        quent intervals, equisized intervals (by minimal support of
     anomaly. The presence or absence of the frequent patterns       every interval). The preprocessing algorithms as well as
     is then used to assign the deviation level [4]. In [19] we      data storage are independent of the selected data mining
     present extension of EasyMiner REST API with our inno-          algorithm. The implemented web services support hash-
     vated outlier detection algorithm called Frequent Pattern       ing functionality to avoid potentially problems with spe-
     Isolation (FPI)[15] that is inspired by an existing algo-       cial characters in attribute names and its values. The min-
     rithm called Isolation Forests (IF) [17, 18]. Since PMML        ing following services work on the “safe” datasets with
     does not yet support outlier (anomaly) detection, in [14]       hashed values.
     we present our proposal for a new PMML outlier model.              The main data mining functionality is provided by the
     The goal of our work was to design modular solution that        service EasyMiner-Miner. This web service provides as-
     would support broader range of anomaly detection algo-          sociation rule learning, prunning of discovered association
     rithms including our FPI method.                                rule sets and building of classification models and outlier
                                                                     detection. EasyMiner-Miner initializes execution of used
                                                                     R packages and another algorithms.
     7    EasyMiner Architecture                                        EasyMiner-Scorer is a web service for testing of clas-
                                                                     sification models based on association rules.
     During the development of EasyMiner system, its archi-
     tecture was transformed to multiple reusable web services.
     A schema of the architecture is shown in Figure 2. All the      8 Distributed Backend: Spark/Hadoop
     services are fully documented in Swagger.
                                                                     As laid out in the previous section, EasyMiner is modu-
                                                                     lar in terms of mining backends. In addition to the default
                                                                     mining backend provided by the arules and rCBA pack-
                                                                     ages, EasyMiner supports an alternate one built on top of
                                                                     Apache Spark/Hadoop introduced in [25].
                                                                        The Spark backend is suitable for larger datasets, which
                                                                     can benefit from parallel computation distributed over
                                                                     multiple machines. The Spark backend also uses FP-
                                                                     Growth frequent pattern mining algorithm instead of apri-
                                                                     ori. FP-Growth is generally considered as faster than apri-
                                                                     ori. However, for smaller datasets using apriori with the
                                                                     R backend is recommended as it provides faster response
                                                                     times, due to the ability of the implementation to provide
                                                                     intermediate results as the mining progresses.


           Figure 2: Architecture of the system EasyMiner            9 Standardization Efforts (PMML)

        The central component (service) is EasyMinerCenter.          Already the earliest research related to EasyMiner was
     This component integrates the functionality of other ser-       linked to work on standardization efforts. While associ-
     vices and provides the main graphical web interface and         ation rules were supported already in the early versions
     REST API for end users. Internally, this component pro-         of PMML, the industry standard format for exchange of
     vides user account and task management, stores discov-          data mining models, the GUHA method that was initially
     ered association rules and works as authentication service      used did not comply to this standard, since it produced
     for other components.                                           rules containing number of constructs not supported by
        For storing and preparing data before mining, the            PMML. Since our research involved background knowl-
     system uses services EasyMiner-Data and EasyMiner-              edge elicited from domain experts, definition of data for-
     Preprocessing. EasyMiner-Data is a web services for             mat supporting this type of knowledge was also required.
     management of data sources. It supports upload of data             In [8] we proposed a topic map-based ontology for as-
     files in CSV and RDF and stores them into databases as          sociation rule learning, which was based on the GUHA
     the set of transactions. EasyMiner-Preprocessing ser-           method and in [11] an extension of this approach that
     vice supports creation of datasets from data sources stored     dealt with domain knowledge. An extension of PMML for
238                                                                                                                T. Kliegr, J. Kuchař, S. Vojíř, V. Zeman

      GUHA-based models was presented in [10] and for han-                       Acknowledgment
      dling of background knowledge [13]. Neither of these ef-
      forts was successful – the ISO Topic Maps standard waded                   This paper was supported by IGA grant 29/2016 of the
      in favour of the W3C RDF/OWL stack. The industry was                       University of Economics, Prague.
      not concerned with exchange of background knowledge at
      the time, and support of GUHA method, implemented es-
      sentially only by the LISp-Miner system, increased com-
      plexity of the models as opposed to the existing PMML                      References
      association rule models.3 Our latest standardization effort
      is related to outlier detection [14] and targets PMML. This                 [1] Rakesh Agrawal, Tomasz Imielinski, and Arun N. Swami.
      proposal is closes industry adoption as it was included into                    Mining association rules between sets of items in large
                                                                                      databases. In SIGMOD, pages 207–216. ACM Press, 1993.
      a roadmap for the next release of PMML.
                                                                                  [2] Michael Hahsler, Sudheer Chelluboina, Kurt Hornik, and
                                                                                      Christian Buchta. The arules r-package ecosystem: ana-
      10     Features in the EasyMiner Version 2.4                                    lyzing interesting patterns from large transaction data sets.
                                                                                      Journal of Machine Learning Research, 12(Jun):2021–
      Table 1 presents an overview of the most salient fea-                           2025, 2011.
      tures that were in some for published between 2009, when                    [3] Andrej Hazucha, Jakub Balhar, and Tomáš Kliegr. A PHP
      the first paper on EasyMiner’s predecessor SEWEBAR-                             library for Ontopia-CMS integration. In TMRA 2010. Uni-
                                                                                      versity of Leipzig, 2010.
      CMS appeared, and 2017, when the current version of
      EasyMiner was released. As follows from the table, a                        [4] Z. He, X. Xu, Z. Huang, and S. Deng. FP-outlier: Frequent
                                                                                      pattern based outlier detection. Computer Science and In-
      number of features is not supported in the current release.
                                                                                      formation Systems/ComSIS, 2(1):103–118, 2005.
                                                                                  [5] Tomáš Kliegr, Jaroslav Kuchař, Davide Sottara, and
      11     Using EasyMiner in Your Project                                          Stanislav Vojíř. Learning business rules with association
                                                                                      rule classifiers. In Antonis Bikakis, Paul Fodor, and Du-
      During the years of development, EasyMiner was exten-                           mitru Roman, editors, Rules on the Web. From Theory to
                                                                                      Applications: 8th International Symposium, RuleML 2014,
      sively used by over thousand of students at the Faculty of
                                                                                      Co-located with the 21st European Conference on Artificial
      Informatics and Statistics to complete their assignments in                     Intelligence, ECAI 2014, Prague, Czech Republic, August
      association rule learning. The software has also been used                      18-20, 2014. Proceedings, pages 236–250, Cham, 2014.
      in several applied research projects. For example, within                       Springer International Publishing.
      the linkedtv.eu project EasyMiner was used to analyze                       [6] Tomáš Kliegr, Andrej Hazucha, and Tomáš Marek. Instant
      user preferences and within the openbudgets.eu project                          feedback on discovered association rules with PMML-
      to analyze budgetary data.                                                      based query-by-example. In Web Reasoning and Rule Sys-
         The full project is based on composition of components                       tems. Springer, 2011.
      and services with fully documented REST APIs. Most                          [7] Tomáš Kliegr and Jaroslav Kuchař. Benchmark of rule-
      of the components and services4 are available under open                        based classifiers in the news recommendation task. In
      source Apache License, Version 2.0. This is an impor-                           Josiane Mothe, Jacques Savoy, Jaap Kamps, Karen Pinel-
      tant factor which differentiates EasyMiner from the com-                        Sauvagnat, Gareth J. F. Jones, Eric SanJuan, Linda Cap-
      mercial MLaaS offerings. For a more detailed comparison                         pellato, and Nicola Ferro, editors, Experimental IR Meets
      with other machine learning systems refer to [24].                              Multilinguality, Multimodality, and Interaction - 6th Inter-
         In addition to the visual web-based interface, the project                   national Conference of the CLEF Association, CLEF 2015,
                                                                                      Toulouse, France, September 8-11, 2015, Proceedings, vol-
      exposes a REST API. This API provides full functionality
                                                                                      ume 9283 of Lecture Notes in Computer Science, pages
      of EasyMiner, including also functions, which are not yet                       130–141. Springer, 2015.
      available in the GUI. It is possible to use this API to ex-
                                                                                  [8] Tomáš Kliegr, Marek Ovečka, and Jan Zemánek. Topic
      tend your own project by data mining functionality. It is                       maps for association rule mining. In Proceedings of TMRA
      suitable for building of mashup applications or data pro-                       2009. University of Leipzig, 2009.
      cessing using script languages. An example of data min-                     [9] Tomáš Kliegr, Martin Ralbovský, Vojtěch Svátek, Mi-
      ing using API is available at http://www.easyminer.                             lan Šimunek, Vojtěch Jirkovský, Jan Nemrava, and Jan
      eu/api-tutorial.                                                                Zemánek. Semantic analytical reports: A framework for
         EasyMiner can also be extended with new algorithms                           post-processing data mining results. In ISMIS’09: 18th
      - rule mining, outlier detection or scorer service. For this                    International Symposium on Methodologies for Intelligent
      purpose, the integration component EasyMinerCenter pro-                         Systems, pages 453–458. Springer, 2009.
      vides documented interfaces in PHP.                                        [10] Tomáš Kliegr and Jan Rauch. An XML format for asso-
          3 Currently, EasyMiner supports export of association rule models in        ciation rule models based on the GUHA method. In Pro-
      formats GUHA PMML also as in standard form PMML 4.3 Association                 ceedings of the 2010 International Conference on Seman-
      Rules.                                                                          tic Web Rules, RuleML’10, pages 273–288, Berlin, Heidel-
          4 The main services were presented in section 7.                            berg, 2010. Springer-Verlag.
EasyMiner – Short History of Research and Current Development                                                                                239

     [11] Tomáš Kliegr, Vojtěch Svátek, Milan Šimůnek, Daniel            [24] Václav Zeman. Analýza cloudového řešení akademického
          Štastný, and Andrej Hazucha. An XML schema and a topic                nástroje pro dolování pravidel z databází. Systémová Inte-
          map ontology for formalization of background knowledge                grace, 23, 2016.
          in data mining. In IRMLeS-2010, 2nd ESWC Workshop on             [25] Václav Zeman, Stanislav Vojíř, Jaroslav Kuchař, and
          Inductive Reasoning and Machine Learning for the Seman-               Tomáš Kliegr. Využití cloudu pro dolování asociačních
          tic Web, Heraklion, Crete, Greece, 2010.                              pravidel z velkých dat přes webové rozhraní. In WIKT/DaZ
     [12] Tomáš Kliegr, Vojtěch Svátek, Milan Šimůnek, and Martin             2016, 2016.
          Ralbovský. Semantic analytical reports: A framework for
          post-processing of data mining results. Journal of Intelli-
          gent Information Systems, 37(3):371–395, 2011.
     [13] Tomáš Kliegr, Stanislav Vojíř, and Jan Rauch. Background
          knowledge and PMML: first considerations. In Proceed-
          ings of the 2011 workshop on Predictive markup language
          modeling, PMML ’11, pages 54–62, New York, NY, USA,
          2011. ACM.
     [14] Jaroslav Kuchař, Adam Ashenfelter, and Tomáš Kliegr.
          Outlier (anomaly) detection modelling in PMML. In
          RuleML 2017 Poster and Challenge Proceedings. CEUR-
          WS, 2017.
     [15] Jaroslav Kuchař and Vojtěch Svátek. Spotlighting anoma-
          lies using frequent patterns. In KDD 2017 Workshop
          on Anomaly Detection in Finance, Halifax, Nova Scotia,
          Canada, 2017.
     [16] Bing Liu, Wynne Hsu, and Yiming Ma. Integrating clas-
          sification and association rule mining. In Proceedings of
          the Fourth International Conference on Knowledge Discov-
          ery and Data Mining, KDD’98, pages 80–86. AAAI Press,
          1998.
     [17] F. T. Liu, K. M. Ting, and Z. H. Zhou. Isolation forest. In
          Proceedings of the 8th IEEE International Conference on
          Data Mining (ICDM’08), pages 413–422, 2008.
     [18] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation-
          based anomaly detection. ACM Trans. Knowl. Discov.
          Data, 6(1):3:1–3:39, March 2012.
     [19] Stanislav Vojíř, Jaroslav Kuchař, Václav Zeman, and
          Tomáš Kliegr. Using easyminer API for financial data anal-
          ysis in project openbudgets.eu. In RuleML 2017 Poster and
          Challenge Proceedings. CEUR-WS, 2017. To appear.
     [20] Stanislav Vojíř, Přemysl Václav Duben, and Tomáš Kliegr.
          Business rule learning with interactive selection of associ-
          ation rules. RuleML Challenge, 2014, 2014.
     [21] Stanislav Vojíř, Tomáš Kliegr, Andrej Hazucha, Radek
          Skrabal, and Milan Šimůnek. Transforming association
          rules to business rules: EasyMiner meets Drools. In Paul
          Fodor, Dumitru Roman, Darko Anicic, Adam Wyner, Mon-
          ica Palmirani, Davide Sottara, and Francois Lévy, editors,
          RuleML-2013 Challenge, volume 1004 of CEUR Workshop
          Proceedings. CEUR-WS.org, 2013.
     [22] Stanislav Vojíř, Václav Zeman, Jaroslav Kuchař, and
          Tomáš Kliegr. Easyminer/R preview: Towards a web in-
          terface for association rule learning and classification in R.
          In Challenge+ DC@ RuleML, 2015.
     [23] Radek Škrabal, Milan Šimůnek, Stanislav Vojíř, An-
          drej Hazucha, Tomáš Marek, David Chudán, and Tomáš
          Kliegr. Association rule mining following the web search
          paradigm. In Peter A. Flach, Tijl Bie, and Nello Cristian-
          ini, editors, Machine Learning and Knowledge Discovery
          in Databases, volume 7524 of Lecture Notes in Computer
          Science, pages 808–811. Springer Berlin Heidelberg, 2012.