J. Hlaváčová (Ed.): ITAT 2017 Proceedings, pp. 235–239 CEUR Workshop Proceedings Vol. 1885, ISSN 1613-0073, c 2017 T. Kliegr, J. Kuchař, S. Vojíř, V. Zeman EasyMiner – Short History of Research and Current Development Tomáš Kliegr1 , Jaroslav Kuchař2 , Stanislav Vojíř1 , and Václav Zeman1 1 Department of Information and Knowledge Engineering, Faculty of Informatics and Statistics, University of Economics, Prague, W. Churchill Sq. 4, Prague 3, Czech Republic 2 Web Intelligence Research Group, Faculty of Information Technology, Czech Technical University, Thákurova 9, 160 00, Prague 6, Czech Republic first.last@{vse|fit.cvut}.cz Abstract: EasyMiner (easyminer.eu) is an academic outlier detection. The architecture of the system is pre- data mining project providing data mining of association sented in Section 7. Since the beginnings, the research rules, building of classification models based on associ- was accompanied with standardization efforts, which are ation rules and outlier detection based on frequent pat- presented in Section 8. The current development efforts tern mining. It differs from other data mining systems by focus also on distributed computation platforms – this is adapting the “web search” paradigm. It is web-based, pro- covered in Section 9. Section 10 provides an overview of viding both a REST API and a user interface, and puts the features that were at some point in time developed as emphasis on interactivity, simplicity of user interface and well as of those that are supported by the current version immediate response. This paper will give an overview of of EasyMiner. Finally, the conclusions present a case for research related to the EasyMiner project. using EasyMiner as a component in new project requiring data mining functionality and refers the interested reader to other publications regarding comparison with other ma- 1 Introduction chine learning as a service (MLaaS) systems. In this paper, we present the history of research and devel- opment of the EasyMiner project http://easyminer. 2 Handling of Domain Knowledge eu. EasyMiner is an academic data mining project pro- viding data mining of association rules, building of classi- EasyMiner evolved from the SEWEBAR (SEmantic-WEB fication models based on association rules and outlier de- Analytical Reports) project, which focused on semanti- tection based on frequent pattern mining. cally readable machine learning. In [9], we presented EasyMiner was to our knowledge the first interactive SEWEBAR-CMS as a set of extensions for the Joomla! web-based data mining system that supported the com- content management system (CMS) that extends it with plete machine learning process. While today there are sev- functionality required to serve as a communication plat- eral web-based machine learning systems on the market1 , form between the data analyst, domain expert and the re- owing to continuous development EasyMiner provides port user. The system later supported elicitation of do- distinct user experience. While most existing machine main knowledge from the analyst [12]. Association rules learning systems offer versatile user interfaces, where the discovered from data with the LISp-Miner system (http: user has to in some way for each task compose a new ma- //lispminer.vse.cz) were stored in a semantic form in chine learning workflow, in EasyMiner the user interface is the SEWEBAR-CMS system. The background knowledge crafted to provide the “web search” experience. The user was used to help answer user search queries, for example, visually constructs a query against the data, and the sys- to find rules that are contradicting existing domain knowl- tem responds with a set of interesting patterns (presented edge [6]. Another novel element in the system was the use as rules) or a classifier (Figure 1). of ontology for representation of the data mining domain. Over the years of development, EasyMiner served as Related research focused on improving semantic capa- a testbed for a number of new technologies and research bilities of content management systems [3] and on design- ideas. The purpose of this paper is to give a brief overview ing ontologies and schemata for representation of back- of this research. ground knowledge [8, 11]. This paper is organized as follows. Section 2 is focused on SEWEBAR-CMS, the predecessor of EasyMiner, used 3 Association Rule Discovery in research on the use of domain knowledge in data min- ing. Section 3 focuses on association rule discovery. Sec- In its first release, EasyMiner provided a web-based in- tion 4 presents the adaptation of EasyMiner for learning terface for the LISp-Miner system, which was used for business rules and Section 5 consequently for association association rule mining [23]. EasyMiner interacted with rule classification. Section 6 presents the current focus on LISp-Miner using its LM-Connect component, which is a web application providing the functionality of LISp-Miner 1 Such as BigML.com or Microsoft Azure. through REST API. 236 T. Kliegr, J. Kuchař, S. Vojíř, V. Zeman Figure 1: Visual query designer in EasyMiner. Table 1: Features supported in EasyMiner 2.4. Year - when was the paper describing the feature published, API - feature available in the REST API, UI - feature available in the user interface. Feature Year API UI Content Management System [9] 2009 No No Semantic search over discovered rules [3] 2010 No No Support for GUHA extension of PMML [10] 2010 Yes Yes Query for related (confirming, contradicting) rules to the selected rule [6] 2011 No No Editor of background knowledge [12] 2011 No No LISp-Miner interface (disjunctions, negations, partial cedents, quantifiers, cuts, coefficients) [23] 2012 No No Export of business rules to Drools [21] 2013 No No Rule pruning with CBA [5] 2014 Yes Yes Evaluation of quality of classification models [20] 2014 Yes No Rule selection and editing for classification model building [20] 2014 Yes No R interface (arules package) [22] 2015 Yes Yes Spark backend [25] 2016 Yes Yes Discretization algorithms [25] 2016 Yes No Support for the input RDF data format 2017 Yes No Outlier detection [19] 2017 Yes No EasyMiner with LISp-Miner backend offered several EasyMiner, which allows to export selected rules to Busi- unique features: 1. negation on attributes, 2. disjunction ness Rules Management System (BRMS) Drools, trans- between attributes, 3. subpatterns allowing for scoping forming the output of association rule learning into the logical connectives, 4. multiple interest measures (called DRL format supported by Drools. We found that the main quantifiers in GUHA), 5. mines directly on multivalued obstacles for a straightforward use of association rules as attributes, no need to create "items", 6. dynamic binning candidate business rules are the excessive number of rules operators (called coefficients in GUHA), 7. PMML-based discovered even on small datasets, and the fact that contra- import and export, 8. grid support. dicting rules are generated. In [5] we propose that a poten- Since LM-Connect component is no longer developed tial solution to these problems is provided by the seminal and maintained, the integration of the current version of association rule classification algorithm CBA [16]. In [20] EasyMiner and LISp-Miner is thus currently not working.2 we presented a software module for EasyMiner, which al- The current version of EasyMiner primarily relies on the lows the domain expert to edit the discovered rules. R arules package [2], which wraps a C implementation of the apriori association rule mining algorithm [1]. 5 Association Rule Based Classification 4 Learning Business Rules In [5] we started to use the CBA algorithm for postpro- One of the first use cases for EasyMiner was learning busi- cessing association rule learning results into a classifier. ness rules. In [21] we presented a software module for In [22] we presented an extension for EasyMiner for build- ing of classification models. A benchmark against stan- 2 It should be noted that all the features list above can be used directly dard symbolic classification algorithms on a news recom- from the LISp-Miner system. mender task was presented in [7]. EasyMiner – Short History of Research and Current Development 237 6 Outlier Detection using EasyMiner-Data using user-defined preprocessing methods. The attributes for data mining are created from The most recent addition of new tasks supported by uploaded data fields using one of these preprocessing algo- EasyMiner is frequent pattern-based anomaly (outlier) de- rithms: each value-one bin, enumeration of intervals, enu- tection. The main idea of the approach is that if an instance meration of nominal values, equidistant intervals, equifre- contains more frequent patterns, it is unlikely to be an quent intervals, equisized intervals (by minimal support of anomaly. The presence or absence of the frequent patterns every interval). The preprocessing algorithms as well as is then used to assign the deviation level [4]. In [19] we data storage are independent of the selected data mining present extension of EasyMiner REST API with our inno- algorithm. The implemented web services support hash- vated outlier detection algorithm called Frequent Pattern ing functionality to avoid potentially problems with spe- Isolation (FPI)[15] that is inspired by an existing algo- cial characters in attribute names and its values. The min- rithm called Isolation Forests (IF) [17, 18]. Since PMML ing following services work on the “safe” datasets with does not yet support outlier (anomaly) detection, in [14] hashed values. we present our proposal for a new PMML outlier model. The main data mining functionality is provided by the The goal of our work was to design modular solution that service EasyMiner-Miner. This web service provides as- would support broader range of anomaly detection algo- sociation rule learning, prunning of discovered association rithms including our FPI method. rule sets and building of classification models and outlier detection. EasyMiner-Miner initializes execution of used R packages and another algorithms. 7 EasyMiner Architecture EasyMiner-Scorer is a web service for testing of clas- sification models based on association rules. During the development of EasyMiner system, its archi- tecture was transformed to multiple reusable web services. A schema of the architecture is shown in Figure 2. All the 8 Distributed Backend: Spark/Hadoop services are fully documented in Swagger. As laid out in the previous section, EasyMiner is modu- lar in terms of mining backends. In addition to the default mining backend provided by the arules and rCBA pack- ages, EasyMiner supports an alternate one built on top of Apache Spark/Hadoop introduced in [25]. The Spark backend is suitable for larger datasets, which can benefit from parallel computation distributed over multiple machines. The Spark backend also uses FP- Growth frequent pattern mining algorithm instead of apri- ori. FP-Growth is generally considered as faster than apri- ori. However, for smaller datasets using apriori with the R backend is recommended as it provides faster response times, due to the ability of the implementation to provide intermediate results as the mining progresses. Figure 2: Architecture of the system EasyMiner 9 Standardization Efforts (PMML) The central component (service) is EasyMinerCenter. Already the earliest research related to EasyMiner was This component integrates the functionality of other ser- linked to work on standardization efforts. While associ- vices and provides the main graphical web interface and ation rules were supported already in the early versions REST API for end users. Internally, this component pro- of PMML, the industry standard format for exchange of vides user account and task management, stores discov- data mining models, the GUHA method that was initially ered association rules and works as authentication service used did not comply to this standard, since it produced for other components. rules containing number of constructs not supported by For storing and preparing data before mining, the PMML. Since our research involved background knowl- system uses services EasyMiner-Data and EasyMiner- edge elicited from domain experts, definition of data for- Preprocessing. EasyMiner-Data is a web services for mat supporting this type of knowledge was also required. management of data sources. It supports upload of data In [8] we proposed a topic map-based ontology for as- files in CSV and RDF and stores them into databases as sociation rule learning, which was based on the GUHA the set of transactions. EasyMiner-Preprocessing ser- method and in [11] an extension of this approach that vice supports creation of datasets from data sources stored dealt with domain knowledge. An extension of PMML for 238 T. Kliegr, J. Kuchař, S. Vojíř, V. Zeman GUHA-based models was presented in [10] and for han- Acknowledgment dling of background knowledge [13]. Neither of these ef- forts was successful – the ISO Topic Maps standard waded This paper was supported by IGA grant 29/2016 of the in favour of the W3C RDF/OWL stack. The industry was University of Economics, Prague. not concerned with exchange of background knowledge at the time, and support of GUHA method, implemented es- sentially only by the LISp-Miner system, increased com- plexity of the models as opposed to the existing PMML References association rule models.3 Our latest standardization effort is related to outlier detection [14] and targets PMML. This [1] Rakesh Agrawal, Tomasz Imielinski, and Arun N. Swami. proposal is closes industry adoption as it was included into Mining association rules between sets of items in large databases. In SIGMOD, pages 207–216. ACM Press, 1993. a roadmap for the next release of PMML. [2] Michael Hahsler, Sudheer Chelluboina, Kurt Hornik, and Christian Buchta. The arules r-package ecosystem: ana- 10 Features in the EasyMiner Version 2.4 lyzing interesting patterns from large transaction data sets. Journal of Machine Learning Research, 12(Jun):2021– Table 1 presents an overview of the most salient fea- 2025, 2011. tures that were in some for published between 2009, when [3] Andrej Hazucha, Jakub Balhar, and Tomáš Kliegr. A PHP the first paper on EasyMiner’s predecessor SEWEBAR- library for Ontopia-CMS integration. In TMRA 2010. Uni- versity of Leipzig, 2010. CMS appeared, and 2017, when the current version of EasyMiner was released. As follows from the table, a [4] Z. He, X. Xu, Z. Huang, and S. Deng. FP-outlier: Frequent pattern based outlier detection. Computer Science and In- number of features is not supported in the current release. formation Systems/ComSIS, 2(1):103–118, 2005. [5] Tomáš Kliegr, Jaroslav Kuchař, Davide Sottara, and 11 Using EasyMiner in Your Project Stanislav Vojíř. Learning business rules with association rule classifiers. In Antonis Bikakis, Paul Fodor, and Du- During the years of development, EasyMiner was exten- mitru Roman, editors, Rules on the Web. From Theory to Applications: 8th International Symposium, RuleML 2014, sively used by over thousand of students at the Faculty of Co-located with the 21st European Conference on Artificial Informatics and Statistics to complete their assignments in Intelligence, ECAI 2014, Prague, Czech Republic, August association rule learning. The software has also been used 18-20, 2014. Proceedings, pages 236–250, Cham, 2014. in several applied research projects. For example, within Springer International Publishing. the linkedtv.eu project EasyMiner was used to analyze [6] Tomáš Kliegr, Andrej Hazucha, and Tomáš Marek. Instant user preferences and within the openbudgets.eu project feedback on discovered association rules with PMML- to analyze budgetary data. based query-by-example. In Web Reasoning and Rule Sys- The full project is based on composition of components tems. Springer, 2011. and services with fully documented REST APIs. Most [7] Tomáš Kliegr and Jaroslav Kuchař. Benchmark of rule- of the components and services4 are available under open based classifiers in the news recommendation task. In source Apache License, Version 2.0. This is an impor- Josiane Mothe, Jacques Savoy, Jaap Kamps, Karen Pinel- tant factor which differentiates EasyMiner from the com- Sauvagnat, Gareth J. F. Jones, Eric SanJuan, Linda Cap- mercial MLaaS offerings. For a more detailed comparison pellato, and Nicola Ferro, editors, Experimental IR Meets with other machine learning systems refer to [24]. Multilinguality, Multimodality, and Interaction - 6th Inter- In addition to the visual web-based interface, the project national Conference of the CLEF Association, CLEF 2015, Toulouse, France, September 8-11, 2015, Proceedings, vol- exposes a REST API. This API provides full functionality ume 9283 of Lecture Notes in Computer Science, pages of EasyMiner, including also functions, which are not yet 130–141. Springer, 2015. available in the GUI. It is possible to use this API to ex- [8] Tomáš Kliegr, Marek Ovečka, and Jan Zemánek. Topic tend your own project by data mining functionality. It is maps for association rule mining. In Proceedings of TMRA suitable for building of mashup applications or data pro- 2009. University of Leipzig, 2009. cessing using script languages. An example of data min- [9] Tomáš Kliegr, Martin Ralbovský, Vojtěch Svátek, Mi- ing using API is available at http://www.easyminer. lan Šimunek, Vojtěch Jirkovský, Jan Nemrava, and Jan eu/api-tutorial. Zemánek. Semantic analytical reports: A framework for EasyMiner can also be extended with new algorithms post-processing data mining results. In ISMIS’09: 18th - rule mining, outlier detection or scorer service. For this International Symposium on Methodologies for Intelligent purpose, the integration component EasyMinerCenter pro- Systems, pages 453–458. Springer, 2009. vides documented interfaces in PHP. [10] Tomáš Kliegr and Jan Rauch. An XML format for asso- 3 Currently, EasyMiner supports export of association rule models in ciation rule models based on the GUHA method. In Pro- formats GUHA PMML also as in standard form PMML 4.3 Association ceedings of the 2010 International Conference on Seman- Rules. tic Web Rules, RuleML’10, pages 273–288, Berlin, Heidel- 4 The main services were presented in section 7. berg, 2010. Springer-Verlag. EasyMiner – Short History of Research and Current Development 239 [11] Tomáš Kliegr, Vojtěch Svátek, Milan Šimůnek, Daniel [24] Václav Zeman. Analýza cloudového řešení akademického Štastný, and Andrej Hazucha. An XML schema and a topic nástroje pro dolování pravidel z databází. Systémová Inte- map ontology for formalization of background knowledge grace, 23, 2016. in data mining. In IRMLeS-2010, 2nd ESWC Workshop on [25] Václav Zeman, Stanislav Vojíř, Jaroslav Kuchař, and Inductive Reasoning and Machine Learning for the Seman- Tomáš Kliegr. Využití cloudu pro dolování asociačních tic Web, Heraklion, Crete, Greece, 2010. pravidel z velkých dat přes webové rozhraní. In WIKT/DaZ [12] Tomáš Kliegr, Vojtěch Svátek, Milan Šimůnek, and Martin 2016, 2016. Ralbovský. Semantic analytical reports: A framework for post-processing of data mining results. Journal of Intelli- gent Information Systems, 37(3):371–395, 2011. [13] Tomáš Kliegr, Stanislav Vojíř, and Jan Rauch. Background knowledge and PMML: first considerations. In Proceed- ings of the 2011 workshop on Predictive markup language modeling, PMML ’11, pages 54–62, New York, NY, USA, 2011. ACM. [14] Jaroslav Kuchař, Adam Ashenfelter, and Tomáš Kliegr. Outlier (anomaly) detection modelling in PMML. In RuleML 2017 Poster and Challenge Proceedings. CEUR- WS, 2017. [15] Jaroslav Kuchař and Vojtěch Svátek. Spotlighting anoma- lies using frequent patterns. In KDD 2017 Workshop on Anomaly Detection in Finance, Halifax, Nova Scotia, Canada, 2017. [16] Bing Liu, Wynne Hsu, and Yiming Ma. Integrating clas- sification and association rule mining. In Proceedings of the Fourth International Conference on Knowledge Discov- ery and Data Mining, KDD’98, pages 80–86. AAAI Press, 1998. [17] F. T. Liu, K. M. Ting, and Z. H. Zhou. Isolation forest. In Proceedings of the 8th IEEE International Conference on Data Mining (ICDM’08), pages 413–422, 2008. [18] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation- based anomaly detection. ACM Trans. Knowl. Discov. Data, 6(1):3:1–3:39, March 2012. [19] Stanislav Vojíř, Jaroslav Kuchař, Václav Zeman, and Tomáš Kliegr. Using easyminer API for financial data anal- ysis in project openbudgets.eu. In RuleML 2017 Poster and Challenge Proceedings. CEUR-WS, 2017. To appear. [20] Stanislav Vojíř, Přemysl Václav Duben, and Tomáš Kliegr. Business rule learning with interactive selection of associ- ation rules. RuleML Challenge, 2014, 2014. [21] Stanislav Vojíř, Tomáš Kliegr, Andrej Hazucha, Radek Skrabal, and Milan Šimůnek. Transforming association rules to business rules: EasyMiner meets Drools. In Paul Fodor, Dumitru Roman, Darko Anicic, Adam Wyner, Mon- ica Palmirani, Davide Sottara, and Francois Lévy, editors, RuleML-2013 Challenge, volume 1004 of CEUR Workshop Proceedings. CEUR-WS.org, 2013. [22] Stanislav Vojíř, Václav Zeman, Jaroslav Kuchař, and Tomáš Kliegr. Easyminer/R preview: Towards a web in- terface for association rule learning and classification in R. In Challenge+ DC@ RuleML, 2015. [23] Radek Škrabal, Milan Šimůnek, Stanislav Vojíř, An- drej Hazucha, Tomáš Marek, David Chudán, and Tomáš Kliegr. Association rule mining following the web search paradigm. In Peter A. Flach, Tijl Bie, and Nello Cristian- ini, editors, Machine Learning and Knowledge Discovery in Databases, volume 7524 of Lecture Notes in Computer Science, pages 808–811. Springer Berlin Heidelberg, 2012.