Designing KDD-Workflows via HTN-Planning for Intelligent Discovery Assistance Jörg-Uwe Kietz1 and Floarea Serban1 and Abraham Bernstein1 and Simon Fischer2 Abstract. Knowledge Discovery in Databases (KDD) has evolved of operators. Parts of the workflows are applied several times (e.g. a lot during the last years and reached a mature stage offering plenty the preprocessing sub-workflow has to be applied on training, test- of operators to solve complex data analysis tasks. However, the user ing, and application data) implying that the users either need to support for building workflows has not progressed accordingly. The copy/paste or even repeatedly design the same sub-workflow6 sev- large number of operators currently available in KDD systems makes eral times. As none of the systems maintain this “copy”-relationship, it difficult for users to successfully analyze data. In addition, the cor- it is left to the user to maintain the relationship in the light of changes. rectness of workflows is not checked before execution. Hence, the Another weak point is that workflows are not checked for correct- execution of a workflow frequently stops with an error after several ness before execution. As a consequence, the execution of the work- hours of runtime. flow oftentimes stops with an error after several hours runtime due to This paper presents our tools, eProPlan and eI DA, which solve the small syntactic incompatibilities between an operator and the data it above problems by supporting the whole life-cycle of (semi-) auto- should be applied on. matic workflow generation. Our modeling tool eProPlan allows to To address these problems several authors [1, 3, 18] propose the describe operators and build a task/method decomposition grammar use of planning techniques to automatically build such workflows. to specify the desired workflows. Additionally, our Intelligent Dis- However, all these approaches are limited. First, they only model covery Assistant, eI DA, allows to place workflows into data mining a very small number of operations and were only demonstrated to (DM) tools or workflow engines for execution. work on very short workflows (less than 10 operators). Second, none of them models operations that work on individual columns of a data set: they only model operations that process all columns of a data 1 Introduction set in the same way. Lastly, the approaches cannot scale to large One of the challenges of Knowledge Discovery in Databases (KDD) amounts of operators and large workflows: their planning approaches is assisting users in creating and executing KDD workflows. Existing fail in the large design space of “correct” (but nevertheless most often KDD systems such as the commercial IBM SPSS Modeler3 or the unwanted) solutions. A full literature review about IDAs (including open-source KNIME4 and RapidMiner5 support the user with nice these approaches) can be found in our survey [13]. graphical user interfaces. Operators can be dropped as nodes onto In this paper we describe the first approach for designing KDD the working pane and the data-flow is specified by connecting the workflows based on ontologies and Hierarchical Task Network operator-nodes. This works very well as long as neither the workflow (HTN) planning [5]. Hierarchical task decomposition knowledge becomes too complicated nor the number of operators becomes too available in DM (e.g. CRISP-DM [2] and CITRUS [15]) can be used large. to significantly reduce the number of generated unwanted correct However, in the past decade, the number of operators in such sys- workflows. The main scientific contributions of this paper, hence, tems has been growing fast. All of them contain over 100 operators are: First, we show how KDD workflows can be designed using on- and RapidMiner, which includes Weka, R, and several pluggable op- tologies and HTN-planning in eProPlan. Second, we exhibit the pos- erator sets (such as anomaly detection, recommendation, text and im- sibility to plug in our approach in existing DM-tools (as illustrated age mining) now has around 1000. It can be expected that the transi- by RapidMiner and Taverna). Third, we present an evaluation of our tion from closed systems (with a fixed set of operators) to open sys- approach that shows significant improvement and simplification of tems that can also use Web services as operators (which is especially the KDD-workflow design process. Thus, the KDD researchers can interesting for domain specific data access and transformations) will easily model not only their DM and preprocessing operators but also further accelerate the rate of growth resulting in total confusion about their DM tasks that is exploited to guide the workflow generation. what operators to use for most users. Moreover less experienced users can use our RM-IDA plugin to au- In addition to the number of operators also the size of the KDD tomatically generate workflows in only 7-clicks. Last but not least, workflows is growing. Today’s workflows easily contain hundreds the planning community may find it interesting to see this real world problem powered by planning techniques and may also find some of 1 University of Zurich, Department of Informatics, Dynamic and Dis- the problems we faced and solved rather pragmatically inspiring for tributed Information Systems Group, Binzmühlestrasse 14, CH-8050 Zurich, Switzerland {kietz|serban|bernstein}@ifi.uzh.ch 6 Several operators must be exchanged and cannot be reapplied. Consider for 2 Rapid-I GmbH, Stockumer Str. 475, 44227 Dortmund, Germany example training data (with labels) and application data (without labels). fischer@rapid-i.com Label-directed operations like feature-selection cannot be reapplied. But 3 http://www.ibm.com/software/analytics/spss/ even if there is a label on separate test data, redoing feature selection may 4 http://www.knime.org/ result in selecting different features. To apply and test the model, however, 5 http://rapid-i.com/content/view/181/190/ exactly the same features have to be selected. further research. not sufficient to choose the dominating operators since they contain The rest of this paper is structured as follows: Section 2 describes one or more subtasks that have to be planned as well. This is similar the knowledge used for planning, then Section 3 details the compo- to planning general control structures like if-then-else or loops, but nents of our system. Section 4 describes several evaluation methods the problem is also easier to solve as these dominating operators and and, finally, we conclude with Section 5. their subtasks have a special purpose and, therefore, task/method de- compositions as all other tasks. Figure 1a shows a cross-validation workflow which uses first a preprocessing task and then it applies 2 The Planning Knowledge cross-validation to produce and test the model as seen in Figure 1b. We modeled our Data Mining Workflow planning problem as a Data During the training step it selects the important features and then it Mining Worlkflow ontology (DMWF),7 which we succinctly illus- trains the model. The produced model is then applied on the testing trate here (a more detailed description of it can be found in [8, 9]). data. The output is an average of the n runs (where n is the number The DMWF contains input/output objects and their meta-data of folds, usually set to 10). (Sec.2.1), which are sufficiently detailed to enable column-wise op- erations – a feature that is not available in any of the previous ap- proaches. In addition, tasks and methods are used to guide and sim- plify the planning process. Each method consists of a sequence of steps, where each step is a (sub-) task or an operator (Sec. 2.2 shows some of the methods). In total, the DMWF contains more than 100 operators from RapidMiner (Sec. 2.3 illustrates one). The amount of operators and their partial redundancy made it favorable to structure them in an inheritance hierarchy starting with abstract operators un- til the basic operators which can be applied on the data. The number (a) of operators is not a limitation of the HTN-planning approach, but a limitation set by the effort to model them and to keep them consis- tent with changes introduced by new releases of RapidMiner. To get a significantly higher number of modeled operators, semi-automatic modeling or at least verification methods need to be developed. (b) Besides the main contribution of supporting users in designing KDD workflows, this paper may also be interesting to the planning Figure 1: (a) Cross Validation as Operator (labeled as “Validation” in community because it shows the successful usage of planning tech- the Figure) in a workflow; (b) Subtasks of Cross Valdation niques to solve the problem of workflow design and more generally problem-specific software-configuration. It may stipulate further re- search on planning as we solved some problems that did not get much 2.1 Meta-Data to describe Input/Output Objects attention in planning so far. First, in our domain it is usually easy to find a correct plan. The simplest correct plan for prediction uses the The planner recognizes the IO-Objects (e.g. DataTable, Document- default model (mean-value for regression, modal-value for classifi- Collection and ImageCollection), Background Knowledge, Model cation). This is a correct solution for all predictive modeling prob- (e.g. PreprocessingModel and PredictionModel, recording required, lems, but it is only the baseline that DM wants to improve and not modified, added and deleted attributes such that the conditions and the wanted solution. We tackle that problem by excluding such un- effects of applying these models on the data can be computed by wanted workflows from our HTN. The real problem is not finding a the planner), and Report (e.g. PerformanceVector and LiftChart). As solution, but handling the large amount of solutions8 . We handle this an example Table 1 shows the meta-data of a DataTable. The meta- by grouping of plans based on meta-level equivalent output (similar data for the user-data is generated by RapidMiner’s/RapidAnalytic’s characteristics of the output) 9 and by using the probabilistic pattern meta-data analyzer and passed to the planner. During the planning generated by meta-learning (see Sec. 2.4) not only to rank the enu- process the planner generates the meta-data of an operator’s output merated workflows, but also for a heuristic beam search in the space objects from the operator’s effect-specification (see Sec. 2.3). of possible solutions. Another interesting problem is the large num- Attribute #Attr Type Role #Diff #Miss Values Min Max Mean Modal Std. ber of relevant operators that we handled by embedding conditions age 1 Scalar input 0 [] 20.0 79.0 50.95 16.74 and effects into an ontological operator hierarchy with inheritance. genLoad 1 Nominal input 2 36 [0,1] 1 This is supported by our eProPlan plugin into the popular ontology label 1 Nominal target 2 0 [+,-] + editor Protégé. Furthermore, RapidMiner (and in DM in general) has meas1 1 Scalar input 0 [] 0.10 3.93 1.845 0.861 meas2 1 Scalar input 30 [] 0.12 4.35 1.979 0.902 several special purpose control/loop operators like cross-validation. meas3 1 Scalar input 0 [] 0.33 5.35 2.319 1.056 They are parametrized operators (the number of folds and the sam- sex 1 Nominal input 2 0 [f,m] m pling strategy for cross validation). In contrast to other operators, it is 7 It is public available from http://www.e-LICO.eu/ontologies/dmo/ Table 1: Meta-Data for a Data Table e-Lico-eProPlan-DMWF-HTN.owl. The best way to open this ontology is: download Protégé 4.0 or 4.1 from http://protege.stanford.edu/ and eProPlan from http://elico.rapid-i.com/eproplan.html (2.0.1 for One of the strengths of the IDA is the ability to plan workflows Protégé4.0 and 2.1.0 for 4.1). with attribute-wise operations – a feature no other previous approach 8 With column-wise operations this may be very large, just consider having had so far. Especially biological micro-array data can easily contain 5 alternative methods to discretize 100 attributes. This results in 5100 ≈ 1070 possible correct plans. several thousand columns, turning this possibility into a major per- 9 ’Meta-level equivalent output’ can be defined as the IOOs-descriptions pro- formance bottleneck. Looking at a number of such analyses we ob- duced by the planner are equivalent up to the names of individuals. served that these columns often have very similar (if not identical) (a) (b) Figure 2: Methods for (a) Modeling with Test-Set Evaluation and (b) Preprocessing meta-data and data characteristics, effectively eliminating the need flow of IO-Objects in RapidMiner.10 The white nodes are tasks to be to differentiate between them during planning. To keep the strength planned and the other nodes are operators. Operators in RapidMiner and avoid the performance bottleneck of several thousand attributes are grouped in a way similar to the DMWF. Some of the top nodes that are identical on the meta-data level, we introduced attribute have colors. The greenish ones are operators that deal with Model groups that collect all attributes together where the operator con- as well as objects that inherit from Model. This includes all learn- ditions would not discriminate between them anyway. This worked ers and the Apply Model operator. The more purple ones are data well as all column-wise operators were able to handle not only single transformation operators. The RapidMiner’s plan-interpreter adds a columns but also specified sets of columns. Therefore, we are able to Multiply node whenever an IO-Object is used by several operators. generate workflows with attribute-group-wise operations for micro- The Preprocessing task has only one method (Fig. 2b). First an array and collaborative filtering data with even 50.000 attributes. empty preprocessing model is created using the operator “Group There are various approaches which have used meta-data to sug- Models”. It is then extended by the next steps. All its sub-tasks gest the best algorithm for a given problem [12, 4, 7] in the context have optional “Nothing to Do” methods (as shown for CleanMiss- of meta-learning. ingValues in Fig. 3c). Most of the preprocessing methods are recur- sive, handling one attribute at a time until nothing is left to be done. CleanMissingValues has two different recursive methods, the choice 2.2 The Task/Method decomposition is made by the planner depending on the amount of missing values. If The top-level task of the HTN is the DM task. It has six methods: there are more than 30% values missing, the attribute can be dropped Clustering, Association Rule Mining, Predictive Modeling with Test (Figure 3b). When there are less than 50% missing, it can be filled Set Evaluation (external given separation), Predictive Modeling with with mean or modal value (Figure 3a). If 30 − 50% of the values are Cross Validation, Predictive Modeling with Test Set Split Evaluation missing, both methods can be used and plans for both are enumer- (random 70:30 split), and Simple Modeling with Training Set Perfor- ated in the order provided by the probabilistic planner. The usage of mance. column-wise operations is illustrated in Figure. 3, which shows how The selection is directed by the (required) main-goal, i.e. Pattern depending on the amount of missing values per column the planner Discovery enforces Association Rule Mining, Descriptive Modeling chooses to fill or drop the column. enforces Clustering, and Predictive Modeling forces the choice of Figure 4 shows a generated workflow for the UCI data set labor- one of the others. If test data are provided Modeling with Test Set negotiations, which has 16 attributes with various amounts of miss- Evaluation has to be chosen, otherwise the choice can be influenced ing values. Note that in its output the planner compresses recursive by an (optional) evaluation-subgoal (not possible with the current tasks into a single task to simplify the browsing of the workflow by GUI). If there are still several methods possible, they are enumer- the user. Such column-wise handling can greatly improve the results ated in the rank-order provided by the probabilistic planner (see Sec. of workflows. For users, however, it is usually too much manual ef- 2.4). Each method consists of a sequence of steps, where each step fort (placing and connecting the 22 operations and setting their pa- is a (sub-)task or an operator (it can also be an abstract operator sub- rameters in the workflow below). suming several basic or dominating operators). Planning occurs in In total the HTN of the DMWF now contains 16 tasks with 33 the order of steps, however the resulting data flow is not necessarily 10 Note that the figures illustrate the methods with structurally equivalent linear, as the grammar allows the specification of port mapping. Fig- workflows in RapidMiner and do not show the more complicated method ure 2a shows the method Modeling with Test Set Evaluation and its definitions in the ontology. Inputs and outputs of an operator are defined as concept expres- sions and are either stored as superclasses or as equivalent classes. The parameters and the corresponding RapidMiner operator name are stored in equivalent classes. Fig. 5 exemplifies the abstract op- erator for a classification learner operator with its corresponding in- puts/outputs, condition and effect. (a) (b) (c) ”ClassificationLearner”: Figure 3: The (a) “Fill Missing Values”, (b) “Drop Missing Values”, Equiv. class: PredictiveSupervisedLearners and and (c) “Empty”/“Nothing to Do” Method that can be used in Fig. 2b (uses exactly 1 DataTable) and (produces exactly 1 PredictionModel) and (operatorName max 1 Literal) Condition: [DataTable and (targetAttribute exactly 1 Attribute) and (inputAttribute min 1 Attribute) and (targetColumn only (DataColumn and columnsHasType only (Categorial))) and (inputColumn only (DataColumn and columnsHasType only (Scalar or Categorial))) ](?D) → new(?this), ClassificationLearner(?this), uses(?this,?D) Effect: uses(?this,?D), ClassificationLearner(?this), inputColumn(?D,?IC),targetColumn(?D,?TC), → copy(?M,?D, {DataTable(?D), containsColumn(?D,? ), amountOfRows(?D,? )}),produces(?this,?M), PredictionModel(?M), needsColumn(?M,?IC), predictsColumn(?M,?TC) Figure 5: An abstract operator: ClassificationLearner Figure 4: Resulting Example Workflow for Missing Value Cleaning A basic classification learner operator inherits all the characteris- tics of the classification learner. In addition, it can define more refined methods, such that it handles general DM from single table data sets conditions or effects and more parameters. Fig. 6 shows the refine- with adequate preprocessing. However, we believe that this will show ment of the general class of all classification learners to a specific its real power in customer/application area specific grammars de- Support Vector Machine implementation in RapidMiner. It has an signed and/or extended in eProPlan, resulting in adaptive customer additional condition (binary target and scalar input attribute and no templates for workflows. The complexity of the produced workflows attribute is allowed to have missing values), but it does not contain a depends on the characteristics of the dataset (if it has missing val- refined effect. Its effect is the one used for all classification learners ues, if it needs to be normalized, etc.) and on the number of modeled (it builds a predictive model that requires all input attributes to be operators for each steps of the KDD process (FeatureSelection and present to be applicable and predicts the target attribute). DataMining have more operators). Due to space limitations, we only illustrate some parts of the grammar here. The full grammar can be ”RM Support Vector Machine LibSVM C SVC linear”: inspected in the public available DMWF-Ontology11 . Equiv. class: RM Operator and (usesData exactly 1 DataTable) and (producesPredictionModel exactly 1 PredictionModel) and 2.3 The Operator Models (simpleParameter kernel type value ”linear”) and To be able to express the operators’ conditions and effects for plan- (simpleParameter svm type value ”minimal leaf size”) and ning we stored them as annotations in the ontology. Conditions and (operatorName exactly 1 {”support vector machine libsvm”}) effects can contain concept expressions, SWRL-rules,12 and some Condition: [MissingValueFreeDataTable and extended-SWRL-like built-ins. We have introduced a set of special (targetColumn exactly 1 CategorialColumn) and built-ins needed for planning (e.g., new, copy, copyComplex, etc.). (inputColumn min 1 Thing) and These built-ins allow to create, copy, and destroy objects during plan- (inputColumn only (ScalarColumn)) ning (e.g., models, produced IO-objects, weights, etc.). eProPlan al- ](?D) lows to define new built-ins which are stored as subclasses of the → RM Support Vector Machine LibSVM C SVC linear(?this), Built-in concept. Each built-in can have types/parameters and the simpleParameter svm type(?this,”C-SVC”), corresponding implementation in Flora2 [16]. But users who want to simpleParameter kernel type(?this,”linear”) add new built-ins need to have some Flora-2 knowledge. They have Figure 6: A basic classification learner operator the possibility to define new functions/operations on the data and in- troduce them in the conditions and effects. The built-ins’ definition with parameters and implementation are stored as class annotations. Each input/output class expression (e.g., usesData exactly 1 11 DataTable) has an annotation which defines its port mapping to its Use an OWL2 Ontology Editor like Protege to ”Open OWL on- tology from URI” with http://www.e-lico.eu/ontologies/ corresponding RapidMiner operator port (e.g., ”training set”). Both dmo/e-Lico-eProPlan-DMWF-HTN.owl conditions and effects are rules. Conditions check the applicability 12 http://www.w3.org/Submission/SWRL/ (lhs) and infer the parameter settings (rhs); different solutions can infer that the operator can be applied with different parameter set- tings. Effects compute the variable-bindings (lhs) for the assertion to be made (rhs); all different solutions are asserted as the effect of one operator application. Modeling & Workflow testing generation 2.4 Probabilistic Ranking of Workflows The HTN-planning method described in this paper enumerates all possible workflows for a given problem. The operator models en- Reasoning & sure that they are executable without error. Furthermore, the HTN- planning Grammar prevents senseless operator combinations. For example, first normalizing the data and then discretizing it does not make sense Figure 7: The eProPlan architecture since the normalization effect is absorbed by the discretization one. Also, converting the scalar data to nominal and then converting it The planner is implemented in Flora2/XSB [16] and uses different back is a useless operation. Another example is dropping attributes parts from the workflow ontology for different purposes (see Fig- without a reason. Nonetheless, the planner can still generate a very ure 8). Specifically, it employs both a specialized ABox (assertional large number of correct candidate workflows and we are unaware box—individual assertions) reasoner that relies on an external TBox of any analytical knowledge available to decide which of them will (terminological box: classes and properties) reasoner (e.g. Pellet14 perform well on the current data. Meta-learning tries to learn rela- or FaCT++ [14]) as a subroutine. Since the TBox reasoner can only tions between data characteristics and method performance. Hence, partially handle OWL2 (Web Ontology Language15 ), we filter all ex- the IDA uses such meta-learned [11] patterns to order the enumera- pressions that are not supported from the ontology. The resulting in- tion of M candidate workflows (heuristic search) and to finally se- ferred/completed TBox and its possibly inconsistent class definitions lect the N best plans and present them to the user. The ranking ap- are passed to our ABox reasoner. The ABox reasoner, implemented proach works as follows: whenever several operators ( ’meta-level in Flora2/XSB, first compiles the classified TBox obtained from Pel- NON equivalent output’ ) or methods are applicable, the PP is asked let on the initial ontology. Then, we process the operators together for a (local, up-to now) ranking and delivers the plans in this order. with their inputs, outputs, preconditions, and effects that are stored This determines which space is enumerated if the planner is asked as OWL annotations. Tasks and methods are handled analogously. for a limited number of solutions. In the end, all generated alterna- Finally, we finish with the compilation of the problem definition, tives are ranked (with left and right context available) for the final which is represented by a set of individuals. The problem descrip- presentation to the user. tion has two elements: the input description in terms of meta-data The planning knowledge described above also does not know (characteristics of the data like attributes, types, median, etc.) and about the execution time of operators. This is caused by the fact that the goals/hints entered by the user. Both are stored as a set of ABox the actual runtime of a DM method cannot be predicted easily be- assertions. Having specified the planning domain and the problem cause of the complexity of the generated models. It can be worst-case description, one can start planning DM workflows. bounded in the number of examples and attributes, but its actual size is strongly affected by statistical properties (e.g. noise-level) of the Planner (Flora2/XSB Prolog) data. The runtime prediction of a DM method was first introduced in Applicable operators [6]. The ranking in our planner, therefore, relies on a meta-learning Op Defs based method to predict the runtime of modeling and feature selec- DM Apply Op Workflow tion operators [19]. Ontology Expandable Task HTN 3 The Overall System Expand Task Our system has two main components as illustrated in Fig. 7: ePro- TBox classification Plan, our modeling support tool for new operators and new tasks to (Pellet) TBox Generate N Plans be solved by the planner, and eI DA, which generates and deploys workflows into DM-suites. eProPlan is the modeling environment Problem for the DM Workflow Ontology (DMWF). It allows to model new definition - Goals/Hints ABox ABox Reasoning operators and uses a task-method decomposition grammar to solve - Input Objects DM problems. Designed as a plugin for the open-source ontology- editor Protégé 4,13 eProPlan exploits the advantages of the ontology Meta-learning as a formal model for the domain knowledge. Instead of employ- Probabilistic Patterns Best Ranked N Plans ranking ing the ontological inferences for planning (as done in [3, 17]) we extend the ontological formalism with the main components of a plan, namely operator conditions and effects for classical planning Figure 8: Workflow Ontology and AI Planner capabilities. and tasks-methods decomposition grammar for HTN-planning. This allowed us to cleanly separate the inheritance from the planing mech- eI DA is a programming interface to the reasoner & planner used anisms in our systems. 14 http://clarkparsia.com/pellet/ 13 http://protege.stanford.edu/ 15 http://www.w3.org/TR/owl2-profiles/ to plugin such an Intelligent Discovery Assistant (IDA), based on the services of the planner, into existing systems (so far into Rapid- 1 2 Miner and Taverna).16 It provides methods for retrieving the plans 2 starting from the data set meta-data and the selection of a main goal. To improve the user experience with the RM-IDA plugin we have developed a simple installer based on precompiled binaries. It works on Linux, Mac OS X 10.5/6, Windows 7 and Windows XP systems. 2 The RapidMiner IDA Extension can be downloaded (or even auto- installed) from the Rapid-I Marketplace. 17 So far it was downloaded over 150 times during the first two months. 3 Both presented tools (eProPlan, eIDA) are open source and avail- able on request. An alternative implementation of RapidMiner IDA Extension ex- ists for Taverna18 . Taverna can execute all workflows composed of 6 web-services. It can execute the workflows generated by the IDA19 4 using any RapidAnalytics 20 server that provides all RapidMiner op- 5 erators as web-services. Extensions for other KDD tools (e.g., KN- 7 IME, Enterprise Miner, etc.) would require two steps: first modeling their corresponding operators in the DMWF, second an implementa- Figure 9: IDA Interface in RapidMiner tion of the GUI and the plan-converter using the IDA-API. DM, i.e. (s)he should know the statistical assumptions underlying DM (e.g., a user should know what it means to have a sample that 4 Evaluation of the IDA is representative, relevant, and large enough to solve a problem with We tested the IDA on 108 datasets from the UCI repository of Ma- DM/statistics). But this is knowledge required in any experimental chine Learning datasets. 21 It produced executable plans for all 78 science. classification and 30 regression problems. These datasets have be- tween three and 1558 attributes, being all nominal (from binary too 4.2 Speedup of Workflow Design many different values like ZIP), all scalar (normalized or not), or mixed types. They have varying degrees of missing values. We are Besides making DM easier for inexperienced users, our main goal not aware of any other Machine Learning or DM approach that is in building the IDA was to speed-up the design of DM workflows. able to adapt itself to so many different and divergent datasets. The To establish a possible speed-up we compared the efficiency of com- IDA also works for less well prepared datasets like the KDD Cup puter science students after attending a DM class to a person using 1998 challenge data (370 attributes, with up to 50% missing values the IDA. The study comprises in total 24 students (9 in 2011 and 15 and nominal data, where it generates plans of around 40 operators. in 2012). They had to solve the following DM problems: Generating and ranking 20 of these workflows took 400 sec. on a 3.2 • Take the UCI ”Communities and Crime” data from GHz Quad-Core Intel Xeon. http://archive.ics.uci.edu/ml/datasets/ Communities+and+Crime and 4.1 Ease of Use a) generate a fine clustering of data that allows me to look for very similar communities Without an IDA data mining is typically achievable by specialized b) generate a description of the clusters (learn a model to predict highly-trained professionals such as DM consultants. They have to the cluster label build in task a)). know a lot about DM methods and how they are implemented in tools. They have to inspect the data and combine the operators into c) generate a function to predict ”ViolentCrimesPerPop” and an adequate workflow. evaluate it with 10-fold cross-validation. The IDA reduces the technical burden, it now offers ”DM with 7 • Take the UCI ”Internet Advertisement” data from clicks” (see Figure 5). (1) Show the IDA-Perspective of the tool; (2) http://archive.ics.uci.edu/ml/datasets/ drag the data to be analyzed from the repository to the view or import Internet+Advertisements and generate an evaluated (and annotate) your data; (3) select your main goal in DM; (4) ask the classification for the attribute ”Ad/Non-Ad”. IDA to generate workflows for data and goal; (5) evaluate all plans by executing them in RapidMiner; (6) select the plan you like most All data are provided as already imported into RapidMiner (a local to see a summary of the plan (the screenshot in Figure 6 is made RapidAnalytics server they could access). All students took/needed after this step); and finally, (7) inspect the plan and its results. Note the full 3 hours. that these steps do not require detailed technical knowledge anymore. The study confirmed that the standard DM problems they had to Still a user should be aware of what (s)he is doing when (s)he uses solve (clustering and prediction tasks on complex UCI data) can be sped-up by the using an IDA whilst maintaing comparable quality: 16 http://www.taverna.org.uk/ it took the students 3 hours (designing and executing the workflows) 17 http://rapidupdate.de:8180/UpdateServer/faces/ to solve the tasks, whereas a non-specialist using the IDA accom- product_details.xhtml?productId=rmx_ida plished the same tasks in 30 minutes (IDA planning and minimal 18 http://e-lico.eu/taverna-ida.html 19 http://e-lico.eu/taverna-rm.html manual adaptation and execution of the workflows) with a compa- 20 http://rapid-i.com/content/view/182/196/ rable output. Table 10 shows how many students managed to solve 21 http://archive.ics.uci.edu/ml/ successfully the given problems and how the IDA solved it. Task #Students Succeeded #Students IDA’s solution been an option, but no one did it, also the IDA had none in the top partial success 5 ranked plans. The task was a simple binary prediction, non-ads are Crime Data much more frequent (2820 non-ad, 459 ad), solved by all students Clean Missing Values 22/24 1/24 drop & fill by different methods. One balanced the data, but that worsened the Many Valued Nominals 11/24 6/24 drop results. One learned a decision tree but did not do an evaluation of Normalization - - yes the results. Some students even failed to produce a model or plot the Clustering 9/24 - 2/10-means data incorrectly (20:80). Cluster Description 8/24 - DT Regression RMSE<0.2:11/24, - k-NN This user evaluation provides a strong indication about the 0.22≤RMSE≤0.244: RMSE=0.2 strength of the IDA. Note that the students are an optimal user-group 3/11, Best RMSE=0.136 ±0.010 : 4/24 for the IDA, as they have limited DM experience but understand the Evaluation 15/24 2/24 10-fold X-Val principles of DM. Advertisement Data Clean Missing Values 17/24 1/24 fill Feature Selection no no no Classification acc>96%:13/24, - DT, 4.3 Performance of the generated workflows 86%≤acc<96%:6/24, Acc=96.34% ± Best Acc=97.32%:1/24 0.65 The performance of the generated workflows depends strongly on Evaluation 5/24 7/24 10-fold X-Val the ranking. The baseline strategy is to rank the workflows simply Figure 10: Features of the designed solutions by students and IDA based on the popularity of the operators. RapidMiner automatically collects operator usage-frequencies from all the users who accept to submit it. A workflow is ranked better, if it contains more frequently Both datasets had missing values which could be ignored (−), the used operators. This already produces workflows comparable to user- attribute could be all dropped or all filled or depending on the amount designed workflows and was used in the speedup-experiments. Our of missing values individually dropped & filled. Here, both students e-LICO project partners24 used the planner to systematically generate and IDA did a good job. The ”Communities and Crime” data had workflows, executed them to get the performance data, and applied key-like attributes (many valued nominals) which are likely to dis- meta-learning to these experiments [11]. turb the DM results and should be dropped or marked as (to be) ig- In [10] they evaluated the meta-mining module and the resulting nore(d). Here, only around half of the students handled it correctly. plan ranking on 65 biological datasets. These datasets are high di- Numerical attributes with different scales cause unwanted weight- mensional with few instances/samples. For their experiments they ing of attributes in distance based similarity computation. Therefore, cross-validated all performance by holding out a dataset. The re- they should be normalized22 . There are several clustering methods sulting meta-model was then used to rank the IDA-generated work- in RapidMiner. The best way to solve the clustering task is by us- flows. They found that the meta-learned rankings significantly out- ing hierarchical top-down 2-means (k-means with k = 2) clustering performed the default, frequency-based strategy. Hence, their ranker till the grouping is fine enough. Only one student used this approach. was able to improve on our ranking to find DM workflows that max- The rest of the students and the IDA used k-means with different val- imize predictive performance. ues for k (k<20 is successful, larger values make the prediction very difficult).The IDA sets k = 2 and there is no way to specify the goal of a ”fine clustering”. This can be solved by opening the returned 5 Conclusion workflow and changing the parameter (re-running it is not much ef- fort). We manually choose k = 10 as we had the next task in mind We presented our Intelligent Discovery Assistant (eI DA and ePro- and knew it is difficult to predict a nominal with too many different Plan) for planning KDD workflows. eI DA can be easily integrated values23 , but many students chose a too fine k and failed the cluster into existing DM-suites or workflow engines. eProPlan is a user- description task (using different methods like Naive Bayes (NB), De- friendly environment for modeling DM operators and defining the cision Tree (DT) or jRIP), most only build bad models using the not HTN grammar for guiding the planning process. Furthermore, it is dropped key-like attributes). For the ”ViolentCrimesPerPop”, most able to plan attribute-wise operations. The main scientific contribu- students used linear regression (linReg). The probabilistic ranking tion of the IDA is the ability to build complex workflows out of a of the IDA preferred k-nearest neighbor. Common mistakes for this much larger set of operations than all previous systems. The demo task were: regression applied on the wrong attribute, converting nu- presents how planning-based KDD workflow design can significantly meric data to nominal and applying NaiveBayes (bad accuracy), or help KDD practitioners to make their daily work more efficient. converting it to binary: no crime (3 examples), crime (595 exam- ples) the resulting model of course predicted crime everywhere. The DM step should have used a 10-fold cross validation, but some stu- ACKNOWLEDGEMENTS dents delivered a training/test set split (maybe to save execution time, maybe because that was used in the group exercise). ”Internet Adver- The work described in this paper is partially supported by the Euro- tisement” data has many attributes, so Feature Selection would have pean Community 7th framework program ICT-2007.4.4 under grant 22 In fact the initial data is 0-1 range normalized, so the students did not do number 231519 “e-Lico: An e-Laboratory for Interdisciplinary Col- that step, but preprocessing operations like filling missing values change laborative Research in DM and Data-Intensive Science” and the re- the column statistics. The planner cannot predict the results very well for sult of collaborative work within this project. We especially like to column groups, so it ensures normalization of the data at the end of pre- thank Phong Nguyen, Alexandros Kalousis, Melanie Hilario, and processing. Tom Smuc for their work on meta-learning and ranking, sketched 23 The meta-data analysis returns categorial for ≤ 10 different values and nominal otherwise. Prediction requires a categorial target (or numeric) tar- here to show the overall picture. get, i.e. the IDA refuses to build a plan to predict a target with more than 10 nominal values. 24 http://www.e-LICO.eu/ REFERENCES [1] A. Bernstein, F. Provost, and S. Hill, ‘Towards Intelligent Assistance for a Data Mining Process: An Ontology-based Approach for Cost- sensitive Classification’, IEEE Transactions on Knowledge and Data Engineering, 17(4), 503–518, (April 2005). [2] P. Chapman, J. Clinton, R. Kerber, T. Khabaza, T. Reinartz, C. Shearer, and R. Wirth, ‘Crisp–dm 1.0: Step-by-step data mining guide’, Techni- cal report, The CRISP–DM Consortium, (2000). [3] C. Diamantini, D. Potena, and E. Storti, ‘KDDONTO: An Ontology for Discovery and Composition of KDD Algorithms’, in Proceedings of the SoKD-09 Workshop at ECML/PKDD, (2009). [4] J. Gama and P. Brazdil, ‘Characterization of classification algorithms’, Progress in Artificial Intelligence, 189–200, (1995). [5] M. Ghallab, D. Nau, and P. Traverso, Automated Planning: Theory & Practice, Morgan Kaufmann, San Francisco, CA, USA, 2004. [6] N. Jankowski and K. Grabczewski, ‘Building meta-learning algorithms basing on search controlled by machine complexity’, in Proceedings of the International Joint Conference on Neural Networks, IJCNN 2008, pp. 3601–3608. IEEE, (2008). [7] A. Kalousis and M. Hilario, ‘Model selection via meta-learning: a com- parative study’, in International Journal on Artificial Intelligence Tools, ICTAI 2000, pp. 406–413. IEEE, (2000). [8] J.-U. Kietz, F. Serban, A. Bernstein, and S. Fischer, ‘Towards coopera- tive planning of data mining workflows’, in Proceedings of the SoKD- 09 Workshop at ECML/PKDD, pp. 1–12, (2009). [9] J.-U. Kietz, F. Serban, A. Bernstein, and S. Fischer, ‘Data min- ing workflow templates for intelligent discovery assistance and auto-experimentation’, in Proceedings of the SoKD-10 Workshop at ECML/PKDD, (2010). [10] P. Nguyen and A. Kalousis, ‘Evaluation report on meta-miner’. Deliv- erable 7.2 of the EU-Project e-LICO, January 2012. [11] P. Nguyen, A. Kalousis, and M. Hilario, ‘A meta-mining infrastructure to support kd workflow optimization’, in Proc. of the PlanSoKD-11 Workshop at ECML/PKDD, (2011). [12] L. Rendell, R. Seshu, and D. Tcheng, ‘Layered concept learning and dynamically-variable bias management’, in Proceedings of Interna- tional Joint Conference on Artificial Intelligence, IJCAI-87, pp. 308– 314. Citeseer, (1987). [13] F. Serban, J. Vanschoren, J.-U. Kietz, and A. Bernstein, ‘A Survey of Intelligent Assistants for Data Analysis’, ACM Computing Surveys, (to appear 2012). [14] D. Tsarkov and I. Horrocks, ‘FaCT++ description logic reasoner: System description’, Lecture Notes in Computer Science, 4130, 292, (2006). [15] R. Wirth, C. Shearer, U. Grimmer, T. P. Reinartz, J. Schlösser, C. Breit- ner, R. Engels, and G. Lindner, ‘Towards process-oriented tool support for knowledge discovery in databases’, in Proceedings of the First Eu- ropean Symposium on Principles of Data Mining and Knowledge Dis- covery, PKDD ’97, pp. 243–253, London, UK, (1997). Springer-Verlag. [16] G. Yang, M. Kifer, and C. Zhao, ‘Flora-2: A Rule-Based Knowledge Representation and Inference Infrastructure for the Semantic Web’, On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE, 671–688, (2003). [17] M. Žaková, P. Křemen, F. Železný, and N. Lavrač, ‘Automating knowl- edge discovery workflow composition through ontology-based plan- ning’, IEEE Transactions on Automation Science and Engineering, 8(2), 253 –264, (2011). [18] M. Žáková, V. Podpečan, F. Železný, and N. Lavrač, ‘Advancing data mining workflow construction: A framework and cases using the orange toolkit’, in Proceedings of the SoKD-09 Workshop at ECML/PKDD, (2009). [19] M. Znidarsic, M. Bohanec, N. Trdin, T. Smuc, M. Piskorec, and M. Bosnjak, ‘Non-functional workflow assessment’. Deliverable D7.3 of the EU Project e-LICO, November 2011.