Designing KDD-Workflows via HTN-Planning for
                  Intelligent Discovery Assistance
              Jörg-Uwe Kietz1 and Floarea Serban1 and Abraham Bernstein1 and Simon Fischer2


Abstract. Knowledge Discovery in Databases (KDD) has evolved                 of operators. Parts of the workflows are applied several times (e.g.
a lot during the last years and reached a mature stage offering plenty       the preprocessing sub-workflow has to be applied on training, test-
of operators to solve complex data analysis tasks. However, the user         ing, and application data) implying that the users either need to
support for building workflows has not progressed accordingly. The           copy/paste or even repeatedly design the same sub-workflow6 sev-
large number of operators currently available in KDD systems makes           eral times. As none of the systems maintain this “copy”-relationship,
it difficult for users to successfully analyze data. In addition, the cor-   it is left to the user to maintain the relationship in the light of changes.
rectness of workflows is not checked before execution. Hence, the                Another weak point is that workflows are not checked for correct-
execution of a workflow frequently stops with an error after several         ness before execution. As a consequence, the execution of the work-
hours of runtime.                                                            flow oftentimes stops with an error after several hours runtime due to
   This paper presents our tools, eProPlan and eI DA, which solve the        small syntactic incompatibilities between an operator and the data it
above problems by supporting the whole life-cycle of (semi-) auto-           should be applied on.
matic workflow generation. Our modeling tool eProPlan allows to                  To address these problems several authors [1, 3, 18] propose the
describe operators and build a task/method decomposition grammar             use of planning techniques to automatically build such workflows.
to specify the desired workflows. Additionally, our Intelligent Dis-         However, all these approaches are limited. First, they only model
covery Assistant, eI DA, allows to place workflows into data mining          a very small number of operations and were only demonstrated to
(DM) tools or workflow engines for execution.                                work on very short workflows (less than 10 operators). Second, none
                                                                             of them models operations that work on individual columns of a data
                                                                             set: they only model operations that process all columns of a data
1   Introduction                                                             set in the same way. Lastly, the approaches cannot scale to large
One of the challenges of Knowledge Discovery in Databases (KDD)              amounts of operators and large workflows: their planning approaches
is assisting users in creating and executing KDD workflows. Existing         fail in the large design space of “correct” (but nevertheless most often
KDD systems such as the commercial IBM SPSS Modeler3 or the                  unwanted) solutions. A full literature review about IDAs (including
open-source KNIME4 and RapidMiner5 support the user with nice                these approaches) can be found in our survey [13].
graphical user interfaces. Operators can be dropped as nodes onto                In this paper we describe the first approach for designing KDD
the working pane and the data-flow is specified by connecting the            workflows based on ontologies and Hierarchical Task Network
operator-nodes. This works very well as long as neither the workflow         (HTN) planning [5]. Hierarchical task decomposition knowledge
becomes too complicated nor the number of operators becomes too              available in DM (e.g. CRISP-DM [2] and CITRUS [15]) can be used
large.                                                                       to significantly reduce the number of generated unwanted correct
   However, in the past decade, the number of operators in such sys-         workflows. The main scientific contributions of this paper, hence,
tems has been growing fast. All of them contain over 100 operators           are: First, we show how KDD workflows can be designed using on-
and RapidMiner, which includes Weka, R, and several pluggable op-            tologies and HTN-planning in eProPlan. Second, we exhibit the pos-
erator sets (such as anomaly detection, recommendation, text and im-         sibility to plug in our approach in existing DM-tools (as illustrated
age mining) now has around 1000. It can be expected that the transi-         by RapidMiner and Taverna). Third, we present an evaluation of our
tion from closed systems (with a fixed set of operators) to open sys-        approach that shows significant improvement and simplification of
tems that can also use Web services as operators (which is especially        the KDD-workflow design process. Thus, the KDD researchers can
interesting for domain specific data access and transformations) will        easily model not only their DM and preprocessing operators but also
further accelerate the rate of growth resulting in total confusion about     their DM tasks that is exploited to guide the workflow generation.
what operators to use for most users.                                        Moreover less experienced users can use our RM-IDA plugin to au-
   In addition to the number of operators also the size of the KDD           tomatically generate workflows in only 7-clicks. Last but not least,
workflows is growing. Today’s workflows easily contain hundreds              the planning community may find it interesting to see this real world
                                                                             problem powered by planning techniques and may also find some of
1 University of Zurich, Department of Informatics, Dynamic and Dis-          the problems we faced and solved rather pragmatically inspiring for
 tributed Information Systems Group, Binzmühlestrasse 14, CH-8050
 Zurich, Switzerland {kietz|serban|bernstein}@ifi.uzh.ch                     6 Several operators must be exchanged and cannot be reapplied. Consider for
2 Rapid-I GmbH, Stockumer Str. 475, 44227 Dortmund, Germany                    example training data (with labels) and application data (without labels).
 fischer@rapid-i.com                                                           Label-directed operations like feature-selection cannot be reapplied. But
3 http://www.ibm.com/software/analytics/spss/                                  even if there is a label on separate test data, redoing feature selection may
4 http://www.knime.org/                                                        result in selecting different features. To apply and test the model, however,
5 http://rapid-i.com/content/view/181/190/                                     exactly the same features have to be selected.
further research.                                                               not sufficient to choose the dominating operators since they contain
   The rest of this paper is structured as follows: Section 2 describes         one or more subtasks that have to be planned as well. This is similar
the knowledge used for planning, then Section 3 details the compo-              to planning general control structures like if-then-else or loops, but
nents of our system. Section 4 describes several evaluation methods             the problem is also easier to solve as these dominating operators and
and, finally, we conclude with Section 5.                                       their subtasks have a special purpose and, therefore, task/method de-
                                                                                compositions as all other tasks. Figure 1a shows a cross-validation
                                                                                workflow which uses first a preprocessing task and then it applies
2     The Planning Knowledge
                                                                                cross-validation to produce and test the model as seen in Figure 1b.
We modeled our Data Mining Workflow planning problem as a Data                  During the training step it selects the important features and then it
Mining Worlkflow ontology (DMWF),7 which we succinctly illus-                   trains the model. The produced model is then applied on the testing
trate here (a more detailed description of it can be found in [8, 9]).          data. The output is an average of the n runs (where n is the number
    The DMWF contains input/output objects and their meta-data                  of folds, usually set to 10).
(Sec.2.1), which are sufficiently detailed to enable column-wise op-
erations – a feature that is not available in any of the previous ap-
proaches. In addition, tasks and methods are used to guide and sim-
plify the planning process. Each method consists of a sequence of
steps, where each step is a (sub-) task or an operator (Sec. 2.2 shows
some of the methods). In total, the DMWF contains more than 100
operators from RapidMiner (Sec. 2.3 illustrates one). The amount of
operators and their partial redundancy made it favorable to structure
them in an inheritance hierarchy starting with abstract operators un-
til the basic operators which can be applied on the data. The number                              (a)
of operators is not a limitation of the HTN-planning approach, but a
limitation set by the effort to model them and to keep them consis-
tent with changes introduced by new releases of RapidMiner. To get
a significantly higher number of modeled operators, semi-automatic
modeling or at least verification methods need to be developed.                 (b)
    Besides the main contribution of supporting users in designing
KDD workflows, this paper may also be interesting to the planning               Figure 1: (a) Cross Validation as Operator (labeled as “Validation” in
community because it shows the successful usage of planning tech-               the Figure) in a workflow; (b) Subtasks of Cross Valdation
niques to solve the problem of workflow design and more generally
problem-specific software-configuration. It may stipulate further re-
search on planning as we solved some problems that did not get much             2.1         Meta-Data to describe Input/Output Objects
attention in planning so far. First, in our domain it is usually easy to
find a correct plan. The simplest correct plan for prediction uses the          The planner recognizes the IO-Objects (e.g. DataTable, Document-
default model (mean-value for regression, modal-value for classifi-             Collection and ImageCollection), Background Knowledge, Model
cation). This is a correct solution for all predictive modeling prob-           (e.g. PreprocessingModel and PredictionModel, recording required,
lems, but it is only the baseline that DM wants to improve and not              modified, added and deleted attributes such that the conditions and
the wanted solution. We tackle that problem by excluding such un-               effects of applying these models on the data can be computed by
wanted workflows from our HTN. The real problem is not finding a                the planner), and Report (e.g. PerformanceVector and LiftChart). As
solution, but handling the large amount of solutions8 . We handle this          an example Table 1 shows the meta-data of a DataTable. The meta-
by grouping of plans based on meta-level equivalent output (similar             data for the user-data is generated by RapidMiner’s/RapidAnalytic’s
characteristics of the output) 9 and by using the probabilistic pattern         meta-data analyzer and passed to the planner. During the planning
generated by meta-learning (see Sec. 2.4) not only to rank the enu-             process the planner generates the meta-data of an operator’s output
merated workflows, but also for a heuristic beam search in the space            objects from the operator’s effect-specification (see Sec. 2.3).
of possible solutions. Another interesting problem is the large num-
                                                                                Attribute    #Attr   Type      Role     #Diff   #Miss   Values   Min    Max    Mean    Modal   Std.
ber of relevant operators that we handled by embedding conditions               age           1      Scalar    input             0      []       20.0   79.0   50.95           16.74
and effects into an ontological operator hierarchy with inheritance.            genLoad       1      Nominal   input     2       36     [0,1]                          1
This is supported by our eProPlan plugin into the popular ontology              label         1      Nominal   target    2       0      [+,-]                          +
editor Protégé. Furthermore, RapidMiner (and in DM in general) has            meas1         1      Scalar    input             0      []       0.10   3.93   1.845           0.861
                                                                                meas2         1      Scalar    input             30     []       0.12   4.35   1.979           0.902
several special purpose control/loop operators like cross-validation.
                                                                                meas3         1      Scalar    input             0      []       0.33   5.35   2.319           1.056
They are parametrized operators (the number of folds and the sam-               sex           1      Nominal   input     2       0      [f,m]                          m
pling strategy for cross validation). In contrast to other operators, it is
7 It is public available from http://www.e-LICO.eu/ontologies/dmo/                                      Table 1: Meta-Data for a Data Table
    e-Lico-eProPlan-DMWF-HTN.owl. The best way to open this ontology is:
    download Protégé 4.0 or 4.1 from http://protege.stanford.edu/ and
    eProPlan from http://elico.rapid-i.com/eproplan.html (2.0.1 for                One of the strengths of the IDA is the ability to plan workflows
    Protégé4.0 and 2.1.0 for 4.1).                                            with attribute-wise operations – a feature no other previous approach
8 With column-wise operations this may be very large, just consider having
                                                                                had so far. Especially biological micro-array data can easily contain
  5 alternative methods to discretize 100 attributes. This results in 5100 ≈
  1070 possible correct plans.                                                  several thousand columns, turning this possibility into a major per-
9 ’Meta-level equivalent output’ can be defined as the IOOs-descriptions pro-   formance bottleneck. Looking at a number of such analyses we ob-
    duced by the planner are equivalent up to the names of individuals.         served that these columns often have very similar (if not identical)
         (a)


                    (b)
                              Figure 2: Methods for (a) Modeling with Test-Set Evaluation and (b) Preprocessing

meta-data and data characteristics, effectively eliminating the need       flow of IO-Objects in RapidMiner.10 The white nodes are tasks to be
to differentiate between them during planning. To keep the strength        planned and the other nodes are operators. Operators in RapidMiner
and avoid the performance bottleneck of several thousand attributes        are grouped in a way similar to the DMWF. Some of the top nodes
that are identical on the meta-data level, we introduced attribute         have colors. The greenish ones are operators that deal with Model
groups that collect all attributes together where the operator con-        as well as objects that inherit from Model. This includes all learn-
ditions would not discriminate between them anyway. This worked            ers and the Apply Model operator. The more purple ones are data
well as all column-wise operators were able to handle not only single      transformation operators. The RapidMiner’s plan-interpreter adds a
columns but also specified sets of columns. Therefore, we are able to      Multiply node whenever an IO-Object is used by several operators.
generate workflows with attribute-group-wise operations for micro-            The Preprocessing task has only one method (Fig. 2b). First an
array and collaborative filtering data with even 50.000 attributes.        empty preprocessing model is created using the operator “Group
   There are various approaches which have used meta-data to sug-          Models”. It is then extended by the next steps. All its sub-tasks
gest the best algorithm for a given problem [12, 4, 7] in the context      have optional “Nothing to Do” methods (as shown for CleanMiss-
of meta-learning.                                                          ingValues in Fig. 3c). Most of the preprocessing methods are recur-
                                                                           sive, handling one attribute at a time until nothing is left to be done.
                                                                           CleanMissingValues has two different recursive methods, the choice
2.2    The Task/Method decomposition                                       is made by the planner depending on the amount of missing values. If
The top-level task of the HTN is the DM task. It has six methods:          there are more than 30% values missing, the attribute can be dropped
Clustering, Association Rule Mining, Predictive Modeling with Test         (Figure 3b). When there are less than 50% missing, it can be filled
Set Evaluation (external given separation), Predictive Modeling with       with mean or modal value (Figure 3a). If 30 − 50% of the values are
Cross Validation, Predictive Modeling with Test Set Split Evaluation       missing, both methods can be used and plans for both are enumer-
(random 70:30 split), and Simple Modeling with Training Set Perfor-        ated in the order provided by the probabilistic planner. The usage of
mance.                                                                     column-wise operations is illustrated in Figure. 3, which shows how
   The selection is directed by the (required) main-goal, i.e. Pattern     depending on the amount of missing values per column the planner
Discovery enforces Association Rule Mining, Descriptive Modeling           chooses to fill or drop the column.
enforces Clustering, and Predictive Modeling forces the choice of             Figure 4 shows a generated workflow for the UCI data set labor-
one of the others. If test data are provided Modeling with Test Set        negotiations, which has 16 attributes with various amounts of miss-
Evaluation has to be chosen, otherwise the choice can be influenced        ing values. Note that in its output the planner compresses recursive
by an (optional) evaluation-subgoal (not possible with the current         tasks into a single task to simplify the browsing of the workflow by
GUI). If there are still several methods possible, they are enumer-        the user. Such column-wise handling can greatly improve the results
ated in the rank-order provided by the probabilistic planner (see Sec.     of workflows. For users, however, it is usually too much manual ef-
2.4). Each method consists of a sequence of steps, where each step         fort (placing and connecting the 22 operations and setting their pa-
is a (sub-)task or an operator (it can also be an abstract operator sub-   rameters in the workflow below).
suming several basic or dominating operators). Planning occurs in             In total the HTN of the DMWF now contains 16 tasks with 33
the order of steps, however the resulting data flow is not necessarily     10 Note that the figures illustrate the methods with structurally equivalent
linear, as the grammar allows the specification of port mapping. Fig-        workflows in RapidMiner and do not show the more complicated method
ure 2a shows the method Modeling with Test Set Evaluation and its            definitions in the ontology.
                                                                             Inputs and outputs of an operator are defined as concept expres-
                                                                          sions and are either stored as superclasses or as equivalent classes.
                                                                          The parameters and the corresponding RapidMiner operator name
                                                                          are stored in equivalent classes. Fig. 5 exemplifies the abstract op-
                                                                          erator for a classification learner operator with its corresponding in-
                                                                          puts/outputs, condition and effect.
              (a)                         (b)                  (c)
                                                                          ”ClassificationLearner”:
Figure 3: The (a) “Fill Missing Values”, (b) “Drop Missing Values”,       Equiv. class: PredictiveSupervisedLearners and
and (c) “Empty”/“Nothing to Do” Method that can be used in Fig. 2b                      (uses exactly 1 DataTable) and
                                                                                        (produces exactly 1 PredictionModel) and
                                                                                        (operatorName max 1 Literal)
                                                                          Condition:    [DataTable and (targetAttribute exactly 1 Attribute) and
                                                                                         (inputAttribute min 1 Attribute) and
                                                                                         (targetColumn only (DataColumn and
                                                                                         columnsHasType only (Categorial))) and
                                                                                         (inputColumn only (DataColumn and
                                                                                         columnsHasType only (Scalar or Categorial)))
                                                                                        ](?D)
                                                                                        → new(?this), ClassificationLearner(?this), uses(?this,?D)
                                                                          Effect:       uses(?this,?D), ClassificationLearner(?this),
                                                                                        inputColumn(?D,?IC),targetColumn(?D,?TC),
                                                                                        → copy(?M,?D, {DataTable(?D), containsColumn(?D,? ),
                                                                                        amountOfRows(?D,? )}),produces(?this,?M), PredictionModel(?M),
                                                                                        needsColumn(?M,?IC), predictsColumn(?M,?TC)

                                                                                    Figure 5: An abstract operator: ClassificationLearner


Figure 4: Resulting Example Workflow for Missing Value Cleaning              A basic classification learner operator inherits all the characteris-
                                                                          tics of the classification learner. In addition, it can define more refined
methods, such that it handles general DM from single table data sets      conditions or effects and more parameters. Fig. 6 shows the refine-
with adequate preprocessing. However, we believe that this will show      ment of the general class of all classification learners to a specific
its real power in customer/application area specific grammars de-         Support Vector Machine implementation in RapidMiner. It has an
signed and/or extended in eProPlan, resulting in adaptive customer        additional condition (binary target and scalar input attribute and no
templates for workflows. The complexity of the produced workflows         attribute is allowed to have missing values), but it does not contain a
depends on the characteristics of the dataset (if it has missing val-     refined effect. Its effect is the one used for all classification learners
ues, if it needs to be normalized, etc.) and on the number of modeled     (it builds a predictive model that requires all input attributes to be
operators for each steps of the KDD process (FeatureSelection and         present to be applicable and predicts the target attribute).
DataMining have more operators). Due to space limitations, we only
illustrate some parts of the grammar here. The full grammar can be        ”RM Support Vector Machine LibSVM C SVC linear”:
inspected in the public available DMWF-Ontology11 .                       Equiv. class: RM Operator and
                                                                                        (usesData exactly 1 DataTable) and
                                                                                        (producesPredictionModel exactly 1 PredictionModel) and
2.3     The Operator Models
                                                                                        (simpleParameter kernel type value ”linear”) and
To be able to express the operators’ conditions and effects for plan-                   (simpleParameter svm type value ”minimal leaf size”) and
ning we stored them as annotations in the ontology. Conditions and                      (operatorName exactly 1 {”support vector machine libsvm”})
effects can contain concept expressions, SWRL-rules,12 and some           Condition:    [MissingValueFreeDataTable and
extended-SWRL-like built-ins. We have introduced a set of special                        (targetColumn exactly 1 CategorialColumn) and
built-ins needed for planning (e.g., new, copy, copyComplex, etc.).                      (inputColumn min 1 Thing) and
These built-ins allow to create, copy, and destroy objects during plan-                  (inputColumn only (ScalarColumn))
ning (e.g., models, produced IO-objects, weights, etc.). eProPlan al-                    ](?D)
lows to define new built-ins which are stored as subclasses of the                      → RM Support Vector Machine LibSVM C SVC linear(?this),
Built-in concept. Each built-in can have types/parameters and the                       simpleParameter svm type(?this,”C-SVC”),
corresponding implementation in Flora2 [16]. But users who want to                      simpleParameter kernel type(?this,”linear”)
add new built-ins need to have some Flora-2 knowledge. They have
                                                                                       Figure 6: A basic classification learner operator
the possibility to define new functions/operations on the data and in-
troduce them in the conditions and effects. The built-ins’ definition
with parameters and implementation are stored as class annotations.          Each input/output class expression (e.g., usesData exactly 1
11
                                                                          DataTable) has an annotation which defines its port mapping to its
      Use an OWL2 Ontology Editor like Protege to ”Open OWL on-
     tology from URI” with http://www.e-lico.eu/ontologies/
                                                                          corresponding RapidMiner operator port (e.g., ”training set”). Both
     dmo/e-Lico-eProPlan-DMWF-HTN.owl                                     conditions and effects are rules. Conditions check the applicability
12 http://www.w3.org/Submission/SWRL/                                     (lhs) and infer the parameter settings (rhs); different solutions can
infer that the operator can be applied with different parameter set-
tings. Effects compute the variable-bindings (lhs) for the assertion to
be made (rhs); all different solutions are asserted as the effect of one
operator application.                                                                         Modeling &                      Workflow
                                                                                               testing                       generation

2.4    Probabilistic Ranking of Workflows
The HTN-planning method described in this paper enumerates all
possible workflows for a given problem. The operator models en-
                                                                                                            Reasoning &
sure that they are executable without error. Furthermore, the HTN-                                           planning
Grammar prevents senseless operator combinations. For example,
first normalizing the data and then discretizing it does not make sense                           Figure 7: The eProPlan architecture
since the normalization effect is absorbed by the discretization one.
Also, converting the scalar data to nominal and then converting it            The planner is implemented in Flora2/XSB [16] and uses different
back is a useless operation. Another example is dropping attributes        parts from the workflow ontology for different purposes (see Fig-
without a reason. Nonetheless, the planner can still generate a very       ure 8). Specifically, it employs both a specialized ABox (assertional
large number of correct candidate workflows and we are unaware             box—individual assertions) reasoner that relies on an external TBox
of any analytical knowledge available to decide which of them will         (terminological box: classes and properties) reasoner (e.g. Pellet14
perform well on the current data. Meta-learning tries to learn rela-       or FaCT++ [14]) as a subroutine. Since the TBox reasoner can only
tions between data characteristics and method performance. Hence,          partially handle OWL2 (Web Ontology Language15 ), we filter all ex-
the IDA uses such meta-learned [11] patterns to order the enumera-         pressions that are not supported from the ontology. The resulting in-
tion of M candidate workflows (heuristic search) and to finally se-        ferred/completed TBox and its possibly inconsistent class definitions
lect the N best plans and present them to the user. The ranking ap-        are passed to our ABox reasoner. The ABox reasoner, implemented
proach works as follows: whenever several operators ( ’meta-level          in Flora2/XSB, first compiles the classified TBox obtained from Pel-
NON equivalent output’ ) or methods are applicable, the PP is asked        let on the initial ontology. Then, we process the operators together
for a (local, up-to now) ranking and delivers the plans in this order.     with their inputs, outputs, preconditions, and effects that are stored
This determines which space is enumerated if the planner is asked          as OWL annotations. Tasks and methods are handled analogously.
for a limited number of solutions. In the end, all generated alterna-      Finally, we finish with the compilation of the problem definition,
tives are ranked (with left and right context available) for the final     which is represented by a set of individuals. The problem descrip-
presentation to the user.                                                  tion has two elements: the input description in terms of meta-data
   The planning knowledge described above also does not know               (characteristics of the data like attributes, types, median, etc.) and
about the execution time of operators. This is caused by the fact that     the goals/hints entered by the user. Both are stored as a set of ABox
the actual runtime of a DM method cannot be predicted easily be-           assertions. Having specified the planning domain and the problem
cause of the complexity of the generated models. It can be worst-case      description, one can start planning DM workflows.
bounded in the number of examples and attributes, but its actual size
is strongly affected by statistical properties (e.g. noise-level) of the                                      Planner (Flora2/XSB Prolog)
data. The runtime prediction of a DM method was first introduced in                                                                  Applicable operators
[6]. The ranking in our planner, therefore, relies on a meta-learning                                           Op Defs
based method to predict the runtime of modeling and feature selec-                  DM                                                      Apply Op
                                                                                  Workflow
tion operators [19].                                                              Ontology
                                                                                                                                     Expandable Task

                                                                                                                  HTN
3     The Overall System                                                                                                                Expand Task

Our system has two main components as illustrated in Fig. 7: ePro-            TBox classification
Plan, our modeling support tool for new operators and new tasks to                 (Pellet)
                                                                                                                 TBox                 Generate N Plans
be solved by the planner, and eI DA, which generates and deploys
workflows into DM-suites. eProPlan is the modeling environment
                                                                                    Problem
for the DM Workflow Ontology (DMWF). It allows to model new                         definition
                                                                                 - Goals/Hints                   ABox                 ABox Reasoning
operators and uses a task-method decomposition grammar to solve                 - Input Objects
DM problems. Designed as a plugin for the open-source ontology-
editor Protégé 4,13 eProPlan exploits the advantages of the ontology
                                                                                 Meta-learning
as a formal model for the domain knowledge. Instead of employ-                   Probabilistic                  Patterns
                                                                                                                                    Best Ranked N Plans
                                                                                   ranking
ing the ontological inferences for planning (as done in [3, 17]) we
extend the ontological formalism with the main components of a
plan, namely operator conditions and effects for classical planning             Figure 8: Workflow Ontology and AI Planner capabilities.
and tasks-methods decomposition grammar for HTN-planning. This
allowed us to cleanly separate the inheritance from the planing mech-
                                                                             eI DA is a programming interface to the reasoner & planner used
anisms in our systems.
                                                                           14 http://clarkparsia.com/pellet/
13 http://protege.stanford.edu/                                            15 http://www.w3.org/TR/owl2-profiles/
to plugin such an Intelligent Discovery Assistant (IDA), based on
the services of the planner, into existing systems (so far into Rapid-                                  1
                                                                                                                2
Miner and Taverna).16 It provides methods for retrieving the plans
                                                                                  2
starting from the data set meta-data and the selection of a main goal.
To improve the user experience with the RM-IDA plugin we have
developed a simple installer based on precompiled binaries. It works
on Linux, Mac OS X 10.5/6, Windows 7 and Windows XP systems.                                                    2
   The RapidMiner IDA Extension can be downloaded (or even auto-
installed) from the Rapid-I Marketplace. 17 So far it was downloaded
over 150 times during the first two months.                                                                 3
   Both presented tools (eProPlan, eIDA) are open source and avail-
able on request.
   An alternative implementation of RapidMiner IDA Extension ex-
ists for Taverna18 . Taverna can execute all workflows composed of
                                                                                                                    6
web-services. It can execute the workflows generated by the IDA19                             4
using any RapidAnalytics 20 server that provides all RapidMiner op-
                                                                                               5
erators as web-services. Extensions for other KDD tools (e.g., KN-
                                                                                              7
IME, Enterprise Miner, etc.) would require two steps: first modeling
their corresponding operators in the DMWF, second an implementa-                           Figure 9: IDA Interface in RapidMiner
tion of the GUI and the plan-converter using the IDA-API.
                                                                            DM, i.e. (s)he should know the statistical assumptions underlying
                                                                            DM (e.g., a user should know what it means to have a sample that
4     Evaluation of the IDA                                                 is representative, relevant, and large enough to solve a problem with
We tested the IDA on 108 datasets from the UCI repository of Ma-            DM/statistics). But this is knowledge required in any experimental
chine Learning datasets. 21 It produced executable plans for all 78         science.
classification and 30 regression problems. These datasets have be-
tween three and 1558 attributes, being all nominal (from binary too         4.2       Speedup of Workflow Design
many different values like ZIP), all scalar (normalized or not), or
mixed types. They have varying degrees of missing values. We are            Besides making DM easier for inexperienced users, our main goal
not aware of any other Machine Learning or DM approach that is              in building the IDA was to speed-up the design of DM workflows.
able to adapt itself to so many different and divergent datasets. The       To establish a possible speed-up we compared the efficiency of com-
IDA also works for less well prepared datasets like the KDD Cup             puter science students after attending a DM class to a person using
1998 challenge data (370 attributes, with up to 50% missing values          the IDA. The study comprises in total 24 students (9 in 2011 and 15
and nominal data, where it generates plans of around 40 operators.          in 2012). They had to solve the following DM problems:
Generating and ranking 20 of these workflows took 400 sec. on a 3.2         • Take the UCI ”Communities and Crime” data from
GHz Quad-Core Intel Xeon.                                                     http://archive.ics.uci.edu/ml/datasets/
                                                                              Communities+and+Crime and
4.1    Ease of Use                                                            a) generate a fine clustering of data that allows me to look for
                                                                                 very similar communities
Without an IDA data mining is typically achievable by specialized
                                                                              b) generate a description of the clusters (learn a model to predict
highly-trained professionals such as DM consultants. They have to
                                                                                the cluster label build in task a)).
know a lot about DM methods and how they are implemented in
tools. They have to inspect the data and combine the operators into           c) generate a function to predict ”ViolentCrimesPerPop” and
an adequate workflow.                                                            evaluate it with 10-fold cross-validation.
   The IDA reduces the technical burden, it now offers ”DM with 7           • Take the UCI ”Internet Advertisement” data from
clicks” (see Figure 5). (1) Show the IDA-Perspective of the tool; (2)         http://archive.ics.uci.edu/ml/datasets/
drag the data to be analyzed from the repository to the view or import        Internet+Advertisements and generate an evaluated
(and annotate) your data; (3) select your main goal in DM; (4) ask the        classification for the attribute ”Ad/Non-Ad”.
IDA to generate workflows for data and goal; (5) evaluate all plans
by executing them in RapidMiner; (6) select the plan you like most          All data are provided as already imported into RapidMiner (a local
to see a summary of the plan (the screenshot in Figure 6 is made            RapidAnalytics server they could access). All students took/needed
after this step); and finally, (7) inspect the plan and its results. Note   the full 3 hours.
that these steps do not require detailed technical knowledge anymore.           The study confirmed that the standard DM problems they had to
Still a user should be aware of what (s)he is doing when (s)he uses         solve (clustering and prediction tasks on complex UCI data) can be
                                                                            sped-up by the using an IDA whilst maintaing comparable quality:
16 http://www.taverna.org.uk/                                               it took the students 3 hours (designing and executing the workflows)
17    http://rapidupdate.de:8180/UpdateServer/faces/                        to solve the tasks, whereas a non-specialist using the IDA accom-
   product_details.xhtml?productId=rmx_ida                                  plished the same tasks in 30 minutes (IDA planning and minimal
18 http://e-lico.eu/taverna-ida.html
19 http://e-lico.eu/taverna-rm.html                                         manual adaptation and execution of the workflows) with a compa-
20 http://rapid-i.com/content/view/182/196/                                 rable output. Table 10 shows how many students managed to solve
21 http://archive.ics.uci.edu/ml/                                           successfully the given problems and how the IDA solved it.
 Task                   #Students Succeeded       #Students IDA’s solution        been an option, but no one did it, also the IDA had none in the top
                                                  partial
                                                  success                         5 ranked plans. The task was a simple binary prediction, non-ads are
                        Crime Data                                                much more frequent (2820 non-ad, 459 ad), solved by all students
 Clean Missing Values   22/24                     1/24       drop & fill          by different methods. One balanced the data, but that worsened the
 Many Valued Nominals   11/24                     6/24       drop                 results. One learned a decision tree but did not do an evaluation of
 Normalization          -                         -          yes                  the results. Some students even failed to produce a model or plot the
 Clustering             9/24                      -          2/10-means
                                                                                  data incorrectly (20:80).
 Cluster Description    8/24                      -          DT
 Regression             RMSE<0.2:11/24,           -          k-NN
                                                                                     This user evaluation provides a strong indication about the
                        0.22≤RMSE≤0.244:                     RMSE=0.2             strength of the IDA. Note that the students are an optimal user-group
                        3/11, Best RMSE=0.136
                        ±0.010 : 4/24                                             for the IDA, as they have limited DM experience but understand the
 Evaluation             15/24                     2/24       10-fold X-Val        principles of DM.
                     Advertisement Data
 Clean Missing Values   17/24                     1/24       fill
 Feature Selection      no                        no         no
 Classification         acc>96%:13/24,            -          DT,
                                                                                  4.3    Performance of the generated workflows
                        86%≤acc<96%:6/24,                    Acc=96.34% ±
                        Best Acc=97.32%:1/24                 0.65                 The performance of the generated workflows depends strongly on
 Evaluation             5/24                      7/24       10-fold X-Val        the ranking. The baseline strategy is to rank the workflows simply
 Figure 10: Features of the designed solutions by students and IDA                based on the popularity of the operators. RapidMiner automatically
                                                                                  collects operator usage-frequencies from all the users who accept to
                                                                                  submit it. A workflow is ranked better, if it contains more frequently
    Both datasets had missing values which could be ignored (−), the
                                                                                  used operators. This already produces workflows comparable to user-
attribute could be all dropped or all filled or depending on the amount
                                                                                  designed workflows and was used in the speedup-experiments. Our
of missing values individually dropped & filled. Here, both students
                                                                                  e-LICO project partners24 used the planner to systematically generate
and IDA did a good job. The ”Communities and Crime” data had
                                                                                  workflows, executed them to get the performance data, and applied
key-like attributes (many valued nominals) which are likely to dis-
                                                                                  meta-learning to these experiments [11].
turb the DM results and should be dropped or marked as (to be) ig-
                                                                                     In [10] they evaluated the meta-mining module and the resulting
nore(d). Here, only around half of the students handled it correctly.
                                                                                  plan ranking on 65 biological datasets. These datasets are high di-
Numerical attributes with different scales cause unwanted weight-
                                                                                  mensional with few instances/samples. For their experiments they
ing of attributes in distance based similarity computation. Therefore,
                                                                                  cross-validated all performance by holding out a dataset. The re-
they should be normalized22 . There are several clustering methods
                                                                                  sulting meta-model was then used to rank the IDA-generated work-
in RapidMiner. The best way to solve the clustering task is by us-
                                                                                  flows. They found that the meta-learned rankings significantly out-
ing hierarchical top-down 2-means (k-means with k = 2) clustering
                                                                                  performed the default, frequency-based strategy. Hence, their ranker
till the grouping is fine enough. Only one student used this approach.
                                                                                  was able to improve on our ranking to find DM workflows that max-
The rest of the students and the IDA used k-means with different val-
                                                                                  imize predictive performance.
ues for k (k<20 is successful, larger values make the prediction very
difficult).The IDA sets k = 2 and there is no way to specify the goal
of a ”fine clustering”. This can be solved by opening the returned
                                                                                  5     Conclusion
workflow and changing the parameter (re-running it is not much ef-
fort). We manually choose k = 10 as we had the next task in mind                  We presented our Intelligent Discovery Assistant (eI DA and ePro-
and knew it is difficult to predict a nominal with too many different             Plan) for planning KDD workflows. eI DA can be easily integrated
values23 , but many students chose a too fine k and failed the cluster            into existing DM-suites or workflow engines. eProPlan is a user-
description task (using different methods like Naive Bayes (NB), De-              friendly environment for modeling DM operators and defining the
cision Tree (DT) or jRIP), most only build bad models using the not               HTN grammar for guiding the planning process. Furthermore, it is
dropped key-like attributes). For the ”ViolentCrimesPerPop”, most                 able to plan attribute-wise operations. The main scientific contribu-
students used linear regression (linReg). The probabilistic ranking               tion of the IDA is the ability to build complex workflows out of a
of the IDA preferred k-nearest neighbor. Common mistakes for this                 much larger set of operations than all previous systems. The demo
task were: regression applied on the wrong attribute, converting nu-              presents how planning-based KDD workflow design can significantly
meric data to nominal and applying NaiveBayes (bad accuracy), or                  help KDD practitioners to make their daily work more efficient.
converting it to binary: no crime (3 examples), crime (595 exam-
ples) the resulting model of course predicted crime everywhere. The
DM step should have used a 10-fold cross validation, but some stu-                ACKNOWLEDGEMENTS
dents delivered a training/test set split (maybe to save execution time,
maybe because that was used in the group exercise). ”Internet Adver-              The work described in this paper is partially supported by the Euro-
tisement” data has many attributes, so Feature Selection would have               pean Community 7th framework program ICT-2007.4.4 under grant
22 In fact the initial data is 0-1 range normalized, so the students did not do
                                                                                  number 231519 “e-Lico: An e-Laboratory for Interdisciplinary Col-
   that step, but preprocessing operations like filling missing values change     laborative Research in DM and Data-Intensive Science” and the re-
   the column statistics. The planner cannot predict the results very well for    sult of collaborative work within this project. We especially like to
   column groups, so it ensures normalization of the data at the end of pre-      thank Phong Nguyen, Alexandros Kalousis, Melanie Hilario, and
   processing.                                                                    Tom Smuc for their work on meta-learning and ranking, sketched
23 The meta-data analysis returns categorial for ≤ 10 different values and
   nominal otherwise. Prediction requires a categorial target (or numeric) tar-   here to show the overall picture.
   get, i.e. the IDA refuses to build a plan to predict a target with more than
   10 nominal values.                                                             24 http://www.e-LICO.eu/
REFERENCES
 [1] A. Bernstein, F. Provost, and S. Hill, ‘Towards Intelligent Assistance
     for a Data Mining Process: An Ontology-based Approach for Cost-
     sensitive Classification’, IEEE Transactions on Knowledge and Data
     Engineering, 17(4), 503–518, (April 2005).
 [2] P. Chapman, J. Clinton, R. Kerber, T. Khabaza, T. Reinartz, C. Shearer,
     and R. Wirth, ‘Crisp–dm 1.0: Step-by-step data mining guide’, Techni-
     cal report, The CRISP–DM Consortium, (2000).
 [3] C. Diamantini, D. Potena, and E. Storti, ‘KDDONTO: An Ontology for
     Discovery and Composition of KDD Algorithms’, in Proceedings of
     the SoKD-09 Workshop at ECML/PKDD, (2009).
 [4] J. Gama and P. Brazdil, ‘Characterization of classification algorithms’,
     Progress in Artificial Intelligence, 189–200, (1995).
 [5] M. Ghallab, D. Nau, and P. Traverso, Automated Planning: Theory &
     Practice, Morgan Kaufmann, San Francisco, CA, USA, 2004.
 [6] N. Jankowski and K. Grabczewski, ‘Building meta-learning algorithms
     basing on search controlled by machine complexity’, in Proceedings of
     the International Joint Conference on Neural Networks, IJCNN 2008,
     pp. 3601–3608. IEEE, (2008).
 [7] A. Kalousis and M. Hilario, ‘Model selection via meta-learning: a com-
     parative study’, in International Journal on Artificial Intelligence Tools,
     ICTAI 2000, pp. 406–413. IEEE, (2000).
 [8] J.-U. Kietz, F. Serban, A. Bernstein, and S. Fischer, ‘Towards coopera-
     tive planning of data mining workflows’, in Proceedings of the SoKD-
     09 Workshop at ECML/PKDD, pp. 1–12, (2009).
 [9] J.-U. Kietz, F. Serban, A. Bernstein, and S. Fischer, ‘Data min-
     ing workflow templates for intelligent discovery assistance and
     auto-experimentation’, in Proceedings of the SoKD-10 Workshop at
     ECML/PKDD, (2010).
[10] P. Nguyen and A. Kalousis, ‘Evaluation report on meta-miner’. Deliv-
     erable 7.2 of the EU-Project e-LICO, January 2012.
[11] P. Nguyen, A. Kalousis, and M. Hilario, ‘A meta-mining infrastructure
     to support kd workflow optimization’, in Proc. of the PlanSoKD-11
     Workshop at ECML/PKDD, (2011).
[12] L. Rendell, R. Seshu, and D. Tcheng, ‘Layered concept learning and
     dynamically-variable bias management’, in Proceedings of Interna-
     tional Joint Conference on Artificial Intelligence, IJCAI-87, pp. 308–
     314. Citeseer, (1987).
[13] F. Serban, J. Vanschoren, J.-U. Kietz, and A. Bernstein, ‘A Survey of
     Intelligent Assistants for Data Analysis’, ACM Computing Surveys, (to
     appear 2012).
[14] D. Tsarkov and I. Horrocks, ‘FaCT++ description logic reasoner:
     System description’, Lecture Notes in Computer Science, 4130, 292,
     (2006).
[15] R. Wirth, C. Shearer, U. Grimmer, T. P. Reinartz, J. Schlösser, C. Breit-
     ner, R. Engels, and G. Lindner, ‘Towards process-oriented tool support
     for knowledge discovery in databases’, in Proceedings of the First Eu-
     ropean Symposium on Principles of Data Mining and Knowledge Dis-
     covery, PKDD ’97, pp. 243–253, London, UK, (1997). Springer-Verlag.
[16] G. Yang, M. Kifer, and C. Zhao, ‘Flora-2: A Rule-Based Knowledge
     Representation and Inference Infrastructure for the Semantic Web’, On
     The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and
     ODBASE, 671–688, (2003).
[17] M. Žaková, P. Křemen, F. Železný, and N. Lavrač, ‘Automating knowl-
     edge discovery workflow composition through ontology-based plan-
     ning’, IEEE Transactions on Automation Science and Engineering,
     8(2), 253 –264, (2011).
[18] M. Žáková, V. Podpečan, F. Železný, and N. Lavrač, ‘Advancing data
     mining workflow construction: A framework and cases using the orange
     toolkit’, in Proceedings of the SoKD-09 Workshop at ECML/PKDD,
     (2009).
[19] M. Znidarsic, M. Bohanec, N. Trdin, T. Smuc, M. Piskorec, and
     M. Bosnjak, ‘Non-functional workflow assessment’. Deliverable D7.3
     of the EU Project e-LICO, November 2011.