The e-LICO Meta-Miner

Experimental Evaluation of the e-LICO Meta-Miner

(Extended Abstract)

0 The DMOP is available at

and

The role of the AI-planner is to plan valid DM workflows by reasoning on the applicability of DM operators at a given step i according to their pre/post-conditions. However, since several operators can have equivalent conditions, the number of resulting plans can be in the order of several thousands. The goal of the meta-miner is to select at a given step i among a set of candidate operators Ai the k best ones that will optimize the performance measure associated with the user goal g and its input meta-data m in order to gear the AI-planner toward optimal plans. For this, the meta-miner makes use of a quality function Q which will score a given plan w by the quality q of the operators that form w as: (1) i=2 where T (wi−1) = [o1, .., oi−1] is the sequence of previous operators selected so far, and q∗ is an initial operator quality function. Thus the meta-miner will qualify a candidate operator by its conditional probability of being applied given all the preceding operators, and select those that have maximum quality to be applied at a step i. In order to have reliable probabilities, the meta-miner makes use of frequent workflow patterns extracted from past DM processes with the help of the DMOP ontology such that the operator quality function q is approximated as: q(o|T (wi−1), g, m) ≈ aggr n supp(fio|g, m) o supp(fi−1|g, m) fio∈Fio (2) where aggr is an aggregation function, Fio is the set of frequent workflow patterns that match the current candidate workflow wio built with a candidate operator o, and fi−1 is the pattern prefix for each pattern fio ∈ Fio. More importantly, the quality of a candidate workflow wio will depend on the support function supp(fio|g, m) of its matching patterns. As described in [ 3 ], this support function is defined by learning a dataset similarity measure which will retrieve a dataset’s nearest neighbors ExpN based on the input meta-data m. We refer the reader to [ 3 ] for more details. In the next section, we will deliver experimental results to validate our meta-mining approach. 3

Experiments

To meta-mine real experiments, we selected 65 high-dimensional biological datasets representing genomic or proteomic microarray data. We applied on these bio-datasets 28 feature selection plus classification workflows, and 7 classification-only workflows, using tenfold cross-validation. We used the 4 following feature selection algorithms: Information Gain, IG, Chi-square, CHI, ReliefF, RF, and recursive feature elimination with SVM, SVMRFE; we fixed the number of selected features to ten. For classification we used the 7 following algorithms: one-nearest-neighbor, 1NN, the C4.5 and CART decision tree algorithms, a Naive Bayes algorithm with normal probability estimation, NBN, a logistic regression algorithm, LR, and SVM with the linear, SVM l and the rbf, SVM r, kernels. We used the implementations of these algorithms provided by the RapidMiner data mining suite with their default parameters. We ended up with a total of 65 × (28 + 7) = 2275 base-level DM experiments, on which we gathered all experimental metadata; folds predictions and performance results, dataset metadata and workflow patterns, for metamining [ 1 ].

We constrain the AI-planner so that it generates feature selection and/or classification workflows only. We did so in order for the past experiments to be really relevant for the type of workflows we want to design. Note that the AI-planner can also select from operators with which we have not experimented. These are for feature selection, Gini Index, Gini, and Information Gain Ratio, IGR. For classification, we used a Naive Bayes algorithm with kernel-based probability estimation, NBK, a Linear Discriminant Analysis algorithm, LDA, a Rule Induction algorithm, Ripper, a Random Tree algorithm, RDT, and a Neural Network algorithm, NNet. is around 2%. As before, the meta-miner achieves significantly better performance than the baseline in a larger number of baselines datasets than vice-versa. 3.1

Baseline Strategy

In order to assess how well our meta-miner performs, we need to compare it with some baseline. To define this baseline, we will use as the operators quality estimates simply their frequency of use within the community of the RapidMiner users. We will denote this quality estimate for an operator o by qdef (o). Additionaly, we will denote the quality of a DM workflow, w, computed using the qdef (o) quality estimations by Qdef (w), thus:

Qdef (w) =

qdef (oi)

Y oi∈T (wf ) (3)

The score qdef (o) focuses on the individual frequency of use of the DM operators, and does not account for longer term interactions and combinations such as the ones captured by our frequent patterns. It reflects thus simply the popularity of the individual operators. In what concerns the most frequently used classification operators, these were C4.5, followed by NBN, and SVM l. For the feature selection algorithms, the most frequently used were CHI and SVMRFE. 3.2

Evaluation and Comparison Strategy

The evaluation will be done in a leave-one-dataset-out manner, where we will use our selection strategies on the remaining 64 datasets to generate workflows for the dataset that was left out. On the left-out dataset, we will then determine the K best workflows using the baseline strategy as well as using the meta-miner selection strategy. To compare the performance of the ordered set of workflows constructed by each strategy, we will use the average estimated performance of the K workflows on the given dataset, which we will denote by φa. We will report the average of φa over all the datasets. Additionally, we will estimate the statistical significance of the number of times over all the datasets that the meta-miner strategy has a higher φa than the baseline strategy; we will denote this by φs. We estimated the neighborhood ExpN of a dataset using N = 5 nearest neighbors. We will compare the performance of the baseline and of the metaminer for K = 1, 3, 5 generated workflows in order to have a large picture of their overall performance. 3.3

Performance Results and Comparisons

K=1. The top-1 workflow selected by the baseline strategy is CHIC4.5. When we compare its performance against the performance of the top-1 workflow selected by the meta-miner given in the first row of table 1, we can see that the meta-mining strategy gives an average performance improvement of around 6% over the baseline strategy. In addition, its improvement over the baseline is statistically significant in 53 datasets over 65, while the baseline wins only on 11 datasets.

K=3. The two other workflows selected by the baseline strategy additionally to the top-1 are CHI-NBN and CHI-SVM l. When we extend the selection to the three best workflows, we obtain the results given in the second row of table 1, where we see that the average predictive performance improvement over the baseline strategy K=5. The two other workflows selected by the baseline strategy additionally to the top-3 are SVMRFE-C4.5 and SVMRFE-SVM l. We give the results of the five best workflows selected by the metaminer in the last row of table 1, where we observe similar trends as before; 2% of average performance improvement and statistical difference in the number of improvement in favor of the meta-mining strategy.

K = 1 K = 3 K = 5

Qdef

Q φa 71.92% 77.68% 75.04% 77.28% 75.18% 77.14% p=2e-7 p=0.006 3.4

Selected Workflows

We will briefly discuss the top-K workflows selected by the metaminer. For K = 1, we have on a plurality of datasets the selection of the LDA classifier, an algorithm we have not experimented with. This happens because within the DMOP ontology this algorithm is related both with the linear, SVM l, and with the NaiveBayes algorithm, both of which perform well on our dataset collection. For K = 3 and K = 5, we have additionally the selection of the previously unseen NNet and Ripper classifiers. These operator selections demonstrate the capability of the meta-miner to select new operators based on their algorithm similarities given by the DMOP with past ones. 4

Conclusion and Future Works

This is a preliminary study, but already we see that we are able to deliver better workflow suggestions, in terms of predictive performance, compared to the baseline strategy, while at the same time being able to suggest workflows consisting of operators with which we have never experimented. Future works include more detailed experimentation and evaluation, and the construction of similarity measures combining both the dataset characteristics and the workflow patterns.

ACKNOWLEDGEMENTS

We would like to thank Jo¨rg-Uwe Kietz and Simon Fischer for their contribution in the development and evaluation of the e-LICO metaminer.

[1]

Melanie

Hilario , Phong Nguyen, Huyen Do, Adam Woznica, and Alexandros Kalousis, ' Ontology-based meta-mining of knowledge discovery workflows' , in Meta-Learning in Computational Intelligence, eds.,

Jankowski ,

Duch , and K. Grabczewski, Springer, ( 2011 ).

[2] Jo¨ rg-Uwe

Kietz

, Floarea Serban, Abraham Bernstein, and Simon Fischer, ' Towards Cooperative Planning of Data Mining Workflows' , in Proc of the ECML/PKDD09 Workshop on Third Generation Data Mining: Towards Service-oriented Knowledge Discovery (SoKD-09) , ( 2009 ).

[3]

Phong

Nguyen , Alexandros Kalousis, and Melanie Hilario, ' A metamining infrastructure to support kd workflow optimization' , in Proc. of the PlanSoKD-2011 Workshop at ECML/PKDD-2011, ( 2011 ).