<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Experimental Evaluation of the e-LICO Meta-Miner</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>(Extended Abstract)</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>The DMOP is available at</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>and</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>The e-LICO Meta-Miner</title>
      <p>
        The role of the AI-planner is to plan valid DM workflows by
reasoning on the applicability of DM operators at a given step i according to
their pre/post-conditions. However, since several operators can have
equivalent conditions, the number of resulting plans can be in the
order of several thousands. The goal of the meta-miner is to select at
a given step i among a set of candidate operators Ai the k best ones
that will optimize the performance measure associated with the user
goal g and its input meta-data m in order to gear the AI-planner
toward optimal plans. For this, the meta-miner makes use of a quality
function Q which will score a given plan w by the quality q of the
operators that form w as:
(1)
i=2
where T (wi−1) = [o1, .., oi−1] is the sequence of previous
operators selected so far, and q∗ is an initial operator quality function.
Thus the meta-miner will qualify a candidate operator by its
conditional probability of being applied given all the preceding operators,
and select those that have maximum quality to be applied at a step i.
In order to have reliable probabilities, the meta-miner makes use of
frequent workflow patterns extracted from past DM processes with
the help of the DMOP ontology such that the operator quality
function q is approximated as:
q(o|T (wi−1), g, m) ≈ aggr n supp(fio|g, m) o
supp(fi−1|g, m) fio∈Fio
(2)
where aggr is an aggregation function, Fio is the set of frequent
workflow patterns that match the current candidate workflow wio
built with a candidate operator o, and fi−1 is the pattern prefix for
each pattern fio ∈ Fio. More importantly, the quality of a candidate
workflow wio will depend on the support function supp(fio|g, m) of
its matching patterns. As described in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], this support function is
defined by learning a dataset similarity measure which will retrieve
a dataset’s nearest neighbors ExpN based on the input meta-data m.
We refer the reader to [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] for more details. In the next section, we will
deliver experimental results to validate our meta-mining approach.
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Experiments</title>
      <p>
        To meta-mine real experiments, we selected 65 high-dimensional
biological datasets representing genomic or proteomic microarray data.
We applied on these bio-datasets 28 feature selection plus
classification workflows, and 7 classification-only workflows, using
tenfold cross-validation. We used the 4 following feature selection
algorithms: Information Gain, IG, Chi-square, CHI, ReliefF, RF, and
recursive feature elimination with SVM, SVMRFE; we fixed the
number of selected features to ten. For classification we used the 7
following algorithms: one-nearest-neighbor, 1NN, the C4.5 and CART
decision tree algorithms, a Naive Bayes algorithm with normal
probability estimation, NBN, a logistic regression algorithm, LR, and SVM
with the linear, SVM l and the rbf, SVM r, kernels. We used the
implementations of these algorithms provided by the RapidMiner data
mining suite with their default parameters. We ended up with a
total of 65 × (28 + 7) = 2275 base-level DM experiments, on which
we gathered all experimental metadata; folds predictions and
performance results, dataset metadata and workflow patterns, for
metamining [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>We constrain the AI-planner so that it generates feature selection
and/or classification workflows only. We did so in order for the past
experiments to be really relevant for the type of workflows we want to
design. Note that the AI-planner can also select from operators with
which we have not experimented. These are for feature selection,
Gini Index, Gini, and Information Gain Ratio, IGR. For
classification, we used a Naive Bayes algorithm with kernel-based probability
estimation, NBK, a Linear Discriminant Analysis algorithm, LDA, a
Rule Induction algorithm, Ripper, a Random Tree algorithm, RDT,
and a Neural Network algorithm, NNet.
is around 2%. As before, the meta-miner achieves significantly
better performance than the baseline in a larger number of baselines
datasets than vice-versa.
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>Baseline Strategy</title>
      <p>In order to assess how well our meta-miner performs, we need to
compare it with some baseline. To define this baseline, we will use as
the operators quality estimates simply their frequency of use within
the community of the RapidMiner users. We will denote this quality
estimate for an operator o by qdef (o). Additionaly, we will denote
the quality of a DM workflow, w, computed using the qdef (o) quality
estimations by Qdef (w), thus:</p>
      <p>Qdef (w) =</p>
      <p>qdef (oi)</p>
      <p>Y
oi∈T (wf )
(3)</p>
      <p>The score qdef (o) focuses on the individual frequency of use of
the DM operators, and does not account for longer term
interactions and combinations such as the ones captured by our frequent
patterns. It reflects thus simply the popularity of the individual
operators. In what concerns the most frequently used classification
operators, these were C4.5, followed by NBN, and SVM l. For the feature
selection algorithms, the most frequently used were CHI and
SVMRFE.
3.2</p>
    </sec>
    <sec id="sec-4">
      <title>Evaluation and Comparison Strategy</title>
      <p>The evaluation will be done in a leave-one-dataset-out manner, where
we will use our selection strategies on the remaining 64 datasets to
generate workflows for the dataset that was left out. On the left-out
dataset, we will then determine the K best workflows using the
baseline strategy as well as using the meta-miner selection strategy. To
compare the performance of the ordered set of workflows constructed
by each strategy, we will use the average estimated performance of
the K workflows on the given dataset, which we will denote by φa.
We will report the average of φa over all the datasets. Additionally,
we will estimate the statistical significance of the number of times
over all the datasets that the meta-miner strategy has a higher φa
than the baseline strategy; we will denote this by φs. We estimated
the neighborhood ExpN of a dataset using N = 5 nearest neighbors.
We will compare the performance of the baseline and of the
metaminer for K = 1, 3, 5 generated workflows in order to have a large
picture of their overall performance.
3.3</p>
    </sec>
    <sec id="sec-5">
      <title>Performance Results and Comparisons</title>
      <p>K=1. The top-1 workflow selected by the baseline strategy is
CHIC4.5. When we compare its performance against the performance
of the top-1 workflow selected by the meta-miner given in the first
row of table 1, we can see that the meta-mining strategy gives an
average performance improvement of around 6% over the baseline
strategy. In addition, its improvement over the baseline is statistically
significant in 53 datasets over 65, while the baseline wins only on 11
datasets.</p>
      <p>K=3. The two other workflows selected by the baseline strategy
additionally to the top-1 are CHI-NBN and CHI-SVM l. When we
extend the selection to the three best workflows, we obtain the
results given in the second row of table 1, where we see that the
average predictive performance improvement over the baseline strategy
K=5. The two other workflows selected by the baseline strategy
additionally to the top-3 are SVMRFE-C4.5 and SVMRFE-SVM l.
We give the results of the five best workflows selected by the
metaminer in the last row of table 1, where we observe similar trends as
before; 2% of average performance improvement and statistical
difference in the number of improvement in favor of the meta-mining
strategy.</p>
      <p>K = 1
K = 3
K = 5</p>
      <sec id="sec-5-1">
        <title>Qdef</title>
        <p>Q</p>
      </sec>
      <sec id="sec-5-2">
        <title>Qdef</title>
        <p>Q</p>
      </sec>
      <sec id="sec-5-3">
        <title>Qdef</title>
        <p>Q
φa
71.92%
77.68%
75.04%
77.28%
75.18%
77.14%
p=2e-7
p=0.006
3.4</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Selected Workflows</title>
      <p>We will briefly discuss the top-K workflows selected by the
metaminer. For K = 1, we have on a plurality of datasets the selection of
the LDA classifier, an algorithm we have not experimented with. This
happens because within the DMOP ontology this algorithm is related
both with the linear, SVM l, and with the NaiveBayes algorithm, both
of which perform well on our dataset collection. For K = 3 and
K = 5, we have additionally the selection of the previously unseen
NNet and Ripper classifiers. These operator selections demonstrate
the capability of the meta-miner to select new operators based on
their algorithm similarities given by the DMOP with past ones.
4</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusion and Future Works</title>
      <p>This is a preliminary study, but already we see that we are able to
deliver better workflow suggestions, in terms of predictive
performance, compared to the baseline strategy, while at the same time
being able to suggest workflows consisting of operators with which we
have never experimented. Future works include more detailed
experimentation and evaluation, and the construction of similarity measures
combining both the dataset characteristics and the workflow patterns.</p>
    </sec>
    <sec id="sec-8">
      <title>ACKNOWLEDGEMENTS</title>
      <p>We would like to thank Jo¨rg-Uwe Kietz and Simon Fischer for their
contribution in the development and evaluation of the e-LICO
metaminer.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Melanie</given-names>
            <surname>Hilario</surname>
          </string-name>
          , Phong Nguyen, Huyen Do, Adam Woznica, and Alexandros Kalousis, '
          <article-title>Ontology-based meta-mining of knowledge discovery workflows'</article-title>
          , in Meta-Learning in Computational Intelligence, eds.,
          <string-name>
            <given-names>N.</given-names>
            <surname>Jankowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Duch</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          K. Grabczewski, Springer, (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2] Jo¨
          <string-name>
            <surname>rg-Uwe</surname>
            <given-names>Kietz</given-names>
          </string-name>
          , Floarea Serban, Abraham Bernstein, and Simon Fischer, '
          <article-title>Towards Cooperative Planning of Data Mining Workflows'</article-title>
          ,
          <source>in Proc of the ECML/PKDD09 Workshop on Third Generation Data Mining: Towards Service-oriented Knowledge Discovery (SoKD-09)</source>
          , (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Phong</given-names>
            <surname>Nguyen</surname>
          </string-name>
          , Alexandros Kalousis, and Melanie Hilario, '
          <article-title>A metamining infrastructure to support kd workflow optimization'</article-title>
          ,
          <source>in Proc. of the PlanSoKD-2011 Workshop</source>
          at ECML/PKDD-2011, (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>