<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluation and Experimental Design in Data Mining and Machine Learning: Motivation and Summary of EDML 2019</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Eirini Ntoutsi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Erich Schuberty</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Albrecht Zimmermannx</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Leibniz University Hannover, Germany &amp; L3S Research Center</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <fpage>2</fpage>
      <lpage>4</lpage>
      <abstract>
        <p>Motivation A vital part of proposing new machine learning and data mining approaches is evaluating them empirically to allow an assessment of their capabilities. Numerous choices go into setting up such experiments: how to choose the data, how to preprocess them (or not), potential problems associated with the selection of datasets, what other techniques to compare to (if any), what metrics to evaluate, etc. and last but not least how to present and interpret the results. Learning how to make those choices on-the-job, often by copying the evaluation protocols used in the existing literature, can easily lead to the development of problematic habits. Numerous, albeit scattered, publications have called attention to those questions and have occasionally called into question published results, or the usability of published methods [11, 4, 2, 9, 12, 3, 1, 5]. At a time of intense discussions about a reproducibility crisis in natural, social, and life sciences, and conferences such as SIGMOD, KDD, and ECML PKDD encouraging researchers to make their work as reproducible as possible, we therefore feel that it is important to bring researchers together, and discuss those issues on a fundamental level. An issue directly related to the rst choice mentioned above is the following: even the best-designed experiment carries only limited information if the underlying data are lacking. We therefore also want to discuss questions related to the availability of data, whether they are reliable, diverse, and whether they correspond to realistic and/or challenging problem settings.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In this workshop, we mainly solicited contributions that
discuss those questions on a fundamental level, take
stock of the state-of-the-art, o er theoretical arguments,
or take well-argued positions, as well as actual
evaluation papers that o er new insights, e.g., question
published results, or shine the spotlight on the
characteristics of existing benchmark data sets. As such, topics
include, but are not limited to</p>
    </sec>
    <sec id="sec-2">
      <title>Benchmark datasets for data mining tasks: are they diverse/realistic/challenging?</title>
    </sec>
    <sec id="sec-3">
      <title>Impact of data quality (redundancy, errors, noise, bias, imbalance, ...) on qualitative evaluation</title>
    </sec>
    <sec id="sec-4">
      <title>Propagation/ampli cation of data quality issues on the data mining results (also interplay between data and algorithms)</title>
    </sec>
    <sec id="sec-5">
      <title>Evaluation of unsupervised data mining (dilemma between novelty and validity)</title>
    </sec>
    <sec id="sec-6">
      <title>Evaluation measures</title>
      <p>(Automatic) data quality evaluation tools: What
are the aspects one should check before starting to
apply algorithms to given data?</p>
    </sec>
    <sec id="sec-7">
      <title>Issues around runtime evaluation (algorithm vs. implementation, dependency on hardware, algorithm parameters, dataset characteristics)</title>
    </sec>
    <sec id="sec-8">
      <title>Design guidelines for crowd-sourced evaluations</title>
      <p>3</p>
      <sec id="sec-8-1">
        <title>Contributions</title>
        <p>The workshop featured a mix of invited speakers, a
number of accepted presentations with ample time for
questions since those contributions were expected to be
less technical, and more philosophical in nature, and
an extensive discussion on the current state, and the
areas that most urgently need improvement, as well as
recommendations to achieve those improvements.
3.1 Invited Presentations Four invited
presentations enriched the workshop with focused talks around
the problems of evaluation in unsupervised learning.</p>
        <p>The rst invited presentation by Ricardo J. G. B.</p>
        <p>
          Campello, University of Newcastle, was on \Evaluation
of Unsupervised Learning Results: Making the
Seemingly Impossible Possible". Ricardo elaborated on the
speci c di culties in the evaluation of unsupervised
data mining methods (namely clustering and outlier de- Based on the instance space analysis techniques for
tection) and reported on some recent solutions and im- optimization and for classi cation problems as discussed
provements, with special focus on the rst internal eval- earlier in the invited presentation by Kate Smith-Miles,
uation measure for outlier detection [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. in \Instance space analysis for unsupervised outlier
        </p>
        <p>
          The second invited presentation by Kate Smith- detection" Sevvandi Kandanaarachchi, Mario Munoz
Miles, University of Melbourne, was on \Instance Spaces and Kate Smith-Miles discuss an approach to extend
for Objective Assessment of Algorithms and Benchmark these techniques to the unsupervised and therefore more
Test Suites", describing attempts to characterize data challenging problem of outlier detection.
sets in a way to allow a map of the landscape of varying The contribution \Characterizing Transactional
problems that shows where which algorithms perform Databases for Frequent Itemset Mining" by Christian
good and this way also to identify areas where no Lezcano and Marta Arias proposes a list of metrics to
good algorithm is available. This approach has been capture representativeness and diversity of benchmark
applied to characterize optimization problems [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] and datasets for frequent itemset mining.
classi cation problems [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. It would be interesting to
see this also on unsupervised learning problems. 3.3 Program Committee The workshop would not
        </p>
        <p>
          The third invited presentation by Bart Goethals, have been possible without the generous help and the
University of Antwerp, reported on \Lessons learned time and e ort put into reviewing submissions by
from the FIMI workshops", a series of workshops that Martin Aumuller, IT University of Copenhagen
Bart run with others roughly 15 years ago, focusing on
the runtime behavior of algorithms for frequent pattern James Bailey, University of Melbourne
mining [
          <xref ref-type="bibr" rid="ref2 ref4">4, 2</xref>
          ]. Bart highlighted the various problems
encountered in these attempts, for example the di culty Roberto Bayardo, Google
in assessing truly algorithmic merits as opposed to Christian Borgelt, University of Salzburg
implementation details.
        </p>
        <p>
          The fourth invited presentation by Milos Ricardo J. G. B. Campello, University of Newcastle
Radovanovic, University of Novi Sad, reported on Sarah Cohen-Boulakia, Universite Paris-Sud
observations regarding \Clustering Evaluation in
HighDimensional Data" and an apparent bias that is shown Ryan R. Curtin, Symantec Corporation
by some evaluation indices w.r.t. the dimensionality of Tijl De Bie, University of Gent
the data [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
3.2 Contributed Papers The submitted papers
discussed a variety of problems around the topic of the
workshop.
        </p>
        <p>In \EvalNE: A Framework for Evaluating Network
Embeddings on Link Prediction", Alexandru Mara,
Jefrey Lij jt, and Tijl De Bie describe an evaluation
framework for benchmarking existing and potentially
new algorithms in the targeted area, motivated by a
observed lack of reproducibility.</p>
        <p>Martin Aumuller and Matteo Ceccarello
contributed a study on \Benchmarking Nearest Neighbor
Search: In uence of Local Intrinsic Dimensionality and
Result Diversity in Real-World Datasets", in which they
study the in uence of intrinsic dimensionality on the
performance of approximate nearest neighbor search.</p>
        <p>In their contribution \Context-Driven Data Mining
through Bias Removal and Incompleteness Mitigation',
Feras Batarseh and Ajay Kulkarni describe case studies
for the use of context to overcome obstacles based on
data quality (or a lack thereof) and thereby to improve
the quality achieved in the corresponding data mining
application.</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Marcus Edel, Freie Universitat Berlin</title>
    </sec>
    <sec id="sec-10">
      <title>Bart Goethals, University of Antwerp</title>
    </sec>
    <sec id="sec-11">
      <title>Markus Goldstein, Hochschule Ulm</title>
    </sec>
    <sec id="sec-12">
      <title>Nathalie Japkowicz, American University</title>
    </sec>
    <sec id="sec-13">
      <title>Daniel Lemire, University of Quebec</title>
    </sec>
    <sec id="sec-14">
      <title>Philippe Lenca, IMT Atlantique</title>
    </sec>
    <sec id="sec-15">
      <title>Helmut Neukirchen, University of Iceland</title>
    </sec>
    <sec id="sec-16">
      <title>Jurgen Pfe er, Technical University Munich</title>
    </sec>
    <sec id="sec-17">
      <title>Milos Radovanovic, University of Novi Sad</title>
    </sec>
    <sec id="sec-18">
      <title>Protiva Rahman, Ohio State University</title>
    </sec>
    <sec id="sec-19">
      <title>Mohak Shah, LG Electronics</title>
    </sec>
    <sec id="sec-20">
      <title>Kate Smith-Miles, University of Melbourne</title>
    </sec>
    <sec id="sec-21">
      <title>Joaquin Vanschoren, Eindhoven University of Technology</title>
    </sec>
    <sec id="sec-22">
      <title>Ricardo Vilalta, University of Houston</title>
    </sec>
    <sec id="sec-23">
      <title>Mohammed Zaki, Rensselaer Polytechnic Institute</title>
      <sec id="sec-23-1">
        <title>Conclusions</title>
        <p>To summarize, the submitted papers as well as the
discussion had a main focus on unsupervised evaluation.
But we also touched other topics and agreed that
the richness of topics and questions is asking for a
continuation to a workshop series. Some main points
of the discussion were:</p>
      </sec>
    </sec>
    <sec id="sec-24">
      <title>Dataset complexity is important. So far, the</title>
      <p>community mainly focused on building more
complex methods, however evaluating existing and new
methods on appropriate benchmarks re ecting the
real world complexity is necessary for scienti c
advance.</p>
    </sec>
    <sec id="sec-25">
      <title>In general, awareness of reviewers should be raised</title>
      <p>regarding evaluation aspects, full-range evaluation,
reproducibility, embracing negative results etc.</p>
    </sec>
    <sec id="sec-26">
      <title>These aspects are important for the furthering</title>
      <p>of maturity of data mining as a scienti c e ort.
However, it seems still very hard to publish papers
concerning issues around evaluation in main stream
venues. We need a critical mass to change the
current status quo.</p>
      <p>Evaluation is a huge domain and only few aspects
have been covered at EDML 2019. Data-related issues
like sample representativeness, redundancy, bias,
nonstationary data etc. have not been discussed. From a
learning method perspective, it would be also
interesting to investigate similar questions in the context of
deep neural networks, that are currently dominating
the research in the data mining/machine learning
areas. These are possible candidate focus areas for future
workshops. We plan to continue EDML as a series.</p>
      <p>Finally, we wish to express our appreciation of
the presented work as well as of interest and vivid
participation of the audience.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Basaran</surname>
          </string-name>
          , E. Ntoutsi,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Zimek</surname>
          </string-name>
          .
          <article-title>Redundancies in data and their e ect on the evaluation of recommendation systems: A case study on the amazon reviews datasets</article-title>
          .
          <source>In SDM</source>
          , pages
          <volume>390</volume>
          {
          <fpage>398</fpage>
          .
          <string-name>
            <surname>SIAM</surname>
          </string-name>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R. J. Bayardo</given-names>
            <surname>Jr.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Goethals</surname>
          </string-name>
          , and M. J. Zaki, editors.
          <source>FIMI '04, Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations</source>
          , Brighton,
          <string-name>
            <surname>UK</surname>
          </string-name>
          , November 1,
          <year>2004</year>
          , volume
          <volume>126</volume>
          <source>of CEUR Workshop Proceedings. CEUR-WS.org</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G. O.</given-names>
            <surname>Campos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zimek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sander</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J. G. B.</given-names>
            <surname>Campello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Micenkova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Schubert</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Assent</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Houle</surname>
          </string-name>
          .
          <article-title>On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study</article-title>
          .
          <source>Data Min. Knowl. Discov.</source>
          ,
          <volume>30</volume>
          (
          <issue>4</issue>
          ):
          <volume>891</volume>
          {
          <fpage>927</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Goethals</surname>
          </string-name>
          and M. J. Zaki, editors.
          <source>FIMI '03</source>
          ,
          <string-name>
            <surname>Frequent Itemset</surname>
          </string-name>
          Mining Implementations,
          <source>Proceedings of the ICDM 2003 Workshop on Frequent Itemset Mining Implementations, 19 December</source>
          <year>2003</year>
          , Melbourne, Florida, USA, volume
          <volume>90</volume>
          <source>of CEUR Workshop Proceedings. CEUR-WS.org</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Kriegel</surname>
          </string-name>
          , E. Schubert,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Zimek.</surname>
          </string-name>
          <article-title>The (black) art of runtime evaluation: Are we comparing algorithms or implementations? Knowl</article-title>
          . Inf. Syst.,
          <volume>52</volume>
          (
          <issue>2</issue>
          ):
          <volume>341</volume>
          {
          <fpage>378</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H. O.</given-names>
            <surname>Marques</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J. G. B.</given-names>
            <surname>Campello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zimek</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Sander</surname>
          </string-name>
          .
          <article-title>On the internal evaluation of unsupervised outlier detection</article-title>
          .
          <source>In SSDBM</source>
          , pages
          <volume>7</volume>
          :
          <issue>1</issue>
          {7:
          <fpage>12</fpage>
          . ACM,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Mun</surname>
          </string-name>
          <article-title>~oz and K. A</article-title>
          .
          <string-name>
            <surname>Smith-Miles</surname>
          </string-name>
          .
          <article-title>Performance analysis of continuous black-box optimization algorithms via footprints in instance space</article-title>
          .
          <source>Evolutionary Computation</source>
          ,
          <volume>25</volume>
          (
          <issue>4</issue>
          ),
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Mun</surname>
          </string-name>
          ~oz, L. Villanova,
          <string-name>
            <given-names>D.</given-names>
            <surname>Baatar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>SmithMiles</surname>
          </string-name>
          .
          <article-title>Instance spaces for machine learning classi cation</article-title>
          .
          <source>Machine Learning</source>
          ,
          <volume>107</volume>
          (
          <issue>1</issue>
          ):
          <volume>109</volume>
          {
          <fpage>147</fpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Sidlauskas</surname>
          </string-name>
          and
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Jensen</surname>
          </string-name>
          .
          <article-title>Spatial joins in main memory: Implementation matters! PVLDB, 8(1</article-title>
          ):
          <volume>97</volume>
          {
          <fpage>100</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>N.</given-names>
            <surname>Tomasev</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Radovanovic</surname>
          </string-name>
          .
          <article-title>Clustering evaluation in high-dimensional data</article-title>
          . In M. E.
          <article-title>Celebi and</article-title>
          K. Aydin, editors,
          <source>Unsupervised Learning Algorithms</source>
          . Springer,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kohavi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Mason</surname>
          </string-name>
          .
          <article-title>Real world performance of association rule algorithms</article-title>
          .
          <source>In KDD</source>
          , pages
          <volume>401</volume>
          {
          <fpage>406</fpage>
          . ACM,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Zimmermann</surname>
          </string-name>
          .
          <article-title>The data problem in data mining</article-title>
          .
          <source>SIGKDD Explorations</source>
          ,
          <volume>16</volume>
          (
          <issue>2</issue>
          ):
          <volume>38</volume>
          {
          <fpage>45</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>