Evaluation and Experimental Design in Data Mining and Machine Learning:
                    Motivation and Summary of EDML 2019

        Eirini Ntoutsi∗         Erich Schubert†          Arthur Zimek‡             Albrecht Zimmermann§


1    Motivation                                              istics of existing benchmark data sets. As such, topics
A vital part of proposing new machine learning and           include, but are not limited to
data mining approaches is evaluating them empirically
to allow an assessment of their capabilities. Numer-             • Benchmark datasets for data mining tasks: are
ous choices go into setting up such experiments: how to            they diverse/realistic/challenging?
choose the data, how to preprocess them (or not), poten-
                                                                 • Impact of data quality (redundancy, errors, noise,
tial problems associated with the selection of datasets,
                                                                   bias, imbalance, ...) on qualitative evaluation
what other techniques to compare to (if any), what met-
rics to evaluate, etc. and last but not least how to             • Propagation/amplification of data quality issues on
present and interpret the results. Learning how to make            the data mining results (also interplay between
those choices on-the-job, often by copying the evalua-             data and algorithms)
tion protocols used in the existing literature, can eas-
ily lead to the development of problematic habits. Nu-           • Evaluation of unsupervised data mining (dilemma
merous, albeit scattered, publications have called atten-          between novelty and validity)
tion to those questions and have occasionally called into
question published results, or the usability of published        • Evaluation measures
methods [11, 4, 2, 9, 12, 3, 1, 5]. At a time of intense
discussions about a reproducibility crisis in natural, so-       • (Automatic) data quality evaluation tools: What
cial, and life sciences, and conferences such as SIGMOD,           are the aspects one should check before starting to
KDD, and ECML PKDD encouraging researchers to                      apply algorithms to given data?
make their work as reproducible as possible, we there-
fore feel that it is important to bring researchers to-          • Issues around runtime evaluation (algorithm vs.
gether, and discuss those issues on a fundamental level.           implementation, dependency on hardware, algo-
     An issue directly related to the first choice men-            rithm parameters, dataset characteristics)
tioned above is the following: even the best-designed
                                                                 • Design guidelines for crowd-sourced evaluations
experiment carries only limited information if the under-
lying data are lacking. We therefore also want to discuss
                                                             3    Contributions
questions related to the availability of data, whether
they are reliable, diverse, and whether they correspond     The workshop featured a mix of invited speakers, a
to realistic and/or challenging problem settings.           number of accepted presentations with ample time for
                                                            questions since those contributions were expected to be
2 Topics                                                    less technical, and more philosophical in nature, and
                                                            an extensive discussion on the current state, and the
In this workshop, we mainly solicited contributions that
                                                            areas that most urgently need improvement, as well as
discuss those questions on a fundamental level, take
                                                            recommendations to achieve those improvements.
stock of the state-of-the-art, offer theoretical arguments,
or take well-argued positions, as well as actual evalua-
                                                            3.1 Invited Presentations Four invited presenta-
tion papers that offer new insights, e.g., question pub-
                                                            tions enriched the workshop with focused talks around
lished results, or shine the spotlight on the character-
                                                            the problems of evaluation in unsupervised learning.
                                                                 The first invited presentation by Ricardo J. G. B.
   ∗ Leibniz University Hannover, Germany
                                                            Campello, University of Newcastle, was on “Evaluation
     & L3S Research Center, Germany
   † Technical University Dortmund, Germany                 of Unsupervised Learning Results: Making the Seem-
   ‡ University of Southern Denmark, Denmark                ingly Impossible Possible”. Ricardo elaborated on the
   § University Caen Normandy, France                       specific difficulties in the evaluation of unsupervised
data mining methods (namely clustering and outlier de-            Based on the instance space analysis techniques for
tection) and reported on some recent solutions and im-       optimization and for classification problems as discussed
provements, with special focus on the first internal eval-   earlier in the invited presentation by Kate Smith-Miles,
uation measure for outlier detection [6].                    in “Instance space analysis for unsupervised outlier
     The second invited presentation by Kate Smith-          detection” Sevvandi Kandanaarachchi, Mario Munoz
Miles, University of Melbourne, was on “Instance Spaces      and Kate Smith-Miles discuss an approach to extend
for Objective Assessment of Algorithms and Benchmark         these techniques to the unsupervised and therefore more
Test Suites”, describing attempts to characterize data       challenging problem of outlier detection.
sets in a way to allow a map of the landscape of varying          The contribution “Characterizing Transactional
problems that shows where which algorithms perform           Databases for Frequent Itemset Mining” by Christian
good and this way also to identify areas where no            Lezcano and Marta Arias proposes a list of metrics to
good algorithm is available. This approach has been          capture representativeness and diversity of benchmark
applied to characterize optimization problems [7] and        datasets for frequent itemset mining.
classification problems [8]. It would be interesting to
see this also on unsupervised learning problems.             3.3 Program Committee The workshop would not
     The third invited presentation by Bart Goethals,        have been possible without the generous help and the
University of Antwerp, reported on “Lessons learned          time and effort put into reviewing submissions by
from the FIMI workshops”, a series of workshops that
                                                               • Martin Aumüller, IT University of Copenhagen
Bart run with others roughly 15 years ago, focusing on
the runtime behavior of algorithms for frequent pattern        • James Bailey, University of Melbourne
mining [4, 2]. Bart highlighted the various problems
                                                               • Roberto Bayardo, Google
encountered in these attempts, for example the difficulty
in assessing truly algorithmic merits as opposed to            • Christian Borgelt, University of Salzburg
implementation details.
     The fourth invited presentation by Miloš                 • Ricardo J. G. B. Campello, University of Newcastle
Radovanović, University of Novi Sad, reported on              • Sarah Cohen-Boulakia, Université Paris-Sud
observations regarding “Clustering Evaluation in High-
Dimensional Data” and an apparent bias that is shown           • Ryan R. Curtin, Symantec Corporation
by some evaluation indices w.r.t. the dimensionality of        • Tijl De Bie, University of Gent
the data [10].
                                                               • Marcus Edel, Freie Universität Berlin
3.2 Contributed Papers The submitted papers                    • Bart Goethals, University of Antwerp
discussed a variety of problems around the topic of the
workshop.                                                      • Markus Goldstein, Hochschule Ulm
    In “EvalNE: A Framework for Evaluating Network
                                                               • Nathalie Japkowicz, American University
Embeddings on Link Prediction”, Alexandru Mara, Je-
frey Lijffijt, and Tijl De Bie describe an evaluation          • Daniel Lemire, University of Quebec
framework for benchmarking existing and potentially
                                                               • Philippe Lenca, IMT Atlantique
new algorithms in the targeted area, motivated by a
observed lack of reproducibility.                              • Helmut Neukirchen, University of Iceland
    Martin Aumüller and Matteo Ceccarello con-
tributed a study on “Benchmarking Nearest Neighbor             • Jürgen Pfeffer, Technical University Munich
Search: Influence of Local Intrinsic Dimensionality and        • Miloš Radovanović, University of Novi Sad
Result Diversity in Real-World Datasets”, in which they
study the influence of intrinsic dimensionality on the         • Protiva Rahman, Ohio State University
performance of approximate nearest neighbor search.            • Mohak Shah, LG Electronics
    In their contribution “Context-Driven Data Mining
through Bias Removal and Incompleteness Mitigation’,           • Kate Smith-Miles, University of Melbourne
Feras Batarseh and Ajay Kulkarni describe case studies         • Joaquin Vanschoren, Eindhoven University of Tech-
for the use of context to overcome obstacles based on            nology
data quality (or a lack thereof) and thereby to improve
the quality achieved in the corresponding data mining          • Ricardo Vilalta, University of Houston
application.
                                                               • Mohammed Zaki, Rensselaer Polytechnic Institute
4     Conclusions                                                   [4] B. Goethals and M. J. Zaki, editors. FIMI ’03, Fre-
To summarize, the submitted papers as well as the                       quent Itemset Mining Implementations, Proceedings of
                                                                        the ICDM 2003 Workshop on Frequent Itemset Min-
discussion had a main focus on unsupervised evaluation.
                                                                        ing Implementations, 19 December 2003, Melbourne,
But we also touched other topics and agreed that
                                                                        Florida, USA, volume 90 of CEUR Workshop Proceed-
the richness of topics and questions is asking for a                    ings. CEUR-WS.org, 2003.
continuation to a workshop series. Some main points                 [5] H. Kriegel, E. Schubert, and A. Zimek. The (black) art
of the discussion were:                                                 of runtime evaluation: Are we comparing algorithms
                                                                        or implementations? Knowl. Inf. Syst., 52(2):341–378,
    • Dataset complexity is important. So far, the
                                                                        2017.
      community mainly focused on building more com-                [6] H. O. Marques, R. J. G. B. Campello, A. Zimek, and
      plex methods, however evaluating existing and new                 J. Sander. On the internal evaluation of unsupervised
      methods on appropriate benchmarks reflecting the                  outlier detection. In SSDBM, pages 7:1–7:12. ACM,
      real world complexity is necessary for scientific ad-             2015.
      vance.                                                        [7] M. A. Muñoz and K. A. Smith-Miles. Performance
                                                                        analysis of continuous black-box optimization algo-
    • In general, awareness of reviewers should be raised               rithms via footprints in instance space. Evolutionary
      regarding evaluation aspects, full-range evaluation,              Computation, 25(4), 2017.
      reproducibility, embracing negative results etc.              [8] M. A. Muñoz, L. Villanova, D. Baatar, and K. Smith-
       These aspects are important for the furthering                   Miles. Instance spaces for machine learning classifica-
       of maturity of data mining as a scientific effort.               tion. Machine Learning, 107(1):109–147, 2018.
                                                                    [9] D. Sidlauskas and C. S. Jensen. Spatial joins in main
       However, it seems still very hard to publish papers
                                                                        memory: Implementation matters! PVLDB, 8(1):97–
       concerning issues around evaluation in main stream               100, 2014.
       venues. We need a critical mass to change the               [10] N. Tomas̆ev and M. Radovanović. Clustering evalu-
       current status quo.                                              ation in high-dimensional data. In M. E. Celebi and
     Evaluation is a huge domain and only few aspects                   K. Aydin, editors, Unsupervised Learning Algorithms.
                                                                        Springer, 2016.
have been covered at EDML 2019. Data-related issues
                                                                   [11] Z. Zheng, R. Kohavi, and L. Mason. Real world
like sample representativeness, redundancy, bias, non-                  performance of association rule algorithms. In KDD,
stationary data etc. have not been discussed. From a                    pages 401–406. ACM, 2001.
learning method perspective, it would be also interest-            [12] A. Zimmermann. The data problem in data mining.
ing to investigate similar questions in the context of                  SIGKDD Explorations, 16(2):38–45, 2014.
deep neural networks, that are currently dominating
the research in the data mining/machine learning ar-
eas. These are possible candidate focus areas for future
workshops. We plan to continue EDML as a series.
     Finally, we wish to express our appreciation of
the presented work as well as of interest and vivid
participation of the audience.

References

    [1] D. Basaran, E. Ntoutsi, and A. Zimek. Redundancies
        in data and their effect on the evaluation of recommen-
        dation systems: A case study on the amazon reviews
        datasets. In SDM, pages 390–398. SIAM, 2017.
    [2] R. J. Bayardo Jr., B. Goethals, and M. J. Zaki, editors.
        FIMI ’04, Proceedings of the IEEE ICDM Workshop on
        Frequent Itemset Mining Implementations, Brighton,
        UK, November 1, 2004, volume 126 of CEUR Work-
        shop Proceedings. CEUR-WS.org, 2005.
    [3] G. O. Campos, A. Zimek, J. Sander, R. J. G. B.
        Campello, B. Micenková, E. Schubert, I. Assent, and
        M. E. Houle. On the evaluation of unsupervised outlier
        detection: measures, datasets, and an empirical study.
        Data Min. Knowl. Discov., 30(4):891–927, 2016.