Evaluation and Experimental Design in Data Mining and Machine Learning: Motivation and Summary of EDML 2019 Eirini Ntoutsi∗ Erich Schubert† Arthur Zimek‡ Albrecht Zimmermann§ 1 Motivation istics of existing benchmark data sets. As such, topics A vital part of proposing new machine learning and include, but are not limited to data mining approaches is evaluating them empirically to allow an assessment of their capabilities. Numer- • Benchmark datasets for data mining tasks: are ous choices go into setting up such experiments: how to they diverse/realistic/challenging? choose the data, how to preprocess them (or not), poten- • Impact of data quality (redundancy, errors, noise, tial problems associated with the selection of datasets, bias, imbalance, ...) on qualitative evaluation what other techniques to compare to (if any), what met- rics to evaluate, etc. and last but not least how to • Propagation/amplification of data quality issues on present and interpret the results. Learning how to make the data mining results (also interplay between those choices on-the-job, often by copying the evalua- data and algorithms) tion protocols used in the existing literature, can eas- ily lead to the development of problematic habits. Nu- • Evaluation of unsupervised data mining (dilemma merous, albeit scattered, publications have called atten- between novelty and validity) tion to those questions and have occasionally called into question published results, or the usability of published • Evaluation measures methods [11, 4, 2, 9, 12, 3, 1, 5]. At a time of intense discussions about a reproducibility crisis in natural, so- • (Automatic) data quality evaluation tools: What cial, and life sciences, and conferences such as SIGMOD, are the aspects one should check before starting to KDD, and ECML PKDD encouraging researchers to apply algorithms to given data? make their work as reproducible as possible, we there- fore feel that it is important to bring researchers to- • Issues around runtime evaluation (algorithm vs. gether, and discuss those issues on a fundamental level. implementation, dependency on hardware, algo- An issue directly related to the first choice men- rithm parameters, dataset characteristics) tioned above is the following: even the best-designed • Design guidelines for crowd-sourced evaluations experiment carries only limited information if the under- lying data are lacking. We therefore also want to discuss 3 Contributions questions related to the availability of data, whether they are reliable, diverse, and whether they correspond The workshop featured a mix of invited speakers, a to realistic and/or challenging problem settings. number of accepted presentations with ample time for questions since those contributions were expected to be 2 Topics less technical, and more philosophical in nature, and an extensive discussion on the current state, and the In this workshop, we mainly solicited contributions that areas that most urgently need improvement, as well as discuss those questions on a fundamental level, take recommendations to achieve those improvements. stock of the state-of-the-art, offer theoretical arguments, or take well-argued positions, as well as actual evalua- 3.1 Invited Presentations Four invited presenta- tion papers that offer new insights, e.g., question pub- tions enriched the workshop with focused talks around lished results, or shine the spotlight on the character- the problems of evaluation in unsupervised learning. The first invited presentation by Ricardo J. G. B. ∗ Leibniz University Hannover, Germany Campello, University of Newcastle, was on “Evaluation & L3S Research Center, Germany † Technical University Dortmund, Germany of Unsupervised Learning Results: Making the Seem- ‡ University of Southern Denmark, Denmark ingly Impossible Possible”. Ricardo elaborated on the § University Caen Normandy, France specific difficulties in the evaluation of unsupervised data mining methods (namely clustering and outlier de- Based on the instance space analysis techniques for tection) and reported on some recent solutions and im- optimization and for classification problems as discussed provements, with special focus on the first internal eval- earlier in the invited presentation by Kate Smith-Miles, uation measure for outlier detection [6]. in “Instance space analysis for unsupervised outlier The second invited presentation by Kate Smith- detection” Sevvandi Kandanaarachchi, Mario Munoz Miles, University of Melbourne, was on “Instance Spaces and Kate Smith-Miles discuss an approach to extend for Objective Assessment of Algorithms and Benchmark these techniques to the unsupervised and therefore more Test Suites”, describing attempts to characterize data challenging problem of outlier detection. sets in a way to allow a map of the landscape of varying The contribution “Characterizing Transactional problems that shows where which algorithms perform Databases for Frequent Itemset Mining” by Christian good and this way also to identify areas where no Lezcano and Marta Arias proposes a list of metrics to good algorithm is available. This approach has been capture representativeness and diversity of benchmark applied to characterize optimization problems [7] and datasets for frequent itemset mining. classification problems [8]. It would be interesting to see this also on unsupervised learning problems. 3.3 Program Committee The workshop would not The third invited presentation by Bart Goethals, have been possible without the generous help and the University of Antwerp, reported on “Lessons learned time and effort put into reviewing submissions by from the FIMI workshops”, a series of workshops that • Martin Aumüller, IT University of Copenhagen Bart run with others roughly 15 years ago, focusing on the runtime behavior of algorithms for frequent pattern • James Bailey, University of Melbourne mining [4, 2]. Bart highlighted the various problems • Roberto Bayardo, Google encountered in these attempts, for example the difficulty in assessing truly algorithmic merits as opposed to • Christian Borgelt, University of Salzburg implementation details. The fourth invited presentation by Miloš • Ricardo J. G. B. Campello, University of Newcastle Radovanović, University of Novi Sad, reported on • Sarah Cohen-Boulakia, Université Paris-Sud observations regarding “Clustering Evaluation in High- Dimensional Data” and an apparent bias that is shown • Ryan R. Curtin, Symantec Corporation by some evaluation indices w.r.t. the dimensionality of • Tijl De Bie, University of Gent the data [10]. • Marcus Edel, Freie Universität Berlin 3.2 Contributed Papers The submitted papers • Bart Goethals, University of Antwerp discussed a variety of problems around the topic of the workshop. • Markus Goldstein, Hochschule Ulm In “EvalNE: A Framework for Evaluating Network • Nathalie Japkowicz, American University Embeddings on Link Prediction”, Alexandru Mara, Je- frey Lijffijt, and Tijl De Bie describe an evaluation • Daniel Lemire, University of Quebec framework for benchmarking existing and potentially • Philippe Lenca, IMT Atlantique new algorithms in the targeted area, motivated by a observed lack of reproducibility. • Helmut Neukirchen, University of Iceland Martin Aumüller and Matteo Ceccarello con- tributed a study on “Benchmarking Nearest Neighbor • Jürgen Pfeffer, Technical University Munich Search: Influence of Local Intrinsic Dimensionality and • Miloš Radovanović, University of Novi Sad Result Diversity in Real-World Datasets”, in which they study the influence of intrinsic dimensionality on the • Protiva Rahman, Ohio State University performance of approximate nearest neighbor search. • Mohak Shah, LG Electronics In their contribution “Context-Driven Data Mining through Bias Removal and Incompleteness Mitigation’, • Kate Smith-Miles, University of Melbourne Feras Batarseh and Ajay Kulkarni describe case studies • Joaquin Vanschoren, Eindhoven University of Tech- for the use of context to overcome obstacles based on nology data quality (or a lack thereof) and thereby to improve the quality achieved in the corresponding data mining • Ricardo Vilalta, University of Houston application. • Mohammed Zaki, Rensselaer Polytechnic Institute 4 Conclusions [4] B. Goethals and M. J. Zaki, editors. FIMI ’03, Fre- To summarize, the submitted papers as well as the quent Itemset Mining Implementations, Proceedings of the ICDM 2003 Workshop on Frequent Itemset Min- discussion had a main focus on unsupervised evaluation. ing Implementations, 19 December 2003, Melbourne, But we also touched other topics and agreed that Florida, USA, volume 90 of CEUR Workshop Proceed- the richness of topics and questions is asking for a ings. CEUR-WS.org, 2003. continuation to a workshop series. Some main points [5] H. Kriegel, E. Schubert, and A. Zimek. The (black) art of the discussion were: of runtime evaluation: Are we comparing algorithms or implementations? Knowl. Inf. Syst., 52(2):341–378, • Dataset complexity is important. So far, the 2017. community mainly focused on building more com- [6] H. O. Marques, R. J. G. B. Campello, A. Zimek, and plex methods, however evaluating existing and new J. Sander. On the internal evaluation of unsupervised methods on appropriate benchmarks reflecting the outlier detection. In SSDBM, pages 7:1–7:12. ACM, real world complexity is necessary for scientific ad- 2015. vance. [7] M. A. Muñoz and K. A. Smith-Miles. Performance analysis of continuous black-box optimization algo- • In general, awareness of reviewers should be raised rithms via footprints in instance space. Evolutionary regarding evaluation aspects, full-range evaluation, Computation, 25(4), 2017. reproducibility, embracing negative results etc. [8] M. A. Muñoz, L. Villanova, D. Baatar, and K. Smith- These aspects are important for the furthering Miles. Instance spaces for machine learning classifica- of maturity of data mining as a scientific effort. tion. Machine Learning, 107(1):109–147, 2018. [9] D. Sidlauskas and C. S. Jensen. Spatial joins in main However, it seems still very hard to publish papers memory: Implementation matters! PVLDB, 8(1):97– concerning issues around evaluation in main stream 100, 2014. venues. We need a critical mass to change the [10] N. Tomas̆ev and M. Radovanović. Clustering evalu- current status quo. ation in high-dimensional data. In M. E. Celebi and Evaluation is a huge domain and only few aspects K. Aydin, editors, Unsupervised Learning Algorithms. Springer, 2016. have been covered at EDML 2019. Data-related issues [11] Z. Zheng, R. Kohavi, and L. Mason. Real world like sample representativeness, redundancy, bias, non- performance of association rule algorithms. In KDD, stationary data etc. have not been discussed. From a pages 401–406. ACM, 2001. learning method perspective, it would be also interest- [12] A. Zimmermann. The data problem in data mining. ing to investigate similar questions in the context of SIGKDD Explorations, 16(2):38–45, 2014. deep neural networks, that are currently dominating the research in the data mining/machine learning ar- eas. These are possible candidate focus areas for future workshops. We plan to continue EDML as a series. Finally, we wish to express our appreciation of the presented work as well as of interest and vivid participation of the audience. References [1] D. Basaran, E. Ntoutsi, and A. Zimek. Redundancies in data and their effect on the evaluation of recommen- dation systems: A case study on the amazon reviews datasets. In SDM, pages 390–398. SIAM, 2017. [2] R. J. Bayardo Jr., B. Goethals, and M. J. Zaki, editors. FIMI ’04, Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations, Brighton, UK, November 1, 2004, volume 126 of CEUR Work- shop Proceedings. CEUR-WS.org, 2005. [3] G. O. Campos, A. Zimek, J. Sander, R. J. G. B. Campello, B. Micenková, E. Schubert, I. Assent, and M. E. Houle. On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min. Knowl. Discov., 30(4):891–927, 2016.