-

Evaluation and Experimental Design in Data Mining and Machine Learning: Motivation and Summary of EDML 2019

Eirini Ntoutsi

Erich Schuberty

Albrecht Zimmermannx

0 0 Leibniz University Hannover, Germany & L3S Research Center , Germany

2 4

Motivation A vital part of proposing new machine learning and data mining approaches is evaluating them empirically to allow an assessment of their capabilities. Numerous choices go into setting up such experiments: how to choose the data, how to preprocess them (or not), potential problems associated with the selection of datasets, what other techniques to compare to (if any), what metrics to evaluate, etc. and last but not least how to present and interpret the results. Learning how to make those choices on-the-job, often by copying the evaluation protocols used in the existing literature, can easily lead to the development of problematic habits. Numerous, albeit scattered, publications have called attention to those questions and have occasionally called into question published results, or the usability of published methods [11, 4, 2, 9, 12, 3, 1, 5]. At a time of intense discussions about a reproducibility crisis in natural, social, and life sciences, and conferences such as SIGMOD, KDD, and ECML PKDD encouraging researchers to make their work as reproducible as possible, we therefore feel that it is important to bring researchers together, and discuss those issues on a fundamental level. An issue directly related to the rst choice mentioned above is the following: even the best-designed experiment carries only limited information if the underlying data are lacking. We therefore also want to discuss questions related to the availability of data, whether they are reliable, diverse, and whether they correspond to realistic and/or challenging problem settings.

In this workshop, we mainly solicited contributions that discuss those questions on a fundamental level, take stock of the state-of-the-art, o er theoretical arguments, or take well-argued positions, as well as actual evaluation papers that o er new insights, e.g., question published results, or shine the spotlight on the characteristics of existing benchmark data sets. As such, topics include, but are not limited to

Benchmark datasets for data mining tasks: are they diverse/realistic/challenging? Impact of data quality (redundancy, errors, noise, bias, imbalance, ...) on qualitative evaluation Propagation/ampli cation of data quality issues on the data mining results (also interplay between data and algorithms) Evaluation of unsupervised data mining (dilemma between novelty and validity) Evaluation measures

(Automatic) data quality evaluation tools: What are the aspects one should check before starting to apply algorithms to given data?

Issues around runtime evaluation (algorithm vs. implementation, dependency on hardware, algorithm parameters, dataset characteristics) Design guidelines for crowd-sourced evaluations

Contributions

The workshop featured a mix of invited speakers, a number of accepted presentations with ample time for questions since those contributions were expected to be less technical, and more philosophical in nature, and an extensive discussion on the current state, and the areas that most urgently need improvement, as well as recommendations to achieve those improvements. 3.1 Invited Presentations Four invited presentations enriched the workshop with focused talks around the problems of evaluation in unsupervised learning.

The rst invited presentation by Ricardo J. G. B.

Campello, University of Newcastle, was on \Evaluation of Unsupervised Learning Results: Making the Seemingly Impossible Possible". Ricardo elaborated on the speci c di culties in the evaluation of unsupervised data mining methods (namely clustering and outlier de- Based on the instance space analysis techniques for tection) and reported on some recent solutions and im- optimization and for classi cation problems as discussed provements, with special focus on the rst internal eval- earlier in the invited presentation by Kate Smith-Miles, uation measure for outlier detection [ 6 ]. in \Instance space analysis for unsupervised outlier

The second invited presentation by Kate Smith- detection" Sevvandi Kandanaarachchi, Mario Munoz Miles, University of Melbourne, was on \Instance Spaces and Kate Smith-Miles discuss an approach to extend for Objective Assessment of Algorithms and Benchmark these techniques to the unsupervised and therefore more Test Suites", describing attempts to characterize data challenging problem of outlier detection. sets in a way to allow a map of the landscape of varying The contribution \Characterizing Transactional problems that shows where which algorithms perform Databases for Frequent Itemset Mining" by Christian good and this way also to identify areas where no Lezcano and Marta Arias proposes a list of metrics to good algorithm is available. This approach has been capture representativeness and diversity of benchmark applied to characterize optimization problems [ 7 ] and datasets for frequent itemset mining. classi cation problems [ 8 ]. It would be interesting to see this also on unsupervised learning problems. 3.3 Program Committee The workshop would not

The third invited presentation by Bart Goethals, have been possible without the generous help and the University of Antwerp, reported on \Lessons learned time and e ort put into reviewing submissions by from the FIMI workshops", a series of workshops that Martin Aumuller, IT University of Copenhagen Bart run with others roughly 15 years ago, focusing on the runtime behavior of algorithms for frequent pattern James Bailey, University of Melbourne mining [ 4, 2 ]. Bart highlighted the various problems encountered in these attempts, for example the di culty Roberto Bayardo, Google in assessing truly algorithmic merits as opposed to Christian Borgelt, University of Salzburg implementation details.

The fourth invited presentation by Milos Ricardo J. G. B. Campello, University of Newcastle Radovanovic, University of Novi Sad, reported on Sarah Cohen-Boulakia, Universite Paris-Sud observations regarding \Clustering Evaluation in HighDimensional Data" and an apparent bias that is shown Ryan R. Curtin, Symantec Corporation by some evaluation indices w.r.t. the dimensionality of Tijl De Bie, University of Gent the data [ 10 ]. 3.2 Contributed Papers The submitted papers discussed a variety of problems around the topic of the workshop.

In \EvalNE: A Framework for Evaluating Network Embeddings on Link Prediction", Alexandru Mara, Jefrey Lij jt, and Tijl De Bie describe an evaluation framework for benchmarking existing and potentially new algorithms in the targeted area, motivated by a observed lack of reproducibility.

Martin Aumuller and Matteo Ceccarello contributed a study on \Benchmarking Nearest Neighbor Search: In uence of Local Intrinsic Dimensionality and Result Diversity in Real-World Datasets", in which they study the in uence of intrinsic dimensionality on the performance of approximate nearest neighbor search.

In their contribution \Context-Driven Data Mining through Bias Removal and Incompleteness Mitigation', Feras Batarseh and Ajay Kulkarni describe case studies for the use of context to overcome obstacles based on data quality (or a lack thereof) and thereby to improve the quality achieved in the corresponding data mining application.

Marcus Edel, Freie Universitat Berlin Bart Goethals, University of Antwerp Markus Goldstein, Hochschule Ulm Nathalie Japkowicz, American University Daniel Lemire, University of Quebec Philippe Lenca, IMT Atlantique Helmut Neukirchen, University of Iceland Jurgen Pfe er, Technical University Munich Milos Radovanovic, University of Novi Sad Protiva Rahman, Ohio State University Mohak Shah, LG Electronics Kate Smith-Miles, University of Melbourne Joaquin Vanschoren, Eindhoven University of Technology Ricardo Vilalta, University of Houston Mohammed Zaki, Rensselaer Polytechnic Institute Conclusions

To summarize, the submitted papers as well as the discussion had a main focus on unsupervised evaluation. But we also touched other topics and agreed that the richness of topics and questions is asking for a continuation to a workshop series. Some main points of the discussion were:

Dataset complexity is important. So far, the

community mainly focused on building more complex methods, however evaluating existing and new methods on appropriate benchmarks re ecting the real world complexity is necessary for scienti c advance.

In general, awareness of reviewers should be raised

regarding evaluation aspects, full-range evaluation, reproducibility, embracing negative results etc.

These aspects are important for the furthering

of maturity of data mining as a scienti c e ort. However, it seems still very hard to publish papers concerning issues around evaluation in main stream venues. We need a critical mass to change the current status quo.

Evaluation is a huge domain and only few aspects have been covered at EDML 2019. Data-related issues like sample representativeness, redundancy, bias, nonstationary data etc. have not been discussed. From a learning method perspective, it would be also interesting to investigate similar questions in the context of deep neural networks, that are currently dominating the research in the data mining/machine learning areas. These are possible candidate focus areas for future workshops. We plan to continue EDML as a series.

Finally, we wish to express our appreciation of the presented work as well as of interest and vivid participation of the audience.

[1]

Basaran , E. Ntoutsi, and

Zimek . Redundancies in data and their e ect on the evaluation of recommendation systems: A case study on the amazon reviews datasets . In SDM , pages 390 { 398 . SIAM , 2017 .

[2]

R. J. Bayardo

Jr. ,

Goethals , and M. J. Zaki, editors. FIMI '04, Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations , Brighton, UK , November 1, 2004 , volume 126 of CEUR Workshop Proceedings. CEUR-WS.org , 2005 .

[3]

G. O.

Campos ,

Zimek ,

Sander ,

R. J. G. B.

Campello ,

Micenkova ,

Schubert , I. Assent , and

M. E.

Houle . On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study . Data Min. Knowl. Discov. , 30 ( 4 ): 891 { 927 , 2016 .

[4]

Goethals and M. J. Zaki, editors. FIMI '03 , Frequent Itemset Mining Implementations, Proceedings of the ICDM 2003 Workshop on Frequent Itemset Mining Implementations, 19 December 2003 , Melbourne, Florida, USA, volume 90 of CEUR Workshop Proceedings. CEUR-WS.org , 2003 .

[5]

Kriegel , E. Schubert, and

Zimek.

The (black) art of runtime evaluation: Are we comparing algorithms or implementations? Knowl . Inf. Syst., 52 ( 2 ): 341 { 378 , 2017 .

[6]

H. O.

Marques ,

R. J. G. B.

Campello ,

Zimek , and

Sander . On the internal evaluation of unsupervised outlier detection . In SSDBM , pages 7 : 1 {7: 12 . ACM, 2015 .

[7]

M. A.

Mun ~oz and K. A . Smith-Miles . Performance analysis of continuous black-box optimization algorithms via footprints in instance space . Evolutionary Computation , 25 ( 4 ), 2017 .

[8]

M. A.

Mun ~oz, L. Villanova,

Baatar , and

SmithMiles . Instance spaces for machine learning classi cation . Machine Learning , 107 ( 1 ): 109 { 147 , 2018 .

[9]

Sidlauskas and

C. S.

Jensen . Spatial joins in main memory: Implementation matters! PVLDB, 8(1 ): 97 { 100 , 2014 .

[10]

Tomasev and

Radovanovic . Clustering evaluation in high-dimensional data . In M. E. Celebi and K. Aydin, editors, Unsupervised Learning Algorithms . Springer, 2016 .

[11]

Zheng ,

Kohavi , and

Mason . Real world performance of association rule algorithms . In KDD , pages 401 { 406 . ACM, 2001 .

[12]

Zimmermann . The data problem in data mining . SIGKDD Explorations , 16 ( 2 ): 38 { 45 , 2014 .