<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Evaluation and Experimental Design in Data Mining and Machine Learning: Motivation and Summary of EDML 2019</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Eirini</forename><surname>Ntoutsi</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">L3S Research Center</orgName>
								<orgName type="institution">Leibniz University Hannover</orgName>
								<address>
									<country>Germany, Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Erich</forename><surname>Schubert</surname></persName>
							<affiliation key="aff1">
								<orgName type="institution">Technical University Dortmund</orgName>
								<address>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Arthur</forename><surname>Zimek</surname></persName>
							<affiliation key="aff2">
								<orgName type="institution">University of Southern Denmark</orgName>
								<address>
									<country key="DK">Denmark</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Albrecht</forename><surname>Zimmermann</surname></persName>
							<affiliation key="aff3">
								<orgName type="institution">University Caen Normandy</orgName>
								<address>
									<country key="FR">France</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Evaluation and Experimental Design in Data Mining and Machine Learning: Motivation and Summary of EDML 2019</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">AA28CAF1AE8BCF022E4182E5AAF132ED</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T06:36+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract/>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Motivation</head><p>A vital part of proposing new machine learning and data mining approaches is evaluating them empirically to allow an assessment of their capabilities. Numerous choices go into setting up such experiments: how to choose the data, how to preprocess them (or not), potential problems associated with the selection of datasets, what other techniques to compare to (if any), what metrics to evaluate, etc. and last but not least how to present and interpret the results. Learning how to make those choices on-the-job, often by copying the evaluation protocols used in the existing literature, can easily lead to the development of problematic habits. Numerous, albeit scattered, publications have called attention to those questions and have occasionally called into question published results, or the usability of published methods <ref type="bibr" target="#b10">[11,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b1">2,</ref><ref type="bibr" target="#b8">9,</ref><ref type="bibr" target="#b11">12,</ref><ref type="bibr" target="#b2">3,</ref><ref type="bibr" target="#b0">1,</ref><ref type="bibr" target="#b4">5]</ref>. At a time of intense discussions about a reproducibility crisis in natural, social, and life sciences, and conferences such as SIGMOD, KDD, and ECML PKDD encouraging researchers to make their work as reproducible as possible, we therefore feel that it is important to bring researchers together, and discuss those issues on a fundamental level.</p><p>An issue directly related to the first choice mentioned above is the following: even the best-designed experiment carries only limited information if the underlying data are lacking. We therefore also want to discuss questions related to the availability of data, whether they are reliable, diverse, and whether they correspond to realistic and/or challenging problem settings.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Topics</head><p>In this workshop, we mainly solicited contributions that discuss those questions on a fundamental level, take stock of the state-of-the-art, offer theoretical arguments, or take well-argued positions, as well as actual evaluation papers that offer new insights, e.g., question published results, or shine the spotlight on the character-istics of existing benchmark data sets. As such, topics include, but are not limited to</p><p>• Benchmark datasets for data mining tasks: are they diverse/realistic/challenging?</p><p>• Impact of data quality (redundancy, errors, noise, bias, imbalance, ...) on qualitative evaluation</p><p>• Propagation/amplification of data quality issues on the data mining results (also interplay between data and algorithms)</p><p>• Evaluation of unsupervised data mining (dilemma between novelty and validity)</p><p>• Evaluation measures</p><p>• (Automatic) data quality evaluation tools: What are the aspects one should check before starting to apply algorithms to given data?</p><p>• Issues around runtime evaluation (algorithm vs. implementation, dependency on hardware, algorithm parameters, dataset characteristics)</p><p>• Design guidelines for crowd-sourced evaluations</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Contributions</head><p>The workshop featured a mix of invited speakers, a number of accepted presentations with ample time for questions since those contributions were expected to be less technical, and more philosophical in nature, and an extensive discussion on the current state, and the areas that most urgently need improvement, as well as recommendations to achieve those improvements.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Invited Presentations</head><p>Four invited presentations enriched the workshop with focused talks around the problems of evaluation in unsupervised learning. The first invited presentation by Ricardo J. G. B. Campello, University of Newcastle, was on "Evaluation of Unsupervised Learning Results: Making the Seemingly Impossible Possible". Ricardo elaborated on the specific difficulties in the evaluation of unsupervised data mining methods (namely clustering and outlier detection) and reported on some recent solutions and improvements, with special focus on the first internal evaluation measure for outlier detection <ref type="bibr" target="#b5">[6]</ref>.</p><p>The second invited presentation by Kate Smith-Miles, University of Melbourne, was on "Instance Spaces for Objective Assessment of Algorithms and Benchmark Test Suites", describing attempts to characterize data sets in a way to allow a map of the landscape of varying problems that shows where which algorithms perform good and this way also to identify areas where no good algorithm is available. This approach has been applied to characterize optimization problems <ref type="bibr" target="#b6">[7]</ref> and classification problems <ref type="bibr" target="#b7">[8]</ref>. It would be interesting to see this also on unsupervised learning problems.</p><p>The third invited presentation by Bart Goethals, University of Antwerp, reported on "Lessons learned from the FIMI workshops", a series of workshops that Bart run with others roughly 15 years ago, focusing on the runtime behavior of algorithms for frequent pattern mining <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b1">2]</ref>. Bart highlighted the various problems encountered in these attempts, for example the difficulty in assessing truly algorithmic merits as opposed to implementation details.</p><p>The fourth invited presentation by Miloš Radovanović, University of Novi Sad, reported on observations regarding "Clustering Evaluation in High-Dimensional Data" and an apparent bias that is shown by some evaluation indices w.r.t. the dimensionality of the data <ref type="bibr" target="#b9">[10]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Contributed Papers</head><p>The submitted papers discussed a variety of problems around the topic of the workshop.</p><p>In "EvalNE: A Framework for Evaluating Network Embeddings on Link Prediction", Alexandru Mara, Jefrey Lijffijt, and Tijl De Bie describe an evaluation framework for benchmarking existing and potentially new algorithms in the targeted area, motivated by a observed lack of reproducibility.</p><p>Martin Aumüller and Matteo Ceccarello contributed a study on "Benchmarking Nearest Neighbor Search: Influence of Local Intrinsic Dimensionality and Result Diversity in Real-World Datasets", in which they study the influence of intrinsic dimensionality on the performance of approximate nearest neighbor search.</p><p>In their contribution "Context-Driven Data Mining through Bias Removal and Incompleteness Mitigation', Feras Batarseh and Ajay Kulkarni describe case studies for the use of context to overcome obstacles based on data quality (or a lack thereof) and thereby to improve the quality achieved in the corresponding data mining application.</p><p>Based on the instance space analysis techniques for optimization and for classification problems as discussed earlier in the invited presentation by Kate Smith-Miles, in "Instance space analysis for unsupervised outlier detection" Sevvandi Kandanaarachchi, Mario Munoz and Kate Smith-Miles discuss an approach to extend these techniques to the unsupervised and therefore more challenging problem of outlier detection.</p><p>The contribution "Characterizing Transactional Databases for Frequent Itemset Mining" by Christian Lezcano and Marta Arias proposes a list of metrics to capture representativeness and diversity of benchmark datasets for frequent itemset mining. To summarize, the submitted papers as well as the discussion had a main focus on unsupervised evaluation. But we also touched other topics and agreed that the richness of topics and questions is asking for a continuation to a workshop series. Some main points of the discussion were:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Program Committee</head><p>• Dataset complexity is important. So far, the community mainly focused on building more complex methods, however evaluating existing and new methods on appropriate benchmarks reflecting the real world complexity is necessary for scientific advance.</p><p>• In general, awareness of reviewers should be raised regarding evaluation aspects, full-range evaluation, reproducibility, embracing negative results etc.</p><p>These aspects are important for the furthering of maturity of data mining as a scientific effort. However, it seems still very hard to publish papers concerning issues around evaluation in main stream venues. We need a critical mass to change the current status quo.</p><p>Evaluation is a huge domain and only few aspects have been covered at EDML 2019. Data-related issues like sample representativeness, redundancy, bias, nonstationary data etc. have not been discussed. From a learning method perspective, it would be also interesting to investigate similar questions in the context of deep neural networks, that are currently dominating the research in the data mining/machine learning areas. These are possible candidate focus areas for future workshops. We plan to continue EDML as a series.</p><p>Finally, we wish to express our appreciation of the presented work as well as of interest and vivid participation of the audience.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>The workshop would not have been possible without the generous help and the time and effort put into reviewing submissions by • Martin Aumüller, IT University of Copenhagen • James Bailey, University of Melbourne • Roberto Bayardo, Google • Christian Borgelt, University of Salzburg • Ricardo J. G. B. Campello, University of Newcastle • Sarah Cohen-Boulakia, Université Paris-Sud • Ryan R. Curtin, Symantec Corporation • Tijl De Bie, University of Gent • Marcus Edel, Freie Universität Berlin • Bart Goethals, University of Antwerp • Markus Goldstein, Hochschule Ulm • Nathalie Japkowicz, American University • Daniel Lemire, University of Quebec • Philippe Lenca, IMT Atlantique • Helmut Neukirchen, University of Iceland • Jürgen Pfeffer, Technical University Munich • Miloš Radovanović, University of Novi Sad • Protiva Rahman, Ohio State University • Mohak Shah, LG Electronics • Kate Smith-Miles, University of Melbourne • Joaquin Vanschoren, Eindhoven University of Technology • Ricardo Vilalta, University of Houston • Mohammed Zaki, Rensselaer Polytechnic Institute</figDesc></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Redundancies in data and their effect on the evaluation of recommendation systems: A case study on the amazon reviews datasets</title>
		<author>
			<persName><forename type="first">D</forename><surname>Basaran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Ntoutsi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zimek</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">SDM</title>
				<imprint>
			<publisher>SIAM</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="390" to="398" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">J</forename><surname>Bayardo</surname><genName>Jr</genName></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Goethals</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J</forename><surname>Zaki</surname></persName>
		</author>
		<ptr target="CEUR-WS.org" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations</title>
		<title level="s">CEUR Workshop Proceedings</title>
		<meeting>the IEEE ICDM Workshop on Frequent Itemset Mining Implementations<address><addrLine>Brighton, UK</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2004-11-01">November 1, 2004. 2005</date>
			<biblScope unit="volume">126</biblScope>
		</imprint>
	</monogr>
	<note>FIMI &apos;04</note>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study</title>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">O</forename><surname>Campos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zimek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sander</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">J G B</forename><surname>Campello</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Micenková</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Schubert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Assent</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">E</forename><surname>Houle</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Data Min. Knowl. Discov</title>
		<imprint>
			<biblScope unit="volume">30</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="891" to="927" />
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">FIMI &apos;03, Frequent Itemset Mining Implementations</title>
		<author>
			<persName><forename type="first">B</forename><surname>Goethals</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J</forename><surname>Zaki</surname></persName>
		</author>
		<ptr target=".org" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the ICDM 2003 Workshop on Frequent Itemset Mining Implementations</title>
				<meeting>the ICDM 2003 Workshop on Frequent Itemset Mining Implementations<address><addrLine>Melbourne, Florida, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2003-12-19">19 December 2003. 2003</date>
			<biblScope unit="volume">90</biblScope>
		</imprint>
	</monogr>
	<note>CEUR Workshop Proceedings. CEUR-WS</note>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">The (black) art of runtime evaluation: Are we comparing algorithms or implementations?</title>
		<author>
			<persName><forename type="first">H</forename><surname>Kriegel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Schubert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zimek</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Knowl. Inf. Syst</title>
		<imprint>
			<biblScope unit="volume">52</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="341" to="378" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">On the internal evaluation of unsupervised outlier detection</title>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">O</forename><surname>Marques</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">J G B</forename><surname>Campello</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zimek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sander</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">SSDBM</title>
				<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2015">2015</date>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="page">12</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Performance analysis of continuous black-box optimization algorithms via footprints in instance space</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Muñoz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">A</forename><surname>Smith-Miles</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Evolutionary Computation</title>
		<imprint>
			<biblScope unit="volume">25</biblScope>
			<biblScope unit="issue">4</biblScope>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Instance spaces for machine learning classification</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Muñoz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Villanova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Baatar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Smith-Miles</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Machine Learning</title>
		<imprint>
			<biblScope unit="volume">107</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="109" to="147" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Spatial joins in main memory: Implementation matters!</title>
		<author>
			<persName><forename type="first">D</forename><surname>Sidlauskas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">S</forename><surname>Jensen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">PVLDB</title>
		<imprint>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="97" to="100" />
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Clustering evaluation in high-dimensional data</title>
		<author>
			<persName><forename type="first">N</forename><surname>Tomasev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Radovanović</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Unsupervised Learning Algorithms</title>
				<editor>
			<persName><forename type="first">M</forename><forename type="middle">E</forename><surname>Celebi</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><surname>Aydin</surname></persName>
		</editor>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Real world performance of association rule algorithms</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Kohavi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Mason</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">KDD</title>
				<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2001">2001</date>
			<biblScope unit="page" from="401" to="406" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">The data problem in data mining</title>
		<author>
			<persName><forename type="first">A</forename><surname>Zimmermann</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">SIGKDD Explorations</title>
		<imprint>
			<biblScope unit="volume">16</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="38" to="45" />
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
