<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">A Generator for Subspace Clusters</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Anna</forename><surname>Beer</surname></persName>
							<email>beer@dbs.ifi.lmu.de</email>
						</author>
						<author>
							<persName><forename type="first">Nadine</forename><forename type="middle">Sarah</forename><surname>Schüler</surname></persName>
							<email>n.schueler@campus.lmu.de</email>
						</author>
						<author>
							<persName><forename type="first">Thomas</forename><surname>Seidl</surname></persName>
							<email>seidl@dbs.ifi.lmu.de</email>
						</author>
						<author>
							<persName><forename type="first">Lmu</forename><surname>Munich</surname></persName>
						</author>
						<title level="a" type="main">A Generator for Subspace Clusters</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">54A5A198BB6AE8F469B04F136F079EC5</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T18:27+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Data Generator</term>
					<term>Subspace Clustering</term>
					<term>Reproducibility</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>We introduce a generator for data containing subspace clusters which is accurately tunable and adjustable to the needs of developers. It is online available and allows to give a plethora of characteristics the data should contain, while it is simultaneously able to generate meaningful data containing subspace clusters with a minimum of input data.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Developing algorithms in the field of data mining is usually an iterative process in which a main idea is implemented and then tested on several use-cases or experiments containing a ground truth. Depending on the results of those, the algorithm is modified and a loop of alternately testing and improving the algorithm starts. If the same data or only a few data sets are used in several iterations of this cycle, we create overfitting algorithms. The fields in which such subspace clusters can occur are manifold and especially for gene expression data or other data with medical background, clusters are most often found only in meaningful subspaces. Nevertheless, the number of labeled datasets is limited, and datasets containing labeled subspace clusters are rare. So, instead of using the few real world labeled datasets to develop and improve a subspace clustering algorithm, artificial datasets, of which the ground-truth is known by construction, are often used. Additionally, we can generate datasets in such a way, that they emphasize the advantages of the algorithm and help to detect diverse properties which possibly emerged in the development process. Data generators simplify the cumbersome process of constructing new datasets by hand, and allow building reproducible data sets, which are versatile enough to produce a non-overfitting algorithm in the above described development cycle. Nevertheless, there are only few publicly available data generators and none for generating data containing subspace clusters, even though some are used in diverse subspace clustering papers, as described in Section 2. Thus, we developed a generator for data containing subspace clusters, which allows to determine a multitude of parameters and is described in Section 3. Section 4 concludes this short paper and gives ideas for future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>The quality of most subspace clustering algorithms presented in the last years is shown using synthetic data, the construction of which is usually not well described or not reproducible at all. Most authors created very elementary data generators, leading to a multitude of generators with too little setting options to construct datasets with reasonably predictable characteristics. Looking at a multitude of subspace clustering related papers, we found the following to describe their data generation process best: SubClu <ref type="bibr" target="#b5">[KKK04]</ref>, SURFING [BPR + 04], CLIQUE <ref type="bibr" target="#b0">[AGGR98]</ref>, which uses the generator described in <ref type="bibr" target="#b9">[ZM97]</ref>, and a review of diverse subspace clustering algorithms <ref type="bibr" target="#b7">[PHL04]</ref>. Further, ResCu [MAG + 09] and INSCY <ref type="bibr" target="#b1">[AKMS08]</ref> use the same generator as <ref type="bibr" target="#b5">[KKK04]</ref>. While all of those generators allow the user to set the number of points and dimensionality of the dataset as well as the number and dimensionality of clusters explicitly or implicitly, some crucial aspects are missing in each. E.g., the density or variance of clusters can be set in SubClu and SURFING, but not in <ref type="bibr" target="#b7">[PHL04]</ref> or CLIQUE. CLIQUE constructs clusters differently to the other generators, as the user defines hypercubes in which the uniformly distributed points are more dense than in the surrounding areas. Surprisingly, generating data with noise is only provided by the generator from CLIQUE. The other generators construct, similarly to ours, some Gaussian distributed clusters and have different properties: in Sub-Clu and SURFING, no cluster can be clustered in the full dimensional space, but the authors do not describe how this is reached. In <ref type="bibr" target="#b7">[PHL04]</ref>, the values of the relevant dimensions for each instance in a cluster can be restricted, leading to hypercube-shaped clusters.</p><p>MDCGen <ref type="bibr">[IZFZ]</ref>, which is probably the most recent and a very elaborated generator especially designed for multidimensional data and also subspace clustering, does not provide the possibility that a point can belong to multiple clusters at once. Additionally, there are data generators introduced independently from the field of subspace clustering, but to the best of our knowledge none of them is able to construct data containing subspace clusters of arbitrary dimensionality. [MLG + 13] gives an overview over some data generators for big data benchmarking, like Hibench, LinkBench, CloudSuite, TPC-DS, YCSB, BigBench and BigDataBench, and BDGS. MUDD <ref type="bibr" target="#b8">[SP04]</ref> is a generator similar to those. They are designed to create big data sets with similar properties as some given real world data, but users cannot specify enough details to be able to expose the advantages and disadvantages of their algorithms in development. RAIL <ref type="bibr" target="#b4">[KBS19]</ref> is an interactive generator concentrating on producing linear correlated data, but allows only constructing 3-dimensional datasets containing 2-dimensional planes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">The Generator</head><p>In contrast to the data generators described in Section 2, the work here presented offers to define a plethora of characteristics of the dataset to be constructed while simultaneously allowing to generate meaningful datasets containing subspace clusters without having to think about parameters too much: It requires only three parameters for the general set-up: The number of points n, the number of dimensions dim and m, a flag determining if it is possible for a point to belong to more than one subspace cluster. If given only those three parameters, we proceed as follows: To restrict the number of subspaces generated we use a fixed number of cluster centers, determined by a random number k between 1 and √ n. The clusters are then randomly allocated to a number of subspaces &lt; k and the number of points as well as the number of dimensions of all subspaces clusters is drawn randomly from a uniform distribution within the given limits.</p><p>Users can specify the properties of the data further by giving information for every subspace S, namely the number of points, dimensionality, and number of clusters in S. Additionally, the variance of each cluster can be given. Figure <ref type="figure">1</ref> shows how subspaces can be distributed. In this example, there are four different subspaces, of which the first contains two clusters, the second and third contain one cluster each, and the fourth contains three clusters. The last two points belong to no cluster at all, while all other points are in two clusters in different subspaces. If m = f alse, a point may only belong to exactly one cluster, we insert the given subspaces into the n × dim matrix as long as there are sufficient points not assigned yet. Points and dimensions not assigned to belong to a certain subspace cluster, are filled with uniformly distributed noise data and a 0 in the label-matrix giving the cluster-assignments. If m = true, subspaces are first assigned in the same way as described above, before points are assigned to a second subspace and obtain a second cluster membership (see Figure <ref type="figure">1</ref>). This is again assigned by going through the points and if there are enough unassigned dimensions to meet the requested subspace dimensionality this point will become a member of the second subspace in addition to the first one. When the points belong to a subspace cluster, the values are drawn from a appropriate multidimensional Gaussian distribution function, the center and standard deviation of which can be given by users. The remaining values are again drawn from a uniform distribution function. Uniformly distributed noise points can be added. Our generator is online available under https://github.com/NanniSchueler/SubCluGen.git and outputs the data matrix as well as the label matrix.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Conclusion</head><p>In summary, we introduced a data generator especially designed for subspace clusters. It expects only three parameters: the size and dimensionality of the dataset as well as as boolean value determining if a point can belong to clusters in different subspaces. With that a fast construction of data is possible. Simultaneously, reproducible datasets with very specific properties can be designed by users to test algorithms they are developing for diverse characteristics. The generator is easy to use and we plan to extend it with even more possibilities, like, e.g., non-axis parallel subspace clusters or other distributions instead of Gaussian, in future work. Also a combination with RAIL or some of the men-Fig. <ref type="figure">1</ref>: Left: Label-Matrix as given by the generator as output, 0 implies uniformly distributed data, where numbers 1 to 4 imply the subspace affinity. On the right, the exact cluster affinity can be seen as well as the variance of the clusters implied by colour saturation.</p><p>tioned generators taking real world data into account could deliver a variety of reproducible datasets containing the desired properties for testing and developing.</p></div>		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgement</head><p>This work has been funded by the German Federal Ministry of Education and Research (BMBF) under Grant No. 01IS18036A. The authors of this work take full responsibilities for its content.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">Automatic subspace clustering of high dimensional data for data mining applications</title>
		<author>
			<persName><forename type="first">Rakesh</forename><surname>Agrawal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Johannes</forename><surname>Gehrke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dimitrios</forename><surname>Gunopulos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Prabhakar</forename><surname>Raghavan</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1998">1998</date>
			<publisher>ACM</publisher>
			<biblScope unit="volume">27</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Inscy: Indexing subspace clusters with in-process-removal of redundancy</title>
		<author>
			<persName><forename type="first">Ira</forename><surname>Assent</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ralph</forename><surname>Krieger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Emmanuel</forename><surname>Müller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Thomas</forename><surname>Seidl</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICDM&apos;08. Eighth IEEE International Conference on</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2008">2008. 2008</date>
			<biblScope unit="page" from="719" to="724" />
		</imprint>
	</monogr>
	<note>Data Mining</note>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Subspace selection for clustering high-dimensional data</title>
		<author>
			<persName><surname>Bpr + ; Christian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Claudia</forename><surname>Baumgartner</surname></persName>
		</author>
		<author>
			<persName><surname>Plant</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H-P</forename><surname>Railing</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Peer</forename><surname>Kriegel</surname></persName>
		</author>
		<author>
			<persName><surname>Kroger</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICDM&apos;04. Fourth IEEE International Conference on</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2004">2004. 2004</date>
			<biblScope unit="page" from="11" to="18" />
		</imprint>
	</monogr>
	<note>Data Mining</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Mdcgen: Multidimensional dataset generator for clustering</title>
		<author>
			<persName><forename type="first">Félix</forename><surname>Iglesias</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tanja</forename><surname>Zseby</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Daniel</forename><surname>Ferreira</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Arthur</forename><surname>Zimek</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Classification</title>
		<imprint>
			<biblScope unit="page" from="1" to="20" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Data on rails: On interactive generation of artificial linear correlated data</title>
		<author>
			<persName><forename type="first">Daniyal</forename><surname>Kazempour</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Anna</forename><surname>Beer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Thomas</forename><surname>Seidl</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Human-Computer Interaction</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="184" to="189" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Density-connected subspace clustering for high-dimensional data</title>
		<author>
			<persName><forename type="first">Karin</forename><surname>Kailing</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hans-Peter</forename><surname>Kriegel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Peer</forename><surname>Kröger</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2004 SIAM international conference on data mining</title>
				<meeting>the 2004 SIAM international conference on data mining</meeting>
		<imprint>
			<publisher>SIAM</publisher>
			<date type="published" when="2004">2004</date>
			<biblScope unit="page" from="246" to="256" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Relevant subspace clustering: Mining the most interesting non-redundant concepts in high dimensional data</title>
		<author>
			<persName><forename type="first">Mag + ; Emmanuel</forename><surname>Müller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ira</forename><surname>Assent</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stephan</forename><surname>Günnemann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ralph</forename><surname>Krieger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Thomas</forename><surname>Seidl ; Zijian Ming</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Chunjie</forename><surname>Luo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Wanling</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Rui</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Qiang</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lei</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jianfeng</forename><surname>Zhan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Ninth IEEE International Conference on Data Mining</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2009">2009. 2009. 2013</date>
			<biblScope unit="page" from="138" to="154" />
		</imprint>
	</monogr>
	<note>Advancing Big Data Benchmarks</note>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Subspace clustering for high dimensional data: a review</title>
		<author>
			<persName><forename type="first">Lance</forename><surname>Parsons</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ehtesham</forename><surname>Haque</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Huan</forename><surname>Liu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Acm Sigkdd Explorations Newsletter</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="90" to="105" />
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Mudd: a multi-dimensional data generator</title>
		<author>
			<persName><forename type="first">M</forename><surname>John</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Meikel</forename><surname>Stephens</surname></persName>
		</author>
		<author>
			<persName><surname>Poess</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM SIGSOFT Software Engineering Notes</title>
		<imprint>
			<biblScope unit="volume">29</biblScope>
			<biblScope unit="page" from="104" to="109" />
			<date type="published" when="2004">2004</date>
			<publisher>ACM</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">A comparative study of clustering methods</title>
		<author>
			<persName><forename type="first">Mohamed</forename><surname>Zaït</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hammou</forename><surname>Messatfa</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Future Generation Computer Systems</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<biblScope unit="issue">2-3</biblScope>
			<biblScope unit="page" from="149" to="159" />
			<date type="published" when="1997">1997</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
