<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Combinatorial Approaches to Clustering and Feature Selection</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Michael</forename><forename type="middle">E</forename><surname>Houle</surname></persName>
							<email>meh@nii.ac.jp</email>
							<affiliation key="aff0">
								<orgName type="institution">National Institute of Informatics</orgName>
								<address>
									<addrLine>2-1-2 Hitotsubashi, Chiyoda-ku</addrLine>
									<postCode>101-8430</postCode>
									<settlement>Tokyo</settlement>
									<country key="JP">Japan</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Combinatorial Approaches to Clustering and Feature Selection</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">5A2B4D413C13EAB39951826C42F72771</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T13:42+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract/>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>One of the most serious difficulties in the analysis of high-dimensional data sets involves the treatment of measures of similarity. Although similarity measures often retain some discriminative ability as the dimension increases, the similarity values themselves are often difficult to interpret. Methods for search, clustering and feature selection that perform quantitive tests of similarity values (as opposed to comparative tests) are particularly susceptible to this problem. This presentation will be concerned with combinatorial models of clustering based on shared neighbor information, and their application to feature selection, subspace clustering, and multiple clustering. The models assume a secondary, derived form of similarity measure based on the intersection properties of neighborhoods defined according to the original similarity measure. The use of secondary similarity has been recently shown to offer solutions that are more robust and more scalable with respect to the dimension of the data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Secondary Similarity Measures</head><p>For similarity search and their applications, the distance measures commonly used in practice are known to be sensitive to local variations within the data distribution, as well as the number of data features involved (the dimension). These dependencies can severely limit the the efficiency and accuracy of the search, and ultimately the quality of the solution -a phenomenon often referred to as the curse of dimensionality. Generally speaking, as the number of data features increases, pairwise distance values tend to concentrate tightly about their mean, reducing the overall discriminative ability of the similarity measure. The effect occurs for a broad range of data distributions and similarity measures, and can be so pronounced as to cast doubt upon whether efficient nearest neighbor search is even achievable in higher dimensions <ref type="bibr" target="#b0">[1]</ref>. However, when a data set is composed of many well-formed clusters, the concentration effect will typically be less severe across cluster boundaries, with distances from a cluster member to other cluster members being relatively easy to distinguish from distances to non-members, especially when the clusters are well separated <ref type="bibr" target="#b0">[1]</ref><ref type="bibr" target="#b1">[2]</ref><ref type="bibr" target="#b2">[3]</ref>.</p><p>In general, any improvement in the discriminative ability of the similarity measure employed can be expected to yield improvements in the performances of solutions based on it. Some simple enhancement strategies involve the use of shared neighbor (SN) information, in which a secondary similarity between two points v and w is defined in terms of data objects in the common intersection of neighborhoods based at v and w, where the neighborhoods themselves are determined according to a supplied primary similarity measure. The primary measure can be any function that determines a well-defined ranking of the data objects relative to the query. Recent studies have shown that secondary similarity measures based on SN information are generally more robust in higher dimensions than their associated primary distance measures, since the neighborhoods of object pairs drawn from a common cluster tend to have significantly more items in common than to pairs drawn from different clusters <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b4">5]</ref>. Furthermore, recent advances in approximate similarity search allow for neighborhood information to be generated accurately and efficiently for many practical applications <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b6">7]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Multi-Source RSC Clustering</head><p>Shared-neighbor information has been used to guide clustering algorithms for almost 40 years <ref type="bibr" target="#b7">[8]</ref><ref type="bibr" target="#b8">[9]</ref><ref type="bibr" target="#b9">[10]</ref>. However, early methods required that the neighborhood size k be fixed in advance by the user. The use of fixed values of k can introduce a very significant bias on the sizes and other characteristics of clusters that can be produced by the methods, in that they tend to favor the discovery of groups with size of roughly the same order as k.</p><p>In order to account for the effects of varying k, the Relevant-Set Correlation (RSC) model for clustering was proposed <ref type="bibr" target="#b10">[11]</ref>. The RSC model provides a consistent and comprehensive framework for the assessment of cluster quality, based on the statistical significance of a form of correlation between the neighborhood sets of its members. More precisely, the model tests the significance of any grouping against the assumption that the neighborhoods contain zero information (that is, against the assumption that they were generated by means of random selection). The greater the extent to which the assumption is violated, the greater the significance of the grouping.</p><p>The RSC model quantifies the quality of cluster candidates of any arbitrary size (allowing the comparison of any two candidates regardless of their size), the degree of association between pairs of cluster candidates, and the degree of association between clusters and individual data items. An efficient greedy selection strategy, GreedyRSC, has been developed based on RSC, and was shown to be very competitive in practice <ref type="bibr" target="#b10">[11]</ref>. It requires only two user-supplied parameters, describing the minimum acceptable cluster size, and the size of the maximum acceptable overlap between two clusters. Both of these parameters can be chosen in a natural way with no knowledge of the nature of the data or its distribution. The number of clusters is not specified by the user. This presentation will be concerned with an extension of the RSC model to account for multiple sources of neighborhood information. Each of these sources is assumed to have its own similarity measure based on its own collection of data features (which may or may not contain features also used by other sources). Like the original RSC model, the extended model relies only on the neighborhood rankings produced according to the sources, and has no knowledge of the nature of the similarity measure or features involved.</p><p>The extension of RSC will be seen to have implications for subspace clustering and feature selection, as well as multiclustering. In particular, the discussion will include the following potential applications of the extended model:</p><p>-The significance of data sets can be simultaneously assessed with respect to object membership as well as the number of sources of neighborhood information. If each source is associated with its own collection of features, the model in effect assesses the significance of the association of a particular group of objects with a collection of features. -Under the model, the combination of sources that are most strongly associated with a putative cluster can be identified very efficiently. -In applications for which multiple clusterings of the data have been generated, the model can be used to decide to which clustering a particular candidate cluster is best aligned. This can potentially serve as a foundation upon which multiple clustering criteria can be designed.</p></div>		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">When is &quot;nearest neighbor&quot; meaningful?</title>
		<author>
			<persName><forename type="first">K</forename><surname>Beyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Goldstein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Ramakrishnan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Shaft</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. ICDT</title>
				<meeting>ICDT</meeting>
		<imprint>
			<date type="published" when="1999">1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Density-based indexing for approximate nearest-neighbor queries</title>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">P</forename><surname>Bennett</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Fayyad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Geiger</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. KDD</title>
				<meeting>KDD</meeting>
		<imprint>
			<date type="published" when="1999">1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Clustering high dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering</title>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">P</forename><surname>Kriegel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Kröger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zimek</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM TKDD</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="1" to="58" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Can sharedneighbor-distances defeat the curse of dimensionality?</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">E</forename><surname>Houle</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">P</forename><surname>Kriegel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Kröger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Schubert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zimek</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. SSDBM</title>
				<meeting>SSDBM</meeting>
		<imprint>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Quality of similarity rankings in time series</title>
		<author>
			<persName><forename type="first">T</forename><surname>Bernecker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">E</forename><surname>Houle</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">P</forename><surname>Kriegel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Kröger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Renz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Schubert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zimek</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. SSTD</title>
				<meeting>SSTD</meeting>
		<imprint>
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions</title>
		<author>
			<persName><forename type="first">A</forename><surname>Andoni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Indyk</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Symp. Foundations of Computer Science</title>
				<imprint>
			<date type="published" when="2006">2006</date>
			<biblScope unit="page" from="459" to="468" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Fast approximate similarity search in extremely highdimensional data sets</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">E</forename><surname>Houle</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sakama</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. ICDE</title>
				<meeting>ICDE</meeting>
		<imprint>
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Clustering using a similarity measure based on shared near neighbors</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">A</forename><surname>Jarvis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">A</forename><surname>Patrick</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE TC C</title>
		<imprint>
			<biblScope unit="volume">22</biblScope>
			<biblScope unit="issue">11</biblScope>
			<biblScope unit="page" from="1025" to="1034" />
			<date type="published" when="1973">1973</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">ROCK: a robust clustering algorithm for categorical attributes</title>
		<author>
			<persName><forename type="first">S</forename><surname>Guha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Rastogi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Shim</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Inform. Sys</title>
		<imprint>
			<biblScope unit="volume">25</biblScope>
			<biblScope unit="page" from="345" to="366" />
			<date type="published" when="2000">2000</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data</title>
		<author>
			<persName><forename type="first">L</forename><surname>Ertöz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Steinbach</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Kumar</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. SDM</title>
				<meeting>SDM</meeting>
		<imprint>
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">The relevant-set correlation model for data clustering</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">E</forename><surname>Houle</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Stat. Anal. Data Min</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="157" to="176" />
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
