<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Empirical Comparative Analysis of 1-of-K Coding and K-Prototypes in Categorical Clustering</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Fei</forename><surname>Wang</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">School of Computing</orgName>
								<orgName type="institution">Dublin Institute of Technology</orgName>
								<address>
									<country key="IE">Ireland</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Hector</forename><surname>Franco</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">School of Computing</orgName>
								<orgName type="institution">Dublin Institute of Technology</orgName>
								<address>
									<country key="IE">Ireland</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">John</forename><surname>Pugh</surname></persName>
							<affiliation key="aff1">
								<orgName type="institution">Nathean Technologies Ltd</orgName>
								<address>
									<settlement>Dublin</settlement>
									<country key="IE">Ireland</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Robert</forename><surname>Ross</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">School of Computing</orgName>
								<orgName type="institution">Dublin Institute of Technology</orgName>
								<address>
									<country key="IE">Ireland</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Empirical Comparative Analysis of 1-of-K Coding and K-Prototypes in Categorical Clustering</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">982343B9792E7C588333D695B2BB6EA0</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T14:14+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>clustering</term>
					<term>categorical data</term>
					<term>k-means</term>
					<term>k-prototypes</term>
					<term>efficiency</term>
					<term>clustering validity</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Clustering is a fundamental machine learning application, which partitions data into homogeneous groups. K-means and its variants are the most widely used class of clustering algorithms today. However, the original k-means algorithm can only be applied to numeric data. For categorical data, the data has to be converted into numeric data through 1-of-K coding which itself causes many problems. K-prototypes, another clustering algorithm that originates from the k-means algorithm, can handle categorical data by adopting a different notion of distance. In this paper, we systematically compare these two methods through an experimental analysis. Our analysis shows that K-prototypes is more suited when the dataset is large-scaled, while the performance of k-means with 1-of-K coding is more stable. We believe these are useful heuristics for clustering methods working with highly categorical data.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Clustering is a fundamental machine learning operation to partition data into homogeneous groups <ref type="bibr" target="#b14">[15]</ref>. Different from classification, clustering looks at the intrinsic characteristics of data, rather than the relationship of the data with external labels. Identified data clusters should be "externally isolated and internally cohesive, implying a certain degree of homogeneity within clusters and heterogeneity between clusters" <ref type="bibr" target="#b25">[26]</ref>. In other words, clustering aims to partition a set of objects into clusters such that the objects in the same cluster are more similar to each other than the objects in different clusters <ref type="bibr" target="#b14">[15,</ref><ref type="bibr" target="#b23">24]</ref>.</p><p>Clustering has historically been the most popular of the unsupervised machine learning techniques. Typical applications include: (a) the discovery of underlying structure in data; (b) the classification of data based on its intrinsic nature; and (c) the compression of data <ref type="bibr" target="#b16">[17]</ref>. As a fundamental method in data mining and machine learning, clustering has been applied a variety of fields, such as image segmentation, documents analysis, customer segmentation, workforce management, genome research in biology and so on <ref type="bibr" target="#b16">[17]</ref>. It is also noted as an important part of unsupervised learning in most data mining and machine learning text books <ref type="bibr" target="#b20">[21,</ref><ref type="bibr" target="#b17">18,</ref><ref type="bibr" target="#b5">6,</ref><ref type="bibr" target="#b7">8,</ref><ref type="bibr" target="#b28">29]</ref>.</p><p>Clustering algorithms can be divided into four categories <ref type="bibr" target="#b28">[29]</ref>: (a) Representative-based clustering, e.g. k-means; (b) Hierarchical clustering, e.g. agglomerative hierarchical clustering; (c) Density-based clustering, e.g. DBSCAN; and (d) Spectral and graph clustering, e.g. spectral clustering. Jain provides a useful detailed introduction to the progress and development of different kinds of clustering algorithms <ref type="bibr" target="#b16">[17]</ref>. Given the multitude of clustering techniques, the primary question that needs to be answered for a given application is which algorithm should be chosen for a specific case. However, many see the answer to this question as being as complex as the range of algorithms themselves available. For a start, answering this question depends on the characteristics of source data, algorithms and the targets of clustering. As stated in <ref type="bibr" target="#b16">[17,</ref><ref type="bibr" target="#b27">28]</ref>, no clustering algorithm can be universally used to solve all problems. Algorithms are always designed with some assumptions or restrictions, in this sense, it is important to have a clear idea about the conditions of the clustering. Even though lots of work can be implemented before the clustering to select algorithms, it is still often impossible to find the "best" algorithm, because the combinations of algorithms and conditions lead to a vast amount of work. Usually, a systematic comparison of some widely used algorithms is the pragmatic way to find out which algorithm to use.</p><p>Our own interest in clustering stems from its importance in customer segmentation. In commercial Business Intelligence applications, the ability to cluster data is a vital tool in order to provide insights into business data. For end users the most beneficial form of clustering is where little a priori knowledge such as the likely number of clusters is needed in advance of the commencement of the clustering process. We see the automatic parameterisation and execution of clustering processes as a goal for our work both from an academic and commercial perspective. We are particularly concerned with the problem of clustering data that has a high proportion of categorical data. Clustering highly categorical data has its own associated challenges but is yet of very real interest to a range of application types. We also pay much attention to the efficiency of the algorithm implementation, which is always a vital aspect for commercial applications.</p><p>In this paper, we outline an empirical comparison of k-means clustering with its derivative algorithm k-prototypes in the context of categorical data analysis. We begin in Section 2 with a brief recap of key issues in clustering for data with a high proportion of categorical data and introduce the two algorithms which we are focusing on. Then in Section 3 we outline the design for our empirical comparison of the algorithms in question. Section 4 presents the study results, before in Section 5 we draw conclusions and outline future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Background</head><p>The most widely known clustering algorithm is the K-means clustering algorithm which was first published in 1955 <ref type="bibr" target="#b16">[17]</ref>. Even today, k-means is still widely applied and researched in different fields because of its ease of implementation, simplicity, efficiency, and empirical success. K-means is a typical representative-based algorithm, which finds firstly the representative of each cluster, then assigns each object to its most similar representative, and at last forms the clusters with objects with the same representative <ref type="bibr" target="#b28">[29]</ref>. On the other hand, k-means is a partitional clustering algorithm, which finds all the clusters simultaneously by partitioning all the objects, and does not have a hierarchical structure unlike hierarchical algorithms <ref type="bibr" target="#b16">[17,</ref><ref type="bibr" target="#b19">20]</ref>.</p><p>It has long been shown that the performance of k-means depends greatly on the initialisation of means. Several initialisation methods were proposed for k-means, e.g. k-means++ <ref type="bibr" target="#b2">[3]</ref>. Recent research shows that k-means probably reaches the global optimum when the initialisation means are well separated <ref type="bibr" target="#b16">[17]</ref>. However, the usual way to overcome the local optima is still to run the k-means algorithm, given a k value, multiple times with different initial means and choose the clustering result with the smallest cost function <ref type="bibr" target="#b25">[26]</ref>.</p><p>As with many other machine learning algorithms, the basic K-means algorithm cannot directly deal with categorical data. Firstly, the common distance measure used in k-means is (squared) Euclidean distance, which can only be computed with numeric data. Secondly, the arithmetic mean is taken as the representative of each cluster, which is also a concept only available for numeric data. However, categorical data is as important as numeric data empirically, and this problem limits the usage of k-means considerably. In order to solve the problem and make k-means fit different data types, there are several ways to adapt k-means to deal with categorical data.</p><p>The traditional method for most machine learning algorithm to deal with categorical data is to convert all the categorical data into numeric data <ref type="bibr" target="#b25">[26]</ref>. Ordinal data can be converted readily into numeric data easily based on its inherent order, but for truly nominal data it is impossible to order it in a meaningful way. The distance from "Red" to "Green" is the same as the distance to "Blue" or "Yellow". Therefore, for nominal data, other methods are required. In this paper, two commonly used methods are considered 1-of-K coding and k-prototypes.</p><p>The first method is due to Ralambondrainy <ref type="bibr" target="#b21">[22]</ref> who proposed an extended k-means algorithm as the complement for categorical data clustering. Before the normal k-means steps, this algorithm converts each multiple category feature into a set of binary features using 1 and 0 to represent a category value present or absent in objects. This method is also called 1-of-K coding and popularly adopted not only in k-means, but also in other machine learning algorithms, like kNN <ref type="bibr" target="#b11">[12]</ref>.</p><p>K-prototype on the other hand inherits the ideas of k-means, but applies different distances and different representatives to numeric data and categorical data <ref type="bibr" target="#b24">[25]</ref>. For a dataset with both numeric and categorical features, the features can be organised as</p><formula xml:id="formula_0">A n 1 , A n 2 , ..., A n p , A c p+1 , ..., A c m ,</formula><p>where m is the total amount of features, p is the amount of numeric features and (m − p) is the amount of categorical features. K-prototype applies the same distance and representative to the first p numeric features, but for last (m − p) categorical features, the limitations of k-means can be removed by the following modifications <ref type="bibr" target="#b14">[15]</ref>: 1 using the simple matching distance for categorical features; 2 replacing means of clusters by modes.</p><p>Except for the definitions of distance and representative, k-prototypes inherits all the implementation process of k-means, so the simplicity and efficiency of k-means are well retained in k-prototypes. It is easy to be found that if the dataset only contains numeric features, k-prototypes is equal to k-means. For the situation that the dataset only contains categorical features, this algorithm can be considered as another algorithm called k-modes that can only deal with purely categorical data <ref type="bibr" target="#b25">[26]</ref>.</p><p>K-prototypes has become one of the most famous methods in categorical data clustering <ref type="bibr" target="#b24">[25]</ref>. It is extended in many different ways and also used as the benchmark to be compared with. <ref type="bibr" target="#b1">[2]</ref> discusses the initialisation methods of kprototypes. In <ref type="bibr" target="#b15">[16]</ref>, k-modes is taken as one of the methods to generate base clusterings for categorical data. <ref type="bibr" target="#b3">[4]</ref> presents an extension of the k-modes for clustering high-dimensional categorical data. <ref type="bibr" target="#b8">[9]</ref> takes k-modes as benchmark and proposes a modified algorithm based on it. <ref type="bibr" target="#b9">[10]</ref> presents an approximation algorithm to improve k-modes. <ref type="bibr" target="#b12">[13]</ref> proposes the fuzzy k-modes algorithm.</p><p>Although both 1-of-K coding and k-prototypes have been widely used, the comparison of their performances has not been implemented systematically. From the theoretical perspective, there are some points of view, e.g. k-modes is faster because it needs less iterations to converge <ref type="bibr" target="#b14">[15]</ref>, 1-of-K coding requires more space and time for implementation because it largely expands the dimensionality <ref type="bibr" target="#b13">[14]</ref>, there is information loss in both methods <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b14">15,</ref><ref type="bibr" target="#b11">12,</ref><ref type="bibr" target="#b24">25]</ref>, and neither methods guarantee the global optimum <ref type="bibr" target="#b8">[9,</ref><ref type="bibr" target="#b9">10]</ref>. However, these points of views and to what degree they affect the clustering performance have not been examined by experiments. There are many reasons for this problem -it is too difficult to generate artificial datasets with categorical data for clustering <ref type="bibr" target="#b14">[15]</ref>, while there is not a mutual internal evaluation method to compare clustering algorithms defined with different distances. In this paper, we implement the empirical comparison with external evaluation but with new features designed especially for this purpose.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Comparison of 1-of-K Coding and K-prototypes</head><p>Normally, there are two types of measures that can be used to evaluate machine learning algorithms in empirical studies: external measures and internal measures <ref type="bibr" target="#b14">[15,</ref><ref type="bibr" target="#b28">29]</ref>. The former is based on labelled datasets as the ground truth, and compare the learning results with the existing labels to uncover how good the learning is. The latter focuses on the intrinsic structure and characteristics of datasets, rather than external man-made labels, so it is widely used in the evaluation of clustering problems, like choosing the best k value in k-means. Even though they can be calculated based on any distance, the internal measures, like the silhouette coefficient <ref type="bibr" target="#b22">[23]</ref>, cannot be used in the comparison between algorithms with different distances. Therefore, we use only the external measures in the present experiment to evaluate the clustering results.</p><p>Due to the limitation of resources, the ideal datasets from industry and with mature labels are not available. Besides, as discussed above, it is too difficult to generate artificial datasets with categorical data for clustering, our experiment is designed based on real world datasets with labels. All the datasets used here are from the UC Irvine Machine Learning Repository (http://archive.ics.uci. edu/ml/). We firstly choose 4 datasets that are famous and widely-used in the research of categorical data clustering: Soybean <ref type="bibr" target="#b14">[15,</ref><ref type="bibr" target="#b4">5,</ref><ref type="bibr" target="#b15">16,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b18">19,</ref><ref type="bibr" target="#b26">27]</ref>, Congressional Voting Records <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b15">16,</ref><ref type="bibr" target="#b6">7,</ref><ref type="bibr" target="#b8">9,</ref><ref type="bibr" target="#b9">10]</ref>, Credit Approval <ref type="bibr" target="#b14">[15,</ref><ref type="bibr" target="#b24">25,</ref><ref type="bibr" target="#b10">11,</ref><ref type="bibr" target="#b18">19]</ref> and Mushroom <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b15">16,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b6">7,</ref><ref type="bibr" target="#b8">9,</ref><ref type="bibr" target="#b27">28,</ref><ref type="bibr" target="#b9">10]</ref>. We note that the labels are made by human experts for a specific purpose, e.g. in the Credit Approval dataset, the data is the general information of people, but the labels are only about if the people were granted credit, which can only represent the data from a specific aspect, rather than the main structure of data. Therefore, we need to evaluate the dataset labels prior to the comparison so that we choose only the datasets with labels correlated with both results by k-means and k-prototypes. In addition, two large datasets, Adult and Bank Marketing, are added into the experiment for the evaluation of the time consumed during the clustering. Detailed information about the datasets is listed as Table <ref type="table">.</ref> 1. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Mixed</head><p>Because of the lack of expert knowledge about these data, all the instances with null value are removed before input. However, the instances with "?" instead of null value are retained, because it is considered as "unknown", a category from the real-life situation. After the filtering, the datasets are re-organised into the format for clustering implementation.</p><p>Our experiments are implemented with each dataset and each algorithm as outlined in Fig. <ref type="figure" target="#fig_0">1</ref>. For k-means, 1-of-K coding is implemented at the first stage, so that all the data can be normalised. For k-prototypes, there is no need to implement 1-of-K coding, but a range of the parameter γ are used for each dataset. For both algorithms, 100 runs be implemented for each situation (different γ for k-prototypes). Although for purely categorical data, the setting of a range of the parameter γ does not affect the clustering result, we still use a range of γ for the evaluation of the clustering efficiency. During the process, there are four steps worth mentioning: the normalisation, the selection of k value, 100 runs for each situation, and the initialisation. Lots of research has been conducted on normalisation methods. Based on Steinley's review paper <ref type="bibr" target="#b25">[26]</ref>, normalisation by range as Eq. 1, rather than z-scores, leads to better performances specially for k-means clustering.</p><formula xml:id="formula_1">X ′ = X − X min X max − X min (1)</formula><p>For k-prototypes, the definition of k-prototypes requires also normalisation by range for numeric data <ref type="bibr" target="#b14">[15]</ref>. Therefore, normalisation by range is adopted in the experiment.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Selection of k</head><p>In the experiment, the k values are selected just as the labels show, that is, k equals to the number of different categories of the labels. The methods of selecting k in k-means have been discussed a lot in previous research <ref type="bibr" target="#b25">[26]</ref>, most of which can also be applied to k-prototypes.</p><p>to the characteristics of k-means and k-prototypes, the global optimum is not guaranteed in a single run of clustering. The common way is to run it multiple times with the same parameter setting, and then choose the result with the best cost function as the final clustering result. Therefore, we only focus on a range of good results, rather than all of them, which is different from the evaluation of other machine learning applications. Likewise, the stability discussed in this paper is the concept how often the good results can be achieved, rather than the analysis of mean or variance of all the results in other applications.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Initialisation</head><p>For both k-means and k-prototypes, different local optima depend on the starting centroids (means or prototypes). Although lots of initialisation methods were proposed to avoid locally optimal solution <ref type="bibr" target="#b25">[26]</ref>, k-means++ <ref type="bibr" target="#b2">[3]</ref> is the most popular method. In k-means++, only the first centroid is uniformly chosen from the data points in the dataset, and each subsequent centroid is chosen from the remaining data points with probability proportional to its squared distance to its closest existing centroid.</p><p>The evaluation of results starts from the conparison of efficiency by analysing the time, the number of iterations and the dimensionality of input. After that, external measures are used for the evaluation of clustering validity. There are plenty of external measure that are widely used in clustering evaluation, such as F-measure, Normalised Mutual Information, Jaccard Coefficient <ref type="bibr" target="#b28">[29]</ref>. Accuracy <ref type="bibr" target="#b14">[15,</ref><ref type="bibr" target="#b12">13,</ref><ref type="bibr" target="#b8">9,</ref><ref type="bibr" target="#b26">27</ref>] is adopted in this experiment, because it is easy to understand and the k value is just from 2 to 4. The clustering accuracy r is defined as:</p><formula xml:id="formula_2">r = ∑ k i=1 a i n (<label>2</label></formula><formula xml:id="formula_3">)</formula><p>where k is the number of clusters, n is the number of instances, and a i is number of instances that are clustered correctly in this cluster. For different combinations of clustering results and the existing labels, clustering accuracy r is defined as the maximum value.</p><p>It should be noted that the accuracies of different datasets are not necessarily positively correlated with the validity of the clustering, because subjective opinions have been added into the labels when human experts labelled the datasets. On the other hand, the only measure we are sure to know how good the clustering is for each algorithm is the cost function. Therefore, before the evaluation of validity of algorithms, each dataset need to be checked if its accuracy results have the same trend as the cost function. Only the datasets that have accuracy results with the same trends as their cost functions of both algorithms can be used in the final evaluation of clustering validity. <ref type="foot" target="#foot_0">3</ref></p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Results</head><p>In this section we the results of our empirical analysis. We begin with a discussion of run time costs before moving on to consider measures of accuracy.</p><p>From the time consumed in 100 runs for each algorithm (Fig. <ref type="figure">2</ref> and Fig. <ref type="figure">3</ref>), it is shown that when the dataset gets large, the time consumed for k-means is 2 to 3 times greater than that for k-prototypes. The time consumed in calculation may not reflect the genuine efficiency of algorithms exactly, but from the commercial perspective, it is meaningful that the implementation of k-prototypes is generally much faster than k-means. However, from the number of iterations in each run (Fig. <ref type="figure">4</ref>) and the number of features before/after 1-of-K coding (Fig. <ref type="figure">5</ref>), we can see that the k-means algorithm consumes much more time not because it needs more iterations to converge, but because 1-of-K coding substantially expands the dimensionality of the datasets. explained before, only the datasets that have accuracy results with the same trends as the cost functions of both algorithms can be used in the final evaluation of clustering validity. Among these 6 datasets as Table <ref type="table">.</ref> 2, only 3 datasets are chosen: Soybean, Congressional Voting Records and Mushroom. This however does not mean that the accuracy results of these 3 datasets have absolutely positive correlations with the cost function results. After all, they are not artificial datasets that are exactly designed for clustering. But the accuracy results of these 3 datasets have almost the same trends as cost function to show how good the clustering is, so they can be considered as the mediums between the two algorithms, so used in the evaluation. Fig. <ref type="figure">6</ref>, Fig. <ref type="figure">7</ref> and Fig. <ref type="figure" target="#fig_4">8</ref> summarise the accuracy calculation results of Soybean, Congressional Voting Records and Mushroom respectively. The first columns give the clustering accuracy intervals. The second and third columns show the numbers of clustering results that fall into a specific interval. There are in total 100 in each column. For k-prototypes, the experiment is implemented 100 runs with each γ, and averages with decimals are filled into the table, because all of the datasets are purely categorical. From these tables, we get to compare the validity of these two algorithms.  <ref type="table">-</ref>Mushroom</p><p>From these results we can make the following observations:</p><p>1 Both algorithms get almost the same highest accuracy. For Soybean and Mushroom, the differences are within 1%, while for Congressional Voting Records, the difference is about 2%; 2 If taking the best accuracy as BR, and the clustering whose results fall into the interval of [BR − 10%, BR] as valid clustering, it is obvious that the valid results with k-means concentrate at the interval of highest accuracy, while the ones with k-prototypes spread much more widely in the valid interval, but the total numbers of valid clustering are not quite different. From this perspective, k-means is more stable than k-prototypes; 3 The numbers in bold refer to the best results based on cost function, that is, the objectively best clustering results. It is shown that for k-means all the best results in a situation lead to the same result with the best accuracy, but for k-prototypes, they may lead to multiple best results with even different performances. In other words, k-means probably finds only one global optimum, but k-prototype can find multiple global optima. Because the calculation in kprototypes is based on integers, it generates the same cost function easily even when the clustering results are different, while this is very rare in k-means.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion</head><p>In this paper we have presented k-means with 1-of-K coding and k-prototypes as two valid clustering algorithms for categorical data.</p><p>though they use different distances in calculating dissimilarity, k-means with 1-of-K coding and k-prototypes provide similar best results. For the clustering speed, k-prototypes is faster than k-means with 1-of-K coding, because the latter expands significantly the dimensionality of the original dataset. For the clustering validity, because of the characteristics of each algorithm, the valid results with k-prototypes spread in multiple optima, while the ones with kmeans with 1-of-K coding concentrate in one point. Therefore, we conclude that k-means with 1-of-K coding is more stable than k-prototypes.</p><p>Due to the preliminary nature of our studies and also the space constraints here, many questions about k-prototypes are not discussed in this paper, e.g., the selection of k value, the setting of parameter and the feature weighting. Each of these requires more research.</p><p>As a valid clustering algorithm for categorical data, k-prototypes can be explored in different ways. On one side, many extensions of k-means or other clustering algorithms can be adjusted and applied into k-prototypes, e.g. using the silhouette coefficient in clustering result evaluation, using Hopkins statistics to find the tendency of datasets for clustering and so on. On the other side, the idea of k-prototypes, especially the distance used in it, can be used directly to modify other algorithms and make them applicable in categorical data, e.g. density-based clustering algorithms. We see this as useful valid future research.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 :</head><label>1</label><figDesc>Fig. 1: Experimental Process</figDesc><graphic coords="6,181.68,209.71,252.12,166.10" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 2 :Fig. 3 :</head><label>23</label><figDesc>Fig. 2: Time Consumed -Soybean, Voting and Credit</figDesc><graphic coords="8,134.76,260.71,173.05,115.50" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Fig. 4 : 5 :</head><label>45</label><figDesc>Fig. 4: Number of Iterations in Each Run Fig. 5: Number of Features before/after 1-of-K Coding</figDesc><graphic coords="8,134.76,515.54,173.05,115.50" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Fig. 6 :Fig. 7 :</head><label>67</label><figDesc>Fig. 6: Accuracy Table -Soybean</figDesc><graphic coords="9,134.76,404.89,172.91,145.68" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Fig. 8 :</head><label>8</label><figDesc>Fig. 8: Accuracy Table -Mushroom</figDesc><graphic coords="10,134.76,174.13,172.91,145.68" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>The Real World Databases Selected for the Experiment</figDesc><table><row><cell>No.</cell><cell>Datasets</cell><cell cols="2">Instances Total Attributes</cell><cell>Type</cell></row><row><cell>1</cell><cell>Soybean</cell><cell>47</cell><cell cols="2">35 (All Categorical) Categorical</cell></row><row><cell>2</cell><cell>Congressional</cell><cell>435</cell><cell cols="2">16 (All Categorical) Categorical</cell></row><row><cell></cell><cell>Voting Records</cell><cell></cell><cell></cell><cell></cell></row><row><cell>3</cell><cell>Credit Approval</cell><cell>690</cell><cell>15 (9 Categorical +</cell><cell>Mixed</cell></row><row><cell></cell><cell></cell><cell></cell><cell>6 Numeric)</cell><cell></cell></row><row><cell>4</cell><cell>Mushroom</cell><cell>8124</cell><cell cols="2">22 (All Categorical) Categorical</cell></row><row><cell>5</cell><cell>Adult</cell><cell>48842</cell><cell>14 (8 Categorical +</cell><cell>Mixed</cell></row><row><cell></cell><cell></cell><cell></cell><cell>6 Numeric)</cell><cell></cell></row><row><cell>6</cell><cell>Bank Marketing</cell><cell>45211</cell><cell>16 (9 Categorical +</cell><cell></cell></row><row><cell></cell><cell></cell><cell></cell><cell>7 Numeric)</cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 :</head><label>2</label><figDesc>The Correlation between Accuracy Results and Cost Functions</figDesc><table><row><cell>No.</cell><cell>Datasets</cell><cell>Accuracy Correlation with</cell></row><row><cell>1</cell><cell>Soybean</cell><cell>Both</cell></row><row><cell>2</cell><cell>Congressional Voting Records</cell><cell>Both</cell></row><row><cell>3</cell><cell>Credit Approval</cell><cell>K-prototypes</cell></row><row><cell>4</cell><cell>Mushroom</cell><cell>Both</cell></row><row><cell>5</cell><cell>Adult</cell><cell>K-means</cell></row><row><cell>6</cell><cell>Bank Marketing</cell><cell>None</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_0">The reason why we cannot just use cost function to compare the results is that the cost function is defined with different types of distances in different algorithms so the comparison with cost function will be meaningless.</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Acknowledgement. The authors wish to acknowledge the support of Enterprise Ireland through the Innovation Partnership Programme SmartSeg 2.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">A k-mean clustering algorithm for mixed numeric and categorical data</title>
		<author>
			<persName><forename type="first">A</forename><surname>Ahmad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Dey</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Data &amp; Knowledge Engineering</title>
		<imprint>
			<biblScope unit="volume">63</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="503" to="527" />
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">K-Mean and K-Prototype Algorithms Performance Analysis</title>
		<author>
			<persName><forename type="first">I</forename><surname>Ahmad</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Computer and Information Technology</title>
		<imprint>
			<biblScope unit="volume">03</biblScope>
			<biblScope unit="issue">04</biblScope>
			<biblScope unit="page" from="823" to="828" />
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">k-means++: The advantages of careful seeding</title>
		<author>
			<persName><forename type="first">D</forename><surname>Arthur</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Vassilvitskii</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms</title>
				<meeting>the eighteenth annual ACM-SIAM symposium on Discrete algorithms</meeting>
		<imprint>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="1027" to="1035" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">A novel attribute weighting algorithm for clustering high-dimensional categorical data</title>
		<author>
			<persName><forename type="first">L</forename><surname>Bai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Liang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Dang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Cao</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Pattern Recognition</title>
		<imprint>
			<biblScope unit="volume">44</biblScope>
			<biblScope unit="issue">12</biblScope>
			<biblScope unit="page" from="2843" to="2861" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Review of Clustering Algorithm for Categorical Data</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">M</forename><surname>Bhagat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">S</forename><surname>Halgaonkar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">M</forename><surname>Wadhai</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2013">2013</date>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="341" to="345" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Pattern recognition</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">M</forename><surname>Bishop</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Machine Learning</title>
				<imprint>
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Rock: A robust clustering algorithm for categorical attributes</title>
		<author>
			<persName><forename type="first">S</forename><surname>Guha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Rastogi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Shim</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings., 15th International Conference on</title>
				<meeting>15th International Conference on</meeting>
		<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="1999">1999. 1999</date>
			<biblScope unit="page" from="512" to="521" />
		</imprint>
	</monogr>
	<note>Data Engineering</note>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">The elements of statistical learning: data mining, inference and prediction</title>
		<author>
			<persName><forename type="first">T</forename><surname>Hastie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Tibshirani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Friedman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Franklin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">The Mathematical Intelligencer</title>
		<imprint>
			<biblScope unit="volume">27</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="83" to="85" />
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Improving k-modes algorithm considering frequencies of attribute values in mode</title>
		<author>
			<persName><forename type="first">Z</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Deng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Xu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Computational Intelligence and Security</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2005">2005</date>
			<biblScope unit="page" from="157" to="162" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Approximation algorithms for k-modes clustering</title>
		<author>
			<persName><forename type="first">Z</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Deng</surname></persName>
		</author>
		<author>
			<persName><surname>Xu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Computational Intelligence</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2006">2006</date>
			<biblScope unit="page" from="296" to="302" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">Clustering mixed numeric and categorical data: A cluster ensemble approach</title>
		<author>
			<persName><forename type="first">Z</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Deng</surname></persName>
		</author>
		<idno>arXiv preprint cs/0509011</idno>
		<imprint>
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Integrated dimensionality reduction technique for mixed data involving categorical values</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">C</forename><surname>Hsu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">H</forename><surname>Huang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Human Capital without Borders: Knowledge and Learning for Quality of Life; Proceedings of the Management, Knowledge and Learning International Conference</title>
				<imprint>
			<publisher>ToKnow-Press</publisher>
			<date type="published" when="2014">2014. 2014</date>
			<biblScope unit="page" from="245" to="255" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">A Fuzzy k -Modes Algorithm for Clustering Categorical Data</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">K</forename><surname>Ng</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Fuzzy Systems</title>
		<imprint>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="446" to="452" />
			<date type="published" when="1999">1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Clustering large data sets with mixed numeric and categorical values</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Huang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 1st Pacific-Asia Conference on Knowledge Discovery and Data Mining</title>
				<meeting>the 1st Pacific-Asia Conference on Knowledge Discovery and Data Mining<address><addrLine>Singapore</addrLine></address></meeting>
		<imprint>
			<publisher>PAKDD</publisher>
			<date type="published" when="1997">1997</date>
			<biblScope unit="page" from="21" to="34" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Extensions to the k-means algorithm for clustering large data sets with categorical values</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Huang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Data mining and knowledge discovery</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="283" to="304" />
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">A link-based cluster ensemble approach for categorical data clustering</title>
		<author>
			<persName><forename type="first">N</forename><surname>Iam-On</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Boongeon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Garrett</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Price</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Knowledge and Data Engineering</title>
		<imprint>
			<biblScope unit="volume">24</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="413" to="425" />
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
	<note>IEEE Transactions on</note>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Data clustering: 50 years beyond k-means</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">K</forename><surname>Jain</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Pattern recognition letters</title>
		<imprint>
			<biblScope unit="volume">31</biblScope>
			<biblScope unit="issue">8</biblScope>
			<biblScope unit="page" from="651" to="666" />
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<title level="m" type="main">Fundamentals of machine learning for predictive data analytics: algorithms, worked examples, and case studies</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">D</forename><surname>Kelleher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mac Namee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>D'arcy</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2015">2015</date>
			<publisher>MIT Press</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Fuzzy clustering of categorical data using fuzzy centroids</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">W</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">H</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Lee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Pattern Recognition Letters</title>
		<imprint>
			<biblScope unit="volume">25</biblScope>
			<biblScope unit="issue">11</biblScope>
			<biblScope unit="page" from="1263" to="1271" />
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Recent advances in clustering: A brief survey</title>
		<author>
			<persName><forename type="first">S</forename><surname>Kotsiantis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Pintelas</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">WSEAS Transactions on Information Science and Applications</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="73" to="81" />
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<monogr>
		<title level="m" type="main">Information theory, inference and learning algorithms</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">J</forename><surname>Mackay</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2003">2003</date>
			<publisher>Cambridge university press</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">A conceptual version of the k-means algorithm</title>
		<author>
			<persName><forename type="first">H</forename><surname>Ralambondrainy</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Pattern Recognition Letters</title>
		<imprint>
			<biblScope unit="volume">16</biblScope>
			<biblScope unit="issue">11</biblScope>
			<biblScope unit="page" from="1147" to="1157" />
			<date type="published" when="1995">1995</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Silhouettes: A graphical aid to the interpretation and validation of cluster analysis</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">J</forename><surname>Rousseeuw</surname></persName>
		</author>
		<ptr target="http://www.sciencedirect.com/science/article/pii/0377042787901257" />
	</analytic>
	<monogr>
		<title level="j">Journal of Computational and Applied Mathematics</title>
		<imprint>
			<biblScope unit="volume">20</biblScope>
			<biblScope unit="page" from="53" to="65" />
			<date type="published" when="1987">1987</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">A novel similarity measure for clustering categorical data sets</title>
		<author>
			<persName><forename type="first">R</forename><surname>Sayal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">V</forename><surname>Kumar</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Computer Applications</title>
		<imprint>
			<biblScope unit="volume">17</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="25" to="30" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">A two-step method for clustering mixed categroical and numeric data</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">Y</forename><surname>Shih</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Jheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">F</forename><surname>Lai</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Tamkang Journal of science and Engineering</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="11" to="19" />
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">K-means clustering: A half-century synthesis</title>
		<author>
			<persName><forename type="first">D</forename><surname>Steinley</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">British Journal of Mathematical and Statistical Psychology</title>
		<imprint>
			<biblScope unit="volume">59</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="1" to="34" />
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">An iterative initial-points refinement algorithm for categorical data clustering</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Chen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Pattern Recognition Letters</title>
		<imprint>
			<biblScope unit="volume">23</biblScope>
			<biblScope unit="issue">7</biblScope>
			<biblScope unit="page" from="875" to="884" />
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Survey of clustering algorithms</title>
		<author>
			<persName><forename type="first">R</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Wunsch</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on</title>
		<imprint>
			<biblScope unit="volume">16</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="645" to="678" />
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
	<note>Neural Networks</note>
</biblStruct>

<biblStruct xml:id="b28">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J</forename><surname>Zaki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Meira</surname><genName>Jr</genName></persName>
		</author>
		<title level="m">Data mining and analysis: fundamental concepts and algorithms</title>
				<imprint>
			<publisher>Cambridge University Press</publisher>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
