Meta-learning System for Automated Clustering Sergey Muravyov and Andrey Filchenkov ITMO University, Computer Technology Lab, Kronverksky pr. 49, 197101 St.Petersburg, Russia {smuravyov,afilchenkov}@corp.ifmo.ru Introduction. Clustering is the most common unsupervised learning task. The number of the clustering algorithms is constantly increasing, which raises a problem of clustering algorithm selection. Selection of proper features for cluster- ing is also important, and several algorithms are designed to solve this problem. Thus, the question arises if we can simultaneously choose both feature selec- tion (FS) algorithm and clustering algorithm given a dataset. To make it even more complicated, no standard method to assess clustering quality exists. To cope with that four approaches for clustering quality assessment may be dis- tinguished. In [1], a method for clustering quality evaluation based on human assessments was proposed. It was also shown there that none of existing cluster validity index (CVI) suits for all problems. Thus, another question arises if we can select a proper CVI given a dataset. This paper aims to answer both questions above. We propose to use a system consisting of two parts. The first part predicts the best CVI given a dataset using approach presented in [1]. The second part predicts the best clustering algorithm and the best FS algorithm for a given dataset and a clustering performance measure. The second part does not depend on a performance measure and can work without the first part. Related works. In comparison to supervised learning tasks, only a few pa- pers are devoted to creating meta-learning systems for clustering. One of the latest is [2], the authors of which suggested a set of meta-features that are used for selection of a clustering algorithm based on an average ranking of several clustering validity indexes based on inter-/intra-cluster measures. In earlier pa- pers [3–5], the authors used external efficacy measures, which is not applicable in real life. We found no other works on this topic, neither did the authors of [6]. Meta-learning system for clustering. We propose a nested system con- sisting of three components. The first component CVIsel predicts the best CVI for a given dataset. The second component Clusel predicts the best clustering algorithm for a given dataset and a performance measure. The last component FSsel predicts the best FS algorithm given a dataset, a performance measure and a clustering algorithm. CVIsel component is taught using training datasets labelled according to the approach proposed in [1]. The main idea is to ask human assessors to evaluate several partitions of a dataset, to rank them, and to compare with how each CVI ranks these partitions. The more similar resulting ranks are the better CVI is. Another approach is to ask human assessors to simply mark partitions as adequate or not and then to evaluate how high CVI higher inadequate partitions. 2 Sergey Muravyov and Andrey Filchenkov CVIsel uses the labeled set of dataset to learn a meta-classifier that predicts the best CVI for a given dataset and it predicts if this CVI is adequate or not. Clusel component is learned using training datasets labelled according to performance of clustering algorithms with respect to a chosen performance mea- sure. It works in the same way as CVIsel does. FSsel component works in a slightly different way. Application of a FS algo- rithm produces a new dataset that may have other properties than the original one. This means that if we also chose from several CVIs, the new dataset may have another best CVI and another best clustering algorithm. If two different FS algorithms produce two datasets that have different best CVIs then these two algorithms are incomparable. This is why we evaluate each FS algorithm for each CVI. FSsel uses the labeled set of datasets and prediction of adequacy of each CVI for resulting dataset to predict the best FSsel with respect to each CVI predicted to be adequate. We 19 meta-features for clustering that are based on distances between ob- jects proposed in [2]. We also generated 60 landmarks obtained by running of 12 different clustering algorithms with different parameters by each CVI. We ap- plied FS for each described meta-classifier resulting into various meta-features for each of them. Experiments. We took 200 real clustering dataset from different sources, including UCI1 and KEEL2 . We choose the five best CVIs in terms of their rank from the 19 that were compared in [1]. These CVIs are: OS-index, Symmetric index, GD41, GD33 and GD43. We used six clustering algorithms: k-Means [7], X-Means [8], EM [9], DBSCAN [10], FarthestFirst [11], and Hierarchical [12]. We used four FS algorithms for clustering [13]: spectral FS; Laplasian score, localized FS based on scatter separability, and multi-cluster FS. We used F1 -measure and leave-one-out cross-validation to estimate meta- classifier performance. The results are presented in Table 1. As we can see, each component has achieved more than 80% of F1 -measure for multi-class classifica- tion tasks using Random Forest. Table 1. Values of F-measure of different components for predicting CVI, clustering algorithm and mean values for predicting FS algorithm Multilayer Naive Bayesian Random Random Component kNN Perceptron Bayes Network Forest Trees CVIsel .680 .690 .700 .725 .855 .805 Clusel .645 .660 .715 .730 .805 .775 FSsel (average) .590 .611 .690 .797 .818 .808 Conclusion. We proposed a full model selection system for clustering. The system contains three components that predict appropriate CVI, clustering al- 1 https://archive.ics.uci.edu/ml/datasets.html 2 http://sci2s.ugr.es/keel/datasets.php Meta-learning system for automated clustering 3 gorithm and FS algorithm for a given unlabeled dataset. The system was tested on real life dataset and got satisfactory results. We plan to improve the accuracy of the system and explore if we can overcome the constraint on FS algorithm incomparability. Acknowledgments. This work was financially supported The Russian Sci- ence Foundation, Agreement 17-71-30029, and Russian Foundation for Basic Research, Grant 16-37-60115 mol a dk and by The Russian Science Foundation, Agreement 17-71-30029. References 1. Filchenkov, A., Muravyov, S., Parfenov, V.: Towards cluster validity index eval- uation and selection. In: Artificial Intelligence and Natural Language Conference (AINL), IEEE. pp. 1–8. IEEE (2016) 2. Ferrari, D.G., de Castro, L.N.: Clustering algorithm selection by meta-learning systems: A new distance-based problem characterization and ranking combination methods. Information Sciences 301, 181–194 (2015) 3. de Souto, M., Prudêncio, R., Soares, R., de Araujo, D., Costa, I., Ludermir, T., Schliep, A.: Ranking and selecting clustering algorithms using a meta-learning approach. In: Neural Networks, 2008. IJCNN 2008. (IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on. pp. 3729– 3735 (June 2008) 4. Soares, R.G., Ludermir, T.B., De Carvalho, F.A.: An analysis of meta-learning techniques for ranking clustering algorithms applied to artificial data. In: Artificial Neural Networks–ICANN 2009, pp. 131–140. Springer (2009) 5. Ferrari, D.G., de Castro, L.N.: Clustering algorithm recommendation: a meta- learning approach. In: Swarm, Evolutionary, and Memetic Computing, pp. 143– 150. Springer (2012) 6. Van Craenendonck, T., Blockeel, H.: Using internal validity measures to compare clustering algorithms. In: AutoML Workshop at ICML 2015. pp. 1–8 (2015) 7. Vassilvitskii, S., Arthur, D.: k-means++: The advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algo- rithms. pp. 1027–1035 (2006) 8. Pelleg, D., Moore, A.W., et al.: X-means: Extending k-means with efficient esti- mation of the number of clusters. In: ICML. pp. 727–734 (2000) 9. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (method- ological) pp. 1–38 (1977) 10. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discov- ering clusters in large spatial databases with noise. In: KDD. vol. 96, p. 226231 (1996) 11. Rosenkrantz, D.J., Stearns, R.E., Lewis, II, P.M.: An analysis of several heuristics for the traveling salesman problem. SIAM journal on computing 6(3), 563–581 (1977) 12. Rokach, L., Maimon, O.: Clustering methods. In: Data mining and knowledge discovery handbook, pp. 321–352. Springer (2005) 13. Alelyani, S., Tang, J., Liu, H.: Feature selection for clustering: A review. Data Clustering: Algorithms and Applications 29, 30–55 (2013)