A study of SOM clustering software implementations

A study of SOM clustering software implementations ABAdeyemo sesanadeyemo@gmail.com Computer Science Department University of Ibadan Nigeria A study of SOM clustering software implementations A5A115B500F002FCC02E819ADFFD1D5B GROBID - A machine learning software for extracting information from scholarly documents Comparative Analysis Clustering Self Organizing Maps

Clustering algorithms generally suffer from some well-known problems for which the Self Organizing Maps (SOM) algorithms are adept at handling. While there are many variants of the SOM algorithm, software programmes that implement the SOM algorithms have tended to give varying results even when tested on the same data sets. This can have serious implications when the goal of the clustering is novelty detection. In this study a comparison of the performance of some SOM clustering software was carried out and results presented.

CCS Concepts

• General and reference ➝ -computing tools and techniques ➝ Empirical studies

INTRODUCTION

In the clustering process data is grouped in such a way that the intra-cluster similarity is maximized while the inter-cluster similarity is minimized. Data can be described by either categorical or numeric features. Due to the differences in the characteristics of these two kinds of data, attempts to develop criteria functions for mixed data have not been very successful [15]. There are two widely used clustering methods: the hierarchical and the nonhierarchical (partitional) methods. The hierarchical clustering process can be categorized as divisive when a large data set is divided into several small groups and, agglomerative when a small data set are put together to create a larger cluster. Self-Organizing Maps (SOM) are competitive networks that provide a "topological" mapping from the input space to the clusters [4]. The SOM was inspired by the way in which various human sensory impressions are neurologically mapped into the brain such that spatial or other relations among stimuli correspond to spatial relations among the neurons.

In a SOM, the neurons (clusters) are organized into a grid which is usually two-dimensional, but sometimes one-dimensional (or (rarely) three or more-dimensions. The reason for using one-and two dimensional grids is that space structures of higher dimensionality cause problems with data display and cannot be displayed on the monitor. The SOM working algorithm is a variant of multidimensional vectors clustering of which the Kmeans clustering algorithm is an example of this type of algorithm [9].

The SOM neural network uses a competitive learning algorithm and is a method for unsupervised learning, based on a grid of artificial neurons whose weights are adapted to match input vectors in a training set. The SOM algorithm is fed with feature vectors, which can be of any dimension. The algorithm for the training of the SOM [4] is explained easily in terms of a set of artificial neurons, each having its own physical location on the output map, which take part in a winner-takeall process where a node with its weight vector closest to the vector of inputs is declared the winner and its weights are adjusted making them closer to the input vector. In each training step, one sample vector "x" from the input data set is chosen randomly and a similarity measure is calculated between it and all the weight vectors of the map. The Best-Matching Unit (BMU), denoted as "c", is the unit whose weight vector has the greatest similarity with the input sample "x" (figure 1). The similarity is usually defined by means of a distance measure, usually the Euclidian distance. The BMU is defined mathematically as the processing element for which the expression: . …..…………….….. 1 where d is the distance measure.

Each node has a set of neighbors. When a node wins a competition, the neighbor"s weights are also changed but not as much as that of the winning node. The further the neighbor is from the winner, the smaller its weight change. The SOM update rule for the weight vector of the unit i is given mathematically as: ………………2 where t represents the sample index for each presentation of a sample "x" h c(x),i represents the neighborhood function around the winner unit "c", with neighborhood radius r(t).

The neighborhood function is like a smoothing kernel that is time-variable. It is a decreasing function of the distance between the the ith and cth reference vectors on the map grid. The neighborhood function is usually expressed as the Gaussian function which can be expressed mathematically as: …………………3 where ά(t) represents the learning rate factor and takes values 0< ά(t)<1 σ(t) represents the width of the neighborhood function which decreases monotomically with the regression steps.

A simpler definition of the neighbourhood function given by Kohonen [4] is:

h c(x),I =σ(t)…………………………………………………….4

If ║ri -rc║ is smaller than a given radius around node "c" and the radius is also a monotomically decreasing function of the regression steps, but otherwise hc(x),I = 0. σ(t) is a diminishing function of time. At the beginning of the learning procedure it is fairly large, but it is made to gradually shrink during learning. Towards the end of learning a single winning processing element is trained. A linear diminishing function of time is usually used. The learning process consisting of winner selection by Equation (1) and adaptation of the synaptic weights by Equation (2). This process is repeated for each input vector, usually for a large number of cycles with different inputs producing different winners. The network therefore associates output nodes with groups or patterns in the input data set. The SOM algorithm is very simple and allows for many subtle adaptations.

There are some visual displays that are used to "determine" where the natural cluster boundaries are in the SOM. Some of the visual tools that can be used are Histograms [6], Component Plane displays [3], U-matrix, P-matrix and U* matrix displays [10], [11], [12, [13]. An important concept in interpreting these displays is the interaction of the two properties of the SOM. These are the neighborhood relationship, and the density mapping. Neighboring neurons in the SOM cannot be too far away from each other (in order to maintain their similarity) but the SOM also wants to place more neurons in areas of high input density (for example, logical clusters). Because of this, there will be neurons that will be placed in areas between natural clusters which are typically low input density areas (so that the map can "stretch" between clusters).

The standard SOM algorithm uses numeric type variables and the Euclidean distance function. The arithmetic operations used during the learning phase for the update of the feature vectors cannot be used with categorical values. The SOM was not directly designed to work with categorical variables due to the limitation of learning laws. The method usually adopted is to translate categories to numeric numbers during data pre-processing before training using the transformed data using standard SOM algorithm [2]. The Kohonen SOM clustering algorithm has also been used for classification purposes with remarkable results. There is a fundamental difference between the clustering process and the classification process. Clustering is an unsupervised process while classification is supervised. Usually data clustering is used as a pre-processor for classification purposes [8].

A rich variety of versions of the basic SOM algorithm have been proposed. Some of the variants aim at improving the preservation of topology by using more flexible map structures instead of the fixed grid. Some of these methods however cannot be used for visualization as easily as the regular grid. Some variants aim at reducing the computational complexity of the SOM [3]. Experiments using different distance measures, map topologies, training parameters such as the learning rate and neighbourhood function can be carried out.

Using identical settings, training of a SOM map over different iterations can lead to different mappings, because of the random initialisation. Yet it has been shown that the conclusions drawn from the map remain remarkably consistent, which makes it a very useful tool in many different circumstances [14]. Some of the desirable features that good SOM clustering software should have include:

1. Being able to set the neighborhood kernel function and to set the start value for the neighborhood function (learning radius): The neighborhood function determines how strongly the processing elements are connected to each other. Neighborhoods of different sizes in different neuron configurations (e.g. rectangular and hexagonal lattices) can be used. The simplest neighborhood function is the bubble (winnertakes-all): it is constant (or 1) over the whole neighborhood of the winner unit and zero elsewhere.

Usually the neighbourhood function is expressed as a Gaussian function and as expected using the winnertakes-all function retrieves less clusters than the Gaussian function.

2. Being able to set the activation function and weight initialization methods: Before the training, initial values are given to the prototype vectors of the SOM. The SOM is very robust with respect to the initialization process, however, when properly accomplished it allows the algorithm to converge faster to a good solution. Initialization procedures that have been used are: Random initialization, where the weight vectors are initialized with small random values; Sample initialization, where the weight vectors are initialized with random samples drawn from the input data set; Linear initialization, where the weight vectors are initialized in an orderly fashion along the linear subspace spanned by the two principal eigenvectors of the input data set.

3. Being able to set the choice of cooling strategy during training: for example linear or exponential.

4. Being able to set the distance measure to be used, for example, Euclidean, Manhattan and Maximum value: It is noted that the distance measure between data points is an important component of a clustering algorithm. If the components of the data instance vectors are all in the same physical units then it is possible to use the simple Euclidean distance metric to successfully group similar data elements. The Euclidean distance in a two or three-dimensional space measures is the actual geometric distance between objects in the space. However, it has been observed that even the Euclidean distance can sometimes be misleading, because of the way the mathematical formula used to combine the distances between the single components of the data feature vectors into a unique distance measure that can be used for clustering purposes is computed. Different formulas lead to different clustering"s. Therefore, domain knowledge must be used to guide the formulation of a suitable distance measure for each particular application.

5. Being able to set the scaling technique to be used: for example z-transform, (0,1) transform, (1,-1) transform or none, depending on the clustering goal and data set.

6. Being able to set the starting and stopping learning rate:

The learning rate is a decreasing function of time between [0,1]. The learning rate can be expressed as a linear function and as a function inversely proportional to time. Using the inverse function ensures that all input samples have approximately equal influence on the training result. Some learning rate functions that have been implemented are the linear, inverse-of-time, and as a power ser.

7. Being able to set the training algorithm to be used: for example batch, on-line, hybrid etc. The batch algorithm has been shown to be faster [4] than the normal sequential algorithm (and the results are just as good or even better).

8. Good data visualization options: for example histograms, hinton charts, weight charts (maps), U-Matrix, P-Matrix etc. Good result analysis and presentation functions: computation of vital statistics for evaluating the quality of the clustering for example, mean, standard deviation (or variance), correlation coefficient, t-test etc.

This work presents a comparative study of the performance some SOM clustering software when tested on the same data set. Results were presented and reasons for the observed variations presented. The study also presents the desirable features that standard SOM software should have. Using the three software"s clusters was generated. The arithmetic mean of each cluster group was also computed. The arithmetic mean is a measure of central tendency which describes the central location of data and is usually used with other statistical measures such as the standard deviation because it can be affected by extreme values in the data set and therefore be biased. The standard deviation describes the spread of the data and is a popular measure of dispersion. It measures the average distance between a single observation and its mean.

MATERIALS AND METHODS

RESULTS AND DISCUSSION

The meteorological data was clustered using NNClust SOM clustering software with a starting learning rate of 0.9 and was trained over 100 epochs. The software accepts only numeric values. Non numeric values are treated as missing values which are replaced by the column mean. The software was set to identify a maximum of ten clusters, however only eight clusters were generated. The software uses the number of clusters specified to create the SOM grid. The mean and standard deviation of the eight clusters were computed.

Increasing the training cycle did not improve the results. Table 1 presents the summary of the eight clusters, while figure 2 presents the chart of the cluster means. The meteorological data was trained using the Pittnet software with a starting learning rate of 0.9 and was set to train for 100 epochs, although the software stops training as soon as the maximum number of clusters have been generated. The software requires the user to specify the number of clusters expected apriori. This number is used in conjunction with the number of input signals (attributes) to determine the SOM grid size. Expected number of clusters was set to ten. The software identified only four clusters. The mean and standard deviation of the clusters were computed. Table 2 presents the summary of the clusters, while figure 3 presents the chart of the cluster means.

TheRapidMiner Studio software was used to cluster the meteorological data set using a starting learning rate of 0.9 and was trained over a 100 epochs. The expected number of clusters was set at ten and the software generated ten clusters. Table 3 presents the summary of the cluster means with their standard deviations while figure 4 presents a chart of their cluster means.

Discussion of Results

The quality of the clusters identified in the data by the three software"s can be inferred from a comparison of the mean and standard deviation of the clusters. If the value of the standard deviation is low, then the clustered records are within the same range. However if the value is high this suggests the presence of outliers in the clustered data records. For example table 4 presents the clustered records for cluster 2 (table1) for the NNClust software which is representative of the trend observed in the clusters identified by the software. Interpreting the cluster is indecisive when the values in the Total Rainfall fields are considered. The field has a mean of 142.05 and a standard deviation of 136.011711. Similarly considering the clusters identified by the Pittnet software in table 2 the same trend is observed. Table 5 presents the records for cluster 4 (table 2) for the Pittnet software cluster results. It can be observed that the cluster is consists of data records which have the same value for the FireDangerIndex attribute. However, considering the Total Rainfall field which has a mean value of 39.74444 and a standard deviation of 43.34732. The high standard deviation value implies that there are outlier data values in the clustered records.

The clusters identified by the RapidMiner software presented in table 3 were easier to interpret. They followed the expected rainfall pattern which is known for the region where the data was collected [5]. Cluster 2 (

ACKNOWLEDGMENTS

Some of the problems found in the literature about clustering algorithms are: Most clustering techniques are based on distance calculations which are very sensitive to ranges of variables, therefore the values have to be normalized. Normalization however is a subjective function, and these transformations cannot be carried out without creating biases;

The presence of outliers in data sets create problems in data clustering based on distance calculations when they have not been identified and removed from the data set; Handling categorical variables (non-numeric data, non-numeric variables, categorical data, nominal data, or nominal variables) are a problem for most clustering algorithms, and even when data encoding methods are used they can introduce extra biases due to the number of values which the encoding introduces in the categorical variables; The selection of variables also has a large influence on clustering results, while the assigning of different weights for variables and categorical values can be used, when many variables and categorical values are involved, it can affect the clustering quality; Capturing patterns (or behaviors) hidden inside time-varying variables and modeling them is another problem and most clustering techniques do not possess this predictive modeling capability; Most clustering techniques were developed for laboratory generated simple data sets consisting of a few to several numerical variables; hence they can"t be used for large data analyses that consist of many categorical complex data.

Most common implementation of data clustering algorithms suffer from these problems, however, SOM"s are very robust and are adept at handling these problems but this depends also on the goal of the algorithm"s implementation (programming).

Applications programmed for demonstration purposes cannot be used for large scale projects and some implementations are not flexible and do not give users much options. However if the various implementations of the conventional SOM algorithm (which are usually focused on the goals of the programmer) provides enough options to the user, it is still a very robust algorithm that can be used for both numerical, categorical and mixed data sets. Further work in this study is focused on the development of an open flexible SOM clustering tool with adequate features that can be used for research purposes.

Figure 1 :1Figure 1: Illustration of the updating of the Best Matching Unit (BMU) of a SOM grid and its neighbors

Figure 2 :2Figure 2: Chart of NNClust cluster means

Figure 3 :Figure 4 :34Figure 3: Chart of Pitnett software cluster means

Agro metrological data for FRIN headquarters, Ibadan, Nigeria was used. The data set had 254 records and the attributes in the data set were: Year (numeric), Month (text), Total Rainfall in millimeters (numeric), Minimum Temperature in Celsius (numeric), Maximum Temperature in Celsius (numeric), Relative Humidity and Fire Danger Index (numeric). The SOM software used were: NNClust, Pittnet Neural Network Educational Software and RapidMiner Studio.The NNClust software was programmed to use only the Gaussian neighbourhood function and the Euclidean distance measure. The user can input the learning rate and starting neighbourhood size. The software automatically normalizes the input data between -1 and 1 and has features for generating data/result statistics and data visualization such as weight maps and radar charts. The Pittnet software also uses the Gaussian neighbourhood function and Euclidean distance metrics. The user also defines the starting learning rate and it also automatically normalizes the data between 0 and 1. It is a DOS based program that saves its result in a text file and has no data analysis or data visualization ability. RapidMiner studio (Community Edition) has facilities for selecting parameters for defining the learning rate, neighbourhood radius and can choose either to normalize the data or not. It also has an array of tools for statistical data analysis and data visualization.

table 3) contained records with only a high FireDangerIndex of 4 as presented in table 6, while cluster 5 (table 3) contains records with the highest recorded Rainfall level in the data set. The other clusters also contained data records which can be categorized by the Rainfall level pattern of the region.

Table 1 : Summary of NNClust clusters1

TotalRainfallMaxTempMinTempRHFireDangerIndexCluster 1Mean3.73224832SD00000Cluster 2Mean142.0533.524.579.333332.666666667SD2.6162950922.62741716.97064.5018510.516397779Cluster 3Mean113.31315831.123684231.060570.547372.5SD69.989518515.455738911.440445.623641.246560403Cluster 4Mean149.9930.833333330.296773.753332.333333333SD98.14254363.5305888320.049925.415820.546672274Cluster 5Mean109.89166730.633333336.166764.644442.638888889SD92.12109854.0207319924.393834.376460.723198364Cluster 6Mean141.62127731.757446827.061773.16172.617021277SD97.03599952.6305681913.707820.86230.644481304Cluster 7Mean123.54579431.441121529.496374.410282.411214953SD81.81370032.9653646318.407724.42390.531165877Cluster 8Mean175.26896629.379310323.06986.896552.068965517SD85.49018781.497946051.066744.3123150.257880715

Table 2 : Summary of the Pitnett software clusters2TotalRainfallMaxTempMinTempRHFireDangerIndexCluster 1Mean50.85000124.7563.53.94SD31.324830.07070912.02081530.1414213560Cluster 2Mean134.333231.708223.598437582.42187282.3828125SD91.1373242.2541231.064395966.9084880130.487025284Cluster 3Mean138.0518524.6481584.40740742.1962962962.407407407SD45.66899915.9080427.237096839.483118321.836329785Cluster 4Mean39.74444435.5555623.555555659.222221334SD43.3473211.3333331.740051087.1200033630

Table 3 : Summary of Rapid miner Studio clusters3TotalRainfallMaxTempMinTempRHFireDangerIndexcluster 0Mean42.3538533.4115423.9961578.461538462.730769231SD8.1920562.3088230.9119137.7986192070.603833905cluster 1Mean13.5051333.4717923.8076977.435897442.820512821SD9.3793432.3428451.2809096.3028601350.451418517cluster 2Mean7.6435.3623.4255.23.8SD16.1587317.9647613.1678640.939662681.299899072cluster 3Mean57.9466725.3533378.133332.7266666672.933333333SD13.2303415.6348811.1130832.159647411.361648053

Table 4 : Sample NNClust software cluster result4YearMonthsTotalRainfall MaxTempMinTempRHFireDangerIndex1980Feb.6035277531987Aug.357.130238621987Nov.1035248031989Mar.5735257731991Apr.108.932248321998Sept.259.33424753Mean142.0533.524.579.333332.666667SD136.01172.0736441.3784054.5018510.516398

Table 5 : Sample Pittnet software cluster result5YearMonthsTotalRainfall MaxTempMinTempRHFireDangerIndex1989Feb.18.435225141990Feb.40.335236441990Mar.11.737256941994Jan.1.333204541997Mar.122.235236241998Feb.236256042000Mar.48.837256242001Mar.1537256042001Apr.983524604Mean39.7444435.5555623.5555659.222224SD43.347321.3333331.7400517.1200030

Table 6 : Sample Rapidminer software cluster result6YearMonthsTotalRainfallMaxTempMinTemp RHFireDangerIndex1989Feb.18.435225141994Jan.1.333204541998Feb.236256042001Mar.1537256042004Mar.1.535.825.1603Mean7.6435.3623.4255.23.8SD8.3613991.4993332.319914 6.9065190.447214Table 7: Sample RapidMiner software cluster resultYearMonthsTotalRainfall MaxTempMinTempRHFireDangerIndex1979Jul.291.229238521979Sept.26929238621979Oct.223.631248621979Nov.261.432248321980Jun30631238221980Aug.427.428238821980Sept.333.529239021981Sept.233.930238621981Oct.225.131248321983May250.731248521984May22332238621984Jun233.63022822

CoRI'16, Sept 7-9, 2016, Ibadan, Nigeria.

<author> <persName><surname>References</surname></persName> </author> <imprint/> </monogr> </biblStruct> <biblStruct xml:id="b1"> <analytic> <title level="a" type="main">Categorical data visualization and clustering using subjective factors CChang ZDing Data & Knowledge Engineering BV Elsevier 2004 An Extension of Self-Organizing Maps to Categorical Data NChen NCMarques Proceedings of the 12th Portuguese conference on progress in Artificial Intelligence the 12th Portuguese conference on progress in Artificial Intelligence

Berlin; Heidelberg

Sringer-Verlag 2005. ©2005 Data exploration using self-organizing maps SKaski Acta Polytechnica Scandinavica, Mathematics, Computing and Management in Engineering Series 82 1997. 1997 The Self-Organizing Map (SOM) TKohonen 1999. 1994-1998. January 2006 Helsinki University of Technology, Laboratory of Computer and Information Science, Neural Networks Research Centre, Quinquennial Report Using Smoothed Data Histograms for Cluster Visualization in Self Organizing Maps EPampalk ARauber DMerkl OeFAI-TR-2002-29 Proceedings of the International Conference on Artificial Neural Networks Springer Lecture Notes in Computer Science the International Conference on Artificial Neural Networks

Madrid, Spain

2002. 2002 Technical Report Identification of rainfall patterns over the Valley of Mexico IJPelczer HLCisneros 11th International Conference on Urban Drainage

Edinburgh, Scotland, UK

2008. 2008 JCPrincipe NREuliano WLefebvre Neural and Adaptive Systems: Fundamentals Through Simulations John Wiley and Sons Inc 2000 656 Statsoft Electronic Statistics Textbook Copyright 2002. 1984-2003. June 2002 Data Mining and Knowledge Discovery with Emergent Self-Organizing Feature Maps for Multivariate Time Series AUltsch Kohonen Maps 1999. 1999 Maps for the Visualization of highdimensional Data Spaces AUltsch Proc. Workshop on Self Organizing Maps Workshop on Self Organizing Maps

Kyushu, Japan

2003a. 2003 U*-Matrix: a Tool to visualize Clusters in high dimensional Data AUltsch No. 36 2003b. 2003 Computer Science Department, University of Marburg, Germany Technical Report ESOM-Maps: tools for clustering, visualization, and classification with Emergent SOM AUltsch FMoerchen No. 46 2005. 2005 Dept. of Mathematics and Computer Science, University of Marburg, Germany Technical Report Self-and Superorganizing Maps in R: The kohonen Package RWehrens LM CBuydens Journal of Statistical Software,published by the American Statistical Association 21 5 2007 Clustering Mixed Categorical and Numeric Data ZengyouHe ShengchunXiaofe I Fe Deng 2003. 1985 Jul. 307. 30 23 86 2 1985 Aug. 232. 30 23 89 2 1986 Jun 312. 9 31 23 83 2 1986 Sept. 374. 1 29 22 84 2 1987 Jul. 246. 30 23 85 2 1987 Aug. 357. 30 23 86 2 1987 Sept. 252. 31 23 87 2 1988 Jun 242. 9 30 22 82 2 1988 Jul. 240. 9 29 23 84 2 1988 Sept. 225. 1 30 23 87 2 1989 May 259. 2 32 23 83 2 1989 Jun 338. 31 23 86 2 1989 Aug. 275 29 22 88 2 1990 Apr. 233. 33 24 82 3 1990 Jul. 293 Harbin; P. R. China Department of Computer Science and Engineering, Harbin Institute of Technology