=Paper=
{{Paper
|id=Vol-1755/160-168
|storemode=property
|title=A Study of SOM Clustering Software Implementations
|pdfUrl=https://ceur-ws.org/Vol-1755/160-168.pdf
|volume=Vol-1755
|authors=A. B. Adeyemo
|dblpUrl=https://dblp.org/rec/conf/cori/Adeyemo16
}}
==A Study of SOM Clustering Software Implementations==
A study of SOM clustering software implementations
A. B. Adeyemo
Computer Science Department
University of Ibadan
Nigeria
+2348052107367
sesanadeyemo@gmail.com
ABSTRACT dimensional (or (rarely) three or more-dimensions. The reason
Clustering algorithms generally suffer from some well-known for using one- and two dimensional grids is that space
problems for which the Self Organizing Maps (SOM) structures of higher dimensionality cause problems with data
algorithms are adept at handling. While there are many display and cannot be displayed on the monitor. The SOM
variants of the SOM algorithm, software programmes that working algorithm is a variant of multidimensional vectors
implement the SOM algorithms have tended to give varying clustering of which the Kmeans clustering algorithm is an
results even when tested on the same data sets. This can have example of this type of algorithm [9].
serious implications when the goal of the clustering is novelty
detection. In this study a comparison of the performance of
The SOM neural network uses a competitive learning algorithm
some SOM clustering software was carried out and results
and is a method for unsupervised learning, based on a grid of
presented.
artificial neurons whose weights are adapted to match input
CCS Concepts vectors in a training set. The SOM algorithm is fed with
• General and reference ➝ -computing tools and techniques feature vectors, which can be of any dimension. The algorithm
➝ Empirical studies for the training of the SOM [4] is explained easily in terms of
a set of artificial neurons, each having its own physical
Keywords location on the output map, which take part in a winner-take-
Comparative Analysis; Clustering; Self Organizing Maps. all process where a node with its weight vector closest to the
vector of inputs is declared the winner and its weights are
1. INTRODUCTION adjusted making them closer to the input vector.
In the clustering process data is grouped in such a way that the
intra-cluster similarity is maximized while the inter-cluster
similarity is minimized. Data can be described by either
categorical or numeric features. Due to the differences in the
characteristics of these two kinds of data, attempts to develop
criteria functions for mixed data have not been very successful
[15]. There are two widely used clustering methods: the
hierarchical and the nonhierarchical (partitional) methods. The
hierarchical clustering process can be categorized as divisive
when a large data set is divided into several small groups and,
agglomerative when a small data set are put together to create
a larger cluster. Self-Organizing Maps (SOM) are competitive
networks that provide a "topological" mapping from the input
space to the clusters [4]. The SOM was inspired by the way in
which various human sensory impressions are neurologically
mapped into the brain such that spatial or other relations
among stimuli correspond to spatial relations among the
neurons.
In a SOM, the neurons (clusters) are organized into a grid
which is usually two-dimensional, but sometimes one- Figure 1: Illustration of the updating of the Best
Matching Unit (BMU) of a SOM grid and its neighbors
In each training step, one sample vector „x‟ from the input data
set is chosen randomly and a similarity measure is calculated
between it and all the weight vectors of the map. The Best-
CoRI’16, Sept 7–9, 2016, Ibadan, Nigeria.
Matching Unit (BMU), denoted as „c‟, is the unit whose
weight vector has the greatest similarity with the input sample
„x‟ (figure 1). The similarity is usually defined by means of a
distance measure, usually the Euclidian distance. The BMU is
160
defined mathematically as the processing element for which neighborhood relationship, and the density mapping. Neighboring
the expression: neurons in the SOM cannot be too far away from each other (in
order to maintain their similarity) but the SOM also wants to place
more neurons in areas of high input density (for example, logical
. …..…………….….. 1 clusters). Because of this, there will be neurons that will be placed
where d is the distance measure. in areas between natural clusters which are typically low input
density areas (so that the map can "stretch" between clusters).
Each node has a set of neighbors. When a node wins a
competition, the neighbor‟s weights are also changed but not
The standard SOM algorithm uses numeric type variables and the
as much as that of the winning node. The further the neighbor
Euclidean distance function. The arithmetic operations used
is from the winner, the smaller its weight change. The SOM
during the learning phase for the update of the feature vectors
update rule for the weight vector of the unit i is given cannot be used with categorical values. The SOM was not directly
mathematically as: designed to work with categorical variables due to the limitation
of learning laws. The method usually adopted is to translate
………………2 categories to numeric numbers during data pre-processing before
training using the transformed data using standard SOM algorithm
where [2]. The Kohonen SOM clustering algorithm has also been used
t represents the sample index for each presentation of a sample „x‟ for classification purposes with remarkable results. There is a
fundamental difference between the clustering process and the
hc(x),i represents the neighborhood function around the winner unit classification process. Clustering is an unsupervised process while
„c‟, with neighborhood radius r(t). classification is supervised. Usually data clustering is used as a
The neighborhood function is like a smoothing kernel that is pre-processor for classification purposes [8].
time-variable. It is a decreasing function of the distance A rich variety of versions of the basic SOM algorithm have
between the the ith and cth reference vectors on the map grid. been proposed. Some of the variants aim at improving the
The neighborhood function is usually expressed as the preservation of topology by using more flexible map structures
Gaussian function which can be expressed mathematically as: instead of the fixed grid. Some of these methods however
cannot be used for visualization as easily as the regular grid.
Some variants aim at reducing the computational complexity
of the SOM [3]. Experiments using different distance
…………………3 measures, map topologies, training parameters such as the
where learning rate and neighbourhood function can be carried out.
ά(t) represents the learning rate factor and takes values 0< ά(t)<1
σ(t) represents the width of the neighborhood function which Using identical settings, training of a SOM map over different
decreases monotomically with the regression steps. iterations can lead to different mappings, because of the
random initialisation. Yet it has been shown that the
A simpler definition of the neighbourhood function given by conclusions drawn from the map remain remarkably
Kohonen [4] is: consistent, which makes it a very useful tool in many different
hc(x),I=σ(t)…………………………………………………….4 circumstances [14]. Some of the desirable features that good
If ║ri – rc║ is smaller than a given radius around node „c‟ and SOM clustering software should have include:
the radius is also a monotomically decreasing function of the 1. Being able to set the neighborhood kernel function and
regression steps, but otherwise hc(x),I = 0. σ(t) is a to set the start value for the neighborhood function
diminishing function of time. At the beginning of the learning (learning radius): The neighborhood function
procedure it is fairly large, but it is made to gradually shrink determines how strongly the processing elements are
during learning. Towards the end of learning a single winning connected to each other. Neighborhoods of different
processing element is trained. A linear diminishing function of sizes in different neuron configurations (e.g.
time is usually used. The learning process consisting of winner rectangular and hexagonal lattices) can be used. The
selection by Equation (1) and adaptation of the synaptic simplest neighborhood function is the bubble (winner-
weights by Equation (2). This process is repeated for each takes-all): it is constant (or 1) over the whole
input vector, usually for a large number of cycles with neighborhood of the winner unit and zero elsewhere.
different inputs producing different winners. The network Usually the neighbourhood function is expressed as a
therefore associates output nodes with groups or patterns in the Gaussian function and as expected using the winner-
input data set. The SOM algorithm is very simple and allows takes-all function retrieves less clusters than the
for many subtle adaptations. Gaussian function.
2. Being able to set the activation function and weight
There are some visual displays that are used to "determine" where initialization methods: Before the training, initial values
the natural cluster boundaries are in the SOM. Some of the visual are given to the prototype vectors of the SOM. The
tools that can be used are Histograms [6], Component Plane SOM is very robust with respect to the initialization
displays [3], U-matrix, P-matrix and U* matrix displays [10], process, however, when properly accomplished it
[11], [12, [13]. An important concept in interpreting these displays allows the algorithm to converge faster to a good
is the interaction of the two properties of the SOM. These are the
161
solution. Initialization procedures that have been used This work presents a comparative study of the performance
are: Random initialization, where the weight vectors some SOM clustering software when tested on the same data
are initialized with small random values; Sample set. Results were presented and reasons for the observed
initialization, where the weight vectors are initialized variations presented. The study also presents the desirable
with random samples drawn from the input data set; features that standard SOM software should have.
Linear initialization, where the weight vectors are
initialized in an orderly fashion along the linear 2. MATERIALS AND METHODS
subspace spanned by the two principal eigenvectors of Agro metrological data for FRIN headquarters, Ibadan,
the input data set. Nigeria was used. The data set had 254 records and the
attributes in the data set were: Year (numeric), Month (text),
3. Being able to set the choice of cooling strategy during Total Rainfall in millimeters (numeric), Minimum
training: for example linear or exponential. Temperature in Celsius (numeric), Maximum Temperature in
4. Being able to set the distance measure to be used, for Celsius (numeric), Relative Humidity and Fire Danger Index
example, Euclidean, Manhattan and Maximum value: It (numeric). The SOM software used were: NNClust, Pittnet
is noted that the distance measure between data points Neural Network Educational Software and RapidMiner Studio.
is an important component of a clustering algorithm. If
the components of the data instance vectors are all in The NNClust software was programmed to use only the
the same physical units then it is possible to use the Gaussian neighbourhood function and the Euclidean distance
simple Euclidean distance metric to successfully group measure. The user can input the learning rate and starting
similar data elements. The Euclidean distance in a two neighbourhood size. The software automatically normalizes
or three-dimensional space measures is the actual the input data between -1 and 1 and has features for generating
geometric distance between objects in the space. data/result statistics and data visualization such as weight
However, it has been observed that even the Euclidean maps and radar charts. The Pittnet software also uses the
distance can sometimes be misleading, because of the Gaussian neighbourhood function and Euclidean distance
way the mathematical formula used to combine the metrics. The user also defines the starting learning rate and it
distances between the single components of the data also automatically normalizes the data between 0 and 1. It is a
feature vectors into a unique distance measure that can DOS based program that saves its result in a text file and has
be used for clustering purposes is computed. Different no data analysis or data visualization ability. RapidMiner
formulas lead to different clustering‟s. Therefore, studio (Community Edition) has facilities for selecting
domain knowledge must be used to guide the parameters for defining the learning rate, neighbourhood
formulation of a suitable distance measure for each radius and can choose either to normalize the data or not. It
particular application. also has an array of tools for statistical data analysis and data
5. Being able to set the scaling technique to be used: for visualization.
example z-transform, (0,1) transform, (1,-1) transform
or none, depending on the clustering goal and data set. Using the three software‟s clusters was generated. The
6. Being able to set the starting and stopping learning rate: arithmetic mean of each cluster group was also computed. The
The learning rate is a decreasing function of time arithmetic mean is a measure of central tendency which
between [0,1]. The learning rate can be expressed as a describes the central location of data and is usually used with
linear function and as a function inversely proportional other statistical measures such as the standard deviation
to time. Using the inverse function ensures that all because it can be affected by extreme values in the data set and
input samples have approximately equal influence on therefore be biased. The standard deviation describes the
the training result. Some learning rate functions that spread of the data and is a popular measure of dispersion. It
have been implemented are the linear, inverse-of-time, measures the average distance between a single observation
and as a power ser. and its mean.
7. Being able to set the training algorithm to be used: for
example batch, on-line, hybrid etc. The batch algorithm 3. RESULTS AND DISCUSSION
has been shown to be faster [4] than the normal The meteorological data was clustered using NNClust SOM
sequential algorithm (and the results are just as good or clustering software with a starting learning rate of 0.9 and was
even better). trained over 100 epochs. The software accepts only numeric
8. Good data visualization options: for example values. Non numeric values are treated as missing values
histograms, hinton charts, weight charts (maps), U- which are replaced by the column mean. The software was set
Matrix, P-Matrix etc. Good result analysis and to identify a maximum of ten clusters, however only eight
presentation functions: computation of vital statistics clusters were generated. The software uses the number of
for evaluating the quality of the clustering for example, clusters specified to create the SOM grid. The mean and
mean, standard deviation (or variance), correlation standard deviation of the eight clusters were computed.
coefficient, t-test etc. Increasing the training cycle did not improve the results. Table
1 presents the summary of the eight clusters, while figure 2
presents the chart of the cluster means.
162
160
200
180 TotalRainfall 140
160 120
140 100
120 MaxTemp
100 80 Cluster 1
80 MinTemp 60
Cluster 2
60 40
40 Cluster 5
RH 20
20 Cluster 6
0 0
FireDangerInd
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
Cluster 7
Cluster 8
ex
Figure 2: Chart of NNClust cluster means
The meteorological data was trained using the Pittnet software Figure 3: Chart of Pitnett software cluster means
with a starting learning rate of 0.9 and was set to train for 100
epochs, although the software stops training as soon as the
maximum number of clusters have been generated. The 300
software requires the user to specify the number of clusters TotalRainfall
expected apriori. This number is used in conjunction with the 250
number of input signals (attributes) to determine the SOM grid
size. Expected number of clusters was set to ten. The software 200 MaxTemp
identified only four clusters. The mean and standard deviation
of the clusters were computed. Table 2 presents the summary 150
of the clusters, while figure 3 presents the chart of the cluster MinTemp
100
means.
TheRapidMiner Studio software was used to cluster the 50 RH
meteorological data set using a starting learning rate of 0.9 and
0
was trained over a 100 epochs. The expected number of
clusters was set at ten and the software generated ten clusters. FireDangerIn
Table 3 presents the summary of the cluster means with their dex
standard deviations while figure 4 presents a chart of their
cluster means.
Figure 4: Chart of Rapid Miner Studio cluster means
3.1 Discussion of Results
The quality of the clusters identified in the data by the three Similarly considering the clusters identified by the Pittnet
software‟s can be inferred from a comparison of the mean and software in table 2 the same trend is observed. Table 5
standard deviation of the clusters. If the value of the standard presents the records for cluster 4 (table 2) for the Pittnet
deviation is low, then the clustered records are within the same software cluster results. It can be observed that the cluster is
range. However if the value is high this suggests the presence consists of data records which have the same value for the
of outliers in the clustered data records. For example table 4 FireDangerIndex attribute. However, considering the Total
presents the clustered records for cluster 2 (table1) for the Rainfall field which has a mean value of 39.74444 and a
NNClust software which is representative of the trend standard deviation of 43.34732. The high standard deviation
observed in the clusters identified by the software. Interpreting value implies that there are outlier data values in the clustered
the cluster is indecisive when the values in the Total Rainfall records.
fields are considered. The field has a mean of 142.05 and a
standard deviation of 136.011711. The clusters identified by the RapidMiner software presented
in table 3 were easier to interpret. They followed the expected
rainfall pattern which is known for the region where the data
was collected [5]. Cluster 2 (table 3) contained records with
only a high FireDangerIndex of 4 as presented in table 6, while
cluster 5 (table 3) contains records with the highest recorded
Rainfall level in the data set. The other clusters also contained
data records which can be categorized by the Rainfall level
pattern of the region.
163
4. ACKNOWLEDGMENTS Portuguese conference on progress in Artificial Intelligence , pp
Some of the problems found in the literature about clustering 304 - 313, (Sringer-Verlag Berlin, Heidelberg ©2005)
algorithms are: Most clustering techniques are based on [3]. Kaski S., (1997), "Data exploration using self-organizing
distance calculations which are very sensitive to ranges of maps”, Acta Polytechnica Scandinavica, Mathematics,
variables, therefore the values have to be normalized. Computing and Management in Engineering Series No. 82,
Normalization however is a subjective function, and these Espoo 1997.
transformations cannot be carried out without creating biases; [4]. Kohonen T, (1999), “The Self-Organizing Map (SOM)”,
The presence of outliers in data sets create problems in data Helsinki University of Technology, Laboratory of Computer
clustering based on distance calculations when they have not and Information Science, Neural Networks Research Centre,
been identified and removed from the data set; Handling Quinquennial Report (1994-1998), (Downloaded from
categorical variables (non-numeric data, non-numeric http://www.cis.hut.fi/research/reports/quinquennial/ January
variables, categorical data, nominal data, or nominal variables) 2006).
are a problem for most clustering algorithms, and even when [5]. Nigeria Climate Review, 2010, Nigerian Meteorological
data encoding methods are used they can introduce extra Agency, www.nimetng.org
biases due to the number of values which the encoding [6]. Pampalk E, Rauber A, Merkl D, (2002), “Using Smoothed
introduces in the categorical variables; The selection of Data Histograms for Cluster Visualization in Self Organizing
variables also has a large influence on clustering results, while Maps”, Technical Report OeFAI-TR-2002-29, extended
version published in Proceedings of the International
the assigning of different weights for variables and categorical
Conference on Artificial Neural Networks, Springer Lecture
values can be used, when many variables and categorical Notes in Computer Science, Madrid, Spain, 2002.
values are involved, it can affect the clustering quality; [7]. Pelczer I. J. and Cisneros H. L., (2008), “Identification of
Capturing patterns (or behaviors) hidden inside time-varying rainfall patterns over the Valley of Mexico”, 11th International
variables and modeling them is another problem and most Conference on Urban Drainage, Edinburgh, Scotland, UK,
clustering techniques do not possess this predictive modeling 2008
capability; Most clustering techniques were developed for [8]. Principe J. C., Euliano N. R. Lefebvre W. C, (2000), Neural
laboratory generated simple data sets consisting of a few to and Adaptive Systems: Fundamentals Through Simulations,
several numerical variables; hence they can‟t be used for large John Wiley and Sons Inc, ISBN 0-471-35167-9, pp 656.
data analyses that consist of many categorical complex data. [9]. Statsoft Electronic Statistics Textbook, (2002), Copyright,
1984-2003,
(http://www.statsoftinc.com/txtbook/glosd.html#Data Mining),
Most common implementation of data clustering algorithms Downloaded June
suffer from these problems, however, SOM‟s are very robust 2002.
and are adept at handling these problems but this depends also [10]. Ultsch, A., (1999), Data Mining and Knowledge Discovery
on the goal of the algorithm‟s implementation (programming). with Emergent Self-Organizing Feature Maps for Multivariate
Applications programmed for demonstration purposes cannot Time Series, In Kohonen Maps, (1999), pp. 33-46.
[11]. Ultsch A, (2003a), Maps for the Visualization of high-
be used for large scale projects and some implementations are
dimensional Data Spaces, Proc. Workshop on Self Organizing
not flexible and do not give users much options. However if
Maps, pp 225 - 230, Kyushu, Japan, 2003.
the various implementations of the conventional SOM [12]. Ultsch A, (2003b), U*-Matrix: a Tool to visualize Clusters
algorithm (which are usually focused on the goals of the in high dimensional Data, Technical Report No. 36, Computer
programmer) provides enough options to the user, it is still a Science Department, University of Marburg, Germany, 2003.
very robust algorithm that can be used for both numerical, [13]. Ultsch A., Moerchen F, (2005), ESOM-Maps: tools for
categorical and mixed data sets. Further work in this study is clustering, visualization, and classification with Emergent
focused on the development of an open flexible SOM SOM, Technical Report No. 46, Dept. of Mathematics and
clustering tool with adequate features that can be used for Computer Science, University of Marburg, Germany, 2005.
research purposes. [14]. Wehrens R. Buydens L. M. C., 2007, Self- and Super-
organizing Maps in R: The kohonen Package, Journal of
5. REFERENCES Statistical Software,published by the American Statistical
[1]. Chang C., Ding Z., (2004), "Categorical data visualization Association, Vol. 21, Issue 5
and clustering using subjective factors", Data & Knowledge [15]. Zengyou He, Xiaofe I Fe, Shengchun Deng, (2003),
Engineering, Published by Elsevier B.V. “Clustering Mixed Categorical and Numeric Data”,
[2]. Chen N. and Marques N. C., (2005), “An Extension of Self- Department of Computer Science and Engineering, Harbin
Organizing Maps to Categorical Data”, Proceedings of the 12th Institute of Technology, Harbin 150001, P. R. China
164
Table 1: Summary of NNClust clusters
TotalRainfall MaxTemp MinTemp RH FireDangerIndex
Cluster 1 Mean 3.7 32 24 83 2
SD 0 0 0 0 0
Cluster 2 Mean 142.05 33.5 24.5 79.33333 2.666666667
SD 2.61629509 22.627417 16.9706 4.501851 0.516397779
Cluster 3 Mean 113.313158 31.1236842 31.0605 70.54737 2.5
SD 69.9895185 15.4557389 11.4404 45.62364 1.246560403
Cluster 4 Mean 149.99 30.8333333 30.2967 73.75333 2.333333333
SD 98.1425436 3.53058883 20.0499 25.41582 0.546672274
Cluster 5 Mean 109.891667 30.6333333 36.1667 64.64444 2.638888889
SD 92.1210985 4.02073199 24.3938 34.37646 0.723198364
Cluster 6 Mean 141.621277 31.7574468 27.0617 73.1617 2.617021277
SD 97.0359995 2.63056819 13.7078 20.8623 0.644481304
Cluster 7 Mean 123.545794 31.4411215 29.4963 74.41028 2.411214953
SD 81.8137003 2.96536463 18.4077 24.4239 0.531165877
Cluster 8 Mean 175.268966 29.3793103 23.069 86.89655 2.068965517
SD 85.4901878 1.49794605 1.06674 4.312315 0.257880715
Table 2: Summary of the Pitnett software clusters
TotalRainfall MaxTemp MinTemp RH FireDangerIndex
Mean 50.850001 24.75 63.5 3.9 4
Cluster 1
SD 31.32483 0.070709 12.0208153 0.141421356 0
Mean 134.3332 31.7082 23.5984375 82.4218728 2.3828125
Cluster 2
SD 91.137324 2.254123 1.06439596 6.908488013 0.487025284
Mean 138.05185 24.64815 84.4074074 2.196296296 2.407407407
Cluster 3
SD 45.668999 15.90804 27.2370968 39.48311832 1.836329785
Mean 39.744444 35.55556 23.5555556 59.22222133 4
Cluster 4
SD 43.347321 1.333333 1.74005108 7.120003363 0
Table 3: Summary of Rapid miner Studio clusters
TotalRainfall MaxTemp MinTemp RH FireDangerIndex
Mean 42.35385 33.41154 23.99615 78.46153846 2.730769231
cluster 0
SD 8.192056 2.308823 0.911913 7.798619207 0.603833905
Mean 13.50513 33.47179 23.80769 77.43589744 2.820512821
cluster 1
SD 9.379343 2.342845 1.280909 6.302860135 0.451418517
Mean 7.64 35.36 23.42 55.2 3.8
cluster 2
SD 16.15873 17.96476 13.16786 40.93966268 1.299899072
Mean 57.94667 25.35333 78.13333 2.726666667 2.933333333
cluster 3
SD 13.23034 15.63488 11.11308 32.15964741 1.361648053
165
Mean 211.4214 23.90714 88.14286 1.871428571 2.071428571
cluster 4
SD 46.93198 1.320527 4.24005 0.299816794 0.267261242
Mean 270.4346 30.36154 23.21923 85.19230769 2.115384615
cluster 5
SD 42.68863 1.395814 0.859101 5.129837598 0.322602539
Mean 188.0463 30.77805 23.31463 84.90243902 2.146341463
cluster 6
SD 15.90989 1.518801 0.887288 5.180757078 0.357839043
Mean 144.6971 31.42 23.47429 82.85714286 2.342857143
cluster 7
SD 10.84353 1.991127 0.995089 7.6855206 0.481593992
Mean 110.85 31.84474 23.72105 82.31578947 2.473684211
cluster 8
SD 9.73158 2.332462 1.076822 6.794692934 0.603451429
Mean 70.05862 32.27241 24.04828 81.31034483 2.482758621
cluster 9
SD 8.635041 2.37936 1.180684 9.043953972 0.508547628
Table 4: Sample NNClust software cluster result
Year Months TotalRainfall MaxTemp MinTemp RH FireDangerIndex
1980 Feb. 60 35 27 75 3
1987 Aug. 357.1 30 23 86 2
1987 Nov. 10 35 24 80 3
1989 Mar. 57 35 25 77 3
1991 Apr. 108.9 32 24 83 2
1998 Sept. 259.3 34 24 75 3
Mean 142.05 33.5 24.5 79.33333 2.666667
SD 136.0117 2.073644 1.378405 4.501851 0.516398
Table 5: Sample Pittnet software cluster result
Year Months TotalRainfall MaxTemp MinTemp RH FireDangerIndex
1989 Feb. 18.4 35 22 51 4
1990 Feb. 40.3 35 23 64 4
1990 Mar. 11.7 37 25 69 4
1994 Jan. 1.3 33 20 45 4
1997 Mar. 122.2 35 23 62 4
1998 Feb. 2 36 25 60 4
2000 Mar. 48.8 37 25 62 4
2001 Mar. 15 37 25 60 4
2001 Apr. 98 35 24 60 4
Mean 39.74444 35.55556 23.55556 59.22222 4
SD 43.34732 1.333333 1.740051 7.120003 0
166
Table 6: Sample Rapidminer software cluster result
Year Months TotalRainfall MaxTemp MinTemp RH FireDangerIndex
1989 Feb. 18.4 35 22 51 4
1994 Jan. 1.3 33 20 45 4
1998 Feb. 2 36 25 60 4
2001 Mar. 15 37 25 60 4
2004 Mar. 1.5 35.8 25.1 60 3
Mean 7.64 35.36 23.42 55.2 3.8
SD 8.361399 1.499333 2.319914 6.906519 0.447214
Table 7: Sample RapidMiner software cluster result
Year Months TotalRainfall MaxTemp MinTemp RH FireDangerIndex
1979 Jul. 291.2 29 23 85 2
1979 Sept. 269 29 23 86 2
1979 Oct. 223.6 31 24 86 2
1979 Nov. 261.4 32 24 83 2
1980 Jun 306 31 23 82 2
1980 Aug. 427.4 28 23 88 2
1980 Sept. 333.5 29 23 90 2
1981 Sept. 233.9 30 23 86 2
1981 Oct. 225.1 31 24 83 2
1983 May 250.7 31 24 85 2
1984 May 223 32 23 86 2
1984 Jun 233.6 30 22 82 2
1985 Jul. 307.2 30 23 86 2
1985 Aug. 232.2 30 23 89 2
1986 Jun 312.9 31 23 83 2
1986 Sept. 374.1 29 22 84 2
1987 Jul. 246.8 30 23 85 2
1987 Aug. 357.1 30 23 86 2
1987 Sept. 252.5 31 23 87 2
1988 Jun 242.9 30 22 82 2
1988 Jul. 240.9 29 23 84 2
1988 Sept. 225.1 30 23 87 2
1989 May 259.2 32 23 83 2
1989 Jun 338.7 31 23 86 2
1989 Aug. 275 29 22 88 2
1990 Apr. 233.8 33 24 82 3
1990 Jul. 293.6 29 23 90 2
167
1990 Oct. 255.4 31 23 85 2
1991 May 258.2 32 24 84 2
1991 Jul. 306.6 29 23 90 2
1992 Sept. 275.4 29 23 90 2
1992 Oct. 276.3 31 23 88 2
1993 Jul. 261 29 27 87 2
1993 Aug. 237.7 29 23 90 2
1993 Sept. 255.5 30 23 86 2
1994 Sept. 236 30 23 89 2
1995 May 334.3 31 24 81 2
1995 Aug. 304.2 29 23 91 2
1996 Aug. 224.7 30 23 89 2
1996 Sept. 304.1 29 22 90 2
1997 Apr. 261.7 32 24 70 3
1998 May 245.4 34 25 70 3
1998 Sept. 259.3 34 24 75 3
2000 Jul. 220.4 29 23 73 3
2000 Aug. 263.8 29 23 85 2
2001 May 265 33 24 74 3
2001 Sept. 275.2 29 22 90 2
2002 Oct. 265 29 24 87 2
2003 Jun 275.3 30.6 24.5 92 2
2003 Sept. 226 30.8 22.4 92 2
2003 Oct. 254.9 32 23.2 92 2
2006 Sept. 250.8 30.4 22.3 86 2
Mean 270.4346 30.36154 23.21923 85.19231 2.115385
SD 42.68863 1.395814 0.859101 5.129838 0.322603
168