<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>IICST</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>INITIAL CENTROID OPTIMIZATION IN K-MEANS ALGORITHM FOR EDUCATIONAL DATA MINING</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Raga S.H. Istanto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arinda D. Listikowati</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bekti P. Wibowo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fitra A. Bachtiar</string-name>
          <email>fitra.bachtiar@ub.ac.id</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Brawijaya University</institution>
          ,
          <country country="ID">Indonesia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>5</volume>
      <fpage>51</fpage>
      <lpage>56</lpage>
      <abstract>
        <p>Senior High School is a formal educational institution at the level of secondary education that aims to develop students' potential in various disciplines. In the process of development, students are directed to certain interests. In the traditional method, the process of selecting interests is carried out by the teacher which turns out to be difficult, especially in identifying and finding information that is useful to determine student areas of interest. To overcome these problems, the Educational Data Mining (EDM) approach is used. K-Means is one method that can be used in EDM. However, the K-Means method has a disadvantage in choosing initial centroids, because it is slow and less accurate. This study proposed 3 methods for conducting initial centroid selection. Evaluation results obtained in the study using 245 student data showed that the initial centroid using average calculations on data that had been normalized using min-max normalization obtained better results.</p>
      </abstract>
      <kwd-group>
        <kwd>Initial Centroid</kwd>
        <kwd>K-Means</kwd>
        <kwd>Educational Data Mining</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        In Indonesia, based on government regulation no. 17 of 2010 article 79
        <xref ref-type="bibr" rid="ref6">(PP, 2010)</xref>
        , there is a process for dividing
senior high school students into several interests, which are named natural science study programs, social science
study programs, and other study programs that are needed by the community. In the traditional method, the process
of selecting interests is done by the teacher. The problem with the traditional selection process is that teachers
must identify and find useful information on large data manually which is a difficult task to do
        <xref ref-type="bibr" rid="ref7">(Quadril and
Kalyankar, 2010)</xref>
        . A very promising solution to facilitate the process of selecting interests is to use educational
data mining (EDM)
        <xref ref-type="bibr" rid="ref8">(Romero, 2007)</xref>
        .
      </p>
      <p>
        This approach has been widely used to carry out data grouping processes such as the case for selecting student
interests. There are many methods in the educational data mining approach to grouping data. K-means is a method
of grouping data that is very popular to use because of its simplicity which makes computing faster and uses
memory more efficiently
        <xref ref-type="bibr" rid="ref9">(Singh et al., 2011)</xref>
        . K-Means is a partitioning clustering method that separates data into
different k groups. With the principle of partitioning iteratively, K-Means minimizes the average distance of each
data to its cluster. However, the performance of K-means is strongly influenced by the selection of the initial
centroid center point (initial centroid)
        <xref ref-type="bibr" rid="ref5">(Mahmud, 2012)</xref>
        . K-Means raises the initial center point of the cluster
through random selection. If the randomly selected center point of the cluster is approaching the final solution of
the cluster center, the k-means algorithm will work faster and uses memory more efficiently without any incorrect
clustering results. Conversely, if the initial center point of the cluster is far from the center of the final solution
cluster, then it is very likely to cause incorrect clustering results
        <xref ref-type="bibr" rid="ref2">(Cheung, 2003)</xref>
        . Poor choice of initial centroid
because random selection can cause this method to be slow and inaccurate in grouping data.
      </p>
      <p>
        Several methods have been introduced to optimize the initial centroid for the K-Means algorithm.
        <xref ref-type="bibr" rid="ref3">Duda and
Hart (1973)</xref>
        , have discussed a recursive method for initializing the average value obtained from the whole data and
randomly generated k times the initial center point.
        <xref ref-type="bibr" rid="ref1">Bradley and Fayyad (1998)</xref>
        propose an algorithm that can
optimize the starting point by analyzing data distribution and data density probabilities. Shehroz and Ahmad (2004)
introduced a method called the Cluster Center Initialization Algorithm (CCIA) to complete the initialization of the
initial center point for K-means. CCIA calculates the average values and standard deviations for all data attributes
and then separates the data using a normal curve to a particular partition. CCIA uses K-means and density-based
multi-scale data conditions to observe the similarity of data patterns before finding the starting point for K-means.
      </p>
      <p>Based on these explanations, the researcher is interested in proposing the optimization of initial centroid
selection by proposing new methods resulting from variations and combinations of methods that already exist.
Determination of the initial center point will be done through a series of stages of normalizing data, separating a
number of data with normal curves to certain partitions using the vote technique, and initializing the initial center
point using the average value of each partition. The results of the proposed optimization will then be evaluated so
that it can be shown changes in the level of accuracy, the number of iterations, and data density.
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Data Pre-processing</title>
      <p>The dataset used in this study is a dataset of student grades taken from one school in Indonesia. The raw dataset is
265 data. The dataset has 12 features in the form of 9 subject values ranging from 0 to 100, student id, name and
one feature output in the form of classes displayed in IPA (natural science) or IPS (social) class names. Raw data
that is separated into several files are put together in one container. The duplicate grades are removed based on the
student full name match. Data that has a feature with a value of 0 and/or null is also removed. Finally after
preprocessed, the data used in this study were 245 data.
2.2</p>
    </sec>
    <sec id="sec-3">
      <title>Proposed Method 1 Fig. 1. The framework of proposed method 1</title>
      <p>The first proposed optimization for initial centroid selection that can be seen in Figure 1 starts with finding the
maximum value called (M) and minimum values (N) for each feature, then find the average value for each feature
using M and N values. After that calculate the global average value as the middle value of the dataset to be clustered.
The middle value is obtained by finding the average value of the average value of each attribute in the dataset, this
value is called the middle value. After the middle value is obtained, then insert each object into the initial cluster
using the calculation of the majority of data membership. For example, if an object has data [70,75,71] on all three
attributes and the middle value that is calculated is 74 then the value will be [L, H, L] in other words the object
will be entered into a cluster member L. After each object is entered into in the initial cluster, the next step is to
calculate the average value of each attribute in the cluster to be used as the initial centroid. The next step in
clustering follows the K-means algorithm after getting the initial centroid. Assign each data to a specific cluster
based on calculation of the distance from each initial centroid where the distance of data from the centroid is
minimum using Euclidean distance. Then revive new centroids from each cluster and reassign data based on the
distance from the new centroid. Continue to repeat until there are no cluster changes in each data.
2.3</p>
    </sec>
    <sec id="sec-4">
      <title>Proposed Method 2</title>
      <p>The second proposed optimization of initial centroid selection starts with finding the middle value of the dataset
to be clustered. The middle value is obtained by finding the average value of the average value of each attribute in
the dataset, this value is called the middle value. After the middle value is obtained, then insert each object into
the initial cluster using the calculation of the majority of data membership. For example, the object has data [69,
78, 73] on all three attributes and the middle value that is owned is 74 then the value will be [L, H, L] in other
words the object will be entered into the members of cluster L (the process is broadly the same as the first proposed
method to get initial group members, see Figure 1 for details). After getting all members in every initial cluster,
initial centroids are then randomly selected from each cluster. The next clustering step follows the K-means
algorithm after determining the initial centroid.
2.4</p>
    </sec>
    <sec id="sec-5">
      <title>Proposed Method 3</title>
      <p>The third optimization proposed for initial centroid selection that can be modeled in Figure 2 starts with
normalizing the dataset using min-max normalization. After the data is normal, it then determines the initial
KMembers randomly. After each object is clustered into a number of initial K, the mean value is calculated for each
cluster to be used as the initial centroid. The next clustering step follows the K-Means algorithm until there is no
cluster displacement on each object.</p>
    </sec>
    <sec id="sec-6">
      <title>The framework of proposed method 3</title>
    </sec>
    <sec id="sec-7">
      <title>3. ANALYSIS 3.1</title>
    </sec>
    <sec id="sec-8">
      <title>Initial Method</title>
      <p>At this stage, testing of the K-means method in general is carried out. Testing is done by looking at the value of
many iterations, accuracy, and silhouette values. Testing is done by selecting the initial centroid randomly that can
be seen in Table 1. Selected the initial centroid by random for the data line [12.28; 120.40; 25.197; 227.87; 5.245].
Tests carried out as many as 5 until the test results can be presented in Table 2.</p>
      <p>PK
1
2
3
4
5
1
2
3
4
5
1
2
1
2
1
2
1
2
1
2
(data-id)
12;28
120;40
25;197
227;87
5;245
Average
83
88
79
85
86
78
81
80
91
88
82
77
75
79
76
75
80
75
78
75
7
8
9
18
17
11.8
82
84
80
82
78
80
78
80
78
76
94
80
80
86
80
80
80
80
81
80
70.20
69.80</p>
      <p>The results of the K-means method show that the number of iterations is at least 7 and the most are 18, the
lowest accuracy is 68.16 with the highest accuracy of 70.20, and the lowest silhouette value is -0.283 with the
highest silhouette value is 0.126. This can occur due to incorrect selection of initial centroids, or initial centroid
much difference from the final centroid. Compared to the first proposed method, regular k-means method provide
an average 0.8 more iterations, accuracy is 0.9% better and the silhouette value decreases by 0.169.
3.2</p>
    </sec>
    <sec id="sec-9">
      <title>Proposed Method 1</title>
      <p>The test scenario performed on the proposed method 1 is to select the initial centroid through the average value in
the dataset which has been normalized on the normal curve whose middle limit is the average value of the average
75
75
77
75
75
75
76
75
75
75
80
82
80
82
80
80
82
80
80
80
85
85
80
75
80
80
80
85
80
80
minimum and maximum value of each feature or attribute, see Table 3 for details. The middle limit that is obtained
is 79.88. Furthermore, each dataset will be categorized into 2 data groups, below the limit and above the limit.
After it is divided into 2 groups, the average value of the initial centroid is selected. Initial centroids obtained are
[82.39; 75.60; 78.14; 80.49; 73.19; 76.62; 75.25; 79.91; 80.00] for cluster 1, while for cluster 2 [85.76; 78.76;
81.11; 81.04; 78.96; 82.51; 75.43; 80.89; 82.47]. The test results using the proposed initial centroid 1 selection
method can be shown in Table 4.</p>
      <p>PK</p>
      <p>Because it uses the average value to select the initial centroid and the initial group selection uses the vote
system, only 1 experiment can be conducted. From the test results, it can be seen that the number of iterations is
reduced by 0.8 when compared to the initial centroid random selection method. Accuracy decreased by 0.9%, and
silhouette values were better by 0.1690.
3.3</p>
    </sec>
    <sec id="sec-10">
      <title>Proposed Method 2</title>
      <p>Proposed method 2 randomly selects initial centroids on a normalized dataset on a normal curve whose middle
limit is the average value of the average minimum and maximum values of each feature or attribute. The middle
limit that is obtained is 79.88. Furthermore, each dataset will be categorized into 2 data groups, below the limit
and above the limit. After it is divided into 2, then the initial centroid will be randomly selected by 5 samples that
can be shown in Table 5. The results of testing using the proposed method 2 can be presented in Table 6.</p>
      <p>PK
77
86
84
90
75
80
78
89
79
88
75
77
75
79
75
75
85
88
77
82
81
81
77
84
79
80
79
82
78
83
80
80
81
80
84
80
80
80
81
84
75
75
82
80
75
79
75
79
75
92
75
93
76
77
75
79
80
90
75
84
75
75
75
76
75
75
75
76
78
75
82
82
78
82
80
82
80
82
80
82
90
80
80
90
75
80
75
85
80
85</p>
      <p>The results of testing the initial centroid selection method 2 give the same level of accuracy results with the
random selection of initial centroids on the usual K-means method. However, the average iteration
3.8 fewer and silhouette values increased by 0.1516.
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5</p>
      <p>Average
In the proposed method 3 centroid selection will be done using the average value in the dataset that has been
grouped randomly after going through the normalization process. Normalization using min-max normalization.
After knowing the normal form in the dataset, then each data is allocated into the initial group randomly, into 2
data groups. From these groups, an average value will be calculated to be used as the initial centroid. Random
grouping carried out 5 times, resulting in 5 initial centroid samples that can be seen in table 7 as experimental
material. With the result of testing the proposed method 3 can be shown in Table 8.</p>
      <p>PK</p>
      <p>From the test results when compared with the random selection of initial centroids on the usual K-means
method, the proposed method 3 provides an average of 3.8 fewer iterations, accuracy increase 3.43%, and the
silhouette value decreases 0.07444. Compared to the first proposed method, proposed method 3 provide an average
of 3 fewer iterations, accuracy is 4.33% better and the silhouette value increases by 0.2434. While compared to
the second proposed method, the proposed method 3 gives the same iteration average as the proposed method 2,
accuracy increase 3.43% and the silhouette value increases by 0.226.
3.5</p>
    </sec>
    <sec id="sec-11">
      <title>Experiment Result</title>
      <p>From the result of experiments concluded on 245 data using 3 proposed methods for selecting initial centroid with
5 experiments in each proposed method. The result obtained for the least average iteration and the best accuracy
is obtained in the third proposed method. However instead, the level of data density in the third proposed method
is worse when compared to the 2 other proposed methods. Meanwhile, the best data density is obtained in the first
proposed method, but the average iteration and accuracy obtained cannot be as good as in the third proposed
method.</p>
    </sec>
    <sec id="sec-12">
      <title>4. CONCLUSION REFERENCES</title>
      <p>Based on experiments, it is known that from the 3 proposed optimization methods it can be concluded that for the
number of iterations and the level of accuracy of the proposed optimization method 3 namely min-max
normalization shows the best results, while for the best silhouette value is owned by the first proposed method. In
future work, a similar dataset with a larger amount of data can be tested using all proposed methods. Other data
sources can also be used to find out other results from all proposed methods.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Bradley</surname>
            ,
            <given-names>P.S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Fayyad</surname>
            ,
            <given-names>U.M.</given-names>
          </string-name>
          (
          <year>1998</year>
          ).
          <article-title>Refining initial points for K-means clustering</article-title>
          ,
          <source>In: Proceeding of the 15th International Conference on Machine Learning (ICML'98)</source>
          , Shavlik,
          <string-name>
            <surname>J</surname>
          </string-name>
          . (ed.),
          <fpage>91</fpage>
          -
          <lpage>99</lpage>
          . Morgan Kaufmann: San Francisco.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Cheung</surname>
            ,
            <given-names>Y.M.</given-names>
          </string-name>
          (
          <year>2003</year>
          ).
          <article-title>k-Means: A new generalized k-means clustering algorithm</article-title>
          .
          <source>Pattern Recognition Letters</source>
          ,
          <volume>24</volume>
          (
          <issue>15</issue>
          ),
          <fpage>2883</fpage>
          -
          <lpage>2893</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Duda</surname>
            ,
            <given-names>R.O.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Hart</surname>
            ,
            <given-names>P.E.</given-names>
          </string-name>
          (
          <year>1973</year>
          ).
          <article-title>Pattern Classification and Scene Analysis</article-title>
          . Wiley: New York.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Khan</surname>
            ,
            <given-names>S.S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Ahmad</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2004</year>
          ).
          <article-title>Cluster center initialization algorithm for K-Means clustering</article-title>
          .
          <source>Pattern Recognition Letters</source>
          ,
          <volume>25</volume>
          (
          <issue>11</issue>
          ),
          <fpage>1293</fpage>
          -
          <lpage>1302</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Mahmud</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rahman</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Akhtar</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Improvement of K-means Clustering algorithm with better initial centroids based on weighted average</article-title>
          ,
          <source>In: 7th International Conference on Electrical and Computer Engineering</source>
          ,
          <fpage>647</fpage>
          -
          <lpage>650</lpage>
          . IEEE: New York NY.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>PP.</surname>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>Peraturan Pemerintah Republik Indonesia nomor 17 tahun 2010 Tentang Pengelolaan dan Penyelenggaraan Pendidikan</article-title>
          . Presiden Republik Indonesia: Indonesia.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Quadril</surname>
            ,
            <given-names>M.N.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Kalyankar</surname>
            ,
            <given-names>N.V.</given-names>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>Drop out feature of student data for academic performance using decision tree techniques</article-title>
          .
          <source>Global Journal of Computer Science and Technology</source>
          ,
          <volume>10</volume>
          (
          <issue>2</issue>
          ),
          <fpage>2</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Romero</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Ventura</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2007</year>
          ).
          <article-title>Educational Data Mining: A Survey from 1995 to 2005</article-title>
          .
          <article-title>Expert System with Applications</article-title>
          ,
          <volume>33</volume>
          (
          <issue>1</issue>
          ),
          <fpage>135</fpage>
          -
          <lpage>146</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malik</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Sharma</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>Evolving limitations in K-Means algorithm in data mining and their removal</article-title>
          .
          <source>International Journal of Computational Engineering &amp; Management (IJCEM)</source>
          ,
          <volume>12</volume>
          ,
          <fpage>105</fpage>
          -
          <lpage>109</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>