<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Anomaly Detection Using a Fuzzy K-Means Algorithm</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Avadh Naresh Kushwaha</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Delhi Technological University</institution>
          ,
          <addr-line>Shahbad Daulatpur, Main Bawana, Delhi</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <fpage>185</fpage>
      <lpage>191</lpage>
      <abstract>
        <p>Anomaly detection uncovers any abnormal action in the given data set for a different activity. It is an activity to finding patterns that do not conform to expected behavior. Several activities, such as the hourly temperature of a place or the number of users who visit a website per minute, have a regular pattern or waveform that depicts normal behavior. Any deviation from this normal or common pattern means that there is something unusual happening. Several methods have been used to detect anomalies/abnormal behavior, such as clustering through finding the underlying patterns or structures within a collection of data to form different clusters. This article describes an anomaly as a particular wave shape that has never been seen before. A library of normal waveforms is created and later used to reconstruct a waveform that is being tested. The "normal" waveforms are determined using the K-means clustering algorithm to group/cluster the similar waveforms. The experiments resulted in a reconstruction error of 23.8%, which indicated there was something abnormal.</p>
      </abstract>
      <kwd-group>
        <kwd>1 K-means Algorithm</kwd>
        <kwd>Anomaly Detection Clustering Analysis</kwd>
        <kwd>Intrusion detection</kwd>
        <kwd>Cluster</kwd>
        <kwd>Outlier Detection</kwd>
        <kwd>Waveform</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Anomaly detection is a method used to recognize abnormal patterns that do not match expected
behavior, called outliers. Nowadays, a massive amount of data and software available on social
platforms are available to ensure service. This will be required to systematically monitor the data to
detect an abnormal event such as Detection of abnormal health conditions, intrusion detection, and
fraud detection.</p>
      <p>Numerous types of algorithms have been developed for anomaly detection for twin academic
research, commercial interest, and the use of all this to protect such a specific application or device.</p>
      <p>In most kinds of records such as temperature, visits on a website, etc. there is usually a normal
accepted behavior indicating that everything is working fine. It might be hard to determine an
anomaly based on an instantaneous value of a waveform in some cases. Instead, it's better to consider
the shape of the waveform as discussed in this paper.</p>
      <p>A library of different waveforms is created, and with the use of k-means clustering, we can
develop what constitutes the normal shape of a waveform. The dataset used is plotted to form a wave
of a particular shape. The waveform using the waveform library is reconstructed. If there is any poor
reconstruction, then the wave is certainly containing some abnormalities.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Jianliang, Haikun, and Ling [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] performed Intrusion detection on networks using K-Means. The
authors' evaluated their proposed model using “the KDD Cup 1992 data”. The data provided a wide
variation of instructions which were simulated over a military network. From the “100,000 labeled
data items”, 1,000 of them contained attack samples, while the other 99.999 were the normal samples
free from attacks.
      </p>
      <p>
        They perform data processing for more extensive features in two ways; first, they calculate the
mean absolute deviation [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and then calculate the standardized measurement [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. New training
datasets are then created based on the formulas above, and then the proposed algorithm is applied.
The experiments performed showed that the K-means algorithm was the most suitable method for
intrusion detection. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        Xinlong and Weishi [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] used an improved k-means clustering technique to perform anomaly
detection on cloud computing. The method proposed by the authors finds the known attacks and
anomaly attacks in the cloud computing environment. They use a distributed intrusion model of cloud
computing to perform the detections [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The authors also made quantitative comparisons of the
different intrusion detection methods used before based on their advantages and disadvantages.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. K-Means Algorithm</title>
      <p>
        Clustering is “the process that groups objects into subclasses so that those subclasses are entirely
related or have some similarities and the different clusters are altogether dissimilar from each other”
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The clustering algorithm is divided into four different groups, i.e., Partitioning Clustering,
hierarchical Algorithm, Density-based and Grid-based.
      </p>
      <p>3.1.</p>
    </sec>
    <sec id="sec-4">
      <title>Partitioning Clustering</title>
      <p>
        Partitioning clustering [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], the data objects are directly divided into clusters of a pre-defined
number. "The checking for all possible clusters is computationally impractical; certain greedy
heuristics are used in the form of iterative optimization of a cluster. The partitioning clustering
consists of several approaches such as K-means Clustering, K-medoid Clustering, Relocation
Algorithms, and Probabilistic Clustering." [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
3.2.
      </p>
    </sec>
    <sec id="sec-5">
      <title>Hierarchical Algorithm</title>
      <p>
        A hierarchical clustering [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] "creates a hierarchical decomposition of the given set of data objects.
It builds a cluster hierarchy, known as a dendrogram. Disjoint groups of clusters are obtained by
cutting the tree at the desired level. Hierarchical clustering is used to find data on different levels of
dissimilarity." [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
3.3.
      </p>
    </sec>
    <sec id="sec-6">
      <title>Density-Based 3.4.</title>
    </sec>
    <sec id="sec-7">
      <title>Grid-Based</title>
      <p>
        Density-based algorithm [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] "can find a cluster of random shapes unexpectedly. To group objects,
it uses the density objective function.in these method clusters will increase until the number of the
article in the nearby growing some limitation." [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
      </p>
      <p>
        Grid-based clustering [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] "quantizes the object space into a finite number of cells that form a grid
structure. After calculating density grid-based clustering, sort the cells according to their densities.
Cluster centers are identified, and all neighbor cells are traversed." [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
      </p>
      <p>
        Partitioning is done to divide a group of data into k-cluster [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. K-mean constitutes a proper
clustering technique by ambitious training
3.5.
      </p>
    </sec>
    <sec id="sec-8">
      <title>The Objective of Algorithm</title>
      <p>Considering the d-dimension data set { |
∈  ,  = 1,2, …  } , the method to follow easy steps
to classify given dataset, number of clusters  ,  , … . 
to define K centroid  =  ,  , … .  , one
for each group,
 =
∑ ∈</p>
      <sec id="sec-8-1">
        <title>Where</title>
      </sec>
      <sec id="sec-8-2">
        <title>Are several datasets in the cluster?</title>
        <p>
          Select all points closest to a considering dataset and relate them to the nearest centroid. When a
topic does not stand out, the first steps compete [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. We need to the recalculated centroid of the cluster
resulting from the previous actions.
        </p>
        <p>A new obligate done between the current dataset and closest new centroid, an iterative loop has
been generated due to this loop centroid change their location slowly until no change is done .lastly
algorithm minimizing an objective function.</p>
        <p>The objective function  = ∑
∑

 , 
where 
 ,</p>
      </sec>
      <sec id="sec-8-3">
        <title>Distance between the data point</title>
        <p>and cluster center  .
3.6.</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Step by Step Algorithm</title>
      <p>Step1: (initialize) randomly choose K occurrence  ,  , … .</p>
      <sec id="sec-9-1">
        <title>From the data set X and start cluster center of the clustering space.</title>
        <p>Step2: (allotment) provide each occurrence to the nearest center: 
 , 
&lt; 
( , 
) i.e.,
 = 1,2 … . . 
&amp;  = 1,2 … …   ≠  , 
= 1,2 … . .  and then allocate   
Step3: (updating) Recalculate the centroid of the cluster ∗,  ∗, … … . .  ∗
;
Step4: (loop) if  = {1 … . .  }  ∗ =  then stop the algorithm and initial  ∗,  ∗, … … . .  ∗ represent
the end group, or else allocate  =  ∗ &amp; perform step 2 and 3 until there isn't any more modification
3.7.</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>Advantage(s): 3.8.</title>
    </sec>
    <sec id="sec-11">
      <title>Disadvantage(s):</title>
      <p>
        K-mean algorithm is essential for a considerable large dataset [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], and its time complexity O (tkn),
where t is the number of loops, k is the number of clusters, and n is the data point in the dataset
      </p>
      <p>
        With the k-mean algorithm, the cluster number is required first, yet the number of clusters is
usually determined later. It is sensitive to outliers [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>It is not easy to predict k-value, the k-mean algorithm with global cluster does not work well, and
it does not perform well with a different group and size</p>
    </sec>
    <sec id="sec-12">
      <title>4. Proposed Methodology</title>
      <p>In our approach, we define an anomaly as a shape of the waveform that has not been experienced
or seen last. The algorithm will build a normal shape library and reconstruct the waveform and
determine its shape. The waveform is said to have an abnormal behavior if the reconstruction of the
waveform by the library is inadequate.
4.1.</p>
    </sec>
    <sec id="sec-13">
      <title>Dataset 4.2.</title>
    </sec>
    <sec id="sec-14">
      <title>Clustering</title>
      <p>In this paper, the data used for the experiments is the EKG dataset publicly available at PhysioNet.
This version of the data provides a regular waveform which provides a reasonable basis for exploring
the proposed algorithm.</p>
      <p>To fully ascertain what constitutes a waveform's normal shape, we use k-means clustering to form
different clusters to determine a normal cluster group. The clusters can be in separate n-dimensional
space that’s defined by n-features. In our case, the clusters are determined programmatically using the
clustering algorithm.
4.3.</p>
    </sec>
    <sec id="sec-15">
      <title>Waveform Space</title>
      <p>We shall consider a 32-dimensional waveform space to perform anomaly detection such that for
every point within the space, it potentially represents a waveform segment. We cluster the segments
that are similar to each other. “The middle of each cluster (the centroid) will measure the prototypical
waveform pattern that all those segments are specific instances of”. Still, where necessary, from all
the waveforms, the average of them within a cluster will be considered as the centroid.</p>
      <p>Since the center of the cluster is also a point that is part of the waveform space, this makes it also
“waveform”. This therefore means that, the centroids of the clusters will constitute a set of “normal
waveform segments”.</p>
      <p>Now using these normal segments, we try to reconstruct the dataset being tested. The
reconstruction will be considered as good if the data is similar to the original. However, suppose there
is some abnormal shape in the data. In that case, it won't be possible to reconstruct the waveform
using the “normal” waveform library, which then results into a reconstruction error. This therefore
indicates that there is some anomaly.</p>
      <p>The algorithm is divided into two parts;
4.4.</p>
    </sec>
    <sec id="sec-16">
      <title>Training</title>
      <p>Divide the waveform data into segments of say n samples.</p>
      <p>Create a space of n dimensions with each of the segments considered as one point.
Cluster the segment point and also determine the centroids of the clusters.
Cluster centroids provide a library of normal waveform shapes.
4.4.1. Testing</p>
      <p>Using the cluster centroids in the training phase, we try to reconstruct the waveform data.</p>
      <p>An abnormal shape is obtained if there is any reconstruction error on any segment.
4.5.</p>
    </sec>
    <sec id="sec-17">
      <title>Step by Step Algorithm</title>
      <sec id="sec-17-1">
        <title>Step 1: Import the data</title>
      </sec>
      <sec id="sec-17-2">
        <title>Step 2: Plot the data</title>
        <p>Let FN be the filename of the data set. Let DS be the variable that holds the data after reading it
using the ekg_data primitive function.</p>
        <p>Using matplotlib.pyplot function and sample size of 300, plot the data. Label the x and y axes, then
show the plotted graph.</p>
      </sec>
      <sec id="sec-17-3">
        <title>Step 3: Split data into segments</title>
        <p>Let seg_len be the segment length of value 32, slide_len of value two, and an array named
segments to store them. Loop through the DS dataset to split it into waveform segments.</p>
      </sec>
      <sec id="sec-17-4">
        <title>Step 4: Cluster the segments</title>
        <p>Perform clustering of the segments in a 32-dimensional space using the k-means algorithm
provided in python's library scikit-learn.</p>
        <p>The different clusters created at this point then constitutes the normal waveform shapes of the
library</p>
      </sec>
      <sec id="sec-17-5">
        <title>Step 5: Reconstruct the waveform</title>
        <p>Try to reconstruct the data or the waveform, which will be tested using the learned library. The
reconstruction process consists of four sub-steps which include;</p>
        <p>First, the data is split into segments that are overlapping.</p>
        <p>Secondly, we determine the segment's cluster centroid by using the samples' average in a
cluster.</p>
        <p>Using the centroid from the above step, we reconstruct the segment.</p>
        <p>Finally, we join all the segments to form the waveform.</p>
      </sec>
      <sec id="sec-17-6">
        <title>Step 6: Conclude on the anomaly</title>
      </sec>
    </sec>
    <sec id="sec-18">
      <title>5. Result and Analysis</title>
      <p>We decide whether there is an anomaly or not based on the reconstruction error obtained. If there
is a very high reconstruction error, then we conclude that there is an anomaly.</p>
      <p>Using the EKG dataset, we performed clustering on the data and also try to reconstruct to find the
error between the original plotted data and the newly revamped data.</p>
      <p>The new shape formed indicates an anomaly detected after reconstruction of the data since there is
a substantial visible error. The reconstruction error obtained was 23.8%.</p>
      <p>The figure below shows the original data's waveform, the reconstructed information, and the error
after reconstruction.</p>
    </sec>
    <sec id="sec-19">
      <title>6. Conclusion</title>
      <p>Anomaly detection is essential in everyday lives, and therefore it's vital to expand on the research
continuously.</p>
      <p>Every different kind of data consists of a regular pattern that depicts normal behavior. When the
data are plotted to form a waveform, it can also be hard to determine an anomaly of a waveform based
on its instantaneous value but by its shape.</p>
      <p>This paper sees how K-means clustering performs a key role in identifying normal waveform
shapes in time series data to form trained library waveforms. Based on the data's reconstruction, we
determine the shape of the waveform after reconstruction tested by the library created. The error
obtained signifies that there is an anomaly since the shape of the waveform has not been seen before.</p>
    </sec>
    <sec id="sec-20">
      <title>7. Acknowledgements</title>
      <p>Special Appreciation to the Department of Software Engineering, Delhi Technological University,
for supporting this work.</p>
    </sec>
    <sec id="sec-21">
      <title>8. References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Jianliang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Haikun</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Ling</surname>
          </string-name>
          ,
          <article-title>"</article-title>
          <source>The Application on Intrusion Detection Based on Kmeans Cluster Algorithm," 2009 International Forum on Information Technology and Applications</source>
          , Chengdu, China,
          <year>2009</year>
          , pp.
          <fpage>150</fpage>
          -
          <lpage>152</lpage>
          , doi: 10.1109/IFITA.
          <year>2009</year>
          .34
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Yazdi Pausadan</surname>
          </string-name>
          , Jaka Lianto Buliali, Raden Venantivs Hari Ginardi,
          <article-title>"Cluster Phenomenon to Determine Anomaly detection on Flight Riute,"</article-title>
          <source>the fifth Information System International Conference</source>
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <article-title>"An Improved K-Means Using in Anomaly Detection,"</article-title>
          <source>2015 First International Conference on Computational Intelligence Theory, Systems and Applications (CCITSA)</source>
          , Ilan, Taiwan,
          <year>2015</year>
          , pp.
          <fpage>129</fpage>
          -
          <lpage>132</lpage>
          , doi: 10.1109/CCITSA.
          <year>2015</year>
          .
          <volume>11</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C. T.</given-names>
            <surname>Baviskar</surname>
          </string-name>
          and
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Patil</surname>
          </string-name>
          ,
          <article-title>"Improvement of data object's membership by using Fuzzy KMeans clustering approach," 2016 International Conference on Computation of Power, Energy Information and Communication (ICCPEIC), Melmaruvathur</article-title>
          , India,
          <year>2016</year>
          , pp.
          <fpage>139</fpage>
          -
          <lpage>147</lpage>
          , doi: 10.1109/ICCPEIC.
          <year>2016</year>
          .
          <volume>7557238</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>I.P</surname>
          </string-name>
          and
          <string-name>
            <surname>G.DK</surname>
          </string-name>
          ,
          <article-title>"A Survey on different clustering algorithm in data mining technique"</article-title>
          <source>International Journal of Modern Engineering Research (IJMER)</source>
          , Vol.
          <volume>3</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>267</fpage>
          -
          <lpage>274</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ayramo</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Karkkainen</surname>
          </string-name>
          ,
          <article-title>"Introduction to partitioning based clustering methods with a robust example,"</article-title>
          <source>Reports of the Department of Mathematical Information Technology Series C. Software and Computational Engineering</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E. G.</given-names>
            <surname>Mansoori</surname>
          </string-name>
          ,
          <article-title>"Frbc: a fuzzy rule-based clustering algorithm," Fuzzy Systems, IEEE Transactions on</article-title>
          , vol.
          <volume>19</volume>
          , no.
          <issue>5</issue>
          , pp.
          <fpage>960</fpage>
          -
          <lpage>971</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Han and M. Kamber</surname>
          </string-name>
          ,
          <article-title>"Data mining: concepts and techniques," United States of America</article-title>
          : Morgan Kauff Mann Publishers,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C. C.</given-names>
            <surname>Aggarwal</surname>
          </string-name>
          and
          <string-name>
            <given-names>C. K.</given-names>
            <surname>Reddy</surname>
          </string-name>
          , Eds.,
          <string-name>
            <surname>Data</surname>
            <given-names>Clustering</given-names>
          </string-name>
          :
          <article-title>Algorithms and Applications</article-title>
          . CRC Press,
          <year>2014</year>
          . [Online]. Available: http://www. Charu Aggarwal. netlclusterbook.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          and
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , “
          <article-title>An Anomay Intrusion Detection Method Based on Improved K-Means of Cloud Computing”</article-title>
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>