<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Web Benefit Utilizations with K-means Clustering Approach for Efficient Clustering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Priya B. Pandharbale</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sasmita Choudhury</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sachi Nandan Mohanty</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alok Kumar Jagadev</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer science Engineering, Mckv Institute of Engineering</institution>
          ,
          <addr-line>Liluah, Howrah, West Bengal</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Computer Engineering, KIIT Deemed to be University</institution>
          ,
          <addr-line>Bhubaneswar, Orrisa</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Vardhaman College of Engineering</institution>
          ,
          <addr-line>Hyderabad</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Clustering is the process of identifying similar groups in a dataset based on some characteristics of the data. This work uses the k-means clustering algorithm for finding the numerous cluster formations of various parameters in the weblog dataset. The clusters are formed and are examined for finding the various status responses generated while accessing the web data as well as the popular methods the users are using for accessing the web. The work concentrates on the optimal k value finding using the Elbow method showing the formation of the number of clusters as the value of k varies.</p>
      </abstract>
      <kwd-group>
        <kwd>k-Means</kwd>
        <kwd>clustering</kwd>
        <kwd>web service</kwd>
        <kwd>weblog</kwd>
        <kwd>access methods</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>Literature Survey</title>
      <p>
        Clustering is the process of identifying similar groups in a dataset based on some characteristics of the
data. In clustering, no class information is needed. Hence it is an unsupervised learning technique. It
has many applications like text clustering. It is generally divided into two categories: hierarchical and
partitioning. Partitioned clustering algorithms are suitable for clustering large datasets.
The creators attempted to apply the k-Means bunching technique from the corn crop information of the
most recent 2 years to deliver achievability data from each sub-district [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The conveyance of harvests
is typically done dependent on the name of the corn-creating sub-district. A gathering of potential
corndelivering locales is needed to know which regions produce huge or modest quantities of corn.
The paper proposes a boundary profile-based gradual grouping (BPIC) technique to find self-assertively
molded bunches with powerfully developing datasets [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This technique addresses the current bunching
results with an assortment of limit profiles and disposes of the internal places of groups as opposed to
keeping all information.
      </p>
      <p>
        The work showed another social occasion approach named CluStream [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. It had a web part that
inconsistently put missing incorporate summary pieces of information and a disconnected piece that used
these assessments. The internet-based part was the quantifiable information assortment piece and the
disconnected part was the legitimate area. The CluStream can deal along arising and evaporating packs
anyway can't administer changing information things and their portrayal.
      </p>
      <p>
        D-stream gathering approach used thickness-based systems [
        <xref ref-type="bibr" rid="ref4 ref8">4,8</xref>
        ]. This had an on the web and
disconnected section. The web-based part maps every data information thing into a structure and a
disconnected area which shapes the framework thickness. The exceptional changes of the information stream
were overseen using a rotting technique. It also perceived the inconsistent organizations organized
through the exclusions. It will in general be used for social event constant flow information. The
advantages of this procedure are that it can productively make packs progressively, can track down lots
of emotional shapes, and can unequivocally perceive the creating sharpens of nonstop information
streams.
      </p>
      <p>
        Authors have characterized an entropy-based objective capacity for the instatement interaction, which
is superior to other existing introduction techniques for k-implies grouping. Additionally planned a
calculation to ascertain the right number of bunches of datasets utilizing some group legitimacy records
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        The calculation uses Fair-Lloyd, a change of Lloyd's heuristic for k-implies, acquiring its
straightforwardness, proficiency, and solidness. Fair-Lloyd displays fair-minded execution by guaranteeing that
all gatherings have equivalent expenses in the result k-grouping, while at the same time bringing about
an irrelevant expansion in running time, accordingly making it a reasonable choice any place k-implies
is as of now utilized [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        A variety of k-implies grouping called round k-means bunch for report bunching [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. It partitioned the
tall dimensional unit circle through infers of social affair of great hyper circles. The estimation played
out a disjoint allocating of the document vectors, and, for each package, figured a centroid using cosine
resemblance. The standardized centroid was called 'idea vectors' which contain significant semantic
data around bunches. The most benefit of this computation is that it meets quickly and it can deal with
the sparsity of content data. Moreover, it tends to be parallelized quickly.
      </p>
      <p>
        This article endeavors to foster a numerical model for designating the assignments to the processors to
accomplish the ideal expense and ideal unwavering quality of the framework [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
The author has introduced the review on different grouping techniques in their work [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Table 1 shows
the introduced review for different grouping calculations by thinking about the boundaries
classification, bunching calculations overviewed, and their time intricacies. Creator guarantees that K-means
give a higher outcome for gigantic information than SOM and progressive grouping calculation.
Our previous works in the area of web services clustering help find better recommendations using
kmeans clustering [
        <xref ref-type="bibr" rid="ref12 ref13 ref14 ref15">12-15</xref>
        ].
      </p>
      <p>
        The work deals with effective bunching strategies, for example, K-implies grouping, Hierarchical
agglomerative bunching, and Balanced Iterative Reducing and Clustering utilizing Hierarchies (BIRCH)
bunching are presented for web administration bunching [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>
        A K -means sort of clustering to be specific Pioneer Supporter calculation is utilized here [
        <xref ref-type="bibr" rid="ref17 ref18">17,18</xref>
        ]. In
this approach for an unused thing ‘i', a closest cluster middle 'c' is recognized. In the event that separates
between things 'I' and cluster middle is over the edge, at that point a modern cluster is made. Something
else the information thing is included to the cluster spoken to through 'c'. Rehash this handle until there
are no more information things.
      </p>
      <p>
        ICECPG clustering using extended condensation point and grid clustering algorithm which was based
on fast density-based clustering techniques This algorithm used a heuristic search method to form
subclusters. A cluster is formed by uniting all the sub-clusters reachable from one another. A steady
grouping utilizing expanded build-up point and lattice for continuous bunching of dynamic information
approach [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. As the new information showed up, it was appointed to existing groups. This calculation
catches the state of the information base through expanded build-up focuses. Then, at that point, for
bunching the information things, it utilized a network-based and thickness-based grouping approach
that utilizes slope-based climbing ideas. This strategy enjoys the benefit of thickness-based and
matrixbased strategies. It has straight time intricacy and can be utilized for mining huge datasets. It decreases
I/O costs.
      </p>
      <p>
        A couple of utilization of stream bunching is interference affirmation, environment insights, E-business,
crisis counter structures [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], site assessment, etc. In-stream grouping each exceptional data thing is
considered as the advanced info data thing. Stream grouping approaches don't deal with lively data
since they don't store the data. Gradual grouping doesn't deal with the time of unused bunches and
updating a group for a thing that changes over time. Both gradual and stream grouping approaches are
less sensible for enthusiastic applications like the Web. In Web-based applications, features of a data
thing might modify quite a while since of an adjust inside the preferences and loathe of end clients.
Also on the net, dealing with creating and evaporating groups is furthermore indispensable. To gain
ground on the nature of electronic applications grouping strategies used should have the option to deal
with enthusiastic circumstances. The survey of various clustering algorithms for finding out the
complexities is discussed in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>
        The web log data is pre-processed. The data set used here is available at
https://www.kaggle.com/shawon10/web-log-dataset. The work focuses on the step-by-step analysis of
the weblog data to find the clusters. The work uses k-means clustering [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] for the creation of the initial
cluster’s formation CF1 using the User data U and the most frequently accessed URL's FA. The website
utilization information parameters like date D and status S are used to form CF1. The status parameter
used for the HyperText Transfer Protocol (HTTP) are identified as 400 is used to indicate a Bad Request
reaction status code it shows that the server can't or won't handle the solicitation because of something
saw to be a customer mistake (e.g., contorted solicitation language structure, invalid solicitation
message outlining, or beguiling solicitation directing).
      </p>
      <p>The HTTP 300 Multiple Choices divert status reaction code shows that the solicitation has more than
one potential reaction. The client specialist or the client ought to pick one of them. As there is no
normalized method of picking one of the reactions, this reaction code is seldom utilized. The HTTP 200
OK accomplishment status response code shows that the sales have succeeded. A 200 response is
cacheable as is normally done.</p>
      <p>Algorithm 1:
K-Means Clustering: URL Analysis for Status Response Code
1. Input: N number of records from dataset S.
2. For each user U finds the most frequently</p>
      <p>accessed URLs FA.
3. cluster formation, CF1 using website utilization information date D and status S.
4. End
Algorithm 2:
K-Means Clustering: User Web URL Access Method Analysis
1. Input: N number of records from dataset S.
2. for each user web URL WU find the access</p>
      <p>method M
3. cluster formation, CF2 using FA and M
4. End
Reapplying the bunching calculation over the cluster formation CF1 in the boundaries for making new
bunches CF2 is the client web access method M and the FA. Among the Web URL access techniques
M, the GET and Post strategies are the most famous techniques utilized. The GET system requests a
depiction of the predefined resource. Requesting using GET should simply recuperate data. The POST
strategy is used to introduce a substance to the foreordained resource, as often as possible causing a
change of state or accidental impacts on the server.</p>
    </sec>
    <sec id="sec-4">
      <title>Results and Discussion</title>
      <p>The data set used here is available at https://www.kaggle.com/shawon10/web-log-dataset.The work
focuses on the step-by-step analysis of the weblog data to find the clusters for the status response code of
the web services and the web URL access methods are mostly used by the users. This dataset has 16008
rows and 4 columns. Columns are IP, Time, URL, Response Status.
In figure 4 we can find the metrics for the calculation of the mean values for the creation of the initial
clusters. As depicted in the methodology section the web URLs are clustered using the criteria status
response code.
According to figure 6, the analysis of weblog data shows that among the Web URL access techniques
the GET and Post strategies are the most famous techniques utilized by the customers. The access
methods popular amongst all the other access methods are GET and POST.
From figure 6 it is observed that these methods are mostly used by the customers for the invocation of
the URLs. On applying the k-means clustering for the web URL access methods the optimal value for
k=2. The clusters formed for the most popular Web URL access methods have two clusters.
Figure 7 shows the selection of the value of k as 2 using the Elbow method it is very easy to predict the
optimal value of k at an elbow point in the graph.</p>
      <p>In the event that the server can't observe the page at the mentioned address, it either sends a 404-blunder
code (site page not found) or sends the guest to the new URL through divert assuming it's known. In
figure 9 the example for the clustering of the web URLs is shown for the cluster formation for methods
GET (0) and POST (1).</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this work, we have discussed various clustering techniques used efficiently for the analysis of the
data and removing the barriers to accessing the huge datasets. Moreover, this work helps to elaborate
k-Means clustering over the weblog dataset to analyze and utilize the weblog dataset efficiently. The
algorithm utilizes various parameters of the weblog dataset for the formation of various clusters. The
Elbow method is then used to find the optimal value of the k in k-means to predict the number of clusters
formed for the given dataset parameters. The optimal value of k is 4 for the status response code for
various status responses. Whereas the value of k=2 for the most popular methods to access the web that
is GET and POST. For the future work we will be using the various width clustering algorithm for the
calculation of the distance for finding the optimal value of k.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Aldino</surname>
            ,
            <given-names>A. A.</given-names>
          </string-name>
          , et al.
          <article-title>"Implementation of K-means algorithm for clustering corn planting feasibility area in south lampung regency</article-title>
          .
          <source>" Journal of Physics: Conference Series</source>
          . Vol.
          <volume>1751</volume>
          . No.
          <article-title>1</article-title>
          .
          <string-name>
            <given-names>IOP</given-names>
            <surname>Publishing</surname>
          </string-name>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Bao</surname>
          </string-name>
          ,
          <string-name>
            <surname>Junpeng</surname>
          </string-name>
          , et al.
          <article-title>"An incremental clustering method based on the boundary profile</article-title>
          .
          <source>" Plos one 13.4</source>
          (
          <year>2018</year>
          ):
          <fpage>e0196108</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Benabdellah</surname>
            ,
            <given-names>Abla</given-names>
          </string-name>
          <string-name>
            <surname>Chouni</surname>
            , Asmaa Benghabrit, and
            <given-names>Imane</given-names>
          </string-name>
          <string-name>
            <surname>Bouhaddou</surname>
          </string-name>
          .
          <article-title>"A survey of clustering algorithms for an industrial context</article-title>
          .
          <source>" Procedia computer science 148</source>
          (
          <year>2019</year>
          ):
          <fpage>291</fpage>
          -
          <lpage>302</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Zhuo</surname>
          </string-name>
          , Chen, Liu Xiang-shuang, and
          <string-name>
            <surname>Zhuang</surname>
          </string-name>
          Xiao-dong.
          <article-title>"A fast incremental clustering algorithm based on grid and density</article-title>
          .
          <source>" Third International Conference on Natural Computation (ICNC</source>
          <year>2007</year>
          ). Vol.
          <volume>5</volume>
          . IEEE,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Chowdhury</surname>
            , Kuntal,
            <given-names>Debasis</given-names>
          </string-name>
          <string-name>
            <surname>Chaudhuri</surname>
          </string-name>
          , and Arup Kumar Pal.
          <article-title>"An entropy-based initialization method of K-means clustering on the optimal number of clusters."</article-title>
          <source>Neural Computing and Applications</source>
          <volume>33</volume>
          .12 (
          <year>2021</year>
          ):
          <fpage>6965</fpage>
          -
          <lpage>6982</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Ghadiri</surname>
            , Mehrdad,
            <given-names>Samira</given-names>
          </string-name>
          <string-name>
            <surname>Samadi</surname>
            , and
            <given-names>Santosh</given-names>
          </string-name>
          <string-name>
            <surname>Vempala</surname>
          </string-name>
          .
          <article-title>"Socially fair k-means clustering</article-title>
          .
          <source>" Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency</source>
          .
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7] https://medium.com/analytics-vidhya/
          <article-title>comparative-study-of-the-clustering-algorithms54d1ed9ea732.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Khalilian</surname>
            , Madjid,
            <given-names>Norwati</given-names>
          </string-name>
          <string-name>
            <surname>Mustapha</surname>
            , and
            <given-names>Nasir</given-names>
          </string-name>
          <string-name>
            <surname>Sulaiman</surname>
          </string-name>
          .
          <article-title>"Data stream clustering by divide and conquer approach based on vector model</article-title>
          .
          <source>" Journal of Big Data 3.1</source>
          (
          <year>2016</year>
          ):
          <fpage>1</fpage>
          -
          <lpage>21</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          , Harendra, Nutan Kumari Chauhan, and Pradeep Kumar Yadav.
          <article-title>"A high performance model for task allocation in distributed computing system using k-means clustering technique." Research Anthology on Architectures, Frameworks, and Integration Strategies for Distributed and Cloud Computing</article-title>
          .
          <source>IGI Global</source>
          ,
          <year>2021</year>
          .
          <fpage>1244</fpage>
          -
          <lpage>1268</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>Wei</surname>
          </string-name>
          , et al.
          <article-title>"Data Stream Clustering Algorithm for Smart Site</article-title>
          and
          <source>Its Implementation Based on Flink." 2019 IEEE Symposium Series on Computational Intelligence (SSCI)</source>
          . IEEE,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>MacQueen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <article-title>"Classification and analysis of multivariate observations." 5th Berkeley Symp</article-title>
          . Math. Statist. Probability.
          <year>1967</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>M. P. B. P. M. S. M. B. P. Semantic</surname>
          </string-name>
          <article-title>Search and Social-Semantic Search as Cooperative Approach</article-title>
          .
          <source>International Journal on Recent and Innovation Trends in Computing and Communication</source>
          ,
          <volume>5</volume>
          (
          <issue>1</issue>
          ),
          <fpage>110</fpage>
          -
          <lpage>114</lpage>
          . https://doi.org/10.17762/ijritcc.v5i1.
          <fpage>98</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Pandharbale</surname>
            ,
            <given-names>Priya B.</given-names>
          </string-name>
          , Sachi Nandan Mohanty, and Alok Kumar Jagadev.
          <article-title>"Recent web service recommendation methods: A review." Materials Today: Proceedings (</article-title>
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Pandharbale</surname>
          </string-name>
          , Priya, Sachi Nandan Mohanty, and Alok Kumar Jagadev.
          <article-title>"Study of Recent Web Service Recommendation Methods</article-title>
          .
          <article-title>" 2020 2nd International Conference on Innovative Mechanisms for Industry Applications (ICIMIA)</article-title>
          . IEEE,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Pandharbale</surname>
            ,
            <given-names>Priya</given-names>
          </string-name>
          <string-name>
            <surname>Bhaskar</surname>
          </string-name>
          ,
          <source>Sachi Nandan Mohanty, and Alok Kumar Jagadev. "Novel Clustering-Based Web Service Recommendation Framework." International Journal of System Dynamics Applications (IJSDA) 11.5</source>
          (
          <year>2021</year>
          ):
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Parimalam</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          , and
          <string-name>
            <given-names>K. Meenakshi</given-names>
            <surname>Sundaram</surname>
          </string-name>
          .
          <article-title>"Efficient clustering techniques for web services clustering." 2017 ieee international conference on computational intelligence and computing research (iccic)</article-title>
          .
          <source>IEEE</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Reyes</surname>
            ,
            <given-names>Jaciel E.</given-names>
          </string-name>
          , et al.
          <article-title>"A Classification of Web Service Credibility Measures</article-title>
          .
          <article-title>" 2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC)</article-title>
          . IEEE,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Sardar</surname>
            ,
            <given-names>Tanvir</given-names>
          </string-name>
          <string-name>
            <surname>Habib</surname>
            , and
            <given-names>Zahid</given-names>
          </string-name>
          <string-name>
            <surname>Ansari</surname>
          </string-name>
          .
          <article-title>"An analysis of distributed document clustering using MapReduce based K-means algorithm</article-title>
          .
          <source>" Journal of The Institution of Engineers (India): Series B 101.6</source>
          (
          <year>2020</year>
          ):
          <fpage>641</fpage>
          -
          <lpage>650</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Yeoh</surname>
            ,
            <given-names>Jia</given-names>
          </string-name>
          <string-name>
            <surname>Ming</surname>
          </string-name>
          , et al.
          <article-title>"A clustering system for dynamic data streams based on meta heuristic optimisation</article-title>
          .
          <source>" Mathematics 7</source>
          .
          <volume>12</volume>
          (
          <year>2019</year>
          ):
          <fpage>1229</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>