Web Benefit Utilizations with K-means Clustering
Approach for Efficient Clustering
Priya B. Pandharbale1, Sasmita Choudhury2, Sachi Nandan Mohanty3,
Alok Kumar Jagadev1

 1 School of Computer Engineering, KIIT Deemed to be University, Bhubaneswar, Orrisa, India.
2 Department of Computer science Engineering, Mckv Institute of Engineering, Liluah, Howrah, West Bengal,
  India.
3 Vardhaman College of Engineering, Hyderabad, India.


              Abstract
              Clustering is the process of identifying similar groups in a dataset based on some characteristics
              of the data. This work uses the k-means clustering algorithm for finding the numerous cluster
              formations of various parameters in the weblog dataset. The clusters are formed and are exam-
              ined for finding the various status responses generated while accessing the web data as well as
              the popular methods the users are using for accessing the web. The work concentrates on the
              optimal k value finding using the Elbow method showing the formation of the number of clus-
              ters as the value of k varies.

              Keywords: k-Means, clustering, web service, weblog, access methods


1.     Introduction
Clustering is essentially depicted as a division of information into bunches of identical articles. Each
cluster includes objects that are comparable among themselves and various checked out of various
packs. We should contemplate among various sorts of packs. The assessments under talk about are: k-
means clustering, distinctive leveled out gathering assessment, self-masterminding maps assessment,
and need expansion bundling computation. Assessment Metrics are selected like calculations, dataset
size, programming utilized, execution, precision, and nature of calculations [7].
In this work, we are advancing to zero in on the quality and the execution of the web information. The
irksome web URLs are the filling the job in off-track check of the data. The superfluous Data is causing
inconveniences in page arranging. The work centers on finding the practical reactions from the net
laborers by purging the immaterial information from the net log dataset. The k-means calculation could
be an extraordinarily normal assessment differentiated and the wide scope of different clustering as-
sessments to the extent the time complexities similarly as information preparing. The work focuses on
the formation of various clusters of web information depending on various parameters like web infor-
mation access date, the status of the web service, various access methods which can be utilized by the
maximum users, etc. The work finds the optimal value of k for the k-means clustering approach applied
for various web information parameters clusters.


ACI’22: Workshop on Advances in Computation Intelligence, its Concepts & Applications at ISIC 2022, May 17-19,
Savannah, United States
EMAIL: priyasathe123@gmail.com (A. 1)
           ©️ 2020 Copyright for this paper by its authors.
           Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
           CEUR Workshop Proceedings (CEUR-WS.org)


                                                                1
2.    Literature Survey
Clustering is the process of identifying similar groups in a dataset based on some characteristics of the
data. In clustering, no class information is needed. Hence it is an unsupervised learning technique. It
has many applications like text clustering. It is generally divided into two categories: hierarchical and
partitioning. Partitioned clustering algorithms are suitable for clustering large datasets.
The creators attempted to apply the k-Means bunching technique from the corn crop information of the
most recent 2 years to deliver achievability data from each sub-district [1]. The conveyance of harvests
is typically done dependent on the name of the corn-creating sub-district. A gathering of potential corn-
delivering locales is needed to know which regions produce huge or modest quantities of corn.
The paper proposes a boundary profile-based gradual grouping (BPIC) technique to find self-assertively
molded bunches with powerfully developing datasets [2]. This technique addresses the current bunching
results with an assortment of limit profiles and disposes of the internal places of groups as opposed to
keeping all information.

The work showed another social occasion approach named CluStream [3]. It had a web part that incon-
sistently put missing incorporate summary pieces of information and a disconnected piece that used
these assessments. The internet-based part was the quantifiable information assortment piece and the
disconnected part was the legitimate area. The CluStream can deal along arising and evaporating packs
anyway can't administer changing information things and their portrayal.

D-stream gathering approach used thickness-based systems [4,8]. This had an on the web and discon-
nected section. The web-based part maps every data information thing into a structure and a discon-
nected area which shapes the framework thickness. The exceptional changes of the information stream
were overseen using a rotting technique. It also perceived the inconsistent organizations organized
through the exclusions. It will in general be used for social event constant flow information. The ad-
vantages of this procedure are that it can productively make packs progressively, can track down lots
of emotional shapes, and can unequivocally perceive the creating sharpens of nonstop information
streams.

Authors have characterized an entropy-based objective capacity for the instatement interaction, which
is superior to other existing introduction techniques for k-implies grouping. Additionally planned a
calculation to ascertain the right number of bunches of datasets utilizing some group legitimacy records
[5].
The calculation uses Fair-Lloyd, a change of Lloyd's heuristic for k-implies, acquiring its straightfor-
wardness, proficiency, and solidness. Fair-Lloyd displays fair-minded execution by guaranteeing that
all gatherings have equivalent expenses in the result k-grouping, while at the same time bringing about
an irrelevant expansion in running time, accordingly making it a reasonable choice any place k-implies
is as of now utilized [6].
A variety of k-implies grouping called round k-means bunch for report bunching [7]. It partitioned the
tall dimensional unit circle through infers of social affair of great hyper circles. The estimation played
out a disjoint allocating of the document vectors, and, for each package, figured a centroid using cosine
resemblance. The standardized centroid was called 'idea vectors' which contain significant semantic
data around bunches. The most benefit of this computation is that it meets quickly and it can deal with
the sparsity of content data. Moreover, it tends to be parallelized quickly.

This article endeavors to foster a numerical model for designating the assignments to the processors to
accomplish the ideal expense and ideal unwavering quality of the framework [9].


                                                    2
The author has introduced the review on different grouping techniques in their work [10]. Table 1 shows
the introduced review for different grouping calculations by thinking about the boundaries classifica-
tion, bunching calculations overviewed, and their time intricacies. Creator guarantees that K-means
give a higher outcome for gigantic information than SOM and progressive grouping calculation.
Our previous works in the area of web services clustering help find better recommendations using k-
means clustering [12-15].

The work deals with effective bunching strategies, for example, K-implies grouping, Hierarchical ag-
glomerative bunching, and Balanced Iterative Reducing and Clustering utilizing Hierarchies (BIRCH)
bunching are presented for web administration bunching [16].

A K -means sort of clustering to be specific Pioneer Supporter calculation is utilized here [17,18]. In
this approach for an unused thing ‘i', a closest cluster middle 'c' is recognized. In the event that separates
between things 'I' and cluster middle is over the edge, at that point a modern cluster is made. Something
else the information thing is included to the cluster spoken to through 'c'. Rehash this handle until there
are no more information things.

ICECPG clustering using extended condensation point and grid clustering algorithm which was based
on fast density-based clustering techniques This algorithm used a heuristic search method to form sub-
clusters. A cluster is formed by uniting all the sub-clusters reachable from one another. A steady group-
ing utilizing expanded build-up point and lattice for continuous bunching of dynamic information ap-
proach [4]. As the new information showed up, it was appointed to existing groups. This calculation
catches the state of the information base through expanded build-up focuses. Then, at that point, for
bunching the information things, it utilized a network-based and thickness-based grouping approach
that utilizes slope-based climbing ideas. This strategy enjoys the benefit of thickness-based and matrix-
based strategies. It has straight time intricacy and can be utilized for mining huge datasets. It decreases
I/O costs.

A couple of utilization of stream bunching is interference affirmation, environment insights, E-business,
crisis counter structures [19], site assessment, etc. In-stream grouping each exceptional data thing is
considered as the advanced info data thing. Stream grouping approaches don't deal with lively data
since they don't store the data. Gradual grouping doesn't deal with the time of unused bunches and
updating a group for a thing that changes over time. Both gradual and stream grouping approaches are
less sensible for enthusiastic applications like the Web. In Web-based applications, features of a data
thing might modify quite a while since of an adjust inside the preferences and loathe of end clients.
Also on the net, dealing with creating and evaporating groups is furthermore indispensable. To gain
ground on the nature of electronic applications grouping strategies used should have the option to deal
with enthusiastic circumstances. The survey of various clustering algorithms for finding out the com-
plexities is discussed in [3].


3.    Methodology
The web log data is pre-processed. The data set used here is available at
https://www.kaggle.com/shawon10/web-log-dataset. The work focuses on the step-by-step analysis of
the weblog data to find the clusters. The work uses k-means clustering [11] for the creation of the initial
cluster’s formation CF1 using the User data U and the most frequently accessed URL's FA. The website
utilization information parameters like date D and status S are used to form CF1. The status parameter
used for the HyperText Transfer Protocol (HTTP) are identified as 400 is used to indicate a Bad Request


                                                      3
reaction status code it shows that the server can't or won't handle the solicitation because of something
saw to be a customer mistake (e.g., contorted solicitation language structure, invalid solicitation mes-
sage outlining, or beguiling solicitation directing).


Figure 1: Architecture Diagram for Clustering Data on Various Parameters in Weblog Dataset and Find-
ing Optimal Value of k.

The HTTP 300 Multiple Choices divert status reaction code shows that the solicitation has more than
one potential reaction. The client specialist or the client ought to pick one of them. As there is no nor-
malized method of picking one of the reactions, this reaction code is seldom utilized. The HTTP 200
OK accomplishment status response code shows that the sales have succeeded. A 200 response is cache-
able as is normally done.
Algorithm 1:
K-Means Clustering: URL Analysis for Status Response Code
1. Input: N number of records from dataset S.
2. For each user U finds the most frequently
   accessed URLs FA.
3. cluster formation, CF1 using website utilization information date D and status S.
4. End

Algorithm 2:
K-Means Clustering: User Web URL Access Method Analysis
1. Input: N number of records from dataset S.
2. for each user web URL WU find the access
   method M
3. cluster formation, CF2 using FA and M
4. End

Reapplying the bunching calculation over the cluster formation CF1 in the boundaries for making new
bunches CF2 is the client web access method M and the FA. Among the Web URL access techniques
M, the GET and Post strategies are the most famous techniques utilized. The GET system requests a
depiction of the predefined resource. Requesting using GET should simply recuperate data. The POST
strategy is used to introduce a substance to the foreordained resource, as often as possible causing a
change of state or accidental impacts on the server.


                                                    4
4.    Results and Discussion
The data set used here is available at https://www.kaggle.com/shawon10/web-log-dataset.The work fo-
cuses on the step-by-step analysis of the weblog data to find the clusters for the status response code of
the web services and the web URL access methods are mostly used by the users. This dataset has 16008
rows and 4 columns. Columns are IP, Time, URL, Response Status.


Figure 2: An Example of the Information Extraction for the Status Response Code of the Web Services

Figure 2 shows the information extraction for the status response code of the web services from the
weblog dataset.
Figure 3 shows the plot of the web URLs frequently utilized by many customers. It is observed that the
customers like to visit some web URLs frequently making them their favorite websites based on the
frequency of accessing the URL.


                                           URLS


Figure 3: Frequently Accessed Web URLs


                                                    5
In figure 4 we can find the metrics for the calculation of the mean values for the creation of the initial
clusters. As depicted in the methodology section the web URLs are clustered using the criteria status
response code.


Figure 4: Calculation of the Mean Values for the Creation of the Initial Clusters

Figure 5 shows the optimal value for k here is 4. Hence, we can observe the four clusters are formed
for the status response code for various status responses.


Figure 5: URL Analysis for Status Response Code

According to figure 6, the analysis of weblog data shows that among the Web URL access techniques
the GET and Post strategies are the most famous techniques utilized by the customers. The access meth-
ods popular amongst all the other access methods are GET and POST.


                                                    6
Figure 6: Analysis of the Web Access Methods

From figure 6 it is observed that these methods are mostly used by the customers for the invocation of
the URLs. On applying the k-means clustering for the web URL access methods the optimal value for
k=2. The clusters formed for the most popular Web URL access methods have two clusters.
Figure 7 shows the selection of the value of k as 2 using the Elbow method it is very easy to predict the
optimal value of k at an elbow point in the graph.


Figure 7: Finding the Optimal k Value

Figure 8 shows the clustering of the Web URL access methods The web server processes the data and
communicates a HTTP status code. Should the solicitation find success, the server sends an information
bundle to the internet browser with all the data expected for the page.


                                                    7
     Figure 8: User Web URL Access Method Analysis


Figure 9: An Example of Clustering Using the Web URL Access Methods.

In the event that the server can't observe the page at the mentioned address, it either sends a 404-blunder
code (site page not found) or sends the guest to the new URL through divert assuming it's known. In
figure 9 the example for the clustering of the web URLs is shown for the cluster formation for methods
GET (0) and POST (1).


5.      Conclusion
In this work, we have discussed various clustering techniques used efficiently for the analysis of the
data and removing the barriers to accessing the huge datasets. Moreover, this work helps to elaborate
k-Means clustering over the weblog dataset to analyze and utilize the weblog dataset efficiently. The
algorithm utilizes various parameters of the weblog dataset for the formation of various clusters. The
Elbow method is then used to find the optimal value of the k in k-means to predict the number of clusters
formed for the given dataset parameters. The optimal value of k is 4 for the status response code for
various status responses. Whereas the value of k=2 for the most popular methods to access the web that
is GET and POST. For the future work we will be using the various width clustering algorithm for the
calculation of the distance for finding the optimal value of k.


                                                     8
References
[1] Aldino, A. A., et al. "Implementation of K-means algorithm for clustering corn planting feasibility
     area in south lampung regency." Journal of Physics: Conference Series. Vol. 1751. No. 1. IOP
     Publishing, 2021.
[2] Bao, Junpeng, et al. "An incremental clustering method based on the boundary profile." Plos
     one 13.4 (2018): e0196108.
[3] Benabdellah, Abla Chouni, Asmaa Benghabrit, and Imane Bouhaddou. "A survey of clustering
     algorithms for an industrial context." Procedia computer science 148 (2019): 291-302.
[4] Zhuo, Chen, Liu Xiang-shuang, and Zhuang Xiao-dong. "A fast incremental clustering algorithm
     based on grid and density." Third International Conference on Natural Computation (ICNC 2007).
     Vol. 5. IEEE, 2007.
[5] Chowdhury, Kuntal, Debasis Chaudhuri, and Arup Kumar Pal. "An entropy-based initialization
     method of K-means clustering on the optimal number of clusters." Neural Computing and Appli-
     cations 33.12 (2021): 6965-6982.
[6] Ghadiri, Mehrdad, Samira Samadi, and Santosh Vempala. "Socially fair k-means clustering." Pro-
     ceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. 2021.
[7] https://medium.com/analytics-vidhya/comparative-study-of-the-clustering-algorithms-
     54d1ed9ea732.
[8] Khalilian, Madjid, Norwati Mustapha, and Nasir Sulaiman. "Data stream clustering by divide and
     conquer approach based on vector model." Journal of Big Data 3.1 (2016): 1-21.
[9] Kumar, Harendra, Nutan Kumari Chauhan, and Pradeep Kumar Yadav. "A high performance
     model for task allocation in distributed computing system using k-means clustering tech-
     nique." Research Anthology on Architectures, Frameworks, and Integration Strategies for Distrib-
     uted and Cloud Computing. IGI Global, 2021. 1244-1268.
[10] Li, Wei, et al. "Data Stream Clustering Algorithm for Smart Site and Its Implementation Based on
     Flink." 2019 IEEE Symposium Series on Computational Intelligence (SSCI). IEEE, 2019.
[11] MacQueen, J. "Classification and analysis of multivariate observations." 5th Berkeley Symp.
     Math. Statist. Probability. 1967.
[12] M. P. B. P. M. S. M. B. P. Semantic Search and Social-Semantic Search as Cooperative Ap-
     proach. International Journal on Recent and Innovation Trends in Computing and Communica-
     tion, 5(1), 110 - 114. https://doi.org/10.17762/ijritcc.v5i1.98.
[13] Pandharbale, Priya B., Sachi Nandan Mohanty, and Alok Kumar Jagadev. "Recent web service
     recommendation methods: A review." Materials Today: Proceedings (2021).
[14] Pandharbale, Priya, Sachi Nandan Mohanty, and Alok Kumar Jagadev. "Study of Recent Web
     Service Recommendation Methods." 2020 2nd International Conference on Innovative Mecha-
     nisms for Industry Applications (ICIMIA). IEEE, 2020.
[15] Pandharbale, Priya Bhaskar, Sachi Nandan Mohanty, and Alok Kumar Jagadev. "Novel Cluster-
     ing-Based Web Service Recommendation Framework." International Journal of System Dynamics
     Applications (IJSDA) 11.5 (2021): 1-15.
[16] Parimalam, T., and K. Meenakshi Sundaram. "Efficient clustering techniques for web services
     clustering." 2017 ieee international conference on computational intelligence and computing re-
     search (iccic). IEEE, 2017.
[17] Reyes, Jaciel E., et al. "A Classification of Web Service Credibility Measures." 2021 IEEE 45th
     Annual Computers, Software, and Applications Conference (COMPSAC). IEEE, 2021.
[18] Sardar, Tanvir Habib, and Zahid Ansari. "An analysis of distributed document clustering using
     MapReduce based K-means algorithm." Journal of The Institution of Engineers (India): Series
     B 101.6 (2020): 641-650.
[19] Yeoh, Jia Ming, et al. "A clustering system for dynamic data streams based on meta heuristic op-
     timisation." Mathematics 7.12 (2019): 1229.


                                                   9