KCRS: KClustering Recommender System for
        Component Configuration

Maurizio Monticelli1 , Ricardo Anibal Matamoros Aragon2[0000−0002−1957−2530] ,
  Francesco Epifania2 , Luca Marconi2[0000−0002−0236−6159] , and Antonio De
                          Simone2[0000−0001−5217−7586]
                          1
                         Nathan Instruments srl, Italy
                           monticelli@nathan.it
                         2
                           Social Things SRL, Italy
{ricardo.matamoros,francesco.epifania,luca.marconi,antonio.desimone}@socialthingum.com


      Abstract. In recent years, the exponential growth in the number of
      items and products handled by e-commerce sites has led to the introduc-
      tion of intelligent systems aimed at supporting users during the decision-
      making proces.
      Making the choice of a product among thousands of items becomes com-
      plicated for consumers, and in response to this problem, recommender
      systems (RS) are born. These systems are a set of algorithms based on
      the concept of information filtering and make it possible to reduce the
      cognitive effort required of users.
      In this paper we present a model-based RS, belonging to the collabo-
      rative filtering (CF) category, for the e-commerce website of the com-
      pany Nathan Instruments (NI). Thus, the main objective of this paper
      is to provide an intelligent approach for recommending configurations of
      hardware components for Computers. This configurator uses clustering
      algorithms to address the problems associated with small dataset sizes.
      Finally, in the experimentation and conclusion sections it is reported how
      the proposed model simplifies the decision process related to the required
      computer customization in terms of hardware and software components.

      Keywords: Recommender Systems · Clustering · E-commerce · Ma-
      chine Learning · Software Components.


                      Copyright © 2021 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0 International (CC
BY 4.0).
1   Introduction

This article aims to present the intelligent model created by the NI team, which
optimises the process of recommending computer configurations. This RS can be
used both within NI’s e-commerce and by the company’s own technicians who
assist customers in ordering and purchasing products in physical shops. [1]
Introducing the RS into NI’s sales process optimises the customisation of the
computer required in terms of hardware, software and various options. [18] At
the same time, it is able to reduce the time and cost of producing the offers
themselves[8], which were initially made from a catalogue offer and managed by
human resources who had to take into account the compatibility of the basic
components available.
Thus, the sales process without RS support was complex: Given the large number
of hardware components, the cardinality of possible configurations grew expo-
nentially and required the vendor to memorise the compatibility and budget
constraints specified by the customer (PCs intended for office automation rather
than processing large amounts of data, preference for a particular operating sys-
tem, etc).
The introduction of the RS [6], on the other hand, automates the collection
of requirements and technical specifications that the final configuration must
present, and it is the customer who can enter this information in a simple and
intuitive way [5]. The following sections describe state-of-the-art methodologies
in the field of artificial intelligence (AI)[9,10], such as Machine Learning (ML)
[7], used to continuously update knowledge on all the technical specifications of
the various components.
In particular, ”clustering methodologies” are also reported, which allow to homo-
geneously group similar configurations and components and thus suggest those
that are less distant from user requirements. The storage of knowledge about the
technical specifications of the hardware and software products in the catalogue
and their processing was carried out through cloud-computing (CC). The article
also reports on the experimentation phase carried out to identify the best type
of clustering in the case study described, as well as the results and conclusions
related to the performance of the RS.


2   Clustering Methodologies

For the realisation of the RS it was decided to analyse some clustering method-
ologies in order to identify similar products on the basis of predefined charac-
teristics, the aim being to make efficient and compatible recommendations with
respect to user requirements[12].
Two different methodologies were adopted for the subdivision of the products:

 – clustering on the hardware components of the products within each user
   category, so as to search for different types of objects within the same user
   type;
 – double clustering on the entire range of available products, with the aim of
   searching for substantial differences between the various objects and classi-
   fying them on the basis of the components of each.
Both methods follow using the k-means algorithm, based on vector distance.
Although this methodology is the one applied by default, it was decided to
study and implement other types of clustering through which similar solutions
are reached.
In particular, the different methodologies analysed are:

2.1   K-Means
Defined C1 , ..., CK a set of groups containing the observations 1, ..., n, such a
set of groups must satisfy the following properties[2]:
 – C1 ∪ C2 ∪ ... ∪ Cn = {1, ..., n}
                            ′
 – Ck ∩ Ck′ = 0, con k ̸= k
In K-Means a good grouping of observations is one for which the intra-cluster
variation, W (CK ), is minimal, i.e. the following problem must be solved[13]:
                                             K
                                             X
                            minC1 ,...,CK          W (CK )                    (1)
                                             K=1

where, through the quadratic Euclidean distance, we define
                                            p
                                  1       X X
                     W (CK ) =                          (xij − xi′ j )2       (2)
                                 |Ck |
                                         i,i′ ∈CK j=1

where |CK | is the number of observations in cluster K.
Thus, the algorithm performs the following steps:
1. A value, from 1 to K, is randomly assigned to each observation;
2. The value of the centroid is calculated. The centroid of the k-th cluster is
   the vector of dimension p that contains the averages of the variables for the
   observations in the k − th cluster;
3. Each observation is assigned to the cluster for which the centroid is closest.
   The proximity is determined by the value of the Euclidean distance.
4. Repeat steps 2. and 3. until convergence is achieved.

2.2   Hierarchical clustering
In this type of method, data are associated with a tree structure in such a way
that the leaves of the tree correspond to observations and the nodes to subsets of
observations. The very nature of the tree introduces a hierarchy into the subsets
associated with the branches.
There are two broad families of hierarchical methods:
 1. Agglomerative methods [AGNES]: starting from an initial state of n groups,
    in which each observation is a separate group, one proceeds by making suc-
    cessive mergers of groups with high similarity between them, until k = 1,
    i.e. all observations belong to the same group;
 2. Divisive Methods [DIANA]: starting from an initial state with k = 1 groups,
    i.e. a single group, one proceeds by successive subdivisions until arriving at
    n groups[4].

Both procedures operate on a dissimilarity matrix. The main feature of this
type of method is that once two groups have been joined they will no longer be
separated later, and similarly[11], once two groups are separated they will no
longer be part of the same cluster. Furthermore, applying an algorithm of this
type, the same tree is used for all values of k, each time referring to a different
level of the tree. It is therefore a rigid structure.


2.3   Mini Batch K-Means

In the case of large amounts of data, it is known that the computational com-
plexity of the K-means algorithm increases considerably. Consequently, it was
decided to modify K-means clustering for research purposes in order to reduce
the computational costs of the algorithm.
Mini-Batch K-means clustering is a variant of K-means clustering in which the
size of the data set considered at each interaction is limited. Standard K-means
clustering operates on the entire dataset at once, while mini-batch K-means clus-
tering operates on the entire dataset[17].
The mini-batches are randomly sampled from the entire dataset and for each
new iteration a new random sample is selected and used to update the position
of the centroids.
In K-Means clustering by minbatch, clusters are updated by a combination of
the mini-batch values and the learning rate. The learning rate decreases during
iterations and is the inverse of the number of elements inserted in a specific
cluster.
The effect of reducing the learning rate is that the impact of new elements de-
creases and convergence is achieved when, after several iterations, there are no
changes in the clusters. The results of studies on the effectiveness of mini-batch
clustering suggest that it can reduce computation time with a slight trade-off
with respect to the quality of the final clusters.


3     Kcluster recommender system - KCRS

The Kcluster algorithm is able to recommend to the user, whether new or not,
products that, based on his past purchases or similarity with other users, can
satisfy his need, therefore, the proposed algorithm solves the problem known as
cold start, which occurs when a new user registers to the system and has not yet
provided any interaction, so it would not be possible to offer him personalised
recommendations with so little data[15].
The defined KCRS model is mainly based on similarities between products. In
particular through simulated data, as explained in [ref section] according to
precise patterns of computer configurations, two products can be recommended:

 1. the first one will be extracted on the basis of the similarities that a user (be-
    longing to a specific consumer category) has with other members belonging
    to the same category;
 2. The second recommendation is obtained by a double subdivision of the prod-
    ucts, again based on the similarity between them, so it is as similar as possible
    to the products previously purchased by the customer making the request.
    This second recommendation is futile when the customer is new and has
    never purchased products from NI. In this specific case, the products ex-
    tracted will both come from the first search methodology.

All recommendations also respect certain constraints, such as lower price, stock
availability and year of publication of the product. The Kcluster algorithm allows
you to choose the type of clustering method you prefer from the three imple-
mented K-means, Hierarchical and Mini Batch K-means. Naturally, a default
method is set, namely, K-means[16].


4   NI simulated dataset

Initially, the data available was numerically insufficient for the application of
the KCRS algorithm. In particular, there was no identifier to link the different
data sets, price list and billing system. As a result, there was insufficient data to
validate the recommendations of the products sold. To deal with this problem of
lack of data, it was decided to simulate data that followed as closely as possible
the real data in the price list and Sales system.
In the domain of computer sales, the relevant data relates to hardware com-
ponents and therefore it was decided to simulate only three main components,
namely:

 – Processor
 – Ram
 – Storage

With the criterion of being able to apply the algorithm to real data, in was
decided to simulate again with the following variables:

 – User Id
 – Product Id
 – Selling Price
 – Availability
 – Date of sale
 – Year of production
The statistical process of simulating the dataset is based on the knowledge of the
team of developers and on a survey carried out on a sample of 100 individuals
using Google forms, through which the probability of the presence of a certain
component within a specific computer was defined.
For this reason, the simulation is based on different patterns of configurations,
each with a different extraction probability.


5     Evaluation KCRS

Since these are unsupervised algorithms, the evaluation component is not as im-
mediate as for other types of supervised clustering, for which the accuracy of an
algorithm is given by the difference of the prediction with respect to the observed
value. The choice of the k-optimum (K ∗ ) is a fundamental procedure to obtain
the best product partitioning in both methodologies followed; in particular, the
objective is to maximise the distance between clusters (BSS) and minimise the
distance between clusters (WSS). To do so, it was decided to implement using
two evaluation metrics through which the optimal K ∗ can be chosen[3].


5.1   Silhouette

Each cluster is represented by a silhouette, which indicates the belonging of a
specific point to the reference cluster.
In particular, the closer the value is to 1, the more the point has been ”well”
classified, and vice versa the further away the point is from unity.
The process to apply this metric consists of the following steps:

 1. consider each observation i in the dataset, assuming A as the cluster of i;
 2. calculate the average distance ai of i with all other observations in A

                                        1     X
                                ai =                   d(i, j)                  (3)
                                       NA
                                            j∈A,j̸=i


 3. consider each cluster C different from A and define the average distance
    d(i, C) of i with the observations in C

                                             1 X
                               d(i, C) =         d(i, c)                        (4)
                                            NC
                                                c∈C


 4. calculate d(i, C) f orall C ̸= A, and select the one with bi = min d(i, C), C ̸=
    A

The silhouette of i is defined as:
                                          bi − ai
                                Sili =                                          (5)
                                         max(ai , bi )
Whereas the one used in the current project is the so-called Silhouette Media:
                                           X sili
                                 silav =                                        (6)
                                            i
                                                N

The number of clusters k can be determined by choosing the value of k that
leads to the highest average silhouette value.

5.2   Index of Calinski-Harabasz
For any number of clusters k ≥ 2, the Calisnki Harabasz index is defined as:
                                       tr(Bk )/(k − 1)
                             ch(k) =                                            (7)
                                       tr(Wk )/(n − k)
Where n and k are the total number of observations and the number of clusters,
respectively. tr(Bk ) is the trace of the matrix of clusters between groups, while
tr(Wk ) is the trace of the matrix of clusters within groups. The optimal num-
ber of clusters is obtained at k for which there are large dissimilarities between
clusters and large similarities within clusters: the solution is then the value of k
that maximises ch(k).

Both metrics give similar results in terms of k ∗ so we decided to use only one
evaluation measure, in particular, we apply the average Silhouette as the default
metric since it is the one with the highest level of explainability[14].


6     Experimentation and results
In this section we report the results obtained with the KCRS algorithm. In the
first results we simulate a given user who decides to do a search through his
user id taking into account purchases made at the same retailer. The algorithm
outputs a summary of past purchases ”Previous Purchases” and then lists the
two recommendations ”Recommendations” (Table 1):
 – The first recommendation refers to the product searched for within the same
   user category;
 – The second recommendation refers to the product being similar to those
   previously bought by the user.
The two recommended products (Table 2) in this case are very similar to those
previously purchased by the user, which means that the KCRS algorithm rec-
ommends correctly as it captures the similarity between purchased and recom-
mended products.
In the second table the user enters both their id x, belonging to user category y,
and a user category z within which the search is carried out. As can be seen from
the table, this is similar to the first case, the difference being that the recom-
mendation no longer refers to past purchases, but to the user category entered.
In the last example (Table 3), where only user category y is entered, the rec-
ommendation is single. This is because, without the user id, the only recom-
mendation proposed refers to similarity between users, i.e. products sold within
the same user category. Consequently, the recommendation does not take into
account products that have not been sold in the past, for the simple reason that
a product is only assigned to a user category once it has been purchased.
These three short tables containing the results obviously cannot be defined as
systematic verifications of the algorithm, but are intended to briefly summarise
the functionalities and recommendations of the KCRS algorithm, which, as can
be seen from the results just shown works and meets the initial needs of the
customer.


                               Previous Purchases
items     Id         Processor Ram        Memory Price         Quant.    year
item 1    5447       i7        32         3072    $1.249,57    1         2019
item 2    9861       R5        8          1024    $938,66      0         2019
                               Recommendations
items     Id         Processor Ram        Memory Price         Quant.    year
item 1    9794       i5        4          1024    $704,27      1         2018
item 2    98         i5        4          256     $948,23      1         2019
                      Table 1. Recommendation by user Id


                              Previous Purchases
items     Id        Processor Ram        Memory Price         Quant.     year
item 1    3033      i7        16         3072      $1.424,44 1           2019
item 2    4821      R9        16         2048      $2.372,5 1            2019
                              Recommendations
items     Id        Processor Ram        Memory Price         Quant.     year
item 1    3528      R3        8          256       $456,24    1          2018
item 2    94        i7        16         2048      $1.407,21 1           2019
             Table 2. Recommendation by user Id and user category.


7   Conclusions
The work described in this paper provides an interesting case of using cluster-
ing algorithms within RS, in particular through an RS combining both Content
Based and CF approaches.
Starting from a simulated dataset for the domain of e-commerce sites, in par-
ticular in the electronics and IT category, the KCRS algorithm allows, within
                                Previous Purchases
items      Id         Processor Ram        Memory Price              Quant.      year
-          -          -         -          -        -                -           -
                                Recommendations
items      Id         Processor Ram        Memory Price              Quant.      year
item 1     5222       i9        64         2048     $1.837,73        1           2019
                    Table 3. Recommendation by user category.


catalogues with considerable quantities of products, to make the most suitable
recommendations for the customer’s needs. The decision to use simulated data
stems from the fact that NI’s billing system is not yet linkable with the price
lists of computer suppliers.
The simulated data has the sole purpose of testing and validating the developed
algorithm, and then in the future applying KCRS to real data. In particular
using K-Means, Hierarchical and Mini-Batch together with evaluation metrics
for cluster separation, Silhouette and Calinski-Harabasz Index, it was concluded
that the clusters found within the datasets are adequately separated to make
personalised recommendations.
Some improvements and future developments consist in analysing further char-
acteristics of the users e.g. the metadata associated with their profile. A user
is described not only by job category, but can be categorised differently within
the platform and, consequently, recommendations could achieve better accu-
racy. In addition, a test phase is planned to apply the Precision@K, Recall@K,
F-measure@K metrics, which will allow a better assessment of the model’s per-
formance against recommendations.


References

1. Sebastian Thrun, Wolfram Burgard Dieter Fox: Probabilistic Robotics (2005)
2. J. B. MacQueen (1967): ”Some Methods for classification and Analysis of Multi-
   variate Observations”, Proceedings of 5-th Berkeley Symposium on Mathematical
   Statistics and Probability, Berkeley, University of California Press
3. Darius Pfitzner, Richard Leibbrandt, David Martin Ward Powers June: Characteri-
   zation and evaluation of similarity measures for pairs of clusterings, 2009 Knowledge
   and Information Systems
4. K. Leonard, R. Peter J: Finding groups in data: an introduction to cluster analysis.
   New York etc : Wiley, 1990
5. Fernando Ortega e Angel Gonzalez-Prieto: Recommender Systems and Collabora-
   tive Filtering, (2020).
6. Ricci, Francesco and Rokach, Lior and Shapira, Bracha: Introduction to recom-
   mender systems handbook, ricci2011introduction, Springer 2011.
7. Thomas M. Mitchell. 1997. Machine Learning (1st. ed.). McGraw-Hill, Inc., USA
8. Michael Scholz, Verena Dorner, Guido Schryen, Alexander Benlian, A configuration-
   based recommender system for supporting e-commerce decisions, European Journal
   of Operational Research, Volume 259, Issue 1, 2017
9. Melville, P., & Sindhwani, V. (2010). Recommender systems. Encyclopedia of ma-
   chine learning, 1, 829-838.
10. Zhao, Q., Harper, F. M., Adomavicius, G., & Konstan, J. A. (2018, April). Explicit
   or implicit feedback? Engagement or satisfaction? A field experiment on machine-
   learning-based recommender systems. In Proceedings of the 33rd Annual ACM Sym-
   posium on Applied Computing (pp. 1331-1340).
11. E. B. Fowlkes, C. L. Mallows A Method for Comparing Two Hierarchical Cluster-
   ings, Journal of the American Statistical Association 78.383 pp.553-69, 1983
12. Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern
   Anal. Mach. Intell. 2, 224–227 (1979)
13. Likas, Aristidis, Nikos Vlassis, and Jakob J. Verbeek. ”The global k-means clus-
   tering algorithm.” Pattern recognition 36.2 (2003): 451-461.
14. S. Lukasik, P. A. Kowalski, M. Charytanowicz and P. Kulczycki, ”Clus-
   tering using flower pollination algorithm and Calinski-Harabasz index,” 2016
   IEEE Congress on Evolutionary Computation (CEC), 2016, pp. 2724-2728, doi:
   10.1109/CEC.2016.7744132.
15. U. Kuzelewska and A. Kurylowicz, ”Multi-Clustering Applied to Col-
   laborative Recommender Systems,” 2018 Thirteenth International Confer-
   ence on Digital Information Management (ICDIM), 2018, pp. 118-123, doi:
   10.1109/ICDIM.2018.8847141.
16. R. M. Esteves, T. Hacker and C. Rong, ”Competitive K-Means, a New Accurate
   and Distributed K-Means Algorithm for Large Datasets,” 2013 IEEE 5th Interna-
   tional Conference on Cloud Computing Technology and Science, 2013, pp. 17-24.
17. Z. Wang, Y. Zhou and G. Li, ”Anomaly Detection by Using Streaming K-Means
   and Batch K-Means,” 2020 5th IEEE International Conference on Big Data Ana-
   lytics (ICBDA), 2020, pp. 11-17.
18. Liang Wang and Huan Ruan, ”An improved recommender system in publications
   e-commerce based on TOPSIS Algorithm,” 2011 International Conference on Com-
   puter Science and Service System (CSSS), 2011, pp.
19. Schröder, G., Thiele, M., & Lehner, W. (2011, October). Setting goals and choosing
   metrics for recommender system evaluations. In UCERSTI2 workshop at the 5th
   ACM conference on recommender systems, Chicago, USA (Vol. 23, p. 53).