Comparative Analysis of Two Approaches to the
    Clustering of Respondents (based on Survey Results)

    Iryna Strutynska, Halyna Kozbur, Lesia Dmytrotsa, Olha Hlado, Liliya Melnyk

           Ternopil Ivan Puluj National Technical University, Ternopil, Ukraine
      {ringtons999, dmytrotsa.lesya, kozbur.galina}@ gmail.com


        Abstract. This paper proposes an algorithm for solving the survey respondents'
        clustering problem, including the steps of collecting, preparing data, summariz-
        ing key results, and developing future goals. The research consists of two ap-
        proaches to clustering: iterative and hierarchical in order to produce consistent
        and comprehensible results. The iterative method is implemented in MS Excel
        using the Data Mining add-in, hierarchical one is used with the help of writing
        code and using Python libraries. Hard clusters with sufficient degree of similari-
        ty within the cluster and differences from others were distinguished, the main
        characteristics of the obtained clusters were described as well. It has been ex-
        perimentally established that the method of agglomerative hierarchical cluster-
        ing is more effective for solving the problem of clustering of mixed-type data
        obtained from the survey of respondents.


        Keywords: digital maturity, clustering methods, mixed-type data


1       Introduction

The fastest and the most convenient way to get any information you need today is to
directly interview your target audience on a specific topic. With the development of
information technology, such questionnaires are increasingly shifting from personal
or telephone communication to online questionnaires. This allows you to reach a larg-
er audience in a shorter time span and with fewer human resources. The positive as-
pects of such surveys are: convenience of expression; partial or complete anonymity
of results; the ability to complete a survey in any convenient for the respondent way;
no need to communicate with the employees of the survey organization, etc. Online
surveys are a particularly effective way of retrieving information if your target audi-
ence is the users of the web. Data collection is only part of the complex task of getting
the information you need. Further processing and analysis of data with conclusions
and recommendations make the data cycle complete. Segmentation or clustering is
one of the most important and interesting tasks of data analysis. This paper offers an
algorithm for solving the problem of clustering respondents by online survey, includ-
ing the steps of collecting, preparing data, summarizing key findings, and developing
future goals.

    Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attrib-
ution 4.0 International (CC BY 4.0) CMiGIN-2019: International Workshop on Conflict Management in
Global Information Networks.
   The problem of clustering of numerical data as a result of a series of measurements
was described by [1]. A similar grouping of respondents was addressed in the works
of [2]. The main purpose of the work was to provide recommendations on the results
of determining the respondents’ political preferences and to compare the clustering
method with other generally accepted methods of providing such recommendations.
The problem of clustering of categorical data using probabilistic approach and
GACUC algorithm was investigated by [3]. The main statements regarding the clus-
tering of mixed-type data and applying of chosen in the paper algorithms and metrics
were discussed in [4,5,6,7]. However, the problem of processing and clustering of
mixed data obtained from the questionnaire has not been researched so far.
   This paper studies clustering of respondents using two approaches: iterative and
hierarchical in order to produce consistent and comprehensible results. The iterative
method is implemented in MS Excel using the Data Mining add-in, hierarchical one is
used with the help of writing code and using Python libraries.
   The survey, the results of which were taken as inputs to the clustering task, was
conducted among small and medium-sized enterprises in Ternopil region for the use
of digital technologies and tools in their business activities [8].


2      The problem of clustering. Approach typing and solution
       algorithms

Clustering or cluster data analysis is one of the machine learning tasks of splitting
multiple objects into subsets (clusters) so that the objects assigned to one cluster are
as similar as possible to each other and the objects referred to different kinds are as
different as possible. This approach does not require a labeled data.
   One of the most common contemporary tasks that uses cluster analysis is text anal-
ysis for news broadcasting, image grouping, consumer segmentation, community
identification on social networks, etc.
   The variability of tasks, types of datasets and expected results has led to the for-
mation of a large number of methods and approaches to clustering, which differ in
their understanding of the concept of “cluster”, as well as adjusting the parameters of
algorithms (number of expected clusters, density threshold, distance metrics, etc.) the
specifics of the dataset and the subsequent use of the results. Thus, this makes it diffi-
cult to uniquely select the algorithm of operation and its parameters for each type of
task.
   Due to this, clustering can also be called an interactive task of machine learning
“with reinforcement”, which provides repeated experimental correction of algorithm
parameters for obtaining stable and interpretative results [9,10].
   There is no single common way to classify clustering methods and algorithms. One
approach is to distinguish clustering methods by cluster models used (connectivity-
based or hierarchical, centroid-based, distribution-based, density-based overlapping
clustering etc.). Another approach uses grouping of methods based on their key char-
acteristics (probabilistic, logical, graph-theoretic, hierarchical, neural, frequency algo-
rithms, etc.) [10].
    One of the simplest approaches to clustering methods is to divide them into two
groups: hierarchical and non-hierarchical. Hierarchical cluster analysis methods are
divided into ascending or descending and can be represented graphically in the form
of dendrograms. At the same time with each subsequent step the number of clusters
increases or decreases depending on the chosen method: divisional or agglomerative
respectively.
    The largest group among non-hierarchical methods is iterative. In an iterative ap-
proach, they define cluster centers and redistribute the elements of the data set by
proximity to the selected centers. These include algorithms k-means, Expectation-
Maximization method, mean-shift and others [11].
    The similarity of cluster elements and the closeness of clusters are determined by
predefined metrics.


3      Data Collection

The input data set was obtained through Google Forms questionnaires from execu-
tives in various companies and enterprises (limited liability companies (LLC), indi-
vidual entrepreneur) and businesses (construction, trade, repair, logistics, services,
etc.)
   Respondents answered 35 questions regarding two main aspects of running their
business:
   • forms, organizations and spheres of activity;
   • level of informatization of business activities (use of digital tools in their work,
work with social networks, planning services, analytics or advertising).
   Due to the specifics of the information requested and the use of different categories
of questions, both quantitative and categorical data were received. For example, in-
formation about the number of employees in an organization was obtained in the form
of natural numbers, and information about the presence of a business model was pre-
sented as a binary “yes” or “no” answer. There were some open-ended questions re-
garding the respondent’s attitude to a particular problem related to informatization of
the business structure. Such responses were excluded from the general clustering
dataset.
   Numerous mechanical errors and blank answers were found in the data retrieval.
These problems were solved with manual processing. However, as the number of
respondents increases, such processing will require the unification of the possible
answer options for each of the questions or the reduction of all possible answers only
to the choice of the suggested ones.


4      Data Preparation

Data preparation consisted of the steps of clearing and encoding data, missing values
were not identified in this study.
4.1    Clearing data
Both the hierarchical agglomerative algorithm and the EM method were run for the
same data set, so pre-processing due to data cleaning was performed equally for both
methods.
   The answers to the open-ended questions were reduced to a specific template, for
example, only “yes” or “no”, or otherwise unified, for example how is shown in Fig 1.
Attributes of the “automatically calculated questionnaire time” or “respondent’s per-
sonal attitude” were marked as informative and removed from the task input. All ma-
nipulations were performed manually due to the small dimension of the task.


                    Fig.1. Respondents' answers to the questionnaire


4.2    Data encoding
Using the MS Excel add-on requires no special training and accepts a simple spread-
sheet of values of any type. Therefore, data encoding was not performed for clustering
with the Data Mining add-in in MS Excel.
   Data encryption was required to work with Python machine learning libraries since
most algorithms use mathematical operations on quantitative data. For those questions
where it was possible to rank more or less or better or worse, ranked value coding was
used. Responses were coded from 0 to some positive number, where 0 meant a single
answer “no” or a number close to 0, and other values were ranked according to the
increase in manifestation of the sign.
   Non-ranking answer options are nominal type data and have been indicated by
some character numbers. For further computational work of the algorithm with such
data, the Hower metric was used, which makes it possible to work with both quantita-
tive and categorical numerical data at the same time. Respondents' coded answers are
shown in Fig. 2.
                        Fig 2. The survey respondents' coded answers


5       Solving the clustering problem

The main objectives of the study were:

 experimental finding of the optimal number of clusters and their characteristic
  features for the interpreted (understandable) segmentation of business structures
  according to the level of digital maturity by several methods;
 comparing the results obtained by different methods and determining the most
  effective for a particular data analysis task.

    The study used two methods of clustering:

1. Using the Data Mining add-in for MS Excel spreadsheets [13]. Clustering capabili-
   ties in MS Excel are represented by iterative algorithms: k-means and Expectation-
   Maximization. For the reference, it was determined EM-algorithm;
2. Using the functions of libraries for machine learning Python programming lan-
   guage [14,15].

   To describe how it works with two algorithms, we have introduced the notation: N
respondents 𝑈 = {𝑢 ⃗⃗⃗⃗1 , ⃗⃗⃗⃗     𝑢𝑁 } and M questions 𝑄 = {𝑞1 , 𝑞2 , … , 𝑞𝑀 }. Every partic-
                           𝑢2 , … , ⃗⃗⃗⃗⃗
      ⃗ 𝑖 ∈ 𝑈 (𝑙 ∈ 1,
ipant 𝑢            ̅̅̅̅̅𝑁) answered each of the questions 𝑞 ∈ 𝑄 (𝑘 ∈ 1,
                                                                 𝑘
                                                                                  ̅̅̅̅̅̅
                                                                                     𝑀), so the
result is a matrix of responses with dimension(𝑁 × 𝑀), in which each respondent is
represented as follows: ⃗𝑢𝑙 = {𝑢𝑙1 , 𝑢𝑙2 , … , 𝑢𝑙𝑘 , … , 𝑢𝑙𝑀 }, where 𝑢𝑙𝑘 is an answer l-
respondents to k-question (Fig. 3). In the future, we call this tuple a point [8].


                                Fig. 3. The matrix of answers
Let us consider the principles of the selected methods.


5.1    Method of hierarchical agglomeration
The principle of operation of the modified agglomerative method is described in de-
tail by [8]. According to the agglomerative approach to clustering, each point is con-
sidered to be a separate cluster at the beginning. As the algorithm works, each of the
two closest clusters is merged at each step, eventually forming a predefined number
of clusters or merging into one. To get started with the agglomerative algorithm, we
build a matrix of pairwise distances between the objects of the cluster. In the context
of the problem, the Hower metric (1) proposed in [8] was used to calculate the dis-
tance matrix.

                                                  1
                               𝑑(𝑢
                                 ⃗⃗⃗𝑖 , ⃗⃗⃗
                                        𝑢𝑗 ) =        ∑𝑀
                                                       𝑘=1 𝑑𝑖𝑗𝑘 ,                            (1)
                                                  𝑀


where 𝑑𝑖𝑗𝑘 = 𝑑(𝑢𝑖𝑘 , 𝑢𝑗𝑘 ) – the distance between the answers in the k-th question, M
is the number of answers to the query in the tuple. The distance matrix Dk for the k-th
question is symmetric:

                               0        𝑑12𝑘             𝑑13𝑘          …     𝑑1𝑁𝑘
                                         0               𝑑23𝑘          …     𝑑2𝑁𝑘
                  Dk =                                    0            …     𝑑3𝑁𝑘
                                                                       …      …
                                                                       …      0

   The symmetric matrix 𝐷 for distances between individual points of the cluster
looks like:

                           0       𝑑(𝑢       𝑢2 ) 𝑑(𝑢
                                     ⃗⃗⃗⃗1 , ⃗⃗⃗⃗           𝑢3 )
                                                    ⃗⃗⃗⃗1 , ⃗⃗⃗⃗       …   𝑑(𝑢       𝑢𝑁 )
                                                                             ⃗⃗⃗⃗1 , ⃗⃗⃗⃗⃗
                                       0          𝑑(𝑢       𝑢3 )
                                                    ⃗⃗⃗⃗2 , ⃗⃗⃗⃗       …   𝑑(𝑢       𝑢𝑁 )
                                                                             ⃗⃗⃗⃗2 , ⃗⃗⃗⃗⃗
                𝐷=                                      0              …   𝑑(𝑢       𝑢𝑁 )
                                                                             ⃗⃗⃗⃗3 , ⃗⃗⃗⃗⃗
                                                                       …        …
                                                                       …         0

   The elements of the matrix D are the averaged values of the pairwise values of the
distances calculated by the formula (1). All questionnaire weights are taken to be 1.
  The way if measuring distances 𝑑𝑖𝑗𝑘 depend on the type of data in k question. If 𝑢𝑖𝑘
and 𝑢𝑗𝑘 quantitative, then distance 𝑑𝑖𝑗𝑘 is expressed by the formula (2):


                                                 |𝑢𝑖𝑘 −𝑢𝑗𝑘 |
                               𝑑𝑖𝑗𝑘 =                              .                         (2)
                                         max(𝑢𝑘 )−min(𝑢𝑘 )
In this case, 𝑑(𝑢      𝑢𝑗 ) ∈ [0; 1]. If 𝑢𝑖𝑘 and 𝑢𝑗𝑘 – nominal data that cannot be ordered,
                ⃗⃗⃗𝑖 , ⃗⃗⃗
then the distance is calculated by the formula (3):


                                       0,   𝑢𝑖𝑘 = 𝑢𝑗𝑘 ,
                            𝑑𝑖𝑗𝑘 = {                                                   (3)
                                       1,   𝑢𝑖𝑘 ≠ 𝑢𝑗𝑘 .


In both cases 𝑑𝑖𝑗𝑘 = 0 means identical answers of the respondents 𝑢𝑗 to k question,
and 𝑑𝑖𝑗𝑘 = 1 – maximal difference. As a consequence, for the averaged distances
calculated by formula (1), all values𝑑(𝑢 ⃗⃗⃗𝑖 , ⃗⃗⃗
                                                𝑢𝑗 ) ∈ [0; 1].
   The distance between the individual clusters was by the distance neighbour meth-
od. Clusters closest to the selected metric are merged, distances from newly created to
other clusters are recalculated, the distance matrix is automatically updated, and clus-
tering continues. The method of the far neighbour allows to allocate rather compact
and stable structures corresponding to the task.


5.2     Expectation-Maximization method
In contrast to the proposed modification of the agglomerative method, the fuzzy clus-
tering EM algorithm presented in the Data Mining Add-in for Microsoft Excel was
selected among the iterative algorithms. In this case, the main idea of the method is to
assume that the elements of the input data set are independent random variables dis-
tributed by a law, in most cases a normal Gaussian distribution [16,17,18,19].
    When using the EM method, any object in the dataset is considered to belong to all
clusters with different probabilities. Before starting the algorithm, the number of K
clusters and the initial approximate parameters for each of the K distributions of the
input data are specified. Iterations incrementally improve the distribution parameters
to a predetermined level of model accuracy. Upon completion of the algorithm, each
object will be assigned to a cluster with the highest probability of belonging. Thus,
two successive steps are performed at each iteration:
    Further, the algorithm is based on an iterative repetition of two consecutive steps as
shown in the Fig. 4.

 1. Expectation is calculating the probability (plausibility) of the points belonging to
    each of the clusters;
2. Maximization is improvement of distribution parameter values to maximize the
    likelihood of points belonging to clusters.


5.3     Adjustment of algorithm parameters
The MS Cluster Task Wizard in MS Excel allows you to select the desired parameters
and adjust their values [13]. For this task, the list of questions that will affect the re-
sult was changed in the algorithm settings, the value of the number of clusters and the
cluster seed of the EM clustering method were set.
   Referring to the sklearn.cluster.AgglomerativeClustering method to create a clus-
tering model using Python generally involves specifying 3 parameters: number of
clusters, intra-cluster distance and inter-cluster distance metrics. The metric of the
distance metric between the elements may be one of those proposed in [21] or other-
wise calculated. Calling the function of creating a cluster model:
   model = AgglomerativeClustering (number of clusters = m, metric = “precomput-
ed”, linkage = complete)
   labels = model.fit_predict(distances),
   where m − predefined number of clusters, distances – distance matrix previously
calculated by the Hower metric (1) – (3).


            Fig. 4. Functional diagram of the EM algorithm, borrowed from [20]


6      Analysis of the results

Let us consider the clustering results by each method and compare the results ob-
tained. The clustering output of MS Excel's Data Mining add-on provides a break-
down of the dataset into clusters with the ability to visualize, view statistics, and clus-
ter profiles.
   Using the clustering methods of the sklearn library to analyse data on Python out-
put provides a one-dimensional numeric array indicating which cluster each input
tuple belongs to. Further analysis and visualization of the obtained result is carried out
additionally. The algorithm for selecting the optimal number of clusters is described
in [8].


6.1    The results of hierarchical agglomerative clustering
   The matrix of distances D between the points of the input set now looks like this
(the fragment of the matrix is shown in Table 1):

                       Тable 1. Matrix of distances between questions


As a result of the agglomerative clustering algorithm using Python sklearn, a stable
distribution of 5 clusters was obtained, with a satisfactory value of the quality metric,
the Silhouette index [22]: 𝑠𝑖𝑙 ≈ 0.16. A comparative analysis of the clusters obtained
by main characteristics is shown in Table 2.

 Table 2. Comparative characteristics of clusters formed by agglomerative clustering with Py-
                                          thon tools
The percentages indicate the proportion of respondents in each cluster who answered
the same question the same way. The number of respondents who answered equally to
the selected questions varied from 40 to 100%. To distinguish the characteristic fea-
tures of the formed clusters, we leave the values greater than 80% and depict the
comparison of the clusters.
   According to the results, the largest number of respondents (16) was attributed to
the first cluster, the main characteristics of which are:
 lack of experience with any digital tools;
 absence of companies in the Internet environment;
 site inefficiency, if any.
   The second cluster was formed by 5 companies, the main characteristics of which
are defined as follows:
 availability of companies in the Internet space;
 usage of simple tools to a limited extent.
   2 more respondents formed the third cluster that is characterized by:
 effective functioning of the site and the purchase chain;
 use of most digital tools including advertising;
 the work of a marketer to promote a brand or product.
   The fourth cluster consists of 10 respondents, and its characteristic features are:
 lack of functioning of the purchase chain on the site;
 non-use of sophisticated digital tools;
 availability of social networks only.
   The fifth cluster consists of only one company that successfully uses virtually all
digital tools with the help of specialists.
   A sufficient degree of differences between clusters and a sufficient degree of simi-
larity of elements within the cluster (80-100%) makes it possible to clearly identify
the following groups and rank them by the level of use of digital technologies and
tools in business activities. Table 3 shows the ranking of types of business structures
by the decline in digital maturity.


          Table 3. Ranking of business structure clusters by digital maturity level
6.2      The results of ЕМ-clustering
The clustering performed by the EM method in MS Excel proved to be unstable. Be-
cause the EM algorithm is a group of iterative fuzzy clustering methods, this result is
normal and suitable for use in a particular class of tasks. The comparative characteris-
tics of the clusters obtained are shown in Table 4. In contrast to agglomerative cluster-
ing, the degree of uniformity of answers to questions within the clusters is much low-
er, and fluctuates on average within 60-70%. As we can see, the degree of difference
between clusters is also low. Repeated application of the EM method did not improve
the quality of the results.
   Similar to the previous clustering method, a cluster was identified that included
two business entities that are actively using digital technology in their businesses. The
differences between the other clusters are small, the differences in the percentage of
answers to the questions are minimal, so it is impossible to distinguish the characteris-
tics of each subset. The inability to distinguish clusters with distinct features does not
meet the objective of the study.


      Table 4. Comparative characteristics of clusters formed by using an iterative approach


7        Conclusions and future work

In this study, we conducted an experimental comparison of the use of two approaches
to the clustering of respondents according to online survey results using the Google
Forms service, hard and soft clustering, in particular. Hard clustering was implement-
ed with the use of Python tools and the hierarchical agglomerative method, while soft
clustering was viewed through the use of the Data Mining add-in MS Excel and the
iterative EM method.
   A comparative analysis of the results obtained by the two methods showed the fol-
lowing results:

 Using hierarchical agglomerative clustering, we obtained 5 clusters, sufficiently
  different from each other and with a high degree of similarity between the elements
  of the cluster (60-100% depending on the question). The cluster features are distin-
  guished (use of social networks, advertising offices and services, analytical tools,
  search engine optimization of sites, etc.);
 the use of the EM method did not allow to obtain good clustering results and to
  achieve the goal of the task, the results of the EM method implementation changed
  with each run of the algorithm.

   It has been experimentally established that the method of agglomerative hierar-
chical clustering is an effective method for solving the problem of clustering of
mixed-type data obtained from the survey of respondents.
   In addition to improving the parameters of the algorithm, the tasks for the further
studies are: elimination of mechanical errors when entering answers and the presence
of empty values; diversity of data, which causes complexity of their unification and
proper ordering; selection of mathematical metrics used as arguments for clustering
functions and calculating their quality.


References
[1]   Cherezov D., Tyuakchev N. Obzor osnovnykh metodov klassifikatsii I klasterizatsii dannykh
      [Overview main methods of data classification and clustering]. Vestnik VGU, Seria: Sistemnyi analiz
      i      informachionnye       tekhnologii,    no     2.      pp.25-29,     2009.     Avaliable     at:
      http://www.vestnik.vsu.ru/pdf/analiz/2009/02/2009-02-05.pdf (Accessed 14 August 2019). (In
      Russian).
[2]   I. Katakis, N. Tsapatsoulis, C. Tziouvas and F. Mendes. (2012). Clustering Online Poll Data: Towards
      a Voting Assistance System, 2012 Seventh International Workshop on Semantic and Social Media
      Adaptation                 and               Personalization,              http://www.katakis.eu/wp-
      content/uploads/2014/11/katakissmap12.pdf.doi.org/10.1109/SMAP.2012.19
[3]   J. McCaffrey, Machine Learning Using C#. Syncfusion, 2014.
[4]   Chae, S. S., Kim, J.-M. & Yang, W. Y., (2006). Cluster analysis with balancing weights on mixed-
      type data. The Korean communications in statistics, 13(3) doi.org/10.5351/CKSS.2006.13.3.719
[5]   Gower JC (1967) A comparison of some methods of cluster analysis. Biometrics 23:623–637
      doi.org/10.2307/2528417
[6]   William M. Rand. Objective Criteria for the Evaluation of Clustering Methods. Journal of the
      American Statistical Association. Vol. 66, No. 336 (Dec., 1971), pp. 846-850
      doi.org/10.2307/2284239
[7]   Gower, J. C., (1971). A General Coefficient of Similarity and Some of Its Properties. Biometrics,
      27(4), p. 859. doi.org/10.2307/2528823
[8]   I.Strutynska, H.Kozbur, L.Dmytrotsa, I.Bodnarchuk and O.Hlado. (2019) Small and Medium
      Business Structures Clustering Method Based on Their Digital Maturity. 2019 International
      Scientific-Practical Conference Problems of Infocommunications. Science and Technology, pp.278-
      282 (in print)
[9]    Klasterizatsia [Clustering] Wikiconspekts, ITMO University [Online]. 2019. Avaliable:
      http://neerc.ifmo.ru/wiki/index.php?title=Кластеризация (in Russian)
[10] Cluster analysis, Wikipedia [Online]. Avaliable: https://en.wikipedia.org/ wiki/Cluster_analysis
[11] Zatserkovnyi V. I., Burachek V. H., Zhelezniak O. O., Tereshchenko A. O. Heoinformatsiini systemy
     i bazy danykh [Geoinformation systems and databases]. Nizhyn, Ukraine: NDU im. M. Hoholia,
     2017. pp. 77-95. Avaliable at: https://studfiles.net/preview/6440954 (In Ukrainian).
[12] Google Forms, https://www.google.com/intl/uk_ua/forms
[13] Microsoft (2017 Dec). Cluster Wizard (Data Mining Add-ins for Excel), Microsoft Docs [On-line].
     Avaliable: https://docs.microsoft.com/en-us/sql/analysis-services/cluster-wizard-data-mining-add-ins-
     for-excel?view=sql-server-2014
[14] Python, https://www.python.org.
[15] Scikit.    Clustering documentation.          Scikit    learn    [Online]. Avaliable:    https://scikit-
     learn.org/stable/modules/clustering.html.
[16] Expectation-maximization             algorithm,           Wikipedia        [Online].        Avaliable:
     https://en.wikipedia.org/wiki/Expectation-maximization_algorithm
[17] Yair Weiss. Bayesian motion estimation and segmentation. PhD thesis, Massachusetts Institute of
     Technology, May 1998.
[18] Zinchenko D. EM-algoritm klasterizatsii [EM-clustering algorithm], AlgoWiki [Online], 2016.
     Avaliable: https://algowiki-project.org/ru/Участник:Noite/EM-алгоритм_кластеризации
[19] Oreshkov V. EM – masshtabiruyemyy algoritm klasterizatsii [EM – scalable clustering algorithm],
     BaseGroup Labs [Online], 2013. Avaliable: https://basegroup.ru/community/articles/em
[20] Lotfi,               E.             (2018).                [image]            Available              at:
     https://www.researchgate.net/profile/Elaachak_Lotfi/publication/322641344/figure/fig3/AS:5856073
     73922305@1516631086753/flowchart-of-EM-algorithm.png [Accessed 15 Oct. 2019].
[21] Scikit.        Agglomerative        Clustering.        Scikit      learn     [Online].      Avaliable:
     https://scikitlearn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html
[22] Scikit. Silhourtte Score. Scikit learn [Online]. Avaliable: https://scikit-learn.org/stable/modules/
     generated/sklearn.metrics.silhouette_score.html