-

Comparative Analysis of Two Approaches to the Clustering of Respondents (Based on Survey Results)

Iryna Strutynska

strutynska@ukr.net 1

Halyna Kozbur

Lesia Dmytrotsa

Olha Hlado

Liliya Melnyk

0 State Higher Vocational School in Nowy Sącz , Nowy Sącz , Poland 1 Ternopil Ivan Puluj National Technical University , Ternopil , Ukraine

This paper proposes an algorithm for solving the survey respondents' clustering problem, including the steps of collecting, preparing data, summarizing key results, and developing future goals. The research consists of two approaches to clustering: iterative and hierarchical in order to produce consistent and comprehensible results. The iterative method is implemented in MS Excel using the Data Mining add-in, hierarchical one is used with the help of writing code and using Python libraries. Hard clusters with sufficient degree of similarity within the cluster and differences from others were distinguished, the main characteristics of the obtained clusters were described as well. It has been experimentally established that the method of agglomerative hierarchical clustering is more effective for solving the problem of clustering of mixed-type data obtained from the survey of respondents.

digital maturity clustering methods mixed-type data security

The fastest and the most convenient way to get any information you need today is to directly interview your target audience on a specific topic. With the development of information technology, such questionnaires are increasingly shifting from personal or telephone communication to online questionnaires. This allows you to reach a larger audience in a shorter time span and with fewer human resources. The positive aspects of such surveys are: convenience of expression; partial or complete anonymity of results; the ability to complete a survey in any convenient for the respondent way; no need to communicate with the employees of the survey organization, etc. Online surveys are a particularly effective way of retrieving information if your target audience is the users of the web. Data collection is only part of the complex task of getting the information you need. Further processing and analysis of data with conclusions and recommendations make the data cycle complete. Segmentation or clustering is one of the most important and interesting tasks of data analysis. This paper offers an algorithm for solving the problem of clustering respondents by online survey, including the steps of collecting, preparing data, summarizing key findings, and developing future goals.

The problem of clustering of numerical data as a result of a series of measurements was described by [ 1 ]. A similar grouping of respondents was addressed in the works of [ 2 ]. The main purpose of the work was to provide recommendations on the results of determining the respondents’ political preferences and to compare the clustering method with other generally accepted methods of providing such recommendations. The problem of clustering of categorical data using probabilistic approach and GACUC algorithm was investigated by [ 3 ]. The main statements regarding the clustering of mixed-type data and applying of chosen in the paper algorithms and metrics were discussed in [ 4,5,6,7 ]. However, the problem of processing and clustering of mixed data obtained from the questionnaire has not been researched so far.

This paper studies clustering of respondents using two approaches: iterative and hierarchical in order to produce consistent and comprehensible results. The iterative method is implemented in MS Excel using the Data Mining add-in, hierarchical one is used with the help of writing code and using Python libraries.

The survey, the results of which were taken as inputs to the clustering task, was conducted among small and medium-sized enterprises in Ternopil region for the use of digital technologies and tools in their business activities [ 8 ]. 2

The problem of clustering. Approach typing and solution algorithms

Clustering or cluster data analysis is one of the machine learning tasks of splitting multiple objects into subsets (clusters) so that the objects assigned to one cluster are as similar as possible to each other and the objects referred to different kinds are as different as possible. This approach does not require a labeled data.

One of the most common contemporary tasks that uses cluster analysis is text analysis for news broadcasting, image grouping, consumer segmentation, community identification on social networks, etc.

The variability of tasks, types of datasets and expected results has led to the formation of a large number of methods and approaches to clustering, which differ in their understanding of the concept of “cluster”, as well as adjusting the parameters of algorithms (number of expected clusters, density threshold, distance metrics, etc.) the specifics of the dataset and the subsequent use of the results. Thus, this makes it difficult to uniquely select the algorithm of operation and its parameters for each type of task.

Due to this, clustering can also be called an interactive task of machine learning “with reinforcement”, which provides repeated experimental correction of algorithm parameters for obtaining stable and interpretative results [ 9,10 ].

There is no single common way to classify clustering methods and algorithms. One approach is to distinguish clustering methods by cluster models used (connectivitybased or hierarchical, centroid-based, distribution-based, density-based overlapping clustering etc.). Another approach uses grouping of methods based on their key characteristics (probabilistic, logical, graph-theoretic, hierarchical, neural, frequency algorithms, etc.) [ 10 ].

One of the simplest approaches to clustering methods is to divide them into two groups: hierarchical and non-hierarchical. Hierarchical cluster analysis methods are divided into ascending or descending and can be represented graphically in the form of dendrograms. At the same time with each subsequent step the number of clusters increases or decreases depending on the chosen method: divisional or agglomerative respectively.

The largest group among non-hierarchical methods is iterative. In an iterative approach, they define cluster centers and redistribute the elements of the data set by proximity to the selected centers. These include algorithms k-means, ExpectationMaximization method, mean-shift and others [ 11 ].

The similarity of cluster elements and the closeness of clusters are determined by predefined metrics. 3

Data Collection

The input data set was obtained through Google Forms questionnaires from executives in various companies and enterprises (limited liability companies (LLC), individual entrepreneur) and businesses (construction, trade, repair, logistics, services, etc.)

Respondents answered 35 questions regarding two main aspects of running their business: • forms, organizations and spheres of activity; • level of informatization of business activities (use of digital tools in their work, work with social networks, planning services, analytics or advertising).

Due to the specifics of the information requested and the use of different categories of questions, both quantitative and categorical data were received. For example, information about the number of employees in an organization was obtained in the form of natural numbers, and information about the presence of a business model was presented as a binary “yes” or “no” answer. There were some open-ended questions regarding the respondent’s attitude to a particular problem related to informatization of the business structure. Such responses were excluded from the general clustering dataset.

Numerous mechanical errors and blank answers were found in the data retrieval. These problems were solved with manual processing. However, as the number of respondents increases, such processing will require the unification of the possible answer options for each of the questions or the reduction of all possible answers only to the choice of the suggested ones. 4

Data Preparation

Data preparation consisted of the steps of clearing and encoding data, missing values were not identified in this study. 4.1

Clearing data

Both the hierarchical agglomerative algorithm and the EM method were run for the same data set, so pre-processing due to data cleaning was performed equally for both methods.

The answers to the open-ended questions were reduced to a specific template, for example, only “yes” or “no”, or otherwise unified, for example how is shown in Fig 1. Attributes of the “automatically calculated questionnaire time” or “respondent’s personal attitude” were marked as informative and removed from the task input. All manipulations were performed manually due to the small dimension of the task. Using the MS Excel add-on requires no special training and accepts a simple spreadsheet of values of any type. Therefore, data encoding was not performed for clustering with the Data Mining add-in in MS Excel.

Data encryption was required to work with Python machine learning libraries since most algorithms use mathematical operations on quantitative data. For those questions where it was possible to rank more or less or better or worse, ranked value coding was used. Responses were coded from 0 to some positive number, where 0 meant a single answer “no” or a number close to 0, and other values were ranked according to the increase in manifestation of the sign.

Non-ranking answer options are nominal type data and have been indicated by some character numbers. For further computational work of the algorithm with such data, the Hower metric was used, which makes it possible to work with both quantitative and categorical numerical data at the same time. Respondents' coded answers are shown in Fig. 2.

Fig 2. The survey respondents' coded answers 5

Solving the clustering problem

The main objectives of the study were:  experimental finding of the optimal number of clusters and their characteristic features for the interpreted (understandable) segmentation of business structures according to the level of digital maturity by several methods;  comparing the results obtained by different methods and determining the most effective for a particular data analysis task.

The study used two methods of clustering: 1. Using the Data Mining add-in for MS Excel spreadsheets [ 13 ]. Clustering capabilities in MS Excel are represented by iterative algorithms: k-means and ExpectationMaximization. For the reference, it was determined EM-algorithm; 2. Using the functions of libraries for machine learning Python programming language [ 14,15 ].

To describe how it works with two algorithms, we have introduced the notation: N respondents = {⃗⃗⃗⃗1, ⃗⃗⃗⃗2, … , ⃗⃗⃗⃗⃗ } and M questions = { 1, 2, … , }. Every participant ⃗⃗ ∈ ( ∈ ̅1̅,̅̅̅) answered each of the questions ∈ ( ∈ ̅1̅̅,̅̅̅), so the result is a matrix of responses with dimension( × ), in which each respondent is represented as follows: ⃗⃗ = { 1, 2, … , , … , }, where is an answer lrespondents to k-question (Fig. 3). In the future, we call this tuple a point [ 8 ]. Let us consider the principles of the selected methods. of the problem, the Hower metric (1) proposed in [ 8 ] was used to calculate the distance matrix. where = ( ,

) – the distance between the answers in the k-th question, M is the number of answers to the query in the tuple. The distance matrix Dk for the k-th question is symmetric: (1) (2) (⃗⃗⃗ , ⃗⃗⃗ ) = 1 ∑

=1 , 0 12 0 13 23

0 0 (⃗⃗⃗⃗1, ⃗⃗⃗⃗2) (⃗⃗⃗⃗1, ⃗⃗⃗⃗3) 0 (⃗⃗⃗⃗2, ⃗⃗⃗⃗3)

0 Dk = = … … … … … … … … … … 1 2 3 … 0 (⃗⃗⃗⃗1, ⃗⃗⃗⃗⃗ ) (⃗⃗⃗⃗2, ⃗⃗⃗⃗⃗ ) (⃗⃗⃗⃗3, ⃗⃗⃗⃗⃗ ) … 0 The symmetric matrix

for distances between individual points of the cluster looks like:

The elements of the matrix D are the averaged values of the pairwise values of the distances calculated by the formula (1). All questionnaire weights are taken to be 1.

The way if measuring distances

depend on the type of data in k question. If and quantitative, then distance

is expressed by the formula (2): .

In this case, (⃗⃗⃗ , ⃗⃗⃗ ) ∈ [0; 1]. If then the distance is calculated by the formula (3): – nominal data that cannot be ordered,

= 0 means identical answers of the respondents to k question, and

= 1 – maximal difference. As a consequence, for the averaged distances calculated by formula (1), all values (⃗⃗⃗ , ⃗⃗⃗ ) ∈ [0; 1].

The distance between the individual clusters was by the distance neighbour method. Clusters closest to the selected metric are merged, distances from newly created to other clusters are recalculated, the distance matrix is automatically updated, and clustering continues. The method of the far neighbour allows to allocate rather compact and stable structures corresponding to the task. 5.2

Expectation-Maximization method

In contrast to the proposed modification of the agglomerative method, the fuzzy clustering EM algorithm presented in the Data Mining Add-in for Microsoft Excel was selected among the iterative algorithms. In this case, the main idea of the method is to assume that the elements of the input data set are independent random variables distributed by a law, in most cases a normal Gaussian distribution [16,17,18,19].

When using the EM method, any object in the dataset is considered to belong to all clusters with different probabilities. Before starting the algorithm, the number of K clusters and the initial approximate parameters for each of the K distributions of the input data are specified. Iterations incrementally improve the distribution parameters to a predetermined level of model accuracy. Upon completion of the algorithm, each object will be assigned to a cluster with the highest probability of belonging. Thus, two successive steps are performed at each iteration:

Further, the algorithm is based on an iterative repetition of two consecutive steps as shown in the Fig. 4. 1. Expectation is calculating the probability (plausibility) of the points belonging to 2.

Maximization is improvement of distribution parameter values to maximize the each of the clusters; likelihood of points belonging to clusters. 5.3

Adjustment of algorithm parameters

The MS Cluster Task Wizard in MS Excel allows you to select the desired parameters and adjust their values [ 13 ]. For this task, the list of questions that will affect the result was changed in the algorithm settings, the value of the number of clusters and the cluster seed of the EM clustering method were set.

Referring to the sklearn.cluster.AgglomerativeClustering method to create a clustering model using Python generally involves specifying 3 parameters: number of clusters, intra-cluster distance and inter-cluster distance metrics. The metric of the distance metric between the elements may be one of those proposed in [21] or otherwise calculated. Calling the function of creating a cluster model:

model = AgglomerativeClustering (number of clusters = m, metric = “precomputed”, linkage = complete) labels = model.fit_predict(distances), where m − predefined number of clusters, distances – distance matrix previously calculated by the Hower metric (1) – (3). Let us consider the clustering results by each method and compare the results obtained. The clustering output of MS Excel's Data Mining add-on provides a breakdown of the dataset into clusters with the ability to visualize, view statistics, and cluster profiles.

Using the clustering methods of the sklearn library to analyse data on Python output provides a one-dimensional numeric array indicating which cluster each input tuple belongs to. Further analysis and visualization of the obtained result is carried out additionally. The algorithm for selecting the optimal number of clusters is described in [ 8 ]. 6.1

The results of hierarchical agglomerative clustering

The matrix of distances D between the points of the input set now looks like this (the fragment of the matrix is shown in Table 1):

Тable 1. Matrix of distances between questions As a result of the agglomerative clustering algorithm using Python sklearn, a stable distribution of 5 clusters was obtained, with a satisfactory value of the quality metric, the Silhouette index [22]: ≈ 0.16. A comparative analysis of the clusters obtained by main characteristics is shown in Table 2. The percentages indicate the proportion of respondents in each cluster who answered the same question the same way. The number of respondents who answered equally to the selected questions varied from 40 to 100%. To distinguish the characteristic features of the formed clusters, we leave the values greater than 80% and depict the comparison of the clusters. According to the results, the largest number of respondents (16) was attributed to the first cluster, the main characteristics of which are:  lack of experience with any digital tools;  absence of companies in the Internet environment;  site inefficiency, if any.

The second cluster was formed by 5 companies, the main characteristics of which are defined as follows:  availability of companies in the Internet space;  usage of simple tools to a limited extent.

2 more respondents formed the third cluster that is characterized by:  effective functioning of the site and the purchase chain;  use of most digital tools including advertising;  the work of a marketer to promote a brand or product.

The fourth cluster consists of 10 respondents, and its characteristic features are:  lack of functioning of the purchase chain on the site;  non-use of sophisticated digital tools;  availability of social networks only.

The fifth cluster consists of only one company that successfully uses virtually all digital tools with the help of specialists.

A sufficient degree of differences between clusters and a sufficient degree of similarity of elements within the cluster (80-100%) makes it possible to clearly identify the following groups and rank them by the level of use of digital technologies and tools in business activities. Table 3 shows the ranking of types of business structures by the decline in digital maturity. The clustering performed by the EM method in MS Excel proved to be unstable. Because the EM algorithm is a group of iterative fuzzy clustering methods, this result is normal and suitable for use in a particular class of tasks. The comparative characteristics of the clusters obtained are shown in Table 4. In contrast to agglomerative clustering, the degree of uniformity of answers to questions within the clusters is much lower, and fluctuates on average within 60-70%. As we can see, the degree of difference between clusters is also low. Repeated application of the EM method did not improve the quality of the results.

Similar to the previous clustering method, a cluster was identified that included two business entities that are actively using digital technology in their businesses. The differences between the other clusters are small, the differences in the percentage of answers to the questions are minimal, so it is impossible to distinguish the characteristics of each subset. The inability to distinguish clusters with distinct features does not meet the objective of the study. In this study, we conducted an experimental comparison of the use of two approaches to the clustering of respondents according to online survey results using the Google Forms service, hard and soft clustering, in particular. Hard clustering was implemented with the use of Python tools and the hierarchical agglomerative method, while soft clustering was viewed through the use of the Data Mining add-in MS Excel and the iterative EM method.

A comparative analysis of the results obtained by the two methods showed the following results:  Using hierarchical agglomerative clustering, we obtained 5 clusters, sufficiently different from each other and with a high degree of similarity between the elements of the cluster (60-100% depending on the question). The cluster features are distinguished (use of social networks, advertising offices and services, analytical tools, search engine optimization of sites, etc.);  the use of the EM method did not allow to obtain good clustering results and to achieve the goal of the task, the results of the EM method implementation changed with each run of the algorithm.

It has been experimentally established that the method of agglomerative hierarchical clustering is an effective method for solving the problem of clustering of mixed-type data obtained from the survey of respondents.

In addition to improving the parameters of the algorithm, the tasks for the further studies are: elimination of mechanical errors when entering answers and the presence of empty values; diversity of data, which causes complexity of their unification and proper ordering; selection of mathematical metrics used as arguments for clustering functions and calculating their quality. 15. Scikit. Clustering documentation. Scikit learn [Online]. Avaliable: https://scikitlearn.org/stable/modules/clustering.html. 16. Expectation-maximization algorithm, Wikipedia [Online]. Avaliable: https://en.wikipedia.org/wiki/Expectation-maximization_algorithm 17. Yair Weiss. Bayesian motion estimation and segmentation. PhD thesis, Massachusetts

Institute of Technology, May 1998. 18. Zinchenko D. EM-algoritm klasterizatsii [EM-clustering algorithm], AlgoWiki [Online], 2016. Avaliable: https://algowiki-project.org/ru/Участник:Noite/EMалгоритм_кластеризации 19. Oreshkov V. EM – masshtabiruyemyy algoritm klasterizatsii [EM – scalable clustering algorithm], BaseGroup Labs [Online], 2013. Avaliable: https://basegroup.ru/community/articles/em 20. Lotfi, E. (2018). [image] Available at: https://www.researchgate.net/profile/Elaachak_Lotfi/publication/322641344/figure/fig3/A S:585607373922305@1516631086753/flowchart-of-EM-algorithm.png [Accessed 15 Oct. 2019]. 21. Scikit. Agglomerative Clustering. Scikit learn [Online]. Avaliable: https://scikitlearn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.h tml 22. Scikit. Silhourtte Score. Scikit learn [Online]. Avaliable: https://scikitlearn.org/stable/modules/ generated/sklearn.metrics.silhouette_score.html

1. Cherezov

, Tyuakchev

Obzor osnovnykh metodov klassifikatsii I klasterizatsii dannykh [Overview main methods of data classification and clustering] . Vestnik

VGU

, Seria: Sistemnyi analiz i informachionnye tekhnologii, no 2 . pp. 25 - 29 , 2009 . Avaliable at: http://www.vestnik.vsu.ru/pdf/analiz/2009/02/2009-02-05.pdf ( Accessed 14 August 2019 ). (In Russian).

2. Katakis , N.

Tsapatsoulis , C.

Tziouvas and F. Mendes. ( 2012 ). Clustering Online Poll Data: Towards a Voting Assistance System , 2012 Seventh International Workshop on Semantic and Social Media Adaptation and Personalization , http://www.katakis.eu/wpcontent/uploads/2014/11/katakissmap12.pdf.doi.org/10.1109/SMAP. 2012 .19

3. J. McCaffrey , Machine Learning Using C# . Syncfusion, 2014 .

4. Chae , S. S. , Kim , J.-M. & Yang , W. Y. , ( 2006 ). Cluster analysis with balancing weights on mixed-type data . The Korean communications in statistics , 13 ( 3) doi .org/10.5351/CKSS. 2006 . 13 .3. 719

5. Gower

(

1967 ) A comparison of some methods of cluster analysis . Biometrics 23 : 623 - 637 doi.org/10.2307/2528417

6. William

Rand . Objective Criteria for the Evaluation of Clustering Methods . Journal of the American Statistical Association . Vol. 66 , No. 336 ( Dec ., 1971 ), pp. 846 - 850 doi.org/10.2307/2284239

7. Gower , J. C. , ( 1971 ). A General Coefficient of Similarity and Some of Its Properties . Biometrics, 27 ( 4 ), p. 859 . doi.org/10.2307/2528823

Strutynska ,

Kozbur ,

Dmytrotsa ,

Bodnarchuk and O.Hlado. ( 2019 ) Small and Medium Business Structures Clustering Method Based on Their Digital Maturity . 2019 International Scientific-Practical Conference Problems of Infocommunications. Science and Technology , pp. 278 - 282 (in print)

9. Klasterizatsia [Clustering] Wikiconspekts, ITMO University [Online]. 2019 . Avaliable: http://neerc.ifmo.ru/wiki/index.php?title=Кластеризация (in Russian)

10. Cluster

analysis

, Wikipedia [Online]. Avaliable: https://en.wikipedia.org/ wiki/Cluster_analysis

11. Zatserkovnyi

V. I.

, Burachek

V. H.

, Zhelezniak

O. O.

, Tereshchenko

A. O.

Heoinformatsiini systemy i bazy danykh [Geoinformation systems and databases]. Nizhyn, Ukraine: NDU im. M. Hoholia, 2017 . pp. 77 - 95 . Avaliable at: https://studfiles.net/preview/6440954 (In Ukrainian).

12. Google Forms, https://www.google.com/intl/uk_ua/forms

13. Microsoft ( 2017 Dec) . Cluster Wizard (Data Mining Add-ins for Excel) , Microsoft Docs [On-line] . Avaliable: https://docs.microsoft. com/en-us/sql/analysis-services/clusterwizard-data-mining-add-ins-for-excel?view=sql-server-2014

14. Python , https://www.python.org.