Comparative Analysis of Two Approaches to the Clustering of Respondents (based on Survey Results) Iryna Strutynska, Halyna Kozbur, Lesia Dmytrotsa, Olha Hlado, Liliya Melnyk Ternopil Ivan Puluj National Technical University, Ternopil, Ukraine {ringtons999, dmytrotsa.lesya, kozbur.galina}@ gmail.com Abstract. This paper proposes an algorithm for solving the survey respondents' clustering problem, including the steps of collecting, preparing data, summariz- ing key results, and developing future goals. The research consists of two ap- proaches to clustering: iterative and hierarchical in order to produce consistent and comprehensible results. The iterative method is implemented in MS Excel using the Data Mining add-in, hierarchical one is used with the help of writing code and using Python libraries. Hard clusters with sufficient degree of similari- ty within the cluster and differences from others were distinguished, the main characteristics of the obtained clusters were described as well. It has been ex- perimentally established that the method of agglomerative hierarchical cluster- ing is more effective for solving the problem of clustering of mixed-type data obtained from the survey of respondents. Keywords: digital maturity, clustering methods, mixed-type data 1 Introduction The fastest and the most convenient way to get any information you need today is to directly interview your target audience on a specific topic. With the development of information technology, such questionnaires are increasingly shifting from personal or telephone communication to online questionnaires. This allows you to reach a larg- er audience in a shorter time span and with fewer human resources. The positive as- pects of such surveys are: convenience of expression; partial or complete anonymity of results; the ability to complete a survey in any convenient for the respondent way; no need to communicate with the employees of the survey organization, etc. Online surveys are a particularly effective way of retrieving information if your target audi- ence is the users of the web. Data collection is only part of the complex task of getting the information you need. Further processing and analysis of data with conclusions and recommendations make the data cycle complete. Segmentation or clustering is one of the most important and interesting tasks of data analysis. This paper offers an algorithm for solving the problem of clustering respondents by online survey, includ- ing the steps of collecting, preparing data, summarizing key findings, and developing future goals. Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attrib- ution 4.0 International (CC BY 4.0) CMiGIN-2019: International Workshop on Conflict Management in Global Information Networks. The problem of clustering of numerical data as a result of a series of measurements was described by [1]. A similar grouping of respondents was addressed in the works of [2]. The main purpose of the work was to provide recommendations on the results of determining the respondents’ political preferences and to compare the clustering method with other generally accepted methods of providing such recommendations. The problem of clustering of categorical data using probabilistic approach and GACUC algorithm was investigated by [3]. The main statements regarding the clus- tering of mixed-type data and applying of chosen in the paper algorithms and metrics were discussed in [4,5,6,7]. However, the problem of processing and clustering of mixed data obtained from the questionnaire has not been researched so far. This paper studies clustering of respondents using two approaches: iterative and hierarchical in order to produce consistent and comprehensible results. The iterative method is implemented in MS Excel using the Data Mining add-in, hierarchical one is used with the help of writing code and using Python libraries. The survey, the results of which were taken as inputs to the clustering task, was conducted among small and medium-sized enterprises in Ternopil region for the use of digital technologies and tools in their business activities [8]. 2 The problem of clustering. Approach typing and solution algorithms Clustering or cluster data analysis is one of the machine learning tasks of splitting multiple objects into subsets (clusters) so that the objects assigned to one cluster are as similar as possible to each other and the objects referred to different kinds are as different as possible. This approach does not require a labeled data. One of the most common contemporary tasks that uses cluster analysis is text anal- ysis for news broadcasting, image grouping, consumer segmentation, community identification on social networks, etc. The variability of tasks, types of datasets and expected results has led to the for- mation of a large number of methods and approaches to clustering, which differ in their understanding of the concept of “cluster”, as well as adjusting the parameters of algorithms (number of expected clusters, density threshold, distance metrics, etc.) the specifics of the dataset and the subsequent use of the results. Thus, this makes it diffi- cult to uniquely select the algorithm of operation and its parameters for each type of task. Due to this, clustering can also be called an interactive task of machine learning “with reinforcement”, which provides repeated experimental correction of algorithm parameters for obtaining stable and interpretative results [9,10]. There is no single common way to classify clustering methods and algorithms. One approach is to distinguish clustering methods by cluster models used (connectivity- based or hierarchical, centroid-based, distribution-based, density-based overlapping clustering etc.). Another approach uses grouping of methods based on their key char- acteristics (probabilistic, logical, graph-theoretic, hierarchical, neural, frequency algo- rithms, etc.) [10]. One of the simplest approaches to clustering methods is to divide them into two groups: hierarchical and non-hierarchical. Hierarchical cluster analysis methods are divided into ascending or descending and can be represented graphically in the form of dendrograms. At the same time with each subsequent step the number of clusters increases or decreases depending on the chosen method: divisional or agglomerative respectively. The largest group among non-hierarchical methods is iterative. In an iterative ap- proach, they define cluster centers and redistribute the elements of the data set by proximity to the selected centers. These include algorithms k-means, Expectation- Maximization method, mean-shift and others [11]. The similarity of cluster elements and the closeness of clusters are determined by predefined metrics. 3 Data Collection The input data set was obtained through Google Forms questionnaires from execu- tives in various companies and enterprises (limited liability companies (LLC), indi- vidual entrepreneur) and businesses (construction, trade, repair, logistics, services, etc.) Respondents answered 35 questions regarding two main aspects of running their business: • forms, organizations and spheres of activity; • level of informatization of business activities (use of digital tools in their work, work with social networks, planning services, analytics or advertising). Due to the specifics of the information requested and the use of different categories of questions, both quantitative and categorical data were received. For example, in- formation about the number of employees in an organization was obtained in the form of natural numbers, and information about the presence of a business model was pre- sented as a binary “yes” or “no” answer. There were some open-ended questions re- garding the respondent’s attitude to a particular problem related to informatization of the business structure. Such responses were excluded from the general clustering dataset. Numerous mechanical errors and blank answers were found in the data retrieval. These problems were solved with manual processing. However, as the number of respondents increases, such processing will require the unification of the possible answer options for each of the questions or the reduction of all possible answers only to the choice of the suggested ones. 4 Data Preparation Data preparation consisted of the steps of clearing and encoding data, missing values were not identified in this study. 4.1 Clearing data Both the hierarchical agglomerative algorithm and the EM method were run for the same data set, so pre-processing due to data cleaning was performed equally for both methods. The answers to the open-ended questions were reduced to a specific template, for example, only “yes” or “no”, or otherwise unified, for example how is shown in Fig 1. Attributes of the “automatically calculated questionnaire time” or “respondent’s per- sonal attitude” were marked as informative and removed from the task input. All ma- nipulations were performed manually due to the small dimension of the task. Fig.1. Respondents' answers to the questionnaire 4.2 Data encoding Using the MS Excel add-on requires no special training and accepts a simple spread- sheet of values of any type. Therefore, data encoding was not performed for clustering with the Data Mining add-in in MS Excel. Data encryption was required to work with Python machine learning libraries since most algorithms use mathematical operations on quantitative data. For those questions where it was possible to rank more or less or better or worse, ranked value coding was used. Responses were coded from 0 to some positive number, where 0 meant a single answer “no” or a number close to 0, and other values were ranked according to the increase in manifestation of the sign. Non-ranking answer options are nominal type data and have been indicated by some character numbers. For further computational work of the algorithm with such data, the Hower metric was used, which makes it possible to work with both quantita- tive and categorical numerical data at the same time. Respondents' coded answers are shown in Fig. 2. Fig 2. The survey respondents' coded answers 5 Solving the clustering problem The main objectives of the study were:  experimental finding of the optimal number of clusters and their characteristic features for the interpreted (understandable) segmentation of business structures according to the level of digital maturity by several methods;  comparing the results obtained by different methods and determining the most effective for a particular data analysis task. The study used two methods of clustering: 1. Using the Data Mining add-in for MS Excel spreadsheets [13]. Clustering capabili- ties in MS Excel are represented by iterative algorithms: k-means and Expectation- Maximization. For the reference, it was determined EM-algorithm; 2. Using the functions of libraries for machine learning Python programming lan- guage [14,15]. To describe how it works with two algorithms, we have introduced the notation: N respondents 𝑈 = {𝑢 ⃗⃗⃗⃗1 , ⃗⃗⃗⃗ 𝑢𝑁 } and M questions 𝑄 = {𝑞1 , 𝑞2 , … , 𝑞𝑀 }. Every partic- 𝑢2 , … , ⃗⃗⃗⃗⃗ ⃗ 𝑖 ∈ 𝑈 (𝑙 ∈ 1, ipant 𝑢 ̅̅̅̅̅𝑁) answered each of the questions 𝑞 ∈ 𝑄 (𝑘 ∈ 1, 𝑘 ̅̅̅̅̅̅ 𝑀), so the result is a matrix of responses with dimension(𝑁 × 𝑀), in which each respondent is represented as follows: ⃗𝑢𝑙 = {𝑢𝑙1 , 𝑢𝑙2 , … , 𝑢𝑙𝑘 , … , 𝑢𝑙𝑀 }, where 𝑢𝑙𝑘 is an answer l- respondents to k-question (Fig. 3). In the future, we call this tuple a point [8]. Fig. 3. The matrix of answers Let us consider the principles of the selected methods. 5.1 Method of hierarchical agglomeration The principle of operation of the modified agglomerative method is described in de- tail by [8]. According to the agglomerative approach to clustering, each point is con- sidered to be a separate cluster at the beginning. As the algorithm works, each of the two closest clusters is merged at each step, eventually forming a predefined number of clusters or merging into one. To get started with the agglomerative algorithm, we build a matrix of pairwise distances between the objects of the cluster. In the context of the problem, the Hower metric (1) proposed in [8] was used to calculate the dis- tance matrix. 1 𝑑(𝑢 ⃗⃗⃗𝑖 , ⃗⃗⃗ 𝑢𝑗 ) = ∑𝑀 𝑘=1 𝑑𝑖𝑗𝑘 , (1) 𝑀 where 𝑑𝑖𝑗𝑘 = 𝑑(𝑢𝑖𝑘 , 𝑢𝑗𝑘 ) – the distance between the answers in the k-th question, M is the number of answers to the query in the tuple. The distance matrix Dk for the k-th question is symmetric: 0 𝑑12𝑘 𝑑13𝑘 … 𝑑1𝑁𝑘 0 𝑑23𝑘 … 𝑑2𝑁𝑘 Dk = 0 … 𝑑3𝑁𝑘 … … … 0 The symmetric matrix 𝐷 for distances between individual points of the cluster looks like: 0 𝑑(𝑢 𝑢2 ) 𝑑(𝑢 ⃗⃗⃗⃗1 , ⃗⃗⃗⃗ 𝑢3 ) ⃗⃗⃗⃗1 , ⃗⃗⃗⃗ … 𝑑(𝑢 𝑢𝑁 ) ⃗⃗⃗⃗1 , ⃗⃗⃗⃗⃗ 0 𝑑(𝑢 𝑢3 ) ⃗⃗⃗⃗2 , ⃗⃗⃗⃗ … 𝑑(𝑢 𝑢𝑁 ) ⃗⃗⃗⃗2 , ⃗⃗⃗⃗⃗ 𝐷= 0 … 𝑑(𝑢 𝑢𝑁 ) ⃗⃗⃗⃗3 , ⃗⃗⃗⃗⃗ … … … 0 The elements of the matrix D are the averaged values of the pairwise values of the distances calculated by the formula (1). All questionnaire weights are taken to be 1. The way if measuring distances 𝑑𝑖𝑗𝑘 depend on the type of data in k question. If 𝑢𝑖𝑘 and 𝑢𝑗𝑘 quantitative, then distance 𝑑𝑖𝑗𝑘 is expressed by the formula (2): |𝑢𝑖𝑘 −𝑢𝑗𝑘 | 𝑑𝑖𝑗𝑘 = . (2) max(𝑢𝑘 )−min(𝑢𝑘 ) In this case, 𝑑(𝑢 𝑢𝑗 ) ∈ [0; 1]. If 𝑢𝑖𝑘 and 𝑢𝑗𝑘 – nominal data that cannot be ordered, ⃗⃗⃗𝑖 , ⃗⃗⃗ then the distance is calculated by the formula (3): 0, 𝑢𝑖𝑘 = 𝑢𝑗𝑘 , 𝑑𝑖𝑗𝑘 = { (3) 1, 𝑢𝑖𝑘 ≠ 𝑢𝑗𝑘 . In both cases 𝑑𝑖𝑗𝑘 = 0 means identical answers of the respondents 𝑢𝑗 to k question, and 𝑑𝑖𝑗𝑘 = 1 – maximal difference. As a consequence, for the averaged distances calculated by formula (1), all values𝑑(𝑢 ⃗⃗⃗𝑖 , ⃗⃗⃗ 𝑢𝑗 ) ∈ [0; 1]. The distance between the individual clusters was by the distance neighbour meth- od. Clusters closest to the selected metric are merged, distances from newly created to other clusters are recalculated, the distance matrix is automatically updated, and clus- tering continues. The method of the far neighbour allows to allocate rather compact and stable structures corresponding to the task. 5.2 Expectation-Maximization method In contrast to the proposed modification of the agglomerative method, the fuzzy clus- tering EM algorithm presented in the Data Mining Add-in for Microsoft Excel was selected among the iterative algorithms. In this case, the main idea of the method is to assume that the elements of the input data set are independent random variables dis- tributed by a law, in most cases a normal Gaussian distribution [16,17,18,19]. When using the EM method, any object in the dataset is considered to belong to all clusters with different probabilities. Before starting the algorithm, the number of K clusters and the initial approximate parameters for each of the K distributions of the input data are specified. Iterations incrementally improve the distribution parameters to a predetermined level of model accuracy. Upon completion of the algorithm, each object will be assigned to a cluster with the highest probability of belonging. Thus, two successive steps are performed at each iteration: Further, the algorithm is based on an iterative repetition of two consecutive steps as shown in the Fig. 4. 1. Expectation is calculating the probability (plausibility) of the points belonging to each of the clusters; 2. Maximization is improvement of distribution parameter values to maximize the likelihood of points belonging to clusters. 5.3 Adjustment of algorithm parameters The MS Cluster Task Wizard in MS Excel allows you to select the desired parameters and adjust their values [13]. For this task, the list of questions that will affect the re- sult was changed in the algorithm settings, the value of the number of clusters and the cluster seed of the EM clustering method were set. Referring to the sklearn.cluster.AgglomerativeClustering method to create a clus- tering model using Python generally involves specifying 3 parameters: number of clusters, intra-cluster distance and inter-cluster distance metrics. The metric of the distance metric between the elements may be one of those proposed in [21] or other- wise calculated. Calling the function of creating a cluster model: model = AgglomerativeClustering (number of clusters = m, metric = “precomput- ed”, linkage = complete) labels = model.fit_predict(distances), where m − predefined number of clusters, distances – distance matrix previously calculated by the Hower metric (1) – (3). Fig. 4. Functional diagram of the EM algorithm, borrowed from [20] 6 Analysis of the results Let us consider the clustering results by each method and compare the results ob- tained. The clustering output of MS Excel's Data Mining add-on provides a break- down of the dataset into clusters with the ability to visualize, view statistics, and clus- ter profiles. Using the clustering methods of the sklearn library to analyse data on Python out- put provides a one-dimensional numeric array indicating which cluster each input tuple belongs to. Further analysis and visualization of the obtained result is carried out additionally. The algorithm for selecting the optimal number of clusters is described in [8]. 6.1 The results of hierarchical agglomerative clustering The matrix of distances D between the points of the input set now looks like this (the fragment of the matrix is shown in Table 1): Тable 1. Matrix of distances between questions As a result of the agglomerative clustering algorithm using Python sklearn, a stable distribution of 5 clusters was obtained, with a satisfactory value of the quality metric, the Silhouette index [22]: 𝑠𝑖𝑙 ≈ 0.16. A comparative analysis of the clusters obtained by main characteristics is shown in Table 2. Table 2. Comparative characteristics of clusters formed by agglomerative clustering with Py- thon tools The percentages indicate the proportion of respondents in each cluster who answered the same question the same way. The number of respondents who answered equally to the selected questions varied from 40 to 100%. To distinguish the characteristic fea- tures of the formed clusters, we leave the values greater than 80% and depict the comparison of the clusters. According to the results, the largest number of respondents (16) was attributed to the first cluster, the main characteristics of which are:  lack of experience with any digital tools;  absence of companies in the Internet environment;  site inefficiency, if any. The second cluster was formed by 5 companies, the main characteristics of which are defined as follows:  availability of companies in the Internet space;  usage of simple tools to a limited extent. 2 more respondents formed the third cluster that is characterized by:  effective functioning of the site and the purchase chain;  use of most digital tools including advertising;  the work of a marketer to promote a brand or product. The fourth cluster consists of 10 respondents, and its characteristic features are:  lack of functioning of the purchase chain on the site;  non-use of sophisticated digital tools;  availability of social networks only. The fifth cluster consists of only one company that successfully uses virtually all digital tools with the help of specialists. A sufficient degree of differences between clusters and a sufficient degree of simi- larity of elements within the cluster (80-100%) makes it possible to clearly identify the following groups and rank them by the level of use of digital technologies and tools in business activities. Table 3 shows the ranking of types of business structures by the decline in digital maturity. Table 3. Ranking of business structure clusters by digital maturity level 6.2 The results of ЕМ-clustering The clustering performed by the EM method in MS Excel proved to be unstable. Be- cause the EM algorithm is a group of iterative fuzzy clustering methods, this result is normal and suitable for use in a particular class of tasks. The comparative characteris- tics of the clusters obtained are shown in Table 4. In contrast to agglomerative cluster- ing, the degree of uniformity of answers to questions within the clusters is much low- er, and fluctuates on average within 60-70%. As we can see, the degree of difference between clusters is also low. Repeated application of the EM method did not improve the quality of the results. Similar to the previous clustering method, a cluster was identified that included two business entities that are actively using digital technology in their businesses. The differences between the other clusters are small, the differences in the percentage of answers to the questions are minimal, so it is impossible to distinguish the characteris- tics of each subset. The inability to distinguish clusters with distinct features does not meet the objective of the study. Table 4. Comparative characteristics of clusters formed by using an iterative approach 7 Conclusions and future work In this study, we conducted an experimental comparison of the use of two approaches to the clustering of respondents according to online survey results using the Google Forms service, hard and soft clustering, in particular. Hard clustering was implement- ed with the use of Python tools and the hierarchical agglomerative method, while soft clustering was viewed through the use of the Data Mining add-in MS Excel and the iterative EM method. A comparative analysis of the results obtained by the two methods showed the fol- lowing results:  Using hierarchical agglomerative clustering, we obtained 5 clusters, sufficiently different from each other and with a high degree of similarity between the elements of the cluster (60-100% depending on the question). The cluster features are distin- guished (use of social networks, advertising offices and services, analytical tools, search engine optimization of sites, etc.);  the use of the EM method did not allow to obtain good clustering results and to achieve the goal of the task, the results of the EM method implementation changed with each run of the algorithm. It has been experimentally established that the method of agglomerative hierar- chical clustering is an effective method for solving the problem of clustering of mixed-type data obtained from the survey of respondents. In addition to improving the parameters of the algorithm, the tasks for the further studies are: elimination of mechanical errors when entering answers and the presence of empty values; diversity of data, which causes complexity of their unification and proper ordering; selection of mathematical metrics used as arguments for clustering functions and calculating their quality. References [1] Cherezov D., Tyuakchev N. Obzor osnovnykh metodov klassifikatsii I klasterizatsii dannykh [Overview main methods of data classification and clustering]. Vestnik VGU, Seria: Sistemnyi analiz i informachionnye tekhnologii, no 2. pp.25-29, 2009. Avaliable at: http://www.vestnik.vsu.ru/pdf/analiz/2009/02/2009-02-05.pdf (Accessed 14 August 2019). (In Russian). [2] I. Katakis, N. Tsapatsoulis, C. Tziouvas and F. Mendes. (2012). Clustering Online Poll Data: Towards a Voting Assistance System, 2012 Seventh International Workshop on Semantic and Social Media Adaptation and Personalization, http://www.katakis.eu/wp- content/uploads/2014/11/katakissmap12.pdf.doi.org/10.1109/SMAP.2012.19 [3] J. McCaffrey, Machine Learning Using C#. Syncfusion, 2014. [4] Chae, S. S., Kim, J.-M. & Yang, W. Y., (2006). Cluster analysis with balancing weights on mixed- type data. The Korean communications in statistics, 13(3) doi.org/10.5351/CKSS.2006.13.3.719 [5] Gower JC (1967) A comparison of some methods of cluster analysis. Biometrics 23:623–637 doi.org/10.2307/2528417 [6] William M. Rand. Objective Criteria for the Evaluation of Clustering Methods. Journal of the American Statistical Association. Vol. 66, No. 336 (Dec., 1971), pp. 846-850 doi.org/10.2307/2284239 [7] Gower, J. C., (1971). A General Coefficient of Similarity and Some of Its Properties. Biometrics, 27(4), p. 859. doi.org/10.2307/2528823 [8] I.Strutynska, H.Kozbur, L.Dmytrotsa, I.Bodnarchuk and O.Hlado. (2019) Small and Medium Business Structures Clustering Method Based on Their Digital Maturity. 2019 International Scientific-Practical Conference Problems of Infocommunications. Science and Technology, pp.278- 282 (in print) [9] Klasterizatsia [Clustering] Wikiconspekts, ITMO University [Online]. 2019. Avaliable: http://neerc.ifmo.ru/wiki/index.php?title=Кластеризация (in Russian) [10] Cluster analysis, Wikipedia [Online]. Avaliable: https://en.wikipedia.org/ wiki/Cluster_analysis [11] Zatserkovnyi V. I., Burachek V. H., Zhelezniak O. O., Tereshchenko A. O. Heoinformatsiini systemy i bazy danykh [Geoinformation systems and databases]. Nizhyn, Ukraine: NDU im. M. Hoholia, 2017. pp. 77-95. Avaliable at: https://studfiles.net/preview/6440954 (In Ukrainian). [12] Google Forms, https://www.google.com/intl/uk_ua/forms [13] Microsoft (2017 Dec). Cluster Wizard (Data Mining Add-ins for Excel), Microsoft Docs [On-line]. Avaliable: https://docs.microsoft.com/en-us/sql/analysis-services/cluster-wizard-data-mining-add-ins- for-excel?view=sql-server-2014 [14] Python, https://www.python.org. [15] Scikit. Clustering documentation. Scikit learn [Online]. Avaliable: https://scikit- learn.org/stable/modules/clustering.html. [16] Expectation-maximization algorithm, Wikipedia [Online]. Avaliable: https://en.wikipedia.org/wiki/Expectation-maximization_algorithm [17] Yair Weiss. Bayesian motion estimation and segmentation. PhD thesis, Massachusetts Institute of Technology, May 1998. [18] Zinchenko D. EM-algoritm klasterizatsii [EM-clustering algorithm], AlgoWiki [Online], 2016. Avaliable: https://algowiki-project.org/ru/Участник:Noite/EM-алгоритм_кластеризации [19] Oreshkov V. EM – masshtabiruyemyy algoritm klasterizatsii [EM – scalable clustering algorithm], BaseGroup Labs [Online], 2013. Avaliable: https://basegroup.ru/community/articles/em [20] Lotfi, E. (2018). [image] Available at: https://www.researchgate.net/profile/Elaachak_Lotfi/publication/322641344/figure/fig3/AS:5856073 73922305@1516631086753/flowchart-of-EM-algorithm.png [Accessed 15 Oct. 2019]. [21] Scikit. Agglomerative Clustering. Scikit learn [Online]. Avaliable: https://scikitlearn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html [22] Scikit. Silhourtte Score. Scikit learn [Online]. Avaliable: https://scikit-learn.org/stable/modules/ generated/sklearn.metrics.silhouette_score.html