Analysis of Business Structures Regarding the Level of
                                Digital Maturity Using Data Mining Methods
                                Iryna Strutynska1,†, Halyna Kozbur2,†, Olena Sorokivska2,∗,†, Lesia Dmytrotsa2,†and Ihor
                                Kozbur2,†
                                1
                                  The Netherlands Loughborough University London, 3 Lesney Avenue, Queen Elizabeth Olympic Park, London, E20
                                3BS, UK
                                2
                                  Ternopil Ivan Puluj National Technical University, Ruska 56 46001 Ternopil, Ukraine


                                                Abstract
                                                Cluster analysis is proposed as an unsupervised machine learning method to divide small and
                                                medium-sized businesses in Ukraine into groups based on their level and types of digital maturity.
                                                The input data used is a dataset formed by expert assessments of the state of digital technology
                                                usage in regional small and medium-sized businesses. The Digital Transformation Index "HIT" is
                                                used to numerically measure the level of digital maturity of domestic enterprises. Various
                                                approaches to building clustering models are implemented using built-in methods in the scikit-learn
                                                library for Data Mining problems. The quality of the constructed models is evaluated using three
                                                indicators. Groups of companies are identified based on similarity in understanding digital
                                                development, and a comparative analysis is performed. Performing clustering for a representative
                                                sample of domestic small and medium-sized businesses will allow understanding the current state
                                                of their use of digital technologies and developing a well-reasoned system of actions to effectively
                                                digitize entrepreneurship in Ukraine.


                                                Keywords
                                                Data Mining Algorithms, Digital Transformation, ICT for Data Analysis, Scikit-learn, Clustering1


                                1. Introduction
                                Digital transformation of small and medium-sized enterprises (SMEs) is a top priority for the
                                development of the Organization for Economic Cooperation and Development (OECD). OECD
                                policy tools, such as the "Digital Policy Framework" and the approved national program
                                "Digitalization for Recovery in Ukraine", envisage that in the long-term perspective (2026-
                                2032) Ukraine can focus on creating a sound data infrastructure for measuring the digital
                                economy [1].
                                   The processes of digital transformation in domestic SMEs – the transformation of their
                                business strategies, models, operations, goals, marketing approaches, etc. towards increased
                                use of digital technologies and improved efficiency, – are slow and underdeveloped. One of

                                1
                                  BAIT’2024: The 1st International Workshop on “Bioinformatics and applied information technologies”, October 02-04,
                                2024, Zboriv, Ukraine
                                ∗
                                  Corresponding author.
                                †
                                  These authors contributed equally.
                                    soroka220996@gmail.com (O. Sorokivska); strutynskairy@gmail.com (I. Strutynska); kozbur.galina@gmail.com
                                (H. Kozbur); dmytrotsa.lesya@gmail.com (L. Dmytrotsa); kozbur.igor@gmail.com (I. Kozbur)
                                    0000-0001-8549-2910 (O. Sorokivska); 0000-0001-5667-6569 (I. Strutynska); 0000-0003-3297-0776 (H. Kozbur);
                                0000-0003-2583-3271 (L. Dmytrotsa); 0000-0002-3113-0014 (I. Kozbur)
                                             © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
the problems is the lack of necessary knowledge among entrepreneurs regarding the
application of innovative digital technologies, as well as the insufficient number of tools
(platforms, services, or applications) that would allow them to assess the current level of
digital maturity of individual enterprises and at the same time provide a roadmap of digital
opportunities for business transformation. Clustering SMEs by the level and types of digital
maturity will allow to understand the current state of digitalization, identify problem groups
of enterprises and bottlenecks in the process of digital transformation, as well as recommend a
reasoned systemic program of actions for effective digital growth.

2. Related works
The process of digitalization of business and the use of digital technologies in activities is the
subject of many scientific studies. Thus, in the work of J. Cenamor, V. Parida, and J. Vincent,
the relationship between the use of digital platforms and small business performance
indicators is analyzed [2]. Features of the use of digital business models are highlighted in the
works of N. Ivanchenko, Zh. Kudrytska, K. Rekachynska [3], N. Kraus, O. Holoborodka,
K. Kraus [4]. Digital transformation is proposed to be considered as "processes that aim to
improve an economic entity by triggering significant changes in its properties through a
combination of information, computing, communications and connectivity" [5]. Digital
transformation affects business processes, operational procedures, and organizational
capabilities [6], requiring enterprises to update workforce skills, achieve a certain level of
digital maturity, and improve productivity and efficiency.
    R. Ochoa in [7] summarizes and forms the semantic core of literature reviews of various
scientists regarding the definition of the digital maturity models. Domestic scientists pay
attention to factors specific to Ukraine (in particular, the low level of digital literacy of society
and cyber security, insufficient regulatory and legal regulation of digitalization), which reduce
the interest of small businesses in the digitalization of business processes [8, p. 231; 9, p. 58].
In connection with this, an important direction of scientific research in the field of digitization
is the study of the peculiarities of the formation of the digital space in Ukraine, as well as the
participation of the state in the institutional and legal regulation of this process (O.
Pishchulina [8], H. Zhekalo [9], H. Karcheva, D. Ohorodnia, and V. Open'ko [10]).
    Investigating the use of digital tools by business organizations [11, 12], the authors
developed methodologies for applying mathematical and computer modeling methods to
measure the level of digital transformations [13, 14]. The main methodological tool of this
study is cluster analysis. General problems of clustering are fully covered in the sources [15,
16]. Authors of scientific studies use diversified methods of cluster analysis, depending on the
problem to be solved. Thus, in the scientific works of C. Iyigun, M. Türkeş, I. Batmaz, C.
Yozgatligil, V. Purutçuoğlu, E. Kartal, M. Öztürk [17] and K. Sablin, E. Kagan, E. Chernova [18]
use hierarchical clustering methods, K. Gorbatiuk, O. Mantalyuk, O. Proskurovych, O. Valkov
in [19] study fuzzy clustering methods. Cluster analysis is often used in scientific works by
both domestic and foreign authors to perform macro analysis, namely the differentiation of
socio-economic development of regions. Works [20, 23-25] are devoted to various directions
of building clusters among the regions of Ukraine. As for tasks at the micro level, many
scientific works are focused on the study of financial transactions in banking institutions and
trade organizations. The work of foreign authors, M. R. Pinto, P. K. Salume, M. W. Barbosa, P.
R. de Sousa [26], is quite interesting and informative, in which the clustering of retail trade
enterprises in relation to the levels of digital maturity according to five dimensions – strategy,
market, operations, culture and technology. It is proposed to consider culture as a driver of
digital transformation.
   The importance of digital education, awareness, and skills for entrepreneurship, as well as
the use of data analysis techniques in digital business transformation processes, has been
discussed in the works of domestic and foreign scientists [27-31]. However, the question of
clustering business structures by the level of digital maturity in order to develop practical
recommendations for digital transformation currently requires further study.

3. Methodology for assessing the level of digital maturity of
   Ukrainian enterprises
Many countries have their own methodologies, frameworks, and tools for measuring digital
maturity and digital transformation of business structures. For example, the UK uses diverse
tools (Digital Acceleration Index (DAI) (Boston Consulting Group (BCG) and Google), The
Digital Scorecard (Lloyds Bank), Digital Maturity Assessment (Department for Digital,
Culture, Media & Sport (DCMS)), Digital Capability Assessment Tool (Department for
Business, Energy and Industrial Strategy (BEIS)), Digital Business Academy Assessment (Tech
Nation, a UK-based network for entrepreneurs)) based on different methodologies to
understand the situation of digital business development. Collecting and processing relevant
data provides an understanding of the development and implementation of various digital
technologies and enables the formation of digital transformation "roadmaps".
    The current state of digital technologies in domestic businesses sharply differs from the
world. The use of international methodologies to determine the level of digital maturity in
business using relevant indicators is not acceptable for domestic realities due to the low
overall level of the use of digital technologies in the economic space. The low level of
awareness of small and medium-sized enterprises about the opportunities for integrating
technologies into their business processes hinders the development of companies and creates
difficulties in the entry of domestic businesses into the international arena. Therefore,
research on the development of digital transformation indicators for businesses, regular
assessments of digital development, and the implementation of regular, systematic statistical
observations [11, 12] deserve special attention.
    It is necessary to develop our own methodology for determining the digital transformation
index of businesses with corresponding indicators that reflect the current state of affairs,
provide a deep analysis of the digital maturity indicators of business structures and take into
account their dynamics, while remaining flexible to quickly respond to new economic
processes and phenomena and ensure further alignment with international methodologies for
comparing Ukraine with the most developed countries in the world.
    A methodology for determining the Digital Transformation Index “HIT” of domestic SMEs
was proposed in [14]. It allows not only to evaluate the level of digital maturity of a business
structure but also obtain a vector of digital development strategy. The main indicators of the
HIT index are:
            Humans (H): digital literacy (competence) of human capital, which is defined as the
             ability of an employee to perform complex tasks and requirements that involve both
             professional and personal digital skills.
            Instruments (I): use of digital tools, which includes components such as social media
             management, website functioning and search engine optimization, work with
             specialized business process automation systems, etc.
            Technologies (T): use of digital technologies, that is, the level of enterprise
             infrastructure provision with necessary equipment (personal computers, laptops,
             smartphones) and broadband Internet.

   The value of the Digital Transformation Index is calculated as a weighted sum of the
values of the three corresponding indicators:

                HIT =ω H ∙ ∑ ¿ H +ω I ∙ ∑ ¿ I +ωT ∙ ∑ ¿T , ¿ ¿ ¿       HIT ∈ [0 ; 1];         (1)


where ∑ ¿ H ¿ – the aggregated indicator of the digital literacy level of the organization's
human capital; ∑ ¿ I ¿ – the aggregated indicator of the functioning of digital tools integrated
into the organization's business processes; ∑ ¿T ¿ – the aggregated indicator of the
functioning of the organization's digital infrastructure; ω H , ω I , ωT – the respective weight
factors of the indicators, where ω H +ω I +ωT =1.
   The weight factors were obtained by expert evaluation: ω H =0.3, ω I =0.5 , ωT =0.2.
   The aggregated indicators ∑ ¿ X ¿ for each of the indicators H, I, T are calculated using
formula:

                                                mX                                            (2)
                                     ∑ ¿ X =∑ n(i X ) ∙ k (i X ) , ¿
                                                i=1


where ∑ ¿ X ¿ – the aggregated value of indicator X (H, I, or T);
   m X – the number of components of indicator X;
       (X)
   ni        – the functioning level of the ith component of indicator X;
       (X)
   ki    – the weight factor of the ith component of indicator X.
    Depending on the obtained value of the HIT index, such gradations for the levels of digital
maturity of domestic SMEs were determined: [0; 0.2) is considered very low; [0.2; 0.4) – low;
[0.4; 0.6) – medium; [0.6; 0.8) – high; and [0.8; 1] – very high.


4. Dataset description
The dataset represents the results of a survey conducted through Google Forms among
Ukrainian entrepreneurs. Thirty four representatives of various small and medium-sized
businesses registered in the Ternopil region participated in the survey. Participants were
asked to answer 29 questions related to the level of digitization of business activity based on
the components of the HIT index. The set of responses was defined as an experimental
dataset.
   The answers of N respondents to M questions formed a matrix of dimension ( N × M ). It is
                              ⃗i answered each of the questions q k . Thus, each surveyed
assumed that each participant u
participant is represented in the form of the vector: u⃗i = { ui 1 , ui 2 , … , uik , … , u ℑ }, where uik is
the answer of the ith participant to the kth question. Each specific vector below in the work is
considered a point.


Figure 1: Matrix of Answers.

   Encoding was used to transform categorical data into numeric data (Figure 2).


Figure 2: The table portion of the input dataset with encoded values.

   All procedures related to data processing were performed in a specially developed software
application using Python. Python libraries used at various stages of the research:

       scikit-learn – for using clustering algorithms and computing quality metrics;
       scipy – for computing distance matrices based on a dataset;
       matplotlib – for visualizing obtained data in the form of graphs;
       pandas – for storing and manipulating a dataset in a special structure, a dataframe.

5. Choice of Clustering Specifications
After obtaining the values of the three components of the HIT index for each SME, the data
set consisted of 34 items with 3 numerical attributes. Clustering of preprocessed data using
the defined method and distance measure was performed sequentially using the number of
clusters from 2 to 8. For each obtained clustering model, quality metrics (Silhouette, Calinski-
Harabasz, and Davies-Bouldin indices) were calculated. Based on visual analysis of the
dependencies, the optimal number of clusters was selected. The Figure 3 shows the quality
index dependence plots on the number of clusters obtained for agglomerative clustering using
cosine distance and Ward linkage.


                                              (A)


                     (B)                                              (C)


Figure 3: Choosing the optimal number of clusters by: (А) – Silhouette Coefficient, (B) –
Calinski-Harabasz Index, (C) – Davies-Bouldin Index.

   As it is shown in the Figure 3, local maxima of the Silhouette index and Calinski-Harabasz
index are achieved at 3 and 8 clusters. At the same points, local minima are observed for the
Davies-Bouldin index. Considering the features of the given problem, the value of 8 clusters
seemed too large for the dataset with 34 points, so 3 clusters were chosen.
   Since the concept of distance metric is used only for two clustering methods:
agglomerative and OPTICS, the selection of criteria set: distance, number of clusters,
neighbors was carried out only for them. For each distance metric, the optimal number of
clusters was determined. Then, among all the used distance metrics, the one that showed the
best results for the current method was selected. The tabular result of such comparison for the
agglomerative method is shown in Table 2.
Table 2
An example of choosing the optimal metric and number of clusters
   Metric used for            Number of         Silhouette     Davies-Bouldin           Calinski-
     intracluster              Clusters         Coefficient        Index                Harabasz
       distance                                                                          Index
    Euclidean                      3               0.34                 1                  18

    Cosine                         3               0.65                1.4                  11

    Manhattan                      7               0.36               0.9                   18

    Chebyshev                      4               0.36                 1                   17

    Hamming                        7               0.13                 3                   3.5


    A similar evaluation was conducted for each used method and distance measure. For each
of the methods used, a summary analytical table was compiled with the main characteristics
of the formed clusters (Tables 3-7). The figures also show a scatter plot of the dependence of
the HIT index on the level of use of digital instruments (on the left) and a bar chart of clusters
by HIT index value (on the right). The elements that belong to one cluster are highlighted in
the same color.
    1. The dataset was divided into 3 clusters using the K-means clustering algorithm. As seen
in the scatter plot in the Figure 4, the clusters almost do not intersect with each other and
contain sufficiently similar elements inside. Cluster #2 (blue dots) is clearly highlighted and is
located at the bottom of the graph in terms of the value of the HIT index to the use of digital
tools. Cluster #1 contains most of the points that are located within the intervals of both the
HIT index value and the use of digital tools. Cluster #3 is characterized by the highest index
values.
    Members of Cluster #1 are partially effective in using social networks but do not use their
own websites, advertising or analytics tools, while having sufficient technical equipment. The
literacy of the human capital is at an elementary level (Table 3).
    Cluster #2 shows similar indicators to Cluster #1, except that they do not use social
networks or use them inefficiently, and the companies lack sufficient technical equipment. In
contrast, Cluster #3 includes respondents who more effectively use the necessary digital tools:
websites, social networks, advertising, and have sufficient human capital literacy.
    2. Using the agglomerative method, the Euclidean distance measure and Ward linkage
allowed for a fairly good result in dividing into 3 clusters (Figure 5). It can be noted that there
is a fairly good separation of Cluster #2 (blue dots), which contains respondents with the
lowest HIT index values. Additionally, Clusters #1 and #3 are fairly spread out in space,
although they do overlap in a few points. Comparison of the main characteristics of the
formed clusters is presented in the Table 4.


             HIT index value by indicator “I”                 HIT index value by participants
                          INDICATOR “I” VALUE


Figure 4: Results of clustering using the K-means method with Euclidean distance.

Table 3
Main characteristics of the clusters formed by the K-means method with Euclidean distance
                                      Cluster # 1 (18)                      Cluster # 2 (8)                       Cluster # 3 (8)


                                 Ranges of Indicator values:          Ranges of Indicator values:           Ranges of Indicator values:
                                         H є [0; 0,364]                       H є [0; 0,364]                        H є [0,636; 1]
                                                                                                                   I є [0,29; 0,826]
       K-means                         I є [0,128; 0,614]
                                          T є [0,7; 1]
                                                                            I є [0,067; 0,657]
                                                                               T є [0; 0,5]                           T є [0,5; 1]
                                Weighted Sum (HIT) є [0,234; 0,56]   Weighted Sum (HIT) є [0,11; 0,488]     Weighted Sum (HIT) є [0,44; 0,91]


                                                    Percentage                           Percentage                           Percentage
                                  Status                                Status                               Status
                                                     of cases                             of cases                             of cases

  Website availability,
                                                                          Not                              Optimized              70.0%
 optimization and             Not optimized             61.1%                               70.0%
                                                                      optimized
 effectiveness

  Social media
                                                                         Not                               Effectively            70.0%
 availability and               Effectively             50.0%                               70.8%
                                                                     effectively
 effectiveness

  Use of online
 advertising and                 Not used               74.1%         Not used              91.6%             Used                58.3%
 analytics

 Use of specialized                                                                                         Not used              71.4%
                                 Not used               80.2%         Not used              73.2%
 management systems

  Use of specialized                                                                                        Not used              87.5%
                                 Not used               96.4%         Not used              79.2%
 technical systems

  Level of technical                                                      Not                             Satisfactory            83.3%
                               Satisfactory             98.1%                               62.5%
 support                                                             satisfactory

                                                                                                           Intermediate
  Level of Digital                                                                                           or above             87.5%
                                   Basic                50.0%           Basic               62.5%
 Literacy
                                                                                                          intermediate


  Communication               With the use                           With the use                         With the use
                                                        74.7%                               83.3%                                 75.0%
 channels                       of ICT                                 of ICT                               of ICT
 Silhouette Coefficient                                                            0.411

 Calinski-Harabasz Index                                                          24.105

 Davies-Bouldin Index                                                              0.889


   Cluster #1 members, who belong to the area with the highest indicator values, effectively
use the website and social media, and also have a level of digital literacy that is at or above the
average for most respondents. In contrast, Cluster #2 is characterized by ineffective use of
digital tools for most members, as well as low digital literacy and unsatisfactory technical
equipment for more than half of the surveyed. Cluster #3 has a certain intensity of social
media use, but low indicators in other areas, such as elementary level of digital literacy among
employees.

                 HIT index value by indicator “I”                                          HIT index value by participants


                          INDICATOR “I” VALUE


Figure 5: Results of clustering using the Agglomerative method with Ward linkage.

Table 4
Main characteristics of the clusters formed by the Agglomerative method with Ward linkage
                                    Cluster # 1 (9)                       Cluster # 2 (8)                    Cluster # 3 (17)

                                Ranges of Indicator values:           Ranges of Indicator values:         Ranges of Indicator values:
  Agglomerative                          H є [0,2; 1]
                                      I є [0,097; 0,826]
                                                                              H є [0; 0,364]
                                                                            I є [0,067; 0,357]
                                                                                                                  H є [0; 0,364]
                                                                                                                I є [0,097; 0,614]
    clustering                           T є [0,25; 1]
                                Weighted Sum (HIT) є [0,43; 0,91]
                                                                               T є [0; 0,7]
                                                                      Weighted Sum (HIT) є [0,11; 0,26]
                                                                                                                   T є [0,75; 1]
                                                                                                          Weighted Sum (HIT) є [0,28; 0,56]


                                                    Percentage                             Percentage                        Percentage
                                  Status                               Status r                             Status
                                                     of cases                               of cases                          of cases

  Website availability,                                                                                       Not
 optimization    and            Optimized               68.9%       Not optimized             80.0%                              60.0%
                                                                                                          optimized
 effectiveness

  Social           media
 availability       and        Effectively              70.3%       Not effectively           75.0%       Effectively            51.0%
 effectiveness

  Use      of      online       Not used                55.6%         Not used               100.0%       Not used               72.5%
 advertising         and
 analytics

  Use of specialized
 management                    Not used         65.1%      Not used            82.1%        Not used        79.8%
 systems

  Use of specialized                            88.9%                          79.2%        Not used        98.0%
                               Not used                    Not used
 technical systems

  Level of     technical                        77.8%                          58.3%       Satisfactory     100.0%
                             Satisfactory               Not satisfactory
 support

                           Intermediate or
  Level   of     Digital                        83.3%                          75.0%          Basic         70.6%
                                above                        Basic
 Literacy
                            intermediate

  Communication            With the use of              With the use of                    With the use
                                                77.8%                          75.0%                        76.5%
 channels                       ICT                          ICT                             of ICT


 Silhouette Coefficient                                              0.398


  Calinski-Harabasz                                                  22.497
 Index

 Davies-Bouldin Index                                                0.954


   3. Using OPTICS with Chebyshev distance metric and a minimum of 7 points for cluster
formation. Despite obtaining an optimal value for quality metrics, the clustering itself was not
successful from a practical standpoint. As can be seen in the visualization in the Figure 6, the
clusters contain almost the same number of members. Additionally, the clusters were
distributed as internal and external, making it impossible to establish fundamental differences
between them, as seen in the analytical Table 5.

             HIT index value by indicator “I”                                 HIT index value by participants


                      INDICATOR “I” VALUE


Figure 6: Results of clustering using the OPTICS method with Chebyshev distance and 7
neighbors.

   The reason for this result is that OPTICS belongs to density-based algorithms, and the
basic data set does not contain dense areas. Therefore, the internal cluster (green) turned out
to be an artificial area with dense values, while the external one was marked as outliers,
meaning values that do not carry any value.
   4. The Affinity Propagation method doesn’t depend on the number of clusters and distance
measures, so its results represent the inherent data structure without any user influence. As
seen in the Figure 7 and Table 6, the data was divided into 6 clusters. Some of the clusters
(such as #1, #5 and #6) are quite distinct from the others. At the same time, clusters #2, #3 and
#4 overlap somewhat with other clusters. The distribution of respondents based on the value
of the HIT index clearly highlights the cluster leader (#5), as well as the clusters with the
lowest values (#2 and #4). Clusters #1, #3 and #6 consist of respondents with average and
above-average values of the index.
   Clusters #1, #3 and #5 are quite similar to each other, as can be seen from the table.
However, it is interesting that about 2/3 of the participants in cluster #1 are successfully using
the website and social media, although they rate the level of human capital literacy as
elementary.


Table 5
Main characteristics of the clusters formed by the OPTICS method with Chebyshev distance
and 7 neighbors
                                                   Cluster # 1 (18)                         Cluster # 2 (16)

                                                Ranges of Indicator values:              Ranges of Indicator values:
                                                        H є [0; 0,364]                             H є [0; 1]
                OPTICS                                I є [0,128; 0,614]
                                                         T є [0,7; 1]
                                                                                               I є [0,067; 0,826]
                                                                                                   T є [0; 1]
                                                Weighted Sum (HIT) є [0,23; 0,56]        Weighted Sum (HIT) є [0,13; 0,91]


                                                                       Percentage                               Percentage
                                                  Status                                  Status
                                                                        of cases                                 of cases

  Website availability, optimization and                                  61.1%                                    51.3%
                                              Not optimized                           Not optimized
 effectiveness

  Social     media      availability    and                               50.0%                                    50.0%
                                               Effectively                              Effectively
 effectiveness

 Use of online advertising and analytics        Not used                  74.1%          Not used                  75.0%


 Use of specialized management systems          Not used                  80.2%          Not used                  72.3%


 Use of specialized technical systems           Not used                  96.3%          Not used                  85.4%


 Level of technical support                   Satisfactory                100.0%       Satisfactory                60.4%


                                                                          69.4%       Intermediate or              56.3%
 Level of Digital Literacy                        Basic
                                                                                    above intermediate
                                               With the use of   74.1%           With the use of   79.2%
 Communication channels
                                                    ICT                               ICT

 Silhouette Coefficient                                                  0.327


 Calinski-Harabasz Index                                                 13.554


 Davies-Bouldin Index                                                    1.408


            HIT index value by indicator “I”                        HIT index value by participants


                      INDICATOR “I” VALUE


Figure 7: Results of clustering using the Affinity Propagation method


    In contrast, cluster #4 has a high value of digital literacy, but only slightly more than half
of the participants are successfully using digital technologies (given the size of the cluster, this
may be within the margin of error). Cluster #5 is the smallest, but consists of respondents with
the highest level of digital tool usage and transformation index value. Clusters #2 and #4 are
characterized by inefficient use of digital resources. The difference between them lies in the
value of the digital literacy indicator. Cluster #6 is also interesting, as it showed the
effectiveness of social media use at low levels of other indicators.
    5. The Gaussian Mixture Expectation-Maximization soft clustering algorithm divided the
dataset into 3 clusters; visualization is shown in the Figure 8. Cluster #2 (blue dots) is dense,
with its HIT index values falling in the interval with the mean values, indicating the use of
digital tools. Slightly higher values can be observed in cluster #3, which is also well grouped.
    In contrast, the largest cluster #1 is very dispersed and contains points with both the
lowest and highest values of HIT index components. The points in this cluster, shown in
green, are located around the perimeter of the scatter plot. Such dividing is likely due to the
initial dataset being far from a normal distribution.
    In Cluster #1, half of the respondents do not use digital tools, although almost 70% of those
surveyed claim to have an average or high level of digital literacy. In Cluster #2, the majority
do not use modern capabilities, despite that all respondents have a basic level of technical
means.
Table 6
Main characteristics of the clusters formed by the Affinity Propagation method
                                      Cluster # 1 (7)                          Cluster # 2 (5)                         Cluster # 3 (5)

                                  Ranges of Indicator values:              Ranges of Indicator values:             Ranges of Indicator values:
                                           H є [0; 0,2]                             H є [0; 0,2]                          H є [0,636; 0,8]
         Affinity                        I є [0,34; 0,61]                        I є [0,067; 0,657]                       I є [0,097; 0,73]
       Propagation                           T є {1}
                                  Weighted Sum (HIT) є [0,37; 0,56]
                                                                                   T є [0,5; 0,7]
                                                                          Weighted Sum (HIT) є [0,13; 0,488]
                                                                                                                          T є [0,25; 0,75]
                                                                                                                   Weighted Sum (HIT) є [0,43; 0,66]


                                                    Percentag                               Percentage                               Percentag
                                  Status                                  Status                                   Status
                                                    e of cases                               of cases                                e of cases

 Website      availability,
 optimization        and        Optimized              60.0%          Not optimized             64.0%            Optimized              52.0%
 effectiveness

 Social             media
 availability        and        Effectively            66.7%          Not effectively           66.6%            Effectively            53.3%
 effectiveness

 Use       of       online
 advertising          and        Not used              52.4%             Not used               86.7%             Not used              60.0%
 analytics

 Use  of  specialized            Not used              81.6%             Not used               71.4%             Not used              74.2%
 management systems

 Use    of    specialized        Not used             100.0%             Not used               66.6%             Not used              93.3%
 technical systems

 Level   of      technical     Satisfactory           100.0%           Satisfactory             53.3%           Satisfactory            86.7%
 support

                                                                                                               Intermediate or
 Level of Digital Literacy         Basic               85.7%              Basic                 70.0%                above              80.0%
                                                                                                                 intermediate


 Communication                With the use of                         With the use of                          With the use of
                                                       85.7%                                    66.6%                                   73.3%
 channels                           ICT                                     ICT                                      ICT


                                      Cluster # 4 (4)                          Cluster # 5 (3)                        Cluster # 6 (10)

                                  Ranges of Indicator values:              Ranges of Indicator values:             Ranges of Indicator values:
                                          H є [0; 0,36]                            H є [0,636; 1]                          H є [0,1; 0,36]
                                         I є [0,12; 0,36]                         I є [0,43; 0,83]                       I є [0,097; 0,369]
                                           T є [0; 0,25]                            T є [0,9; 1]                            T є [0,75; 1]
                                  Weighted Sum (HIT) є [0,11; 0,20]        Weighted Sum (HIT) є [0,61; 0,91]       Weighted Sum (HIT) є [0,28; 0,43]


                                                    Percentag                               Percentage                               Percentag
                                  Status                                 Status r                                  Status
                                                    e of cases                               of cases                                e of cases

 Website      availability,
 optimization        and      Not optimized            85.0%            Optimized               93.3%          Not optimized            72.0%
 effectiveness

 Social             media
 availability        and Not effectively               75.0%            Effectively            100.0%            Effectively            60.0%
 effectiveness

 Use       of       online       Not used             100.0%             Not used               55.5%             Not used              86.7%
 advertising          and
 analytics

 Use  of  specialized              Not used      78.6%      Not used           66.6%         Not used       78.6%
 management systems

 Use    of    specialized          Not used      91.6%      Not used           88.9%         Not used       96.7%
 technical systems

 Level   of      technical            Not
                                                 75.0%    Satisfactory         100.0%      Satisfactory     100.0%
 support                          satisfactory

                                                         Intermediate or
 Level of Digital Literacy            Basic      75.0%         above           100.0%         Basic         60.0%
                                                           intermediate


 Communication                 With the use of           With the use of                  With the use of
                                                 91.7%                         77.8%                        70.0%
 channels                            ICT                       ICT                              ICT


 Silhouette Coefficient                                                0.351

 Calinski-Harabasz Index                                            21.607

 Davies-Bouldin Index                                                  0.931


   The Cluster #3 shows moderate success in using simple tools, such as a website and social
networks, provided that 80% of respondents consider the digital competencies of their
employees to be basic. Another observation is that half of the respondents use, for example,
analytics and half do not, making it impossible to identify precise distinguishing features
between the clusters.


               HIT index value by indicator “I”                                 HIT index value by participants


                           INDICATOR “I” VALUE


Figure 8: Results of clustering using the Gaussian Mixture (EM-method)

  Analytical data with the main characteristics of the formed clusters are presented in the
Table 7.

Table 7
Main characteristics of the formed clusters by the Gaussian Mixture (EM-method)
                                     Cluster # 1 (17)                         Cluster # 2 (9)                          Cluster # 3 (8)

                                  Ranges of Indicator values:             Ranges of Indicator values:              Ranges of Indicator values:
                                                                                                                            H є [0; 0,2]
        Gaussian                            H є [0; 1]
                                        I є [0,067; 0,826]
                                                                                 H є [0,1; 0,364]
                                                                                I є [0,097; 0,369]                       I є [0,319; 0,614]
     Mixture (EM)                           T є [0; 1]
                                  Weighted Sum (HIT) є [0,13; 0,91]
                                                                                   T є [0,75; 1]
                                                                          Weighted Sum (HIT) є [0,28; 0,43]
                                                                                                                              T є {1}
                                                                                                                   Weighted Sum (HIT) є [0,37; 0,56]


                                                  Percentage                              Percentage                              Percentage of
                                 Status                                  Status                                  Status
                                                   of cases                                of cases                                   cases

 Website      availability,
 optimization        and      Not optimized            61.1%          Not optimized            70.0%            Optimized               70.0%
 effectiveness

 Social             media
 availability        and        Effectively            50.0%          Not effectively          70.8%            Effectively             70.0%
 effectiveness

 Use       of       online
 advertising          and        Not used              74.1%             Not used              91.6%             Not used               58.3%
 analytics

 Use  of  specialized                                                                                            Not used               71.4%
                                 Not used              80.2%             Not used              73.2%
 management systems

 Use    of    specialized                                                                                        Not used               87.5%
                                 Not used              96.4%             Not used              79.2%
 technical systems

 Level   of      technical                                                                                     Satisfactory             83.3%
                               Satisfactory            98.1%           Satisfactory            62.5%
 support

                              Intermediate or
 Level of Digital Literacy          above              50.0%              Basic                62.5%              Basic                 87.5%
                                intermediate

 Communication                With the use of                         With the use of                         With the use of
                                                       74.7%                                   83.3%                                    75.0%
 channels                           ICT                                     ICT                                     ICT


 Silhouette Coefficient                                                              0.192

 Calinski-Harabasz Index                                                             8.578

 Davies-Bouldin Index                                                                1.352


    It is worth noting that the level of digital literacy of employees has a significant impact on
the overall state of digitalization of the enterprise. If the level of digital literacy of employees
is defined as elementary, then such an enterprise lacks websites, social networks and other
used tools. As the digital literacy of employees increases, the percentage of use of tools and
technologies increases, so investing in people is seen as an important contribution to the
success of digitalization. It is interesting that the level of technical equipment does not have a
significant impact on the overall digital level of the enterprises.
6. Conclusions
The paper presents 5 data clustering models for understanding the current state of
digitalization of business processes among small and medium-sized enterprises in the Ternopil
region of Ukraine. The Digital Transformation Index "HIT" was used for numerical
measurement of the current level of digital maturity of domestic enterprises. Clustering of
enterprises was based on numerical values of three indicators – components of the Digital
Transformation Index. A special software application was developed in Python programming
language for solving the task. Various approaches to clustering model construction were
implemented using built-in methods of the scikit-learn library for Data Mining problems. Four
hard clustering methods (K-Means, Affinity Propagation, Hierarchical clustering, OPTICS) and
one soft clustering method using the EM algorithm (Gaussian Mixture) were used. The
Silhouette Index was used as the main quality metric. From the perspective of similarity
between elements within groups and differences between different clusters, the best results on
the dataset were demonstrated by Affinity Propagation, Ward's hierarchical clustering with 3
clusters, and K-Means with a division into 3 clusters. Analysis of the constructed models
showed that high values of quality metrics do not always indicate an optimal and effective
division into groups that can be successfully interpreted. New valuable ideas were obtained
regarding the importance of individual components of the Digital Transformation Index.
Common features of the obtained groups of enterprises, their strengths and weaknesses in the
use of digital tools and digital literacy of human capital were identified. In the future, stable
formed clusters can be used for classifying new surveyed enterprises and identifying
significant attributes with the greatest impact on the value of digital maturity of the subject or
for developing a methodology for providing recommendations to improve the level of digital
maturity of the enterprise.

7. References
[1] OECD Policy Responses on the Impacts of the War in Ukraine “Digitalisation for
    recovery                 in               Ukraine“,                 2022.               URL:
    https://www.oecd.org/ukraine-hub/policy-responses/digitalisation-for-recovery-in-
    ukraine-c5477864/.
[2] J. Cenamor, V. Parida and J. Wincent, How entrepreneurial SMEs compete through digital
    platforms: The roles of digital platform capability, network capability and ambidexterity,
    Journal of Business Research, Julay, vol. 100, (2019), pp. 196–216.
[3] N. Ivanchenko, Zh. Kudryts'ka. and K. Rekachyns'ka, Business models in the conditions
    of digital transformations, Vcheni zapysky TNU imeni V. I. Vernads'koho, Seriia:
    Ekonomika i upravlinnia, vol. 3, no. 31 (2020), pp. 185–190.
[4] N. M. Kraus, O. P. Holoborod'ko, and K. M. Kraus, Digital economy: trends and
    perspectives of the abangard change of development, Efektyvna ekonomika, vol. 1, 2018.
[5] A. Annarelli et al. Literature review on digitalization capabilities: Co-citation analysis of
    antecedents, conceptualization and consequences Technol. Forecast. Soc. Change, 2021.
[6] J. Mero et al. An effectual approach to executing dynamic capabilities under unexpected
    uncertainty Ind. Market. Manag, 2022.
[7] Digital Maturity Models: A Systematic Literature Review May 2021. doi:10.1007/978-3-
    030-69380-0_5                                                                           URL:
     https://www.researchgate.net/publication/351975241_Digital_Maturity_Models_A_Syste
     matic_Literature_Review.
[8] O. Pischulina, Digital economy: trends, risks and social determinants: report, Tsentr
     Razumkova,                  2020,                271               p.                URL:
     https://razumkov.org.ua/uploads/article/2020_digitalization.pdf.
[9] H. Zhekalo, Digital economy of Ukraine: problems and prospects of development,
     Naukovyj visnyk Uzhhorods'koho natsional'noho universytetu, Seriia: Mizhnarodni
     ekonomichni vidnosyny ta svitove hospodarstvo, vol. 26, no. 1, (2019): 56–60.
[10] H. Karcheva, D. Ohorodnia, and V. Open'ko, Digital economy and its impact on the
     development of national and international economy, Finansovyj prostir, vol. 3, no. 1,
     (2017): 13–21.
[11] I. Strutynska, L. Dmytrotsa, H. Kozbur, O. Hlado, P. Dudkin and O. Dudkina,
     Development of Digital Platform to Identify and Monitor the Digital Business
     Transformation Index, in: Proceedings of the 15th International Conference on Computer
     Sciences and Information Technologies (CSIT), Zbarazh, Ukraine, September 23, 2020, pp.
     171-175, doi: 10.1109/CSIT49958.2020.9322016.
[12] I. Strutynska, L. Dmytrotsa, H. Kozbur, L. Melnyk, O. Hlado. Developing Practical
     Recommendations for Increasing the Level of Digital Business Transformation Index, in:
     Proceedings of the 16th International Conference on ICT in Education, Research and
     Industrial Applications. Integration, Harmonization and Knowledge Transfer, volume II:
     Workshops of ICTERI, Part III: 8th International Workshop Information Technology in
     Economic Research (ITER 2020), Kharkiv, Ukraine, October 06-10, 2020, pp. 351-362. URL:
     https://ceur-ws.org/Vol-2732/20200351.pdf.
[13] I. Strutynska, L. Dmytrotsa, H. Kozbur, L. Melnyk, System-Integrated Methodological
     Approach Development to Calculating the Digital Transformation Index of Businesses,
     in: Proceedings of the 16th International Conference on ICT in Education, Research and
     Industrial Applications. Integration, Harmonization and Knowledge Transfer, volume I:
     Main Conference (ICTERI 2020), Kharkiv, Ukraine, October 06-10, 2020, pp. 373-379. URL:
     http://ceur-ws.org/Vol-2740/20200373.pdf.
[14] I. Strutynska, L. Dmytrotsa, H. Kozbur, L. Melnyk, The Digital Business Transformation
     Index Determining and Monitoring: Development of a National Online Platform, in:
     Proceedings of the 1st International Workshop on Information Technologies: Theoretical
     and Applied Problems, ITTAP 2021, Ternopil, Ukraine, 2021, pp. 327-334.
[15] H. Cuesta, S. Kumar, Practical Data Analysis. Birmingham, Packt Publishing Ltd, 2016.
[16] Data Science & Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting
     Data, EMC Education Services. Indianapolis, John Wiley & Sons, Inc, 2015.
[17] C. Iyigun, M. Türkeş, I. Batmaz, C. Yozgatligil, V. Purutçuoğlu, E. Kartal, M. Öztürk,
     Clustering current climate regions of Turkey by using a multivariate statistical method.
     Theoretical and Applied Climatology, 114 (2013): 95-106.
[18] K. Sablyn, E. Kahan, E. Chernova, Clustering of coal mining regions of Russia: investment
     and innovation activity. Journal of New Economy, 21 (1) (2020): 89-106.
[19] K. Gorbatiuk, O. Mantalyuk, O. Proskurovych, O.Valkov, Application of Fuzzy Clustering
     to Shaping Regional Development Strategies in Ukraine, Proceedings of the 6th
     International Conference on Strategies, Models and Technologies of Economic Systems
     Management (SMTESM 2019), 2019, pp. 271-276.
[20] T. Paianok, Y. Vazhaliuk, Cluster analysis of labor potential of Ukraine. Economy and
     State, 12 (2019): 109-114.
[21] S. Behun, Application of cluster analysis to study the demographic situation in the
     region. Economic Journal of Lesya Ukrainka East European National University, 2 (2016):
     122-128.
[22] S. Synytsia, O. Vakun, Clustering of regions by level of economic potential. Economy and
     society Mukachevo State University, 12 (2017): 776-784.
[23] L. Zomchak, Y. Dobrotii, Clustering of regions of Ukraine by competitiveness.
     Proceedings of the International scientific-practical conference Administrative-territorial
     vs economic spatial borders of regions, KNEU, 2020, pp. 328-332.
[24] V. Aulin, O. Lyashuk, O. Pavlenko, D. Velykodnyi, A. Hrynkiv, S. Lysenko, et al.,
     "Realization of the Logistic Approach in the International Cargo Delivery System",
     COMMUNICATIONS, vol. 21, no. 2, pp. 3-12, 2019.
[25] Petraška, A.; Čižiuniene, K.; Jarašuniene, A.; Maruschak, P.; Prentkovskis, O. Algorithm
     for the assessment of heavyweight and oversize cargo transportation routes. J. Bus. Econ.
     Manag. 2017, 18, 1098–1114
[26] The path to digital maturity: A cluster analysis of the retail industry in an emerging
     economy Marcelo Rezende Pinto, Paula Karina Salume, Marcelo Werneck Barbosa, Paulo
     Renato de Sousa https://doi.org/10.1016/j.techsoc.2022.102191.
[27] M. Halkidi, Y. Batistakis, M. Vazirgiannis, Clustering algorithms and validity measures,
     in: Proceedings of the Thirteenth International Conference on Scientific and Statistical
     Database Management. SSDBM, Fairfax, VA, USA SourceIEEE Xplore, July 18, 2001.
     doi:10.1109/SSDM.2001.938534.
[28] Clustering, 2022. URL: https://scikit-learn.org/stable/modules/clustering.
[29] E. Zuccarelli, Performance Metrics in Machine Learning — Part 3: Clustering, 2021. URL:
     https://towardsdatascience.com/performance-metrics-in-machine-learning-part-3-
     clustering-d69550662dc6.
[30] N. Bolshakova, F. Azuaje, Cluster validation techniques for genome expression data,
     volume 83 of Signal Processing, Issue 4, April 2003, pp. 825-833. doi: 10.1016/S0165-
     1684(02)00475-9.
[31] Yavorskyi, A.V.; Karpash, M.O.; Zhovtulia, L.Y.; Poberezhny, L.Y.; Maruschak, P.O. Safe
     operation of engineering structures in the oil and gas industry. J. Nat. Gas Sci. Eng. 2017,
     46, 289–295.