<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Data mining methods for Market Segmentation*</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ildus Rizaev</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elza Takhavova</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zemfira Zakharova</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kazan National Research T echnical University named after A.N. T upolev-KAI</institution>
          ,
          <addr-line>10, K. Marx str., Kazan, 420111, Russian Federation</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Data mining methods give opportunity to solve problems of current interest, market segmentation problem belongs to which. T here are different approaches to solve market segmentation problem that differ by used methods. Data mining methods are used to solve classification and clustering problems. T here are k-means method, EM algorithm and neural networks which are considered and compared. Deductor platform is used to analyse implementation of clustering algorithms. Ke ywords: market segmentation, customer preferences, clustering, similarity measure, neural networks.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Development of methods for recording and storing data has currently led to a rapid
growth of the amount of collected information. The volumes of data are so impressive
that it is simply not realistic for a person to cope with data analysis on their own and
the problemto analyze collected information more efficiently is continuously
increasing. All modern companies today have the ability to store data in a single database,
called the customer base [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However, in addition to this, organizations pursue thegoal
of increasing profitability and reducing costs. The analysis of the customer base
remains incomplete if customers are considered as similar people. This can be facilitated
by highlighting certain preferences among clients [
        <xref ref-type="bibr" rid="ref2 ref3">2-3</xref>
        ]. To increase efficiency, it is
needed to identify which customer groups exist and then figure out what actions will
help attract more customers. To solve this problem, cluster analysis is used. In Data
Mining, a common measure for assessing the proximity between objects is a metric or
a way of specifying the distance. When using clustering algorithms, problems arise,
since the same set of objects can be grouped into clusters in different ways. This led to
choose among a large number of clustering algorithms [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4-6</xref>
        ].
      </p>
      <p>With a large number of customers, it is difficult to build an individual approach, so
it is convenient to group theminto groups with homogeneous characteristics, which are
segments. Clustering can be used to segment and build customer profiles. The
efficiency of working with clients increases by taking into account their personal
preferences. Clustering can be used in a wide variety of areas; they are retail, banking,
telecommunications, insurance, government services and others.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Mate rials and me thods</title>
      <p>
        Clustering differs from classification in that an output variable is not required and the
number of clusters into which a set of data must be grouped may not be known. The
output of clustering is not a ready-made answer. A cluster is a group of similar objects.
Clustering indicates the similarity of objects which it includes. The resulting clusters
require additional interpretation. To determine the similarity of objects, it is needed to
set a measure of proximity. The most popular measure of proximity in two dimensions
is the Cartesian distance or Minkowski metric [
        <xref ref-type="bibr" rid="ref6 ref7">6-7</xref>
        ].There are a lot of approaches to solve
clustering problem.
      </p>
      <p>
        The k-means algorithmis one of the simplest, but at the same time, not entirely a
ccurate clustering method [
        <xref ref-type="bibr" rid="ref7 ref8">7-8</xref>
        ]. The goal of the method is to divide m observations into
k clusters, with each object belonging to the cluster with the center (centroid) of which
it is closest. In this method, the number of clusters is predefined. This method tends to
minimize the total square deviation of cluster points fromthe centers of these clusters
(1):

 =1  ∈ 
 = ∑
∑( −   )
2
(1)
      </p>
      <p>In (1) k is the number of clusters, Si are the resulting clusters, i varies from 1 to k
and μi are centers of elements x fromcluster Si. The algorithmis performed iteratively
until the boundaries of the clusters and the location of the centroids change. The
algorithm may take a dozen iterations to execute. The advantage of the algorithm is
simplicity of implementation and speed of execution. The disadvantage is the need to
initially set the number of clusters and select the initial mass centers.</p>
      <p>
        Kohonen neural networks form a class of neural networks, the main element of
which is the Kohonen layer, they are used for data analysis and solving clustering
problems [
        <xref ref-type="bibr" rid="ref10 ref9">9-10</xref>
        ]. The self-organizing Kohonen map is a kind of neural network algorithms,
characterized by the fact that it is taught without a teacher, the result depends only on
the data structure. Either small randomvalues can be used to initialize the weights, or
based on an example of a training sample. The advantage of this method is that the
network is trained without a teacher, the implementation is simple, and the
corresponding answer is received after passing the data through the layers. The disadvantage of
this method is working only with numerical data and the need to preliminary determine
the number of clusters.
      </p>
      <p>
        The EM (Expectation Maximization) algorithmis based, as it were, on maximizing
expectations, in which it is assumed that the observation results are distributed
accordrepresented as follows.
is calculated ɣ (3).
ing to the normal law in accordance with the Gaussian function [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. In the EM
algorithm, auxiliary hidden variables are introduced, on the basis of which the c oefficients
are recalculated in order to approximate the parameter vector to maximize the
likelihood. Optimal parameters are found by a sequential iterative EM algorithm. This model
consists of two steps. The first of themis E-step (Expectation), the values of the
likelihood function are found on which. At the second step which is M -step (Maximization),
the maximum likelihood estimate is found. The order of execution of the model can be
E-step. Basing on the current values of parameters (2), the vector of hidden variables
M-step. Based on the current values of the hidden variables, the parameter vector is
reevaluated according to (4).
      </p>
      <p>= ( 1,… ;  к ,  1,. . ,  к ; ∑1, . . , ∑к),


=
   (  |  , ∑</p>
      <p>)
∑

 =1   (  |  ,∑</p>
      <p>)
 

=∑

∑ =1  (  −  

)(  −  

)
=
1</p>
      <p>=

  = ∑ =1 
 

,  
&gt; ∆.</p>
      <p>The procedure is stopped if the difference of the hidden variables does not exceed
the specified constant (5). This model is based on the methods of mathematical
statistics.</p>
      <p>= max{ 
| 
−   0 }.</p>
      <p>Data mining software tools enable using and compare different approaches to
implement and choose the most appropriate method.
3</p>
      <p>
        Re s ults
Тo test the clustering methods, information was prepared with the assignments of the
initial data: input, descriptive and output data. The following features were selected as
input data: gender, age, marital status, income, store category. For informative purposes
full name and purchase amount were used. The rest of the data is highlighted as not
being used. For the analysis, the Deductor Studio platform[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] was used, which allows
for a comprehensive analysis of enterprise data, predict the indicators of its
development, conduct segmentation and search for patterns [
        <xref ref-type="bibr" rid="ref6 ref7">6-7</xref>
        ]. Initial data is represented in
Deductor Studio in the formof a table (Figure 1).
(2)
(3)
(4)
(5)
      </p>
      <p>Segmentation analysis was performed using three k-means methods, Kohonen maps
and EM clustering. When using the k-means method, the initial data was assigned as
above was said: input, informative and output data. The choice of the number k can be
based on theoretical justification or intuition. The number of clusters was set equal to
5. Result cluster profiles are shown in Figure 2. They display statistical information on
clusters as a percentage. The data “category of stores” (this column in the table in the
Figure 1 is not shown) includes such stores as grocery, construction, furniture,
household, pharmacies, etc., a total of 51 records. Figure 2 shows this category in the
highlighted part of the table on the right. In total, five clusters were identified by categories:
shops, marital status, gender, income and age.</p>
      <p>The main difference between Kohonen's self-organizing maps and the k-means
method is that in it all neurons (class center and node) are ordered into some structure.
Let's continue the example with clients for certain preferences. The same document for
processing was selected. The same input and information data were leaved. The results
of the execution of the algorithmare displayed on maps and a separate map is built for
each input parameter. The convenience of the map (Figure 3) is that a user can click on
the cluster number and see the location of the cluster on other maps that were built
using the input values. For example, cluster 0 includes men, with an average age of 36
years, with marital status “married”, with an income of 50 thousand, and chose the
following categories of stores: construction, food.</p>
      <p>clasters links
profiles
age significance
sex
the shops</p>
      <p>For EM clustering, the same sample, leaving the same input data and settings were
chosen too. Clustering with this method allows user to choose automatic or fixed
definition of clusters at the step of setting parameters. According to the connection of the
clusters, it can be said, that the systemhas allocated the maximum number of clusters
that were set.</p>
      <p>Claster links map claster</p>
      <p>What if Profiles claster
age
family</p>
      <p>cash income
Clistance matrix
the clasters</p>
      <p>The shops
grocery store
hardware
beaty shop
clothing
furniture
auyo shop
pharmacy</p>
      <p>The choice of the method of initial initialization was made. If it is chosen at random,
we get 5 clusters. We can see that cluster 1 includes both men and women with the
status of married / married or widower aged 23 to 65, including all categories of stores
(Figure 4). For example, cluster # 2 included 70% of women and 30% of men aged 16
to 21, single and unmarried, interested in such categories of stores as food, beauty,
pharmacy, clothing, household.</p>
      <p>Let's configure for a fixed number of clusters. The choice of the method of initial
initialization of clusters is given. Let's choose it at random, we get 5 clusters. We can
see that cluster 1 includes both men and women with the status of married / married or
widower aged 23 to 65, including all categories of stores (Fig ure 4).</p>
      <p>For example, cluster 2 included 70% of women and 30% of men aged 16 to 21,
single and unmarried, interested in s uch categories of stores as food, beauty, pharmacy,
clothing, household. When the method of initial initialization of clusters is chosen
“from the training set”, we get 5 clusters (Figure 4). In this case, cluster 2 includes
unmarried women from 16 to 20 years old who are interested in the categories of the
store: food, clothing, household and pharmacy.</p>
      <p>Claster lincs x whats if x clastrr matrix x profiles claster x table</p>
      <p>clasyers
atAttrtibriubtuete
Cash
income</p>
      <p>significance
family significance
age</p>
      <p>significance
sex significance
the
shop significance
Namber of records 12
signatificance
married
not married
widower
married
frequency
Considered algorithms have fast cluster detection speed. However, the simplest for the
user the clustering algorithmby Kohonen Maps became. In this algorithm, isn’t needed
to specify the number of clusters at the input, the number of clusters determining is
provided by the method. Also, in addition, this method has its own special display
method, in the form of colored maps. On the maps, it is convenient immediately to
determine where which cluster is located and what data is included in this cluster. Also,
this algorithmhas good resistance to noisy data. Thus, with the help of Kohonen maps,
the analyst can obtain more accurate segmentation results.
5</p>
    </sec>
    <sec id="sec-3">
      <title>Conclus ion</title>
      <p>To choose the method on the context of solving market segmentation
problemcomparative analysis of three clustering methods was performed. K-means, ЕМ-clustering
method, and Kokhonen’s neural networks method were considered and analyzed
applied for market segmentation. Kokhonen’s neural networks method was
recommended as preferable because accuracy is the most critical requirement for the subject
areas linked with marketing sphere. Databases information about customers gives
opportunity to group clients by different preferences and take interests these groups into
account when making decisions in marketing to improve business processes.</p>
    </sec>
    <sec id="sec-4">
      <title>Re fe re nces</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Pavlov</surname>
            ,
            <given-names>B.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garifullin</surname>
            ,
            <given-names>R.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mingaleev</surname>
            ,
            <given-names>G.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Babushkin</surname>
            ,
            <given-names>V.M.:</given-names>
          </string-name>
          <article-title>Key technologies of digital economy in the Russian Federation</article-title>
          .
          <source>In: Proceedings of the 33rd International Business Information Management Association Conference, IBIMA 2019: Education Excellence and Innovation Management through Vision</source>
          ,
          <year>2020</year>
          ,
          <fpage>3401</fpage>
          -
          <lpage>3407</lpage>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Kotler</given-names>
            <surname>Ph</surname>
          </string-name>
          .
          <source>: Fundamentals of Marketing. Short course Dialectics</source>
          , Moscow; St.
          <string-name>
            <surname>Petersburg</surname>
          </string-name>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Kotler</given-names>
            <surname>Ph</surname>
          </string-name>
          ., Keller K.L.:
          <article-title>Marketing management</article-title>
          .
          <volume>12</volume>
          edn. Piter, Saint
          <string-name>
            <surname>Petersburg</surname>
          </string-name>
          (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Barsegyan</surname>
            ,
            <given-names>A.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kupriyanov</surname>
            ,
            <given-names>M.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stepanenko</surname>
            ,
            <given-names>V.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kholod</surname>
            ,
            <given-names>I.I.</given-names>
          </string-name>
          :
          <article-title>Data analysis technology: Data Mining, Visual Mining, T ext Mining, OLAP . 2 edn</article-title>
          . BHV-Petersburg, Saint
          <string-name>
            <surname>Petersburg</surname>
          </string-name>
          (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Mandel</surname>
            ,
            <given-names>I.D.</given-names>
          </string-name>
          :
          <article-title>Cluster analysis</article-title>
          .
          <source>Finance and statistics</source>
          , Moscow (
          <year>1988</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ian</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Witten</surname>
            , Eibe Frank, Mark,
            <given-names>A. Hall.</given-names>
          </string-name>
          :
          <article-title>Data Mining: Practical Machine Learning T ools and T echniques</article-title>
          . 3rd edn, Morgan Kaufmann (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Dyuk</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Flegontov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fomina</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Application of Data Mining technologies in the scientific, technical and humanitarian areas</article-title>
          .
          <source>Izvestia: Herzen university journal of humanities &amp; sciences, Saint Petersburg: Russian state. A.I. Herzen Pedagogical University</source>
          ,
          <volume>138</volume>
          ,
          <fpage>77</fpage>
          -
          <lpage>84</lpage>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Mirkes</surname>
            <given-names>E.M.:</given-names>
          </string-name>
          <article-title>K-means and K-medoids: applet</article-title>
          . University of Leicester (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Kohonen</surname>
            ,
            <given-names>T .</given-names>
          </string-name>
          :
          <string-name>
            <surname>Self-Organizing Maps</surname>
          </string-name>
          : third extended edn: Springer-Verlag, New York (
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Wasserman</surname>
            ,
            <given-names>F.: Neural</given-names>
          </string-name>
          <string-name>
            <surname>Computing</surname>
          </string-name>
          .
          <source>T heory and Practice: Mir</source>
          , Moscow (
          <year>1992</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Hastie</surname>
            <given-names>T .</given-names>
          </string-name>
          ,
          <string-name>
            <surname>T ibshiran</surname>
          </string-name>
          , R.,
          <string-name>
            <surname>Friedman</surname>
          </string-name>
          , J.:
          <article-title>T he T he EM algorithm</article-title>
          .
          <source>T he Elements of Statistical Learning</source>
          : Springer, New York (
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Rizaev</surname>
            ,
            <given-names>I.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>T akhavova</surname>
          </string-name>
          , E.G.:
          <article-title>Solution of the Problem of Classification of Vehicles on the Basis of Statistical Estimates of Data</article-title>
          .
          <source>Proceedings 12th International Scientific and T echnical Conference "Dynamics of Systems, Mechanisms and Machines"</source>
          ,
          <year>Dynamics 2018</year>
          ,
          <volume>8601417</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>