<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Review of Modern Techniques of Qualitative Data Clustering</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>The North Caucasus Federal University, Institute of Information Technology and Telecommunications</institution>
        </aff>
      </contrib-group>
      <fpage>7</fpage>
      <lpage>13</lpage>
      <abstract>
        <p>This article made a brief comparative survey of modern clustering algorithms quantitative and qualitative data. As a practical component to the process of analysis of algorithms used in the task of analyzing the consumer basket. The examples of commercial use of clustering algorithms, described the current problems of using cluster analysis.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Descriptive Comparison of Clustering Algorithms</title>
      <p>If we talk about the current level of clustering techniques, it should be noticed
that the range of techniques regarded to the analysis of quantitative data is
wide enough and allows you to perform the tasks of clustering in accordance
with all requirements to processing speed and quality of the final result. The
most well-known representatives of numeric data clustering algorithms are:</p>
      <sec id="sec-1-1">
        <title>1. Algorithm Clustering Using Representatives.</title>
        <p>2. Algorithm Balanced Iterative Reducing and Clustering using Hierarchies.
3. Algorithm HCM (Hard C - Means).</p>
        <p>Usage of the aforementioned methods for clustering categorical data is
inefficient, and often impossible. The main difficulties are associated with high
dimensionality and huge volumes of data that often characterizes such databases,
because the pairwise comparison of the characteristics of objects from
multimillion records database tables can take quite a long period of time. Clustering
algorithms of aforementioned types are currently having a number of special
requirements - recommendations that will optimize performance when working
with large amounts of data:</p>
      </sec>
      <sec id="sec-1-2">
        <title>1. Ability to work in a limited amount of RAM.</title>
        <p>
          2. The algorithm should work under the condition that the information from
the database can be gathered only in the forward-only cursor.
3. Ability to abort the algorithm with preservation of intermediate results
with the possibility of resuming the process of data processing.
4. Minimizing the number of requests for the full database table scan [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
        <p>Currently, the most promising and popular clustering algorithms qualitative
data include:</p>
        <p>
          Currently, the most promising and popular algorithms for clustering
qualitative data include:
1. Algorithm Clope. The main advantages of this algorithm are speed and
quality of clustering which are achieved by using the estimated global
optimization criterion based on the maximization of the gradient histogram
cluster height. During its operation, the algorithm stores a small amount
of information on each of the clusters in memory, and requires a minimum
number of data sets scans. Number of clusters is automatically selected by
the algorithm based on the coefficient of repulsion, which was originally
set by the user (the more the level of the coefficient is, the lower the level
of similarity will be, and more clusters will be generated). [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]
2. Algorithm LargeItem, which was developed in 1999 as an optimized
algorithm for clustering data sets based on the appraised, absolute function
that uses the support parameter.
3. Kohonen maps (this method is applicable to both qualitative and
quantitative indicators). Allows you to identify useful and non-trivial patterns,
consider the influence of hundreds of factors, visualize complex
multidimensional clusters in the form of clear and accessible maps.
4. Hierarchical clustering, as an example of MST (Algorithm based on
Minimum Spanning Trees). MST algorithm first builds minimum spanning
tree on the graph, and then sequentially removes edge with the largest
weight. The disadvantage is the high degree of dependence on emissions
contained in the data sets [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
3
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Comparison of Algorithms Clope and Koho</title>
      <p>nen‘s Maps for the Problem of the Consumer</p>
    </sec>
    <sec id="sec-3">
      <title>Basket</title>
      <p>For a more practical comparison of algorithms we will describe a typical
clustering problem: the consumer’s basket analysis. The aim is to define a set of
products that are most often found in one check (or order).
3.1</p>
      <sec id="sec-3-1">
        <title>Algorithm Clope</title>
        <p>”Clope” algorithm functionality is based on the idea of maximizing the global
cost function, which increases the degree of similarity of transactions in the
clusters by increasing the parameter of cluster histogram. Consider a simple
example, we have 5 consumer checks:
1. {[bread, milk]}.
2. {[bread, milk, sour cream]}.
3. {[bread, sour cream, cheese]}.
4. {[cheese, sausage]}.
5. {[cheese, sausage, peppers]}.</p>
        <sec id="sec-3-1-1">
          <title>Suppose that we already have two partitions into clusters:</title>
          <p>1. {[bread, milk, bread, milk, sour cream, bread, sour cream, cheese] [cheese,
sausage, cheese, sausage, peppers]}.
2. {[bread, milk, bread, milk, sour cream}, {[bread, sour cream, cheese,
cheese, sausage, cheese, sausage, peppers]}.
Let’s calculate the height (H) and width of the cluster (W) for both clusters.</p>
          <p>H - height of the chart, is calculated as the average height (density) of all
the columns in the chart. For example, for the first cluster we got H = 2 as a
result of the following arithmetic operation:
(3 +2 +2 +1) / 4 = 2.</p>
          <p>W - the width of the cluster, which is equal to the number of columns in the
cluster. To determine the quality of the partition clusters by Clope it is necessary
to calculate cluster diagram parameter (H / W) for both distributions. For the
first cluster it will be (2 + 1.67) / (3 + 4) = 0.52. For the second cluster,
respectively (1.67 + 1.67) / (3 + 5) = 0.41. Obviously, the first partition is
better option because cluster diagram parameter is higher, which indicates that
the transaction have a large overlap with each other.
3.2</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Kohonen’s Maps</title>
        <p>Let’s try to perform this distribution using Kohonen maps. Compose a text
file containing information about the checks. Column ”Id” corresponds to the
check box and ”Item” describes the subject contained in the check.</p>
        <p>Processing options in the input file define that we want to get 2 clusters as
a result of processing, composed on the basis of field ”Item”. Neurons odds left
by default, without changing the coefficients. The result of the processing is
following (see Figure 4, 5).
According to the results of distribution two clusters were allocated, the first
checks were 1,2,3, the second 4,5. Distribution is obtained similarly to
distribution provided by Clope algorithm. Basic formula, in which the distribution is
a function of the neighborhood, allows us to define a ”measure neighborhood”
nodes compared:
hci(t) = α(t) × exp(− ||rc−ri||2 ), where 0 &lt; α(t) &lt; 1 is learner and decreasing
2σ2(t)
factor, ri and rj are coordinates of nodes Mi(t) and Mc(t). The last step of the
algorithm is to change the vectors of weights on approaching the observation
under consideration:
mi(t) = mi(t − 1) + hci(t) × (x(t) − mi(t − 1)).</p>
        <p>As the results of the algorithm, the user has a visual map, where similar
indicators will be grouped into visually isolated clusters.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>Nowadays there is enough of active practical usage of clustering methods, as it
makes work with homogeneous data sets much easier and allows the usage of
specific methods of treatment which are not suitable for the total sample. In
fact, the bigger and bigger funds are conducted by different companies around
the globe to provide scientific segmentation and clustering of customers and
end-products. The most famous events in this direction are: ”Svyaznoi”
company with their program ”Svyaznoi Club”, Bank of Moscow with the targeted
marketing automation system SAS Marketing Automation, ”PSB”, ”Tinkoff”
bank and many others. For example, the company ”Svyaznoi” in its loyalty
program ”Svyaznoi Club” implemented deep customer segmentation - cardholders
of ”Svyaznoi Club”. Specialists from BaseGroup Labs and analytical platform
Deductor were involved to solve this problem. They helped to process the entire
array of raw data, and built more than 120 classification models for client’s
response per share. Despite the fairly extensive use of qualitative data clustering
methods there are a number of topical issues that hinder the work and further
developments: 1. Lack of comparable treatment for describing objects with
attributes of different nature, measured in different units. 2. The complexity of
making a formalized description of objects having both qualitative and
quantitative traits. 3. The problem of detecting and displaying the artifacts in the
source data. The need to develop such clustering method that would identify
artifacts in the source data and either exclude them from the final result, or
carry specific processing to visually represent the fact of ”break out”.</p>
    </sec>
    <sec id="sec-5">
      <title>References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Article</given-names>
            <surname>Ke</surname>
          </string-name>
          <string-name>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>ChuXu</surname>
          </string-name>
          , Bing Liu Clustering Transactions Using Large Items 2003
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guan</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>You</surname>
          </string-name>
          . J. CLOPE:
          <article-title>A fast and Effective Clustering Algorithm for Transactional Data In Proc</article-title>
          .
          <source>of SIGKDD'02, July 23-26</source>
          ,
          <year>2002</year>
          , Edmonton, Alberta, Canada.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>[3] Classification and comparison of clustering methods / IM transmission line protected [electronic resource].</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>F.</given-names>
            <surname>Hooper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Klawonn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kruse</surname>
          </string-name>
          , T.Runkler,
          <string-name>
            <given-names>Fuzzy</given-names>
            <surname>Clustering</surname>
          </string-name>
          . Chichester, United Kingdom: Wiley,
          <year>1999</year>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>