Personal data segmentation based on conjugation index usage

                    P V Hripunov1 and D A Zherdev2


                    1
                        Pension Fund of the Russian Federation, Shabolovka str. 4, Moscow, Russia, 119991
                    2
                        Samara National Research University, Moskovskoe Shosse 34, Samara, Russia, 443086


                    Abstract. The paper proposes a method for processing personal data that allows them to be
                    divided into many segments or classes. The customer database is used as the source data. We
                    use the indicator of conjugacy that has already proved the effectiveness in both recognition and
                    clustering of data problems.


1. Introduction
Data mining problem is a primary problem inprocessing of huge amount data. Different methods of
pattern recognition, classification, images clustering and others have foundthe implementation in the
data mining. Many works [1-4] study clusterization processes of big data. In this study, we research
the recognition ability of some personal data that was presented by a digit vector. For classification
within some probability, it is necessary to find out is a similar element consist in the database. To
achieve this, we use conjugation index. The index effectiveness was shown in study of face
recognition problems [5], as well as objects recognition in radar images [6], [7].

2. Segmentation and classification of personal data
In this section the clustering approach based on conjugation index usage is described. We can decide
that a vector belongs to a class by calculation of conjugation index value. The higher probability of
conjugation index shows that a vector has similarity to the vectors that form a class:
                             Xk  x1  k  , x2  k  , ..., x j  k  ,..., xM  k  , k  1, K ,
where x j   x1 , x2 ,..., xi ,..., xN  is aN×1 feature vector.
    The conjugation index can be presented as:
                                                             xTj Qk x j
                                              Rk  x j                  ,   k  1, K ,
                                                              xTj x j
where K is a class count,
                                                                    1
                                            Qk  Xk  XTk Xk  XTk , k  1, K ,
is a N  N matrix of k -class.
    Each letter in a single categorical data is coded by some index thus digital vector of a string can be
formed. Each letter of Russian alphabet “А-Я” coded by numbers1-33. In the result the new database
of vectors can be formed.
    There are three fields in the database: first name, second name and middle name. For convenience,
the maximum number of possible symbols in the database was chosen to be 100. As the result, all


IV International Conference on "Information Technology and Nanotechnology" (ITNT-2018)
Data Science
P V Hripunov and D A Zherdev


vectors in the new dataset contain 300 features. All vectors in dataset must be the same size. For
example, if the first field size equals 12 symbols then the first 12 features of a vector contain the field
value and other 88 features are filled by zero value.When we processed the database, all personal data
was encrypted by summing up with some digital key.
   We use the similar procedure for clustering which was used in work [6] for clustering of radar
images. At first step of the whole set we choose the two most "distanced" vectors. These vectors have
the minimal value of the correlation ratio and we can be labeled them as x1 , xM .
   Then the algorithm from the remain set of vectors adds two new vectors ( x2 , xM 1 ). Each one of
these vectors must have the maximum of the correlation ratio:
                                                    (xT x )2
                                              R1,2  1 2 ,
                                                     x1 x 2
                                                                (xTM 1x M )2
                                                   RM 1, M                  ,
                                                                 x M 1 x M
with one vector was obtained at the first step. In the result received pairs of vectors x1 , x2 and
xM 1 , xM form the subspaces that were formed by matrices X1,2 and XM 1, M correspondingly. Then
using the remaining set next two vectors x3 , xM 2 that are closest to the subspaces are joined to
previously formed subspaces using computation ofconjugation index with a maximum value.
   Since the database contains a large number of vectors, process continues due finding of specific
number of vectors in both subspaces. For example, for such dataset the resulted subspace contains 15
vectors in both matrices Xk , Xl , which correspond to two subclasses.
   The procedure described above is repeated iteratively with all unlabeled vectors. Clustering is
continued until all the vectors will be specified to any of the subspaces.At the recognition stage with a
certain decision rule, the vector closest to one of the subclasses formed in thedescribed manner is
considered to belong to the class.

3. Results and discussion
In this paper, the problem of the determining possibility whether there is some given record in the
database is examined. After clustering process we can figure out the belonging of a vector to some
class. The subclass stores a small number of vectors in comparisonof the initial database.Therefore,
after the classification of the current vector, it will be easy to analyze data in a subclass and determine
isit possible to add a new value into the database.
    Thus, to verify the above assumption, we performed the experiment. From the database of 1041100
records there was performed the random selection of 1040 records five times. Each selection was
divided onto 80 subclasses, a subclass consists of 13 vectors. After the clustering procedure, the
generated vectors were classified.
    The testing vectors were formed using the existed in the dataset records with some modifications.
For example, there was simulated situation of incorrect handwritten letters conversion when the
personal data was filled in some document. As it was shown in the work [8] the problem of text
recognition is a difficult and can have many solutions. Figure 1 a, b presents the images of two
handwritten words. The word in Figure 1 a was correct converted by some letters recognition software
into “ADAM” as opposed to word in Figure 1 b that led to incorrect result: “ADRM”.


          a)                                         b)
               Figure 1. Examples of a) correct and b) incorrect handwritten letters conversion.

  In the classification experiment, 20 vectors of the type described above were tested. All vectors
were successfully classified based on the many-to-many approach [9] extending the possibilities of the

IV International Conference on "Information Technology and Nanotechnology" (ITNT-2018)                 441
Data Science
P V Hripunov and D A Zherdev


binary classification of the support subspaces algorithm [7]. Moreover the average value of
conjugation index was 0.95 for true defined class. This fact undoubtedly indicates the reliability of
using the conjugation index in problems of this kind. This is an advantage for following research of
such kind both with databases of a more complex type, with a larger field number, and for
classification using a whole database of one million or more records.

4. References
[1] Yang Y and Guan J 2002 CLOPE: a fast and effective clustering algorithm for transactional
      data Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery
      and data mining 682-687
[2] Zhang T, Ramakrishnan R and Livny M 1996 BIRCH: an efficient data clustering method for
      very large databases ACM Sigmod Record 25(2) 103-114
[3] He Z, Xu X and Deng S 2005 A cluster ensemble method for clustering categorical data
      Information Fusion 6(2) 143-151
[4] Huang Z 1997 A fast clustering algorithm to cluster very large categorical data sets in data
      mining DMKD 3(8) 34-39
[5] Fursov V and Kozin N 2007 Recognition through constructing the eigenface classifiers using
      conjugation indices IEEE Conference on Advanced Video and Signal Based Surveillance 465-
      469
[6] Minaev E and Fursov V 2016 Support subspaces method for fractal images recognition CEUR
      Workshop Proceedings 1638 379-385
[7] Zherdev D A, Kazanskiy N L and Fursov V A 2015 Object recognition in radar images using
      conjugation indices and support subspaces Computer Optics 39(2) 255-264 DOI:
      10.18287/0134-2452-2015-39-2-255-264
[8] Bolotova Y A, Spitsyn V G and Osina P M 2017 A review of algorithms for text detection in
      images and videos Computer Optics 41(3) 441-452 DOI: 10.18287/2412-6179-2017-41-3-441-
      452
[9] Bishop Ch M 2006 Pattern Recognition and Machine Learning (New York: Springer) p 738


IV International Conference on "Information Technology and Nanotechnology" (ITNT-2018)            442