Clustering Algorithms for Economic and Psychological
              Analysis of Human Behavior

          Nataliya Boyko[0000-0002-6962-9363], Hanna Komarnytska[0000-0002-5533-6439]

         Yurii Kryvenchuk[0000-0002-2504-5833], Yuriy Malynovskyy[0000-0002-7139-5623]

                   Lviv Polytechnic National University, Lviv79013, Ukraine

            nataliya.i.boyko@lpnu.ua, lozjuk7@gmail.com,
          yurkokryvenchuk@gmail.com, inem.news@gmail.com.


             Abstract. This article proposes an approach to analyze the behavior of
         groups of people and shows how to predict person location for the next month.
         The clustering algorithms were used in this research. Also was inspected the
         problem of finding associative rules. We used scalable Apriori algorithms to
         find the best rules. For analysis, we used the standard mlxtend library to aggre-
         gate data by cluster, user login, and time. In this article we explored apriori and
         k-means clustering algorithms to get user behavior analysis template. In the
         process, we looked at the problem of finding associative rules that were able to
         find and describe patterns in large datasets. We used scalable Apriori algo-
         rithms to find the best rules. For analysis, we used the standard mlxtend library
         to aggregate data by cluster, user login, and time. While working, we were
         faced with the problem of inaccuracy and inconsistency of data with real condi-
         tions, and were forced to reduce the minimum support for associative rules.


         Keywords: Preprocessing, clustering, associative rules, Apriori algorithm.


     1        Introduction

Nowadays, information technologies offer lots of opportunities for communication
and social networking. Such partial isolation is a huge problem for establishing socio-
communicative relationships between people. One way to partially solve this problem
is to use the meet city. This application uses geolocation and artificial intelligence
methods to predict and recommend meetings. For example, if you desire to talk to a
similar-interest person during your lunch, this application will help [1-3].
    To solve the problems described above we should use machine learning technolo-
gies. We propose to use clustering and associative rules. While working, we have
formed a pattern of finding algorithms and patterns of behavior of people or groups of
people united by common interests [8, 13].


    Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attrib-
ution 4.0 International (CC BY 4.0) CMiGIN-2019: International Workshop on Conflict Management in
Global Information Networks.
     2        Statement of the problem
To begin the analysis, we select the following data (Fig.1):


Fig. 1. Meetcity application input

    First of all, we need to vectorize (binarize) the text data of the PlaceName field in
order to be able to apply clustering using the LabelBinarizer class from the standard
sklearn library [5, 6]. As a result, we obtain text format data (list) that returns the
compatibility matrix of given element.
    from sklearn.preprocessing import LabelBinarizer
    encoder= LabelBinarizer()
        encoder.fit(df['PlaceName'])
        data=encoder.transform(df['PlaceName'])
        df1 = pd.DataFrame(data)


Fig. 2. Results of binarization of meetcity application data
    In Fig.2 columns from 0 to 8 are vectors in multidimensional space. A value of 1
in these columns belongs to a particular value of the previous Placename field for a
concrete type.
    Since the k-means algorithm is unstable to anomalies, the data must first be nor-
malized. To do this, we should apply the MinMax Scaling method which moves all
points of space to the limit [0,1].

                                                X  X min
                                   X norm                                         (1)
                                               X max  X min
    To do this, we will use the Normalizer class from the sklearn library from
sklearn.preprocessing import Normalizer
    scaler= Normalizer()
    scaler.fit(df1)
    scaled_df=scaler.transform(df1)
    df1= pd.DataFrame(scaled_df)
   As a result of using of the Normalizer class, we obtain the output data of the pre-
processing function (Fig.3)


Fig. 3. Output data of the preprocessing function

    In Fig. 3. we received data as a result of normalization and grouping by the value
of login (Fig. 2). Columns 0-10 are measurements of multidimensional space, and
data is a reflection of vectors in them. All data should be normalized using the
MinMax Scaling method [12-15].
    In order for the system to be dynamic and allow users to change their behavior
and, as a consequence, to move to other clusters, their number should be determined
automatically. For this purpose it is necessary to cluster on n different quantities of
clusters, and then to find the point of inflection of the function of dispersion, which is
"elbow". This number will be the optimal number of clusters at the moment [4, 7].
    An alternative to the reproduction of the “elbow” dispersion function is the use of
gap statistics [6-8], which are generated on the basis of resampling and Monte Carlo
simulation procedures.
    Let E∗n{log(W∗k)} mean an estimate of the average dispersion W∗k obtained by
the bootstrap method when k clusters are generated by random sets of objects from an
input sample of size n. Then the statistic Gapn(k)=E{n{log(W∗k)}-log(Wk) determines
the deviation of the observed dispersion Wk from its expected value, provided that the
null hypothesis holds that the input data form only one cluster.
    To do this, we cluster the data into different number of clusters, find the dispersion
and find the point of inflection by the “elbow” method.
    def clasternumber(df1):
    # find number of clusters
       cluster_range = range(1, 10)
       cluster_errors = []

        for num_clusters in cluster_range:
           clusters = KMeans(num_clusters)
           clusters.fit(df1)
           cluster_errors.append(clusters.inertia_)


Fig. 4. Dispersion for n clusters

    Find the inflection point of the function. It corresponds to the number of clusters
4. Therefore, the given dataset must be divided into 4 clusters for best results.
    Clustering - grouping objects based on the similarity of their properties, so that
each cluster consists of similar objects, and the objects of different clusters differ
significantly [12-18]. Clustering helps us understand natural grouping or data struc-
ture. The purpose of clustering is to determine the internal grouping of multiple unla-
beled data segments. With the help of clustering we can solve the following problems
of data analysis: search and discovery of knowledge, grouping and recognition of
objects: search for representatives of homogeneous groups (reducing the dimensional-
ity of data), search for natural clusters and description of their unknown properties,
search for useful and appropriate grouping, search for unusual objects. (detection of
emissions) [2, 19].
    Unlike hierarchical methods, when clusters are built up step by step, split cluster-
ing methods examine all segments at once. In doing so, they either attempt to identify
clusters by iteratively moving points between subsets, or to identify clusters as areas
that are densely filled with objects. The first kind of algorithms belong to partitioning
methods with movement (partitioning relocation clustering). They, in turn, are divided
into probabilistic, k-means and k-medoids methods, and concentrate on adjusting
points to corresponding clusters, with a tendency to construct spherical segments. The
second type separation algorithms belong to the group of density-based partitioning
methods. They try to identify dense cohesive data components that are flexible in
terms of their shape. This group is less sensitive to emissions and may find irregular
clusters [6-10].
    Separating moving clustering methods have several advantages: linear complexity
of the algorithm; relative scalability and simplicity; good fit to data with compact,
well-separated spherical clusters. The disadvantages include: a significant decrease in
performance on high-dimensional data; the need to indicate the number of clusters in
advance; sensitivity to initialization, noise and emissions; possibility to get to local
optima’s; inability to cope with clusters of different shapes and densities [11, 14, 16].
    To divide users into clusters, you need to use the k-means algorithm, and then ap-
ply the resulting data to the Apriori algorithm to obtain user behavior patterns [1, 15].
    The principle of the k-means algorithm is to find the following cluster centers and
sets of elements of each cluster in the presence of some function F(°), which express-
es the quality of the current division of the set into k clusters, when the total quadratic
deviation of the cluster elements from the centers of these clusters is smallest:
                                    k
                             V    ( x j  i ) 2                                    (2)
                                   i 1 x j Si


   Where k – cluster amount, Si – obtained clusters, i = 1, 2, … , k, µ - vectors` center
of mass, xi ϵ Si.
   {\displaystyle V=\sum _{i=1}^{k}\sum _{x_{j}\in S_{i}}(x_{j}-\mu
_{i})^{2}}
    At first step of the k-means algorithm, we select the cluster centers arbitrarily.
Then, for each element of the set, we calculate the distance from the centers and at-
tach each element to the cluster. For each of the obtained cluster, we calculate new
value of the center, trying to minimize the function F (°). After that the procedure of
redistributing the elements between the clusters is repeated.
   Algorithm of Clustering using the k-means scheme:
     select k information points as cluster centers until the process of changing
         cluster centers is completed;
       compose each information point with a cluster whose distance to the center is
        minimal;
       ensure that each cluster contains at least one point. To do this, each empty
        cluster must be supplemented by an arbitrary point located "far" from the cen-
        ter of the cluster;
       replace the center of each cluster with the mean value of the cluster elements;
    The main advantages of the k-means method are its simplicity and speed of execu-
tion. The k-means method is more convenient for clustering large numbers of obser-
vations than the method of hierarchical cluster analysis (in which the dendrograms
become overloaded and lose clarity).
    One of the disadvantages of the simple method is the violation of the connectivity
of elements of one cluster, so different modifications of the method, as well as its
fuzzy k-means methods, are developed, in which, at the first stage of the algorithm,
the membership of one element of a set to several clusters is allowed (with varying
degrees of affiliation).
    Despite the obvious advantages of the method, it also has significant disad-
vantages:
     The result of the classification strongly depends on the random starting posi-
        tions of the cluster centers
       The algorithm is sensitive to emissions, which can distort the mean value
       The researcher should determine the number of clusters in advance.
   For clustering, we pass to the function numerical normalized data, which are the
coordinates of user vectors in multidimensional space based on their activity:
Fig. 5. Input data for clustering

1. K-means algorithm using python and sklearn library def kmeans(n,df1):
          df1.index = np.arange(len(df1))
          # clustering
          n_clusters = 4
          km = KMeans(n_clusters=n_clusters)
          data= KMeans.fit_predict(km,df1)
          data=pd.DataFrame(data)
          centroids= km.cluster_centers_
          centroids=pd.DataFrame(centroids)
2. At the output, each element is assigned to a specific cluster:


Fig. 6. K-means algorithm result displayed in diagram
    As a result, we obtained data that is divided into 4 clusters. The size of each sec-
tion corresponds to the number of values in the cluster. Next, we need to use this in-
formation to find the rules of conduct. Therefore, after clustering, we should start
looking for associative rules.
    Affinity analysis is one of the common methods of Data Mining. The purpose of
this method is to investigate the relationship between events that occur together. Its
purpose is to identify associations between different events, that means to find rules
for quantifying the relationship between two or more events. These rules are called
association rules.
    The basic concepts in associative rule theory are subject set and transaction. A
subject set is some non-empty set of elements that transactions can include:

                               I={i1,i2,…,ik,…,in},                                  (3)

    where ik - elements included in the subject sets, k=1..n, n is the number of ele-
ments of the set I.
    Transaction is a set that has some elements of set I that occur together. The trans-
action also has a unique TID (Transaction ID).
    There is a certain set of transactions in the database:

                           T={t1, t2,…, ti,…, tm},                                   (4)

   where ti – some transaction, m – amount of transactions.
   Between transaction elements we can set some regularities in the form of associa-
                          
tive rules: X  ik ik  I , called a condition and Y      j j  I , called a con-
                                                              k   k
sequence. We should also say that the same set should not be included in the anteced-
ent and consequent at the same time: ik  j k .
    The associative rule describes the relationship between sets of subjects which re-
spond to consequence rule and are written down as X  Y . The sets X and Y must
not intersect: X∩Y=Ø. The main indicators of the importance of an associative rule
are support and confidence.
    We should distinguish between support for recruitment and support for associative
rule. In two cases, it is defined as the ratio of the number of transactions having the
specified amount of items to the total number of transactions. The only difference is
that the number of transactions that have the corresponding set is taken to calculate
support for the set, and the number of transactions that have both a condition and a
consequence at the same time to calculate the support of the associative rule.
    Lets assume that we have some set X and associative rule X → Y. Then support of
set X is:

                                             X (t )
                               Supp( X )             ,                              (5)
                                               T

                     
   Where X (t )  t  Т Х  t .   
   Support for the associative rule will be equal to:

                                             X (t )  Y (t )
                       Supp( X  Y )                               ,                (6)
                                                   T

                                           
   where X (t )  t  Т Х  t , Y (t )  t  Т Y  t .          
    Because modern database sizes can reach large enough volumes (up to gigabytes
and terabytes), finding associstive rules requires efficient algorithms that are scalable
and allow you to find a solution to a given problem in an acceptable time.
    Apriori algorithm was designed for relational databases and allows you to gener-
ate frequent datasets from transaction tables.
    Apriori algorithm uses iterative approach. At first step it finds single-element fre-
quent datasets denoted by the set L1. At the next step dataset L1 is used to find fre-
quent sets that has two elements, from which we form dataset L2 which is used to find
dataset L3 and so on. To increase the productivity of frequent datasets generation the
anti-monotony property is used. This is based on the following observation: if some
                            
dataset is not frequent: sup I  min         sup , when if we add some specific object
i we will get new dataset which also would not be frequent:

                          supI  
                                   i   min            sup .                        (7)

    Using the specified property, frequent k-element sets of Lk data can be obtained by
combining frequent (k-1) element sets. Moreover, in order for some k-element set Lk
to be included in frequent sets Lk, all of its (k -1) element subsets must also be fre-
quent. If at least one of them is not a frequent set, Lk must be excluded from the set of
frequent subject sets.
    This observation contributes to the creation of a plurality of candidates of k-
elemental Ck, sets, which will be a subset of Lk. This subset is obtained by removing
from the Ck infrequent datasets, which is the result of checking the support values of
each of the candidates ck, (ck  Ck,). Based on the antimonotonic property, the set Ck
is generated in two steps. In the first step, the candidate is generated by joining the
members of the set of frequent sets Ck-1, where two members can be joined if they
have k-2 common elements, ie:

                              
             Lk 1  Lk 1  A  B A, B  Lk 1 A  B  k  2                       (7)

  The next step is to remove from the set of Сk members that include (k-1)- ele-
mental data sets that are not frequent.


    3      Numerical experiments

We used the standard mlxtend library to aggregate data by cluster, user login, and
time.
Fig. 7. Data preparation for the accurate prediction

    After that we need to use apriori class for finding associative dependencies
def rules(data,df2):
          df2['TimeStamp'] = pd.to_datetime(df2['TimeStamp'], unit='s')
          df2=pd.merge(df2, data,on='Login')
          df2.set_index(['Class','Login'], inplace=True)
          df2.sort_index(inplace=True)
          del df2['Longitude']
          del df2['Latitude']
          del df2['Ratting']
          df2=df2.groupby('TimeStamp')['PlaceName'].apply(list)
          from mlxtend.preprocessing import TransactionEncoder
          te = TransactionEncoder()
          te_ary = te.fit(df2).transform(df2)
          df = pd.DataFrame(te_ary, columns=te.columns_)

          from mlxtend.frequent_patterns import apriori
          rules=apriori(df, min_support=0.01, use_colnames=True)


     4        Analysis of the results

Algorithm result is shown below:
Fig. 8. Graph representation.

    In Fig. 8 we can see the work of the associative rule algorithm, where the diameter
of a circle means the support (frequency) of a certain rule, and the arrows to and from
the circle, respectively, indicate the sequence of elements in the rule. As we can see,
the Driving-Shopping rule (with the largest and brightest circle) is most likely. It has
0.046 support and 1.818 color saturation. the least likely is the Amusements-Finance
rule with 0.008 support and a circle saturation of 0.369.


Fig. 9. Most popular places due to apriori algorithm
   In Fig. 9 is a diagram showing the 5 most popular places. They are in descending
order. Therefore, it can be assumed that, due to the use of meetcity and its geolocation
capabilities to predict and recommend meetings, Driving is considered the most visit-
ed place.


    5       Conclusion

In this article we explored apriori and k-means clustering algorithms to get user be-
havior analysis template. In the process, we looked at the problem of finding associa-
tive rules that were able to find and describe patterns in large datasets. We used scala-
ble Apriori algorithms to find the best rules. For analysis, we used the standard mlx-
tend library to aggregate data by cluster, user login, and time.
   While working, we were faced with the problem of inaccuracy and inconsistency
of data with real conditions, and were forced to reduce the minimum support for asso-
ciative rules.


        References
 1. Estivill-Castro, V., Lee, I.: Amoeba: Hierarchical clustering based on spatial proximity us-
    ing Delaunay diagram. In: 9th Intern. Symp. on spatial data handling, pp. 26–41, Beijing,
    China (2000)
 2. Kang, H.-Y., Lim, B.-J., Li K.-J.: P2P Spatial query processing by Delaunay triangulation ,
    Lecture notes in computer science, vol. 3428, pp. 136–150, Springer, Heidelberg (2005)
 3. Boehm, C., Kailing, K., Kriegel, H., Kroeger P.: Density connected clus-tering with local
    subspace preferences. In: Proc. of the 4th IEEE Intern. conf. on data mining, pp. 27–34,
    Los Alamitos: IEEE Computer Society (2004)
 4. Wang, Y., Wu, X.: Heterogeneous spatial data mining based on grid, Lecture notes in
    computer science, vol. 4683, pp. 503–510,Springer/Heidelberg (2007)
 5. Harel, D., Koren, Y.: Clustering spatial data using random walks. In: Proc. of the 7th ACM
    SIGKDD Intern. conf. on knowledge discovery and data mining, pp. 281–286, San Fran-
    cisco, California (2001)
 6. Turton, I., Openshaw, S., Brunsdon, C. et al.: Testing spacetime and more complex hyper-
    space geographical analysis tools. In: Innovations in GIS 7, pp. 87–100, L.: Taylor &
    Francis (2000)
 7. Tung, A.K. H., Hou, J., Han, J.: Spatial clustering in the presence of obstacles. In: The
    17th Intern. conf. on data engineering (ICDE’01), pp. 359–367, Heidelberg (2001)
 8. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic sub-space clustering of
    high dimensional data. In: Data mining knowledge discovery, vol. 11(1), pp. 5–33 (2005)
 9. Guimei, L., Jinyan, L., Sim, K., Limsoon, W.: Distance based subspace clustering with
    flexible dimension partitioning. In: Proc. of the IEEE 23rd Intern. conf. on digital object
    identifier, vol. 15, pp. 1250-1254 (2007)
10. Aggarwal, C., Yu, P.: Finding generalized projected clusters in high dimensional spaces.
    In: ACM SIGMOD Intern. conf. on management of data, pp. 70-81 (2000)
11. Procopiuc, C.M., Jones, M., Agarwal, P.K., Murali, T.M. : A Monte Carlo algorithm for
    fast projective clustering. In: ACM SIGMOD Intern. conf. on management of data, pp.
    418–427, Madison, Wisconsin, USA (2002)
12. Ankerst, M., Ester, M., Kriegel, H.-P.: Towards an effective cooperation of the user and
    the computer for classification. In: Proc. of the 6th ACM SIGKDD Intern. conf. on
    knowledge discovery and data mining, pp. 179-188, Boston, Massachusetts, USA (2000)
13. Peuquet, D.J.: Representations of space and time, N. Y., Guilford Press (2002)
14. Guo, D., Peuquet, D.J., Gahegan, M.: ICEAGE: Interactive clustering and exploration of
    large and high-dimensional geodata. Geoinfor-matica, vol. 3, N. 7, pp. 229-253 (2003)
15. Boyko, N.: A look trough methods of intellectual data analysis and their applying in in-
    formational systems. In: Scientific and Technical Conference “Computer Sciences and In-
    formation Technologies (CSIT), 2016 XIth International, pp. 183-185, IEEE (2016).
16. Shakhovska, N., Boyko, N., Zasoba, Y., Benova, E.: Big data processing technologies in
    distributed information systems. Procedia Computer Science, 10th International confer-
    ence on emerging ubiquitous systems and pervasive networks (EUSPN-2019), 9th Interna-
    tional conference on current and future trends of information and communication technol-
    ogies in healthcare (ICTH-2019), Vol. 160, 2019, pp. 561–566, Lviv, Ukraine (2019)
17. Boyko, N., Shakhovska, Kh., Mochurad, L., Campos, J.: Information System of Catering
    Selection by Using Clustering Analysis, Proceedings of the 1st International Workshop on
    Digital Content & Smart Multimedia (DCSMart 2019), рр. 94-106, Lviv, Ukraine (2019)
18. Kunanets, N., Vasiuta, O., Boikо, N.: Advanced Technologies of Big Data Research in
    Distributed Information Systems, Proceedings of the 14th International conference "Com-
    puter sciences and Information technologies" (CSIT 2019), pp. 71-76, Lviv, Ukraine,
    (2019)
19. Boyko, N., Basystiuk, O.: Comparison Of Machine Learning Libraries Performance Used
    For Machine Translation Based On Recurrent Neural Networks, 2018 IEEE Ukraine Stu-
    dent, Young Professional and Women in Engineering Congress (UKRSYW), pp.78-82,
    Kyiv, Ukraine (2018)