=Paper= {{Paper |id=Vol-2416/paper63 |storemode=property |title=Determining the proximity of groups in social networks based on text analysis using big data |pdfUrl=https://ceur-ws.org/Vol-2416/paper63.pdf |volume=Vol-2416 |authors=Andrey Mykhin,Igor Rytsarev,Rustam Paringer,Alexander Kupriyanov,Dmitriy Kirsh }} ==Determining the proximity of groups in social networks based on text analysis using big data == https://ceur-ws.org/Vol-2416/paper63.pdf
Determining the proximity of groups in social networks
based on text analysis using big data

                A S Mukhin1, I A Rytsarev1,2, R A Paringer1,2, A V Kupriyanov1,2, D V Kirsh1,2


                1
                 Samara National Research University, Moskovskoe Shosse 34А, Samara, Russia, 443086
                2
                 Image Processing Systems Institute of RAS - Branch of the FSRC "Crystallography and
                Photonics" RAS, Molodogvardejskaya street 151, Samara, Russia, 443001



                e-mail: andrey63ru@mail.ru



                Abstract. The article is devoted to the definition of such groups in social networks. The
                object of the study was selected data social network Vk. Text data was collected, processed
                and analyzed. To solve the problem of obtaining the necessary information, research was
                conducted in the field of optimization of data collection of the social network Vk. A software
                tool that provides the collection and subsequent processing of the necessary data from the
                specified resources has been developed. The existing algorithms of text analysis, mainly of
                large volume, were investigated and applied.



1. Introduction
Currently, social networks are booming: every day their users send billions of messages and leave
millions of comments under the relevant posts. The analysis of such content is of great importance
for many areas of business. For example, it is impossible to overestimate the impact of Internet
marketing on the promotion of goods and services. However, clear understanding of user requests is
essential to use these mechanisms effectively. The source of such information can be the materials
published by users of social networks, as well as the shares and reposts by users and the entire
communities. Thus, the issue of determining the proximity of groups in the social network
Vkontakte using the BigData technology, considered in this paper, is certainly a relevant objective
and a task of great scientific importance in the field of data analysis.
    Data processing from social networks is very popular now. For example, in the article [1]
proposes a text normalization with deep convolutional character level embedding (Conv-char-Emb)
neural network model for SA of unstructured data. This model can tackle the problems: (1)
processing the noisy sentence for sentiment detection (2) handling small memory space in word level
embedded learning (3) accurate sentiment analysis of the unstructured data. In the article [2], authors
introduce SS3, a novel supervised learning model for text classification that naturally supports these
aspects. SS3 was designed to be used as a general framework to deal with ERD problems. In the
article [3], authors propose a nonparametric model (NPMM) which exploits auxiliary word
embeddings to infer the topic number and employs a “spike and slab” function to alleviate the
sparsity problem of topic-word distributions in online short text analyses. NPMM can automatically


                    V International Conference on "Information Technology and Nanotechnology" (ITNT-2019)
Data Science
A S Mukhin, I A Rytsarev, R A Paringer, A V Kupriyanov, D V. Kirsh




decide whether a given document belongs to existing topics, measured by the squared Mahalanobis
distance. In the article [4], examine the long-term relationship between signals derived from nine
years of unstructured social media microblog text data and financial market developments in five
major economic regions. Employing statistical language modeling techniques we construct
directional sentiment metrics. In the article [5], the authors propose a background clustering
technology for discussion. Compared with the traditional methods, background future clustering
keeps the constrains caused by data sparseness and spatio-temporal dependence off, and can be used
for unpredictable activities discovery

2. Social network data collection
The social network Vkontakte was selected as a data source for this study [6]. The reasons for this
choice are as follows:
    • the network provides open access to its data (no restrictions on accessing the server data);
    • Vkontakte is the most popular social network in Russia and the fifth most popular social
       network in the world;
    • Vkontakte is a full-fledged social network (unlike Twitter and Instagram, which are
       microblogs) allowing to create thematic communities, which are particularly interesting for
       this study.
    As part of this study, a Python software package was developed, containing an authorization
module, a data collection module, and a filtration module. This software package allows to collect
data and filter them to take the relevant information only.
    Within this study, the developed software package was used to collect more than 8,000 posts and
over 280,000 comments on them from the two most popular communities of the city of Samara
(“Podslushano Samara” and “Uslyshano Samara”) and from the community of the Samara
University students (“Podslushano Samarsky Universitet”).
       Streaming data obtained from social networks contains a lot of service information. Only the
relevant data is important for further analysis; therefore, it is necessary to separate the service
information from the relevant data. The software package pre-processing module structures the
collected data and filters the relevant and the service fields.

3. Determination of the proximity of groups using BigData technology
To determine the proximity of groups, several metrics for the comparison of word indexes were
considered: Euclidean distance, city-block distance and Mahalanobis distance [7, 8]. The Euclidean
distance was chosen, since it is most suitable for this experiment according to the following criteria:
     1) It is the most widely used and universal metric;
     2) The Euclidean distance is calculated based on the original, not the standardized data.
   To calculate this metric, attribute vectors were formed between the groups by combining two
word indexes or more into a common one [9]. Weight was assigned to each word in the word index,
thus each group took the form of a vector of attributes (words) with own weights. In this paper, it
was decided to use the word frequency count as the weight [10, 11].
Such an approach for calculating the weights of words in word indexes using traditional methods and
technologies requires huge computational resources and takes a long time when the volume and the
number of analyzed word indexes increases, so it was decided to use BigData technology and
computational clusters for this purpose [12]. At this stage, an algorithm involving MapReduce
technology was developed, that rejected non-informative parts of the word index (words consisting
of less than three or more than fifteen characters) and also counted the frequency of words in the
text. As a result, three word indexes were developed, the elements of which had their own weights,
one of them is presented in Fig. 1.
   At the next step, it was decided to use two word indexes (of the groups “Podslushano Samara”
and “Uslyshano Samara”) to get a common word index, and to use the other word index (of the



V International Conference on "Information Technology and Nanotechnology" (ITNT-2019)              522
Data Science
A S Mukhin, I A Rytsarev, R A Paringer, A V Kupriyanov, D V. Kirsh




group “Podslushano Samarsky Universitet”) for test counting. The common word index consisted of
overlapping words with the weights recalculated according to formula 2.




             Figure 1. Part of the word index developed for the group “Podslushano Samara”.
                                                                𝑔1 + 𝑔2
                                                 𝑔(𝑔1 , 𝑔2 ) = �        �                          (1)
                                                                   𝑛
where: 𝑔(𝑔1 , 𝑔2 ) is the weight of the word in the common word index; 𝑔1 is the weight of the word
in the first word index; 𝑔2 is the weight of the word in the second word index.
    The last step was to calculate first the distances between the resulting word index and the groups
it was based on, and then the distance between the resulting word index and the test group. The
Euclidean formula of distance between the two groups was used to measure this value.
                                                                          𝑛

                                                         𝑑(𝑥, 𝑦) = ��(𝑥𝑖 − 𝑦𝑖 )2                     (2)
                                                                         𝑖=1

    The results are provided in the Table 1.

                      Table 1. The results of calculating the distances between the groups.
                                   Title                                  Euclidean distance
                     “Podslushano Samara”                                       188.32
                     “Uslyshano Samara”                                         173.11
                     “Podslushano Samarsky Universitet”                         165.98

    Based on the results, we can conclude that the distances between the first two groups are very
close. This implies that the common word index is compiled quite accurately and reflects the context
of the messages in the said groups well. After analyzing the distance to the test group, one can notice
the proximity of all three values, but it is also clear that this distance has doubled as compared to the
other two. It can be assumed that this group is slightly different from the other two.

4. Conduct research for five groups
In the next step, it was decided to add two more communities to the already analyzed groups and
conduct additional research. The distribution of groups with their number, the number of subscribers,
the number of posts and the number of comments below them is presented in Table 2.


V International Conference on "Information Technology and Nanotechnology" (ITNT-2019)                523
Data Science
A S Mukhin, I A Rytsarev, R A Paringer, A V Kupriyanov, D V. Kirsh




                       Table 2. Analyzed communities and their quantitative indicators.
            Group number             №1         №2         №3                             №4         №5
            Count of subscribers
            at the time of 217 679            92 024    150 587                          60 225     10 465
            receiving data
            Count of posts         32 485     10 957     20 783                           8 572      1 034
            Count of comments     3 898 212 1 949 106   133 188                          171 444     22 748
            Count of words       40 167 816 17 541 954 14 650 780                       1 371 552   107 650
   In order to show the applicability of the proposed method for calculating distances between
groups using a common dictionary, we calculated the distances between groups without using a
common dictionary and with its use. Table 3 presents the results of calculations without a common
dictionary. Table 4 presents the results of calculating the Euclidean distances between all pairs of
groups and the templates of common dictionaries built on their basis.

      Table 3. Euclidean distance calculation results for all five groups without using a common
                                              dictionary.
                                      №1            №2               №3          №4            №5
                   №1                 -             -                -           -             -
                   №2                 513.363       -                -           -             -
                   №3                 571.324       603.66           -           -             -
                   №4                 644.413       863.041          867.504     -             -
                   №5                 701.423       727.723          689.51      974.583       -

Table 4. The results of the calculation of the Euclidean distances between groups and their common
                                              vocabulary.
                                      №1            №2               №3          №4            №5
                   №1                 -             -                -           -             -

                                      188.32
                   №2                               -                -           -             -
                                      173.11

                                                                     -           -             -
                                      195.55        219.66
                   №3
                                      365.98        386.85

                                                                                 -             -
                                      296.01        288.95           272.03
                   №4
                                      542.91        589.46           541.34

                                                                                               -
                   №5                 193.5         201.74           225.43      167.21
                                      390.71        402.43           453.11      467.21
   Comparing the obtained results, we can notice that the distances for calculations using a common
dictionary are less than calculations without it. From this we can assume that the use of the method
of finding a common dictionary for calculating Euclidean distances is justified and applicable for
solving the problem posed.

5. The study of the dependence of the volume of the general dictionary used to calculate the
Euclidean distance between groups
We investigate at what volume of a general dictionary the results of determining the degree of
similarity of groups among themselves give the most informative readings. To do this, we carry out
an experimental calculation of the Euclidean distances for groups numbered 1 and 2 between them


V International Conference on "Information Technology and Nanotechnology" (ITNT-2019)                         524
Data Science
A S Mukhin, I A Rytsarev, R A Paringer, A V Kupriyanov, D V. Kirsh




and their common vocabulary by changing the dimensions of the common vocabulary. For the study,
we will choose the size of the dictionary equal to the greatest number of unique words for the second
group (18.948 words), the small size of the general dictionary (300 words) and several intermediate
values. Table 5 shows the results of calculations of this experiment.
    After analyzing the results obtained, it can be noted that for the anomalously large and, on the
contrary, anomalously small size of the general dictionary, the results turned out to be as non-
informative as possible. Most likely this is explained by the fact that with a small dictionary, for the
most part, only the most common words that do not carry more information and are approximately
equally found in the texts of both groups, for the maximum size of a common dictionary, the
situation is fundamentally opposite, tk. Many rare words come into account that are found only in
one of the groups, and therefore the results show such an abnormally large scatter. When analyzing
the results produced for the intermediate sizes of the general dictionary, it is seen that the values of
the distances cease to have strong leaps relative to each other when using a common dictionary of
about 3/5 of the amount of unique words for the group with the highest number. Such a result is due
to the fact that with such a volume the most non-informative words and words are cut off.

Table 5. The results of the calculation of the Euclidean distances between groups and their common
                                              vocabulary.
                        Dictionary size               Distance to group         Distance to group
                                                          number 1                  number 2
                                 18948                        349.92                    306.92
                                 14000                        278.21                    249.96
                                 11000                        188.32                    173.11
                                 9000                         195.44                    181.06
                                 2000                         161.32                    155.78
                                 300                           60.66                     61.05

6. Conclusion
Within the framework of this study, a set of software modules was developed allowing to determine
the distance between the communities of the social network Vkontakte. As a result of the work, a
common word index was compiled, on the basis of which the degrees of proximity between 3
communities were determined. In the future, the results of the work can be used to develop
algorithms for determining the proximity of larger groups and communities using the BigData
technology.

7. References
[1] Arora M, Kansal V 2019 Character level embedding with deep convolutional neural network
      for text normalization of unstructured data for Twitter sentiment analysis Social Network
      Analysis and Mining 9(1) 12
[2] Burdisso S G, Errecalde M and Montes-y-Gómez M 2019 A text classification framework for
      simple and effective early depression detection over social media streams Expert Systems with
      Applications 133 182-197
[3] Chen J, Gong Z and Liu W 2019 A Nonparametric Model for Online Topic Discovery with
      Word Embeddings Information Sciences
[4] Groß-Klußmann A, König S and Ebner M 2019 Buzzwords build Momentum: Global
      Financial Twitter Sentiment and the Aggregate Stock Market Expert Systems with
      Applications
[5] Zhu C, Du J 2018 Background feature clustering and its application to social text Information
      Processing Letters 136 44-48
[6] Xu X 2007 Scan: a structural clustering algorithm for networks Proceedings of the 13th ACM
      SIGKDD international conference on Knowledge discovery and data mining 824-833



V International Conference on "Information Technology and Nanotechnology" (ITNT-2019)               525
Data Science
A S Mukhin, I A Rytsarev, R A Paringer, A V Kupriyanov, D V. Kirsh




[7]  Rytsarev I A, Kozlov D D, Kravtsova N S, Kupriyanov A V, Liseckiy K S, Liseckiy S K,
     Paringer R A and Samykina N Yu 2018 Application of the principal component analysis to
     detect semantic differences during the content analysis of social networks CEUR Workshop
     Proceedings 2212 262-269
[8] Rytsarev I A, Kupriyanov A V, Kirsh D V and Liseckiy K S 2018 Clustering of social media
     content with the use of BigData technology Journal of Physics: Conference Series 1096(1)
[9] Rytsarev I A, Kirsh D V and Kupriyanov A V 2018 Clustering of media content from social
     networks using BigData technology Computer Optics 42(5) 921-927 DOI: 10.18287/2412-
     6179-. –2018-42-5-921-927
[10] Mikhaylov D V, Kozlov A P and Emelyanov G M 2016Extraction of knowledge and relevant
     linguistic means with efficiency estimation for the formation of subject-oriented text sets
     Computer Optics 40(4) 572-582 DOI: 10.18287/2412-6179-2016-40-4-572-582
[11] Rytsarev I A, Kupriyanov A V, Kirsh D V and Liseckiy K S 2018 Clustering of social media
     content with the use of BigData technology Journal of Physics: Conference Series 1096(1)
     DOI: 10.1088/1742-6596/1096/1/01208
[12] Kropotov Y A, Proskuryakov A Y and Belov A A 2018 Method for forecasting changes in
     time series parameters in digital information management systems Computer Optics 42(6)
     1093-1100 DOI: 10.18287/2412-6179-2018-42-6-1093-1100

Acknowledgments
This work was financially supported by the Russian Foundation for Basic Research under grant # 19-
29-01135, # 18-37-00418, # 17-01-00972 and by the Ministry of Science and Higher Education
within the State assignment to the FSRC “Crystallography and Photonics” RAS No. 007-
GZ/Ch3363/26 (theoretical results).




V International Conference on "Information Technology and Nanotechnology" (ITNT-2019)          526