Research and analysis of messages of users of social networks
using BigData technology

                I A Rytsarev1,2, A V Kupriyanov1,2, D V Kirsh1,2 and R A Paringer1,2


                1
                 Samara National Research University, Moskovskoe Shosse 34А, Samara, Russia, 443086
                2
                 Image Processing Systems Institute of RAS - Branch of the FSRC "Crystallography and
                Photonics" RAS, Molodogvardejskaya street 151, Samara, Russia, 443001


                e-mail: rycarev@gmail.com


                Abstract. In this paper is dedicated to the World Cup held in the city of Samara from June 15
                to July 15, 2018. As part of the work, a multithreaded collection in real time was organized,
                filtering and processing messages from users of the social network Twitter within the host city
                and its surroundings from May 15 to August 15, 2018. Then, a study was conducted of the
                texts of user messages on the subject of the popularity of topics and the construction of a “word
                cloud”. The second study was the construction of a diagram of the dynamics of the number of
                messages in different languages. As part of the work, modules for collecting, filtering and
                processing data using BigData technology were implemented.


1. Introduction
Currently, social networks are booming: every day their users generate hundreds of terabytes of media
content: images and video. The analysis of such content is of great importance for many areas of
business. For example, it is impossible to overestimate the impact of Internet marketing on the
promotion of goods and services. However, clear understanding of user requests is essential to use
these mechanisms effectively. The source of such information can be the materials published by users
of social networks, as well as the shares and reposts by users and the entire communities. But in the
period of any major events the population of online communities can vary greatly. In this paper, a
comparison is made between the flow of messages before the World Cup, during and after it.
       The task considered in the framework of this work is undoubtedly an urgent task, the solution of
which is also of great scientific importance in the field of data analysis. In the article [1], a large
dataset of geotagged tweets containing certain keywords relating to climate change is analyzed using
volume analysis and text mining techniques such as topic modeling and sentiment analysis. In the
article [2], the local and global term frequencies are computed through a bag-of-words (BOW) model.
To remove the negative impact of high dimensionality on the global term weighting, the principal
component analysis is adopted; thereafter the fuzzy c-means algorithm is employed to retrieve the
semantically relevant topics from the documents. In the article [3], examine the long-term relationship
between signals derived from nine years of unstructured social media microblog text data and financial
market developments in five major economic regions. Employing statistical language modeling
techniques we construct directional sentiment metrics. In the article [4], the authors propose a


                    V International Conference on "Information Technology and Nanotechnology" (ITNT-2019)
Data Science
I A Rytsarev, A V Kupriyanov, D V Kirsh and R A Paringer


background clustering technology for discussion. Compared with the traditional methods, background
future clustering keeps the constrains caused by data sparseness and spatio-temporal dependence off,
and can be used for unpredictable activities discovery.In the article [5], the method of applying cross-
references was considered to improve the accuracy of providing dictionaries in the task of calculating
distributions between social communities based on text messages. In the article [6], the technology of
processing large-scale text data on data collected from a social network was tested. The article [7]
proposed a mathematical model for calculating the activity of users of social networks. Article [8]
proposed a technology for normalizing text data. To capture the contextual meaning of tokens, authors
create a neural word embeddings using word2vec trained on over a million social media messages
representing a mix of domains and degrees of linguistic deviations.

2. Social network data collection
The Twitter social network was selected as a data source for this study. The reasons for this choice are
as follows:
    • the network provides open access to its data (no restrictions on accessing the server data);
    • Twitter is the second most popular social network (after Facebook, which does not provide open
       access to its data) among users all over the world;
    • Twitter is not a specialized network, which means it reflects the public opinion of a wider range
       of users [9].
    The data collection from the Twitter social network can be carried out using the software products
Apache Ambari and Flume, this method is described in more detail in [10]. However, it is often more
convenient to develop a dedicated software product using standard libraries (twitter4j, tweepy, etc.) to
collect the data using a number of filters [11, 12].
    As part of this study, a Python software package was developed, containing an authorization
module, a data collection module, and a filtration module. This software package allows to collect data
by geolocation, by keywords, by user. The Twitter social network has a restriction in the form of a
message limit that a client can receive during real-time monitoring. According to the documentation,
this limit is 60 messages per second (this is about 1% of the average rate of tweets). A network of
computers located in different cities was set up and cloud services were involved in order to avoid
interruptions in the operation of the software complex, and to minimize message loss. Multiple unique
authorization keys have been implemented in each copy. The designed software complex operates in
real-time monitoring mode, and can make requests to receive information located on servers.
    The geolocation filtering parameters were the coordinates of the city of Samara (the host city of the
World Cup) in the form of an extended geobox (48.9700523344,52.7652295668,
50.7251182524,53.6648329274), which includes not only the city of Samara, but also the city of
Togliatti (the training base of football players and the city where the tourists lived), airport Kurumoch
and the settlements nearby the city of Samara.
    More than 1,200,000 user messages were collected during the operation of the distributed network
of the software complex nodes.

3. Analysis of the collected data using the BigData technology
The merging of the collected data, data processing and analysis using traditional approaches requires
huge computational resources and takes a long time. For this reason, it was decided to use the BigData
technology and the computing cluster for processing extra-large data available at the Samara
University.
    First of all, the data collected had to be merged. For this purpose, a data merging module was
implemented using MapReduce technology. As a result of the module operation, we received more
than 170,000 unique user messages.
    The second task was the primary data processing. Streaming data obtained from social networks
contains a lot of service information. Only the relevant data is important for further analysis; therefore,
it is necessary to separate the service information from the relevant data. For this purpose, a json-
response processing module has been implemented. This module uses the MapReduce technology for
data structuring by way of arranging the data and excluding non-relevant and service data.

V International Conference on "Information Technology and Nanotechnology" (ITNT-2019)                   505
Data Science
I A Rytsarev, A V Kupriyanov, D V Kirsh and R A Paringer


   The third task was to analyze the data collected. The first study was the construction of a “tag
cloud” for each of the three months separately. The results of the study are provided in Fig. 1, 2 and 3
respectively.


                             Figure 1. “Tag Cloud” for the period 15.05-14.06, 2018.


                             Figure 2. “Tag Cloud” for the period 15.06-14.07, 2018


                            Figure 3. “Tag Cloud” for the period 15.07-14.08, 2018.

                    0,2

                    0,1

                      0
                             Week 1       Week 2           Week 3     Week 4      Week 5   Week 6

                       English           Spanish             Ukrainian       Portuguese    Turkish
                       Indonesian        French              Arabic          Bengali       Japanese
              Figure 4. Distribution of messages by language for the period 11.06-22.07.2018.


V International Conference on "Information Technology and Nanotechnology" (ITNT-2019)                 506
Data Science
I A Rytsarev, A V Kupriyanov, D V Kirsh and R A Paringer


                                                    0,08
                                                    0,07


                        Count of posts per day, %
                                                    0,06
                                                    0,05
                                                    0,04
                                                    0,03
                                                    0,02
                                                    0,01
                                                       0


                                                                                Date

      Figure 5. The distribution of the count of messages by day for the period 11.06-24.07 2018.


                                                       Figure 6. World Cup World Cup schedule 2018.

V International Conference on "Information Technology and Nanotechnology" (ITNT-2019)                 507
Data Science
I A Rytsarev, A V Kupriyanov, D V Kirsh and R A Paringer


    It can be seen in Figures 1, 2 and 3, that the filling of the “clouds” changed dramatically with the
beginning of the World Cup in the Samara Region. Taking into account the results of the previous
study, the decision was taken to look at the dynamics of changes in the number of messages in
different languages in the next study. The analysis of the language of writing a message was carried
out on the basis of data provided by the Twitter social network in json-response. A 7-day period was
selected as an analysis period. The results are provided in Fig. 4.
    As it can be seen from Fig. 4, the number of messages in the languages other than Russian varied in
accordance with the football games held in the city of Samara. It started to increase a week before the
beginning of the tournament, then the number of messages remained at the same level throughout the
tournament and then dropped to the values close to zero due to the departure of delegations.
    Additionally, we construct a graph of user activity by day (Figure 5) and relate it to the schedule of
games (Figure 6).
    As can be seen on the graph, the peak of user activity fell on the days of the games at the Samara
Arena stadium. On the days of the games at other stadiums, user activity was lower than on the days of
matches in Samara. On the other days, the activity did not exceed 0.01 percent (the exception was the
match days for the third place and the final). The peak of activity came on 07.07.18 when the matches
at the Samara Arena and Russia - Croatia took place. After the end of the World Cup, user activity has
declined sharply.

4. Conclusion
In this paper, a study was conducted of the activity of the users of the social network Twitter of the
Samara region, as well as the activity of the guests of the 2018 World Cup who came to support the
national teams in the city of Samara. The study showed that a major event can drastically change the
main subjects of messages and dictionaries of frequently used words in social networks. From this it
follows that when analyzing social network data in the period of any major events, it is necessary to
apply methods of reactive data analysis, as well as take into account user profile information for
correct data processing (collect separate statistics, since it will be completely different from statistical
data which was collected before the event).

5. References
[1] Dahal B, Kumar S A P and Li Z 2019 Topic modeling and sentiment analysis of global climate
      change tweets Social Network Analysis and Mining 9(1) 24
[2] Rashid J, Shah S M A and Irtaza A 2019 Fuzzy topic modeling approach for text mining over
      short text Information Processing & Management 56(6) 102060
[3] Groß-Klußmann A, König S and Ebner M 2019 Buzzwords build Momentum: Global Financial
      Twitter Sentiment and the Aggregate Stock Market Expert Systems with Applications
[4] Zhu C, Du J 2018 Background feature clustering and its application to social text Information
      Processing Letters 136 44-48
[5] Rytsarev I A, Kupriyanov A V, Kirsh D V and Liseckiy K S 2018 Clustering of social media
      content with the use of BigData technology Journal of Physics: Conference Series 1096(1)
      012085.
[6] Blagov A, Rytcarev I, Strelkov K and Khotilin M 2015 Big Data Instruments for Social Media
      Analysis Proceedings of the 5th International Workshop on Computer Science and Engineering
      179-184
[7] Rytsarev I, Blagov A 2017 Creating the Model of the Activity of Social Network Twitter Users
      Journal of Telecommunication, Electronic and Computer Engineering (JTEC) 9(1-3) 27-30
[8] Kusumawardani R P, Priansya S and Atletiko F J 2018 Context-sensitive normalization of
      social media text in bahasa Indonesia based on neural word embeddings Procedia computer
      science 144 105-117
[9] Rytsarev I A, Blagov A V 2017 Development and research of algorithms for clustering data of
      super-large volume CEUR Workshop Proceedings 1903 80-83


V International Conference on "Information Technology and Nanotechnology" (ITNT-2019)                    508
Data Science
I A Rytsarev, A V Kupriyanov, D V Kirsh and R A Paringer


[10] Mikhaylov D V, Kozlov A P and Emelyanov G M 2016 Extraction of knowledge and relevant
     linguistic means with efficiency estimation for the formation of subject-oriented text sets
     Computer Optics 40(4) 572-582 DOI: 10.18287/2412-6179-2016-40-4-572-582
[11] Rytsarev I A, Kirsh D V and Kupriyanov A V 2018 Clustering of media content from social
     networks using BigData technology Computer Optics 42(5) 921-927 DOI: 10.18287/2412-
     6179-2018-42-5-921-927.
[12] Kropotov Y A, Proskuryakov A Y and Belov A A 2018 Method for forecasting changes in time
     series parameters in digital information management systems Computer Optics 42(6) 1093-1100
     DOI: 10.18287/2412-6179-2018-42-6-1093-1100

Acknowledgments
This work was financially supported by the Russian Foundation for Basic Research under grant # 19-
29-01135, # 18-37-00418, # 17-01-00972 and by the Ministry of Science and Higher Education within
the State assignment to the FSRC “Crystallography and Photonics” RAS No. 007-GZ/Ch3363/26
(theoretical results).


V International Conference on "Information Technology and Nanotechnology" (ITNT-2019)          509