Integration Issues of Big Data Analysis on Social Networks

                    A V Ivaschenko1, N Yu Ilyasova1,2, A A Khorina1, V A Isayko1,
                    D N Krupin3, V A Bolotsky3, P V Sitnikov4


                    1
                      Samara National Research University, Moskovskoe Shosse 34А, Samara, Russia, 443086
                    2
                      Image Processing Systems Institute - Branch of the Federal Scientific Research Centre
                    “Crystallography and Photonics” of Russian Academy of Sciences, Molodogvardeyskaya str.
                    151, Samara, Russia, 443001
                    3
                      IPSI SEC “Open Code”, Yarmarochnaya Str. 55, Samara, Russia, 443001
                    4
                      ITMO University, Birzhevaya liniya 14 lit. A, Saint-Petersburg, Russia, 199034


                    Abstract. Nowadays Social Media becomes one of the major providers of Big Data for
                    analysis of users’ behaviour, focus, trends, and deviations. One user can be presented in several
                    social networks by various avatars. Most users have different dynamics of data processing and
                    generation. In order to provide a solution capable to deal with this, there was developed and
                    implemented a software library for integration with a number of social networks. This paper
                    describes the problem, solution architecture and technical details of its implementation
                    supported by the results of simulation and real data analysis for a number of popular social
                    networks.


1. Introduction
The way we deal with the advent of the era of Big Data is crucial. Although this phenomenon has the
right take place in conditions of uncertainty form the future, but with increasing automation of data
collection and analysis - the number of algorithms that can extract and illustrate large-scale models of
human behavior also increases. How do systems conduct this practice, and how do they regulate the
flow of data?
    The market sees big data like net opportunity: marketers optimizing their proposals, based on
market analysis, Wall Street bankers process tons of information about the dynamics of changing
rates. Legislation has already been suggested to limit the collection and storage of data, as a rule, about
the inviolability of private life.
    In recent years, the amount of information formed by business, science and social networks
increases in geometric progression. This phenomenon is also called phenomenon known as a data
stream.
    In business, Valmart's valuation transaction databases estimate the amount of data currently stored
in more than 2.5 petabytes of data, including: information on customer behavior and preferences, data
about network activity and devices, information on market trends.
    As for science, for example, the Large Hadron Collider (LHC) in The European Organization for
Nuclear Research produced 13 petabytes of data in 2010.
    In addition, the sensor, social networks, mobile data, subscriber data and the location data grow at a
frenzied pace. Simultaneously with this growth in the volume of information, data also become more


IV International Conference on "Information Technology and Nanotechnology" (ITNT-2018)
Data Science
A V Ivaschenko, N Yu Ilyasova, A A Khorina, V A Isayko, D N Krupin, V A Bolotsky and P V Sitnikov


interconnected. Facebook, for example, is almost completely connected, from 99.91% network to one,
large connection component.
    Modern social media can be treated as a major source of Big Data that describes the process of
users’ interaction and various information exchanges. Analysis of this data turns out to become a
complicated technical problem: it is required to integrate with multiple social media for data import,
associate separate profiles of the same users in different networks, match the facts of their interaction
across the real events and derive basic trends and deviations.
    To solve this problem there was developed a model of social media user behavior and a based on it
software solution that provides capabilities for social networks analysis and simulation.

2. State of the Art
New opportunities of interaction in virtual environments allow Internet users to exchange the ideas
immediately. At the same time everybody needs to obtain and process lots of incoming events. Under
such informational pressure individuals start prioritizing the most important data, filtering and
rejecting everything that is not currently interesting. Such a focus on the current interest instead of
importance leads to various imperfections, including the creativity constraints. This process can be
described by the modern principles of distributed simulation and decision-making support powered by
multi-agent technology [1].
    The virtual world of social media should be treated as a complex network of continuously running
and co-evolving intelligent agents. Such solutions are based on holons paradigm and bio-inspired
approach [2], which requires development of new methods and tools for supporting fundamental
mechanisms of self-organization and evolution similar to living organisms (colonies of ants, swarms
of bees, etc) [3]. As for the human beings represented by actors or agents, social network user should
consider a combination of human and time factors. Interaction of customers and service providers
powered by intermediary services generate and can be characterized by a big number of events that
form Big Data and require modern technologies for its analysis [4].
    Modeling the Internet users’ behavior can be based on the modern principles of knowledge
representation in the form of Ontologies [5]. These concepts allow formalizing self-organization and
semantics, which is advantageous for abstract description of social concepts and their interaction in
technical applications [6]. The papers discusses the detection process of uncharacteristic behavior of
users [7] and methods classify users [8].
    In the context of this paper there should be mentioned the papers on Internet development
strategies [9], virtual communities and social networks studies [10-11]. Despite the successful
application of mathematical statistics used to cluster and generalize the user’s behavior the problem of
Big Data analysis of social networks remains open. This happens due to a necessity to personalize user
activity models and understand individual features of human behavior.
    Our experience in the area of integrated information space development and its users’ behavior
analysis [12-15] can be used to build a software solution to derive basic trends in social media and
provide intelligent functionality for social media big data analysis. The proposed abstract model and
solution vision are given below.

3. Abstract Model
Let us present a community of Internet users by ui , where i = 1 \ ..N u – a number of users. The activity
of users information exchange can be presented by posts, comments or messages p j , where
j = 1 \ ..N w – an absolute number of an informational object. Post generation is an event
                                                   gi , j = ( ui , p j , ti0, j ) .                   (1)
   Issue or processing of an information object can be presented by an event ei , j , k that can be
characterized by the combination of user, focus, and time:
                      =ei , j ,k e=                 (
                                  p j , ( ui , f i.k , ti , j ,k ) {0,1} ,     )                      (2)


IV International Conference on "Information Technology and Nanotechnology" (ITNT-2018)                249
Data Science
A V Ivaschenko, N Yu Ilyasova, A A Khorina, V A Isayko, D N Krupin, V A Bolotsky and P V Sitnikov


where focus f i , k presents the current user interest and can be described by a tag cloud, which is a set
of pairs:

                                                           {                   }
                                                   f i , k = (τ n , wn , k )i , k ,                    (3)
where τ n is a tag (keyword) with weight wn , k .
    The sequence of interdependent user focuses represents the evolution of the user’s interest.
    Each user has own ontology that forms the basis of his perception. It changes with time under the
influence of learning and forgetting the information (presented by posts, comments or messages) and
can be presented by a chain of contexts:

                                                           {
                                                 ci , m = (τ l′, wl′, m )i , m .}                      (4)
    This change is correlated with user focus.
    The focus cannot be considered new in order to provide positive perception, and at the same time it
is not equal to the context to be able to excite interest.
    Considering this correlation let us synchronize the context and focus changes:
                  =ei′, j ,m e=                     (
                              ′ pk , ( ui , ci.m , ti , j ,m ) {0,1} .           )                     (5)
    The statements (2) and (5) are Boolean variables, which mean that appearance or perception of a
post, comment or message does not guarantee changes in focus and context.
    Events (2) and (5) can be used for analysis. One of the possible implementations is presented
below. Study of the user’s focus and context trends allows identifying tendencies, variations and
iterations that form the patterns of user creativity.
    In case new informational proposal remain suspended and does not make any effect over the user
focus, this means that the user does not see any interest.
    Possible reasons are concerned with context: additional education is needed to provoke such
interest. On the other side, lots of changes of the user context indicate the search for a stable interest
that should be proposed for the user at a certain time.
    Context and focus can be also influenced by negative intervention.
    In order to manage the user focus there can be generated a series of repeated affections partially
covering the actual context and the targeted interest. Such patterns can also be identified applying
cross-correlation analysis to the proposed model, which helps identification and resistance to negative
informational influence.

4. Solution Architecture
The proposed approach is based on simulation of focus and context. It was implemented in multi agent
architecture, which is presented in Figure 1.
    Under the bounds of our proposed architecture we provide profile descriptor, post generator and
navigator.
    These are methods generated and used to simulate real activity of users in the social networks. Post
generator used to create posts according to predefined logic.
    Navigator is used to process incoming data which is described by network and can be presented
like a sorted graph there the nodes are informational objects, for example web sites, documents, posts,
comments, and the links are references between these objects.
    Each object can refer to several other objects and documents; and the navigator according to all
predefined logic decides which link to go.
    In addition to navigator and post generator the informational frames are provided under the multi-
agent architecture that correspond to a predefined above focus and context concepts. Focus is used to
represent the current interest of Internet users.
    Context is used to formalize informational space in which the agent performs its negotiating
activity. Based on the provided model an algorithm has been developed for social media big data
analysis.


IV International Conference on "Information Technology and Nanotechnology" (ITNT-2018)                250
Data Science
A V Ivaschenko, N Yu Ilyasova, A A Khorina, V A Isayko, D N Krupin, V A Bolotsky and P V Sitnikov


                                    Figure. 1. Solution multi-agent architecture.

   The model is used to formalize the social media user and integrate the analytical software with
various social media open for data import and analysis.
   This algorithm consists of 2 stages:
   1. Calculation of the sample frequency vector for all users and development of the standard
deviation vector for a variety of users.
   You need to select topics and convert them to a view
                                                                1 
                                                         Ti ,          ,                          (6)
                                                               N i , p  
where p is an hour of publication, N i ,t is a number of users that posted on a certain topic in a time
period p , Ti is an identificator of a topic.
   Then process the received pairs and calculate the amount
                                                      i
                                                             1 
                                              Ti , ∑                                              (7)
                                                    k = 0 N k , p  
and calculate the standard deviation for each pair σ j .   k


    The obtained values are divided by the period.
    2. Calculating the deviation metric for a particular user. You need to select topics and convert them
to a key-value view. After that you need to process the data pairs and count the sum of topics with the
same key:
                                                        0, ∆ i ≤ 3σ i ,
                                                                                                     (8)
                                                        1, ∆ i > 3σ i
    Count the deviation of a particular user, summarize the deviations of a particular user, divide the
sum of the deviations by the number of topics (n) for a particular user and generate the resulting CSV-
file in the form of a table with user data and information of standard deviation of this user.
    One of the main features of Internet users’ activity online that should be considered in the explored
scope is mutual influence of contexts and focuses of communicating peers. This factor makes it
possible to introduce the control loop: in addition to web content semantic analysis the platform starts
to manage the users interest based on focus identification and context feedback.


IV International Conference on "Information Technology and Nanotechnology" (ITNT-2018)               251
Data Science
A V Ivaschenko, N Yu Ilyasova, A A Khorina, V A Isayko, D N Krupin, V A Bolotsky and P V Sitnikov


    This information is being collected in social networks and has all necessary details to get actual
estimations. Still in this case it is required to provide integration with social networks and the data
being processed contain tons of subjective assessments and perceptions.
    Online libraries and professional communities are more neutral. For example, Wikipedia enforces
various groups of authors to update the articles targeting maximum objectivity. Analysis of this data
can help adequate identification of significant trends of consumers’ focus identification that can be
practically used e.g. for marketing and product placement.
    Activation method is used to simulate multi agent activities in real time. The special agent
dispatcher will call all the agents by using this activation method and after be activated each time an
agent generates the time series period according to some distribution rule. Agent generates time series
of navigation calls and post generation methods.
    At these stages, we solve the proposed navigation and generation in such a way that we can model
post-writers or readers and introduce some specific patterns of online activity, for example, the agent
can be more active at night, or we can use some time frame for high / low activity.
    Focus and contexts update the results of real agent behavior based on the influents of informational
objects. We can generate focus and contexts according to our goals and in case we want agent to
behave in a sort of specific way we introduce this control directly and formulate focus and context the
agent will do that you want.
    This approach allows simulating this influence and is introduced in the system. In this case we need
to analyze the focus and context changes of the agent during the period of time. On the basis of
analysis we introduce changes in focus by generating informational objects inside this network. This
can be done in real systems using the contexts based advertisements. We can generate just the objects
with certain informational context, which can be described by tag clouds.
    The introduced architecture can be used to simulate online users in social networks and model
realistic Internet behavior. In the area of simulation, practical application is generation of cognitive
patterns of collective behavior based on self-organization. In this case the agent should be simple and
the logic of focus and context should be close to very simple but generic behavior. This logic can
correspond to know real users of social networks but it can represent some generalized behavior and
the community of agent. Such behavior can be used to study and develop some visual cases.
    In another case it is implemented as a sort of a frame, using which the algorithms of syntactical
analysis or other large data analysis can look onto the real world of social networks and filter the data
for intelligent study.

5. Implementation
To implement the proposed approach there was developed a software solution for social media focus
identification based on knowledge discovery and Big Data analysis.
    The solution can integrate with various data sources, pick out concepts, generate tag clouds for
contexts and focuses and process their changes in time. Solution implementation architecture is
presented in Fig. 2. The data imported from social networks is captured in database and can be
processed either in real time or in batch mode.


                                               Figure 2. Integration model.


IV International Conference on "Information Technology and Nanotechnology" (ITNT-2018)               252
Data Science
A V Ivaschenko, N Yu Ilyasova, A A Khorina, V A Isayko, D N Krupin, V A Bolotsky and P V Sitnikov


    Crawler addresses asynchronously to a web service with requests for data from social networks.
After receiving the request, the web service starts processing it. Next, the web service accesses the
integrator, which starts downloading the requested data in the form of RDF / XML files, storing the
intermediate data received from the single request of the crawler to receive the data by the single block
to transfer the already downloaded ones.
    Then in the background, i.e. in a mode where there is no need to control the data unloading process,
the integrator automatically continues the embedded process and uploads the data to the database and
uses Apache JENA to generate RDF / XML files that will be transferred to the first crawler address.
The described model, software solution and its implementation was probated and tested using a typical
data set derived from a number of social networks. In addition to a real regular result set of social
media users’ negotiation there was introduced a peak batch of posts generated by an online bot.
    Apart from the social media (getting no a prior knowledge of a data structure) the big data analysis
algorithms was able to identify the online bot influence. The results are presented in Fig. 3. Gray lines
represent the annual trends of users’ activity. The peak identified on Aug 15 corresponds the Bot
activity and can be easily identified by the agent comparing the behavior of previous periods. The
described research results show that the proposed model can be used for online behavior analysis and
identification of negative informational influence.


                                          Figure 3. Bot activity identification.

    There were processed the data of 32,000 users and their posts for the period of 2014 – 2017. To
simulate the intervention there were modeled 50 bot users that automatically perform actions through
interfaces intended for people. The given statistics show the distribution of posts throughout the
considered time for each year. The horizontal axis, respectively, is temporary, contains the values of t
recalculated in step 6 of the abovementioned algorithm.
    Each line has 2 similar peaks at the beginning of the year (a detailed analysis showed that such
emissions fall on holidays), they are similar to each other throughout the rest of the time. But the curve
of 2016 has an unusual outburst (see 8/15), which characterizes the appearance of users’ unusual
behavior.
    This example illustrates that the model and statistically developed patterns of users creativity can
be used to identify negative deviations and the attempts to influence using repeated affections.

6. Conclusion
As shown above, the proposed model allows capturing the process of Internet user’s activity
considering a combination of human and time factors.

7. References
[1] Wooldridge M 2002 An introduction to multi-agent systems (Chichester: John Wiley and Sons)
      p 340

IV International Conference on "Information Technology and Nanotechnology" (ITNT-2018)                253
Data Science
A V Ivaschenko, N Yu Ilyasova, A A Khorina, V A Isayko, D N Krupin, V A Bolotsky and P V Sitnikov


[2]   Leitao P 2009 Holonic rationale and self-organization on design of complex evolvable systems
      HoloMAS LNAI 5696 1-12
[3] Gorodetskii V 2012 Self-organization and multiagent systems: I. Models of multiagent self-
      organization Journal of Computer and Systems Sciences International 51(2) 256-281
[4] Bessis N and Dobre C 2014 Big Data and Internet of Things: A roadmap for smart
      environments (Berlin: Springer) p 450
[5] Mouromtsev D, Pavlov D, Emelyanov Y, Morozov A, Razdyakonov D and Galkin M 2015 The
      simple, web-based tool for visualization and sharing of semantic data and ontologies CEUR
      Workshop Proceedings 1486 77
[6] One Internet. Global commission on Internet Governance 2016 (Access mode:
      https://www.cigionline.org/initiatives/global-commission-internet-governance) (01.11.2017)
[7] Shatalin R, Fidelman V and Ovchinnikov P 2017 Abnormal behavior detection method for
      video surveillance applications Computer Optics 41(1) 37-45 DOI: 10.18287/2412-6179-2017-
      41-1-37-45
[8] Rybintsev A, Konushin V and Konushin A 2015 Consecutive gender and age classification
      from facial images based on ranked local binary patterns Computer Optics 39(5) 762-769 DOI:
      10.18287/0134-2452-2015-39-5-762-769
[9] Balakrishnan H and Deo N 2006 Discovering communities in complex networks Proceedings of
      the 44th Annual Southeast Regional Conference 280-285
[10] Wei W, Joseph K, Liu H and Carley K 2016 Exploring Characteristics of Suspended Users and
      Network Stability on Twitter Social Network Analysis and Mining 6-51
[11] Kadushin C 2012 Understanding social networks: theories, concepts, and findings (Oxford:
      Oxford University Press) p 264
[12] Ivaschenko A 2014 Multi-agent solution for business processes management of 5PL
      transportation provider Lecture Notes in Business Information Processing 170 110-120
[13] Ivaschenko A, Minaev A and Spodobaev M 2015 Self-mediator software for sensor networks
      Proceedings of the 2015 International Siberian Conference on Control and Communications
      (SIBCON) 1-4
 [14] Ivaschenko A, Lednev A, Diyazitdinova A and Sitnikov P 2016 Agent-based outsourcing
      solution for agency service management Lecture Notes in Networks and Systems 16 204-215
[15] Protsenko V, Kazanskiy N and Serafimovich P 2015 Real-time analysis of parameters of
      multiple object detection systems Computer Optics 39(4) 582-591 DOI: 10.18287/0134-2452-
      2015-39-4-582-591


IV International Conference on "Information Technology and Nanotechnology" (ITNT-2018)              254