=Paper= {{Paper |id=Vol-1903/paper19 |storemode=property |title=The analysis of profiles on social networks |pdfUrl=https://ceur-ws.org/Vol-1903/paper19.pdf |volume=Vol-1903 |authors=Vladislav A. Bakayev,Alexander V. Blagov }} ==The analysis of profiles on social networks == https://ceur-ws.org/Vol-1903/paper19.pdf
                                The analysis of profiles on social networks
                                                     V.A. Bakayev1, A.V. Blagov1
                                  1
                                   Samara National Research University, 34 Moskovskoe Shosse, 443086, Samara, Russia


Abstract

The article is devoted to the analysis of data and communications on various social networks. The method of search of the profiles belonging
to the same users, based on the analysis of the communications and communities which are available for a profile is offered. The program
complex realizing this method is created.

Keywords: data mining; social networks; graphs; Label Propagation algorithm; Apache Spark


1. Introduction

    Now one of the most urgent directions in information technologies is the analysis of data or Data Mining. The analysis of data
represents process of detection of data, suitable for use, in large data sets, often diverse. Usually such data can't be found at
traditional viewing and search as communications are too difficult, or because of the excessive volume.
    On social networks big data flows are generated (profiles, communications, content are created). Analyzing these data it is
possible to obtain a lot of useful information as on various groups, communities and discussions, and on each user separately [1-
4].
    The great interest in social networks is shown by various commercial organizations using them as the instrument of
interaction with audience. Applying specialized services, the companies analyze information on users, their activities and
personalize offers for separately taken segments of the target audience, thereby increasing conversion and reducing costs of
advertizing campaign.
    In the article the method of increase in efficiency of this sort of tools and services which is based on psychology and patterns
of human behavior is offered.
    The offered method is based on the following facts:
    ● many Internet users have accounts on several popular social networks at once (VKontakte, Facebook, Instagram and
Twitter);
    ● many users of social networks hide information on themselves from strangers (including information on existence of
accounts in other social networks);
    ● as social networks are a subject of socialization of people, for each user it is possible to allocate at least one community of
people it that users of it community are in pairs familiar with each other;
    ● the person has accounts in different social networks and contacts to the same people.
    Is developed the program complex which for realization of a method:
    a) analyzes all profiles of target social networks and on the basis of public data finds the accounts belonging to the owner of
an initial profile;
    b) for users on whose pages there is no information on existence at them of accounts in other social networks finds
communications with other social networks on the basis of communities in which the user consists.

2. The object of the study (Model, Process, Device, Sample preparation etc.)

   At the first stage the system analyzes all users of social networks VKontakte, Twitter and Instagram and groups them in the
following rules:
   ● in each group there are no more than one profile from each social network;
   ● all profiles in one group belong to one person.
   This problem is solved by means of a program framework of Apache Spark (in particular, superstructures of Spark Streaming
intended for stream data processing see fig. 1) and the broker of messages RabbitMQ realizing delivery of basic data in Spark
Streaming [5].




                                          Fig. 1. Architecture of the aggregator of users of social networks (pipeline).


3rd International conference “Information Technology and Nanotechnology 2017”                                                          88
                                                          Data Science / V.A. Bakayev, A.V. Blagov

   Description of steps:
   1. Adding of data from different sources in queue for later processing. Data represent a set of couples (network_id, user_id)
containing information on profiles which are required to be analyzed.
   2. RDD (Resilient Distributed Dataset) formation by a packing of the basic data which are in queue for increase in
productivity.
   3. RDD (mapping) conversion. For each couple (network_id, user_id) the algorithm finds and groups profiles on other social
networks, and also additional information on the person to whom belongs the initial account. The algorithm is restarted for each
found profile until all available information on the user is found. As sources can be: the public information specified on the page
of the user (the status, contact information, entries in the film, etc.).
   4. Export of data retrieveds from RDD in queue for the subsequent saving.
   5. Saving results in NoSQL to the MongoDB database in the form of documents with structure, the reflected in table 1.
   The speed of data processing makes about 120-130 profiles a second. For work the Microsoft Azure A2 v2 virtual computer
was used (2 kernels, 4 GB of RAM, 20 GB of SSD). Casual users of social network VKontakte (1.000.000 profiles) were
analyzed.
   Thus, if to assume that speeds of processing of profiles of VKontakte, Instagram and Twitter are equal, we will receive an
approximate assessment of time which will be required for the analysis of all users of target social networks:
                                               4 ∗ 108 + 6 ∗ 108 + 13 ∗ 108
                               𝑇(𝑛) =                        ≈ 5324 ℎ𝑜𝑢𝑟𝑠 ≈ 221 𝑑𝑎𝑦,                                          (1)
                                                 120𝑛
   where n - the number of servers in a cluster with a similar configuration.
   In case of horizontal scaling of a cluster the linear dependence between the number of servers and processing rate of profiles
is watched.
   Example of the reference to the table: results of an experiment are reflected in table 1.
                               Table 1. Data storage structure of profiles.
                                Name of the field     Type            Description
                                _id                    ObjectId         the document identifier in a collection
                                vk_id                  Int32            the profile identifier in VKontake
                                facebook_id            Int64            the profile identifier in Facebook network
                                instagram_id           Int64            the profile identifier in Instagram network
                                twitter_id             Int64            the profile identifier in Twitter network
                                other                  Object           The additional information (phone number, e-mail
                                                                        address, skype, etc.)


  The further task comes down to expansion of the received base by association of profiles by the rules described earlier on
which pages bek-links aren't specified other social networks.

3. Methods

    Based on a hypothesis that the person consists in the same communities on all social networks which uses, we can establish
connection between profiles of the different social networks which are in one community and with a certain probability to
assume that they belong to one person.
    The purpose of this stage is separation of communities among the people who are in the database received from the previous
step. Extension of Apache Spark GraphX which is intended for distributed processing of graphs is for this purpose used.
    The algorithm of preliminary data handling:
    1.   Generation of a graph on the basis of the available data. Peaks represent an entity of "people" and store identifiers of the
         profiles belonging to the specific user. Connection between peaks is established by the following principle: two peaks
         are adjacent if profiles of matching social networks are coherent. We will determine edge weight between peaks as:
                                𝑤(𝐴, 𝐵) = |{𝑖: 𝑖 ∈ [0, 𝑛 − 1] ∧ ∃ 𝐴𝑖 , 𝐵𝑖 ∧ 𝑟𝑒𝑙(𝐴𝑖 , 𝐵𝑖 )}|,                                      (2)
    where rel (a, b) - function, which the truth in only case when when profiles an and b are interconnected (the relation of
friendship or manifestation of activity is established).
    2. The algorithm Label Propagation [6] realized in GraphX API which solves the problem of a clustering is applied to the
received graph and finds communities in the graph.
    3. For each community RDD with a set of profiles of VKontakte, Facebook, Instagram and Twitter which are connected to
one or several members of the initial community is generated.
    Thus the data set (dataset) on the basis of which we can speculate about accessory of group of accounts of different social
networks (without obvious indication of interrelations) to one person is formed.
    Grouping of the common features given on the basis of the analysis is final stage in the solution of an objective. The
following procedure is applied to each data set from a dataset:
    1. Creation of the full multiple-count graph in which information on profiles of social networks and the potential
characterizing probability of their accessory to one person is stored;
    Tops of the graph contain information on profiles which is used at their comparison.
    For comparison of two profiles the multilayered neural network is used. On an entrance layer of network the vector of
dimension 12 containing the following data moves:
    ● Name ↔ Name'
    ● max (Name → Username', Name' → Username)

3rd International conference “Information Technology and Nanotechnology 2017”                                                89
                                                       Data Science / V.A. Bakayev, A.V. Blagov
   ● max (Name → E-mail', Name' → E-mail)
   ● max (Name → Skype', Name' → Skype)
   ● Username ↔ Username'
   ● max (Username → E-mail', Username' → E-mail)
   ● Username ↔ Skype'
   ● max (Skype → Username', Skype' → Username)
   ● max (Skype → E-mail', Skype' → E-mail)
   ● E-mail ↔ E-mail'
   ● Phone ↔ Phone'
   ● Website ↔ Website'
   We will determine operations a → by b and a ↔ b:
   1.1 Completeness of entry of a into b:
                                                       𝑑+𝑟 +𝑠
                                              𝑎 →𝑏 =1−        ∈ [0, 1],                                                              (3)
                                                                          𝑙𝑒𝑛(𝑎)
   where d - the number of operations of removal for transformation a to b;
   r - the number of operations of replacement for transformation a to b;
   s - the number of operations of a transposition for transformation a to b;
   len(x) - function of calculation of length of an argument.
   1.2 Comparison of an and b:
                                                                                                 𝑑𝑖𝑠𝑡(𝑎[𝑖], 𝑏[𝑗])
                                   ∀𝑖 ∈ [1, 𝑙𝑒𝑛(𝑎)], 𝑗 ∈ [1, 𝑙𝑒𝑛(𝑏)] 𝑑[𝑖, 𝑗] = 1 −                  𝑙𝑒𝑛(𝑏[𝑗])
                                                                                                                    ∈ [0, 1],        (4)
                                                                      𝑙𝑒𝑛(𝑎)
                                                                     ∑1        𝑑 [𝑖, 𝑓𝑖𝑡(𝑖)]
                                                        𝑎↔𝑏=                                   ∈ [0, 1],                             (5)
                                                                    𝑚𝑖𝑛(𝑙𝑒𝑛(𝑎),𝑙𝑒𝑛(𝑏))
   where dist (a, b) - the function calculating Damerau-Levenstein [7] distance for lines an and b;
   fit(i) - the function returning an index of the word of a line b put in compliance to the word a[i].
   Operation of comparison doesn't consider a word order. All words of initial lines are in pairs compared, and then, by means of
Kuhn-Munkres [8] algorithm, to each word of a line the word of a line b is put in compliance so that the similarity sum on all
couples of words was maximum. Also punctuation marks and other symbols aren't considered (except for letters and figures).
   Before processing all symbols of entrance data are given to Latin by rules of a transliteration. The summary table of the main
alphabets is for this purpose used (Russian, Ukrainian, Bulgarian, Indian, Arab).

4. Results and Discussion

   The training and control selections are collected on the basis of primary data. The amount of the training selection ~ 106
couples.
   As negative examples both casual couples of profiles, and couples found by means of full text search in different parameters
were used (name, username, email, skype).
   In generated to the column for each couple of shares the following algorithm is carried out:
   1. edges are sorted in decreasing order of scales;
   2. edges which weight is less than threshold value are removed or one of incidental tops is already connected with any top of
an opposite share.
   As a result of these transformations the count in whom everyone a component of connectivity represents group of accounts of
different social networks which belong to one person turns out.
   Because the person can belong at the same time several community and also if the same group has been created in several
communities, then it is possible to believe that accounts of this group really belong to one user.
   The data obtained at the last stage register in the same base where primary data are stored. However they aren't used as
entrance data for the realized algorithm in view of the unreliability.

5. Conclusion

   The processing and data analysis of social networks allows to personalize a product or service for a specific segment of target
audience. The program complex received as a result of operation erases boundaries between social networks for different
services, allowing them to integrate API and to operate with an entity of "people", but not "profile" that does their operation
more effective.
   The further research can be continued regarding upgrade of algorithms of aggregation and the analysis of data retrieveds. It is
necessary for implementation of the decision of the following tasks:
   ● the detection of popular social networks and services which users mention on the pages (for example, LinkedIn, Last.fm)
and development of parcers for them;
   ● the analysis of additional information sources on pages of users in VKontakte, Instagram and Twitter (for example, a news
feed on Twitter);
   ● the analysis of pages of users on target social networks and search of additional parameters based on which one user can
compare profiles on accessory.

Acknowledgements

   The work has been performed with partial financial support from the Ministry of Education and Sciences of the Russian

3rd International conference “Information Technology and Nanotechnology 2017”                                                   90
                                                         Data Science / V.A. Bakayev, A.V. Blagov
Federation within the framework of implementation of the Program for Improving the Samara University Competitiveness
among the World's Leading Research and Educational Centers for the Period of 2013-2020s.

References

[1] Tan W, Blake MB, Saleh I, Dustdar S. Social-network-sourced big data analytics. IEEE Internet Computing 2013; 5: 62–69.
[2] Khotilin MI, Blagov AV. Visualization and Cluster Analysis of Social Networks. CEUR Workshop Proceedings 2016; 1638: 843–850.
[3] Protsenko VI, Serafimovich PG, Popov SB, Kazanskiy NL. Software and hardware infrastructure for data stream processing. CEUR Workshop Proceedings
     2016; 1638: 782–787. DOI: 10.18287/1613-0073-2016-1638-782-787.
[4] Popov SB. The Big Data methodology in computer vision systems. CEUR Workshop Proceedings 2015; 1490: 420–425. DOI: 10.18287/1613-0073-2015-
     1490-420-425.
[5] Apache Spark Documentation. Access mode: http://spark.apache.org/docs/2.1.0.
[6] Smetanin N. Fuzzy search in the text and the dictionary. URL: https://habrahabr.ru/post/114997. (in Russian)
[7] Liu X, Murata T. Advanced modularity-specialized label propagation algorithm for detecting communities in networks. Physica A: Statistical Mechanics and
     its Applications. 389(7): 1493–1500.
[8] Hungarian algorithm for solving the assignment problem. URL: http://e-maxx.ru/algo/assignment_hungary. (in Russian)




3rd International conference “Information Technology and Nanotechnology 2017”                                                                      91