=Paper=
{{Paper
|id=Vol-2475/paper10
|storemode=property
|title=The Social Portrait Building of a Social Network User Based on Semi-Structured Data Analysis
|pdfUrl=https://ceur-ws.org/Vol-2475/paper10.pdf
|volume=Vol-2475
|authors=Nadezhda Yarushkina,Aleksey Filippov,Vadim Moshkin,Alexey Namestnikov,Gleb Guskov
}}
==The Social Portrait Building of a Social Network User Based on Semi-Structured Data Analysis==
<pdf width="1500px">https://ceur-ws.org/Vol-2475/paper10.pdf</pdf>
<pre>
     The Social Portrait Building of a Social Network User
          Based on Semi-Structured Data Analysis

             N.Yarushkina, A.Filippov, V.Moshkin, A.Namestnikov, G.Guskov

                       Ulyanovsk state technical university, Ulyanovsk, Russia
                    e-mail: {jng, al.filippov, v.moshkin, nam, g.guskov}@ulstu.ru


        Abstract. The article presents the method of constructing a social portrait of a
        social network user. This technique is implemented in the module of system for
        opinion mining. The method of building a social portrait is based on collecting
        statistical information about the user, the dynamics of his activity and on the
        semantic analysis of the subject of his posts and comments using linguistic
        ontology.


1 Introduction

   Active growth of social media audience on the Internet (social networks, forums,
blogs and online media) made them a new source of data and knowledge. The
specifics of working with social media has several advantages and disadvantages.
   Advantages include:

 high speed of access to information;
 a broad audience;
 a wide range of data topics;
 large amount of data.
    The disadvantages are:

 large amount of data;
 unstructured presentation of information;
 absence of a single conceptual framework.
    A large amount of social media data is both an advantage and a disadvantage at
the same time. Monthly in Russian social networks about 30 million unique authors
publish 580 billion messages according to statistics for 2018.
    However, a large amount of data makes it possible to obtain a large training sets,
for machine learning methods and a large statistical sample for social studies.
    The monthly billions of unstructured text messages and publications that users
leave monthly cannot be processed manually.
    There is a need for methods of automated intelligent and sentimental analysis of
text data. These methods handle large amounts of data and understand their meaning


___________________________
Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
In: P. Sosnin, V. Maklaev, E. Sosnina (eds.): Proceedings of the IS-2019 Conference, Ulyanovsk, Russia,
24-27 September 2019, published at http://ceur-ws.org
120


(Text Mining), determine the sentiment (Opinion Mining) of user messages and
publications in a short time [1-5].
    Understanding the meaning and sentiment of publications in social media is the
most important and complex element of automated text processing [6-11].
    Our scientific group has created an intelligent tool for Opinion Mining of social
media. This tool includes new approaches to the hybridization of ontological analysis
and methods of knowledge engineering with methods of nature language processing
(NLP) for extracting the semantic and emotional component of semi-structured and
unstructured text resources [12-16].
    These approaches will improve the efficiency of the analysis of social media
content-specific data and fuzziness of natural language.
    The implementation of the model and methodology for obtaining a social portrait
of a social network user will be considered in this paper. This model was developed
on the example of the most popular social network in Russia VKontakte [17]. The
number of users of this social network exceeded 528 million people at the time of
publication of the article.
    It is necessary to use intelligent data analysis systems to automate the process of
analyzing the target audience. Large amounts of data, a variety of forms of their
presentation and their unstructured presentation do not allow for quickly building a
social portrait of a potential user of a company's product.
    This problem increases the time frame for analyzing requirements, ideas,
competitors, and target audiences. The ability to form a social portrait of a social
network user is useful for:

 the fight against terrorism and extremism in the social network;
 the construction of a person-oriented education and health care system through the
  correct presentation of information about a healthy lifestyle, cultural and social
  values;
 sociological research;
 workforce planning, etc.


2 The architechture of System for Opinion Mining

    Service-oriented approach is the basis of the architecture of the software system
for Opinion Mining Social Media (SOM). This approach allows:

 Increase the overall fault tolerance of the SOM by performing services in different
  address spaces.
 Increase the scalability of the SOM by running several instances of services and
  balancing the load between them.
 Provide the ability to use different operating systems, programming languages,
  storage technologies, etc.
 Reduce the downtime of SOM when making changes, correcting errors, etc.
 Provide an opportunity to completely replace services while maintaining the
  interface of interaction with other parts of the SOM.
                                                                                     121


    REST in conjunction with the HTTP protocol [18] is the basis for the organization
of the interface for the interaction of SOM services. REST allows a distributed system
of any type to have the following properties: performance, extensibility, simplicity,
updatability, intelligibility, portability and reliability.
    The architecture of SOM is shown in Figure 1.


                           Figure 1. The architecture of SOM

    1. Social data extraction module is a subsystem for importing data from social
media. This subsystem works with social networks through the public application
programming interface (Public API).
    2. The data storage module provides the representation of information extracted
from social networks in a unified structure that is convenient for further processing.
The data is stored in the context of users, collections, data sources, versions, etc. As
database management systems (DBMS) are used:

 Elasticsearch for indexing and retrieving data [19];
 MongoDB for storing data in JSON format [20];
 Neo4j for storing graphs of social interaction (social graph) and ontology [21].
122


     3. Domain knowledge Management module translates OWL / RDF-ontology
into the graph knowledge base [22].
    4. Administration and visualization module manages user rights, tasks,
provides a friendly system interface, and displays the necessary reports on the
analysis of social network data.
    5. Module search and analysis of text data performs preprocessing of text
resources using statistical and linguistic methods. Also this module searches for
objects related to a specific task. The task is presented in the form of a set of
keywords. In this case, the user's query can be extended semantically using an
ontology. Ontology contains descriptions of features of the PrA.
    6. The social portrait building module performs semantic analysis of posts and
user comments and collects statistics of its activity. The module work is described in
more detail in the following chapters of the article.


3 Formal model of the social network user

A.        Statistical social portrait of a social network user
    Formally the statistical social portrait of the social network user is:
                                                    , , ,
where         is user meta information; – statistical information about the user; – is
the user's social graph.
    The following expression is typical for any user of a social network:
                                ∀       ∈ :              , ,       ,
where is a specific user ID.
    Let us consider in more detail the components . Formally the attributes that
determine the meta information            of the i-th user are
                                        ,     ,      ,    ,    ,     ,
                                                                         ,
                                      ,        ,     ,      ,    ,
where
        is the social network Vkontakte page of the i-th user;
          is the short address of the i-th user page;
            is the address of the i-th user’s wall page;
         ∈ , is account status (0 - account is inactive, 1 - account is active);
          ∈ , is a page verification (0 – no, 1 – yes);
           are i-th user’s first name, last name and nickname;
          ∈ ,         – is the user’s gender (f is female, m is male);
              is the contact list of the i-th user;
            is a set of interests of the i-th user (may be categorical);
           is the information about the education received by the i-th user;
             is the information about the i-th user career;
             is the information about the military service of the i-th user.
                                                                                              123


   Statistical information about the social network user Si is formed in various time
sections with an indication of the time period:
                            ,       ,         ,           ,|    |    , ;  ,.
            ,         is a time section and a related period of time; n is the number of
time sections; l is the number of time periods for analyzing the activity of a social
network user;
    The analysis is performed on a set of indicators. These indicators are presented as
a set:
                                           , ,        ,      ,     ,        ,
where
       – the number of communities the user belongs to;
       – the number of friends of a social network user;
           – is the number of subscribers for the user;
         – is the number of subscriptions a user has;
         – is the number of user posts;
           – is the number of user comments.
    Each indicator corresponds to the value               which characterizes its sum:
                                                 ∑             .
    The structure of the social relations of the analyzed user is represented as a graph:
                                                    ,       ,
where        – is a finite set of social graph vertices;         is a finite set of edges defining
pairs of adjacent identifiers of social network users.
    The set of vertices of the social graph is represented as the union of the singleton
set. This set contains the identifier of the analyzed social network user with set of his
friends:
                                               ∪        ,     ,…,        ,
    where is the number of friends of the network user.
    Formally set of edges is:
                                    ,       ,       ,           ,…,           ,       .
B.        Semantic representation of the social portrait of a social network user
    The task of building a social portrait of a social network user is the task of
classifying a set of users by classifying text fragments (social network posts,
comments) of a specific user or his friends. Classes are categories of social network
user interests. These categories may include topics related to subject areas: sports, IT-
technologies, music, business and others.
    Formally the task of classifying text fragments is described by a set of text
fragments sets:
                                            , ,…,      .
    User interest categories are set:
                                         , where      ,…, .
    A hierarchy of categories will represent this set of pairs. This set of pairs
determines the relationship of nesting between rubrics:
                                        ,     , , ∈
    (the category is nested in the category ).
124


    The hierarchy of categories is formed as domain ontology and each category is
represented as a class (concept). In the classification problem it is necessary to build a
procedure based on this data. The procedure should find the most likely category from
the set     for the text fragment .
    Our method for classifying text fragments is based on the assumption that texts
belonging to the same category contain the same attributes (words or phrases). The
presence or absence of such attributes of the text fragment indicates its belonging or
non-belonging to one or another topic.
    For each category a set of attributes is:
                                                ∪      ,
where                  ,…, ,…,         .
    The specified set of attributes defines the dictionary. This dictionary consists of
tokens, including words and phrases characterizing the category. This dictionary is
considered as the linguistic basis of the ontological resource of the developed system.
    Each text fragment also has attributes similar to topics or categories. A fragment
of these attributes can be attributed to one or more categories with some degree of
probability:
                                              ,…, ,…,        .
    The set of all text fragments attributes should be equal to the set of attributes of
interests categories of social network users, ie:
                                                      ∪       .
    The decision to classify the text fragment        as    is made on the basis of the
intersection:
                                              ∩          .
    The category of a specific social network user is determined on the basis of a
numerical indicator that aggregates the values of text fragments of posts and user
comments.
    Formally the metric for calculating the degree of conformity of the text (post,
comment) to the description of the area of interest from the ontology is:
                                           ∩
                                                   ,            ..   ,
where                 ∩              – is the number of matched attributes of the
dictionaries        and             respectively;             – is the number of
attributes in the dictionary       .
    A set of degrees of correspondence of a text fragment to a set of categories of
interests      formed:
                                          ,      ,…,
    The following expression is used to calculate the severity of the user's interest
category in the process of forming a social portrait:
                                     ,         ,   ,…,
                                ∑
                                     ,         ,   ,…,
                                                               ,
where n is the number of text fragments; m - the number of categories.
                                                                                        125


C.       Building a social portrait of a social network user
    The social portrait model of a social network user is a SOM module. Experiments
on the construction of a social portrait were carried out on the open data of users of
the social network VKontakte.
    The constructed social portrait consists of four sections:
        User information.
        Statistical data.
        The interests of the user and user friends.
        Social graph.
    The first block contains the main public data from the user’s page (Fig.2).


                                  Figure 2. User information

     This block contains a list of interests of the user, his education, career, etc.
     The second block is a graph of the dynamics of user activity:
         Communities
         Friends
         Subscribers
         Subscriptions
         Posts
         Comments (Fig.3)
126


                                 Figure 3. Statistical Data

   Data can be presented by day, month and year.
   The third block contains the results of semantic analysis of social network user
posts and comments (Fig.4). The calculation model is presented in paragraph 3.


                     Figure 4. User interests and user friends interests

   The fourth block of the social portrait is a social graph. The social graph contains
data about the social network users associated with a specific user of various types of
connections: friend, follower, etc. (Fig.5).
                                                                                    127


                                 Figure 5. Social graph


Conclusion

    Data analysis of social networks can be useful in the management of company
personnel, since it is often possible to learn more from social networks about a
person’s professional and personal qualities than from his resume.
    The analysis of a person’s interests as well as his psycho-physiological
characteristics is important from the point of view of ensuring the integrated safety of
the organization’s functioning.
    The developed algorithm for the formation of a social portrait in the framework of
SOM will allow HR specialists of any company, which first of all need this system, to
quickly get an objective understanding of the personal, psycho-physiological and
business qualities of a person. The use of this system will reduce the company's risks
associated with the work of the specialists involved.
128


Acknowledgements. This study was supported by the Russian Foundation for Basic
Research (Grants No. 18-47-732007 and 18-47-730035).

References

    1. Leskovec J., Faloutsos C. Sampling from large graphs //Proceedings of the 12th ACM
SIGKDD international conference on Knowledge discovery and data mining. ACM. pp. 631-
636 (2006).
    2. Gjoka M. et al. Practical recommendations on crawling online social networks
//Selected Areas in Communications, IEEE Journal on. Vol. 29. №. 9. pp. 1872-1892 (2011).
    3. Boyd D., Ellison N. Social network sites: Definition, history, and scholarship. Journal
of Computer-Mediated Communication. Vol. 13(1). article 11. (2007)
    4. Pallis G., Zeinalipour-Yazti D., Dikaiakos M.. Online Social Networks: Status and
Trends. New Directions in Web Data Management 1, Studies in Computational Intelligence
Volume 331, pp 213-234 (2011).
    5. Key Trends to Watch in Gartner 2012 Emerging Technologies Hype Cycle.
http://www.forbes.com/sites/gartnergroup/2012/09/18/key-trends-to-watch-in-gartner2012-
emerging-technologies-hype-cycle-2, last accessed 2019/08/08.
    6. Korshunov A. Tasks and methods for determining the attributes of users of social
networks // Proceedings of the 15th All-Russian Scientific Conference "Digital Libraries:
Advanced Methods and Technologies, Digital Collections" - RCDL'2013
    7. Korshunov A., Beloborodov I., Gomzin A., Chuprina K., Astrakhantsev N., Nedumov
J., Turdakov D. Determination of demographic attributes of users of microblogging //
Proceedings of the Institute of System Programming of RAS. Vol. 25, 2013 DOI : 10.15514 /
ISPRAS-2013-25-10.
    8. Fleuret F. Fast Binary Feature Selection with Conditional Mutual Information // JMLR,
5:1531–1555 (2004).
    9. Crammer K., Dekel O., Keshet J., Shalev-Shwartz S., Singer Y. Online Passive-
Aggressive Algorithms // JMLR, 7(Mar):551–585 (2006).
    10. Pang B., Lee L., Vaithyanathan S. Thumbs up? Sentiment Classification using Machine
Learning Techniques. pp. 79–86 (2002).
    11. Turney P. Thumbs Up or Thumbs Down? Semantic Orientation Applied to
Unsupervised Classification of Reviews // Proceedings of the Association for Computational
Linguistics. pp. 417–424. arΧiv: LG/0212032 (2002)
    12. Chetviorkin I., Loukachevitch N. Sentiment Analysis Track at ROMIP-2012. Computer
linguistics and intellectual technologies. Computer linguistics and intellectual technologies:
Dialogue-2013. Sat. scientific articles volume 2, p. 40-50.
    13. Antonova A., Soloviev A., Using the method of conditional random fields for
processing texts in Russian. Computer linguistics and intellectual technologies: Dialogue-2013.
Sat. scientific articles / Issue. 12 (19)- Moscow: Publishing house of the RSUH. pp.27-44
(2013).
    14. Pazelskaya A., Soloviev A. Method of definition of emotions in texts in Russian.
Computer linguistics and intellectual technologies. Computer linguistics and intellectual
technologies: Dialogue-2011. Sat. scientific articles / Issue. 11 (18). Moscow: Publishing
House of the RSUH. pp. 510-523 (2011).
    15. García-Moya, L., Anaya-Sanchez, H., Berlanga-Llavori, R.: Retrieving product features
and opinions from customer reviews. IEEE Intelligent Systems 28(3), pp. 19–27 (2013)
                                                                                           129


    16. Tarasov D. Deep Recurrent Neural Networks for Multiple Language Aspect-Based
Sentiment Analysis // Computational Linguistics and Intellectual Technologies: Proceedings of
Annual International Conference “Dialogue-2015”. Issue 14(21), Vol.2, pp. 65-74 (2015).
    17. Representational           state         transfer,       https://en.wikipedia.org/wiki/
Representational_state_transfer, last accessed 2019/08/08.
    18. Social Network VKontakte https://vk.com last accessed 2019/08/08
    19. The Heart of the Elastic Stack, https://www.elastic.co/products/elasticsearch, last
accessed 2019/08/08.
    20. MongoDB. For Giant ideas, https://www.mongodb.com, last accessed 2019/08/08.
    21. Introducing the Neo4j Graph Platform, https://neo4j.com, last accessed 2019/08/08.
    22. Yarushkina N., Filippov A., Moshkin V. Development of the unified technological
platform for constructing the domain knowledge base through the context analysis.
Communications in Computer and Information Science. 2017. Vol. 754. pp. 62-72.

</pre>