=Paper= {{Paper |id=Vol-2667/paper49 |storemode=property |title=An integrated approach to mapping user profiles on social networks |pdfUrl=https://ceur-ws.org/Vol-2667/paper49.pdf |volume=Vol-2667 |authors=Vladimir Belov,Dmitriy Drozdov,Roman Shakurov,Vadim Moshkin,Ilya Andreev }} ==An integrated approach to mapping user profiles on social networks == https://ceur-ws.org/Vol-2667/paper49.pdf
 An integrated approach to mapping user profiles on
                  social networks
            Vladimir Belov                                      Dmitriy Drozdov                                    Roman Shakurov
   Information Systems department,                      Information Systems department,                    Information Systems department,
  Faculty of Information Systems and                   Faculty of Information Systems and                 Faculty of Information Systems and
             Technologies                                         Technologies                                       Technologies
 Ulyanovsk State Technical University                 Ulyanovsk State Technical University               Ulyanovsk State Technical University
          Ulyanovsk, Russia                                    Ulyanovsk, Russia                                  Ulyanovsk, Russia
        belo.199666@mail.ru                                 dimoxx123@gmail.com                                     relife@inbox.ru

                       Vadim Moshkin                                                                     Ilya Andreev
Information Systems department, Faculty of Information Systems                  Information Systems department, Faculty of Information Systems
                      and Technologies                                                                and Technologies
             Ulyanovsk State Technical University                                            Ulyanovsk State Technical University
                      Ulyanovsk, Russia                                                               Ulyanovsk, Russia,
                     v.moshkin@ulstu.ru                                                              ia.andreev@ulstu.ru

    Abstract—In this paper, we consider an integrated approach                II. THE MAIN APPROACHES TO SOLVING THE PROBLEM OF
for the sole identification of a person in several different social                               USER IDENTIFICATION
networks by analyzing the questionnaire data, poorly structured
information and images comparison from the profiles of social                A. Methods and algorithms for mapping user accounts on
networks. Also the paper contains the description of the                         social networks
software service that implements the proposed approach.
                                                                                 Currently, the task of identifying users using data profiles
    Keywords—social network, account, search, mapping                        of social networks is solved in various ways [1, 2, 3].
                                                                                 In [4, 5, 6, 7, 8], methods for analyzing data profiles of the
                        I. INTRODUCTION                                      social networks MySpace, StudiVZ are described. But these
    The active growth of the audience of social networks has                 networks are not popular in Russia. The proposed approaches
led to the emergence of these resources as a new source of data              consist in constructing feature vectors of user characteristics
and knowledge. In Russia, several social networks are                        based on the information provided on personal pages. To the
currently the most popular, each of which has its own focus                  obtained vectors, methods of exact, partial and fuzzy
and specificity of the content posted. Such resources include                comparison are applied. In these works, the authors proposed
VKontakte, Odnoklassniki, Instagram, Facebook, Youtube,                      features that are most significant when comparing accounts.
Twitter[1]. Many users have several accounts on different                    The developed algorithms the accuracy of about 80% on a test
social networks and publish different or similar content to                  sample of user accounts.
them. And to find a person in any of the networks becomes
problematic.                                                                     In [6, 7, 9], methods for mapping user profiles of social
                                                                             networks by analyzing published unstructured (text)
    Working with social networks can be beneficial in                        information are presented. In [6], the authors conclude that the
implementing the functions of the company’s personnel                        creator of a text note can be identified by a unique writing
management system, as you can often find out much more                       style. In [7], a method is shown that takes into account not
information about the professional and personal qualities of                 only text information published by a user in a note, but also
the applicant from social networks than from the CV.                         meta-information associated with it: geolocation, publication
Currently, the collection or meaningful analysis of                          time, hashtags, etc.
information from social networks is carried out manually by
specialists of personnel services, which is time consuming and               B. Software services for searching users in social networks
limits the amount of information processed.                                      Currently, there are several services for searching for
                                                                             profiles of people in social networks in RuNet. Most services
    Thus, there is a need to develop a software system that                  work on the principle of conventional search engines -
allows you to identify a person’s profile in several social                  download all available open profile data and save it to a local
networks. Such developments would allow aggregating more                     database.
data about users to assess the severity of their personal
characteristics. This work is aimed at solving the problem of                    FindFace [10]. One of these services is the FindFace
searching for an integrated approach for mapping (comparing)                 system, as well as many other systems based on it, which
user profiles in various social networks based on the analysis               allow you to find a person’s profile on a social network from
of structured data, text information, as well as graphic                     their photo. To start the search, you need to select a photo
materials for the purpose of further analysis of the user's social           where the human face is clearly visible, and upload the
portrait.                                                                    picture. The algorithm will find pages with similar photos and
                                                                             lay out links to them with examples of images. Each link will
                                                                             have a rating from 0 to 1. If the indicator is more than 0.67,
                                                                             then this means that the system recorded the most complete
                                                                             match. The developed neural network scanned the faces of
                                                                             500 million users of the VKontakte social network.



Copyright © 2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
Data Science

    Yandex.People [11]. The Yandex.People system uses text                                             max            
data obtained from social network profiles to search. So the
following data is uploaded from the profiles of a person:                     where is the number of pairwise matching lemmas, 1,
     Name of the user (or at least one of the parameters                     2 - the number of lemmas in lines 1 and 2, respectively.
      allowing to identify the person).                                       If the value of this criterion is more than 0.85, then the
                                                                              lines are considered similar.
     Age of user.
                                                                                   Criteria for the presence of similar posts. Two metrics
     Place of residence or user address.                                           were used to compare text notes. The first is finding
                                                                                    the Levenshtein distance (editorial distance, editing
     Place of study or completed education.                                        distance)[12, 13] - the minimum number of operations
     User’s place of work.                                                         to insert one character, delete one character and
                                                                                    replace one character with another, necessary to turn
    Despite the availability of these services that solve specific                  one line into another. As a second method for finding
tasks of searching for users of social networks, there are                          similar posts, the shingles algorithm was
currently no comprehensive approaches and universal services                        implemented [14]. This algorithm works on the
that allow users to compare user profiles in various social                         principle of splitting text into shingles, computing
networks by analyzing not only the data of profiles, but also                       hashes of data of shingles, pairwise comparison of
poorly structured information from the pages of the respective                      hashes. The following metric was used for the shingle
accounts.                                                                           method:
  III. APPROACH TO MAPPING USER PROFILES ON SOCIAL
                                                                                                                 
                            NETWORKS
    The developed algorithm takes as a basis a person’s                       where  is the number of matching hashs of shingles, ,
personal page from a social network. Different information is                  - the number of shingles in the first and second row,
downloaded and is used for further search and comparison of                   respectively.
the profile in various social networks. At the current stage, the
following data is used:                                                     A visual representation of the shingle algorithm is shown in
                                                                          Figure 1.
        First name, middle name of the user;
        Date of Birth;
        Place of residence;
        Place of Birth;
        Friends;
        Text notes (posts);
        Place of work;
        Place of study;
                                                                          Fig. 1. Shingles Algorithm.
        Contacts, email, phone number;                                            Criteria for having similar friends. This indicator is
        Profile avatar, as well as profile photos.                                 calculated by pairwise comparison of the names of
                                                                                    friends. The more matches, the higher the profile in
   This information is downloaded, both for the original                            the final search results.
profile, and for the desired profiles in other social networks.
The loaded profile data is mapped to the source profile data.            IV. IMPLEMENTATION OF A SOFTWARE SYSTEM FOR MAPPING
Since there can be several profiles found, they are sorted                               USER PROFILES ON SOCIAL NETWORKS
according to the following criteria:
                                                                              To test the effectiveness of the proposed approach, a
        Criteria for the presence of similar faces in                    software system for mapping user profiles on social networks
         photographs. Using the HOG method, people are                    was implemented. The developed system is a client-server
         found in photographs and their vector representation             application, where the server is a Java web service developed
         is generated. Subsequently, the Euclidean norms of               using the Spring Boot software platform.
         the vectors are compared.                                            The system integrates with the three most popular social
        The criterion for the presence of similar contacts.              networks in the CIS: Vkontakte [15], Odnoklassniki [16] and
         Profiles containing links to each other are very likely          Facebook [17]. Data from the VKontakte social network is
         to belong to the same person.                                    downloaded through the integration with VK API. Data from
                                                                          the Odnoklassniki and Facebook is downloaded by parsing the
        The criterion for a similar place of work and place of           desktop and mobile versions of the website of the respective
         study. To calculate this indicator, the strings are pre-         networks.
         processed: they are cleared of punctuation, reduced to
         lower case. After that, the lines are lemmatized, and               A web service containing an application in python has also
         using the obtained lemmas, the lines are compared                been developed. This service, using the DLIB library [18],
         according to the following metric:                               forms a vector representation for user photos.



VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                 226
Data Science

    As input, the system accepts a link to a profile in one of                The page displays the selected profiles, and on the left
the social networks. From this profile, all possible data about           there is an application menu.
a person is loaded, and according to these data a single model
of the desired profile is formed. Similar profiles are searched                                V. EXPERIMENT RESULTS
by searching for fields loaded from the original profile. Based               A pre-prepared sample of 100 users with profiles in
on loaded similar profiles, a rating is formed, according to              various social networks was used as an experimental base. All
which sorting subsequently occurs.                                        these users had 204 accounts, since not all of them had
    Similar profiles are sorted by the received rating and                accounts in all networks at once. For each of these accounts,
displayed on a web-form.                                                  we tried to find similar ones using the developed service. As a
                                                                          result of the experiments, the diagram shown in Figure 4 was
    The system architecture is shown in Figure 2.                         compiled.




                                                                          Fig. 4. Chart of the percentage of found profiles.

                                                                               On the diagram you can see that the system coped best
                                                                          with finding profiles on the VK social network, and worst of
                                                                          all, Facebook. This is due to the convenience of extracting
                                                                          data from relevant resources. VK API allows you to quickly
Fig. 2. System architecture.                                              extract large amounts of data, which increases the quality of
    As can be seen from the figure, the system consists of two            recognition, while parsing other networks consumes many
server applications:                                                      resources, which forces to limit the amount of data retrieved.

         Spring Boot App [19],                                               The results for the profile comparison criteria were also
                                                                          calculated, the result is shown in Figure 5.
         Flask App [20]
   There is also a React App client application. Spring Boot
App contains the basic logic of the system, as well as data
loaders from social networks:
         VK Api Loader,
         Facebook loader,
         OK loader
    The Flask App contains methods for recognizing faces in
photographs using the DLIB library. H2 Database is used to
store some non-confidential data. An example of such data is
the id of cities and countries from the VK API.                           Fig. 5. Profile comparison criteria diagram.

    An example of the system interface is shown in Figure 3.                  On the diagram you can see that in all cases the system
                                                                          managed to find at least one common friend. The results of the
                                                                          coincidence of other criteria are much smaller. Twice worse,
                                                                          the system managed to find common educational institutions
                                                                          and common places of work. This is due to the fact that users
                                                                          do not always indicate these characteristics on their pages.
                                                                          Also, often the format of the specified data does not allow to
                                                                          correctly compare them. Even less, the system coped with
                                                                          finding common faces in photos. This is due to many factors,
                                                                          such as, for example, the accuracy of the model itself, the
                                                                          quality and number of photos uploaded. Only a third of the
                                                                          experiments managed to find common posts on the pages.
                                                                          This is due to the fact that users do not always fill pages with
                                                                          the same posts. Cross-references to profiles were found least
Fig. 3. System appearance.
                                                                          of all, as users provide such information less often.



VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                227
Data Science

                             CONCLUSION                                       [4]  Y. Gaewon, "Enhancing Entity Search with Social Network
                                                                                   Matching," EDBT/ICDT: Proceedings of the 14th International
   Thus, within the framework of this work, an integrated                          Conference on Extending Database Technology, 2011.
approach was proposed to find user profiles in different social               [5] M. Motoyama and G. Varghese, "I seek you: searching and matching
networks by analyzing not only the data of profiles, but also                      individuals in social networks," Proceedings of the eleventh
poorly structured information from the pages of the respective                     international workshop on Web information and data management,
                                                                                   2009.
accounts, as well as graphic information.
                                                                              [6] E Raad, R. Chbeir and A. Dipanda, "User profile matching in social
    As a result of the work done, a software system was                            networks," 13th International Conference on Network-Based
developed that performs the function of searching and                              Information Systems, IEEE, 2010.
mapping similar profiles on social networks. The application                  [7] J. Vosecky, D. Hong and V.Y. Shen, "User identification across
can be used as a personnel search platform. The proposed                           multiple social networks," 1 international conference on networked
                                                                                   digital technologies, IEEE, 2009.
methodology lays the foundation for further work on
                                                                              [8] I. Veldman, “Matching Profiles from Social Network Sites,” Master's
conducting relevant experiments, developing new algorithms                         thesis, University of Twente, 2009.
for searching, comparing, analyzing, and building a portrait of               [9] N. Yarushkina, A. Filippov, V. Moshkin, A. Namestnikov and G.
a user based on open data about they.                                              Guskov, “The social portrait building of a social network user based on
                                                                                   semi-structured data analysis,” CEUR Workshop Proceedings, vol.
                         ACKNOWLEDGMENT                                            2475, pp. 119-129, 2019.
    This work was supported by the Foundation for Assistance                  [10] FindFace [Online]. URL: https://findface.pro.
to the Development of Small Forms of Enterprises in the                       [11] Yandex.People [Online]. URL: https://yandex.ru/people.
Scientific and Technical Sphere within the framework of the                   [12] V. Chernenkyi and Yu. Gapanyuk, “Passenger identification technique
project "Development, technical implementation and testing                         based on installation data,” Engineering Journal: Science and
                                                                                   Innovation, vol. 3, no. 3, pp. 3, 2012.
of a prototype platform for the formation of a social portrait of
                                                                              [13] D.V. Mikhailov, A.P. Kozlov and G.M. Emelyanov, “An approach
an applicant based on intelligent data retrieval in social                         based on analysis of n-grams on links of words to extract the
networks using the principles of knowledge engineering" of                         knowledge and relevant linguistic means on subject-oriented text sets,”
Agreement No. 60GS1CTS10-D5 / 56043 from 06.02.2020.                               Computer Optics, vol. 41, no. 3, pp. 461-471, 2017. DOI: 10.18287/
                                                                                   2412-6179-2017-41-3-461-471.
                               REFERENCES                                     [14] A.Tsimbalov and O. Zolotarev, “The method of shingles,” Vestnik of
[1]   I. Rytsarev, D. Kirsh and A. Kupriyanov, “Clustering of media content        the russian new university. Series: Complex systems: models, analysis
      from social networks using bigdata technology," Computer Optics, vol.        and management, vol. 4, no. 4, pp. 72-79, 2016.
      42, no. 5, pp. 921-927, 2018. DOI: 10.18287/2412-6179-2018-42-5-        [15] Vkontakte [Online]. URL: https://vk.com, last accessed 2020/05/11.
      921-927.
                                                                              [16] Odnoklassniki [Online]. URL: https://ok.ru/.
[2]   A. Filippov, V. Moshkin and N. Yarushkina, “Development of a
      Software for the Semantic Analysis of Social Media Content,” Recent     [17] Facebook [Online]. URL: https://www.facebook.com.
      Research in Control Engineering and Decision Making. Studies in         [18] DLIB library [Online]. URL: http://dlib.net/.
      Systems, Decision and Control, vol 199, pp. 421-432, 2019.              [19] Spring Boot App [Online]. URL: https://spring.io/projects/spring-boot.
[3]   N. Yarushkina, A. Filippov, V. Moshkin, G. Guskov and A. Romanov,       [20] Flask App [Online]. URL: https://flask.palletsprojects.com/en/1.1.x
      “Intelligent Instrumentation for Opinion Mining in Social                    /quickstart/.
      Media,” Proceedings of the II International Scientific and Practical
      Conference Fuzzy Technologies in the Industry, Ulyanovsk, Russia,
      pp. 50-55, 2018.




VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                               228