=Paper=
{{Paper
|id=None
|storemode=property
|title=Estimating users'
areas of research by publications and profiles on social networks
|pdfUrl=https://ceur-ws.org/Vol-1210/SP2014_11.pdf
|volume=Vol-1210
|dblpUrl=https://dblp.org/rec/conf/ht/SalounOZ14
}}
==Estimating users'
areas of research by publications and profiles on social networks==
<pdf width="1500px">https://ceur-ws.org/Vol-1210/SP2014_11.pdf</pdf>
<pre>
     Estimating users’ areas of research by publications and
                  profiles on social networks

                    Petr Saloun                           Adam Ondrejka                            Ivan Zelinka
            VSB-Technical University of              VSB-Technical University of           VSB-Technical University of
             Ostrava, 17. listopadu 15                Ostrava, 17. listopadu 15             Ostrava, 17. listopadu 15
                  70833 Ostrava                            70833 Ostrava                         70833 Ostrava
                 Czech Republic                           Czech Republic                        Czech Republic
              petr.saloun@vsb.cz                   adam.ondrejka.st@vsb.cz                   ivan.zelinka@vsb.cz

ABSTRACT                                                              person and as an input for further research in finding suit-
We focus on estimating a research area of a researcher/user           able reviewers of publications presented at conferences and
by finding a unique identity in digital libraries and social          for detecting the violations of a copyright.
networks and by analyse of public metadata of their publica-
tions and published information on social networks profiles.          2.   ESTIMATING AREA OF AUTHOR’S RE-
The lack of content of the metadata in some of the publica-
tions is solved by the information retrieval using techniques              SEARCH
of NLP. We estimate the author’s domain by extracting key-            To find the right profiles we used a technique which compares
words from abstracts as well as by information published on           specific attributes by different weights. Details are described
social profiles. The result of this work is a design, an original     in [2]. We used a modified version shown in the Equation 1
algorithm and experimental verification of the algorithm.             (similar work is mentioned in [3]).

Categories and Subject Descriptors                                               (P
H.5.4 [Information Interfaces and Presentation]: Hy-                                  n
                                                                                      i=0 wi · sim(ai,u , ai,p )   if sim(aname ) > thname
pertext/Hypermedia—User issues; H.3.7 [Information In-                simu,p =
                                                                                  0                                otherwise
terfaces and Presentation]: Digital Libraries—User is-
sues                                                                                                                                  (1)

                                                                      where sim(aname ) is similarity between author and user pro-
General Terms                                                         file names, thname is threshold value to decide if names are
Design                                                                the same or not, n is count of compared attributes, wi is
                                                                      weight of compared attributes, ap is set of user’s profile at-
Keywords                                                              tributes, au is set of user’s attributes by his publications,
digital library, identify user, social media, information re-         sim(ai,u , ai,p ) is similarity between attributes. The text
trieval, natural language processing                                  comparison is done by fuzzy matching to include potential
                                                                      typing errors in attributes.
1.   INTRODUCTION
There are situations in life when we need to find works of a          As shown in the Algorithm 1 the input is the name of the re-
specific researcher, for example when we organize a confer-           searcher. Then the search requests to all the digital libraries
ence. One of the most common way to solve this problem                are executed and it downloads the publications. Each pub-
is to search for the information about this researcher, either        lication is then categorized by the defined criteria. Initially
by looking at the institutions and his publications or by ex-         we eliminated all the articles that were similar or equal and
amining the topics he had on various conferences, and then            were occurring in multiple libraries. Then we categorize the
create a profile of the researcher manually. With the boom            publications by affiliations using the text similarity algo-
of social networking people began to publish more openly              rithm and also by their co-authors. Now we have groups
accessible data than before. Using the data may reveal an             of possible unique authors. There is an issue now with the
interesting complement to the true identity of a person. Un-          author publishing on his own or being active in multiple af-
fortunately, the expansion and the emergence of various so-           filiations, because then the algorithm divides him into more
cial networks caused a relatively large fragmentation and             groups. To handle the situation, we included a compari-
users publish specific information about themselves to a so-          son with user’s connections retrieved from social networks
cial network focusing on the specific topic. The fact that            and additional information about skills, experiences and so
people can have the same name is another obstacle, there-             on. After that we categorized the keywords by social profile
fore it is necessary to verify that it is actually a profile of the   similarity. We found all the profiles associated with the re-
right person and not of his namesake. The main objective              searcher name. Then we tried to find common connections
of this work is to identify researchers on social networks and        and affiliations, and if there were at least one in each pair we
digital libraries. Based on the public information on these           would assign them together with the compared social pro-
sites, we estimate the area of a person’s research. The re-           files. The process was repeated for every found co-authors
sults are keywords that serve both as a description of the            and referred publications with the input of the previously
found authors, so the results would be more accurate. Peo-
ple with the same name are not merged into one identity,                  Table 1: Experiment of finding identities
because of the classification by connections and affiliates. It                       R POA NA Precision Recall
is highly unlikely that these people will have the same co-            Co-authors   118     3    59       97 %    66 %
authors, friends and jobs. More information about a unique             C-A + Social 132     3    45       98 %    74 %
user identity is described in [1].                                     C-A + S + K 166     14      0      92 % 100 %

The research domain is obtained by analysing the keywords
of all found publications and by extracting the additional        In the next step we included comparisons of authors by data
information from the social profiles. Because of lacking and      found on their social profiles. 132 users were identified cor-
incompletely chosen keywords in many publications we had          rectly (73 %) and to 3 same authors were again assigned
to use our original technique to get additional keywords from     wrong publications due to the same reasons, error rate re-
abstract. We do not go into detail describing our original        mained 2 %. The only improvements were made in the case
technique, because of the page limit of this poster.              when one author was in two different groups (”NA”) and
                                                                  when there were connections found in social profiles between
Data: Author’s first name and last name                           them, so error rate decreased to 25 %.Finally we added com-
Result: User’s identities                                         parisons by keywords between publications with a single au-
firstName, lastName ← {user raw input};                           thor and publications with multiple authors. 166 users were
for searcher in DigitalLibrariesSearchers do                      identified correctly, correct rate increased to 92 %. There
     publications ←SearchAuthor(firstName, lastName);             was no situation with a one author in more groups (”NA”,
end                                                               error rate of this category decreased to 0 %). Unfortunately
GroupByPublication(publications);                                 14 users had assigned wrong publications (”POA”), error
GroupByAffiliates(publications);                                  rate increased to 8 %. It was caused by errors in extracting
for searcher in SocialNetworkSearchers do                         of keywords and the associated bad detection of a similarity
     publications ←SearchAuthor(firstName, lastName);             between researchers and publications.
end
for group in groups do                                            4.     CONCLUSION
     for publication in publications do                           The goal of our work was to create algorithm to estimate
        groupKeywords +=                                          research area of users by finding their identities in digital li-
        AnalyzePublication(publication);                          brary and social networks and by analyse found data. As the
     end                                                          results from our experiment show, the algorithm for identify-
end                                                               ing research identities on digital libraries and social networks
finalGroups = GroupBySocialProfiles(groups);                      was successful in 92 % of all the attempts in final. This work
Algorithm 1: Finding unique author identity on digital            was the first step in the research of recommending publica-
libraries and social networks                                     tions to authors and finding violations of copyrights. We
                                                                  would want to try to add comparing authors’ domains de-
3.   EXPERIMENT                                                   tected from publications and information on the Internet
From the digital libraries we chose IEEExplorer1 , ACM Digi-      to classic full-text search approach. This work is input for
tal Library2 and SpringerLink3 . In this work, the researchers    further research in finding suitable reviewers of publications
are found on LinkedIn4 and Researchgate5 social networks.         presented at conferences and for detecting the violations of
In the experiment we check if we can find unique iden-            a copyright.
tities and research domains of 180 randomly selected re-
searchers. The search of user identities in digital libraries     5.     ACKNOWLEDGMENT
has been tested by at least 180 researchers, by download-         The following grant is acknowledged for the financial support
ing and analysing about 3100 publications (Table 1). The          provided for this research: Grant of SGS No. SP2014/42,
researchers were chosen randomly and included people of dif-      VSB - Technical University of Ostrava, Czech Republic.
ferent nationalities. Initially there were users grouped only
by co-authors and affiliates. There were 118 authors grouped
correctly (”R”) with rate 65 %. 3 authors had assigned other      6.     REFERENCES
author’s publications (”POA”, 2 % error rate) because of          [1] K. Kostkova, M. Barla, and M. Bielikova. Social
fact that searched author had publications with namesake              relationships as a means for identifying an individual in
co-authors and it was poorly evaluated as same person, error          large information spaces. In M. Bramer, editor,
rate in this case was 59 authors were not merged correctly            Artificial Intelligence in Theory and Practice III,
(”NA”), there were too many created identities of which               volume 331 of IFIP Advances in Information and
should be same one author. This was caused by publica-                Communication Technology, pages 35–44. Springer
tions with no or one co-author and different affiliations, it         Berlin Heidelberg, 2010.
was not possible to find connection between them. Error           [2] E. Raad, R. Chbeir, and A. Dipanda. User profile
rate of this category was 33 %.                                       matching in social networks. In Network-Based
1
                                                                      Information Systems (NBiS), 2010 13th International
  http://ieeexplorer.ieee.org                                         Conference on, pages 297–304, Sept 2010.
2
  http://dl.acm.org
3                                                                 [3] J. Vosecky, D. Hong, and V. Y. Shen. User
  http://www.springerlink.com
4                                                                     identification across social networks using the web
  http://linkedin.com
5                                                                     profile and friend network. IJWA, 2(1):23–34, 2010.
  http://www.researchgate.com

</pre>