=Paper=
{{Paper
|id=None
|storemode=property
|title=Estimating users'
areas of research by publications and profiles on social networks
|pdfUrl=https://ceur-ws.org/Vol-1210/SP2014_11.pdf
|volume=Vol-1210
|dblpUrl=https://dblp.org/rec/conf/ht/SalounOZ14
}}
==Estimating users'
areas of research by publications and profiles on social networks==
Estimating users’ areas of research by publications and
profiles on social networks
Petr Saloun Adam Ondrejka Ivan Zelinka
VSB-Technical University of VSB-Technical University of VSB-Technical University of
Ostrava, 17. listopadu 15 Ostrava, 17. listopadu 15 Ostrava, 17. listopadu 15
70833 Ostrava 70833 Ostrava 70833 Ostrava
Czech Republic Czech Republic Czech Republic
petr.saloun@vsb.cz adam.ondrejka.st@vsb.cz ivan.zelinka@vsb.cz
ABSTRACT person and as an input for further research in finding suit-
We focus on estimating a research area of a researcher/user able reviewers of publications presented at conferences and
by finding a unique identity in digital libraries and social for detecting the violations of a copyright.
networks and by analyse of public metadata of their publica-
tions and published information on social networks profiles. 2. ESTIMATING AREA OF AUTHOR’S RE-
The lack of content of the metadata in some of the publica-
tions is solved by the information retrieval using techniques SEARCH
of NLP. We estimate the author’s domain by extracting key- To find the right profiles we used a technique which compares
words from abstracts as well as by information published on specific attributes by different weights. Details are described
social profiles. The result of this work is a design, an original in [2]. We used a modified version shown in the Equation 1
algorithm and experimental verification of the algorithm. (similar work is mentioned in [3]).
Categories and Subject Descriptors (P
H.5.4 [Information Interfaces and Presentation]: Hy- n
i=0 wi · sim(ai,u , ai,p ) if sim(aname ) > thname
pertext/Hypermedia—User issues; H.3.7 [Information In- simu,p =
0 otherwise
terfaces and Presentation]: Digital Libraries—User is-
sues (1)
where sim(aname ) is similarity between author and user pro-
General Terms file names, thname is threshold value to decide if names are
Design the same or not, n is count of compared attributes, wi is
weight of compared attributes, ap is set of user’s profile at-
Keywords tributes, au is set of user’s attributes by his publications,
digital library, identify user, social media, information re- sim(ai,u , ai,p ) is similarity between attributes. The text
trieval, natural language processing comparison is done by fuzzy matching to include potential
typing errors in attributes.
1. INTRODUCTION
There are situations in life when we need to find works of a As shown in the Algorithm 1 the input is the name of the re-
specific researcher, for example when we organize a confer- searcher. Then the search requests to all the digital libraries
ence. One of the most common way to solve this problem are executed and it downloads the publications. Each pub-
is to search for the information about this researcher, either lication is then categorized by the defined criteria. Initially
by looking at the institutions and his publications or by ex- we eliminated all the articles that were similar or equal and
amining the topics he had on various conferences, and then were occurring in multiple libraries. Then we categorize the
create a profile of the researcher manually. With the boom publications by affiliations using the text similarity algo-
of social networking people began to publish more openly rithm and also by their co-authors. Now we have groups
accessible data than before. Using the data may reveal an of possible unique authors. There is an issue now with the
interesting complement to the true identity of a person. Un- author publishing on his own or being active in multiple af-
fortunately, the expansion and the emergence of various so- filiations, because then the algorithm divides him into more
cial networks caused a relatively large fragmentation and groups. To handle the situation, we included a compari-
users publish specific information about themselves to a so- son with user’s connections retrieved from social networks
cial network focusing on the specific topic. The fact that and additional information about skills, experiences and so
people can have the same name is another obstacle, there- on. After that we categorized the keywords by social profile
fore it is necessary to verify that it is actually a profile of the similarity. We found all the profiles associated with the re-
right person and not of his namesake. The main objective searcher name. Then we tried to find common connections
of this work is to identify researchers on social networks and and affiliations, and if there were at least one in each pair we
digital libraries. Based on the public information on these would assign them together with the compared social pro-
sites, we estimate the area of a person’s research. The re- files. The process was repeated for every found co-authors
sults are keywords that serve both as a description of the and referred publications with the input of the previously
found authors, so the results would be more accurate. Peo-
ple with the same name are not merged into one identity, Table 1: Experiment of finding identities
because of the classification by connections and affiliates. It R POA NA Precision Recall
is highly unlikely that these people will have the same co- Co-authors 118 3 59 97 % 66 %
authors, friends and jobs. More information about a unique C-A + Social 132 3 45 98 % 74 %
user identity is described in [1]. C-A + S + K 166 14 0 92 % 100 %
The research domain is obtained by analysing the keywords
of all found publications and by extracting the additional In the next step we included comparisons of authors by data
information from the social profiles. Because of lacking and found on their social profiles. 132 users were identified cor-
incompletely chosen keywords in many publications we had rectly (73 %) and to 3 same authors were again assigned
to use our original technique to get additional keywords from wrong publications due to the same reasons, error rate re-
abstract. We do not go into detail describing our original mained 2 %. The only improvements were made in the case
technique, because of the page limit of this poster. when one author was in two different groups (”NA”) and
when there were connections found in social profiles between
Data: Author’s first name and last name them, so error rate decreased to 25 %.Finally we added com-
Result: User’s identities parisons by keywords between publications with a single au-
firstName, lastName ← {user raw input}; thor and publications with multiple authors. 166 users were
for searcher in DigitalLibrariesSearchers do identified correctly, correct rate increased to 92 %. There
publications ←SearchAuthor(firstName, lastName); was no situation with a one author in more groups (”NA”,
end error rate of this category decreased to 0 %). Unfortunately
GroupByPublication(publications); 14 users had assigned wrong publications (”POA”), error
GroupByAffiliates(publications); rate increased to 8 %. It was caused by errors in extracting
for searcher in SocialNetworkSearchers do of keywords and the associated bad detection of a similarity
publications ←SearchAuthor(firstName, lastName); between researchers and publications.
end
for group in groups do 4. CONCLUSION
for publication in publications do The goal of our work was to create algorithm to estimate
groupKeywords += research area of users by finding their identities in digital li-
AnalyzePublication(publication); brary and social networks and by analyse found data. As the
end results from our experiment show, the algorithm for identify-
end ing research identities on digital libraries and social networks
finalGroups = GroupBySocialProfiles(groups); was successful in 92 % of all the attempts in final. This work
Algorithm 1: Finding unique author identity on digital was the first step in the research of recommending publica-
libraries and social networks tions to authors and finding violations of copyrights. We
would want to try to add comparing authors’ domains de-
3. EXPERIMENT tected from publications and information on the Internet
From the digital libraries we chose IEEExplorer1 , ACM Digi- to classic full-text search approach. This work is input for
tal Library2 and SpringerLink3 . In this work, the researchers further research in finding suitable reviewers of publications
are found on LinkedIn4 and Researchgate5 social networks. presented at conferences and for detecting the violations of
In the experiment we check if we can find unique iden- a copyright.
tities and research domains of 180 randomly selected re-
searchers. The search of user identities in digital libraries 5. ACKNOWLEDGMENT
has been tested by at least 180 researchers, by download- The following grant is acknowledged for the financial support
ing and analysing about 3100 publications (Table 1). The provided for this research: Grant of SGS No. SP2014/42,
researchers were chosen randomly and included people of dif- VSB - Technical University of Ostrava, Czech Republic.
ferent nationalities. Initially there were users grouped only
by co-authors and affiliates. There were 118 authors grouped
correctly (”R”) with rate 65 %. 3 authors had assigned other 6. REFERENCES
author’s publications (”POA”, 2 % error rate) because of [1] K. Kostkova, M. Barla, and M. Bielikova. Social
fact that searched author had publications with namesake relationships as a means for identifying an individual in
co-authors and it was poorly evaluated as same person, error large information spaces. In M. Bramer, editor,
rate in this case was 59 authors were not merged correctly Artificial Intelligence in Theory and Practice III,
(”NA”), there were too many created identities of which volume 331 of IFIP Advances in Information and
should be same one author. This was caused by publica- Communication Technology, pages 35–44. Springer
tions with no or one co-author and different affiliations, it Berlin Heidelberg, 2010.
was not possible to find connection between them. Error [2] E. Raad, R. Chbeir, and A. Dipanda. User profile
rate of this category was 33 %. matching in social networks. In Network-Based
1
Information Systems (NBiS), 2010 13th International
http://ieeexplorer.ieee.org Conference on, pages 297–304, Sept 2010.
2
http://dl.acm.org
3 [3] J. Vosecky, D. Hong, and V. Y. Shen. User
http://www.springerlink.com
4 identification across social networks using the web
http://linkedin.com
5 profile and friend network. IJWA, 2(1):23–34, 2010.
http://www.researchgate.com