Profiling Less Active Users in Online
                     Communities

       Alexandra Barysheva, Anna Golubtsova, and Rostislav Yavorskiy

              Department of Data Analysis and Artificial Intelligence
                          Faculty of Computer Science
                          Higher School of Economics
                    Myasnitskaya 20, Moscow, Russia, 101000
            {asbarysheva, annagolubtsova1, ryavorsky}@gmail.com


      Abstract. Our research is focused on the study of social interactions of
      online community users, especially in business-oriented social network-
      ing services like LinkedIn or Habrahabr. The general aim of the work is
      to design methods for profiling of discussion participants within groups
      according to their interaction patterns. One of our goals is to make the
      approach independent from the language of communication, that is why
      we build our analysis on the comments graph and do not use information
      from the posts content. This paper suggest FCA based approach to pro-
      filing less active users for which not much data is available and statistical
      analysis is not applicable.

      Keywords: online community, communication graph, user profiles


1   Introduction
Social Internet development unveiled great research potential for network analy-
sis which includes the analysis of relationships and flows between people, groups,
organizations, computers, URLs, and other connected information/knowledge
entities.
    This paper focuses on the behaviour patterns of the members of social net-
works groups (communities) of interest. In these online groups, users on a regular
basis can publish information or news that called posts, and interact with each
other by commenting or liking them.
    The general goal of this work is to provide a method of profiling group users
by analysing the group interaction graph. An interaction graph is a graph where
vertices correspond to users and edges represent relation “user A comments or
likes post of user B”.
    Today almost every social media site provides an API for easy data retrieval.
Application programming interface (API) is the set of routines, protocols, and
tools for building software applications using the obtained data. In order to
retrieve the graph of social interaction we use data sets collected from business-
oriented social networking service LinkedIn and Habrahabr (leading Russian blog
on Information Technology topics).
    In this paper we continue research described in [1], which is also dedicated
to the task of profiling online community users. The method proposed in that
paper is based on clustering users according to statistical characteristics of their
communication patterns.
    Clustering based on statistical characteristics allows one to study the commu-
nication patterns, but it is not applicable to users with low activity for which not
enough data is available. Our study shows that actively involved users constitute
approximately 2% of a community, while more than half of the community mem-
ber could be classified as “observers” (see [1]). That motivates us on designing
a separate technique for profiling less active members of an online community.
    The paper is organized as follows. Section 2 contains the review of relevant
related works organized according to the used approach. Section 3 describes the
data set. Section 4 summarizes achieved and anticipated results. In Section 5 we
conclude and discuss the possible applications of our work.


2   Related work

Relationship is a central concept of the science of Social Network Analysis. Our
race, ethnicity, background and personality — all influence our behaviour and
interact. Thus, the behavioural patterns analysis in online communities can pro-
vide the information about the user that is not explicit in his or her profile page
(and obviously cast some light on the principles of social behaviour in online
networks).
    There are different types of relationships between people: friendship, trust,
influence, or conflict, dislike etc. In [2] authors provide several types of rela-
tionships in social networks including (1) binary and valued relationships, (2)
symmetric and asymmetric relationships, and (3) multimodal relationships. Ex-
amples of binary and valued relationships are “Sam follows Ann on Facebook”
and “Alex retweeted 4 tweets from Mary” respectively. Following or reposting on
Twitter, Facebook or LinkedIn are asymmetric relationships by definition, but
a follow-back tie can exist, thus symmetrizing them. An example of symmet-
ric relation is “Ann and Bob have common interests”. Multimodal relationships
are interactions between actors of different types people possess information,
group adds people, and so on. In our study, we analyze all these three types of
relationships between group users.
    The task of user profile modelling consists of many subtasks and approaches,
such as content-based methods [3], the island method [2], researching of users
friend- or following-connections, or tracing user activity [4] etc. Typically, the
majority of proposed profiling methods combine different techniques.
    An example of Twitter users profiling is presented in [5]. Authors study de-
mographic estimation algorithms based on users tweets and community relation-
ships. They propose a hybrid community-based and text-based method where
demographics of Twitter users are estimated by tracking the tweet history and
clustering the followers/followings. The method estimates wide varieties of de-
mographics such as gender, age, area etc. The authors also consider users with
few tweets such as followers of corporate accounts.
     In [6] authors suggest a generic model for user classification in social media
with application to Twitter. Analyzing the users behaviour, linguistic content
and the network structure of the users Twitter feed they develop the method
of automatic inferring the values of user attributes such as political orientation
or ethnicity. Machine learning approach is used relying on four general feature
classes: user profile, user tweeting behaviour, linguistic content of user messages
and user social network features. The paper presents experimental results on 3
tasks with different characteristics: ethnicity identification, political affiliation
detection and detecting affinity for a particular business.
     A weakly supervised approach to user profile extraction from Twitter is also
suggested in [7]. In addition to traditional linguistic features, this approach also
takes into account network information, offered by social media. Authors use
users profiles from social media websites such as Facebook or Google Plus as
a distant source of supervision for extraction of their attributes from user-
generated text. They test the algorithm on three attribute domains including
spouse, job and education and results demonstrate accurate predictions for users
attributes based on tweets.
     Unlike previous mentioned works, article [4] focus just on user activity, ignor-
ing the content of messages a user exchanged. Authors take into consideration
both social interactions and tweeting patterns of microblogging integrating ser-
vice Twitter, which allow profiling users according to their activity patterns.
According to the investigation, there are 75 % of the users in their appropriate
cluster, which can be classified with a 0.9 assignment probability. Clusters are
characterized by a set of statistical features relating user activity, network struc-
ture and dynamic patterns. Furthermore, the authors propose three algorithms
to analyze the impact of content posted by a user.
     In [8] authors use modelling user profile to predict the profile of another user
in the network. They gather fine-grained data from two social networks and try
to infer user profile attributes. The article proposes a method of inferring user
attributes that is inspired by previous approaches to detecting communities in
social networks based on the fact, that users with common attributes are more
likely to be friends and often form dense communities. Results show that certain
user attributes can be inferred with high accuracy when given information on as
little as 20% of the users.
     The expertise retrieval task of user profiling is also mentioned in [9]. In this
work, the topical profiling task is decomposed into two stages: (1) discovering and
identifying possible knowledge areas, and (2) measuring the persons competency
in each of these areas.
     Our research has the same goal as the previous mentioned works — to provide
a method of profiling users. Main task of this paper is to suggest an approach
for profiling of less active users, which usually form the majority of any online
community. Since we have little data for these members, statistical analysis of
their behaviour is not possible. That is why we turn to FCA tools.
    The idea to apply formal concept analysis to social network analysis is not
new, see e.g. [10] or [11]. Usually the technique is used for network clustering
and detecting communities. Our goal is slightly different, we assume that a com-
munity is already given. We target at detailed description of roles of different
users in this community.
    Method of retrieving groups of websites users with similar behaviour using
Formal Concept Analysis presented in [12]. Authors propose to construct a tax-
onomy based on users visits of different pages of websites. The problem of big
number of concepts is solved by applying the stability index [13] to the lattice
concepts.
    Extension of using the stability index (not only in terms of intent stability,
but also in terms of extent stability) to taxonomy construction described in [14].
For instance, in this work authors study the dataset from research by Davis,
Gardner and Gardner [15] which features ladies attending particular events in a
small Mississippi town in the 1930s. By constructing stabilised lattice (according
to extent) authors found the core members in groups.


3     Description of the data set

In our work we use two data sets. The first one is communication graph retrieved
from Habrahabr blogging site for several most popular topics. The second in-
cludes communication graphs for several LinkedIn groups.


3.1    Habrahabr data

HabraHabr (http://habrahabr.ru/) is the most popular Russian blog service de-
voted to Information Technology. Currently we work with communication graphs
for the most active topics including ”Algorithms”, ”Big Data”, ”High perfor-
mance computing”, ”Information security” and others. For a given topic the
dataset is a single table with the following columns:

 – Post Id
 – Post author
 – Comment Id
 – Comment author
 – Parent comment Id
 – Time stamp

Also, for conveniece we added some derived values like comment depth, number
of child comments etc.
    Data set for ”Big data” community in CSV format is available at the project
page on GitHub, see https://github.com/ryavorsky/HabraGraph.
3.2    LinkedIn data
Business-oriented social networking service LinkedIn, http://www.linkedin.com,
allows users to create profiles and interact with each other in an online social
network, which may represent real-world professional relationships. LinkedIn
also supports the formation of interest groups that are, generally, employment
related, although the majority of topics are covered mainly around professional
and career issues. Currently we work with communication graphs for the largest
groups related to the topic ”Bioinformatics”. The dataset has the same structure:
Post Id, Post author, Comment Id, Comment author, Like author, Time stamp,
and some derived characteristics, such as the number of child comments and its
depth in the thread.


4     Users profiling
4.1    Clustering according to the statistical characteristics
In this paper we continue research on profiling online community users described
in [1]. Firstly, a set of the user attributes that can be computed using community
comment graph and post comment graphs is listed. They are: the number of
people that leave comment to the user and were commented by the user, the
number of posts the user wrote and commented, average depth of the user’s
comment and how often the user’s comment was a terminal. These attributes
reflect the user’s communication style in online discussions.
    The clustering allowed us to figure out the following user types:
1. Silent stars (2 users). Authors of popular posts who do not participate in
   the discussions.
2. Communicative stars (2 users). Authors of popular posts who are actively
   involved in the discussions.
3. Active chatters (2% of users). Participants who leave many comments,
   and reply to almost every comment on their posts.
4. Idle chatters (2% of users). People who write few comments, but usually
   their comments support the subsequent debate.
5. Socializers (5% of users). Users who do not produce many comments, al-
   though the number of people their talk with is notably high.
6. Investigators (15% of users). Participants who communicate with many
   people within very narrow discussion (few blog posts).
7. Concluders(22% of users). Participants, who produce little comments and
   quite often their comment is the last one in the discussion branch.
8. There is also one more type of user - observers (more than 50%) who are
   the most inactive users: each one leaved no more than 3 comments.
    It can be seen that less active users represent bigger part of community.
That is why the goal of the current work is to provide method of detection of
dependencies between users with different activity rate. In other words we want
to know to whom among the active users the less active users are similar.
4.2   Profiling of less active users
As it was mentioned above, the task of analyzing and profiling of user behaviour
is rather straightforward for more active users, when a lot of data is available.
    For the other part of online community, the majority of less active members,
we suggest to describe the user profile in terms of similarity to one or several pre-
selected benchmarks, key users of the community with well-known behavioural
patterns.
    In more details the suggested procedure is the following. First, select a small
number of key users of the studied community. Second, build the object-attribute
table, in which rows (objects) are all group users and columns (properties) are
benchmark profiles (key users). Then use use FCA tools to compute the lattice
of formal concepts. Finally, conclude that activity pattern of user user1 could
be described in terms of intersection of few benchmarks, e. g. core user1 and
core user3.

4.3   Core users
There are many different ways to determine the set of benchmarks. In our work
we use the notion of communication graph core [16,17]. The picture on fig. 1
shows 3-core for users-posts graph (that is the largest subset of users and posts,
in which each user left comments in at least three posts and each post has
comments from at least three users) corresponding to “big data” community at
Habrahabr.ru platform.
    Restricting our focus with the 3-core helps us to filter out blog posts which
are not very relevant to the main community topic, and also figure out users,
which play central role in the group communications.
    Consider for example an irrelevant post, which produced quite intensive dis-
cussion. We can detect its irrelevance by the fact that core users did not partic-
ipated in the thread. Formally, to classify a post as a core one we require that
at least 3 core users participate in the discussion.
    Similarly, there might be a user, who went into long comments exchange in a
single thread or left few remarks in some rather irrelevant discussions. User with
such a behaviour usually is interpreted as a casual visitor, not a core one. To be
included into the community core we require that the user should participate in
at least 3 core discussions.

4.4   Why FCA
As it was already mentioned above, the main goal of this work is to design an
approach for describing profiles for majority of less active members of an online
community. Usually we have very little data for such users, a couple of comments
or so. That is why classification according to numerical characteristics suggested
in [1] hardly makes sense. The users will be classified as “inactive” and that’s it.
    In this paper we suggest to use information about the particular topics, which
attracted a user. That data has “object-property” type, so we turn to FCA.
Fig. 1. 3-core of for graph corresponding to “big data” community at Habrahabr.ru


     The formal context is defined as follows. Assuming user is a group user and
benchmark is a core user we say that object user has property benchmark if
users user and benchmark together participated in a post discussion. For the
example of “big data” community mentioned above the formal context table has
13 properties (the number of core users) and hundreds of objects (for all the
other community members).
     We use FCA tools (Concept Explorer [18] and FCArt [19,20]) to build the
lattice of the formal context. As a result, the set of formal concepts is given (see
fig. 2). Each formal concept has a set of objects (extent). These users are similar
to each other and their profile could be specified in terms of the benchmarks,
the set of core users.
     By combining the 3-core graph and the lattice we can get a visual map of
the community, see fig. 3.
     Also, in applications we can use the resulting formal concepts for introducing
the corresponding links between the user profiles. Indeed, for a user with low
activity the profile will be almost empty due to lack of statistics. The links from
Fig. 2. Lattice of users-benchmarks formal context corresponding to “big data” com-
munity on Habrahabr


this empty profile to more detailed profiles of most similar core users will help
to get at least some information about the user interests.


5   Conclusion

The paper describes work in progress on developing a universal tool for auto-
mated building of profiles for online community users. The proposed method
is based on the user activity in the process of posting, liking and commenting
group posts.
    To make the approach suitable for the analysis of different online commu-
nities, the approach does not use information from the user profile or content
analysis. Thus, it is based on user activity and his/her skills to interact with
other group participants.
    In order to classify less active online group members the method based on
retrieving formal concepts with core users as attributes is suggested. The results
can be used to extend the functionality of the groups with the detailed descrip-
tion of the profile of participants and the nature of their interaction, which in
turn should help to understand users behaviour.
    The developed method can be applied to any online community.
Fig. 3. The map of the Habrahabr “big data” community built from the 3-core graph


References

 1. Barysheva A., Yavorskiy R. Building Profiles of Blog Users Based on Comment
    Graph Analysis, Proceedings of AIST’2015, 4-th International Conference “Anal-
    ysis of Images, Social Networks and Texts”, Yekaterinburg, 9-11 April 2015. To
    appear in Springer CCIS.
 2. Kouznetsov A., Tsvetovat M. Social network analysis for startups O’Reily,
    2011.
 3. Santosh R. Author Profiling: Predicting Age and Gender from Blogs, PAN at CLEF,
    2013.
 4. Rocha E. User profiling on Twitter, Semantic Web. Interoperability, Usability,
    Applicability, 2011.
 5. Kazushi I. Twitter user profiling based on text and community mining for market
    analysis Knowledge-Based Systems, 2013, pp. 35-47.
 6. Pennachiotti M. A. Machine Learning Approach to Twitter User Classification,
    Fifth International AAAI Conference on Weblogs and Social Media, 2011, p. 45.
 7. Li J., Ritter A., Hovy E. Weakly Supervised User Profile Extraction from Twitter
    ACL, 2014.
 8. Druschel P., Gummadi K. P., Mislove A., Viswanath B. You Are Who You Know:
    Inferring User Profiles in Online Social Networks ACM WSDM, 2010.
 9. Balog K., Fang Y., de Rijke M., Serdyukov P., and Si. L. Expertise Retrieval,
    Foundations and Trends in Information Retrieval, 6 (2-3), 2012, pp. 127-256.
10. Snasel, Vaclav, Zdenek Horak, and Ajith Abraham. Understanding social networks
    using formal concept analysis. Proceedings of the 2008 IEEE/WIC/ACM Interna-
    tional Conference on Web Intelligence and Intelligent Agent Technology-Volume
    03. IEEE Computer Society, 2008.
11. Gnatyshak, D., Ignatov, D. I., Semenov, A., Poelmans, J. (2012). Gaining insight
    in social networks with biclustering and triclustering. In Perspectives in Business
    Informatics Research (pp. 162-171). Springer Berlin Heidelberg.
12. Sergei O. Kuznetsov, D.I. Ignatov, Concept Stability for Constructing Taxonomies
    of Web-site users. In: S. Obiedkov, C. Roth, Eds., Proc. Social Network Analysis
    and Conceptual Structures: Exploring Opportunities, Clermont-Ferrand, 2007.
13. Kuznetsov, S.O.: On stability of a formal concept. In SanJuan, E., ed.: JIM, Metz,
    France (2003)
14. Sergei O. Kuznetsov, Sergei Obiedkov and Camille Roth, Reducing the Represen-
    tation Complexity of Lattice-Based Taxonomies. In: U. Priss, S. Polovina, R. Hill,
    Eds., Proc. 15th International Conference on Conceptual Structures (ICCS 2007),
    Lecture Notes in Artificial Intelligence (Springer), Vol. 4604, pp. 241-254, 2007.
15. Davis, A., Gardner, B.B., Gardner, M.R.: Deep South. University of Chicago Press,
    Chicago (1941)
16. Batagelj V., Zaversnik M. Generalized Cores, arXiv:cs/0202039v1, 2002.
17. Seidman S. B. Network structure and minimum degree Social Networks, 5, 1983,
    pp. 269–287.
18. Yevtushenko, Serhiy A. System of data analysis“Concept Explorer”. Proceedings
    of the 7th national conference on Artificial Intelligence KII. Vol. 2000. 2000.
19. Neznanov, Alexey, Dmitry Ilvovsky, and Andrey Parinov. Advancing FCA Work-
    flow in FCART System for Knowledge Discovery in Quantitative Data. Procedia
    Computer Science 31 (2014): 201-210.
20. Neznanov, A. A., and A. A. Parinov. FCA Analyst Session and Data Access Tools in
    FCART. Artificial Intelligence: Methodology, Systems, and Applications. Springer
    International Publishing, 2014. 214-221.