=Paper=
{{Paper
|id=Vol-1893/Paper4
|storemode=property
|title=Identifying Influential Users' Professions via the Microblogs They Forward
|pdfUrl=https://ceur-ws.org/Vol-1893/Paper4.pdf
|volume=Vol-1893
|authors=Yuan Wang,Hangyu Mao,Zhen Xiao
|dblpUrl=https://dblp.org/rec/conf/ijcai/WangMX17
}}
==Identifying Influential Users' Professions via the Microblogs They Forward==
Proceedings of the 3rd International Workshop on Social Influence Analysis (SocInf 2017)
August 19th, 2017 - Melbourne, Australia
Identifying Influential Users’ Professions via the
Microblogs They Forward
Yuan Wang, Hangyu Mao, and Zhen Xiao
Department of Computer Science, Peking University, Beijing 100871, China
{wangyuan, mhy, xiaozhen}@net.pku.edu.cn
Abstract. For most social media sites, how to find out (influential) users’ pro-
fessions is an important task. Much work has been conducted to explore this task
through mining user-generated textual content or analyzing the social network
structure. In this paper, we innovatively solve this task by only examining which
microblog messages an influential user has forwarded. First, we define hot mi-
croblog messages under two standards and identify them from a large number
of candidate messages. Each of the identified messages points to a specific hot
event. Next, we group similar hot messages together based on their word similar-
ity, semantic similarity, and forwarders’ similarity. Last, we represent users with
the hot messages they forwarded and design an identification method to identify
their professions. Moreover, we collect a real-world dataset to conduct experi-
ments and prove that our method performs significantly better than the traditional
method.
1 Introduction
Online microblogging services have become an integral part of the daily life for most
Netizens. These services expect to know more about their users’ profiles, since user
profile plays an important role in commercial services, such as personalized recom-
mendation and online advertising. However, user profile is usually not easily obtained,
because users are reluctant to expose their profiles to the public. Fortunately, some
work has been conducted to solve this problem. A traditional practice is cutting user-
s’ messages into bags of words and training a classifier. This practice can achieve an
acceptable result on simple tasks such as predicting gender and age [1], but it can not
solve more complex tasks [14].
Profession, which is founded upon specialized educational training, is a critical so-
cial profile of influential users. In Weibo, the largest microblogging service in China,
influential users are mainly organized by their professions. They are more likely to fol-
low other users that have the same profession with them. It is important to correctly
identify influential users and their professions for microblogging services.
Message forwarding (e.g. retweeting on Twitter.com and reposting on Weibo.com)
is one of the most popular functions in the existing microblogging services. In Weibo,
users can forward messages or any interesting content on the web, such as real blogs,
Copyright c 2017 for the individual papers by the papers’ authors. Copying permitted for
private and academic purposes. This volume is published and copyrighted by its editors.
33
Proceedings of the 3rd International Workshop on Social Influence Analysis (SocInf 2017)
August 19th, 2017 - Melbourne, Australia
photos and external links. In this paper, if a weibo message was forwarded by any user,
we define it as forwarded message, otherwise we define it as non-forwarded message.
Based on a large dataset, we find that about 60% of weibo messages are forwarded mes-
sages. For most users, the messages they forwarded are exactly what they are interested
in. Users’ professions can be reflected by the messages they forwarded to some extent.
But the traditional “bag of words” model will completely undermine the information
contained in users’ forwarding behaviors. Naturally, in this paper, we ask and try to an-
swer the following question: can we represent microblog users with the messages they
forwarded, and predict their professions more accurately than the traditional method?
The task confronts some challenges which make it non-trivial. The first challenge is
that there exist too many forwarded messages. If we consider each forwarded message
as a feature, the feature vector will be very large and sparse. We observe that most
of these messages only have been forwarded by no more than 3 weibo users. In this
paper, we define them as non-hot forwarded messages and define other messages that
are forwarded by more users as hot forwarded messages. In our experiment, we discard
the non-hot messages. Another challenge is that even though we can filter out non-hot
messages, the number of remaining hot messages is still quite large. We observe that,
every hot message points to a hot event (e.g. a breaking news or a recently released
movie). We should come up with some methods to group similar hot weibo messages
together.
Fig. 1. The framework of PIFB
In this paper, we propose an efficient framework of Profession Identification by
using Forwarding Behaviors (PIFB). As Figure 1 shows, first, we identify the hot for-
warded messages from a large number of candidates. Each of these identified messages
points to a specific hot event. Next, we introduce three methods to group similar mes-
sages together, downsizing our message sets. Then, influential users can be represented
with the merged hot messages that they have forwarded. Finally, we predict users’ pro-
fessions, and the results are more accurate than those in the traditional method.
34
Proceedings of the 3rd International Workshop on Social Influence Analysis (SocInf 2017)
August 19th, 2017 - Melbourne, Australia
2 Dataset and Professions
We collect 41,531 manually annotated influential users from Weibo (http://weibo.com).
To avoid robot users, we only collected verified users. Weibo conducts manual verifica-
tions to make sure that the verified users provide real and authentic information. These
users belong to 11 representative professions. As Table 1 shows, the professions include
“media”, “entertainment”, “sports”, and “IT”, etc.
We also collect users’ latest 500 weibo messages. These messages can be classified
into two categories: forward action and post action. In general, forward action consists
of trace and content. Trace contains the information that through which users the current
user can see the final messages. Content can be extended to any forms as long as it can
be shared by users with their followers, such as videos and blogs. A simple example is
shown below: if a user froward the message:
RT @Raj RT @Sheldon : It took 50 years ...
| {z } | {z }
trace content
This forward action indicates that “It took 50 years ...” was originally posted by
“Sheldon” and was forwarded by “Raj”, and now is forwarded by the current user again.
In general, post action only contains the “content” part, representing that the current
user posted an original message.
3 The Framework of PIFB
In this section, we formalize our problem as a classification task and introduce the main
steps of PIFB.
3.1 Hot Message Identification
This paper focus on influential user’s behaviors about the forwarded messages. A criti-
cal step is to identify the hot forwarded messages. In this part, we define hot messages
under two standards.
Absolutely Hot Message We argue that if a message has been forwarded by more
users, the information behind it will be more. And the forwarding behaviors about this
message can help our profession prediction more. Nowadays, Weibo has become the
the biggest “News Site” in China. Most traditional news organizations open their offi-
cial accounts in Weibo and these accounts are all very active. They usually publish the
breaking news timely and make the news spread quickly. There also exist many Chi-
nese celebrities in Weibo, including actors, singers and entrepreneurs, etc. They post
their personal views or daily lives in their accounts. They generally have a great num-
ber of followers and their daily updates are likely to get thousands of forwards. So, in
this paper, if a weibo message has been forwarded by more than a certain times (for ex-
ample, 500), it will be regarded as the first kind of “hot forwarded message” (absolutely
hot).
35
Proceedings of the 3rd International Workshop on Social Influence Analysis (SocInf 2017)
August 19th, 2017 - Melbourne, Australia
Table 1. The distribution of professions in our dataset.
No. Category (%) No. Category (%)
1 Media 26.3 7 Sports 6.4
2 Entertainment 10.1 8 Fashion 6.2
3 Estate 9.1 9 Education 5.9
4 Finance 8.6 10 Literature 5.4
5 Government 8.5 11 Game 5.1
6 IT 8.4
Relatively Hot Message The 11 professions, showed in Table 1, are not “evenly
matched” on attracting attentions. Nearly all the high forwarded messages are all posted
by “entertainment” and “sports” stars. For an “estate” account, it is not easy to post an
absolutely hot message, because “estate” accounts usually have relatively less followers
and lower forwarding rate. If we only adopt the absolutely hot messages as described
in the previous paragraph, it is very possible that we only get the messages posted by a
small subset of that 11 categories (may be 2-4). Therefore, as a supplement to the first
standard, we define another kind of hot message. In our dataset, if a message’s owner
has f followers (f >500) and this message has been forwarded by more than f /5 times,
it will be regarded as the second kind of “hot forwarded message” (relatively hot).
After identifying all these two types of “hot messages”, we can build a matrix M ,
whose columns denote hot messages and rows denote users. This matrix represents all
the forwarding relationships between weibo users and hot messages. M will have too
much columns, if we don’t filter out the non-hot messages. Even though we do only
consider the hot messages, the number of column is also very big. To slim down M , we
propose three methods to group similar messages together in the next.
3.2 Group Similar Hot Messages Together
In most microblogging services, users can be divided into two categories: informa-
tion producer and information consumer. The information producer mainly includes
the news site accounts, self-media accounts, and profit-seeking accounts with legions
of followers. Their main purpose is making their microblogs broadcast as widely as pos-
sible to expand their influence and get more new followers. Whenever there is a news,
producers will timely post their relevant microblogs. The producers are very likely to
post similar contents, because the texts may be pasted from the same source. The infor-
mation consumer mainly refers to normal weibo users. More than 90% weibo users can
be classified into this category. Their most important action is reading and forwarding
messages. Normally, hot messages are more likely to attract them.
If the hot messages only contain a video link or a web link, it is easy to determine
whether they are similar. But if they contain some text contents, the task will be more
difficult. In the next, we introduce three methods to solve it.
Simhash As described above, the information producers are likely to post similar wei-
bo messages. The most direct idea is that merging similar hot messages based on their
36
Proceedings of the 3rd International Workshop on Social Influence Analysis (SocInf 2017)
August 19th, 2017 - Melbourne, Australia
word similarity. Simhash [2] is a widely used dimensionality reduction technique in
calculating the document similarity. This model can map high dimensional document
vectors to small-sized fingerprints. With the help of simhash, we can transform such a
high-dimensional vector into a k-bit fingerprint where k is quite small, such as 64. An
important characteristic of simhash is that, similar documents have similar hash values.
For instance, if there are two documents that only differ in a single word, the crypto-
graphic hash functions will hash them into two completely different values. However,
simhash will hash them into similar fingerprints. This characteristic is very important
in calculating the document similarity.
In this method, we firstly calculate the simhash values of all the hot messages. Af-
ter that, we can group the similar messages together, if the hamming distance of their
simhash fingerprints is less than or equal to 3.
Paragraph Vector The simhash can only calculate the documents’ similarity based on
their word similarity. It can not deal with situation that, two documents have the similar
semantics but written with different words. [8] proposes “Paragraph Vector” (P2V), an
unsupervised framework that learns continuous distributed vector representations for
pieces of texts. This method can be applied to variable-length paragraphs, and trans-
form them into fixed-length vectors. In this model, every weibo message is mapped to
a unique vector, represented by a column in a matrix and every word is also mapped
to a unique vector, represented by a column in another matrix. The paragraph vectors
and word vectors are concatenated to predict the next word. They are trained using s-
tochastic gradient descent and the gradient is obtained via backpropagation. Details can
be found in the original paper. After being trained, the distance between two paragraph
vectors will be small if they talk about a same topic. It is not sensitive about the syn-
onym. These vectors can be used as features directly to conventional machine learning
models, such as logistic regression or k-means.
We firstly calculate hot messages’ representative vectors by using the “Paragraph
Vector” method. The length of vector is set to 400 according to the original paper. After
that, we calculate their distances. A pair of hot messages can be grouped together if
their distance is smaller than a threshold.
User-Weibo Matrix Factorization The first method is based on message’s word simi-
larity and the second is based on the semantic similarity. They are both directly calculat-
ed by the weibo contents. As described in section 3.1, we have generated the user-weibo
relationship matrix M . So we can further find more similar messages based on which
users have forwarded these messages. Hofmann [5] introduced the PLSA, which de-
veloped probabilistic latent semantic models for performing collaborative filtering. In
this step, PLSA models users (u∈U ) and documents (d∈D) as random variables, taking
values from the space of all possible users and documents respectively. The relationship
between them is learned by modeling the joint distribution of users and documents as
a mixture distribution. The hidden variables t (t∈T , kT k=k) represent the topics be-
tween U and D. The model can be written in the form of mixture model as the next
equation:
k
X
P (u|d; θ) = p(u|t)p(t|d) (1)
t=1
37
Proceedings of the 3rd International Workshop on Social Influence Analysis (SocInf 2017)
August 19th, 2017 - Melbourne, Australia
Based on this model, we can transform the user-weibo matrix into two new matrices.
The first is user-topic matrix, which represents each user with a vector of k topics. The
second is document-topic matrix, which represents each document with a vector of k
topics too. In the second matrix, if the documents contain similar topics, their vectors
are more likely similar. We can group two similar hot messages together, if the distance
between their vectors is under a threshold. In this paper, we empirically set k to 400 and
name this method UWMF.
3.3 Profession Prediction
After merging similar hot messages, users can be represented as more compact vectors.
Each element of these vectors represents a merged hot message, and the elements will
be used as features in our multi-class classifier.
Over the last several decades, many kinds of discriminant classifier have been cre-
ated. In our experiment, we compare Logistic Regression (LR) and Gradient Boosted
Decision Tree (GBDT). We choose GBDT as our default multi-class classifier, because
we find that GBDT performs better in most instances. Hence, in the following part we
only show the results obtained with GBDT [3].
4 Experiment Results
In this section, we first statistically study our dataset. After that, we identify the hot
weibo messages and merge the similar ones. At last, we compare our methods with the
baseline method comprehensively.
4.1 Observation
We firstly count influential user’s forwarding rates on different professions. As Figure
2(a) shows, different professions have different forwarding rates on average. It is a little
surprise that the “estate” and “government” accounts forwarded more messages com-
pared with the “finance” accounts. Overall, the difference between different professions
is not significant. In our dataset, about 58% of weibos are all forwarded messages. For
about 66% users, more than half of their messages are forwarded messages. Figure 2(b)
shows the distribution of how many messages users forwarded (in their latest 500 mes-
sages) in our dataset. We find that about 95% users forwarded more than 50 messages.
In this paper, our goal is to predict users professions only based on their forwarding
behaviors, so we discard other 5% users who forwarded no more than 50 messages in
our experiment.
As described in section 3.1, we define the absolutely hot message and the relatively
hot message separately. To better understand these two types, we calculate how many
times that users’ latest 500 weibo messages have been forwarded on average by cate-
gory. As Figure 2(c) shows, these numbers of different categories are very unbalanced.
The “entertainment” and “literature” accounts attract much more forwarding behaviors
38
Proceedings of the 3rd International Workshop on Social Influence Analysis (SocInf 2017)
August 19th, 2017 - Melbourne, Australia
than “estate” accounts. The main reason is that the “entertainment” and “literature” ac-
counts have relatively more followers. If we only adopt absolutely hot messages (for
example, the threshold is set to 500), it is possible that we can not get any hot mes-
sages posted by “estate”. So identifying relatively hot messages is very necessary in
our model.
Users forward how many messages Users forward how many messages
in their latest 500 messages in their latest 500 messages
500 200
Number of forwards
400
Number of users
150
300
100
200
50
100
0 0
Me En Es Fi Go IT Sp Fa Ed li Ga 0 100 200 300 400 500
Professions Forward how many messages
(a) Users forward how many messages (b) The distribution of user’s forward-
ing behavior
How many times users’ latest 500 The length distribution of
4
x 10
messages were forwarded forwarded messages
7 3500
6 3000
How many times
5 2500
Count
4 2000
3 1500
2 1000
1 500
0 0
Me En Es Fi Go IT Sp Fa Ed li Ga 0 50 100 150 200 250 300
Professions Length of forwarded Weibo
(c) Number of users’ 500 messages (d) Length distribution of forwarded
were forwarded messages
Fig. 2. Data observation
Weibo limits message length to 140 Chinese characters or 280 English characters.
Figure 2(d) shows the length distribution of hot messages in our dataset. We can find
that there exist two peaks. The first peak represents the hot messages that only contain
10-20 characters. These messages are likely to be posted by star users who have millions
of fans. This kind of message usually additional contains a picture or a video link. The
second peak represents the messages that contain 140 Chinese characters. This kind of
message generally contains rich semantics.
4.2 Identify Hot Messages
As described in section 3.1, if a message has been forwarded by more than a certain
number of times, it will be considered as an absolutely hot message. It is apparent that
how to set the threshold is a double-edged sword. If we set the threshold to a smaller
39
Proceedings of the 3rd International Workshop on Social Influence Analysis (SocInf 2017)
August 19th, 2017 - Melbourne, Australia
value (more hot messages), on one hand, user can be represented with more messages
and our model’s expression ability will be increased; on the other hand, our model
should handle more features and need to take the risk of over-fitting. As Table 2 shows,
we set the threshold to 500, 2,000, and 10,000 separately. When the threshold is set
to 500, we can get 731,153 hot weibo messages. This number is too large and most
of these messages have been forwarded by no more than 5 users in our dataset (40
thousand users). Then, we filter out such messages from our hot message sets, leaving
100,219 valid messages. In the prediction tasks, we compare the performance of these
three thresholds and choose 500 as the default value.
Table 2. Number of absolutely hot messages
No. Threshold # before filter # after filter
1 500 731,150 100,219
2 2000 426,019 82,339
3 10000 74,308 32,955
As section 3.1 shows, if a message’s owner has f followers (f >500) and this mes-
sage has been forwarded by more than f /5 times, we regard this weibo message as a
relatively hot message. Just as the absolutely hot messages, we also filter out the mes-
sages that have been forwarded by no more than 5 users in our dataset, and get 61,806
relatively hot messages.
Eventually, we collect 162,025 hot messages in total (100,219 absolutely hot &
61,806 relatively hot).
4.3 Group Similar Hot Messages Together
In this part, we evaluate the performance of our three methods on clustering similar hot
messages. As Table 3 shows: (1) In the simhash method, we choose 64 as the default
length of hash value. In this step, we group similar messages together, if their hamming
distance is less than or equal to 3. We can merge our 162,025 hot messages, identified
from section 4.2, into 57,624 hot events. (2) In the second method, we choose 400 as
the default size of paragraph vector, and merge similar messages according to their
Euclidean distances. In this step, we can merge the 162,025 hot messages into 32,118
hot events. (3) In the third method, we also choose 400 as the size of hidden variables,
and adopt Euclidean distance to measure their similarities. In this step, we can merge
the 162,025 hot messages into 27,129 hot events. In our experiment, the lengths of
these three vectors (64, 400, 400) are chosen empirically [8, 10]. We validate the other
hyper-parameters (where to stop merging) with the validation set, and find the best stop
points.
In practice, we serially combine all these three methods. At first, we adopt the
simhash to find similar hot messages, making users’ representative vectors more com-
pact. On the basis of this results, we adopt the second method, further compressing
users’ vectors. At last, we perform the third method based on the current results. After
40
Proceedings of the 3rd International Workshop on Social Influence Analysis (SocInf 2017)
August 19th, 2017 - Melbourne, Australia
Table 3. Number of messages under different merging strategies
No. Merging Strategy # before # after
1 Simhash 162,025 57,624
2 P2V 162,025 32,118
3 UWMF 162,025 27,129
4 Simhash+P2V+UWMF 162,025 17,196
these three steps, our 162,025 hot messages can cluster together into 17,196 hot events.
In the next, we will study whether these optimizations can improve our profession iden-
tification tasks.
4.4 Results of Prediction
We randomly divide our 40 thousand labeled users into training set (60%), validation
set (20%), and test set (20%). We regard user’s labeled profession as the gold standard,
and select accuracy, macro-averaging precision/recall/F-Measure as evaluation metrics.
To verify the validity of our method, we build a baseline model. The feature can-
didates of baseline model include: (1) Words in user’s original messages; (2) Words in
user’s forwarded messages; (3) Mentioned user ids in messages; (4) URLs in messages;
(5) Hash tags in messages. There exist hundreds of thousands of feature candidates and
we have to perform feature selection to downsize our feature sets. Following the valid
experience in feature selection for text classification, we use χ2 statistic to select rep-
resentative features. We evaluate performance with different numbers of features, and
select 9200 feature candidates. We compare LR and GBDT on these features and find
they have similar performance. To be consistent with our model, we also choose GBDT
as the default baseline classifier.
Table 4. Evaluation results for various features and combinations. (%)
No. Method Accuracy Precision Recall F1
1 Baseline 62.38 64.03 60.29 62.10
2 Simhash 69.24 ↑ 6.86% 70.88 67.61 69.21 ↑ 7.11%
3 Simhash+P2V 73.79 ↑ 11.41% 73.90 71.28 72.57 ↑ 10.47%
4 Simhash+P2V+UWMF 73.98 ↑ 11.60% 74.81 72.95 73.87 ↑ 11.77%
From Table 4, we can observe the evaluation results. We find that the baseline mod-
el achieves a performance of 62.38% in accuracy and our three models all get better
results than it. This comparison proves user’s forward behavior is effective in profes-
sion identification. As Table 4 shows, along with the implementation of three merging
strategies, our three models can make the prediction gradually improved. Our model in
the fourth line that serially adopts all three merging strategies achieves the best result
(accuracy=73.98, F1=73.87). This result indicates that effective clustering of similar
messages is necessary, for there exist too many forwarded messages.
41
Proceedings of the 3rd International Workshop on Social Influence Analysis (SocInf 2017)
August 19th, 2017 - Melbourne, Australia
To better understand the prediction errors, we present the details of the best result.
In Table 5, the value of ith row and jth column represents the ratio of the users in
profession i being identified as profession j.
Table 5. Distribution of identified professions in each profession.
Me En Es Fi Go IT Sp Fa Ed Li Ga
Me 76.7 5.6 2.7 3.5 4.1 3.3 2.2 0.9 0.6 0.2 0.2
En 7.2 74.5 0.2 3.3 0.7 1.4 4.4 5.1 0.2 1.3 1.7
Es 7.4 2.0 72.9 8.5 5.3 2.2 0.4 0.9 0.1 0.0 0.3
Fi 8.4 0.1 6.4 70.2 5.3 6.2 0.2 1.3 1.7 0.1 0.1
Go 4.9 2.2 0.4 4.2 78.2 2.9 4.1 0.4 2.5 0.2 0.0
IT 6.1 0.7 3.9 4.3 1.3 76.3 0.2 0.1 2.6 0.7 3.8
Sp 5.1 2.9 0.0 0.3 0.3 1.0 86.2 2.2 0.7 0.0 1.3
Fa 9.7 14.9 1.0 6.2 0.2 0.0 3.3 61.5 0.9 1.2 1.1
Ed 5.2 3.9 3.3 4.6 2.0 3.2 0.7 1.8 68.4 4.2 2.7
Li 13.7 7.2 0.7 1.3 0.6 1.4 0.4 3.3 9.8 60.9 0.7
Ga 5.2 3.9 0.8 0.0 0.4 7.1 1.2 4.3 0.1 0.3 76.7
To make the data more intuitive, we illustrate the ratio in each entry using different
shades of color. We can observe that: (1) Our model performs differently on different
professions. The recall scores (value on the diagonal) of most professions are bigger
than 70%, with only “fashion” and “literature” less than 65%. The main reason is that
the forward behavior of these two professions has no special characteristics. (2) The
“media” accounts occupy about a quarter of our user collections. Our model tends to
predict the uncertain user as “media” account, making the precision score of “media”
relatively lower (51.3%). (3) The behaviors of some professions are quite similar. For
example, the “entertainment” user and “fashion” user have the similar interests, they
usually follow and interact with each other. It makes the boundary between these two
professions not very clear for identification.
5 Related work
User’s attributes can be inferred from user-generated text data and social network struc-
ture. [6] showed that users’ age and gender can be predicted from people’s webpage
browsing logs. [9] showed users’ profiles can be predicted by their mobile phone apps.
[13] analyzed tens of thousands of blogs and indicated significant differences in writing
style and word usage between different gender and age groups. [1, 11] predicted user’s
gender and age based on their twitter linguistic characteristics. [15] identified weibo
users’ profiles only via the videos they talk about. [12] identified users’ political orien-
tation and ethnicity by leveraging their network structure and linguistic characteristics.
[4, 17] predicted users’ profiles based on their social network structure and chick ins.
Recently, there are some researches on identify users’ professions. [14] presented an
efficient framework for profession identification in Weibo. This work identified users’
42
Proceedings of the 3rd International Workshop on Social Influence Analysis (SocInf 2017)
August 19th, 2017 - Melbourne, Australia
professions based on both personal information and network structure. [7, 16] showed
that computers’ judgments of people’s personalities based on their Facebook Likes are
more accurate than judgments made by their close acquaintances.
6 Conclusion
In this paper, we present an efficient framework PIFB to predict influential users’ pro-
fessions by only examining which microblogs they have forwarded. In the first step, we
identify the hot weibo messages from a large number of candidate messages, and rep-
resent users with the hot messages they forwarded. After that, we group hot messages
together if they talk about the similar topics. This step can make users’ representative
vectors more compact. At last, we design a multi-class classifiler to predict their profes-
sions. The experiments on a real-world dataset demonstrate the effectiveness of PIFB.
Our method performs significantly better than the traditional “bag of words” based
method.
Acknowledgments
The authors would like to thank the anonymous reviewers for their comments. This
work was supported by the National Natural Science Foundation of China under Grant
No.61572044. The contact author is Zhen Xiao.
References
1. Burger, J.D., Henderson, J., Kim, G., Zarrella, G.: Discriminating gender on twitter. In: Pro-
ceedings of the EMNLP. pp. 1301–1309 (2011)
2. Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings
of the thiry-fourth annual ACM symposium on Theory of computing. pp. 380–388. ACM
(2002)
3. Chen, T., Guestrin, C.: Xgboost: A scalable tree boosting system. In: Proceedings of SIGKD-
D. pp. 785–794. ACM (2016)
4. Culotta, A., Kumar, N.R., Cutler, J.: Predicting the demographics of twitter users from web-
site traffic data. In: Proceedings of AAAI. pp. 72–78 (2015)
5. Hofmann, T.: Latent semantic models for collaborative filtering. ACM Transactions on In-
formation Systems (TOIS) 22(1), 89–115 (2004)
6. Hu, J., Zeng, H.J., Li, H., Niu, C., Chen, Z.: Demographic prediction based on user’s brows-
ing behavior. In: Proceedings of WWW. pp. 151–160. ACM (2007)
7. Kosinski, M., Stillwell, D., Graepel, T.: Private traits and attributes are predictable from dig-
ital records of human behavior. Proceedings of the National Academy of Sciences 110(15),
5802–5805 (2013)
8. Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. In: Proceed-
ings of ICML. pp. 1188–1196 (2014)
9. Malmi, E., Weber, I.: You are what apps you use: Demographic prediction based on user’s
apps. arXiv preprint arXiv:1603.00059 (2016)
10. Manku, G.S., Jain, A., Das Sarma, A.: Detecting near-duplicates for web crawling. In: Pro-
ceedings of WWW. pp. 141–150. ACM (2007)
43
Proceedings of the 3rd International Workshop on Social Influence Analysis (SocInf 2017)
August 19th, 2017 - Melbourne, Australia
11. Nguyen, D., Gravel, R., Trieschnigg, D., Meder, T.: ”how old do you think i am?”; a study
of language and age in twitter. In: Proceedings of ICWSM. AAAI Press (2013)
12. Pennacchiotti, M., Popescu, A.M.: A machine learning approach to twitter user classification.
In: Proceedings of ICWSM. pp. 281–288 (2011)
13. Schler, J., Koppel, M., Argamon, S., Pennebaker, J.W.: Effects of age and gender on blog-
ging. In: AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.
vol. 6, pp. 199–205 (2006)
14. Tu, C., Liu, Z., Sun, M.: Social Media Processing: 4th National Conference, SMP 2015,
Guangzhou, China, November 16-17, 2015, Proceedings, chap. PRISM: Profession Identi-
fication in Social Media with Personal Information and Community Structure, pp. 15–27.
Springer Singapore, Singapore (2015)
15. Wang, Y., Xiao, Y., Ma, C., Xiao, Z.: Improving users’ demographic prediction via the videos
they talk about. In: Proceedings of EMNLP (2016)
16. Wu, Y., Kosinski, M., Stillwell, D.: Computer-based personality judgments are more accu-
rate than those made by humans. Proceedings of the National Academy of Sciences 112(4),
1036–1040 (2015)
17. Zhong, Y., Yuan, N.J., Zhong, W., Zhang, F., Xie, X.: You are where you go: Inferring demo-
graphic attributes from location check-ins. In: Proceedings of WSDM. pp. 295–304. ACM
(2015)
44