<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Identifying Influential Users&apos; Professions via the Microblogs They Forward</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Yuan</forename><surname>Wang</surname></persName>
							<email>wangyuan@net.pku.edu.cn</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">Peking University</orgName>
								<address>
									<postCode>100871</postCode>
									<settlement>Beijing</settlement>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Hangyu</forename><surname>Mao</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">Peking University</orgName>
								<address>
									<postCode>100871</postCode>
									<settlement>Beijing</settlement>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Zhen</forename><surname>Xiao</surname></persName>
							<email>xiaozhen@net.pku.edu.cn</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">Peking University</orgName>
								<address>
									<postCode>100871</postCode>
									<settlement>Beijing</settlement>
									<country key="CN">China</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Identifying Influential Users&apos; Professions via the Microblogs They Forward</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">B1B96CDAB0AE8D6E4F370401C3F0E4A9</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-23T23:53+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>For most social media sites, how to find out (influential) users' professions is an important task. Much work has been conducted to explore this task through mining user-generated textual content or analyzing the social network structure. In this paper, we innovatively solve this task by only examining which microblog messages an influential user has forwarded. First, we define hot microblog messages under two standards and identify them from a large number of candidate messages. Each of the identified messages points to a specific hot event. Next, we group similar hot messages together based on their word similarity, semantic similarity, and forwarders' similarity. Last, we represent users with the hot messages they forwarded and design an identification method to identify their professions. Moreover, we collect a real-world dataset to conduct experiments and prove that our method performs significantly better than the traditional method.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Online microblogging services have become an integral part of the daily life for most Netizens. These services expect to know more about their users' profiles, since user profile plays an important role in commercial services, such as personalized recommendation and online advertising. However, user profile is usually not easily obtained, because users are reluctant to expose their profiles to the public. Fortunately, some work has been conducted to solve this problem. A traditional practice is cutting users' messages into bags of words and training a classifier. This practice can achieve an acceptable result on simple tasks such as predicting gender and age <ref type="bibr" target="#b0">[1]</ref>, but it can not solve more complex tasks <ref type="bibr" target="#b13">[14]</ref>.</p><p>Profession, which is founded upon specialized educational training, is a critical social profile of influential users. In Weibo, the largest microblogging service in China, influential users are mainly organized by their professions. They are more likely to follow other users that have the same profession with them. It is important to correctly identify influential users and their professions for microblogging services.</p><p>Message forwarding (e.g. retweeting on Twitter.com and reposting on Weibo.com) is one of the most popular functions in the existing microblogging services. In Weibo, users can forward messages or any interesting content on the web, such as real blogs, photos and external links. In this paper, if a weibo message was forwarded by any user, we define it as forwarded message, otherwise we define it as non-forwarded message. Based on a large dataset, we find that about 60% of weibo messages are forwarded messages. For most users, the messages they forwarded are exactly what they are interested in. Users' professions can be reflected by the messages they forwarded to some extent. But the traditional "bag of words" model will completely undermine the information contained in users' forwarding behaviors. Naturally, in this paper, we ask and try to answer the following question: can we represent microblog users with the messages they forwarded, and predict their professions more accurately than the traditional method?</p><p>The task confronts some challenges which make it non-trivial. The first challenge is that there exist too many forwarded messages. If we consider each forwarded message as a feature, the feature vector will be very large and sparse. We observe that most of these messages only have been forwarded by no more than 3 weibo users. In this paper, we define them as non-hot forwarded messages and define other messages that are forwarded by more users as hot forwarded messages. In our experiment, we discard the non-hot messages. Another challenge is that even though we can filter out non-hot messages, the number of remaining hot messages is still quite large. We observe that, every hot message points to a hot event (e.g. a breaking news or a recently released movie). We should come up with some methods to group similar hot weibo messages together. In this paper, we propose an efficient framework of Profession Identification by using Forwarding Behaviors (PIFB). As Figure <ref type="figure" target="#fig_0">1</ref> shows, first, we identify the hot forwarded messages from a large number of candidates. Each of these identified messages points to a specific hot event. Next, we introduce three methods to group similar messages together, downsizing our message sets. Then, influential users can be represented with the merged hot messages that they have forwarded. Finally, we predict users' professions, and the results are more accurate than those in the traditional method.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Dataset and Professions</head><p>We collect 41,531 manually annotated influential users from Weibo (http://weibo.com). To avoid robot users, we only collected verified users. Weibo conducts manual verifications to make sure that the verified users provide real and authentic information. These users belong to 11 representative professions. As Table <ref type="table" target="#tab_0">1</ref> shows, the professions include "media", "entertainment", "sports", and "IT", etc.</p><p>We also collect users' latest 500 weibo messages. These messages can be classified into two categories: forward action and post action. In general, forward action consists of trace and content. Trace contains the information that through which users the current user can see the final messages. Content can be extended to any forms as long as it can be shared by users with their followers, such as videos and blogs. A simple example is shown below: if a user froward the message:</p><p>RT @Raj RT @Sheldon trace : It took 50 years ... content This forward action indicates that "It took 50 years ..." was originally posted by "Sheldon" and was forwarded by "Raj", and now is forwarded by the current user again.</p><p>In general, post action only contains the "content" part, representing that the current user posted an original message.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">The Framework of PIFB</head><p>In this section, we formalize our problem as a classification task and introduce the main steps of PIFB.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Hot Message Identification</head><p>This paper focus on influential user's behaviors about the forwarded messages. A critical step is to identify the hot forwarded messages. In this part, we define hot messages under two standards.</p><p>Absolutely Hot Message We argue that if a message has been forwarded by more users, the information behind it will be more. And the forwarding behaviors about this message can help our profession prediction more. Nowadays, Weibo has become the the biggest "News Site" in China. Most traditional news organizations open their official accounts in Weibo and these accounts are all very active. They usually publish the breaking news timely and make the news spread quickly. There also exist many Chinese celebrities in Weibo, including actors, singers and entrepreneurs, etc. They post their personal views or daily lives in their accounts. They generally have a great number of followers and their daily updates are likely to get thousands of forwards. So, in this paper, if a weibo message has been forwarded by more than a certain times (for example, 500), it will be regarded as the first kind of "hot forwarded message" (absolutely hot). Relatively Hot Message The 11 professions, showed in Table <ref type="table" target="#tab_0">1</ref>, are not "evenly matched" on attracting attentions. Nearly all the high forwarded messages are all posted by "entertainment" and "sports" stars. For an "estate" account, it is not easy to post an absolutely hot message, because "estate" accounts usually have relatively less followers and lower forwarding rate. If we only adopt the absolutely hot messages as described in the previous paragraph, it is very possible that we only get the messages posted by a small subset of that 11 categories (may be 2-4). Therefore, as a supplement to the first standard, we define another kind of hot message. In our dataset, if a message's owner has f followers (f &gt;500) and this message has been forwarded by more than f /5 times, it will be regarded as the second kind of "hot forwarded message" (relatively hot). After identifying all these two types of "hot messages", we can build a matrix M , whose columns denote hot messages and rows denote users. This matrix represents all the forwarding relationships between weibo users and hot messages. M will have too much columns, if we don't filter out the non-hot messages. Even though we do only consider the hot messages, the number of column is also very big. To slim down M , we propose three methods to group similar messages together in the next.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Group Similar Hot Messages Together</head><p>In most microblogging services, users can be divided into two categories: information producer and information consumer. The information producer mainly includes the news site accounts, self-media accounts, and profit-seeking accounts with legions of followers. Their main purpose is making their microblogs broadcast as widely as possible to expand their influence and get more new followers. Whenever there is a news, producers will timely post their relevant microblogs. The producers are very likely to post similar contents, because the texts may be pasted from the same source. The information consumer mainly refers to normal weibo users. More than 90% weibo users can be classified into this category. Their most important action is reading and forwarding messages. Normally, hot messages are more likely to attract them.</p><p>If the hot messages only contain a video link or a web link, it is easy to determine whether they are similar. But if they contain some text contents, the task will be more difficult. In the next, we introduce three methods to solve it. Simhash As described above, the information producers are likely to post similar weibo messages. The most direct idea is that merging similar hot messages based on their word similarity. Simhash <ref type="bibr" target="#b1">[2]</ref> is a widely used dimensionality reduction technique in calculating the document similarity. This model can map high dimensional document vectors to small-sized fingerprints. With the help of simhash, we can transform such a high-dimensional vector into a k-bit fingerprint where k is quite small, such as 64. An important characteristic of simhash is that, similar documents have similar hash values. For instance, if there are two documents that only differ in a single word, the cryptographic hash functions will hash them into two completely different values. However, simhash will hash them into similar fingerprints. This characteristic is very important in calculating the document similarity.</p><p>In this method, we firstly calculate the simhash values of all the hot messages. After that, we can group the similar messages together, if the hamming distance of their simhash fingerprints is less than or equal to 3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Paragraph Vector</head><p>The simhash can only calculate the documents' similarity based on their word similarity. It can not deal with situation that, two documents have the similar semantics but written with different words. <ref type="bibr" target="#b7">[8]</ref> proposes "Paragraph Vector" (P2V), an unsupervised framework that learns continuous distributed vector representations for pieces of texts. This method can be applied to variable-length paragraphs, and transform them into fixed-length vectors. In this model, every weibo message is mapped to a unique vector, represented by a column in a matrix and every word is also mapped to a unique vector, represented by a column in another matrix. The paragraph vectors and word vectors are concatenated to predict the next word. They are trained using stochastic gradient descent and the gradient is obtained via backpropagation. Details can be found in the original paper. After being trained, the distance between two paragraph vectors will be small if they talk about a same topic. It is not sensitive about the synonym. These vectors can be used as features directly to conventional machine learning models, such as logistic regression or k-means.</p><p>We firstly calculate hot messages' representative vectors by using the "Paragraph Vector" method. The length of vector is set to 400 according to the original paper. After that, we calculate their distances. A pair of hot messages can be grouped together if their distance is smaller than a threshold.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>User-Weibo Matrix Factorization</head><p>The first method is based on message's word similarity and the second is based on the semantic similarity. They are both directly calculated by the weibo contents. As described in section 3.1, we have generated the user-weibo relationship matrix M . So we can further find more similar messages based on which users have forwarded these messages. Hofmann <ref type="bibr" target="#b4">[5]</ref> introduced the PLSA, which developed probabilistic latent semantic models for performing collaborative filtering. In this step, PLSA models users (u∈U ) and documents (d∈D) as random variables, taking values from the space of all possible users and documents respectively. The relationship between them is learned by modeling the joint distribution of users and documents as a mixture distribution. The hidden variables t (t∈T , T =k) represent the topics between U and D. The model can be written in the form of mixture model as the next equation:</p><formula xml:id="formula_0">P (u|d; θ) = k t=1 p(u|t)p(t|d)<label>(1)</label></formula><p>Based on this model, we can transform the user-weibo matrix into two new matrices. The first is user-topic matrix, which represents each user with a vector of k topics. The second is document-topic matrix, which represents each document with a vector of k topics too. In the second matrix, if the documents contain similar topics, their vectors are more likely similar. We can group two similar hot messages together, if the distance between their vectors is under a threshold. In this paper, we empirically set k to 400 and name this method UWMF.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Profession Prediction</head><p>After merging similar hot messages, users can be represented as more compact vectors. Each element of these vectors represents a merged hot message, and the elements will be used as features in our multi-class classifier.</p><p>Over the last several decades, many kinds of discriminant classifier have been created. In our experiment, we compare Logistic Regression (LR) and Gradient Boosted Decision Tree (GBDT). We choose GBDT as our default multi-class classifier, because we find that GBDT performs better in most instances. Hence, in the following part we only show the results obtained with GBDT <ref type="bibr" target="#b2">[3]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experiment Results</head><p>In this section, we first statistically study our dataset. After that, we identify the hot weibo messages and merge the similar ones. At last, we compare our methods with the baseline method comprehensively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Observation</head><p>We firstly count influential user's forwarding rates on different professions. As Figure <ref type="figure" target="#fig_1">2</ref>(a) shows, different professions have different forwarding rates on average. It is a little surprise that the "estate" and "government" accounts forwarded more messages compared with the "finance" accounts. Overall, the difference between different professions is not significant. In our dataset, about 58% of weibos are all forwarded messages. For about 66% users, more than half of their messages are forwarded messages. Figure <ref type="figure" target="#fig_1">2(b)</ref> shows the distribution of how many messages users forwarded (in their latest 500 messages) in our dataset. We find that about 95% users forwarded more than 50 messages. In this paper, our goal is to predict users professions only based on their forwarding behaviors, so we discard other 5% users who forwarded no more than 50 messages in our experiment.</p><p>As described in section 3.1, we define the absolutely hot message and the relatively hot message separately. To better understand these two types, we calculate how many times that users' latest 500 weibo messages have been forwarded on average by category. As Figure <ref type="figure" target="#fig_1">2</ref>(c) shows, these numbers of different categories are very unbalanced. The "entertainment" and "literature" accounts attract much more forwarding behaviors than "estate" accounts. The main reason is that the "entertainment" and "literature" accounts have relatively more followers. If we only adopt absolutely hot messages (for example, the threshold is set to 500), it is possible that we can not get any hot messages posted by "estate". So identifying relatively hot messages is very necessary in our model. Weibo limits message length to 140 Chinese characters or 280 English characters. Figure <ref type="figure" target="#fig_1">2(d)</ref> shows the length distribution of hot messages in our dataset. We can find that there exist two peaks. The first peak represents the hot messages that only contain 10-20 characters. These messages are likely to be posted by star users who have millions of fans. This kind of message usually additional contains a picture or a video link. The second peak represents the messages that contain 140 Chinese characters. This kind of message generally contains rich semantics.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Me En Es Fi</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Identify Hot Messages</head><p>As described in section 3.1, if a message has been forwarded by more than a certain number of times, it will be considered as an absolutely hot message. It is apparent that how to set the threshold is a double-edged sword. If we set the threshold to a smaller value (more hot messages), on one hand, user can be represented with more messages and our model's expression ability will be increased; on the other hand, our model should handle more features and need to take the risk of over-fitting. As Table <ref type="table" target="#tab_1">2</ref> shows, we set the threshold to 500, 2,000, and 10,000 separately. When the threshold is set to 500, we can get 731,153 hot weibo messages. This number is too large and most of these messages have been forwarded by no more than 5 users in our dataset (40 thousand users). Then, we filter out such messages from our hot message sets, leaving 100,219 valid messages. In the prediction tasks, we compare the performance of these three thresholds and choose 500 as the default value. As section 3.1 if a message's owner has f followers (f &gt;500) and this message has been forwarded by more than f /5 times, we regard this weibo message as a relatively hot message. Just as the absolutely hot messages, we also filter out the messages that have been forwarded by no more than 5 users in our dataset, and get 61,806 relatively hot messages.</p><p>Eventually, we collect 162,025 hot messages in total (100,219 absolutely hot &amp; 61,806 relatively hot).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Group Similar Hot Messages Together</head><p>In this part, we evaluate the performance of our three methods on clustering similar hot messages. As Table <ref type="table" target="#tab_2">3</ref> shows: (1) In the simhash method, we choose 64 as the default length of hash value. In this step, we group similar messages together, if their hamming distance is less than or equal to 3. We can merge our 162,025 hot messages, identified from section 4.2, into 57,624 hot events. (2) In the second method, we choose 400 as the default size of paragraph vector, and merge similar messages according to their Euclidean distances. In this step, we can merge the 162,025 hot messages into 32,118 hot events. (3) In the third method, we also choose 400 as the size of hidden variables, and adopt Euclidean distance to measure their similarities. In this step, we can merge the 162,025 hot messages into 27,129 hot events. In our experiment, the lengths of these three vectors (64, 400, 400) are chosen empirically <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b9">10]</ref>. We validate the other hyper-parameters (where to stop merging) with the validation set, and find the best stop points.</p><p>In practice, we serially combine all these three methods. At first, we adopt the simhash to find similar hot messages, making users' representative vectors more compact. On the basis of this results, we adopt the second method, further compressing users' vectors. At last, we perform the third method based on the current results. After these three steps, our 162,025 hot messages can cluster together into 17,196 hot events. In the next, we will study whether these optimizations can improve our profession identification tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Results of Prediction</head><p>We randomly divide our 40 thousand labeled users into training set (60%), validation set (20%), and test set (20%). We regard user's labeled profession as the gold standard, and select accuracy, macro-averaging precision/recall/F-Measure as evaluation metrics.</p><p>To verify the validity of our method, we build a baseline model. The feature candidates of baseline model include: <ref type="bibr" target="#b0">(1)</ref> Words in user's original messages; (2) Words in user's forwarded messages; (3) Mentioned user ids in messages; (4) URLs in messages;</p><p>(5) Hash tags in messages. There exist hundreds of thousands of feature candidates and we have to perform feature selection to downsize our feature sets. Following the valid experience in feature selection for text classification, we use χ 2 statistic to select representative features. We evaluate performance with different numbers of features, and select 9200 feature candidates. We compare LR and GBDT on these features and find they have similar performance. To be consistent with our model, we also choose GBDT as the default baseline classifier. From Table <ref type="table" target="#tab_3">4</ref>, we can observe the evaluation results. We find that the baseline model achieves a performance of 62.38% in accuracy and our three models all get better results than it. This comparison proves user's forward behavior is effective in profession identification. As Table <ref type="table" target="#tab_3">4</ref> shows, along with the implementation of three merging strategies, our three models can make the prediction gradually improved. Our model in the fourth line that serially adopts all three merging strategies achieves the best result (accuracy=73.98, F1=73.87). This result indicates that effective clustering of similar messages is necessary, for there exist too many forwarded messages.</p><p>To better understand the prediction errors, we present the details of the best result. In Table <ref type="table" target="#tab_4">5</ref>, the value of i th row and j th column represents the ratio of the users in profession i being identified as profession j. To make the data more intuitive, we illustrate the ratio in each entry using different shades of color. We can observe that: (1) Our model performs differently on different professions. The recall scores (value on the diagonal) of most professions are bigger than 70%, with only "fashion" and "literature" less than 65%. The main reason is that the forward behavior of these two professions has no special characteristics. (2) The "media" accounts occupy about a quarter of our user collections. Our model tends to predict the uncertain user as "media" account, making the precision score of "media" relatively lower (51.3%). (3) The behaviors of some professions are quite similar. For example, the "entertainment" user and "fashion" user have the similar interests, they usually follow and interact with each other. It makes the boundary between these two professions not very clear for identification.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Related work</head><p>User's attributes can be inferred from user-generated text data and social network structure. <ref type="bibr" target="#b5">[6]</ref> showed that users' age and gender can be predicted from people's webpage browsing logs. <ref type="bibr" target="#b8">[9]</ref> showed users' profiles can be predicted by their mobile phone apps. <ref type="bibr" target="#b12">[13]</ref> analyzed tens of thousands of blogs and indicated significant differences in writing style and word usage between different gender and age groups. <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b10">11]</ref> predicted user's gender and age based on their twitter linguistic characteristics. <ref type="bibr" target="#b14">[15]</ref> identified weibo users' profiles only via the videos they talk about. <ref type="bibr" target="#b11">[12]</ref> identified users' political orientation and ethnicity by leveraging their network structure and linguistic characteristics. <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b16">17]</ref> predicted users' profiles based on their social network structure and chick ins.</p><p>Recently, there are some researches on identify users' professions. <ref type="bibr" target="#b13">[14]</ref> presented an efficient framework for profession identification in Weibo. This work identified users' professions based on both personal information and network structure. <ref type="bibr" target="#b6">[7,</ref><ref type="bibr" target="#b15">16]</ref> showed that computers' judgments of people's personalities based on their Facebook Likes are more accurate than judgments made by their close acquaintances.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion</head><p>In this paper, we present an efficient framework PIFB to predict influential users' professions by only examining which microblogs they have forwarded. In the first step, we identify the hot weibo messages from a large number of candidate messages, and represent users with the hot messages they forwarded. After that, we group hot messages together if they talk about the similar topics. This step can make users' representative vectors more compact. At last, we design a multi-class classifiler to predict their professions. The experiments on a real-world dataset demonstrate the effectiveness of PIFB. Our method performs significantly better than the traditional "bag of words" based method.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. The framework of PIFB</figDesc><graphic coords="2,211.62,377.68,192.12,158.40" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 2 .</head><label>2</label><figDesc>Fig. 2. Data observation</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>The distribution of professions in our dataset.</figDesc><table><row><cell>No. Category</cell><cell cols="2">(%) No. Category (%)</cell></row><row><cell>1 Media</cell><cell>26.3 7 Sports</cell><cell>6.4</cell></row><row><cell cols="3">2 Entertainment 10.1 8 Fashion 6.2</cell></row><row><cell>3 Estate</cell><cell cols="2">9.1 9 Education 5.9</cell></row><row><cell>4 Finance</cell><cell cols="2">8.6 10 Literature 5.4</cell></row><row><cell cols="2">5 Government 8.5 11 Game</cell><cell>5.1</cell></row><row><cell>6 IT</cell><cell>8.4</cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 .</head><label>2</label><figDesc>Number of absolutely hot messages</figDesc><table><row><cell cols="3">No. Threshold # before filter # after filter</cell></row><row><cell>1 500</cell><cell>731,150</cell><cell>100,219</cell></row><row><cell>2 2000</cell><cell>426,019</cell><cell>82,339</cell></row><row><cell>3 10000</cell><cell>74,308</cell><cell>32,955</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 .</head><label>3</label><figDesc>Number of messages under different merging strategies</figDesc><table><row><cell>No. Merging Strategy</cell><cell># before # after</cell></row><row><cell>1 Simhash</cell><cell>162,025 57,624</cell></row><row><cell>2 P2V</cell><cell>162,025 32,118</cell></row><row><cell>3 UWMF</cell><cell>162,025 27,129</cell></row><row><cell cols="2">4 Simhash+P2V+UWMF 162,025 17,196</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 4 .</head><label>4</label><figDesc>Evaluation results for various features and combinations. (%)</figDesc><table><row><cell>No. Method</cell><cell cols="3">Accuracy Precision Recall F1</cell></row><row><cell>1 Baseline</cell><cell>62.38</cell><cell>64.03</cell><cell>60.29 62.10</cell></row><row><cell>2 Simhash</cell><cell cols="2">69.24 ↑ 6.86% 70.88</cell><cell>67.61 69.21 ↑ 7.11%</cell></row><row><cell>3 Simhash+P2V</cell><cell cols="2">73.79 ↑ 11.41% 73.90</cell><cell>71.28 72.57 ↑ 10.47%</cell></row><row><cell cols="3">4 Simhash+P2V+UWMF 73.98 ↑ 11.60% 74.81</cell><cell>72.95 73.87 ↑ 11.77%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 5 .</head><label>5</label><figDesc>Distribution of identified professions in each profession.</figDesc><table><row><cell>Me En Es Fi Go IT Sp Fa Ed Li Ga</cell></row><row><cell>Me 76.7 5.6 2.7 3.5 4.1 3.3 2.2 0.9 0.6 0.2 0.2</cell></row><row><cell>En 7.2 74.5 0.2 3.3 0.7 1.4 4.4 5.1 0.2 1.3 1.7</cell></row><row><cell>Es 7.4 2.0 72.9 8.5 5.3 2.2 0.4 0.9 0.1 0.0 0.3</cell></row><row><cell>Fi 8.4 0.1 6.4 70.2 5.3 6.2 0.2 1.3 1.7 0.1 0.1</cell></row><row><cell>Go 4.9 2.2 0.4 4.2 78.2 2.9 4.1 0.4 2.5 0.2 0.0</cell></row><row><cell>IT 6.1 0.7 3.9 4.3 1.3 76.3 0.2 0.1 2.6 0.7 3.8</cell></row><row><cell>Sp 5.1 2.9 0.0 0.3 0.3 1.0 86.2 2.2 0.7 0.0 1.3</cell></row><row><cell>Fa 9.7 14.9 1.0 6.2 0.2 0.0 3.3 61.5 0.9 1.2 1.1</cell></row><row><cell>Ed 5.2 3.9 3.3 4.6 2.0 3.2 0.7 1.8 68.4 4.2 2.7</cell></row><row><cell>Li 13.7 7.2 0.7 1.3 0.6 1.4 0.4 3.3 9.8 60.9 0.7</cell></row><row><cell>Ga 5.2 3.9 0.8 0.0 0.4 7.1 1.2 4.3 0.1 0.3 76.7</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0">Proceedings of the 3rd International Workshop on Social Influence Analysis (SocInf 2017) August 19th, 2017 -Melbourne, Australia</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>The authors would like to thank the anonymous reviewers for their comments. This work was supported by the National Natural Science Foundation of China under Grant No.61572044. The contact author is Zhen Xiao.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Discriminating gender on twitter</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">D</forename><surname>Burger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Henderson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Zarrella</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the EMNLP</title>
				<meeting>the EMNLP</meeting>
		<imprint>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="1301" to="1309" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Similarity estimation techniques from rounding algorithms</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">S</forename><surname>Charikar</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the thiry-fourth annual ACM symposium on Theory of computing</title>
				<meeting>the thiry-fourth annual ACM symposium on Theory of computing</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2002">2002</date>
			<biblScope unit="page" from="380" to="388" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Xgboost: A scalable tree boosting system</title>
		<author>
			<persName><forename type="first">T</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Guestrin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of SIGKD-D</title>
				<meeting>SIGKD-D</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="785" to="794" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Predicting the demographics of twitter users from website traffic data</title>
		<author>
			<persName><forename type="first">A</forename><surname>Culotta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">R</forename><surname>Kumar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Cutler</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of AAAI</title>
				<meeting>AAAI</meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="72" to="78" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Latent semantic models for collaborative filtering</title>
		<author>
			<persName><forename type="first">T</forename><surname>Hofmann</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Transactions on Information Systems (TOIS)</title>
		<imprint>
			<biblScope unit="volume">22</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="89" to="115" />
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Demographic prediction based on user&apos;s browsing behavior</title>
		<author>
			<persName><forename type="first">J</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">J</forename><surname>Zeng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Niu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Chen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of WWW</title>
				<meeting>WWW</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="151" to="160" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Private traits and attributes are predictable from digital records of human behavior</title>
		<author>
			<persName><forename type="first">M</forename><surname>Kosinski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Stillwell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Graepel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Proceedings of the National Academy of Sciences</title>
		<imprint>
			<biblScope unit="volume">110</biblScope>
			<biblScope unit="issue">15</biblScope>
			<biblScope unit="page" from="5802" to="5805" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Distributed representations of sentences and documents</title>
		<author>
			<persName><forename type="first">Q</forename><forename type="middle">V</forename><surname>Le</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Mikolov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of ICML</title>
				<meeting>ICML</meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="1188" to="1196" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">You are what apps you use: Demographic prediction based on user&apos;s apps</title>
		<author>
			<persName><forename type="first">E</forename><surname>Malmi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Weber</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1603.00059</idno>
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Detecting near-duplicates for web crawling</title>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">S</forename><surname>Manku</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Jain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Das Sarma</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of WWW</title>
				<meeting>WWW</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="141" to="150" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">how old do you think i am?&quot;; a study of language and age in twitter</title>
		<author>
			<persName><forename type="first">D</forename><surname>Nguyen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Gravel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Trieschnigg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Meder</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of ICWSM</title>
				<meeting>ICWSM</meeting>
		<imprint>
			<publisher>AAAI Press</publisher>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">A machine learning approach to twitter user classification</title>
		<author>
			<persName><forename type="first">M</forename><surname>Pennacchiotti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">M</forename><surname>Popescu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of ICWSM</title>
				<meeting>ICWSM</meeting>
		<imprint>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="281" to="288" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Effects of age and gender on blogging</title>
		<author>
			<persName><forename type="first">J</forename><surname>Schler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Koppel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Argamon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Pennebaker</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs</title>
				<imprint>
			<date type="published" when="2006">2006</date>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="page" from="199" to="205" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<author>
			<persName><forename type="first">C</forename><surname>Tu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sun</surname></persName>
		</author>
		<title level="m">Proceedings, chap. PRISM: Profession Identification in Social Media with Personal Information and Community Structure</title>
				<meeting>chap. PRISM: Profession Identification in Social Media with Personal Information and Community Structure<address><addrLine>Guangzhou, China; Singapore, Singapore</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2015">November 16-17, 2015. 2015</date>
			<biblScope unit="page" from="15" to="27" />
		</imprint>
	</monogr>
	<note>Social Media Processing: 4th National Conference, SMP 2015</note>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Improving users&apos; demographic prediction via the videos they talk about</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Xiao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Xiao</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of EMNLP</title>
				<meeting>EMNLP</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Computer-based personality judgments are more accurate than those made by humans</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kosinski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Stillwell</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Proceedings of the National Academy of Sciences</title>
		<imprint>
			<biblScope unit="volume">112</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="1036" to="1040" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">You are where you go: Inferring demographic attributes from location check-ins</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">J</forename><surname>Yuan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Zhong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Xie</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of WSDM</title>
				<meeting>WSDM</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="295" to="304" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
