Introduction

Mining Twitter for an Explanatory Model of Social In uence

Jan Hau a

Benjamin Koster

Florian Hartl

Valeria Kollhofer

Georg Groh

grohgg@in.tum.de 0 0 Technische Universitat Munchen, Department of Informatics , Boltzmannstr. 3, 85748 Garching , Germany

2016

3 14

The large-scale availability of online communication data offers an opportunity to learn about social in uence on the individual level. Starting from an abstract cognitive de nition, we iteratively build a predictive model of social in uence upon the principle of locality of in uence, which implies the decomposition of observed behavior into resistance to in uence, and in uence received via direct and indirect exposure to others' behavior. After training the model on a 30,000 user dataset of the social network service Twitter, we nd that direct exposure has much less explanatory value than expected, and sources of in uence exhibit strong temporal variation. We identify two modes of communication on Twitter, di ering in the manifestation of in uence.

Introduction

Interpersonal social in uence has long been a subject of research in the social sciences. A generally accepted de nition is \change in an individual's thoughts, feelings, attitudes, or behaviors that results from interaction" [10], but the nature of the process, by which an individual receives in uence, remains under active research and debate. With the rise of online social network services (SNS), social interaction has become observable outside of constrained experimental settings and accessible to large scale data mining. Longitudinal interaction data makes changes in behavior visible, enabling inference about changes in people's attitude and reasoning about the process that drives these changes. By analyzing communication data in large volume, we attempt to identify fundamental characteristics of social in uence. 1.1

In uence in Social Networks In a simple model of human cognition, the behavior of an individual is determined by an internal state, which is constantly updated by perception of the environment. Change of behavior in reaction to events in the environment is the most general form of in uence. The internal state is not observable, but observing both the environment and the behavior of an individual enables inductive reasoning about their relationship, and by extension about the underlying cognitive processes. Inferences can be tested by applying them to the prediction Copyright c 2016 for the individual papers by the papers' authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors. of future behavior. Social in uence can be de ned as the subset of updates to the internal state caused by interpersonal interaction, and its e ect on future interactions.

From an outside perspective, the e ects of social interaction and general perception cannot be separated, so any amount of data that can be gathered in a practical experiment will be insu cient for reasoning within this model. To make inference tractable, we introduce an assumption called locality of in uence: The in uence of behavior perceived in social context a on behavior produced in context b is proportional to the similarity of a and b. Local in uences may override external in uence, but the resulting change in behavior may also be limited to a particular social context.

Related concepts can be found in the literature: Latane's [5] dynamic theory of social impact asserts that \[...] in uence is directly proportional to the immediacy of the source of in uence." Immediacy is de ned as a combination of variables, including \richness of the communication channels" and geospatial distance. Myers et al. [8] provide empirical support by attributing only 29% of information in a complete record of Twitter activity over one month to \external events and factors outside the network". The role of local graph structure for information di usion in social networks is discussed e.g. by Zhang et al. [15]. 1.2

Related Work The main di erence between our work and other studies of social in uence [12] is our goal of learning about the in uence process. Instead of inferring an in uence network from observed interactions, our model yields a network-wide rule for generating individual in uence networks for each user, comparable to egocentric di usion networks [15]. 2

Data Acquisition

Characterizing the social in uence process requires a large corpus of observed social interaction that is not restricted to a particular social group or subject matter. We build such a corpus by crawling Twitter, an online service focused on the exchange of short text messages (\tweets") up to 140 characters in length, which are public by default. The only method of interaction is posting a tweet, and the only relation over the set of users is \a follows b", whereby a subscribes to tweets sent by b. Following is asymmetric, and does not require con rmation by the followee. Each user has a personal news feed that chronologically aggregates the tweets sent by followees. 2.1

Crawling Twitter The follower network was crawled using non-exhaustive breadth- rst search (BFS), ignoring the direction of edges. Accounts younger than 10 days, with a degree greater than 25,000, or not posting in English were excluded, to avoid spammers and mitigate the e ect of \hubs", e.g. celebrities, who connect otherwise distant parts of the network.

Crawling produced a longitudinal dataset of 358,342 users and their tweets, which was subsampled to 30,000 users by BFS traversal from the original starting point due to the computational complexity of subsequent processing. Table 1 compares the samples to the full Twitter follower graph of July 2009 [4]. The metrics con rm that BFS is biased towards high-degree nodes, but preserves the dissortative tendency of the graph, and improves data quality for our use case by yielding subgraphs that are more dense than the original graph by orders of magnitude. The originally intended use case for Twitter was posting brief \status updates". When holding conversations over Twitter became more popular, the community reached consensus on social conventions, which were later adopted by Twitter and integrated into the UI: @-mention Pre xing a user name with the `@' sign anywhere in a tweet causes the speci ed user to be noti ed. Honeycutt and Herring [3] identify two main uses: Addressing a message to another user, and referencing a user in a message intended for a wider audience.

Reply Tweets starting with an @-mention are considered part of an ongoing conversation.

Retweet Reposting a received tweet under one's own name extends its visibility.

The usual way of attribution is pre xing the quoted tweet with \RT" or \via", followed by @-mentioning the original author.

Among the 17 million tweets of the 30,000 user dataset, 46% are regular tweets, 36% contain at least one @-mention, and 18% are retweets. 77% of tweets containing @-mentions are explicit replies via the UI. 8% of replies are users replying to their own posts, presumably chaining related posts.

Addressivity is a property of communication in online social media. The sender of an addressive message explicitly designates one or more recipients, demonstrating awareness. Non-addressive messages are \broadcast" to an undisclosed group of people. For the purposes of this work, we treat regular tweets as non-addressive and replies as addressive, while tweets containing @-mentions are counted both as non-addressive and as addressed to each mentioned user. On average, 36% of a user's tweets are addressive ( = 24%).

Given the conceptual di erences between the two types of communication, it stands to reason that they are also di erent in terms of in uence, so we analyze them separately. As retweeting has already been studied within the information di usion framework, e.g. by Zhang et al. [15], we exclude retweets from the following experiments. 2.3

Data Sparsity Certain characteristics of the dataset may cause a lack of data in an experimental setting. The rst issue is the low information content of a single tweet, caused by the size limit of 140 characters, and the presence of elements with a primarily social function, e.g. @-mentions. The second issue is sparsity of the spatio-temporal distribution of tweets. When discretizing time into periods of equal length, and assigning non-addressive and addressive messages to the nodes and directed edges of the social network graph, respectively, not all of them will be active, i.e. have at least one associated tweet, in each period. For a period length of 14 days, on average 69.2% of nodes and only 0.9% of edges were active, while for a period length of 2 days, 48.5% of nodes and 0.2% of edges were active. The third issue is missing observations. On average, only 19% of a node's rst degree neighbors in the Twitter follower graph are present in the sample. 3

Data Representation via Topic Modeling

The most salient component of interaction on Twitter is unstructured text, so a suitable numeric representation has to be found. Given evidence that individual potential to exert in uence depends on the topic of conversation [6], topic models appear to be an appropriate choice.

Latent Dirichlet Allocation (LDA) [11] represents each document in a collection as a probability distribution over T topics, which in turn are probability distributions over the set of unique words. The Author-Recipient-Topic model (ART) [7], designed for email messages, extends LDA by observed variables for the sender and one or more recipients. For each sender-recipient pair, it yields a relationship-topic distribution representing the messages sent along the corresponding social graph edge. ART assigns each word of a message to an individual recipient. For short messages like tweets, it is more tting to assume that the message as a whole is addressed to all recipients. As a compromise, we choose a canonical sender-recipient pair for each tweet: The rst @-mentioned user in an addressive tweet is the recipient, while the author of a non-addressive tweet is both sender and recipient, yielding separate topic distributions for each mode of communication. 3.1

Parameter Estimation and Inference The tweet text is subjected to domain speci c tokenization and stop word removal. The number of topics T is arbitrarily set to 150; values of the other ART hyper-parameters are chosen according to best practices: is set to 0.01 [11] to obtain a symmetric Dirichlet prior for , while is determined in a data-driven way [13], allowing the prior of to be asymmetric. Exact estimation of the model parameters is intractable, so we approximate them via 2000 iterations of Gibbs sampling.

For predicting behavior and evaluating the prediction, it is necessary to subdivide the dataset along the time axis, and compute separate relationship-topic distributions for each period. To be comparable, these distributions need to refer to a single set of topics . After parameter estimation on the full dataset, relationship-topic distributions for arbitrary subsets of the original data can be computed by resampling, i.e. repeating the Gibbs sampling process with xed , for which 200 iterations are su cient.

After resampling, the sampler's internal state can be used for fast approximation of aggregate relationship-topic distributions over groups of senders and recipients. The formula for estimation of [7] is adapted to sum over a set of senders S and recipients R, resulting in 1 for approximation of the aggregate distribution S;R, where t = 1::T is the topic index, and ni;j;t the number of words in messages from i to j assigned to topic t.

S;R;t =

t + Pi2S Pj2R ni;j;t PtT0=1( t0 + Pi2S Pj2R ni;j;t0 ) (1)

After tting an ART model to Twitter data covering a certain time period, we partition that data into observation and evaluation periods of equal length, and separate addressive from non-addressive communication. For each of these four subsets, various relationship-topic distributions ( M in Table 2) are computed via resampling and aggregation. 4

The Social Content In uence Model

The Social Content In uence Model (SCIM) learns to express the content of future interactions in terms of observed past interactions. Its predictive accuracy serves as an indicator for the explanatory value of the learned parameters.

Ignoring all other cognitive or social processes, future behavior can be fully explained by the presence or absence of social in uence, or equivalently as a combination of inertia and exposure to others' behavior. If exposure is potential in uence, then inertia is individual resistance to in uence, a tendency not to deviate from past behavior. Unobserved sources of in uence exist outside of the studied social medium, but also within, due to sampling. Their e ect on the observed network appears as indirect in uence, i.e. correlated behavioral changes in non-incident nodes [2]. Analogously, we distinguish direct and indirect exposure. If person a interacts with b, the content of the interaction can be directly observed, but will also be partially re ected in the future interactions of b with others. Aggregating the behavior of a group smoothes over individual preferences, but preserves information about strong in uence that equally a ected every member. With the principle of locality, it follows that the aggregated behavior of people who are socially close to b re ects the behavior b is exposed to.

From the perspective of an individual node or node pair (ego and alter) connected by an edge, the social network can be viewed as a hierarchy of social circles of decreasing locality. To account for missing observations within the medium, we aggregate over a node's social neighborhood. Among di erent de nitions of neighborhood, we aim to identify those that capture indirect exposure equally well across the whole graph. In uence from outside the medium is approximated by the aggregate behavior of the whole network, which potentially re ects strong trends from other media. This tripartite view of the egocentric social network corresponds to the distinction between interpersonal, peer, and media in uence in sociology [14]. 4.1

Prediction Given the observed topic distributions from two successive time periods, the prediction problem can be formulated as using information from the rst period to make predictions ^iM,n,s for each node i, or ^iM;j,a,s for each edge from i to j, so that their Jensen-Shannon divergence (JSD) from the distributions iM,n,s; iM;j,a,s (see Table 2) in the second period is minimal. The JSD belongs to the family of symmetrized Kullback-Leibler divergences, which are commonly used for comparing topic distributions [11]. When de ning the prediction ^ as a nite mixture of observed topic distributions k; k 2 C 2, nding coe cients c that minimize the JSD is a convex optimization problem 3.

^i;j =

X ck k

+ cd d argmin X c; d i;j k2Cnd DJS (^i;j ; i;j ) +

X ^

k i;j k1 i;j subject to 0 ck; td

1 for k 2 C; t = 1::T; k2C

T X ck = 1; X t=1 td = 1 (2) (3)

The models for addressive and non-addressive communication di er only in the number of mixture components. Table 2 lists all 15 components, names the subset of messages they are computed from, and de nes the set of senders and recipients they are aggregated over, where applicable. Each component represents either inertia, indirect, or direct exposure at a particular level of locality (scope). The components at relationship scope only apply to addressive communication. M,n,s non-addr. messages sent by i iM,a,s addr. messages sent by i iiM;j,a,s addr. messages from i to j N(i),a,s addr. messages from i to neighbors iM,n,r non-addr. messages received by i iM,a,r addr. messages received by i ijM;i,a,s addr. messages from j to i N(i),a,r addr. messages from neighbors to i iM,n,s non-addr. messages sent by j jjNM(,ai),s,n andodn-ra.dmders.smageesssasgeenst sbeyntj by neighbors N(i),a addr. messages sent by neighbors M,n all non-addr. messages M,a all addr. messages d estimated from data

R role

scope fig V inertia personal fig V inertia personal fig fjg inertia relationship

inertia neighborhood fx 2 V : i follows xg V direct exposure personal V fig direct exposure personal fjg fig direct exposure relationship

direct exposure neighborhood fjg V indirect exposure relationship fjg V indirect exposure relationship indirect exposure neighborhood indirect exposure neighborhood V V indirect exposure medium V V indirect exposure medium indirect exposure medium

Computing a single set of scalar coe cients that minimizes the error sum implies the assumption that the in uence process is dominated by global, instead of individual or topical characteristics. Component d is estimated from the data, capturing all global e ects of in uence that are either not explicitly represented in the SCIM or not directly observable. It allows the model to attain a training error of 0 if the in uence process does not have any individual characteristics. The `1 regularization promotes sparse predictions and thereby the sparsity of c and d. Regularization factor is set to 0:001. 4.2

Construction of the Social Neighborhood The social neighborhood N (i) of node i is a node-weighted subgraph of the social network graph (V; E), induced by an indicator function Ii : V ! f0; 1g and a weight function Wi : V ! R+. The neighborhood mixture components N(i) are weighted sums over particular relationship-topic distributions of the subgraph nodes: iM;v,a,s for iN(i),a,s, vM,n,s for N(i),n, vM;i,a,s for iN(i),a,r, and vM,a,s for N(i),a.

We consider seven indicator and 25 weight functions. One family of indicators de nes the neighborhood of i as the set of all nodes with a maximum distance of either one or two from i, either in the follower graph or the graph induced by addressive communication. The second family nds dense subgraphs of the undirected graph of reciprocal following, either by randomly selecting a maximal clique containing i, or applying the clique percolation method (k = 5) [9] or edge clustering [1], and taking the union of the communities i is member of.

A basic weight function assigns uniform weight to all neighborhood nodes j. More complex functions derive the weight from structural properties of the social network graph (both local, such as the in-degree of j, and global, e.g. PageRank), from community structure (e.g. the number of shared communities of i and j), or from the communication behavior of j (e.g. how often j is retweeted). 5

Experimental Evaluation

The basic prediction experiment is de ned as follows: First, a candidate set of either edges or nodes is built, depending on the type of communication to be analyzed. Candidates have to be active in both the observation and the evaluation period. The set is split randomly into training and test set of equal size, then parameter estimation and evaluation are performed.

This basic experiment is repeated, testing all combinations of four experiment parameters: The observation date marks the end of the observation and the beginning of the evaluation period. Three equidistant dates within eight weeks were chosen, April 20, May 4, and May 18 2012, aiming to test the temporal stability of the model. The length of the observation and evaluation period (time period length) needs to match the speed of conversation ow. We test periods of 14, 5, and 2 days, falling back to an extended period of 14 days if there is no activity. The relationship type is only relevant for addressive communication. It controls whether or not a needs to follow b for the edge from a to b to be considered. The last parameter is the choice of social neighborhood.

The SCIM is compared to three baseline predictors to verify that it captures non-trivial information about the in uence process. The rst predictor draws randomly from a Dirichlet distribution Dir( ) with taken from the ART. The second predictor outputs the mean of Dir( ), which is the relationship-topic distribution the ART would produce in the absence of data. The third predictor outputs the relationship-topic distribution of the observed behavior, e ectively a model of in uence fully driven by inertia.

The experiment results are ltered to improve interpretability. Two restricted variants of the SCIM are introduced speci cally to assess the utility of the coe cients and the neighborhood de nitions. In the rst variant, coe cients are uniform (c1::jCj = 1=jCj; cd = 0), while in the second variant all neighborhoods are empty. Any neighborhood de nition that does not outperform these variants or the baselines across all combinations of experiment parameters is discarded.

To determine the experiment parameters' e ect on prediction accuracy, we propose an ANOVA design, where the choice of neighborhood is a repeated measurement (including the baseline predictors for reference), and the remaining parameters are between-subject factors. The candidate sets are constructed and assigned to the experiments accordingly. All pairs of neighborhood de nitions are tested post-hoc for signi cant di erences in mean prediction error with Tukey's HSD test. The results can be expressed as homogeneous subsets of neighborhoods with equivalent performance. After ranking them by mean error, the mixture coe cients of the best-performing subset are analyzed via descriptive statistics. 5.1

Results 43.4% of experiments for non-addressive, and 91.5% for addressive communication are ltered out. ANOVA is performed with a per-group sample size of 238. For both types of communication, there are signi cant interaction e ects ( = 0:01) involving neighborhood de nition, observation date and time period length. This indicates that the amount of indirect in uence captured by some or all of the neighborhood de nitions varies over time, possibly related to the temporally irregular activity of users (Section 2.3). An interaction between neighborhood and time period length indicates that subgraphs di er in speed of information ow.

For non-addressive communication, there is a signi cant e ect of time period length, with longer time periods improving the accuracy, but this e ect may already be fully explained by the higher-order interactions. There is no signi cant e ect involving the relationship type, so the existence of a follower relationship does not appear to a ect the perception of addressive messages. For both types of communication, the choice of neighborhood is signi cant. Tukey's test yields a high number of overlapping homogeneous subsets, but isolated baseline predictors. The lack of clustering limits the explanatory value of the best subsets.

The subset for non-addressive communication contains neighborhoods built by three indicator functions: First, communities found by edge clustering are given uniform weight, which implies that follower communities re ect indirect in uence to a degree that is di cult to improve by weighting. Second, followers with a path distance of up to two, weighted with the number of shared followees or communities, also hint at the importance of cohesive social groups. Third, followers of distance one are paired with weights based on similarity of users or their message content, promoting homogeneous neighborhoods.

The neighborhoods in the best subset for addressive communication are built by a single indicator function, followers with a distance of up to two. Weights are mostly similarity-based and include the number of shared followees and the similarities of both kinds of communication.

Figure 1 compares the mean prediction error of the best subset to the baseline predictors. The SCIM outperforms all baselines, with a 10% improvement over the best performing baseline for non-addressive, and 28% for addressive communication. The lower error of the Dirichlet mean baseline predictor in case of addressive communication re ects the spatio-temporal sparsity discussed in Section 2.3.

Figure 2 shows the mixture coe cients as leaves of a tree, with the parent nodes representing either role or scope as listed in Table 2. Line width is proportional to the coe cient mean across the best subset, while the color corresponds to the ratio of mean and standard deviation: The darker, the less a ected is the coe cient by the experiment parameters. Both addressive and non-addressive communication are strongly driven by inertia, but the predictive value of direct 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 exposure is unexpectedly low, contradicting the principle of locality. The value of indirect exposure from the neighborhood is as expected, while the high value of the data-driven component d suggests the existence of patterns of indirect inuence not covered by the SCIM. Communication is mainly in uenced by other communication of the same type. Components aggregating the relationship-topic distributions of a large number of users are generally of low predictive value. 6

Discussion

We report two main results: First, a novel point of view on the question whether Twitter is a social network, or a bipartite network of content producers and consumers [4]. A major di erence to other social media is the high volume of non-addressive communication. Messaging behavior of individuals is highly variable, with the proportion of addressive communication having a one-SD range of 12% to 60%. The di erence between the two modes of communication is visible in the in uence process: Non-addressive communication is more resistant to inuence, so the more stable communication behavior can be exploited by longer observation periods. Users are in uenced in their non-addressive communication by their edge communities, while their addressive communication receives in uence from a larger set of neighbors, weighted by similarity. In e ect, the Twitter social network is a product of the follower network, which governs the ow of non-addressive communication, and the implicit network formed by addressive messaging.

Second, future behavior can be predicted to a certain extent from local sources of information, which the SCIM learns to exploit. However, our results do not fully con rm the decomposability of social in uence into inertia, direct, and indirect exposure, which follows from the principle of locality. The low exProceedings of the 2nd International Workshop on Social Influence Analysis (SocInf 2016) µ=0.62, σ=0.12 sent by ego (non-addr.) planatory value of direct exposure implies that locality is not su cient on its own to explain why the SCIM is able to outperform the baselines: If interactions within and from outside the medium have similar potential for in uence, observable interactions are responsible for just a fraction of the overall in uence. Therefore it is important to exploit indirect in uence, which allows information to cross the medium boundary. The best-performing neighborhood de nitions favor nodes that are similar to the ego, and likely to be exposed to similar external in uences.

Future work involves repeating the experiments on new datasets from di erent social media to test if our results apply to social interaction in general. 2. Christakis, N., Fowler, J.: Social contagion theory: Examining dynamic social networks and human behavior. Statistics in Medicine 32(4), 556{577 (2013) 3. Honeycutt, C., Herring, S.: Beyond microblogging: Conversation and collaboration via Twitter. In: Proceedings of HICSS (Jan 2009 ) media? In: Proceedings of WWW (Apr 2010 )

1. Ahn , Y. , Bagrow , J. , Lehmann , S. : Link communities reveal multiscale complexity 4 . Kwak , H. , Lee , C. , Park , H., Moon, S. : What is Twitter, a social network or a news 5 . Latane , B. : Dynamic social impact: The creation of culture by communication .

Journal of Communication 46 ( 4 ), 13 { 25 ( 1996 ) 6 . Liu , L. , Tang , J ., Han, J ., Jiang , M. , Yang , S. : Mining topic-level in uence in heterogeneous networks . In: Proceedings of CIKM (Oct 2010 ) 7 . McCallum , A. , Wang , X. , Corrada-Emmanuel , A. : Topic and role discovery in social networks with experiments on Enron and academic email . Journal of Arti cial Intelligence Research 30 , 249 { 272 ( 2007 ) 8 . Myers , S. , Zhu , C. , Leskovec , J.: Information di usion and external in uence in networks . In: Proceedings of SIGKDD (Aug 2012 ) 9 . Palla , G. , Derenyi , I. , Farkas , I. , Vicsek , T. : Uncovering the overlapping community structure of complex networks in nature and society . Nature 435 ( 7043 ), 814 { 818 ( 2005 ) 10 . Rashotte , L. : Social in uence . In: Ritzer, G . (ed.) The Blackwell Encyclopedia of Sociology , vol. 9 , pp. 4426 { 4429 . Blackwell ( 2007 ) 11 . Steyvers , M. , Gri ths, T.: Probabilistic topic models . In: Landauer, T. , McNamara , D. , Dennis , S. , Kintsch , W. (eds.) Handbook of Latent Semantic Analysis, chap . 21. Lawrence Erlbaum ( 2007 ) 12 . Sun , J. , Tang , J.: A survey of models and algorithms for social in uence analysis .

In: Social Network Data Analysis , chap. 7 . Springer ( 2011 ) 13 . Wallach , H. , Mimno , D. , McCallum , A. : Rethinking

LDA

: Why priors matter . In: Proceedings of NIPS (Dec 2009 ) 14 . Walther , J. , Carr , C. , Choi , S. , DeAndrea , D. , Kim , J. , Tong , S. , Van Der Heide , B. : Interaction of interpersonal, peer, and media in uence sources online . In: Papacharissi, Z . (ed.) A Networked Self , chap. 1. Routledge ( 2010 ) 15 . Zhang , J., Tang , J. , Li , J. , Liu, Y. , Xing , C. : Who in uenced you? Predicting retweet via social in uence locality . ACM Transactions on Knowledge Discovery from Data 9 ( 3 ), 25 ( 2014 )