<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Scalable Privacy-Compliant Virality Prediction on Twitter?</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>DTU Compute</institution>
          ,
          <addr-line>Matematiktorvet 303B 2800 Kgs. Lyngby</addr-line>
          ,
          <country country="DK">Denmark</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Microsoft Development Center Copenhagen</institution>
          ,
          <addr-line>Kanalvej 7 2800 Kgs. Lyngby</addr-line>
          ,
          <country country="DK">Denmark</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The digital town hall of Twitter becomes a preferred medium of communication for individuals and organizations across the globe. Some of them reach audiences of millions, while others struggle to get noticed. Given the impact of social media, the question remains more relevant than ever: how to model the dynamics of attention in Twitter. Researchers around the world turn to machine learning to predict the most in uential tweets and authors, navigating the volume, velocity, and variety of social big data, with many compromises. In this paper, we revisit content popularity prediction on Twitter. We argue that strict alignment of data acquisition, storage and analysis algorithms is necessary to avoid the common trade-o s between scalability, accuracy and privacy compliance. We propose a new framework for the rapid acquisition of large-scale datasets, high accuracy supervisory signal and multilanguage sentiment prediction while respecting every privacy request applicable. We then apply a novel gradient boosting framework to achieve stateof-the-art results in virality ranking, already before including tweet's visual or propagation features. Our Gradient Boosted Regression Tree is the rst to o er explainable, strong ranking performance on benchmark datasets. Since the analysis focused on features available early, the model is immediately applicable to incoming tweets in 18 languages.</p>
      </abstract>
      <kwd-group>
        <kwd>Twitter scalability</kwd>
        <kwd>popularity</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>virality
privacy
sentiment
explainability</p>
      <p>
        Introduction and motivation
"The role of the social and professional networks in the spread and
acceptance of innovations, knowledge, business practices, products,
behavior, rumors, and memes, is a much-studied problem in social sciences,
marketing and economics. Online environments like Twitter, o er an
unprecedented opportunity to track such phenomena." [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
The knowledge discovery process, however, is becoming even more tangled with
the arrival of social big data. 700 million tweets have been posted on the day
of writing this introduction. The volume, velocity, and variety of mostly
unstructured information even from a single social network are evolving at an
extremely fast pace. From an engineering and data science perspective, near
real-time analysis via online services and algorithms scalable in-memory are
required, and demand substantial computational resources. Scienti c endeavors to
date o er progress toward speci c subtasks of social network analysis (SNA) yet
data collection and privacy compliance remain among the biggest challenges in
extracting knowledge [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Arguably the most signi cant among them is privacy
[
        <xref ref-type="bibr" rid="ref34">34</xref>
        ]. The social nature of nodes in these networks makes data subjective to many
privacy concerns and laws. The new European General Data Protection
Regulation (GDPR and ISO/IEC 27001) in force since May 25th, 2018 makes SNA
and black-box approaches (like deep neural networks) more di cult to use in
business, requiring the results to be retraceable (explainable) on demand [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
In machine learning, explainable (compliant) real-time analysis is often at odds
with predictive accuracy. In social popularity prediction, some of the best
results today are achieved using deep neural networks, di cult to interpret [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ] or
data modalities time-consuming to acquire [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Modeling popularity relies on a
precise count of responses (subject to privacy requests, i.e., retweets in virality
prediction) which exposes them further. Accuracy in such studies depends on
processing documents no longer available, while privacy compliance requires
removing them. Ensuring accurate and explainable analysis via quality of the data
and methods, while respecting user privacy, remain con icting goals and open
research issues individually. In this work we argue that signi cant advancement
in SNA requires avoiding such trade-o s and addressing all the above issues
simultaneously. We draw inspiration from multiple disciplines, to challenge state
of the art in content virality prediction on Twitter. We propose a framework
which to the best of our knowledge, is the rst one that satis es the properties
of model preserving and privacy-compliant simultaneously. We use it to train a
scalable and explainable model, and are the rst to achieve strong [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] ranking
performance on benchmark datasets.
2
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <sec id="sec-2-1">
        <title>Social big data analysis before GDPR</title>
        <p>
          Social big data has become essential for various distributed services, applications,
and systems [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ], enabling event detection [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], sentiment analysis [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ],
popularity prediction [
          <xref ref-type="bibr" rid="ref38">38</xref>
          ], natural language processing, nding in uential bloggers,
personalized recommendation [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], online advertising, viral marketing, opinion
leader detection etc. Computational and storage requirements of such
applications have led to cloud scale reinvention of data storage and processing
technologies. New tools are constantly emerging to replace the conventional non-e ective
ones, and a hybrid of techniques [
          <xref ref-type="bibr" rid="ref15 ref20">20,15</xref>
          ] is now a requirement to extract value
from the social big data. [
          <xref ref-type="bibr" rid="ref35">35</xref>
          ] proposes a solution based on Hadoop technology
and a Naive Bayes classi cation for sentiment analysis of tweets. The sentiment
analysis in performed in MapReduce layer and results stored in distributed
NOSQL data-base. [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] uses Lucene indexing with full-text searching ability on top
of Hadoop for spectral clustering, to detect Twitter communities during the
Hurricane Sandy disaster. In our work we pursue close alignment of data
acquisition and analysis algorithms, with the strict constraints of storage and time,
to accommodate both user-generated content (UGC) and privacy requests,
arriving at high volume and velocity. Instead of perturbing or anonymizing the
data, sensitive or deleted information is permanently eliminated from storage
and subsequent analysis.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Content popularity prediction</title>
        <p>
          Social network in uence can be de ned as the ability of a user to spread
information in the network [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ], with the retweet count assumed as a measure of
a tweets popularity. One common challenge for content-based popularity
prediction is the 140-character constraint imposed by Twitter, making it di cult
to identify and extract predictive features [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ] showed that carefully crafted
wording of the message could help propagate the tweets better, but there's much
more to UGC than the caption. [
          <xref ref-type="bibr" rid="ref19 ref37">19,37</xref>
          ] demonstrate social-oriented features were
the best performers to predict image popularity on Twitter. [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] utilized textual,
visual, and social cues to predict the image popularity on Flickr. [
          <xref ref-type="bibr" rid="ref37">37</xref>
          ] proposed a
joint-embedding neural network combining the same cues to rival state-of-the-art
methods. Recurrent and Deep Neural Networks advance feature extraction from
high-dimensional unstructured data (i.e., image attachments), however due to
low explainability also introduce a major drawback for critical decision-making
processes (with recent advances by [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ]). In this study, we prioritize explainable
methods in application to structured data. [
          <xref ref-type="bibr" rid="ref23 ref32 ref7">32,23,7</xref>
          ] demonstrate relationships
between the number of followers of Twitter users and their in uence on
information spreading. Ranking users by the number of followers is found to perform
similarly to PageRank [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ] models the probability to be retweeted by a
power law function. [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ] have used an explainable Random Forrest classi er to
predict a range of the logarithm of the retweets volume. He demonstrates the
predictive value of user features (e.g., count of followers), network features, and
the popularity of hashtags included. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] provide a comparison of learning
methods and features, regarding retweet prediction accuracy and feature importance.
They nd Random Forests to achieve the best performance in binary classi
cation of retweetability and highlight the value of author features: number of times
the user is listed by other users, number of followers and the average number
of tweets posted per day. [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ] uses recursive partitioning trees to achieve 0.682
classi cation accuracy on a large topical dataset, albeit using features
unavailable early (favorites count) or anymore (local publication time) challenging both
scalability and reproducibility. [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] investigated the features of tweets
contributing to retweetability and is the rst to explore the impact of negative sentiment
in di usion of news on Twitter. We follow [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] to consider a ect in our model.
Substantial gains are seen when including network features extracted from the
content graph formed by retweets, or relationship graph formed by "friendships".
The document level subgraphs to inform prediction are often acquired via
realtime monitoring of the di usion process. [
          <xref ref-type="bibr" rid="ref39">39</xref>
          ] predicted the popularity of a tweet
through the time-series path of its retweets, using a Bayesian probabilistic model.
[
          <xref ref-type="bibr" rid="ref37">37</xref>
          ] uses preconditioned recurrent neural network to model the temporal di
usion, and shows SOTA ranking performance of 0.366 on benchmark datasets. [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]
used temporal evolution patterns to predict the popularity of online UGC. [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]
use temporal and structural features to predict the cascades of photo shares on
Facebook. [
          <xref ref-type="bibr" rid="ref41">41</xref>
          ] model the retweeting cascades as a self-exciting point process. [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]
argues that determining the topic of interest of a user based on his past tweets
might boost predictive accuracy. [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ] studied retweet network propagation trends
using conditional random elds, demonstrating gains in accuracy when
considering social relationships and retweet history. Access to subgraphs on the author
or even document level is however strictly limited by social networks, thus
leveraging tweets (early) performance, authors relationships, preferences or retweet
history is prohibitive for a scalable, near real-time prediction on a single tweet.
        </p>
        <p>
          In this study we seek to maximize virality ranking performance. We follow
[
          <xref ref-type="bibr" rid="ref37">37</xref>
          ] to approach the problem as Poisson regression, and [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] to consider tweet
sentiment in prediction. However, in the contrast to prior work, we don't
sacri ce scalability or privacy compliance, nor rely on available retweet count for
ground truth.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Solution overview</title>
      <sec id="sec-3-1">
        <title>Data acquisition</title>
        <p>We use Twitters Historical APIs to acquire datasets of tweets for training and
validation against other studies. In contrast to sampling Twitters x-hose,
predominant in prior work, we apply Twitters PowerTrack search rules, to
formulate and collect entire datasets retroactively. The documents are then stored in
a globally distributed NO-SQL database, hosted by Microsoft Azure. The data
remains online, exposed to every privacy request applicable.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Privacy compliant storage</title>
        <p>Data analyzed in this study is publicly available during collection. Exactly how
much of it remains public, changes rapidly afterwards. Account removal,
suspension, or deleting of a single tweet render a ected content unavailable for analysis
in a privacy-compliant way. Users exercise their right to be forgotten at an
unprecedented rate. We consume an average of 4,000 of such requests per second
via Twitters Compliance Firehose API and apply to our storage simultaneously
with analysis. For perspective, the average rate of new tweets published today
is 8,000/s. To support this velocity and rapid feature extraction for dependent
analysis we choose Azure Cosmos DB as the persistent data store.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>High accuracy labels</title>
        <p>In the contrast to prior work, we do not rely on available retweet count for
training supervision. Twitter's Engagement Totals API is called during data
collection, to retrieve the number of retweets and favorites ever registered for
the tweet (including those deleted shortly after). This enables our data collection
e ort to focus on unique content only, reducing the document volume required
for the task (and proportional compliance responsibility) by more than half,
while ensuring 100% accuracy of the supervisory signal.
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Sentiment analysis</title>
        <p>
          To compute document sentiment, we adopt Text Analytics API from Microsoft
Cognitive Services [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ], a collection of readily consumable ML algorithms in the
cloud. At the time of this study, the service supports 18 languages: English,
Spanish, Portuguese, French, German, Italian, Dutch, Norwegian, Swedish,
Polish, Danish, Finnish, Russian, Greek, Turkish, Arabic, Japanese and Chinese.
The service is for-pro t and continuously improving (changing) over time, which
might challenge reproduction. To address this, we share the score of each
document.
3.5
        </p>
      </sec>
      <sec id="sec-3-5">
        <title>Compute</title>
        <p>
          We conduct an in-memory analysis of entries no longer personally identi able.
This prevents fragmentation of sensitive data outside of the central store
exposed to user privacy requests. Instead of anonymizing the datasets, sensitive
or deleted information is eliminated from storage and future analysis as soon as
the request from the user is processed by the social media platform. We dedicate
an Apache Spark cluster to data preprocessing and analysis. Spark is e cient
at iterative computations and is thus well-suited for the development of
largescale machine learning applications [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]. Communication performance between
Spark and our privacy-compliant Cosmos DB enables feature extraction at rates
exceeding 65,000 tweets per second. The resulting in-memory dataset is then
aggregated by the Spark master node, equipped with Tesla K80 GPUs (Graphics
Processing Units) for predictive analysis and model tuning. We choose
LightGBM framework to train our Gradient Boosted Regression Tree and explain the
choice in the following section.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Data collection</title>
      <p>
        We use the new framework to build multiple datasets across di erent time
periods for training and evaluation of our models (Table 1)
Total Unique (acquired) Never retweeted
2,724,764 1,319,288 1,042,411
9,025,826 2,804,153 2,106,475
8,469,016 2,736,600 2,088,377
27,032,417 14,788,552 12,809,021
19,850,448 9,719,264 8,774,009
Benchmark datasets We acquire three benchmark datasets MBI, T2015 and
T2016 (with a total of 6,860,041 unique tweets) to enable comparison with the
work of [
        <xref ref-type="bibr" rid="ref22 ref25 ref37 ref6">25,22,6,37</xref>
        ]. The datasets match the same lters, as applied before (e.g.,
timeframe, language or presence of image attachment) yet result in higher
volume. We follow [
        <xref ref-type="bibr" rid="ref37 ref6">37,6</xref>
        ] to split the tweets into 70% training, 10% validation, and
20% test sets respectively.
      </p>
      <p>Twitter 2017 For the general multilanguage model, we have collected 10 million
unique tweets and used 9.7M of them for predictive analysis, after applying
privacy requests. The dataset has been downsampled from the entire Twitter
2017 volume to 18 languages supported by the sentiment scoring service, then
using Twitter PowerTracks sample and bio operators, to manage the volume
without sacri cing our models generalization capability over the full year.
4.1</p>
      <sec id="sec-4-1">
        <title>Sentiment score and all-time totals</title>
        <p>Retweet counts, favorite counts, and sentiment scores were collected for ca. 30
million unique tweets, simultaneously with applying privacy requests. It is worth
noting that 85% of unique tweets acquired had never been retweeted.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Feature selection</title>
        <p>Multiple features have been extracted from the rich Twitter metadata, to
capture what is being said (content), by who (author), when (temporal) and how
(sentiment). Table 2 describes selected features and their Pearson correlation
coe cient with the logarithm of retweet count in T2017-BIO. Only the
information available at the time of acquisition or immediately after is considered,
to maximize the scalability of the solution. Speci cally, we do not consider the
early performance of the tweet (i.e., retweet or favorite counts received) or
imagebased features at this point.</p>
        <p>
          Some authors (e.g., celebrities) receive more attention than others despite
low activity. We calculate the two author ratio features in an attempt to
isolate such examples. Number of attachments (like hashtags, mentions, URLs,
images, symbols and videos) compete for viewers atten-tion with the original
140-character body of the tweet, and their total count is also considered.
Finally, we log-transform selected author features (e.g. author's favorite and listed
counts) due to power-law distribution [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
We consider the problem of predicting the scale of retweet cascade for a given
tweet based on data modalities available immediately after its delivery. The
author features are used together with the content, language, and temporal to
predict the number of future retweets. In this study, we assume the future retweet
count r of a tweet follows Poisson distribution:
        </p>
        <p>P (R = r j ) =
e
r!
r
where the latent variable 2 R+ de nes the mean and variance of the
distribution, and maximize the Poisson log-likelihood given a collection of N training
tuples of tweets ti and their retweet counts rgt;i
GBRT is a tree ensemble algorithm which builds one regression tree at a time
by tting the residual of the trees that preceded it. With our twice-di erentiable
loss function, denoted as:</p>
        <p>LPoisson(rgt; t) = rgt ln (t) + (t)
GBRT minimizes the loss function (regularization term omitted for simplicity):
with a function estimation F(t) represented in an additive form:
L =</p>
        <p>N
X LPoisson(rgt;i; F (ti))
i=1
F (t) =</p>
        <p>T
X fm(t)
m=1
(1)
(2)
(3)
(4)
(5)
(6)
(7)
where each Fm(t) is a regression tree and T is the number of trees. GBRT learns
these regression trees in an incremental way: at m-stage, xing the previous
m 1 trees when learning the m-th trees. To construct the m-th tree, GBRT
minimizes the following loss:</p>
        <p>Lm =</p>
        <p>N
X LPoisson(rgt;i; Fm 1(ti) + fm(ti))
t=1
where Fm 1 (t) = Pkm 1 fk (t).</p>
        <p>The optimization problem (6) can be solved by Taylor expansion of the loss
function:</p>
        <p>Lm</p>
        <p>Lm =</p>
        <p>N 2
X[LPoisson(rgt;i; Fm 1(ti)) + rifm(ti) + ri f m2(ti)]
i=0 2
(8)
(9)
(10)
with the gradient and Hessian de ned as:
ri =
ri2 =
@LPoisson(rgt;i; F (ti))</p>
        <p>@F (ti)
@L2Poisson(rgt;i; F (ti))
@2F (ti)
j F (ti) = Fm 1(ti)
j F (ti) = Fm 1(ti)
We train our GBRT by minimizing Lm which is equivalent to minimizing:</p>
        <p>
          N 2
min X ri (fm(ti) + ri )2
f2F i=1 2 ri2
This approach is vulnerable to overdispersion and power-law distribution,
characterizing the retweet count. In extreme cases where Hessian is nearly zero (9)
approaches positive in nity. To safeguard the optimization, we cap each trees
weight estimation at 1.5 and follow [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] to use total retweet count as ground-truth
after log-transformation:
        </p>
        <p>rgt = ln(rtotal + 1)
5.2</p>
      </sec>
      <sec id="sec-4-3">
        <title>Gradient Boosting Framework</title>
        <p>
          LightGBM [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] implementation of GBDT is chosen for the task, due to
distinctive techniques applicable. Experiments on multiple public datasets show
that Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling
(EFB) can accelerate the training process by over 20 times while achieving almost
the same accuracy [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. Most of all, LightGBM implements a novel
histogrambased algorithm to approximately nd the best splits which is highly scalable on
GPUs [
          <xref ref-type="bibr" rid="ref40">40</xref>
          ]. The framework allows us to explore substantially larger
hyperparameter space during cross-validation. Finally, LightGBM o ers good accuracy with
integer-encoded categorical features by applying [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] to nd the optimal split
over categories. This often performs better than one-hot encoding and enables
treating more features as categorical while avoiding dimensionality explosion.
6
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experiments</title>
      <p>We exercise gradient boosted Poisson regression in experiments organized by
datasets, to tune and compare our approach against recent state-of-the-art
methods, before attempting to generalize the prediction across topics and cultures in
the multilingual extended timeframe study.
6.1</p>
      <sec id="sec-5-1">
        <title>Evaluation metrics</title>
        <p>
          We compute the Spearman Rho ranking coe cient, to measure our models
ability to rank the content by expected popularity. Interpretation of this coe cient
is domain speci c, with guidelines for social/behavioral sciences proposed by [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
SpearmanR from SciPy version 1.4.0 is used to ensure tie handling. We did not
nd this concern expressed in prior work. The p-value for all reported Spearman
results is p &lt; 0:001
        </p>
        <p>
          Relative and absolute measures of t: R2, and RMSE are chosen for
optimization, to penalize large error higher (i.e. when underestimating highly viral
content or vice-versa). The mean-absolute-percentage-error (MAPE) is computed
due to popularity in previous studies [
          <xref ref-type="bibr" rid="ref37">37</xref>
          ], but not considered for tuning. We
dispute MAPEs value relative to above when tting asymmetric, zero-in ated
distribution of the dependent variable (like retweet count). It is unde ned for
the majority of examples (Table 1), which never receive a retweet and penalizes
errors for least retweeted higher.
6.2
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>Validation on benchmark datasets</title>
        <p>We begin with evaluation of our multimodal GBRT against previous
state-of-theart methods. For a fair comparison, we use Poisson regression on the joint author,
content and temporal features (ACT), before including sentiment (ACTL). Table
4 demonstrates that our proposed model achieves substantially higher ranking
performance, compared to other content-based methods, already before
considering image and propagation modalities. Using more advanced feature
representations, sentiment score and high accuracy ground-truth, we outperform the
state-of-the-art by more than 37% on multiple datasets.</p>
      </sec>
      <sec id="sec-5-3">
        <title>Multilingual, extended timeframe experiments</title>
        <p>
          We apply our method to the new T2017-BIO dataset to generalize
popularity prediction across languages and time. Tweet t(A; C; T; L) includes content
descriptions C, language descriptions L and is rst issued by author A, at the
time T. Table 4 summarizes contributions of these modalities individually and
in combination. The baseline model is trained on a single feature, most popular
in literature: the count of authors followers, noti ed about the tweet.
When prioritizing social posts by expected popularity, model's ranking
performance might precede metrics of overall t. Interpretation of Spearman and R2
metrics is domain speci c. For social/behavioral sciences, reaching 0.5 indicates
strong correlation [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. The nal study aimed to explore generalizability of our
method over an extended time-frame and 18 languages. The relative insigni
cance of the Temporal modality (Table 4) suggests low correlation between the
time of posting and the content popularity, thereby challenging the common
intuition, that posting at the time of audiences activity helps propagating the
content. We also nd that content-based features alone have higher value for
expected popularity ranking than the number of followers. How many people
like you appears less important than what you have to say.
        </p>
        <p>
          Non-linear advanced ML algorithms like deep neural networks and gradient
boosted decision trees are among the most successful methods used today. The
fact is often attributed to the inherent capability of discovering non-linear
relationships between groups of features. It was not necessary in our study to
compute e.g., all cross-products to rival state-of-the-art, and at times we have noticed
a higher cumulative contribution of combined modalities over their individual
gains (Table 4). The size of the audience immediately exposed to the tweet,
measured as the count of the authors followers, remains the single strongest predictor
of tweet popularity when considered in isolation (Figure 2). The number of times
an author has been listed by others, followed others or favorited other content
are also among signi cant features, open to interpretation. Number of friends is
arguably related to the diversity of content the author is exposed to. We expect
the count of tweets favorited over time (i.e. age of account) to di erentiate
active from passive consumers. Assuming the authors in uence is measured by her
capacity to spread information in the social network [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ], could the diversity of
content actively consumed over time maximize authors in uence? We propose
this hypothesis for computational social science.
8
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusions and future work</title>
      <p>
        In this paper, we have studied the problem of predicting tweet popularity under
scalability, explainability and privacy compliance constraints. Our method
estimates the potential reach of a tweet i.e. size of retweet cascades based on
modalities available immediately after document creation. We prove it is possible to
rival state-of-the-art results without compromising on explainability, scalability
or privacy compliance. Our Gradient Boosted Regression Tree, combining
available modalities with sentiment score and high accuracy ground-truth achieves
state-of-the-art results on multiple datasets and is the rst to achieve strong [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
virality ranking performance. In the nal round of experiments, we apply our
method to generalize prediction across extended time-frame in 18 languages and
explain the contribution of each modality.
      </p>
      <p>
        Training the nal model on NVidia Tesla K80 took 10 minutes. Computing
predictions for the 2 million unique tweets in the validation set, took another 45
seconds. Thats over 44,000 tweets scored per second, with a single GPU.
Assuming incoming tweets are already vectorized, the ACT model deployed on Tesla
K80 can cope with 5 ( ve) times todays Twitter volume and velocity. [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ] take
up to 72 additional hours (after data collection) to acquire propagation features
for the prediction. During that time, our model will have predicted popularity
for up to 11 billion tweets.
8.1
      </p>
      <sec id="sec-6-1">
        <title>Applications</title>
        <p>Our model is ready for production with immediate application to social media
monitoring. The proposed framework is extendable to other data modalities (e.g.
visual) and other methods (e.g. deep neural networks) Our privacy compliant
storage solution is immediately applicable to data collection and analysis from
other social networks exposing privacy signal (e.g. Tumblr and WordPress, with
privacy requests available as compliance interactions from DataSift). Our
solution to focus analysis on temporary in-memory samples, created ad-hoc for
every iteration, from a single central persistent storage to receive compliance
requests, is applicable to any social network sourced data. Our solution to rely
on dedicated APIs for high accuracy labels, instead of error prone counting or
crawling used in prior work, is immediately applicable to Instagram, Tumblr and
Facebook Pages. Our explainable GBRT approach is immediately applicable to
Instagram and Tumblr.
8.2</p>
      </sec>
      <sec id="sec-6-2">
        <title>Acknowledgements</title>
        <p>This project is supported by Microsoft Development Center Copenhagen and
the Danish Innovation Fund, Case No. 5189-00089B. We would like to thank
Charlotte Mark, Lars Kai Hansen, Joerg Derungs, Petter Stengard and U e
Kjall. Any opinions, ndings, conclusions or recommendations expressed in this
material are those of the authors and do not necessarily re ect those of the
sponsors.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Ahmed</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spagna</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huici</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>A Peek into the Future : Predicting the Evolution of Popularity in User Generated Content</article-title>
          .
          <source>In: Proceedings of the sixth ACM international conference on Web search and data mining</source>
          (
          <year>2013</year>
          ). https://doi.org/10.1145/2433396.2433473
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Barabsi</surname>
            ,
            <given-names>A.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Psfai</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Network science</article-title>
          . Cambridge University Press, Cambridge (
          <year>2016</year>
          ), http://barabasi.com/networksciencebook/
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bello-Orgaz</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jung</surname>
            ,
            <given-names>J.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Camacho</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Social big data: Recent achievements and new challenges</article-title>
          .
          <source>Information Fusion</source>
          (
          <year>2016</year>
          ). https://doi.org/10.1016/j.in us.
          <year>2015</year>
          .
          <volume>08</volume>
          .005
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bunyamin</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tunys</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>A Comparison of Retweet Prediction Approaches: The Superiority of Random Forest Learning Method</article-title>
          .
          <source>TELKOMNIKA (Telecommunication Computing Electronics and Control)</source>
          <volume>14</volume>
          (
          <issue>3</issue>
          ),
          <volume>1052</volume>
          (sep
          <year>2016</year>
          ). https://doi.org/10.12928/telkomnika.v14i3.3150, http: //www.journal.uad.ac.id/index.php/TELKOMNIKA/article/view/3150
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Can</surname>
            ,
            <given-names>E.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oktay</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manmatha</surname>
          </string-name>
          , R.:
          <article-title>Predicting retweet count using visual cues</article-title>
          .
          <source>In: Proceedings of the 22nd ACM international conference on Conference on information &amp; knowledge management - CIKM '13</source>
          (
          <year>2013</year>
          ). https://doi.org/10.1145/2505515.2507824
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Cappallo</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mensink</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Snoek</surname>
            ,
            <given-names>C.G.</given-names>
          </string-name>
          :
          <article-title>Latent Factors of Visual Popularity Prediction</article-title>
          .
          <source>In: Proceedings of the 5th ACM on International Conference on Multimedia Retrieval - ICMR '15</source>
          (
          <year>2015</year>
          ). https://doi.org/10.1145/2671188.2749405
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Cha</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haddadi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Benevenuto</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gummadi</surname>
            ,
            <given-names>K.P.</given-names>
          </string-name>
          :
          <article-title>Measuring User Inuence in Twitter: The Million Follower Fallacy</article-title>
          .
          <source>In: ICWSM</source>
          <volume>10</volume>
          (
          <year>2010</year>
          ). https://doi.org/10.1.1.167.192
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. Cheng, J.,
          <string-name>
            <surname>Adamic</surname>
            ,
            <given-names>L.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dow</surname>
            ,
            <given-names>P.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kleinberg</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leskovec</surname>
          </string-name>
          , J.: Can Cascades be Predicted? (mar
          <year>2014</year>
          ). https://doi.org/10.1145/2566486.2567997, http://arxiv. org/abs/1403.4608http://dx.doi.org/10.1145/2566486.2567997
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Cohen</surname>
          </string-name>
          , J.:
          <article-title>Statistical Power Analysis for the Behavioral Sciences</article-title>
          . Lawrence Erlbaum Associates (
          <year>1988</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Dong</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mavroeidis</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Calabrese</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frossard</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Multiscale event detection in social media</article-title>
          .
          <source>Data Mining and Knowledge Discovery</source>
          (
          <year>2015</year>
          ). https://doi.org/10.1007/s10618-015-0421-2
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Feldman</surname>
          </string-name>
          , R.:
          <article-title>Techniques and applications for sentiment analysis</article-title>
          .
          <source>Commun. ACM</source>
          <volume>56</volume>
          (
          <issue>4</issue>
          ),
          <volume>82</volume>
          {89 (Apr
          <year>2013</year>
          ). https://doi.org/10.1145/2436256.2436274, http://doi. acm.
          <source>org/10</source>
          .1145/2436256.2436274
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Firdaus</surname>
            ,
            <given-names>S.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ding</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sadeghian</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Retweet prediction considering user's di erence as an author and retweeter</article-title>
          .
          <source>In: Proceedings of the 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining</source>
          ,
          <string-name>
            <surname>ASONAM</surname>
          </string-name>
          <year>2016</year>
          (
          <year>2016</year>
          ). https://doi.org/10.1109/ASONAM.
          <year>2016</year>
          .7752337
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Fisher</surname>
          </string-name>
          , W.D.:
          <article-title>On Grouping For Maximum Homogeneity</article-title>
          . American Statistical Association Journal (
          <year>1958</year>
          ), http://www.csiss.org/SPACE/workshops/2004/SAC/ files/fisher.pdf
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Gan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jiang</surname>
          </string-name>
          , R.: FLOWER:
          <article-title>Fusing global and local associations towards personalized social recommendation</article-title>
          .
          <source>Future Generation Computer Systems</source>
          (
          <year>2018</year>
          ). https://doi.org/10.1016/j.future.
          <year>2017</year>
          .
          <volume>02</volume>
          .027
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Gandomi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haider</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Beyond the hype: Big data concepts, methods, and analytics</article-title>
          .
          <source>International Journal of Information Management</source>
          (
          <year>2015</year>
          ). https://doi.org/10.1016/j.ijinfomgt.
          <year>2014</year>
          .
          <volume>10</volume>
          .007
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Hansen</surname>
            ,
            <given-names>L.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arvidsson</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nielsen</surname>
            ,
            <given-names>F.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Colleoni</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Etter</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Good friends, bad news - A ect and virality in twitter</article-title>
          .
          <source>In: Communications in Computer and Information Science</source>
          (
          <year>2011</year>
          ). https://doi.org/10.1007/978-3-
          <fpage>642</fpage>
          -22309-95
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Holzinger</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Biemann</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pattichis</surname>
            ,
            <given-names>C.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kell</surname>
            ,
            <given-names>D.B.</given-names>
          </string-name>
          :
          <article-title>What do we need to build explainable AI systems for the medical domain? (dec</article-title>
          <year>2017</year>
          ), http://arxiv.org/ abs/1712.09923
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dong</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yesha</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>A Scalable System for Community Discovery in Twitter During Hurricane Sandy</article-title>
          .
          <source>In: 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing</source>
          . pp.
          <volume>893</volume>
          {
          <fpage>899</fpage>
          . IEEE (may
          <year>2014</year>
          ). https://doi.org/10.1109/CCGrid.
          <year>2014</year>
          .
          <volume>122</volume>
          , http://ieeexplore.ieee.org/ document/6846543/
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Ishiguro</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kimura</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Takeuchi</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Towards automatic image understanding and mining via social curation</article-title>
          .
          <source>In: Proceedings - IEEE International Conference on Data Mining</source>
          , ICDM (
          <year>2012</year>
          ). https://doi.org/10.1109/ICDM.
          <year>2012</year>
          .37
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Kaisler</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Armour</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Espinosa</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Money</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Big data: Issues and challenges moving forward</article-title>
          .
          <source>In: Proceedings of the Annual Hawaii International Conference on System Sciences</source>
          (
          <year>2013</year>
          ). https://doi.org/10.1109/HICSS.
          <year>2013</year>
          .645
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Ke</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meng</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          , Ma, W., Liu, T.Y.,
          <string-name>
            <surname>Finley</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          , Ma,
          <string-name>
            <given-names>W.</given-names>
            ,
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            ,
            <surname>Liu</surname>
          </string-name>
          , T.Y.:
          <article-title>LightGBM: A highly e cient gradient boosting decision tree</article-title>
          .
          <source>Advances in Neural Information Processing Systems</source>
          (
          <year>2017</year>
          ). https://doi.org/10.1046/j.1365-
          <fpage>2575</fpage>
          .
          <year>1999</year>
          .
          <volume>00060</volume>
          .x
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Khosla</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Das</surname>
            <given-names>Sarma</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Hamid</surname>
          </string-name>
          , R.:
          <article-title>What makes an image popular?</article-title>
          <source>In: Proceedings of the 23rd international conference on World wide web - WWW '14</source>
          (
          <year>2014</year>
          ). https://doi.org/10.1145/2566486.2567996
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Kwak</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Park</surname>
            , H., Moon,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>What is Twitter, a social network or a news media?</article-title>
          <source>In: Proceedings of the 19th international conference on World wide web - WWW '10</source>
          (
          <year>2010</year>
          ). https://doi.org/10.1145/1772690.1772751
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Mazloom</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rietveld</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rudinac</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Worring</surname>
          </string-name>
          , M.,
          <string-name>
            <surname>van Dolen</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Multimodal Popularity Prediction of Brand-related Social Media Posts</article-title>
          .
          <source>In: Proceedings of the 2016 ACM on Multimedia Conference - MM '16</source>
          (
          <year>2016</year>
          ). https://doi.org/10.1145/2964284.2967210
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>McParlane</surname>
            ,
            <given-names>P.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moshfeghi</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          , Jose,
          <string-name>
            <surname>J.M.:</surname>
          </string-name>
          <article-title>"Nobody comes here anymore, it's too crowded"; Predicting Image Popularity on Flickr</article-title>
          .
          <source>Proceedings of International Conference on Multimedia Retrieval - ICMR '14</source>
          (
          <year>2014</year>
          ). https://doi.org/10.1145/2578726.2578776
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Meng</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bradley</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yavuz</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sparks</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Venkataraman</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Freeman</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tsai</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amde</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Owen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xin</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xin</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Franklin</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zadeh</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaharia</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Talwalkar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Mllib: Machine learning in apache spark</article-title>
          .
          <source>J. Mach. Learn. Res</source>
          .
          <volume>17</volume>
          (
          <issue>1</issue>
          ),
          <volume>1235</volume>
          {1241 (Jan
          <year>2016</year>
          ), http://dl.acm.org/citation.cfm?id=
          <volume>2946645</volume>
          .
          <fpage>2946679</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Microsoft</surname>
          </string-name>
          :
          <article-title>Cognitive Services APIs reference</article-title>
          . https://westus.dev. cognitive.microsoft.com/docs/services/TextAnalytics.V2.0/operations/ 56f30ceeeda5650db055a3c9 (
          <year>2017</year>
          ), accessed:
          <fpage>2018</fpage>
          -09-05
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Nesi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pantaleo</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paoli</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaza</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Assessing the reTweet proneness of tweets: predictive models for retweeting</article-title>
          .
          <source>Multimedia Tools and Applications</source>
          (
          <year>2018</year>
          ). https://doi.org/10.1007/s11042-018-5865-0
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Palovics</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daroczy</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Benczur</surname>
            ,
            <given-names>A.A.</given-names>
          </string-name>
          :
          <article-title>Temporal prediction of retweet count</article-title>
          .
          <source>In: 4th IEEE International Conference on Cognitive Infocommunications, CogInfoCom 2013 - Proceedings</source>
          (
          <year>2013</year>
          ). https://doi.org/10.1109/CogInfoCom.
          <year>2013</year>
          .6719254
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Peng</surname>
            ,
            <given-names>H.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piao</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , Zhang, Y.:
          <article-title>Retweet modeling using conditional random elDs</article-title>
          .
          <source>In: Proceedings - IEEE International Conference on Data Mining</source>
          , ICDM (
          <year>2011</year>
          ). https://doi.org/10.1109/ICDMW.
          <year>2011</year>
          .146
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>Peng</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cao</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Niu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jia</surname>
            ,
            <given-names>W.:</given-names>
          </string-name>
          <article-title>In uence analysis in social networks: A survey (</article-title>
          <year>2018</year>
          ). https://doi.org/10.1016/j.jnca.
          <year>2018</year>
          .
          <volume>01</volume>
          .005
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <surname>Pezzoni</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>An</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Passarella</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Crowcroft</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Conti</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Why do I retweet it? An information propagation model for microblogs</article-title>
          .
          <source>In: Lecture Notes in Computer Science (including subseries Lecture Notes in Arti cial Intelligence and Lecture Notes in Bioinformatics)</source>
          (
          <year>2013</year>
          ). https://doi.org/10.1007/978-3-
          <fpage>319</fpage>
          -03260-331
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          33.
          <string-name>
            <surname>Samek</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiegand</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          , Muller, K.R.:
          <article-title>Explainable Arti cial Intelligence: Understanding, Visualizing and Interpreting Deep Learning Models</article-title>
          (aug
          <year>2017</year>
          ), http://arxiv.org/abs/1708.08296
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          34.
          <string-name>
            <surname>Sapountzi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Psannis</surname>
            ,
            <given-names>K.E.</given-names>
          </string-name>
          :
          <article-title>Social networking data analysis tools &amp; challenges. Future Generation Computer Systems (</article-title>
          <year>2018</year>
          ). https://doi.org/10.1016/j.future.
          <year>2016</year>
          .
          <volume>10</volume>
          .019
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          35.
          <string-name>
            <surname>Sheela</surname>
            ,
            <given-names>L.J.:</given-names>
          </string-name>
          <article-title>A Review of Sentiment Analysis in Twitter Data Using Hadoop</article-title>
          .
          <source>International Journal of Database Theory and Application</source>
          (
          <year>2016</year>
          ). https://doi.org/10.14257/ijdta.
          <year>2016</year>
          .
          <volume>9</volume>
          .1.
          <fpage>07</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          36.
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>The e ect of wording on message propagation: Topicand author-controlled natural experiments on twitter</article-title>
          .
          <source>In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</source>
          . pp.
          <volume>175</volume>
          {
          <fpage>185</fpage>
          . Association for Computational Linguistics, Baltimore, Maryland (
          <year>June 2014</year>
          ), http://www.aclweb.org/anthology/P14-1017
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          37.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bansal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frahm</surname>
            ,
            <given-names>J.M.:</given-names>
          </string-name>
          <article-title>Retweet wars: Tweet popularity prediction via dynamic multimodal regression</article-title>
          .
          <source>In: Proceedings - 2018 IEEE Winter Conference on Applications of Computer Vision</source>
          ,
          <string-name>
            <surname>WACV</surname>
          </string-name>
          <year>2018</year>
          (
          <year>2018</year>
          ). https://doi.org/10.1109/WACV.
          <year>2018</year>
          .00204
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          38.
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          :
          <article-title>Analyzing and predicting news popularity on Twitter</article-title>
          .
          <source>International Journal of Information Management</source>
          (
          <year>2015</year>
          ). https://doi.org/10.1016/j.ijinfomgt.
          <year>2015</year>
          .
          <volume>07</volume>
          .003
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          39.
          <string-name>
            <surname>Zaman</surname>
            ,
            <given-names>T.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Herbrich</surname>
            , R., van Gael,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stern</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Predicting Information Spreading in Twitter</article-title>
          .
          <source>In: Workshop on Computational Social Science and the Wisdom of Crowds, NIPS</source>
          <year>2010</year>
          (
          <year>2010</year>
          ). https://doi.org/10.1016/j.jclepro.
          <year>2015</year>
          .
          <volume>12</volume>
          .007
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          40.
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , H.,
          <string-name>
            <surname>Si</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hsieh</surname>
            ,
            <given-names>C.J.:</given-names>
          </string-name>
          <article-title>GPU-acceleration for Large-scale Tree Boosting (jun</article-title>
          <year>2017</year>
          ), http://arxiv.org/abs/1706.08359
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          41.
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Erdogdu</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>H.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rajaraman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leskovec</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>SEISMIC: A self-exciting point process model for predicting tweet popularity</article-title>
          .
          <source>CoRR abs/1506</source>
          .02594 (
          <year>2015</year>
          ), http://arxiv.org/abs/1506.02594
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>