=Paper=
{{Paper
|id=Vol-2881/paper4
|storemode=property
|title=Regression-enhanced Random Forests with Personalized Patching for COVID-19 Retweet Prediction
|pdfUrl=https://ceur-ws.org/Vol-2881/paper4.pdf
|volume=Vol-2881
|authors=Guangyuan Piao,Weipeng Huang
}}
==Regression-enhanced Random Forests with Personalized Patching for COVID-19 Retweet Prediction==
<pdf width="1500px">https://ceur-ws.org/Vol-2881/paper4.pdf</pdf>
<pre>
           Regression-enhanced Random Forests with Personalized
                 Patching for COVID-19 Retweet Prediction
                               Guangyuan Piao                                                                           Weipeng Huang
              Maynooth International Engineering College                                                     Insight Centre for Data Analytics
                  Department of Computer Science                                                                School of Computer Science
                        Maynooth University                                                                      University College Dublin
                   Maynooth, Co Kildare, Ireland                                                                      Dublin, Ireland
                      guangyuan.piao@mu.ie                                                                  weipeng.huang@insight-centre.org

ABSTRACT                                                                                      1.1      COVID-19 Retweet Prediction Challenge
In this report, we describe an ensemble approach with a set of                                The retweet prediction challenge is based on the TweetsCOV19
enhanced random forest models for COVID-19 retweet prediction                                 dataset [2] — a publicly available dataset containing more than 8
challenge at CIKM Analyticup 2020 held by the 29th ACM Inter-                                 million tweets related to COVID-19, spanning the period October
national Conference on Information and Knowledge Management.                                  2019 to April 2020. On top of the TweetsCOV19 dataset, the dataset
The proposed approach is based on a global model and a set of                                 provided by the challenge and the problem and evaluation metric
personalized models. The global model consists of a set of random                             are given as follows.
forests enhanced by three different types of models such as linear
                                                                                                 Dataset. The dataset of the challenge consists of 8,151,524 COVID-
regression, feed-forward neural networks, and factorization ma-
                                                                                              19 related tweets for training, 961,182, and 961,183 tweets for val-
chines. In addition to this global model, we trained a number of
                                                                                              idation, and testing, respectively. In addition, the challenge also
personalized models for users that exist in both training and test sets
                                                                                              provides a set of features for each tweet, such as:
and have a sufficient number of tweets for training. Our approach
obtained a MSLE (Mean Squared Log Error) value of 0.149997 on                                      • 𝑇𝑤𝑒𝑒𝑡𝐼𝐷 for each tweet from Twitter
the test set of the challenge and ranked 4th on the final leaderboard.                             • 𝑈 𝑠𝑒𝑟𝑛𝑎𝑚𝑒, i.e., the author of a tweet
                                                                                                   • 𝑇𝑖𝑚𝑒𝑠𝑡𝑎𝑚𝑝 of a tweet in the UTC time zone
KEYWORDS                                                                                           • #𝐹𝑜𝑙𝑙𝑜𝑤𝑒𝑟𝑠(No. of followers) which indicates the number of
                                                                                                     followers of the author of a tweet
COVID-19, Random Forests, Neural Networks, Factorization Ma-
                                                                                                   • #𝐹𝑟𝑖𝑒𝑛𝑑𝑠(No. of friends) which indicates the number of friends
chines, Deep Learning, Retweet Prediction, Twitter
                                                                                                     of the author of a tweet
                                                                                                   • #𝐹𝑎𝑣𝑜𝑟𝑖𝑡𝑒𝑠(No. of favorites) which indicates the number of
1     INTRODUCTION
                                                                                                     favorites of a tweet
Retweeting or reposting, a function to repost a post such as a tweet                               • Entities and their scores extracted from each tweet using FEL
with followers, is one of the most crucial functionalities in many                                   library [1]
popular social media platforms such as Twitter1 or Weibo2 as it                                    • Sentiment scores of each tweet extracted from SentiStrength3
enables information spreading on those platforms. Understanding                                    • Mentions of other user accounts in each tweet
retweet behavior is useful for many applications such as political                                 • Hashtags in each tweet
audience design [8] or fake news spreading and tracking [9]. There-                                • URLs in each tweet
fore, understanding and modeling retweet behavior has been an                                      • #Retweets(No. of retweets) which indicates the number of
active research area and might be particularly helpful during times                                  retweets of a tweet. This is the target variable for prediction
of crisis, such as the current COVID-19 pandemic.                                                    on the validation and test datasets.
   In this regard, the COVID-19 retweet prediction challenge held
in conjunction with the 29th ACM International Conference On                                     Problem. Given the set of features for a tweet from TweetsCOV19,
Information and Knowledge Management was launched to better                                   the task is to predict the number of times it has been retweeted.
understand retweet behavior in the context of COVID-19. The                                      Evaluation metric. Consider the predicted results 𝒚ˆ and the actual
challenge has two phases including validation and testing where 51                            retweet counts 𝒚 on the test set, which are both of length 𝑀. The
teams participated the validation phase and 20 teams participated                             performance is evaluated by MSLE (Mean Squared Log Error):
the testing pahse. In this report, we present our proposed approach
                                                                                                                                  𝑀
for the retweet prediction task in the challenge, which ranked 4th                                                            1 Õ
                                                                                                                   ˆ =
                                                                                                           MSLE(𝒚, 𝒚)               (ln(1 + 𝑦𝑖 ) − ln (1 + 𝑦ˆ𝑖 )) 2   (1)
place on the final leaderboard after the testing phase.                                                                       𝑀 𝑖=1
1 https://twitter.com
2 https://weibo.com                                                                           2     PROPOSED APPROACH
                                                                                              Our approach consists of two main components by splitting users
 Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).                                            into two groups based on whether the user exists in training, vali-
In: Dimitar Dimitrov, Xiaofei Zhu (eds.): Proceedings of the CIKM AnalytiCup 2020, 22         dation, and test sets. Figure 1 shows an overview of the approach.
October, 2020, Gawlay (Virtual Event), Ireland, 2020, published at http://ceur-ws.org.
                                                                                              3 http://sentistrength.wlv.ac.uk/

                                                                                         13
                                                                                        We use ★★RF to refer to a RERF depending on which regression
                                                                                     model is used for enhancing a random forest. The global model
                                                                                     consists of three types of RERFs with 16 models in total where the
  global model                                          personalized                 final prediction is the mean of predicted values from those models.
                  Average                               models
                                                                LR+RF
                                                                                           • A LRRF (Linear Regression-enhanced Random Forest) which
                                                               LR+RF
      LRRF
      LRRF        NNRF
                  NNRF          FMRF
                                FMRF                          LRRF                           denotes a simple linear regression-enhanced random for-
                                                                                             est model. We used a simple linear regression without an
                                    user has                                                 intercept and regularization given the large number of exam-
                               a sufﬁcient number
            no               of tweets for training a              yes                       ples in the training set. For the corresponding random forest
                              personalized model
                                                                                             model, we used one with a maximum depth of 20 which
                                                                                             consists of 500 estimators/trees.
                                                                                           • Ten NNRFs (Neural Networks-enhanced Random Forests)
  user features   content features     time features      sentiment features                 where each NNRF uses feed-forward neural networks with
                                                                                             different hyper-parameters (e.g., the number of hidden layers
Figure 1: Overview of our proposed approach based on LRRF                                    and neurons) for enhancing the corresponding random forest
(Linear Regression-enhanced Random Forest), NNRF (Neu-                                       model. For the corresponding random forest model, we used
ral Networks-enhanced Random Forest), and FMRF (Factor-                                      one with a maximum depth of 18 which consists of 500
ization Machine-enhanced Random Forest) which are intro-                                     estimators.
duced in Section 2.1.                                                                      • Five FMRFs (Factorization Machine-enhanced Random
                                                                                             Forests) where four of them are DeepFM (Deep Factoriza-
                                                                                             tion Machine) [3] models with different hyper-parameters
The first group of users consists are the ones who exist in both                             (e.g., the number of iteration or seed) and one xDeepFM [4]
training and the test (and validation) sets with a sufficient number                         for enhancing the corresponding random forest model. The
of tweets for training. The rest of users fall into the second group.                        random forest model consists of 500 estimators and has a
   First, for the second group of users, we build a global model                             maximum depth of 16 and maximum features of 50%.
which is an ensemble of random forest models enhanced by lin-
                                                                                        For training, the input of each RERF is a set of feature values (we
ear regression, feed-forward neural networks, and factorization
                                                                                     will discuss the features in Section 2.3) regarding a tweet and the
machines. Secondly, for each user in the first group, we build a
                                                                                     number of retweets of it. Given MSLE as the evaluation metric of
personalized model for each user using a random forest enhanced
                                                                                     the challenge, we further log transformed the set of feature values
by a linear regression model. Next, we discuss the global and per-
                                                                                     and the number of retweets for each tweet for training a RERF.
sonalized models in detail.
                                                                                     Those RERFs are implemented using scikit-learn [5] and DeepCTR
                                                                                     [7] Python packages. The implementation details can be found in
2.1    Global model                                                                  our github repository4 .
The global model is a collection of regression-enhanced random
forests (RERF), which has been introduced recently in [10] to cope                   2.2     Patching personalized models
with the extrapolation problem of random forests where predictions
                                                                                     Although the global model captures the overall relationship be-
on the test set are required at points out of the domain of the
                                                                                     tween the set of features and the retweet count of a tweet, the
training dataset. In contrast to the definition of RERF with a specific
                                                                                     relationship would vary depending on the author of a tweet [6].
regression model (Lasso) in [10], we use a general definition of RERF
                                                                                     Figure 2 shows an example of the variance of the relationship be-
in this work as follows:
                                                                                     tween the number of favorites and the number of retweets for two
   Given a training dataset 𝑪 = {𝐶𝑖 = (𝒙𝑖 , 𝑦𝑖 ) : 𝑖 = 1, . . . , 𝑁 } where
                                                                                     different users in a log scale. Therefore, for the first group of users
𝑁 is the size of the training set. Also, 𝒚 = {𝑦𝑖 : 𝑖 = 1, . . . , 𝑁 } is
                                                                                     who are in both training and test sets and have at least 10 tweets
the set of targeted feature values and 𝑿 = {𝒙𝑖 : 𝑖 = 1, . . . , 𝑁 }
                                                                                     for training, a personalized LRRF model for each user is trained,
refers to the final set of features (e.g., after manual engineering,
                                                                                     and the prediction using the global model will be patched/updated
transformation, scaling, adding high-order, or interaction):
                                                                                     with the prediction from a personalized model.
  Step 1: Train a regression model 𝑔(𝑿 ) using the training set, and                    One challenge of training a personalized model is the number
          let 𝜖 𝜆 = 𝒚 − 𝑔(𝑿 ) be the residual from 𝑔(𝑿 ). Here, 𝑔(𝑿 )                of tweets for a user can be limited, and using all features used for
          can be any regression model such as linear, Lasso, Ridge,                  training the global model can result in overfitting. To cope with
          neural networks, or factorization machines, except a tree-                 this problem, we only used #Favorites as a single feature to learn
          based regressor. We then create a new training dataset                     a personalized model for each user. Also, as tweets having zero
          𝑪 𝜆 = {𝐶𝑖𝜆 = (𝒙𝑖 , 𝜖𝑖𝜆 ) : 𝑖 = 1, . . . , 𝑁 }.                             values in either #Favorites and #Retweets are not useful to learn a
  Step 2: Train a random forest model 𝑓 (𝑿 ) using the new train-                    personalized model, we further limit users who have more than six
          ing set 𝑪 𝜆 . The hyper-parameters can be predefined or                    tweets having none zero values in both #Favorites and #Retweets.
          determined with grid search and cross-validation.                          Overall, 236,240 tweets in the test set belong to this category.
  Step 3: Given the trained model 𝑔(·) and 𝑓 (·), the RERF predic-
          tion 𝒚ˆ for the response at 𝑿ˆ is given by 𝒚ˆ = 𝑔( 𝑿ˆ ) + 𝑓 ( 𝑿ˆ ).        4 https://github.com/parklize/cikm2020-analyticup

                                                                                14
                         10
                                                                                             0.18
                         8                                                                   0.16
       No. of retweets


                                                                                      MSLE
                         6                                                                   0.14
                         4                                                                   0.12

                         2                                                                   0.10
                                                                                                    RF           LRRF     LRRF+Patching
                         0
                              0   2     4        6       8   10                Figure 3: Improvement of the performance in terms of MSLE
                                      No. of favorites                         using an regression-enhanced random forest and personal-
                                                                               ized patching compared to using a random forest model on
                                                                               the validation set.
Figure 2: The relationship between the number of favorites
and retweets for two different users in a log scale.                              Time features consist of features that capture relevant informa-
                                                                               tion related to the time when a tweet is posted such as whether the
                                                                               tweet is posted on a weekend, or on which day of the week.
                                                                                  Sentiment features refer to both positive and negative sentiment
   On one hand, the above-mentioned personalized LRRFs using
                                                                               scores of a tweet provided by the SentiStrength, and their interac-
a single feature might resolve the problem of overfitting for users
                                                                               tion (e.g., the sum of positive and negative scores).
with a small number of tweets. On the other hand, we found that
those LRRFs can result in underfitting for user who have a large
                                                                               3   RESULTS
number of tweets for training. Therefore, for the group of users who
have more than 𝜂 tweets having nonzero values in both #Favorites               Table 1 shows the results of the top six teams (semi-finalists) ac-
and #Retweets, we use RidgeRFs (or LRRFs with L2 regularization)               cording to the MSLE score in the testing phase. As we can see from
with all features that have been used for the global model where the           the table, our team (PH) achieved the MSLE score of 0.149997 on
penalty term is set to 5. We empirically found that 𝜂 = 160 achieves           the test set and ranked 4th among 20 teams.
the best results. Overall, 70,821 tweets in the test set belong to this           To investigate whether a regression-enhanced random forest
category.                                                                      or personalized patching (i.e., updating with personalized models)
                                                                               improves the prediction performance, we tested the prediction
                                                                               results on the validation set using a random forest model, LRRF,
2.3    Features                                                                and applying personalized patching for users who have a sufficient
On top of the features provided by the challenge for each tweet                number of tweets for training a personalized model as we described
which has been introduced in Section 1, we extracted 30 features               in Section 2.2. Figure 3 shows that the MSLE decreases when using
which are described in detail in Table 2. The features we used for             LRRF as well as applying personalized patching, which clearly
training models in Section 2.1 and 2.2 can be classified into four             shows the contribution of each component of our approach.
categories: (1) user features, (2) content features, (3) time features,
and (4) sentiment features.                                                    4   CONCLUSION
   User features denote a set of features related to the user/author           In this report, we presented an approach using regression-enhanced
of a tweet. In addition to the number of followers and friends of a            random forests with personalized patching for the task of COVID-
user, we also included the ratio of those two numbers and the total            19 retweet prediction. Regression-enhanced random forests with
number of tweets posted by the user in the training, validation, and           different types of regression models improved the performance of
test datasets. The total number of tweets shows the activity level             prediction compared to using a single regression-enhanced random
of a user and we found that it helped to improve the prediction                forest. In addition, personalized patching for those users having
performance.
   Content features include a set of features related to tweet content         Table 1: Results of MSLE (Mean Squared Log Errors) for semi-
to capture different characteristics of the content. For example, the          finalists of the challenge.
number of favorites that a tweet has, the popularity of entities,
hashtags, mentions, and URL domain in a tweet. The popularity                                   User (Team)                  MSLE
of an entity can be estimated by how many times an entity in a                                vinayaka (BIAS)             0.120551 (1)
tweet appeared in all tweets in the training, validation, and testing                        mc-aida (MC-AIDA)            0.121094 (2)
datasets. We also noticed that a tweet could be retweeted more                                  myaunraitau               0.136239 (3)
when a popular account (e.g., @WHO) is mentioned in the tweet.                                  parklize (PH)             0.149997 (4)
To incorporate popularity of mentioned users in a tweet, we used                         JimmyChang (GrandMasters)        0.156876 (5)
the maximum number of followers and friends of mentioned users                                    Thomary                 0.169047 (6)
where the number of followers and friends for each mentioned user                                     ..                       ..
has been obtained via the Twitter API.                                                                 .                        .
                                                                          15
Table 2: Details of used features in our approach. The features are classified into four categories (with the number of features
in each category) in the table.

 Category                       Feature                                                                   Description
                  No. of followers                                                  Number of followers that a user has
    User          No. of friends                                                    Number of friends that a user has
     (4)          No. of friends / No. of followers                                 The ratio of those two numbers
                  Number of tweets                                                  No. of tweets posted by a user
                  No. of favorites                                                  Number of favorites that a tweet has
                  No. of favorites / No. of followers                               The ratio of those two numbers
                  Has entity                                                        1 or 0 to denote whether a tweet contains any entity
                  Has hashtag                                                       1 or 0 to denote whether a tweet contains any hashtag
                  Has mention                                                       1 or 0 to denote whether a tweet mentions other users
                  Has URL                                                           1 or 0 to denote whether a tweet contains any URL
                  No. of entities                                                   The total number of entities extracted from a tweet
                  No. of hashtags                                                   The total number of hashtags in a tweet
                  No. of mentions                                                   The total number of mentions in a tweet
  Content         No. of URLs                                                       The total number of URLs in a tweet
   (20)                                                                             How many times an entity in a tweet appeared in all tweets
                  Entity popularity
                                                                                    (Take the maximum value of all entities in a tweet)
                  Hashtag popularity                                                How many times a hashtag in a tweet appeared in all tweets
                  Mention popularity                                                How many times a mentioned user in a tweet appeared in all tweets
                  URL domain popularity                                             How many times the domain of a URL in a tweet appeared in all tweets
                  Tweet length                                                      The total number of entities, hashtags, mentions, as well as URLs
                  No. of top 20 entities                                            Number of top 20 entities from all tweets of a day
                  No. of top 20 hashtags                                            Number of top 20 hashtags from all tweets of a day
                  No. of top 20 mentions                                            Number of top 20 mentioned users from all tweets of a day
                  Maximum No. of followers of mentioned users                       The maximum number of followers of mentioned users in a tweet
                  Maximum No. of friends of mentioned users                         The maximum number of friends of mentioned users in a tweet
                  Time segment                                                      The time segment of a tweet {1 · · · 24} indicating when it is posted
    Time
                  Weekend                                                           1 or 0 to indicate whether a tweet is posted on a weekend or not
     (3)
                  Day of week                                                       A value from {1 · · · 7} to indicate the 𝑛𝑡ℎ day of a week
                  Positive sentiment                                                A score for positive (1 to 5) sentiment for a tweet
 Sentiment
                  Negative sentiment                                                A score for negative (-1 to -5) sentiment for a tweet
    (3)
                  Overall sentiment                                                 The sum of positive and negative sentiment of a tweet


a sufficient number of tweets for training a personalized model                                 [4] Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and
further improved the performance.                                                                   Guangzhong Sun. 2018. xdeepfm: Combining explicit and implicit feature in-
                                                                                                    teractions for recommender systems. In Proceedings of the 24th ACM SIGKDD
                                                                                                    International Conference on Knowledge Discovery & Data Mining. 1754–1763.
ACKNOWLEDGEMENTS                                                                                [5] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.
                                                                                                    Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-
We pay our highest respect to numerous healthcare professionals                                     napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine
                                                                                                    Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
and volunteers battling the COVID-19 pandemic on the front lines.                               [6] Guangyuan Piao and John G Breslin. 2018. Learning to Rank Tweets with Author-
W. Huang is supported by Science Foundation Ireland under grant                                     Based Long Short-Term Memory Networks. In International Conference on Web
number SFI/12/RC/2289_P2.                                                                           Engineering. Springer, 288–295.
                                                                                                [7] Weichen Shen. 2018. DeepCTR: Easy-to-use,Modular and Extendible package of
                                                                                                    deep-learning based CTR models. https://github.com/shenweichen/deepctr.
REFERENCES                                                                                      [8] Stefan Stieglitz and Linh Dang-Xuan. 2012. Political communication and influence
                                                                                                    through microblogging–An empirical analysis of sentiment in Twitter messages
[1] Roi Blanco, Giuseppe Ottaviano, and Edgar Meij. 2015. Fast and Space-Efficient                  and retweet behavior. In 2012 45th Hawaii International Conference on System
    Entity Linking in Queries. In Proceedings of the Eight ACM International Conference             Sciences. IEEE, 3500–3509.
    on Web Search and Data Mining (Shanghai, China) (WSDM 15). ACM, New York,                   [9] Soroush Vosoughi, Deb Roy, and Sinan Aral. 2018. The spread of true and false
    NY, USA, 10.                                                                                    news online. Science 359, 6380 (2018), 1146–1151.
[2] Dimitar Dimitrov, Erdal Baran, Pavlos Fafalios, Ran Yu, Xiaofei Zhu, Matthäus              [10] Haozhe Zhang, Dan Nettleton, and Zhengyuan Zhu. 2019. Regression-enhanced
    Zloch, and Stefan Dietze. 2020. TweetsCOV19 - A Knowledge Base of Semantically                  random forests. arXiv preprint arXiv:1904.10416 (2019).
    Annotated Tweets about the COVID-19 Pandemic. In Proceedings of the 29th ACM
    International Conference on Information & Knowledge Management. Association
    for Computing Machinery, New York, NY, USA, 2991–2998. https://doi.org/10.
    1145/3340531.3412765
[3] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017.
    DeepFM: a factorization-machine based neural network for CTR prediction. arXiv
    preprint arXiv:1703.04247 (2017).
                                                                                          16

</pre>