=Paper=
{{Paper
|id=Vol-2881/paper4
|storemode=property
|title=Regression-enhanced Random Forests with Personalized Patching for COVID-19 Retweet Prediction
|pdfUrl=https://ceur-ws.org/Vol-2881/paper4.pdf
|volume=Vol-2881
|authors=Guangyuan Piao,Weipeng Huang
}}
==Regression-enhanced Random Forests with Personalized Patching for COVID-19 Retweet Prediction==
Regression-enhanced Random Forests with Personalized Patching for COVID-19 Retweet Prediction Guangyuan Piao Weipeng Huang Maynooth International Engineering College Insight Centre for Data Analytics Department of Computer Science School of Computer Science Maynooth University University College Dublin Maynooth, Co Kildare, Ireland Dublin, Ireland guangyuan.piao@mu.ie weipeng.huang@insight-centre.org ABSTRACT 1.1 COVID-19 Retweet Prediction Challenge In this report, we describe an ensemble approach with a set of The retweet prediction challenge is based on the TweetsCOV19 enhanced random forest models for COVID-19 retweet prediction dataset [2] β a publicly available dataset containing more than 8 challenge at CIKM Analyticup 2020 held by the 29th ACM Inter- million tweets related to COVID-19, spanning the period October national Conference on Information and Knowledge Management. 2019 to April 2020. On top of the TweetsCOV19 dataset, the dataset The proposed approach is based on a global model and a set of provided by the challenge and the problem and evaluation metric personalized models. The global model consists of a set of random are given as follows. forests enhanced by three different types of models such as linear Dataset. The dataset of the challenge consists of 8,151,524 COVID- regression, feed-forward neural networks, and factorization ma- 19 related tweets for training, 961,182, and 961,183 tweets for val- chines. In addition to this global model, we trained a number of idation, and testing, respectively. In addition, the challenge also personalized models for users that exist in both training and test sets provides a set of features for each tweet, such as: and have a sufficient number of tweets for training. Our approach obtained a MSLE (Mean Squared Log Error) value of 0.149997 on β’ ππ€πππ‘πΌπ· for each tweet from Twitter the test set of the challenge and ranked 4th on the final leaderboard. β’ π π ππππππ, i.e., the author of a tweet β’ πππππ π‘πππ of a tweet in the UTC time zone KEYWORDS β’ #πΉπππππ€πππ (No. of followers) which indicates the number of followers of the author of a tweet COVID-19, Random Forests, Neural Networks, Factorization Ma- β’ #πΉππππππ (No. of friends) which indicates the number of friends chines, Deep Learning, Retweet Prediction, Twitter of the author of a tweet β’ #πΉππ£ππππ‘ππ (No. of favorites) which indicates the number of 1 INTRODUCTION favorites of a tweet Retweeting or reposting, a function to repost a post such as a tweet β’ Entities and their scores extracted from each tweet using FEL with followers, is one of the most crucial functionalities in many library [1] popular social media platforms such as Twitter1 or Weibo2 as it β’ Sentiment scores of each tweet extracted from SentiStrength3 enables information spreading on those platforms. Understanding β’ Mentions of other user accounts in each tweet retweet behavior is useful for many applications such as political β’ Hashtags in each tweet audience design [8] or fake news spreading and tracking [9]. There- β’ URLs in each tweet fore, understanding and modeling retweet behavior has been an β’ #Retweets(No. of retweets) which indicates the number of active research area and might be particularly helpful during times retweets of a tweet. This is the target variable for prediction of crisis, such as the current COVID-19 pandemic. on the validation and test datasets. In this regard, the COVID-19 retweet prediction challenge held in conjunction with the 29th ACM International Conference On Problem. Given the set of features for a tweet from TweetsCOV19, Information and Knowledge Management was launched to better the task is to predict the number of times it has been retweeted. understand retweet behavior in the context of COVID-19. The Evaluation metric. Consider the predicted results πΛ and the actual challenge has two phases including validation and testing where 51 retweet counts π on the test set, which are both of length π. The teams participated the validation phase and 20 teams participated performance is evaluated by MSLE (Mean Squared Log Error): the testing pahse. In this report, we present our proposed approach π for the retweet prediction task in the challenge, which ranked 4th 1 Γ Λ = MSLE(π, π) (ln(1 + π¦π ) β ln (1 + π¦Λπ )) 2 (1) place on the final leaderboard after the testing phase. π π=1 1 https://twitter.com 2 https://weibo.com 2 PROPOSED APPROACH Our approach consists of two main components by splitting users Copyright Β© 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). into two groups based on whether the user exists in training, vali- In: Dimitar Dimitrov, Xiaofei Zhu (eds.): Proceedings of the CIKM AnalytiCup 2020, 22 dation, and test sets. Figure 1 shows an overview of the approach. October, 2020, Gawlay (Virtual Event), Ireland, 2020, published at http://ceur-ws.org. 3 http://sentistrength.wlv.ac.uk/ 13 We use β β RF to refer to a RERF depending on which regression model is used for enhancing a random forest. The global model consists of three types of RERFs with 16 models in total where the global model personalized final prediction is the mean of predicted values from those models. Average models LR+RF β’ A LRRF (Linear Regression-enhanced Random Forest) which LR+RF LRRF LRRF NNRF NNRF FMRF FMRF LRRF denotes a simple linear regression-enhanced random for- est model. We used a simple linear regression without an user has intercept and regularization given the large number of exam- a sufο¬cient number no of tweets for training a yes ples in the training set. For the corresponding random forest personalized model model, we used one with a maximum depth of 20 which consists of 500 estimators/trees. β’ Ten NNRFs (Neural Networks-enhanced Random Forests) user features content features time features sentiment features where each NNRF uses feed-forward neural networks with different hyper-parameters (e.g., the number of hidden layers Figure 1: Overview of our proposed approach based on LRRF and neurons) for enhancing the corresponding random forest (Linear Regression-enhanced Random Forest), NNRF (Neu- model. For the corresponding random forest model, we used ral Networks-enhanced Random Forest), and FMRF (Factor- one with a maximum depth of 18 which consists of 500 ization Machine-enhanced Random Forest) which are intro- estimators. duced in Section 2.1. β’ Five FMRFs (Factorization Machine-enhanced Random Forests) where four of them are DeepFM (Deep Factoriza- tion Machine) [3] models with different hyper-parameters The first group of users consists are the ones who exist in both (e.g., the number of iteration or seed) and one xDeepFM [4] training and the test (and validation) sets with a sufficient number for enhancing the corresponding random forest model. The of tweets for training. The rest of users fall into the second group. random forest model consists of 500 estimators and has a First, for the second group of users, we build a global model maximum depth of 16 and maximum features of 50%. which is an ensemble of random forest models enhanced by lin- For training, the input of each RERF is a set of feature values (we ear regression, feed-forward neural networks, and factorization will discuss the features in Section 2.3) regarding a tweet and the machines. Secondly, for each user in the first group, we build a number of retweets of it. Given MSLE as the evaluation metric of personalized model for each user using a random forest enhanced the challenge, we further log transformed the set of feature values by a linear regression model. Next, we discuss the global and per- and the number of retweets for each tweet for training a RERF. sonalized models in detail. Those RERFs are implemented using scikit-learn [5] and DeepCTR [7] Python packages. The implementation details can be found in 2.1 Global model our github repository4 . The global model is a collection of regression-enhanced random forests (RERF), which has been introduced recently in [10] to cope 2.2 Patching personalized models with the extrapolation problem of random forests where predictions Although the global model captures the overall relationship be- on the test set are required at points out of the domain of the tween the set of features and the retweet count of a tweet, the training dataset. In contrast to the definition of RERF with a specific relationship would vary depending on the author of a tweet [6]. regression model (Lasso) in [10], we use a general definition of RERF Figure 2 shows an example of the variance of the relationship be- in this work as follows: tween the number of favorites and the number of retweets for two Given a training dataset πͺ = {πΆπ = (ππ , π¦π ) : π = 1, . . . , π } where different users in a log scale. Therefore, for the first group of users π is the size of the training set. Also, π = {π¦π : π = 1, . . . , π } is who are in both training and test sets and have at least 10 tweets the set of targeted feature values and πΏ = {ππ : π = 1, . . . , π } for training, a personalized LRRF model for each user is trained, refers to the final set of features (e.g., after manual engineering, and the prediction using the global model will be patched/updated transformation, scaling, adding high-order, or interaction): with the prediction from a personalized model. Step 1: Train a regression model π(πΏ ) using the training set, and One challenge of training a personalized model is the number let π π = π β π(πΏ ) be the residual from π(πΏ ). Here, π(πΏ ) of tweets for a user can be limited, and using all features used for can be any regression model such as linear, Lasso, Ridge, training the global model can result in overfitting. To cope with neural networks, or factorization machines, except a tree- this problem, we only used #Favorites as a single feature to learn based regressor. We then create a new training dataset a personalized model for each user. Also, as tweets having zero πͺ π = {πΆππ = (ππ , πππ ) : π = 1, . . . , π }. values in either #Favorites and #Retweets are not useful to learn a Step 2: Train a random forest model π (πΏ ) using the new train- personalized model, we further limit users who have more than six ing set πͺ π . The hyper-parameters can be predefined or tweets having none zero values in both #Favorites and #Retweets. determined with grid search and cross-validation. Overall, 236,240 tweets in the test set belong to this category. Step 3: Given the trained model π(Β·) and π (Β·), the RERF predic- tion πΛ for the response at πΏΛ is given by πΛ = π( πΏΛ ) + π ( πΏΛ ). 4 https://github.com/parklize/cikm2020-analyticup 14 10 0.18 8 0.16 No. of retweets MSLE 6 0.14 4 0.12 2 0.10 RF LRRF LRRF+Patching 0 0 2 4 6 8 10 Figure 3: Improvement of the performance in terms of MSLE No. of favorites using an regression-enhanced random forest and personal- ized patching compared to using a random forest model on the validation set. Figure 2: The relationship between the number of favorites and retweets for two different users in a log scale. Time features consist of features that capture relevant informa- tion related to the time when a tweet is posted such as whether the tweet is posted on a weekend, or on which day of the week. Sentiment features refer to both positive and negative sentiment On one hand, the above-mentioned personalized LRRFs using scores of a tweet provided by the SentiStrength, and their interac- a single feature might resolve the problem of overfitting for users tion (e.g., the sum of positive and negative scores). with a small number of tweets. On the other hand, we found that those LRRFs can result in underfitting for user who have a large 3 RESULTS number of tweets for training. Therefore, for the group of users who have more than π tweets having nonzero values in both #Favorites Table 1 shows the results of the top six teams (semi-finalists) ac- and #Retweets, we use RidgeRFs (or LRRFs with L2 regularization) cording to the MSLE score in the testing phase. As we can see from with all features that have been used for the global model where the the table, our team (PH) achieved the MSLE score of 0.149997 on penalty term is set to 5. We empirically found that π = 160 achieves the test set and ranked 4th among 20 teams. the best results. Overall, 70,821 tweets in the test set belong to this To investigate whether a regression-enhanced random forest category. or personalized patching (i.e., updating with personalized models) improves the prediction performance, we tested the prediction results on the validation set using a random forest model, LRRF, 2.3 Features and applying personalized patching for users who have a sufficient On top of the features provided by the challenge for each tweet number of tweets for training a personalized model as we described which has been introduced in Section 1, we extracted 30 features in Section 2.2. Figure 3 shows that the MSLE decreases when using which are described in detail in Table 2. The features we used for LRRF as well as applying personalized patching, which clearly training models in Section 2.1 and 2.2 can be classified into four shows the contribution of each component of our approach. categories: (1) user features, (2) content features, (3) time features, and (4) sentiment features. 4 CONCLUSION User features denote a set of features related to the user/author In this report, we presented an approach using regression-enhanced of a tweet. In addition to the number of followers and friends of a random forests with personalized patching for the task of COVID- user, we also included the ratio of those two numbers and the total 19 retweet prediction. Regression-enhanced random forests with number of tweets posted by the user in the training, validation, and different types of regression models improved the performance of test datasets. The total number of tweets shows the activity level prediction compared to using a single regression-enhanced random of a user and we found that it helped to improve the prediction forest. In addition, personalized patching for those users having performance. Content features include a set of features related to tweet content Table 1: Results of MSLE (Mean Squared Log Errors) for semi- to capture different characteristics of the content. For example, the finalists of the challenge. number of favorites that a tweet has, the popularity of entities, hashtags, mentions, and URL domain in a tweet. The popularity User (Team) MSLE of an entity can be estimated by how many times an entity in a vinayaka (BIAS) 0.120551 (1) tweet appeared in all tweets in the training, validation, and testing mc-aida (MC-AIDA) 0.121094 (2) datasets. We also noticed that a tweet could be retweeted more myaunraitau 0.136239 (3) when a popular account (e.g., @WHO) is mentioned in the tweet. parklize (PH) 0.149997 (4) To incorporate popularity of mentioned users in a tweet, we used JimmyChang (GrandMasters) 0.156876 (5) the maximum number of followers and friends of mentioned users Thomary 0.169047 (6) where the number of followers and friends for each mentioned user .. .. has been obtained via the Twitter API. . . 15 Table 2: Details of used features in our approach. The features are classified into four categories (with the number of features in each category) in the table. Category Feature Description No. of followers Number of followers that a user has User No. of friends Number of friends that a user has (4) No. of friends / No. of followers The ratio of those two numbers Number of tweets No. of tweets posted by a user No. of favorites Number of favorites that a tweet has No. of favorites / No. of followers The ratio of those two numbers Has entity 1 or 0 to denote whether a tweet contains any entity Has hashtag 1 or 0 to denote whether a tweet contains any hashtag Has mention 1 or 0 to denote whether a tweet mentions other users Has URL 1 or 0 to denote whether a tweet contains any URL No. of entities The total number of entities extracted from a tweet No. of hashtags The total number of hashtags in a tweet No. of mentions The total number of mentions in a tweet Content No. of URLs The total number of URLs in a tweet (20) How many times an entity in a tweet appeared in all tweets Entity popularity (Take the maximum value of all entities in a tweet) Hashtag popularity How many times a hashtag in a tweet appeared in all tweets Mention popularity How many times a mentioned user in a tweet appeared in all tweets URL domain popularity How many times the domain of a URL in a tweet appeared in all tweets Tweet length The total number of entities, hashtags, mentions, as well as URLs No. of top 20 entities Number of top 20 entities from all tweets of a day No. of top 20 hashtags Number of top 20 hashtags from all tweets of a day No. of top 20 mentions Number of top 20 mentioned users from all tweets of a day Maximum No. of followers of mentioned users The maximum number of followers of mentioned users in a tweet Maximum No. of friends of mentioned users The maximum number of friends of mentioned users in a tweet Time segment The time segment of a tweet {1 Β· Β· Β· 24} indicating when it is posted Time Weekend 1 or 0 to indicate whether a tweet is posted on a weekend or not (3) Day of week A value from {1 Β· Β· Β· 7} to indicate the ππ‘β day of a week Positive sentiment A score for positive (1 to 5) sentiment for a tweet Sentiment Negative sentiment A score for negative (-1 to -5) sentiment for a tweet (3) Overall sentiment The sum of positive and negative sentiment of a tweet a sufficient number of tweets for training a personalized model [4] Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and further improved the performance. Guangzhong Sun. 2018. xdeepfm: Combining explicit and implicit feature in- teractions for recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1754β1763. ACKNOWLEDGEMENTS [5] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour- We pay our highest respect to numerous healthcare professionals napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825β2830. and volunteers battling the COVID-19 pandemic on the front lines. [6] Guangyuan Piao and John G Breslin. 2018. Learning to Rank Tweets with Author- W. Huang is supported by Science Foundation Ireland under grant Based Long Short-Term Memory Networks. In International Conference on Web number SFI/12/RC/2289_P2. Engineering. Springer, 288β295. [7] Weichen Shen. 2018. DeepCTR: Easy-to-use,Modular and Extendible package of deep-learning based CTR models. https://github.com/shenweichen/deepctr. REFERENCES [8] Stefan Stieglitz and Linh Dang-Xuan. 2012. Political communication and influence through microbloggingβAn empirical analysis of sentiment in Twitter messages [1] Roi Blanco, Giuseppe Ottaviano, and Edgar Meij. 2015. Fast and Space-Efficient and retweet behavior. In 2012 45th Hawaii International Conference on System Entity Linking in Queries. In Proceedings of the Eight ACM International Conference Sciences. IEEE, 3500β3509. on Web Search and Data Mining (Shanghai, China) (WSDM 15). ACM, New York, [9] Soroush Vosoughi, Deb Roy, and Sinan Aral. 2018. The spread of true and false NY, USA, 10. news online. Science 359, 6380 (2018), 1146β1151. [2] Dimitar Dimitrov, Erdal Baran, Pavlos Fafalios, Ran Yu, Xiaofei Zhu, MatthΓ€us [10] Haozhe Zhang, Dan Nettleton, and Zhengyuan Zhu. 2019. Regression-enhanced Zloch, and Stefan Dietze. 2020. TweetsCOV19 - A Knowledge Base of Semantically random forests. arXiv preprint arXiv:1904.10416 (2019). Annotated Tweets about the COVID-19 Pandemic. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. Association for Computing Machinery, New York, NY, USA, 2991β2998. https://doi.org/10. 1145/3340531.3412765 [3] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a factorization-machine based neural network for CTR prediction. arXiv preprint arXiv:1703.04247 (2017). 16