=Paper=
{{Paper
|id=Vol-2881/paper2
|storemode=property
|title=Word and Graph Embeddings for COVID-19 Retweet Prediction
|pdfUrl=https://ceur-ws.org/Vol-2881/paper2.pdf
|volume=Vol-2881
|authors=Tam T. Nguyen,Karamjit Singh,Sangam Verma,Hardik Wadhwa,Siddharth Vimal,Lalasa Dheekollu,Sheng Jie Lui,Divyansh Gupta,Dong Yang Yin,Zha Wei
}}
==Word and Graph Embeddings for COVID-19 Retweet Prediction==
Word and Graph Embeddings for COVID-19 Retweet Prediction Tam T. Nguyen, Karamjit Singh, Sangam Verma, Hardik Wadhwa, Siddharth Vimal, Lalasa Dheekollu, Sheng Jie Lui, Divyansh Gupta, Dong Yang Jin, Zha Weiβ ABSTRACT In this paper, we present our solution for COVID-19 retweet pre- diction challenge. The proposed approach consists of feature engi- neering and modeling. For feature engineering, we leverage both hand-crafted and unsupervised learning features. As the provided data set is large, we implement auto-encoding algorithms to reduce feature dimension. To develop predictive models, we utilize ensem- ble learning and deep learning algorithms. We then combine these models to generate the final blended model. Moreover, to stabilize the predictions, we also apply bagging as well as down-sampling techniques to remove the tweets where number of retweets equals to zero. Our solution is ranked π ππππ on the public test set and πΊπππππ on the private test set. Figure 1: Retweets per Tweet against Number of Mentions 1 INTRODUCTION There are approximately 3.6M unique users in the train data and 300k common users between train and test data. On an average, Machine learning and Text mining have been proved to be a pow- the total number of tweets and retweets remain constant in the erful tools for processing unstructured text data and making short- period of October, 2019 to February, 2020 as shown in Figure 1(a) term prediction, whether they are still helpful during the time of and a sharp increment is seen in March, 2020. It is also observed crisis, e.g. the ongoing Coronavirus disease 2019 (COVID-19), is that the number of tweets are higher in certain hours of the day still in question. In this paper, we take advantage of both text min- in Figure 1(b), which might be due to data distribution of day and ing and machine learning methods to build more accurate models night, depending on the geographical location. We also observe the to predict number of retweets in Twitter. Our proposed approach there is a negative correlation between number of mentions and consists of two major parts: feature engineering and modelling. number of retweets per tweet as shown in Figure 1(d). For feature engineering, we leverage hand-crafted features and While exploring the number of mentions with retweets per tweet, auto feature learning based on neural networks models. We use there are two key observations as shown in Figure 1(d). First, for most of state-of-the-art algorithms in text mining and machine zero mentions, the averaged number of retweets per tweets is the learning. For example, we rely on word2vec [9], doc2vec [9], and highest. This might be due to the fact that influential people use less BERT [4] to learn embedding features. For graph features, we utilize mentions and get retweeted way more than the average. Second, node2vec and pyTorch-biggraph [6] algorithms to extract features. as the number of mentions increase, there is a downward trend in And finally, we build predictive models using emerging boosting retweets per tweet. algorithms such as Catboost [1] and LightGBM [5]. Our solution is released as an open source on Github1 . 3 FEATURE ENGINEERING 2 DESCRIPTIVE ANALYSIS Our feature set consists of both hand-crafted features and unsuper- vised learning features. The provided TweetsCOV19 dataset [3] contains extracted tweet instances which can be broadly divided in three types. 1) Metadata: 3.1 User-level Feature tweet ID, username, and timestamp, 2) Connection: number of followers, number of friends, and number of favorites, and 3) Tweet We aggregate data by users to generate user-level features. For ex- text based: entities, sentiment, mentions, hashtags, and URLs. The ample, we group data by users and concatenate all tweets belonging goal is to predict number of retweets for each tweet. The train, to the users. We then generate features for them as follows. validation, and test datasets are sequentially provided with respect 3.1.1 Hand-crafted user feature. to time. The training data accounts for the tweet instances from β’ Followers and friends ratio. Along with the number of follow- October, 2019 to April, 2020. ers and friends, we derive the ratio of followers/(friends+1). β Corresponding author: Tam T. Nguyen E-mail: nthanhtam@gmail.com β’ Tweet count. Number of tweets for each user is calculated. 1 https://github.com/nthanhtam/cikm-cup-2020.git β’ Statistical features. Since the followers, friends and favourites vary for a given user over time for each user, various statistics Copyright Β© 2020 for this paper by its authors. Use permitted under Creative Commons like sum, mean, median, min, max, standard deviation are License Attribution 4.0 International (CC BY 4.0). used to aggregate these columns separately for a user. In: Dimitar Dimitrov, Xiaofei Zhu (eds.): Proceedings of the CIKM AnalytiCup 2020, 22 October, 2020, Gawlay (Virtual Event), Ireland, 2020, published at http://ceur-ws.org. β’ Time based features. The number of tweets made by a user in a particular month and hour, number of different hours 5 in which a user tweeted, the average time between two con- β’ URL field is fitted and transformed on a Count Vectorizer [8]. secutive tweets were used as features. Also the total number This is given as input to a SVD (Singular Value Decompo- of unique users who tweeted on a particular date and hour sition) with 5 principal components to obtain 5 URL_SVD are also calculated. features. β’ User sentiments. For each user, the mean and standard devia- β’ Domain name and suffix(url part followed by the domain tion are calculated for both positive and negative sentiments. name) are extracted for each URL in a given tweet and fol- Also, the number of tweets for each value of positive and lowing are created: negative sentiments for a user are calculated. (1) A flag that indicates the presence of βTwitterβ word in the domain name. 3.1.2 Hashtags/Mentions. There are two kind of hashtags/mentions (2) For each domain name, number of tweets with that domain features in our models. name are recorded. If a tweet has multiple URLs, then Hand-crafted Feature. We calculate count features from hashtags statistics like min, max, average are computed on domain and mentions as follows. frequencies for each tweet. β’ Same features are created for suffix also. β’ Number of unique hashtags and mentions used by a user are given as input for each tweet of that user. 3.1.5 Sentiment Feature. β’ Number of tweets that hashtags or mentions occur in. β’ For each tweet, the product of the number of followers and Word2Vec Features. There are only a few mentions in a tweet in the sentiment values, both positive and negative were taken. the given data set. Most of the time, users donβt mention anything For the negative sentiment values the absolute of the senti- in their tweets. If we directly use mentions as features (e.g. one- ment value was taken. hot encoding), the feature space will be very sparse. Therefore, β’ Similarly, the product of the number of friends and the sen- we propose to learn dense embedding features for mentions by timent values was also used as a feature. considering each mention as word and train a word2vec model of 64 dimension using Gensim [9]. 3.2 Tweet-level Feature 3.1.3 Entity Embedding Feature. In this section, we introduce our For tweet-level features, we generate the features using label en- approach on how to extract embedding features from entities as coding for user ID, Hashtag, Mention and URL. We also use some follows. features given in the raw data directly like favourites, friends, fol- lowers etc. Note that we only use label encoded features directly β’ Pre-process entity triples and keep matched entities only. in tree-based/boosting models. For neural network models, we use β’ Aggregate entities by concatenating them based on user embedding layer on top of these. β’ Train different embedding models with various parameters such as embedding size and vocabulary size. 3.3 Trend-level Feature β’ Extract embedding features for each user. Text based attributes of tweets such as hashtags {βπ }, mentions For embedding models, we try word2vec, doc2vec, and BERT mod- {ππ }, and entities {ππ } can have a time based trend, also for each els. In the following sections, we present how we train these model tweet there can be multiple values of the above. Since these at- in detail. tributes can have a time based trend, we create features to capture Doc2Vec. In order to generate doc2vec features [9], we treat all those trends. For the explanation purpose, we will use {ππ } for {βπ }, entities of a user as a document. We then train doc2vec model {ππ }, and {ππ }. First we created features for each {ππ } and since using different embedding sizes such as 32, 64, 128, and 256. We one tweet can have multiple {ππ }, we aggregate these features to a then extract features and train our predictive models. Based on our tweet level. experimental results on our internal validation set, we choose the β’ Score: Number of tweets which have used ππ in that day embedding size of 64 as it gives us the best score. β’ Age: Number of days since ππ came into existence. β’ Life: Number of days ππ has been active through the course Bert. Similar to doc2vec model, we aggregate entities by users of the data. and treat these as documents. We choose the top attention layer β’ Trend: Change in the score of ππ from the previous day. of BERT model to extract features. As this layer has dimensions of β’ Peak: Fraction of tweets of ππ this day. 1024, we reduce its dimension using: Below are the aggregate tweet level features: β’ Truncated SVD [8] with 55 components. β’ Autoencoder: We train a enoder-decoder neural networks β’ Score based features: we take average and max of the score with input and output size as 1024. We try different size of of all ππ in a tweet. hidden layer choose the size of 128 which gives us good β’ Weighted score features: Instead of normal average and performance on both auto-encoding and predictive models. max of scores, we take weighted average where two type of weights are used: inverse of βAgeβ and βLifeβ of ππ 3.1.4 URL Feature. β’ Age and life based features: we take average, min and max β’ Number of unique URLs used by a user of βAgeβ of all ππ in a tweet attribute. Similar features are β’ Number of legit URLS in the tweet. created for life. 6 β’ Trend: we take average trend, max trend of all ππ in a tweet attribute. β’ Peak: we take average and max of peak values of all ππ in a tweet. 3.4 Graph-based Feature In this section, we introduce our graph-based feature set. In order to generate the features, we build a few kind of graphs such as user-entity graph (UE), user-hashtag graph (UH), and combined user-hashtag-URL-mention graph (CG). Figure 2: CNN Model Architecture 3.4.1 UE. To generate graph feature, we build a user-entity graph as follows: (i) each node represents a user or an entity and (ii) create we pass categorical features like hour, weekday, month, user id, an edge between a user and an entity if user tweets about that entity. hashtag, and entity to the embedding layer. We concatenate all We use pyTorch-biggraph [6] to generate user level embedding of of these layers together and pass them to a feed forward neural the size 64, which we use as features in our models. networks and obtain the final retweet count. 3.4.2 UH. Similar to user-entity graph, we build a user-hashtag 4.2.2 Embedding Neural Networks Model. To develop embedding graph as follows: (i) each node represents a user or a hashtag and neural networks model, we classify features into two categories: (ii) create an edge between a user and a hashtag if the user uses numerical and categorical features. For numerical features such as that hashtag in his tweet. Similar to UE, we also use pyTorch- no. of friends, no. of followers, etc., we use log scale to transform biggraph [6] to generate user level embedding of size 64 and use it the data. For categorical features such as mentions, hashtags, and as features. entities, we use embedding layer to deal with them. 3.4.3 CG. This graph is similar to the above two graphs but we replace entities and hashtags by mentions and URLs. The edges of 4.3 Common-User Model the graph is identified whether users use mentions or URLs in their As discussed in Section 2, there are about 50% common users in train tweets. and test sets. This serves as the motivation to train a model specific Additional to embedding graph features, we use traditional graph to the common users only. Apart from the features discussed in algorithms to extract information of users from the graphs such as Section 3, we create some additional features specific to common centrality, average neighbor degrees, etc. Moreover, we also train user model. We used catboost to train the model node2vec [7] models to extract embedding features. K-Fold Target Encoding. Target encoding for each user is created to capture the popularity of a user, but a model using tra- 4 MODELING APPROACH ditional target encoding (taking mean of the retweets for all the We use hold-out testing technique to validate our models where 80% tweets of a user) tends to overfit the training data set and did not of the data is for training and 20% is for testing. Our final solution perform well. To tackle this problem, we use 5-Fold target encoding. is an ensemble of 3 major popular machine learning algorithms: Moving Averages. To capture the trend of a particular user Catboost [1], LightGBM [5], and Neural Networks using Keras [2]. whether be the gain or loss in popularity over time, we create trend based features. For a user on a given day we find the moving 4.1 Boosting Models averages of number of followers and friends for the last 3, 5, and We use Catboost (CAT) [1] and LightGBM (LGB) [5] to train two 10 tweets. This is then used directly as a paramter in the model. separate models on features discussed in Section 3. For LightGBM algorithm, we also use bagging technique for LGB by repeating 4.4 K-fold Down Sampling training it using various random seed numbers. Doing so, we can stablize its predictions in the validation and test sets. We will discuss We notice that most of tweets have no retweet count. If we use the further on these algorithmsβ parameters in the next section. whole training data to train models, the prediction in the validation and test sets tend to be smaller than the actual number of retweets. Hence, we use k-fold down sampling these tweets as follows: 4.2 Neural Network Models 4.2.1 Convolutional Neural Networks (CNN). Figure 2 shows the β’ Split the training data into two sets A (retweets > 0) and B architecture of the CNN and feed forward based model. Input to (retweets = 0). the model are all numerical features discussed above, embeddings β’ Apply k-fold cross validation on set B to select 80% of the for entity, hashtag, and mentions. We also graph based features data to have a new data set C. Merge sets A and C to train derived from user-hashtag graph as described in Section 3.4.2. 1D predictive models and make prediction on validation and test convolutional layers are used to extract features from local input sets. In this case, we have total 5 predictions , corresponding patches allowing for representation modularity and data efficiency. to each of the fold. We use 64 filters for each variable followed by max pooling. We β’ The final prediction of validation and test sets will be the then concatenate the output of these CNN Layers. Simultaneously, averaged prediction of k folds. 7 Table 1: Performance Comparison of various Strategies CAT LGB NN CNN Models val test val test val test val test base 0.12344 0.12433 0.13582 - 0.15693 - 0.12763 0.13139 base+common-model 0.12286 0.12374 - - - - - - base+kfold 0.12279 0.12366 0.13127 - - - 0.12649 0.13018 base+kfold+common-model 0.12164 0.12285 - - - - - - 4.5 Ensemble πΆπ΄π ββ + πΆπ π β ensemble in the validation phase so we donβt have In practice, combining multiple machine learning models will help its score. Based on the results, one can see that adding more models improve the performance of the final model. In order to do that, the can improve the accuracy. For instance, πΆπ΄π ββ + πΆπ π β + πΏπΊπ΅ β simplest way is to use weighted average the prediction of selected ensemble works better than πΆπ΄π β + πΏπΊπ΅ β and out final ensemble models. Our ensemble model prediction is a linear combination of consists of 4 algorithms. Catboost, LightGBM, and neural networks models. We estimate the weights based on the performance of the models on the leaderboard Table 3: Comparison of Various Ensemble Approaches where the weights of Catboost, LightGBM, and neural networks Model Valid Score Test Score are 0.6, 0.3, and 0.1, respectively. πΆπ΄π ββ + πΆπ π β - 0.121874 πΆπ΄π ββ + πΏπΊπ΅ β 0.12518 - 5 EXPERIMENTS πΆπ΄π ββ + πΆπ π β + πΏπΊπ΅ β 0.12511 - In this section, we discuss the experiment settings including hyper- πΆπ΄π ββ + πΆπ π β + π π β + πΏπΊπ΅ β - 0.121094 parameters selected for each model and the performance of each Note that β indicates base model + k-fold and ββ: base + fold + model in different settings based on our internal validation set. common model. 5.1 Experiment Settings 6 CONCLUSION We use two boosting based models: Catboost and LightGBM where We have presented our solution for COVID-19 retweet prediction we use larger learning rate and fewer trees for LightGBM (LGB which consists of feature engineering and bagging/down-sampling Parameters - num_leaves: 35, max_depth: 8, min_child_sample: 100, modelling techniques. The proposed approach is ranked π ππππ and subsample: 0.7, colsample_bytree: 0.7, n_estimators: 15k, learning πΊπππππ on the public and private test sets, respectively. With an rate: 0.2), and smaller learning rate for Catboost (CAT Parameters exhausted feature set and powerful modeling techniques, we hope - iterations: 53k, border_count: 254, max_depth: 11, learning rate: that our solution provides a solid baseline for retweet prediction 0.03). research, especially in the crisis time like COVID-19 pandemic. 5.2 Results REFERENCES As mentioned in Section 3.4, we generate two types of embeddings [1] Dorogush Anna Veronika, Ershov Vasily, and Gulin Andrey. 2018. CatBoost: gradient boosting with categorical features support. (2018). from the following types of graphs: user-hashtag graph (UH) and [2] Francois Chollet et al. 2015. Keras. https://github.com/fchollet/keras user entity graph (UE), where the embedding size is 64 in both [3] Dimitar Dimitrov, Erdal Baran, Pavlos Fafalios, Ran Yu, Xiaofei Zhu, MatthΓ€us Zloch, and Stefan Dietze. 2020. TweetsCOV19 - A Knowledge Base of Semantically the cases. Table 2 shows the performance of Catboost model on Annotated Tweets about the COVID-19 Pandemic. In Proceedings of the 29th ACM different feature sets. In this setting, we would like to study the International Conference on Information & Knowledge Management. Association impact of graph based features on the performance of our models. for Computing Machinery, New York, NY, USA, 2991β2998. https://doi.org/10. 1145/3340531.3412765 Without using any graph features, Catboost (CAT) has the error [4] Kenton Lee Jacob Devlin, Ming-Wei Chang and Kristina Toutanova. 2018. BERT: of 0.12494. By using normal graph features such as UH and UE, Pre-training of Deep Bidirectional Transformers for Language Understanding. we can improve the model 0.0015. It means that adding the graph arXiv preprint arXiv:1810.04805 (2018). [5] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei embeddings improves the performance of the models. Ye, and Tie-Yan Liu. 2017. LightGBM: A Highly Efficient Gradient Boosting Deci- Table 1 shows the performance improvement using k-fold and sion Tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems (Long Beach, California, USA) (NIPSβ17). Curran Associates common model strategy with the baseline models. It shows that Inc., Red Hook, NY, USA, 3149β3157. all our base models improves by using these strategies. Further [6] Adam Lerer, Ledell Wu, Jiajun Shen, Timothee Lacroix, Luca Wehrstedt, Abhi- Table 3 shows the performance of various ensemble approaches, it jit Bose, and Alex Peysakhovich. 2019. Pytorch-biggraph: A large-scale graph embedding system. arXiv preprint arXiv:1903.12287 (2019). shows that ensemble of all four models yields the best performance. [7] Adam Lerer, Ledell Wu, Jiajun Shen, Timothee Lacroix, Luca Wehrstedt, Abhijit We use the weighted average approach to combined our predictive Bose, and Alex Peysakhovich. 2019. PyTorch-BigGraph: A Large-scale Graph models. Please note that we donβt have all results for every model. Embedding System. In Proceedings of the 2nd SysML Conference. Palo Alto, CA, USA. We only choose important models to submit and get the score. [8] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon- Table 2: Graph Features on Internal Validation del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Model + Embedding CAT CAT + UH CAT + UE Python. Journal of Machine Learning Research 12 (2011), 2825β2830. [9] Radim ΕehΕ―Εek and Petr Sojka. 2010. Software Framework for Topic Modelling Validation Score 0.12494 0.12464 0.12343 with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges As we only know the score after submitting our submission, for NLP Frameworks. ELRA, Valletta, Malta, 45β50. http://is.muni.cz/publication/ 884893/en. this table only shows part of all results. Note that we donβt submit 8