=Paper= {{Paper |id=Vol-2881/paper2 |storemode=property |title=Word and Graph Embeddings for COVID-19 Retweet Prediction |pdfUrl=https://ceur-ws.org/Vol-2881/paper2.pdf |volume=Vol-2881 |authors=Tam T. Nguyen,Karamjit Singh,Sangam Verma,Hardik Wadhwa,Siddharth Vimal,Lalasa Dheekollu,Sheng Jie Lui,Divyansh Gupta,Dong Yang Yin,Zha Wei }} ==Word and Graph Embeddings for COVID-19 Retweet Prediction== https://ceur-ws.org/Vol-2881/paper2.pdf
Word and Graph Embeddings for COVID-19 Retweet Prediction
           Tam T. Nguyen, Karamjit Singh, Sangam Verma, Hardik Wadhwa, Siddharth Vimal, Lalasa
                     Dheekollu, Sheng Jie Lui, Divyansh Gupta, Dong Yang Jin, Zha Weiβˆ—
ABSTRACT
In this paper, we present our solution for COVID-19 retweet pre-
diction challenge. The proposed approach consists of feature engi-
neering and modeling. For feature engineering, we leverage both
hand-crafted and unsupervised learning features. As the provided
data set is large, we implement auto-encoding algorithms to reduce
feature dimension. To develop predictive models, we utilize ensem-
ble learning and deep learning algorithms. We then combine these
models to generate the final blended model. Moreover, to stabilize
the predictions, we also apply bagging as well as down-sampling
techniques to remove the tweets where number of retweets equals
to zero. Our solution is ranked 𝑭 π’Šπ’“π’”π’• on the public test set and
𝑺𝒆𝒄𝒐𝒏𝒅 on the private test set.
                                                                                              Figure 1: Retweets per Tweet against Number of Mentions
1    INTRODUCTION                                                                               There are approximately 3.6M unique users in the train data and
                                                                                             300k common users between train and test data. On an average,
Machine learning and Text mining have been proved to be a pow-
                                                                                             the total number of tweets and retweets remain constant in the
erful tools for processing unstructured text data and making short-
                                                                                             period of October, 2019 to February, 2020 as shown in Figure 1(a)
term prediction, whether they are still helpful during the time of
                                                                                             and a sharp increment is seen in March, 2020. It is also observed
crisis, e.g. the ongoing Coronavirus disease 2019 (COVID-19), is
                                                                                             that the number of tweets are higher in certain hours of the day
still in question. In this paper, we take advantage of both text min-
                                                                                             in Figure 1(b), which might be due to data distribution of day and
ing and machine learning methods to build more accurate models
                                                                                             night, depending on the geographical location. We also observe the
to predict number of retweets in Twitter. Our proposed approach
                                                                                             there is a negative correlation between number of mentions and
consists of two major parts: feature engineering and modelling.
                                                                                             number of retweets per tweet as shown in Figure 1(d).
For feature engineering, we leverage hand-crafted features and
                                                                                                While exploring the number of mentions with retweets per tweet,
auto feature learning based on neural networks models. We use
                                                                                             there are two key observations as shown in Figure 1(d). First, for
most of state-of-the-art algorithms in text mining and machine
                                                                                             zero mentions, the averaged number of retweets per tweets is the
learning. For example, we rely on word2vec [9], doc2vec [9], and
                                                                                             highest. This might be due to the fact that influential people use less
BERT [4] to learn embedding features. For graph features, we utilize
                                                                                             mentions and get retweeted way more than the average. Second,
node2vec and pyTorch-biggraph [6] algorithms to extract features.
                                                                                             as the number of mentions increase, there is a downward trend in
And finally, we build predictive models using emerging boosting
                                                                                             retweets per tweet.
algorithms such as Catboost [1] and LightGBM [5]. Our solution is
released as an open source on Github1 .
                                                                                             3       FEATURE ENGINEERING
2    DESCRIPTIVE ANALYSIS                                                                    Our feature set consists of both hand-crafted features and unsuper-
                                                                                             vised learning features.
The provided TweetsCOV19 dataset [3] contains extracted tweet
instances which can be broadly divided in three types. 1) Metadata:                          3.1       User-level Feature
tweet ID, username, and timestamp, 2) Connection: number of
followers, number of friends, and number of favorites, and 3) Tweet                          We aggregate data by users to generate user-level features. For ex-
text based: entities, sentiment, mentions, hashtags, and URLs. The                           ample, we group data by users and concatenate all tweets belonging
goal is to predict number of retweets for each tweet. The train,                             to the users. We then generate features for them as follows.
validation, and test datasets are sequentially provided with respect                         3.1.1     Hand-crafted user feature.
to time. The training data accounts for the tweet instances from                                     β€’ Followers and friends ratio. Along with the number of follow-
October, 2019 to April, 2020.                                                                          ers and friends, we derive the ratio of followers/(friends+1).
βˆ— Corresponding author: Tam T. Nguyen E-mail: nthanhtam@gmail.com
                                                                                                     β€’ Tweet count. Number of tweets for each user is calculated.
1 https://github.com/nthanhtam/cikm-cup-2020.git                                                     β€’ Statistical features. Since the followers, friends and favourites
                                                                                                       vary for a given user over time for each user, various statistics
 Copyright Β© 2020 for this paper by its authors. Use permitted under Creative Commons                  like sum, mean, median, min, max, standard deviation are
License Attribution 4.0 International (CC BY 4.0).                                                     used to aggregate these columns separately for a user.
In: Dimitar Dimitrov, Xiaofei Zhu (eds.): Proceedings of the CIKM AnalytiCup 2020, 22
October, 2020, Gawlay (Virtual Event), Ireland, 2020, published at http://ceur-ws.org.               β€’ Time based features. The number of tweets made by a user
                                                                                                       in a particular month and hour, number of different hours
                                                                                         5
          in which a user tweeted, the average time between two con-                   β€’ URL field is fitted and transformed on a Count Vectorizer [8].
          secutive tweets were used as features. Also the total number                    This is given as input to a SVD (Singular Value Decompo-
          of unique users who tweeted on a particular date and hour                       sition) with 5 principal components to obtain 5 URL_SVD
          are also calculated.                                                            features.
        β€’ User sentiments. For each user, the mean and standard devia-                 β€’ Domain name and suffix(url part followed by the domain
          tion are calculated for both positive and negative sentiments.                  name) are extracted for each URL in a given tweet and fol-
          Also, the number of tweets for each value of positive and                       lowing are created:
          negative sentiments for a user are calculated.                                (1) A flag that indicates the presence of β€˜Twitter’ word in the
                                                                                             domain name.
3.1.2 Hashtags/Mentions. There are two kind of hashtags/mentions                        (2) For each domain name, number of tweets with that domain
features in our models.                                                                      name are recorded. If a tweet has multiple URLs, then
  Hand-crafted Feature. We calculate count features from hashtags                            statistics like min, max, average are computed on domain
and mentions as follows.                                                                     frequencies for each tweet.
                                                                                       β€’ Same features are created for suffix also.
        β€’ Number of unique hashtags and mentions used by a user are
          given as input for each tweet of that user.                          3.1.5     Sentiment Feature.
        β€’ Number of tweets that hashtags or mentions occur in.
                                                                                       β€’ For each tweet, the product of the number of followers and
   Word2Vec Features. There are only a few mentions in a tweet in                        the sentiment values, both positive and negative were taken.
the given data set. Most of the time, users don’t mention anything                       For the negative sentiment values the absolute of the senti-
in their tweets. If we directly use mentions as features (e.g. one-                      ment value was taken.
hot encoding), the feature space will be very sparse. Therefore,                       β€’ Similarly, the product of the number of friends and the sen-
we propose to learn dense embedding features for mentions by                             timent values was also used as a feature.
considering each mention as word and train a word2vec model of
64 dimension using Gensim [9].                                                 3.2       Tweet-level Feature
3.1.3 Entity Embedding Feature. In this section, we introduce our              For tweet-level features, we generate the features using label en-
approach on how to extract embedding features from entities as                 coding for user ID, Hashtag, Mention and URL. We also use some
follows.                                                                       features given in the raw data directly like favourites, friends, fol-
                                                                               lowers etc. Note that we only use label encoded features directly
        β€’ Pre-process entity triples and keep matched entities only.           in tree-based/boosting models. For neural network models, we use
        β€’ Aggregate entities by concatenating them based on user               embedding layer on top of these.
        β€’ Train different embedding models with various parameters
          such as embedding size and vocabulary size.
                                                                               3.3       Trend-level Feature
        β€’ Extract embedding features for each user.
                                                                               Text based attributes of tweets such as hashtags {β„Žπ‘– }, mentions
For embedding models, we try word2vec, doc2vec, and BERT mod-                  {π‘šπ‘– }, and entities {𝑒𝑖 } can have a time based trend, also for each
els. In the following sections, we present how we train these model            tweet there can be multiple values of the above. Since these at-
in detail.                                                                     tributes can have a time based trend, we create features to capture
   Doc2Vec. In order to generate doc2vec features [9], we treat all            those trends. For the explanation purpose, we will use {π‘Žπ‘– } for {β„Žπ‘– },
entities of a user as a document. We then train doc2vec model                  {π‘šπ‘– }, and {𝑒𝑖 }. First we created features for each {π‘Žπ‘– } and since
using different embedding sizes such as 32, 64, 128, and 256. We               one tweet can have multiple {π‘Žπ‘– }, we aggregate these features to a
then extract features and train our predictive models. Based on our            tweet level.
experimental results on our internal validation set, we choose the                     β€’ Score: Number of tweets which have used π‘Žπ‘– in that day
embedding size of 64 as it gives us the best score.                                    β€’ Age: Number of days since π‘Žπ‘– came into existence.
                                                                                       β€’ Life: Number of days π‘Žπ‘– has been active through the course
   Bert. Similar to doc2vec model, we aggregate entities by users
                                                                                         of the data.
and treat these as documents. We choose the top attention layer
                                                                                       β€’ Trend: Change in the score of π‘Žπ‘– from the previous day.
of BERT model to extract features. As this layer has dimensions of
                                                                                       β€’ Peak: Fraction of tweets of π‘Žπ‘– this day.
1024, we reduce its dimension using:
                                                                               Below are the aggregate tweet level features:
        β€’ Truncated SVD [8] with 55 components.
        β€’ Autoencoder: We train a enoder-decoder neural networks                       β€’ Score based features: we take average and max of the score
          with input and output size as 1024. We try different size of                   of all π‘Žπ‘– in a tweet.
          hidden layer choose the size of 128 which gives us good                      β€’ Weighted score features: Instead of normal average and
          performance on both auto-encoding and predictive models.                       max of scores, we take weighted average where two type of
                                                                                         weights are used: inverse of β€˜Age’ and β€˜Life’ of π‘Žπ‘–
3.1.4     URL Feature.                                                                 β€’ Age and life based features: we take average, min and max
        β€’ Number of unique URLs used by a user                                           of β€˜Age’ of all π‘Žπ‘– in a tweet attribute. Similar features are
        β€’ Number of legit URLS in the tweet.                                             created for life.
                                                                           6
      β€’ Trend: we take average trend, max trend of all π‘Žπ‘– in a tweet
        attribute.
      β€’ Peak: we take average and max of peak values of all π‘Žπ‘– in a
        tweet.

3.4     Graph-based Feature
In this section, we introduce our graph-based feature set. In order
to generate the features, we build a few kind of graphs such as
user-entity graph (UE), user-hashtag graph (UH), and combined
user-hashtag-URL-mention graph (CG).                                                            Figure 2: CNN Model Architecture
3.4.1 UE. To generate graph feature, we build a user-entity graph
as follows: (i) each node represents a user or an entity and (ii) create        we pass categorical features like hour, weekday, month, user id,
an edge between a user and an entity if user tweets about that entity.          hashtag, and entity to the embedding layer. We concatenate all
We use pyTorch-biggraph [6] to generate user level embedding of                 of these layers together and pass them to a feed forward neural
the size 64, which we use as features in our models.                            networks and obtain the final retweet count.
3.4.2 UH. Similar to user-entity graph, we build a user-hashtag
                                                                                4.2.2 Embedding Neural Networks Model. To develop embedding
graph as follows: (i) each node represents a user or a hashtag and
                                                                                neural networks model, we classify features into two categories:
(ii) create an edge between a user and a hashtag if the user uses
                                                                                numerical and categorical features. For numerical features such as
that hashtag in his tweet. Similar to UE, we also use pyTorch-
                                                                                no. of friends, no. of followers, etc., we use log scale to transform
biggraph [6] to generate user level embedding of size 64 and use it
                                                                                the data. For categorical features such as mentions, hashtags, and
as features.
                                                                                entities, we use embedding layer to deal with them.
3.4.3 CG. This graph is similar to the above two graphs but we
replace entities and hashtags by mentions and URLs. The edges of                4.3     Common-User Model
the graph is identified whether users use mentions or URLs in their
                                                                                As discussed in Section 2, there are about 50% common users in train
tweets.
                                                                                and test sets. This serves as the motivation to train a model specific
   Additional to embedding graph features, we use traditional graph
                                                                                to the common users only. Apart from the features discussed in
algorithms to extract information of users from the graphs such as
                                                                                Section 3, we create some additional features specific to common
centrality, average neighbor degrees, etc. Moreover, we also train
                                                                                user model. We used catboost to train the model
node2vec [7] models to extract embedding features.
                                                                                   K-Fold Target Encoding. Target encoding for each user is
                                                                                created to capture the popularity of a user, but a model using tra-
4     MODELING APPROACH                                                         ditional target encoding (taking mean of the retweets for all the
We use hold-out testing technique to validate our models where 80%              tweets of a user) tends to overfit the training data set and did not
of the data is for training and 20% is for testing. Our final solution          perform well. To tackle this problem, we use 5-Fold target encoding.
is an ensemble of 3 major popular machine learning algorithms:                     Moving Averages. To capture the trend of a particular user
Catboost [1], LightGBM [5], and Neural Networks using Keras [2].                whether be the gain or loss in popularity over time, we create
                                                                                trend based features. For a user on a given day we find the moving
4.1     Boosting Models                                                         averages of number of followers and friends for the last 3, 5, and
We use Catboost (CAT) [1] and LightGBM (LGB) [5] to train two                   10 tweets. This is then used directly as a paramter in the model.
separate models on features discussed in Section 3. For LightGBM
algorithm, we also use bagging technique for LGB by repeating                   4.4     K-fold Down Sampling
training it using various random seed numbers. Doing so, we can
stablize its predictions in the validation and test sets. We will discuss       We notice that most of tweets have no retweet count. If we use the
further on these algorithms’ parameters in the next section.                    whole training data to train models, the prediction in the validation
                                                                                and test sets tend to be smaller than the actual number of retweets.
                                                                                Hence, we use k-fold down sampling these tweets as follows:
4.2     Neural Network Models
4.2.1 Convolutional Neural Networks (CNN). Figure 2 shows the                         β€’ Split the training data into two sets A (retweets > 0) and B
architecture of the CNN and feed forward based model. Input to                          (retweets = 0).
the model are all numerical features discussed above, embeddings                      β€’ Apply k-fold cross validation on set B to select 80% of the
for entity, hashtag, and mentions. We also graph based features                         data to have a new data set C. Merge sets A and C to train
derived from user-hashtag graph as described in Section 3.4.2. 1D                       predictive models and make prediction on validation and test
convolutional layers are used to extract features from local input                      sets. In this case, we have total 5 predictions , corresponding
patches allowing for representation modularity and data efficiency.                     to each of the fold.
We use 64 filters for each variable followed by max pooling. We                       β€’ The final prediction of validation and test sets will be the
then concatenate the output of these CNN Layers. Simultaneously,                        averaged prediction of k folds.
                                                                            7
                                       Table 1: Performance Comparison of various Strategies
                                                          CAT               LGB                 NN                   CNN
                      Models
                                                   val      test       val      test       val     test        val     test
                      base                         0.12344 0.12433     0.13582 -           0.15693 -           0.12763 0.13139
                      base+common-model            0.12286 0.12374     -        -          -       -           -       -
                      base+kfold                   0.12279 0.12366     0.13127 -           -       -           0.12649 0.13018
                      base+kfold+common-model      0.12164 0.12285     -        -          -       -           -       -

4.5    Ensemble                                                            𝐢𝐴𝑇 βˆ—βˆ— + 𝐢𝑁 𝑁 βˆ— ensemble in the validation phase so we don’t have
In practice, combining multiple machine learning models will help          its score. Based on the results, one can see that adding more models
improve the performance of the final model. In order to do that, the       can improve the accuracy. For instance, 𝐢𝐴𝑇 βˆ—βˆ— + 𝐢𝑁 𝑁 βˆ— + 𝐿𝐺𝐡 βˆ—
simplest way is to use weighted average the prediction of selected         ensemble works better than 𝐢𝐴𝑇 βˆ— + 𝐿𝐺𝐡 βˆ— and out final ensemble
models. Our ensemble model prediction is a linear combination of           consists of 4 algorithms.
Catboost, LightGBM, and neural networks models. We estimate the
weights based on the performance of the models on the leaderboard              Table 3: Comparison of Various Ensemble Approaches
where the weights of Catboost, LightGBM, and neural networks                      Model                                     Valid Score      Test Score
are 0.6, 0.3, and 0.1, respectively.                                              𝐢𝐴𝑇 βˆ—βˆ— + 𝐢𝑁 𝑁 βˆ—                                      -        0.121874
                                                                                  𝐢𝐴𝑇 βˆ—βˆ— + 𝐿𝐺𝐡 βˆ—                                 0.12518               -
5     EXPERIMENTS                                                                 𝐢𝐴𝑇 βˆ—βˆ— + 𝐢𝑁 𝑁 βˆ— + 𝐿𝐺𝐡 βˆ—                        0.12511               -
In this section, we discuss the experiment settings including hyper-              𝐢𝐴𝑇 βˆ—βˆ— + 𝐢𝑁 𝑁 βˆ— + 𝑁 𝑁 βˆ— + 𝐿𝐺𝐡 βˆ—                      -       0.121094
parameters selected for each model and the performance of each               Note that βˆ— indicates base model + k-fold and βˆ—βˆ—: base + fold +
model in different settings based on our internal validation set.          common model.
5.1    Experiment Settings                                                 6     CONCLUSION
We use two boosting based models: Catboost and LightGBM where              We have presented our solution for COVID-19 retweet prediction
we use larger learning rate and fewer trees for LightGBM (LGB              which consists of feature engineering and bagging/down-sampling
Parameters - num_leaves: 35, max_depth: 8, min_child_sample: 100,          modelling techniques. The proposed approach is ranked 𝑭 π’Šπ’“π’”π’• and
subsample: 0.7, colsample_bytree: 0.7, n_estimators: 15k, learning         𝑺𝒆𝒄𝒐𝒏𝒅 on the public and private test sets, respectively. With an
rate: 0.2), and smaller learning rate for Catboost (CAT Parameters         exhausted feature set and powerful modeling techniques, we hope
- iterations: 53k, border_count: 254, max_depth: 11, learning rate:        that our solution provides a solid baseline for retweet prediction
0.03).                                                                     research, especially in the crisis time like COVID-19 pandemic.
5.2    Results                                                             REFERENCES
As mentioned in Section 3.4, we generate two types of embeddings           [1] Dorogush Anna Veronika, Ershov Vasily, and Gulin Andrey. 2018. CatBoost:
                                                                               gradient boosting with categorical features support. (2018).
from the following types of graphs: user-hashtag graph (UH) and            [2] Francois Chollet et al. 2015. Keras. https://github.com/fchollet/keras
user entity graph (UE), where the embedding size is 64 in both             [3] Dimitar Dimitrov, Erdal Baran, Pavlos Fafalios, Ran Yu, Xiaofei Zhu, MatthΓ€us
                                                                               Zloch, and Stefan Dietze. 2020. TweetsCOV19 - A Knowledge Base of Semantically
the cases. Table 2 shows the performance of Catboost model on                  Annotated Tweets about the COVID-19 Pandemic. In Proceedings of the 29th ACM
different feature sets. In this setting, we would like to study the            International Conference on Information & Knowledge Management. Association
impact of graph based features on the performance of our models.               for Computing Machinery, New York, NY, USA, 2991–2998. https://doi.org/10.
                                                                               1145/3340531.3412765
Without using any graph features, Catboost (CAT) has the error             [4] Kenton Lee Jacob Devlin, Ming-Wei Chang and Kristina Toutanova. 2018. BERT:
of 0.12494. By using normal graph features such as UH and UE,                  Pre-training of Deep Bidirectional Transformers for Language Understanding.
we can improve the model 0.0015. It means that adding the graph                arXiv preprint arXiv:1810.04805 (2018).
                                                                           [5] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei
embeddings improves the performance of the models.                             Ye, and Tie-Yan Liu. 2017. LightGBM: A Highly Efficient Gradient Boosting Deci-
   Table 1 shows the performance improvement using k-fold and                  sion Tree. In Proceedings of the 31st International Conference on Neural Information
                                                                               Processing Systems (Long Beach, California, USA) (NIPS’17). Curran Associates
common model strategy with the baseline models. It shows that                  Inc., Red Hook, NY, USA, 3149–3157.
all our base models improves by using these strategies. Further            [6] Adam Lerer, Ledell Wu, Jiajun Shen, Timothee Lacroix, Luca Wehrstedt, Abhi-
Table 3 shows the performance of various ensemble approaches, it               jit Bose, and Alex Peysakhovich. 2019. Pytorch-biggraph: A large-scale graph
                                                                               embedding system. arXiv preprint arXiv:1903.12287 (2019).
shows that ensemble of all four models yields the best performance.        [7] Adam Lerer, Ledell Wu, Jiajun Shen, Timothee Lacroix, Luca Wehrstedt, Abhijit
We use the weighted average approach to combined our predictive                Bose, and Alex Peysakhovich. 2019. PyTorch-BigGraph: A Large-scale Graph
models. Please note that we don’t have all results for every model.            Embedding System. In Proceedings of the 2nd SysML Conference. Palo Alto, CA,
                                                                               USA.
We only choose important models to submit and get the score.               [8] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-
       Table 2: Graph Features on Internal Validation                          del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
                                                                               M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in
       Model + Embedding     CAT      CAT + UH    CAT + UE                     Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
                                                                           [9] Radim ŘehΕ―Ε™ek and Petr Sojka. 2010. Software Framework for Topic Modelling
        Validation Score    0.12494    0.12464     0.12343
                                                                               with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges
   As we only know the score after submitting our submission,                  for NLP Frameworks. ELRA, Valletta, Malta, 45–50. http://is.muni.cz/publication/
                                                                               884893/en.
this table only shows part of all results. Note that we don’t submit
                                                                       8