=Paper=
{{Paper
|id=Vol-2881/paper1
|storemode=property
|title=CIKM AnalytiCup 2020: COVID-19 Retweet Prediction with Personalized Attention
|pdfUrl=https://ceur-ws.org/Vol-2881/paper1.pdf
|volume=Vol-2881
|authors=T Vinayaka Raj
}}
==CIKM AnalytiCup 2020: COVID-19 Retweet Prediction with Personalized Attention==
<pdf width="1500px">https://ceur-ws.org/Vol-2881/paper1.pdf</pdf>
<pre>
       CIKM AnalytiCup 2020: COVID-19 Retweet Prediction with
                      Personalized Attention
                                                                            T Vinayaka Raj
                                                                             Rakuten Inc.
                                                                      vinayaka.raj@rakuten.com

ABSTRACT                                                                                     Twitter and are very crucial to understand the information diffusion
This paper describes the first place winning solution for the CIKM                           mechanism on Twitter. Some of the practical applications of infor-
AnalytiCup 2020 COVID-19 retweet prediction challenge. The ob-                               mation diffusion using tweets are political audience design [9, 15],
jective of the challenge is to predict the popularity of COVID-19                            fake news spreading and tracking [10, 17] and health promotion
related tweets in terms of the number of retweets, and the submitted                         [3].
solutions of the challenge are ranked based on Mean Squared Loga-                               In this paper, all the techniques used to predict the retweet
rithmic Error(MSLE) on the leaderboard. The proposed deep learn-                             count of tweets are discussed in detail. The first section provides
ing model to predict retweet counts uses minimal hand-engineered                             a summary of the dataset presented to solve the problem. Hand-
features and learns to predict retweet count based on a personal-                            engineering new features and their pre-processing techniques are
ized attention mechanism. As a tweet keyword may have different                              also explained in this section. Model architecture and the person-
informativeness for different users, the personalized attention mech-                        alized attention mechanism are described in the next section, and
anism helps the deep learning model to weigh the importance of                               finally, in the last section, all the experiments carried out to improve
tweet keywords based on a user’s interest to retweet. Additional                             the model score on the test leaderboard are explained in detail.
techniques such as adding external data sets to training and pseudo-
labeling are also experimented with to further improve the MSLE
score. The final solution comprises of an ensemble of different per-                         2   DATASET
sonalized attention-based deep learning models, and the source code                          TweetsCOV19 [5] dataset provided in the competition is a large
for the solution can be found at https://github.com/vinayakaraj-                             collection of COVID-19 related tweets that are extracted using a
t/CIKM2020.                                                                                  seed list of 268 COVID-19 related keywords [4] from a large corpus
                                                                                             of anonymized and annotated TweetsKB [6] corpus. TweetsCOV19
KEYWORDS                                                                                     contains all the COVID-19 related tweets from October 2019 to
Deep Learning, Personalized attention, COVID-19, Pseudo-labelling                            April 2020 and the total number of tweets in the dataset is around
                                                                                             8 million that are posted by 3.7 million users. For each tweet, the
1    INTRODUCTION                                                                            user who tweeted that, time of the tweet, metadata information
Understanding information diffusion in social networks is imper-                             such as #followers, #favorites and #Friends and text information
ative as it helps to comprehend social interactions among users                              of tweets are provided in the dataset. Text information of tweets is
in a better way. Information spread on a large scale in social net-                          split into entities, hashtags, mentions, and URLs. Entities of each
works enables marketers, advertisers to design their campaigns                               tweet are created using Fast Entity Linker [1, 11]. The sentiment
more effectively to target potential customers. In addition to that,                         of each tweet is also provided and is extracted using SentiStrength
identifying influential users [8] in social media is also significant as                     [16] which scores each tweet between -4(very negative) to 4(very
these users contribute immensely to information diffusion during                             positive).
viral marketing campaigns. Relationships between users on so-                                   In addition to the given metadata features, few more features are
cial networks heavily affect the amount of information exchanged                             derived from the given tweet metadata information and used to pre-
among themselves. Furthermore, understanding how fake news                                   dict the retweet count. A full list of features and their preprocessing
spreads in social networks is also crucial to prevent the propagation                        techniques is provided in Table 1.
of misinformation during global pandemics such as COVID-19.                                     Both original tweet keywords and their respective annotated
   Modeling information diffusion in social networks is a hot re-                            entities are extracted and considered for analysis. Numbers and
search topic that has garnered more interest in research communi-                            special characters are removed from hashtags and mentions, and
ties of late. In CIKM AnalytiCup 2020, the competition objective is                          duplicate keywords are removed from entities, hashtags, and men-
to model the information spreading mechanism during COVID-19                                 tions. URLs are split into two parts. The hostname of the tweet
by predicting the retweet count of tweets on Twitter. Retweeting a                           URL is extracted as URL-1, and the path of the URL is considered as
tweet is one of the functions of Twitter that helps users to quickly                         URL-2. Besides that, numbers and special characters are removed
share their tweets or tweets of other users to all of their followers.                       from URL-2.
Retweets can be seen as one of the ways information spreads on
 Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).                                           3   MODEL ARCHITECTURE
In: Dimitar Dimitrov, Xiaofei Zhu (eds.): Proceedings of the CIKM AnalytiCup 2020, 22
October, 2020, Gawlay (Virtual Event), Ireland, 2020, published at http://ceur-ws.org.       Figure 1 shows the network architecture of the retweet prediction
                                                                                             model and the attention mechanism. High cardinal feature such
                                                                                         1
                                              Table 1: Feature Information and Preprocessing Techniques

                    Feature                                           Description                                                    Preprocessing Technique
                      week                            Week info extracted from timestamp                                               One-hot-encoded
                      time                            Time info extracted from timestamp                                    Log transformed and then standardized
                      year                             Year info extracted from timestamp                                   Log transformed and then standardized
                   no_entities                               No. of entities in a tweet                                     Log transformed and then standardized
                keyword_entities                     No. of COVID related entities in a tweet                               Log transformed and then standardized
                  no_hashtags                               No. of hashtags in a tweet                                      Log transformed and then standardized
               keyword_hashtags                     No. of COVID related hashtags in a tweet                                Log transformed and then standardized
                  no_mentions                               No. of mentions in a tweet                                      Log transformed and then standardized
                     no_urls                                  No. of urls in a tweet                                        Log transformed and then standardized
                   Sentiment                       Sentiment score from SentiStrength (-4 to 4)                                        One-hot-encoded
                   #Favorites                                    Tweet favorites                                            Log transformed and then standardized
                   #Followers                               No. of followers of an user                                     Log transformed and then standardized
                    #Friends                                 No. of friends of an user                                      Log transformed and then standardized
              #Followers/#Friends                       No. of followers and friends ratio                                  Log transformed and then standardized
              #Friends/#Favorites                       No. of friends and favorites ratio                                  Log transformed and then standardized
             #Favorites/#Followers                     No. of favorites and followers ratio                                 Log transformed and then standardized
                    username                                   Encrypted username                                                       Label encoded


                                                                                  Retweet Count                                                                       Personlaized
                                                                                                                                                                       Attention


                                                                                Dense Layer 2 (150)


                                                                                Dense Layer 1 (500)


                                                                   Entity             Hashtag                Mentions              Url Host                Url Path           Attention
                                                               Representation       Representation         Representation       Representation          Representation        Outputs

                     Numeric and One Hot
                      Encoded Features


                                                                LSTM/CNN              LSTM/CNN              LSTM/CNN             LSTM/CNN                LSTM/CNN
                                             User Dense                                                                            Output                  Output
                                                                  Output                Output                Output
                                               Vector                                                                            Sequences               Sequences
                                                                Sequences             Sequences             Sequences


                                                                LSTM/CNN             LSTM/CNN               LSTM/CNN             LSTM/CNN                LSTM/CNN
                                           User Dense Layer
                                                                  Head                 Head                   Head                 Head                    Head


                                           User Embedding       Entity Word          Hashtag Word          Mentions Word          Host Word               Path Word
                                                Vector           Vectors               Vectors                Vectors              Vectors                 Vectors


                                           User Embedding
                                                layer                                                Word Embedding Layer


                                              Username           Entities            Hashtags               Mentions               Host                    Path


                                                                                                                                                 URLs


                               Figure 1: Architecture of the deep learning model to predict retweet counts.


as username is embedded as a fixed-length vector using user em-                                         The word embedding layer is shared by the preprocessed key-
bedding layer, which is then passed through a series of user dense                                   words of tweet entities, mentions, hashtags, and URLs and is ini-
layers to get the final representation of users.                                                     tialized by any pre-trained word vectors. For a tweet, entity word
                                                                                                     vectors are a sequence of word vectors queried from the embedding
                                                                                              2
                                                                                                      Table 2: Dataset Splits
             Entity/Hashtag/Mention/URLs
                     Representation                     r
                                                                                               Data Split     Start Date    End Date
                                                                                                Training      2019-09-30    2020-04-25
                                                                                               Validation     2020-04-26    2020-04-30
                         u
  User Dense Vector                                                                           Testing Set 1   2020-05-01    2020-05-15
                                                                                              Testing Set 2   2020-05-16    2020-05-31


                                 a1           ai            aj        am                               Table 3: Model Setup


                                                                                                        Optimizer              Adam
                         v1           vi           vj            vm                                   Learning Rate            0.0001
                                                                                                        Batch Size              2048
                                 LSTM/CNN Output Sequence                                        Entity Sequence Length           10
                                                                                                Hashtag Sequence Length            5
         Figure 2: Personalized attention mechanism                                             Mentions Sequence Length          5
                                                                                                 URL-1 Sequence Length            3
                                                                                                 URL-2 Sequence Length            15
layer and the length of the sequence is equivalent to the number                                  User Embedding Size             64
of entity keywords extracted from that tweet. The length of the                                     User Dense Layer             150
sequence is fixed for the entire dataset and is a hyper-parameter in                              Word embedding size            150
the model. Entity keywords smaller than the sequence length are                                        LSTM units                150
padded with zeros on the left, and the larger ones are trimmed on                                       CNN units                150
the right. Word vectors of hashtags, mentions, and URLs are created                                   Dense layer 1              500
the same way as the entity word vectors. These extracted word                                         Dense Layer 2              150
vectors are then inputted to an LSTM/CNN layer, which is then
used to learn the representation of entities, mentions, hashtags, and
URLs from their respective word vector sequence.                                 winners of the competition. The rest of the data set from 2019-09-30
   User vector representation 𝑢 and the LSTM/CNN output vectors 𝑣                to 2020-04-30 is used for training the model. The training data set
are used to create personalized attention mechanism [18]. Attention              is sorted in chronological order and the very recent 5% tweets of
weight 𝑎 is formulated as:                                                       the training data set are filtered out and used as the validation
                                                                                 set. Information about training, validation, and testing splits are
                     𝑒𝑖 = 𝑣𝑇𝑖 ∗ 𝑡𝑎𝑛ℎ(𝑊𝑢 ∗ 𝑢 + 𝑏𝑢 )
                                                                                 provided in Table 2.
                                                                                    Mean Square Logarithmic Error (MSLE) is the evaluation metric
                                 𝑒𝑥𝑝 (𝑒𝑖 )                                       used in this competition. MSLE is given by:
                          𝑎𝑖 = Í𝑚
                                𝑗=1 𝑒𝑥𝑝 (𝑒 𝑗 )                                                    𝑁
                                                                                               1 Õ
   where 𝑊𝑢 and 𝑏𝑢 are user projection parameters and 𝑚 is the se-                   𝑀𝑆𝐿𝐸 =          ((𝑙𝑜𝑔(𝑎𝑖 + 1) − 𝑙𝑜𝑔(𝑝𝑖 + 1)) 2
quence length. The final representation 𝑟 of entities/hashtags/mentions/URLs                   𝑁 𝑖=0
is given by:                                                              where 𝑎𝑖 and 𝑝𝑖 are the actual and predicted retweet counts respec-
                                 𝑚
                                Õ                                         tively. MSLE penalizes under estimations more than over estima-
                           𝑟𝑖 =     𝑎𝑗 ∗ 𝑣𝑗
                                                                          tions.
                                𝑗=1
                                                                             The model described in Figure 1 is trained on a Tesla V100 GPU
   The final representation vectors of entities, hashtags, mentions,
                                                                          machine. The optimal hyper-parameter settings are selected based
and URLs are then concatenated together with the user vector and
                                                                          on the model with the best MSLE score on the validation set, and the
other features such as tweet metadata and time-based features. The
                                                                          tuned hyperparameters of the model setup are provided in Table 3.
final concatenated feature vector is then passed through a series of
dense layers to estimate the retweet count.                               4.2 Results
4 EXPERIMENTS                                                                    The performance of the models on the final testing dataset is shown
                                                                                 in Table 4. A single personalized attention model with fast text em-
4.1 Experiment Setting                                                           bedding and LSTM head for learning tweet representation provides
Dataset provided for the competition comprises of all COVID 19                   an MSLE score of 0.12860 on the test dataset. A large collection
related tweets from 2019-09-30 to 2020-05-31. Of which, the entire               of annotated tweets for the months of April 2020 and March 2020
month of May 2020 is considered for testing and is split into two                from the dataset corpus TweetsKB is added to the training dataset
testing sets - testing set 1 & 2. Testing set 1 is used for validating the       and a deep learning model is trained on the whole dataset. This
model on the leader board and testing set 2 is used to rank the final            addition of an external dataset decreases the MSLE score by 3.76%.
                                                                             3
                 Table 4: Model Performance Results                                                 on Web Search and Data Mining (Shanghai, China) (WSDM 15). ACM, New York,
                                                                                                    NY, USA, 10.
                                                                                                [2] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016.
                             Model Type                            MSLE                             Enriching Word Vectors with Subword Information. CoRR abs/1607.04606 (2016).
                                                                                                    arXiv:1607.04606 http://arxiv.org/abs/1607.04606
                    Fasttext Embedding                            0.12860                       [3] Jae Eun Chung. 2017. Retweeting in health promotion: Analysis of tweets about
          Fasttext Embedding + External Dataset                   0.12376                           Breast Cancer Awareness Month. Computers in Human Behavior 74 (04 2017).
                                                                                                    https://doi.org/10.1016/j.chb.2017.04.025
                         Ensemble                                 0.12071                       [4] Dimitar Dimitrov. 2020 (accessed August 2020). COVID-19 related keywords.
               Ensemble + Pseudo-Labelling                        0.12055                           https://data.gesis.org/tweetscov19/keywords.txt
                                                                                                [5] Dimitar Dimitrov, Erdal Baran, Pavlos Fafalios, Ran Yu, Xiaofei Zhu, Matthäus
                                                                                                    Zloch, and Stefan Dietze. 2020. TweetsCOV19 - A Knowledge Base of Semantically
                                                                                                    Annotated Tweets about the COVID-19 Pandemic. In Proceedings of the 29th ACM
                                                                                                    International Conference on Information & Knowledge Management. Association
TweetsCOV19 is the subset of TweetsKB and hence doesn’t include                                     for Computing Machinery, New York, NY, USA, 2991–2998. https://doi.org/10.
all the tweets of users but their COVID related tweets. Including all                               1145/3340531.3412765
tweets of a user not only help the personalized mechanism to under-                             [6] Pavlos Fafalios, Vasileios Iosifidis, Eirini Ntoutsi, and Stefan Dietze. 2018. Tweet-
                                                                                                    sKB: A Public and Large-Scale RDF Corpus of Annotated Tweets. In European
stand the relation between users and their tweets but also help the                                 Semantic Web Conference. Springer, 177–190.
model to learn a rich representation of users and tweet keywords.                               [7] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016.
To further improve the score, techniques such as ensembling and                                     Bag of Tricks for Efficient Text Classification. CoRR abs/1607.01759 (2016).
                                                                                                    arXiv:1607.01759 http://arxiv.org/abs/1607.01759
pseudo-labeling are also tried.                                                                 [8] E. Kafeza, A. Kanavos, C. Makris, and P. Vikatos. 2014. T-PICE: Twitter Personality
                                                                                                    Based Influential Communities Extraction System. In 2014 IEEE International
4.2.1 Ensembling. In addition to initialing the Twitter keywords                                    Congress on Big Data. 212–219. https://doi.org/10.1109/BigData.Congress.2014.38
with fasttext embeddings [2, 7], pre-trained word vectors such as                               [9] Eunice Kim, Yongjun Sung, and Hamsu Kang. 2014. Brand followers’ retweeting
                                                                                                    behavior on Twitter: How brand relationships influence brand electronic word-
glove840 [12], glovetwitter [12], fasttext wiki [2, 7] and LexVec                                   of-mouth. Computers in Human Behavior 37 (2014), 18 – 25. https://doi.org/10.
[13, 14] are also used to train the deep learning model. Among the                                  1016/j.chb.2014.04.020
five models trained with different pre-trained vectors, fasttext em-                           [10] Cristian Lumezanu, Nick Feamster, and Hans Klein. 2012. # bias: Measuring the
                                                                                                    Tweeting Behavior of Propagandists.
bedding initialization provides the best score on the testing leader-                          [11] Aasish Pappu, Roi Blanco, Yashar Mehdad, Amanda Stent, and Kapil Thadani.
board. Furthermore, another set of models is trained by replacing                                   2017. Lightweight Multilingual Entity Extraction and Linking. In Proceedings of
                                                                                                    the Tenth ACM International Conference on Web Search and Data Mining (Cam-
the LSTM head with CNN head and also with all the five pre-trained                                  bridge, United Kingdom) (WSDM ’17). Association for Computing Machinery,
word vectors. Individual MSLE scores of the models with CNN head                                    New York, NY, USA, 365–374. https://doi.org/10.1145/3018661.3018724
are much lower than the models with LSTM heads but ensembling                                  [12] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe:
                                                                                                    Global Vectors for Word Representation. In Empirical Methods in Natural Lan-
all the models together provide a significant improvement on the                                    guage Processing (EMNLP). 1532–1543. http://www.aclweb.org/anthology/D14-
testing leaderboard. In total, there are ten personalized attention                                 1162
models, and the final solution is created by ensembling all the ten                            [13] Alexandre Salle, Marco Idiart, and Aline Villavicencio. 2016. Enhancing the
                                                                                                    LexVec Distributed Word Representation Model Using Positional Contexts and
output predictions with simple averaging. Ensembling decreases                                      External Memory. CoRR abs/1606.01283 (2016). arXiv:1606.01283 http://arxiv.
the MSLE score by 2.464%.                                                                           org/abs/1606.01283
                                                                                               [14] Alexandre Salle, Aline Villavicencio, and Marco Idiart. 2016. Matrix Factor-
4.2.2 Pseudo-Labelling. Pseudo-labelling is another trick tried to                                  ization using Window Sampling and Negative Sampling for Improved Word
                                                                                                    Representations. In Proceedings of the 54th Annual Meeting of the Association for
decrease the MSLE score. Best output solutions on the leaderboard                                   Computational Linguistics (Volume 2: Short Papers). Association for Computational
of the test set 1 and test set 2 are used as labels for the respective                              Linguistics, Berlin, Germany, 419–424. https://doi.org/10.18653/v1/P16-2068
                                                                                               [15] S. Stieglitz and L. Dang-Xuan. 2012. Political Communication and Influence
data sets and are then added to the training set for building the deep                              through Microblogging–An Empirical Analysis of Sentiment in Twitter Messages
learning models. Similar to the ensembling technique described                                      and Retweet Behavior. In 2012 45th Hawaii International Conference on System
above, ten different models with different pre-trained word vectors                                 Sciences. 3500–3509. https://doi.org/10.1109/HICSS.2012.476
                                                                                               [16] Mike Thelwall, Kevan Buckley, Georgios Paltoglou, Di Cai, and Arvid Kappas.
and LSTM/CNN heads are built with the new dataset and are then                                      2010. Sentiment strength detection in short informal text. Journal of the American
averaged. Pseudo-labelling decreased the MSLE score by a very                                       Society for Information Science and Technology 61, 12 (2010), 2544–2558.
small percentage of 0.132%.                                                                    [17] Soroush Vosoughi, Deb Roy, and Sinan Aral. 2018. The spread of true and false
                                                                                                    news online. Science 359, 6380 (2018), 1146–1151. https://doi.org/10.1126/science.
                                                                                                    aap9559 arXiv:https://science.sciencemag.org/content/359/6380/1146.full.pdf
5    CONCLUSION                                                                                [18] Chuhan Wu, Fangzhao Wu, Mingxiao An, Jianqiang Huang, Yongfeng Huang, and
                                                                                                    Xing Xie. 2019. NPA: Neural News Recommendation with Personalized Attention.
A methodology to estimate retweet count for COVID related tweets                                    CoRR abs/1907.05559 (2019). arXiv:1907.05559 http://arxiv.org/abs/1907.05559
is proposed in this paper. The personalized attention-based deep
learning model described in this paper uses less hand-engineered
features and learns a rich representation of users and tweet key-
words to predict retweet count. To further improve the performance
of the model, techniques such as adding external datasets, ensem-
bling, and pseudo-labeling are also tried. The final solution to es-
timate retweet counts is created by an ensemble of deep learning
models which placed the team first on the testing leaderboard.

REFERENCES
 [1] Roi Blanco, Giuseppe Ottaviano, and Edgar Meij. 2015. Fast and Space-Efficient
     Entity Linking in Queries. In Proceedings of the Eight ACM International Conference
                                                                                           4

</pre>