=Paper=
{{Paper
|id=Vol-2881/paper1
|storemode=property
|title=CIKM AnalytiCup 2020: COVID-19 Retweet Prediction with Personalized Attention
|pdfUrl=https://ceur-ws.org/Vol-2881/paper1.pdf
|volume=Vol-2881
|authors=T Vinayaka Raj
}}
==CIKM AnalytiCup 2020: COVID-19 Retweet Prediction with Personalized Attention==
CIKM AnalytiCup 2020: COVID-19 Retweet Prediction with Personalized Attention T Vinayaka Raj Rakuten Inc. vinayaka.raj@rakuten.com ABSTRACT Twitter and are very crucial to understand the information diffusion This paper describes the first place winning solution for the CIKM mechanism on Twitter. Some of the practical applications of infor- AnalytiCup 2020 COVID-19 retweet prediction challenge. The ob- mation diffusion using tweets are political audience design [9, 15], jective of the challenge is to predict the popularity of COVID-19 fake news spreading and tracking [10, 17] and health promotion related tweets in terms of the number of retweets, and the submitted [3]. solutions of the challenge are ranked based on Mean Squared Loga- In this paper, all the techniques used to predict the retweet rithmic Error(MSLE) on the leaderboard. The proposed deep learn- count of tweets are discussed in detail. The first section provides ing model to predict retweet counts uses minimal hand-engineered a summary of the dataset presented to solve the problem. Hand- features and learns to predict retweet count based on a personal- engineering new features and their pre-processing techniques are ized attention mechanism. As a tweet keyword may have different also explained in this section. Model architecture and the person- informativeness for different users, the personalized attention mech- alized attention mechanism are described in the next section, and anism helps the deep learning model to weigh the importance of finally, in the last section, all the experiments carried out to improve tweet keywords based on a userβs interest to retweet. Additional the model score on the test leaderboard are explained in detail. techniques such as adding external data sets to training and pseudo- labeling are also experimented with to further improve the MSLE score. The final solution comprises of an ensemble of different per- 2 DATASET sonalized attention-based deep learning models, and the source code TweetsCOV19 [5] dataset provided in the competition is a large for the solution can be found at https://github.com/vinayakaraj- collection of COVID-19 related tweets that are extracted using a t/CIKM2020. seed list of 268 COVID-19 related keywords [4] from a large corpus of anonymized and annotated TweetsKB [6] corpus. TweetsCOV19 KEYWORDS contains all the COVID-19 related tweets from October 2019 to Deep Learning, Personalized attention, COVID-19, Pseudo-labelling April 2020 and the total number of tweets in the dataset is around 8 million that are posted by 3.7 million users. For each tweet, the 1 INTRODUCTION user who tweeted that, time of the tweet, metadata information Understanding information diffusion in social networks is imper- such as #followers, #favorites and #Friends and text information ative as it helps to comprehend social interactions among users of tweets are provided in the dataset. Text information of tweets is in a better way. Information spread on a large scale in social net- split into entities, hashtags, mentions, and URLs. Entities of each works enables marketers, advertisers to design their campaigns tweet are created using Fast Entity Linker [1, 11]. The sentiment more effectively to target potential customers. In addition to that, of each tweet is also provided and is extracted using SentiStrength identifying influential users [8] in social media is also significant as [16] which scores each tweet between -4(very negative) to 4(very these users contribute immensely to information diffusion during positive). viral marketing campaigns. Relationships between users on so- In addition to the given metadata features, few more features are cial networks heavily affect the amount of information exchanged derived from the given tweet metadata information and used to pre- among themselves. Furthermore, understanding how fake news dict the retweet count. A full list of features and their preprocessing spreads in social networks is also crucial to prevent the propagation techniques is provided in Table 1. of misinformation during global pandemics such as COVID-19. Both original tweet keywords and their respective annotated Modeling information diffusion in social networks is a hot re- entities are extracted and considered for analysis. Numbers and search topic that has garnered more interest in research communi- special characters are removed from hashtags and mentions, and ties of late. In CIKM AnalytiCup 2020, the competition objective is duplicate keywords are removed from entities, hashtags, and men- to model the information spreading mechanism during COVID-19 tions. URLs are split into two parts. The hostname of the tweet by predicting the retweet count of tweets on Twitter. Retweeting a URL is extracted as URL-1, and the path of the URL is considered as tweet is one of the functions of Twitter that helps users to quickly URL-2. Besides that, numbers and special characters are removed share their tweets or tweets of other users to all of their followers. from URL-2. Retweets can be seen as one of the ways information spreads on Copyright Β© 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 3 MODEL ARCHITECTURE In: Dimitar Dimitrov, Xiaofei Zhu (eds.): Proceedings of the CIKM AnalytiCup 2020, 22 October, 2020, Gawlay (Virtual Event), Ireland, 2020, published at http://ceur-ws.org. Figure 1 shows the network architecture of the retweet prediction model and the attention mechanism. High cardinal feature such 1 Table 1: Feature Information and Preprocessing Techniques Feature Description Preprocessing Technique week Week info extracted from timestamp One-hot-encoded time Time info extracted from timestamp Log transformed and then standardized year Year info extracted from timestamp Log transformed and then standardized no_entities No. of entities in a tweet Log transformed and then standardized keyword_entities No. of COVID related entities in a tweet Log transformed and then standardized no_hashtags No. of hashtags in a tweet Log transformed and then standardized keyword_hashtags No. of COVID related hashtags in a tweet Log transformed and then standardized no_mentions No. of mentions in a tweet Log transformed and then standardized no_urls No. of urls in a tweet Log transformed and then standardized Sentiment Sentiment score from SentiStrength (-4 to 4) One-hot-encoded #Favorites Tweet favorites Log transformed and then standardized #Followers No. of followers of an user Log transformed and then standardized #Friends No. of friends of an user Log transformed and then standardized #Followers/#Friends No. of followers and friends ratio Log transformed and then standardized #Friends/#Favorites No. of friends and favorites ratio Log transformed and then standardized #Favorites/#Followers No. of favorites and followers ratio Log transformed and then standardized username Encrypted username Label encoded Retweet Count Personlaized Attention Dense Layer 2 (150) Dense Layer 1 (500) Entity Hashtag Mentions Url Host Url Path Attention Representation Representation Representation Representation Representation Outputs Numeric and One Hot Encoded Features LSTM/CNN LSTM/CNN LSTM/CNN LSTM/CNN LSTM/CNN User Dense Output Output Output Output Output Vector Sequences Sequences Sequences Sequences Sequences LSTM/CNN LSTM/CNN LSTM/CNN LSTM/CNN LSTM/CNN User Dense Layer Head Head Head Head Head User Embedding Entity Word Hashtag Word Mentions Word Host Word Path Word Vector Vectors Vectors Vectors Vectors Vectors User Embedding layer Word Embedding Layer Username Entities Hashtags Mentions Host Path URLs Figure 1: Architecture of the deep learning model to predict retweet counts. as username is embedded as a fixed-length vector using user em- The word embedding layer is shared by the preprocessed key- bedding layer, which is then passed through a series of user dense words of tweet entities, mentions, hashtags, and URLs and is ini- layers to get the final representation of users. tialized by any pre-trained word vectors. For a tweet, entity word vectors are a sequence of word vectors queried from the embedding 2 Table 2: Dataset Splits Entity/Hashtag/Mention/URLs Representation r Data Split Start Date End Date Training 2019-09-30 2020-04-25 Validation 2020-04-26 2020-04-30 u User Dense Vector Testing Set 1 2020-05-01 2020-05-15 Testing Set 2 2020-05-16 2020-05-31 a1 ai aj am Table 3: Model Setup Optimizer Adam v1 vi vj vm Learning Rate 0.0001 Batch Size 2048 LSTM/CNN Output Sequence Entity Sequence Length 10 Hashtag Sequence Length 5 Figure 2: Personalized attention mechanism Mentions Sequence Length 5 URL-1 Sequence Length 3 URL-2 Sequence Length 15 layer and the length of the sequence is equivalent to the number User Embedding Size 64 of entity keywords extracted from that tweet. The length of the User Dense Layer 150 sequence is fixed for the entire dataset and is a hyper-parameter in Word embedding size 150 the model. Entity keywords smaller than the sequence length are LSTM units 150 padded with zeros on the left, and the larger ones are trimmed on CNN units 150 the right. Word vectors of hashtags, mentions, and URLs are created Dense layer 1 500 the same way as the entity word vectors. These extracted word Dense Layer 2 150 vectors are then inputted to an LSTM/CNN layer, which is then used to learn the representation of entities, mentions, hashtags, and URLs from their respective word vector sequence. winners of the competition. The rest of the data set from 2019-09-30 User vector representation π’ and the LSTM/CNN output vectors π£ to 2020-04-30 is used for training the model. The training data set are used to create personalized attention mechanism [18]. Attention is sorted in chronological order and the very recent 5% tweets of weight π is formulated as: the training data set are filtered out and used as the validation set. Information about training, validation, and testing splits are ππ = π£ππ β π‘ππβ(ππ’ β π’ + ππ’ ) provided in Table 2. Mean Square Logarithmic Error (MSLE) is the evaluation metric ππ₯π (ππ ) used in this competition. MSLE is given by: ππ = Γπ π=1 ππ₯π (π π ) π 1 Γ where ππ’ and ππ’ are user projection parameters and π is the se- πππΏπΈ = ((πππ(ππ + 1) β πππ(ππ + 1)) 2 quence length. The final representation π of entities/hashtags/mentions/URLs π π=0 is given by: where ππ and ππ are the actual and predicted retweet counts respec- π Γ tively. MSLE penalizes under estimations more than over estima- ππ = ππ β π£π tions. π=1 The model described in Figure 1 is trained on a Tesla V100 GPU The final representation vectors of entities, hashtags, mentions, machine. The optimal hyper-parameter settings are selected based and URLs are then concatenated together with the user vector and on the model with the best MSLE score on the validation set, and the other features such as tweet metadata and time-based features. The tuned hyperparameters of the model setup are provided in Table 3. final concatenated feature vector is then passed through a series of dense layers to estimate the retweet count. 4.2 Results 4 EXPERIMENTS The performance of the models on the final testing dataset is shown in Table 4. A single personalized attention model with fast text em- 4.1 Experiment Setting bedding and LSTM head for learning tweet representation provides Dataset provided for the competition comprises of all COVID 19 an MSLE score of 0.12860 on the test dataset. A large collection related tweets from 2019-09-30 to 2020-05-31. Of which, the entire of annotated tweets for the months of April 2020 and March 2020 month of May 2020 is considered for testing and is split into two from the dataset corpus TweetsKB is added to the training dataset testing sets - testing set 1 & 2. Testing set 1 is used for validating the and a deep learning model is trained on the whole dataset. This model on the leader board and testing set 2 is used to rank the final addition of an external dataset decreases the MSLE score by 3.76%. 3 Table 4: Model Performance Results on Web Search and Data Mining (Shanghai, China) (WSDM 15). ACM, New York, NY, USA, 10. [2] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Model Type MSLE Enriching Word Vectors with Subword Information. CoRR abs/1607.04606 (2016). arXiv:1607.04606 http://arxiv.org/abs/1607.04606 Fasttext Embedding 0.12860 [3] Jae Eun Chung. 2017. Retweeting in health promotion: Analysis of tweets about Fasttext Embedding + External Dataset 0.12376 Breast Cancer Awareness Month. Computers in Human Behavior 74 (04 2017). https://doi.org/10.1016/j.chb.2017.04.025 Ensemble 0.12071 [4] Dimitar Dimitrov. 2020 (accessed August 2020). COVID-19 related keywords. Ensemble + Pseudo-Labelling 0.12055 https://data.gesis.org/tweetscov19/keywords.txt [5] Dimitar Dimitrov, Erdal Baran, Pavlos Fafalios, Ran Yu, Xiaofei Zhu, MatthΓ€us Zloch, and Stefan Dietze. 2020. TweetsCOV19 - A Knowledge Base of Semantically Annotated Tweets about the COVID-19 Pandemic. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. Association TweetsCOV19 is the subset of TweetsKB and hence doesnβt include for Computing Machinery, New York, NY, USA, 2991β2998. https://doi.org/10. all the tweets of users but their COVID related tweets. Including all 1145/3340531.3412765 tweets of a user not only help the personalized mechanism to under- [6] Pavlos Fafalios, Vasileios Iosifidis, Eirini Ntoutsi, and Stefan Dietze. 2018. Tweet- sKB: A Public and Large-Scale RDF Corpus of Annotated Tweets. In European stand the relation between users and their tweets but also help the Semantic Web Conference. Springer, 177β190. model to learn a rich representation of users and tweet keywords. [7] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. To further improve the score, techniques such as ensembling and Bag of Tricks for Efficient Text Classification. CoRR abs/1607.01759 (2016). arXiv:1607.01759 http://arxiv.org/abs/1607.01759 pseudo-labeling are also tried. [8] E. Kafeza, A. Kanavos, C. Makris, and P. Vikatos. 2014. T-PICE: Twitter Personality Based Influential Communities Extraction System. In 2014 IEEE International 4.2.1 Ensembling. In addition to initialing the Twitter keywords Congress on Big Data. 212β219. https://doi.org/10.1109/BigData.Congress.2014.38 with fasttext embeddings [2, 7], pre-trained word vectors such as [9] Eunice Kim, Yongjun Sung, and Hamsu Kang. 2014. Brand followersβ retweeting behavior on Twitter: How brand relationships influence brand electronic word- glove840 [12], glovetwitter [12], fasttext wiki [2, 7] and LexVec of-mouth. Computers in Human Behavior 37 (2014), 18 β 25. https://doi.org/10. [13, 14] are also used to train the deep learning model. Among the 1016/j.chb.2014.04.020 five models trained with different pre-trained vectors, fasttext em- [10] Cristian Lumezanu, Nick Feamster, and Hans Klein. 2012. # bias: Measuring the Tweeting Behavior of Propagandists. bedding initialization provides the best score on the testing leader- [11] Aasish Pappu, Roi Blanco, Yashar Mehdad, Amanda Stent, and Kapil Thadani. board. Furthermore, another set of models is trained by replacing 2017. Lightweight Multilingual Entity Extraction and Linking. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining (Cam- the LSTM head with CNN head and also with all the five pre-trained bridge, United Kingdom) (WSDM β17). Association for Computing Machinery, word vectors. Individual MSLE scores of the models with CNN head New York, NY, USA, 365β374. https://doi.org/10.1145/3018661.3018724 are much lower than the models with LSTM heads but ensembling [12] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Lan- all the models together provide a significant improvement on the guage Processing (EMNLP). 1532β1543. http://www.aclweb.org/anthology/D14- testing leaderboard. In total, there are ten personalized attention 1162 models, and the final solution is created by ensembling all the ten [13] Alexandre Salle, Marco Idiart, and Aline Villavicencio. 2016. Enhancing the LexVec Distributed Word Representation Model Using Positional Contexts and output predictions with simple averaging. Ensembling decreases External Memory. CoRR abs/1606.01283 (2016). arXiv:1606.01283 http://arxiv. the MSLE score by 2.464%. org/abs/1606.01283 [14] Alexandre Salle, Aline Villavicencio, and Marco Idiart. 2016. Matrix Factor- 4.2.2 Pseudo-Labelling. Pseudo-labelling is another trick tried to ization using Window Sampling and Negative Sampling for Improved Word Representations. In Proceedings of the 54th Annual Meeting of the Association for decrease the MSLE score. Best output solutions on the leaderboard Computational Linguistics (Volume 2: Short Papers). Association for Computational of the test set 1 and test set 2 are used as labels for the respective Linguistics, Berlin, Germany, 419β424. https://doi.org/10.18653/v1/P16-2068 [15] S. Stieglitz and L. Dang-Xuan. 2012. Political Communication and Influence data sets and are then added to the training set for building the deep through MicrobloggingβAn Empirical Analysis of Sentiment in Twitter Messages learning models. Similar to the ensembling technique described and Retweet Behavior. In 2012 45th Hawaii International Conference on System above, ten different models with different pre-trained word vectors Sciences. 3500β3509. https://doi.org/10.1109/HICSS.2012.476 [16] Mike Thelwall, Kevan Buckley, Georgios Paltoglou, Di Cai, and Arvid Kappas. and LSTM/CNN heads are built with the new dataset and are then 2010. Sentiment strength detection in short informal text. Journal of the American averaged. Pseudo-labelling decreased the MSLE score by a very Society for Information Science and Technology 61, 12 (2010), 2544β2558. small percentage of 0.132%. [17] Soroush Vosoughi, Deb Roy, and Sinan Aral. 2018. The spread of true and false news online. Science 359, 6380 (2018), 1146β1151. https://doi.org/10.1126/science. aap9559 arXiv:https://science.sciencemag.org/content/359/6380/1146.full.pdf 5 CONCLUSION [18] Chuhan Wu, Fangzhao Wu, Mingxiao An, Jianqiang Huang, Yongfeng Huang, and Xing Xie. 2019. NPA: Neural News Recommendation with Personalized Attention. A methodology to estimate retweet count for COVID related tweets CoRR abs/1907.05559 (2019). arXiv:1907.05559 http://arxiv.org/abs/1907.05559 is proposed in this paper. The personalized attention-based deep learning model described in this paper uses less hand-engineered features and learns a rich representation of users and tweet key- words to predict retweet count. To further improve the performance of the model, techniques such as adding external datasets, ensem- bling, and pseudo-labeling are also tried. The final solution to es- timate retweet counts is created by an ensemble of deep learning models which placed the team first on the testing leaderboard. REFERENCES [1] Roi Blanco, Giuseppe Ottaviano, and Edgar Meij. 2015. Fast and Space-Efficient Entity Linking in Queries. In Proceedings of the Eight ACM International Conference 4