=Paper=
{{Paper
|id=Vol-2881/paper3
|storemode=property
|title=Feature Extraction for Deep Neural Networks: A Case Study on
the COVID-19 Retweet Prediction Challenge
|pdfUrl=https://ceur-ws.org/Vol-2881/paper3.pdf
|volume=Vol-2881
|authors=Daichi Takehara
}}
==Feature Extraction for Deep Neural Networks: A Case Study on
the COVID-19 Retweet Prediction Challenge==
Feature Extraction for Deep Neural Networks: A Case Study on the COVID-19 Retweet Prediction Challenge Daichi Takehara Aidemy Inc. takehara-d@aidemy.co.jp ABSTRACT input features to the DNN according to the data and tasks, which This paper presents our solution for the COVID-19 Retweet Predic- can be difficult. tion Challenge, which is part of the CIKM 2020 AnalytiCup. The As a case study to tackle these difficulties, in this paper, we challenge was to predict the number of times it will be retweeted present our solution to the COVID-19 Retweet Prediction Chal- of tweets related to COVID-19. We tackled this challenge using lenge as part of the CIKM 2020 AnalytiCup. This challenge’s task a deep neural network-based retweet prediction method. In this was to predict the number retweets for a given COVID-19-related method, we introduced useful feature extraction techniques for tweet. We propose a DNN-based retweet prediction method. In the retweet prediction. Experiments have confirmed the effectiveness proposed method, we introduce a useful feature extraction method of the techniques, especially for the primary processes: numerical as an input to a DNN for retweet prediction. In the feature ex- feature transformation and user modeling. Finally, the solution used traction, we transform numerical features into multiple different a stacking-based ensemble method to provide the final predictive distributions to effectively utilize the metrics related to tweets. Be- result for the competition. The code for this solution is available at sides, we cluster users based on multiple types of features to enable https://github.com/haradai1262/CIKM2020-AnalytiCup. infrequent users to represent user attributes. Using the obtained features, we train a DNN model. In the experiments, we verify KEYWORDS the effectiveness of the transformation of numerical features and user data handling, which are essential issues in retweet predic- Information diffusion, Retweet prediction, Feature extraction, Deep tion. In addition, as a solution for the competition, we introduce a learning, COVID-19 stacking-based ensemble method to improve the prediction results’ performance and robustness. 1 INTRODUCTION To understand the mechanisms of information diffusion is an active 2 CHALLENGE area of research that has many practical applications. In a crisis like COVID-19, information diffusion directly influences people’s behav- 2.1 Dataset ior and becomes especially valuable [6]. Retweeting, sharing tweets In the COVID-19 Retweet Prediction Challenge, TweetsCOV19 directly to followers on Twitter, can be viewed as amplifying the dataset was provided. This dataset consists of 8151524 tweets con- diffusion of original content. Thus retweet prediction is beneficial cerning COVID-19 on Twitter published by 3664518 users from for understanding the mechanisms of information diffusion. October 2019 until April 2020. For each tweet, the dataset pro- Retweet prediction has been widely studied. In recent years, vides metadata and some precalculated features. The contents of there has been growing interest in methods based on deep neural the dataset and the process of their generation are detailed in this networks (DNNs), which have reported high performance [10, 15, paper [3]. 19]. DNNs have made it possible to skip many feature engineering, especially in image processing and natural language processing. 2.2 Task Description However, in DNNs for tabular data including retweet predictions, Given a tweet from the TweetsCOV19 dataset, the task was to data pre-processing and feature engineering are still often necessary predict the number retweets (#retweets). The test data for the eval- and significantly impact performance [9, 14]. uation are tweets published during May 2020, which the month In retweet prediction, the processing of numerical features re- subsequent to the tweets included in the TweetsCOV19 dataset. lated to tweets, such as the number of followers, strongly affects per- The mean squared log error (MSLE) is used as the evaluation metric formance. To train DNNs effectively, it may be useful to transform for the task. the numerical features to different distributions [20]. Furthermore, it is crucial to learn the expression of the user that publish tweets. 3 METHOD Although the embedding-based method using the user id is often 3.1 Overview used in DNN-based methods it may not be that easy to sufficiently learn the representation of the infrequent users included in the An overview of the proposed method is shown in Figure 1. First, training data [5]. As mentioned above, it is necessary to design the we extract the features to be inputed to the DNN. The features input to the DNN are divided into numerical, categorical, and multi- Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons hot categorical features, and categorical and multi-hot categorical License Attribution 4.0 International (CC BY 4.0). features are converted into low-dimensional vectors through the In: Dimitar Dimitrov, Xiaofei Zhu (eds.): Proceedings of the CIKM AnalytiCup 2020, 22 October, 2020, Gawlay (Virtual Event), Ireland, 2020, published at http://ceur-ws.org. embedding layer. Using the extracted features, we train a multilayer perceptron (MLP) for retweet prediction. 9 #Retweets Log transformation Target MSE Loss Prediction transformations, we use the transformed values as categorical fea- Fully-Conneted (1) ⇨ ReLU tures. Fully-Conneted (128) ⇨ Batch normalization ⇨ ReLU ⇨ Dropout We apply these transformations to tweet metrics (Table 1) and Fully-Conneted (512) ⇨ Batch normalization ⇨ ReLU ⇨ Dropout input the obtained values into a MLP. Fully-Conneted (2048) ⇨ Batch normalization ⇨ ReLU ⇨ Dropout 3.2.2 User Modeling. Appropriately representing the user who Flatten and Concatnate published the tweet is essential for predicting #retweets. For DNN- Embedding Embedding based prediction models, a common and effective method is to learn Numerical Features Categorical Features Multi-hot Categorical Features Tweet metrics User dynamics Time Sentiment Entities Hashtags by inputting the user id into embedding layers. However, the infre- Tweet metrics log User metrics User Id User cluster Id Mentions URLs quent users included in the training data are not sufficiently trained Tweet metrics rank Tweet metrics CDF Topic Count encording Tweet metrics binning Components of URLs [5]. By clustering users from various points of view and embedding Tweet metrics z Target encording based on their cluster Ids, we can even learn the user attributes for infrequent users. Specifically, we introduced the following three types of user clustering. Figure 1: Overview of our proposed method. The notation of fea- tures corresponds to the name columns in Table 1. User topic clustering. We clustered users using topics contained in tweets. Specifically, we combined the entities, hashtags, men- tions, and URLs included in the tweet and set them as sequences 3.2 Features for each user. Next, user topic features were extracted by applying The features used in the proposed method are shown in Table 1. Nu- the term frequency-inverse document frequency (TFIDF) [2] to merical feature transformation and user modeling, the critical issues the sequences and dimensionality reduction using singular value of retweet prediction, are discussed in the following subsections. decomposition (SVD) [4]. Using the extracted features, we applied Please refer to the published code1 for strict processing. the K-means clustering [11] to the users. User metric clustering. We clustered users using user-related 3.2.1 Numerical Feature Transformation. #retweets is strongly re- metrics. User metric features consist of the mean and standard de- lated to metrics of a tweet, such as the number of followers (#fol- viation of the user’s followers, friends, likes, as well as the unique lowers) and favorites (#followers), which are expected to have a numbers of entities, hashtags, mentions, and URLs from the tweet significant impact on the performance of our prediction model. In log posted by each user. Using the obtained features, we applied the proposed method, we attempt to represent various distributions the K-means clustering to the users. of these metrics and improve the performance by combining them User topic and metric clustering. Using the features that com- as input for the DNN. Specifically, we introduce the following five bine user topic features and user metric features, we applied the numerical feature transformations. K-means clustering to the users. Z transformation. We transform each value 𝑥𝑖 ∈ 𝑋 by the fol- Note that the number of clusters is set to 1000 in each clustering. lowing function using the mean 𝑥 and standard deviation 𝜇 of the dataset 𝑋 : 3.3 Model 𝑥𝑖 − 𝑥 𝐹𝑧 (𝑥𝑖 ) = (1) Using the extracted features, we trained the MLP. In the proposed 𝜇 CDF transformation. We derived a normal distribution from the method, the inputs of the MLP can be divided into numerical, cat- mean and standard deviation observed from the dataset. Using the egorical, and multi-hot categorical features. Numerical features distribution, we transformed the original values by the cumulative were applied to min-max scaling and converted to a scale of [0, distribution function (CDF). To implement this function, we used 1]. In the proposed method, we transformed categorical features the Python library SciPy2 . into low-dimensional vectors using the embedding layers. Specif- Rank transformation. We transform each value 𝑥𝑖 ∈ 𝑋 by the ically, we represent one-hot vector 𝒙 𝒊 , a categorical feature, with following function: low-dimensional vector 𝒆 𝒊 , Õ ( 1 𝑥 𝑗 < 𝑥 is true 𝒆 𝒊 = 𝑬𝒄 𝒙 𝒊 , (4) 𝐹𝑟𝑎𝑛𝑘 (𝑥𝑖 ) = I𝑥 𝑗 <𝑥 , I𝑥 𝑗 <𝑥 = (2) 𝑥 𝑗 ∈𝑋 0 otherwise where 𝑬 𝒄 is an embedding matrix for categorical feature 𝑐. We further modify it and represent multi-hot vector 𝒙 𝒊 , a multi-hot Log transformation. We transform each value 𝑥𝑖 ∈ 𝑋 by the categorical feature, in the following way: following function: 1 𝐹𝑙𝑜𝑔 (𝑥𝑖 ) = log𝑒 (𝑥𝑖 + 1) (3) 𝒆𝒊 = 𝑬𝒄 𝒙 𝒊 , (5) 𝑛𝑐 Here we add one to 𝑥𝑖 to avoid the output being infinity when 𝑥𝑖 is where 𝑛𝑐 is the number of items that a sample has for categorical zero. feature 𝑐. These processed values are concatenated and flattened Binning transformation. We separate each value into buckets of before inputting to the MLP. the same size based on the quantiles of the sample. In the proposed As shown in Figure 1, the MLP is a structure that uses ReLU method, the values are divided into ten quantiles. Unlike other [13] as the activation function and includes batch normalization 1 https://github.com/haradai1262/CIKM2020-AnalytiCup/blob/master/src/feature_ [7] and dropout [17]. In the proposed method, mean squared error extraction.py (MSE) loss is calculated as a loss function from the ground truth 2 https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html of #retweets using log transformation and the prediction of MLP. 10 Table 1: Feature table. Numerical, categorical, and multi-hot categorical features are denoted by N, C, and MC in the Type column, respectively. The values in the #Dim column is the number of dimensions of the features Name Description Type #Dim Metrics related to a tweet. Specifically, we use #followers, #friends, and #favorites, as well as the multiplication of Tweet metrics N 6 #followers and #favorites, #friends and #favorites, and #followers and #friends and #favorites Tweet metrics z Values obtained by applying z transformation to “tweet metrics” N 6 Tweet metrics CDF Values obtained by applying CDF transformation to “tweet metrics” N 6 Tweet metrics rank Values obtained by applying rank transformation to “tweet metrics” N 6 Tweet metrics log Values obtained by applying log transformation to “tweet metrics” N 6 Tweet metrics binning Values obtained by applying binning transformation to “tweet metrics” N 6 Sentiment Positive (1 to 5) and negative (-1 to -5) sentiment scores extracted from the text of a tweet by SentiStrength [18] C 2 Features obtained from the timestamp of a tweet. Specifically, we use “weekday,” “hour,” “day,” and “week of month” Time N, C 5 as categorical features, and the difference between the timestamp of the tweet and 2020/6/1 as numerical features Entities Entities extracted from the text of a tweet by the Fast Entity Linker [1] MC 1 Hashtags Hashtags included in a tweet MC 1 Mentions Mentions included in a tweet MC 1 URLs URLs included in a tweet MC 1 Components of URLs included in a tweet. We extract the three components “protcol,” “host,” and “top level domain” Components of URLs MC 3 from the URL (e.g., “http,” “www.youtube.com,” and “.com” are extracted from http://www.youtube.com/) User ID User identifier C 1 User cluster ID Identifier assigned to a user by three clustering methods described in section 3.2.2. C 3 Metrics related to a user. Specifically, we use the mean and standard deviation of the followers, friends, favorites, User metrics N 10 and unique numbers of entities, hashtags, mentions, and URLs from the user’s tweet history. Metrics related to the dynamics in the #followers and #friends of a user. We use the increase in #follower and #friends User dynamics N 8 from the previous day, the previous week, on the same day, and within the same week 5-dimensional features extracted by applying TFIDF [2] to sequences consisted of entities, hashtags, mentions, Topic N 5 and URLs included in tweets and dimensionality reduction by SVD [4] Count encoding Values obtained by applying count encoding [16] to the categorical features “sentiment” and “time ” N 6 Values obtained by applying target encoding [12] to the categorical features “tweet metrics binning,” “sentiment,” Target encoding N 11 “time,” and “user Id” Note that, at the time of inference, the output value is applied to Table 2: Comparison of numerical feature transformations. the inverse transformation and returned to the original scale. Method MSLE Tweet metrics 0.187028 3.4 Validation Strategy Tweet metrics + z transformation 0.173821 The method for dividing the dataset into training and validation Tweet metrics + CDF transformation 0.151882 data was as follows. In this competition, the test data for evaluation Tweet metrics + log transformation 0.129360 is May 2020, one month after the data included in the training data. Tweet metrics + rank transformation 0.130994 To bring the distribution of the validation data and the test data Tweet metrics + binning transformation 0.174810 closer, we need to use the validation data that is as close to the test Tweet metrics + all transformations 0.127761 data as possible in time series. Thus, we used the data from May 2020 as the validation data. We also wanted to utilize the May 2020 data to perform better learning with fresh data close to test data. For this reason, the May 2020 data was divided into five validation all models. Other hyperparameters can be strictly checked in the data point and five models to be trained. Here, when verifying with published code. one verification data point, the remaining four are used for training data. Finally, the prediction value of the test data was calculated 4.2 Results for each model, and the evaluation score was calculated from their First, we verified the effectiveness of the numerical feature trans- average value. formation introduced in section 3.2.1. In the experiment, we tried using the tweet metrics without the transformation, with the ap- 4 EXPERIMENTS plication of each transformation, and with the application of all transformations. The experimental results are shown in Table 2. 4.1 Settings It was confirmed that log, rank, CDF, z, and binning transforma- The experimental results are not the scores of the test dataset, but tion contributed to improving the performance, in this order. Since the average of the 5-fold validation described in section 3.4. We the number of followers and favorites in tweet metrics follows the empirically set the sizes of the three fully-connected layers to 2048, power law, it is reasonable that log transformation is useful. Also, 512, and 128, respectively, dimension of embedding to 32, dropout the best MSLE was obtained when applying all transformations. rate to 0.3, and batch size to 256. We use Adam [8] to optimize The result shows the effectiveness of transforming tweet metrics 11 Table 3: Comparison of user modeling features. method. To improve the performance, we introduced a feature ex- traction method to be input into the DNN (mainly focusing on nu- Method MSLE merical feature transformation and user modeling) and confirmed Both user ID and user cluster ID are unused 0.144809 its effectiveness with experiments. As a solution for the competi- User ID 0.128004 tion, we introduced a stacking-based ensemble method for multiple User cluster ID 0.137432 models, which positioned us in the 3rd place. User ID and user cluster ID 0.127761 REFERENCES [1] Roi Blanco, Giuseppe Ottaviano, and Edgar Meij. 2015. Fast and Space-Efficient Table 4: Models used for the ensemble of our solution. MAE Entity Linking for Queries. In Proceedings of the Eighth ACM Int. Conf. on Web in the LOSS column denotes mean absolute error loss. Search and Data Mining. ACM, 179–188. [2] Leo Breiman. 1996. Stacked regressions. Machine learning 24, 1 (1996), 49–64. [3] Dimitar Dimitrov, Erdal Baran, Pavlos Fafalios, Ran Yu, Xiaofei Zhu, Matthäus Embedding dim Sizes of FC layers Dropout rate Loss MSLE Zloch, and Stefan Dietze. 2020. TweetsCOV19 - A Knowledge Base of Semantically 32 2048, 512, 128 0.1 MSE 0.128448 Annotated Tweets about the COVID-19 Pandemic. In Proceedings of the 29th ACM 32 2048, 512, 128 0.3 MSE 0.127761 International Conference on Information & Knowledge Management. Association for Computing Machinery, New York, NY, USA, 2991–2998. https://doi.org/10. 32 2048, 512, 128 0.5 MSE 0.128413 1145/3340531.3412765 40 4096, 1024, 128 0.1 MSE 0.127964 [4] Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. 2011. Finding structure 40 4096, 1024, 128 0.3 MSE 0.127810 with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review 53, 2 (2011), 217–288. 40 4096, 1024, 128 0.5 MSE 0.128520 [5] Casper Hansen, Christian Hansen, Jakob Grue Simonsen, Stephen Alstrup, and 40 4096, 1024, 128 0.1 MAE 0.132143 Christina Lioma. 2020. Content-aware Neural Hashing for Cold-start Recom- mendation. In Proceedings of the 43rd Int. ACM SIGIR Conf. on Research and Development in Information Retrieval. 971–980. Table 5: Final submission results of the top six teams (semi- [6] Cindy Hui, Yulia Tyshchuk, William A Wallace, Malik Magdon-Ismail, and Mark Goldberg. 2012. Information cascades in social media in response to a crisis: a finalists) in the competition. preliminary model and a case study. In Proceedings of the 21st Int. Conf. on World Wide Web. 653–656. Rank Team MSLE (Test dataset) [7] Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd 1 vinayaka 0.120551 Int. Conf. on Machine Learning. 448–456. 2 mc-aida 0.121094 [8] Diederik P Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Opti- mization. arXiv preprint arXiv:1412.6980. 3 myaunraitau (ours) 0.136239 [9] Yuanfei Luo, Mengshuo Wang, Hao Zhou, Quanming Yao, Wei-Wei Tu, Yuqiang 4 parklize 0.149997 Chen, Wenyuan Dai, and Qiang Yang. 2019. Autocross: Automatic feature crossing 5 JimmyChang 0.156876 for tabular data in real-world applications. In Proceedings of the 25th ACM SIGKDD 6 Thomary 0.169047 Int. Conf. Knowledge Discovery & Data Mining. 1936–1945. [10] Renfeng Ma, Xiangkun Hu, Qi Zhang, Xuanjing Huang, and Yu-Gang Jiang. 2019. Hot Topic-Aware Retweet Prediction with Masked Self-attentive Model. In Proceedings of the 42nd Int. ACM SIGIR Conf. on Research and Development in Information Retrieval. 525–534. into different distributions and inputting them into the DNN model [11] James MacQueen et al. 1967. Some methods for classification and analysis of for retweet prediction. multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Next, we verified the effectiveness of the user modeling intro- Mathematical Statistics and Probability, Vol. 1. 281–297. [12] Daniele Micci-Barreca. 2001. A preprocessing scheme for high-cardinality cat- duced in section 3.2.2. In the experiment, in regard to embedding of egorical attributes in classification and prediction problems. ACM SIGKDD user ID and user cluster ID, we tried not using either, using either Explorations Newsletter 3, 1 (2001), 27–32. [13] Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted one, and using both. The experimental results are shown in Table 3. boltzmann machines. In Proceedings of the 27th Int. Conf. on Machine Learning. It has been found that the performance is improved when the user 807–814. cluster ID is also used compared to when only using the user ID. [14] Jean-François Puget. 2017. Feature Engineering For Deep Learning. https://medium.com/inside-machine-learning/feature-engineering-for- deep-learning-2b1fc7605ace. Accessed: 2020-09-28. 5 SOLUTION [15] Jiezhong Qiu, Jian Tang, Hao Ma, Yuxiao Dong, Kuansan Wang, and Jie Tang. 2018. Deepinf: Social influence prediction with deep learning. In Proceedings of the We used ensemble on multiple models with modified hyperparame- 24th ACM SIGKDD Int. Conf. on Knowledge Discovery & Data Mining. 2110–2119. ters (size of embedding dimension, sizes of fully-connected layers, [16] Shubham Singh. 2020. Categorical Variable Encoding Techniques. https://medium.com/analytics-vidhya/categorical-variable-encoding- and dropout rate) and loss function. Table 4 shows the seven mod- techniques-17e607fe42f9. Accessed: 2020-09-28. els used for the ensemble. Stacking ridge regression [2] was used [17] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan as the ensemble method. Stacking ridge regression is a method of Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929–1958. blending each model’s prediction results by a linear sum based on [18] Mike Thelwall, Kevan Buckley, Georgios Paltoglou, Di Cai, and Arvid Kappas. the weights learned by ridge regression. The integer value was 2010. Sentiment strength detection in short informal text. Journal of the American Society for Information Science and Technology 61, 12 (2010), 2544–2558. obtained as the final predicted value by rounding according to the [19] Qi Zhang, Yeyun Gong, Jindou Wu, Haoran Huang, and Xuanjing Huang. 2016. competition’s manners. The final leaderboard looked like Table 5. Retweet prediction with attention-based deep neural network. In Proceedings of Our solution was located in the 3rd place. the 25th ACM Int. on Conf. on Information and Knowledge Management. 75–84. [20] Honglei Zhuang, Xuanhui Wang, Michael Bendersky, and Marc Najork. 2020. Feature transformation for neural ranking models. In Proceedings of the 43rd Int. 6 CONCLUSION ACM SIGIR Conf. Research and Development in Information Retrieval. 1649–1652. This paper presents our solution for the COVID-19 Retweet Pre- diction Challenge. We proposed a DNN-based retweet prediction 12