News Feed for Stock Movement Prediction Andriy Krysovatyy1[000-0003-1545-0584], Oleksandra Vasylchyshyn1 [0000-0002- 9948-5532], Oksana Desyatnyuk1 and Svitlana Galeshchuk1,2[0000-0002-6706-3028] 1Faculty of Finance, Ternopil National Economic University, Ternopil 46006, Ukraine 1 Governance Analytics, Paris Dauphine University, Paris 75016, France rector@tneu.edu.ua, volexandra@gmail.com, desyatnyuk.oksana@tneu.edu.ua, svitlana.galeshchuk@dauphine.fr Abstract. The study aims at predicting 10-day stock return movements using heterogeneous data over the timespan of 5 years such as historical stock per- formance at the market and the news feed with information on the particular firm’s asset. Feature engineering helps reduce the number of variables used in the classification model as it excludes multicollinearity. A suite of parametric and non-parametric machine learning methods has not provided satisfactory ac- curacy, i.e., the random forest ensemble gives only 66% precision at the out-of- sample data using all features and 51% with only historical data from the stock market. It motivated us to develop the convolutional neural network architec- ture which delivered significantly better results. Keywords: classification, stock market, prediction, machine learning, convolu- tional neural networks 1 Introduction Stock exchange prediction is a longstanding challenge that spurs interest in time- series modelling, pattern detection, analysis of macroeconomic and market data among both academics and practitioners. Also, our research contributes to the domain with its general objective to predict the directional change of stock exchange returns. Predictability of stock prices from the past and current information is a fundamen- tal basis for modern trading technics with implications in investing. It constitutes one of the most profound controversies between academics and market participants. De- spite that, fundamental and technical analyses are still used by foreign exchange pro- fessionals to predict movements in the currency market due to the belief that price fluctuations will reflect known patterns. Technical analysis implies three main principles (Neely & Weller (2011): (1) as- sets price history uses all relevant information, so any research assets fundamentals is pointless; (2) assets prices are moving with trends, and that is a circumstantial factor for academic investigation due to the fact that trends imply predictability and allow the traders to get the profits; (3) history tends to be repeated itself. The traders use it into adherence to some patterns with similar conditions. Fundamental analysis involves the use of economic data (e.g., production, con- sumption, disposable income) to forecast prices. However, researchers often do not take into account nonlinearities between eco- nomic data, political, behavioural factors and financial markets. Heaton et al. (2016) point out that the possibly relevant data for financial markets prediction is extensive, while the importance of the data and the potentially complex interactions in the data are not well specified by financial economic theory (see also Engel, 2013). Behav- ioural factors are frequently omitted in the models. Since financial markets are complex, evolutionary, noisy, and nonlinear dynamic system (Huang & Tsai, 2009), more adaptive and flexible mechanisms are required to improve forecasting accuracy (Cavalcante et al., 2016). This motivates researchers to investigate the ability of more flexible methods to study financial markets, in particu- lar, machine learning methods (see Chen et al., 2015, Patel et al., 2015). Nevertheless, not only methodology defines the experimental outcomes. The quali- ty and richness of data together with feature engineering play a crucial role in the high accuracy at the out-of-sample set. The choice of variables and their tuning paves the way to the convincing results. The domain experts usually determine the set of features based on the prior knowledge of the dependent variable. As we mentioned above, researchers tend to build on either macroeconomic indicators or historical market data or both of them. However, a significant part of economic society believes behavioural factors such as news and public reaction may have a significant influence on stock prices. We decide to test this mainstream of economic thoughts by developing our predictive model with the extant machine learning methods. This conclusion along with the available data shaped and specified our general objective mentioned at the beginning of this section. Now we define it as follows: This paper particularly aims at developing a prediction method for the directional change of stock exchange 10-day returns with the cutting- edge machine learning approaches by integrating historical market features and the news data. The paper is organized as follows: section 2 introduces the data and its descriptive analysis. Section 3 elaborates on the methodological set-up of the study and the eval- uation technics considered. Section 4 presents the results of our model and its com- parison with the outputs provides by the other existent methods. Section 5 concludes with comments and directions for future research. 2 Data 2.1 Data Sources The data for stock exchange performance is publicly available. Hence, we do not experience any challenges in getting it. However, collecting the news and their pro- cessing is a time-consuming and labor-intensive task. Many datasets are now availa- ble for training the models and Kaggle 1contributes to the machine learning society by 1 https://www.kaggle.com/c/two-sigma-financial-news publishing some data from trustworthy sources. Kaggle competition “Two Sigma: Using News to Predict Stock Movements” includes the market and news data from 2007 to 2016. Moreover, Thomson Reuters, the mass media and information firms with a longstanding tradition of news procurement, is a point of supply for this da- taset. The market data reflects the following indicators for the US-listed firms and their assets: 1) raw open-to-open daily returns, market-residualized open-to-open returns; 2) 10-day raw open-to-open daily returns, 10-day market-residualized open-to- open daily returns; 3) raw close-to-close daily returns, market-residualized close-to-close daily re- turns; 4) 10-day raw close-to-close returns, 10-day market-residualized close-to-close re- turns; 5) daily trading volume in shares; 6) daily open price; 7) daily close price; 8) 10-day forward market-residualized open-to-open daily returns. The news table comprises the data on the articles published concerning the particu- lar company and its assets: the title, source, sentiment (negative, neutral, positive) of a story, words count, novelty vis-à-vis previous news (12-hour novelty, 24-hour novel- ty, 3 –day, 5-day, 7-day)., volume of news (12-hour volume, 24-hour volume, 3 –day, 5-day, 7-day), news relevance, sentiment scores (positive score, negative, neutral, general (binary)), news urgency. Our task is to predict the directional change for the 10-day forward market- residualized open-to-open daily returns (whether it will go down, stay stable, go up). For more details on the dataset and its variables please follow the link. 2 2.2 Descriptive Statistics and Feature Engineering Market dataset accounts for circa 4 mln observations (3,979,902) for more than 2000 firms. We first create some new variables as “price difference” (the ratio between the difference in the close and open prices and open price), “volume percentage change”, “absolute change. Having this rich dataset, the problem of missing values occurred. We simply impute the rows containing the ‘nan’ values. Linear correlation analysis (Fig. 1 a) does not show clear dependencies between 10-day forward open returns and the other variables. However, it provides insights into possible multicollinearity to avoid in developing the model. Jaccard index measures the non-linear dependencies between the sets of data. We calculated first the directional change for each variable as: b[t] = 1 if x[t+1]>x[t] and 0 otherwise The Jaccard index for two Boolean arrays may in our case be defined as: 2 https://www.kaggle.com/c/two-sigma-financial-news/data 𝐶0𝑋 + 𝐶𝑌0 𝐽(𝑋, 𝑌) = 𝐶𝑋𝑌 + 𝐶0𝑋 + 𝐶𝑌0 where 𝐶𝑋𝑌 represents the number of occurrences when both X and Y are equal to 1. 𝐶0𝑋 /𝐶𝑌0 if X/ Y are equal to 0. Fig.1 b depicts the results that support the conclusion about the non-linear relationship between 10 day forward returns and the rest of open- to-open returns. Moreover, interestingly the difference in close and open prices shows relates to the 1-day close-to-close returns. a) b) Fig.1. a) Correlation between the variables; b) Jaccard index measure of the similarity between the sets of data Data distribution on Fig.2 shows that close-to-close returns generally stay close to the mean, while open-to-open ones are more dispersed. The plot also detects the outliers: we impute the data with 10-day future returns that violate the boundaries [-800; 800] filtering away circa 16 000 observations. Based on the output of descriptive statistics, we decide to ignore close-to-close returns in our experimental set-up. The next step explores the news dataset. The table contains many possible varia- bles but some feature engineering is necessary to avoid overfitting and multicollinear- ity. We create five new features taking into consideration relative importance of each indicator: (i) sentiment_positive: ('sentimentPositive'*'relevance')/('urgency’*0.2 "novel- tyCount12H"*0.1noveltyCount24H) (ii) sentiment_negative: 'sentimentNegative'*'relevance')/('urgency’*0.2 "novel- tyCount12H"*0.1noveltyCount24H) (iii) sentiment neutral: 'sentimentNeutral'*'relevance')/('urgency’*0.2 "novel- tyCount12H"*0.1noveltyCount24H” The volume indicators are in absolute values. We merged market and news data on dates and firm names. We used Scikit-Learn Python library to scale the features. We run stratified split data on train and test sets with a ratio: 80:15. Thus we obtain same ratio of 0 and 1 classes for the target variable in every set. Fig.2. The distribution of data 3 Methodology Recall from the Introduction that our goal is a directional prediction for the 10-day forward rate of stock return. We use the following formal definition of the directional change: Define the direction of change zk(t)=1, if the rate increases, i.e. if 𝑦(𝑡+1)−𝑦𝑡>0; otherwise, 𝑧𝑘(𝑡)=0. A k-period (k = 1 in our set-up) forward prediction model is evaluated by its classification accuracy on out-of-sample observations, where classification accuracy is defined as the percentage of test cases for which the predicted direction of change 𝑧̂𝑘 (t) equals the true direction of change zk(t). This section elaborates on the baseline technics used, developed method and the evaluation of the results. 3.1. Baseline Methods We apply the number of extant classification methods to help in prediction of direc- tional movement for 10-day forward returns. The following methods are used 3: Fixed effects linear regression reveals the linear dependencies among the depend- ent and independent variables vis-à-vis each firm. Python library Linearmodels4 in- cludes fixed-effects panel regression models. Our model may be summarized by the following equation: 𝑦̂ = 𝛼 + 𝛽𝑋 + 𝛿𝐹 + 𝜀, where 𝛼 is an intercept, X is a vector of dependent variables, F represents firms’ fixed effects, 𝜀 is an error. Then we compute the difference between prognosed and the previous value to determine the directional change. Simple logistic regression calculates weighted sum of input variables (like linear regression) outputting the probability of each instance to belong to a positive class. 3 we skip random walk model since it has been previously implemented by the Kaggle com- petiotion founders to show its poor accuracy 4 https://pypi.org/project/linearmodels/ Our set-up exploits all the available training instances to train logistic regression re- laxing on the firm’s individual effects. Python Library Scikit Learn (linear models)5 allows logit estimations. Decision Trees usually handle well linearities and non-linearities in the data.6 The method is versatile and simple in interpretation. We do not need to run feature scaling while training the data. Scikit Learn provides us with decision trees implementation. The method is, however, prone to be sensitive to the data variation Random forest helps overcome the disadvantages of single decision tree by sum- marizing and averaging predictions over the number of trees. It is an ensemble learn- ing approach that uses the outputs of the individual predictors as votes. If positive class gets more votes, the method will return the corresponding result. Again, Scikit Learn comprises random forest as a part of its ensemble methods7. Shallow multilayer perceptron (MLP) has the ability capture non-linearity between features. We use the following architecture to train the model: 16-300-1, where 1’ is a number of input neurons, 300 neurons in the hidden layer and we have binary classifi- cation problem, hence 1 final output. Adam is used for the model optimization. We use Randomized Search to tune the parameters for decision trees, random for- est, MLP (e.g., it determines 512 as an optimal number of trees in the Random forest classification). 3.1. Convolutional Neural Network The Convolutional Neural Networks (CNN) belongs to a family of deep learning methods with empirically proved classification ability on large datasets. CNN are capable to learn complex patterns due to the idea of receptive fields when each hidden neuron is connected not with all input neurons but corresponding local part of them. Moreover, CNN detect learn patterns anywhere in the input data, they have fewer parameters that the vanilla deep learning networks which makes CNN less prone to overfitting. It motivates us to use the CNN architecture in our study. We define the structure of the CNN by trials and errors. The architecture that provides the highest accuracy on the test set is describe on Fig. 3. 5https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html 6 https://scikit-learn.org/stable/modules/tree.html 7https://scikitlearn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.htm Fig. 3. The CNN Architecture 4 Results and Conclusion Table 1 describes the output accuracy of the predictors on the test sets. Fixed-effects linear regression model explains only small part of the data variance as R2 is insignifi- cant equal on the training set. Thus, we do not need to verify it on the out-of-sample data. As you can see the prediction accuracy of the developed CNN is higher than those of the other methods. We try to run same models for the market data only without count- ing the news data. The results even with the CNN architecture is significantly poorer. It empirically proofs that at our dataset which comprises the data from stock exchange market over 10 years with over 4 mln observations the news information is essential in the forward market prediction. 5 Further Research The results make contribution to the market theory proving that the news data is sig- nificant for prediction of stock exchange. We plan to extend the dataset with the data from Google Trends as we believe it has a predictive significance. Moreover, we en- visage using the LSTM with attention mechanism in our future studies. References 1. Cavalcante, R. C., Brasileiro, R. C., Souza, V. L., Nobrega, J. P., & Oliveira, A. L. (2016). Computational intelligence and financial markets: A survey and future direc- tions. Expert Systems with Applications, 55, 194-211. 2. Chen, K., Zhou, Y., & Dai, F. (2015, October). A LSTM-based method for stock returns prediction: A case study of China stock market. In 2015 IEEE International Conference on Big Data (Big Data) (pp. 2823-2824). IEEE. 3. Engel, C. (2014). Exchange rates and interest parity. In Handbook of international eco- nomics (Vol. 4, pp. 453-522). Elsevier. 4. Huang, C. L., & Tsai, C. Y. (2009). A hybrid SOFM-SVR with a filter-based feature se- lection for stock market forecasting. Expert Systems with Applications, 36(2), 1529-1539. 5. Heaton, J. B., Polson, N. G., & Witte, J. H. (2016). Deep learning in finance. arXiv pre- print arXiv:1602.06561. 6. Neely, C. J., & Weller, P. A. (2011). Technical analysis in the foreign exchange mar- ket. Federal Reserve Bank of St. Louis Working Paper No. 7. Patel, J., Shah, S., Thakkar, P., & Kotecha, K. (2015). Predicting stock market index us- ing fusion of machine learning techniques. Expert Systems with Applications, 42(4), 2162- 2172.