Bitcoin Price Prediction with Neural Networks Kejsi Struga Olti Qirici kejsi.struga@fshnstudent.info olti.qirici@fshn.edu.al University of Tirana 2 Bitcoin Price Abstract Many endorse Bitcoin, while other are sceptic. Re- gardless, the price of Bitcoin is a topic discussed from an economic angle, computer science, financial, and In this work, we use the LSTM version of Re- psychological perspective. In the time of writing this current Neural Networks, to predict the price article, the price seems to be getting stable. In some of Bitcoin. In order to develop a better un- way, this may be normal given that many investors are derstanding on its price influencers and the waiting to see regulations, the scaling problems is not general vision of this brilliant innovation, we seeing any improvement, the fact that the first hype first give a brief perspective on Bitcoin and around 2017 is now over and it is normal for a market its economics. Then we describe the dataset, to get stable for some time. Nevertheless, given the which is comprised of data from stock mar- features of Bitcoin like: ket indices, sentiment, blockchain and Coin- marketcap1 . Further on this investigation, we • Tax free show the usage of LSTM architecture with the aforementioned time series. To conclude, we • Unforgeable outline the results of predicting Bitcoin price for 30 and 60 days ahead. • Bordless and unbound from distance 1 Introduction • Decentralized With the advent of Bitcoin 10 years ago the world • Verifiable and secure of economics, albeit in small scale, has and is ex- periencing a revolution. Bitcoin introduced itself as the system that solved the Double Spend problem • Normally, negligible transaction fees [Nakamoto2008], a prevalent issue with inherent Dig- ital Cash systems. Nevertheless, the impact during • Cannot be counterfeited coming years was greater. Distributed Ledger Tech- nologies (DLT), Smart Contracts, Cryptocurrencies, It is a plausible solution for countries with developing etc. all stemmed from this very ”Bitcoin idea”. This is economies and financial systems to improve their eco- attributed, to the unique decentralization mixed with nomical position, while struggling to access the best intuitive incentive. On the other end of the spectrum, technology of its time. However, for a country to have with data being regarded as the oil of nowadays, along an aptitude towards this monetary system, a tech- with the tremendous increases in hardware efficiency, savvy population should be present to adopt the sys- Machine Learning is increasingly being utilized. As a tem, along with regulations from financial institutions result, we are inclined in attempting to predict the Bit- that support it. Both of these are absent for the mo- coin price, despite of the dynamic nature present not ment, but the future prospect is very potent. Another only in Bitcoin exchanges (Fig. 1), but in financial factor why we believe Bitcoin should be studied, is the markets in general. fast paced advancements in technology (Fig. 1), that favour Bitcoin, considering its qualities as a software 1 Coinmarketcap takes the average of every exchange that and a decentralized system, which can not be beaten trades Bitcoin by banking systems, or a likes. of storage used (currently 100 GB for the Bitcoin for each high-grade Bitcoin network client), transaction confirmation needs 10 50 minutes, etc. While we will try to build a predictive model for the Bitcoin prospect value calculation, we are aware in advance that price may differ greatly because of internal and external factors to Bitcoin. By internal factors we are presuming factors inside the Bitcoin se- curity (some breach). By external we are referring to agents which influence indirectly the price of Bitcoin (exchange closures, replacing cryptocurrencies, spec- ulation markets, the fact that as its believed widely over 80% of Bitcoins in circulation is concentrated in a limited number of investors etc.) Figure 1: 5 tech companies together are worth more Anyway, we shall compare our results to other mod- than 282 other companies, by market capitalization. els built for cryptocurrency prediction. Lets not forget Source: [Batnick2018] that in the first month of 2018 there were models which predicted that Bitcoin would surpass the 100,000.00 2.1 Bitcoin Deflation USD per Bitcoin till the end of the year, while we are barely reaching the 7,000.00 USD value just 2 months Bitcoin’s supply is predetermined by design, and rep- before the end of the year. resented by this geometric series: a (1 − rn ) 50 (1 − 0.5) Sn = = 210000 × ≈ 21 × 106 . 1−r 1 − 0.5 (1) As it is clearly stated in Bitcoin’s wiki [BitcoinWiki2017], the decrease of supply, resem- bles the rate at which commodities like gold are mined. This makes many consider Bitcoin as defla- tionary, but this currency is infinitely divisible, not only because 1 BTC = 108 satoshis 2 , and in turn, no one would run out that fast of every satoshi, but also because the protocol could be updated allowing for satoshis to be more divisible (have more zeros). As a result, deflation, does not have to occur. 2.2 Bitcoin Inflation Figure 2: Bitcoin’s steep price movements. Bitcoin is not debt based, and no artificial money can be issued. Additionally, because of the fixed sup- ply mentioned above no more Bitcoins than predicted can be created, unlike the economy system of today. 3 Data preprocessing Bitcoin’s deflationary attribute, stems from imitating 3.1 Data gathering gold, in that a currency must be scarce (or with a fi- Daily data of four channels are considered since 2013. nite supply in case of BTC), consequently, no one can First, the Bitcoin price history, which is extracted from increase the supply and inflate the value of goods. Coinmarketcap through its open API. Secondly, data Nonetheless, the bitcoin has also its dark side which from Blockchain is gathered, in particular we choose sometimes makes users quite sceptical on its poses- the average block size, the number of user addresses, sion and usage. Several of these problems have been number of transactions, and the miners revenue. We reported (and may not be limited to) by Kaspersky found it counter intuitive to have some Blockchain Lab [Kaspersky2017]. We may include the facts that: data, given the incessant scaling problem, on the other blockchain nodes do exactly the same thing (no paral- hand, the number of accounts, by definition is related leling, no synergy, no mutual assitance), growing size to the price movements, since an increase in the num- 2 Satoshi is the smallest unit of Bitcoin currency ber of accounts, either means more transactions oc- curring (presumably for exchanging with different par- ing so can trigger large gradient updates that will pre- ties and not just transferring Bitcoins to another ad- vent the network from converging. To make learning dress), or it is a sign of more users joining the network. easier for the network, data should have the following Thirdly, for the sentiment data we obtain the Interest characteristics [Geron2017]: over time for the word ’Bitcoin’ using PyTrends li- • Take small values - Typically, most values should brary. Lastly, two indices are considered, that of S&P be in the range 0-1 range 500 and Dow and Jones. Both are retrieved through Yahoo Finance API. All in all, these make for 12 fea- • Be homogeneous - That is, all features should take tures. The Pearson correlation between the attributes values at roughly the same range. is shown in Figure 2. Clearly, some attributes are not The most common normalization methods used too correlated, for example, the financial indices are during data transformation include: relevant with each other, but not with any of bitcoin- related attributes. Also, we see how Google Trends • Min-Max Scaling , where the data inputs are are related to Bitcoin transactions. mapped on a number from 0 to 1: x − min(X) x0 = (2) max(X) − min(X) • Mean Normalization, which makes data to have a values between -1 and 1 with a mean of 0: x − mean(X) x0 = (3) max(X) − min(X) • Z-Score (Standardization), where the features are redistributed with their mean of 0 and stan- dard deviation of 1: x − mean(x) x0 = (4) ρ Figure 3: Pearson correlation, 1.0 means the highest correlation For our problem, we use Min-Max Scaling and ad- just features on a scale from 0 to 1 given that most of our time-series have a peek, therefore we might ar- gue we know the maximum of the series, in which case 3.2 Data cleansing Min-Max Scaling does a good job. From exchange data we consider relevant only the Vol- ume, Close, Open, High prices and Market capitaliza- 4 Machine Learning Pipeline tion. For all data sets if NaN values are found to be In this section we describe how to make time series existent, they are replaced with the mean of the re- data adaptable for supervised machine learning prob- spective attribute. After this, all datasets are merged lems. The price prediction is treated as regression into one, along the time dimension. Judging from Bit- rather than classification, and we show how LSTM can coin price movements during the period from 2013 un- be used in such cases. We then, discuss hyperparame- til 2014, we considered best to get rid of data points ters. before 2014, hence the data which will be passed to the neural network lies from 2014 until September 2018. 4.1 Software used For Deep Learning backend system we choose Tensor- 3.3 Data normalization flow, and Keras as the front-end layer of building neu- Deciding on the method for normalizing a time series, ral networks fast. Pandas is used extensively for data especially financial ones is never easy. What’s more, related tasks, Numpy is utilized for matrix/vector op- as a rule of thumb a neural network should load data erations and for storing training and test data sets, that take relatively large values, or data that is het- Scikit-learn (also known as: sklearn) is used for per- erogeneous (referring to time-series that have different forming the min-max normalization. Lastly, Plotly is scales, like exchange price, with Google Trends). Do- used for displaying the charts. 4.2 Time series data these small windows of data into a numpy array. Each window is a 35x12 matrix, so all windows will create Normally a time series is a sequence of numbers along the tensor. Furthermore, in LSTM the input layer is time. LSTM for sequence prediction acts as a super- by design, specified from the input shape argument on vised algorithm unlike its autoencoder version. As the first hidden, the these three dimensions of input such, the overall dataset should be split into inputs shape: and outputs. Moreover, LSTM is great in comparison with classic statistics linear models, since it can easier • Samples handle multiple input forecasting problems. In our ap- proach, the LSTM will use previous data to predict 30 • Window size days ahead of closing price. First, we should decide on how many previous days one forecast will have access • Number of features to. This number we refer as the window size. We have opted for 35 days in case of monthly prediction, and 4.5 LSTM implementation 65 days in that of 2 months prediction, therefore the 4.5.1 LSTM internals input data set will be a tensor comprising of matrices A chief feature of feedforward Networks, is that they with dimension 35x12/65x12 respectively, such that we don’t retain any memory. So each input is processed have 12 features, and 35 rows in each window. So the independently, with no state being saved between in- first window will consist of 0 to the 34 row (python is puts. Given that we are dealing with time series where zero indexed), the second from 1 to 35 and so on. An- information from previous Bitcoin price are needed, we other reason for choosing this window length is that a should maintain some information to predict the fu- small window leaves out patterns which may appear in ture. An architecture providing this is the Recurrent a longer sequence. The output data takes into account neural network (RNN) which along with the output not only the window size but also the prediction range has a self-directing loop. So the window we provide which in our case is 30 days. The output dataset starts as input gets processed in a sequence rather than in from row 35 up until the end, and is made of chunks a single step. However, when the time step (size of of length 30. The prediction range also determines the window) is large (which is often the case) the gra- output size for the LSTM network. dient gets too small/large, which leads to the phe- 4.3 Split into training and test data nomenon known as vanishing/exploding gradient re- spectively [Chollet2017]. This problem occurs while This step is one of the most important, especially in the optimizer backpropagates, and will make the al- the case of Bitcoin. We first wanted to predict the gorithm run, while the weights almost do not change year ahead, but this would mean, that data from 1 at all. RNN variations mitigate the problem, namely Jan 2018 until September 2018 would be used for test- LSTM and GRU (Fig. 2). ing, the downside of this, is of course the steep slope The LSTM layer adds some cells that carry infor- in 2017, which would make the neural network learn mation across many timesteps (Fig. 2). The cell state this pattern as by the last input, and the prediction of is the horizontal line from Ct−1 to Ct , and its impor- year 2018 would not be very logical. Thus we go for tance lies in holding the long-term or short term mem- training data from 2014-01-01 until 2018-07-05, this ory. The output of LSTM is modulated by the state leaves us with approximately 2 months for prediction, of these cells. And this is important when it comes while we predict for two months, the data set is split to predict based on historic context, rather than only a bit earlier to leave room for 2 months: 2018-06-01. the last input. LSTM networks manage to remember Each training set and test set is composed of input and inputs by making use of a loop. These loops are ab- output features. sent in RNN. On the other hand, as more time passes, the less likely it becomes that the next output depends 4.4 Turn data into tensors on a very old input, therefore forgetting is necessary. LSTM expects that the input is given in the form of LSTM achieves this by learning when to remember a 3 dimensional vector of float values. A key feature and when to forget, through their forget-gates. We of tensors is their shape, which in Python is a tuple mention them shortly to not consider LSTM just as a of integers representing the dimensions of it along the black box model [Olah2015]. 3 axis. For instance, in our testing data of Bitcoin, the shape of training inputs is: (1611, 35, 12), so we • Forget gate: ft = σ(Wf St−1 + Wf St ) have 1611 samples, a window size (timestep) of 35 val- • Input gate: it = σ(Wi St−1 + Wi St ) ues, and 12 features. In overall the idea is simple, in that we separate the data into chunks of 35, and push • Output gate: ot = σ(Wo St−1 + Wo St ) r  2 1 n • RMSE(X, h) = n Σi=1 h(xi ) − y i   • MAE(X, h) = n1 Σni=1 h(xi ) − y i RMSE is generally used when distribution resembles a bell-shaped curve, but given the Bitcoin price spikes we chose to go MAE, since it deals better with outliers. 4.6.3 Activation function The choice for activation function was not very diffi- cult. The most popular are sigmoid, tanh, and ReLu. Sigmoid suffers from vanishing gradient, therefore al- most no signal flows from the neuron to its weight, moreover it is not centered around zero, as a result the gradient might be to high or to low a number. By contrast, tanh makes the output zero centered, and in practice is almost always preferred to sigmoid. ReLu Figure 4: LSTM cell is also widely used, and since it was invented later, it should be better. Nevertheless, for predicting Bitcoin price that was not the case, and we chose tanh due to • Intermediate Cell State: C̃ = tanh(Wc St−1 + better results. Wc Xt ) • Cell state (memory for next input): ct = (it ∗ C̃t )+ 4.6.4 Dropout Rate (ft ∗ ct−1 ) Regularization is the technique for constraining the weights of the network. While in simple neural net- • Calculating new state: ht = ot ∗ tanh(ct ) works, l1 and l2 regularization is used, in multi layer As it can be seen from the equations, each gate has networks, drop out regularization takes place. It ran- different sets of weights. In the last equation, the in- domly sets some input units to 0 in order to prevent put gate and intermediate cell state are added with overfitting. Hence, its value represents the percentage the old cell state and the forget gate. Output of this of disabled neurons in the preceding layer and ranges operation is then used to calculate the new state. So, from 0 to 1. We have tried 0.25 and 0.3 and lastly we this advanced cell with four interacting layers instead decided for 0.3. of just one tanh layer in RNN, make LSTM perfect for sequence prediction. 4.6.5 Number of Neurons in hidden layers We opted for 10 neurons in the hidden layers, it actu- 4.6 Hyperparameters ally costs a lot to have more neurons, as the training 4.6.1 Optimizer process will last longer. Also, trying a larger number did not give improved results. While Stochastic Gradient Descent is used in many Neural Network problems, it has the problem of con- 4.6.6 Epochs verging to a local minimum. This of course presents a problem considering Bitcoin price. Some other nice op- Rather arbitrarily, we decided for 100 epochs, after timizers are variations of adaptive learning algorithms, trying other values, like 50, or 20. As with the number like Adam, Adagrad, and RMSProp. Adam was found of hidden layer neurons, the more epochs, the more to work slightly better than the rest, and that’s why time it takes for training to finish, since one epoch is we go for it. (All of these come packed with Keras.) a full iteration over the training data. Also, it may overfit the model. 4.6.2 Loss function 4.6.7 Batch Size The performance measure for regression problems, will typically be either RMSE (Root Mean Square Error) We decided to feed the network, with batches of 120 or MAE (Mean Absolute Error). data (again this number is a guess). 4.6.8 Architecture of Network We used the Sequential API of Keras, rather than the functional one. The overall architecture is as follows: • 1 LSTM Layer: The LSTM layer is the inner one, and all the gates, mentioned at the very be- ginning are already implemented by Keras, with a default activation of hard-sigmoid [Keras2015]. The LSTM parameters are the number of neurons, and the input shape as discussed above. • 1 Dropout Layer: Typically this is used before the Dense layer. As for Keras, a dropout can be added after any hidden layer, in our case it is after the LSTM. • 1 Dense Layer: This is the regular fully con- nected layer. • 1 Activation Layer: Because we are solving a regression problem, the last layer should give the Figure 5: Error loss during training linear combination of the activations of the previ- ous layer with the weight vectors. Therefore, this activation is a linear one. Alternatively, it could the data we gathered for Bitcoin, even though has been be passed as a parameter to the previous Dense collected through years, might have become interest- layer. ing, producing historic interpretation only in the last couple of years. Furthermore, a breakthtrough evolu- 5 Results and Analysis tion in peer-to-peer transactions is ongoing and trans- forming the landscape of payment services. While it In this section we show the results of our LSTM model. seems all doubts have not been settled, time might be It was noted during training that the higher the batch perfect to act. We think its difficult to give a mature size (200) (Fig. 7, 8) the worst the prediction on the thought on Bitcoin for the future. test set. Of course this is no wonder, since the more training, the more prone to overfitting the model be- comes . While it is difficult to predict the price of Bit- References coin, we see that features are critical to the algorithm, [Nakamoto2008] Satoshi Nakamoto. Bitcoin: A Peer- future work includes trying out the Gated Recurrent to-Peer Electronic Cash System Unit version of RNN, as well as tuning, on existing hyper-parameters. Below we show the loss from the [BitcoinWiki2017] Bitcoin Wiki. Deep Learning with Mean Absolute Error function, when using the model Python. https://en.bitcoin.it/wiki/ to predict the training and test data. Controlled_supply 6 Conclusions [Chollet2017] Deep Learning with Python. All in all, predicting a price-related variable is difficult https://www.manning.com/books/ given the multitude of forces impacting the market. deep-learning-with-python Add to that, the fact that prices are by a large extent depended on future prospect rather than historic data. [Batnick2018] M. Batnick. The market cap However, using deep neural networks, has provided us of the top 5 S&P 500 companies. with a better understanding of Bitcoin, and LSTM ar- https://theirrelevantinvestor.com/ chitecture. The work in progress, includes implement- 2018/07/19/pareto/ ing hyperparameter tuning, in order to get a more ac- curate network architecture. Also, other features can [Geron2017] Bitcoin Wiki. Hands on Machine be considered (although from our experiments with Learning with scikit-learn and tensorflow Bitcoin, more features have not always led to better re- https://www.oreilly.com/library/ sults). Microeconomic factors might be included in the view/hands-on-machine-learning/ model for a better predictive result. Anyway, maybe 9781491962282/ [Olah2015] Understanding LSTMs https: //colah.github.io/posts/ 2015-08-Understanding-LSTMs/ [Kaspersky2017] , K1 Kaspersky Lab Daily, Six Myths about blockchain and Bitcoin: De- bunking the effectiveness of the tech- nology https://www.kaspersky.com/blog/ bitcoin-blockchain-issues/18019/ [Keras2015] Keras https://keras.io/ getting-started Results Visualization Figure 6: Bitcoin Prediction on Training Set Figure 7: Bitcoin Prediction on Test Set, Batch Size 100 Figure 8: Bitcoin Prediction on Test Set, with a batch size of 200. Loss is greater than with batch 50 Figure 9: Bitcoin Prediction on Test Set, 60 days prediction