Bitcoin Price Prediction with Neural Networks

                                Kejsi Struga                                 Olti Qirici
                       kejsi.struga@fshnstudent.info                  olti.qirici@fshn.edu.al
                                                  University of Tirana


                                                              2     Bitcoin Price

                       Abstract                               Many endorse Bitcoin, while other are sceptic. Re-
                                                              gardless, the price of Bitcoin is a topic discussed from
                                                              an economic angle, computer science, financial, and
    In this work, we use the LSTM version of Re-              psychological perspective. In the time of writing this
    current Neural Networks, to predict the price             article, the price seems to be getting stable. In some
    of Bitcoin. In order to develop a better un-              way, this may be normal given that many investors are
    derstanding on its price influencers and the              waiting to see regulations, the scaling problems is not
    general vision of this brilliant innovation, we           seeing any improvement, the fact that the first hype
    first give a brief perspective on Bitcoin and             around 2017 is now over and it is normal for a market
    its economics. Then we describe the dataset,              to get stable for some time. Nevertheless, given the
    which is comprised of data from stock mar-                features of Bitcoin like:
    ket indices, sentiment, blockchain and Coin-
    marketcap1 . Further on this investigation, we                • Tax free
    show the usage of LSTM architecture with the
    aforementioned time series. To conclude, we
                                                                  • Unforgeable
    outline the results of predicting Bitcoin price
    for 30 and 60 days ahead.
                                                                  • Bordless and unbound from distance

1    Introduction                                                 • Decentralized

With the advent of Bitcoin 10 years ago the world
                                                                  • Verifiable and secure
of economics, albeit in small scale, has and is ex-
periencing a revolution. Bitcoin introduced itself as
the system that solved the Double Spend problem                   • Normally, negligible transaction fees
[Nakamoto2008], a prevalent issue with inherent Dig-
ital Cash systems. Nevertheless, the impact during                • Cannot be counterfeited
coming years was greater. Distributed Ledger Tech-
nologies (DLT), Smart Contracts, Cryptocurrencies,            It is a plausible solution for countries with developing
etc. all stemmed from this very ”Bitcoin idea”. This is       economies and financial systems to improve their eco-
attributed, to the unique decentralization mixed with         nomical position, while struggling to access the best
intuitive incentive. On the other end of the spectrum,        technology of its time. However, for a country to have
with data being regarded as the oil of nowadays, along        an aptitude towards this monetary system, a tech-
with the tremendous increases in hardware efficiency,         savvy population should be present to adopt the sys-
Machine Learning is increasingly being utilized. As a         tem, along with regulations from financial institutions
result, we are inclined in attempting to predict the Bit-     that support it. Both of these are absent for the mo-
coin price, despite of the dynamic nature present not         ment, but the future prospect is very potent. Another
only in Bitcoin exchanges (Fig. 1), but in financial          factor why we believe Bitcoin should be studied, is the
markets in general.                                           fast paced advancements in technology (Fig. 1), that
                                                              favour Bitcoin, considering its qualities as a software
   1 Coinmarketcap takes the average of every exchange that   and a decentralized system, which can not be beaten
trades Bitcoin                                                by banking systems, or a likes.
                                                            of storage used (currently 100 GB for the Bitcoin for
                                                            each high-grade Bitcoin network client), transaction
                                                            confirmation needs 10 50 minutes, etc.
                                                               While we will try to build a predictive model for
                                                            the Bitcoin prospect value calculation, we are aware
                                                            in advance that price may differ greatly because of
                                                            internal and external factors to Bitcoin. By internal
                                                            factors we are presuming factors inside the Bitcoin se-
                                                            curity (some breach). By external we are referring to
                                                            agents which influence indirectly the price of Bitcoin
                                                            (exchange closures, replacing cryptocurrencies, spec-
                                                            ulation markets, the fact that as its believed widely
                                                            over 80% of Bitcoins in circulation is concentrated in
                                                            a limited number of investors etc.)
Figure 1: 5 tech companies together are worth more             Anyway, we shall compare our results to other mod-
than 282 other companies, by market capitalization.         els built for cryptocurrency prediction. Lets not forget
Source: [Batnick2018]                                       that in the first month of 2018 there were models which
                                                            predicted that Bitcoin would surpass the 100,000.00
2.1    Bitcoin Deflation                                    USD per Bitcoin till the end of the year, while we are
                                                            barely reaching the 7,000.00 USD value just 2 months
Bitcoin’s supply is predetermined by design, and rep-       before the end of the year.
resented by this geometric series:
       a (1 − rn )            50 (1 − 0.5)
Sn =               = 210000 ×              ≈ 21 × 106 .
          1−r                   1 − 0.5
                                                   (1)

As it is clearly stated in Bitcoin’s wiki
[BitcoinWiki2017], the decrease of supply, resem-
bles the rate at which commodities like gold are
mined. This makes many consider Bitcoin as defla-
tionary, but this currency is infinitely divisible, not
only because 1 BTC = 108 satoshis 2 , and in turn, no
one would run out that fast of every satoshi, but also
because the protocol could be updated allowing for
satoshis to be more divisible (have more zeros). As a
result, deflation, does not have to occur.

2.2    Bitcoin Inflation
                                                                  Figure 2: Bitcoin’s steep price movements.
Bitcoin is not debt based, and no artificial money
can be issued. Additionally, because of the fixed sup-
ply mentioned above no more Bitcoins than predicted
can be created, unlike the economy system of today.
                                                            3     Data preprocessing
Bitcoin’s deflationary attribute, stems from imitating      3.1   Data gathering
gold, in that a currency must be scarce (or with a fi-
                                                            Daily data of four channels are considered since 2013.
nite supply in case of BTC), consequently, no one can
                                                            First, the Bitcoin price history, which is extracted from
increase the supply and inflate the value of goods.
                                                            Coinmarketcap through its open API. Secondly, data
    Nonetheless, the bitcoin has also its dark side which
                                                            from Blockchain is gathered, in particular we choose
sometimes makes users quite sceptical on its poses-
                                                            the average block size, the number of user addresses,
sion and usage. Several of these problems have been
                                                            number of transactions, and the miners revenue. We
reported (and may not be limited to) by Kaspersky
                                                            found it counter intuitive to have some Blockchain
Lab [Kaspersky2017]. We may include the facts that:
                                                            data, given the incessant scaling problem, on the other
blockchain nodes do exactly the same thing (no paral-
                                                            hand, the number of accounts, by definition is related
leling, no synergy, no mutual assitance), growing size
                                                            to the price movements, since an increase in the num-
  2 Satoshi is the smallest unit of Bitcoin currency        ber of accounts, either means more transactions oc-
curring (presumably for exchanging with different par-       ing so can trigger large gradient updates that will pre-
ties and not just transferring Bitcoins to another ad-       vent the network from converging. To make learning
dress), or it is a sign of more users joining the network.   easier for the network, data should have the following
Thirdly, for the sentiment data we obtain the Interest       characteristics [Geron2017]:
over time for the word ’Bitcoin’ using PyTrends li-
                                                                 • Take small values - Typically, most values should
brary. Lastly, two indices are considered, that of S&P
                                                                   be in the range 0-1 range
500 and Dow and Jones. Both are retrieved through
Yahoo Finance API. All in all, these make for 12 fea-            • Be homogeneous - That is, all features should take
tures. The Pearson correlation between the attributes              values at roughly the same range.
is shown in Figure 2. Clearly, some attributes are not
                                                               The most common normalization methods used
too correlated, for example, the financial indices are
                                                             during data transformation include:
relevant with each other, but not with any of bitcoin-
related attributes. Also, we see how Google Trends               • Min-Max Scaling , where the data inputs are
are related to Bitcoin transactions.                               mapped on a number from 0 to 1:
                                                                                        x − min(X)
                                                                               x0 =                               (2)
                                                                                      max(X) − min(X)

                                                                 • Mean Normalization, which makes data to
                                                                   have a values between -1 and 1 with a mean of
                                                                   0:


                                                                                        x − mean(X)
                                                                               x0 =                               (3)
                                                                                      max(X) − min(X)

                                                                 • Z-Score (Standardization), where the features
                                                                   are redistributed with their mean of 0 and stan-
                                                                   dard deviation of 1:
                                                                                         x − mean(x)
                                                                                  x0 =                            (4)
                                                                                              ρ
Figure 3: Pearson correlation, 1.0 means the highest
correlation                                                     For our problem, we use Min-Max Scaling and ad-
                                                             just features on a scale from 0 to 1 given that most
                                                             of our time-series have a peek, therefore we might ar-
                                                             gue we know the maximum of the series, in which case
3.2   Data cleansing                                         Min-Max Scaling does a good job.
From exchange data we consider relevant only the Vol-
ume, Close, Open, High prices and Market capitaliza-         4     Machine Learning Pipeline
tion. For all data sets if NaN values are found to be        In this section we describe how to make time series
existent, they are replaced with the mean of the re-         data adaptable for supervised machine learning prob-
spective attribute. After this, all datasets are merged      lems. The price prediction is treated as regression
into one, along the time dimension. Judging from Bit-        rather than classification, and we show how LSTM can
coin price movements during the period from 2013 un-         be used in such cases. We then, discuss hyperparame-
til 2014, we considered best to get rid of data points       ters.
before 2014, hence the data which will be passed to the
neural network lies from 2014 until September 2018.          4.1    Software used
                                                             For Deep Learning backend system we choose Tensor-
3.3   Data normalization
                                                             flow, and Keras as the front-end layer of building neu-
Deciding on the method for normalizing a time series,        ral networks fast. Pandas is used extensively for data
especially financial ones is never easy. What’s more,        related tasks, Numpy is utilized for matrix/vector op-
as a rule of thumb a neural network should load data         erations and for storing training and test data sets,
that take relatively large values, or data that is het-      Scikit-learn (also known as: sklearn) is used for per-
erogeneous (referring to time-series that have different     forming the min-max normalization. Lastly, Plotly is
scales, like exchange price, with Google Trends). Do-        used for displaying the charts.
4.2   Time series data                                       these small windows of data into a numpy array. Each
                                                             window is a 35x12 matrix, so all windows will create
Normally a time series is a sequence of numbers along
                                                             the tensor. Furthermore, in LSTM the input layer is
time. LSTM for sequence prediction acts as a super-
                                                             by design, specified from the input shape argument on
vised algorithm unlike its autoencoder version. As
                                                             the first hidden, the these three dimensions of input
such, the overall dataset should be split into inputs
                                                             shape:
and outputs. Moreover, LSTM is great in comparison
with classic statistics linear models, since it can easier     • Samples
handle multiple input forecasting problems. In our ap-
proach, the LSTM will use previous data to predict 30          • Window size
days ahead of closing price. First, we should decide on
how many previous days one forecast will have access           • Number of features
to. This number we refer as the window size. We have
opted for 35 days in case of monthly prediction, and         4.5     LSTM implementation
65 days in that of 2 months prediction, therefore the        4.5.1    LSTM internals
input data set will be a tensor comprising of matrices
                                                             A chief feature of feedforward Networks, is that they
with dimension 35x12/65x12 respectively, such that we
                                                             don’t retain any memory. So each input is processed
have 12 features, and 35 rows in each window. So the
                                                             independently, with no state being saved between in-
first window will consist of 0 to the 34 row (python is
                                                             puts. Given that we are dealing with time series where
zero indexed), the second from 1 to 35 and so on. An-
                                                             information from previous Bitcoin price are needed, we
other reason for choosing this window length is that a
                                                             should maintain some information to predict the fu-
small window leaves out patterns which may appear in
                                                             ture. An architecture providing this is the Recurrent
a longer sequence. The output data takes into account
                                                             neural network (RNN) which along with the output
not only the window size but also the prediction range
                                                             has a self-directing loop. So the window we provide
which in our case is 30 days. The output dataset starts
                                                             as input gets processed in a sequence rather than in
from row 35 up until the end, and is made of chunks
                                                             a single step. However, when the time step (size of
of length 30. The prediction range also determines the
                                                             window) is large (which is often the case) the gra-
output size for the LSTM network.
                                                             dient gets too small/large, which leads to the phe-
4.3   Split into training and test data                      nomenon known as vanishing/exploding gradient re-
                                                             spectively [Chollet2017]. This problem occurs while
This step is one of the most important, especially in        the optimizer backpropagates, and will make the al-
the case of Bitcoin. We first wanted to predict the          gorithm run, while the weights almost do not change
year ahead, but this would mean, that data from 1            at all. RNN variations mitigate the problem, namely
Jan 2018 until September 2018 would be used for test-        LSTM and GRU (Fig. 2).
ing, the downside of this, is of course the steep slope         The LSTM layer adds some cells that carry infor-
in 2017, which would make the neural network learn           mation across many timesteps (Fig. 2). The cell state
this pattern as by the last input, and the prediction of     is the horizontal line from Ct−1 to Ct , and its impor-
year 2018 would not be very logical. Thus we go for          tance lies in holding the long-term or short term mem-
training data from 2014-01-01 until 2018-07-05, this         ory. The output of LSTM is modulated by the state
leaves us with approximately 2 months for prediction,        of these cells. And this is important when it comes
while we predict for two months, the data set is split       to predict based on historic context, rather than only
a bit earlier to leave room for 2 months: 2018-06-01.        the last input. LSTM networks manage to remember
Each training set and test set is composed of input and      inputs by making use of a loop. These loops are ab-
output features.                                             sent in RNN. On the other hand, as more time passes,
                                                             the less likely it becomes that the next output depends
4.4   Turn data into tensors                                 on a very old input, therefore forgetting is necessary.
LSTM expects that the input is given in the form of          LSTM achieves this by learning when to remember
a 3 dimensional vector of float values. A key feature        and when to forget, through their forget-gates. We
of tensors is their shape, which in Python is a tuple        mention them shortly to not consider LSTM just as a
of integers representing the dimensions of it along the      black box model [Olah2015].
3 axis. For instance, in our testing data of Bitcoin,
the shape of training inputs is: (1611, 35, 12), so we         • Forget gate: ft = σ(Wf St−1 + Wf St )
have 1611 samples, a window size (timestep) of 35 val-         • Input gate: it = σ(Wi St−1 + Wi St )
ues, and 12 features. In overall the idea is simple, in
that we separate the data into chunks of 35, and push          • Output gate: ot = σ(Wo St−1 + Wo St )
                                                                                 r                              2
                                                                                     1 n
                                                              • RMSE(X, h) =         n Σi=1       h(xi ) − y i

                                                                                                 
                                                              • MAE(X, h) = n1 Σni=1 h(xi ) − y i

                                                            RMSE is generally used when distribution resembles
                                                            a bell-shaped curve, but given the Bitcoin price spikes
                                                            we chose to go MAE, since it deals better with outliers.

                                                            4.6.3   Activation function
                                                            The choice for activation function was not very diffi-
                                                            cult. The most popular are sigmoid, tanh, and ReLu.
                                                            Sigmoid suffers from vanishing gradient, therefore al-
                                                            most no signal flows from the neuron to its weight,
                                                            moreover it is not centered around zero, as a result
                                                            the gradient might be to high or to low a number. By
                                                            contrast, tanh makes the output zero centered, and in
                                                            practice is almost always preferred to sigmoid. ReLu
                 Figure 4: LSTM cell                        is also widely used, and since it was invented later, it
                                                            should be better. Nevertheless, for predicting Bitcoin
                                                            price that was not the case, and we chose tanh due to
  • Intermediate Cell State: C̃ = tanh(Wc St−1 +
                                                            better results.
    Wc Xt )

  • Cell state (memory for next input): ct = (it ∗ C̃t )+   4.6.4   Dropout Rate
    (ft ∗ ct−1 )                                            Regularization is the technique for constraining the
                                                            weights of the network. While in simple neural net-
  • Calculating new state: ht = ot ∗ tanh(ct )
                                                            works, l1 and l2 regularization is used, in multi layer
As it can be seen from the equations, each gate has         networks, drop out regularization takes place. It ran-
different sets of weights. In the last equation, the in-    domly sets some input units to 0 in order to prevent
put gate and intermediate cell state are added with         overfitting. Hence, its value represents the percentage
the old cell state and the forget gate. Output of this      of disabled neurons in the preceding layer and ranges
operation is then used to calculate the new state. So,      from 0 to 1. We have tried 0.25 and 0.3 and lastly we
this advanced cell with four interacting layers instead     decided for 0.3.
of just one tanh layer in RNN, make LSTM perfect for
sequence prediction.                                        4.6.5   Number of Neurons in hidden layers
                                                            We opted for 10 neurons in the hidden layers, it actu-
4.6     Hyperparameters                                     ally costs a lot to have more neurons, as the training
4.6.1    Optimizer                                          process will last longer. Also, trying a larger number
                                                            did not give improved results.
While Stochastic Gradient Descent is used in many
Neural Network problems, it has the problem of con-         4.6.6   Epochs
verging to a local minimum. This of course presents a
problem considering Bitcoin price. Some other nice op-      Rather arbitrarily, we decided for 100 epochs, after
timizers are variations of adaptive learning algorithms,    trying other values, like 50, or 20. As with the number
like Adam, Adagrad, and RMSProp. Adam was found             of hidden layer neurons, the more epochs, the more
to work slightly better than the rest, and that’s why       time it takes for training to finish, since one epoch is
we go for it. (All of these come packed with Keras.)        a full iteration over the training data. Also, it may
                                                            overfit the model.
4.6.2    Loss function
                                                            4.6.7   Batch Size
The performance measure for regression problems, will
typically be either RMSE (Root Mean Square Error)           We decided to feed the network, with batches of 120
or MAE (Mean Absolute Error).                               data (again this number is a guess).
4.6.8     Architecture of Network
We used the Sequential API of Keras, rather than the
functional one. The overall architecture is as follows:
    • 1 LSTM Layer: The LSTM layer is the inner
      one, and all the gates, mentioned at the very be-
      ginning are already implemented by Keras, with
      a default activation of hard-sigmoid [Keras2015].
      The LSTM parameters are the number of neurons,
      and the input shape as discussed above.
    • 1 Dropout Layer: Typically this is used before
      the Dense layer. As for Keras, a dropout can be
      added after any hidden layer, in our case it is after
      the LSTM.
    • 1 Dense Layer: This is the regular fully con-
      nected layer.
    • 1 Activation Layer: Because we are solving a
      regression problem, the last layer should give the                Figure 5: Error loss during training
      linear combination of the activations of the previ-
      ous layer with the weight vectors. Therefore, this
      activation is a linear one. Alternatively, it could      the data we gathered for Bitcoin, even though has been
      be passed as a parameter to the previous Dense           collected through years, might have become interest-
      layer.                                                   ing, producing historic interpretation only in the last
                                                               couple of years. Furthermore, a breakthtrough evolu-
5     Results and Analysis                                     tion in peer-to-peer transactions is ongoing and trans-
                                                               forming the landscape of payment services. While it
In this section we show the results of our LSTM model.         seems all doubts have not been settled, time might be
It was noted during training that the higher the batch         perfect to act. We think its difficult to give a mature
size (200) (Fig. 7, 8) the worst the prediction on the         thought on Bitcoin for the future.
test set. Of course this is no wonder, since the more
training, the more prone to overfitting the model be-
comes . While it is difficult to predict the price of Bit-     References
coin, we see that features are critical to the algorithm,      [Nakamoto2008] Satoshi Nakamoto. Bitcoin: A Peer-
future work includes trying out the Gated Recurrent                    to-Peer Electronic Cash System
Unit version of RNN, as well as tuning, on existing
hyper-parameters. Below we show the loss from the              [BitcoinWiki2017] Bitcoin Wiki. Deep Learning with
Mean Absolute Error function, when using the model                      Python.    https://en.bitcoin.it/wiki/
to predict the training and test data.                                  Controlled_supply

6     Conclusions                                              [Chollet2017] Deep   Learning   with    Python.
All in all, predicting a price-related variable is difficult            https://www.manning.com/books/
given the multitude of forces impacting the market.                     deep-learning-with-python
Add to that, the fact that prices are by a large extent
depended on future prospect rather than historic data.         [Batnick2018] M.   Batnick.  The market cap
However, using deep neural networks, has provided us                    of the top 5 S&P 500 companies.
with a better understanding of Bitcoin, and LSTM ar-                    https://theirrelevantinvestor.com/
chitecture. The work in progress, includes implement-                   2018/07/19/pareto/
ing hyperparameter tuning, in order to get a more ac-
curate network architecture. Also, other features can          [Geron2017] Bitcoin Wiki. Hands on Machine
be considered (although from our experiments with                      Learning with scikit-learn and tensorflow
Bitcoin, more features have not always led to better re-               https://www.oreilly.com/library/
sults). Microeconomic factors might be included in the                 view/hands-on-machine-learning/
model for a better predictive result. Anyway, maybe                    9781491962282/
[Olah2015] Understanding   LSTMs         https:
        //colah.github.io/posts/
        2015-08-Understanding-LSTMs/

[Kaspersky2017] , K1 Kaspersky Lab Daily, Six
         Myths about blockchain and Bitcoin: De-
         bunking the effectiveness of the tech-
         nology https://www.kaspersky.com/blog/
         bitcoin-blockchain-issues/18019/

[Keras2015] Keras            https://keras.io/
         getting-started
Results Visualization


                         Figure 6: Bitcoin Prediction on Training Set


                    Figure 7: Bitcoin Prediction on Test Set, Batch Size 100
Figure 8: Bitcoin Prediction on Test Set, with a batch size of 200. Loss is greater than with batch 50


                    Figure 9: Bitcoin Prediction on Test Set, 60 days prediction