=Paper= {{Paper |id=Vol-1565/bmaw2015_paper2 |storemode=property |title=Bayesian Optimisation of Gated Bayesian Networks for Algorithmic Trading |pdfUrl=https://ceur-ws.org/Vol-1565/bmaw2015_paper2.pdf |volume=Vol-1565 |authors=Marcus Bendtsen |dblpUrl=https://dblp.org/rec/conf/uai/Bendtsen15 }} ==Bayesian Optimisation of Gated Bayesian Networks for Algorithmic Trading== https://ceur-ws.org/Vol-1565/bmaw2015_paper2.pdf
    Bayesian Optimisation of Gated Bayesian Networks for Algorithmic Trading



                                                    Marcus Bendtsen
                                                marcus.bendtsen@liu.se
                                     Department of Computer and Information Science
                                             Linköping University, Sweden



                        Abstract                                tion. If the signals are followed, then they give rise to cer-
                                                                tain risk and reward on the initial investment, which will be
     Gated Bayesian networks (GBNs) are an exten-               described further in Section 3.2. Further down the line in
     sion of Bayesian networks that aim to model sys-           algorithmic trading systems are components that combine
     tems that have distinct phases. In this paper, we          signals from several alpha models, and other so called risk
     aim to use GBNs to output buy and sell decisions           models, to combine a portfolio of assets. We will not ad-
     for use in algorithmic trading systems. These              dress these later components in this paper, our focus will
     systems may have several parameters that require           be on the alpha models.
     tuning, and assessing the performance of these             In Figure 1, the price of an asset is plotted along with
     systems as a function of their parameters cannot           buy signals (upward arrows) and sell signals (downward
     be expressed in closed form, and thus requires             arrows). We view the time spent between these signals as
     simulation. Bayesian optimisation has grown in             two different phases: before a buy signal, our intention is
     popularity as a means of global optimisation of            to have a model that identifies good opportunities to buy
     parameters where the objective function may be             the asset, once such an opportunity has been identified and
     costly or a black box. We show how algorithmic             a buy signal has been generated, we move into a different
     trading using GBNs, supported by Bayesian opti-            phase. In this second phase, we intend to model the identi-
     misation, can lower risk towards invested capital,         fication of good opportunities to sell the asset. Once a sell
     while at the same time generating similar or bet-          signal is generated, we move back to the original phase,
     ter rewards, compared to the benchmark invest-             once again using a model to generate buy signals. This par-
     ment strategy buy-and-hold.                                ticular situation was the main motivation for the introduc-
                                                                tion of gated Bayesian networks (GBNs) [4, 5, 6], which
                                                                we will describe in Section 2.
1    INTRODUCTION
                                                                                     120




Algorithmic trading can be viewed as a process of actively
                                                                                     110




deciding when to own assets and when to not own assets,
                                                                                     100




so as to get better risk and reward on invested capital, com-
                                                                      Price

                                                                                     90




pared to holding the assets over a long period of time. On
                                                                                     80




the other end of the spectrum is the buy-and-hold strategy,
where one owns assets continuously over a period of time
                                                                                     70




without making any decisions of selling or buying. An al-
                                                                                     60




                                                                                             Dec 31   Mar 03   May 01   Jul 01   Sep 02   Nov 03   Dec 29
gorithmic trading system consists of several components,                                      2007     2008     2008    2008      2008     2008     2008

some which may be automated by a computer, and others
that may be manually executed [1, 2, 3]. At the heart of an
                                                                                     24000




                                                                                                  Figure 1: Buy and Sell Signals
algorithmic trading system are the alpha models. They are
                                                                      Equity curve




responsible for outputting decisions for buying and selling
                                                                                     22000




assets based on the data they are given. These decisions        Alpha models normally take a set of parameters, allowing
are commonly referred to as signals. The data which is          them to be tuned to the input data. Naturally, two different
                                                                                     20000




supplied to the alpha models varies greatly, e.g. potential     sets of parameters may yield two different sets of signals.
prospects, sentiment analysis, previous trades, or technical    Therefore, it is imperative to assess how good a set of sig-
analysis, which will be the focus of the included applica-      nals are, so that different parameter sets may be compared.



                                                           2
                                                                               BN1                          BN2
This is usually done by backtesting, a type of simulation
that calculates certain scores of the signals, e.g. how much               A         B       G2         W
the return on the initial investment would have been. Back-
testing cannot be written as a function of the alpha model’s
parameters in closed form, thus it is not possible to ana-                           S        G1        E         F

lytically find the optimal parameters. Instead, backtesting
must be considered a black box function that should be op-
timised.                                                                       Figure 2: Two Phased GBN
Bayesian optimisation has grown in popularity in the ma-
chine learning community as an intuitive way of maximis-       needed to represent the joint probability distribution, thus
ing either black box objective functions and/or very costly    making it easier to elicit the probability parameters needed
objective functions (costly in the sense of both time and      from experts or from data. See [8, 9, 10] for more details.
resources) [7]. Utilising a prior over objective functions,
and then sparingly evaluating the objective function at cer-   Despite their popularity and advantages, there are situa-
tain points (guided by the posterior), Bayesian optimisation   tions where a BN is not enough. For instance, when try-
attempts to find the global maximum of the objective func-     ing to model the process of buying and selling assets, we
tion within a predefined grid.                                 wanted to model the constant flow between identifying
                                                               buying opportunities and then, once such have been found,
Our intention in this paper is to combine the use of GBNs as   identifying selling opportunities, as is required by an al-
alpha models and optimising the parameters of these GBNs       pha model. These two phases can be very different and the
using Bayesian optimisation.                                   variables included in the BNs modelling them are not nec-
The rest of the paper is organised as follows. We begin by     essarily the same. The need to switch between two different
giving a brief introduction to GBNs in Section 2, this will    BNs was the foundation for the introduction of GBNs.
illuminate how GBNs can be used as alpha models. We            Switching between phases is done using so called gates.
continue by explaining by which metrics alpha models can       These gates are encoded with predefined logical expres-
be evaluated in Section 3, and give slightly more details      sions regarding posterior probabilities of random variables
regarding backtesting. In Section 4 we will describe the       in the BNs. This allows for the activation and deactivation
components of Bayesian optimisation, including the use of      of BNs based on posterior probabilities. A GBN that uses
Gaussian processes as priors, as well as kernel and acqui-     two different BNs (BN1 and BN2) is shown in Figure 2.
sition functions. In Section 5 we will account for the pro-    Here, we will give a brief explanation of GBNs in general,
cedure we will use to evaluate the expected performance of     and the GBN in Figure 2 in particular (for the full definition
using Bayesian optimisation over the parameters of GBNs.       of GBNs see [4, 6]):
Once the procedure has been described, we will in Sec-
tion 6 account for a real-world application where we show        • A GBN consists of BNs and gates. BNs can be active
how GBNs can be used as alpha models with support from             or inactive. The label of BN1 is underlined, indicating
Bayesian optimisation. Finally, in Section 7 we will offer a       that it is active at the initial state of the GBN. The BNs
few words regarding our conclusions and future work.               supply posterior probabilities to the gates via so called
                                                                   trigger nodes. The node S is a trigger node for gate G1
2   GATED BAYESIAN NETWORKS                                        and W is a trigger node for G2. A gate can utilise more
                                                                   than one trigger node.
Bayesian networks (BNs) can be interpreted as models
of causality at the macroscopic level, where unmodelled          • Each gate is encoded with a predefined logical expres-
causes add uncertainty. Cause and effect are modelled us-          sion regarding its trigger nodes’ posterior probabil-
ing random variables that are placed in a directed acyclic         ity of a certain state, e.g. G1 may be encoded with
graph (DAG). The causal model implies some probabilistic           p(S = s1|e) > 0.7. This expression is known as the
independencies among the variables, that can easily be read        trigger logic for gate G1.
off the DAG. Therefore, a BN does not only represent a           • When evidence is supplied to the GBN, an evidence
causal model but also an independence model. The qualita-          handling algorithm updates posterior probabilities and
tive model can be quantified by specifying certain marginal        checks if any of the logical statements in the gates are
and conditional probability distributions so as to specify         satisfied. If the trigger logic is satisfied for a gate it is
a joint posterior distribution, which can later be used to         said to trigger. A BN that is inactive never supplies
answer queries regarding posterior probabilities, interven-        any posterior probabilities, hence G2 will never trig-
tions, counterfactuals, etc. The independencies represented        ger as long as BN2 is inactive.
in the DAG make it possible to compute these queries effi-
ciently. Furthermore, they reduce the number of parameters       • When a gate triggers, it deactivates all of its parent



                                                          3
      BNs and activates its child BNs (as defined by the         total value (0.06% is a common commission charge used in
      direction of the edges between gates and BNs). In          the included application).
      our example, if G1 was to trigger it would deactivate
                                                                 Alpha models are backtested separately from the other
      BN1 and activate BN2, this implies that the model has
                                                                 components of the algorithmic trading system, as the back-
      switched phase.
                                                                 testing results are input to the other components. There-
                                                                 fore, we execute every signal from an alpha model during
If the GBN was used as an alpha model, then when the
                                                                 backtesting, whereas in a full algorithmic trading system
GBN identifies a buying opportunity, and moves to the sell
                                                                 we would have a portfolio construction model that would
phase, a buy signal is generated. Looking again at Fig-
                                                                 combine several alpha models and decide how to build a
ure 1, each buy and sell signal was generated as the GBN
                                                                 portfolio from their signals.
switched back and forth between its phases.
For the purpose of discussing GBN parameter optimisation         3.2   ALPHA MODEL METRICS
in general, we will say that a GBN is parameterised by three
disjoint parameter sets Θ, Λ and Γ. The parameters in Θ are      What constitutes risk and reward is not necessarily the
the parameters of the marginal and conditional probabil-         same for every investor, and investors may have their own
ity distributions of the variables in the contained BNs. All     personal preferences. However, there are a few metrics that
other free parameters are contained in Λ, while any fixed        are common and often taken into consideration [12]. Here
parameters are contained in Γ. For instance, in a setting        we will introduce the metrics that we will use to evaluate
where the only unknowns are the thresholds in the trigger        the performance of our alpha models.
logic of the gates, we say that the thresholds are in Λ and
                                                                 Although not a metric on its own, the equity curve needs
all other parameters are fixed in Γ. This notation allows
                                                                 to be defined in order to define the following metrics. The
a bit of convenience when discussing the evaluation of the
                                                                 equity curve represents the total value of a trading account
optimisation procedure in Section 5 and the application in
                                                                 at a given point in time. If a daily timescale is used, then it
Section 6.
                                                                 is created by plotting the value of the trading account day
                                                                 by day. If no assets are bought, then the equity curve will
3     EVALUATION OF ALPHA MODELS                                 be flat at the same level as the initial investment. If assets
                                                                 are bought that increase in value, then the equity curve will
As we alluded in Section 1, and as we shall see in Sec-          rise. If the assets are sold at this higher value then the eq-
tion 6, it is necessary to assess how good a set of signals      uity curve will again go flat at this new level. The equity
are, thereby assessing the performance of an alpha model.        curve summarises the value of the trading account includ-
Regression models can be evaluated by how well they min-         ing cash holdings and the value of all assets. We will use
imise some error function or by their log predictive scores.     Et to reference the value of the equity curve at point t.
For classification, the accuracy and precision of a model
may be of greatest interest. Alpha models may rely on re-        Metric 1 (Return) The return of an investment is defined
gression and classification, but cannot be evaluated as ei-      as the percentage difference between two points on the eq-
ther. An alpha model’s performance needs to be based on          uity curve. If the timescale of the equity curve is daily, then
its generated signals over a period of time, and the per-        rt = (Et − Et−1 )/|Et−1 | would be the daily return between
formance must be measured by the risk and reward of the          day t and t − 1. We will use r̄ and σr to denote the mean
model. This is known as backtesting.                             and standard deviation of a set of returns.

3.1   BACKTESTING                                                Metric 2 (Sharpe Ratio) One of the most well known
                                                                 metrics used is the so called Sharpe ratio. Named after
The process of evaluating an alpha model on historic data        its inventor Nobel laureate William F. Sharpe, this ratio is
is known as backtesting, and its goal is to produce met-         defined as: (r̄ −risk free rate)/σr . The risk free rate is usu-
rics that describe the behaviour of a specific alpha model.      ally set to be a ”safe” investment such as government bonds
These metrics can then be used for comparison between al-        or the current interest rate, but is also sometimes removed
pha models [11, 12]. A time range, price data for assets         from the equation [12]. The intuition behind the Sharpe ra-
traded and a set of signals are used as input. The back-         tio is that one would prefer a model that gives consistent
tester steps through the time range and executes signals         returns (returns around the mean), rather than one that fluc-
that are associated with the current time (using the supplied    tuates. This is important since investors tend to trade on
price data) and computes an equity curve (which will be ex-      margin (borrowing money to take larger positions), and it
plained in Section 3.2). From the equity curve it is possible    is then more important to get consistent returns than returns
to compute metrics of risk and reward. To simulate poten-        that sometimes are large and sometimes small. This is why
tial transaction costs, often referred to as commission, every   the Sharpe ratio is used as a reward metric rather than the
trade executed is usually charged a small percentage of the      return.



                                                            4
                  Equity in $
                                                                      in their place, should be aware of the LVFI as it is the
                                                                      worst case scenario if they need to retract their investment
                                               MDD                    prematurely.

  Initial investment
                                LVFI                                  Metric 5 (Time In Market Ratio (TIMR)) The percent-
                                TIMR                 1 - TIMR         age of time of the investment period where the alpha model
                                                           Time       owned assets. This metric may seem odd to place within
                                                                      the same family as the other drawdown risks, however it
        Figure 3: Equity Curve with Drawdown Risks                    fits naturally in this space. We can assume that the days the
                                                                      alpha model does not own any assets the drawdown risk is
                                                                      zero. If we are not invested, then there is no risk of loss.
Furthermore, under certain assumptions it can be shown                In fact, we can further assume that our equity is growing
that there exists an optimal allocation of equity between             according to the risk free rate, as it is not bound in assets.
alpha models (in the portfolio construction model), such
that the long-term growth rate of equity is maximised [12].
                                                                      4     BAYESIAN OPTIMISATION
This growth rate turns out to be g = r + S 2 /2, where r
is the risk free rate and S is the Sharpe ratio. Thus, a high
                                                                      Our intention is to use GBNs as alpha models and to opti-
Sharpe ratio is not only an indication of good risk adjusted
                                                                      mise the free parameters Λ with respect to the metrics given
return, but holding the risk free rate constant, the optimal
                                                                      in Section 3.2. In order to do so we must backtest the sig-
growth rate is an increasing function of the Sharpe ratio.
                                                                      nals that a GBN produces, and thus we cannot analytically
                                                                      solve the optimisation problem, as backtesting as a function
Using the Sharpe ratio as a metric will ensure that the alpha
                                                                      of Λ has no general closed form expression. At the same
models are evaluated on their risk adjusted return, however,
                                                                      time, backtesting is relatively costly, as one must create the
there are other important alpha model behaviours that need
                                                                      model, prepare data, estimate parameters, generate signals
to be measured. A family of these, that are known as draw-
                                                                      and walk through the time range to simulate the trading.
down risks, are presented here (see Figure 3 for examples
                                                                      For this reason, it is not feasible to exhaustively sweep a
of an equity curve and these metrics).
                                                                      large grid of parameters. However, Bayesian optimisation
                                                                      allows us to prioritise the points on the grid to evaluate,
Metric 3 (Maximum Drawdown (MDD)) The percent-
                                                                      thus reducing the number of evaluations, while still finding
age between the highest peak and the lowest trough of the
                                                                      the global maximum of a potentially costly and black box
equity curve during backtesting. The peak must come be-
                                                                      objective function.
fore the trough in time. The MDD is important from both
a technical and psychological regard. It can be seen as a
measure of the maximum risk that the investment will live             4.1   GAUSSIAN PROCESS AS SURROGATE
through. Investors that use their existing investments that                 FUNCTION
have gained in value as safety for new investments may be
                                                                      Essentially, we would like to find the parameters Λ∗ ∈ Λ
put in a situation where they are forced to sell everything.
                                                                      that maximises an unknown function f . We place a
Other risk management models may automatically sell in-
                                                                      prior, p(f ), over the possible functions f , and compute
vestments that are loosing value sharply. For the individual
                                                                      the posterior over f using observations {Λ1:i , f1:i }, where
who is not actively trading but rather placing money in a
                                                                      fj = f (Λj ). Hence, we compute p(f |{Λ1:i , f1:i }) ∝
fund, the MDD is psychologically frustrating to the point
                                                                      p({Λ1:i , f1:i }|f )p(f ). We can then use this posterior distri-
where the individual may withdraw their investment at a
                                                                      bution over objective functions as an estimate of our objec-
loss in fear of loosing more money.
                                                                      tive function. This is sometimes known as using the poste-
                                                                      rior as a surrogate function to the true objective function.
Metric 4 (Lowest Value From Investment (LVFI)) The
percentage between the initial investment and the lowest              In Bayesian optimisation it is common to use a Gaussian
value of the equity curve. This is one of the most important          process (GP) as the surrogate function [7]. It is defined
metrics, and has a significant impact on technical and                as a multivariate normal distribution of infinite dimension,
psychological factors. For investors trading on margin,               where each dimension is a point along some grid. A finite
a high LVFI will cause the lender to ask the investor for             set of these dimensions will form a Gaussian distribution,
more safety capital (known as a margin call). This can be             thus allowing a GP to be defined completely by a mean
potentially devastating, as the investor may not have the             function µ and a kernel function κ. The GP over the grid
capital required, and is then forced to sell the investment.          Λ is then defined as N (µ(Λ), κ(Λ, Λ0 )) for all Λ, Λ0 ∈ Λ.
The investor will then never enjoy the return the investment          Commonly, the prior µ(Λ) is assumed to be zero for all
could have produced. Individuals who are not investing                Λ ∈ Λ, although this is by no means necessary if prior
actively, but instead are choosing between funds that invest          information is available to suggest otherwise. The more



                                                                  5
involved task is to define the kernel function κ. With κ




                                                                                 1.0
                                                                                                                                c=1
we can express our prior belief about the objective function                                                                    c=5




                                                                                 0.8
                                                                                                                                c = 10
that we wish to maximise. Although we do not know the




                                                                    Covariance

                                                                                 0.6
form of the objective function, we often assume that points




                                                                                 0.4
close to each other on the grid give similar results, thus
we assume the objective function to possess at least some




                                                                                 0.2
smoothness. These assumptions can be articulated in κ,
                                                                                        0       2        4              6   8            10
for instance by the rational quadratic kernel in Equation 1,
                                                                                                             Distance
where c is a tuning constant for how smooth we believe
the objective function to be. For points close to each other,
                                                                                        Figure 4: Covariance Decrease by Distance
Equation 1 will result in values close to 1, while points fur-
ther away will be given values closer to 0. The GP prior
will obtain the same smoothness properties, as the covari-
ance matrix is completely defined by κ. To visualise the            In Bayesian optimisation we make use of a so called ac-
smoothness achieved by tuning c, Figure 4 shows the de-             quisition function. Several acquisition functions have been
creasing covariance as distance grows with three different          suggested, however the goal is to trade off exploring the
settings of c (1, 5 and 10). As can be seen, as c increases         grid where the posterior uncertainty is high, while exploit-
the decrease is slower, thus more smoothness is assumed.            ing points that have a high posterior mean. We will use
                                                                    the upper confidence bound criterion, which is expressed
                                                                    as U CB(Λ) = µ(Λ) + ησ(Λ), where µ(Λ) and σ(Λ) rep-
                                    ||Λ − Λ0 ||2
                κ(Λ, Λ0 ) = 1 −                               (1)   resent the mean and standard deviation at the point Λ of
                                  ||Λ − Λ0 ||2 + c                  the GP, and η is a tuning parameter to allow for more ex-
                                                                    ploration (as η is increased) or more exploitation (as η is
Assuming that we have observed {Λ1:i , f1:i }, and that we          decreased).
wish to calculate the posterior predictive distribution for
an unobserved point Λi+1 , a closed form expression exists          Succinctly, define a GP over a grid with some kernel func-
for this calculation as described in Equation 2. Thus, it is        tion, then randomly sample a point and evaluate the objec-
possible to efficiently calculate the posterior distribution of     tive function at this point. Calculate the posterior of the
an unobserved point where both the prior smoothness and             GP given this new observation and find Λ0 that maximises
observed data have been considered. For more on GPs,                the acquisition function. Then Λ0 is the next point where
please see [13].                                                    to evaluate the objective function. Iterate these steps for a
                                                                    predefined number of iterations. Once all iterations have
                                                                    passed, the Λ with the highest posterior mean is the set of
                                  !                                parameters that maximises the objective function.
                          K KT
                      
       f1:i                     ∗
              ∼ N 0,
       fi+1              K∗ K∗∗
           
             κ(Λ1 , Λ1 ) · · · κ(Λ1 , Λi )
                                                                   5                  EVALUATION PROCEDURE
      K=
                ..       ..       ..     
                  .          .      .                              In Section 6 we will account for a real-world application
           κ(Λi , Λ1 ) · · · κ(Λi , Λi )                            of GBNs as alpha models supported by Bayesian optimi-
                                                              (2)
                                                                    sation. However, in this section we will introduce the op-
                                           
      K∗ = κ(Λi+1 , Λ1 ) · · · κ(Λi+1 , Λi )
      K∗∗ = κ(Λi+1 , Λi+1 )                                         timisation procedure used, as well as the method used to
                                                                    evaluate the performance of the optimisation, which is es-
      p(fi+1 |{Λ1:i , f1:i }) = N (µi (Λi+1 ), σi2 (Λi+1 ))         sentially the same method used in [5].
      µi (Λi+1 ) = K∗ K−1 f1:i                                      A data set D of consecutive evidence sets, e.g. observations
      σi2 (Λi+1 ) = K∗∗ − K∗ K−1 K∗ T                               over all or some of the random variables in the GBN, is di-
                                                                    vided into n equally sized blocks (D1 , ..., Dn ), such that
                                                                    they are mutually exclusive and exhaustive. Each block
4.2    ACQUISTION FUNCTIONS AND BAYESIAN                            contains consecutive evidence sets and all evidence sets in
       OPTIMISATION                                                 block Di come before all evidence sets in Dj for all i < j.
                                                                    Depending on the amount of available data, k is chosen as
Using a GP as a surrogate to the objective function al-             the number of blocks used for optimisation. Starting from
lows us to encode prior beliefs about the unknown objec-            index 1, blocks 1,...,k are used for optimisation and k + 1
tive function, and sampling the objective function allows           for testing, thus ensuring that the evidence sets in the test-
us to update the posterior of the surrogate. What is left           ing data occurs after the optimisation data. The procedure
to do is to decide where to sample the objective function.          is then repeated starting from index 2 (i.e. blocks 2, ..., k+1



                                                               6
   Simulation 7
                                                                            the natural order of the data, thus allowing a validation
   Simulation 6
   Simulation 5
                                                                            block to come before a block used for estimating the pa-
   Simulation 4
                                                                            rameters Θ. This could potentially induce a bias in the
   Simulation 3                                                             cross-validation estimate as the data used for estimating the
   Simulation 2                                                             parameters would not have been known at the time the data
   Simulation 1                                                             for the validation block was generated. However, as we do
                                                                            not use this scheme when we evaluate the performance of
                  Data for optimisation   Data withheld for testing         the optimisation, the expected performance of the optimi-
                                                                            sation is not biased in this way. We simply use this scheme
    Figure 5: Data Split For Optimisation and Testing                       to make the best use of the data during cross-validation.
                                                                            Second, one scoring function J has been used both during
                                                                            optimisation and for evaluating the expected performance
are used for optimisation and k + 2 for testing). By doing                  of the optimisation. The scoring function J could inter-
so we create t repeated simulations, moving the testing data                nally use many different metrics to come up with one score
one block forward each time. An illustration of this proce-                 to maximise. However, it is natural in the coming setting
dure when n = 12, k = 5 and t = 7 is shown in Figure 5.                     to expose the actual values of several metrics, thus several
During Bayesian optimisation, when the objective function                   scoring functions J are used to get a vector of mean scores
is evaluated for some acquired Λ, a cross-validation esti-                  [ρ̄J1 , ..., ρ̄Jm ].
mate is calculated for the k blocks used. Here, k − 1 blocks                Another approach to combine Bayesian optimisation with
are used to estimate the parameters Θ of the contained BNs                  cross-validation is to reduce the number of fold evaluations
and the held out block is used as validation data to calcu-                 necessary [14], as certain folds may be closely correlated,
late a score ρ. The value of the objective function, given                  however our approach is to reduce the number of parame-
parameters Λ, is thus the average of all ρ when each block                  ters that we need to test with cross-validation.
in the optimisation data has been held out.
In order to formalise the procedure used to evaluate the op-                6   APPLICATION
timisation, recall from Section 2 that Λ is used to represent
the free parameters of a GBN and Γ is used to represent
                                                                            Having established the optimisation procedure, and the
all fixed parameters. Let J be a score function such that
                                                                            method we intend to use to evaluate the performance of
J (Λ, Dj , {D}m l |Γ) is the score for a GBN under some pa-                 the optimisation, we turn our attention to a real-world ap-
rameterisation Λ and Γ when block j has been used for
                                                                            plication. We aim to use GBNs as alpha models to gener-
either testing or validation and the blocks Dl , ..., Dm have
                                                                            ate buy and sell signals of stock indices in such a way that
been used to estimate Θ of the BNs in the GBN (under the
                                                                            drawdown risks are mitigated, compared to the buy-and-
parameters Λ and Γ).
                                                                            hold strategy, while at the same time maintaining similar or
                                                                            better rewards.
 1. For each simulation t, where (as discussed previously)
    Dt+k is the testing data and Dt , ..., Dt+k−1 is the op-                Stock indices are weighted averages of their respective
    timisation data, use Bayesian optimisation to find the                  stock components. For instance, the Dow Jones Industrial
    parameters Λt that satisfies Equation 3.                                Average (DJIA) is a weighted average of 30 large compa-
                                                                            nies based in the United States. Indices may have different
                                t+k−1
                               1 X                                          schemes for how the different components are weighted,
      Λt = arg max                    J (Λ, Dj , {D}tt+k−1 \Dj |Γ)          however they all aim to give a collective representation of
                    Λ∈Λ        k j=t
                                                                            their components.
                                                                      (3)
                                                                            An index fund owns shares of the components of a specific
 2. For each Λt calculate the score ρtJ on the testing set                  index, proportional to the weights, such that the fund’s re-
    with respect to the scoring function J according to                     turn is mirrored by the index. These funds are very popular,
    Equation 4.                                                             as they are easy for the investor to comprehend but at the
                      ρtJ = J (Λt , Dt+k , {D}t+k−1 |Γ)               (4)   same time trading the individual components of an index
                                              t
                                                                            requires a lot of effort.
 3. The expected performance ρ̄J of the optimisation,                       A buy-and-hold strategy on stock indices via index funds
    with respect to the score function J , is then
                                                 P given by                 may be convenient, however it implies that the equity is put
    the average of the scores ρtJ , i.e. ρ̄J = 1t t ρtJ .                   through the full force of drawdown risks described in Sec-
                                                                            tion 3.2. The buy-and-hold strategy holds assets over the
Two things to note about this procedure. First, during                      entire backtesting period and so will be subject to the full
cross-validation inside the objective function we disregard                 force of these metrics. For instance, as an asset will be held



                                                                       7
                                                                  6.1.1   Variables
                                                Long     G3
             G2
                                          G1
                                                                  The variables used in the GBNs were discretisations of
      Buy             Sell       Trend
                                                                  so called technical analysis indicators. One of the major
                                          G2                      tenets in technical analysis is that the movement of the price
               G1
                                                Short    G4       of an asset repeats itself in recognisable patterns. Indicators
                                                                  are computations of price and volume that support the iden-
                                                                  tification and confirmation of patterns used for forecasting.
                                                                  Many classical indicators exists, such as the moving aver-
                  Figure 6: GBN-1 and GBN-2                       age (MA), which is the average price over time, and the
                                                                  relative strength index (RSI) which compares the size of
                                                                  recent gains to the size of recent losses. For the full defini-
                                                                  tion of these indicators, please see [16, 17].
throughout the period, the lowest point of the assets value
will coincide with LVFI. In dwindling stock markets, the          For each phase in the GBNs (Buy, Sell, T rend, Long
index funds will lose value, and equity could be salvaged         and Short), we placed a naı̈ve Bayesian classifier over the
and possibly be placed in risk-free assets during these peri-     same technical analysis indicators. However by allowing
ods. Furthermore, utilising certain financial products, it is     the parameterisation of one of the technical analysis indi-
also possible to increase equity during these times of dis-       cators to vary between the phases, we essentially created
tress by purchasing short positions of the index. Short posi-     different variables in the different phases. The tuning of
tions can be thought of as a loan, where the value of the loan    the technical analysis parameters allowed us to better cap-
increases if the index decreases in value, and it is possible     ture the dynamics of the data, as they may differ between
to sell the loan at its higher value (to make the distinction,    assets as well as between the different phases of trading.
regular positions are called long when short positions are        Figure 7 depicts the classifier structure and variables used.
considered).                                                      The variables are explained below, along with an example
At first the buy-and-hold strategy may seem naı̈ve, how-          in Figure 8.
ever it has been shown that deciding when to own and not
own assets requires consistent high accuracy of predictions         • S represents the first-order finite backward difference
in order to gain higher returns than the buy-and-hold strat-          of 5 periods of the MA of ψ periods, shifted 5 periods
egy [15]. The buy-and-hold strategy has become a standard             into the future. To clarify, the first plot in Figure 8
benchmark, not only because of the required accuracy, but             shows the price of an asset along with the MA. If the
also because it requires very little effort to execute (no com-       current time is t, then S represents the slope of the line
plex computations and/or experts needed).                             between what the MA will be at t + 5 and what it is at
                                                                      t.

                                                                    • A represents the same slope as S but at its current
6.1   METHODOLOGY                                                     value (i.e. between t and t − 5).

We used two different GBN structures to create alpha mod-           • B represents the difference between the current value
els. The first GBN structure (henceforth known as GBN-1)              of the MA of ψ periods and the current raw price. This
modelled buying and selling long positions only, while the            can be seen in Figure 8 as the difference between the
second GBN structure (GBN-2) modelled buying and sell-                two time series in the first plot.
ing long and short positions. The structures are depicted in
                                                                    • C represents the current RSI value (at t in the second
Figure 6 (GBN-1 on the left and GBN-2 on the right). The
                                                                      plot of the figure) using 14 periods.
structure for GBN-1 works as described in Section 2. The
structure for GBN-2 starts in the T rend phase, from where          • D represents what C was 5 time steps in the past (at
either G1 or G2 can trigger. If G1 triggers then a long open          t − 5 in the second plot of the figure).
signal is generated and the Long phase is activated (deac-
tivating the T rend phase). If then gate G3 triggers then a
                                                                  The choice of 14 periods for RSI is based on the prevailing
long close signal is generated, and the T rend phase is ac-
                                                                  standard [16, 17], and the choice of 5 periods as the pre-
tivated again (deactivating the Long phase). However, if
                                                                  diction horizon is based on the number of trading days in a
before G1 triggers G2 triggers instead, then a short open
                                                                  week.
position is generated, and the Short phase is activated (de-
activating T rend). In similar fashion, when G4 triggers a        Variables A, B, C and D were discretised into six bins,
short close signal is generated, activating T rend and deac-      each using equal width binning, and S was discretised into
tivating Short.                                                   two bins separated by zero. Thus, the states of S represents



                                                              8
                                                                                                          • In both cases all τ were confined to [60, 90] and all ψ
                              S
                         ∇15 MA(ψ)                                                                          to [10, 40].
                         Offset(+5)

                                                                                                        We used the upper confidence bound acquisition function
                              A
                                                   B                  C               D                 (as described in Section 4.2) with η = 5, which allowed
                         ∇15 MA(ψ)                                  RSI(14)        RSI(14)
                                              PDIFF(MA(ψ))
                                                                                  Offset(-5)
                                                                                                        for abundant exploration, as our objective function was
                                                                                                        not extremely expensive to evaluate. We used the rational
                                                                                                        quadratic kernel as described in Equation 1 with c = 1.
                                                                                                        For GBN-1 we ran the Bayesian optimisation for 1,600 it-
                     Figure 7: Bayesian Classifier in GBN Phases
                                                                                                        erations, and for GBN-2 we ran 12,800 iterations.
                  1325




                          MA
                                                            A
                                                                                                        6.1.3    Data Sets
                  1315




                                                                              S
   Price and MA




                                                                                                        We used four indices in this study, DJIA and NASDAQ
                          Price
                                                                                                        which are both based on companies in the United States,
                  1305




                                                                    B


                                                                                                        FTSE100 which is based on companies in the United King-
                  1295




                                             t−5                t                 t+5
                                                                                                        dom and DAX which is based on companies located in Ger-
                         Mar 01     Mar 04    Mar 07   Mar 10   Mar 14   Mar 17   Mar 21   Mar 24       many. We ran our experiments on daily adjusted closing
                          2011       2011      2011     2011     2011     2011     2011     2011
                                                                                                        prices for these indices ranging from 2001-01-01 to 2012-
                                                                                                        12-28 (data downloaded from Yahoo! FinanceTM ). This
                  55




                          RSI
                                              D
                                                                                                        gave a total of 12 years of price data for each index, where
   RSI

                  45




                                                                C
                                                                                                        each year was allocated to a block, thus n = 12. For the
                  35




                                             t−5                t

                         Mar 01     Mar 04    Mar 07   Mar 10   Mar 14   Mar 17   Mar 21   Mar 24
                                                                                                        cross-validation step we used k = 5 giving t = 7 simula-
                          2011       2011      2011     2011     2011     2011     2011     2011
                                                                                                        tions from which to calculate [ρ̄J1 , ..., ρ̄Jm ] (the data split
                                                                                                        is depicted in Figure 5).
                                  Figure 8: Visualisation of Variables
                                                                                                        6.1.4    Scoring Functions

                                                                                                        The signals generated were backtested in order to calculate
a predicted positive or negative future value of the modelled
                                                                                                        relevant metrics. During optimisation (i.e. step 1 in Sec-
asset price (smoothed by the moving average of ψ periods).
                                                                                                        tion 5) the objective function used the Sharpe ratio. The
As S represents a future value, evidence for S was only
                                                                                                        choice was made as it combines both risk and reward into
available during estimation of the parameters Θ, not during
                                                                                                        one score, for which a cross-validation estimate could be
the generation of signals.
                                                                                                        returned by the objective function. For evaluating the per-
The gates all defined trigger logic over the posterior dis-                                             formance of the optimisation (step 2 in Section 5), we used
tribution of S with some threshold τ . For instances,                                                   the return and the drawdown risks described in Section 3.2
in GBN-1, the trigger logic for G1 was T L(G1) :                                                        to create a score vector [ρ̄J1 , ..., ρ̄Jm ]. The same metrics
p(S = positive | e) > τG1 , i.e. if the posterior probabil-                                             were calculated for the buy-and-hold strategy.
ity of a positive climate is greater than some threshold, then
the model should give a buy signal and move to the next                                                 6.2     RESULTS AND DISCUSSION
phase (the sell phase). Naturally, the trigger logic for G2
in the same GBN was T L(G2) : p(S = negative | e) >                                                     The score vectors from the evaluation of the optimisation
τG2 , thus giving a sell signal if the posterior probability of                                         versus the the score vector for the buy-and-hold strategy
a negative future value exceeds some threshold.                                                         over the seven simulations are shown in Table 1. The an-
                                                                                                        nual Sharpe presented in the table is the mean return di-
6.1.2                    Bayesian Optimisation Settings                                                 vided by the standard deviation of returns over the seven
                                                                                                        simulation, and since each block was allocated one year of
The previous section implies the following for the two                                                  data it becomes the annual Sharpe ratio.
GBNs:
                                                                                                        Will will first turn our attention to GBN-1. We use the
                                                                                                        Sharpe ratio as our measure of reward, prioritised above
  • For GBN-1 the free parameters Λ to be optimised are:                                                the raw return for reasons discussed in Section 3.2. There-
    Λ = {τG1 , τG2 , ψBuy , ψSell }.                                                                    fore, we must first ensure that the Sharpe ratio of our al-
                                                                                                        gorithmic trading system produces similar or better Sharpe
  • For GBN-2 the free parameters Λ to be optimised are:                                                ratios than the buy-and-hold strategy. As can be seen, this
    Λ = {τG1 , τG2 , τG3 , τG4 , ψT rend , ψLong , ψShort }.                                            was the case for DJIA, NASDAQ and DAX, but not for



                                                                                                    9
                                                                 open short positions changes the results dramatically. For
   Table 1: Metric Values for GBNs and Buy-and-Hold
                                                                 DJIA, we improved upon the Sharpe ratio, at the cost of
    DJIA         Score     GBN-1 GBN-2 BaH
                                                                 the drawdown risks. Both MDD and LVFI were increased
         Annual Sharpe 0.289        0.330    0.157               marginally, yet still lower than buy-and-hold. The TIMR
                 Return 0.019       0.032    0.028               was also increased to such a degree that we were invested
                 MDD       0.085    0.116    0.167               almost the entire investment period. There is potential gain
                 LVFI      0.058    0.062    0.119               in reward from using GBN-2 for DJIA, however the in-
                 TIMR 0.628         0.91     1.0                 creased risk must be considered.
    NASDAQ Score           GBN-1 GBN-2 BaH
                                                                 For NASDAQ, FTSE100 and DAX there is no improve-
         Annual Sharpe 0.308        0.081    0.254
                                                                 ment over GBN-1. Instead, Sharpe ratios are decreased, as
                 Return 0.033       0.012    0.067
                                                                 well as an increase in drawdown risks. There could be sev-
                 MDD       0.101    0.164    0.207
                                                                 eral reasons for this that are worth investigating, however
                 LVFI      0.062    0.099    0.146
                                                                 our immediate response is that we have either overfitted the
                 TIMR 0.554         0.94     1.0                 model due to several more parameters being optimised over
    FTSE100 Score          GBN-1 GBN-2 BaH                       the same amount of data, or the fact that a bad short posi-
         Annual Sharpe -0.057       -0.64    0.127               tion is doubly bad on equity as we will lose out of the profit
                 Return -0.006      -0.074   0.022               from a long position during the same time.
                 MDD       0.098    0.167    0.188
                 LVFI      0.074    0.121    0.142
                 TIMR 0.649         0.962    1.0                 7   CONCLUSIONS AND FUTURE WORK
    DAX          Score     GBN-1 GBN-2 BaH
         Annual Sharpe 0.778        0.589    0.278
                 Return 0.081       0.062    0.069               Our results show that it is feasible to use GBNs as alpha
                 MDD       0.107    0.171    0.213               models, and to use Bayesian optimisation to tune them in
                 LVFI      0.056    0.059    0.154               order to beat the buy-and-hold benchmark, with respect to
                                                                 certain risk and reward metrics. Some of the design deci-
                 TIMR 0.610         0.926    1.0
                                                                 sions made before optimisation may however have reduced
                                                                 the performance of the GBNs on some of the used data sets.
                                                                 Short positions are optimally taken during times of distress,
                                                                 and due to increased volatility, markets move very differ-
FTSE100. Secondly, we must take into consideration the
                                                                 ently compared to stable increasing markets. We decided to
TIMR. For GBN-1, we were invested only slightly above
                                                                 lock in the forward and backward horizons to 5 time steps,
half of the time compared to buy-and-hold, reducing risk
                                                                 and the RSI period of 14, which may have made it impos-
to equity considerably. Meanwhile, the rest of the time the
                                                                 sible to capture the more volatile dynamics. Furthermore,
equity could have gained in value from interest rates (or
                                                                 stock indices generally increase in value over long periods
other risk-free assets), this potential gain was not consid-
                                                                 of time, thus short selling will always be in the opposite of
ered in these results. Risk to equity from MDD was half
                                                                 the long term trend, which in general is ill-advised.
its counterpart from the buy-and-hold strategy for all in-
dices. The LVFI is a major threat to equity (as discussed in     Nevertheless, we are encouraged to see the included posi-
Section 3.2), and one where buy-and-hold severely under-         tive results and are at the same time motivated to address
performs. For DAX the LVFI was only a third of the buy-          the problems we faced with GBN-2. We would not ex-
and-hold LVFI, and for the other three indices it was half.      pect the exact same model to perform well on all given
                                                                 data sets, and so further work is needed to improve upon
All in all, the results clearly indicate that GBN-1 was com-
                                                                 the results on FTSE100 to make them in par with the other
petitive with the buy-and-hold strategy for three of the in-
                                                                 three indices. For instance, there is room to make the ob-
dices, as Sharpe ratios were improved upon and risk to eq-
                                                                 jective function even more expensive by not only estimat-
uity was decreased significantly. Furthermore, these results
                                                                 ing BN parameters, but also performing variable selection
were achieved while at the same time only having equity in-
                                                                 and structure learning during cross-validation.
vested half of the investment period. It is also clear that we
cannot expect the same GBN to be useful for all indices, as
the reward was not improved upon for FTSE100. Some of
                                                                 Acknowledgements
the parameters that were fixed in Γ may have to be tuned in
order to accommodate the dynamics of FTSE100, such as
                                                                 BN inference in our implementation is based on the SMILE
the technical analysis indicators used, or the fixed parame-
                                                                 reasoning engine, contributed to the community by the De-
ters of the ones used currently.
                                                                 cision Systems Laboratory of the University of Pittsburgh
Moving on to GBN-2, we can see that allowing the GBN to          and available at https://dslpitt.org/genie/.



                                                           10
References                                                     [16] J. J. Murphy, Technical analysis of the financial mar-
                                                                    kets. New York Institute of Finance, 1999.
 [1] P. Treleaven, M. Galas, and V. Lalchand, “Algorith-
     mic trading review,” Communications of the ACM,           [17] M. J. Pring, Technical analysis explained. McGraw-
     vol. 56, no. 11, pp. 76–85, 2013.                              Hill, 2002.

 [2] G. Nuti, M. Mirghaemi, P. Treleaven, and
     C. Yingsaeree, “Algorithmic trading,” Computer,
     vol. 44, no. 11, pp. 61–69, 2011.

 [3] R. K. Narang, Inside the black box. John Wiley &
     Sons, 2013.

 [4] M. Bendtsen and J. M. Peña, “Gated Bayesian net-
     works,” in Proceedings of the Twelfth Scandina-
     vian Conference on Artificial Intelligence, pp. 35–44,
     2013.

 [5] M. Bendtsen and J. M. Peña, “Learning gated
     Bayesian networks for algorithmic trading,” in Pro-
     ceedings of the Seventh European Workshop on Prob-
     abilistic Graphical Models, pp. 49–64, 2014.

 [6] M. Bendtsen and J. M. Peña, “Gated Bayesian net-
     works for algorithmic trading,” International Journal
     of Approximate Reasoning, 2015, submitted.

 [7] E. Brochu, V. M. Cora, and N. de Freitas, “A tuto-
     rial on Bayesian optimization of expensive cost func-
     tions, with application to active user modeling and hi-
     erarchical reinforcement learning,” Tech. Rep. UBC
     TR-2009-023 and arXiv:1012.2599, 2009.

 [8] J. Pearl, Probabilistic reasoning in intelligent sys-
     tems: networks of plausible inference. Morgan Kauf-
     mann Publishers, 1988.

 [9] F. V. Jensen and T. D. Nielsen, Bayesian networks and
     decision graphs. Springer, 2007.

[10] K. B. Korb and A. E. Nicholson, Bayesian artificial
     intelligence. Taylor and Francis Group, 2011.

[11] R. Pardo, The evaluation and optimization of trading
     strategies. John Wiley & Sons, 2008.

[12] E. P. Chan, Quantitative trading. John Wiley & Sons,
     2009.

[13] C. E. Rasmussen and C. K. I. Williams, Gaussian pro-
     cesses for machine learning. MIT Press, 2006.

[14] K. Swersky, J. Snoek, and R. P. Adams, “Multi-task
     Bayesian optimization,” in Advances in Neural Infor-
     mation Processing Systems 26, pp. 2004–2012, 2013.

[15] W. F. Sharpe, “Likely gains from market timing,” Fi-
     nancial Analysts Journal, vol. 31, no. 2, pp. 60–69,
     1975.



                                                         11