=Paper= {{Paper |id=Vol-1565/bmaw2015_paper2 |storemode=property |title=Bayesian Optimisation of Gated Bayesian Networks for Algorithmic Trading |pdfUrl=https://ceur-ws.org/Vol-1565/bmaw2015_paper2.pdf |volume=Vol-1565 |authors=Marcus Bendtsen |dblpUrl=https://dblp.org/rec/conf/uai/Bendtsen15 }} ==Bayesian Optimisation of Gated Bayesian Networks for Algorithmic Trading== https://ceur-ws.org/Vol-1565/bmaw2015_paper2.pdf

Bayesian Optimisation of Gated Bayesian Networks for Algorithmic Trading

Marcus Bendtsen
marcus.bendtsen@liu.se
Department of Computer and Information Science
Linköping University, Sweden

Abstract tion. If the signals are followed, then they give rise to cer-
tain risk and reward on the initial investment, which will be
Gated Bayesian networks (GBNs) are an exten- described further in Section 3.2. Further down the line in
sion of Bayesian networks that aim to model sys- algorithmic trading systems are components that combine
tems that have distinct phases. In this paper, we signals from several alpha models, and other so called risk
aim to use GBNs to output buy and sell decisions models, to combine a portfolio of assets. We will not ad-
for use in algorithmic trading systems. These dress these later components in this paper, our focus will
systems may have several parameters that require be on the alpha models.
tuning, and assessing the performance of these In Figure 1, the price of an asset is plotted along with
systems as a function of their parameters cannot buy signals (upward arrows) and sell signals (downward
be expressed in closed form, and thus requires arrows). We view the time spent between these signals as
simulation. Bayesian optimisation has grown in two different phases: before a buy signal, our intention is
popularity as a means of global optimisation of to have a model that identifies good opportunities to buy
parameters where the objective function may be the asset, once such an opportunity has been identified and
costly or a black box. We show how algorithmic a buy signal has been generated, we move into a different
trading using GBNs, supported by Bayesian opti- phase. In this second phase, we intend to model the identi-
misation, can lower risk towards invested capital, fication of good opportunities to sell the asset. Once a sell
while at the same time generating similar or bet- signal is generated, we move back to the original phase,
ter rewards, compared to the benchmark invest- once again using a model to generate buy signals. This par-
ment strategy buy-and-hold. ticular situation was the main motivation for the introduc-
tion of gated Bayesian networks (GBNs) [4, 5, 6], which
we will describe in Section 2.
1 INTRODUCTION
120

Algorithmic trading can be viewed as a process of actively
110

deciding when to own assets and when to not own assets,
100

so as to get better risk and reward on invested capital, com-
Price

pared to holding the assets over a long period of time. On
80

the other end of the spectrum is the buy-and-hold strategy,
where one owns assets continuously over a period of time
70

without making any decisions of selling or buying. An al-
60

Dec 31 Mar 03 May 01 Jul 01 Sep 02 Nov 03 Dec 29
gorithmic trading system consists of several components, 2007 2008 2008 2008 2008 2008 2008

some which may be automated by a computer, and others
that may be manually executed [1, 2, 3]. At the heart of an
24000

Figure 1: Buy and Sell Signals
algorithmic trading system are the alpha models. They are
Equity curve

responsible for outputting decisions for buying and selling
22000

assets based on the data they are given. These decisions Alpha models normally take a set of parameters, allowing
are commonly referred to as signals. The data which is them to be tuned to the input data. Naturally, two different
20000

supplied to the alpha models varies greatly, e.g. potential sets of parameters may yield two different sets of signals.
prospects, sentiment analysis, previous trades, or technical Therefore, it is imperative to assess how good a set of sig-
analysis, which will be the focus of the included applica- nals are, so that different parameter sets may be compared.

2
BN1 BN2
This is usually done by backtesting, a type of simulation
that calculates certain scores of the signals, e.g. how much A B G2 W
the return on the initial investment would have been. Back-
testing cannot be written as a function of the alpha model’s
parameters in closed form, thus it is not possible to ana- S G1 E F

lytically find the optimal parameters. Instead, backtesting
must be considered a black box function that should be op-
timised. Figure 2: Two Phased GBN
Bayesian optimisation has grown in popularity in the ma-
chine learning community as an intuitive way of maximis- needed to represent the joint probability distribution, thus
ing either black box objective functions and/or very costly making it easier to elicit the probability parameters needed
objective functions (costly in the sense of both time and from experts or from data. See [8, 9, 10] for more details.
resources) [7]. Utilising a prior over objective functions,
and then sparingly evaluating the objective function at cer- Despite their popularity and advantages, there are situa-
tain points (guided by the posterior), Bayesian optimisation tions where a BN is not enough. For instance, when try-
attempts to find the global maximum of the objective func- ing to model the process of buying and selling assets, we
tion within a predefined grid. wanted to model the constant flow between identifying
buying opportunities and then, once such have been found,
Our intention in this paper is to combine the use of GBNs as identifying selling opportunities, as is required by an al-
alpha models and optimising the parameters of these GBNs pha model. These two phases can be very different and the
using Bayesian optimisation. variables included in the BNs modelling them are not nec-
The rest of the paper is organised as follows. We begin by essarily the same. The need to switch between two different
giving a brief introduction to GBNs in Section 2, this will BNs was the foundation for the introduction of GBNs.
illuminate how GBNs can be used as alpha models. We Switching between phases is done using so called gates.
continue by explaining by which metrics alpha models can These gates are encoded with predefined logical expres-
be evaluated in Section 3, and give slightly more details sions regarding posterior probabilities of random variables
regarding backtesting. In Section 4 we will describe the in the BNs. This allows for the activation and deactivation
components of Bayesian optimisation, including the use of of BNs based on posterior probabilities. A GBN that uses
Gaussian processes as priors, as well as kernel and acqui- two different BNs (BN1 and BN2) is shown in Figure 2.
sition functions. In Section 5 we will account for the pro- Here, we will give a brief explanation of GBNs in general,
cedure we will use to evaluate the expected performance of and the GBN in Figure 2 in particular (for the full definition
using Bayesian optimisation over the parameters of GBNs. of GBNs see [4, 6]):
Once the procedure has been described, we will in Sec-
tion 6 account for a real-world application where we show • A GBN consists of BNs and gates. BNs can be active
how GBNs can be used as alpha models with support from or inactive. The label of BN1 is underlined, indicating
Bayesian optimisation. Finally, in Section 7 we will offer a that it is active at the initial state of the GBN. The BNs
few words regarding our conclusions and future work. supply posterior probabilities to the gates via so called
trigger nodes. The node S is a trigger node for gate G1
2 GATED BAYESIAN NETWORKS and W is a trigger node for G2. A gate can utilise more
than one trigger node.
Bayesian networks (BNs) can be interpreted as models
of causality at the macroscopic level, where unmodelled • Each gate is encoded with a predefined logical expres-
causes add uncertainty. Cause and effect are modelled us- sion regarding its trigger nodes’ posterior probabil-
ing random variables that are placed in a directed acyclic ity of a certain state, e.g. G1 may be encoded with
graph (DAG). The causal model implies some probabilistic p(S = s1|e) > 0.7. This expression is known as the
independencies among the variables, that can easily be read trigger logic for gate G1.
off the DAG. Therefore, a BN does not only represent a • When evidence is supplied to the GBN, an evidence
causal model but also an independence model. The qualita- handling algorithm updates posterior probabilities and
tive model can be quantified by specifying certain marginal checks if any of the logical statements in the gates are
and conditional probability distributions so as to specify satisfied. If the trigger logic is satisfied for a gate it is
a joint posterior distribution, which can later be used to said to trigger. A BN that is inactive never supplies
answer queries regarding posterior probabilities, interven- any posterior probabilities, hence G2 will never trig-
tions, counterfactuals, etc. The independencies represented ger as long as BN2 is inactive.
in the DAG make it possible to compute these queries effi-
ciently. Furthermore, they reduce the number of parameters • When a gate triggers, it deactivates all of its parent

3
BNs and activates its child BNs (as defined by the total value (0.06% is a common commission charge used in
direction of the edges between gates and BNs). In the included application).
our example, if G1 was to trigger it would deactivate
Alpha models are backtested separately from the other
BN1 and activate BN2, this implies that the model has
components of the algorithmic trading system, as the back-
switched phase.
testing results are input to the other components. There-
fore, we execute every signal from an alpha model during
If the GBN was used as an alpha model, then when the
backtesting, whereas in a full algorithmic trading system
GBN identifies a buying opportunity, and moves to the sell
we would have a portfolio construction model that would
phase, a buy signal is generated. Looking again at Fig-
combine several alpha models and decide how to build a
ure 1, each buy and sell signal was generated as the GBN
portfolio from their signals.
switched back and forth between its phases.
For the purpose of discussing GBN parameter optimisation 3.2 ALPHA MODEL METRICS
in general, we will say that a GBN is parameterised by three
disjoint parameter sets Θ, Λ and Γ. The parameters in Θ are What constitutes risk and reward is not necessarily the
the parameters of the marginal and conditional probabil- same for every investor, and investors may have their own
ity distributions of the variables in the contained BNs. All personal preferences. However, there are a few metrics that
other free parameters are contained in Λ, while any fixed are common and often taken into consideration [12]. Here
parameters are contained in Γ. For instance, in a setting we will introduce the metrics that we will use to evaluate
where the only unknowns are the thresholds in the trigger the performance of our alpha models.
logic of the gates, we say that the thresholds are in Λ and
Although not a metric on its own, the equity curve needs
all other parameters are fixed in Γ. This notation allows
to be defined in order to define the following metrics. The
a bit of convenience when discussing the evaluation of the
equity curve represents the total value of a trading account
optimisation procedure in Section 5 and the application in
at a given point in time. If a daily timescale is used, then it
Section 6.
is created by plotting the value of the trading account day
by day. If no assets are bought, then the equity curve will
3 EVALUATION OF ALPHA MODELS be flat at the same level as the initial investment. If assets
are bought that increase in value, then the equity curve will
As we alluded in Section 1, and as we shall see in Sec- rise. If the assets are sold at this higher value then the eq-
tion 6, it is necessary to assess how good a set of signals uity curve will again go flat at this new level. The equity
are, thereby assessing the performance of an alpha model. curve summarises the value of the trading account includ-
Regression models can be evaluated by how well they min- ing cash holdings and the value of all assets. We will use
imise some error function or by their log predictive scores. Et to reference the value of the equity curve at point t.
For classification, the accuracy and precision of a model
may be of greatest interest. Alpha models may rely on re- Metric 1 (Return) The return of an investment is defined
gression and classification, but cannot be evaluated as ei- as the percentage difference between two points on the eq-
ther. An alpha model’s performance needs to be based on uity curve. If the timescale of the equity curve is daily, then
its generated signals over a period of time, and the per- rt = (Et − Et−1 )/|Et−1 | would be the daily return between
formance must be measured by the risk and reward of the day t and t − 1. We will use r̄ and σr to denote the mean
model. This is known as backtesting. and standard deviation of a set of returns.

3.1 BACKTESTING Metric 2 (Sharpe Ratio) One of the most well known
metrics used is the so called Sharpe ratio. Named after
The process of evaluating an alpha model on historic data its inventor Nobel laureate William F. Sharpe, this ratio is
is known as backtesting, and its goal is to produce met- defined as: (r̄ −risk free rate)/σr . The risk free rate is usu-
rics that describe the behaviour of a specific alpha model. ally set to be a ”safe” investment such as government bonds
These metrics can then be used for comparison between al- or the current interest rate, but is also sometimes removed
pha models [11, 12]. A time range, price data for assets from the equation [12]. The intuition behind the Sharpe ra-
traded and a set of signals are used as input. The back- tio is that one would prefer a model that gives consistent
tester steps through the time range and executes signals returns (returns around the mean), rather than one that fluc-
that are associated with the current time (using the supplied tuates. This is important since investors tend to trade on
price data) and computes an equity curve (which will be ex- margin (borrowing money to take larger positions), and it
plained in Section 3.2). From the equity curve it is possible is then more important to get consistent returns than returns
to compute metrics of risk and reward. To simulate poten- that sometimes are large and sometimes small. This is why
tial transaction costs, often referred to as commission, every the Sharpe ratio is used as a reward metric rather than the
trade executed is usually charged a small percentage of the return.

4
Equity in $
in their place, should be aware of the LVFI as it is the
worst case scenario if they need to retract their investment
MDD prematurely.

Initial investment
LVFI Metric 5 (Time In Market Ratio (TIMR)) The percent-
TIMR 1 - TIMR age of time of the investment period where the alpha model
Time owned assets. This metric may seem odd to place within
the same family as the other drawdown risks, however it
Figure 3: Equity Curve with Drawdown Risks fits naturally in this space. We can assume that the days the
alpha model does not own any assets the drawdown risk is
zero. If we are not invested, then there is no risk of loss.
Furthermore, under certain assumptions it can be shown In fact, we can further assume that our equity is growing
that there exists an optimal allocation of equity between according to the risk free rate, as it is not bound in assets.
alpha models (in the portfolio construction model), such
that the long-term growth rate of equity is maximised [12].
4 BAYESIAN OPTIMISATION
This growth rate turns out to be g = r + S 2 /2, where r
is the risk free rate and S is the Sharpe ratio. Thus, a high
Our intention is to use GBNs as alpha models and to opti-
Sharpe ratio is not only an indication of good risk adjusted
mise the free parameters Λ with respect to the metrics given
return, but holding the risk free rate constant, the optimal
in Section 3.2. In order to do so we must backtest the sig-
growth rate is an increasing function of the Sharpe ratio.
nals that a GBN produces, and thus we cannot analytically
solve the optimisation problem, as backtesting as a function
Using the Sharpe ratio as a metric will ensure that the alpha
of Λ has no general closed form expression. At the same
models are evaluated on their risk adjusted return, however,
time, backtesting is relatively costly, as one must create the
there are other important alpha model behaviours that need
model, prepare data, estimate parameters, generate signals
to be measured. A family of these, that are known as draw-
and walk through the time range to simulate the trading.
down risks, are presented here (see Figure 3 for examples
For this reason, it is not feasible to exhaustively sweep a
of an equity curve and these metrics).
large grid of parameters. However, Bayesian optimisation
allows us to prioritise the points on the grid to evaluate,
Metric 3 (Maximum Drawdown (MDD)) The percent-
thus reducing the number of evaluations, while still finding
age between the highest peak and the lowest trough of the
the global maximum of a potentially costly and black box
equity curve during backtesting. The peak must come be-
objective function.
fore the trough in time. The MDD is important from both
a technical and psychological regard. It can be seen as a
measure of the maximum risk that the investment will live 4.1 GAUSSIAN PROCESS AS SURROGATE
through. Investors that use their existing investments that FUNCTION
have gained in value as safety for new investments may be
Essentially, we would like to find the parameters Λ∗ ∈ Λ
put in a situation where they are forced to sell everything.
that maximises an unknown function f . We place a
Other risk management models may automatically sell in-
prior, p(f ), over the possible functions f , and compute
vestments that are loosing value sharply. For the individual
the posterior over f using observations {Λ1:i , f1:i }, where
who is not actively trading but rather placing money in a
fj = f (Λj ). Hence, we compute p(f |{Λ1:i , f1:i }) ∝
fund, the MDD is psychologically frustrating to the point
p({Λ1:i , f1:i }|f )p(f ). We can then use this posterior distri-
where the individual may withdraw their investment at a
bution over objective functions as an estimate of our objec-
loss in fear of loosing more money.
tive function. This is sometimes known as using the poste-
rior as a surrogate function to the true objective function.
Metric 4 (Lowest Value From Investment (LVFI)) The
percentage between the initial investment and the lowest In Bayesian optimisation it is common to use a Gaussian
value of the equity curve. This is one of the most important process (GP) as the surrogate function [7]. It is defined
metrics, and has a significant impact on technical and as a multivariate normal distribution of infinite dimension,
psychological factors. For investors trading on margin, where each dimension is a point along some grid. A finite
a high LVFI will cause the lender to ask the investor for set of these dimensions will form a Gaussian distribution,
more safety capital (known as a margin call). This can be thus allowing a GP to be defined completely by a mean
potentially devastating, as the investor may not have the function µ and a kernel function κ. The GP over the grid
capital required, and is then forced to sell the investment. Λ is then defined as N (µ(Λ), κ(Λ, Λ0 )) for all Λ, Λ0 ∈ Λ.
The investor will then never enjoy the return the investment Commonly, the prior µ(Λ) is assumed to be zero for all
could have produced. Individuals who are not investing Λ ∈ Λ, although this is by no means necessary if prior
actively, but instead are choosing between funds that invest information is available to suggest otherwise. The more

5
involved task is to define the kernel function κ. With κ

1.0
c=1
we can express our prior belief about the objective function c=5

0.8
c = 10
that we wish to maximise. Although we do not know the

Covariance

0.6
form of the objective function, we often assume that points

0.4
close to each other on the grid give similar results, thus
we assume the objective function to possess at least some

0.2
smoothness. These assumptions can be articulated in κ,
0 2 4 6 8 10
for instance by the rational quadratic kernel in Equation 1,
Distance
where c is a tuning constant for how smooth we believe
the objective function to be. For points close to each other,
Figure 4: Covariance Decrease by Distance
Equation 1 will result in values close to 1, while points fur-
ther away will be given values closer to 0. The GP prior
will obtain the same smoothness properties, as the covari-
ance matrix is completely defined by κ. To visualise the In Bayesian optimisation we make use of a so called ac-
smoothness achieved by tuning c, Figure 4 shows the de- quisition function. Several acquisition functions have been
creasing covariance as distance grows with three different suggested, however the goal is to trade off exploring the
settings of c (1, 5 and 10). As can be seen, as c increases grid where the posterior uncertainty is high, while exploit-
the decrease is slower, thus more smoothness is assumed. ing points that have a high posterior mean. We will use
the upper confidence bound criterion, which is expressed
as U CB(Λ) = µ(Λ) + ησ(Λ), where µ(Λ) and σ(Λ) rep-
||Λ − Λ0 ||2
κ(Λ, Λ0 ) = 1 − (1) resent the mean and standard deviation at the point Λ of
||Λ − Λ0 ||2 + c the GP, and η is a tuning parameter to allow for more ex-
ploration (as η is increased) or more exploitation (as η is
Assuming that we have observed {Λ1:i , f1:i }, and that we decreased).
wish to calculate the posterior predictive distribution for
an unobserved point Λi+1 , a closed form expression exists Succinctly, define a GP over a grid with some kernel func-
for this calculation as described in Equation 2. Thus, it is tion, then randomly sample a point and evaluate the objec-
possible to efficiently calculate the posterior distribution of tive function at this point. Calculate the posterior of the
an unobserved point where both the prior smoothness and GP given this new observation and find Λ0 that maximises
observed data have been considered. For more on GPs, the acquisition function. Then Λ0 is the next point where
please see [13]. to evaluate the objective function. Iterate these steps for a
predefined number of iterations. Once all iterations have
passed, the Λ with the highest posterior mean is the set of
! parameters that maximises the objective function.
K KT

f1:i ∗
∼ N 0,
fi+1 K∗ K∗∗

κ(Λ1 , Λ1 ) · · · κ(Λ1 , Λi )
 5 EVALUATION PROCEDURE
K=
 .. .. .. 
. . .  In Section 6 we will account for a real-world application
κ(Λi , Λ1 ) · · · κ(Λi , Λi ) of GBNs as alpha models supported by Bayesian optimi-
(2)
sation. However, in this section we will introduce the op-

K∗ = κ(Λi+1 , Λ1 ) · · · κ(Λi+1 , Λi )
K∗∗ = κ(Λi+1 , Λi+1 ) timisation procedure used, as well as the method used to
evaluate the performance of the optimisation, which is es-
p(fi+1 |{Λ1:i , f1:i }) = N (µi (Λi+1 ), σi2 (Λi+1 )) sentially the same method used in [5].
µi (Λi+1 ) = K∗ K−1 f1:i A data set D of consecutive evidence sets, e.g. observations
σi2 (Λi+1 ) = K∗∗ − K∗ K−1 K∗ T over all or some of the random variables in the GBN, is di-
vided into n equally sized blocks (D1 , ..., Dn ), such that
they are mutually exclusive and exhaustive. Each block
4.2 ACQUISTION FUNCTIONS AND BAYESIAN contains consecutive evidence sets and all evidence sets in
OPTIMISATION block Di come before all evidence sets in Dj for all i < j.
Depending on the amount of available data, k is chosen as
Using a GP as a surrogate to the objective function al- the number of blocks used for optimisation. Starting from
lows us to encode prior beliefs about the unknown objec- index 1, blocks 1,...,k are used for optimisation and k + 1
tive function, and sampling the objective function allows for testing, thus ensuring that the evidence sets in the test-
us to update the posterior of the surrogate. What is left ing data occurs after the optimisation data. The procedure
to do is to decide where to sample the objective function. is then repeated starting from index 2 (i.e. blocks 2, ..., k+1

6
Simulation 7
the natural order of the data, thus allowing a validation
Simulation 6
Simulation 5
block to come before a block used for estimating the pa-
Simulation 4
rameters Θ. This could potentially induce a bias in the
Simulation 3 cross-validation estimate as the data used for estimating the
Simulation 2 parameters would not have been known at the time the data
Simulation 1 for the validation block was generated. However, as we do
not use this scheme when we evaluate the performance of
Data for optimisation Data withheld for testing the optimisation, the expected performance of the optimi-
sation is not biased in this way. We simply use this scheme
Figure 5: Data Split For Optimisation and Testing to make the best use of the data during cross-validation.
Second, one scoring function J has been used both during
optimisation and for evaluating the expected performance
are used for optimisation and k + 2 for testing). By doing of the optimisation. The scoring function J could inter-
so we create t repeated simulations, moving the testing data nally use many different metrics to come up with one score
one block forward each time. An illustration of this proce- to maximise. However, it is natural in the coming setting
dure when n = 12, k = 5 and t = 7 is shown in Figure 5. to expose the actual values of several metrics, thus several
During Bayesian optimisation, when the objective function scoring functions J are used to get a vector of mean scores
is evaluated for some acquired Λ, a cross-validation esti- [ρ̄J1 , ..., ρ̄Jm ].
mate is calculated for the k blocks used. Here, k − 1 blocks Another approach to combine Bayesian optimisation with
are used to estimate the parameters Θ of the contained BNs cross-validation is to reduce the number of fold evaluations
and the held out block is used as validation data to calcu- necessary [14], as certain folds may be closely correlated,
late a score ρ. The value of the objective function, given however our approach is to reduce the number of parame-
parameters Λ, is thus the average of all ρ when each block ters that we need to test with cross-validation.
in the optimisation data has been held out.
In order to formalise the procedure used to evaluate the op- 6 APPLICATION
timisation, recall from Section 2 that Λ is used to represent
the free parameters of a GBN and Γ is used to represent
Having established the optimisation procedure, and the
all fixed parameters. Let J be a score function such that
method we intend to use to evaluate the performance of
J (Λ, Dj , {D}m l |Γ) is the score for a GBN under some pa- the optimisation, we turn our attention to a real-world ap-
rameterisation Λ and Γ when block j has been used for
plication. We aim to use GBNs as alpha models to gener-
either testing or validation and the blocks Dl , ..., Dm have
ate buy and sell signals of stock indices in such a way that
been used to estimate Θ of the BNs in the GBN (under the
drawdown risks are mitigated, compared to the buy-and-
parameters Λ and Γ).
hold strategy, while at the same time maintaining similar or
better rewards.
1. For each simulation t, where (as discussed previously)
Dt+k is the testing data and Dt , ..., Dt+k−1 is the op- Stock indices are weighted averages of their respective
timisation data, use Bayesian optimisation to find the stock components. For instance, the Dow Jones Industrial
parameters Λt that satisfies Equation 3. Average (DJIA) is a weighted average of 30 large compa-
nies based in the United States. Indices may have different
t+k−1
1 X schemes for how the different components are weighted,
Λt = arg max J (Λ, Dj , {D}tt+k−1 \Dj |Γ) however they all aim to give a collective representation of
Λ∈Λ k j=t
their components.
(3)
An index fund owns shares of the components of a specific
2. For each Λt calculate the score ρtJ on the testing set index, proportional to the weights, such that the fund’s re-
with respect to the scoring function J according to turn is mirrored by the index. These funds are very popular,
Equation 4. as they are easy for the investor to comprehend but at the
ρtJ = J (Λt , Dt+k , {D}t+k−1 |Γ) (4) same time trading the individual components of an index
t
requires a lot of effort.
3. The expected performance ρ̄J of the optimisation, A buy-and-hold strategy on stock indices via index funds
with respect to the score function J , is then
P given by may be convenient, however it implies that the equity is put
the average of the scores ρtJ , i.e. ρ̄J = 1t t ρtJ . through the full force of drawdown risks described in Sec-
tion 3.2. The buy-and-hold strategy holds assets over the
Two things to note about this procedure. First, during entire backtesting period and so will be subject to the full
cross-validation inside the objective function we disregard force of these metrics. For instance, as an asset will be held

7
6.1.1 Variables
Long G3
G2
G1
The variables used in the GBNs were discretisations of
Buy Sell Trend
so called technical analysis indicators. One of the major
G2 tenets in technical analysis is that the movement of the price
G1
Short G4 of an asset repeats itself in recognisable patterns. Indicators
are computations of price and volume that support the iden-
tification and confirmation of patterns used for forecasting.
Many classical indicators exists, such as the moving aver-
Figure 6: GBN-1 and GBN-2 age (MA), which is the average price over time, and the
relative strength index (RSI) which compares the size of
recent gains to the size of recent losses. For the full defini-
tion of these indicators, please see [16, 17].
throughout the period, the lowest point of the assets value
will coincide with LVFI. In dwindling stock markets, the For each phase in the GBNs (Buy, Sell, T rend, Long
index funds will lose value, and equity could be salvaged and Short), we placed a naı̈ve Bayesian classifier over the
and possibly be placed in risk-free assets during these peri- same technical analysis indicators. However by allowing
ods. Furthermore, utilising certain financial products, it is the parameterisation of one of the technical analysis indi-
also possible to increase equity during these times of dis- cators to vary between the phases, we essentially created
tress by purchasing short positions of the index. Short posi- different variables in the different phases. The tuning of
tions can be thought of as a loan, where the value of the loan the technical analysis parameters allowed us to better cap-
increases if the index decreases in value, and it is possible ture the dynamics of the data, as they may differ between
to sell the loan at its higher value (to make the distinction, assets as well as between the different phases of trading.
regular positions are called long when short positions are Figure 7 depicts the classifier structure and variables used.
considered). The variables are explained below, along with an example
At first the buy-and-hold strategy may seem naı̈ve, how- in Figure 8.
ever it has been shown that deciding when to own and not
own assets requires consistent high accuracy of predictions • S represents the first-order finite backward difference
in order to gain higher returns than the buy-and-hold strat- of 5 periods of the MA of ψ periods, shifted 5 periods
egy [15]. The buy-and-hold strategy has become a standard into the future. To clarify, the first plot in Figure 8
benchmark, not only because of the required accuracy, but shows the price of an asset along with the MA. If the
also because it requires very little effort to execute (no com- current time is t, then S represents the slope of the line
plex computations and/or experts needed). between what the MA will be at t + 5 and what it is at
t.

• A represents the same slope as S but at its current
6.1 METHODOLOGY value (i.e. between t and t − 5).

We used two different GBN structures to create alpha mod- • B represents the difference between the current value
els. The first GBN structure (henceforth known as GBN-1) of the MA of ψ periods and the current raw price. This
modelled buying and selling long positions only, while the can be seen in Figure 8 as the difference between the
second GBN structure (GBN-2) modelled buying and sell- two time series in the first plot.
ing long and short positions. The structures are depicted in
• C represents the current RSI value (at t in the second
Figure 6 (GBN-1 on the left and GBN-2 on the right). The
plot of the figure) using 14 periods.
structure for GBN-1 works as described in Section 2. The
structure for GBN-2 starts in the T rend phase, from where • D represents what C was 5 time steps in the past (at
either G1 or G2 can trigger. If G1 triggers then a long open t − 5 in the second plot of the figure).
signal is generated and the Long phase is activated (deac-
tivating the T rend phase). If then gate G3 triggers then a
The choice of 14 periods for RSI is based on the prevailing
long close signal is generated, and the T rend phase is ac-
standard [16, 17], and the choice of 5 periods as the pre-
tivated again (deactivating the Long phase). However, if
diction horizon is based on the number of trading days in a
before G1 triggers G2 triggers instead, then a short open
week.
position is generated, and the Short phase is activated (de-
activating T rend). In similar fashion, when G4 triggers a Variables A, B, C and D were discretised into six bins,
short close signal is generated, activating T rend and deac- each using equal width binning, and S was discretised into
tivating Short. two bins separated by zero. Thus, the states of S represents

8
• In both cases all τ were confined to [60, 90] and all ψ
S
∇15 MA(ψ) to [10, 40].
Offset(+5)

We used the upper confidence bound acquisition function
A
B C D (as described in Section 4.2) with η = 5, which allowed
∇15 MA(ψ) RSI(14) RSI(14)
PDIFF(MA(ψ))
Offset(-5)
for abundant exploration, as our objective function was
not extremely expensive to evaluate. We used the rational
quadratic kernel as described in Equation 1 with c = 1.
For GBN-1 we ran the Bayesian optimisation for 1,600 it-
Figure 7: Bayesian Classifier in GBN Phases
erations, and for GBN-2 we ran 12,800 iterations.
1325

MA
A
6.1.3 Data Sets
1315

S
Price and MA

We used four indices in this study, DJIA and NASDAQ
Price
which are both based on companies in the United States,
1305

FTSE100 which is based on companies in the United King-
1295

t−5 t t+5
dom and DAX which is based on companies located in Ger-
Mar 01 Mar 04 Mar 07 Mar 10 Mar 14 Mar 17 Mar 21 Mar 24 many. We ran our experiments on daily adjusted closing
2011 2011 2011 2011 2011 2011 2011 2011
prices for these indices ranging from 2001-01-01 to 2012-
12-28 (data downloaded from Yahoo! FinanceTM ). This
55

RSI
D
gave a total of 12 years of price data for each index, where
RSI

C
each year was allocated to a block, thus n = 12. For the
35

t−5 t

Mar 01 Mar 04 Mar 07 Mar 10 Mar 14 Mar 17 Mar 21 Mar 24
cross-validation step we used k = 5 giving t = 7 simula-
2011 2011 2011 2011 2011 2011 2011 2011
tions from which to calculate [ρ̄J1 , ..., ρ̄Jm ] (the data split
is depicted in Figure 5).
Figure 8: Visualisation of Variables
6.1.4 Scoring Functions

The signals generated were backtested in order to calculate
a predicted positive or negative future value of the modelled
relevant metrics. During optimisation (i.e. step 1 in Sec-
asset price (smoothed by the moving average of ψ periods).
tion 5) the objective function used the Sharpe ratio. The
As S represents a future value, evidence for S was only
choice was made as it combines both risk and reward into
available during estimation of the parameters Θ, not during
one score, for which a cross-validation estimate could be
the generation of signals.
returned by the objective function. For evaluating the per-
The gates all defined trigger logic over the posterior dis- formance of the optimisation (step 2 in Section 5), we used
tribution of S with some threshold τ . For instances, the return and the drawdown risks described in Section 3.2
in GBN-1, the trigger logic for G1 was T L(G1) : to create a score vector [ρ̄J1 , ..., ρ̄Jm ]. The same metrics
p(S = positive | e) > τG1 , i.e. if the posterior probabil- were calculated for the buy-and-hold strategy.
ity of a positive climate is greater than some threshold, then
the model should give a buy signal and move to the next 6.2 RESULTS AND DISCUSSION
phase (the sell phase). Naturally, the trigger logic for G2
in the same GBN was T L(G2) : p(S = negative | e) > The score vectors from the evaluation of the optimisation
τG2 , thus giving a sell signal if the posterior probability of versus the the score vector for the buy-and-hold strategy
a negative future value exceeds some threshold. over the seven simulations are shown in Table 1. The an-
nual Sharpe presented in the table is the mean return di-
6.1.2 Bayesian Optimisation Settings vided by the standard deviation of returns over the seven
simulation, and since each block was allocated one year of
The previous section implies the following for the two data it becomes the annual Sharpe ratio.
GBNs:
Will will first turn our attention to GBN-1. We use the
Sharpe ratio as our measure of reward, prioritised above
• For GBN-1 the free parameters Λ to be optimised are: the raw return for reasons discussed in Section 3.2. There-
Λ = {τG1 , τG2 , ψBuy , ψSell }. fore, we must first ensure that the Sharpe ratio of our al-
gorithmic trading system produces similar or better Sharpe
• For GBN-2 the free parameters Λ to be optimised are: ratios than the buy-and-hold strategy. As can be seen, this
Λ = {τG1 , τG2 , τG3 , τG4 , ψT rend , ψLong , ψShort }. was the case for DJIA, NASDAQ and DAX, but not for

9
open short positions changes the results dramatically. For
Table 1: Metric Values for GBNs and Buy-and-Hold
DJIA, we improved upon the Sharpe ratio, at the cost of
DJIA Score GBN-1 GBN-2 BaH
the drawdown risks. Both MDD and LVFI were increased
Annual Sharpe 0.289 0.330 0.157 marginally, yet still lower than buy-and-hold. The TIMR
Return 0.019 0.032 0.028 was also increased to such a degree that we were invested
MDD 0.085 0.116 0.167 almost the entire investment period. There is potential gain
LVFI 0.058 0.062 0.119 in reward from using GBN-2 for DJIA, however the in-
TIMR 0.628 0.91 1.0 creased risk must be considered.
NASDAQ Score GBN-1 GBN-2 BaH
For NASDAQ, FTSE100 and DAX there is no improve-
Annual Sharpe 0.308 0.081 0.254
ment over GBN-1. Instead, Sharpe ratios are decreased, as
Return 0.033 0.012 0.067
well as an increase in drawdown risks. There could be sev-
MDD 0.101 0.164 0.207
eral reasons for this that are worth investigating, however
LVFI 0.062 0.099 0.146
our immediate response is that we have either overfitted the
TIMR 0.554 0.94 1.0 model due to several more parameters being optimised over
FTSE100 Score GBN-1 GBN-2 BaH the same amount of data, or the fact that a bad short posi-
Annual Sharpe -0.057 -0.64 0.127 tion is doubly bad on equity as we will lose out of the profit
Return -0.006 -0.074 0.022 from a long position during the same time.
MDD 0.098 0.167 0.188
LVFI 0.074 0.121 0.142
TIMR 0.649 0.962 1.0 7 CONCLUSIONS AND FUTURE WORK
DAX Score GBN-1 GBN-2 BaH
Annual Sharpe 0.778 0.589 0.278
Return 0.081 0.062 0.069 Our results show that it is feasible to use GBNs as alpha
MDD 0.107 0.171 0.213 models, and to use Bayesian optimisation to tune them in
LVFI 0.056 0.059 0.154 order to beat the buy-and-hold benchmark, with respect to
certain risk and reward metrics. Some of the design deci-
TIMR 0.610 0.926 1.0
sions made before optimisation may however have reduced
the performance of the GBNs on some of the used data sets.
Short positions are optimally taken during times of distress,
and due to increased volatility, markets move very differ-
FTSE100. Secondly, we must take into consideration the
ently compared to stable increasing markets. We decided to
TIMR. For GBN-1, we were invested only slightly above
lock in the forward and backward horizons to 5 time steps,
half of the time compared to buy-and-hold, reducing risk
and the RSI period of 14, which may have made it impos-
to equity considerably. Meanwhile, the rest of the time the
sible to capture the more volatile dynamics. Furthermore,
equity could have gained in value from interest rates (or
stock indices generally increase in value over long periods
other risk-free assets), this potential gain was not consid-
of time, thus short selling will always be in the opposite of
ered in these results. Risk to equity from MDD was half
the long term trend, which in general is ill-advised.
its counterpart from the buy-and-hold strategy for all in-
dices. The LVFI is a major threat to equity (as discussed in Nevertheless, we are encouraged to see the included posi-
Section 3.2), and one where buy-and-hold severely under- tive results and are at the same time motivated to address
performs. For DAX the LVFI was only a third of the buy- the problems we faced with GBN-2. We would not ex-
and-hold LVFI, and for the other three indices it was half. pect the exact same model to perform well on all given
data sets, and so further work is needed to improve upon
All in all, the results clearly indicate that GBN-1 was com-
the results on FTSE100 to make them in par with the other
petitive with the buy-and-hold strategy for three of the in-
three indices. For instance, there is room to make the ob-
dices, as Sharpe ratios were improved upon and risk to eq-
jective function even more expensive by not only estimat-
uity was decreased significantly. Furthermore, these results
ing BN parameters, but also performing variable selection
were achieved while at the same time only having equity in-
and structure learning during cross-validation.
vested half of the investment period. It is also clear that we
cannot expect the same GBN to be useful for all indices, as
the reward was not improved upon for FTSE100. Some of
Acknowledgements
the parameters that were fixed in Γ may have to be tuned in
order to accommodate the dynamics of FTSE100, such as
BN inference in our implementation is based on the SMILE
the technical analysis indicators used, or the fixed parame-
reasoning engine, contributed to the community by the De-
ters of the ones used currently.
cision Systems Laboratory of the University of Pittsburgh
Moving on to GBN-2, we can see that allowing the GBN to and available at https://dslpitt.org/genie/.

10
References [16] J. J. Murphy, Technical analysis of the financial mar-
kets. New York Institute of Finance, 1999.
[1] P. Treleaven, M. Galas, and V. Lalchand, “Algorith-
mic trading review,” Communications of the ACM, [17] M. J. Pring, Technical analysis explained. McGraw-
vol. 56, no. 11, pp. 76–85, 2013. Hill, 2002.

[2] G. Nuti, M. Mirghaemi, P. Treleaven, and
C. Yingsaeree, “Algorithmic trading,” Computer,
vol. 44, no. 11, pp. 61–69, 2011.

[3] R. K. Narang, Inside the black box. John Wiley &
Sons, 2013.

[4] M. Bendtsen and J. M. Peña, “Gated Bayesian net-
works,” in Proceedings of the Twelfth Scandina-
vian Conference on Artificial Intelligence, pp. 35–44,
2013.

[5] M. Bendtsen and J. M. Peña, “Learning gated
Bayesian networks for algorithmic trading,” in Pro-
ceedings of the Seventh European Workshop on Prob-
abilistic Graphical Models, pp. 49–64, 2014.

[6] M. Bendtsen and J. M. Peña, “Gated Bayesian net-
works for algorithmic trading,” International Journal
of Approximate Reasoning, 2015, submitted.

[7] E. Brochu, V. M. Cora, and N. de Freitas, “A tuto-
rial on Bayesian optimization of expensive cost func-
tions, with application to active user modeling and hi-
erarchical reinforcement learning,” Tech. Rep. UBC
TR-2009-023 and arXiv:1012.2599, 2009.

[8] J. Pearl, Probabilistic reasoning in intelligent sys-
tems: networks of plausible inference. Morgan Kauf-
mann Publishers, 1988.

[9] F. V. Jensen and T. D. Nielsen, Bayesian networks and
decision graphs. Springer, 2007.

[10] K. B. Korb and A. E. Nicholson, Bayesian artificial
intelligence. Taylor and Francis Group, 2011.

[11] R. Pardo, The evaluation and optimization of trading
strategies. John Wiley & Sons, 2008.

[12] E. P. Chan, Quantitative trading. John Wiley & Sons,
2009.

[13] C. E. Rasmussen and C. K. I. Williams, Gaussian pro-
cesses for machine learning. MIT Press, 2006.

[14] K. Swersky, J. Snoek, and R. P. Adams, “Multi-task
Bayesian optimization,” in Advances in Neural Infor-
mation Processing Systems 26, pp. 2004–2012, 2013.

[15] W. F. Sharpe, “Likely gains from market timing,” Fi-
nancial Analysts Journal, vol. 31, no. 2, pp. 60–69,
1975.