Bayesian Optimisation of Gated Bayesian Networks for Algorithmic Trading Marcus Bendtsen marcus.bendtsen@liu.se Department of Computer and Information Science Linköping University, Sweden Abstract tion. If the signals are followed, then they give rise to cer- tain risk and reward on the initial investment, which will be Gated Bayesian networks (GBNs) are an exten- described further in Section 3.2. Further down the line in sion of Bayesian networks that aim to model sys- algorithmic trading systems are components that combine tems that have distinct phases. In this paper, we signals from several alpha models, and other so called risk aim to use GBNs to output buy and sell decisions models, to combine a portfolio of assets. We will not ad- for use in algorithmic trading systems. These dress these later components in this paper, our focus will systems may have several parameters that require be on the alpha models. tuning, and assessing the performance of these In Figure 1, the price of an asset is plotted along with systems as a function of their parameters cannot buy signals (upward arrows) and sell signals (downward be expressed in closed form, and thus requires arrows). We view the time spent between these signals as simulation. Bayesian optimisation has grown in two different phases: before a buy signal, our intention is popularity as a means of global optimisation of to have a model that identifies good opportunities to buy parameters where the objective function may be the asset, once such an opportunity has been identified and costly or a black box. We show how algorithmic a buy signal has been generated, we move into a different trading using GBNs, supported by Bayesian opti- phase. In this second phase, we intend to model the identi- misation, can lower risk towards invested capital, fication of good opportunities to sell the asset. Once a sell while at the same time generating similar or bet- signal is generated, we move back to the original phase, ter rewards, compared to the benchmark invest- once again using a model to generate buy signals. This par- ment strategy buy-and-hold. ticular situation was the main motivation for the introduc- tion of gated Bayesian networks (GBNs) [4, 5, 6], which we will describe in Section 2. 1 INTRODUCTION 120 Algorithmic trading can be viewed as a process of actively 110 deciding when to own assets and when to not own assets, 100 so as to get better risk and reward on invested capital, com- Price 90 pared to holding the assets over a long period of time. On 80 the other end of the spectrum is the buy-and-hold strategy, where one owns assets continuously over a period of time 70 without making any decisions of selling or buying. An al- 60 Dec 31 Mar 03 May 01 Jul 01 Sep 02 Nov 03 Dec 29 gorithmic trading system consists of several components, 2007 2008 2008 2008 2008 2008 2008 some which may be automated by a computer, and others that may be manually executed [1, 2, 3]. At the heart of an 24000 Figure 1: Buy and Sell Signals algorithmic trading system are the alpha models. They are Equity curve responsible for outputting decisions for buying and selling 22000 assets based on the data they are given. These decisions Alpha models normally take a set of parameters, allowing are commonly referred to as signals. The data which is them to be tuned to the input data. Naturally, two different 20000 supplied to the alpha models varies greatly, e.g. potential sets of parameters may yield two different sets of signals. prospects, sentiment analysis, previous trades, or technical Therefore, it is imperative to assess how good a set of sig- analysis, which will be the focus of the included applica- nals are, so that different parameter sets may be compared. 2 BN1 BN2 This is usually done by backtesting, a type of simulation that calculates certain scores of the signals, e.g. how much A B G2 W the return on the initial investment would have been. Back- testing cannot be written as a function of the alpha model’s parameters in closed form, thus it is not possible to ana- S G1 E F lytically find the optimal parameters. Instead, backtesting must be considered a black box function that should be op- timised. Figure 2: Two Phased GBN Bayesian optimisation has grown in popularity in the ma- chine learning community as an intuitive way of maximis- needed to represent the joint probability distribution, thus ing either black box objective functions and/or very costly making it easier to elicit the probability parameters needed objective functions (costly in the sense of both time and from experts or from data. See [8, 9, 10] for more details. resources) [7]. Utilising a prior over objective functions, and then sparingly evaluating the objective function at cer- Despite their popularity and advantages, there are situa- tain points (guided by the posterior), Bayesian optimisation tions where a BN is not enough. For instance, when try- attempts to find the global maximum of the objective func- ing to model the process of buying and selling assets, we tion within a predefined grid. wanted to model the constant flow between identifying buying opportunities and then, once such have been found, Our intention in this paper is to combine the use of GBNs as identifying selling opportunities, as is required by an al- alpha models and optimising the parameters of these GBNs pha model. These two phases can be very different and the using Bayesian optimisation. variables included in the BNs modelling them are not nec- The rest of the paper is organised as follows. We begin by essarily the same. The need to switch between two different giving a brief introduction to GBNs in Section 2, this will BNs was the foundation for the introduction of GBNs. illuminate how GBNs can be used as alpha models. We Switching between phases is done using so called gates. continue by explaining by which metrics alpha models can These gates are encoded with predefined logical expres- be evaluated in Section 3, and give slightly more details sions regarding posterior probabilities of random variables regarding backtesting. In Section 4 we will describe the in the BNs. This allows for the activation and deactivation components of Bayesian optimisation, including the use of of BNs based on posterior probabilities. A GBN that uses Gaussian processes as priors, as well as kernel and acqui- two different BNs (BN1 and BN2) is shown in Figure 2. sition functions. In Section 5 we will account for the pro- Here, we will give a brief explanation of GBNs in general, cedure we will use to evaluate the expected performance of and the GBN in Figure 2 in particular (for the full definition using Bayesian optimisation over the parameters of GBNs. of GBNs see [4, 6]): Once the procedure has been described, we will in Sec- tion 6 account for a real-world application where we show • A GBN consists of BNs and gates. BNs can be active how GBNs can be used as alpha models with support from or inactive. The label of BN1 is underlined, indicating Bayesian optimisation. Finally, in Section 7 we will offer a that it is active at the initial state of the GBN. The BNs few words regarding our conclusions and future work. supply posterior probabilities to the gates via so called trigger nodes. The node S is a trigger node for gate G1 2 GATED BAYESIAN NETWORKS and W is a trigger node for G2. A gate can utilise more than one trigger node. Bayesian networks (BNs) can be interpreted as models of causality at the macroscopic level, where unmodelled • Each gate is encoded with a predefined logical expres- causes add uncertainty. Cause and effect are modelled us- sion regarding its trigger nodes’ posterior probabil- ing random variables that are placed in a directed acyclic ity of a certain state, e.g. G1 may be encoded with graph (DAG). The causal model implies some probabilistic p(S = s1|e) > 0.7. This expression is known as the independencies among the variables, that can easily be read trigger logic for gate G1. off the DAG. Therefore, a BN does not only represent a • When evidence is supplied to the GBN, an evidence causal model but also an independence model. The qualita- handling algorithm updates posterior probabilities and tive model can be quantified by specifying certain marginal checks if any of the logical statements in the gates are and conditional probability distributions so as to specify satisfied. If the trigger logic is satisfied for a gate it is a joint posterior distribution, which can later be used to said to trigger. A BN that is inactive never supplies answer queries regarding posterior probabilities, interven- any posterior probabilities, hence G2 will never trig- tions, counterfactuals, etc. The independencies represented ger as long as BN2 is inactive. in the DAG make it possible to compute these queries effi- ciently. Furthermore, they reduce the number of parameters • When a gate triggers, it deactivates all of its parent 3 BNs and activates its child BNs (as defined by the total value (0.06% is a common commission charge used in direction of the edges between gates and BNs). In the included application). our example, if G1 was to trigger it would deactivate Alpha models are backtested separately from the other BN1 and activate BN2, this implies that the model has components of the algorithmic trading system, as the back- switched phase. testing results are input to the other components. There- fore, we execute every signal from an alpha model during If the GBN was used as an alpha model, then when the backtesting, whereas in a full algorithmic trading system GBN identifies a buying opportunity, and moves to the sell we would have a portfolio construction model that would phase, a buy signal is generated. Looking again at Fig- combine several alpha models and decide how to build a ure 1, each buy and sell signal was generated as the GBN portfolio from their signals. switched back and forth between its phases. For the purpose of discussing GBN parameter optimisation 3.2 ALPHA MODEL METRICS in general, we will say that a GBN is parameterised by three disjoint parameter sets Θ, Λ and Γ. The parameters in Θ are What constitutes risk and reward is not necessarily the the parameters of the marginal and conditional probabil- same for every investor, and investors may have their own ity distributions of the variables in the contained BNs. All personal preferences. However, there are a few metrics that other free parameters are contained in Λ, while any fixed are common and often taken into consideration [12]. Here parameters are contained in Γ. For instance, in a setting we will introduce the metrics that we will use to evaluate where the only unknowns are the thresholds in the trigger the performance of our alpha models. logic of the gates, we say that the thresholds are in Λ and Although not a metric on its own, the equity curve needs all other parameters are fixed in Γ. This notation allows to be defined in order to define the following metrics. The a bit of convenience when discussing the evaluation of the equity curve represents the total value of a trading account optimisation procedure in Section 5 and the application in at a given point in time. If a daily timescale is used, then it Section 6. is created by plotting the value of the trading account day by day. If no assets are bought, then the equity curve will 3 EVALUATION OF ALPHA MODELS be flat at the same level as the initial investment. If assets are bought that increase in value, then the equity curve will As we alluded in Section 1, and as we shall see in Sec- rise. If the assets are sold at this higher value then the eq- tion 6, it is necessary to assess how good a set of signals uity curve will again go flat at this new level. The equity are, thereby assessing the performance of an alpha model. curve summarises the value of the trading account includ- Regression models can be evaluated by how well they min- ing cash holdings and the value of all assets. We will use imise some error function or by their log predictive scores. Et to reference the value of the equity curve at point t. For classification, the accuracy and precision of a model may be of greatest interest. Alpha models may rely on re- Metric 1 (Return) The return of an investment is defined gression and classification, but cannot be evaluated as ei- as the percentage difference between two points on the eq- ther. An alpha model’s performance needs to be based on uity curve. If the timescale of the equity curve is daily, then its generated signals over a period of time, and the per- rt = (Et − Et−1 )/|Et−1 | would be the daily return between formance must be measured by the risk and reward of the day t and t − 1. We will use r̄ and σr to denote the mean model. This is known as backtesting. and standard deviation of a set of returns. 3.1 BACKTESTING Metric 2 (Sharpe Ratio) One of the most well known metrics used is the so called Sharpe ratio. Named after The process of evaluating an alpha model on historic data its inventor Nobel laureate William F. Sharpe, this ratio is is known as backtesting, and its goal is to produce met- defined as: (r̄ −risk free rate)/σr . The risk free rate is usu- rics that describe the behaviour of a specific alpha model. ally set to be a ”safe” investment such as government bonds These metrics can then be used for comparison between al- or the current interest rate, but is also sometimes removed pha models [11, 12]. A time range, price data for assets from the equation [12]. The intuition behind the Sharpe ra- traded and a set of signals are used as input. The back- tio is that one would prefer a model that gives consistent tester steps through the time range and executes signals returns (returns around the mean), rather than one that fluc- that are associated with the current time (using the supplied tuates. This is important since investors tend to trade on price data) and computes an equity curve (which will be ex- margin (borrowing money to take larger positions), and it plained in Section 3.2). From the equity curve it is possible is then more important to get consistent returns than returns to compute metrics of risk and reward. To simulate poten- that sometimes are large and sometimes small. This is why tial transaction costs, often referred to as commission, every the Sharpe ratio is used as a reward metric rather than the trade executed is usually charged a small percentage of the return. 4 Equity in $ in their place, should be aware of the LVFI as it is the worst case scenario if they need to retract their investment MDD prematurely. Initial investment LVFI Metric 5 (Time In Market Ratio (TIMR)) The percent- TIMR 1 - TIMR age of time of the investment period where the alpha model Time owned assets. This metric may seem odd to place within the same family as the other drawdown risks, however it Figure 3: Equity Curve with Drawdown Risks fits naturally in this space. We can assume that the days the alpha model does not own any assets the drawdown risk is zero. If we are not invested, then there is no risk of loss. Furthermore, under certain assumptions it can be shown In fact, we can further assume that our equity is growing that there exists an optimal allocation of equity between according to the risk free rate, as it is not bound in assets. alpha models (in the portfolio construction model), such that the long-term growth rate of equity is maximised [12]. 4 BAYESIAN OPTIMISATION This growth rate turns out to be g = r + S 2 /2, where r is the risk free rate and S is the Sharpe ratio. Thus, a high Our intention is to use GBNs as alpha models and to opti- Sharpe ratio is not only an indication of good risk adjusted mise the free parameters Λ with respect to the metrics given return, but holding the risk free rate constant, the optimal in Section 3.2. In order to do so we must backtest the sig- growth rate is an increasing function of the Sharpe ratio. nals that a GBN produces, and thus we cannot analytically solve the optimisation problem, as backtesting as a function Using the Sharpe ratio as a metric will ensure that the alpha of Λ has no general closed form expression. At the same models are evaluated on their risk adjusted return, however, time, backtesting is relatively costly, as one must create the there are other important alpha model behaviours that need model, prepare data, estimate parameters, generate signals to be measured. A family of these, that are known as draw- and walk through the time range to simulate the trading. down risks, are presented here (see Figure 3 for examples For this reason, it is not feasible to exhaustively sweep a of an equity curve and these metrics). large grid of parameters. However, Bayesian optimisation allows us to prioritise the points on the grid to evaluate, Metric 3 (Maximum Drawdown (MDD)) The percent- thus reducing the number of evaluations, while still finding age between the highest peak and the lowest trough of the the global maximum of a potentially costly and black box equity curve during backtesting. The peak must come be- objective function. fore the trough in time. The MDD is important from both a technical and psychological regard. It can be seen as a measure of the maximum risk that the investment will live 4.1 GAUSSIAN PROCESS AS SURROGATE through. Investors that use their existing investments that FUNCTION have gained in value as safety for new investments may be Essentially, we would like to find the parameters Λ∗ ∈ Λ put in a situation where they are forced to sell everything. that maximises an unknown function f . We place a Other risk management models may automatically sell in- prior, p(f ), over the possible functions f , and compute vestments that are loosing value sharply. For the individual the posterior over f using observations {Λ1:i , f1:i }, where who is not actively trading but rather placing money in a fj = f (Λj ). Hence, we compute p(f |{Λ1:i , f1:i }) ∝ fund, the MDD is psychologically frustrating to the point p({Λ1:i , f1:i }|f )p(f ). We can then use this posterior distri- where the individual may withdraw their investment at a bution over objective functions as an estimate of our objec- loss in fear of loosing more money. tive function. This is sometimes known as using the poste- rior as a surrogate function to the true objective function. Metric 4 (Lowest Value From Investment (LVFI)) The percentage between the initial investment and the lowest In Bayesian optimisation it is common to use a Gaussian value of the equity curve. This is one of the most important process (GP) as the surrogate function [7]. It is defined metrics, and has a significant impact on technical and as a multivariate normal distribution of infinite dimension, psychological factors. For investors trading on margin, where each dimension is a point along some grid. A finite a high LVFI will cause the lender to ask the investor for set of these dimensions will form a Gaussian distribution, more safety capital (known as a margin call). This can be thus allowing a GP to be defined completely by a mean potentially devastating, as the investor may not have the function µ and a kernel function κ. The GP over the grid capital required, and is then forced to sell the investment. Λ is then defined as N (µ(Λ), κ(Λ, Λ0 )) for all Λ, Λ0 ∈ Λ. The investor will then never enjoy the return the investment Commonly, the prior µ(Λ) is assumed to be zero for all could have produced. Individuals who are not investing Λ ∈ Λ, although this is by no means necessary if prior actively, but instead are choosing between funds that invest information is available to suggest otherwise. The more 5 involved task is to define the kernel function κ. With κ 1.0 c=1 we can express our prior belief about the objective function c=5 0.8 c = 10 that we wish to maximise. Although we do not know the Covariance 0.6 form of the objective function, we often assume that points 0.4 close to each other on the grid give similar results, thus we assume the objective function to possess at least some 0.2 smoothness. These assumptions can be articulated in κ, 0 2 4 6 8 10 for instance by the rational quadratic kernel in Equation 1, Distance where c is a tuning constant for how smooth we believe the objective function to be. For points close to each other, Figure 4: Covariance Decrease by Distance Equation 1 will result in values close to 1, while points fur- ther away will be given values closer to 0. The GP prior will obtain the same smoothness properties, as the covari- ance matrix is completely defined by κ. To visualise the In Bayesian optimisation we make use of a so called ac- smoothness achieved by tuning c, Figure 4 shows the de- quisition function. Several acquisition functions have been creasing covariance as distance grows with three different suggested, however the goal is to trade off exploring the settings of c (1, 5 and 10). As can be seen, as c increases grid where the posterior uncertainty is high, while exploit- the decrease is slower, thus more smoothness is assumed. ing points that have a high posterior mean. We will use the upper confidence bound criterion, which is expressed as U CB(Λ) = µ(Λ) + ησ(Λ), where µ(Λ) and σ(Λ) rep- ||Λ − Λ0 ||2 κ(Λ, Λ0 ) = 1 − (1) resent the mean and standard deviation at the point Λ of ||Λ − Λ0 ||2 + c the GP, and η is a tuning parameter to allow for more ex- ploration (as η is increased) or more exploitation (as η is Assuming that we have observed {Λ1:i , f1:i }, and that we decreased). wish to calculate the posterior predictive distribution for an unobserved point Λi+1 , a closed form expression exists Succinctly, define a GP over a grid with some kernel func- for this calculation as described in Equation 2. Thus, it is tion, then randomly sample a point and evaluate the objec- possible to efficiently calculate the posterior distribution of tive function at this point. Calculate the posterior of the an unobserved point where both the prior smoothness and GP given this new observation and find Λ0 that maximises observed data have been considered. For more on GPs, the acquisition function. Then Λ0 is the next point where please see [13]. to evaluate the objective function. Iterate these steps for a predefined number of iterations. Once all iterations have passed, the Λ with the highest posterior mean is the set of ! parameters that maximises the objective function. K KT    f1:i ∗ ∼ N 0, fi+1 K∗ K∗∗  κ(Λ1 , Λ1 ) · · · κ(Λ1 , Λi )  5 EVALUATION PROCEDURE K=  .. .. ..  . . .  In Section 6 we will account for a real-world application κ(Λi , Λ1 ) · · · κ(Λi , Λi ) of GBNs as alpha models supported by Bayesian optimi- (2) sation. However, in this section we will introduce the op-   K∗ = κ(Λi+1 , Λ1 ) · · · κ(Λi+1 , Λi ) K∗∗ = κ(Λi+1 , Λi+1 ) timisation procedure used, as well as the method used to evaluate the performance of the optimisation, which is es- p(fi+1 |{Λ1:i , f1:i }) = N (µi (Λi+1 ), σi2 (Λi+1 )) sentially the same method used in [5]. µi (Λi+1 ) = K∗ K−1 f1:i A data set D of consecutive evidence sets, e.g. observations σi2 (Λi+1 ) = K∗∗ − K∗ K−1 K∗ T over all or some of the random variables in the GBN, is di- vided into n equally sized blocks (D1 , ..., Dn ), such that they are mutually exclusive and exhaustive. Each block 4.2 ACQUISTION FUNCTIONS AND BAYESIAN contains consecutive evidence sets and all evidence sets in OPTIMISATION block Di come before all evidence sets in Dj for all i < j. Depending on the amount of available data, k is chosen as Using a GP as a surrogate to the objective function al- the number of blocks used for optimisation. Starting from lows us to encode prior beliefs about the unknown objec- index 1, blocks 1,...,k are used for optimisation and k + 1 tive function, and sampling the objective function allows for testing, thus ensuring that the evidence sets in the test- us to update the posterior of the surrogate. What is left ing data occurs after the optimisation data. The procedure to do is to decide where to sample the objective function. is then repeated starting from index 2 (i.e. blocks 2, ..., k+1 6 Simulation 7 the natural order of the data, thus allowing a validation Simulation 6 Simulation 5 block to come before a block used for estimating the pa- Simulation 4 rameters Θ. This could potentially induce a bias in the Simulation 3 cross-validation estimate as the data used for estimating the Simulation 2 parameters would not have been known at the time the data Simulation 1 for the validation block was generated. However, as we do not use this scheme when we evaluate the performance of Data for optimisation Data withheld for testing the optimisation, the expected performance of the optimi- sation is not biased in this way. We simply use this scheme Figure 5: Data Split For Optimisation and Testing to make the best use of the data during cross-validation. Second, one scoring function J has been used both during optimisation and for evaluating the expected performance are used for optimisation and k + 2 for testing). By doing of the optimisation. The scoring function J could inter- so we create t repeated simulations, moving the testing data nally use many different metrics to come up with one score one block forward each time. An illustration of this proce- to maximise. However, it is natural in the coming setting dure when n = 12, k = 5 and t = 7 is shown in Figure 5. to expose the actual values of several metrics, thus several During Bayesian optimisation, when the objective function scoring functions J are used to get a vector of mean scores is evaluated for some acquired Λ, a cross-validation esti- [ρ̄J1 , ..., ρ̄Jm ]. mate is calculated for the k blocks used. Here, k − 1 blocks Another approach to combine Bayesian optimisation with are used to estimate the parameters Θ of the contained BNs cross-validation is to reduce the number of fold evaluations and the held out block is used as validation data to calcu- necessary [14], as certain folds may be closely correlated, late a score ρ. The value of the objective function, given however our approach is to reduce the number of parame- parameters Λ, is thus the average of all ρ when each block ters that we need to test with cross-validation. in the optimisation data has been held out. In order to formalise the procedure used to evaluate the op- 6 APPLICATION timisation, recall from Section 2 that Λ is used to represent the free parameters of a GBN and Γ is used to represent Having established the optimisation procedure, and the all fixed parameters. Let J be a score function such that method we intend to use to evaluate the performance of J (Λ, Dj , {D}m l |Γ) is the score for a GBN under some pa- the optimisation, we turn our attention to a real-world ap- rameterisation Λ and Γ when block j has been used for plication. We aim to use GBNs as alpha models to gener- either testing or validation and the blocks Dl , ..., Dm have ate buy and sell signals of stock indices in such a way that been used to estimate Θ of the BNs in the GBN (under the drawdown risks are mitigated, compared to the buy-and- parameters Λ and Γ). hold strategy, while at the same time maintaining similar or better rewards. 1. For each simulation t, where (as discussed previously) Dt+k is the testing data and Dt , ..., Dt+k−1 is the op- Stock indices are weighted averages of their respective timisation data, use Bayesian optimisation to find the stock components. For instance, the Dow Jones Industrial parameters Λt that satisfies Equation 3. Average (DJIA) is a weighted average of 30 large compa- nies based in the United States. Indices may have different t+k−1 1 X schemes for how the different components are weighted, Λt = arg max J (Λ, Dj , {D}tt+k−1 \Dj |Γ) however they all aim to give a collective representation of Λ∈Λ k j=t their components. (3) An index fund owns shares of the components of a specific 2. For each Λt calculate the score ρtJ on the testing set index, proportional to the weights, such that the fund’s re- with respect to the scoring function J according to turn is mirrored by the index. These funds are very popular, Equation 4. as they are easy for the investor to comprehend but at the ρtJ = J (Λt , Dt+k , {D}t+k−1 |Γ) (4) same time trading the individual components of an index t requires a lot of effort. 3. The expected performance ρ̄J of the optimisation, A buy-and-hold strategy on stock indices via index funds with respect to the score function J , is then P given by may be convenient, however it implies that the equity is put the average of the scores ρtJ , i.e. ρ̄J = 1t t ρtJ . through the full force of drawdown risks described in Sec- tion 3.2. The buy-and-hold strategy holds assets over the Two things to note about this procedure. First, during entire backtesting period and so will be subject to the full cross-validation inside the objective function we disregard force of these metrics. For instance, as an asset will be held 7 6.1.1 Variables Long G3 G2 G1 The variables used in the GBNs were discretisations of Buy Sell Trend so called technical analysis indicators. One of the major G2 tenets in technical analysis is that the movement of the price G1 Short G4 of an asset repeats itself in recognisable patterns. Indicators are computations of price and volume that support the iden- tification and confirmation of patterns used for forecasting. Many classical indicators exists, such as the moving aver- Figure 6: GBN-1 and GBN-2 age (MA), which is the average price over time, and the relative strength index (RSI) which compares the size of recent gains to the size of recent losses. For the full defini- tion of these indicators, please see [16, 17]. throughout the period, the lowest point of the assets value will coincide with LVFI. In dwindling stock markets, the For each phase in the GBNs (Buy, Sell, T rend, Long index funds will lose value, and equity could be salvaged and Short), we placed a naı̈ve Bayesian classifier over the and possibly be placed in risk-free assets during these peri- same technical analysis indicators. However by allowing ods. Furthermore, utilising certain financial products, it is the parameterisation of one of the technical analysis indi- also possible to increase equity during these times of dis- cators to vary between the phases, we essentially created tress by purchasing short positions of the index. Short posi- different variables in the different phases. The tuning of tions can be thought of as a loan, where the value of the loan the technical analysis parameters allowed us to better cap- increases if the index decreases in value, and it is possible ture the dynamics of the data, as they may differ between to sell the loan at its higher value (to make the distinction, assets as well as between the different phases of trading. regular positions are called long when short positions are Figure 7 depicts the classifier structure and variables used. considered). The variables are explained below, along with an example At first the buy-and-hold strategy may seem naı̈ve, how- in Figure 8. ever it has been shown that deciding when to own and not own assets requires consistent high accuracy of predictions • S represents the first-order finite backward difference in order to gain higher returns than the buy-and-hold strat- of 5 periods of the MA of ψ periods, shifted 5 periods egy [15]. The buy-and-hold strategy has become a standard into the future. To clarify, the first plot in Figure 8 benchmark, not only because of the required accuracy, but shows the price of an asset along with the MA. If the also because it requires very little effort to execute (no com- current time is t, then S represents the slope of the line plex computations and/or experts needed). between what the MA will be at t + 5 and what it is at t. • A represents the same slope as S but at its current 6.1 METHODOLOGY value (i.e. between t and t − 5). We used two different GBN structures to create alpha mod- • B represents the difference between the current value els. The first GBN structure (henceforth known as GBN-1) of the MA of ψ periods and the current raw price. This modelled buying and selling long positions only, while the can be seen in Figure 8 as the difference between the second GBN structure (GBN-2) modelled buying and sell- two time series in the first plot. ing long and short positions. The structures are depicted in • C represents the current RSI value (at t in the second Figure 6 (GBN-1 on the left and GBN-2 on the right). The plot of the figure) using 14 periods. structure for GBN-1 works as described in Section 2. The structure for GBN-2 starts in the T rend phase, from where • D represents what C was 5 time steps in the past (at either G1 or G2 can trigger. If G1 triggers then a long open t − 5 in the second plot of the figure). signal is generated and the Long phase is activated (deac- tivating the T rend phase). If then gate G3 triggers then a The choice of 14 periods for RSI is based on the prevailing long close signal is generated, and the T rend phase is ac- standard [16, 17], and the choice of 5 periods as the pre- tivated again (deactivating the Long phase). However, if diction horizon is based on the number of trading days in a before G1 triggers G2 triggers instead, then a short open week. position is generated, and the Short phase is activated (de- activating T rend). In similar fashion, when G4 triggers a Variables A, B, C and D were discretised into six bins, short close signal is generated, activating T rend and deac- each using equal width binning, and S was discretised into tivating Short. two bins separated by zero. Thus, the states of S represents 8 • In both cases all τ were confined to [60, 90] and all ψ S ∇15 MA(ψ) to [10, 40]. Offset(+5) We used the upper confidence bound acquisition function A B C D (as described in Section 4.2) with η = 5, which allowed ∇15 MA(ψ) RSI(14) RSI(14) PDIFF(MA(ψ)) Offset(-5) for abundant exploration, as our objective function was not extremely expensive to evaluate. We used the rational quadratic kernel as described in Equation 1 with c = 1. For GBN-1 we ran the Bayesian optimisation for 1,600 it- Figure 7: Bayesian Classifier in GBN Phases erations, and for GBN-2 we ran 12,800 iterations. 1325 MA A 6.1.3 Data Sets 1315 S Price and MA We used four indices in this study, DJIA and NASDAQ Price which are both based on companies in the United States, 1305 B FTSE100 which is based on companies in the United King- 1295 t−5 t t+5 dom and DAX which is based on companies located in Ger- Mar 01 Mar 04 Mar 07 Mar 10 Mar 14 Mar 17 Mar 21 Mar 24 many. We ran our experiments on daily adjusted closing 2011 2011 2011 2011 2011 2011 2011 2011 prices for these indices ranging from 2001-01-01 to 2012- 12-28 (data downloaded from Yahoo! FinanceTM ). This 55 RSI D gave a total of 12 years of price data for each index, where RSI 45 C each year was allocated to a block, thus n = 12. For the 35 t−5 t Mar 01 Mar 04 Mar 07 Mar 10 Mar 14 Mar 17 Mar 21 Mar 24 cross-validation step we used k = 5 giving t = 7 simula- 2011 2011 2011 2011 2011 2011 2011 2011 tions from which to calculate [ρ̄J1 , ..., ρ̄Jm ] (the data split is depicted in Figure 5). Figure 8: Visualisation of Variables 6.1.4 Scoring Functions The signals generated were backtested in order to calculate a predicted positive or negative future value of the modelled relevant metrics. During optimisation (i.e. step 1 in Sec- asset price (smoothed by the moving average of ψ periods). tion 5) the objective function used the Sharpe ratio. The As S represents a future value, evidence for S was only choice was made as it combines both risk and reward into available during estimation of the parameters Θ, not during one score, for which a cross-validation estimate could be the generation of signals. returned by the objective function. For evaluating the per- The gates all defined trigger logic over the posterior dis- formance of the optimisation (step 2 in Section 5), we used tribution of S with some threshold τ . For instances, the return and the drawdown risks described in Section 3.2 in GBN-1, the trigger logic for G1 was T L(G1) : to create a score vector [ρ̄J1 , ..., ρ̄Jm ]. The same metrics p(S = positive | e) > τG1 , i.e. if the posterior probabil- were calculated for the buy-and-hold strategy. ity of a positive climate is greater than some threshold, then the model should give a buy signal and move to the next 6.2 RESULTS AND DISCUSSION phase (the sell phase). Naturally, the trigger logic for G2 in the same GBN was T L(G2) : p(S = negative | e) > The score vectors from the evaluation of the optimisation τG2 , thus giving a sell signal if the posterior probability of versus the the score vector for the buy-and-hold strategy a negative future value exceeds some threshold. over the seven simulations are shown in Table 1. The an- nual Sharpe presented in the table is the mean return di- 6.1.2 Bayesian Optimisation Settings vided by the standard deviation of returns over the seven simulation, and since each block was allocated one year of The previous section implies the following for the two data it becomes the annual Sharpe ratio. GBNs: Will will first turn our attention to GBN-1. We use the Sharpe ratio as our measure of reward, prioritised above • For GBN-1 the free parameters Λ to be optimised are: the raw return for reasons discussed in Section 3.2. There- Λ = {τG1 , τG2 , ψBuy , ψSell }. fore, we must first ensure that the Sharpe ratio of our al- gorithmic trading system produces similar or better Sharpe • For GBN-2 the free parameters Λ to be optimised are: ratios than the buy-and-hold strategy. As can be seen, this Λ = {τG1 , τG2 , τG3 , τG4 , ψT rend , ψLong , ψShort }. was the case for DJIA, NASDAQ and DAX, but not for 9 open short positions changes the results dramatically. For Table 1: Metric Values for GBNs and Buy-and-Hold DJIA, we improved upon the Sharpe ratio, at the cost of DJIA Score GBN-1 GBN-2 BaH the drawdown risks. Both MDD and LVFI were increased Annual Sharpe 0.289 0.330 0.157 marginally, yet still lower than buy-and-hold. The TIMR Return 0.019 0.032 0.028 was also increased to such a degree that we were invested MDD 0.085 0.116 0.167 almost the entire investment period. There is potential gain LVFI 0.058 0.062 0.119 in reward from using GBN-2 for DJIA, however the in- TIMR 0.628 0.91 1.0 creased risk must be considered. NASDAQ Score GBN-1 GBN-2 BaH For NASDAQ, FTSE100 and DAX there is no improve- Annual Sharpe 0.308 0.081 0.254 ment over GBN-1. Instead, Sharpe ratios are decreased, as Return 0.033 0.012 0.067 well as an increase in drawdown risks. There could be sev- MDD 0.101 0.164 0.207 eral reasons for this that are worth investigating, however LVFI 0.062 0.099 0.146 our immediate response is that we have either overfitted the TIMR 0.554 0.94 1.0 model due to several more parameters being optimised over FTSE100 Score GBN-1 GBN-2 BaH the same amount of data, or the fact that a bad short posi- Annual Sharpe -0.057 -0.64 0.127 tion is doubly bad on equity as we will lose out of the profit Return -0.006 -0.074 0.022 from a long position during the same time. MDD 0.098 0.167 0.188 LVFI 0.074 0.121 0.142 TIMR 0.649 0.962 1.0 7 CONCLUSIONS AND FUTURE WORK DAX Score GBN-1 GBN-2 BaH Annual Sharpe 0.778 0.589 0.278 Return 0.081 0.062 0.069 Our results show that it is feasible to use GBNs as alpha MDD 0.107 0.171 0.213 models, and to use Bayesian optimisation to tune them in LVFI 0.056 0.059 0.154 order to beat the buy-and-hold benchmark, with respect to certain risk and reward metrics. Some of the design deci- TIMR 0.610 0.926 1.0 sions made before optimisation may however have reduced the performance of the GBNs on some of the used data sets. Short positions are optimally taken during times of distress, and due to increased volatility, markets move very differ- FTSE100. Secondly, we must take into consideration the ently compared to stable increasing markets. We decided to TIMR. For GBN-1, we were invested only slightly above lock in the forward and backward horizons to 5 time steps, half of the time compared to buy-and-hold, reducing risk and the RSI period of 14, which may have made it impos- to equity considerably. Meanwhile, the rest of the time the sible to capture the more volatile dynamics. Furthermore, equity could have gained in value from interest rates (or stock indices generally increase in value over long periods other risk-free assets), this potential gain was not consid- of time, thus short selling will always be in the opposite of ered in these results. Risk to equity from MDD was half the long term trend, which in general is ill-advised. its counterpart from the buy-and-hold strategy for all in- dices. The LVFI is a major threat to equity (as discussed in Nevertheless, we are encouraged to see the included posi- Section 3.2), and one where buy-and-hold severely under- tive results and are at the same time motivated to address performs. For DAX the LVFI was only a third of the buy- the problems we faced with GBN-2. We would not ex- and-hold LVFI, and for the other three indices it was half. pect the exact same model to perform well on all given data sets, and so further work is needed to improve upon All in all, the results clearly indicate that GBN-1 was com- the results on FTSE100 to make them in par with the other petitive with the buy-and-hold strategy for three of the in- three indices. For instance, there is room to make the ob- dices, as Sharpe ratios were improved upon and risk to eq- jective function even more expensive by not only estimat- uity was decreased significantly. Furthermore, these results ing BN parameters, but also performing variable selection were achieved while at the same time only having equity in- and structure learning during cross-validation. vested half of the investment period. It is also clear that we cannot expect the same GBN to be useful for all indices, as the reward was not improved upon for FTSE100. Some of Acknowledgements the parameters that were fixed in Γ may have to be tuned in order to accommodate the dynamics of FTSE100, such as BN inference in our implementation is based on the SMILE the technical analysis indicators used, or the fixed parame- reasoning engine, contributed to the community by the De- ters of the ones used currently. cision Systems Laboratory of the University of Pittsburgh Moving on to GBN-2, we can see that allowing the GBN to and available at https://dslpitt.org/genie/. 10 References [16] J. J. Murphy, Technical analysis of the financial mar- kets. New York Institute of Finance, 1999. [1] P. Treleaven, M. Galas, and V. Lalchand, “Algorith- mic trading review,” Communications of the ACM, [17] M. J. Pring, Technical analysis explained. McGraw- vol. 56, no. 11, pp. 76–85, 2013. Hill, 2002. [2] G. Nuti, M. Mirghaemi, P. Treleaven, and C. Yingsaeree, “Algorithmic trading,” Computer, vol. 44, no. 11, pp. 61–69, 2011. [3] R. K. Narang, Inside the black box. John Wiley & Sons, 2013. [4] M. Bendtsen and J. M. Peña, “Gated Bayesian net- works,” in Proceedings of the Twelfth Scandina- vian Conference on Artificial Intelligence, pp. 35–44, 2013. [5] M. Bendtsen and J. M. Peña, “Learning gated Bayesian networks for algorithmic trading,” in Pro- ceedings of the Seventh European Workshop on Prob- abilistic Graphical Models, pp. 49–64, 2014. [6] M. Bendtsen and J. M. Peña, “Gated Bayesian net- works for algorithmic trading,” International Journal of Approximate Reasoning, 2015, submitted. [7] E. Brochu, V. M. Cora, and N. de Freitas, “A tuto- rial on Bayesian optimization of expensive cost func- tions, with application to active user modeling and hi- erarchical reinforcement learning,” Tech. Rep. UBC TR-2009-023 and arXiv:1012.2599, 2009. [8] J. Pearl, Probabilistic reasoning in intelligent sys- tems: networks of plausible inference. Morgan Kauf- mann Publishers, 1988. [9] F. V. Jensen and T. D. Nielsen, Bayesian networks and decision graphs. Springer, 2007. [10] K. B. Korb and A. E. Nicholson, Bayesian artificial intelligence. Taylor and Francis Group, 2011. [11] R. Pardo, The evaluation and optimization of trading strategies. John Wiley & Sons, 2008. [12] E. P. Chan, Quantitative trading. John Wiley & Sons, 2009. [13] C. E. Rasmussen and C. K. I. Williams, Gaussian pro- cesses for machine learning. MIT Press, 2006. [14] K. Swersky, J. Snoek, and R. P. Adams, “Multi-task Bayesian optimization,” in Advances in Neural Infor- mation Processing Systems 26, pp. 2004–2012, 2013. [15] W. F. Sharpe, “Likely gains from market timing,” Fi- nancial Analysts Journal, vol. 31, no. 2, pp. 60–69, 1975. 11