1. Introduction

Journal of Financial Economics 129 (2018). [14] C. R. HARVEY

Avoiding the Pitfalls on Stock Market: Challenges and Solutions in Developing Quantitative Strategies

Marco Bergianti

marco.bergianti@cristail.com 0 1

Nicola Ciofo

nicola.cioffo@cristail.com 0 1

Angelo Porrello

angelo.porrello@unimore.it 0 2

Financial Markets, Quantitative Trading Strategies, Machine Learning, Deep Learning, Bias

0 , Francesco Del Buono 1 Cristail , Italy 2 University of Modena and Reggio Emilia , Italy

1994

1 29 31

Quantitative stock trading based on Machine Learning (ML) and Deep Learning (DL) has gained great attention in recent years thanks to the ever-increasing availability of financial data and the ability of this technology to analyze the complex dynamics of the stock market. Despite the plethora of approaches present in literature, a large gap exists between the solutions produced by the scientific community and the practices adopted in real-world systems. Most of these works in fact lack a practical vision of the problem and ignore the main issues aflicting fintech practitioners. To fill such a gap, we provide a systematic review of the main dangers afecting the development of an ML/DL pipeline in the financial domain. They include managing the stochastic and non-stationary characteristics of stock data, various types of bias, overfitting of models and devising impartial valuation methods. Finally, we present possible solutions to these critical issues.

Quantitative Strategies

1. Introduction

ment, which has undergone a massive renewal in recent years. Given the ever-increasing availability of financial tion have been gradually replaced by the most modern

Machine Learning (ML) and Deep Learning (DL) method

ologies, by virtue of their efectiveness in identifying hidden patterns with high predictive power.

This technology, when applied in the financial domain, is mainly used to predict stock prices, their trends (i.e., positive or negative depending on whether stock prices are expected to increase or decrease) or directly the most profitable stocks. In the first two scenarios, regressors and classifiers are respectively employed to predict the future behavior of the stocks, while in the last case the model is trained to learn a ranking function that sorts stocks in descending order by expected profit. The outputs of these models are then exploited to select the top-k most profitable stocks and to build trading strategies.

In literature, a large variety of financial models have ifed into methods based on technical analysis (TA) and approaches based on fundamental analysis (FA) [1]. The In more detail, this paper will dive through the main

macro-steps of a typical ML/DL pipeline, namely data been proposed to solve these tasks. They can be classi- have already addressed it [9, 10, 11]. On the contrary, Outliers and missing

Look-ahead bias

Survival bias

Featurization Inhomogeneous series

Non-stationarity Future time horizon

Class distribution each of them we will explore the main challenges, and we will discuss about some of the most adopted solutions.

The rest of this paper is organized as follows. Section 2 provides an overview of the dangers across the ML/DL pipeline. In Sections 3-6 we will investigate, for each of the above steps, the main solutions to mitigate the relative critical issues. Finally in Section 6 we sketch out some conclusions and future work.

Modeling Profit Driven Optimization

Stochasticity Covariates

Evaluation Time/Serial correlations

Overfitting

3. Data Preparation

Preparing data for financial models is a crucial task as it requires handling incomplete and inaccurate data with diferent forms of bias. Indeed, biased data can lead to the development of inefective trading strategies that underperform in the real market.

3.1. Outliers and missing values 2. Overview

Financial data frequently contains stocks that trade intermittently and outliers (e.g., price values that deviate strongly from average behavior), which can reveal abnormal patterns (e.g., abnormal returns). Managing these anomalies is much more pressing in the financial domain than in any other field as financial decisions are often critical and profit-driven, i.e., even small errors can result in significant losses. Furthermore, they can negatively afect the training of ML/DL models, which acquire a distorted knowledge of the task. A possible solution to the first problem is to consider only the stocks that have been traded on more than a certain percentage of trading days (e.g., 98%), while the standard method to deal with outliers is to clip values within a specific range [ 12].

This section provides an overview of the main challenges that will be covered in this paper and that will be explored by following the main macro-steps of a typical ML/DL pipeline (see Figure 1).

Data Preparation. Preparing financial data is a complex activity due to the presence of outliers, missing values and bias in the data. These mainly include look-ahead bias, survivorship bias, and dividend/split adjustment, which require ad-hoc procedures to avoid information leakage and erroneous predictions.

Featurization. Designing financial supervised tasks includes both stock data featurization and label preparation. Featurization is needed to remove unwanted properties from raw stock price series, which exhibit nonhomogeneity (i.e., values arrive with an irregular fre- 3.2. Look-ahead bias quency) and non-stationarity (i.e., their statistical prop- Look-ahead bias occurs when a model uses information erties vary over time). Preparing financial data labels, on that would not have been available at inference time [8]. the other hand, mainly means managing imbalance label A generic approach to solve this problem is to impledistribution in classification scenarios, and appropriately ment out-of-sample testing, which involves dividing the define the prediction dates in regression scenarios (i.e., data into two parts: one for model construction and one whether to set them statically or dynamically). for validation. The model is trained on the first part of Modeling. Designing financial models presents its own the data and then tested on the second part of the data. set of challenges, where stochasticity and the exploitation This approach can help avoid overfitting the data and of stock relations are the most relevant aspects. that its performance is more accurately estimated. Evaluation. The application of traditional ML/DL eval- Despite the use of this technique, look-ahead bias may uation methods in the financial domain often results in still emerge when processing adjusted price data and inflated performance due to diferent forms of bias and fundamental data. Adjusted prices, for example, are condata dependencies. Furthermore, ad-hoc countermea- stantly updated based on the occurrence of a split or the sures must be taken to handle model and backtest over- payment of dividends. When such events occur, all past iftting. time series is corrected accordingly. For example, when a 2-for-1 stock split occurs, all prices before that date are halved. As a consequence of this, adjusted prices implic- 4.1.1. Inhomogeneous series itly store information about future events and should be used with caution. To mitigate this problem, the yield In literature stock price series are typically time-indexed, series is preferred rather than the original series. It op- i.e., their values are sampled at fixed time intervals. It reperates on percentage diferences rather than on absolute resents the most intuitive choice as it is consistent with values and is not afected by the bias produced by such sunlight cycles. Unfortunately, markets are operated by corrections. algorithms that trade with limited human supervision,

When fundamental data is processed, instead, it is nec- for which CPU processing cycles are much more releessary to pay attention to its publication process. These vant than chronological intervals [16]. As a consequence, documents are written on a certain date and subsequently sampling information on a time basis would result in overcorrected without updating the filling date, implicitly in- sampling during low-activity periods and undersampling dicating that the new information was already known at during high-activity periods. Furthermore time-sampled the initial writing time of the document. Not considering series often exhibit poor statistical properties, like sethis aspect means including future information in the rial correlation, heteroscedasticity, and non-normality of historical data, and results in inflated performance. returns. To alleviate this problem, alternative forms of sampling have been proposed, such as volume bars that collect information whenever a certain amount of stock 3.3. Survival bias units have been traded, or dollar bars that sample data every time a pre-defined market value is exchanged.

Survivorship bias occurs when the data used to train and test a model only includes the stocks that have survived until the present time, hence ignoring that some compa- 4.1.2. Non-stationarity nies went bankrupt and securities were delisted. This bias can result in an overestimation of the performance of the strategies as they ignore the stocks that have gone bankrupt or delisted [8, 13, 14]. Various solutions have been proposed in the literature to address this bias, such as including delisted securities in the analysis [15] or applying a survivorship bias correction method, which involves adjusting the returns of surviving securities to account for the returns of the delisted securities.

Another undesired property of the raw stock price series is non-stationarity [17, 18], i.e., when its statistical properties vary over time. This prevents the direct application of inferential analysis as they operate exclusively on invariant processes. To circumvent this problem, the most adopted solution is to transform the raw price series into a yield series, where the absolute values of the prices are replaced by percentage variations. Although this transformation makes the series stationary, its drawback is that it removes memory from the data (i.e., removes cor4. Featurization relations between past and future observations), which is the main bias for the model’s predictive power. ReThe data preparation phase is typically followed by a fea- cent featurization methodologies based on fractionally turization phase, which aims at transforming the raw data diferentiated features have been explored to obtain an in order to 1) highlight expressive patterns for the stock efective trade-of between stationarity and memory [ 8]. selection task and 2) obtain better statistical properties that facilitate processing through ML/DL. This procedure 4.2. Output is mainly applied to raw stock price series, which exhibit unwanted properties such as non-homogeneity (i.e., val- Parallel to the input featurization, the label space must ues arrive with irregular frequency) and non-stationarity be transformed coherently with the type of task to be (i.e., their statistical properties vary over time). solved (i.e., classification or regression).

In this section, we present some solutions to these problems, distinguishing between solutions for the input 4.2.1. Class unbalanced distribution (i.e., feature space) and the output (i.e., label space). 4.1. Input

A very popular category of stock selection approaches is

based on technical analysis, which directly elaborates on numerical features like past prices and macroeconomic indicators. This type of data is afected by several problematic conditions that must be managed appropriately to create efective trading strategies.

In a classification scenario, observations are typically la

beled based on whether the return is positive or negative. However, this may produce unbalanced classes, as during market booms the probability of a positive return is much higher, and during market crashes they are lower [19]. This unbalanced distribution can introduce a bias in the model training by favoring the more frequent classes over the rarer ones. To avoid this condition, in [20] an asymmetric threshold assignment is used to balance the classes (e.g., samples with returns ≤-0.5% and > 0.55% are labeled with down and up, respectively). 4.2.2. Fixed vs variable future time horizon The intuition behind the first category is to model the output in probabilistic terms, estimating a probability distribution and not relying on punctual targets. Quantile loss [21] and Gaussian loss [22] represent the main objective functions used in this category of methods. Adversarial training approaches instead try to manage the stochasticity by training the model to produce similar outputs for diferent variations of the same target input [18]. Finally, instead of using only deterministic features, generative models incorporate inherently probabilistic components. Variational auto-encoders (VAE) [23] are the best-known example of this component and several stock selection approaches rely on them [20, 24].

5.2. Covariates

A more specific concern of regression scenarios is the The stock market is also characterized by significant definition of the prediction time horizon, i.e., whether to forms of correlation between stocks, e.g., stocks belongdetermine it statically (e.g., using a fixed time interval) or ing to the same sector show similar patterns. Capturing dynamically (e.g., when certain events occur). Although these types of relationships is essential to better underthe first category is more intuitive, several approaches standing market dynamics and creating efective trading based on variable time horizons are applied in the in- strategies accordingly. Although initially most of the dustry, e.g., based on the occurrence of significant price approaches proposed in the literature treated each stock changes with respect to an average volatility. This is as isolated for prediction, a new line of work is actively done to adhere to the dynamics of the market, where con- exploring the joint prediction of multiple stocks. Most ditions for exiting a position are often defined through of these works integrate graph neural networks [25] to thresholds for profit-taking and stop-losses [ 8]. model such correlations in static [26, 27] or dynamic (i.e., learned directly from the model) [17] graphs.

5. Modeling 5.3. Profit-Driven Optimization

Given stock features and related labels, the next step is to apply supervised approaches to learn hidden patterns in past data and acquire predictive capabilities on future data. Several challenges aflict the design of ML/DL models in the financial domain, such as the management of the stochastic nature of data (mainly in price series), the exploitation of correlations between stocks and the correct definition of the model optimization function (e.g., identify the most profitable stocks).

Another aspect often overlooked in the design of ML/DL

models in finance concerns the correct definition of the learning strategy according to the investment objective.

Most of the approaches do not directly optimize the target of investment in terms of profit, even if they are interested in identifying the most profitable stocks. In other words, the stock selection task is typically formulated as a classification problem (to estimate the future trend of stocks) or a regression problem (to directly estimate the future price/return of stocks). However, correctly 5.1. Stochasticity solving these tasks can lead to sub-optimal solutions in terms of profit [ 12, 4]. Consider the toy example shown Stock data have a chaotic and noisy nature: they are in Table 1, where two regressors (R1 and R2) and two largely driven by new information and result in a random- classifiers ( C1, C2) are respectively used to predict the walk pattern [20]. This random component can nega- return and the trend of 5 stocks. As can be seen, the tively impact the training process. Traditional supervised worst-performing models (i.e., R2 and C2) are able to techniques are in fact designed to operate on clean data select the most profitable top-1 stock compared to the and are not capable of handling uncertain data. This best-performing methods (i.e., R1 and C1). 1 Following has motivated an intense efort in the area of deep learn- this direction, a new line of work has suggested adopting, leading to several solutions over the last few years. ing a ranking approach, which is closer to the problem Among these, three categories of methods have been of selecting the most profitable stocks [ 27, 17]. Instead explored: 1) the adoption of ad-hoc loss functions, 2) the exploitation of adversarial training procedures, and 3) the construction of intrinsically probabilistic models. 1Note that in the regression the top-1 stock is selected based on the higher predicted return, while in the classification based on the higher probability of the positive trend. of predicting, for example, the return of stocks (as in a regression task), the goal here is to sort the stocks by decreasing return. In this way the stocks that perform better than others will appear first in the ranking and will be selected by a topk-based trading strategy. embargoing creates a further gap between train and test sets when the latter precedes the train set in time. This is done to avoid that it contains information that is highly correlated with the next train set.

6.2. Overfitting 6. Evaluation A very common condition in financial machine learning

is overfitting, i.e., the poor ability to generalize to new The goal of the evaluation step in the financial domain is data. This condition mainly afects backtesting stratetwofold. First, the predictive ability of the ML/DL model gies, although it is also common in financial model trainmust be evaluated and, second, the performance of the ing [18]. Backtest overfitting occurs when a strategy is trading strategy must be analyzed. The latter is built over-optimized on a specific backtest scheme, resulting in on top of the model’s predictions and varies depending poor performance if the backtest is changed. Most of the on the type of supervised task used for model training. trading strategies are afected by this condition, as they In a classification scenario, the up and down predictions are evaluated exclusively with the popular walk-forward are interpreted as buy and sell signals. In the regression (WF) scheme. With this procedure, the historical data is and ranking scenarios the (top-k) stocks with the highest divided into two sets, the in-sample and out-of-sample predictions are bought and those (top-k) with the lowest periods. The strategy is developed and optimized during predictions are sold. the in-sample period and evaluated during the out-of

To achieve this, diferent metrics and evaluation proce- sample period. The scheme is repeated by moving the dures for both tasks have been proposed. With regard to in-sample and out-of-sample periods forward in time. evaluation metrics, a distinction is made between model Although this procedure has the advantage of providmetrics and portfolio metrics, depending on whether they ing a clear historical interpretation of the performance evaluate the model or the strategy. Commonly used port- of a strategy, it has the disadvantage of testing a single folio metrics include return, Sharpe ratio, and Sortino ratio. scenario obtained by splitting the data only in the forRegarding the evaluation procedures, instead, an out-of- ward direction. To mitigate this problem, Combinatorial sample evaluation scheme is typically used to evaluate Purged Cross-Validation (CPCV) has recently been prothe efectiveness of the model (e.g., cross-validation is the posed [8]. It modifies a traditional K-Fold cross-validation most commonly adopted solution), while a backtesting scheme by generating all possible combinations of traintechnique is employed to analyze the performance of the test splits having > 1 folds as test set and the remaining trading strategy. folds for the train set, while purging train observations

However, there are still several problems that prac- that contain leaked information. Unlike traditional crosstitioners may encounter during the evaluation process. validation methods, the test sets are not used to compute They arise mainly from the tendency of models to overfit performance metrics directly. Instead, they are divided and the presence of serial correlation in the data. into groups, each representing an independent evaluation path. In this way, multiple backtest paths are evaluated 6.1. Time/Serial correlations instead of a single one, reducing backtest overfitting. Although most financial models are evaluated in standard cross-validation (i.e., an extension of out-of-sample 7. Conclusions evaluation to multiple train-test splits), it is not the ideal evaluation tool for financial data. This is due to the ex- In this paper, we have provided a systematic review of istence of various forms of temporal correlations in the the main pitfalls aflicting fintech practitioners in develdata which create leakages, or implicit overlaps between oping stock selection strategies, and we have collected train and test data, compromising the reliability of the the main solutions used to mediate them. Starting from evaluation process. To mitigate this issue, a new cross- the data preparation step, the most adopted practices validation scheme has been proposed in [8], where purg- are the use of clipping techniques to reduce abnormal ing and embargoing techniques are applied to remove patterns, the correct management of price-adjusted and such dependencies. More specifically, the purging tech- fundamental data to avoid look-ahead bias, and the innique removes from the train set all observations whose clusion of delisted stocks to limit survivorship bias. In labels overlapped in time with those included in the test the featurization phase, the main solutions to manage set. In a task that predicts monthly stock returns, for ex- the inhomogeneity and non-stationarity of stock series ample, this means creating a window of at least 30 days are the adoption of sampling techniques based on volbetween train and test observations. On the other hand, ume or dollar bars and the transformation of price series