<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Journal of Financial Economics 129 (2018).
[14] C. R. HARVEY</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Avoiding the Pitfalls on Stock Market: Challenges and Solutions in Developing Quantitative Strategies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marco Bergianti</string-name>
          <email>marco.bergianti@cristail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Ciofo</string-name>
          <email>nicola.cioffo@cristail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Angelo Porrello</string-name>
          <email>angelo.porrello@unimore.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Financial Markets, Quantitative Trading Strategies, Machine Learning, Deep Learning, Bias</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>, Francesco Del Buono</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Cristail</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Modena and Reggio Emilia</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1994</year>
      </pub-date>
      <volume>1</volume>
      <fpage>29</fpage>
      <lpage>31</lpage>
      <abstract>
        <p>Quantitative stock trading based on Machine Learning (ML) and Deep Learning (DL) has gained great attention in recent years thanks to the ever-increasing availability of financial data and the ability of this technology to analyze the complex dynamics of the stock market. Despite the plethora of approaches present in literature, a large gap exists between the solutions produced by the scientific community and the practices adopted in real-world systems. Most of these works in fact lack a practical vision of the problem and ignore the main issues aflicting fintech practitioners. To fill such a gap, we provide a systematic review of the main dangers afecting the development of an ML/DL pipeline in the financial domain. They include managing the stochastic and non-stationary characteristics of stock data, various types of bias, overfitting of models and devising impartial valuation methods. Finally, we present possible solutions to these critical issues.</p>
      </abstract>
      <kwd-group>
        <kwd>Quantitative Strategies</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>ment, which has undergone a massive renewal in recent
years. Given the ever-increasing availability of financial
tion have been gradually replaced by the most modern</p>
      <sec id="sec-1-1">
        <title>Machine Learning (ML) and Deep Learning (DL) method</title>
        <p>ologies, by virtue of their efectiveness in identifying
hidden patterns with high predictive power.</p>
        <p>This technology, when applied in the financial domain,
is mainly used to predict stock prices, their trends (i.e.,
positive or negative depending on whether stock prices
are expected to increase or decrease) or directly the most
profitable stocks. In the first two scenarios, regressors
and classifiers are respectively employed to predict the
future behavior of the stocks, while in the last case the
model is trained to learn a ranking function that sorts
stocks in descending order by expected profit. The
outputs of these models are then exploited to select the top-k
most profitable stocks and to build trading strategies.</p>
      </sec>
      <sec id="sec-1-2">
        <title>In literature, a large variety of financial models have ifed into methods based on technical analysis (TA) and approaches based on fundamental analysis (FA) [1]. The</title>
      </sec>
      <sec id="sec-1-3">
        <title>In more detail, this paper will dive through the main</title>
        <p>macro-steps of a typical ML/DL pipeline, namely data
been proposed to solve these tasks. They can be classi- have already addressed it [9, 10, 11]. On the contrary,
Outliers and missing</p>
        <p>Look-ahead bias</p>
        <p>Survival bias</p>
        <p>Featurization
Inhomogeneous series</p>
        <p>Non-stationarity
Future time horizon</p>
        <p>Class distribution
each of them we will explore the main challenges, and we
will discuss about some of the most adopted solutions.</p>
        <p>The rest of this paper is organized as follows. Section
2 provides an overview of the dangers across the ML/DL
pipeline. In Sections 3-6 we will investigate, for each
of the above steps, the main solutions to mitigate the
relative critical issues. Finally in Section 6 we sketch out
some conclusions and future work.</p>
        <p>Modeling
Profit Driven Optimization</p>
        <p>Stochasticity
Covariates</p>
        <p>Evaluation
Time/Serial correlations</p>
        <p>Overfitting</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Data Preparation</title>
      <p>Preparing data for financial models is a crucial task as it
requires handling incomplete and inaccurate data with
diferent forms of bias. Indeed, biased data can lead to
the development of inefective trading strategies that
underperform in the real market.</p>
      <sec id="sec-2-1">
        <title>3.1. Outliers and missing values</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2. Overview</title>
      <p>Financial data frequently contains stocks that trade
intermittently and outliers (e.g., price values that deviate
strongly from average behavior), which can reveal
abnormal patterns (e.g., abnormal returns). Managing these
anomalies is much more pressing in the financial domain
than in any other field as financial decisions are often
critical and profit-driven, i.e., even small errors can result
in significant losses. Furthermore, they can negatively
afect the training of ML/DL models, which acquire a
distorted knowledge of the task. A possible solution to
the first problem is to consider only the stocks that have
been traded on more than a certain percentage of trading
days (e.g., 98%), while the standard method to deal with
outliers is to clip values within a specific range [ 12].</p>
      <sec id="sec-3-1">
        <title>This section provides an overview of the main challenges that will be covered in this paper and that will be explored by following the main macro-steps of a typical ML/DL pipeline (see Figure 1).</title>
        <p>Data Preparation. Preparing financial data is a complex
activity due to the presence of outliers, missing values
and bias in the data. These mainly include look-ahead
bias, survivorship bias, and dividend/split adjustment,
which require ad-hoc procedures to avoid information
leakage and erroneous predictions.</p>
        <p>Featurization. Designing financial supervised tasks
includes both stock data featurization and label
preparation. Featurization is needed to remove unwanted
properties from raw stock price series, which exhibit
nonhomogeneity (i.e., values arrive with an irregular fre- 3.2. Look-ahead bias
quency) and non-stationarity (i.e., their statistical prop- Look-ahead bias occurs when a model uses information
erties vary over time). Preparing financial data labels, on that would not have been available at inference time [8].
the other hand, mainly means managing imbalance label A generic approach to solve this problem is to
impledistribution in classification scenarios, and appropriately ment out-of-sample testing, which involves dividing the
define the prediction dates in regression scenarios (i.e., data into two parts: one for model construction and one
whether to set them statically or dynamically). for validation. The model is trained on the first part of
Modeling. Designing financial models presents its own the data and then tested on the second part of the data.
set of challenges, where stochasticity and the exploitation This approach can help avoid overfitting the data and
of stock relations are the most relevant aspects. that its performance is more accurately estimated.
Evaluation. The application of traditional ML/DL eval- Despite the use of this technique, look-ahead bias may
uation methods in the financial domain often results in still emerge when processing adjusted price data and
inflated performance due to diferent forms of bias and fundamental data. Adjusted prices, for example, are
condata dependencies. Furthermore, ad-hoc countermea- stantly updated based on the occurrence of a split or the
sures must be taken to handle model and backtest over- payment of dividends. When such events occur, all past
iftting. time series is corrected accordingly. For example, when
a 2-for-1 stock split occurs, all prices before that date are
halved. As a consequence of this, adjusted prices implic- 4.1.1. Inhomogeneous series
itly store information about future events and should be
used with caution. To mitigate this problem, the yield In literature stock price series are typically time-indexed,
series is preferred rather than the original series. It op- i.e., their values are sampled at fixed time intervals. It
reperates on percentage diferences rather than on absolute resents the most intuitive choice as it is consistent with
values and is not afected by the bias produced by such sunlight cycles. Unfortunately, markets are operated by
corrections. algorithms that trade with limited human supervision,</p>
        <p>When fundamental data is processed, instead, it is nec- for which CPU processing cycles are much more
releessary to pay attention to its publication process. These vant than chronological intervals [16]. As a consequence,
documents are written on a certain date and subsequently sampling information on a time basis would result in
overcorrected without updating the filling date, implicitly in- sampling during low-activity periods and undersampling
dicating that the new information was already known at during high-activity periods. Furthermore time-sampled
the initial writing time of the document. Not considering series often exhibit poor statistical properties, like
sethis aspect means including future information in the rial correlation, heteroscedasticity, and non-normality of
historical data, and results in inflated performance. returns. To alleviate this problem, alternative forms of
sampling have been proposed, such as volume bars that
collect information whenever a certain amount of stock
3.3. Survival bias units have been traded, or dollar bars that sample data
every time a pre-defined market value is exchanged.</p>
        <p>Survivorship bias occurs when the data used to train and
test a model only includes the stocks that have survived
until the present time, hence ignoring that some compa- 4.1.2. Non-stationarity
nies went bankrupt and securities were delisted. This
bias can result in an overestimation of the performance
of the strategies as they ignore the stocks that have gone
bankrupt or delisted [8, 13, 14]. Various solutions have
been proposed in the literature to address this bias, such
as including delisted securities in the analysis [15] or
applying a survivorship bias correction method, which
involves adjusting the returns of surviving securities to
account for the returns of the delisted securities.</p>
        <p>Another undesired property of the raw stock price series
is non-stationarity [17, 18], i.e., when its statistical
properties vary over time. This prevents the direct application
of inferential analysis as they operate exclusively on
invariant processes. To circumvent this problem, the most
adopted solution is to transform the raw price series into
a yield series, where the absolute values of the prices are
replaced by percentage variations. Although this
transformation makes the series stationary, its drawback is
that it removes memory from the data (i.e., removes
cor4. Featurization relations between past and future observations), which
is the main bias for the model’s predictive power.
ReThe data preparation phase is typically followed by a fea- cent featurization methodologies based on fractionally
turization phase, which aims at transforming the raw data diferentiated features have been explored to obtain an
in order to 1) highlight expressive patterns for the stock efective trade-of between stationarity and memory [ 8].
selection task and 2) obtain better statistical properties
that facilitate processing through ML/DL. This procedure 4.2. Output
is mainly applied to raw stock price series, which exhibit
unwanted properties such as non-homogeneity (i.e., val- Parallel to the input featurization, the label space must
ues arrive with irregular frequency) and non-stationarity be transformed coherently with the type of task to be
(i.e., their statistical properties vary over time). solved (i.e., classification or regression).</p>
        <p>In this section, we present some solutions to these
problems, distinguishing between solutions for the input 4.2.1. Class unbalanced distribution
(i.e., feature space) and the output (i.e., label space).
4.1. Input</p>
      </sec>
      <sec id="sec-3-2">
        <title>A very popular category of stock selection approaches is</title>
        <p>based on technical analysis, which directly elaborates on
numerical features like past prices and macroeconomic
indicators. This type of data is afected by several
problematic conditions that must be managed appropriately
to create efective trading strategies.</p>
      </sec>
      <sec id="sec-3-3">
        <title>In a classification scenario, observations are typically la</title>
        <p>beled based on whether the return is positive or negative.
However, this may produce unbalanced classes, as during
market booms the probability of a positive return is much
higher, and during market crashes they are lower [19].
This unbalanced distribution can introduce a bias in the
model training by favoring the more frequent classes
over the rarer ones. To avoid this condition, in [20] an
asymmetric threshold assignment is used to balance the
classes (e.g., samples with returns ≤-0.5% and &gt; 0.55% are
labeled with down and up, respectively).
4.2.2. Fixed vs variable future time horizon
The intuition behind the first category is to model the
output in probabilistic terms, estimating a probability
distribution and not relying on punctual targets.
Quantile loss [21] and Gaussian loss [22] represent the main
objective functions used in this category of methods.
Adversarial training approaches instead try to manage the
stochasticity by training the model to produce similar
outputs for diferent variations of the same target input
[18]. Finally, instead of using only deterministic features,
generative models incorporate inherently probabilistic
components. Variational auto-encoders (VAE) [23] are
the best-known example of this component and several
stock selection approaches rely on them [20, 24].</p>
        <sec id="sec-3-3-1">
          <title>5.2. Covariates</title>
          <p>A more specific concern of regression scenarios is the The stock market is also characterized by significant
definition of the prediction time horizon, i.e., whether to forms of correlation between stocks, e.g., stocks
belongdetermine it statically (e.g., using a fixed time interval) or ing to the same sector show similar patterns. Capturing
dynamically (e.g., when certain events occur). Although these types of relationships is essential to better
underthe first category is more intuitive, several approaches standing market dynamics and creating efective trading
based on variable time horizons are applied in the in- strategies accordingly. Although initially most of the
dustry, e.g., based on the occurrence of significant price approaches proposed in the literature treated each stock
changes with respect to an average volatility. This is as isolated for prediction, a new line of work is actively
done to adhere to the dynamics of the market, where con- exploring the joint prediction of multiple stocks. Most
ditions for exiting a position are often defined through of these works integrate graph neural networks [25] to
thresholds for profit-taking and stop-losses [ 8]. model such correlations in static [26, 27] or dynamic (i.e.,
learned directly from the model) [17] graphs.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Modeling</title>
      <sec id="sec-4-1">
        <title>5.3. Profit-Driven Optimization</title>
        <p>Given stock features and related labels, the next step is
to apply supervised approaches to learn hidden patterns
in past data and acquire predictive capabilities on future
data. Several challenges aflict the design of ML/DL
models in the financial domain, such as the management of
the stochastic nature of data (mainly in price series), the
exploitation of correlations between stocks and the
correct definition of the model optimization function (e.g.,
identify the most profitable stocks).</p>
        <sec id="sec-4-1-1">
          <title>Another aspect often overlooked in the design of ML/DL</title>
          <p>models in finance concerns the correct definition of the
learning strategy according to the investment objective.</p>
          <p>Most of the approaches do not directly optimize the target
of investment in terms of profit, even if they are
interested in identifying the most profitable stocks. In other
words, the stock selection task is typically formulated
as a classification problem (to estimate the future trend
of stocks) or a regression problem (to directly estimate
the future price/return of stocks). However, correctly
5.1. Stochasticity solving these tasks can lead to sub-optimal solutions in
terms of profit [ 12, 4]. Consider the toy example shown
Stock data have a chaotic and noisy nature: they are in Table 1, where two regressors (R1 and R2) and two
largely driven by new information and result in a random- classifiers ( C1, C2) are respectively used to predict the
walk pattern [20]. This random component can nega- return and the trend of 5 stocks. As can be seen, the
tively impact the training process. Traditional supervised worst-performing models (i.e., R2 and C2) are able to
techniques are in fact designed to operate on clean data select the most profitable top-1 stock compared to the
and are not capable of handling uncertain data. This best-performing methods (i.e., R1 and C1). 1 Following
has motivated an intense efort in the area of deep learn- this direction, a new line of work has suggested
adopting, leading to several solutions over the last few years. ing a ranking approach, which is closer to the problem
Among these, three categories of methods have been of selecting the most profitable stocks [ 27, 17]. Instead
explored: 1) the adoption of ad-hoc loss functions, 2)
the exploitation of adversarial training procedures, and
3) the construction of intrinsically probabilistic models.
1Note that in the regression the top-1 stock is selected based on the
higher predicted return, while in the classification based on the
higher probability of the positive trend.
of predicting, for example, the return of stocks (as in a
regression task), the goal here is to sort the stocks by
decreasing return. In this way the stocks that perform
better than others will appear first in the ranking and
will be selected by a topk-based trading strategy.
embargoing creates a further gap between train and test
sets when the latter precedes the train set in time. This is
done to avoid that it contains information that is highly
correlated with the next train set.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>6.2. Overfitting</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Evaluation</title>
      <sec id="sec-5-1">
        <title>A very common condition in financial machine learning</title>
        <p>is overfitting, i.e., the poor ability to generalize to new
The goal of the evaluation step in the financial domain is data. This condition mainly afects backtesting
stratetwofold. First, the predictive ability of the ML/DL model gies, although it is also common in financial model
trainmust be evaluated and, second, the performance of the ing [18]. Backtest overfitting occurs when a strategy is
trading strategy must be analyzed. The latter is built over-optimized on a specific backtest scheme, resulting in
on top of the model’s predictions and varies depending poor performance if the backtest is changed. Most of the
on the type of supervised task used for model training. trading strategies are afected by this condition, as they
In a classification scenario, the up and down predictions are evaluated exclusively with the popular walk-forward
are interpreted as buy and sell signals. In the regression (WF) scheme. With this procedure, the historical data is
and ranking scenarios the (top-k) stocks with the highest divided into two sets, the in-sample and out-of-sample
predictions are bought and those (top-k) with the lowest periods. The strategy is developed and optimized during
predictions are sold. the in-sample period and evaluated during the
out-of</p>
        <p>To achieve this, diferent metrics and evaluation proce- sample period. The scheme is repeated by moving the
dures for both tasks have been proposed. With regard to in-sample and out-of-sample periods forward in time.
evaluation metrics, a distinction is made between model Although this procedure has the advantage of
providmetrics and portfolio metrics, depending on whether they ing a clear historical interpretation of the performance
evaluate the model or the strategy. Commonly used port- of a strategy, it has the disadvantage of testing a single
folio metrics include return, Sharpe ratio, and Sortino ratio. scenario obtained by splitting the data only in the
forRegarding the evaluation procedures, instead, an out-of- ward direction. To mitigate this problem, Combinatorial
sample evaluation scheme is typically used to evaluate Purged Cross-Validation (CPCV) has recently been
prothe efectiveness of the model (e.g., cross-validation is the posed [8]. It modifies a traditional K-Fold cross-validation
most commonly adopted solution), while a backtesting scheme by generating all possible combinations of
traintechnique is employed to analyze the performance of the test splits having  &gt; 1 folds as test set and the remaining
trading strategy. folds for the train set, while purging train observations</p>
        <p>However, there are still several problems that prac- that contain leaked information. Unlike traditional
crosstitioners may encounter during the evaluation process. validation methods, the test sets are not used to compute
They arise mainly from the tendency of models to overfit performance metrics directly. Instead, they are divided
and the presence of serial correlation in the data. into groups, each representing an independent evaluation
path. In this way, multiple backtest paths are evaluated
6.1. Time/Serial correlations instead of a single one, reducing backtest overfitting.
Although most financial models are evaluated in
standard cross-validation (i.e., an extension of out-of-sample 7. Conclusions
evaluation to multiple train-test splits), it is not the ideal
evaluation tool for financial data. This is due to the ex- In this paper, we have provided a systematic review of
istence of various forms of temporal correlations in the the main pitfalls aflicting fintech practitioners in
develdata which create leakages, or implicit overlaps between oping stock selection strategies, and we have collected
train and test data, compromising the reliability of the the main solutions used to mediate them. Starting from
evaluation process. To mitigate this issue, a new cross- the data preparation step, the most adopted practices
validation scheme has been proposed in [8], where purg- are the use of clipping techniques to reduce abnormal
ing and embargoing techniques are applied to remove patterns, the correct management of price-adjusted and
such dependencies. More specifically, the purging tech- fundamental data to avoid look-ahead bias, and the
innique removes from the train set all observations whose clusion of delisted stocks to limit survivorship bias. In
labels overlapped in time with those included in the test the featurization phase, the main solutions to manage
set. In a task that predicts monthly stock returns, for ex- the inhomogeneity and non-stationarity of stock series
ample, this means creating a window of at least 30 days are the adoption of sampling techniques based on
volbetween train and test observations. On the other hand, ume or dollar bars and the transformation of price series</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>