=Paper= {{Paper |id=None |storemode=property |title=Applying Neural Networks for Concept Drift Detection in Financial Markets |pdfUrl=https://ceur-ws.org/Vol-960/paper9.pdf |volume=Vol-960 }} ==Applying Neural Networks for Concept Drift Detection in Financial Markets== https://ceur-ws.org/Vol-960/paper9.pdf

Applying Neural Networks for Concept Drift Detection
in Financial Markets
Bruno Silva 1 and Nuno Marques 2 and Gisele Panosso 3

Abstract. Traditional stock market analysis is based on the assump- The main contributions of this work are: (i) a drift detection
tion of a stationary market behavior. The recent financial crisis was method based on the output of Adaptive Resonance Theory (ART)
an example of the inappropriateness of such assumption, namely by networks [7] which produce aggregations (or data synopsis in some
detecting the presence of much higher variations than what would literature) of d-dimensional data streams. These fast aggregations
normally be expected by traditional models. Data stream methods compress a, possibly, high-rate stream while maintaining the intrinsic
present an alternative for modeling the vast amounts of data arriv- relations within the data. A fixed sequence of consecutive aggrega-
ing each day to a financial analyst. This paper discusses the use of tions is then analyzed to infer concept drift in the underlying distri-
a framework based on an artificial neural network that continuously bution – Section 2 ; (ii) an application of the previous scheme to the
monitors itself and allows the implementation on a multivariate fi- stock market, namely to the Dow Jones Industrial index (DJI), using
nancial non-stationary model of market behavior. An initial study is a stream of with a chosen set of statistical and technical indicators.
performed over ten years of the Dow Jones Industrial Average index The detection of concept drift is performed over an incoming stream
(DJI), and shows empirical evidence of concept drift in the multivari- of these observations –Section 3.
ate financial statistics used to describe the index data stream. These contributions adhere to the impositions of data stream mod-
els in [8], namely: the data points can only be accessed in the order
in which they arrive; random access to data is not allowed; memory
1 INTRODUCTION is assumed to be small relatively to the number of data points, thus
only allowing a limited amount information to be stored. Therefore,
Data streams are generated naturally within several domains. Net- all of the additional indicators are computed using sliding windows,
work monitoring, web mining, telecommunications data manage- thus only needing a small subset of data kept in memory. This is also
ment, stock-market analysis and sensor data processing are appli- true for the number of aggregations needed to compute the concept
cations that have vast amounts of data arriving continuously. In such drift.
applications, the process may not be strictly stationary, i.e., the target At the end of the paper, Section 4, discussion of the results are
concept may change over time. Concept drift means that the concept made together with final conclusions.
about which data is being collected may shift from time to time, each
time after some minimum permanence [6].
In this paper we address the detection and analysis of concept drift 2 METHODOLOGY
in financial markets by employing a methodology based on Artificial The presented methodology for drift detection comprises two mod-
Neural Networks (ANN). ANN are a set of biologically inspired al- ules. The first module uses an ART network that receives the incom-
gorithms and well-established data mining methods popular for tech- ing stream and produces aggregations, or data synopsis, compressing
nical market analysis and price predictions. We are currently under- the data and retaining the intrinsic relationships within the distribu-
going a wider research on using ANN in Ubiquitous Data Mining. tion (Section 2.1). This module feeds a second module that takes a
This work, in essence, is a real-world application of a mechanism to fixed set of these aggregations and through simple computations pro-
detect concept drift while processing data streams. The motivation duces an output that can be used to detect concept drift.
for this approach in the financial field can be easily explained. Math-
ematical finance has made wide use of normal distributions in stock
market analysis to maximize return rates, i.e., they assume station-
2.1 Online Data Aggregation
ary distributions, which are easier to understand and work well most One should point out that algorithms performing on data streams are
of the times. However, this traditional approach neglects big heavy- expected to produce “only” approximated models [6], since the data
tails, i.e.,huge asset losses, in the distributions and their potential risk cannot be revisited to refine the generated models. The aggregation
evaluation [11, 12]. This is where the detection of drifting from this module is responsible for the online summarization of the incom-
normal behavior is of critical importance to reduce investment risk in ing stream and processes the stream in blocks of size S. For each S
the presence of non-normal distribution of market events. observations q representative prototypes of data are created, where
q S. This can be related to an incremental clustering process
1 DSI/ESTSetúbal, Instituto Politécnico de Setúbal, Portugal, email: that is performed by an ART network. Each prototype is included
bruno.silva@estsetubal.ips.pt in a tuple that stores other relevant information, such as the number
2 CITI and Departamento de Informática, FCT, Universidade Nova de Lisboa,
Portugal, email: nmm@fct.unl.pt of observations described by a particular prototype and the point in
3 ISEGI, Instituto Superior de Estatı́stica e Gestão de Informação, Universi- time that a particular prototype was last updated. These data struc-
dade Nova de Lisboa, Portugal, email: m2010147@isegi.unl.pt tures were popularized in [1] and called micro-clusters.

43
Hence, we create q “weighted” prototypes of data stored in tuples (new) (old)
PJ = η · Xi + (1 − η) · PJ (2)
Q = {M1 , ..., Mj , ..., Mq }, each containing: a prototype of data
Pj ; the number of inputs patterns Nj assigned to that prototype and The constant learning rate η ∈ [0, 1] is chosen to prevent proto-
a timestamp Tj that contains the point in time that prototype was type PJ from moving too fast and therefore destabilizing the learning
last accessed, hence Mj = {Pj , Nj , Tj }. The prototype together process. However, given our goals, i.e., to perform an adaptive vector
with the number of inputs assigned to it (weighting) is important to quantization, we define η dynamically in such a way that the mean
preserve the input space density if one is interested in creating of- quantization error of inputs represented by a prototype is minimized.
fline models of the distribution. The timestamp allows the creation Equation 3 establishes the dynamic value of η, where NJ is the cur-
of models from specific intervals in time. rent number of assigned input patterns for prototype J. This way, it
ART [7] is a family of neural networks that develop stable recog- is expected that the prototypes converge to the mean of the assigned
nition categories (clusters) by self-organization in response to arbi- input patterns.
trary sequences of input patterns. Its fast commitment mechanism
and capability of learning at moderate speed guarantees a high ef- NJ
η= (3)
ficiency. The common algorithm used for clustering in any kind of NJ + 1
ART network is closely related to the k-means algorithm. Both use This does not guarantee the convergence to local minimum, how-
single prototypes to internally represent and dynamically adapt clus- ever, according to the adaptive vector quantization (AVQ) conver-
ters. The k-means algorithm clusters a given set of input patterns into gence theorem [2], AVQ can be viewed as a way to learn prototype
k groups. The parameter k thus specifies the coarseness of the parti- vector patterns of real numbers; it can guarantee that average synap-
tion. In contrast, ART uses a minimum required similarity between tic vectors converge to centroids exponentially quickly.
patterns that are grouped within one cluster. The resulting number Another needed modification arises from the fact that ART net-
k of clusters then depends on the distances (in terms of the applied works, by design, form as much prototypes as needed based on the
metric) between all input patterns, presented to the network during vigilance value. At the extremes, ρ = 1 causes each unique input to
training. This similarity parameter is called vigilance ρ. K-means is be encoded by a separate prototype, whereas ρ = 0 causes all inputs
a popular algorithm in clustering data streams, e.g., [4], but suffers to be represented by a single prototype. Therefore, for decreasing
from the problem that the initial k clusters have to be set either ran- values of ρ coarser prototypes are formed. However, to achieve ex-
domly or through other methods. This has a strong impact on the actly q prototypes solely on a manually tuned value of ρ is a very
quality of the clustering process. On the other hand, ART networks hard task, mainly due to the input space density, that can change over
do not suffer from this problem. time, and is also different from application to application.
More formally, a data stream is a sequence of data items (ob- To overcome this, we make a modification to the ART2-A algo-
servations) x1 , ..., xi , ..., xn such that the items are read once in rithm to impose a restriction on creating a maximum of q proto-
increasing order of the indexes i. If each observation contains a types and dynamically adjusting the vigilance parameter. We start
set of d-dimensional features, then a data stream is a sequence of with ρ = 1 so that a new micro-cluster is assigned to each arriving
X1d , ..., Xid , ..., Xnd vectors. We employ an ART2-A [3] network spe- input vector. After learning an input vector, a verification is made to
cially geared towards fast one-shot training, with an important mod- check if q = j + 1, where j is the current number of stored micro-
ification given our goals: constrain the network on a maximum of q clusters. If this condition is met, then to keep only q we need to merge
prototypes. It shares the basic processing of all ART networks, which the nearest pair of micro-clusters. Let Tr,s = min{kPr − Ps k2 :
is based on competitive learning. ART requires the same input pat- r, s = 1, ..., q, r 6= s} be the minimum Euclidean distance between
tern size for all patterns, i.e., the dimension d of the input space where prototypes stored in micro-clusters Mr and Ms . We merge the two
the clusters regions shall be placed. Starting with an empty set of pro- micro-clusters into one:
totypes P1d , ..., Pjd , ..., Pqd each input pattern Xid is compared to the
j stored prototypes in a search stage, in a winner-takes-all fashion. Mmerge = {Pmerge , Nr + Ns , max{Tr , Ts }} (4)
If the degree of similarity between current input pattern and best fit- with the new prototype being a “weighted” average between the
ting prototype WJ is at least as high as vigilance ρ, this prototype is previous two:.
chosen to represent the micro-cluster containing the input. Similarity
between the input pattern i and a prototype j is given by Equation 1, Nr Ns
Pmerge = Pr + Ps (5)
where the distance is subtracted from one to get SXi ,Pj = 1 if input Nr + Ns Nr + Ns
and prototype are identical. The distance is normalized with the di- With d-dimensional input vectors, Equation 1 defines a hyper- √
mension d of an input vector. This keeps measurements of similarity sphere around any stored prototype with radius r = (1 − ρ) · d.
independent of the number of features. By solving this equation in respect to ρ, we update the vigilance pa-
v
u rameter dynamically with Equation 6, hence ρ(new) < ρ(old) and the
u1 X d
radius, consequently, increases.
SXi ,Pj = 1 − t (Xin − Pjn )2 (1)
d n=1
Tr,s
ρ(new) = 1 − √ (6)
The degree of similarity is limited to the range [0, 1]. If similarity d
between the input pattern and the best matching prototype does not Our experimental results show that this approach is effective in
fit into the vigilance interval [ρ, 1], i.e., SXi ,Pj < ρ, a new micro- providing a summarization of the underlying distribution within the
cluster has to be created, where the current input is used as the pro- data streams. The inclusion of these results is out of the scope of this
totype initialization. Otherwise, if one of the previously committed paper.
prototypes (micro-clusters) matches the input pattern well enough, it
is adapted by shifting the prototype’s values towards the values of the We must point out that the aggregation module produces more in-
input by the update rule in Equation 2. formation that it is actually necessary for the concept drift detection,

44
namely the weighting of the prototypes and the timestamps. This
module is an integrating part of a larger framework that also gen-
erates offline models of the incoming stream for specific points in
time.

2.2 Detecting Concept Drift
Our method assumes that if the underlying distribution is stationary
that the error-rate of the learning algorithm will decrease as the num-
ber of samples increases [5]. Hence, we compute the quantization
error at each aggregation phase of the ART network and track the
changes of these errors over time.
We use a queue B of b aggregation results, such that B =
{Ql , Ql−1 , ..., Ql−b+1 }, where Ql is the last aggregation obtained.
For each Ql that arrives, we compute the average Euclidean distance
between each prototype Pi in Ql and the closest one in Bl−1 =
{Ql−1 , ..., Ql−b+1 }. Equation 7 formalizes this Average Quantiza-
tion Error (AQE) computation for the lth aggregation, where k · k2 Figure 1. Hierarchical clustering of variables produced by VARCLUS.
is the Euclidean distance and q is the number of prototypes in Ql by
definition. This computes the error of the last aggregation in “quan-
tifying” previous aggregations in a particular point in time.

q
High and Low values. From these we chose the lowest daily price
1X (PX LOW) because it provides better insight to the risk of a fall.
AQE(l) = min( k Pi − Pj k2 , ∀Pj ∈ Bl−1 ) (7)
q i=1 Other available technical indicator was the trading Volume.
In terms of statistical indicators, we initially considered a large
By repeating this procedure over time, we obtain a series of errors number of them, like moving averages (MA) from 20 to 180 trading
that stabilizes and/or decreases when the underlying distribution is days, relative numbers, i.e., the DJI index value divided by moving
stationary and presents increases on this curve when the underlying averages (AVG), price fluctuation and Hurst index. However, it was
distribution is changing, i.e., concept drift is occurring. This series of important to reduce the number of variables because redundant vari-
errors is the drift curve. ables can reduce the model efficiency. For this purpose we performed
Larger values of b are used to detect abrupt changes in the un- an analysis with the VARCLUS procedure (SAS/STAT).The VAR-
derlying distribution, whereas to detect gradual concept drift a lower CLUS procedure can be used as a variable-reduction method. The
value should be adopted. We exemplify the automatic concept drift VARCLUS procedure divides a set of numeric variables into dis-
detection in this drift curve using a moving average in Section 3.2. joint or hierarchical clusters through principal component analysis.
All variables were treated as equally important. VARCLUS created
an output was used by the TREE procedure to draw a tree diagram
3 APPLICATION TO DOW JONES
of hierarchical clusters (SAS/STAT R 9.1 User’s Guide p. 4797). The
INDUSTRIAL
tree diagram is depicted in Figure 1. We can observe in the hier-
We present an application of the previous methodology to the stock archical clustering that the price variables and moving averages are
market, namely to the Dow Jones Industrial index (DJI). Instead of correlated, so it was only chosen PX LOW of Cluster 1. In Cluster
using daily prices of several stocks that compose the DJI, our ap- 2 all variables were selected because, although they are correlated,
proach to this problem uses the DJI daily index values themselves they measure different characteristics. In the case of relative num-
and other computed statistical and technical indicators, which are ex- bers different averages were selected because it is interesting to see
plained in Section 3.1. We make extensive use of moving averages, the differences between the analysis of short, medium and long term.
as they reduce the short term volatility of time series and retain in- Finally in Cluster 3 and Cluster 4 just Hurst index and price fluctu-
formation from previous market events; another statistical indicator ation appeared, because they are not correlated with any other vari-
is the Hurst index [9], defined as a function to uncover changes in the able, so these variables were included in the final data set.
direction of the trend of a set of values in time. We believe that these Hence, the complete set of features in the data stream is the fol-
indicators, together with the index value, can provide a multi-variate lowing:
insight to hidden and subtle changes in the normality of financial
events and be used to assess the risk of investment at any point in PX LOW: Minimum daily price;
time, thus lowering exposure to risk. PX VOLUME: Volume of daily business;
This application makes use of data gathered from the period com- IX HURST: Hurst index computed for 30 days;
prised between the 1st of January of 2001 to the 31st of December IX CAP FLUTUATION: PX LOW(t)/ PX LOW (t 1). This vari-
of 2011, in a total of 2767 observations. able represents price fluctuation for one day interval;
AVG 20: PX LOW / 20-day moving average. This variable repre-
3.1 Variable Selection and Generated Data Stream sents the relative number of current price divided by the 20-day
Moving Average. This shows whether the current price is cheap,
The data gathered was composed by a set of technical variables in- average value, expensive or really expensive. The same applies to
cluding different index values for one trading day like Open, Close, the next indicators but within other time frames;

45
AVG 30: PX LOW / 30-day moving average; Concept drift buffer size: b = 15
AVG 60: PX LOW / 60-day moving average;
AVG 100: PX LOW / 100-day moving average; The result of the procedure of Section 2.2 applied to the data
AVG 120: PX LOW / 120-day moving average; stream is presented in Figure 3. Each point of the series corresponds
AVG 180: PX LOW / 180-day moving average; to the error of the model for a particular trading day, thus provid-
ing possible indications of drifting. It can be seen an overall shape
The dataset is depicted in Figure 2, where the behavior of all vari- of a curve that indicates the drift over time. Since this drift is being
ables can be seem. This data is our data stream. The stream comprises computed for every trading day, the “noise” around the curve is con-
10 features, e.g., a 10-dimensional stream. sidered normal since it is affected by the daily volatility of the index
values.
To obtain a “clean” curve we apply a convolution filter along this
drift series of the same size as b, i.e., 15 days. An alarm scheme is
created through the generation of an empirical moving average of 60
days performed over the drift series. The cleaned drift curve and its
moving average are depicted in Figure 4a).
We then compare the differences between the drift series and its
moving average obtaining a line that oscillates around zero. We call
this line the drift trend, shown in Figure 4b). Whenever the drift se-
ries has values lower than its moving average we are in a descending
trend. This is reflected in the drift trend with values lower than zero.
Whenever the moving average is crossed by the drift series it signals
a shift in the trend and the drift trend crosses zero. This reasoning to
detect trends is also very popular in financial technical analysis. In
this context, the 60 trading days moving average reflects the intuitive
notion of long-term “decreasing” or “increasing” trend of the drift.
All plots in Figure 4 are aligned in time for easy comparison. Fig-
ure 4c) shows the time series of PX LOW, i.e., the DJI index, that we
compare to the detection of drift performed.

4 DISCUSSION AND CONCLUSIONS
Based on experiments we found that a tenth of prototypes relative
to the number of observations are sufficient in most applications to
represent them adequately, hence, q = 10. Usage of higher values
of q did not improve the results with the additional problem of in-
Figure 2. Variables of the data stream used in the presented application. It creased computational time. Additionally, since we are both inter-
comprises technical and statistical indicators (description in text). ested in abrupt and gradual drift detection we used a moderate sized
buffer of aggregations (b = 15) to compute the series of quantifi-
cation errors. During our experiments we found that this value was
appropriate for the established goals.
By inspecting Figure 4 and comparing the drift trend with the be-
3.2 Concept Drift in the Dow Jones Industrial havior of the DJI index we can make two important observations: (i)
The methodology presented in Section 2 was applied to the above
data. It is converted into a data stream by taking the data input or-
der as the order of the streaming. All features were previously nor-
malized to the range [0, 1] so they have equal importance in the
Euclidean distances used to process them. The largest moving av-
erage indicator computed was over 180 days. Therefore, only after
the 180th observation can the stream be presented to the algorithm.
However, since we are dealing with financial time-series, it is im-
portant to retain the time dependency of the sequence of observa-
tions. Therefore, in this application, we use a sliding-window of 100
trading days, i.e., approximately a trimester of trading as input to
each aggregation phase. Note that a year of trading has approxi-
mately 260 days. This means that the stream is processed in blocks
of 100 observations that are kept in a queue. For each new observa-
tion that arrives the oldest in the queue is discarded and the new one
added. The parameterization used was the following: Figure 3. Concept drift series obtained through the methodology in
Section 2.2 computed for each trading day.
Block size: S = 100;
Number of micro-clusters: q = 10;

46
Figure 4. a) Cleaned drift curve and its moving average. b) The trend drift curve is used to automatically detect drifting. c) The DJI index time series (PX
LOW variable).

the drift trend crossed zero before the market crash of 2008 (around REFERENCES
day 1500). It appears that the concept that was being learned changed
sometime before the crash occurred. (ii) it may be reasonable to as- [1] C.C. Aggarwal, J. Han, J. Wang, and P.S. Yu, ‘A framework for clus-
tering evolving data streams’, in Proceedings of the 29th International
sume that in periods of normality the long-term tendency of these
Conference on Very Large Databases, volume 29, pp. 81–92. Morgan
indexes is upwards. One of such periods is after the recovery of the Kaufmann Publishers Inc., (2003).
2002 market crash, i.e., the dot-com bubble, until the other crash of [2] K. Bart, Neural networks and fuzzy systems: A dynamical systems ap-
2008 (approximately between days 300 and 1300). During such pe- proach to machine intelligence, Prentice-Hall of India, 1997.
riod it is interesting to see that the drift trend was always below zero. [3] G.A. Carpenter, S. Grossberg, and D.B. Rosen, ‘Art 2-a: An adaptive
resonance algorithm for rapid category learning and recognition’, Neu-
ral networks, 4(4), 493–504, (1991).
In the present work we have shown a methodology to detect con- [4] F. Farnstrom, J. Lewis, and C. Elkan, ‘Scalability for clustering algo-
cept drift in financial markets. We intend to apply this same method- rithms revisited’, in ACM SIGKDD Explorations Newsletter, volume 2,
ology to intra-day trading as soon as it is possible, thus reinforcing pp. 51–57. ACM, (2000).
[5] J. Gama, P. Medas, G. Castillo, and P. Rodrigues, ‘Learning with
the need to efficient processing of large volumes of data. The pro-
drift detection’, Advances in Artificial Intelligence–SBIA 2004, 66–112,
posed methodology applied over a data stream comprised of care- (2004).
fully chosen technical and statistical indicators seems promising in [6] Joao Gama, Knowledge discovery from data streams, Chapman &
detecting changes in markets events ahead of time that can reduce Hall/CRC Data Mining and Knowledge Discovery Series, 2010.
the exposure to risk. [7] S. Grossberg, ‘Adaptive pattern classification and universal recording:
II. Feedback, expectation, olfaction, illusions’, Biological Cybernetics,
The characterization of the drifts, i.e., trying to understand what is 23, (1976).
really changing in the markets through inspection of hidden changes [8] Monika R. Henzinger, Prabhakar Raghavan, and Sridhar Rajagopalan,
in the indicators is reserved for future work. Work is under way in ‘External memory algorithms’, chapter Computing on data streams,
this subject and we are using Self-Organizing Maps [10] to produce 107–118, American Mathematical Society, Boston, MA, USA, (1999).
[9] H.E. Hurst, RP Black, and YM Simaika, Long-term storage: An exper-
different mappings of the variables for particular segments in time,
imental study, Constable, 1965.
namely ones where the market seems to exhibit a stable behavior and [10] T. Kohonen, ‘Self-organized formation of topologically correct feature
comparing with others where it does not. This segments are obtained maps’, Biological cybernetics, 43(1), 59–69, (1982).
by segmenting time with the concept drift detection. As another im- [11] B. Mandelbrot, R.L. Hudson, and E. Grunwald, ‘The (mis) behaviour
mediate future work we will apply this methodology to other indexes of markets’, The Mathematical Intelligencer, 27(3), 77–79, (2005).
[12] N.N. Taleb, ‘Errors, robustness, and the fourth quadrant’, International
and perform the same study. Journal of Forecasting, 25(4), 744–759, (2009).