=Paper=
{{Paper
|id=None
|storemode=property
|title=Applying Neural Networks for Concept Drift Detection in Financial Markets
|pdfUrl=https://ceur-ws.org/Vol-960/paper9.pdf
|volume=Vol-960
}}
==Applying Neural Networks for Concept Drift Detection in Financial Markets==
Applying Neural Networks for Concept Drift Detection in Financial Markets Bruno Silva 1 and Nuno Marques 2 and Gisele Panosso 3 Abstract. Traditional stock market analysis is based on the assump- The main contributions of this work are: (i) a drift detection tion of a stationary market behavior. The recent financial crisis was method based on the output of Adaptive Resonance Theory (ART) an example of the inappropriateness of such assumption, namely by networks [7] which produce aggregations (or data synopsis in some detecting the presence of much higher variations than what would literature) of d-dimensional data streams. These fast aggregations normally be expected by traditional models. Data stream methods compress a, possibly, high-rate stream while maintaining the intrinsic present an alternative for modeling the vast amounts of data arriv- relations within the data. A fixed sequence of consecutive aggrega- ing each day to a financial analyst. This paper discusses the use of tions is then analyzed to infer concept drift in the underlying distri- a framework based on an artificial neural network that continuously bution – Section 2 ; (ii) an application of the previous scheme to the monitors itself and allows the implementation on a multivariate fi- stock market, namely to the Dow Jones Industrial index (DJI), using nancial non-stationary model of market behavior. An initial study is a stream of with a chosen set of statistical and technical indicators. performed over ten years of the Dow Jones Industrial Average index The detection of concept drift is performed over an incoming stream (DJI), and shows empirical evidence of concept drift in the multivari- of these observations –Section 3. ate financial statistics used to describe the index data stream. These contributions adhere to the impositions of data stream mod- els in [8], namely: the data points can only be accessed in the order in which they arrive; random access to data is not allowed; memory 1 INTRODUCTION is assumed to be small relatively to the number of data points, thus only allowing a limited amount information to be stored. Therefore, Data streams are generated naturally within several domains. Net- all of the additional indicators are computed using sliding windows, work monitoring, web mining, telecommunications data manage- thus only needing a small subset of data kept in memory. This is also ment, stock-market analysis and sensor data processing are appli- true for the number of aggregations needed to compute the concept cations that have vast amounts of data arriving continuously. In such drift. applications, the process may not be strictly stationary, i.e., the target At the end of the paper, Section 4, discussion of the results are concept may change over time. Concept drift means that the concept made together with final conclusions. about which data is being collected may shift from time to time, each time after some minimum permanence [6]. In this paper we address the detection and analysis of concept drift 2 METHODOLOGY in financial markets by employing a methodology based on Artificial The presented methodology for drift detection comprises two mod- Neural Networks (ANN). ANN are a set of biologically inspired al- ules. The first module uses an ART network that receives the incom- gorithms and well-established data mining methods popular for tech- ing stream and produces aggregations, or data synopsis, compressing nical market analysis and price predictions. We are currently under- the data and retaining the intrinsic relationships within the distribu- going a wider research on using ANN in Ubiquitous Data Mining. tion (Section 2.1). This module feeds a second module that takes a This work, in essence, is a real-world application of a mechanism to fixed set of these aggregations and through simple computations pro- detect concept drift while processing data streams. The motivation duces an output that can be used to detect concept drift. for this approach in the financial field can be easily explained. Math- ematical finance has made wide use of normal distributions in stock market analysis to maximize return rates, i.e., they assume station- 2.1 Online Data Aggregation ary distributions, which are easier to understand and work well most One should point out that algorithms performing on data streams are of the times. However, this traditional approach neglects big heavy- expected to produce “only” approximated models [6], since the data tails, i.e.,huge asset losses, in the distributions and their potential risk cannot be revisited to refine the generated models. The aggregation evaluation [11, 12]. This is where the detection of drifting from this module is responsible for the online summarization of the incom- normal behavior is of critical importance to reduce investment risk in ing stream and processes the stream in blocks of size S. For each S the presence of non-normal distribution of market events. observations q representative prototypes of data are created, where q S. This can be related to an incremental clustering process 1 DSI/ESTSetúbal, Instituto Politécnico de Setúbal, Portugal, email: that is performed by an ART network. Each prototype is included bruno.silva@estsetubal.ips.pt in a tuple that stores other relevant information, such as the number 2 CITI and Departamento de Informática, FCT, Universidade Nova de Lisboa, Portugal, email: nmm@fct.unl.pt of observations described by a particular prototype and the point in 3 ISEGI, Instituto Superior de Estatı́stica e Gestão de Informação, Universi- time that a particular prototype was last updated. These data struc- dade Nova de Lisboa, Portugal, email: m2010147@isegi.unl.pt tures were popularized in [1] and called micro-clusters. 43 Hence, we create q “weighted” prototypes of data stored in tuples (new) (old) PJ = η · Xi + (1 − η) · PJ (2) Q = {M1 , ..., Mj , ..., Mq }, each containing: a prototype of data Pj ; the number of inputs patterns Nj assigned to that prototype and The constant learning rate η ∈ [0, 1] is chosen to prevent proto- a timestamp Tj that contains the point in time that prototype was type PJ from moving too fast and therefore destabilizing the learning last accessed, hence Mj = {Pj , Nj , Tj }. The prototype together process. However, given our goals, i.e., to perform an adaptive vector with the number of inputs assigned to it (weighting) is important to quantization, we define η dynamically in such a way that the mean preserve the input space density if one is interested in creating of- quantization error of inputs represented by a prototype is minimized. fline models of the distribution. The timestamp allows the creation Equation 3 establishes the dynamic value of η, where NJ is the cur- of models from specific intervals in time. rent number of assigned input patterns for prototype J. This way, it ART [7] is a family of neural networks that develop stable recog- is expected that the prototypes converge to the mean of the assigned nition categories (clusters) by self-organization in response to arbi- input patterns. trary sequences of input patterns. Its fast commitment mechanism and capability of learning at moderate speed guarantees a high ef- NJ η= (3) ficiency. The common algorithm used for clustering in any kind of NJ + 1 ART network is closely related to the k-means algorithm. Both use This does not guarantee the convergence to local minimum, how- single prototypes to internally represent and dynamically adapt clus- ever, according to the adaptive vector quantization (AVQ) conver- ters. The k-means algorithm clusters a given set of input patterns into gence theorem [2], AVQ can be viewed as a way to learn prototype k groups. The parameter k thus specifies the coarseness of the parti- vector patterns of real numbers; it can guarantee that average synap- tion. In contrast, ART uses a minimum required similarity between tic vectors converge to centroids exponentially quickly. patterns that are grouped within one cluster. The resulting number Another needed modification arises from the fact that ART net- k of clusters then depends on the distances (in terms of the applied works, by design, form as much prototypes as needed based on the metric) between all input patterns, presented to the network during vigilance value. At the extremes, ρ = 1 causes each unique input to training. This similarity parameter is called vigilance ρ. K-means is be encoded by a separate prototype, whereas ρ = 0 causes all inputs a popular algorithm in clustering data streams, e.g., [4], but suffers to be represented by a single prototype. Therefore, for decreasing from the problem that the initial k clusters have to be set either ran- values of ρ coarser prototypes are formed. However, to achieve ex- domly or through other methods. This has a strong impact on the actly q prototypes solely on a manually tuned value of ρ is a very quality of the clustering process. On the other hand, ART networks hard task, mainly due to the input space density, that can change over do not suffer from this problem. time, and is also different from application to application. More formally, a data stream is a sequence of data items (ob- To overcome this, we make a modification to the ART2-A algo- servations) x1 , ..., xi , ..., xn such that the items are read once in rithm to impose a restriction on creating a maximum of q proto- increasing order of the indexes i. If each observation contains a types and dynamically adjusting the vigilance parameter. We start set of d-dimensional features, then a data stream is a sequence of with ρ = 1 so that a new micro-cluster is assigned to each arriving X1d , ..., Xid , ..., Xnd vectors. We employ an ART2-A [3] network spe- input vector. After learning an input vector, a verification is made to cially geared towards fast one-shot training, with an important mod- check if q = j + 1, where j is the current number of stored micro- ification given our goals: constrain the network on a maximum of q clusters. If this condition is met, then to keep only q we need to merge prototypes. It shares the basic processing of all ART networks, which the nearest pair of micro-clusters. Let Tr,s = min{kPr − Ps k2 : is based on competitive learning. ART requires the same input pat- r, s = 1, ..., q, r 6= s} be the minimum Euclidean distance between tern size for all patterns, i.e., the dimension d of the input space where prototypes stored in micro-clusters Mr and Ms . We merge the two the clusters regions shall be placed. Starting with an empty set of pro- micro-clusters into one: totypes P1d , ..., Pjd , ..., Pqd each input pattern Xid is compared to the j stored prototypes in a search stage, in a winner-takes-all fashion. Mmerge = {Pmerge , Nr + Ns , max{Tr , Ts }} (4) If the degree of similarity between current input pattern and best fit- with the new prototype being a “weighted” average between the ting prototype WJ is at least as high as vigilance ρ, this prototype is previous two:. chosen to represent the micro-cluster containing the input. Similarity between the input pattern i and a prototype j is given by Equation 1, Nr Ns Pmerge = Pr + Ps (5) where the distance is subtracted from one to get SXi ,Pj = 1 if input Nr + Ns Nr + Ns and prototype are identical. The distance is normalized with the di- With d-dimensional input vectors, Equation 1 defines a hyper- √ mension d of an input vector. This keeps measurements of similarity sphere around any stored prototype with radius r = (1 − ρ) · d. independent of the number of features. By solving this equation in respect to ρ, we update the vigilance pa- v u rameter dynamically with Equation 6, hence ρ(new) < ρ(old) and the u1 X d radius, consequently, increases. SXi ,Pj = 1 − t (Xin − Pjn )2 (1) d n=1 Tr,s ρ(new) = 1 − √ (6) The degree of similarity is limited to the range [0, 1]. If similarity d between the input pattern and the best matching prototype does not Our experimental results show that this approach is effective in fit into the vigilance interval [ρ, 1], i.e., SXi ,Pj < ρ, a new micro- providing a summarization of the underlying distribution within the cluster has to be created, where the current input is used as the pro- data streams. The inclusion of these results is out of the scope of this totype initialization. Otherwise, if one of the previously committed paper. prototypes (micro-clusters) matches the input pattern well enough, it is adapted by shifting the prototype’s values towards the values of the We must point out that the aggregation module produces more in- input by the update rule in Equation 2. formation that it is actually necessary for the concept drift detection, 44 namely the weighting of the prototypes and the timestamps. This module is an integrating part of a larger framework that also gen- erates offline models of the incoming stream for specific points in time. 2.2 Detecting Concept Drift Our method assumes that if the underlying distribution is stationary that the error-rate of the learning algorithm will decrease as the num- ber of samples increases [5]. Hence, we compute the quantization error at each aggregation phase of the ART network and track the changes of these errors over time. We use a queue B of b aggregation results, such that B = {Ql , Ql−1 , ..., Ql−b+1 }, where Ql is the last aggregation obtained. For each Ql that arrives, we compute the average Euclidean distance between each prototype Pi in Ql and the closest one in Bl−1 = {Ql−1 , ..., Ql−b+1 }. Equation 7 formalizes this Average Quantiza- tion Error (AQE) computation for the lth aggregation, where k · k2 Figure 1. Hierarchical clustering of variables produced by VARCLUS. is the Euclidean distance and q is the number of prototypes in Ql by definition. This computes the error of the last aggregation in “quan- tifying” previous aggregations in a particular point in time. q High and Low values. From these we chose the lowest daily price 1X (PX LOW) because it provides better insight to the risk of a fall. AQE(l) = min( k Pi − Pj k2 , ∀Pj ∈ Bl−1 ) (7) q i=1 Other available technical indicator was the trading Volume. In terms of statistical indicators, we initially considered a large By repeating this procedure over time, we obtain a series of errors number of them, like moving averages (MA) from 20 to 180 trading that stabilizes and/or decreases when the underlying distribution is days, relative numbers, i.e., the DJI index value divided by moving stationary and presents increases on this curve when the underlying averages (AVG), price fluctuation and Hurst index. However, it was distribution is changing, i.e., concept drift is occurring. This series of important to reduce the number of variables because redundant vari- errors is the drift curve. ables can reduce the model efficiency. For this purpose we performed Larger values of b are used to detect abrupt changes in the un- an analysis with the VARCLUS procedure (SAS/STAT).The VAR- derlying distribution, whereas to detect gradual concept drift a lower CLUS procedure can be used as a variable-reduction method. The value should be adopted. We exemplify the automatic concept drift VARCLUS procedure divides a set of numeric variables into dis- detection in this drift curve using a moving average in Section 3.2. joint or hierarchical clusters through principal component analysis. All variables were treated as equally important. VARCLUS created an output was used by the TREE procedure to draw a tree diagram 3 APPLICATION TO DOW JONES of hierarchical clusters (SAS/STAT R 9.1 User’s Guide p. 4797). The INDUSTRIAL tree diagram is depicted in Figure 1. We can observe in the hier- We present an application of the previous methodology to the stock archical clustering that the price variables and moving averages are market, namely to the Dow Jones Industrial index (DJI). Instead of correlated, so it was only chosen PX LOW of Cluster 1. In Cluster using daily prices of several stocks that compose the DJI, our ap- 2 all variables were selected because, although they are correlated, proach to this problem uses the DJI daily index values themselves they measure different characteristics. In the case of relative num- and other computed statistical and technical indicators, which are ex- bers different averages were selected because it is interesting to see plained in Section 3.1. We make extensive use of moving averages, the differences between the analysis of short, medium and long term. as they reduce the short term volatility of time series and retain in- Finally in Cluster 3 and Cluster 4 just Hurst index and price fluctu- formation from previous market events; another statistical indicator ation appeared, because they are not correlated with any other vari- is the Hurst index [9], defined as a function to uncover changes in the able, so these variables were included in the final data set. direction of the trend of a set of values in time. We believe that these Hence, the complete set of features in the data stream is the fol- indicators, together with the index value, can provide a multi-variate lowing: insight to hidden and subtle changes in the normality of financial events and be used to assess the risk of investment at any point in PX LOW: Minimum daily price; time, thus lowering exposure to risk. PX VOLUME: Volume of daily business; This application makes use of data gathered from the period com- IX HURST: Hurst index computed for 30 days; prised between the 1st of January of 2001 to the 31st of December IX CAP FLUTUATION: PX LOW(t)/ PX LOW (t 1). This vari- of 2011, in a total of 2767 observations. able represents price fluctuation for one day interval; AVG 20: PX LOW / 20-day moving average. This variable repre- 3.1 Variable Selection and Generated Data Stream sents the relative number of current price divided by the 20-day Moving Average. This shows whether the current price is cheap, The data gathered was composed by a set of technical variables in- average value, expensive or really expensive. The same applies to cluding different index values for one trading day like Open, Close, the next indicators but within other time frames; 45 AVG 30: PX LOW / 30-day moving average; Concept drift buffer size: b = 15 AVG 60: PX LOW / 60-day moving average; AVG 100: PX LOW / 100-day moving average; The result of the procedure of Section 2.2 applied to the data AVG 120: PX LOW / 120-day moving average; stream is presented in Figure 3. Each point of the series corresponds AVG 180: PX LOW / 180-day moving average; to the error of the model for a particular trading day, thus provid- ing possible indications of drifting. It can be seen an overall shape The dataset is depicted in Figure 2, where the behavior of all vari- of a curve that indicates the drift over time. Since this drift is being ables can be seem. This data is our data stream. The stream comprises computed for every trading day, the “noise” around the curve is con- 10 features, e.g., a 10-dimensional stream. sidered normal since it is affected by the daily volatility of the index values. To obtain a “clean” curve we apply a convolution filter along this drift series of the same size as b, i.e., 15 days. An alarm scheme is created through the generation of an empirical moving average of 60 days performed over the drift series. The cleaned drift curve and its moving average are depicted in Figure 4a). We then compare the differences between the drift series and its moving average obtaining a line that oscillates around zero. We call this line the drift trend, shown in Figure 4b). Whenever the drift se- ries has values lower than its moving average we are in a descending trend. This is reflected in the drift trend with values lower than zero. Whenever the moving average is crossed by the drift series it signals a shift in the trend and the drift trend crosses zero. This reasoning to detect trends is also very popular in financial technical analysis. In this context, the 60 trading days moving average reflects the intuitive notion of long-term “decreasing” or “increasing” trend of the drift. All plots in Figure 4 are aligned in time for easy comparison. Fig- ure 4c) shows the time series of PX LOW, i.e., the DJI index, that we compare to the detection of drift performed. 4 DISCUSSION AND CONCLUSIONS Based on experiments we found that a tenth of prototypes relative to the number of observations are sufficient in most applications to represent them adequately, hence, q = 10. Usage of higher values of q did not improve the results with the additional problem of in- Figure 2. Variables of the data stream used in the presented application. It creased computational time. Additionally, since we are both inter- comprises technical and statistical indicators (description in text). ested in abrupt and gradual drift detection we used a moderate sized buffer of aggregations (b = 15) to compute the series of quantifi- cation errors. During our experiments we found that this value was appropriate for the established goals. By inspecting Figure 4 and comparing the drift trend with the be- 3.2 Concept Drift in the Dow Jones Industrial havior of the DJI index we can make two important observations: (i) The methodology presented in Section 2 was applied to the above data. It is converted into a data stream by taking the data input or- der as the order of the streaming. All features were previously nor- malized to the range [0, 1] so they have equal importance in the Euclidean distances used to process them. The largest moving av- erage indicator computed was over 180 days. Therefore, only after the 180th observation can the stream be presented to the algorithm. However, since we are dealing with financial time-series, it is im- portant to retain the time dependency of the sequence of observa- tions. Therefore, in this application, we use a sliding-window of 100 trading days, i.e., approximately a trimester of trading as input to each aggregation phase. Note that a year of trading has approxi- mately 260 days. This means that the stream is processed in blocks of 100 observations that are kept in a queue. For each new observa- tion that arrives the oldest in the queue is discarded and the new one added. The parameterization used was the following: Figure 3. Concept drift series obtained through the methodology in Section 2.2 computed for each trading day. Block size: S = 100; Number of micro-clusters: q = 10; 46 Figure 4. a) Cleaned drift curve and its moving average. b) The trend drift curve is used to automatically detect drifting. c) The DJI index time series (PX LOW variable). the drift trend crossed zero before the market crash of 2008 (around REFERENCES day 1500). It appears that the concept that was being learned changed sometime before the crash occurred. (ii) it may be reasonable to as- [1] C.C. Aggarwal, J. Han, J. Wang, and P.S. Yu, ‘A framework for clus- tering evolving data streams’, in Proceedings of the 29th International sume that in periods of normality the long-term tendency of these Conference on Very Large Databases, volume 29, pp. 81–92. Morgan indexes is upwards. One of such periods is after the recovery of the Kaufmann Publishers Inc., (2003). 2002 market crash, i.e., the dot-com bubble, until the other crash of [2] K. Bart, Neural networks and fuzzy systems: A dynamical systems ap- 2008 (approximately between days 300 and 1300). During such pe- proach to machine intelligence, Prentice-Hall of India, 1997. riod it is interesting to see that the drift trend was always below zero. [3] G.A. Carpenter, S. Grossberg, and D.B. Rosen, ‘Art 2-a: An adaptive resonance algorithm for rapid category learning and recognition’, Neu- ral networks, 4(4), 493–504, (1991). In the present work we have shown a methodology to detect con- [4] F. Farnstrom, J. Lewis, and C. Elkan, ‘Scalability for clustering algo- cept drift in financial markets. We intend to apply this same method- rithms revisited’, in ACM SIGKDD Explorations Newsletter, volume 2, ology to intra-day trading as soon as it is possible, thus reinforcing pp. 51–57. ACM, (2000). [5] J. Gama, P. Medas, G. Castillo, and P. Rodrigues, ‘Learning with the need to efficient processing of large volumes of data. The pro- drift detection’, Advances in Artificial Intelligence–SBIA 2004, 66–112, posed methodology applied over a data stream comprised of care- (2004). fully chosen technical and statistical indicators seems promising in [6] Joao Gama, Knowledge discovery from data streams, Chapman & detecting changes in markets events ahead of time that can reduce Hall/CRC Data Mining and Knowledge Discovery Series, 2010. the exposure to risk. [7] S. Grossberg, ‘Adaptive pattern classification and universal recording: II. Feedback, expectation, olfaction, illusions’, Biological Cybernetics, The characterization of the drifts, i.e., trying to understand what is 23, (1976). really changing in the markets through inspection of hidden changes [8] Monika R. Henzinger, Prabhakar Raghavan, and Sridhar Rajagopalan, in the indicators is reserved for future work. Work is under way in ‘External memory algorithms’, chapter Computing on data streams, this subject and we are using Self-Organizing Maps [10] to produce 107–118, American Mathematical Society, Boston, MA, USA, (1999). [9] H.E. Hurst, RP Black, and YM Simaika, Long-term storage: An exper- different mappings of the variables for particular segments in time, imental study, Constable, 1965. namely ones where the market seems to exhibit a stable behavior and [10] T. Kohonen, ‘Self-organized formation of topologically correct feature comparing with others where it does not. This segments are obtained maps’, Biological cybernetics, 43(1), 59–69, (1982). by segmenting time with the concept drift detection. As another im- [11] B. Mandelbrot, R.L. Hudson, and E. Grunwald, ‘The (mis) behaviour mediate future work we will apply this methodology to other indexes of markets’, The Mathematical Intelligencer, 27(3), 77–79, (2005). [12] N.N. Taleb, ‘Errors, robustness, and the fourth quadrant’, International and perform the same study. Journal of Forecasting, 25(4), 744–759, (2009). 47