Dimensionality reduction for financial data visualization Jelena Zubova Olga Kurasova Marius Liutvinavičius Institute of mathematics and informatics, Institute of mathematics and informatics, Kaunas faculty, Vilnius University, Vilnius University, Vilnius University, Vilnius, Lithuania Vilnius, Lithuania Kaunas, Lithuania e-mail: jelena.zubova@mii.vu.lt e-mail: olga.kurasova@mii.vu.lt e-mail: marius.liutvinavicius@khf.vu.lt Abstract—Various data mining methods are used for examining large financial data sets to uncover hidden and II. DIMENSIONALITY REDUCTION METHODS useful information. Ability to access big data sources raises The detailed reviews of dimensionality reduction methods new challenges related with capabilities to handle such enormous were done by I. K. Fodor (2002), M. Mizuta (2004), C.O.S. amounts of data. This research focuses on big financial data Sorzano, J. Vargas et. al. (2014). In this section we present the visualization that is based on dimensionality reduction brief summary of most popular methods in order to choose best methods. We use data set that contains financial ratios of suitable in our case. stocks traded on NASDAQ stock exchange. A brief overview of the most popular dimensionality reduction and visualization Dimensionality reduction refers to the process of taking a methods is presented in this paper. We also show how to adjust data set with a usually large number of dimensions, and then the algorithms of these methods for parallel computing. The creating a new data set with a fewer number of dimensions, MPI technology is applied in computer cluster to perform which are meant to capture only those dimensions that are in dimensionality reduction. The results show that Random some sense “important”. The idea here is that we want to projection and Multidimensional scaling methods can effectively preserve as much “structure” of the data as possible, while classify data and find the most promising stocks. reducing the number of dimensions it possesses [6]. Keywords—financial data; dimensionality reduction; The demand for such methods rises because various visualization sources generate enormous amount of data, e. g laboratory instruments can report thousands measurements for a single I. INTRODUCTION experiment, and the statistical methods face challenging tasks when dealing with such high‐dimensional data. However, This template, We use big data analytics for according to C.O.S. Sorzano, J. Vargas et. al. (2014), much of examining large financial data sets to uncover hidden and useful information. Ability to access new data sources the data is highly redundant and can be efficiently brought provides many opportunities, for example create new down to a much smaller number of variables without a investors sentiment indexes from online social media significant loss of information. streams. But it also raises new challenges related with capabilities to handle such enormous amounts of data. We need effective methods and powerful environments to complete large and complex tasks. This research focuses on big financial data visualization that is based on dimensionality reduction methods. We use data set that contains financial ratios of stocks traded on NASDAQ stock exchange. Our goal is to find the most effective ways to analyze and visualize such Fig. 1. Dimensionality reduction methods type data. In the second section, we present a brief overview of the most popular dimensionality reduction and visualization methods. In the third section we present how A. Principal Components Analysis (PCA) to adjust the algorithms of these methods for parallel As long as data have a near-linear structure, the computing. We use MPI technology in computer cluster singularities of the data can be pointed out using Principal to perform dimensionality reduction. In the last section we Component Analysis (PCA) [11]. PCA is by far one of the present the classification and visualization results which most popular algorithms for dimensionality reduction [5]. where get while using Random projection and Multidimensional scaling methods. The current data set PCA find components that make projections uncorrelated is relatively small, but the results give suggestions by selecting the highest eigenvalues of the covariance matrix what techniques to choose for future works with big and maximizes retained variance [2]. The theoretical idea online data streams. behind PCA is that we find the principal components of the Copyright © 2017 held by the authors 94 data, which correspond to the components along which there is F. Vector quantization [3] the most variation [6]. Probably the simplest way of reducing dimensionality is by assigning a class (among a total of K classes) to each one of the B. Random Projection observations xn. This can be seen as an extreme case of Random Projection finds components that make projections dimensionality reduction in which we go from M dimensions to uncorrelated by multiplying by a random matrix and minimizes 1 (the discrete class label Ƙ). Each class, Ƙ, has a computations for a particular dimension size [2]. representative xƘ which is the average of all the observations assigned to that class. If a vector xn has been assigned to the This method involves taking a high-dimensional data set Ƙn‐th class, then its approximation after the dimensionality and then mapping it into a lower-dimensional space, while reduction is simply xn = xƘn. Widely used K‐means method also providing some guarantees on the approximate preservation of belongs to this group of dimensionality reduction methods. [5] distance. If the input data is an n × d matrix, then to do a projection, we choose a “suitable” d × k matrix R, and then G. Principal curves, surfaces and manifolds define the projection of A to be E = AR, which stores k- dimensional approximations for our n points (matrix E is n × k) PCA is the perfect tool to reduce data that in their original [6]. R is a matrix with elements rij, where rij = random M‐dimensional space lie in some linear manifold. However, Gaussian. R can also be constructed in one of the following there are situations at which the data follow some curved ways [2]: structure. In this case, approximating the curve by a straight line will not perform a good approximation of the original data. rij = ±1 with probability of 0.5 each For such type data the solution is to use principal curves, rij = ±1 with probability of 1/6 each, or 0 with a probability surfaces and manifolds. [5] of 2/3 Curve fitting to data is an important method for data Steve Vincent (2004) made comparison of principal analysis. When we obtain a fitting curve for data, the component analysis and random projection in text mining. He dimension of the data is nonlinearly reduced to one dimension. found that in general Random projection is faster by many [11] orders of magnitude over PCA, but in most cases produced lower accuracy. This lead to suggestion to use Random H. Projection pursuit projection method if speed of processing is most important. [2] PCA is ineffective in analyzing nonlinear structures, i.e. curves, surfaces or clusters. In such cases Projection pursuit C. Factor analysis method might be a solution. The goal of projection pursuit is to Like PCA, factor analysis is also a linear method, based on find a projection that reveals interesting structures in the data. the second-order data summaries. Factor analysis assumes that “Interesting” is defined as being “far from the normal the measured variables depend on some unknown, and often distribution”, i.e. the normal distribution is assumed to be the immeasurable, common factors. The goal of this method is to most uninteresting. The degree of “far from the normal uncover such relations, and thus can be used to reduce the distribution” is defined as being a projection index. There are dimension of data sets following the factor model. [3] various standards for interestingness. [11] Factors are assumed to follow a multivariate normal I. Multidimensional scaling distribution, and to be uncorrelated to noise [5]. Given n items in a d-dimensional space and n x n matrix of proximity measures among the items, multidimensional scaling D. Independent component analysis (ICA) (MDS) produces a k-dimensional, k ≤ d, representation of the ICA is a higher-order method that seeks linear projections, items such that the distances among the points in the new space not necessarily orthogonal to each other, that are as nearly reflect the proximities in the data. [3] statistically independent as possible. Statistical independence is a much stronger condition than uncorrelatedness. ICA can be The proximity measures the (dis)similarities among the considered as a generalization of the PCA and the Projection items, and in general, it is a distance measure: the more similar pursuit concepts. While PCA seeks uncorrelated variables, ICA two items are, the smaller their distance is. Popular distance seeks independent variables. In contrast with PCA, the goal of measures are Euclidian distance, the Manhattan distance and ICA is not necessarily dimension reduction. [3] the maximum norm. [3] E. Maximum likelihood J. Kohonen’s self-organizing maps This method specifies the likelihood of the noise-free ICA Self‐Organizing Maps (SOM) are generalizations of the model, and uses the maximum likelihood principle to estimate vector quantization approaches presented above. SOM work by the parameters. The advantages of this method include the assigning to each input vector a label Ƙn corresponding to the asymptotic efficiency of maximum likelihood estimates under class closest to its representative vector. Kohonen’s SOMs start regularity conditions. However, it is computationally intensive, by creating a set of labels on a given manifold (usually a which make it undesirable in many practical situations. [3] plane). Labels are distributed in a regular grid and the topological neighborhood is defined as the neighbors in the plane of each point of the grid. [5] 95 K. Support Vector machines CPU CPU CPU CPU Support vector machines (SVMs) have been recognized as Memory Memory one of the most successful classification methods [8]. SVMs are a powerful machine learning technique not only for CPU CPU CPU CPU classification, but also for regression. It works by finding a Network hyperplane that best splits the target values. SVM performs classification by finding the hyperplane that maximizes the margin between the two classes. The vectors (cases) that define CPU CPU CPU CPU the hyperplane are the support vectors. The margin is the largest distance between the borderline data points. SVMs Memory Memory handle nonlinearity by mapping the data into a higher- CPU CPU CPU CPU dimensional space using kernel tricks. [9] Fig. 2. MPI scheme L. Other methods According to R. Rabenseifner, G. Hager, etc. (2009), most Important comparison of different methods was made by R. systems in high-performance computing feature a hierarchical S. Rosaria, I. Adae et. al. (2014). They concentrated on a few hardware design: shared memory nodes with several multi-core state-of-the-art methods to reduce input dimensionality and CPUs are connected via a network infrastructure. examined how they might affect the final classification accuracy. In particular, they tried removing data columns with The comparison between MPI and hybrid models was made too many missing values and used low variance filter that by F. Cappello, D. Etiemble (2000). Their test results showed calculates each column variance and removes those columns that a unified MPI approach is better for most of the with a variance value below a given threshold. They also benchmarks. reduced highly correlated columns and implemented dimensionality reduction via tree ensembles (Random forests). Rifat Chowdhury (2010) designed an algorithm using the But the best results were got by using backward feature libraries of OpenMP to compute Matrix Multiplication elimination and forward feature construction together with efficiently. The speedup was close to the ideal 4 times using removing data columns with too many missing. four cores of a processor to compute the final matrix instead of one. This is an example how parallel computing can increase The Backward Feature Elimination loop performs the performance. dimensionality reduction against a particular machine learning algorithm. Similarly to the Backward Feature Elimination B. Random Projection method approach, a Forward Feature Construction loop builds a Our goal is to combine speed and accuracy. For initial data number of pre-selected classifiers using an incremental number dimensionality reduction we selected Random projection of input features. The Forward Feature Construction loop starts method because it is very fast. Moreover we adapted this from 1 feature and adds one more feature at a time in the method for parallel computing. This method can be applied for subsequent iterations. [12] many purposes. For example, G. E. Dahl, J. W. Stokes et. al. successfully used Random projections for large-scale malware III. ALGORITHM ADOPTION FOR PARALLEL COMPUTING classification to further reduce the dimensionality of the In this section we briefly describe the MPI technology that original input space. is used for parallel computing in computer cluster. We also present the algorithm of selected dimensionality reduction method which is adapted for MPI. A. MPI MPI is message-passing library interface specification. MPI addresses primarily the message-passing parallel programming model, in which data is moved from the address space of one process to that of another process through cooperative operations on each process. [7] Figure 2 shows typical structure of multi-core cluster. In the pure MPI case there is one MPI proccess per core. H. Jin, D. Jespersen (2011) stated that the increasing number of cores in modern microprocessors is pushing the current high performance computing systems into the petascale and exascale era. Fig. 3. Random projection algorithm for parallelization with MPI 96 E. Bingham and H. Mannila used Random projection as dimensionality reduction method tool in cases of processing both noisy and noiseless images, and information retrieval in text documents. In our case we use this method for financial data mining. To use Random projection method, we firstly declare 3 matrixes: n × d matrix1 (contains initial financial data), d × k matrix2 (operating matrix) and n × k matrix3, which is our results matrix. During the next step we initialize random integer values to all elements of matrix1. Then we assign values of 1 or -1 to matrix2 with probability 0,5 each. Then matrix1 and matrix2 are multiplied and results are assigned to the elements of matrix3. The code is constructed accordingly to MPI requirements (Fig. 3). Program includes MPI library. All matrices and Fig. 4. The execution times of Random projection algorithm variables are declared in serial code region, but MPI environment is initialized by special functions before parallel Technical indicators are ATR, Beta, Simple moving region begins. The program uses determined amount of threads averages of different periods of time, RSI, (“ranks”). “Zero“ thread is used to divide multiplication task the lowest and highest prices at which a stock has traded in the into multiple processes and handle incoming results from the previous 52 weeks. And the last group contains parameters of remaining threads. All computing operations are stock ownership. simultaneously done by multiple threads. Usually our goal is to visualize data and uncover hidden information. This means that for example in this case we have IV. RESEARCH RESULTS to reduce the dimensions from 54 to 2 or 3. It’s complicated to In this section we present the results of applying Random quickly perform such tasks with single machine. So in the first projection and Multidimensional scaling methods for our step we executed Random projection algorithm in the cluster financial market data set. having 120 nodes to reduce the dimensions from initial 54 to smaller amount: 25, 15, 10 and 5 (got several new data sets). The data set contains information about 1400 companies When the most suitable number of threads were chosen it took that are traded on NASDAQ stock exchange. Every company less than 0,04 seconds to complete each task (Figure 4). It is described by 54 different parameters. All these parameters should be noted that using more threads not always leads to are grouped into 6 categories: overview, valuation, financial, better performance. As this data set was relative small, 10 to 30 performance, technical and ownership. All data is from threads was optimal choice to execute the algorithm in the finviz.com website. fastest way. In comparison with serial code, MPI code worked from 10 to 20 times faster. In the Overview group there are such parameters as sector, industry, market capitalization, price, volume. In Valuation In the second step we applied Multidimensional scaling group we have P/E, PEG, P/B. EPS and sales parameters. method (with Sammon) to visualize the reduced data set Financial parameters group contains ROA, ROE, ROI, containing 5 variables. We investigated two stock classification earnings, debt and margins parameters. Performance indicators cases: based on analysts’ recommendation and based on present price changes during different periods of time, multicriteria indicator. volatility and recommendation values. Fig. 5. Parameters of analysed companies 97 TABLE I. MULTICRITERIA INDICATOR FOR STOCK CLASSIFICATION EPS this P/E Year % ROE SMA50 % RSI „Good stocks“ < 20 >5 >5 >0 > 50 „Bad stocks“ > 20 <5 <5 <0 < 50 B. Stock classification based on multicriteria indicator In the second case we constructed multicriteria indicator for separating stocks that are worth investing (“good”), neutral stocks and “bad” stocks. In our assumption “good” stocks meet all following criteria: P/E < 20, EPS this year > 5, ROE > 5, 50-period SMA > 0, RSI > 50. Table 1 shows the ratios by which “good” and “bad” classes of stocks are constructed. “Neutral” stocks are those which don’t fall within any of previously defined two classes. We used the same Multidimensional scaling method to Fig. 6. Data visualization using Multidimensional scaling visualize the reduced data set, so the arrangement of data points is the same as presented in figure 6. But in this case we found that the three different classes of stocks were separated quite A. Stock classification based on analysts’ recommendation well. One of our initial 54 factors was the recommendation of financial analysts. This variable ranges from 1 (strong Figure 8 shows that “good” companies were presented in recommendation to buy stock) to 5 (strong recommendation to the left side of the plot. The “bad” and “neutral” companies sell stock). We raised hypothesis that “good” companies overlapped, but “bad” companies separated from “good” (having recommendation ratio from 1 to 2) should be separated companies. from the rest companies after dimensionality reduction and This leads to suggestion, that dimensionality reduction and visualization processes. visualization methods can effectively classify data and find the However, figure 6 shows that the combination of Random most promising stocks. However, in order to explain the projection and Multidimensional scaling methods couldn’t differences between classes we need to use several different separate these two classes. They just overlapped each other. ratios. We tried to use 3D plot, but it was uninformative too (Fig. 7). We used the combination of Multidimensional scaling and The first thought could be that the methods are not Random projection methods because the latter one is very fast. efficient enough. But it also could be that in reality the However it might be not so accurate. So we tried to apply MDS potentiality of “good” and “bad” companies (as determined by method for full initial data set and it actually led to better analysts) don’t differ so much. This would explain why using results. 3D plot in figure 9 shows that “bad” stocks were just the opinion of financial advisors often leads to unstable separated from “neutral” stocks. returns or even losses. It also suggests that opinions might be more based on intuition than various stock ratios. Fig. 8. Visualizing three classes of stocks Fig. 7. Data visualization using Multidimensional scaling (3D plot) 98 to use several different ratios. It should be also noted, that MDS method alone was more accurate than combination of two methods. REFERENCES [1] R. Chowdhury, "Parallel Computing with OpenMP to solve matrix Multiplication," UCONN BIOGRID REU Summer 2010. Department of Computer Science & Engineering. University of Connecticut, Storrs, CT 06269. [2] Dr. Domeniconi, "Comparison of Principal Component Analysis and Random Projection in Text Mining," April 29, 2004. INFS 795. [3] I. K. Fodor, "A survey of dimension reduction techniques," Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, June 2002 [4] Intel Parallel studio. Access: https://software.intel.com/en-us/ intel-parallel-studio-xe [5] C. O.S. Sorzano, J. Vargas, A. Pascual Montano, "A survey of Fig. 9. Visualizing initial data set using MDS dimensionality reduction techniques," 2014. Access: arXiv:1403.2877. [6] A. K. Menon. Random projections and applications to dimensionality reduction. School of Information Technologies, The University of V. CONCLUSIONS Sydney, 2007. Various data mining methods are used for examining large [7] MPI technology. Access: http://www.mpi-forum.org/docs/mpi- 3.1/mpi31-report.pdf financial data sets to uncover hidden and useful information. [8] H. Kim , P. Howland, H. Park. Dimension Reduction in Text This research focused on financial data visualization that is Classification with Support Vector Machines. Journal of Machine based on dimensionality reduction methods. We used data set Learning Research 6 (2005) 37–53 that contained financial ratios of stocks traded on NASDAQ [9] S. Kudyba. Big Data, Mining, and Analytics–Components of Strategic stock exchange. A brief overview of the most popular data Decision Making. March 12, 2014 by Auerbach Publications. Reference dimensionality reduction and visualization methods was - 325 Pages - 89 B/W Illustrations. ISBN 9781466568709 - CAT# presented in this paper. We also showed how to adjust the K16400 algorithm of Random projection method for parallel [10] Message passing interface. Access: https://computing.llnl.gov/tutorials/mpi/ computing. The MPI technology was applied in computer cluster to perform dimensionality reduction. The performance [11] M. Mizuta, "Dimension Reduction Methods, Papers," Humboldt- Universität Berlin, Center for Applied Statistics and Economics results revealed the advantages of parallel computing. Our goal (CASE), 15, 2004. was to visualize data and uncover hidden information. In order [12] R. S. Rosaria, I. Adae, A. Hart, M. Berthold, "Seven Techniques to do this we had to reduce the dimensions to 2 or 3. In the first for Dimensionality Reduction," Knime, 2014. step we executed Random projection algorithm in the cluster to [13] H. Jin, D. Jespersen etc. High performance computing using MPI reduce the dimensions from initial 54 to smaller amount. In the and OpenMP on multi-core parallel systems. Parallel Computing 37 second step we applied Multidimensional scaling method to (2011) 562–575 visualize the reduced data set. [14] R. Rabenseifner, G. Hager, G. Jost, "Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes," One of the data set ratios was the recommendation of Conference: Proceedings of the 17th Euromicro International financial analysts. We raised hypothesis that companies having Conference on Parallel, Distributed and Network-Based Processing, recommendation to buy them should be separated from the rest PDP 2009, Weimar, Germany, 18-20 Febuary 2009 of companies. But the results showed that the combination of [15] F. Cappello, D. Etiemble, "MPI versus MPI+OpenMP on the IBM SP for the NAS Benchmarks," Supercomputing, ACM/IEEE 2000 Random projection and Multidimensional scaling methods Conference. ISSN: 1063-9535. couldn’t do this. This might have happened because in reality [16] G. E. Dahl, J. W. Stokes, L. Deng, D. Yu, "Large-scale the potentiality of “good” and “bad” companies (as they are malware classification using random projections and neural determined by analysts’ recommendations) doesn’t differ so network," Acoustics, Speech and Signal Processing (ICASSP), much. However, in the second case of stock classification 2013 IEEE International Conference, pp. 3422-3426, 2013. based on multicriteria indicator “good” and “bad” stocks were [17] E. Bingham, H. Mannila, "Random projection in dimensionality separated quite well. This leads to suggestion, that reduction: Applications to image and text data," Proceedings of the seventh ACM SIGKDD international conference on dimensionality reduction and visualization methods can Knowledge discovery and data mining, pp. 245-250, 2001. effectively classify data and find the most promising stocks. But in order to explain the differences between classes we need 99