Clustering underlying stock trends via non-negative matrix factorization

Clustering underlying stock trends via non-negative matrix factorization AndreaPazienza Dipartimento di Informatica Università di Bari SabrinaFrancescaPellegrino Dipartimento di Matematica Università di Bari StefanoFerilli Dipartimento di Informatica Università di Bari FlorianaEsposito Dipartimento di Informatica Università di Bari Clustering underlying stock trends via non-negative matrix factorization 61CF704E4B95F21EA21449E07CA14AAA GROBID - A machine learning software for extracting information from scholarly documents

Building a diversified portfolio is an appealing strategy in the analysis of stock market dynamics. It aims at reducing risk in market capital investments. Grouping stocks by similar latent trend can be cast into a clustering problem. The classical K-Means clustering algorithm does not fit the task of financial data analysis. Hence, we investigate Non-negative Matrix Factorization (NMF) techniques which, contrary to K-Means, turn out to be very effective when applied to stock data. In particular, recently developed NMF techniques, which incorporate convexity constraints, generate more disjoint latent trend groupings than the traditional sector-based groupings. In this paper, the NMF technique and its variants are applied to NASDAQ stock data (i.e., daily closing prices). Experimental results confirm that (convex ) NMF techniques are highly recommended to produce trend based assets and build a good diversified portfolio.

Introduction

A trader's purpose is to beat the market and, then, to make money. To achieve this objective, the trader should be able to predict future stock prices. In this way, he could determine a self-financing trading strategy that maximizes his portfolio return [5]. However, because of randomness in the market, creating and managing successful portfolios of financial assets is a difficult practice. Diversification theory is the most widely used practice by individuals to develop portfolios. It is based on the principle of attempting to maximize expected return for a given amount of risk or, equivalently, to minimize the risk for a given amount of return [9]. It is one of the most effective ways to get low risk-reward ratios. This problem can be seen as a clustering process, in which the aim is grouping data (e.g., stocks) into subgroups of similar behavior (e.g., market trend).

Clustering is arguably one of the most common steps in unsupervised behavior analysis. With K-Means [8], a classical clustering algorithm, it is not possible to establish the effectiveness and coherence of the clusters when dealing with stock data. Therefore, more powerful analysis techniques are required. Matrix decomposition strategies can overcome this problem, in fact, they can provide ways to produce cleaner data which may lead to a better interpretation of the results. Since the closing prices of stocks are definitely non-negative signals, it makes sense to apply Non-negative Matrix Factorization (NMF) on them.

In this paper, we use NMF and its variants to learn the components which drive the stock market, and to construct a diversification method using cluster analysis of financial assets. We compare NMF with its two variants, Convex NMF (C-NMF) and Convex-Hull NMF (CH-NMF), obtained by imposing orthogonal and convex constraints. We investigate the impact of these constraints in real stock return data.

Finally, we conclude that CH-NMF yields a more accurate and disjoint representation of the data, and allows a better interpretation of clustering. Moreover, CH-NMF is very useful because it converges in only one iteration, with very low runtime and, most important, achieving a very small error in Frobenius norm.

This paper is organized as follows. The next section recalls useful background information, including related works. Then, Section 3 introduces the mathematical aspects of NMF, its variants C-NMF and CH-NMF. Section 4 describes experimental results and evaluates the difference between the methods in terms of number of iterations, error and clustering representation. Finally, Section 5 concludes the paper.

Background and Related Work

Portfolio Diversification is the process of choosing investments in order to reduce exposure to a particular asset. This is typically done by investing in a variety of assets, because, if the stock prices do not move together, then a diversified portfolio of assets will have lower variance than the weighted average variance of the assets. A straightforward diversification breaks the stock market into several classic sectors according to the primary activities of a company (such as Basic Materials, Technology, Financial, Health Care, etc.).

As the stock evolution is comparable to a stochastic process, stock prices are determined by fluctuations in underlying or latent trends, which can be modeled by a Brownian motion [5]. Therefore, stocks in the same sector may not show similar behavior in the market. So, albeit the sector-based strategy is simple enough to apply, the portfolio will not ensure the maximum return. Hence, if one were able to identify and predict the underlying trends from the stock market data, one would have the opportunity to leverage this knowledge to obtain genuine portfolio diversification opportunities. In other words, investors should diversify their money into not only different sectors, but also different trends (e.g. clusters).

Unfortunately, the K-Means clustering algorithm, has still some limitations in the exploitation of financial data. Indeed, in [1] is stated that K-Means clustering tends to find spherical clusters so that centroid-based clustering does not handle the noise. Hence, the authors aimed to discover other centroid-based clustering approaches for financial datasets and to introduce weighted Euclidean distance instead of standard Euclidean distance to re-evaluate centroid-based clusters, as to overcome the limitations of K-Means.

Matrix decompositions, especially NMF, are used in the literature for the analysis of financial data. In particular, [4] applied a constrained NMF to real stock data and found that there is a tradeoff between smoothness of trend and sparsity of the weight matrix. [2] provides a contribution to the Diversification Theory by comparing Semi-NMF with Sparse-semiNMF applied to a diffusive model based on the Black&Scholes equation for option pricing. It is deduced that Sparse-semiNMF outperforms semi-NMF because it better reduces the risk related to the portfolio selection. Using the multiplicative update rules algorithm, [7] analyzes the behavior of latent trends for different value of the number of the latent forces and shows that the increase in the underlying force does not affect the trends of the original forces. In [11] a variant of Sparse-semiNMF with sum-to-one and smoothness constraints is applied to the portfolio diversification problem, and results show a disjoint data representation that allows a better understanding of stock properties.

In our setting, assume the market is made up of m stocks S 1 , S 2 , . . . , S m ; each stock S i is stored as a row vector whose entries are n daily closing prices. Suppose there are k latent bases, W 1 , W 2 , . . . , W k ; each W j is a n-dimensional row vector, which can be thought as a Brownian motion. So it is possible to express each stock as linear combination of these bases:

(1)

S i = k j=1 h ij W j ,

where h ij is a non negative real number and indicates the association degree of the i-th stock with the basis W j . With matrix notation, (1) can be expressed as

(2)

S + = H + W ± ,

where S ∈ R m×n + , H ∈ R m×k + (weight matrix) and W ∈ R k×n ± (trend matrix). This strongly recalls the non-negative matrix factorization formulation.

Non-Negative Matrix Factorization

The standard definition for non-negative matrix factorization of a matrix S is

(3) S = H W,

where S ∈ R m×n , H ∈ R m×k and W ∈ R k×n , and k ≤ m. Both W and H must contain only non-negative entries. W is the matrix of factors and H is the mixing matrix.

According to [6], each data point, which is represented as a row in S, can be approximated by an additive combination of the non-negative basis vectors, which are represented as rows in W , weighted by the components of H. Matrices H and W are found by solving the optimization problem (4) min

H≥0, W ≥0 S − H W 2 F ,

where • F is the Frobenius norm. The algorithm is expressed in terms of a pair of update rules that are applied alternately:

H ij = H ij k S ik (H W ) ik W jk , W ij = W ij k H ki S kj (H W ) kj .

Matrices H and W are initialized at random. Various variants and improvements to NMF have been introduced in recent years [3,10].

Convex NMF Convex non-negative matrix factorization (C-NMF) [3] allows the data matrix S to have mixed signs. It minimizes

S − S H W 2 F subject to the convex constraint H i 1 = 1, H ≥ 0, where S ∈ R m×n , H ∈ R n×k and W ∈ R k×n .

Matrices H and W are updated iteratively, until convergence, using the following update rules:

H ik = H ik (Y + W ) ik + (Y − H W T W ) ik (Y − W ) ik + (Y + H W T W ) ik W ik = W ik (Y + H) ik + (W H T Y − H) ik (Y − H) ik + (W H T Y + H) ik

where Y = S T S, and matrices Y + and Y − are given by

Y + ik = 1 2 |Y ik | + Y ik and Y − ik = 1 2 |Y ik | − Y ik respectively.

Matrices H and W are initialized at random. The convex constraint imposed on H has the advantage that one might interpret the rows of H as weighted sums of certain data points. This means that these rows can be interpreted as centroids. Moreover, C-NMF solutions are generally significantly more orthogonal than Semi-NMF solutions.

Convex-Hull NMF Massive datasets are likely to capture even very rare aspects of the problem at hand. Along this line, [10] recently introduced a datadriven Convex NMF approach, called Convex-Hull NMF (CH-NMF), that is fast and scales extremely well: it can efficiently factorize huge matrices and in turn extract meaningful "clusters" from massive datasets. The key idea is to restrict the clusters to be combinations of vertices of the convex hull of the dataset; this allows to directly explore the data itself to solve the convex NMF problem.

We consider a factorization of the form S = S H W where S ∈ R m×n , H ∈ R n×k and W ∈ R k×n . We further restrict the rows of H and W to convexity, i.e.,

H i 1 = 1 H ≥ 0, W j 1 = 1 W ≥ 0.

In contrast to C-NMF, we consider a convex combination on both H and W . The task now is to minimize

(5) S − S H W 2 F , s.t. H i 1 = 1, H ≥ 0, W j 1 = 1, W ≥ 0.

This optimization problem is equivalent to projecting the solution in the convex hull of S. Convexity constraints yield latent components with some properties: first, any data point can be expressed as a convex and meaningful combination of these basis vectors; this offers interesting new opportunities for data interpretation. Second, they span a simplex that encloses most of the remaining data. CH-NMF aims at a data factorization based on the data points residing on the data convex hull. Therefore, CH-NMF seeks an approximate solution by subsampling the convex hull, exploiting each data point on the convex hull as a linear lower dimensional projection of the original data.

A consequence of the convexity of H and W is that the rows of H tend to become nearly orthogonal. Requiring orthogonal rows for H produces a noncorrelation between stocks being attracted from different clusters. This indicates a more accurate clustering and, hence, that the NMF family is competitive with K-Means for the purposes of clustering financial data. Therefore, the aim of this paper is to exploit NMF techniques, and in particular the ones with convexity constraints, in the context of financial data.

Evaluations and Comparisons of NMF Techniques

The objective of this work is studying the application of clustering approaches to determine a trend-based portfolio diversification that is more consistent than the sector-based one. We ran experiments on stock data gathered from the NASDAQ Stock Market (http://www.nasdaq.com), and specifically on the closing prices of 28 stocks in the past 10 years (2518 working days). Table 1 reports the list of stocks involved in the experiments and their belonging sector. Fig. 1(a) shows the actual stock prices, useful as a reference to compare the numerical solutions that we obtain from the different methods. So, we want to identify the subgroups (clusters) of stocks that show a similar trend. In particular, we want to assess the performance of NMF, C-NMF and CH-NMF. We try different numbers k of clusters. Since the considered stocks involve only 8 market sectors overall, we ran the methods 6 times, one for each k ∈ {3, 4, . . . , 8}. We used a Python implementation of these methods available at http://pymf.googlecode.com.

The raw data were preprocessed as log-returns and stored into a matrix S ∈ R 28×2518 . In this way the variances of stock data distribution are homogeneous to allow a better understanding of the graphical results. We compute two matrices H and W for the given stock data matrix S. H and W are used to identify the k cluster labels of the stocks: in fact, rows of W regard the cluster centroids while H is the cluster membership indicator matrix. In other words, sample i is in cluster j if H ij is the largest value in row H i,: . The product H W , representing the reconstruction of S, is a useful way to explicit the difference between the original stock prices data and the approximated dynamics of transformed data.

Table 2 reports, for each k, statistics about the results obtained by the NMF methods: number of iterations for convergence, error estimation in Frobenius norm, and number of non-empty clusters in the result. We note that, in most cases, all decomposition methods provide a good reconstruction of data. They differ in the cost to achieve a good decomposition (the larger the number of iterations, the more expensive the method). We want to highlight that CH-NMF is very fast and its estimated Frobenius error is of the same order of magnitude as the other methods. In fact, even if we set a higher number of maximum iterations, CH-NMF needs only 1 iteration to converge for these particular dataset, as reported in Table 2. For others techniques involved in the experiments, NMF requires up to 5000 iterations in the worst case, while C-NMF takes less iterations than NMF to converge, but it always discovers only two non-empty clusters. This is very unattractive because, considering only 2 trends will expose investors to high risks as they would diversify their portfolio making it up with only 2 stocks, one for each trend. The role of k is to force a representation for the data that is more compact than its actual form. Assuming a more compact representation will capture underlying regularities in the data that might be obscured by the form in which the data is found in matrix S. The target is to achieve a low rank approximation which ensures a good interpretation of data in terms of clustering partition, data reconstruction and compactness of representation. Indeed, for a genuine portfolio diversification, choosing k = 4 represents a fair compromise between a lower rank approximation and the goal of yielding a good cluster partition. The reconstruction obtained for k = 4 is shown Fig. 1(b,c,d).

Now, we focus on the study of W , i.e., the matrix which represents the latent trends. As shown in Fig. 2, the trend obtained from NMF points out the increment of fluctuations as k grows up, despite the quality of graphical reconstruction being good. In Fig. 3 we can see that C-NMF shows too many fluctuations and does not allow us to compose a good diversified portfolio. Indeed, as shown in Table 2, for each k, C-NMF provides always only 2 clusters, which reflect the trends that can be seen in the figures. A reason for this behavior is related to the fact that C-NMF imposes the convexity constraint only on H. Hence, convexity for H should be used together with convexity also on W . In fact, CH-NMF is able to overcome this problem and leads to more regular components (cfr. Fig. 4). Therefore, we can state that the convexity constraint of H and W provides a good adjustment: the bases become more disjoint and Frobenius norm decreases at a speed that depends on the number of iterations. It is important to note that while all methods try to minimize the same criterion, they impose different constraints and thus yield different matrix factors. For example, CH-NMF assumes W and H to be non-negative matrices and often leads to sparse representations of the data.

Another important graphical confirmation of our proposal can be found in the analysis of colormaps which is a good way to display matrices in scaled colors. It represents a color-filled table in which every color indicates the weight of each corresponding matrix component according to a total order relation managed by a color scale called colorbar. In this way, we can evaluate the components of greater weight associated with latent trends. More precisely, every H ij indicates how the i-th stock is related to the row basis W j . Also in this representation, we can observe the degree of belonging of each data point to a cluster by selecting, for each row, the highest element in the colormap. In the case of NMF (see Fig. 5(a)), colormaps display a regular proceeding of data with a high peak in some rows. This determines the membership of an element in a row to a cluster in the corresponding column. While, for C-NMF (see Fig. 5(b)) the elements with the highest color with respect to the colorbar are located in the first or last column, giving more emphasis to the fact that the resulting clusters are always 2. Regarding colormaps for CH-NMF (see Fig. 5(c)), we can observe a clearer disjunction of columns, meaning that the resulting clusters are readily visible. Compared to the K-Means algorithm, used as baseline, the main difference lies into the creation of clusters. In fact, K-Means always produces exactly k clusters, while NMF methods generate at most k clusters, as shown in Table 2. This means that there are centroids which are not attracting stocks.

After analyzing all decompositions, our main purpose is to obtain clusters of stocks with the same trend, starting from a matrix decomposition of data in a such way that W would hold centroids coordinates and H would hold the relationship degree of different centroids. We collect every i-th row of H, which corresponds to the i-th stock index for Table 1, into their own membership cluster in order to discover, for each k, which subgroups of stocks remain unaffected. Thus, the most frequent subgroups of stock data can be chosen as final outcome of our portfolio diversification strategy. More precisely, it could be possible to construct a tempting portfolio by selecting the most promising stocks from each subgroup. We implemented this functionality in MATLAB with the objective to take out a cluster matrix by varying both of k and the decomposition method. In Tables 3-5, we show the resulting clusters of stock data: we see that clusters of stocks persist across the different decomposition methods. To give some concrete examples, stocks with index 1, 2, 9, 27 (and often 12 too) are always grouped together. They correspond to red indexes in the tables. Summing up, the obtained clusters demonstrate successful application of CH-NMF to the analysis of financial data. This means that CH-NMF is robust in the case of analysis of stock market. Moreover, it provides a trend-based diversification containing groups of different sectors. The most interesting result is that the stocks of the same sector is not necessarily assigned into the same cluster and vice versa, which is of potential use to guide diversified portfolio.

Conclusions

Constructing a diversified portfolio, in which the correlation between constituent asset classes and investment strategies is meaningfully low, can be challenging, in order to reduce the exposure to risk by investing in a variety of assets. Our aim is to group stocks having similar trend. This can be cast as a clustering problem in data mining that we solve with NMF techniques. We investigate NMF and its variants with convexity constraints to improve the exploitation of similar stock trends. In particular, we show that, for this task, CH-NMF is a very fast and scalable in terms of speed and reconstruction quality. Our extensive experimental evaluation shows that NMFs better point out the clustering properties, additionally yielding very low error in Frobenius norm and high efficiency in terms of convergence time. Furthermore, we compared the resulting clusters to check whether frequent itemsets of stock stay together still while the number of requested clusters changes. The task of prediction is not applicable for NMF techniques because the number of clusters to be discovered is given in input.

As future work, we will use more datasets from different markets and will investigate further decomposition techniques that can further improve the effectiveness of clustering stock data and impose other penalty constraints in order to achieve a better portfolio diversification strategy, reduce the risk of investments and, hence, beat the market.

( a )aActual stock prices data. (b) NMF. (c) C-NMF. (d) CH-NMF.

Fig. 1 :1Fig. 1: Original stock data with reconstruction for k = 4

(a) k = 3 .3(b) k = 4. (c) k = 5. (d) k = 6. (e) k = 7. (f) k = 8.

Fig. 2 :2Fig. 2: Trends for NMF

Fig. 3 :3Fig. 3: Trends for C-NMF

Fig. 4 :4Fig. 4: Trends for CH-NMF

(a) NMF. (b) C-NMF. (c) CH-NMF.

Fig. 5 :5Fig. 5: Colormaps for k = 4

Table 1 :1Stocks Table# Code CompanySector1 AA Alcoa Inc.Basic Materials2 AIG American International Group, Inc.Financial3 AAPL Apple Inc.Technology4 AXP American Express CompanyFinancial5 BA Boeing CompanyIndustrial Goods6 CAT Caterpillar, Inc.Industrial Goods7 DD E.I. du Pont de Nemours and CompanyBasic Materials8 DIS Walt Disney CompanyServices9 GE General Electrics CompanyConglomerates10 HD Home Depot, Inc.Services11 HON Honeywell International Inc.Industrial Goods12 HPQ HP Inc.Technology13 IBM International Business Machines Corporation Technology14 INTC Intel CorporationTechnology15 JNJ Johnson & JohnsonHealthcare16 JPM JP Morgan Chase & Co.Financial17 KO Coca-Cola CompanyConsumer Goods18 MCD McDonald's CorporationServices19 MSFT Microsoft CorporationTechnology20 MMM 3M CompanyConglomerates21 MO Altria Group, Inc.Consumer Goods22 PFE Pfizer, Inc.Healthcare23 PG Procter & Gamble CompanyConsumer Goods24 UTX United Technologies CorporationConglomerates25 VZ Verizon Communications Inc.Technology26 WMT Wal-Mart Stores, Inc.Services27 ABIO ARCA biopharma, Inc.Healthcare28 AMGN Amgen Inc.Healthcare

Table 2 :2Numerical resultsNMFC-NMFCH-NMFk # iter error # clusters # iter error # clusters # iter error # clusters3 1528 33.77033500 45.718521 47.584424 2355 27.19664500 42.414821 43.982445 3358 21.08385500 40.250221 38.658546 2523 16.99874500 33.576121 56.405047 5000 14.67066500 38.667521 32.575558 5000 13.54827500 32.178621 46.25355

Table 3 :3NMF Clustersk = 3k = 4k = 5k = 6k = 7k = 83 4 5 6 3 5 8 10 12 14 153 4 5 6 3 4 5 6 11 15 17 187 8 9 10 11 15 16 16 17 18 7 8 10 11 7 8 14 16 20 21 22 2311 14 15 16 19 20 21 19 20 23 13 14 15 16 19 20 24 24 25 26 2817 19 20 21 22 23 24 24 25 26 17 18 19 2022 23 24 25 25 26 2821 22 24 2526 282812 13 18 6 7 12 13 4 5 8 1023 26 12 15 232 4 5 814 17 18 21 22 28269 101 2 27277 111 2 9 2721 28271 2 4 9 3 6 1312 10 13 171618 22 251 2 9 27273 13 141 2 9 111 6 7 1219

Table 4 :4C-NMF Clusters They correspond respectively to Alcoa Inc. (Basic Materials), American International Group Inc. (Financial), General Electrics Company (Conglomerates), ARCA biopharma Inc. (Healthcare), HP Inc. (Technology). Other stocks which are often grouped together are depicted in the Tables with different colours.k = 3k = 4k = 5k = 6k = 7k = 81 2 12 27 1 2 9 12 1 2 9 121 2 27 1 2 6 9 1 2 9 12 27272712 23 273 4 5 6 7 3 4 5 6 7 3 4 5 6 73 4 5 6 7 3 4 5 7 8 3 4 5 6 78 9 10 11 8 10 11 8 10 11 8 9 10 11 10 11 138 10 1113 14 15 13 14 15 13 14 15 12 13 14 15 14 15 1613 14 1516 17 18 16 17 18 16 17 18 16 17 18 19 17 18 1916 17 1819 20 21 19 20 21 19 20 21 20 21 22 23 20 21 2219 20 2122 23 24 22 23 24 22 23 24 24 25 26 28 24 25 2622 23 2425 26 28 25 26 28 25 26 282825 26 28

Table 5 :5CH-NMF Clustersk = 3k = 4k = 5k = 6k = 7k = 81 2 9 12 271 2 9 271 2 9 27 1 2 9 12 271 2 92 273 4 5 6 7 83 4 5 7 8 3 5 7 8 103 4 5 7 83 4 6 7 4 8 10 119 10 11 13 14 10 11 14 1511 18 21 11 14 15 1613 14 17 13 15 16 1715 16 17 18 19 16 17 18 1922 24 25 17 20 22 2320 23 24 18 20 21 2220 21 22 23 24 20 21 22 2324 25 26 2825 26 23 24 25 2625 26 28 24 25 26 28286 13 4 6 12 146 13 5 8 10 111 5 9 1215 16 17 1915 16 1814 1920 23 26 2819 21 22 281213 10 18 19 21126 7273

Acknowledgments

This work was partially funded by the Italian PON 2007-2013 project PON02_00563_3489339 'Puglia@Service'.

Clustering approaches for financial data analysis: a FCai NLe-Khac MKechadi Proceedings of the International Conference on Data Mining (DMIN) the International Conference on Data Mining (DMIN) 2012 Portfolio diversification using subspace factorizations RDe Fréin KDrakakis SRickard CISS 2008. 42nd Annual Conference on IEEE 2008. 2008 Information Sciences and Systems Convex and semi-nonnegative matrix factorizations CH QDing TLi MIJordan IEEE Transactions on Pattern Analysis and Machine Intelligence 32 1 Jan 2010 Analysis of financial data using non-negative matrix factorization KDrakakis SRickard RDe Fréin ACichocki International Mathematical Forum 3 38 2008 Option pricing and portfolio optimization RKorn EKorn Graduate Studies in Mathematics 31 18 2001 Algorithms for non-negative matrix factorization DLee HSSeung Advances in neural information processing systems 2001 Non-negative matrix factorization for stock market pricing TLiu Biomedical Engineering and Informatics. BMEI'09 IEEE 2009 Some methods for classification and analysis of multivariate observations JMacqueen Proceedings of the fifth Berkeley symposium on mathematical statistics and probability the fifth Berkeley symposium on mathematical statistics and probability

Oakland, CA, USA

1967 1 Portfolio selection HMarkowitz The journal of finance 7 1 1952 Yes we can: simplex volume maximization for descriptive web-scale matrix factorization CThurau KKersting CBauckhage Information and knowledge management. CIKM ACM 2010. 2010 Stock trend extraction via matrix factorization JWang Advanced Data Mining and Applications Springer 2012