Fast stepwise regression on linked data Barbara Żogala-Siudem1,2 and Szymon Jaroszewicz2,3 1 Systems Research Institute, Polish Academy of Sciences Warsaw, Poland zogala@ibspan.waw.pl 2 Institute of Computer Science, Polish Academy of Sciences Warsaw, Poland s.jaroszewicz@ipipan.waw.pl 3 National Institute of Telecommunications Warsaw, Poland Abstract. The main focus of research in machine learning and statistics is on building more advanced and complex models. However, in practice it is often much more important to use the right variables. One may hope that recent popularity of open data would allow researchers to easily find relevant variables. However current linked data methodology is not suitable for this purpose since the number of matching datasets is often overwhelming. This paper proposes a method using correlation based indexing of linked datasets which can significantly speed up feature selection based on classical stepwise regression procedure. The technique is efficient enough to be applied at interactive speed to huge collections of publicly available linked open data. Keywords: stepwise feature selection, linked open data, spatial indexing 1 Introduction It is well known from statistical modeling practice that including the right vari- ables in the model is often more important than the type of model used. Unfor- tunately analysts have to rely on their experience and/or intuition as there are not many tools available to help with this important task. The rising popularity of linked open data could offer a solution to this prob- lem. The researcher would simply link their data with other variables down- loaded from a public database and use them in their model. Currently, several systems exist which allow for automatically linking publicly available data ([2, 5, 11, 17, 18]). Unfortunately, those systems are not always sufficient. Consider, for example, a researcher who wants to find out which factors affect some vari- able available for several countries for several consecutive years. The researcher could then link publicly available data (from, say, Eurostat [1] or the United Nations [6]) by country and year to the target modeling variable and build a linear regression model using some method of variable selection. Unfortunately, such an approach is not practical since there are literally millions of variables available from Eurostat alone and most of them can be linked by country and year. As a result, several gigabytes of data would have to be downloaded and used for modeling. This paper proposes an alternative approach: linking a new variable is per- formed not only by some key attributes but also based on the correlation with the target variable. We describe how to use spatial indexing techniques to find correlated variables quickly. Moreover, we demonstrate how such an index can be used to build stepwise regression models commonly used in statistics. To the best of our knowledge no current system offers such functionality. The closest to the approach proposed here is the Google Correlate service [14]. It allows the user to submit a time series and find Google query whose frequency is most correlated with it. However Google Correlate is currently limited to search engine query frequencies and cannot be used with other data such as publicly available government data collections. Moreover it allows only for finding a single correlated variable, while the approach proposed here allows for automatically building full statistical models. In other words our contribution adds a statistical model construction step on top of a correlation based index such as Google Correlate. There are several approaches to speeding up variable selection in stepwise regression models such as streamwise regression [22] or VIF regression [13]. None of them is, however, capable of solving the problem considered here: allowing an analyst to build a model automatically selecting from millions of available variables at interactive speeds. Let us now introduce the notation. We will not make a distinction between a random variable and a vector of data corresponding to it. Variables/vectors will be denoted with lowercase letters x, y; x̄ is the mean of x and cor(x, y) correlation between x and y. Matrices (sets of variables) will be denoted with boldface uppercase letters, e.g. X. The identity matrix is denoted by I and XT is the transpose of the matrix X. 2 Finding most correlated variables. Multidimensional indexing The simplest linear regression model we may think of is a model with only one variable: the one which is most correlated with the target. An example system building such models in the open data context is the Google Correlate tool [3, 14, 21]. It is based on the fact that finding a variable with the highest correlation is equivalent to finding a nearest neighbor of the response variable after appropriate normalization of the vectors. In this paper we will normalize all input vectors (potential variables to be included in the model) as x0 = kx−x̄k x−x̄ . That way, each vector is centered at zero and has unit norm, so we can think of them as of points on an (n − 1)-sphere. It is easy to see that the correlation coefficient of two vectors x, y is simply equal to the dot product of their normalized counterparts cor(x, y) = hx0 , y 0 i. Note that our normalization is slightly different from the one used in [14], but has the advantage that standard multidimensional indices can be used. After normalization the Euclidean distance between two vectors is directly related to their correlation coefficient p kx − yk = 2 − 2cor(x, y). (1) The above equation gives us a tool to quickly find variables most correlated with a given variable, which is simply the one which is closest to it in the usual geometrical sense. Moreover to find all variables x whose √ correlation with y is at least η one needs to find all x’s for which kx − yk 6 2 − 2η. The idea now is to build an index containing all potential variables and use that index to find correlated variables quickly. Thanks to the relationship with Euclidean distance, multidimensional indexing can be used for the purpose. Building the index may be time consuming, but afterwards, finding correlated variables should be very fast. We now give a brief overview of the indexing techniques. Multidimensional indexing. Multidimensional indices are data structures de- signed to allow for rapidly finding nearest neighbors in n-dimensional spaces. Typically two types of queries are supported. Nearest neighbor queries return k vectors in the index which are closest to the supplied query vector. Another type of query is range query which returns all vectors within a given distance from the query. Due to space limitations we will not give an overview of multidimensional techniques, see e.g. [19]. Let us only note that Google Correlate [21] uses a custom designed technique called Asymmetric Hashing [8]. In the current work we use Ball Trees [12] implemented in the Python Scikit.Learn package. Ball Trees are supposed to work well even for high di- mensional data and return exact solutions to both nearest neighbor and range queries. For faster, approximate searches we use the randomized kd-trees imple- mented in the FLANN package [15] (see also [16]). Of course finding most correlated variables has already been implemented by Google Correlate. In the next section we extend the technique to building full regression models, which is the main contribution of this paper. 3 Stepwise and stagewise regression In this section we will describe classical modeling techniques: stagewise and stepwise linear regression and show how they can be efficiently implemented in the open data context using a multidimensional index. Stagewise regression is a simple algorithm for variable selection in a regression model which does not take into account interactions between predictor variables, see e.g. [10] for a discussion. The algorithm is shown in Figure 1. The idea is simple: at each step we add the variable most correlated with the residual of the current model. The initial residual is the target variable y and the initial model matrix X contains just a column of ones responsible for the intercept term. The matrix HX = X(XT X)−1 XT is the projection matrix on X, see [9] for details. Algorithm: Stagewise 1) r ← y; X ← (1, 1, . . . , 1)T , 2) Find a variable xi most correlated with r, 3) Add xi to the model: X ← [X|xi ], 4) Compute the new residual vector r = y − HX y, 5) If the model has improved: goto 2. Fig. 1. The stagewise regression algorithm. The stopping criterion in step 5 is based on the residual sum of squares: RSS = rT r = krk2 , where r is the vector of residuals (differences between true and predicted values). The disadvantage of RSS is that adding more variables can only decrease the criterion. To prevent adding too many variables to the model additional penalty terms are included, the two most popular choices are Akaike’s AIC [4] and Schwarz’s BIC [20]. Here we simply set a hard limit on the number of variables included in the model. The advantage of stagewise regression is its simplicity, one only needs to compute the correlation of all candidate variables with the residual r. Thanks to this, one can easily implement stagewise regression using techniques from Section 2, so the approach can trivially be deployed in the proposed setting. The disadvantage of stagewise regression is that is does not take into account correlations between the new variable and variables already present in the model. Consider an example dataset given in Table 1. The dataset has three predictor variables x1 , x2 , x3 and a target variable y. The data follows an exact linear relationship: y = 3x1 + x2 . It can be seen that the variable most correlated with y is x1 , which will be the first variable included in the model. The residual vector of that model, denoted r1 , is also given in the table. Clearly the variable most correlated with r1 is x3 giving a model y = β0 + β1 x1 + β2 x3 . But the true model does not include x3 at all! The reason is that x3 is highly correlated with x1 , and this correlation is not taken into account by stagewise regression. Table 1. Example showing the difference between stepwise and stagewise regression. x1 x2 x3 y r1 0.03 -0.12 0.75 -0.03 0.51 -0.54 -0.10 -0.47 -1.71 -0.15 0.13 -1.03 0.11 -0.64 -0.27 0.73 -1.58 0.00 0.61 -0.09 An improvement on stagewise regression is stepwise regression proposed in 1960 by Efroymson [7]. The algorithm is given in Figure 2. The main idea is that at each step we add each variable to the model, compute the actual residual sum of squares (which takes into account correlations between variables) and add the variable which gives the best improvement. Algorithm: Stepwise 1) r ← y; X ← (1, 1, . . . , 1)T , 2) For each variable xi : compute the residual of the model obtained by adding xi to the current model: ri = y − H[X|xi ] y 3) Find xi∗ , where i∗ = arg min ri T ri , 4) If model: [X|xi∗ ] is better than X: 1. Add xi∗ to the model X ← [X|xi∗ ] 2. goto 2). Fig. 2. The stepwise regression algorithm. In the example stepwise regression will choose the correct variables x1 and then x2 , which is the best possible model. In general, stepwise regression builds better models than stagewise regression, but is more costly computationally. At each step we need to compute the RSS for several regression models, which is much more expensive than simply computing correlations. 4 Fast stepwise selection based on multidimensional indices Stepwise regression is known to give good predictions, however when the number of attributes is large, it becomes inefficient; building a model consisting of many variables when we need to search through several millions of candidates, as is often the case with linked data, would be extremely time consuming, since at each step we would need to compute RSS of millions of multidimensional models. In this section we present the main contribution of this paper, namely an approach to speed up the process using a multidimensional index. Our goal is to decrease the number of models whose RSS needs to be computed at each step through efficient filtering based on a multidimensional index. Assume that k − 1 variables are already in a model and we want to add the k-th one. Let Xk−1 denote the current model matrix. The gist of our approach is given in the following theorem. Theorem 1. Assume that the variables x1 , . . . , xk−1 currently in the model are orthogonal, i.e. Xk−1 T Xk−1 = I and let r = y − HXk−1 y denote the residual vector of the current model. Consider two variables xk and xk0 . Denote ci,k = cor(xi , xk ), ci,k0 = cor(xi , xk0 ), cr,k = cor(r, xk ), cr,k0 = cor(r, xk0 ). Let Xk = [Xk−1 |xk ] and Xk0 = [Xk−1 |xk0 ]. Further, let rk = y − HXk y be the residual vector of the regression model obtained by adding variable xk to the current model, and let rk0 be defined analogously. Then, krk0 k2 6 krk k2 implies |cr,k | max {|c1,k0 |, . . . , |ck−1,k0 |, |cr,k0 |} > q . (2) 1 − c21,k − . . . − c2k−1,k + (k − 1)c2r,k The theorem (the proof can be found in the Appendix) gives us a way to implement a more efficient construction of regression models through the step- wise procedure. Each step is implemented as follows. We first find a variable xk which is most correlated with the current residual r. Then, using the right hand side of Equation 2 we find the lower bound for correlations of the potential new variable with the current residual and all variables currently in the model. Then, based on Equation 1, we can use k range queries (see Section 2) on the spatial index to find all candidate variables. Steps 2 and 3 of Algorithm 2 are then performed only for variables returned by those queries. Since step 2 is the most costly step of the stepwise procedure this can potentially result in huge speedups. The theorem assumes x1 , . . . , xk−1 to be orthogonal which is not al- ways the case. However we can always orthogonalize them before applying the procedure using e.g. the QR factorization. The final algorithm is given in Figure 3. It is worth noting that (when exact index is used like the Ball Tree) algorithm described in Figure 3 gives the same results as stepwise regression performed on full, joined data. Algorithm: Fast stepwise based on multidimensional index 1) r ← y; X ← (1, 1, . . . , 1)T 2) Find a variable x1 most correlated with r # nearest neighbor query 3) Add x1 to the model: X ← [X|x1 ] 4) Compute the new residual vector r = y − HX y 5) Find a variable xk most correlated with r 6) C ← {xk } # the set of candidate variables |cr,k | 7) η← q 2 2 2 1−c1,k −...−ck−1,k +(k−1)cr,k 8) For i ← 1, . . . , k − 1: 9) C ← C ∪ all variables x such that kx − xi k2 6 2 − 2η # range queries 10) C ← C ∪ all variables x such that kr − xi k2 6 2 − 2η # range query 11) Find the best variable xi∗ in C using stepwise procedure, add it to the model 12) If the model has improved significantly: goto 4). Fig. 3. The fast stepwise regression algorithm based on multidimensional index. 5 Experimental evaluation We will now present an experimental evaluation of the proposed approach. First we give an illustrative example, then examine the efficiency. 5.1 An illustrative example The following example is based on a part of the Eurostat database [1]. The re- sponse variable is the infant mortality rate in each country and the predictors are variables present in a part of the database concerning ‘Population and social conditions’, mainly ‘Health’. The combined data set consists of 736 observations (data from 1990 till 2012 for each of the 32 European countries) and 164460 variables. We decided to select two variables for the model. Missing values in the time series were replaced with the most previous available value or with the next one if the previous did not exist. Exact stepwise regression (produced with regular stepwise procedure or the Ball Tree index) resulted in the following two variables added to the model: – ,,Causes of death by NUTS 2 regions - Crude death rate (per 100,000 in- habitants) for both men an women of age 65 or older, due to Malignant neoplasms, stated or presumed to be primary, of lymphoid, haematopoietic and related tissue” – ,,Health personnel by NUTS 2 regions - number of physiotherapists per in- habitant”. The variables themselves are not likely to be directly related to the target, but are correlated with important factors. The first variable is most probably correlated with general life expectancy which reflects the efficiency of the medi- cal system. The number of physiotherapists (second variable) is most probably correlated with the number of general health personnel per 100,000 inhabitants. Dealing with correlated variables is an important topic of the future research. An analogous result was obtained using an approximate index implemented in the FLANN package. Due to the fact that the results are approximated, slightly different attributes were selected but the RSS remained comparable. Moreover, building the model using the Ball Tree index was almost 8 times faster than stepwise regression on full data, and using the FLANN index more than 400 times faster! 5.2 Performance evaluation To assess performance we used a part of the Eurostat [1] database. The response variable was again the infant mortality rate and predictors came from the ‘Pop- ulation and social conditions’ section, mainly: ‘Health’, ‘Education and training’ and ‘Living conditions and welfare’. This resulted in a joined dataset consisting of 736 observations (data from 1990 till 2012 for 32 European countries) and over 200, 000 attributes. The algorithms used in comparison are regular stepwise regression on full joined data (‘regular step’), fast stepwise regression using two types of spatial indices and stepwise regression built using the algorithm in Figure 3 with spatial queries answered using brute force search (‘step with no index’). The first two charts in Figure 4 show how the time to build a model with 3 or 5 variables changes with growing number of available attributes (i.e. the Time to build the model with 3 variables Time to build the model with 5 variables 40 ● step with BallTree 80 ● step with BallTree step with FLANN step with FLANN regular step regular step 30 step with no index 60 step with no index time [s] time [s] 20 40 ● 10 ● 20 ● ● ● ● ● ● ● ● 0 0 50000 100000 150000 200000 50000 100000 150000 200000 Number of attributes Number of attributes Time to build the model with 3 variables Time to build the model with 5 variables 50 ● step with BallTree ● step with BallTree step with FLANN 80 step with FLANN regular step regular step 40 step with no index step with no index 60 30 time [s] time [s] 40 20 ● ● ● 10 20 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 0 100 200 300 400 500 600 700 100 200 300 400 500 600 700 Number of observations Number of observations Fig. 4. Average time needed to build models with 3 or 5 variables for varying numbers of available variables and observations. ‘regular step’ is the classical stepwise regression, all others are the proposed fast versions using different spatial indexes and brute search. size of full joined data). The second two charts show how the time changes with growing number of observations (records). To obtain the smaller datasets we simply drew samples of the attributes or of the observations. We can see that the best times can be obtained using FLANN index. It is worth noting that FLANN gives approximate, yet quite precise results. Slower, but still reasonably fast model construction can be obtained by using Ball Tree index, which guarantees the solution is exact. All charts show that the bigger the data the bigger the advantage from using the algorithm shown in Figure 3. 6 Conclusions The paper presents a method for building regression model on linked open data at interactive speeds. The method is based on the use of spatial indexes for effi- cient finding of candidate variables. The method has been evaluated experimen- tally on Eurostat data and demonstrated to perform much faster than standard regression implementations. 7 Acknowledgements The paper is co-founded by the European Union from resources of the European Social Fund. Project PO KL ,,Information technologies: Research and their in- terdisciplinary applications”, Agreement UDA-POKL.04.01.01-00-051/10-00. A Proof of Theorem 1 To prove Theorem 1 we need to introduce two lemmas. Lemma 1. Adding xk to a least squares model of y based on Xk−1 = [x1 | . . . |xk−1 ] (xT k Pk−1 y) 2 decreases the RSS by xT P x k−1 k , where Pk−1 := I − HXk−1 . k Proof. xk can be expressed as a sum of vectors in the plane spanned by Xk−1 and perpendicular to that plane: xk = HXk−1 xk + Pk−1 xk . If xk is a linear function of columns of Xk−1 , adding it to the model gives no decrease of RSS, so we only need to consider Pk−1 xk . It is easy to see that if xk is uncorrelated (xT y)2 with each column of Xk−1 , adding it to the model decreases RSS by xTk xk . This k is because the RSS is then equal to y T Pk y, where Pk = Pk−1 − xk (xTk xk )−1 xTk . Combining those facts with symmetry and idempotency of Pk−1 , RSS decreases by ((Pk−1 xk )T y)2 (xTk PTk−1 y)2 (xTk Pk−1 y)2 = = . (Pk−1 xk )T Pk−1 xk xTk PTk−1 Pk−1 xk xTk Pk−1 xk Lemma 2. Assume now that Xk−1 is orthogonal. If adding xk0 to the model gives lower RSS than adding xk , then: c2r,k c2r,k0 < . (3) 1 − c21,k − . . . − c2k−1,k 1 − c21,k0 − . . . − c2k−1,k0 Proof. From Lemma 1 we know that if xk0 causes greater decrease in RSS then (xTk Pk−1 y)2 (xTk0 Pk−1 y)2 < . xTk Pk−1 xk xTk0 Pk−1 xk0 We also know that (since vectors are normalized) c2r,k = (xTk r)2 = (xTk Pk−1 y)2 , and using orthogonality of Xk−1 we get xTk Pk−1 xk = xTk (I − Xk−1 (XTk−1 Xk−1 )−1 XTk−1 )xk = xTk (I − Xk−1 XTk−1 )xk = = xTk xk − (xT1 xk )2 − . . . − (xTk−1 xk )2 = 1 − c21,k − . . . − c2k−1,k , which proves the lemma. |cr,k | Proof (of Theorem 1). If for any i = 1, . . . , k−1: |ci,k0 | > √ 1−c21,k −...−c2k−1,k +(k−1)c2r,k then the inequality is true. Otherwise for all i = 1, . . . , k − 1: |cr,k | |ci,k0 | < q (4) 1 − c21,k − . . . − c2k−1,k + (k − 1)c2r,k |cr,k | and we need to show that this implies |cr,k0 | > √ . No- 1−c21,k −...−c2k−1,k +(k−1)c2r,k tice first that the inequalities (4) imply 1 − c21,k − . . . − c2k−1,k 1 − c21,k0 − . . . − c2k−1,k0 > . 1 − c1,k − . . . − c2k−1,k + (k − 1)c2r,k 2 Using this inequality and Lemma 2 we get the desired result: 1 − c21,k0 − . . . − c2k−1,k0 + c2r,k0 c2r,k c2r,k0 > c2r,k > . 1 − c21,k − . . . − c2k−1,k + c2r,k 1 − c21,k − . . . − c2k−1,k + (k − 1)c2r,k References 1. Eurostat. http://ec.europa.eu/eurostat. 2. Fegelod. http://www.ke.tu-darmstadt.de/resources/fegelod. 3. Google correlate. http://www.google.com/trends/correlate. 4. H. Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic Control, AC-19(6):716–723, 1974. 5. C. Bizer, T. Heath, and T. Berners-Lee. Linked data - the story so far. International Journal on Semantic Web and Information Systems (IJSWIS), 5(3):1–22, 2009. 6. United Nations Statistics Division. Undata. http://data.un.org. 7. M. A. Efroymson. Multiple Regression Analysis. Wiley, 1960. 8. A. Gersho and R. M. Gray. Vector Quantization and Signal Compression. Springer, 1992. 9. T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA, 2001. 10. T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, 2009. 11. T. Heath and Bizer C. Linked Data: Evolving the Web into a Global Data Space. Synthesis Lectures on the Semantic Web: Theory and Technology. Morgan & Clay- pool, 1 edition, 2011. 12. A. M. Kibriya and E. Frank. An empirical comparison of exact nearest neighbour algorithms. In PKDD, pages 140–151. Springer, 2007. 13. D. Lin and D. P. Foster. Vif regression: A fast regression algorithm for large data. In ICDM ’09. Ninth IEEE International Conference on Data Mining, 2009. 14. M. Mohebbi, D. Vanderkam, J. Kodysh, R. Schonberger, H. Choi, and S. Kumar. Google correlate whitepaper. 2011. 15. M. Muja and D. Lowe. FLANN - Fast Library for Approximate Nearest Neighbors, 2013. 16. M. Muja and D. G. Lowe. Scalable nearest neighbor algorithms for high dimen- sional data. IEEE Trans. on Pattern Analysis and Machine Intelligence, 36, 2014. 17. H. Paulheim. Explain-a-lod: using linked open data for interpreting statistics. In IUI, pages 313–314, 2012. 18. H. Paulheim and J. Fürnkranz. Unsupervised generation of data mining features from linked open data. Technical Report TUD-KE-2011-2, Knowledge Engineering Group, Technische Universität Darmstadt, 2011. 19. H. Samet. The Design and Analysis of Spatial Data Structures. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1990. 20. G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6(2):461–464, 1978. 21. D. Vanderkam, R. Schonberger, H. Rowley, and S. Kumar. Technical report: Near- est neighbor search in google correlate. 2013. 22. J. Zhou, D. P. Foster, R. A. Stine, and L. H. Ungar. Streamwise feature selection. J. Mach. Learn. Res., 7:1861–1885, December 2006.