Greedy Algorithm for Sparse Monotone Regression

Greedy Algorithm for Sparse Monotone Regression AlexeyRFaizliev Saratov State University

Saratov Russia

AlexanderAGudkov Saratov State University

Saratov Russia

SergeiVMironov Saratov State University

Saratov Russia

MikhailALevshunov Saratov State University

Saratov Russia

Greedy Algorithm for Sparse Monotone Regression 9C650CC3F5AD233D3E857FB9980ED70B GROBID - A machine learning software for extracting information from scholarly documents greedy algorithms pool-adjacent-violators algorithm isotonic regression monotone regression

The problem of constructing the best fitted monotone regression is NP-hard problem and can be formulated in the form of a convex programming problem with linear constraints. The paper proposes a simple greedy algorithm for finding a sparse monotone regression using Frank-Wolfe-type approach. A software package for this problem is developed and implemented in R and C++. The proposed method is compared with the well-known pool-adjacent-violators algorithm (PAVA) using simulated data.

Introduction

The recent years have seen an increasing interest in shape-constrained estimation in statistics [10]. One of such problems is the problem of constructing monotone regression. The problem is to find best fitted non-decreasing points to a given set of points on the plane. The survey of results on monotone regression can be found in the book by Robertson and Dykstra [25]. The papers of Barlow and Brunk [3], Dykstra [16], Best and Chakravarti [4], Best [5] consider the problem of finding monotone regression in quadratic and convex programming frameworks.

Using mathematical programming approach the works [1,21,31] have recently provided some new results on the topic. The papers [7,17] extend the problem to particular orders defined by the variables of a multiple regression. The paper [8] investigates a dual active-set algorithm for regularized monotonic regression.

Monotone regression is widely used in mathematical statistics [2,10]; in smoothing of empirical data [15]; in shape-preserving approximation [19], [26], [30], [6], [27], [13]; in shape-preserving dynamic programming [9].

Constructing monotone regression we assume a relationship between a predictor x = (x 1 , . . . , x n ) and a response y = (y 1 , . . . , y n ). In the general case it is expected that

x i+1 − x i = const, x i < x i+1 , i = 1, . . . , n − 1.

The work was supported by RFBR (grant 16-01-00507).

A sequencez = (z 1 , . . . , z n ) ∈ R n is called monotone if z i − z i−1 ≥ 0, i = 2, . . . , n.

Denote ∆ n

1 the set of all vectors from R n , which are monotone. The problem of constructing monotone regression can be formulated in the form of a convex programming problem with linear constraints as follows: it is necessary to find a vector z ∈ R n with the lowest mean square error of approximation to the given vector y ∈ R n under condition of monotonicity of z:

f (z) = 1 n n i=1 (z i − y i ) 2 → min z∈∆ n 1 ,(1)

In many situations researchers have no information regarding the mathematical specification of the true regression function. Typically, this involves non-decreasing of y i 's with the ordered x i 's. Such a situation is called isotonic regression. Isotonic regression (monotone regression) is a special case to the kmonotone regression [24].

It is well-known that the problem (1) is NP-hard problem [24]. In this paper we present a simple greedy algorithm which employs Frank-Wolfe-type approach for finding sparse monotone regression. A software package for this problem is developed and implemented in R and C++.

For the convenience of solving the problem (1), we move from points z i to its increments ζ i , where

ζ i = z i+1 − z i , i = 1, . . . , n − 1, ζ 0 = z 1 .

Then monotonicity of z corresponds non-negativity of ζ i 's (exept ζ 0 ). The proposed method is compared with the well-known pool-adjacent-violators algorithm (PAVA) using simulated data.

Algorithms for monotone regression

PAVA

Simple iterative algorithm for solving the problem (1) is called Pool-Adjacent-Violators Algorithm (PAVA) [11,24]. The work [4] examined the generalization of this algorithm. The paper [32] studied this problem as the problem of identifying the active set and proposed a direct algorithm of the same complexity as the PAVA (the dual algorithm).

PAVA computes a non-decreasing sequence of values z = (z i ) n i=1 such that the problem (1) is optimized. In the simple monotone regression case we have the measurement pairs (x i , y i ). Let us assume that these pairs are ordered with respect to the predictors. The following (Algorithm 1) is a pseudocode of PAVA for the problem. The generalized pool-adjacent-violators algorithm (GPAVA), which is a strict generalization of PAVA, was developed in the article [33].

The block values are expanded with respect to the observations i = 1, . . . , n such that the final result is the vector z of length n with elements z i of increasing order [24].

:= s(z (l)

r ), the solver s is conditional (weighted) mean and (weighted) quantities;

• If z (l) r+1 ≤ z (l) r then l := l + 1; until the z-blocks are increasing, i.e. z (l) r+1 ≥ z (l)

r for all r; • Return z; end

Frank-Wolfe type greedy algorithm

Frank-Wolfe method (or conditional gradient method) solves conditional convex optimization problems in vector finite-dimensional space. The method was introduced in 1956. The original algorithm did not use a fixed step size, and has the complexity of the linear programming. Frank-Wolfe method was developed by Levitin and Polyak in 1966, and V.F. Demianov and A.M. Rubinov generalized it to the case of arbitrary Banach spaces in 1970 [14]. Recently Frank-Wolfe type methods have caused an increased interest related to the possibility of obtaining sparse solutions, as well as a good scaling [12,23]. In particular, [22,34] researched algorithms for solving problems with penalty functions (instead of considering the conditional optimization problems). Besides, the paper [34] uses interlacing boosting with fixed-rank local optimization.

As it was mentioned above, for computational convenience of the problem (1), we moved from points z i to increments

ζ i = z i+1 − z i , i = 1, . . . , n − 1, ζ 0 = z 1 .

Then the problem (1) can be rewritten as follows:

g(ζ) := 1 n n i=1 i−1 j=0 ζ j − y i 2 → min ζ∈S ,(2)

where S denotes the set of all

ζ = (ζ 0 , ζ 1 , . . . , ζ n−1 ) ∈ R n such that ζ 0 ∈ R, (ζ 1 , . . . , ζ n−1 ) ∈ R n−1 + and n−1 j=0 ζ j ≤ max i y i . Let ∇g(ζ) denote the gradient of function g at point ζ.

It should be noted that for larger-scale problems the solution can appear computationally quite challenging. In this regard, the present study proposes to use a greedy algorithm of Frank-Wolfe type for solving this problem.

The following (Algorithm 2) is a pseudocode of Frank-Wolfe-type algorithm for the problem (2).

The rate of convergence is estimated according to the following theorem.

Algorithm 2: Greedy algorithm for sparse monotone regression

(max i y i − min i y i ) 2 t + 2 , (3)

where g * is the optimal solution of (2).

Proof. It it is know [18] that for all t ≥ 2:

g(ζ t ) − g * ≤ 2L(Diam(S)) 2 t + 2 ,

where L is the Lipschitz constant and and Diam(S) is the diameter of S. Let

∇ 2 g(ζ) := ∂ 2 g ∂ζ 2 0 , ∂ 2 g ∂ζ 2 1 , . . . , ∂ 2 g ∂ζ 2 n−1 .

It is well-known that if ∇g is differentiable then its Lipschitz constant L satisfies the inequality

L ≤ sup ζ ∇ 2 g(ζ) 2 . Then L ≤ sup ζ n−1 k=0 ∂ 2 g ∂ζ 2 k 2 = = 1 n n k=1 (2(n − k + 1)) 2 = 2 n n k=1 k 2 = 2 n(n + 1)(2n + 1) 6n 2 . (4)

It is easy to prove and Diam(S)

:= √ 2(max i y i − min i y i ).

The disadvantage of this method is the dependence of the theoretical degree of convergence on the dimensionality of the problem. The papers [28], [20], [29] suggest to use the values of duality gap as the stopping criterion for Frank-Wolfe type algorithms.

Empirical Result

The algorithms have been implemented both in R and C++. We compared the performance of the greedy algorithm (Algorithm 2) with the performance of PAVA (Algorithm 1) using simulated data sets.

It should also be noted that the PAVA's speed is significantly higher for small-scale tasks in R. But if the number of points is greater than at least 2000, the greedy algorithm spends less time searching for a solution (Fig. 1).

Tables 1, 2 present empirical results for PAVA and greedy algorithms for a simulated set of points. The simulated points are obtained as the values of logarithm function with added normally distributed noise: A = {(x i , y i ), y i = ln(x 0 + i x) + ϕ i , ϕ i ∼ N (0, 1), x 0 = 1, x = 1, i = 1, . . . , 10000. The dimension of the problem is 10000 points. The tables contain information on errors

1 n n i=1 (z i −y i ) 2 ,

elapsed time, cardinality and greedy algorithm's iteration number.

The results show that error of greedy algorithm are getting closer to the error of PAVA with increase of number of iterations for greedy algorithm. While PAVA is better than greedy algorithm in terms of errors, the solutions of greedy algorithm have a better sparsity. Greedy algorithm's output solution is more sparse. It should be noted that the elapsed time for PAVA implemented in C++ is smaller than for greedy algorithm. However, greedy algorithm has a better rate of convergence if number of iterations is less than 700 for the algorithms implemented in R,. Both algorithms obtain a sparse solutions, but we can control the number of nonzero elements (cardinality) in the greedy algorithm as opposed to PAVA. Generally, the greedy algorithm's cardinality increases by one at each iteration. Consequently, we should limit the number of iterations to obtain more sparse solution.

Figure 2 shows simulated points (N = 100) with logarithm structure and isotonic regressions, where green line represents the greedy algorithm's isotonic Table 1. Comparison of algorithms PAVA and greedy algorithm (Greedy) on an example of the simulated data (implementation in language C++): The obtained empirical results for the greedy algorithm show that the degree of convergence for the considered examples is much higher than its theoretical estimates obtained in Theorem 1.

A = {(xi, yi), yi = ln(x0 + i x) + ϕi, ϕi ∼ N (0, 1), x0 = 1, x = 1, i = 1, . . . ,10000

Conclusion

Our research proposes an algorithm for solving the problem of constructing the best fitted monotone regression by using the Frank-Wolfe method. The software was implemented in R and C++. We compared the performance of the greedy algorithm with the performance of PAVA using simulated data sets. While PAVA gives a slightly smaller errors than greedy algorithm, greedy algorithm obtains significantly sparser solutions. The advantages of greedy algorithm are the simplicity of implementation, the potential for controlling cardinality and the elapsed time is lower for the implementation in R in the case of problem with large dimension.

Algorithm 1 :•1Pool-Adjacent-Violators Algorithm (PAVA) begin • Let z (0) j := yi be the start point, l = 0; • The index for the blocks is r = 1, . . . , B where at step l = 0 we set B := n, i.e. each observation z (0) r forms a block; repeat • (Adjacent pooling) Merge values of z (l) into blocks if z Solve f (z) for each block r, i.e., compute the update based on the solver which gives z (l+1) r

Fig. 1 .1Fig. 1. The dependence of the CPU time on dimension of the problem of the greedy algorithm (green line, 100 iterations) and PAVA (red line) implemented in R.

Let zero vector ζ 0 = (0, . . . , 0) be the start point, and let the counter t = 0;• while t < N do • Calculate ∇g(ζ t), the gradient of the function g at the point ζ t ;• Let ζ t be the solution of the linear optimization problem ∇g(ζ t ) Recover the monotone sequence z = (z1, . . . , zn) from the vector of increments ζ N ; end Theorem 1. Let {ζ t } be generated according to the Frank-Wolfe method (Algorithm 2) using the step-size rule α t = 2begin• Let N be the number of iteration;• Our function g(ζ) and the feasible set S were defined above.Let ∇g(ζ) =∂g ∂ζ0,∂g ∂ζ1, . . . ,∂g ∂ζn−1be the gradient of function g ata point ζ,∂g ∂ζ k=2 nn i=ki−1 j=0ζj − yi , k = 0, . . . , n − 1;•

T , ζ → min ζ∈S , where ∇g(ζ t ) T , ζ is a scalar product of vectors; • (Update step) Let ζ t+1 = ζ t + αt( ζ t − ζ t ), αt = 2 t+2 and than t := t + 1; • t+2 . Then for all t ≥ 2 g(ζ t ) − g * ≤ 4 n(n + 1)(2n + 1) 6n 2

Table 2 .2Comparison of algorithms PAVA and greedy algorithm (Greedy) on an example of the simulated data (implementation in language R): A = (xi, yi)Algorithm (the number of iterations) Error Cardinality TimePAVA0.994824.28Greedy (10)1.16960.09Greedy (50)1.011300.33Greedy (100)0.999410.63Greedy (200)0.996571.23Greedy (500)0.995763.08Greedy (1000)0.994796.28Greedy (2000)0.9948212.67Greedy (5000)0.9948231.58Greedy (10000)0.9948260.9164Value2020406080100xFig. 2. Step functions obtained by the greedy algorithm (ε = 0.753) and PAVA (ε =0.751)

A fast scaling algorithm for minimizing separable convex functions subject to chain constraints RAhuja JOrlin Operations Research 49 1 2001 Efficient algorithms for non-convex isotonic regression through submodular optimization FBach hal-01569934 Jul 2017 Tech. Rep. The isotonic regression problem and its dual RBarlow HBrunk Journal of the American Statistical Association 67 1 1972 Active set algorithms for isotonic regression: a unifying framework MJBest NChakravarti Mathematical Programming: Series A and B 47 3 1990 Minimizing separable convex functions subject to simple chain constraints MBest NChakravarti VUbhaya SIAM Journal on Optimization 10 1 2000 Linear approximation method preserving kmonotonicity DIBoytsov SPSidorov Siberian electronic mathematical reports 12 2015 A generalised PAV algorithm for monotonic regression in several variables OBurdakov AGrimvall MHussian COMPSTAT, Proceedings of the 16th Symposium in Computational Statistics Antoch 2004 10 A dual active-set algorithm for regularized monotonic regression OBurdakov OSysoev Journal of Optimization Theory and Applications 172 3 2017 Shape-preserving dynamic programming YCai KLJudd Math. Meth. Oper. Res 77 2013 Aspects of Shape-constrained Estimation in Statistics YChen 2013 Ph.D. thesis Polynomial algorithms for isotonic regression VChepoi DCogneau BFichet Lecture Notes-Monograph Series 31 1 1997 Sparse greedy approximation, and the Frank-Wolfe algorithm KLClarkson ACM Transactions on Algorithms 6 4 2010 Piecewise convex-concave approximation in the minimax norm MPCullinan Abstracts of Conference on Approximation and Optimization: Algorithms, Complexity, and Applications IDemetriou PPardalos

Athens, Greece

June 29Ц30, 2017. 2017 4 National and Kapodistrian University of Athens Approximate Methods in Optimization Problems VFDemyanov AMRubinov Modern Analytic and Computational Methods in Science and Mathematics Case-control isotonic regression for investigation of elevation in risk around a point source DigglePeter MSTony MJ Statistics in medicine 18 1 1999 An isotonic regression algorithm RDykstra Journal of Statistical Planning and Inference 5 1 1981 An algorithm for isotonic regression for two or more independent variables RDykstra TRobertson The Annals of Statistics 10 1 1982 An algorithm for quadratic programming MFrank PWolfe Naval Research Logistics Quarterly 3 1-2 1956 Shape-Preserving Approximation by Real and Complex Polynomials SGGal 2008 Springer On the convergence of a greedy algorithm for the solution of the problem for the construction of monotone regression AAGudkov SVMironov ARFaizliev Izv. Saratov Univ. (N. S.) 17 2017 Ser. Math. Mech. Inform. Algorithms and error estimations for monotone regression on partially preordered sets JHansohm Journal of Multivariate Analysis 98 1 2007 Conditional gradient algorithms for norm-regularized smooth convex optimization ZHarchaoui AJuditsky ANemirovski Mathematical Programming: Series A and B 152 1-2 2015 Sparse Convex Optimization Methods for Machine Learning MJaggi 2001 ETH Zürich Ph.D. thesis Isotone optimization in R: Pool-adjacent-violators algorithm (PAVA) and active set methods JLeeuw KHornik P M Journal of Statistical Software 32 5 2009 Order Restricted Statistical Inference TRobertson FWright RDykstra 1988 John Wiley & Sons New York Local approximation by splines VTShevaldin 2014 UrO RAN Ekaterinburg Linear k-monotonicity preserving algorithms and their approximation properties SPSidorov LNCS 9582 2016 Duality gap analysis of weak relaxed greedy algorithms SPSidorov SVMironov LNCS 10556 2017 Dual convergence estimates for a family of greedy algorithms in banach spaces SPSidorov SVMironov MGPleshakov LNCS 10710 2018 in press On the saturation effect for linear shape-preserving approximation in Sobolev spaces SSidorov Miskolc Mathematical Notes 16 2 2015 An algorithm for isotonic regression with arbitrary convex distance function UStromberg Computational Statistics & Data Analysis 11 1 1991 Isotonic regression: Another look at the changepoint problem WBWu MWoodroofe GMentz Biometrika 88 3 2001 Exact algorithms for isotonic regression and related YLYu EPXing Journal of Physics: Conference Series 699 1 2016 Accelerated training for matrix-norm regularization: A boosting approach XZhang YYu DSchuurmans NIPS'12 Proceedings of the 25th International Conference on Neural Information Processing Systems 2012 1