In-Database Factorized Learning Hung Q. Ngo1 , XuanLong Nguyen2 , Dan Olteanu3 , and Maximilian Schleich3 1 LogicBlox, Inc. hung.ngo@logicblox.com 2 University of Michigan xuanlong@umich.edu 3 University of Oxford {dan.olteanu,max.schleich}@cs.ox.ac.uk 1 Introduction In this paper, we overview recent contributions on in-database analytics for a class of optimization problems that are important for LogicBlox retail-planning and forecasting applications [4, 5, 7]. The class includes ridge linear regression, polynomial regression, factorization machines, principal component analysis and classification models. Such problems are typically computed over input data de- fined by a feature extraction join query on data sources residing inside a database. The query result can have a large number of attributes and records, which leads to large compute times or failure to process the entire dataset for conventional analytics engines. Pushing analytical computation inside the database engine saves non-trivial time usually spent on data import/export at the interface between database systems and statistical packages. In addition, a large part of the computational challenge for optimization problems can be addressed with conventional database techniques. To show this, we decouple the data-dependent computation from the computation of the optimal solution. The data-dependent step can be phrased as factorized computation of many inter-related aggregates over database joins [3,7]. We further exploit functional dependencies to reduce the dimensionality of the optimization problem [4]. Motivated by the industrial applications, this line of work attracted increasing interest in academia recently [1]. 2 Problem formulation We use a unified framework to express and solve optimization problems [4]. In the following, bold face letters (e.g., x, xi , θ) denote vectors or matrices, normal face letters (e.g., xi , θj ) denote scalars, and h·, ·i denotes the Frobenius inner product of two matrices. Let Q be a feature extraction join query and D a database that defines the training dataset Q(D) for an optimization problem. Suppose the problem has p parameters θ = (θ1 , . . . , θp ) ∈ Rp , as well as response y and n numeric features x = (x1 , . . . , xn ), provided by the data points (x, y) ∈ Q(D). For a positive integer m, there exist two vector-valued functions g : Rp → Rm and h : Rn → Rm . Each component function gj of g = (gj )j∈[m] is a multivariate polynomial of model parameters. Each component function hj of h = (hj )j∈[m] is a multivariate monomial of input features. Using the least-squares loss function with `2 -regularization, all problems in our class of optimization problems are captured in the following objective function: 1 X 2 λ 2 J(θ) := (hg(θ), h(x)i − y) + kθk2 . (1) 2|Q(D)| 2 (x,y)∈Q(D) We exemplify the case of ridge linear and polynomial regression [4]. The ridge linear regression model with response y and features x = (x0 = 1, x1 , . . . , xn ) has p = n + 1 parameters θ = (θ0 , . . . , θn ) and the objective n !2 1 X X λ 2 J(θ) = θi xi − y + kθk2 . 2|Q(D)| i=0 2 (x,y)∈Q(D) This has the form (1) with m = n + 1, where g and h are the identity functions: g(θ) = θ and h(x) = x. The degree-d polynomial regression model with response y and features x = (x0 = 1, x1 , . . . , xn ) extends the ridge linear regression model with all possible feature interactions up to degree d. Thus, each component function hj defines one interaction (e.g. x2 x3 x5 ) and gj defines the corresponding parameter (θ235 ). Pd i Formally, the model has p = m = i=0 n parameters θP = (θa ), where a = n (a1 , . . . , an ) is a tuple of non-negative integers such that i=1 ai ≤Q d. In this n case, g(θ) = θ and h is defined by the component functions ha (x) = i=1 xai i . 3 Efficiently Solving the Optimization Problems We outline how to compute efficiently solutions to problems of the form (1). First, we consider problems with continuous features [5,7], and then generalize to continuous and categorical features [4]. The difference is that continuous features are quantitative, whereas categorical features encode qualitative properties (e.g. color or store id). We then show how functional dependencies can be exploited to reduce the dimensionality of the problem. Solutions for (1) can be computed with a variant of batch gradient descent (BGD), which repeatedly updates the parameters in the direction of the gradient until convergence θ := θ − α∇J(θ). To compute each BGD iteration efficiently, we rewrite (1) to decouple the data-dependent computation from the parameters. 1 Theorem 1 ( [4] ) Let w = |Q(D)| . Define the matrix Σ = (σij )i,j∈[m] , the vector c = (ci )i∈[m] , and the scalar sY by X X X Σ =w· h(x)h(x)> ; c=w· y · h(x); sY = w · y2 (x,y)∈Q(D) (x,y)∈Q(D) (x,y)∈Q(D) Then, (1) becomes 1 sY λ 2 J(θ) = g(θ)> Σg(θ) − hg(θ), ci + + kθk2 2 2 2 ∂g(θ)> ∂g(θ)> ∇J(θ) = Σg(θ) − c + λθ (2) ∂θ ∂θ Σ, c, and sY can be expressed as sum-product functional aggregate queries (FAQ) and computed inside the database. For example, in the case of linear regression, the aggregates in Σ are the sums of all possible pairwise products of features (e.g. the SQL-aggregate SUM(Xi · Xj ), ∀i ≤ j ∈ [n]). Our algorithms draw on earlier work on factorized databases [6] and the FAQ framework for computing aggregates over joins [3]. All aggregates can be computed in one pass over the (non-materialized) factorized join [7]. Once these aggregates are computed, each BGD iteration computes the gradient in Equa- tion (2) without scanning the data Q(D). The following runtime bounds do not include log factors in the database size. Proposition 2 ( [5, 7] ) Let Q be a feature extraction join query, fhtw(Q) de- note its fractional hypertree width, and D a database. If all variables in Q are continuous, then the aggregates (Σ,  c, sY ) for an optimization problem (1) can be computed in O m2 · |D|fhtw(Q) . For continuous variables, the size of Σ and c is O(m2 ) and respectively O(m). Thus, using the gradient (2) and the precomputed quantities (Σ, c, sY ), the running-time for BGD is bounded by a polynomial in terms of m, p, and the number of iterations, but not database size. To put this result into context, if we would first compute the join using a worst-case optimal join algorithm and ∗ then the aggregates, this would take time O(m2 · |D|ρ (Q) ). For acyclic joins, fhtw(Q) = 1, whereas ρ∗ (Q) can be as large as the number of relations in Q [2]. Our problem formulation (1) naturally captures the case of categorical vari- ables [4]. Such variables are a common source of sparsity in optimization prob- lems, because they are one-hot encoded. Assume city is a categorical variable in query Q and London, Oxford, Bristol are the only three cities occurring in the query result. One-hot encoding transforms each component xcity of (x, y) ∈ Q(D) into a 3-dimensional vector xcity = [xLondon , xOxford , xBristol ], where one of the three values is 1 and the others 0, indicating which city occurs in this data point. If a particular data point has city = “Oxford”, then xcity = [0, 1, 0]. One-hot encoding can potentially blow up the size of the input data, which makes subsequent processing inefficient. In order to avoid the blow up, modern analytics engines use a sparse representation of the input data. Transforming the data into a sparse format, however, requires non-trivial time. In our framework, we can capture categorical features by turning the aggre- gates required to compute Σ, c, and sY into group-by aggregates with categorical feature as free variables, as opposed to the sum aggregates without free variables in Theorem 1. In the problem formulation (1), the feature vector x becomes a vector of vectors, each component gj of g = (gj )j∈[m] and hj of h = (hj )j∈[m] is a matrix, and each (σ ij )i,j∈[m] of Σ as well as c are tensors. The dimensionality of (σ ij )i,j∈[m] and c can be very large, but fortunately these tensors are very sparse and have many repeating values. Thus, we compute a sparse tensor representation of Σ and c with an algorithm that factorizes and massively shares aggregate computation. We refer the reader to the full report for details on the algorithm [4]. Proposition 3 ( [4] ) Let faqw(i, j) denote the FAQ-width of the query corre- sponding to σ ij , and S(i, j) denote the size of the sparse representation, i.e., the number of aggregates used to represent P the tensor σ ij . Then, the aggregates  faqw(i,j) (Σ, c, sY ) can be computed in time O i,j∈[m] (|D| + S(i, j)) . For acyclic queries, the FAQ-width [3] is faqw(i, j) = 1. The gradient (2) can now be computed over each tensor σ ij , whose size S(i, j) is bounded by the database size due to the one-hot encoding. Thus, BGD is not completely inde- pendent of the data size if the feature extraction query has categorical variables. A side effect of one-hot encoding is that the categorical variables become “lin- earized”. We can exploit functional dependencies (FDs) on categorical variables to reduce the dimensionality of the optimization problem [4]. Suppose we have the FD city → country, where city and country are two categorical variables in Q and xcity and xcountry are one-hot encoded vectors. Then, using the one-hot encoding example from above, the following identity must hold: xEngland = xLondon + xOxford + xBristol . We use this linear relationship to rewrite the monomials in each component hj of h = (hj )j∈[m] such that all occurrences of xcountry are replaced by the functionally determining quantity xcity . This leads to a reparameterization of the loss term [4]. The direct implication of the reparameterization is that the number of re- quired aggregates to compute Σ and c can be reduced drastically. The effect of the reparameterization on the parameter space is less obvious, because the `2 - regularization term is non-linear. Depending on the structure of the FD, however, many of parameters corresponding to functionally determined statistics can be optimized out [4]. Therefore, the transformed parameter space is also reduced in dimension, which can help speed up the convergence in the optimization phase. In practice, the interplay of efficient algorithms that compute (Σ, c, sY ) and the exploitation of FDs to reduce the dimensionality of the problem leads to orders-of-magnitude performance improvements over state-of-the-art analytics engines for optimization problems that are common for LogicBlox analytics [4,7]. References 1. S. Abiteboul and et al. Research Directions for Principles of Data Management (Dagstuhl Perspectives Workshop 16151). CoRR, abs/1701.09007, 2017. 2. A. Atserias, M. Grohe, and D. Marx. Size bounds and query plans for relational joins. SIAM Journal on Computing, 42(4):1737–1767, 2013. 3. M. A. Khamis, H. Q. Ngo, and A. Rudra. FAQ: Questions asked frequently. In PODS, pages 13–28, 2016. 4. H. Ngo, X. Nguyen, D. Olteanu, and M. Schleich. In-database learning with sparse tensors, Technical Report. CoRR, abs/1703.04780, 2017. 5. D. Olteanu and M. Schleich. Factorized databases. SIGMOD Rec., 45(2):5–16, 2016. 6. D. Olteanu and J. Závodný. Size bounds for factorised representations of query results. TODS, 40(1):2, 2015. 7. M. Schleich, D. Olteanu, and R. Ciucanu. Learning linear regression models over factorized joins. In SIGMOD, pages 3–18, 2016.