In-Database Factorized Learning

Hung Q. Ngo1 , XuanLong Nguyen2 , Dan Olteanu3 , and Maximilian Schleich3
                        1
                       LogicBlox, Inc. hung.ngo@logicblox.com
                    2
                      University of Michigan xuanlong@umich.edu
         3
           University of Oxford {dan.olteanu,max.schleich}@cs.ox.ac.uk


1    Introduction

In this paper, we overview recent contributions on in-database analytics for a
class of optimization problems that are important for LogicBlox retail-planning
and forecasting applications [4, 5, 7]. The class includes ridge linear regression,
polynomial regression, factorization machines, principal component analysis and
classification models. Such problems are typically computed over input data de-
fined by a feature extraction join query on data sources residing inside a database.
The query result can have a large number of attributes and records, which leads
to large compute times or failure to process the entire dataset for conventional
analytics engines.
    Pushing analytical computation inside the database engine saves non-trivial
time usually spent on data import/export at the interface between database
systems and statistical packages. In addition, a large part of the computational
challenge for optimization problems can be addressed with conventional database
techniques. To show this, we decouple the data-dependent computation from the
computation of the optimal solution. The data-dependent step can be phrased as
factorized computation of many inter-related aggregates over database joins [3,7].
We further exploit functional dependencies to reduce the dimensionality of the
optimization problem [4]. Motivated by the industrial applications, this line of
work attracted increasing interest in academia recently [1].


2    Problem formulation

We use a unified framework to express and solve optimization problems [4].
    In the following, bold face letters (e.g., x, xi , θ) denote vectors or matrices,
normal face letters (e.g., xi , θj ) denote scalars, and h·, ·i denotes the Frobenius
inner product of two matrices. Let Q be a feature extraction join query and D
a database that defines the training dataset Q(D) for an optimization problem.
Suppose the problem has p parameters θ = (θ1 , . . . , θp ) ∈ Rp , as well as response
y and n numeric features x = (x1 , . . . , xn ), provided by the data points (x, y) ∈
Q(D). For a positive integer m, there exist two vector-valued functions g : Rp →
Rm and h : Rn → Rm . Each component function gj of g = (gj )j∈[m] is a
multivariate polynomial of model parameters. Each component function hj of h =
(hj )j∈[m] is a multivariate monomial of input features. Using the least-squares
loss function with `2 -regularization, all problems in our class of optimization
problems are captured in the following objective function:
                        1        X                        2  λ     2
            J(θ) :=                     (hg(θ), h(x)i − y) + kθk2 .          (1)
                    2|Q(D)|                                  2
                               (x,y)∈Q(D)

    We exemplify the case of ridge linear and polynomial regression [4].
    The ridge linear regression model with response y and features x = (x0 =
1, x1 , . . . , xn ) has p = n + 1 parameters θ = (θ0 , . . . , θn ) and the objective
                                                n
                                                                !2
                               1       X       X                       λ     2
                    J(θ) =                         θi xi − y + kθk2 .
                            2|Q(D)|            i=0
                                                                       2
                               (x,y)∈Q(D)

This has the form (1) with m = n + 1, where g and h are the identity functions:
g(θ) = θ and h(x) = x.
    The degree-d polynomial regression model with response y and features x =
(x0 = 1, x1 , . . . , xn ) extends the ridge linear regression model with all possible
feature interactions up to degree d. Thus, each component function hj defines
one interaction (e.g. x2 x3 x5 ) and gj defines the corresponding parameter (θ235 ).
                                           Pd      i
Formally, the model has p = m =               i=0 n parameters θP = (θa ), where a =
                                                                  n
(a1 , . . . , an ) is a tuple of non-negative integers such that i=1 ai ≤Q  d. In this
                                                                              n
case, g(θ) = θ and h is defined by the component functions ha (x) = i=1 xai i .

3    Efficiently Solving the Optimization Problems
We outline how to compute efficiently solutions to problems of the form (1).
First, we consider problems with continuous features [5,7], and then generalize to
continuous and categorical features [4]. The difference is that continuous features
are quantitative, whereas categorical features encode qualitative properties (e.g.
color or store id). We then show how functional dependencies can be exploited
to reduce the dimensionality of the problem.
    Solutions for (1) can be computed with a variant of batch gradient descent
(BGD), which repeatedly updates the parameters in the direction of the gradient
until convergence θ := θ − α∇J(θ). To compute each BGD iteration efficiently,
we rewrite (1) to decouple the data-dependent computation from the parameters.
                                     1
Theorem 1 ( [4] ) Let w = |Q(D)|        . Define the matrix Σ = (σij )i,j∈[m] , the
vector c = (ci )i∈[m] , and the scalar sY by
          X                                  X                           X
 Σ =w·            h(x)h(x)> ;       c=w·          y · h(x);   sY = w ·          y2
       (x,y)∈Q(D)                        (x,y)∈Q(D)                      (x,y)∈Q(D)

Then, (1) becomes
                      1                         sY   λ   2
               J(θ) = g(θ)> Σg(θ) − hg(θ), ci +     + kθk2
                      2                          2   2
                      ∂g(θ)>         ∂g(θ)>
              ∇J(θ) =        Σg(θ) −         c + λθ                               (2)
                        ∂θ             ∂θ
    Σ, c, and sY can be expressed as sum-product functional aggregate queries
(FAQ) and computed inside the database. For example, in the case of linear
regression, the aggregates in Σ are the sums of all possible pairwise products of
features (e.g. the SQL-aggregate SUM(Xi · Xj ), ∀i ≤ j ∈ [n]).
    Our algorithms draw on earlier work on factorized databases [6] and the
FAQ framework for computing aggregates over joins [3]. All aggregates can be
computed in one pass over the (non-materialized) factorized join [7]. Once these
aggregates are computed, each BGD iteration computes the gradient in Equa-
tion (2) without scanning the data Q(D). The following runtime bounds do not
include log factors in the database size.

Proposition 2 ( [5, 7] ) Let Q be a feature extraction join query, fhtw(Q) de-
note its fractional hypertree width, and D a database. If all variables in Q are
continuous, then the aggregates (Σ,
                                   c, sY ) for an optimization problem (1) can
be computed in O m2 · |D|fhtw(Q) .

    For continuous variables, the size of Σ and c is O(m2 ) and respectively
O(m). Thus, using the gradient (2) and the precomputed quantities (Σ, c, sY ),
the running-time for BGD is bounded by a polynomial in terms of m, p, and the
number of iterations, but not database size. To put this result into context, if
we would first compute the join using a worst-case optimal join algorithm and
                                                      ∗
then the aggregates, this would take time O(m2 · |D|ρ (Q) ). For acyclic joins,
fhtw(Q) = 1, whereas ρ∗ (Q) can be as large as the number of relations in Q [2].
    Our problem formulation (1) naturally captures the case of categorical vari-
ables [4]. Such variables are a common source of sparsity in optimization prob-
lems, because they are one-hot encoded. Assume city is a categorical variable in
query Q and London, Oxford, Bristol are the only three cities occurring in the
query result. One-hot encoding transforms each component xcity of (x, y) ∈ Q(D)
into a 3-dimensional vector xcity = [xLondon , xOxford , xBristol ], where one of the three
values is 1 and the others 0, indicating which city occurs in this data point. If a
particular data point has city = “Oxford”, then xcity = [0, 1, 0].
    One-hot encoding can potentially blow up the size of the input data, which
makes subsequent processing inefficient. In order to avoid the blow up, modern
analytics engines use a sparse representation of the input data. Transforming
the data into a sparse format, however, requires non-trivial time.
    In our framework, we can capture categorical features by turning the aggre-
gates required to compute Σ, c, and sY into group-by aggregates with categorical
feature as free variables, as opposed to the sum aggregates without free variables
in Theorem 1. In the problem formulation (1), the feature vector x becomes a
vector of vectors, each component gj of g = (gj )j∈[m] and hj of h = (hj )j∈[m] is
a matrix, and each (σ ij )i,j∈[m] of Σ as well as c are tensors.
    The dimensionality of (σ ij )i,j∈[m] and c can be very large, but fortunately
these tensors are very sparse and have many repeating values. Thus, we compute
a sparse tensor representation of Σ and c with an algorithm that factorizes and
massively shares aggregate computation. We refer the reader to the full report
for details on the algorithm [4].
Proposition 3 ( [4] ) Let faqw(i, j) denote the FAQ-width of the query corre-
sponding to σ ij , and S(i, j) denote the size of the sparse representation, i.e.,
the number of aggregates used to represent
                                       P the tensor σ ij . Then, the aggregates
                                                                            
                                                       faqw(i,j)
(Σ, c, sY ) can be computed in time O      i,j∈[m] (|D|          + S(i, j))   .

   For acyclic queries, the FAQ-width [3] is faqw(i, j) = 1. The gradient (2) can
now be computed over each tensor σ ij , whose size S(i, j) is bounded by the
database size due to the one-hot encoding. Thus, BGD is not completely inde-
pendent of the data size if the feature extraction query has categorical variables.
    A side effect of one-hot encoding is that the categorical variables become “lin-
earized”. We can exploit functional dependencies (FDs) on categorical variables
to reduce the dimensionality of the optimization problem [4].
    Suppose we have the FD city → country, where city and country are two
categorical variables in Q and xcity and xcountry are one-hot encoded vectors.
Then, using the one-hot encoding example from above, the following identity
must hold: xEngland = xLondon + xOxford + xBristol . We use this linear relationship
to rewrite the monomials in each component hj of h = (hj )j∈[m] such that all
occurrences of xcountry are replaced by the functionally determining quantity xcity .
This leads to a reparameterization of the loss term [4].
    The direct implication of the reparameterization is that the number of re-
quired aggregates to compute Σ and c can be reduced drastically. The effect of
the reparameterization on the parameter space is less obvious, because the `2 -
regularization term is non-linear. Depending on the structure of the FD, however,
many of parameters corresponding to functionally determined statistics can be
optimized out [4]. Therefore, the transformed parameter space is also reduced in
dimension, which can help speed up the convergence in the optimization phase.
   In practice, the interplay of efficient algorithms that compute (Σ, c, sY ) and
the exploitation of FDs to reduce the dimensionality of the problem leads to
orders-of-magnitude performance improvements over state-of-the-art analytics
engines for optimization problems that are common for LogicBlox analytics [4,7].

References
1. S. Abiteboul and et al. Research Directions for Principles of Data Management
   (Dagstuhl Perspectives Workshop 16151). CoRR, abs/1701.09007, 2017.
2. A. Atserias, M. Grohe, and D. Marx. Size bounds and query plans for relational
   joins. SIAM Journal on Computing, 42(4):1737–1767, 2013.
3. M. A. Khamis, H. Q. Ngo, and A. Rudra. FAQ: Questions asked frequently. In
   PODS, pages 13–28, 2016.
4. H. Ngo, X. Nguyen, D. Olteanu, and M. Schleich. In-database learning with sparse
   tensors, Technical Report. CoRR, abs/1703.04780, 2017.
5. D. Olteanu and M. Schleich. Factorized databases. SIGMOD Rec., 45(2):5–16, 2016.
6. D. Olteanu and J. Závodný. Size bounds for factorised representations of query
   results. TODS, 40(1):2, 2015.
7. M. Schleich, D. Olteanu, and R. Ciucanu. Learning linear regression models over
   factorized joins. In SIGMOD, pages 3–18, 2016.