A Database Framework for Classifier Engineering

                         Benny Kimelfeld1 and Christopher Ré2
                           1
                               LogicBlox, Inc. and Technion, Israel
                                    2
                                      Stanford University


1     Introduction

In the design of machine-learning solutions, a critical and often the most resourceful
task is that of feature engineering [7, 4], for which recipes and tooling have been devel-
oped [3, 7]. In this vision paper we embark on the establishment of database founda-
tions for feature engineering. We propose a formal framework for classification, in the
context of a relational database, towards investigating the application of database and
knowledge management to assist with the task of feature engineering. We demonstrate
the usefulness of this framework by formally defining two key algorithmic challenges
within: (1) separability refers to determining the existence of feature queries that agree
with the given training examples, and (2) identifiability is the task of testing for the
property of independence among features (given as queries). Moreover, we give pre-
liminary results on these challenges, in the context of conjunctive queries. We focus
here on boolean features that are represented as ordinary database queries, and view
this work as the basis of various future extensions such as numerical features and more
general regression tasks.


2     Formal Framework

We first present our formal framework for classification with binary features within a
relational database.


2.1   Classifiers and Learning

In this work, a classifier is a function of the form
                                              n
                                  γ : {−1, 1} → {−1, 1}

where n is a natural number that we call the arity of γ. A classifier class is a (possibly
infinite) family Γ of classifiers. We denote by Γn the restriction of Γ to the n-ary
classifiers in Γ . An n-ary training collection is a multiset T of pairs hx, yi where x ∈
         n
{−1, 1} and y ∈ {−1, 1}. We denote by Tn the set of all n-ary training collections.
A cost function for a classifier class Γ is a function of the form
                                                    
                               c : ∪n (Γn × Tn ) → R≥0
where R≥0 is the set of nonnegative numbers. In the context of a classifier class Γ and
a cost function c, learning a classifier is the task of finding a classifier γ ∈ Γn that
minimizes c(γ, T ), given a training collection T ∈ Tn .
    We illustrate the above definitions on the important class of linear classifiers. An
n-ary linear classifier is parameterized by a vector w ∈ Rn , is denoted by Λw , and is
                                         n
defined as follows for all a ∈ {−1, 1} .
                                         (
                                     def   1   if a · w ≥ 0;
                              λw (a) =
                                           −1 otherwise.

where “·” denotes the operation of dot product. By Lin we denote the class of linear
classifiers. An example of a cost function is the least square cost lsq that is given by
                                       def
                                            X                 2
                          lsq(Λw , T ) =          (x · w − y)
                                          hx,yi∈T


for the arguments Λw ∈ Linn and T ∈ Tn .


2.2   Relational Formalism

Our relational terminology is as follows. A schema is a pair (A, Σ), where A is a
signature that consists of relation symbols, and Σ is a set of logical integrity constraints
over A. Each relation symbol R has an associated arity. We assume an infinite set Const
of constants. An instance I over a schema S = (A, Σ) associates with every k-ary
relation symbol R ∈ A a finite subset RI of Constk , such that all the constraints of Σ
are satisfied. The active domain of an instance I, denoted adom(I), is the set of all the
constants in Const that are mentioned in I.
    Let S be schema. A query (over S) is a function Q that is associated with an arity
k, and that maps every relation instance I over S into a finite subset Q(I) of Constk . A
query Q0 contains a query Q if Q(I) ⊆ Q0 (I) for all instances I over S; if Q ⊆ Q0 and
Q0 ⊆ Q then Q and Q0 are said to be equivalent. A query Q is additive if for every two
instances I1 and I2 , if adom(I1 ) and adom(I2 ) are disjoint, then

                             Q(I1 ∪ I2 ) = Q(I1 ) ∪ Q(I2 ) .

A query class is a mapping that associates with every schema S a class of queries over
S. An example of a query class is that of the conjunctive queries. A conjunctive query
(CQ) is represented by the logical formula q(x) that has the form

                          ∃y[φ1 (x, y, d) ∧ · · · ∧ φm (x, y, d)]

where x and y are disjoint sequences of variables, d is a sequence of constants, and each
φi is an atomic query over S (i.e., a formula that consists of a single relation symbol and
no logical operators). The result of applying the CQ Q = q(x) to the instance I consists
of all the tuples a (of the same length as x) such that q(a) is true in I; we denote this
result is denoted by Q(I).
2.3   Classification Framework

We now present our formal framework. An entity schema is a triple (A, Σ, E), where
(A, Σ) is a schema and E is a relation symbol in A that represents the entities (as tu-
ples). An instance I over a schema (A, Σ, E) is simply an instance over (A, Σ). Intu-
itively, an instance I over an entity schema (A, Σ, E) represents a set of entities, namely
E I (i.e., the set of tuples in E), along with information about the entities that is con-
tained in the remaining relations RI . For example, E may be the relation Persons and
A may include, besides Persons, relations such as PersonAddress, PersonCompany,
CompanyCity, and so on. If S = (A, Σ, E) is an entity schema, then the elements A,
Σ and E are denoted by AS , ΣS and ES , respectively.
    Let S be an entity schema, and let I be an instance over S. A feature query (over
S) is a query π over the schema S, such that π ⊆ ES , where ES is viewed as the query
that, given an instance I, copies the relation ESI . In other words, a feature query is a
query that selects entities. For example, if E is Persons(ssn, name) then a feature can
be the following CQ q(s, n) (selecting persons working in New York City).
                                                                                 
          ∃c Persons(s, n) ∧ PersonCompany(s, c) ∧ CompanyCity(c, ‘NYC’)

If π is a feature query, then π I denotes the function f : ESI → {−1, 1} where
                                        (
                                          1     if e ∈ π(I);
                                f (e) =
                                          −1 otherwise.

A statistic (over S) is a sequence Π = (π1 , . . . , πn ) of feature queries. We denote by
                                                         n
Π I the function (π1I , . . . , πnI ) from ESI to {−1, 1} .
    A feature schema is a pair (S, Π), where S is an entity schema, and Π is a statistic
over S that produces a sequence of features for every entity of a given input instance.
We say that Π and (S, Π) are in a query class Q if every query in Π belongs to Q. A
training instance over S is a pair (I, o), where I is an instance over S and o : ESI →
{−1, 1} is a function that partitions the entities into positive and negative examples.
Given a feature schema (S, Π) and a classifier class Γ , the training instance (I, o)
defines the training collection that consists of the tuple hΠ I (e), o(e)i for every e ∈ ESI .


3     Feature Engineering

In feature engineering, one devises feature queries for a classification task. Next, we
discuss two computational challenges that naturally arise in feature engineering.


3.1   Separability

Let (S, Π) be a feature schema, and let Γ be a classifier class. A training instance (I, o)
is said to be Γ -separable with respect to (w.r.t.) Π if there exists a classifier γ ∈ Γ that
fully agrees with o; that is, γ and Π have the same arity, and γ(e) = o(e) for every
e ∈ ESI . We define the following core computational problem.
Problem 1 (Separability). Let S be an entity schema, let Q be a query class over S, and
let Γ be a classifier class. The separability problem is the following. Given a training
instance (I, o) over S, determine whether there exists a statistic Π in Q such that (I, o)
is Γ -separable w.r.t. Π.

    The separability problem, as defined, can be extended in various practical direc-
tions. The input can include a bound N on the length n of the statistic Π (hence,
limiting the model complexity, which results in classifiers that are more efficient and
less overfitting). One can allow for an approximate agreement with o (e.g., the classifier
should agree with o on at least (1 − ) of the entities, or at most k examples should
be misclassified). And one can impose various constraints on common query classes
Q (e.g., limit the size of queries, number of constants, etc., again to limit the model
complexity and potential overfitting). The following theorem considers the complexity
of testing for separability in the case where the class of queries is that of CQs without
constants,3 which we denote by CQnc . It states that, in the absence of such extensions
of the problem, it can very quickly get intractable.
Theorem 1. Let Q be the class CQnc , and let Γ be the class Lin. For every entity
schema S, separability is in NP. Moreover, there exists an entity schema S such that
separability is NP-complete.
The proof of membership in NP is using the concept of a canonical database [1], and
the proof of NP-hardness is by a reduction from the maximum-clique problem.
    We note that a problem similar to separability has been studied in a recent paper by
Cohen and Weiss [2], where data are labeled graphs and features are tree patterns.

3.2     Statistic Identifiability
We denote by 0m the vector of m zeroes. Let M be an n×k real matrix. A linear column
dependence in M is a weight vector w ∈ Rk such that w 6= 0k and M · w = 0n ; if
M does not have any linear column dependence, then we say that M is linearly column
independent. Let (S, Π) be a feature schema, and let I be an instance of S. We fix
an arbitrary order over the entities in ESI , and denote by JΠ I K the matrix that consists
of the rows Π I (e) for every e ∈ ESI in order. The second computational problem we
define is the following.

Problem 2 (Identifiability). Let Q be a query class. Identifiability is the problem of
testing, given a feature schema (S, Π) in Q, whether there exists an instance I over S
such that the matrix JΠ I K is linearly column independent; in that case, we say that Π
is identifiable.

Identifiability is an important property in the design of machine-learning solutions [5].
Particularly, in the case of the classifier class Lin and the cost function lsq, this property
implies that there is a single optimal classifier, whereas its absence implies that the
space of optimal solutions is unbounded.
 3
     For CQs with constants, the problem is trivial and not interesting, since the positive examples
     can be hardcoded into the statistic.
    Next, we show that in the case of CQs. identifiability amounts to query equivalence.
A statistic Π is said to have redundancy if it contains two distinct feature queries that
are equivalent.
Theorem 2. Let Q be the query class of additive CQs, and let (S, Π) be a feature
schema such that Π is in Q. Then Π is identifiable if and only if Π has no redundancy.
    We conclude with comments on Theorem 2. First, this theorem is proved again
by applying the concept of a canonical database of a CQ. Second, we can extend the
theorem to the class of all CQs, but the condition that characterizes identifiability is
significantly more complicated (and will be given in the extended version of this paper).
Third, this theorem generalizes to affine independence, which is important in different
cost functions such as maximum entropy [6]. Finally, by an immediate application of
the NP-completeness of CQ containment [1] we get that identifiability is NP-complete
in the case of additive CQs.


Acknowledgments
Benny Kimelfeld is a Taub Fellow, supported by the Taub Foundation. The research of
Christopher Ré is supported by DARPA’s projects XDATA (FA8750-12-2-0335), DEFT
(FA8750-13-2-0039), MEMEX, and SIMPLEX. His research is also supported by NSF
Career Award IIS-1353606, ONR Awards N000141210041 and N000141310129, NIH
Grant U54EB020405 (awarded by NIBIB through funds provided by the trans-NIH
BD2K initiative), the Sloan Research Fellowship, the Moore Foundation, American
Family Insurance, Google, and Toshiba. Any opinions, findings, and conclusions or
recommendations expressed in this material are those of the authors and do not neces-
sarily reflect the views of DARPA, AFRL, NSF, ONR, NIH, or the U.S. government.


References
1. A. K. Chandra and P. M. Merlin. Optimal implementation of conjunctive queries in relational
   data bases. In J. E. Hopcroft, E. P. Friedman, and M. A. Harrison, editors, STOC, pages 77–90.
   ACM, 1977.
2. S. Cohen and Y. Y. Weiss. Learning Tree Patterns from Example Graphs. In M. Arenas
   and M. Ugarte, editors, 18th International Conference on Database Theory (ICDT 2015),
   volume 31 of Leibniz International Proceedings in Informatics (LIPIcs), pages 127–143,
   Dagstuhl, Germany, 2015. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.
3. I. Guyon, S. Gunn, M. Nikravesh, and L. A. Zadeh. Feature Extraction: Foundations and
   Applications (Studies in Fuzziness and Soft Computing). Springer-Verlag New York, Inc.,
   Secaucus, NJ, USA, 2006.
4. S. Kandel, A. Paepcke, J. M. Hellerstein, and J. Heer. Enterprise data analysis and visualiza-
   tion: An interview study. IEEE Trans. Vis. Comput. Graph., 18(12):2917–2926, 2012.
5. E. L. Lehmann and G. Casella. Theory of point estimation, volume 31. Springer, 1998.
6. M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational
   inference. Foundations and Trends in Machine Learning, 1(1-2):1–305, 2008.
7. C. Zhang, A. Kumar, and C. R. Materialization optimizations for feature selection workloads.
   In SIGMOD Conference, pages 265–276, 2014.