Teaching the Old Dog New Tricks:
                          Supervised Learning with Constraints
                              Fabrizio Detassis1,2 and Michele Lombardi1 and Michela Milano1


                                                                               assumptions that restrict the type of constraints (e.g. differentiabil-
Abstract. Methods for taking into account external knowledge in                ity, no relational information), the type of models (e.g. only Decision
Machine Learning models have the potential to address outstanding is-          Trees, only differentiable approaches), and often require modifications
sues in data-driven AI methods, such as improving safety and fairness,         in the employed training algorithms (e.g. specialized loss terms).
and can simplify training in the presence of scarce data. We propose              We propose a decomposition-based method, referred to as Moving
a simple, but effective, method for injecting constraints at training          Targets, to augment supervised learning with constraints. A master
time in supervised learning, based on decomposition and bi-level               step “teaches” constraint satisfaction to a learning algorithm by it-
optimization: a master step is in charge of enforcing the constraints,         eratively adjusting the sample labels. The master and learner have
while a learner step takes care of training the model. The process             no direct knowledge of each other, meaning that: 1) any ML method
leads to approximate constraint satisfaction. The method is applicable         can be used for the learner, with no modifications; 2) the master can
to any ML approach for which the concept of label (or target) is well          be defined via techniques such as Mathematical or Constraint Pro-
defined (most regression and classification scenarios), and allows to          gramming, to support discrete values or non-differentiable constraints.
reuse existing training algorithms with no modifications. We require           Our method is also well suited to deal with relational constraints over
no assumption on the constraints, although their properties affect the         large populations (e.g. fairness indicators). Moving Targets subsumes
shape and complexity of the master problem. Convergence guarantees             the few existing techniques – such as the one by [13] – capable of
are hard to provide, but we found that the approach performs well              offering the same degree of versatility.
on ML tasks with fairness constraints and on classical datasets with              When constraints conflict with the data, the approach prioritizes
synthetic constraints.                                                         constraint satisfaction over accuracy. For this reason, it is not well
                                                                               suited to deal with fuzzy information. Moreover, due to our open
                                                                               setting, it is hard to determine convergence properties. Despite this,
1   Introduction                                                               we found that the approach performs well (compared to state of
                                                                               the art methods) on classification and regression tasks with fairness
Techniques to deal with constraints in Machine Learning (ML) have
                                                                               constraints, and on classification problems with balance constraints.
the potential to address outstanding issues in data-driven AI methods.
                                                                                  Due to its combination of simplicity, generality, and the observed
Constraints representing (e.g.) physical laws can be employed to
                                                                               empirical performance, Moving Targets can represent a valuable ad-
improve generalization; constraints may encode negative patterns (e.g.
                                                                               dition to the arsenal of techniques for dealing with constraints in
excluded classes) and relational information (e.g. involving multiple
                                                                               Machine Learning. The paper is organized as follows: in Section 2 we
examples); constraints can ensure the satisfaction of desired properties,
                                                                               briefly survey related works on the integration of constraints in ML;
such as fairness, safety, or lawfulness; they can even be used to extract
                                                                               in Section 3 we present our method and in Section 4 our empirical
symbolic information from data.
                                                                               evaluation. Concluding remarks are in Section 5.
   Such techniques can highly improve the learning process in terms
of versatility and control. As a consequence, our work constitutes
a step forward in the direction of a more human-oriented AI, as                2    Related Works
it is general and allows for the realization of learning models that
satisfy specific user requirements. This is a fundamental and essential        Here we provide an overview of representative approaches for inte-
behaviour in contexts where there exist human-specific constraints,            grating constraints in ML, and we discuss their differences with our
which the trained AI model has to acknowledge. The present paper               method.
aims at enlarging the pool of tools that can be used to inject an AI              Most approaches in the literature build over just a few key ideas.
model with external knowledge, a field that is rapidly expanding               One of them is using the constraints to adjust the output of a trained
thanks to its many potential practical applications.                           ML model. This is done in DeepProbLog [20], where Neural Networks
   To the best of the authors knowledge, the vast majority of ap-              with probabilistic output (mostly classifiers) are treated as predicates.
proaches for taking into account external knowledge in ML make                 [25] presents a Neural Theorem Prover using differentiable predicates
                                                                               and the Prolog backward chaining algorithm. The original Markov
1  DISI, University of Bologna, Italy, email: {fabrizio.detassis2,             Logic Networks [24] rely instead on Markov Fields defined over
  michele.lombardi2, michela.milano}@unibo.it                                  First Order Logic formulas. As a drawback, with these approaches the
2 Corresponding author
                                                                               constraints have no effect on the model parameters, which complicates
Copyright c 2020 for this paper by its authors. Use permitted under Creative   the analysis of feature importances. Moreover, dealing with relational
 Commons License Attribution 4.0 International (CC BY 4.0).                    constraints (e.g. fairness) requires access at prediction time either to a
representative population or to its distribution [12, 11].                Algorithm 1 M OVING TARGETS
   A second group of approaches operate by using constraint-based
                                                                          input label vector y ∗ , scalar parameters α, β, n
expressions as regularization terms at training time. In Semantic
                                                                            y 1 = l(y ∗ )                               # pretraining
Based Regularization [7] constraints are expressed as fuzzy logical
                                                                            for k = 1..n do
formulas over differentiable predicates. Logic Tensor Networks [26]
                                                                               if y k ∈
                                                                                      / C then
focus on Neural Networks and replace the entire loss function with a
                                                                                  z k = mα (y k )                       # infeasible master step
fuzzy formula. Differentiable Reasoning [27] uses in a similar fashion
                                                                               else
relational background knowledge to benefit from unlabeled data. In
                                                                                  z k = mβ (y k )                       # feasible master step
the context of fairness constraints, this approach has been taken in
                                                                               end if
[1, 10, 28, 5, 15]. Methods in this class account for the constraints
                                                                               y k+1 = l(z k )                          # learner step
by adjusting the model parameters, and can therefore be used to an-
                                                                            end for
alyze feature importance. They can deal with relational constraints
without additional examples at prediction time; however, they ide-         The Algorithm Our goal is to adjust the parameters of a ML model
ally require simultaneous access at training time to all the examples     so as to minimize a loss function with clearly defined labels, under
linked by relational constraints (which can be problematic when using     a set of generic constraints. We acknowledge that any constrained
mini-batches). They often require properties on the constraints (e.g.     learning problem must trade prediction mistakes for a better level
differentiability), which may force approximations; they may also be      of constraint satisfaction, and we attempt to control this process by
susceptible to numerical issues.                                          carefully selecting which mistakes should be made. This is similar
   A third idea consists in enforcing constraint satisfaction in the      to [13, 14, 18], but: 1) we consider generic constraints rather than
data via a pre-processing step. This is proposed in the context of        focusing on fairness; and 2) we rely on an iterative process (which
fairness constraints by [13, 14, 18]. The approach enables the use        alternates “master” and “learner” steps) to improve the results.
of standard ML methods with no modification, and can deal with               Let L(y, y ∗ ) be the loss function, where y is the prediction vector
relational constraints on large sets of examples. As a main drawback,     and y ∗ is the label vector. We make the (non restrictive) assumption
bias in the model or the training algorithm may prevent getting close     that the loss is a pre-metric – i.e. L(y, y ∗ ) ≥ 0 and L(y, y ∗ ) = 0
to the pre-processed labels.                                              iff y = y ∗ . Examples of how to treat common loss functions can be
   Multiple ideas can be combined: domain knowledge has been              found in Table 1.
introduced in differentiable Machine Learning (e.g. Deep Networks)           We then want to solve, in an exact or approximate fashion, the
by designing their structure, rather than the loss function: examples     following constrained optimization problem:
include Deep Structured Models in [16] and [19]. These approaches
can use constraints to support both training and inference.                            arg min{L(y, y ∗ ) | y = f (X; θ), y ∈ C}                (1)
                                                                                           θ
   Though less related to our approach, constraints can be used to
extract symbolic knowledge from data, for example by allowing the         where f is the ML model and θ its parameter vector. With some abuse
training algorithm to adjusting the regularizer weights. This approach    of notation we refer as f (X; θ) to the vector of predictions for the
is considered (e.g.) in [17, 21, 6].                                      examples in the training set X. Since the model input at training time
   Our approach is closely related to the idea of enforcing constraints   is known, constraints can be represented as a feasible set C for the
by altering the data, and shares the same advantages (versatility,        sole predictions y.
support for relational constraints and feature importance analysis,          The problem can be rewritten without loss of generality by intro-
no differentiability assumptions). We counter the main drawbacks          ducing a second set B corresponding to the ML model bias. This
mentioned above by using an iterative algorithm rather than a single      leads to a formulation in pure label space:
pre-processing step.
                                                                                               arg min{L(y, y ∗ ) | y ∈ B ∩ C}                  (2)
                                                                                                  y
[tb]

       Table 1. Notable loss functions (m = # examples, c = #classes)     where B = {y | ∃θ, y = f (X; θ)}.
                                                                             The Moving Targets method is described in Algorithm 1, and starts
                                                                          with a learner step w.r.t. the original label vector y ∗ (pretraining).
        Loss Function              Expression            Label Space
                                                                          Each learner step, given a label vector as input, solves approximately
                                   1                                      or exactly the problem:
    Mean Squared Error               ky − y ∗ k22            Rm
                                  mm
                                 1 X                                                        l (z) = arg min{L(y, z) | y ∈ B}                    (3)
       Hamming Distance                I[yi 6= yj∗ ]       {1..c}m                                      y
                                m i=1
                                  m   c
                               1 XX ∗                                     Note that this is a traditional unconstrained learning problem, since
         Cross Entropy                   yij log yij        [0, 1]m       B is just the model/algorithm bias. The result of the first learner step
                               m i=1 j=1
                                                                          gives an initial vector of predictions y 1 .
                                                                             Next comes a master step to adjust the label vector: this can take two
                                                                          forms, depending on the current predictions. In case of an infeasibility,
                                                                          i.e. y k ∈
                                                                                   / C, we solve:
3      Moving Targets                                                                                  
                                                                                                                       1
                                                                                                                                         
                                                                                   mα (y) = arg min L(z, y ∗ ) + L(z, y) | z ∈ C                 (4)
                                                                                                 z                     α
In this section we present our method, discuss its properties, draw
connections with related algorithms and provide some convergence          Intuitively, we try to find a feasible label vector z that is close (in
considerations.                                                           terms of loss function value) to both the original labels y ∗ and the
                                                                                 We start by observing that Equation (2) is simply the Best Ap-
                                                                              proximation Problem (BAP), which involves finding a point in the
                                  y*
                                                                              intersection of two sets (say B and C) that is as close as possible to a
                                                                              reference point (say y ∗ ). For convex sets in Hilbert spaces, the BAP
                y1                                                            can be solved optimally via Douglas-Rachford splits or the method
                                                                              from [2], both relying on projection operators. Since our learner is
                                                                              essentially a projection operator on B, it would seem sensible to apply
                                                                              these methods in our setting. Unfortunately: 1) we cannot reliably
            B               z2/y3... z3...                                    assume convexity; and 2) in a discrete space, the Douglas-Rachford
                                                                              splits may lead to “label” vectors that are meaningless for the learner.
                                                                                 We therefore chose a design that is less elegant, but also less sensi-
                                                                              tive to which assumptions are valid. In particular: 1) in the mα steps,
                                                                              we use a modification of a suboptimal BAP algorithm to find a fea-
                                                                              sible prediction vector; then 2) in mβ we apply a modified proximal
                 Figure 1. A sample run of our algorithm                      operator to improve its distance (in terms of loss) w.r.t. the original
                                                                              labels y ∗ . The basis for our mα step is the Alternating Projections
                                                                              (AP) method, discussed e.g. in [4]. The AP simply alternates projec-
current prediction y. A parameter α ∈ (0, ∞) controls which of the
                                                                              tion operators on the two sets, which never generates vectors outside
two should be preferred. If the input vector is feasible we instead
                                                                              of the label space.
solve:
                                                                                 Indeed, for α → 0 our mα step becomes a projection of the
        mβ (y) = arg min {L(z, y ∗ ) | L(z, y) ≤ β, z ∈ C}             (5)    predictions y k on C. With this setup we recover the AP behavior, and
                        z                                                     its convergence to a feasible point for convex sets. For α → ∞ we
i.e. we look for a feasible label vector z that is 1) not too far from        obtain, essentially, the pre-processing method from [13]: mα becomes
the current predictions (in the ball defined by L(z, y) ≤ β) and 2)           a projection of the true labels y ∗ on C, and subsequent iterations have
closer (in terms of loss) to the true labels y ∗ . Here, we are seeking an    no further effect; convergence to a feasible point is achieved only if
accuracy improvement.                                                         the pre-processed labels are in B, which may not be the case (e.g. a
   We then make a learner step trying to reach the adjusted labels; the       quadratic distribution for a linear model). For reasonable α values,
new predictions will be adjusted at the next iteration and so on. In          our mα step balances the distance (loss) from both the predictions
case of convergence, the predictions y k and the adjusted labels z k          y and the targets y ∗ . Convergence in this case is an open question,
become stationary (but not necessarily identical). An example run,            but especially in a non-convex or discrete setting (where multiple
for a Mean Squared Error loss and convex constraints and bias, is in          projections may exist) this modification helps escaping local minima
Figure 1.                                                                     and accelerate progress.
                                                                                 When a feasible prediction vector is obtained, our method switches
 Properties The learner is not directly affected by the constraints,          to the mβ step; we then search for a point in C that is closer to the
thus enabling the use of arbitrary ML approaches. The master prob-            true labels, but also not too far from the predictions. This is related to
lems do not depend on the ML model, often leading to clean structures         the Proximal Gradient method, discussed e.g. in [22], but we limit the
that are easier to solve. Since we make no explicit use of mini-batches,      distance via a constraint rather than a squared norm, and we search in
we can deal well with relational constraints on large groups (e.g. fair-      a ball rather than on a line. As in the proximal gradient, a too large
ness). The master step can be addressed via any suitable solver, so that      search radius prevents convergence: for β → ∞ the mβ step always
discrete variables and non-differentiable constraints can be tackled via      returns the same adjusted labels, corresponding to the projection of
(e.g.) Mathematical Programming, Constraint Programming, or SAT               y ∗ on C set.
Modulo Theories. Depending on the constraints, the loss functions,
and the label space (e.g. numeric vs discrete) the master problems
may be NP-hard. Even in this case, their clean structure may allow for        4   Empirical Evaluation
exact solutions for datasets of practical size. Moreover, for separable
loss functions (e.g. all those from Table 1), the master problems can         Our experimentation is designed around a few main questions: 1) How
be defined over only the constrained examples, with a possibly signifi-       does the Moving Targets method work on a variety of constraints,
cant size reduction. If scalability is still a concern, the master step can   tasks, and datasets? 2) What is the effect of the α, β parameters?
be solved in an approximate fashion: this may lead to a lower accuracy        3) How does the approach scale? 4) How different is the behavior
and a higher level of constraint violation; however, such issues are          with different ML models? 5) How does the method compare with
partly inevitable, due to algorithm/model bias, and since constraint          alternative approaches? We proceed to describe our setup and the
satisfaction on the training data does not necessarily transfer to unseen     experiments we performed to address such questions.
examples.

 Analysis and Convergence Due to its open nature and minimal                   Tasks and Constraints We experiment on three case studies, cov-
assumptions, establishing the convergence of our method is hard. Here         ering two types of ML tasks and two types of constraints. First, we
we provide some considerations and connect the approach to existing           consider a classification problem augmented with a balance con-
algorithms. We will make the simplifying assumption that the learner          straint, which forces the distribution over the classes to be approxi-
problem from Equation (3) can be solved exactly. This holds for a             mately uniform. The loss function is given by the Hamming distance
few cases (e.g. convex ML models trained to close optimality), but in         and the label space is {1..c}m . The mα (y) problem is defined as a
general the assumption will not be strictly satisfied.                        Mixed Integer Linear Program with binary variables zij such that
zij = 1 iff the adjusted class for the i-th example is j. Formally:                 Our third case study is a regression problem with fairness con-
                                                                                 straints, based on a specialized DIDI version from [1]:
                      m                      m
                   1 X                    1 X                                                                     X X
          min
                   m i=1
                         (1 − zi,yi∗ ) +
                                         αm i=1
                                                (1 − zi,yi )               (6)                  DIDI r (X, y) =            dkv                 (16)
                                                                                                                        k∈K v∈Dk
                   c
                   X                                                                                             m
            s.t.         zij = 1                         ∀i = 1..m         (7)                               1   X        1       X
                                                                                                 dk,v,j =          yi −             y               (17)
                   j=1                                                                                       m i=1      |Xk,v | i∈X
                   m                                                                                                                        k,v
                   X                 (1 + ξ)m
                         yij ≤                                ∀j = 1..c    (8)   In this case, we use the Mean Squared Error (MSE) as a loss function,
                                         c
                   i=1
                                                                                 and the label space is Rm . The mα problem can be defined via the
                   zij ∈ {0, 1}            ∀i = 1..m, j = 1..c             (9)   following Mathematical Program:
                                                                                                        m                    m
The summations in Equation (6) encode the Hamming distance w.r.t.                                    1 X ∗                1 X
                                                                                              min          (yi − zi )2 +        (zi − yi )2         (18)
the true labels y ∗ and the predictions y. Equation (7) prevents as-                                 m i=1               αm i=1
signing two classes to the same example. Equation (8) requires an                                    X X
equal count for each class, with tolerance defined by ξ (ξ = 0.05 in                            s.t.           dkv ≤        ∀j = 1..c              (19)
                                                                                                    k∈K v∈Dk
all our experiments); the balance constraint is stated in exact form,
                                                                                                             m
thanks to the discrete labels. The mα formulation generalizes the                                            X yi           X         yi
knapsack problem and is hence NP-hard; since all examples appear in                                 dkv ≥              −                            (20)
                                                                                                             i=1
                                                                                                                 m         i∈Xk,v
                                                                                                                                    |Xk,v |
Equation (8), no problem size reduction is possible. The mβ problem
                                                                                                               m
can be derived from mα by changing the objective function and by                                               X yi             X         yi
                                                                                                    dkv ≥ −                +                        (21)
adding the ball constraint as in Equation (5).                                                                     m                    |Xk,v |
                                                                                                               i=1             i∈Xk,v
   Our second use case is a classification problem with realistic fair-
ness constraints, based on the DIDI indicator from [1]:                                             zi ∈ R             ∀i = 1..m                    (22)
                                 c
                             X X X                                               This is a linearly constrained, convex, Quadratic Programming prob-
      DIDI c (X, y) =                           dkvj                      (10)   lem that can be solved (unlike our classification examples) in polyno-
                             k∈K v∈Dk j=1                                        mial time. The mβ problem can be derived as in the previous cases:
                         m                                                       while still convex, mβ is in this case a quadratically constrained
                     1   X               1       X
                                                                                 problem.
      dk,v,j =             I[yi = j] −             I[yi = j]
                     m i=1             |Xk,v | i∈X
                                                            k,v
                                                                                  Datasets, Preparation, and General Setup We test our method
where K contains the indices of “protected features” (e.g. ethnicity,            on seven datasets from the UCI Machine Learning repository [9],
gender, etc.). Dk is the set of possible values for the k-th feature,            namely iris (150 examples), redwine (1,599), crime (2,215), whitewine
and Xk,v is the set of examples having value v for the k-th feature.             (4,898), adult (32,561), shuttle (43,500), and dota2 (92,650). We
The mα (y) problem can be defined via the following Mathematical                 use adult for the classification/fairness case study, crime for regres-
Program:                                                                         sion/fairness, and the remaining datasets for the classification/balance
                                                                                 case study.
                       m                      m
                    1 X                    1 X                                      For each experiment, we perform a 5-fold cross validation (with a
           min            (1 − zi,yi∗ ) +        (1 − zi,yi )             (11)
                    m i=1                 αm i=1                                 fixed seed for random reshuffling) to account for noise due to sampling
                                                                                 and in the training process. Hence, the training set for each fold will
                                                                                 include 80% of the data. All our experiments are run on an Intel Core
           s.t. Equation (7)                                                     i7 laptop with 16GB RAM, and we use Cplex 12.8 to solve the master
                       c                                                         problems. Our code and datasets are publicly available3 .
                   X X X
                                      dkvj ≤              ∀j = 1..c      (12)      All the datasets for the classification/balance case study are pre-
                   k∈K v∈Dk j=1                                                  pared by standardizing all input features (on the training folds) to
                            m                                                    have zero mean and unit variance. The iris and dota2 datasets are
                            X yij           X        yij
                   dkvj ≥              −                                  (13)   very balanced (the constraint is easily satisfied), while the remaining
                            i=1
                                m          i∈Xk,v
                                                    |Xk,v |                      datasets are quite unbalanced. In the adult (also known as “Census
                              m                                                  Income”) dataset the target is “income” and the protected attribute
                              X yij             X        yij
                   dkvj ≥ −                +                              (14)   is “race”. We remove the features “education” (duplicated) and “na-
                                 i=1
                                     m         i∈Xk,v
                                                        |Xk,v |                  tive country” and use a one-hot encoding on all categorical features.
                                                                                 Features are normalized between 0 and 1. Our crime dataset is the
                   zij ∈ {0, 1}            ∀i = 1..m, j = 1..c            (15)
                                                                                 “Communities and Crime Unnormalized” table. The target is “violent-
where Equation (12) is the constraint on the DIDI value and Equa-                PerPop” and the protected feature is “race”. We remove features that
tion (13)-(14) linearize the absolute values in its definition. The DIDI         are empty almost everywhere and features trivially related to the tar-
scales with the number of examples and has an intrinsic value due to             get (“murders”, “robberies”, etc.). Features are normalized between 0
the discrimination in the data. Therefore, we compute DIDI tr for the            and 1 and we select the top 15 ones according to the SelectKBest
training set, then in our experiments we have  = 0.2DIDI tr . This              method of scikit-learn (excluding “race”). The protected feature is
is again an NP-hard problem defined over all training examples. The              then reintroduced.
                                                                                 3 git repository publicly available
mβ formulation can be derived as in the previous case.
                                              Table 2.   Effect of parameters α and β on different datasets.
  NN (α, β)                Ptr             α=1               α=1                  α=1                 α = .1           α = 0+           Ideal case
                                          β = .01           β = .05               β = .1              β = .01          β = 0.1
  Iris           S    .970 ± .002        .99 ± .01        .997 ± .004          .997 ± .004            .99 ± .02     0.995 ± 0.008     .9968 ± .0004
                 C    .016 ± .001      .014∗ ± .004       .013∗ ± .004         .013∗ ± .004         .015∗ ± .005     .013∗ ± .004      .013 ± .004
  Redwine        S    .709 ± .005       .508 ± .006        .511 ± .009         .506 ± .006          .484 ± .007        .50 ± .01       .525 ± .002
                 C    .200 ± .001      .019∗ ± .001      .0186∗ ± .0005      .0184∗ ± .0008       .0188∗ ± .0004     .019∗ ± .001        .019 ± 0
  Whitewine      S    .644 ± .002      .446 ± .006         .437 ± .009          .439 ± .009           .40 ± .02       .401 ± .009      .524 ± .002
                 C    .189 ± .003      .015∗ ± .002       .013∗ ± .004         .015∗ ± .003         .014∗ ± .004     .014∗ ± .004      .015 ± .002
  Shuttle        S      .999 ± 0         .39 ± .04         .37 ± .01           .375 ± .007            .37 ± .03       .37 ± .03       .3608 ± .0008
                 C      .267 ± 0        .045∗ ± .03       .029 ± .003          .028 ± .006           0.05 ± .03       .04 ± .04          .017 ± 0
  Dota2          S    .686 ± .002      .666 ± .007        .661 ± .002            .66 ± .01          .672 ± .004      .656 ± .006      .9984 ± .0009
                 C    .070 ± .008       .04∗ ± .03         .04∗ ± .03            .09 ± .06          .023 ± .006       .07 ± .03          .025 ± 0
  Adult          S    .867 ± 0.001      .818 ± .005         .86 ± .02          .841 ± .006          .852 ± .004        .84 ± .02      0.992 ± .0005
                 C       1 ± .03         .18∗ ± .04        .16∗ ± .02           .22∗ ± .07           .22∗ ± .03       .22∗ ± .04      0.200 ± .0005
  Crime          S      .56 ± .02        .49 ± .01          .46 ± .04            .48 ± .03            .45 ± .05        .46 ± .06       .910 ± .007
                 C      .97 ± .03       .22∗ ± .07         .16∗ ± .08             .2∗ ± .1           .19∗ ± .02       .21∗ ± .04          .2 ± 0

 Parameter tuning We know that extreme choices for α and β can
dramatically alter the method behavior, but not what effect can be
expected for more reasonable values. With this aim, we perform an
investigation by running the algorithm for 15 iterations (used in all
experiments), with different α and β values. As a ML model, we
use a fully-connected, feed-forward Neural Network (NN) with two
hidden layers with 32-Rectifier Linear Units. The last layer has either
a SoftMax activation (for classification) or Linear (for regression).
The loss function is respectively the categorical cross-entropy or
the MSE. The network is trained with 100 epochs of RMSProp in
Keras/Tensorflow 2.0 (default parameters and batch size 64).
   The results are in Table 2. We report a score (row S, higher is better)
and a level of constraint violation (row C, lower is better). The S score             Figure 2. Average master step time, compared to NN training
is the accuracy for classification and the R2 coefficient for regression.
For the balance constraint, the C score is the standard deviation of            balanced datasets redwine, whitewine, and shuttle and all fairness use
the class frequencies; in the fairness case studies, we use the ratio           cases, for which feasible (or close) results are almost always obtained.
between the DIDI of the predictions and that of the training data. Each         Satisfying very tight constraints (e.g. in the unbalanced dataset) gen-
cell reports mean and standard deviation for the 5 runs. Near feasible          erally comes at a significant cost in terms of accuracy. When the
values are marked with a ∗; accuracy comparisons are fair only for              constraints are less demanding, however, we often observe accuracy
similar constraint violation scores. We consider a solution as near             improvements w.r.t. the pretraining results (even substantial ones for
feasible when the associated constraint score lies within one standard          adult and crime): this is not simply a side effect of the larger number
deviation from the imposed score.                                               of training epochs, since we reset the NN weights at each iteration.
   All columns labeled with α and β values refer to our method with             Rather, this seems a positive side-effect of using an iterative method
the corresponding parameters. We explore different values of β (for             to guide training (which may simplify escaping from local optima):
a fixed α = 1), corresponding to different ball radii in the mβ step;           further investigation is needed to characterize this phenomenon. Fi-
we also explore different values of α, corresponding to a behavior              nally, reasonable parameter choices have only a mild effect on the
of the mα step progressively closer to that of the Alternating Projec-          algorithm behavior, thus simplifying its configuration. Empirically,
tions method. The ideal case column refers to a simple projection of            α = 1, β = 0.1 seems to works well and is used for all subsequent
the true labels y ∗ on the feasible space C: this corresponds (on the           experiments.
training data) to the best possible accuracy that can be achieved if the
constraints are fully satisfied. The ptr column reports the results of           Scalability We next turn to investigating the method scalability:
the pretraining step, as defined in algorithm 1.                                from this perspective our examples are worst cases, since they must
   Results are to be read as follows: the ideal case constitutes the            be defined on all the training data, and in some case involve NP-hard
upper bound to the performance of a constrained learner; it exactly             problems. We report the average time for a master step in Figure 2.
matches the constraint threshold while minimizing the loss function.            The average time for a learner step (100 epochs of our NN) is shown as
On the other hand, the ptr represents the constraint-agnostic behavior:         a reference. At least in our experimentation, the time for a master step
it retains good performances while not necessarily satisfying the               is always very reasonable, even for the dota2 dataset for which we
constraints. Our method lies in between the two extreme cases.                  solve NP-hard problems on 74,120 examples. This is mostly due to the
   The Moving Targets algorithm can significantly improve the sat-              clean structure of the mα and mβ problems. Of course, for sufficiently
isfaction of non-trivial constraints: this is evident for the (very) un-        large training sets, exact solutions will become impractical.
 Setup of Alternative Approaches Here we describe how to setup             for all classes, i.e. completely random predictions. This undesirable
alternative approaches that will be used for comparison. Namely,           behavior is countered by the L term.
we consider the regularized linear approach from [3], referred to as          There is no simple recipe for choosing the value of µ in Equa-
RLR, and a Neural Network with Semantic Based Regularization               tion (23); therefore, we performed experiments with different µ values
[8], referred to as SBR. Both approaches are based on the idea of          to characterize its effect. The results are reported in Table 3. In most
introducing constraints as regularizers at training time. Hence, their     cases, larger µ values tend as expected to result in better constraint
loss function is in the form:                                              satisfaction, with a few notable exceptions for classification tasks
                                                                           (iris, dota, and adult). The issue is likely due to the approximations
                     L(f (X; θ), y ∗ ) + µg(f (X; θ))              (23)    introduced in the regularizers, since it arises even on small datasets
The regularization term must be differentiable. We use SBR only for        that fit in a single mini-batch (iris). Further analysis will be needed
the case studies with the balance constraint, which we are forced to       to confirm this intuitions. The accuracy decreases for a larger µ, as
approximate to obtain differentiability:                                   expected, but at a rather rapid pace. In the subsequent experiments,
                                                                           we will use for each dataset the RLR and SBR that offer the best
                                             m
                                             X                             accuracy while being as close to feasible as possible (the “ideal case”
                     g(f (X; θ)) = max             f (X; θ)        (24)
                                    j=1..c                                 column from Table 2): these are the cells in bold font in Table 3.
                                             i=1

i.e. we use the sums of the NN output neurons to approximate the            Alternative Approaches and ML Models We can now compare
class counts and the maximum as a penalty; this proved superior to         the performance of Moving Targets using different ML models with
other attempts in preliminary tests. The L term is the categorical         that of the alternative approaches presented above (with µ = 0.1),
cross-entropy.                                                             plus a pre-processing approach adapted from [13], referred to as
[tb]                                                                       NNpp and obtained by setting α, β → ∞ in our method.
                                                                              For our method, we consider the following ML models: 1) the
            Table 3.    Effect of parameter µ in regularization.           NN from the previous section with α = 1, β = 0.1; 2a) a Random
                              µ      0.01      0.1            1            Forest Classifier with 50 estimators and maximum depth of 5 (used for
              Iris            S     0.984      0.97       0.4              all classification case studies); 2b) a Gradient Boosted Trees model,
                              C     0.006      0.03      0.13              with 50 estimators, maximum depth 4, and a minimum threshold of
              Redwine         S      0.15      0.15      0.17              samples per leaf of 5 (for the regression case study); 4a) a Logistic
                              C      0.12      0.12      0.04              Regression model (for classification); 4b) a Linear Regression model
              Whitewine       S      0.17      0.15      0.14
                                                                           (for regression). All models except the NN are implemented using
                              C      0.08      0.02      0.03
              Shuttle         S       0.7      0.31      0.14              scikit-learn [23]. In the table, the tree ensemble method are reported
                              C      0.14      0.03      0.03              on a single column, while another column (LR) groups Logistic and
              Dota2           S      0.61      0.48      0.49              Linear regression.
                              C       0.2       0.5       0.5                 Our algorithm seems to work well with all the considered ML
              Adult           S       .83       .75       .75
                              C       1.6       3.1        3               models: tree ensembles and the NN have generally better constraint
              Crime           S       .39      0.30      0.30              satisfaction (and higher accuracy for constraint satisfaction) than
                              C        .4        0         0               linear models, thanks to their larger variance. The preprocessing
                                                                           approach is effective when constraints are easy to satisfy (iris and
   Our SBR approach relies on the NN model from the previous para-         dota2) and on all the fairness case studies, though less so on the
graphs. Since access to the network structure is needed to differentiate   remaining datasets. All Moving Targets approaches tend to perform
the regularizer, SBR works best when all the examples linked by            better and more reliably than RLR and SBR. The case of RLR and
relational constraints can be included in the same batch. When this        LR is especially interesting, as they differ only for the mechanism
is not viable the regularizer can be treated stochastically (via subsets   used to enforce constraint satisfaction. In principle we can expect the
of examples), at the cost of one additional approximation. We use          two models to have zero gap, for the regularization term has the same
a batch size of 2,048 as a compromise between memory usage and             formulation and the problem is convex. The gap is then due to an
noise. The SBR method is trained for 1,600 epochs.                         incomplete exploration of the space of the multiplier µ, since for the
   The RLR approach relies on linear models (Logistic or Linear            same multipliers value the gap is null under optimal training (which
Regression), which are simple enough to consider large group of            is the standard for linear regression). The example emphasizes a
examples simultaneously. We use this approach for the fairness use         practical problem that often arises when dealing with regularized loss
cases. In the crime (regression) dataset L is the MSE and the regular-     function: the value of the multiplier has to be thoroughly calibrated
izer is simply Equation (17). In the adult (classification) dataset L is   by hand, while Moving Targets allows to directly define the desired
the cross-entropy; the regularizer is Equation (10), with the following    constraint threshold and obtain the best solution associated to it.
substitution:
                                                                            Generalization Since our main contribution is an optimization
                       m
                    1 X >          1       X >                             algorithm, we have focused so far on evaluating its performance on
           dk,v,j =       θ xi −             θ xi
                    m i=1        |Xk,v | i∈X                               the training data, as it simplifies its analysis. Now that the property of
                                                        k,v
                                                                           our methods are clearer, we can assess the performance of the models
This is an approximation obtained according to [3] by disregarding the     we trained on the test data. The results of this evaluation are reported
sigmoid in the Logistic Regressor to preserve convexity. We train this     in Table 5, in the form of average ration between the scores and the
approach to convergence using the CVXPY 1.1 library (with default          level of constraint satisfaction in the test and the train data. With a
configuration). In RLR and SBR classification, the introduced ap-          few exceptions (e.g. satisfiability in iris), the models (especially the
proximations permit to satisfy the constraints by having equal output      tree ensembles and LR) generalize well in terms of both accuracy and
                                              Table 4.    Benchmarks with different ML models and alternative approaches
                                          Regularized methods                NN                LR            Ensemble trees           NNpp
                 Iris              S             .984 ± .006            .997 ± .004          .96 ± .02         .995 ± .004           .96 ± .01
                                   C            .006∗ ± .003            .013∗ ± .004      .014∗ ± .005        .012∗ ± .003        .014∗ ± .005
                 Redwine           S               .17 ± .05            .506 ± .006          .32 ± .01           .40 ± .02         .480 ± .001
                                   C               .05 ± .01            .018∗ ± .001       .031 ± .005           .08 ± .01         .073 ± .006
                 Whitewine         S               .15 ± .03            .439 ± .009        .025 ± .009           .37 ± .04           .47 ± .02
                                   C             .02∗ ± .004            .015∗ ± .003       .027 ± .003           .10 ± .01         .084 ± .009
                 Shuttle           S               .31 ± .04             .375 ± .007      .332 ± .007            .51 ± .05             .5 ± .1
                                   C              .03∗ ± .02             .028 ± .005      .023∗ ± .007           .11 ± .01            .2 ± .02
                 Dota2             S               .61 ± .02               .66 ± .01       .592 ± .005           .53 ± .01        .689 ± .003
                                   C                .2∗ ± .1               .09 ± .06         .038 ± 0            .16 ± .04          .02∗ ± .02
                 Adult             S             .834 ± .001             .841 ± .006       .805 ± .006           .76 ± .01        .865 ± .003
                                   C               1.7 ± .05              .22∗ ± .07        .20∗ ± .03          .01∗ ± .03         .081∗ ± .09
                 Crime             S               .30 ± .01               .48 ± .03       .369 ± .008           .49 ± .01        .484 ± .008
                                   C                 0±0                    .2∗ ± .1          .02∗ ± 0           .24 ± .01          .19∗ ± .02

constraint satisfaction. Given the tightness of some of the original                   Acknowledgement
constraint and the degree to which the labels were altered, this is a
remarkable result.                                                                     This research has been partially funded by the H2020 Project AI4EU,
                                                                                       grant agreement 825619.
                                       [tb]

      Table 5. Generalization of various models in the test scenario
                                                                                       REFERENCES
                                       NN        Ensemble Trees       LR
                                                                                       [1] Sina Aghaei, Mohammad Javad Azizi, and Phebe Vayanos, ‘Learning
       Iris             Sts /Str       0.96              0.96        0.99                  optimal and fair decision trees for non-discriminative decision-making’,
                        Cts /Ctr       5.68              5.17        4.31                  in Proceedings of AAAI, IAAI, EAAAI, pp. 1418–1426, (2019).
       Redwine          Sts /Str       0.62              0.92        0.94              [2] Francisco J Aragón Artacho and Rubén Campoy, ‘A new projection
                        Cts /Ctr       1.22              1.04        1.35                  method for finding the closest point in the intersection of convex sets’,
       Whitewine        Sts /Str       0.70              0.96        1.00                  Computational optimization and applications, 69(1), 99–132, (2018).
                        Cts /Ctr       1.11              1.00        0.99              [3] Richard Berk, Hoda Heidari, Shahin Jabbari, Matthew Joseph, Michael J.
       Shuttle          Sts /Str       0.99              1.00        0.99                  Kearns, Jamie Morgenstern, Seth Neel, and Aaron Roth, ‘A convex
                        Cts /Ctr       0.97              1.00        1.01                  framework for fair regression’, CoRR, abs/1706.02409, (2017).
       Dota2            Sts /Str       0.83              1.00        0.99              [4] Stephen Boyd, Jon Dattorro, et al., ‘Alternating projections’, EE392o,
                        Cts /Ctr       1.10              1.00        1.03                  Stanford University, (2003).
       Adult            Sts /Str       0.99              1.00        1.00              [5] Toon Calders and Sicco Verwer, ‘Three naive bayes approaches for
                        Cts /Ctr       1.55              1.92        0.98                  discrimination-free classification’, Data Mining and Knowledge Discov-
       Crime            Sts /Str       0.75              0.73        0.93                  ery, 21(2), 277–292, (2010).
                        Cts /Ctr       0.74              1.05        1.03              [6] Alessandro Daniele and Luciano Serafini, ‘Knowledge enhanced neural
                                                                                           networks’, in Proceedings of PRICAI 2019, pp. 542–554, (2019).
                                                                                       [7] Michelangelo Diligenti, Marco Gori, and Claudio Sacca, ‘Semantic-
                                                                                           based regularization for learning and inference’, Artificial Intelligence,
5   Conclusion                                                                             244, 143–165, (2017).
                                                                                       [8] Michelangelo Diligenti, Marco Gori, and Claudio Saccà, ‘Semantic-
In this paper we have introduced Moving Targets, a decomposition                           based regularization for learning and inference’, Artificial Intelligence,
approach to augment a generic supervised learning algorithm with                           244, 143 – 165, (2017). Combining Constraint Solving with Mining and
constraints, by iteratively adjusting the example labels. The method                       Learning.
is designed to prioritize constraint satisfaction over accuracy, and                   [9] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.
                                                                                      [10] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and
proved to behave very well on a selection of tasks, constraints, and                       Richard Zemel, ‘Fairness through awareness’, in Proceedings of the 3rd
datasets. Its relative simplicity, reasonable scalability, and the ability                 innovations in theoretical computer science conference, pp. 214–226,
to handle any classical ML model make it well suited for use in real                       (2012).
world settings.                                                                       [11] Benjamin Fish, Jeremy Kun, and Ádám D Lelkes, ‘A confidence-based
                                                                                           approach for balancing fairness and accuracy’, in Proceedings of the
   Many open questions remain: we highlighted limitations of reg-                          2016 SIAM International Conference on Data Mining, pp. 144–152.
ularization based techniques that deserve a much deeper analysis.                          SIAM, (2016).
The convergence properties of our method still need to be formally                    [12] Moritz Hardt, Eric Price, and Nati Srebro, ‘Equality of opportunity
characterized (even in simpler, convex, scenarios). The method scala-                      in supervised learning’, in Advances in neural information processing
bility should be tested on larger datasets, for which using approximate                    systems, pp. 3315–3323, (2016).
                                                                                      [13] Faisal Kamiran and Toon Calders, ‘Classifying without discriminat-
master steps will be necessary. Given the good performance of the                          ing’, in 2009 2nd International Conference on Computer, Control and
pre-processing approach in Table 4, it may be interesting to skip the                      Communication, pp. 1–6. IEEE, (2009).
pretraining step in our method. Improvements may be possible by                       [14] Faisal Kamiran and Toon Calders, ‘Data preprocessing techniques for
using specialized algorithms in specific settings: Douglas-Rachford                        classification without discrimination’, Knowl. Inf. Syst., 33(1), 1–33,
                                                                                           (2011).
splits could be applied in a numeric setting, or probabilistic predic-                [15] Faisal Kamiran, Toon Calders, and Mykola Pechenizkiy, ‘Discrimination
tions could be employed (when available) to refine the projection                          aware decision tree learning’, in 2010 IEEE International Conference
operators.                                                                                 on Data Mining, pp. 869–874. IEEE, (2010).
[16] Guosheng Lin, Chunhua Shen, Anton Van Den Hengel, and Ian Reid,
     ‘Efficient piecewise training of deep structured models for semantic
      segmentation’, in Proc. of the IEEE CVPR, pp. 3194–3203, (2016).
[17] Marco Lippi and Paolo Frasconi, ‘Prediction of protein β-residue con-
      tacts by markov logic networks with grounding–specific weights’, Bioin-
      formatics, 25(18), 2326–2333, (2009).
[18] Binh Thanh Luong, Salvatore Ruggieri, and Franco Turini, ‘k-nn as an
      implementation of situation testing for discrimination discovery and
      prevention’, in Proceedings of the 17th ACM SIGKDD international
      conference on Knowledge discovery and data mining, pp. 502–510,
     (2011).
[19] Xuezhe Ma and Eduard Hovy, ‘End-to-end sequence labeling via bi-
      directional lstm-cnns-crf’, in Proc. of ACL, pp. 1064–1074. Association
      for Computational Linguistics, (2016).
[20] Robin Manhaeve, Sebastijan Dumančić, Angelika Kimmig, Thomas
      Demeester, and Luc De Raedt, ‘Deepproblog: Neural probabilistic logic
      programming’, arXiv preprint arXiv:1805.10872, (2018).
[21] Giuseppe Marra, Francesco Giannini, Michelangelo Diligenti, and
      Marco Gori, ‘Integrating learning and reasoning with deep logic models’,
      in Proc. of ECML, (2019).
[22] Neal Parikh, Stephen Boyd, et al., ‘Proximal algorithms’, Foundations
      and Trends R in Optimization, 1(3), 127–239, (2014).
[23] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
      O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-
      plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
      esnay, ‘Scikit-learn: Machine learning in Python’, Journal of Machine
     Learning Research, 12, 2825–2830, (2011).
[24] Matthew Richardson and Pedro Domingos, ‘Markov logic networks’,
     Machine learning, 62(1-2), 107–136, (2006).
[25] Tim Rocktäschel and Sebastian Riedel, ‘End-to-end differentiable prov-
      ing’, in Advances in Neural Information Processing Systems, pp. 3788–
      3800, (2017).
[26] Luciano Serafini and Artur d’Avila Garcez, ‘Logic tensor networks:
      Deep learning and logical reasoning from data and knowledge’, arXiv
      preprint arXiv:1606.04422, (2016).
[27] Emile Van Krieken, Erman Acar, and Frank Van Harmelen, ‘Semi-
      supervised learning using differentiable reasoning’, Journal of Applied
     Logic, (2019). to Appear.
[28] Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork,
     ‘Learning fair representations’, in International Conference on Machine
     Learning, pp. 325–333, (2013).