Teaching the Old Dog New Tricks: Supervised Learning with Constraints Fabrizio Detassis1,2 and Michele Lombardi1 and Michela Milano1 assumptions that restrict the type of constraints (e.g. differentiabil- Abstract. Methods for taking into account external knowledge in ity, no relational information), the type of models (e.g. only Decision Machine Learning models have the potential to address outstanding is- Trees, only differentiable approaches), and often require modifications sues in data-driven AI methods, such as improving safety and fairness, in the employed training algorithms (e.g. specialized loss terms). and can simplify training in the presence of scarce data. We propose We propose a decomposition-based method, referred to as Moving a simple, but effective, method for injecting constraints at training Targets, to augment supervised learning with constraints. A master time in supervised learning, based on decomposition and bi-level step “teaches” constraint satisfaction to a learning algorithm by it- optimization: a master step is in charge of enforcing the constraints, eratively adjusting the sample labels. The master and learner have while a learner step takes care of training the model. The process no direct knowledge of each other, meaning that: 1) any ML method leads to approximate constraint satisfaction. The method is applicable can be used for the learner, with no modifications; 2) the master can to any ML approach for which the concept of label (or target) is well be defined via techniques such as Mathematical or Constraint Pro- defined (most regression and classification scenarios), and allows to gramming, to support discrete values or non-differentiable constraints. reuse existing training algorithms with no modifications. We require Our method is also well suited to deal with relational constraints over no assumption on the constraints, although their properties affect the large populations (e.g. fairness indicators). Moving Targets subsumes shape and complexity of the master problem. Convergence guarantees the few existing techniques – such as the one by [13] – capable of are hard to provide, but we found that the approach performs well offering the same degree of versatility. on ML tasks with fairness constraints and on classical datasets with When constraints conflict with the data, the approach prioritizes synthetic constraints. constraint satisfaction over accuracy. For this reason, it is not well suited to deal with fuzzy information. Moreover, due to our open setting, it is hard to determine convergence properties. Despite this, 1 Introduction we found that the approach performs well (compared to state of the art methods) on classification and regression tasks with fairness Techniques to deal with constraints in Machine Learning (ML) have constraints, and on classification problems with balance constraints. the potential to address outstanding issues in data-driven AI methods. Due to its combination of simplicity, generality, and the observed Constraints representing (e.g.) physical laws can be employed to empirical performance, Moving Targets can represent a valuable ad- improve generalization; constraints may encode negative patterns (e.g. dition to the arsenal of techniques for dealing with constraints in excluded classes) and relational information (e.g. involving multiple Machine Learning. The paper is organized as follows: in Section 2 we examples); constraints can ensure the satisfaction of desired properties, briefly survey related works on the integration of constraints in ML; such as fairness, safety, or lawfulness; they can even be used to extract in Section 3 we present our method and in Section 4 our empirical symbolic information from data. evaluation. Concluding remarks are in Section 5. Such techniques can highly improve the learning process in terms of versatility and control. As a consequence, our work constitutes a step forward in the direction of a more human-oriented AI, as 2 Related Works it is general and allows for the realization of learning models that satisfy specific user requirements. This is a fundamental and essential Here we provide an overview of representative approaches for inte- behaviour in contexts where there exist human-specific constraints, grating constraints in ML, and we discuss their differences with our which the trained AI model has to acknowledge. The present paper method. aims at enlarging the pool of tools that can be used to inject an AI Most approaches in the literature build over just a few key ideas. model with external knowledge, a field that is rapidly expanding One of them is using the constraints to adjust the output of a trained thanks to its many potential practical applications. ML model. This is done in DeepProbLog [20], where Neural Networks To the best of the authors knowledge, the vast majority of ap- with probabilistic output (mostly classifiers) are treated as predicates. proaches for taking into account external knowledge in ML make [25] presents a Neural Theorem Prover using differentiable predicates and the Prolog backward chaining algorithm. The original Markov 1 DISI, University of Bologna, Italy, email: {fabrizio.detassis2, Logic Networks [24] rely instead on Markov Fields defined over michele.lombardi2, michela.milano}@unibo.it First Order Logic formulas. As a drawback, with these approaches the 2 Corresponding author constraints have no effect on the model parameters, which complicates Copyright c 2020 for this paper by its authors. Use permitted under Creative the analysis of feature importances. Moreover, dealing with relational Commons License Attribution 4.0 International (CC BY 4.0). constraints (e.g. fairness) requires access at prediction time either to a representative population or to its distribution [12, 11]. Algorithm 1 M OVING TARGETS A second group of approaches operate by using constraint-based input label vector y ∗ , scalar parameters α, β, n expressions as regularization terms at training time. In Semantic y 1 = l(y ∗ ) # pretraining Based Regularization [7] constraints are expressed as fuzzy logical for k = 1..n do formulas over differentiable predicates. Logic Tensor Networks [26] if y k ∈ / C then focus on Neural Networks and replace the entire loss function with a z k = mα (y k ) # infeasible master step fuzzy formula. Differentiable Reasoning [27] uses in a similar fashion else relational background knowledge to benefit from unlabeled data. In z k = mβ (y k ) # feasible master step the context of fairness constraints, this approach has been taken in end if [1, 10, 28, 5, 15]. Methods in this class account for the constraints y k+1 = l(z k ) # learner step by adjusting the model parameters, and can therefore be used to an- end for alyze feature importance. They can deal with relational constraints without additional examples at prediction time; however, they ide- The Algorithm Our goal is to adjust the parameters of a ML model ally require simultaneous access at training time to all the examples so as to minimize a loss function with clearly defined labels, under linked by relational constraints (which can be problematic when using a set of generic constraints. We acknowledge that any constrained mini-batches). They often require properties on the constraints (e.g. learning problem must trade prediction mistakes for a better level differentiability), which may force approximations; they may also be of constraint satisfaction, and we attempt to control this process by susceptible to numerical issues. carefully selecting which mistakes should be made. This is similar A third idea consists in enforcing constraint satisfaction in the to [13, 14, 18], but: 1) we consider generic constraints rather than data via a pre-processing step. This is proposed in the context of focusing on fairness; and 2) we rely on an iterative process (which fairness constraints by [13, 14, 18]. The approach enables the use alternates “master” and “learner” steps) to improve the results. of standard ML methods with no modification, and can deal with Let L(y, y ∗ ) be the loss function, where y is the prediction vector relational constraints on large sets of examples. As a main drawback, and y ∗ is the label vector. We make the (non restrictive) assumption bias in the model or the training algorithm may prevent getting close that the loss is a pre-metric – i.e. L(y, y ∗ ) ≥ 0 and L(y, y ∗ ) = 0 to the pre-processed labels. iff y = y ∗ . Examples of how to treat common loss functions can be Multiple ideas can be combined: domain knowledge has been found in Table 1. introduced in differentiable Machine Learning (e.g. Deep Networks) We then want to solve, in an exact or approximate fashion, the by designing their structure, rather than the loss function: examples following constrained optimization problem: include Deep Structured Models in [16] and [19]. These approaches can use constraints to support both training and inference. arg min{L(y, y ∗ ) | y = f (X; θ), y ∈ C} (1) θ Though less related to our approach, constraints can be used to extract symbolic knowledge from data, for example by allowing the where f is the ML model and θ its parameter vector. With some abuse training algorithm to adjusting the regularizer weights. This approach of notation we refer as f (X; θ) to the vector of predictions for the is considered (e.g.) in [17, 21, 6]. examples in the training set X. Since the model input at training time Our approach is closely related to the idea of enforcing constraints is known, constraints can be represented as a feasible set C for the by altering the data, and shares the same advantages (versatility, sole predictions y. support for relational constraints and feature importance analysis, The problem can be rewritten without loss of generality by intro- no differentiability assumptions). We counter the main drawbacks ducing a second set B corresponding to the ML model bias. This mentioned above by using an iterative algorithm rather than a single leads to a formulation in pure label space: pre-processing step. arg min{L(y, y ∗ ) | y ∈ B ∩ C} (2) y [tb] Table 1. Notable loss functions (m = # examples, c = #classes) where B = {y | ∃θ, y = f (X; θ)}. The Moving Targets method is described in Algorithm 1, and starts with a learner step w.r.t. the original label vector y ∗ (pretraining). Loss Function Expression Label Space Each learner step, given a label vector as input, solves approximately 1 or exactly the problem: Mean Squared Error ky − y ∗ k22 Rm mm 1 X l (z) = arg min{L(y, z) | y ∈ B} (3) Hamming Distance I[yi 6= yj∗ ] {1..c}m y m i=1 m c 1 XX ∗ Note that this is a traditional unconstrained learning problem, since Cross Entropy yij log yij [0, 1]m B is just the model/algorithm bias. The result of the first learner step m i=1 j=1 gives an initial vector of predictions y 1 . Next comes a master step to adjust the label vector: this can take two forms, depending on the current predictions. In case of an infeasibility, i.e. y k ∈ / C, we solve: 3 Moving Targets  1  mα (y) = arg min L(z, y ∗ ) + L(z, y) | z ∈ C (4) z α In this section we present our method, discuss its properties, draw connections with related algorithms and provide some convergence Intuitively, we try to find a feasible label vector z that is close (in considerations. terms of loss function value) to both the original labels y ∗ and the We start by observing that Equation (2) is simply the Best Ap- proximation Problem (BAP), which involves finding a point in the y* intersection of two sets (say B and C) that is as close as possible to a reference point (say y ∗ ). For convex sets in Hilbert spaces, the BAP y1 can be solved optimally via Douglas-Rachford splits or the method from [2], both relying on projection operators. Since our learner is essentially a projection operator on B, it would seem sensible to apply these methods in our setting. Unfortunately: 1) we cannot reliably B z2/y3... z3... assume convexity; and 2) in a discrete space, the Douglas-Rachford splits may lead to “label” vectors that are meaningless for the learner. We therefore chose a design that is less elegant, but also less sensi- tive to which assumptions are valid. In particular: 1) in the mα steps, we use a modification of a suboptimal BAP algorithm to find a fea- sible prediction vector; then 2) in mβ we apply a modified proximal Figure 1. A sample run of our algorithm operator to improve its distance (in terms of loss) w.r.t. the original labels y ∗ . The basis for our mα step is the Alternating Projections (AP) method, discussed e.g. in [4]. The AP simply alternates projec- current prediction y. A parameter α ∈ (0, ∞) controls which of the tion operators on the two sets, which never generates vectors outside two should be preferred. If the input vector is feasible we instead of the label space. solve: Indeed, for α → 0 our mα step becomes a projection of the mβ (y) = arg min {L(z, y ∗ ) | L(z, y) ≤ β, z ∈ C} (5) predictions y k on C. With this setup we recover the AP behavior, and z its convergence to a feasible point for convex sets. For α → ∞ we i.e. we look for a feasible label vector z that is 1) not too far from obtain, essentially, the pre-processing method from [13]: mα becomes the current predictions (in the ball defined by L(z, y) ≤ β) and 2) a projection of the true labels y ∗ on C, and subsequent iterations have closer (in terms of loss) to the true labels y ∗ . Here, we are seeking an no further effect; convergence to a feasible point is achieved only if accuracy improvement. the pre-processed labels are in B, which may not be the case (e.g. a We then make a learner step trying to reach the adjusted labels; the quadratic distribution for a linear model). For reasonable α values, new predictions will be adjusted at the next iteration and so on. In our mα step balances the distance (loss) from both the predictions case of convergence, the predictions y k and the adjusted labels z k y and the targets y ∗ . Convergence in this case is an open question, become stationary (but not necessarily identical). An example run, but especially in a non-convex or discrete setting (where multiple for a Mean Squared Error loss and convex constraints and bias, is in projections may exist) this modification helps escaping local minima Figure 1. and accelerate progress. When a feasible prediction vector is obtained, our method switches Properties The learner is not directly affected by the constraints, to the mβ step; we then search for a point in C that is closer to the thus enabling the use of arbitrary ML approaches. The master prob- true labels, but also not too far from the predictions. This is related to lems do not depend on the ML model, often leading to clean structures the Proximal Gradient method, discussed e.g. in [22], but we limit the that are easier to solve. Since we make no explicit use of mini-batches, distance via a constraint rather than a squared norm, and we search in we can deal well with relational constraints on large groups (e.g. fair- a ball rather than on a line. As in the proximal gradient, a too large ness). The master step can be addressed via any suitable solver, so that search radius prevents convergence: for β → ∞ the mβ step always discrete variables and non-differentiable constraints can be tackled via returns the same adjusted labels, corresponding to the projection of (e.g.) Mathematical Programming, Constraint Programming, or SAT y ∗ on C set. Modulo Theories. Depending on the constraints, the loss functions, and the label space (e.g. numeric vs discrete) the master problems may be NP-hard. Even in this case, their clean structure may allow for 4 Empirical Evaluation exact solutions for datasets of practical size. Moreover, for separable loss functions (e.g. all those from Table 1), the master problems can Our experimentation is designed around a few main questions: 1) How be defined over only the constrained examples, with a possibly signifi- does the Moving Targets method work on a variety of constraints, cant size reduction. If scalability is still a concern, the master step can tasks, and datasets? 2) What is the effect of the α, β parameters? be solved in an approximate fashion: this may lead to a lower accuracy 3) How does the approach scale? 4) How different is the behavior and a higher level of constraint violation; however, such issues are with different ML models? 5) How does the method compare with partly inevitable, due to algorithm/model bias, and since constraint alternative approaches? We proceed to describe our setup and the satisfaction on the training data does not necessarily transfer to unseen experiments we performed to address such questions. examples. Analysis and Convergence Due to its open nature and minimal Tasks and Constraints We experiment on three case studies, cov- assumptions, establishing the convergence of our method is hard. Here ering two types of ML tasks and two types of constraints. First, we we provide some considerations and connect the approach to existing consider a classification problem augmented with a balance con- algorithms. We will make the simplifying assumption that the learner straint, which forces the distribution over the classes to be approxi- problem from Equation (3) can be solved exactly. This holds for a mately uniform. The loss function is given by the Hamming distance few cases (e.g. convex ML models trained to close optimality), but in and the label space is {1..c}m . The mα (y) problem is defined as a general the assumption will not be strictly satisfied. Mixed Integer Linear Program with binary variables zij such that zij = 1 iff the adjusted class for the i-th example is j. Formally: Our third case study is a regression problem with fairness con- straints, based on a specialized DIDI version from [1]: m m 1 X 1 X X X min m i=1 (1 − zi,yi∗ ) + αm i=1 (1 − zi,yi ) (6) DIDI r (X, y) = dkv (16) k∈K v∈Dk c X m s.t. zij = 1 ∀i = 1..m (7) 1 X 1 X dk,v,j = yi − y (17) j=1 m i=1 |Xk,v | i∈X m   k,v X (1 + ξ)m yij ≤ ∀j = 1..c (8) In this case, we use the Mean Squared Error (MSE) as a loss function, c i=1 and the label space is Rm . The mα problem can be defined via the zij ∈ {0, 1} ∀i = 1..m, j = 1..c (9) following Mathematical Program: m m The summations in Equation (6) encode the Hamming distance w.r.t. 1 X ∗ 1 X min (yi − zi )2 + (zi − yi )2 (18) the true labels y ∗ and the predictions y. Equation (7) prevents as- m i=1 αm i=1 signing two classes to the same example. Equation (8) requires an X X equal count for each class, with tolerance defined by ξ (ξ = 0.05 in s.t. dkv ≤  ∀j = 1..c (19) k∈K v∈Dk all our experiments); the balance constraint is stated in exact form, m thanks to the discrete labels. The mα formulation generalizes the X yi X yi knapsack problem and is hence NP-hard; since all examples appear in dkv ≥ − (20) i=1 m i∈Xk,v |Xk,v | Equation (8), no problem size reduction is possible. The mβ problem m can be derived from mα by changing the objective function and by X yi X yi dkv ≥ − + (21) adding the ball constraint as in Equation (5). m |Xk,v | i=1 i∈Xk,v Our second use case is a classification problem with realistic fair- ness constraints, based on the DIDI indicator from [1]: zi ∈ R ∀i = 1..m (22) c X X X This is a linearly constrained, convex, Quadratic Programming prob- DIDI c (X, y) = dkvj (10) lem that can be solved (unlike our classification examples) in polyno- k∈K v∈Dk j=1 mial time. The mβ problem can be derived as in the previous cases: m while still convex, mβ is in this case a quadratically constrained 1 X 1 X problem. dk,v,j = I[yi = j] − I[yi = j] m i=1 |Xk,v | i∈X k,v Datasets, Preparation, and General Setup We test our method where K contains the indices of “protected features” (e.g. ethnicity, on seven datasets from the UCI Machine Learning repository [9], gender, etc.). Dk is the set of possible values for the k-th feature, namely iris (150 examples), redwine (1,599), crime (2,215), whitewine and Xk,v is the set of examples having value v for the k-th feature. (4,898), adult (32,561), shuttle (43,500), and dota2 (92,650). We The mα (y) problem can be defined via the following Mathematical use adult for the classification/fairness case study, crime for regres- Program: sion/fairness, and the remaining datasets for the classification/balance case study. m m 1 X 1 X For each experiment, we perform a 5-fold cross validation (with a min (1 − zi,yi∗ ) + (1 − zi,yi ) (11) m i=1 αm i=1 fixed seed for random reshuffling) to account for noise due to sampling and in the training process. Hence, the training set for each fold will include 80% of the data. All our experiments are run on an Intel Core s.t. Equation (7) i7 laptop with 16GB RAM, and we use Cplex 12.8 to solve the master c problems. Our code and datasets are publicly available3 . X X X dkvj ≤  ∀j = 1..c (12) All the datasets for the classification/balance case study are pre- k∈K v∈Dk j=1 pared by standardizing all input features (on the training folds) to m have zero mean and unit variance. The iris and dota2 datasets are X yij X yij dkvj ≥ − (13) very balanced (the constraint is easily satisfied), while the remaining i=1 m i∈Xk,v |Xk,v | datasets are quite unbalanced. In the adult (also known as “Census m Income”) dataset the target is “income” and the protected attribute X yij X yij dkvj ≥ − + (14) is “race”. We remove the features “education” (duplicated) and “na- i=1 m i∈Xk,v |Xk,v | tive country” and use a one-hot encoding on all categorical features. Features are normalized between 0 and 1. Our crime dataset is the zij ∈ {0, 1} ∀i = 1..m, j = 1..c (15) “Communities and Crime Unnormalized” table. The target is “violent- where Equation (12) is the constraint on the DIDI value and Equa- PerPop” and the protected feature is “race”. We remove features that tion (13)-(14) linearize the absolute values in its definition. The DIDI are empty almost everywhere and features trivially related to the tar- scales with the number of examples and has an intrinsic value due to get (“murders”, “robberies”, etc.). Features are normalized between 0 the discrimination in the data. Therefore, we compute DIDI tr for the and 1 and we select the top 15 ones according to the SelectKBest training set, then in our experiments we have  = 0.2DIDI tr . This method of scikit-learn (excluding “race”). The protected feature is is again an NP-hard problem defined over all training examples. The then reintroduced. 3 git repository publicly available mβ formulation can be derived as in the previous case. Table 2. Effect of parameters α and β on different datasets. NN (α, β) Ptr α=1 α=1 α=1 α = .1 α = 0+ Ideal case β = .01 β = .05 β = .1 β = .01 β = 0.1 Iris S .970 ± .002 .99 ± .01 .997 ± .004 .997 ± .004 .99 ± .02 0.995 ± 0.008 .9968 ± .0004 C .016 ± .001 .014∗ ± .004 .013∗ ± .004 .013∗ ± .004 .015∗ ± .005 .013∗ ± .004 .013 ± .004 Redwine S .709 ± .005 .508 ± .006 .511 ± .009 .506 ± .006 .484 ± .007 .50 ± .01 .525 ± .002 C .200 ± .001 .019∗ ± .001 .0186∗ ± .0005 .0184∗ ± .0008 .0188∗ ± .0004 .019∗ ± .001 .019 ± 0 Whitewine S .644 ± .002 .446 ± .006 .437 ± .009 .439 ± .009 .40 ± .02 .401 ± .009 .524 ± .002 C .189 ± .003 .015∗ ± .002 .013∗ ± .004 .015∗ ± .003 .014∗ ± .004 .014∗ ± .004 .015 ± .002 Shuttle S .999 ± 0 .39 ± .04 .37 ± .01 .375 ± .007 .37 ± .03 .37 ± .03 .3608 ± .0008 C .267 ± 0 .045∗ ± .03 .029 ± .003 .028 ± .006 0.05 ± .03 .04 ± .04 .017 ± 0 Dota2 S .686 ± .002 .666 ± .007 .661 ± .002 .66 ± .01 .672 ± .004 .656 ± .006 .9984 ± .0009 C .070 ± .008 .04∗ ± .03 .04∗ ± .03 .09 ± .06 .023 ± .006 .07 ± .03 .025 ± 0 Adult S .867 ± 0.001 .818 ± .005 .86 ± .02 .841 ± .006 .852 ± .004 .84 ± .02 0.992 ± .0005 C 1 ± .03 .18∗ ± .04 .16∗ ± .02 .22∗ ± .07 .22∗ ± .03 .22∗ ± .04 0.200 ± .0005 Crime S .56 ± .02 .49 ± .01 .46 ± .04 .48 ± .03 .45 ± .05 .46 ± .06 .910 ± .007 C .97 ± .03 .22∗ ± .07 .16∗ ± .08 .2∗ ± .1 .19∗ ± .02 .21∗ ± .04 .2 ± 0 Parameter tuning We know that extreme choices for α and β can dramatically alter the method behavior, but not what effect can be expected for more reasonable values. With this aim, we perform an investigation by running the algorithm for 15 iterations (used in all experiments), with different α and β values. As a ML model, we use a fully-connected, feed-forward Neural Network (NN) with two hidden layers with 32-Rectifier Linear Units. The last layer has either a SoftMax activation (for classification) or Linear (for regression). The loss function is respectively the categorical cross-entropy or the MSE. The network is trained with 100 epochs of RMSProp in Keras/Tensorflow 2.0 (default parameters and batch size 64). The results are in Table 2. We report a score (row S, higher is better) and a level of constraint violation (row C, lower is better). The S score Figure 2. Average master step time, compared to NN training is the accuracy for classification and the R2 coefficient for regression. For the balance constraint, the C score is the standard deviation of balanced datasets redwine, whitewine, and shuttle and all fairness use the class frequencies; in the fairness case studies, we use the ratio cases, for which feasible (or close) results are almost always obtained. between the DIDI of the predictions and that of the training data. Each Satisfying very tight constraints (e.g. in the unbalanced dataset) gen- cell reports mean and standard deviation for the 5 runs. Near feasible erally comes at a significant cost in terms of accuracy. When the values are marked with a ∗; accuracy comparisons are fair only for constraints are less demanding, however, we often observe accuracy similar constraint violation scores. We consider a solution as near improvements w.r.t. the pretraining results (even substantial ones for feasible when the associated constraint score lies within one standard adult and crime): this is not simply a side effect of the larger number deviation from the imposed score. of training epochs, since we reset the NN weights at each iteration. All columns labeled with α and β values refer to our method with Rather, this seems a positive side-effect of using an iterative method the corresponding parameters. We explore different values of β (for to guide training (which may simplify escaping from local optima): a fixed α = 1), corresponding to different ball radii in the mβ step; further investigation is needed to characterize this phenomenon. Fi- we also explore different values of α, corresponding to a behavior nally, reasonable parameter choices have only a mild effect on the of the mα step progressively closer to that of the Alternating Projec- algorithm behavior, thus simplifying its configuration. Empirically, tions method. The ideal case column refers to a simple projection of α = 1, β = 0.1 seems to works well and is used for all subsequent the true labels y ∗ on the feasible space C: this corresponds (on the experiments. training data) to the best possible accuracy that can be achieved if the constraints are fully satisfied. The ptr column reports the results of Scalability We next turn to investigating the method scalability: the pretraining step, as defined in algorithm 1. from this perspective our examples are worst cases, since they must Results are to be read as follows: the ideal case constitutes the be defined on all the training data, and in some case involve NP-hard upper bound to the performance of a constrained learner; it exactly problems. We report the average time for a master step in Figure 2. matches the constraint threshold while minimizing the loss function. The average time for a learner step (100 epochs of our NN) is shown as On the other hand, the ptr represents the constraint-agnostic behavior: a reference. At least in our experimentation, the time for a master step it retains good performances while not necessarily satisfying the is always very reasonable, even for the dota2 dataset for which we constraints. Our method lies in between the two extreme cases. solve NP-hard problems on 74,120 examples. This is mostly due to the The Moving Targets algorithm can significantly improve the sat- clean structure of the mα and mβ problems. Of course, for sufficiently isfaction of non-trivial constraints: this is evident for the (very) un- large training sets, exact solutions will become impractical. Setup of Alternative Approaches Here we describe how to setup for all classes, i.e. completely random predictions. This undesirable alternative approaches that will be used for comparison. Namely, behavior is countered by the L term. we consider the regularized linear approach from [3], referred to as There is no simple recipe for choosing the value of µ in Equa- RLR, and a Neural Network with Semantic Based Regularization tion (23); therefore, we performed experiments with different µ values [8], referred to as SBR. Both approaches are based on the idea of to characterize its effect. The results are reported in Table 3. In most introducing constraints as regularizers at training time. Hence, their cases, larger µ values tend as expected to result in better constraint loss function is in the form: satisfaction, with a few notable exceptions for classification tasks (iris, dota, and adult). The issue is likely due to the approximations L(f (X; θ), y ∗ ) + µg(f (X; θ)) (23) introduced in the regularizers, since it arises even on small datasets The regularization term must be differentiable. We use SBR only for that fit in a single mini-batch (iris). Further analysis will be needed the case studies with the balance constraint, which we are forced to to confirm this intuitions. The accuracy decreases for a larger µ, as approximate to obtain differentiability: expected, but at a rather rapid pace. In the subsequent experiments, we will use for each dataset the RLR and SBR that offer the best m X accuracy while being as close to feasible as possible (the “ideal case” g(f (X; θ)) = max f (X; θ) (24) j=1..c column from Table 2): these are the cells in bold font in Table 3. i=1 i.e. we use the sums of the NN output neurons to approximate the Alternative Approaches and ML Models We can now compare class counts and the maximum as a penalty; this proved superior to the performance of Moving Targets using different ML models with other attempts in preliminary tests. The L term is the categorical that of the alternative approaches presented above (with µ = 0.1), cross-entropy. plus a pre-processing approach adapted from [13], referred to as [tb] NNpp and obtained by setting α, β → ∞ in our method. For our method, we consider the following ML models: 1) the Table 3. Effect of parameter µ in regularization. NN from the previous section with α = 1, β = 0.1; 2a) a Random µ 0.01 0.1 1 Forest Classifier with 50 estimators and maximum depth of 5 (used for Iris S 0.984 0.97 0.4 all classification case studies); 2b) a Gradient Boosted Trees model, C 0.006 0.03 0.13 with 50 estimators, maximum depth 4, and a minimum threshold of Redwine S 0.15 0.15 0.17 samples per leaf of 5 (for the regression case study); 4a) a Logistic C 0.12 0.12 0.04 Regression model (for classification); 4b) a Linear Regression model Whitewine S 0.17 0.15 0.14 (for regression). All models except the NN are implemented using C 0.08 0.02 0.03 Shuttle S 0.7 0.31 0.14 scikit-learn [23]. In the table, the tree ensemble method are reported C 0.14 0.03 0.03 on a single column, while another column (LR) groups Logistic and Dota2 S 0.61 0.48 0.49 Linear regression. C 0.2 0.5 0.5 Our algorithm seems to work well with all the considered ML Adult S .83 .75 .75 C 1.6 3.1 3 models: tree ensembles and the NN have generally better constraint Crime S .39 0.30 0.30 satisfaction (and higher accuracy for constraint satisfaction) than C .4 0 0 linear models, thanks to their larger variance. The preprocessing approach is effective when constraints are easy to satisfy (iris and Our SBR approach relies on the NN model from the previous para- dota2) and on all the fairness case studies, though less so on the graphs. Since access to the network structure is needed to differentiate remaining datasets. All Moving Targets approaches tend to perform the regularizer, SBR works best when all the examples linked by better and more reliably than RLR and SBR. The case of RLR and relational constraints can be included in the same batch. When this LR is especially interesting, as they differ only for the mechanism is not viable the regularizer can be treated stochastically (via subsets used to enforce constraint satisfaction. In principle we can expect the of examples), at the cost of one additional approximation. We use two models to have zero gap, for the regularization term has the same a batch size of 2,048 as a compromise between memory usage and formulation and the problem is convex. The gap is then due to an noise. The SBR method is trained for 1,600 epochs. incomplete exploration of the space of the multiplier µ, since for the The RLR approach relies on linear models (Logistic or Linear same multipliers value the gap is null under optimal training (which Regression), which are simple enough to consider large group of is the standard for linear regression). The example emphasizes a examples simultaneously. We use this approach for the fairness use practical problem that often arises when dealing with regularized loss cases. In the crime (regression) dataset L is the MSE and the regular- function: the value of the multiplier has to be thoroughly calibrated izer is simply Equation (17). In the adult (classification) dataset L is by hand, while Moving Targets allows to directly define the desired the cross-entropy; the regularizer is Equation (10), with the following constraint threshold and obtain the best solution associated to it. substitution: Generalization Since our main contribution is an optimization m 1 X > 1 X > algorithm, we have focused so far on evaluating its performance on dk,v,j = θ xi − θ xi m i=1 |Xk,v | i∈X the training data, as it simplifies its analysis. Now that the property of k,v our methods are clearer, we can assess the performance of the models This is an approximation obtained according to [3] by disregarding the we trained on the test data. The results of this evaluation are reported sigmoid in the Logistic Regressor to preserve convexity. We train this in Table 5, in the form of average ration between the scores and the approach to convergence using the CVXPY 1.1 library (with default level of constraint satisfaction in the test and the train data. With a configuration). In RLR and SBR classification, the introduced ap- few exceptions (e.g. satisfiability in iris), the models (especially the proximations permit to satisfy the constraints by having equal output tree ensembles and LR) generalize well in terms of both accuracy and Table 4. Benchmarks with different ML models and alternative approaches Regularized methods NN LR Ensemble trees NNpp Iris S .984 ± .006 .997 ± .004 .96 ± .02 .995 ± .004 .96 ± .01 C .006∗ ± .003 .013∗ ± .004 .014∗ ± .005 .012∗ ± .003 .014∗ ± .005 Redwine S .17 ± .05 .506 ± .006 .32 ± .01 .40 ± .02 .480 ± .001 C .05 ± .01 .018∗ ± .001 .031 ± .005 .08 ± .01 .073 ± .006 Whitewine S .15 ± .03 .439 ± .009 .025 ± .009 .37 ± .04 .47 ± .02 C .02∗ ± .004 .015∗ ± .003 .027 ± .003 .10 ± .01 .084 ± .009 Shuttle S .31 ± .04 .375 ± .007 .332 ± .007 .51 ± .05 .5 ± .1 C .03∗ ± .02 .028 ± .005 .023∗ ± .007 .11 ± .01 .2 ± .02 Dota2 S .61 ± .02 .66 ± .01 .592 ± .005 .53 ± .01 .689 ± .003 C .2∗ ± .1 .09 ± .06 .038 ± 0 .16 ± .04 .02∗ ± .02 Adult S .834 ± .001 .841 ± .006 .805 ± .006 .76 ± .01 .865 ± .003 C 1.7 ± .05 .22∗ ± .07 .20∗ ± .03 .01∗ ± .03 .081∗ ± .09 Crime S .30 ± .01 .48 ± .03 .369 ± .008 .49 ± .01 .484 ± .008 C 0±0 .2∗ ± .1 .02∗ ± 0 .24 ± .01 .19∗ ± .02 constraint satisfaction. Given the tightness of some of the original Acknowledgement constraint and the degree to which the labels were altered, this is a remarkable result. This research has been partially funded by the H2020 Project AI4EU, grant agreement 825619. [tb] Table 5. Generalization of various models in the test scenario REFERENCES NN Ensemble Trees LR [1] Sina Aghaei, Mohammad Javad Azizi, and Phebe Vayanos, ‘Learning Iris Sts /Str 0.96 0.96 0.99 optimal and fair decision trees for non-discriminative decision-making’, Cts /Ctr 5.68 5.17 4.31 in Proceedings of AAAI, IAAI, EAAAI, pp. 1418–1426, (2019). Redwine Sts /Str 0.62 0.92 0.94 [2] Francisco J Aragón Artacho and Rubén Campoy, ‘A new projection Cts /Ctr 1.22 1.04 1.35 method for finding the closest point in the intersection of convex sets’, Whitewine Sts /Str 0.70 0.96 1.00 Computational optimization and applications, 69(1), 99–132, (2018). Cts /Ctr 1.11 1.00 0.99 [3] Richard Berk, Hoda Heidari, Shahin Jabbari, Matthew Joseph, Michael J. Shuttle Sts /Str 0.99 1.00 0.99 Kearns, Jamie Morgenstern, Seth Neel, and Aaron Roth, ‘A convex Cts /Ctr 0.97 1.00 1.01 framework for fair regression’, CoRR, abs/1706.02409, (2017). Dota2 Sts /Str 0.83 1.00 0.99 [4] Stephen Boyd, Jon Dattorro, et al., ‘Alternating projections’, EE392o, Cts /Ctr 1.10 1.00 1.03 Stanford University, (2003). Adult Sts /Str 0.99 1.00 1.00 [5] Toon Calders and Sicco Verwer, ‘Three naive bayes approaches for Cts /Ctr 1.55 1.92 0.98 discrimination-free classification’, Data Mining and Knowledge Discov- Crime Sts /Str 0.75 0.73 0.93 ery, 21(2), 277–292, (2010). Cts /Ctr 0.74 1.05 1.03 [6] Alessandro Daniele and Luciano Serafini, ‘Knowledge enhanced neural networks’, in Proceedings of PRICAI 2019, pp. 542–554, (2019). [7] Michelangelo Diligenti, Marco Gori, and Claudio Sacca, ‘Semantic- based regularization for learning and inference’, Artificial Intelligence, 5 Conclusion 244, 143–165, (2017). [8] Michelangelo Diligenti, Marco Gori, and Claudio Saccà, ‘Semantic- In this paper we have introduced Moving Targets, a decomposition based regularization for learning and inference’, Artificial Intelligence, approach to augment a generic supervised learning algorithm with 244, 143 – 165, (2017). Combining Constraint Solving with Mining and constraints, by iteratively adjusting the example labels. The method Learning. is designed to prioritize constraint satisfaction over accuracy, and [9] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. [10] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and proved to behave very well on a selection of tasks, constraints, and Richard Zemel, ‘Fairness through awareness’, in Proceedings of the 3rd datasets. Its relative simplicity, reasonable scalability, and the ability innovations in theoretical computer science conference, pp. 214–226, to handle any classical ML model make it well suited for use in real (2012). world settings. [11] Benjamin Fish, Jeremy Kun, and Ádám D Lelkes, ‘A confidence-based approach for balancing fairness and accuracy’, in Proceedings of the Many open questions remain: we highlighted limitations of reg- 2016 SIAM International Conference on Data Mining, pp. 144–152. ularization based techniques that deserve a much deeper analysis. SIAM, (2016). The convergence properties of our method still need to be formally [12] Moritz Hardt, Eric Price, and Nati Srebro, ‘Equality of opportunity characterized (even in simpler, convex, scenarios). The method scala- in supervised learning’, in Advances in neural information processing bility should be tested on larger datasets, for which using approximate systems, pp. 3315–3323, (2016). [13] Faisal Kamiran and Toon Calders, ‘Classifying without discriminat- master steps will be necessary. Given the good performance of the ing’, in 2009 2nd International Conference on Computer, Control and pre-processing approach in Table 4, it may be interesting to skip the Communication, pp. 1–6. IEEE, (2009). pretraining step in our method. Improvements may be possible by [14] Faisal Kamiran and Toon Calders, ‘Data preprocessing techniques for using specialized algorithms in specific settings: Douglas-Rachford classification without discrimination’, Knowl. Inf. Syst., 33(1), 1–33, (2011). splits could be applied in a numeric setting, or probabilistic predic- [15] Faisal Kamiran, Toon Calders, and Mykola Pechenizkiy, ‘Discrimination tions could be employed (when available) to refine the projection aware decision tree learning’, in 2010 IEEE International Conference operators. on Data Mining, pp. 869–874. IEEE, (2010). [16] Guosheng Lin, Chunhua Shen, Anton Van Den Hengel, and Ian Reid, ‘Efficient piecewise training of deep structured models for semantic segmentation’, in Proc. of the IEEE CVPR, pp. 3194–3203, (2016). [17] Marco Lippi and Paolo Frasconi, ‘Prediction of protein β-residue con- tacts by markov logic networks with grounding–specific weights’, Bioin- formatics, 25(18), 2326–2333, (2009). [18] Binh Thanh Luong, Salvatore Ruggieri, and Franco Turini, ‘k-nn as an implementation of situation testing for discrimination discovery and prevention’, in Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 502–510, (2011). [19] Xuezhe Ma and Eduard Hovy, ‘End-to-end sequence labeling via bi- directional lstm-cnns-crf’, in Proc. of ACL, pp. 1064–1074. Association for Computational Linguistics, (2016). [20] Robin Manhaeve, Sebastijan Dumančić, Angelika Kimmig, Thomas Demeester, and Luc De Raedt, ‘Deepproblog: Neural probabilistic logic programming’, arXiv preprint arXiv:1805.10872, (2018). [21] Giuseppe Marra, Francesco Giannini, Michelangelo Diligenti, and Marco Gori, ‘Integrating learning and reasoning with deep logic models’, in Proc. of ECML, (2019). [22] Neal Parikh, Stephen Boyd, et al., ‘Proximal algorithms’, Foundations and Trends R in Optimization, 1(3), 127–239, (2014). [23] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander- plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch- esnay, ‘Scikit-learn: Machine learning in Python’, Journal of Machine Learning Research, 12, 2825–2830, (2011). [24] Matthew Richardson and Pedro Domingos, ‘Markov logic networks’, Machine learning, 62(1-2), 107–136, (2006). [25] Tim Rocktäschel and Sebastian Riedel, ‘End-to-end differentiable prov- ing’, in Advances in Neural Information Processing Systems, pp. 3788– 3800, (2017). [26] Luciano Serafini and Artur d’Avila Garcez, ‘Logic tensor networks: Deep learning and logical reasoning from data and knowledge’, arXiv preprint arXiv:1606.04422, (2016). [27] Emile Van Krieken, Erman Acar, and Frank Van Harmelen, ‘Semi- supervised learning using differentiable reasoning’, Journal of Applied Logic, (2019). to Appear. [28] Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork, ‘Learning fair representations’, in International Conference on Machine Learning, pp. 325–333, (2013).