=Paper=
{{Paper
|id=Vol-2659/silvestri
|storemode=property
|title=Injecting domain knowledge in neural networks: a
  controlled experiment on a constrained problem
|pdfUrl=https://ceur-ws.org/Vol-2659/silvestri.pdf
|volume=Vol-2659
|authors=Mattia Silvestri,Michele Lombardi,Michela Milano
|dblpUrl=https://dblp.org/rec/conf/ecai/Silvestri0M20
}}
==Injecting domain knowledge in neural networks: a
  controlled experiment on a constrained problem==
<pdf width="1500px">https://ceur-ws.org/Vol-2659/silvestri.pdf</pdf>
<pre>
          Injecting Domain Knowledge in Neural Networks:
         a Controlled Experiment on a Constrained Problem
                                 Mattia Silvestri1 and Michele Lombardi2 and Michela Milano3


Abstract. Given enough data, Deep Neural Networks (DNNs) are                   feasibility and they struggle with the (very) limited number of past
capable of learning complex input-output relations with high accu-             solutions available for practical use cases.
racy. In several domains, however, data is scarce or expensive to re-             We use as a benchmark the Partial Latin Square (PLS) comple-
trieve, while a substantial amount of expert knowledge is available.           tion problem, which requires to complete a partially filled n × n
It seems reasonable that if we can inject this additional information          square with values in {1..n}, such that no value appears twice on
in the DNN, we could ease the learning process. One such case is               any row or column. Despite its simplicity, the PLS is NP-hard, un-
that of Constraint Problems, for which declarative approaches ex-              less we start from an empty square; the problem has practical appli-
ists and pure ML solutions have obtained mixed success. Using a                cations (e.g. in optical fiber routing), and serves as the basis for more
classical constrained problem as a case study, we perform controlled           complex problems (e.g. timetabling). Using a classical constrained
experiments to probe the impact of progressively adding domain and             problem as a case study grants access to reliable domain knowledge
empirical knowledge in the DNN. Our results are very encouraging,              (the declarative formulation), and facilitates the generation of empir-
showing that (at least in our setup) embedding domain knowledge at             ical data (problem solutions). This combination enables controlled
training time can have a considerable effect and that a small amount           experiments that are difficult to perform on more traditional datasets.
of empirical knowledge is sufficient to obtain practically useful re-             We train a problem-agnostic, data-driven, solution approach on a
sults. 4                                                                       pool of solutions, and we inject domain knowledge (constraints) both
                                                                               at training time and at solution generation time. We then adjust the
1     Introduction                                                             amount of initial data (empirical knowledge) and of injected con-
                                                                               straints (domain knowledge), and assess the ability of the approach
Given enough data, Deep Neural Networks are capable of learning                to yield feasible solutions. Our results are very encouraging, showing
complex input-output relations with high accuracy. In many domains,            that (at least in our setup) embedding domain knowledge in a data-
however, there exists also a substantial degree of expert knowledge:           driven approach can have a considerable effect, and that a small
it seems reasonable that if we can inject this additional information          amount of empirical knowledge is sufficient to obtain practically use-
in the DNN, we could ease the learning process. Indeed, methods                ful results.
for hybridizing learning and reasoning (or for taking into account                As a byproduct of our analysis, we develop general techniques
constraints at training time) can accelerate convergence or improve            for taking into account constraints in data-driven methods for deci-
the accuracy, especially when supervised data is scarce.                       sion problems, based on easily accessible methods from the Con-
   In this paper we aim at characterizing this trade-off between im-           straint Programming and Machine Learning domains. While such
plicit knowledge (derived from data) and explicit knowledge (sup-              techniques are originally designed for problems with discrete deci-
plied by experts), via a set of controlled experiments. On this pur-           sion, they should be adaptable to numeric decisions as well. Hence,
pose, we use a setting that is both rigorous enough from a scientific          despite our focus remains on a scientific investigation, we also re-
standpoint and practically relevant: that of constrained problems.             gard this paper as a relevant step towards practical applicability for
   Constrained problems involve assigning values to a set of vari-             some data-driven solution methods for constrained problems. We
ables, subject to a number of constraints, and possibly with the goal          also think that integrating data-driven methods and Constraint Pro-
of minimizing a cost metric. Depending on the lack or presence of              gramming will be helpful to develop more human-centered AI sys-
a cost function, they are formally known as Constraint Satisfaction            tems. Nowadays, the wide adoption of AI in everyday life has intro-
Problems (CSPs) or Constraint Optimization problems (COPs).                    duced the need for methods that are more focused on the require-
   Constrained problem are classically modeled by domain experts in            ments of human operators. A practical scenario is the one of con-
a fully declarative fashion: however, such models can be hard to de-           straints that allow the autonomous system to grant fairness and to
sign, may rely on simplistic and unquantifiable approximations, and            meet security guidelines during the decision process. Despite their
may fail to take into account constraints (or preferences) that are not        great success, since they are a purely data-driven approach, DNNs
known to the expert, despite being satisfied in historical solutions.          may fail to correctly infer and generalize fairness or security con-
Data-driven methods for constrained problems offer a potential solu-           straints because they may be only partially satisfied in the data. At
tion for some of these issues, but they may have trouble maintaining           the same time, declarative models can help to overcome this limita-
1 University of Bologna, Italy, mattia.silvestri4@unibo.it                     tion of data-driven methods. This motivates the interest for the devel-
2 University of Bologna, Italy, michele.lombardi2@unibo.it                     oping of hybrid approaches that can exploit the advantages of both
3 University of Bologna, Italy, michela.milano@unibo.it
                                                                               data-driven and declarative methods.
4 Copyright   c 2020 for this paper by its authors. Use permitted under Cre-      The paper is organized as follows: Section 2 briefly surveys the re-
    ative Commons License Attribution 4.0 International (CC BY 4.0)
lated literature and motivates the choice of our baseline techniques;      replace the entire loss function with a fuzzy formula defined on logi-
Section 3 discusses the details of the problem and methods we use;         cal predicates. LTNs are connected to Differentiable Reasoning [21],
Section 4 presents the results of our analysis, while Section 5 pro-       which uses relational background knowledge to benefit from unla-
vides concluding remarks.                                                  beled data. Domain knowledge has also been introduced in differen-
                                                                           tiable Machine Learning (in particular Deep Networks) by properly
                                                                           designing their structure, rather than the loss function: examples in-
2   Related Works and Baseline Choice                                      clude Deep Structured Models (see e.g. [11] and [13], the latter inte-
                                                                           grating deep learning with Conditional Random Fields).
The analysis that we aim to perform requires 1) a data-driven tech-
                                                                              Integration of external knowledge in Neural Networks after train-
nique that can solve a constrained problem, with no access to its
                                                                           ing is considered for example in DeepProbLog [14], where DNNs
structure; moreover, we need 2) methodologies for injecting domain
                                                                           with probabilistic output (classifiers in particular) are treated as pred-
knowledge in such a system, both at training time and after training.
                                                                           icates. Markov Logic Networks achieve a similar results via the
In this section, we briefly survey methods available in the literature
                                                                           use of Markov Fields defined over First Order Logic formulas [17],
for such tasks and we motivate our selection of techniques.
                                                                           which may be defined via probabilistic ML models. [18] presents a
                                                                           Neural Theorem Prover using differentiable predicates and the Pro-
 Neural Networks for Solving Constrained Problems The inte-                log backward chaining algorithm.
gration of Machine Learning methods for the solution of constrained           Recent works such as [15] are capable of integrating probabilis-
problems is an active research topic, which has recently been sur-         tic reasoning and Neural Networks both during and after training.
veyed in [3]. Many such approaches consider how ML can improve             Even more general is the Differentiable Inductive Logic approach
specific steps of the solution process: here, however, we are inter-       [7], which proposes a framework that can solve ML tasks typical of
ested in methods that use learning to replace (entirely or in part)        traditional Inductive Logic Programming systems, but is also robust
the modeling activity itself. These include Constraint Acquisition         to noise and error in the training data.
approaches (e.g. [4]), which attempt to learn a declarative problem           We use a method loosely based on Semantic Based Regularization
description from feasible/infeasible variable assignments. These ap-       for injecting knowledge at training time, as it offers a good com-
proaches may however have trouble dealing with implicit knowledge          promise between flexibility and simplicity. For injecting knowledge
(e.g. preferences) that cannot be easily stated in a well defined con-     after training, we rely on a very simple method well suited for con-
straint language. Techniques for encoding Machine Learning models          strained problems (i.e. using constraint propagation to adjust the out-
in constrained problems (e.g. [8, 12, 22, 16]) are capable of integrat-    put of the ML model).
ing empirical and domain knowledge, but not at training time; addi-
tionally, they require to know a-priori which variables are involved
in the constraints to be learned.                                          3   Basic Methods
   Some approaches (e.g. [1, 5]) rely on carefully structured Hop-
field Networks to solve constrained problems, but designing these          We reimplement the approach from [9] and extend it via number of
networks (or their training algorithms) requires full problem knowl-       techniques, described in this section together with our evaluation pro-
edge. Recently, reinforcement learning and Pointer Networks [2] or         cedure. The actual setup and results of our experiments are presented
Attention [10] have been used for solving specific classes of con-         in Section 4.
strained problems, with some measure of success. These approaches
also require a high degree of problem knowledge to generate the re-         Neural Network for the Data Driven Approach The baseline
ward signal, and to some degree for the network design. The method         approach is based on training a Neural Network to extend a partial
from [23] applies Neural Networks to predict the feasibility of a bi-      assignment (also called a partial solution) by making one additional
nary CSP, with a very high degree of accuracy; the prediction is how-      assignment, so as to preserve feasibility. Formally, the network is a
ever based on a representation of the allowed variable-value pairs,        function:
and hence requires explicit information about the problem.                                       f : {0, 1}m → [0, 1]m                       (1)
   To the best of the authors knowledge, the only example of problem-
agnostic, data-driven, approach for the solution of constrained prob-      whose input and output are m dimensional vectors. Each element
lems is the one in [9]. Here, a Neural Network is used to learn how        in the vectors is associated to a variable-value pair hzj , vj i, where
to extend a partial variable assignment so as to retain feasibility. De-   zj is the associated variable and vj is the associated value. The net-
spite its limited practical effectiveness, such method shares the best     work input represents partial assignments, assuming that xj = 1 iff
properties of Constraint Acquisition (no explicit problem informa-         zj = vj . Each component fj (x) of the output is proportional to the
tion), without being restricted to constraints expressed in a classical    probability that pair hzj , vj i is chosen for the next assignment. This
declarative language. We therefore chose this approach as our base-        is achieved in practice by using an output layer with m neurons with
line.                                                                      a sigmoid activation function.

 Domain Knowledge in Neural Networks There are several ap-                  Dataset Generation Process The input of each training example
proaches for incorporating external knowledge in Neural Networks,          corresponds to a partial solution x, and the output to a single variable
none of which has been applied so far on constrained decision prob-        value assignment (represented as a vector y using a one-hot encod-
lems. One method to do take into account domain knowledge at               ing). The training set is constructed by repeatedly calling the ran-
training time is the so-called Semantic Based Regularization [6],          domized deconstruction procedure Algorithm 1 on an initial set of
which is based on the idea of converting constraints into regulariz-       full solutions (referred to as solution pool). Each call generates a
ing terms in the loss function used by a gradient-descent algorithm.       number of examples that are used to populate a dataset. At the end
Other techniques include Logic Tensor Networks (LTNs) [20], which          of the process we discard multiple copies of identical examples. Two
Algorithm 1 DECONSTRUCT (x)                                                  Evaluation and Knowledge Injection We evaluate the approach
                                                                            via a specialized procedure, relying on a randomized solution func-
  D=∅
  while kxk1 > 0 do                                                         tion for PLS instances. This has signature SOLVE(x, C, h), where
    Let y = 0 # zero vector                                                 x is the starting partial assignment, C is the considered (sub)set of
    Select a random index i s.t. xi = 1                                     problem constraints, and h is a probability estimator for variable-
    Set xi = 0, set yi = 1                                                  value pairs (e.g. our trained NN). The function is implemented via
    Add the pair hx, yi to D                                                the Google or-tools constraint solver, and is based on Depth First
  return D                                                                  Search: at each search node, the solver applies constraint propaga-
                                                                            tion to prune some infeasible values, it chooses for branching the
examples may have the same input, but different output, since a sin-        first unassigned variable in a static order, then assigns a value cho-
gle partial assignment may have multiple viable completions.                sen at random with probabilities proportional to h(x0 ), where x0 is
   Unlike [9], here we sometimes perform multiple calls to Algo-            the current state of assignments. The SOLVE function returns either a
rithm 1 for the same starting solution. This simple approach enables        solution, or ⊥ in case of infeasibility.
to investigate independently the effect of the training set size (which        Our evaluation method tests the ability of the NN to identify indi-
depends on the number of examples) and of the amount of actual em-          vidual assignments that are globally feasible, i.e. that can be extended
pirical knowledge (which depends on the size of the solution pool).         into full solutions. This is done via Algorithm 2, which 1) starts from
The method also enables to generate large training sets starting from       a given partial solution; 2) relies on a constraint propagator C to dis-
a very small number of historical solutions.                                card some of the provably infeasible assignments; 3) uses the NN to
                                                                            make a (deterministic) single assignment; 4) attempts to complete it
 Training and Knowledge Injection The basic training for the NN             into a full solution (taking into account all problem constraints, i.e.
is the same as for neural classifiers. Since the network output can         Cpls ). Replacing the NN with a uniform probability estimator allows
be assimilated to a class, we process the network output through a          to obtain a baseline. We repeat the process on all partial solutions
softmax operator, and then we use as a loss function the categor-           from a test set, and collect statistics. This approach is identical to one
ical crossentropy H. Additionally, we inject domain knowledge at            of those in [9], with one major difference, i.e. the use of a constraint
training time via an approach that combines ideas of Semantic Based         propagator for “correcting” the output of the probability estimator.
Regularization (SBR) and Constraint Programming (CP, [19]).                 This enables injection of (incomplete) knowledge at solution con-
   In CP, constraints are associated to algorithms called propagators       struction time, while the original behavior can be recovered by using
that can identify provably infeasible values in the domain of the vari-     an empty set of propagators.
ables. Propagators are generally incomplete, i.e. there is no guarantee
they will find all infeasible values. Given a constraint (or a collection   Algorithm 2 FEASTEST(X, C, h)
of constraints) C, here we will treat its propagator as a multivariate        J ∗ = arg max{hj (x) | Cj (x) = 1} # Most likely assignments
function such that Cj (x) = 1 iff assignment zj = vj has not been             Pick j ∗ uniformly at random from J ∗
marked as infeasible by the propagator, while Cj (x) = 0 otherwise.           Set xj ∗ = 1
We then augment the loss function with a SBR inspired term that               if SOLVE(x, Cpls , hrnd ) 6= ⊥ then
                                                                                 return 1                           # Globally feasible
discourages provably infeasible pairs, and encourages the remaining
                                                                              else
ones. Given an example hx, yi, we have:                                          return 0                           # Globally infeasible
                               m−1
                               X
                  Lsbr (x) =         (Cj (x) − fj (x))2              (2)       Unlike in typical Machine Learning evaluations, we do not mea-
                               j=0                                          sure the (extremely low) network accuracy: in fact, the accuracy met-
                                                                            ric in our case is tied to the network ability to replicate the same se-
i.e. increasing the output of a neuron corresponding to a provably          quence of assignments observed at training time, which has almost
infeasible pair incurs a penalty that grows with the square of fj (x);      no practical value.
increasing the output for the remaining pairs incurs a penalty that
grows with the square of 1 − fj (x). Our full loss is hence given by:
                                                                          4   Empirical Analysis
                          1
                    H       f (x), y + λLsbr (x)                   (3)      In this section we discuss our experimental analysis, which is de-
                          Z
                                                                            signed around some key questions of both scientific and practical
where Z is the partition function and the scalar λ controls the balance     import. We focus on the following aspects:
between the crossentropy term H and the SBR term. The method can            Q1: Does injecting knowledge at training time improve the network
be applied for all known propagators with discrete variables. With            ability to identify feasible assignments?
some adaptations, it can be made to work for important classes of           Q2: What is the effect of injecting domain knowledge after, rather
numerical propagators (e.g. those that enforce Bound Consistency              than during, training?
[19]).                                                                      Q3: What is the effect of adjusting the amount of available empirical
   Since propagators are incomplete and we are rewarding assign-              knowledge?
ments not marked as infeasible, there is a chance that our SBR term
injects incorrect information in the model. This reward mechanism is        Taking advantage of our controlled use case, we present a series of
however crucial to ensure a non-negligible gradient at training time:       experiments that investigate such research directions. Details about
the incorrect information is balanced by the presence of the crossen-       the rationale and the setup of each experiment are reported in dedi-
tropy term, which encourages assignments that are guaranteed feasi-         cated sections, but some common configuration can be immediately
ble (since they originate from full problem solutions).                     described.
           Figure 1. Knowledge Injection at Training Time                       Figure 2. Knowledge Injection at Evaluation Time (ROWS)


   We focus on 10 × 10 PLS instances, resulting in input and output          We evaluate the resulting approaches via the FEASTEST procedure,
vectors with 1, 000 elements. We use a feed-forward, fully-connected      using the (separated) test set as X, the trained networks (or the uni-
Neural Network with three hidden layers, each with 512 units having       formly random estimator) as h, and an empty set of constraints (i.e.
ReLU activation function. This setup is considerably simpler than         no propagation at test time). We then produce “feasibility plots” that
the one used in the approach we chose as our baseline, but manages        report on the x-axis the number of assigned variables (filled cells) in
to reach very similar results. We employ the Adam optimizer from          the considered partial solutions and on the y-axis the ratio of sug-
Keras-TensorFlow2.0, with default parameters. The number of train-        gested assignments that are globally feasible.
ing epochs and batch size depends on the experiment.                         The results of the evaluation are shown in Figure 1. Without prop-
                                                                          agating any constraint at evaluation time, a purely random choice
                                                                          is extremely unlikely to result in globally feasible assignments: this
4.1    Domain Knowledge at Training Time                                  is expected and only serves as a pessimistic baseline. The AGN ap-
                                                                          proach, relying exclusively on empirical knowledge, behaves con-
We start with an experiment designed to address Question 1, i.e.          siderably better, with high feasibility ratios for almost empty and al-
whether injecting domain knowledge at training time may help the          most full squares, and a noticeable drop when ∼60% of the square
NN in the identification of feasible assignments. This is practically     is filled. The trend is a common feature for many of the approaches,
motivated by situations in which a domain expert has only partial         and may be connected a known phase transition in the complexity
information about the problem structure, but a pool of historical so-     of this combinatorial problem. Injecting domain knowledge at train-
lutions is available.                                                     ing time improves the feasibility ratio by a noticeable margin: a half
   For this experiment, the training set for the network is generated     of the improvement is observed when moving from AGN to ROW,
using the deconstruction approach from Section 3, starting from a set     suggesting that even partial knowledge about the problem structure
of 10,000 PLS solutions. Each solution is then deconstructed exactly      could prove very useful.
once, yielding a training set of ∼ 730,000 examples, 25% of which
are then removed to form a separate test set. We use mini-batches of
                                                                          4.2    Domain Knowledge at Evaluation Time
50,000 examples and we stop training after 1000 epochs.
   We compare four approaches: a random approach (referred to as          Our second experiment addresses Question 2, regarding the effect
RND), which treats all possible variable-value pairs as equally likely;   of knowledge injection at test time. This topic relates to scenarios
a model-agnostic neural network (referred to as AGN); a network           where the expert-supplied information can be incorporated in a solu-
trained with knowledge about row constraints (referred to as ROWS);       tion method (e.g. as a constraint in a declarative model). While not
a network trained with knowledge about row and column constraints         always viable, this situation is frequent enough to deserve a dedicated
(referred to as FULL).                                                    analysis.
   The RND approach lacks even the basic knowledge that a vari-              For this experiment, we rely on the same training and test set as in
able cannot be assigned twice, since this is not enforced by our in-      Section 4.1, and compare the same approaches. As a main difference,
put/output encoding. The same holds for AGN, which can however            we take into account some or all the problem constraints at evaluation
infer such constraint from data. Conversely, in ROWS we use our           time, by passing a non-empty set C of propagators to the FEASTEST
SBR-inspired method (and a Forward Checking propagator) to in-            procedure. Also at test time the constraints are taken into account by
ject knowledge that both assigning a variable twice and assigning a       means of an incomplete propagator (Forward Checking), and hence
value twice on the same row is forbidden. For the FULL approach we        all approaches rely on incomplete knowledge.
do the same, applying the Forward Checking propagator also to col-           The results of this evaluation are presented in Figure 2 (for row
umn constraints (i.e. no value can appear twice on the same column).      constraints propagation at test time) and Figure 3 (for all problem
Due to the use of an incomplete propagator, both ROW and FULL             constraints). The RND results in this case are representative of the
make use of incomplete knowledge. We have empirical determined            behavior (at each search node) of a Constraint Programming solver
that λ = 1 for FULL and λ = 0.01 for ROW works reasonably well.           having access to either only row constraints or the full problem def-
       Figure 3. Knowledge Injection at Evaluation Time (FULL)                      Figure 4. Effect of Training Set Size (300k examples)


inition: since CP is known to work very well on the PLS, it is there-
fore expected that the performance of the random approach is signif-
icantly boosted by knowledge injection at test time.
   All approaches relying either on purely (AGN) or partially (ROWS
and FULL) on empirical knowledge gain almost no benefit from in-
jecting constraints at evaluation time, though they still perform no-
ticeably better than RND. On one hand, these diminishing returns
should be taken into account when taking into account constraints
during solution generation is viable. On the other hand, the fact that
all knowledge-driven approaches are not helped by constraint prop-
agation suggests that their advantage comes from information about
global feasibility, which they can access from the empirical data.


4.3    Training Set Size and Empirical Information
Next, we proceed to tackle Question 3, by acting on the training set
generation process. In classical Machine Learning approaches, the
amount of available information is usually measured via the training
set size: this is a reasonable approach, since the number of training               Figure 5. Effect of Training Set Size (50k examples)
examples has a strong impact on the ability of a ML method to learn
and generalize.
   We performed experiments to probe the effect of the training set
size on the performance of the data-driven approaches. Figure 4 and        pends on how many solutions are available; this is also very useful in
Figure 5 report results for training sets with respectively 300,000        practice, since in many practical applications only a relatively small
and 50,000 examples. Knowledge injection at training time has in           number of historical solutions exists.
this case a dramatic effect: while the AGN approach is very sensi-            The results of this evaluation are shown in Figure 6 and Figure 7
tive to the available number of examples, the FULL one has only a          for solution pools having respectively 1,000 and 100 elements. In
minor drop in performance when moving from ∼730,000 examples               both cases the size of the generated training set is comparable to the
to 50,000 examples. This confirms previous experiences with tech-          original, i.e. around 700,000 examples: despite this fact, there is a
niques such as Semantic Based Regularization, although the effect          very significant gap in performance between the AGN approach and
in our case is much more pronounced: the gap is likely due to the          FULL . This is likely due once again to the richer information made
fact that, while our cross-entropy term in the loss function provides      accessible via the combined use of propagators and our SBR-inspired
information about a single (globally) feasible assignment, the SBR         loss.
terms provides information about a large number of (locally) feasi-           From a practical point of view it seems that, as long as enough
ble assignments.                                                           problem knowledge is available, it is possible to train data-driven
   In our setup, we have also the possibility to apply the deconstruc-     methods with very high feasibility ratio, starting from very small
tion process multiple times, so that the number of different examples      pools of historical solutions. It may be argued that if extensive prob-
that can be obtained from a single solution grows with the number          lem knowledge is available, one may use a more traditional solution
of possible permutations of the variable indices (i.e. O(n2 !) for the     approach (e.g. Constraint Programming or Mathematical Program-
PLS). Such observation opens up the possibility to generate large          ming): even in such a case, however, a (partially) data-driven ap-
training sets from a very small number of starting solutions: this is      proach should have a higher chance to preserve implicit properties
scientifically interesting, since the “actual” empirical information de-   (e.g. user preferences) that make the historical solutions desirable.
                                                                        Our analysis required the development of a general, SBR-inspired,
                                                                        method to turn any constraint propagator into a source of training-
                                                                        time information. Nowadays, there is an increasing need for AI sys-
                                                                        tems that are more human-centered and in several applications the
                                                                        agent is required to take into account security and fairness constraints
                                                                        during the decision process. One way to reach this goal can be the
                                                                        adoption of a successful hybrid approach that combines data-driven
                                                                        methods and Constraint Programming. Due to the results achieved
                                                                        with this work, together with our conclusions, this paper makes a
                                                                        significant step toward the practical applicability of data-driven ap-
                                                                        proaches for constraint problems and also towards the developing of
                                                                        more human-centered AI systems.
                                                                           Many open questions remain: an experimentation with different
                                                                        problem types and scales is needed to make sure that our results hold
                                                                        in general. Embedding the NN in an actual search process (with or
                                                                        without propagation) will provide more insight into the global behav-
                                                                        ior of the data-driven methods; finally, when applying propagation at
         Figure 6. Effect of Solution Pool Size (1k solutions)
                                                                        training time is viable, it is desirable to adjust training so that it com-
                                                                        plements, rather than replicate, the effect of propagation.


                                                                        ACKNOWLEDGEMENTS
                                                                        This research has been partially funded by the H2020 AI4EU project,
                                                                        grant agreement 825619.


                                                                        REFERENCES
                                                                         [1] H. M. Adorf and M. D. Johnston, ‘A discrete stochastic neural network
                                                                             algorithm for constraint satisfaction problems’, in Proc. of IJCNN, pp.
                                                                             917–924 vol.3, (June 1990).
                                                                         [2] Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy
                                                                             Bengio, ‘Neural combinatorial optimization with reinforcement learn-
                                                                             ing’, arXiv preprint arXiv:1611.09940, (2016).
                                                                         [3] Yoshua Bengio, Andrea Lodi, and Antoine Prouvost, ‘Machine learn-
                                                                             ing for combinatorial optimization: a methodological tour d’horizon’,
                                                                             arXiv preprint arXiv:1811.06128, (2018).
         Figure 7. Effect of Solution Pool Size (100 solutions)          [4] Christian Bessiere, Frédéric Koriche, Nadjib Lazaar, and Barry
                                                                             O’Sullivan, ‘Constraint acquisition’, Artifcial Intelligence, 244, 315–
                                                                             342, (2017).
                                                                         [5] A. Bouhouch, L. Chakir, and A. El Qadi, ‘Scheduling meeting solved
5   Conclusion                                                               by neural network and min-conflict heuristic’, in Proc. of IEEE CIST,
                                                                             pp. 773–778, (Oct 2016).
We considered injecting domain knowledge in Deep Neural Net-             [6] Michelangelo Diligenti, Marco Gori, and Claudio Sacca, ‘Semantic-
works to bridge the gap between expert-designed models and data-             based regularization for learning and inference’, Artificial Intelligence,
driven approaches for constrained problems. We chose the PLS as a            244, 143–165, (2017).
                                                                         [7] Richard Evans and Edward Grefenstette, ‘Learning explanatory rules
case study, and extended an existing NN approach to enable knowl-            from noisy data’, Journal of Artificial Intelligence Research, 61, 1–64,
edge injection. We performed controlled experiments to investigate           (2018).
three main scientific questions, drawing the following conclusions:      [8] Matteo Fischetti and Jason Jo, ‘Deep neural networks as 0-1 mixed in-
                                                                             teger linear programs: A feasibility study’, in Proc. of CPAIOR, (2018).
Q1: Injecting domain knowledge at training time improves the abil-       [9] Andrea Galassi, Michele Lombardi, Paola Mello, and Michela Mi-
                                                                             lano, ‘Model agnostic solution of csps via deep learning: A preliminary
  ity of the NN approach to identify feasible assignments. Data
                                                                             study’, in Proc. of CPAIOR, ed., Willem-Jan van Hoeve, pp. 254–262,
  driven methods behave significantly better than a naive random             Cham, (2018). Springer International Publishing.
  baseline.                                                             [10] Wouter Kool, HV Hoof, and Max Welling, ‘Attention solves your tsp,
Q2: Using constraint propagation to filter out some infeasible as-           approximately’, Statistics, 1050, 22, (2018).
  signments at test time improves dramatically the behavior of ran-     [11] Guosheng Lin, Chunhua Shen, Anton Van Den Hengel, and Ian Reid,
                                                                             ‘Efficient piecewise training of deep structured models for semantic
  dom selection; data-driven methods receive almost no benefit, but          segmentation’, in Proc. of the IEEE CVPR, pp. 3194–3203, (2016).
  they still perform best. This suggests that data-driven methods can   [12] Michele Lombardi, Michela Milano, and Andrea Bartolini, ‘Empirical
  infer information about global feasibility from empirical data.            decision model learning’, Artif. Intell., 244, 343–367, (2017).
Q3: A pure data-driven approach is very sensitive to the available      [13] Xuezhe Ma and Eduard Hovy, ‘End-to-end sequence labeling via bi-
                                                                             directional lstm-cnns-crf’, in Proc. of ACL, pp. 1064–1074. Association
  empirical information. Injecting knowledge at training time im-
                                                                             for Computational Linguistics, (2016).
  proves robustness: if both row and column constraints are consid-     [14] Robin Manhaeve, Sebastijan Dumančić, Angelika Kimmig, Thomas
  ered, a limited performance drop is observed with as few as 100            Demeester, and Luc De Raedt, ‘Deepproblog: Neural probabilistic logic
  historical solutions.                                                      programming’, arXiv preprint arXiv:1805.10872, (2018).
[15] Giuseppe Marra, Francesco Giannini, Michelangelo Diligenti, and
     Marco Gori, ‘Integrating learning and reasoning with deep logic mod-
     els’, in Proc. of ECML, (2019).
[16] Velibor V Mišić, ‘Optimization of tree ensembles’, arXiv preprint
     arXiv:1705.10883, (2017).
[17] Matthew Richardson and Pedro Domingos, ‘Markov logic networks’,
     Machine learning, 62(1-2), 107–136, (2006).
[18] Tim Rocktäschel and Sebastian Riedel, ‘End-to-end differentiable prov-
     ing’, in Advances in Neural Information Processing Systems, pp. 3788–
     3800, (2017).
[19] Francesca Rossi, Peter Van Beek, and Toby Walsh, Handbook of con-
     straint programming, Elsevier, 2006.
[20] Luciano Serafini and Artur d’Avila Garcez, ‘Logic tensor networks:
     Deep learning and logical reasoning from data and knowledge’, arXiv
     preprint arXiv:1606.04422, (2016).
[21] Emile Van Krieken, Erman Acar, and Frank Van Harmelen, ‘Semi-
     supervised learning using differentiable reasoning’, Journal of Applied
     Logic, (2019). to Appear.
[22] Sicco Verwer, Yingqian Zhang, and Qing Chuan Ye, ‘Auction optimiza-
     tion using regression trees and linear models as integer programs’, Arti-
     ficial Intelligence, 244(Supplement C), 368 – 395, (2017). Combining
     Constraint Solving with Mining and Learning.
[23] Hong Xu, Sven Koenig, and TK Satish Kumar, ‘Towards effective deep
     learning for constraint satisfaction problems’, in Proc. of CPAIOR, pp.
     588–597. Springer, (2018).

</pre>