<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Injecting Domain Knowledge in Neural Networks: a Controlled Experiment on a Constrained Problem</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mattia Silvestri</string-name>
          <email>mattia.silvestri4@unibo.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michele Lombardi</string-name>
          <email>michele.lombardi2@unibo.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michela Milano</string-name>
          <email>michela.milano@unibo.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International</institution>
          ,
          <addr-line>CC BY 4.0</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Bologna</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Given enough data, Deep Neural Networks (DNNs) are capable of learning complex input-output relations with high accuracy. In several domains, however, data is scarce or expensive to retrieve, while a substantial amount of expert knowledge is available. It seems reasonable that if we can inject this additional information in the DNN, we could ease the learning process. One such case is that of Constraint Problems, for which declarative approaches exists and pure ML solutions have obtained mixed success. Using a classical constrained problem as a case study, we perform controlled experiments to probe the impact of progressively adding domain and empirical knowledge in the DNN. Our results are very encouraging, showing that (at least in our setup) embedding domain knowledge at training time can have a considerable effect and that a small amount of empirical knowledge is sufficient to obtain practically useful results. 4</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Given enough data, Deep Neural Networks are capable of learning
complex input-output relations with high accuracy. In many domains,
however, there exists also a substantial degree of expert knowledge:
it seems reasonable that if we can inject this additional information
in the DNN, we could ease the learning process. Indeed, methods
for hybridizing learning and reasoning (or for taking into account
constraints at training time) can accelerate convergence or improve
the accuracy, especially when supervised data is scarce.</p>
      <p>In this paper we aim at characterizing this trade-off between
implicit knowledge (derived from data) and explicit knowledge
(supplied by experts), via a set of controlled experiments. On this
purpose, we use a setting that is both rigorous enough from a scientific
standpoint and practically relevant: that of constrained problems.</p>
      <p>Constrained problems involve assigning values to a set of
variables, subject to a number of constraints, and possibly with the goal
of minimizing a cost metric. Depending on the lack or presence of
a cost function, they are formally known as Constraint Satisfaction
Problems (CSPs) or Constraint Optimization problems (COPs).</p>
      <p>Constrained problem are classically modeled by domain experts in
a fully declarative fashion: however, such models can be hard to
design, may rely on simplistic and unquantifiable approximations, and
may fail to take into account constraints (or preferences) that are not
known to the expert, despite being satisfied in historical solutions.
Data-driven methods for constrained problems offer a potential
solution for some of these issues, but they may have trouble maintaining
feasibility and they struggle with the (very) limited number of past
solutions available for practical use cases.</p>
      <p>We use as a benchmark the Partial Latin Square (PLS)
completion problem, which requires to complete a partially filled n n
square with values in f1::ng, such that no value appears twice on
any row or column. Despite its simplicity, the PLS is NP-hard,
unless we start from an empty square; the problem has practical
applications (e.g. in optical fiber routing), and serves as the basis for more
complex problems (e.g. timetabling). Using a classical constrained
problem as a case study grants access to reliable domain knowledge
(the declarative formulation), and facilitates the generation of
empirical data (problem solutions). This combination enables controlled
experiments that are difficult to perform on more traditional datasets.</p>
      <p>We train a problem-agnostic, data-driven, solution approach on a
pool of solutions, and we inject domain knowledge (constraints) both
at training time and at solution generation time. We then adjust the
amount of initial data (empirical knowledge) and of injected
constraints (domain knowledge), and assess the ability of the approach
to yield feasible solutions. Our results are very encouraging, showing
that (at least in our setup) embedding domain knowledge in a
datadriven approach can have a considerable effect, and that a small
amount of empirical knowledge is sufficient to obtain practically
useful results.</p>
      <p>As a byproduct of our analysis, we develop general techniques
for taking into account constraints in data-driven methods for
decision problems, based on easily accessible methods from the
Constraint Programming and Machine Learning domains. While such
techniques are originally designed for problems with discrete
decision, they should be adaptable to numeric decisions as well. Hence,
despite our focus remains on a scientific investigation, we also
regard this paper as a relevant step towards practical applicability for
some data-driven solution methods for constrained problems. We
also think that integrating data-driven methods and Constraint
Programming will be helpful to develop more human-centered AI
systems. Nowadays, the wide adoption of AI in everyday life has
introduced the need for methods that are more focused on the
requirements of human operators. A practical scenario is the one of
constraints that allow the autonomous system to grant fairness and to
meet security guidelines during the decision process. Despite their
great success, since they are a purely data-driven approach, DNNs
may fail to correctly infer and generalize fairness or security
constraints because they may be only partially satisfied in the data. At
the same time, declarative models can help to overcome this
limitation of data-driven methods. This motivates the interest for the
developing of hybrid approaches that can exploit the advantages of both
data-driven and declarative methods.</p>
      <p>The paper is organized as follows: Section 2 briefly surveys the
related literature and motivates the choice of our baseline techniques;
Section 3 discusses the details of the problem and methods we use;
Section 4 presents the results of our analysis, while Section 5
provides concluding remarks.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Works and Baseline Choice</title>
      <p>
        The analysis that we aim to perform requires 1) a data-driven
technique that can solve a constrained problem, with no access to its
structure; moreover, we need 2) methodologies for injecting domain
knowledge in such a system, both at training time and after training.
In this section, we briefly survey methods available in the literature
for such tasks and we motivate our selection of techniques.
Neural Networks for Solving Constrained Problems The
integration of Machine Learning methods for the solution of constrained
problems is an active research topic, which has recently been
surveyed in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Many such approaches consider how ML can improve
specific steps of the solution process: here, however, we are
interested in methods that use learning to replace (entirely or in part)
the modeling activity itself. These include Constraint Acquisition
approaches (e.g. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]), which attempt to learn a declarative problem
description from feasible/infeasible variable assignments. These
approaches may however have trouble dealing with implicit knowledge
(e.g. preferences) that cannot be easily stated in a well defined
constraint language. Techniques for encoding Machine Learning models
in constrained problems (e.g. [
        <xref ref-type="bibr" rid="ref12 ref16 ref22 ref8">8, 12, 22, 16</xref>
        ]) are capable of
integrating empirical and domain knowledge, but not at training time;
additionally, they require to know a-priori which variables are involved
in the constraints to be learned.
      </p>
      <p>
        Some approaches (e.g. [
        <xref ref-type="bibr" rid="ref1 ref5">1, 5</xref>
        ]) rely on carefully structured
Hopfield Networks to solve constrained problems, but designing these
networks (or their training algorithms) requires full problem
knowledge. Recently, reinforcement learning and Pointer Networks [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] or
Attention [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] have been used for solving specific classes of
constrained problems, with some measure of success. These approaches
also require a high degree of problem knowledge to generate the
reward signal, and to some degree for the network design. The method
from [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] applies Neural Networks to predict the feasibility of a
binary CSP, with a very high degree of accuracy; the prediction is
however based on a representation of the allowed variable-value pairs,
and hence requires explicit information about the problem.
      </p>
      <p>
        To the best of the authors knowledge, the only example of
problemagnostic, data-driven, approach for the solution of constrained
problems is the one in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Here, a Neural Network is used to learn how
to extend a partial variable assignment so as to retain feasibility.
Despite its limited practical effectiveness, such method shares the best
properties of Constraint Acquisition (no explicit problem
information), without being restricted to constraints expressed in a classical
declarative language. We therefore chose this approach as our
baseline.
      </p>
      <p>
        Domain Knowledge in Neural Networks There are several
approaches for incorporating external knowledge in Neural Networks,
none of which has been applied so far on constrained decision
problems. One method to do take into account domain knowledge at
training time is the so-called Semantic Based Regularization [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
which is based on the idea of converting constraints into
regularizing terms in the loss function used by a gradient-descent algorithm.
Other techniques include Logic Tensor Networks (LTNs) [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], which
replace the entire loss function with a fuzzy formula defined on
logical predicates. LTNs are connected to Differentiable Reasoning [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ],
which uses relational background knowledge to benefit from
unlabeled data. Domain knowledge has also been introduced in
differentiable Machine Learning (in particular Deep Networks) by properly
designing their structure, rather than the loss function: examples
include Deep Structured Models (see e.g. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], the latter
integrating deep learning with Conditional Random Fields).
      </p>
      <p>
        Integration of external knowledge in Neural Networks after
training is considered for example in DeepProbLog [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], where DNNs
with probabilistic output (classifiers in particular) are treated as
predicates. Markov Logic Networks achieve a similar results via the
use of Markov Fields defined over First Order Logic formulas [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ],
which may be defined via probabilistic ML models. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] presents a
Neural Theorem Prover using differentiable predicates and the
Prolog backward chaining algorithm.
      </p>
      <p>
        Recent works such as [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] are capable of integrating
probabilistic reasoning and Neural Networks both during and after training.
Even more general is the Differentiable Inductive Logic approach
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], which proposes a framework that can solve ML tasks typical of
traditional Inductive Logic Programming systems, but is also robust
to noise and error in the training data.
      </p>
      <p>We use a method loosely based on Semantic Based Regularization
for injecting knowledge at training time, as it offers a good
compromise between flexibility and simplicity. For injecting knowledge
after training, we rely on a very simple method well suited for
constrained problems (i.e. using constraint propagation to adjust the
output of the ML model).
3</p>
    </sec>
    <sec id="sec-3">
      <title>Basic Methods</title>
      <p>
        We reimplement the approach from [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and extend it via number of
techniques, described in this section together with our evaluation
procedure. The actual setup and results of our experiments are presented
in Section 4.
      </p>
      <p>Neural Network for the Data Driven Approach The baseline
approach is based on training a Neural Network to extend a partial
assignment (also called a partial solution) by making one additional
assignment, so as to preserve feasibility. Formally, the network is a
function:
f : f0; 1gm
! [0; 1]m
(1)
whose input and output are m dimensional vectors. Each element
in the vectors is associated to a variable-value pair hzj ; vj i, where
zj is the associated variable and vj is the associated value. The
network input represents partial assignments, assuming that xj = 1 iff
zj = vj . Each component fj (x) of the output is proportional to the
probability that pair hzj ; vj i is chosen for the next assignment. This
is achieved in practice by using an output layer with m neurons with
a sigmoid activation function.</p>
      <p>Dataset Generation Process The input of each training example
corresponds to a partial solution x, and the output to a single variable
value assignment (represented as a vector y using a one-hot
encoding). The training set is constructed by repeatedly calling the
randomized deconstruction procedure Algorithm 1 on an initial set of
full solutions (referred to as solution pool). Each call generates a
number of examples that are used to populate a dataset. At the end
of the process we discard multiple copies of identical examples. Two</p>
      <sec id="sec-3-1">
        <title>Algorithm 1 DECONSTRUCT(x)</title>
        <p>D = ;
while kxk1 &gt; 0 do</p>
        <p>Let y = 0 # zero vector
Select a random index i s.t. xi = 1
Set xi = 0, set yi = 1</p>
        <p>Add the pair hx; yi to D
return D
examples may have the same input, but different output, since a
single partial assignment may have multiple viable completions.</p>
        <p>
          Unlike [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], here we sometimes perform multiple calls to
Algorithm 1 for the same starting solution. This simple approach enables
to investigate independently the effect of the training set size (which
depends on the number of examples) and of the amount of actual
empirical knowledge (which depends on the size of the solution pool).
The method also enables to generate large training sets starting from
a very small number of historical solutions.
        </p>
        <p>
          Training and Knowledge Injection The basic training for the NN
is the same as for neural classifiers. Since the network output can
be assimilated to a class, we process the network output through a
softmax operator, and then we use as a loss function the
categorical crossentropy H. Additionally, we inject domain knowledge at
training time via an approach that combines ideas of Semantic Based
Regularization (SBR) and Constraint Programming (CP, [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]).
        </p>
        <p>In CP, constraints are associated to algorithms called propagators
that can identify provably infeasible values in the domain of the
variables. Propagators are generally incomplete, i.e. there is no guarantee
they will find all infeasible values. Given a constraint (or a collection
of constraints) C, here we will treat its propagator as a multivariate
function such that Cj (x) = 1 iff assignment zj = vj has not been
marked as infeasible by the propagator, while Cj (x) = 0 otherwise.
We then augment the loss function with a SBR inspired term that
discourages provably infeasible pairs, and encourages the remaining
ones. Given an example hx; yi, we have:</p>
        <p>
          Lsbr (x) =
m 1
X (Cj (x)
j=0
fj (x))2
i.e. increasing the output of a neuron corresponding to a provably
infeasible pair incurs a penalty that grows with the square of fj (x);
increasing the output for the remaining pairs incurs a penalty that
grows with the square of 1 fj (x). Our full loss is hence given by:
where Z is the partition function and the scalar controls the balance
between the crossentropy term H and the SBR term. The method can
be applied for all known propagators with discrete variables. With
some adaptations, it can be made to work for important classes of
numerical propagators (e.g. those that enforce Bound Consistency
[
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]).
        </p>
        <p>Since propagators are incomplete and we are rewarding
assignments not marked as infeasible, there is a chance that our SBR term
injects incorrect information in the model. This reward mechanism is
however crucial to ensure a non-negligible gradient at training time:
the incorrect information is balanced by the presence of the
crossentropy term, which encourages assignments that are guaranteed
feasible (since they originate from full problem solutions).
(2)
(3)</p>
        <p>Evaluation and Knowledge Injection We evaluate the approach
via a specialized procedure, relying on a randomized solution
function for PLS instances. This has signature SOLVE(x, C, h), where
x is the starting partial assignment, C is the considered (sub)set of
problem constraints, and h is a probability estimator for
variablevalue pairs (e.g. our trained NN). The function is implemented via
the Google or-tools constraint solver, and is based on Depth First
Search: at each search node, the solver applies constraint
propagation to prune some infeasible values, it chooses for branching the
first unassigned variable in a static order, then assigns a value
chosen at random with probabilities proportional to h(x0), where x0 is
the current state of assignments. The SOLVE function returns either a
solution, or ? in case of infeasibility.</p>
        <p>
          Our evaluation method tests the ability of the NN to identify
individual assignments that are globally feasible, i.e. that can be extended
into full solutions. This is done via Algorithm 2, which 1) starts from
a given partial solution; 2) relies on a constraint propagator C to
discard some of the provably infeasible assignments; 3) uses the NN to
make a (deterministic) single assignment; 4) attempts to complete it
into a full solution (taking into account all problem constraints, i.e.
Cpls ). Replacing the NN with a uniform probability estimator allows
to obtain a baseline. We repeat the process on all partial solutions
from a test set, and collect statistics. This approach is identical to one
of those in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], with one major difference, i.e. the use of a constraint
propagator for “correcting” the output of the probability estimator.
This enables injection of (incomplete) knowledge at solution
construction time, while the original behavior can be recovered by using
an empty set of propagators.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Algorithm 2 FEASTEST(X, C, h)</title>
        <p>J = arg maxfhj (x) j Cj (x) = 1g
Pick j uniformly at random from J
Set xj = 1
if SOLVE(x, Cpls , hrnd ) 6= ? then</p>
        <p>return 1
else
return 0
# Most likely assignments
# Globally feasible
# Globally infeasible</p>
        <p>Unlike in typical Machine Learning evaluations, we do not
measure the (extremely low) network accuracy: in fact, the accuracy
metric in our case is tied to the network ability to replicate the same
sequence of assignments observed at training time, which has almost
no practical value.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Empirical Analysis</title>
      <p>In this section we discuss our experimental analysis, which is
designed around some key questions of both scientific and practical
import. We focus on the following aspects:
Q1: Does injecting knowledge at training time improve the network
ability to identify feasible assignments?
Q2: What is the effect of injecting domain knowledge after, rather
than during, training?
Q3: What is the effect of adjusting the amount of available empirical
knowledge?
Taking advantage of our controlled use case, we present a series of
experiments that investigate such research directions. Details about
the rationale and the setup of each experiment are reported in
dedicated sections, but some common configuration can be immediately
described.</p>
      <p>We focus on 10 10 PLS instances, resulting in input and output
vectors with 1; 000 elements. We use a feed-forward, fully-connected
Neural Network with three hidden layers, each with 512 units having
ReLU activation function. This setup is considerably simpler than
the one used in the approach we chose as our baseline, but manages
to reach very similar results. We employ the Adam optimizer from
Keras-TensorFlow2.0, with default parameters. The number of
training epochs and batch size depends on the experiment.
4.1</p>
    </sec>
    <sec id="sec-5">
      <title>Domain Knowledge at Training Time</title>
      <p>We start with an experiment designed to address Question 1, i.e.
whether injecting domain knowledge at training time may help the
NN in the identification of feasible assignments. This is practically
motivated by situations in which a domain expert has only partial
information about the problem structure, but a pool of historical
solutions is available.</p>
      <p>For this experiment, the training set for the network is generated
using the deconstruction approach from Section 3, starting from a set
of 10,000 PLS solutions. Each solution is then deconstructed exactly
once, yielding a training set of 730,000 examples, 25% of which
are then removed to form a separate test set. We use mini-batches of
50,000 examples and we stop training after 1000 epochs.</p>
      <p>We compare four approaches: a random approach (referred to as
RND), which treats all possible variable-value pairs as equally likely;
a model-agnostic neural network (referred to as AGN); a network
trained with knowledge about row constraints (referred to as ROWS);
a network trained with knowledge about row and column constraints
(referred to as FULL).</p>
      <p>The RND approach lacks even the basic knowledge that a
variable cannot be assigned twice, since this is not enforced by our
input/output encoding. The same holds for AGN, which can however
infer such constraint from data. Conversely, in ROWS we use our
SBR-inspired method (and a Forward Checking propagator) to
inject knowledge that both assigning a variable twice and assigning a
value twice on the same row is forbidden. For the FULL approach we
do the same, applying the Forward Checking propagator also to
column constraints (i.e. no value can appear twice on the same column).
Due to the use of an incomplete propagator, both ROW and FULL
make use of incomplete knowledge. We have empirical determined
that = 1 for FULL and = 0:01 for ROW works reasonably well.</p>
      <p>We evaluate the resulting approaches via the FEASTEST procedure,
using the (separated) test set as X, the trained networks (or the
uniformly random estimator) as h, and an empty set of constraints (i.e.
no propagation at test time). We then produce “feasibility plots” that
report on the x-axis the number of assigned variables (filled cells) in
the considered partial solutions and on the y-axis the ratio of
suggested assignments that are globally feasible.</p>
      <p>The results of the evaluation are shown in Figure 1. Without
propagating any constraint at evaluation time, a purely random choice
is extremely unlikely to result in globally feasible assignments: this
is expected and only serves as a pessimistic baseline. The AGN
approach, relying exclusively on empirical knowledge, behaves
considerably better, with high feasibility ratios for almost empty and
almost full squares, and a noticeable drop when 60% of the square
is filled. The trend is a common feature for many of the approaches,
and may be connected a known phase transition in the complexity
of this combinatorial problem. Injecting domain knowledge at
training time improves the feasibility ratio by a noticeable margin: a half
of the improvement is observed when moving from AGN to ROW,
suggesting that even partial knowledge about the problem structure
could prove very useful.
4.2</p>
    </sec>
    <sec id="sec-6">
      <title>Domain Knowledge at Evaluation Time</title>
      <p>Our second experiment addresses Question 2, regarding the effect
of knowledge injection at test time. This topic relates to scenarios
where the expert-supplied information can be incorporated in a
solution method (e.g. as a constraint in a declarative model). While not
always viable, this situation is frequent enough to deserve a dedicated
analysis.</p>
      <p>For this experiment, we rely on the same training and test set as in
Section 4.1, and compare the same approaches. As a main difference,
we take into account some or all the problem constraints at evaluation
time, by passing a non-empty set C of propagators to the FEASTEST
procedure. Also at test time the constraints are taken into account by
means of an incomplete propagator (Forward Checking), and hence
all approaches rely on incomplete knowledge.</p>
      <p>The results of this evaluation are presented in Figure 2 (for row
constraints propagation at test time) and Figure 3 (for all problem
constraints). The RND results in this case are representative of the
behavior (at each search node) of a Constraint Programming solver
having access to either only row constraints or the full problem
definition: since CP is known to work very well on the PLS, it is
therefore expected that the performance of the random approach is
significantly boosted by knowledge injection at test time.</p>
      <p>All approaches relying either on purely (AGN) or partially (ROWS
and FULL) on empirical knowledge gain almost no benefit from
injecting constraints at evaluation time, though they still perform
noticeably better than RND. On one hand, these diminishing returns
should be taken into account when taking into account constraints
during solution generation is viable. On the other hand, the fact that
all knowledge-driven approaches are not helped by constraint
propagation suggests that their advantage comes from information about
global feasibility, which they can access from the empirical data.
4.3</p>
    </sec>
    <sec id="sec-7">
      <title>Training Set Size and Empirical Information</title>
      <p>Next, we proceed to tackle Question 3, by acting on the training set
generation process. In classical Machine Learning approaches, the
amount of available information is usually measured via the training
set size: this is a reasonable approach, since the number of training
examples has a strong impact on the ability of a ML method to learn
and generalize.</p>
      <p>We performed experiments to probe the effect of the training set
size on the performance of the data-driven approaches. Figure 4 and
Figure 5 report results for training sets with respectively 300,000
and 50,000 examples. Knowledge injection at training time has in
this case a dramatic effect: while the AGN approach is very
sensitive to the available number of examples, the FULL one has only a
minor drop in performance when moving from 730,000 examples
to 50,000 examples. This confirms previous experiences with
techniques such as Semantic Based Regularization, although the effect
in our case is much more pronounced: the gap is likely due to the
fact that, while our cross-entropy term in the loss function provides
information about a single (globally) feasible assignment, the SBR
terms provides information about a large number of (locally)
feasible assignments.</p>
      <p>In our setup, we have also the possibility to apply the
deconstruction process multiple times, so that the number of different examples
that can be obtained from a single solution grows with the number
of possible permutations of the variable indices (i.e. O(n2!) for the
PLS). Such observation opens up the possibility to generate large
training sets from a very small number of starting solutions: this is
scientifically interesting, since the “actual” empirical information
depends on how many solutions are available; this is also very useful in
practice, since in many practical applications only a relatively small
number of historical solutions exists.</p>
      <p>The results of this evaluation are shown in Figure 6 and Figure 7
for solution pools having respectively 1,000 and 100 elements. In
both cases the size of the generated training set is comparable to the
original, i.e. around 700,000 examples: despite this fact, there is a
very significant gap in performance between the AGN approach and
FULL. This is likely due once again to the richer information made
accessible via the combined use of propagators and our SBR-inspired
loss.</p>
      <p>From a practical point of view it seems that, as long as enough
problem knowledge is available, it is possible to train data-driven
methods with very high feasibility ratio, starting from very small
pools of historical solutions. It may be argued that if extensive
problem knowledge is available, one may use a more traditional solution
approach (e.g. Constraint Programming or Mathematical
Programming): even in such a case, however, a (partially) data-driven
approach should have a higher chance to preserve implicit properties
(e.g. user preferences) that make the historical solutions desirable.
We considered injecting domain knowledge in Deep Neural
Networks to bridge the gap between expert-designed models and
datadriven approaches for constrained problems. We chose the PLS as a
case study, and extended an existing NN approach to enable
knowledge injection. We performed controlled experiments to investigate
three main scientific questions, drawing the following conclusions:
Q1: Injecting domain knowledge at training time improves the
ability of the NN approach to identify feasible assignments. Data
driven methods behave significantly better than a naive random
baseline.</p>
      <p>Q2: Using constraint propagation to filter out some infeasible
assignments at test time improves dramatically the behavior of
random selection; data-driven methods receive almost no benefit, but
they still perform best. This suggests that data-driven methods can
infer information about global feasibility from empirical data.
Q3: A pure data-driven approach is very sensitive to the available
empirical information. Injecting knowledge at training time
improves robustness: if both row and column constraints are
considered, a limited performance drop is observed with as few as 100
historical solutions.</p>
      <p>Our analysis required the development of a general, SBR-inspired,
method to turn any constraint propagator into a source of
trainingtime information. Nowadays, there is an increasing need for AI
systems that are more human-centered and in several applications the
agent is required to take into account security and fairness constraints
during the decision process. One way to reach this goal can be the
adoption of a successful hybrid approach that combines data-driven
methods and Constraint Programming. Due to the results achieved
with this work, together with our conclusions, this paper makes a
significant step toward the practical applicability of data-driven
approaches for constraint problems and also towards the developing of
more human-centered AI systems.</p>
      <p>Many open questions remain: an experimentation with different
problem types and scales is needed to make sure that our results hold
in general. Embedding the NN in an actual search process (with or
without propagation) will provide more insight into the global
behavior of the data-driven methods; finally, when applying propagation at
training time is viable, it is desirable to adjust training so that it
complements, rather than replicate, the effect of propagation.</p>
    </sec>
    <sec id="sec-8">
      <title>ACKNOWLEDGEMENTS</title>
      <p>This research has been partially funded by the H2020 AI4EU project,
grant agreement 825619.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H. M.</given-names>
            <surname>Adorf and M. D. Johnston</surname>
          </string-name>
          , '
          <article-title>A discrete stochastic neural network algorithm for constraint satisfaction problems'</article-title>
          ,
          <source>in Proc. of IJCNN</source>
          , pp.
          <fpage>917</fpage>
          -
          <lpage>924</lpage>
          vol.
          <volume>3</volume>
          ,
          <issue>(</issue>
          <year>June 1990</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Irwan</given-names>
            <surname>Bello</surname>
          </string-name>
          , Hieu Pham, Quoc V Le,
          <string-name>
            <given-names>Mohammad</given-names>
            <surname>Norouzi</surname>
          </string-name>
          , and Samy Bengio, '
          <article-title>Neural combinatorial optimization with reinforcement learning'</article-title>
          ,
          <source>arXiv preprint arXiv:1611.09940</source>
          , (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          , Andrea Lodi, and Antoine Prouvost, '
          <article-title>Machine learning for combinatorial optimization: a methodological tour d'horizon'</article-title>
          , arXiv preprint arXiv:
          <year>1811</year>
          .
          <volume>06128</volume>
          , (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Christian</given-names>
            <surname>Bessiere</surname>
          </string-name>
          , Fre´de´ric Koriche, Nadjib Lazaar, and
          <string-name>
            <surname>Barry O'Sullivan</surname>
          </string-name>
          , '
          <article-title>Constraint acquisition'</article-title>
          ,
          <source>Artifcial Intelligence</source>
          ,
          <volume>244</volume>
          ,
          <fpage>315</fpage>
          -
          <lpage>342</lpage>
          , (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bouhouch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chakir</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>El</surname>
          </string-name>
          <string-name>
            <surname>Qadi</surname>
          </string-name>
          , '
          <article-title>Scheduling meeting solved by neural network and min-conflict heuristic'</article-title>
          ,
          <source>in Proc. of IEEE CIST</source>
          , pp.
          <fpage>773</fpage>
          -
          <lpage>778</lpage>
          , (Oct
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Michelangelo</given-names>
            <surname>Diligenti</surname>
          </string-name>
          , Marco Gori, and Claudio Sacca, '
          <article-title>Semanticbased regularization for learning and inference'</article-title>
          ,
          <source>Artificial Intelligence</source>
          ,
          <volume>244</volume>
          ,
          <fpage>143</fpage>
          -
          <lpage>165</lpage>
          , (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Richard</given-names>
            <surname>Evans</surname>
          </string-name>
          and
          <string-name>
            <given-names>Edward</given-names>
            <surname>Grefenstette</surname>
          </string-name>
          , '
          <article-title>Learning explanatory rules from noisy data'</article-title>
          ,
          <source>Journal of Artificial Intelligence Research</source>
          ,
          <volume>61</volume>
          ,
          <fpage>1</fpage>
          -
          <lpage>64</lpage>
          , (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Matteo</given-names>
            <surname>Fischetti</surname>
          </string-name>
          and Jason Jo, '
          <article-title>Deep neural networks as 0-1 mixed integer linear programs: A feasibility study'</article-title>
          ,
          <source>in Proc. of CPAIOR</source>
          , (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Andrea</given-names>
            <surname>Galassi</surname>
          </string-name>
          , Michele Lombardi, Paola Mello, and Michela Milano, '
          <article-title>Model agnostic solution of csps via deep learning: A preliminary study'</article-title>
          ,
          <source>in Proc. of CPAIOR</source>
          , ed.,
          <source>Willem-Jan van Hoeve</source>
          , pp.
          <fpage>254</fpage>
          -
          <lpage>262</lpage>
          , Cham, (
          <year>2018</year>
          ). Springer International Publishing.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Wouter</surname>
            <given-names>Kool</given-names>
          </string-name>
          , HV Hoof, and Max Welling, '
          <article-title>Attention solves your tsp</article-title>
          , approximately', Statistics,
          <volume>1050</volume>
          ,
          <fpage>22</fpage>
          , (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Guosheng</surname>
            <given-names>Lin</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Chunhua</given-names>
            <surname>Shen</surname>
          </string-name>
          , Anton Van Den Hengel, and Ian Reid, '
          <article-title>Efficient piecewise training of deep structured models for semantic segmentation'</article-title>
          ,
          <source>in Proc. of the IEEE CVPR</source>
          , pp.
          <fpage>3194</fpage>
          -
          <lpage>3203</lpage>
          , (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Michele</surname>
            <given-names>Lombardi</given-names>
          </string-name>
          , Michela Milano, and Andrea Bartolini, '
          <article-title>Empirical decision model learning', Artif</article-title>
          . Intell.,
          <volume>244</volume>
          ,
          <fpage>343</fpage>
          -
          <lpage>367</lpage>
          , (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Xuezhe</given-names>
            <surname>Ma</surname>
          </string-name>
          and Eduard Hovy, '
          <article-title>End-to-end sequence labeling via bidirectional lstm-cnns-crf'</article-title>
          ,
          <source>in Proc. of ACL</source>
          , pp.
          <fpage>1064</fpage>
          -
          <lpage>1074</lpage>
          . Association for Computational Linguistics, (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Robin</surname>
            <given-names>Manhaeve</given-names>
          </string-name>
          , Sebastijan Dumancˇic´, Angelika Kimmig, Thomas Demeester, and Luc De Raedt, 'Deepproblog:
          <article-title>Neural probabilistic logic programming'</article-title>
          , arXiv preprint arXiv:
          <year>1805</year>
          .
          <volume>10872</volume>
          , (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Giuseppe</surname>
            <given-names>Marra</given-names>
          </string-name>
          , Francesco Giannini, Michelangelo Diligenti, and Marco Gori, '
          <article-title>Integrating learning and reasoning with deep logic models'</article-title>
          ,
          <source>in Proc. of ECML</source>
          , (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Velibor</surname>
            <given-names>V Misˇic</given-names>
          </string-name>
          ´, '
          <article-title>Optimization of tree ensembles'</article-title>
          ,
          <source>arXiv preprint arXiv:1705.10883</source>
          , (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Matthew</given-names>
            <surname>Richardson</surname>
          </string-name>
          and Pedro Domingos, '
          <article-title>Markov logic networks'</article-title>
          ,
          <source>Machine learning</source>
          ,
          <volume>62</volume>
          (
          <issue>1-2</issue>
          ),
          <fpage>107</fpage>
          -
          <lpage>136</lpage>
          , (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Tim</given-names>
            <surname>Rockta</surname>
          </string-name>
          <article-title>¨schel and Sebastian Riedel, 'End-to-end differentiable proving'</article-title>
          ,
          <source>in Advances in Neural Information Processing Systems</source>
          , pp.
          <fpage>3788</fpage>
          -
          <lpage>3800</lpage>
          , (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Francesca</surname>
            <given-names>Rossi</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peter Van Beek</surname>
            ,
            <given-names>and Toby</given-names>
          </string-name>
          <string-name>
            <surname>Walsh</surname>
          </string-name>
          ,
          <article-title>Handbook of constraint programming</article-title>
          ,
          <source>Elsevier</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Luciano</given-names>
            <surname>Serafini</surname>
          </string-name>
          and
          <article-title>Artur d'Avila Garcez, 'Logic tensor networks: Deep learning and logical reasoning from data and knowledge'</article-title>
          ,
          <source>arXiv preprint arXiv:1606.04422</source>
          , (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Emile</surname>
            <given-names>Van Krieken</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Erman Acar</surname>
          </string-name>
          , and Frank Van Harmelen, '
          <article-title>Semisupervised learning using differentiable reasoning'</article-title>
          ,
          <source>Journal of Applied Logic</source>
          , (
          <year>2019</year>
          ). to Appear.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Sicco</surname>
            <given-names>Verwer</given-names>
          </string-name>
          , Yingqian Zhang, and Qing Chuan Ye, '
          <article-title>Auction optimization using regression trees and linear models as integer programs'</article-title>
          ,
          <source>Artificial Intelligence</source>
          ,
          <volume>244</volume>
          (
          <string-name>
            <surname>Supplement</surname>
            <given-names>C)</given-names>
          </string-name>
          ,
          <volume>368</volume>
          -
          <fpage>395</fpage>
          , (
          <year>2017</year>
          ).
          <article-title>Combining Constraint Solving with Mining and Learning</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Hong</surname>
            <given-names>Xu</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Sven</given-names>
            <surname>Koenig</surname>
          </string-name>
          , and TK Satish Kumar, '
          <article-title>Towards effective deep learning for constraint satisfaction problems'</article-title>
          ,
          <source>in Proc. of CPAIOR</source>
          , pp.
          <fpage>588</fpage>
          -
          <lpage>597</lpage>
          . Springer, (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>