=Paper=
{{Paper
|id=Vol-2986/paper2
|storemode=property
|title=Neuro-Symbolic Constraint Programming for Structured Prediction
|pdfUrl=https://ceur-ws.org/Vol-2986/paper2.pdf
|volume=Vol-2986
|authors=Paolo Dragone,Stefano Teso,Andrea Passerini
|dblpUrl=https://dblp.org/rec/conf/nesy/DragoneTP21
}}
==Neuro-Symbolic Constraint Programming for Structured Prediction==
Neuro-Symbolic Constraint Programming for
Structured Prediction
Paolo Dragone1 , Stefano Teso2 and Andrea Passerini2
1
    Twitter, London, United Kingdom
2
    University of Trento, Trento, Italy
                                        Abstract
                                        We propose N e s t e r , a method for injecting neural networks into constrained structured predictors. N e s t e r
                                        first uses a neural network to compute an initial prediction that may or may not satisfy the constraints,
                                        and then applies a constraint-based structured predictor to refine the raw predictions according to hard
                                        and soft constraints. N e s t e r combines the advantages of its two components: the network can learn
                                        complex representations from low-level data while the constraint program on top reasons about the
                                        high-level properties and requirements of the prediction task. An empirical evaluation on handwritten
                                        equation recognition shows that N e s t e r achieves better performance than both the either component in
                                        isolation, especially when training examples are scarce, while scaling to more complex problems than
                                        other neuro-programming approaches. N e s t e r proves especially useful to reduce errors at the semantic
                                        level of the problem, which is particularly challenging for neural network architectures.
                                         Keywords
                                         Machine Learning, Neuro-Symbolic Integration, Structured Prediction, Constraint Programming
1. Introduction
We study neuro-symbolic integration in the context of structured output prediction under
constraints. In structured prediction, one predicts objects β like parse trees, floor plans,
and molecules β encoded by multiple interdependent variables [1]. We tackle the more
complex problem of predicting structures subject to complex logical-numerical constraints
from complex sub-symbolic inputs like images or sensor data.
   This problem is beyond the reach of existing methods. Classical frameworks for
structured prediction [2, 3] lack support for representation learning whereas most deep
structured prediction approaches have no support for constraints [4, 5]. Existing neuro-
symbolic approaches are restricted to logical constraints and either rely on fuzzy logic [6, 7]
or enforce satisfaction of constraints only in expectation [8], failing to deal with hard
constraints. Closest to our work, DeepProbLog [9] supports sub-symbolic data as well
as logic and (discrete) numeric constraints, but relies on an inference procedure that
becomes prohibitively expensive when dealing with structured prediction problems with
a high degree of non-determinism, as highlighted by our experiments.
NeSyβ21: 15th International Workshop on Neural-Symbolic Learning and Reasoning
Envelope-Open pdragone@twitter.com (P. Dragone); stefano.teso@unitn.it (S. Teso); andrea.passerini@unitn.it
(A. Passerini)
Orcid 0000-0002-9147-8514 (P. Dragone); 0000-0002-2340-9461 (S. Teso); 0000-0002-2765-5395 (A. Passerini)
                                       Β© 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution
                                       4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
Figure 1: How N e s t e r predicts an equation π¦ from a sequence of images π₯. Each image is first classified
by a CNN (the same network for each image) and the predicted labels are grouped into a sequence π¦Μ
that happens to violate both syntactic and semantic constraints. The structured predictor then takes π₯
and π¦Μ and outputs a refined label π¦ that is guaranteed to satisfy all constraints. (Best viewed in color.)
  We propose N e s t e r (NEuro-Symbolic consTrainEd structuRed predictor), an approach
that addresses constrained structured prediction with sub-symbolic inputs and complex
constraints. N e s t e r integrates a neural network and a constrained structured predictor.
The former uses the input data alone to produce an initial guess, i.e., a structure that may
violate some constraints. The structured predictor takes this initial guess and leverages a
constraint satisfaction solver to impose hard and soft constraints on it. Inspired by work
on declarative structured prediction [10], N e s t e r leverages MiniZinc [11] to define the
constraint program and can therefore deal with both hard and soft constraints as well as
categorical and numerical variables. The entire architecture can be trained end-to-end.
  We evaluated N e s t e r on handwritten equation recognition [10], a sequence prediction
task in which the prediction must obey both syntactic and semantic constraints, cf.
Figure 1. Our results show that the neural network and the constrained structured
predictor work in synergy, achieving better results than either model taken in isolation.
Importantly, the neural network alone struggles to achieve low error at the semantic level,
whereas N e s t e r predicts structures that are both syntactically and semantically sound.
2. Structured Prediction under Constraints
The goal of structured prediction is to output a structured object π¦ β π΄ that best
matches a given input π₯ β π³. Learning a mapping π βΆ π³ β π΄ directly is non-trivial. One
strategy is to acquire a scoring function πΉ βΆ π³ Γ π΄ β β that measures the compatibility
between inputs and outputs and then use it to identify a highly compatible output
π¦ = π (π₯) βΆ= argmaxπ¦ β² βπ΄ πΉ (π₯, π¦ β² ) [12]. A principled approach for estimating such scoring
functions are structured SVMs [3], which model πΉ (π₯, π¦) as a linear function of the form
β¨w, π(π₯, π¦)β©, where π βΆ π³ Γ π΄ β βπ is a feature map that transforms input-output pairs
into a joint, π-dimensional feature space, and w β βπ are weights learned from data. Given
a training set of input-output pairs {(π₯π , π¦π )}ππ=1 , where π¦π is a high-quality output for π₯π ,
structured SVMs learn w by solving the following quadratic problem:
                     π
          π 2 1
  min       βwβ + β ππ      s.t. β π, βπ¦ β π΄ β§΅ {π¦π } . β¨w, π(π₯π , π¦π )β© β β¨w, π(π₯π , π¦)β© β₯ Ξ(π¦π , π¦) β ππ
   w, π   2      π π=1
For each example π β [π], the constraints encourage the correct label β¨w, π(π₯π , π¦π )β© to score
above any alternative label π¦ by a task-dependent margin Ξ(π¦π , π¦) that measures how much
π¦ differs from π¦π . The objective minimizes the slacks ππ , which grow when a constraint is
not satisfied, and an β2 regularization term π2 βwβ2 that penalizes overly complex functions.
   Following recent work on structured prediction under constraints [10], we solve the
inference problem argmaxπ¦ β² πΉ (π₯, π¦ β² ) using an off-the-shelf constraint programming solution.
Specifically, we leverage MiniZinc [11], a constraint programming language that interfaces
several state-of-the-art solvers and that allows to inject complex background knowledge β
in the form of logical and algebraic constraints β into the inference (and learning) steps.
For instance, in our equation recognition experiment, we encode equation validity by
transforming the sequence of predicted labels into integers and then imposing a constraint
that the left and right hand sides match, see Section 4.
3. Neural Constrained Structured Prediction
We propose N e s t e r , a neuro-symbolic approach that upgrades the structured SVM pipeline
with representation learning while offering support for complex hard and soft constraints.
N e s t e r combines two components: a neural network and a general-purpose constrained
structured predictor [13] that acts as a βrefinementβ layer on top of the network. The
network is not directly aware of the constraints: given an input π₯ β π³, its task is to
guess an initial, raw prediction π¦Μ β π΄. Furthermore, the network is free to predict
distinct output variables (e.g., individual characters in an equation) independently. The
constrained structured predictor then takes the input π₯ β π³, the networkβs prediction
π¦Μ β π΄, and a set of hard and soft constraints, and outputs a final prediction π¦ β π΄, as
illustrated in Figure 1 for handwritten equation recognition. The hard constraints encode
validity requirements that the output has to satisfy (e.g., the output equation must be
algebraically valid). The soft constraints are weighted constraints that measure how
good the output structure is based on the input, network output and refined output, and
whose weights are learned during training. The constrained structured predictor is in
charge of grouping the neural predictions while enforcing the constraints, thus correcting
potential mistakes and inconsistencies.
    The refinement step is performed by solving the following constraint program:
           argmaxπ¦βπ΄ β¨w, π(π₯, π¦)β© + β¨wπ , ππ (π₯, π¦,Μ π¦)β© + π€πΏ πΏ(π¦, π¦)Μ     s.t. π¦Μ = π(π₯; π )           (1)
where the set of candidates π΄ is defined by the hard constraints; π(π₯; π ) is a neural
network; (w, wπ , π€πΏ ) and π are the parameters of the constrained structured predictor
and of the network, respectively; and π(π₯, π¦), ππ (π₯, π¦,Μ π¦) and πΏ(π¦, π¦)Μ are features. More
specifically, π(π₯, π¦) are the prediction features that, analogously to the ones used in
standard structured prediction models [2], help predicting the right output from the
input. ππ (π₯, π¦,Μ π¦) are the refinement features and are designed to help the structured
predict to spot the mistakes made by the network, by relating input, output and network
predictions. Finally πΏ(π¦, π¦)Μ measures the difference between the raw and refined predictions
and (depending on π€πΏ ) it keeps the refined prediction close to the networkβs guess.
  In N e s t e r , learning can very intuitively be broken up into three steps: i) Bootstrapping
the model by pre-training the neural network or by loading a pre-trained model, ii) Freezing
the network and training the constrained structured prediction model based on its
predictions, and iii) Fine tuning the two components end-to-end. This last step is worth
discussing in detail. Structured SVMs are typically learned with cutting planes [14]
or block-coordinate Frank-Wolfe [15]. In contrast, N e s t e r uses stochastic sub-gradient
descent [16], which works by minimizing the structured hinge loss πΏ(w; π₯π , π¦π , π¦πβ ) = π2 βwβ2 +
β¨w, π(π₯π , π¦πβ ) β π(π₯π , π¦π )β©, where π¦πβ = argmaxπ¦βπ΄ Ξ(π¦π , π¦) β β¨w, π(π₯π , π¦π ) β π(π₯π , π¦)β©. Given π¦πβ ,
the algorithm simply computes a sub-gradient βw πΏ and then performs a descent step.
In contrast to the alternatives, stochastic sub-gradient descent allows N e s t e r to back-
propagate the gradient of the loss through the whole model. This is achieved by chaining
the gradients of πΏ with respect to π¦Μ and the gradient of the neural network output with
                                                                                                         ππ¦Μ
respect to its weights. For the last layer of the network with weights ππΎ : βπΏ = ππΏ                 ππ¦Μ πππΎ
                                                                                                             .
Gradients for the other layers can be obtained by standard backpropagation through the
network.
3.1. Neural Constrained Sequence Prediction
In the following we focus on sequence prediction, a very common structured output task.
Here, outputs π¦ = (π¦1 , β¦ , π¦π ) are variable length sequences of elements from an alphabet
                            π
of π possible symbols {π π }π=1 . For each element, a vector of π input features is typically
available. We indicate with π₯π,π the π-th feature of the π-th element of the input sequence.
As prediction features we employ standard features for correlating input and output
variables commonly used in e.g. conditional random fields [2] for sequence prediction:
                          π                                         πβ1
          ππ,π (π₯, π¦) = βπ=1 π₯π,π β
 Jπ¦π = π π K    ππ1 ,π2 (π₯, π¦) = βπ=1 Jπ¦π = π π1 β§ π¦π+1 = π π2 K         (2)
where Jβ
K evaluates to 1 if the argument is true and 0 otherwise. ππ,π (π₯, π¦) encodes emission
features, correlating the appearance of a specific input feature (e.g., a pixel inside input
images) with each of the emitted symbols. Here π ranges over input features and π over
output symbols. ππ1 ,π2 (π₯, π¦) encodes transition features, correlating the appearance of
consecutive symbols inside the output sequence. Here π1 , π2 ranges over all the possible
combinations of couples of symbols.We define refinement features so as to correlate the
value of each input feature with the overriding of neural network predictions by the
                                     π
refinement layer: ππ,π (π₯, π¦,Μ π¦) = βπ=1 π₯π,π β
 Jπ¦π = π π β§ π¦πΜ β  π π K, where π ranges over input
features and π over output symbols. By learning appropriate weights for them, the
structured predictor can learn to fix the most common mistakes made by the neural
network. Finally, we define πΏ(π¦, π¦)Μ as the Hamming distance between π¦ and π¦,Μ that is,
            π
πΏ(π¦, π¦)Μ = βπ=1 Jπ¦π β  π¦πΜ K.
4. Experiments
We tested our technique on the handwritten equation recognition task described in [10].
All our experiments are implemented using Tensorflow and Pyconstruct [10]. MiniZinc [11]
was used as constraint programming engine and Gurobi (https://www.gurobi.com/) as the
underlying constraint solver.
Setting Training examples represent equations are all of the form π + π = π, where π, π, π
are arbitrary positive integers (of a predefined maximum number of digits). Inputs are
variable length sequences of 9 Γ 9 black and white images and the corresponding outputs
are sequences of symbols that include the digits from 0 to 9, + and =. The dataset
contains 10,000 valid equations from the ICFHRβ14 CROHME competition data and is
used with a 80 βΆ 20 train-test split. Results are reported as learning curves by dividing
the training set into 20 chunks of increasing size.
Neural model As highlighted in Figure 1, we use a CNN to predict the symbol of each
image in the sequence independently. The CNN is composed of two convolutional layers
with 3 Γ 3 filters and ReLU activation, each followed by a 2 Γ 2 max-pool layers, then
a 128 dense layer with dropout regularization (0.5 probability), and finally a softmax
layer with 12 outputs, one for each symbol. The CNN is trained using Adamwith a
cross entropy loss. The left plot in Figure 2 reports the relative misclassification error
of the CNN over the test set, i.e. the percentage of sequences for which the network
made at least one mistake. We subdivide the errors into syntactic errors, for which the
prediction is not properly formatted according to the template π + π = π, and semantic
errors, i.e. well-formatted predictions for which π plus π does not equal π (errors which
are syntactically and semantically correct were a negligible minority) The shaded areas
in the figure show how the different types of errors contribute to the total. While the
total number of errors decreases with increasing training set size, the proportion between
syntactic and semantic errors settles around 33% to 67%, indicating that semantic errors
are generally harder to correct. To make sure that these errors are not simply due to the
fact that the CNN predicts each output symbol independently, we also experimented with
a recurrent architecture stacking an LSTM network over the CNN outputs. Results are
reported in the middle plot of Figure 2. The network makes more errors overall, possibly
due to the increased number of trainable parameters. Relatively speaking, though, the
CNN + LSTM network learns very quickly how to fix syntactic errors, faster than the
CNN. Yet, semantic errors prove much more challenging, and end up being the large
majority of the errors of the network. These results show that even highly expressive
neural architectures can fail to go beyond a syntactic level of understanding, especially
when limited data is available.
Neural constrained model As shown in Figure 1, N e s t e r applied to handwritten equa-
tion recognition stacks a constrained structured layer on top of the independent CNN
predictions. As constrained structured layer (βCSTβ from now on) we used an enhanced
Figure 2: Misclassification error for the CNN (left) and the CNN + LSTM (middle). Right: structural
loss (Hamming distance) for the CNN + CST model and the competitors. Best viewed in color.
version of the constrained model from [10]. Our model extends the original one by taking
the prediction of the neural network π¦Μ as input and combining the prediction features
with the refinement features and the Hamming distance as described in Section 3.1 (see
also Figure 1). The constrained structured model is defined in the MiniZinc language.
Syntactic constraints over the output sequence enforce a single + and a single = and the
fact that + should come before =. Semantic constraints are enforced by first combining
sequences of consecutive digits into corresponding numbers according to positional rules,
and then enforcing π + π = π over the resulting numbers. Prediction and refinement
features as well as Hamming distance between neural network output and final output are
also encoded in MiniZinc. See [17] for further details. We perform a warm start training
of the CNN as in the previous paragraph, and train the CST model using the structured
SVMs method with stochastic subgradient descent [16] for one epoch over the training
set. The structured loss used in this task is the Hamming distance between the predicted
sequence of symbols and the true sequence. Figure 2 (right) shows the test loss of our
method (CNN + CST) together to those of the single components, i.e., the CNN and the
CST in the original variant [10], and the CNN + LSTM cascade. The general trend is
rather clear, the CNN has always an edge over the CST, yet their combination always
performs better than both. As expected, the gap between the CNN and the combined
model is maximal for the smallest dataset, but it remains clearly evident when the curves
start to level off (82.9% relative error reduction at the last iteration). The loss of the
CNN + LSTM model is overall slightly higher than the one of the CNN model, consistent
with what happens for the misclassification error (see the middle plot of Figure 2). An
ablation study shows that the distance feature alone already improves over the plain
CNN model (29.6% relative error reduction at the last iteration), and that all features
contribute to the overall performance. See [17] for details.
   Inference and training time of the CNN + CST model is mostly bounded by the
inference time of CST, which in our experiment is below one second using Gurobi. The
inference time of the CNN is negligible. While the CNN is much more scalable, the
combined model still has low latency, allowing its application to many domains for which
sample complexity is more a important aspect than scalability. Nevertheless, we are
looking at integrating approximation techniques into N e s t e r as future work [18].
Comparison with DeepProbLog DeepProbLog is an expressive neuro-symbolic frame-
work that combines probabilistic-logic programming with neural predicates and has
been used to learn, e.g., to add digits represented as MNIST images [9]. The equation
recognition task we address here, however, is more complex as the numbers have a variable
number of digits and the position of the operators is not known a priori. Modelling this
problem with DeepProbLog requires the use of non-deterministic predicates, which are
extremely expensive. Indeed, even after optimizing the encoding1 , equation recognition
turned out to be too memory hungry to be even executed by DeepProbLog.
5. Related work
Traditional approaches to structure prediction acquire a scoring function over candidate
input-output pairs [2, 3] and admit prior knowledge in the form of ad-hoc constraints
on the output [19, 20]. These ideas have been extended to a fully declarative setup in
which the constraints are specified using a language like MiniZinc [10, 13, 21]. These
approaches rely on manually designed features, which is often sub-optimal compared to
representation learning. N e s t e r is a neuro-symbolic upgrade of these techniques.
   Deep structured output prediction has been tackled with energy-based and search-
based models. Energy-based models like deep value networks [22], structured prediction
energy networks (SPENs) [4] and input convex neural networks [5] implement the scoring
function using a neural network and use gradient ascent to perform inference, but offer no
support for discrete outputs and prior knowledge defined by constraints. Lee et al. [23]
address this issue by casting the constrained inference of SPENs into an instance-specific
learning problem with a constraint-based loss function. Unlike N e s t e r , this method is not
guaranteed to satisfy all hard constraints. Search-based techniques, on the other hand,
sequentially build an output structure by repeatedly applying an auto-regressive models
to predict the next piece given the already predicted pieces. These approaches however
lack support for long-range constraints without an expensive back-tracking procedure.
   Most attempts to combine deep networks and reasoning focus on forcing the network
to learn weights that ultimately produce predictions satisfying the constraints. A popular
line of research integrates constraints into the objective function using fuzzy logic [6, 7].
These approaches cannot guarantee that the predicted outputs satisfy the hard constraints.
The same is true for the semantic loss, which encourages the output of a neural network
to satisfy given constraints with high probability [8]. Furthermore, these approaches
are usually restricted to Boolean variables and logical constraints. Perhaps the most
expressive approach in this class is DeepProbLog, which integrates probabilistic-logic
programming and neural predicates in a principled manner. In contrast to N e s t e r ,
DeepProbLog can answer probabilistic queries other than MAP inference, but it does not
always scale to complex structured-output predictions tasks, as discussed in Section 4.
   1
       The main developer of DeepProbLog helped us in trying to optimize the encoding.
6. Conclusion
We presented N e s t e r , an approach to structured output prediction that combines neural
networks for input processing and representation learning with constraint programming
for high-level reasoning. Our preliminary experiments on a challenging constrained
structured prediction task, namely handwritten equation recognition, show that CNN
and LSTMs learn quickly to correct syntactic errors but fail to grasp and avoid semantic
errors. In contrast, N e s t e r improves recognition performance over both the neural network
and the constrained structured model on their own, especially with smaller training sets,
while generating predictions that are consistent by design.
   A key assumption underlying our method is that the constraints themselves are correct.
We plan on investigating settings in which this is not the case, for instance because of
modeling mistakes on the designerβs end or because the background knowledge has been
acquired semi-automatically [24]. One option is to encode the background knowledge in
terms of soft constraints. This is however non-trivial and left to future work.
Acknowledgments
We are grateful to Carlo NicolΓ² and Edoardo Battocchio for running preliminary experi-
ments and to Robin Manhaeve for support on DeepProbLog. The research of ST and
AP was partially supported by TAILOR, a project funded by EU Horizon 2020 research
and innovation programme under GA No 952215.
References
 [1] G. Bakir, T. Hofmann, B. SchΓΆlkopf, A. Smola, B. Taskar, S. Vishwanathan, Pre-
     dicting Structured Data, 2007.
 [2] J. Lafferty, A. McCallum, F. Pereira, Conditional random fields: Probabilistic
     models for segmenting and labeling sequence data, in: ICML, 2001.
 [3] I. Tsochantaridis, T. Hofmann, T. Joachims, Y. Altun, Support vector machine
     learning for interdependent and structured output spaces, in: ICML, 2004.
 [4] D. Belanger, A. McCallum, Structured prediction energy networks, in: ICML, 2016.
 [5] B. Amos, L. Xu, J. Kolter, Input convex neural networks, in: ICML, 2017.
 [6] M. Diligenti, M. Gori, V. Scoca, Learning efficiently in semantic based regularization,
     in: ECML-PKDD, 2016.
 [7] I. Donadello, L. Serafini, A. dβAvila Garcez, Logic tensor networks for semantic
     image interpretation, in: IJCAI, 2017.
 [8] J. Xu, Z. Zhang, T. Friedman, Y. Liang, G. Van den Broeck, A semantic loss
     function for deep learning with symbolic knowledge, in: ICML, 2017.
 [9] R. Manhaeve, S. DumanΔiΔ, A. Kimmig, T. Demeester, L. De Raedt, DeepProblog:
     Neural probabilistic logic programming, in: NeurIPS, 2018.
[10] P. Dragone, S. Teso, A. Passerini, Pyconstruct: Constraint programming meets
     structured prediction, in: IJCAI, 2018.
[11] N. Nethercote, P. Stuckey, R. Becket, S. Brand, G. Duck, G. Tack, Minizinc: Towards
     a standard cp modelling language, in: CP, 2007.
[12] Y. LeCun, S. Chopra, R. Hadsell, F. J. Huang, et al., A tutorial on energy-based
     learning, in: Predicting Structured Data, MIT Press, 2006.
[13] P. Dragone, S. Teso, A. Passerini, Constructive preference elicitation, Frontiers in
     Robotics and AI 4 (2018).
[14] T. Joachims, T. Finley, C. Yu, Cutting-plane training of structural svms, Machine
     Learning (2009).
[15] S. Lacoste-Julien, M. Jaggi, M. Schmidt, P. Pletscher, Block-coordinate frank-wolfe
     optimization for structural svms, in: ICML, 2013.
[16] N. Ratliff, J. Bagnell, M. Zinkevich, (approximate) subgradient methods for struc-
     tured prediction, in: AISTATS, 2007.
[17] P. Dragone, S. Teso, A. Passerini, Neuro-symbolic constraint programming for
     structured prediction, CoRR abs/2103.17232 (2021).
[18] J. Mandi, T. Guns, Interior Point Solving for LP-based prediction+ optimisation,
     in: NeurIPS, 2020.
[19] T. Kristjansson, A. Culotta, P. Viola, A. McCallum, Interactive information extrac-
     tion with constrained conditional random fields, in: AAAI, 2004.
[20] E. Fersini, E. Messina, G. Felici, D. Roth, Soft-constrained inference for named
     entity recognition, Information Processing & Management (2014).
[21] P. Dragone, G. Pellegrini, M. Vescovi, K. Tentori, A. Passerini, No More Ready-Made
     Deals: Constructive Recommendation for Telco Service Bundling, in: RecSys, 2018.
[22] M. Gygli, M. Norouzi, A. Angelova, Deep value networks learn to evaluate and
     iteratively refine structured outputs, in: ICML, 2017.
[23] J. Lee, M. Wick, J. Tristan, J. Carbonell, Enforcing output constraints via sgd: A
     step towards neural lagrangian relaxation, in: NIPS, 2017.
[24] L. De Raedt, A. Passerini, S. Teso, Learning constraints from examples, in: AAAI,
     2018.