Zero-shot Multi-Domain Dialog State Tracking Using
Prescriptive Rules
Edgar Altszyler*1,2 , Pablo Brusco*3 , Nikoletta Basiou†4 , John Byrnes5 and
Dimitra Vergyri5
1
  Departamento de Computación, FCEyN, Universidad de Buenos Aires, Argentina
2
  Instituto de Investigación en Ciencias de la Computación, CONICET-UBA, Argentina
3
  ASAPP, USA
3
  Amazon, Alexa-AI
5
  SRI International, USA


                                         Abstract
                                         In this work, we present a framework for incorporating declarative logical rules in state-of-the-art neural
                                         networks, enabling them to learn how to handle unseen labels without the introduction of any new
                                         training data. The rules are integrated into existing networks without modifying their architecture,
                                         through an additional term in the network’s loss function that penalizes states of the network that do
                                         not obey the designed rules. As a case study, the framework is applied to an existing neural-based Dialog
                                         State Tracker. Our experiments demonstrate that the inclusion of logical rules allows the prediction of
                                         unseen labels, without deteriorating the predictive capacity of the original system.

                                         Keywords
                                         Zero-shot Learning, Prescriptive rules, Neural network, Natural Language Understanding


1. Introduction
When deploying machine-learning-based systems, it is common for users to detect problems
related to functionalities that do not meet the expected requirements. In particular, in dialog
systems, problems arise when for a certain input, the model makes a prediction that is different
from the user’s inferred decision. This is often due to the model structure and to the inherent
characteristics of the dataset used to train the models. In the same direction, new user require-
ments for the existing system may require outputing unseen labels (not present in training
data). In such situations, a typical solution consists of the collection of new annotated data
aligned with the expected functionality. However, collecting new data every time such a need
arises is an expensive and time-consuming effort, therefore, an alternative approach is desired.
   In this work, we propose a solution to the above mentioned problems by incorporating
prescriptive logical rules into learned neural network models. These rules are designed by

15th International Workshop on Neural-Symbolic Learning and Reasoning
*These authors contributed equally to this work. This work was performed while the authors were still at SRI.
† Work done prior to joining Amazon.
Envelope-Open ealtszyler@dc.uba.ar (E. Altszyler*); pbrusco@dc.uba.ar (P. Brusco*); nbbasiou@amazon.com (N. Basiou† );
john.byrnes@sri.com (J. Byrnes); dimitra.vergyri@sri.com (D. Vergyri)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
domain experts and can influence the system output, enabling also the prediction of unseen
labels. Similar to other works [1, 2, 3, 4, 5], we use differentiable first-order logic (FOL) that has
proven useful for integrating knowledge into a neural-symbolic system.
   We apply our logic rules framework to Dialog State Tracking, a challenging and complex task
in the field of Dialog Systems. More specifically, we extend the Multi-Domain Neural Belief
State Tracker (MDNBT), proposed in [6] and recently incorporated as one of the state of the
art dialog state trackers in ConvLab, an open-source multidomain end-to-end dialog system
platform released under the Dialog State Tracker Challenge (DSTC8) [7].
   The main contributions of our work are the following: a) we enhance a Neural-based Dialog
State Tracker with logic rules, without degrading the performance of the base system, b) we
show that the addition of the logic rules allow the predictions of unseen labels which can be
very useful in the case of unlabeled or partially labeled data.


2. Related Work
Due to the increasing popularity of neural network models for supervised learning, there is a
growing body of material related to the inclusion of structural knowledge as a tool of biasing
certain models decisions and as a way to mitigate the uninterpretability of results. One way to
introduce this knowledge is to integrate logical rules through the use of FOL — a declarative
language that can represent high-level knowledge [1, 2, 3, 4, 5, 8, 9, 10, 11, 12, 13, inter allia]. In
the majority of prior works, different types of posterior regularization terms are implemented
to affect the optimization process. For example, in a seminal work by Hu et al. [1], the authors
propose a general teacher-student framework approach in which the model simultaneously
learns from labeled data and logical rules through an iterative process that shifts the parameters
of CNN and RNN networks in the tasks of sentiment classification and named entity recognition.
Although these works generally apply rules as functions of the network inputs and outputs,
there are some that can predicate over the internal values of the network’s neurons [5]. These
rules can be included in the existing neural network architectures to guide the training and
predictions without additional learning parameters [1, 11].
   There are also works that use these techniques in semi-supervised or even zero-shot learning
settings. For example, in [1, 3] it is shown that the rules can be applied over unlabeled data.
Furthermore, Donadello and Serafini [14] showed that the use of rules in a semantic image
interpretation task allows predicting unseen visual relationships. However, to our knowledge,
this type of framework has not been applied to the fully-unsupervised problem of unseen-labels
prediction.


3. Our approach
The proposed framework consists of the addition of a plug-in component into an existing
computational graph with no extra learning parameters. A rules-dependent loss term is
introduced into the system’s loss function as a way of integrating rules to the learning process
and as a way of allowing the use of off-the-shelf optimizers. By including rules as part of the
training process, we generate extra cost when the network does not satisfy a certain rule.
3.1. Neural Belief State Tracker
A Dialog State Tracker (DST) is a key component in task-based spoken dialog systems. It
models the user’s intent at any point of an ongoing conversation [15] which is then used by the
downstream dialog management component to choose the next system response. DST models
estimate the belief state, which is the system’s internal probability distribution over possible
dialog states, by taking into account the user goals at every turn as extracted by a Spoken
Language Understanding (SLU) component. The dialog states are defined by a domain-specific
ontology that lists the slot-value pairs that describe the constraints the users can express (e.g.
price range-expensive, price range-cheap, area-west, area-east, etc) [16].
   In our work, we use the Multi-domain Neural Belief State Tracker (MDNBT) which jointly
identifies the domain and tracks the belief states corresponding to that domain by utilizing the
semantic similarity between dialog utterances and ontology terms [6]1 . MDNBT is implemented
as multi-layer networks with Bi-LSTMs to model the user and system utterance and with RNNs
with a memory cell to model the flow of the conversation. We use the same set of parameters
as Ramadan et al. [6].
   For our experiments, we used the MultiWOZ 2.0 dataset [17], following Ramadan et al. [6].
This dataset contains 2480 single-domain dialogues and 7375 multiple-domain dialogues, in
which at least two domains are involved throughout each conversation. The ontology contains
a total of 663 domain-slot-values triples, distributed across 27 slots in 5 domains (restaurant,
hotel, attraction, train, taxi). Here we show an example of two turns inside a conversation with
their corresponding state label.
   1. utterance: Hi, can you help me find a place to eat on the northside?
      {T r u e s t a t e : r e s t a u r a n t - a r e a - n o r t h }
      system: Yes, I have 15 options, do you have any preferences for the price range?
   2. utterance: Yes, I would like an expensive restaurant
      {T r u e s t a t e : r e s t a u r a n t - a r e a - n o r t h , r e s t a u r a n t - p r i c e r a n g e - e x p e n s i v e }
      system: There are 2 expensive places, an Italian restaurant and a gastropub.

3.2. Rules definition
Rules are defined as formulas in a relaxation of FOL that represent truth values in a continuous
domain (FOL Fuzzy Logic) in which the satisfaction of rules is a differentiable function that
can be maximized to perform learning [2, 11]. A formula’s truthiness is represented as a real
number that indicate the degree of truth or falsity of a relation defined over entities of the
system2 .
   We formulated two types of rules for this study. The type 𝑅1 , which triggers when some
specific keyword (in this case expensive) is explicitly uttered in a specific domain (in this case
hotel), and the type 𝑅2 in order to preserve belief-state predictions related to the price-range
slots throughout the turns for the cases when the user’s price-range intent does not change

    1
     https://github.com/osmanio2/multi-domain-belief-tracking
    2
     In our experiments, we are not interested in FOL functions and quantifiers, and we leave them out of the
discussion.
throughout the turns. Examples of the two types of rules for the HOTEL domain are given below:

𝑅1 : IF the user’s utterance contains a word like EXPENSIVE AND also contains a word like
HOTEL THEN the prediction in the domain HOTEL, slot PRICERANGE should be EXPENSIVE

𝑅2 : IF the previous prediction for the domain HOTEL, slot PRICERANGE was EXPENSIVE
AND the user did NOT uttered a word like MODERATE AND did NOT uttered a word like
CHEAP THEN the next prediction in the domain HOTEL, slot PRICERANGE should be
EXPENSIVE.

   In these examples, underlined words represent predicates; bold-uppercase terms represent
logical connectives; bold-lowercase terms represent nodes in the computational graph on the
network (or concepts that are mapped into embeddings such as in the case of the user’s utterance);
finally, italic-uppercase words refer to the constants of our system.
   As we will discuss in further detail in the next sections, rules are implemented in the model’s
computational graph through the addition of new operations that are applied over existing
nodes. When the graph is evaluated for a specific instance, each rule produces a number that
indicates the truthiness of that rule for the specific instance under evaluation. The loss function
will be a function of the truthiness, allowing the network to learn the rules.

3.2.1. Learning mechanism
Unsatisfied rules generate a cost that the optimizer minimizes in conjunction with the mis-
classification cost. Our system is trained not only to learn from labels (by minimizing the
cross-entropy loss function) but also to learn how to make the set of rules ℛ as true as possible
(a concept called best satisfiability as presented in [18]). For this, we use the simplest posterior
regularization approach, in which the objective function of the rule-based model is the sum of
the loss function of the base MDNBT model (ℒ𝑀𝐷𝑁 𝐵𝑇 ) and the rules’ loss function (ℒ𝑟𝑢𝑙𝑒𝑠 ) [11],

                    ℒ (𝜃; 𝒟 ; 𝒴 ; ℛ) = ℒ𝑀𝐷𝑁 𝐵𝑇 (𝜃; 𝒟 ; 𝒴 ) + 𝑤 ℒ𝑟𝑢𝑙𝑒𝑠 (𝜃; 𝒟 ; ℛ)

where 𝜃 represents the model parameters (weights and biases), 𝒟, 𝒴 refers to the dataset and its
labels respectively, ℛ represents the set of rules of our system, 𝑤 is a weighting hyperparameter
that we call rules’ weight, and ℒ𝑟𝑢𝑙𝑒𝑠 is defined as the sum of the loss of each individual rule:


                          ℒ𝑟𝑢𝑙𝑒𝑠 (𝜃; 𝒟 ; ℛ) =     ∑ ℒ𝑟 (𝜃; 𝒟 )
                                                  𝑟∈ℛ
                                              =   ∑ 1 − 𝑡𝑟𝑢𝑡ℎ𝑖𝑛𝑒𝑠𝑠𝑟 (𝜃; 𝒟 )
                                                  𝑟∈ℛ

Here, each individual rule loss (ℒ𝑟 ) is defined in terms of the 𝑡𝑟𝑢𝑡ℎ𝑖𝑛𝑒𝑠𝑠𝑟 , i.e. the degree of truth
of the rule. Finally, the optimization problem consists of finding the optimal weights and biases
for the network given the composed loss function,

                                   𝜃 ∗ = arg min ℒ (𝜃; 𝒟 ; 𝒴 ; ℛ)                                   (1)
                                             𝜃
The backpropagation mechanism computes gradients that update all trainable parameters in
the network that are reachable from the loss function. For weights and biases to be reachable,
the operations that define the loss function have to be fully differentiable. Thus the logical
operations that define truthiness of the rules need to be differentiable.

3.2.2. Formulas and Predicates
We define two types of formulas: (i) atomic formulas: predicates applied to constants and
nodes of the computational graph; and (ii) composed formulas, which are built up from atomic
formulas using the Boolean connectives.
   Predicates define relations among entities of the neural network. For example, in the case
of the MDNBT model, we can refer to the embedding representation of a word in the input
data; to the actual belief states (𝑏𝑡 ) — a slot-specific distribution of probabilities estimated by
the MDNBT output layer [6]; to previous belief states (𝑏𝑡−1 ); etc. Also, predicates may refer to
external values to the computational graph (i.e constants) such as a pre-trained word embedding
for the word HOTEL. When the computational graph is evaluated, each predicate returns a
truthiness value. That is, a value between 0 and 1, with 1 being the highest confidence in the
truth of a predicate.
   The implementation of a predicate, as opposed to that of a logic connective, strongly depends
on the underlying architecture and on the input representations. For example, the predicate
contains a word like CHEAP can be expressed as the cosine similarity function over nodes of
the computational graph: 𝐼 𝑛𝑡ℎ𝑟𝑒𝑠ℎ=𝑡 (𝑢𝑡𝑡, CHEAP). This function may check if there is any word
in the utterance whose cosine similarity with the word embedding for cheap is greater than
threshold 𝑡.

3.2.3. Logical operators
Logic operators (namely ∧, ∨, ¬, → ) are implemented through the following equations based on
Product 𝑡-norm for conjunction, 𝑠-norm for disjunction, and the residuum of the 𝑡-norm for the
implication (see [19] and [11] for further details).


                                      ¬𝑋 ≝ 1 − 𝑋 ,
                                    𝑋 ∧ 𝑌 ≝ 𝑋 ⋅ 𝑌,
                                    𝑋 ∨ 𝑌 ≝ ¬(¬𝑋 ∧ ¬𝑌 ) = 𝑋 + 𝑌 − 𝑋 𝑌
                                  𝑋 → 𝑌 ≝ ¬(𝑋 ∧ ¬𝑌 ) = 1 − 𝑋 (1 − 𝑌 )
Having defined all the aforementioned components, we can represent rules in terms of logic
formulas. For example, 𝑅1 is:

𝑅1 ≡ (𝐼 𝑛𝑡ℎ𝑟𝑒𝑠ℎ=𝑡 (𝑢𝑡𝑡, EXPENSIVE) ∧ 𝐼 𝑛𝑡ℎ𝑟𝑒𝑠ℎ=𝑡 (𝑢𝑡𝑡, HOTEL)) →
           𝐴𝑠𝑠𝑒𝑟𝑡(𝑏𝑡 , HOTEL-PRICERANGE-EXPENSIVE)

where 𝐴𝑠𝑠𝑒𝑟𝑡(𝑏𝑡 , HOTEL-PRICERANGE-EXPENSIVE) is a predicate that returns the model’s belief
state probability 𝑏𝑡 in the index corresponding to the HOTEL-PRICERANGE-EXPENSIVE state.
3.2.4. Antecedent and Consequent learning
When a gradient-based method is used to solve the optimization problem of Eq. 1, the parameters
𝜃 of the network are updated in the opposite direction to the gradient of ℒ. The rule loss term
produces an update associated with it. For example, in the case of a rule with the form 𝑟 = 𝑋 → 𝑌
in which the loss is ℒ𝑟 = 𝑋 (1 − 𝑌 ) , the update associated to the rule looks like:


                            Δ𝑟 𝜃 = −𝜆(𝑑𝑋 (ℒ𝑟 )𝑑𝜃 (𝑋 )) + 𝑑𝑌 (ℒ𝑟 )𝑑𝜃 (𝑌 )))

where 𝜆 is the learning rate and the partial derivatives of the implication are 𝑑𝑋 (ℒ𝑟 ) = 1 −
𝑌 and 𝑑𝑌 (ℒ𝑟 ) = −𝑋.
   For example, in the case where this implication is not satisfied (𝑋 = 1 and 𝑌 = 0), the partial
derivatives are 𝑑𝑋 (ℒ𝑟 ) = 1 and 𝑑𝑌 (ℒ𝑟 ) = −1, and the update is,

                                    Δ𝑟 𝜃 = −𝜆 𝑑𝜃 (𝑋 ) + 𝜆 𝑑𝜃 (𝑌 )

The network will update 𝜃 in the direction of growth of Y and in the direction of decrease of X.
That is, during the learning step the antecedent tends to decrease and the consequent tends to
increase simultaneously.
    Depending on the rule, one may want the antecedent or the consequent learning to be frozen
(i.e. that the learning process occurs only through one of them). This is the case of some if-type
rules like 𝑅2 , in which we are not interested in learning how to make the condition True, but we
want to make the then branch True (in case the condition is met). For these types of cases, one
can use two different alternatives: either implementing predicates using non-derivable functions
such as 𝑎𝑟𝑔𝑚𝑎𝑥; or by programmatically stopping the back-propagation in the corresponding
subterms.


4. Results and Discussion
The main question we address in this section is: can we extend a system with a new slot without
any additional data and without degrading the existing system?
   To answer this question, we simulated this scenario by removing all existing annotations
for the PRICERANGE slot in the training set of the MULTIWOZ dataset. Next, we built a set
of twelve rules designed to learn the PRICERANGE slot values. Six of these rules were in the
form of 𝑅1 , and six were in the form of 𝑅2 (as described in Section 3.2), addressing the six
possible combinations of domains (hotel and restaurant) and price ranges (cheap, moderate and
expensive).
   We compare our models to a base MDNBT model that does not contain rules and in which
we removed the PRICERANGE slot-value pairs from the ontology. In our experiments, we used
100 Bi-LSTM cells and we trained the models from scratch using the ADAM optimizer [20]
with batch size 64 for 600 epochs. A dropout rate [21] of 50% was used in all the intermediate
representations. Also, all the weights were initialized using normal distribution of zero mean
and unit variance and biases were initialized to zero. For the rule-based MDNBT we trained
models with four different values of the rules’ weight, i.e., 𝑤 = 10, 30, 100, 300.
Figure 1: Performance of rules-based MDNBT models for the PRICERANGE slots (left), and for the
rest of the slots (right). Each star shows the 𝐹1 for 3 different runs per weight, and the mean values are
shown as circles. The mean 𝐹1 , in the PRICERANGE slots, is 0.093,0.493, 0.521 and 0.436 for the values of
𝑤 of 10, 30, 100, 300 respectively. While the mean 𝐹1 in the other slots is 0.823, 0.697, 0.832, and 0.581 for
the 𝑤 of 10, 30, 100, and 300 respectively. Performance of the base MDNBT model is also included in the
right plot, which is computed as the mean among 6 runs (𝜇 = 0.817 ± 0.085).


   The F1 performance of the rules-based MDNBT models for the PRICERANGE slot and for
all the remaining slots are depicted in Fig. 1. F1 is measured by considering the correct and
incorrect predictions in each slot of each domain. The performance of the base MDNBT
model for the remaining slots is also shown (right plot). From the left panel, it is evident that
the rule-based MDNBT model shows predictive capability over the PRICERANGE slot values
without any training data (zero-shot learning). The increase in the performance measured in
the PRICERANGE slots for different rules’ weight values affects the overall performance on the
rest of the slots in the ontology depending on the rules’ weight value. As can be seen from the
right plot, for lower rules’ weight values (e.g. 𝑤 = 10) in the rule-based MDNBT there is not an
appreciable performance drop compared to the base MDNBT ( 0.7% relative decrease), while
for large rules’ weights (e.g. 𝑤 = 300) there is a considerable degradation of performance (i.e.,
29% relative decrease). A selection of an optimal rules’ weight (e.g. 𝑤 = 30) can guarantee the
optimal trade-off between the performance on unlabeled data (i.e. PRICERANGE slot) and the
performance on labeled data (e.g. the remaining slots in the ontology).
   Our results do not show a significant performance degradation in the other slots (right panel)
for low rules-weight values (𝑤 = 10, 30), since it does not show significant difference with the
base model (two-sided 𝑡-test 𝑝-val > 0.1 in both 𝑤 values). However, using 𝑤 = 100 or 𝑤 = 300
we see a notable decrease in the general performance (𝑝-val = 0.06 and 𝑝-val= 0.004 respectively
with two-sided 𝑡-tests). That is, the system learns how to identify price ranges at the expense of
producing unwanted effects in the performance of the rest of the slots.
   From the experimental results, we observe that it is possible to integrate rules into an existing
system to allow the prediction of unseen labels without degrading the predictive capabilities
over the rest of the labels (as it is the case with 𝑤 = 30). However, it is necessary to pay special
attention to the trade-off that is generated between learning the rules and the degradation of the
system. In particular, it is important to notice that the weights depend on the number of times
the rules are actually satisfied, the number of rules, and the design properties of the system.


5. Scope and limitations
In this work we show how a set of rules can be incorporated into a Neural Network to predict a
new category that did not exist in the training set.
   It is worth noting that this rule-based setup completely depends on the coverage of a set of
rules established by the domain experts, and also, that the rules/weight/predicate and values/slot
names are problem specific.
   The presented approach serves specifically for the case in which there are new requirements
for a system that can easily be covered – allowing the creation of an updated version of the
system without the need of additional annotations.


6. Future Work
As we described in section 2, there are other methods to integrate rules in neural networks,
and also, there are new state-of-the-art Dialogue State Tracking models that produce better
metrics [22]. Since the objective of this work was to show that it is possible to perform Zero-
Shot Learning by adding rules to an existing neural network, we have chosen a simple and
transparent rule integration method, and we have not focused on the selection of the State
Tracking model. In the future, we plan to compare different methods of integration of rules in
state-of-the-art models.
   Campagna et al. [23] have used data synthesis techniques for transfering knowledge into
new domains (Zero-Shot Transfer Learning). In the future, we plan to adapt this method for our
task (i.e. predict slots not seen during training) to compare with our approach. We believe that
the integration of rules in neural networks has the benefit of using existing information in the
internal states and to predicate, for example, about the prediction of the model during training.


7. Conclusions
This paper presented how the addition of prescriptive logical rules designed by domain experts
can enable neural networks to predict unseen labels without the need for creating new labeled
training data. The rules are integrated into an existing neural network without modifying the
original architecture. A posterior regularization approach was used to introduce the rules into
the learning process, penalizing the objective function when inputs and the internal state of
the network do not obey one of the designed rules. Our rules-based framework was applied
and tested to an existing neural-based Dialog State Tracker for Dialog Systems where rules
were implemented so that the model learns to identify PRICERANGE labels, which were not
seen during training. Our experiments showed that the inclusion of logical rules allows the
prediction of new labels, without jeopardizing the predictive capacity on the rest of the data. It
is finally worth noting that our rules-based solution is independent of the neural network model
and thus can be applied to any application (and neural network model) given the formulation
of appropriate rules.


References
 [1] Z. Hu, X. Ma, Z. Liu, E. Hovy, E. Xing, Harnessing deep neural networks with logic
     rules, in: Proceedings of the 54th Annual Meeting of the Association for Computational
     Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Berlin,
     Germany, 2016, pp. 2410–2420. URL: https://www.aclweb.org/anthology/P16-1228. doi:1 0 .
     18653/v1/P16- 1228.
 [2] K. Sikka, A. Silberfarb, J. Byrnes, I. Sur, E. Chow, A. Divakaran, R. Rohwer, Deep adaptive
     semantic logic (dasl): Compiling declarative knowledge into deep neural networks, arXiv
     preprint arXiv:2003.07344 (2020).
 [3] J. Xu, Z. Zhang, T. Friedman, Y. Liang, G. Van Den Broeck, A semantic loss function
     for deep learning with symbolic knowledge, 35th International Conference on Machine
     Learning, ICML 2018 12 (2018) 8752–8760. a r X i v : a r X i v : 1 7 1 1 . 1 1 1 5 7 v 2 .
 [4] M. Fischer, M. Balunovic, D. Drachsler-Cohen, T. Gehr, C. Zhang, M. Vechev, Dl2: Training
     and querying neural networks with logic, in: International Conference on Machine
     Learning, 2019, pp. 1931–1941.
 [5] T. Li, V. Srikumar, Augmenting neural networks with first-order logic, arXiv preprint
     arXiv:1906.06298 (2019). doi:1 0 . 1 8 6 5 3 / v 1 / P 1 9 - 1 0 2 8 .
 [6] O. Ramadan, P. Budzianowski, M. Gasic, Large-scale multi-domain belief tracking with
     knowledge sharing, in: Proceedings of the 56th Annual Meeting of the Association for
     Computational Linguistics (Volume 2: Short Papers), 2018, pp. 432–437. doi:1 0 . 1 8 6 5 3 / v 1 /
     P18- 2069.
 [7] S. Lee, Q. Zhu, R. Takanobu, X. Li, Y. Zhang, Z. Zhang, J. Li, B. Peng, X. Li, M. Huang,
     J. Gao, Convlab: Multi-domain end-to-end dialog system platform, in: Proceedings of
     the 57th Annual Meeting of the Association for Computational Linguistics, 2019. doi:1 0 .
     18653/v1/P19- 3011.
 [8] B. Zhang, X. Xu, X. Li, X. Chen, Y. Ye, Z. Wang, Sentiment analysis through critic learning
     for optimizing convolutional neural networks with rules, Neurocomputing 356 (2019)
     21–30. doi:1 0 . 1 0 1 6 / j . n e u c o m . 2 0 1 9 . 0 4 . 0 3 8 .
 [9] G. Marra, F. Giannini, M. Diligenti, M. Gori, Integrating learning and reasoning with deep
     logic models, arXiv preprint arXiv:1901.04195 (2019). doi:1 0 . 1 0 0 7 / 9 7 8 - 3 - 0 3 0 - 4 6 1 4 7 - 8 _ 3 1 .
[10] B. Chen, Z. Hao, X. Cai, R. Cai, W. Wen, J. Zhu, G. Xie, Embedding Logic Rules Into
     Recurrent Neural Networks, IEEE Access 7 (2019) 14938–14946. doi:1 0 . 1 1 0 9 / A C C E S S . 2 0 1 9 .
     2892140.
[11] E. van Krieken, E. Acar, F. van Harmelen, Analyzing differentiable fuzzy logic operators,
     arXiv preprint arXiv:2002.06100 (2020).
[12] G. Marra, M. Diligenti, F. Giannini, M. Gori, M. Maggini, Relational neural machines, arXiv
     preprint arXiv:2002.02193 (2020).
[13] M. Diligenti, M. Gori, C. Sacca, Semantic-based regularization for learning and inference,
     Artificial Intelligence 244 (2017) 143–165.
[14] I. Donadello, L. Serafini, Compensating supervision incompleteness with prior knowledge
     in semantic image interpretation, in: 2019 International Joint Conference on Neural
     Networks (IJCNN), IEEE, 2019, pp. 1–8.
[15] S. Young, Cognitive user interfaces, IEEE Signal Processing Magazine (2010). doi:1 0 . 1 1 0 9 /
     MSP.2010.935874.
[16] N. Mrkšić, I. Vulić, Fully statistical neural belief tracking, arXiv preprint arXiv:1805.11350
     (2018).
[17] P. Budzianowski, T.-H. Wen, B.-H. Tseng, I. Casanueva, S. Ultes, O. Ramadan, M. Gašić,
     Multiwoz-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue
     modelling, in: Proceedings of the 2018 Conference on Empirical Methods in Natural
     Language Processing, 2018. doi:1 0 . 1 8 6 5 3 / v 1 / D 1 8 - 1 5 4 7 .
[18] I. Donadello, L. Serafini, A. d’Avila Garcez, Logic tensor networks for semantic image
     interpretation, in: Proceedings of the Twenty-Sixth International Joint Conference on
     Artificial Intelligence, IJCAI-17, 2017, pp. 1596–1602. URL: https://doi.org/10.24963/ijcai.
     2017/221. doi:1 0 . 2 4 9 6 3 / i j c a i . 2 0 1 7 / 2 2 1 .
[19] L. Serafini, A. D. Garcez, Logic tensor networks: Deep learning and logical reasoning from
     data and knowledge, CEUR Workshop Proceedings 1768 (2016). a r X i v : a r X i v : 1 6 0 6 . 0 4 4 2 2 v 2 .
[20] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: Proceedings of ICLR,
     2014.
[21] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple
     way to prevent neural networks from overfitting, The journal of machine learning research
     15 (2014) 1929–1958.
[22] H. Lee, J. Lee, T.-Y. Kim, Sumbt: Slot-utterance matching for universal and scalable belief
     tracking, in: Proceedings of the 57th Annual Meeting of the Association for Computational
     Linguistics, 2019, pp. 5478–5483. doi:1 0 . 1 8 6 5 3 / v 1 / P 1 9 - 1 5 4 6 .
[23] G. Campagna, A. Foryciarz, M. Moradshahi, M. Lam, Zero-shot transfer learning with
     synthesized data for multi-domain dialogue state tracking, in: Proceedings of the 58th
     Annual Meeting of the Association for Computational Linguistics, 2020, pp. 122–132.
     doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 0 . a c l - m a i n . 1 2 .