A Fast and Robust BERT-based Dialogue State Tracker for
                      Schema-Guided Dialogue Dataset
                                  Vahid Noroozi                                                                       Yang Zhang
                               vnoroozi@nvidia.com                                                              yangzhang@nvidia.com
                                  NVIDIA, USA                                                                       NVIDIA, USA

                              Evelina Bakhturina                                                                   Tomasz Kornuta
                            ebakhturina@nvidia.com                                                               tkornuta@nvidia.com
                                 NVIDIA, USA                                                                         NVIDIA, USA
ABSTRACT                                                                                    reservation, hotel reservation, food ordering, appointment schedul-
Dialog State Tracking (DST) is one of the most crucial modules for                          ing) [21]. Traditionally, goal-oriented dialogue systems are set up
goal-oriented dialogue systems. In this paper, we introduce FastSGT                         as a pipeline with four main modules: 1-Natural Language Under-
(Fast Schema Guided Tracker), a fast and robust BERT-based model                            standing (NLU), 2-Dialogue State Tracking (DST), 3-Dialog Policy
for state tracking in goal-oriented dialogue systems. The proposed                          Manager, and 4-Response Generator. NLU extracts the semantic
model is designed for the Schema-Guided Dialogue (SGD) dataset                              information from each dialogue turn which includes e.g. user in-
which contains natural language descriptions for all the entities                           tents and slot values mentioned by user or system. DST takes the
including user intents, services, and slots. The model incorporates                         extracted entities to build the state of the user goal by aggregating
two carry-over procedures for handling the extraction of the values                         and tracking the information across all turns of the dialogue. Dia-
not explicitly mentioned in the current user utterance. It also uses                        log Policy Manager is responsible for deciding the next action of
multi-head attention projections in some of the decoders to have a                          the system based on the current state. Finally, Response Generator
better modelling of the encoder outputs.                                                    converts the system action into human natural text understandable
    In the conducted experiments we compared FastSGT to the base-                           by the user.
line model for the SGD dataset. Our model keeps the efficiency in                              The NLU and DST modules have shown to be successfully trained
terms of computational and memory consumption while improv-                                 using data-driven approaches [21]. In the most recent advances in
ing the accuracy significantly. Additionally, we present ablation                           language understanding, due to models like BERT [3], researchers
studies measuring the impact of different parts of the model on                             have successfully combined NLU and DST into a single unified
its performance. We also show the effectiveness of data augmenta-                           module, called Word-Level Dialog State Tracking (WL-DST) [7, 14,
tion for improving the accuracy without increasing the amount of                            18]. The WL-DST models can take the user or system utterances in
computational resources.                                                                    natural language format as input and predict the state at each turn.
                                                                                            The model we are going to propose in this paper falls into this class
KEYWORDS                                                                                    of algorithms.
                                                                                               Most of the previously published public datasets, such as Mul-
goal-oriented dialogue systems, dialogue state tracking, schema
                                                                                            tiWOZ [2] or M2M [16], use a fixed list of defined slots for each
guided dialogues
                                                                                            domain without any information on the semantics of the slots and
ACM Reference Format:                                                                       other entities in the dataset. As a result, the systems developed on
Vahid Noroozi, Yang Zhang, Evelina Bakhturina, and Tomasz Kornuta. 2020.                    these datasets fail to understand the semantic similarity between
A Fast and Robust BERT-based Dialogue State Tracker for Schema-Guided                       the domains and slots. The capability of sharing the knowledge
Dialogue Dataset. In Proceedings of KDD Workshop on Conversational Systems                  between the slots and domains might help a model to work across
Towards Mainstream Adoption (KDD Converse’20). ACM, New York, NY, USA,                      multiple domains and/or services, as well as to handle the unseen
8 pages.
                                                                                            slots and APIs when the new APIs and slots are similar in function-
                                                                                            ality to those present in the training data.
                                                                                               The Schema-Guided Dialogue (SGD) dataset [14] was created
                                                                                            to overcome these challenges by defining and including schemas
1     INTRODUCTION                                                                          for the services. A schema can be interpreted as an ontology en-
Goal-oriented dialogue systems is a category of dialogue systems                            compassing naming and definition of the entities, properties and
designed to solve one or multiple specific goals or tasks (e.g. flight                      relations between the concepts. In other words, schema defines not
                                                                                            only the structure of the underlying data (relations between all the
Permission to make digital or hard copies of part or all of this work for personal or       services, slots, intents and values), but also provides descriptions
classroom use is granted without fee provided that copies are not made or distributed       of most of the entities expressed in a natural language. As a result,
for profit or commercial advantage and that copies bear this notice and the full citation   the dialogue systems can exploit that rich information to capture
on the first page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).                                            more general semantic meanings of the concepts. Additionally, the
KDD Converse’20, August 2020,                                                               availability of the schema enables the model to use the power of
© 2020 Copyright held by the owner/author(s).                                               pre-trained models like BERT to transfer or share the knowledge


Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
KDD Converse’20, August 2020,
                                                                                                                                                       Noroozi et al.


between different services and domains. The recent emergence of                          One obvious disadvantage of multi-pass models is their lack
the SGD dataset has triggered a new line of research on dialogue                      of efficiency. The other disadvantage is the memory consumption.
systems based on schemas, e.g. [1, 4, 9, 15].                                         They typically use multiple BERT-like models (e.g. five in SPDD) for
   Many state-of-the-art models proposed for the SGD dataset, de-                     predicting intents, requested slots, slot statuses, categorical values,
spite showing impressive performance in terms of accuracy, appear                     and non-categorical values. This significantly increases the memory
not to be very efficient in terms of computational complexity and                     consumption compared to most of the single-pass models with a
memory consumption, e.g. [9, 11, 12, 19]. To address these issues,                    single encoder.
we introduce a fast and flexible WL-DST model called Fast Schema
Guided Tracker (FastSGT)1 . Main contributions of the paper are as
follows:                                                                              2.2     Single-pass Models
      • FastSGT is able to predict the whole state at each turn with                  The works that incorporate the single-pass approach [1, 14] rely
        just a single pass through the model, which lowers both the                   on BERT-like models to encode the descriptions of services, slots, in-
        training and inference time.                                                  tents and slot values into representations, called schema embeddings.
      • The model employs carry-over mechanisms for transferring                      The main difference lies in the fact that this procedure is executed
        the values between slots, enabling switching between ser-                     just once, before the actual training starts, mitigating the need
        vices and accepting the values offered by the system during                   to pass the descriptions through the model for each one of the
        dialogue.                                                                     turns/predictions.
      • We propose an attention-based projection to attend over                          While these models are very efficient and robust in terms of
        all the tokens of the main encoder to be able to model the                    training and inference time, they have shown significantly lower
        encoded utterances better.                                                    performance in terms of accuracy compared to the multi-pass ap-
      • We evaluate the model on the SGD dataset [14] and show that                   proaches. On the other hand, multi-pass models need significantly
        our model has significantly higher accuracy when compared                     higher computation resource for training and inference, and also
        to the baseline model of SGD, at the same time keeping the                    the usage of additional BERT-based encoders increases the memory
        efficiency in terms of computation and memory utilization.                    usage drastically.
      • We show the effectiveness of augmentation on the SGD
        dataset without increasing the training steps.
                                                                                      3     THE FASTSGT MODEL
2     RELATED WORKS
                                                                                      The FastSGT (Fast Schema Guided Tracker) model belongs to the
The availability of schema descriptions for services, intents and                     category of single-pass models, keeping the flexibility along with
slots enables the NLU/DST models to share and transfer knowl-                         memory and computational efficiency. Our model is based on the
edge between different services that have similar slots and intents.                  baseline model proposed for SGD [14] with some improvements in
Considering the recent advances in natural language understand-                       the decoding modules. The model architecture is illustrated in Fig. 1.
ing and rise of Transformer-based models [17] like BERT [3] or                        It consists of four main modules: 1-Utterance Encoder, 2-Schema
RoBERTa [10], it looks like a promising approach for training a uni-                  Encoder, 3-State Decoder, and 4-State Tracker. The first three
fied model on datasets which are aggregated from different sources.                   modules constitute the NLU component and are based on neural
We categorize all models proposed for the SGD dataset into two                        networks, whereas the state tracker is a rule-based module. We
main categories: multi-pass and single-pass models.                                   used BERT [3] for both encoders in our model, but similar models
                                                                                      like RoBERTa [10] or XLNet [20] can also be used.
2.1     Multi-pass Models                                                                 Assume we have a dialogue of 𝑁 turns. Each turn consists of the
The general principle of operation of multi-pass models [6, 11, 12,                   preceding system utterance (𝑆𝑡 ) and the user utterance (𝑈𝑡 ). Let
15, 19] lies in passing of descriptions of every slots and intents as                 𝐷 = {(𝑆 1, 𝑈 1 ), (𝑆 1, 𝑈 2 ), ..., (𝑆 𝑁 , 𝑈 𝑁 )} be the collection of turns in
inputs to the BERT-like encoders to produce their embeddings. As                      the dialogue.
a result encoders are executed several times per a single dialog turn.                    The Utterance Encoder is a BERT model which encodes the
Passing the descriptions to the model along with the user or system                   user and system utterances at each turn. The Schema Encoder
utterances enables the model to have a better understanding of the                    is also a BERT model which encodes the schema descriptions of
task and facilitates learning of similarity between intents and slots.                intents, slots, and values into schema embeddings. These schema
The SPDD model [12] is a multi-pass model which showed one of                         embeddings help the decoders to transfer or share knowledge be-
the highest performances in terms of accuracy on the SGD dataset.                     tween different services by having some language understanding of
For instance, in order to predict the user state for a service with                   each slot, intent, or value. The schema and utterance embeddings
4 intents and 10 slots and 3 slots being active in a given turn, this                 are passed to the State Decoder - a multi-task module. This module
model needs 27 passes through the encoder (4 for intents, 10 for                      consists of five sub-modules producing the information necessary
requested status, 10 for statuses, and 3 for values). Such approaches                 to track the state of the dialogue. Finally, the State Tracker mod-
handle unseen services well and achieve high accuracy, but seem                       ule takes the previous state along with the current outputs of the
not to be practical in many cases when time or resources are limited.                 State Decoder and predicts the current state of the dialogue by
                                                                                      aggregating and summarizing the information across turns. Details
1 Source code of the model is publicly available at: https://github.com/NVIDIA/NeMo   of all model components are presented in the following subsections.
                                                                                                                                                                                            KDD Converse’20, August 2020,
A Fast and Robust BERT-based Dialogue State Tracker for Schema-Guided Dialogue Dataset


                                              Previous state                                                                                                        Next state
                                              Slots: {city: San Jose,                                                                                               Slots: {city: San Jose,
                                              restaurant_name: Billy Berk’s}                                         State Tracker                                  restaurant_name: Billy Berk’s,
                                              Intent: FindRestaurants}                                                                                              party_size:2}
                                                                                                                                                                    Intent: FindRestaurants}


                                                                                   Requested                      Categorical
                                                        Intent                                                                                Slot Status                  Non-categorical
                                                                                      Slot                          Value
                                                       Decoder                                                                                 Decoder                     Value Decoder
                                                                                    Decoder                        Decoder


                                                                                      Schema
                                                                                     embedding
                                                                                      Memory


                                                                          Schema Encoder                                                                         Utterance
                                                                                                                                                                 Encoder


            [CLS] A leading provider for restaurant search and reservations [SEP] Find a restaurant of a particular cuisine in a city [SEP]    [CLS] How many people? [SEP] Please find a table for two people. [SEP]
            …
            [CLS] A leading provider for restaurant search and reservations [SEP] Party size for a reservation [SEP]
            …
            [CLS] Party size for a reservation [SEP] 1 [SEP]
            …


Figure 1: The overall architecture of FastSGT (Fast Schema Guided Tracker) with exemplary inputs from a restaurant service.


3.1     Utterance Encoder                                                                                                       We used the same approach introduced in [14] for encoding the
This module is responsible for encoding of the current turn of the                                                          schemas. For a service with 𝑁𝐼 intents, 𝑁𝐶 categorical slots and
dialogue. At each turn (𝑆𝑡 , 𝑈𝑡 ), the preceding system utterance                                                           𝑁 𝑁𝐶 non-categorical slots, the representation of the intents are
is concatenated with the user utterance separated by the special                                                            denoted as 𝐼𝑖 , 1 ⩽ 𝑖 ⩽ 𝑁𝐼 . Schema embeddings for the categorical
token of [𝑆𝐸𝑃], resulting in (𝑇𝑡 ) which is serves as input into the                                                        and non-categorical slots are indicated as 𝑆𝑖𝐶 , 1 ⩽ 𝑖 ⩽ 𝑁𝐶 , and 𝑆𝑖𝑁𝐶 ,
utterance encoder module:                                                                                                   1 ⩽ 𝑖 ⩽ 𝑁 𝑁𝐶 respectively. The embeddings for the values of the
                                                                                                                            𝑘-th categorical slot of a service with 𝑁𝑉𝑘 possible values is denoted
                 𝑇𝑡 = [𝐶𝐿𝑆] ⊕ 𝑆𝑡 ⊕ [𝑆𝐸𝑃] ⊕ 𝑈𝑡 ⊕ [𝑆𝐸𝑃]                                                      (1)              as 𝑉𝑖𝑘 , 1 ⩽ 𝑖 ⩽ 𝑁𝑉𝑘 .
                                                                                                                                Generally, the input to the Schema Encoder is the concatena-
   The output of the first token passed to the encoder is denoted                                                           tion of two sequences with the [𝑆𝐸𝑃] token used as the separator
as 𝑌𝑐𝑙𝑠 and is interpreted as a sentence-level representation of the                                                        and the [𝐶𝐿𝑆] token indicating the beginning of the sequence.
turn, whereas the token-level representations are denoted as 𝑌𝑡𝑜𝑘 =                                                             The Schema Encoder produces four types of schema embed-
   1 , 𝑌 2 , ..., 𝑌 𝑀 ], where M is the total number of tokens in 𝑇 .
[𝑌𝑡𝑜𝑘   𝑡𝑜𝑘        𝑡𝑜𝑘                                             𝑡                                                        dings: intents, categorical slots, non-categorical slots and categori-
                                                                                                                            cal slot values. For a single intent embeddings 𝐼𝑖 , the first sequence
3.2     Schema Encoder                                                                                                      is the corresponding service description and second one is the in-
The Schema Encoder uses the descriptions of intents, slots, and                                                             tent description. For each categorical 𝑆𝑖𝐶 and non-categorical 𝑆𝑖𝑁𝐶
services to produce some embedding vectors which represent the                                                              slots embedding, the service description is concatenated with the
semantics of slots, intents and slot values. To build these schema                                                          description of the slot. To produce the schema embedding 𝑁𝑉𝑘 for
representations we instantiate a BERT model with the same weights                                                           the k-𝑡ℎ possible value of a categorical slot, the description of the
as the Utterance Encoder. However, this module is used just once,                                                           slot is used as the first sequence along with the value itself as the
before the training starts, and all the schema embeddings are stored                                                        second sequence.
in a memory to be reused during training. This means, they will                                                                 These sequences are given one by one to the Schema Encoder
be fixed during the training time. This approach of handling the                                                            before the main training is started and the output of the first out-
schema embeddings is one of the main reasons behind the effi-                                                               put token embedding 𝑌𝑐𝑙𝑠 is extracted and stored as the schema
ciency of our model compared to the multi-pass models in terms of                                                           representation, forming the Schema Embeddings Memory.
computation time.
KDD Converse’20, August 2020,
                                                                                                                                          Noroozi et al.


3.3    State Decoder                                                      is no active intent for the current turn. An embedding vector 𝐼 0 is
The Schema Embeddings Memory along with the outputs of the                considered as the schema embedding for the 𝑁𝑂𝑁 𝐸 intent. It is a
Utterance Encoder are used as inputs to the State Decoder to              learnable embedding which is shared among all the services.
predict the values necessary for state tracking. The State Decoder            The inputs to the Intent Decoder for a service are the schema em-
module consists of five sub-modules, each employing a set of pro-         beddings 𝐼𝑖 , 0 ⩽ 𝑖 ⩽ 𝑁𝐼 from the Schema Embeddings Memory
jection transformations to decode their inputs. We use the two            and 𝑌𝑐𝑙𝑠 of the Utterance Encoder. The predicted output of this
following projection layers in the decoder sub-modules:                   sub-module is the active intent of the current turn 𝐼𝑎𝑐𝑡𝑖𝑣𝑒 defined
   1) Single-token projection: this projection transformation,            as:
which is introduced in [14], takes the schema embedding vector and
                                                                                                               𝐾
the 𝑌𝑐𝑙𝑠 of the Utterance Encoder as its inputs. The projection for                         𝐼𝑎𝑐𝑡𝑖𝑣𝑒 = argmax 𝐹 𝐹𝐶 (𝐼𝑖 , 𝑌𝑐𝑙𝑠 ; 𝑝 = 𝑁𝐼 )             (7)
                                                   𝐾 (𝑥, 𝑦; 𝑝) for two                                 0⩽𝑖 ⩽𝑁𝐼
predicting 𝑝 outputs for task 𝐾 is defined as 𝐹 𝐹𝐶
vectors 𝑥, 𝑦 ∈ 𝑅𝑞 as the inputs. 𝑞 is the embedding size, 𝑝 is the size   3.3.2 Slot Request Decoder. At each turn, the user may request
of the output (e.g. number of classes), the first input 𝑥 is a schema     information about a slot instead of informing the system about a
embedding vector, and 𝑦 is the sentence-level output embedding            slot value. For example, when a user asks for the flight time when
vector produced by the Utterance encoder. The sources of the              using a ticket reservation service, 𝑓 𝑙𝑖𝑔ℎ𝑡_𝑡𝑖𝑚𝑒 slot of the service
inputs 𝑥 and 𝑦 depend on the task and the sub-module. Function            is requested by the user. This is a binary prediction task for each
  𝐾 (𝑥, 𝑦; 𝑝) for projection 𝐾 is defined as:                                                                              𝐾 (𝑆 𝑖 , 𝑌 ; 𝑝 = 1) >
𝐹 𝐹𝐶                                                                      slot. For this task, slot 𝑠𝑖 is requested when 𝐹 𝐹𝐶   𝑗 𝑐𝑙𝑠
                                                                          0.5, 𝑗 ∈ 𝑐, 𝑛. The same prediction is done for both categorical and
                        ℎ 1 = 𝐺𝐸𝐿𝑈 (𝑊1𝐾 𝑦 + 𝑏 𝐾
                                              1)                   (2)    non-categorical slots.
                 ℎ 2 = 𝐺𝐸𝐿𝑈 (𝑊2𝐾 (𝑥 ⊕ ℎ 1 ) + 𝑏 𝐾2)             (3)       3.3.3 Slot Status Decoder. We consider four different statuses
             𝐹 𝐹𝐶 (𝑥, 𝑦; 𝑝) = 𝑆𝑜 𝑓 𝑡𝑚𝑎𝑥 (𝑊3𝐾 ℎ 2 + 𝑏 𝐾
               𝐾
                                                     3 )        (4)       for each slot, namely: 𝑖𝑛𝑎𝑐𝑡𝑖𝑣𝑒, 𝑎𝑐𝑡𝑖𝑣𝑒, 𝑑𝑜𝑛𝑡_𝑐𝑎𝑟𝑒, 𝑐𝑎𝑟𝑟𝑦_𝑜𝑣𝑒𝑟 . If the
       𝐾                   𝐾
where𝑊𝑖 , 1 ⩽ 𝑖 ⩽ 3 and 𝑏𝑖 , 1 ⩽ 𝑖 ⩽ 3 are the learnable parameters       value of a slot has not changed from the previous state to the
for the projection, and 𝐺𝐸𝐿𝑈 is the activation function introduced        current user state, then the slot’s status is "𝑖𝑛𝑎𝑐𝑡𝑖𝑣𝑒". If a slot’s
in [5]. Symbol ⊕ indicates the concatenation of two vectors. Softmax      value is updated in the current user’s state into "𝑑𝑜𝑛𝑡_𝑐𝑎𝑟𝑒", then
function is used to normalize the outputs as a distribution over the      the status of the slot is set to "𝑑𝑜𝑛𝑡_𝑐𝑎𝑟𝑒" which means the user
targets. This projection is used by the Intent, Requested Slot and        does not care about the value of this slot. If the value for the slot is
Non-categorical Value Decoders.                                           updated and its value is mentioned in the current user utterance,
   2) Attention-based projection: the single-token projection             then its status is "𝑎𝑐𝑡𝑖𝑣𝑒". There are many cases where the value for
just takes one vector from the outputs of the Utterance Encoder.          a slot does not exist in the last user utterance and it comes from
For the Slot Status Decoder and Categorical Value Decoder we              previous utterances in the dialogue. For such cases, the status is
propose to use a more powerful projection layer based on multi-           set to "𝑐𝑎𝑟𝑟𝑦_𝑜𝑣𝑒𝑟 " which means we should search the previous
head attention mechanism [17]. We use the schema embedding                system or user utterances in the dialogue to find the value for this
vector 𝑥 as the query to attend to the token representations 𝑌𝑡𝑜𝑘         slot. More details of the carry over mechanisms are described in
as outputted by the Utterance Encoder. The idea is that domain-           subsection 3.4.
specific and slot-specific information can be extracted more ef-             The status of the categorical slot 𝑟 is defined as:
ficiently from the collection of token-level representations than
                                                                                                          𝐾
just from the sentence-level encoded vector 𝑌𝑐𝑙𝑠 . The multi-head                            𝑆𝑟 = argmax 𝐹𝑀𝐻𝐴 (𝑆𝑖 , 𝑌𝑡𝑜𝑘 ; 𝑝 = 4)                   (8)
                                        𝐾                                                          0⩽𝑖 ⩽𝑁𝑐
attention-based projection function 𝐹𝑀𝐻𝐴    (𝑥, 𝑌𝑡𝑜𝑘 ; 𝑝) for task 𝐾 to
produce targets with size 𝑝 is defined as:                                      Similar decoder is used for the status of the non-categorical slots
                                                                          as:

  ℎ 1 = MultiHeadAtt(𝑞𝑢𝑒𝑟𝑦 = 𝑥, 𝑘𝑒𝑦𝑠 = 𝑌𝑡𝑜𝑘 , 𝑣𝑎𝑙𝑢𝑒𝑠 = 𝑌𝑡𝑜𝑘 )      (5)                                    𝐾
                                                                                             𝑆𝑟 = argmax 𝐹𝑀𝐻𝐴 (𝑆𝑖 , 𝑌𝑡𝑜𝑘 ; 𝑝 = 4)                   (9)
                                                                                                   0⩽𝑖 ⩽𝑁𝑛𝑐
              𝐾
            𝐹𝑀𝐻𝐴   (𝑥, 𝑌𝑡𝑜𝑘 ; 𝑝) = 𝑆𝑜 𝑓 𝑡𝑚𝑎𝑥 (𝑊1𝐾 ℎ 1 + 𝑏 𝐾
                                                          1)       (6)    where 𝑆𝑟 is the status of 𝑟 -th non-categorical slot.
where 𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑𝐴𝑡𝑡 is the multi-head attention function intro-            3.3.4 Categorical Value Decoder. The list of possible values for
duced in [17], and 𝑊1𝐾 and 𝑏 𝐾  1 are learnable parameters of a linear    categorical slots are fixed and known. For the active service, this
projection after the multi-head attention. To accommodate padded          sub-module takes the 𝑌𝑐𝑙𝑠 and the schema embeddings of the values
utterances we use attention masking to mask out the padded por-           for all the categorical slots 𝑉𝑖𝑘 , 1 ⩽ 𝑘 ⩽ 𝑁𝑐 , 1 ⩽ 𝑖 ⩽ 𝑁𝑉𝑘 as input and
tion of the utterance.                                                                𝐾
                                                                          uses the 𝐹𝑀𝐻𝐴    (𝑥, 𝑌𝑡𝑜𝑘 ; 𝑝 = 1) projection on each value embedding
3.3.1 Intent Decoder. Each service schema contains a list of all          and 𝑌𝑡𝑜𝑘 . Then the predictions of the values are normalized by
possible user intents for the current dialogue turn. For example,         another Softmax layer, resulting in a distribution over all possible
in a service for reserving flight tickets, we may have intents of         values for each slot. The value with the maximum probability is
searching for a flight or cancelling a ticket. At each turn, for each     selected as the value of the slot. If a slot’s status is not "𝑎𝑐𝑡𝑖𝑣𝑒" or
service, at most one intent can be active. For all services, we addi-     "𝑐𝑎𝑟𝑟𝑦_𝑜𝑣𝑒𝑟 ", this prediction for a given slot will be ignored during
tionally consider an intent called 𝑁𝑂𝑁 𝐸 which indicates that there       the training.
                                                                                                                                 KDD Converse’20, August 2020,
A Fast and Robust BERT-based Dialogue State Tracker for Schema-Guided Dialogue Dataset


3.3.5 Non-categorical Value Decoder. For non-categorical slots                       dialogue may contain more than one service and the user may
there is no predefined list of possible values, and they can take any                switch between these services. When a switch happens, we may
value. For such slots, we use the spanning network approach [14] to                  need to carry some values from a slot in the previous service to
decode and extract the slot’s value directly from the user or system                 another slot in the current service. This carry-over procedure is
utterances.                                                                          called cross-service carry-over.
   This decoder would use two projection layers, one for predicting
the start of the span, and one for end of the span. This projection
layer is shared among all the non-categorical slots. The only dif-                   3.4.1 In-service Carry-over. We trigger this procedure in three
ference is their inputs. For each non-categorical slot, the outputs                  cases: 1-status of a slot is predicted as "𝑐𝑎𝑟𝑟𝑦_𝑜𝑣𝑒𝑟 ", 2-the spanning
of the tokens from the Utterance Encoder and also the schema                         region found for the non-categorical slots is not in the span of the
embedding for the slot are given as inputs to the two projection                     user utterance, 3-"#𝐶𝐴𝑅𝑅𝑌𝑂𝑉 𝐸𝑅#" value is predicted for a cate-
layers. These predict the probability of each token to be the start                  gorical slot with "active" or "𝑐𝑎𝑟𝑟𝑦_𝑜𝑣𝑒𝑟 " statuses. The in-service
or end of the span for the value. The token with the maximum                         carry-over procedure tries to retrieve a value for a slot from the
probability of being the start token is considered as the start and                  previous system utterances in the dialogue. We first search the
the token with the maximum probability of being the end token                        system actions starting from the most recent system utterance and
which appears after the start token is be considered as the end of                   then move backwards for a value mentioned for that slot. The most
the span.                                                                            recent value would be considered for the slot if multiple values are
   For slots with no value mentioned in the current turn, we train                   found. If no value could be found, that slot would not get updated
the network to predict the first [𝐶𝐿𝑆] to be both the start and                      in the current state.
end tokens. The non-categorical slots with status of "𝑖𝑛𝑎𝑐𝑡𝑖𝑣𝑒" are
ignored during the training.

                                                                                     3.4.2 Cross-service Carry-over. Carrying values from previous
3.4     State Tracker                                                                services to the current one when a switch happens in a turn is done
The user’s state at each turn includes the active intent, the collection             by cross-service carry-over procedure. The previous service and
of the slots and their corresponding values, and also the list of the                slots are called sources, and the new service and slots are called
requested slots. This module would get all the predictions from                      the targets. To perform the carry-over, we need to build a list of
the decoders along with the previous user state and produces the                     candidates for each slot which contains the slots where a carry-
current user’s state. The list of requested slots and the active intent              over can happen from them. We create this carry-over candidate
are produced from the outputs of the Slot Request Decoder and                        list from the training data. We process the whole dialogues in the
Intent Decoder directly. Producing the collection of the slots and                   training data, and count the number of times a value from a source
their values requires more processing.                                               slot and service carry-overs to a target service and slot when a
   We start with the list of slots and values from the previous state,               switch happens. These counts are normalized to the number of the
and update it as the following: For each slot, if its status is "𝑖𝑛𝑎𝑐𝑡𝑖𝑣𝑒",          switches between each two services in the whole training dialogues
no update or change is applied and the value from the previous                       to have a better estimate of the likelihood of carry-overs. In our ex-
state is be carried over. If its status is predicted as "𝑑𝑜𝑛𝑡_𝑐𝑎𝑟𝑒",                 periments, the candidates with likelihoods less than 0.1 are ignored.
then the explicit value of "don’t care" would be set as the value                    This carry-over relation between two slots is considered symmetric
for the slot. If the slot status is "𝑎𝑐𝑡𝑖𝑣𝑒", then it means we should                and statistics from both sides are aggregated. This candidate list
retrieve a value for that slot. We initially check the output of the                 for each slot contains a list of slot candidates from other services
value decoders, if a value is predicted, then that value is assigned                 which is looked up to find a carry-over value.
to the slot. If the "#𝐶𝐴𝑅𝑅𝑌𝑂𝑉 𝐸𝑅#" special value is predicted for                        When a switch between two services happens in the dialogue,
categorical slots or the start and end pointers for the non-categorical              the cross-service carry-over procedure is triggered. It looks into the
slots are pointing to a value outside the user utterance’s span, then                candidate for all the slots of the new service and carry any value it
we kick off the carry-over mechanism. The carry-over mechanisms                      found from previous service. If multiple values for a slot are found
try to search the previous states or system actions for a value. When                in the dialogue, the most recent one is used.
the "𝑐𝑎𝑟𝑟𝑦_𝑜𝑣𝑒𝑟 " status is predicted for any slot, the same carry-over
procedures are triggered.
   The carry-over procedures [12] enable the model to retrieve                       3.4.3 State Summarization. At each turn, the predictions of the
a value for a slot from the preceding system utterance or even                       decoders are used to update the previous state and produce the
previous turns in the dialogue. There are many cases where the                       current state. The slots and their values from the previous state
user is accepting some values offered by the system and the value                    are updated with the new predictions if their statuses are predicted
is not mentioned explicitly in the user utterance. In our system we                  as active, otherwise kept unchanged. In case of several WL-DST
have implemented two different carry-over procedures. The value                      models, e.g. TRADE [18] or the the model introduced in [19], they
may be offered in the last system utterance, or even in the previous                 require the entire of part of the dialogue history to be passed as
turns by the system. The procedure to retrieve values in these cases                 the input. FastSGT just decodes the current turn and updates just
is called in-service carry-over. There are also cases where a switch                 the parts of the state that need the change at each turn. It helps the
is happening between two services in multi-domain dialogues. A                       computational efficiency and robustness of the model.
KDD Converse’20, August 2020,
                                                                                                                                       Noroozi et al.


4 EXPERIMENTS                                                                    • Joint Goal Accuracy: the average accuracy of predicting
                                                                                   all slot values in a dialogue turn correctly. This metric is the
4.1 Experimental Settings
                                                                                   strictest one among them all.
We did all our experiments on systems with 8 V100 GPUs using
mixed precision training [13] to make the training process faster.
We used BERT-base-cased in our experiments for the encoders. All           4.4     Performance Evaluation
of the models were trained for maximally 160 epochs to have less
variance in the results, whereas most of them managed to converge          In this section, we evaluate and compare FastSGT with the base-
in less than 60 epochs. We repeated each experiment three times            line SGD model proposed in [14] on the two datasets of 𝑆𝐺𝐷 and
and reported the average in all tables.                                    𝑆𝐺𝐷+. A through comparison of both models with regard to all
   For each of the BERT-based encoders and attention-based projec-         the evaluation metrics explained in subsection 4.3 are reported in
tion layers we used 16 heads. We have optimized the model using            Tables 1 and 2. We have reported the metrics calculated for the
Adam [8] optimizer with default parameter settings. Batch size was         turns from the all services vs just seen services. The performance of
set to 128 per GPU, with maximum learning rate to 4𝑒 − 4. Linear           the models trained and evaluated on single-domain and all-domains
decay annealing was used with warm-up of 2% of the total steps.            are reported separately. All dialogues includes single-domain and
Dropout rate of 0.2 is used for regularization for all the modules.        multi-domain dialogues.
We tuned all the parameters to get the best joint goal accuracy on            There were some issues in the original evaluation process of
the dev-set of the datasets.                                               the SGD baseline which we had to fix. First, some services were
                                                                           considered seen services during the evaluation for single-domain
                                                                           dialogues while they do not actually exist in the training data. The
4.2     Datasets                                                           other issue was that the turns which come after an unseen service
                                                                           in multi-domain dialogues could be counted as seen by the original
The SGD dataset is multi-domain goal-oriented dataset that con-
                                                                           evaluation. The errors from unseen services may propagate through
tains over 16k multi-domain conversations between a human user
                                                                           the dialogue and affect some of the metrics for seen services. We
and a virtual assistant and spans across 16 domains. The SGD
                                                                           fixed it by just considering only the turns if there are no turns before
dataset provides schemas that contain details of the API’s interface.
                                                                           them in the dialogue from unseen services. These fixes helped to
The schema define the list of the supported slots by each service,
                                                                           improve the results. However, to have a fair comparison we also
possible values for categorical slots, along with the supported in-
                                                                           reported the performance of the baseline model and ours with and
tents.
                                                                           without these fixes. In Table 1, we have denoted them by +/- in the
   In the original SGD dataset, test and validation splits contains
                                                                           column labelled as "Eval Fix".
Unseen Services, i.e. services that do not exist in the training set. By
                                                                              The performance of our model in all the metrics on both versions
design, those services are in most cases similar to the services in
                                                                           of the dataset is better than the baseline. The main advantage of
the train with similar descriptions, but different names. We created
                                                                           our model comparing to the baseline model is benefiting from the
another version of the dataset by merging all the dialogues and spit-
                                                                           carry-over mechanisms. These procedures are enhanced by having
ting them again randomly in 70%/15%/15% to train/validation/test.
                                                                           the capability of predicting "carryover" statuses for all slots, and
In this version of the dataset, called 𝑆𝐺𝐷+, all services are seen.
                                                                           also having "#CARRYOVER#" value for the categorical slots. During
It helped us to have a better evaluation of our model as the focus
                                                                           our error analysis, we found out that great number of errors come
of our model is for seen services and by using the original split
                                                                           from the cases where the value for a slot is not explicitly mentioned
we were ignoring a significant part of the dataset (some services
                                                                           in the user utterance. The other advantage of FastSGT over the SGD
simply weren’t present in the original Seen Services split).
                                                                           baseline is the attention-based projection functions which enable
                                                                           our model to have a better modeling of the utterance encoder’s
4.3     Evaluation Metrics                                                 outputs.
We evaluated our proposed model and compared with other base-
lines on the the following metrics:                                        4.5     Data Augmentation
                                                                           One of the main challenges in building goal-oriented dialogue sys-
      • Active Intent Accuracy: the fraction of user turns with            tem is lack of sufficient high quality data with annotation. In this
        correctly predicted active intent, where active intent repre-      section, we study the effectiveness of augmentation on the perfor-
        sents user’s intent in the current turn that is being fulfilled    mance of our model. We created augmented versions of the SGD
        by the system.                                                     and SGD+ datasets by replacing the values of the non-categorical
      • Requested Slot F1: the macro-averaged F1 score of the re-          slots with other values seen for the same slot in other dialogues
        quested slots. The turns with no requested slots are skipped.      randomly. It enables us to create new dialogues from the available
        Requested slots are the slots that were requested by the user      dialogues in an offline manner. However, augmentation in multi-
        in the most recent utterance.                                      turn dialogues can be challenging as changing a value for a slot may
      • Average Goal Accuracy: the average accuracy of predict-            need some update in the rest of dialogue to have keep the altered
        ing the value of a slot correctly, where the user goal repre-      dialogue still consistent. We carefully updated the whole dialogue
        sents user’s constraints mentioned in the dialogue up until        with each slot’s value replacement to maintain the integrity of the
        the current user turn.                                             dialogue’s content and annotations.
                                                                                                                                     KDD Converse’20, August 2020,
A Fast and Robust BERT-based Dialogue State Tracker for Schema-Guided Dialogue Dataset


Table 1: Evaluation results of our model compared to the SGD baseline [14] on the dev/test sets of the SGD dataset. The results
are reported separately for single-domain dialogues and all dialogues. We also reported the results without the fixes in the
evaluation process of the baseline for both seen and all services.

                              Eval                       Single-domain Dialogues                                             All dialogues
 Model             Services   Fix      Intent Acc      Slot Req F1 Average GA             Joint GA     Intent Acc    Slot Req F1 Average GA          Joint GA
 SGD Baseline         all      –       96.00/88.05      96.50/95.60   77.60/68.40        48.60/35.60   90.80/90.60    97.30/96.50     74.00/56.00   41.10/25.40
 FastSGT              all      –       96.45/88.60      96.55/94.65   81.11/71.22        56.66/39.77   91.58/90.33    97.53/96.33     78.22/60.66   52.06/29.20
 SGD Baseline        seen      –       99.03/78.22      98.74/96.83   88.12/92.17        68.61/73.94   96.44/94.50    99.47/99.29     79.86/67.77   54.68/41.63
 FastSGT             seen      –       98.94/77.53      98.80/96.89   92.98/94.12        83.13/80.25   96.61/94.18    99.66/99.55     88.78/76.52   71.34/55.23
 SGD Baseline        seen      +       99.00/75.12      96.08/99.22   90.84/91.42        71.14/68.94   96.08/91.64    99.62/99.66     83.33/81.03   61.15/60.05
 FastSGT             seen      +       98.86/73.98      99.64/99.24   96.54/95.31        88.03/81.56   96.26/91.44    99.65/99.64     92.33/92.12   79.65/78.55

Table 2: Evaluation results of our model compared to the SGD baseline [14] on SGD+ dataset for single-domain dialogues and
also all dialogues. All dialogues includes all the single-domain and multi-domain dialogues.

                                           Single-domain Dialogues                                                   All Dialogues
      Model             Intent Acc       Slot Req F1 Average GA              Joint GA         Intent Acc     Slot Req F1 Average GA              Joint GA
      SGD Baseline      95.31/95.70       99.49/99.66   91.85/91.67         71.76/71.86       96.34/96.45     99.69/99.69     81.25/81.15       56.07/56.11
      FastSGT           95.36/95.67       99.52/99.59   95.57/95.44         83.97/84.31       96.17/96.36     99.69/99.69     95.33/95.46       82.03/82.81

                              Table 3: The results of the ablation study of FastSGT on test/dev sets of SGD+.

                                                      Single-domain Dialogues                                            All Dialogues
      Model                          Intent Acc     Slot Req F1 Average GA          Joint GA       Intent Acc    Slot Req F1 Average GA           Joint GA
      SGD Baseline                   95.31/95.70     99.49/99.66   91.85/91.67     71.76/71.86     96.34/96.45    99.69/99.69     81.25/81.15    56.07/56.11
      FastSGT                        95.36/95.67     99.52/99.59   95.57/95.44     83.97/84.31     96.17/96.36    99.69/99.69     95.33/95.46    82.03/82.81
      - Attention-based Layer        95.23/95.72     99.51/99.62   95.28/95.04     82.85/83.04     96.32/96.48    99.69/99.70     95.14/95.44    81.80/82.76
      - Carry-over Value/Status      95.12/95.87     99.54/99.63   95.30/95.40     81.88/82.36     96.32/96.50    99.68/99.71     95.08/95.27    80.84/81.65
      - In-service Carry-over        95.28/95.83     99.50/99.65   91.48/91.18     71.48/71.34     96.25/96.41    99.67/99.70     92.07/92.01    72.62/72.58
      - Cross-service Carry-over          -               -             -               -          96.28/96.43    99.68/99.71     84.31/84.37    62.90/63.54


   Augmentation for categorical slots was not possible since the                     of FastSGT are reported in Table 3. In each variant, one part of
dataset does not provide the unique position of the categorical slot                 the model is disabled to show the effect of each feature of the
values in the dialogue utterance. Also, we did not try the augmen-                   model on the final performance. As it can be seen, both carry-over
tation on multi-domain dialogues as switching between services                       mechanisms are very effective and most of the improvements of our
makes it more challenging to maintain the consistency of the dia-                    model is the result of this part of the model. The experiments show
logue.                                                                               the effectiveness of the cross-service carry-over for multi-domain
   We augmented the data with 10x of the original data, but kept the                 dialogues which was expected. We did not report the results of
number of training steps fixed to have a fair comparison by keeping                  removing the cross-service carry-over for single-domain dialogues
the total amount of the computation the same. The results reported                   as it does not affect such dialogues.
in Table 4 show the effectiveness of augmentation on both the SGD                       The performance of the variant of the model without carry-over
and SGD+ datasets for single domain dialogues. The experiments                       status or carry-over values for categorical slots is still good. The
on the SGD dataset are done just for the seen services.                              main reason is that while this variant of the model lacks these
                                                                                     features it still can handle carry-over for non-categorical slots by
4.6     Ablation Study                                                               predicting "active" status along with out-of-bound span prediction.
We study and evaluate the effectiveness of different aspects of                         Experiments also show that the attention-based projections can
our model in this section. The performances of different variants                    help the performance of the model in terms of accuracy. It shows the
                                                                                     effectiveness of exploiting all the encoded outputs instead of just
                                                                                     the first output of the encoder by using self-attention mechanism.
Table 4: FastSGT with and without data augmentation (Aug.)
on seen single-domain dialogues of SGD and SGD+ datasets.
All results are reported for dev/test sets.                                          5       CONCLUSIONS
  Dataset   Aug.    Intent Acc     Slot Req F1     Average GA      Joint GA
                                                                                     In this paper we proposed an efficient and robust state tracker
  SGD        –      98.86/73.98     99.64/99.24     96.54/95.31   88.03/81.56        model for goal-oriented dialogue systems called FastSGT. In many
  SGD        +      98.74/73.97     99.59/99.31     96.70/96.31   88.66/83.12        cases in multi-turn dialogues, the value of a slot is not mentioned
  SGD+       –      95.36/95.67     99.52/99.59     95.57/95.44   83.97/84.31        explicitly in the user utterance. To cope with this issue, the pro-
  SGD+       +      95.31/95.76     99.49/99.65     96.23/96.19   84.93/86.24        posed model incorporates two carry-over procedures to retrieve
KDD Converse’20, August 2020,
                                                                                                                                                                 Noroozi et al.


the value of such slots from the previous system utterances or other                            the Association for Computational Linguistics. 808–819.
services. Additionally, FastSGT model utilized multi-head attention                        [19] Kevin Knight Xing Shi, Scot Fang. 2020. A BERT-based Unified Span Detection
                                                                                                Framework for Schema-Guided Dialogue State Tracking. DSTC8 Workshop, AAAI-
mechanism in some of the decoders to have a better modeling of                                  20 (2020).
the encoder outputs.                                                                       [20] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov,
                                                                                                and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language
   We run several experiments and compared our model with the                                   understanding. In Advances in neural information processing systems. 5754–5764.
baseline model of the SGD dataset. The results indicate that our                           [21] Zheng Zhang, Ryuichi Takanobu, Minlie Huang, and Xiaoyan Zhu. 2020. Re-
model has significantly higher accuracy, at the same time being effi-                           cent advances and challenges in task-oriented dialog system. arXiv preprint
                                                                                                arXiv:2003.07490 (2020).
cient in terms of memory utilization and computation resources. In
additional experiments, we studied the effectiveness of augmenta-
tion for our model and shown that data augmentation can improve
the performance of the model. Finally, we also included ablation
studies measuring the impact of different parts of the model on its
performance.

REFERENCES
 [1] Vevake Balaraman and Bernardo Magnini. 2020. Domain-aware dialogue state
     tracker for multi-domain dialogue systems. arXiv preprint arXiv:2001.07526
     (2020).
 [2] Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva,
     Stefan Ultes, Osman Ramadan, and Milica Gasic. 2018. MultiWOZ-A Large-Scale
     Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling. In
     Proceedings of the 2018 Conference on Empirical Methods in Natural Language
     Processing. 5016–5026.
 [3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:
     Pre-training of Deep Bidirectional Transformers for Language Understanding. In
     Proceedings of the 2019 Conference of the North American Chapter of the Association
     for Computational Linguistics: Human Language Technologies, Volume 1 (Long and
     Short Papers). 4171–4186.
 [4] Pavel Gulyaev, Eugenia Elistratova, Vasily Konovalov, Yuri Kuratov, Leonid Pu-
     gachev, and Mikhail Burtsev. 2020. Goal-oriented multi-task bert-based dialogue
     state tracker. arXiv preprint arXiv:2002.02450 (2020).
 [5] Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus).
     arXiv preprint arXiv:1606.08415 (2016).
 [6] John Chan Junyuan Zheng, Onkar Salvi. 2020. Candidate Attended Dialogue
     State Tracking Using BERT. DSTC8 Workshop, AAAI-20 (2020).
 [7] Sungdong Kim, Sohee Yang, Gyuwan Kim, and Sang-Woo Lee. 2019. Efficient
     Dialogue State Tracking by Selectively Overwriting Memory. arXiv preprint
     arXiv:1911.03906 (2019).
 [8] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti-
     mization. arXiv preprint arXiv:1412.6980 (2014).
 [9] Shuyu Lei, Shuaipeng Liu, Mengjun Sen, Huixing Jiang, and Xiaojie Wang. 2020.
     Zero-shot state tracking and user adoption tracking on schema-guided dialogue.
     In Dialog System Technology Challenge Workshop at AAAI.
[10] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer
     Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A
     robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692
     (2019).
[11] Yue Ma, Zengfeng Zeng, Dawei Zhu, Xuan Li, Yiying Yang, Xiaoyuan Yao, Kaijie
     Zhou, and Jianping Shen. 2019. An End-to-End Dialogue State Tracking System
     with Machine Reading Comprehension and Wide & Deep Classification. arXiv
     preprint arXiv:1912.09297 (2019).
[12] Yunbo Cao Miao Li, Haoqi Xiong. 2020. The SPPD System for Schema Guided
     Dialogue State Tracking Challenge. DSTC8 Workshop, AAAI-20 (2020).
[13] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich
     Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh
     Venkatesh, et al. 2017. Mixed precision training. arXiv preprint arXiv:1710.03740
     (2017).
[14] Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta, and Pranav
     Khaitan. 2019. Towards scalable multi-domain conversational agents: The schema-
     guided dialogue dataset. arXiv preprint arXiv:1909.05855 (2019).
[15] Yu-Ping Ruan, Zhen-Hua Ling, Jia-Chen Gu, and Quan Liu. 2020. Fine-tuning
     bert for schema-guided zero-shot dialogue state tracking. arXiv preprint
     arXiv:2002.00181 (2020).
[16] Pararth Shah, Dilek Hakkani-Tür, Gokhan Tür, Abhinav Rastogi, Ankur Bapna,
     Neha Nayak, and Larry Heck. 2018. Building a conversational agent overnight
     with dialogue self-play. arXiv preprint arXiv:1801.04871 (2018).
[17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
     Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
     you need. In Advances in neural information processing systems. 5998–6008.
[18] Chien-Sheng Wu, Andrea Madotto, Ehsan Hosseini-Asl, Caiming Xiong, Richard
     Socher, and Pascale Fung. 2019. Transferable Multi-Domain State Generator for
     Task-Oriented Dialogue Systems. In Proceedings of the 57th Annual Meeting of