A Fast and Robust BERT-based Dialogue State Tracker for Schema-Guided Dialogue Dataset Vahid Noroozi Yang Zhang vnoroozi@nvidia.com yangzhang@nvidia.com NVIDIA, USA NVIDIA, USA Evelina Bakhturina Tomasz Kornuta ebakhturina@nvidia.com tkornuta@nvidia.com NVIDIA, USA NVIDIA, USA ABSTRACT reservation, hotel reservation, food ordering, appointment schedul- Dialog State Tracking (DST) is one of the most crucial modules for ing) [21]. Traditionally, goal-oriented dialogue systems are set up goal-oriented dialogue systems. In this paper, we introduce FastSGT as a pipeline with four main modules: 1-Natural Language Under- (Fast Schema Guided Tracker), a fast and robust BERT-based model standing (NLU), 2-Dialogue State Tracking (DST), 3-Dialog Policy for state tracking in goal-oriented dialogue systems. The proposed Manager, and 4-Response Generator. NLU extracts the semantic model is designed for the Schema-Guided Dialogue (SGD) dataset information from each dialogue turn which includes e.g. user in- which contains natural language descriptions for all the entities tents and slot values mentioned by user or system. DST takes the including user intents, services, and slots. The model incorporates extracted entities to build the state of the user goal by aggregating two carry-over procedures for handling the extraction of the values and tracking the information across all turns of the dialogue. Dia- not explicitly mentioned in the current user utterance. It also uses log Policy Manager is responsible for deciding the next action of multi-head attention projections in some of the decoders to have a the system based on the current state. Finally, Response Generator better modelling of the encoder outputs. converts the system action into human natural text understandable In the conducted experiments we compared FastSGT to the base- by the user. line model for the SGD dataset. Our model keeps the efficiency in The NLU and DST modules have shown to be successfully trained terms of computational and memory consumption while improv- using data-driven approaches [21]. In the most recent advances in ing the accuracy significantly. Additionally, we present ablation language understanding, due to models like BERT [3], researchers studies measuring the impact of different parts of the model on have successfully combined NLU and DST into a single unified its performance. We also show the effectiveness of data augmenta- module, called Word-Level Dialog State Tracking (WL-DST) [7, 14, tion for improving the accuracy without increasing the amount of 18]. The WL-DST models can take the user or system utterances in computational resources. natural language format as input and predict the state at each turn. The model we are going to propose in this paper falls into this class KEYWORDS of algorithms. Most of the previously published public datasets, such as Mul- goal-oriented dialogue systems, dialogue state tracking, schema tiWOZ [2] or M2M [16], use a fixed list of defined slots for each guided dialogues domain without any information on the semantics of the slots and ACM Reference Format: other entities in the dataset. As a result, the systems developed on Vahid Noroozi, Yang Zhang, Evelina Bakhturina, and Tomasz Kornuta. 2020. these datasets fail to understand the semantic similarity between A Fast and Robust BERT-based Dialogue State Tracker for Schema-Guided the domains and slots. The capability of sharing the knowledge Dialogue Dataset. In Proceedings of KDD Workshop on Conversational Systems between the slots and domains might help a model to work across Towards Mainstream Adoption (KDD Converse’20). ACM, New York, NY, USA, multiple domains and/or services, as well as to handle the unseen 8 pages. slots and APIs when the new APIs and slots are similar in function- ality to those present in the training data. The Schema-Guided Dialogue (SGD) dataset [14] was created to overcome these challenges by defining and including schemas 1 INTRODUCTION for the services. A schema can be interpreted as an ontology en- Goal-oriented dialogue systems is a category of dialogue systems compassing naming and definition of the entities, properties and designed to solve one or multiple specific goals or tasks (e.g. flight relations between the concepts. In other words, schema defines not only the structure of the underlying data (relations between all the Permission to make digital or hard copies of part or all of this work for personal or services, slots, intents and values), but also provides descriptions classroom use is granted without fee provided that copies are not made or distributed of most of the entities expressed in a natural language. As a result, for profit or commercial advantage and that copies bear this notice and the full citation the dialogue systems can exploit that rich information to capture on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). more general semantic meanings of the concepts. Additionally, the KDD Converse’20, August 2020, availability of the schema enables the model to use the power of © 2020 Copyright held by the owner/author(s). pre-trained models like BERT to transfer or share the knowledge Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). KDD Converse’20, August 2020, Noroozi et al. between different services and domains. The recent emergence of One obvious disadvantage of multi-pass models is their lack the SGD dataset has triggered a new line of research on dialogue of efficiency. The other disadvantage is the memory consumption. systems based on schemas, e.g. [1, 4, 9, 15]. They typically use multiple BERT-like models (e.g. five in SPDD) for Many state-of-the-art models proposed for the SGD dataset, de- predicting intents, requested slots, slot statuses, categorical values, spite showing impressive performance in terms of accuracy, appear and non-categorical values. This significantly increases the memory not to be very efficient in terms of computational complexity and consumption compared to most of the single-pass models with a memory consumption, e.g. [9, 11, 12, 19]. To address these issues, single encoder. we introduce a fast and flexible WL-DST model called Fast Schema Guided Tracker (FastSGT)1 . Main contributions of the paper are as follows: 2.2 Single-pass Models • FastSGT is able to predict the whole state at each turn with The works that incorporate the single-pass approach [1, 14] rely just a single pass through the model, which lowers both the on BERT-like models to encode the descriptions of services, slots, in- training and inference time. tents and slot values into representations, called schema embeddings. • The model employs carry-over mechanisms for transferring The main difference lies in the fact that this procedure is executed the values between slots, enabling switching between ser- just once, before the actual training starts, mitigating the need vices and accepting the values offered by the system during to pass the descriptions through the model for each one of the dialogue. turns/predictions. • We propose an attention-based projection to attend over While these models are very efficient and robust in terms of all the tokens of the main encoder to be able to model the training and inference time, they have shown significantly lower encoded utterances better. performance in terms of accuracy compared to the multi-pass ap- • We evaluate the model on the SGD dataset [14] and show that proaches. On the other hand, multi-pass models need significantly our model has significantly higher accuracy when compared higher computation resource for training and inference, and also to the baseline model of SGD, at the same time keeping the the usage of additional BERT-based encoders increases the memory efficiency in terms of computation and memory utilization. usage drastically. • We show the effectiveness of augmentation on the SGD dataset without increasing the training steps. 3 THE FASTSGT MODEL 2 RELATED WORKS The FastSGT (Fast Schema Guided Tracker) model belongs to the The availability of schema descriptions for services, intents and category of single-pass models, keeping the flexibility along with slots enables the NLU/DST models to share and transfer knowl- memory and computational efficiency. Our model is based on the edge between different services that have similar slots and intents. baseline model proposed for SGD [14] with some improvements in Considering the recent advances in natural language understand- the decoding modules. The model architecture is illustrated in Fig. 1. ing and rise of Transformer-based models [17] like BERT [3] or It consists of four main modules: 1-Utterance Encoder, 2-Schema RoBERTa [10], it looks like a promising approach for training a uni- Encoder, 3-State Decoder, and 4-State Tracker. The first three fied model on datasets which are aggregated from different sources. modules constitute the NLU component and are based on neural We categorize all models proposed for the SGD dataset into two networks, whereas the state tracker is a rule-based module. We main categories: multi-pass and single-pass models. used BERT [3] for both encoders in our model, but similar models like RoBERTa [10] or XLNet [20] can also be used. 2.1 Multi-pass Models Assume we have a dialogue of 𝑁 turns. Each turn consists of the The general principle of operation of multi-pass models [6, 11, 12, preceding system utterance (𝑆𝑡 ) and the user utterance (𝑈𝑡 ). Let 15, 19] lies in passing of descriptions of every slots and intents as 𝐷 = {(𝑆 1, 𝑈 1 ), (𝑆 1, 𝑈 2 ), ..., (𝑆 𝑁 , 𝑈 𝑁 )} be the collection of turns in inputs to the BERT-like encoders to produce their embeddings. As the dialogue. a result encoders are executed several times per a single dialog turn. The Utterance Encoder is a BERT model which encodes the Passing the descriptions to the model along with the user or system user and system utterances at each turn. The Schema Encoder utterances enables the model to have a better understanding of the is also a BERT model which encodes the schema descriptions of task and facilitates learning of similarity between intents and slots. intents, slots, and values into schema embeddings. These schema The SPDD model [12] is a multi-pass model which showed one of embeddings help the decoders to transfer or share knowledge be- the highest performances in terms of accuracy on the SGD dataset. tween different services by having some language understanding of For instance, in order to predict the user state for a service with each slot, intent, or value. The schema and utterance embeddings 4 intents and 10 slots and 3 slots being active in a given turn, this are passed to the State Decoder - a multi-task module. This module model needs 27 passes through the encoder (4 for intents, 10 for consists of five sub-modules producing the information necessary requested status, 10 for statuses, and 3 for values). Such approaches to track the state of the dialogue. Finally, the State Tracker mod- handle unseen services well and achieve high accuracy, but seem ule takes the previous state along with the current outputs of the not to be practical in many cases when time or resources are limited. State Decoder and predicts the current state of the dialogue by aggregating and summarizing the information across turns. Details 1 Source code of the model is publicly available at: https://github.com/NVIDIA/NeMo of all model components are presented in the following subsections. KDD Converse’20, August 2020, A Fast and Robust BERT-based Dialogue State Tracker for Schema-Guided Dialogue Dataset Previous state Next state Slots: {city: San Jose, Slots: {city: San Jose, restaurant_name: Billy Berk’s} State Tracker restaurant_name: Billy Berk’s, Intent: FindRestaurants} party_size:2} Intent: FindRestaurants} Requested Categorical Intent Slot Status Non-categorical Slot Value Decoder Decoder Value Decoder Decoder Decoder Schema embedding Memory Schema Encoder Utterance Encoder [CLS] A leading provider for restaurant search and reservations [SEP] Find a restaurant of a particular cuisine in a city [SEP] [CLS] How many people? [SEP] Please find a table for two people. [SEP] … [CLS] A leading provider for restaurant search and reservations [SEP] Party size for a reservation [SEP] … [CLS] Party size for a reservation [SEP] 1 [SEP] … Figure 1: The overall architecture of FastSGT (Fast Schema Guided Tracker) with exemplary inputs from a restaurant service. 3.1 Utterance Encoder We used the same approach introduced in [14] for encoding the This module is responsible for encoding of the current turn of the schemas. For a service with 𝑁𝐼 intents, 𝑁𝐶 categorical slots and dialogue. At each turn (𝑆𝑡 , 𝑈𝑡 ), the preceding system utterance 𝑁 𝑁𝐶 non-categorical slots, the representation of the intents are is concatenated with the user utterance separated by the special denoted as 𝐼𝑖 , 1 ⩽ 𝑖 ⩽ 𝑁𝐼 . Schema embeddings for the categorical token of [𝑆𝐸𝑃], resulting in (𝑇𝑡 ) which is serves as input into the and non-categorical slots are indicated as 𝑆𝑖𝐶 , 1 ⩽ 𝑖 ⩽ 𝑁𝐶 , and 𝑆𝑖𝑁𝐶 , utterance encoder module: 1 ⩽ 𝑖 ⩽ 𝑁 𝑁𝐶 respectively. The embeddings for the values of the 𝑘-th categorical slot of a service with 𝑁𝑉𝑘 possible values is denoted 𝑇𝑡 = [𝐶𝐿𝑆] ⊕ 𝑆𝑡 ⊕ [𝑆𝐸𝑃] ⊕ 𝑈𝑡 ⊕ [𝑆𝐸𝑃] (1) as 𝑉𝑖𝑘 , 1 ⩽ 𝑖 ⩽ 𝑁𝑉𝑘 . Generally, the input to the Schema Encoder is the concatena- The output of the first token passed to the encoder is denoted tion of two sequences with the [𝑆𝐸𝑃] token used as the separator as 𝑌𝑐𝑙𝑠 and is interpreted as a sentence-level representation of the and the [𝐶𝐿𝑆] token indicating the beginning of the sequence. turn, whereas the token-level representations are denoted as 𝑌𝑡𝑜𝑘 = The Schema Encoder produces four types of schema embed- 1 , 𝑌 2 , ..., 𝑌 𝑀 ], where M is the total number of tokens in 𝑇 . [𝑌𝑡𝑜𝑘 𝑡𝑜𝑘 𝑡𝑜𝑘 𝑡 dings: intents, categorical slots, non-categorical slots and categori- cal slot values. For a single intent embeddings 𝐼𝑖 , the first sequence 3.2 Schema Encoder is the corresponding service description and second one is the in- The Schema Encoder uses the descriptions of intents, slots, and tent description. For each categorical 𝑆𝑖𝐶 and non-categorical 𝑆𝑖𝑁𝐶 services to produce some embedding vectors which represent the slots embedding, the service description is concatenated with the semantics of slots, intents and slot values. To build these schema description of the slot. To produce the schema embedding 𝑁𝑉𝑘 for representations we instantiate a BERT model with the same weights the k-𝑡ℎ possible value of a categorical slot, the description of the as the Utterance Encoder. However, this module is used just once, slot is used as the first sequence along with the value itself as the before the training starts, and all the schema embeddings are stored second sequence. in a memory to be reused during training. This means, they will These sequences are given one by one to the Schema Encoder be fixed during the training time. This approach of handling the before the main training is started and the output of the first out- schema embeddings is one of the main reasons behind the effi- put token embedding 𝑌𝑐𝑙𝑠 is extracted and stored as the schema ciency of our model compared to the multi-pass models in terms of representation, forming the Schema Embeddings Memory. computation time. KDD Converse’20, August 2020, Noroozi et al. 3.3 State Decoder is no active intent for the current turn. An embedding vector 𝐼 0 is The Schema Embeddings Memory along with the outputs of the considered as the schema embedding for the 𝑁𝑂𝑁 𝐸 intent. It is a Utterance Encoder are used as inputs to the State Decoder to learnable embedding which is shared among all the services. predict the values necessary for state tracking. The State Decoder The inputs to the Intent Decoder for a service are the schema em- module consists of five sub-modules, each employing a set of pro- beddings 𝐼𝑖 , 0 ⩽ 𝑖 ⩽ 𝑁𝐼 from the Schema Embeddings Memory jection transformations to decode their inputs. We use the two and 𝑌𝑐𝑙𝑠 of the Utterance Encoder. The predicted output of this following projection layers in the decoder sub-modules: sub-module is the active intent of the current turn 𝐼𝑎𝑐𝑡𝑖𝑣𝑒 defined 1) Single-token projection: this projection transformation, as: which is introduced in [14], takes the schema embedding vector and 𝐾 the 𝑌𝑐𝑙𝑠 of the Utterance Encoder as its inputs. The projection for 𝐼𝑎𝑐𝑡𝑖𝑣𝑒 = argmax 𝐹 𝐹𝐶 (𝐼𝑖 , 𝑌𝑐𝑙𝑠 ; 𝑝 = 𝑁𝐼 ) (7) 𝐾 (𝑥, 𝑦; 𝑝) for two 0⩽𝑖 ⩽𝑁𝐼 predicting 𝑝 outputs for task 𝐾 is defined as 𝐹 𝐹𝐶 vectors 𝑥, 𝑦 ∈ 𝑅𝑞 as the inputs. 𝑞 is the embedding size, 𝑝 is the size 3.3.2 Slot Request Decoder. At each turn, the user may request of the output (e.g. number of classes), the first input 𝑥 is a schema information about a slot instead of informing the system about a embedding vector, and 𝑦 is the sentence-level output embedding slot value. For example, when a user asks for the flight time when vector produced by the Utterance encoder. The sources of the using a ticket reservation service, 𝑓 𝑙𝑖𝑔ℎ𝑡_𝑡𝑖𝑚𝑒 slot of the service inputs 𝑥 and 𝑦 depend on the task and the sub-module. Function is requested by the user. This is a binary prediction task for each 𝐾 (𝑥, 𝑦; 𝑝) for projection 𝐾 is defined as: 𝐾 (𝑆 𝑖 , 𝑌 ; 𝑝 = 1) > 𝐹 𝐹𝐶 slot. For this task, slot 𝑠𝑖 is requested when 𝐹 𝐹𝐶 𝑗 𝑐𝑙𝑠 0.5, 𝑗 ∈ 𝑐, 𝑛. The same prediction is done for both categorical and ℎ 1 = 𝐺𝐸𝐿𝑈 (𝑊1𝐾 𝑦 + 𝑏 𝐾 1) (2) non-categorical slots. ℎ 2 = 𝐺𝐸𝐿𝑈 (𝑊2𝐾 (𝑥 ⊕ ℎ 1 ) + 𝑏 𝐾2) (3) 3.3.3 Slot Status Decoder. We consider four different statuses 𝐹 𝐹𝐶 (𝑥, 𝑦; 𝑝) = 𝑆𝑜 𝑓 𝑡𝑚𝑎𝑥 (𝑊3𝐾 ℎ 2 + 𝑏 𝐾 𝐾 3 ) (4) for each slot, namely: 𝑖𝑛𝑎𝑐𝑡𝑖𝑣𝑒, 𝑎𝑐𝑡𝑖𝑣𝑒, 𝑑𝑜𝑛𝑡_𝑐𝑎𝑟𝑒, 𝑐𝑎𝑟𝑟𝑦_𝑜𝑣𝑒𝑟 . If the 𝐾 𝐾 where𝑊𝑖 , 1 ⩽ 𝑖 ⩽ 3 and 𝑏𝑖 , 1 ⩽ 𝑖 ⩽ 3 are the learnable parameters value of a slot has not changed from the previous state to the for the projection, and 𝐺𝐸𝐿𝑈 is the activation function introduced current user state, then the slot’s status is "𝑖𝑛𝑎𝑐𝑡𝑖𝑣𝑒". If a slot’s in [5]. Symbol ⊕ indicates the concatenation of two vectors. Softmax value is updated in the current user’s state into "𝑑𝑜𝑛𝑡_𝑐𝑎𝑟𝑒", then function is used to normalize the outputs as a distribution over the the status of the slot is set to "𝑑𝑜𝑛𝑡_𝑐𝑎𝑟𝑒" which means the user targets. This projection is used by the Intent, Requested Slot and does not care about the value of this slot. If the value for the slot is Non-categorical Value Decoders. updated and its value is mentioned in the current user utterance, 2) Attention-based projection: the single-token projection then its status is "𝑎𝑐𝑡𝑖𝑣𝑒". There are many cases where the value for just takes one vector from the outputs of the Utterance Encoder. a slot does not exist in the last user utterance and it comes from For the Slot Status Decoder and Categorical Value Decoder we previous utterances in the dialogue. For such cases, the status is propose to use a more powerful projection layer based on multi- set to "𝑐𝑎𝑟𝑟𝑦_𝑜𝑣𝑒𝑟 " which means we should search the previous head attention mechanism [17]. We use the schema embedding system or user utterances in the dialogue to find the value for this vector 𝑥 as the query to attend to the token representations 𝑌𝑡𝑜𝑘 slot. More details of the carry over mechanisms are described in as outputted by the Utterance Encoder. The idea is that domain- subsection 3.4. specific and slot-specific information can be extracted more ef- The status of the categorical slot 𝑟 is defined as: ficiently from the collection of token-level representations than 𝐾 just from the sentence-level encoded vector 𝑌𝑐𝑙𝑠 . The multi-head 𝑆𝑟 = argmax 𝐹𝑀𝐻𝐴 (𝑆𝑖 , 𝑌𝑡𝑜𝑘 ; 𝑝 = 4) (8) 𝐾 0⩽𝑖 ⩽𝑁𝑐 attention-based projection function 𝐹𝑀𝐻𝐴 (𝑥, 𝑌𝑡𝑜𝑘 ; 𝑝) for task 𝐾 to produce targets with size 𝑝 is defined as: Similar decoder is used for the status of the non-categorical slots as: ℎ 1 = MultiHeadAtt(𝑞𝑢𝑒𝑟𝑦 = 𝑥, 𝑘𝑒𝑦𝑠 = 𝑌𝑡𝑜𝑘 , 𝑣𝑎𝑙𝑢𝑒𝑠 = 𝑌𝑡𝑜𝑘 ) (5) 𝐾 𝑆𝑟 = argmax 𝐹𝑀𝐻𝐴 (𝑆𝑖 , 𝑌𝑡𝑜𝑘 ; 𝑝 = 4) (9) 0⩽𝑖 ⩽𝑁𝑛𝑐 𝐾 𝐹𝑀𝐻𝐴 (𝑥, 𝑌𝑡𝑜𝑘 ; 𝑝) = 𝑆𝑜 𝑓 𝑡𝑚𝑎𝑥 (𝑊1𝐾 ℎ 1 + 𝑏 𝐾 1) (6) where 𝑆𝑟 is the status of 𝑟 -th non-categorical slot. where 𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑𝐴𝑡𝑡 is the multi-head attention function intro- 3.3.4 Categorical Value Decoder. The list of possible values for duced in [17], and 𝑊1𝐾 and 𝑏 𝐾 1 are learnable parameters of a linear categorical slots are fixed and known. For the active service, this projection after the multi-head attention. To accommodate padded sub-module takes the 𝑌𝑐𝑙𝑠 and the schema embeddings of the values utterances we use attention masking to mask out the padded por- for all the categorical slots 𝑉𝑖𝑘 , 1 ⩽ 𝑘 ⩽ 𝑁𝑐 , 1 ⩽ 𝑖 ⩽ 𝑁𝑉𝑘 as input and tion of the utterance. 𝐾 uses the 𝐹𝑀𝐻𝐴 (𝑥, 𝑌𝑡𝑜𝑘 ; 𝑝 = 1) projection on each value embedding 3.3.1 Intent Decoder. Each service schema contains a list of all and 𝑌𝑡𝑜𝑘 . Then the predictions of the values are normalized by possible user intents for the current dialogue turn. For example, another Softmax layer, resulting in a distribution over all possible in a service for reserving flight tickets, we may have intents of values for each slot. The value with the maximum probability is searching for a flight or cancelling a ticket. At each turn, for each selected as the value of the slot. If a slot’s status is not "𝑎𝑐𝑡𝑖𝑣𝑒" or service, at most one intent can be active. For all services, we addi- "𝑐𝑎𝑟𝑟𝑦_𝑜𝑣𝑒𝑟 ", this prediction for a given slot will be ignored during tionally consider an intent called 𝑁𝑂𝑁 𝐸 which indicates that there the training. KDD Converse’20, August 2020, A Fast and Robust BERT-based Dialogue State Tracker for Schema-Guided Dialogue Dataset 3.3.5 Non-categorical Value Decoder. For non-categorical slots dialogue may contain more than one service and the user may there is no predefined list of possible values, and they can take any switch between these services. When a switch happens, we may value. For such slots, we use the spanning network approach [14] to need to carry some values from a slot in the previous service to decode and extract the slot’s value directly from the user or system another slot in the current service. This carry-over procedure is utterances. called cross-service carry-over. This decoder would use two projection layers, one for predicting the start of the span, and one for end of the span. This projection layer is shared among all the non-categorical slots. The only dif- 3.4.1 In-service Carry-over. We trigger this procedure in three ference is their inputs. For each non-categorical slot, the outputs cases: 1-status of a slot is predicted as "𝑐𝑎𝑟𝑟𝑦_𝑜𝑣𝑒𝑟 ", 2-the spanning of the tokens from the Utterance Encoder and also the schema region found for the non-categorical slots is not in the span of the embedding for the slot are given as inputs to the two projection user utterance, 3-"#𝐶𝐴𝑅𝑅𝑌𝑂𝑉 𝐸𝑅#" value is predicted for a cate- layers. These predict the probability of each token to be the start gorical slot with "active" or "𝑐𝑎𝑟𝑟𝑦_𝑜𝑣𝑒𝑟 " statuses. The in-service or end of the span for the value. The token with the maximum carry-over procedure tries to retrieve a value for a slot from the probability of being the start token is considered as the start and previous system utterances in the dialogue. We first search the the token with the maximum probability of being the end token system actions starting from the most recent system utterance and which appears after the start token is be considered as the end of then move backwards for a value mentioned for that slot. The most the span. recent value would be considered for the slot if multiple values are For slots with no value mentioned in the current turn, we train found. If no value could be found, that slot would not get updated the network to predict the first [𝐶𝐿𝑆] to be both the start and in the current state. end tokens. The non-categorical slots with status of "𝑖𝑛𝑎𝑐𝑡𝑖𝑣𝑒" are ignored during the training. 3.4.2 Cross-service Carry-over. Carrying values from previous 3.4 State Tracker services to the current one when a switch happens in a turn is done The user’s state at each turn includes the active intent, the collection by cross-service carry-over procedure. The previous service and of the slots and their corresponding values, and also the list of the slots are called sources, and the new service and slots are called requested slots. This module would get all the predictions from the targets. To perform the carry-over, we need to build a list of the decoders along with the previous user state and produces the candidates for each slot which contains the slots where a carry- current user’s state. The list of requested slots and the active intent over can happen from them. We create this carry-over candidate are produced from the outputs of the Slot Request Decoder and list from the training data. We process the whole dialogues in the Intent Decoder directly. Producing the collection of the slots and training data, and count the number of times a value from a source their values requires more processing. slot and service carry-overs to a target service and slot when a We start with the list of slots and values from the previous state, switch happens. These counts are normalized to the number of the and update it as the following: For each slot, if its status is "𝑖𝑛𝑎𝑐𝑡𝑖𝑣𝑒", switches between each two services in the whole training dialogues no update or change is applied and the value from the previous to have a better estimate of the likelihood of carry-overs. In our ex- state is be carried over. If its status is predicted as "𝑑𝑜𝑛𝑡_𝑐𝑎𝑟𝑒", periments, the candidates with likelihoods less than 0.1 are ignored. then the explicit value of "don’t care" would be set as the value This carry-over relation between two slots is considered symmetric for the slot. If the slot status is "𝑎𝑐𝑡𝑖𝑣𝑒", then it means we should and statistics from both sides are aggregated. This candidate list retrieve a value for that slot. We initially check the output of the for each slot contains a list of slot candidates from other services value decoders, if a value is predicted, then that value is assigned which is looked up to find a carry-over value. to the slot. If the "#𝐶𝐴𝑅𝑅𝑌𝑂𝑉 𝐸𝑅#" special value is predicted for When a switch between two services happens in the dialogue, categorical slots or the start and end pointers for the non-categorical the cross-service carry-over procedure is triggered. It looks into the slots are pointing to a value outside the user utterance’s span, then candidate for all the slots of the new service and carry any value it we kick off the carry-over mechanism. The carry-over mechanisms found from previous service. If multiple values for a slot are found try to search the previous states or system actions for a value. When in the dialogue, the most recent one is used. the "𝑐𝑎𝑟𝑟𝑦_𝑜𝑣𝑒𝑟 " status is predicted for any slot, the same carry-over procedures are triggered. The carry-over procedures [12] enable the model to retrieve 3.4.3 State Summarization. At each turn, the predictions of the a value for a slot from the preceding system utterance or even decoders are used to update the previous state and produce the previous turns in the dialogue. There are many cases where the current state. The slots and their values from the previous state user is accepting some values offered by the system and the value are updated with the new predictions if their statuses are predicted is not mentioned explicitly in the user utterance. In our system we as active, otherwise kept unchanged. In case of several WL-DST have implemented two different carry-over procedures. The value models, e.g. TRADE [18] or the the model introduced in [19], they may be offered in the last system utterance, or even in the previous require the entire of part of the dialogue history to be passed as turns by the system. The procedure to retrieve values in these cases the input. FastSGT just decodes the current turn and updates just is called in-service carry-over. There are also cases where a switch the parts of the state that need the change at each turn. It helps the is happening between two services in multi-domain dialogues. A computational efficiency and robustness of the model. KDD Converse’20, August 2020, Noroozi et al. 4 EXPERIMENTS • Joint Goal Accuracy: the average accuracy of predicting all slot values in a dialogue turn correctly. This metric is the 4.1 Experimental Settings strictest one among them all. We did all our experiments on systems with 8 V100 GPUs using mixed precision training [13] to make the training process faster. We used BERT-base-cased in our experiments for the encoders. All 4.4 Performance Evaluation of the models were trained for maximally 160 epochs to have less variance in the results, whereas most of them managed to converge In this section, we evaluate and compare FastSGT with the base- in less than 60 epochs. We repeated each experiment three times line SGD model proposed in [14] on the two datasets of 𝑆𝐺𝐷 and and reported the average in all tables. 𝑆𝐺𝐷+. A through comparison of both models with regard to all For each of the BERT-based encoders and attention-based projec- the evaluation metrics explained in subsection 4.3 are reported in tion layers we used 16 heads. We have optimized the model using Tables 1 and 2. We have reported the metrics calculated for the Adam [8] optimizer with default parameter settings. Batch size was turns from the all services vs just seen services. The performance of set to 128 per GPU, with maximum learning rate to 4𝑒 − 4. Linear the models trained and evaluated on single-domain and all-domains decay annealing was used with warm-up of 2% of the total steps. are reported separately. All dialogues includes single-domain and Dropout rate of 0.2 is used for regularization for all the modules. multi-domain dialogues. We tuned all the parameters to get the best joint goal accuracy on There were some issues in the original evaluation process of the dev-set of the datasets. the SGD baseline which we had to fix. First, some services were considered seen services during the evaluation for single-domain dialogues while they do not actually exist in the training data. The 4.2 Datasets other issue was that the turns which come after an unseen service in multi-domain dialogues could be counted as seen by the original The SGD dataset is multi-domain goal-oriented dataset that con- evaluation. The errors from unseen services may propagate through tains over 16k multi-domain conversations between a human user the dialogue and affect some of the metrics for seen services. We and a virtual assistant and spans across 16 domains. The SGD fixed it by just considering only the turns if there are no turns before dataset provides schemas that contain details of the API’s interface. them in the dialogue from unseen services. These fixes helped to The schema define the list of the supported slots by each service, improve the results. However, to have a fair comparison we also possible values for categorical slots, along with the supported in- reported the performance of the baseline model and ours with and tents. without these fixes. In Table 1, we have denoted them by +/- in the In the original SGD dataset, test and validation splits contains column labelled as "Eval Fix". Unseen Services, i.e. services that do not exist in the training set. By The performance of our model in all the metrics on both versions design, those services are in most cases similar to the services in of the dataset is better than the baseline. The main advantage of the train with similar descriptions, but different names. We created our model comparing to the baseline model is benefiting from the another version of the dataset by merging all the dialogues and spit- carry-over mechanisms. These procedures are enhanced by having ting them again randomly in 70%/15%/15% to train/validation/test. the capability of predicting "carryover" statuses for all slots, and In this version of the dataset, called 𝑆𝐺𝐷+, all services are seen. also having "#CARRYOVER#" value for the categorical slots. During It helped us to have a better evaluation of our model as the focus our error analysis, we found out that great number of errors come of our model is for seen services and by using the original split from the cases where the value for a slot is not explicitly mentioned we were ignoring a significant part of the dataset (some services in the user utterance. The other advantage of FastSGT over the SGD simply weren’t present in the original Seen Services split). baseline is the attention-based projection functions which enable our model to have a better modeling of the utterance encoder’s 4.3 Evaluation Metrics outputs. We evaluated our proposed model and compared with other base- lines on the the following metrics: 4.5 Data Augmentation One of the main challenges in building goal-oriented dialogue sys- • Active Intent Accuracy: the fraction of user turns with tem is lack of sufficient high quality data with annotation. In this correctly predicted active intent, where active intent repre- section, we study the effectiveness of augmentation on the perfor- sents user’s intent in the current turn that is being fulfilled mance of our model. We created augmented versions of the SGD by the system. and SGD+ datasets by replacing the values of the non-categorical • Requested Slot F1: the macro-averaged F1 score of the re- slots with other values seen for the same slot in other dialogues quested slots. The turns with no requested slots are skipped. randomly. It enables us to create new dialogues from the available Requested slots are the slots that were requested by the user dialogues in an offline manner. However, augmentation in multi- in the most recent utterance. turn dialogues can be challenging as changing a value for a slot may • Average Goal Accuracy: the average accuracy of predict- need some update in the rest of dialogue to have keep the altered ing the value of a slot correctly, where the user goal repre- dialogue still consistent. We carefully updated the whole dialogue sents user’s constraints mentioned in the dialogue up until with each slot’s value replacement to maintain the integrity of the the current user turn. dialogue’s content and annotations. KDD Converse’20, August 2020, A Fast and Robust BERT-based Dialogue State Tracker for Schema-Guided Dialogue Dataset Table 1: Evaluation results of our model compared to the SGD baseline [14] on the dev/test sets of the SGD dataset. The results are reported separately for single-domain dialogues and all dialogues. We also reported the results without the fixes in the evaluation process of the baseline for both seen and all services. Eval Single-domain Dialogues All dialogues Model Services Fix Intent Acc Slot Req F1 Average GA Joint GA Intent Acc Slot Req F1 Average GA Joint GA SGD Baseline all – 96.00/88.05 96.50/95.60 77.60/68.40 48.60/35.60 90.80/90.60 97.30/96.50 74.00/56.00 41.10/25.40 FastSGT all – 96.45/88.60 96.55/94.65 81.11/71.22 56.66/39.77 91.58/90.33 97.53/96.33 78.22/60.66 52.06/29.20 SGD Baseline seen – 99.03/78.22 98.74/96.83 88.12/92.17 68.61/73.94 96.44/94.50 99.47/99.29 79.86/67.77 54.68/41.63 FastSGT seen – 98.94/77.53 98.80/96.89 92.98/94.12 83.13/80.25 96.61/94.18 99.66/99.55 88.78/76.52 71.34/55.23 SGD Baseline seen + 99.00/75.12 96.08/99.22 90.84/91.42 71.14/68.94 96.08/91.64 99.62/99.66 83.33/81.03 61.15/60.05 FastSGT seen + 98.86/73.98 99.64/99.24 96.54/95.31 88.03/81.56 96.26/91.44 99.65/99.64 92.33/92.12 79.65/78.55 Table 2: Evaluation results of our model compared to the SGD baseline [14] on SGD+ dataset for single-domain dialogues and also all dialogues. All dialogues includes all the single-domain and multi-domain dialogues. Single-domain Dialogues All Dialogues Model Intent Acc Slot Req F1 Average GA Joint GA Intent Acc Slot Req F1 Average GA Joint GA SGD Baseline 95.31/95.70 99.49/99.66 91.85/91.67 71.76/71.86 96.34/96.45 99.69/99.69 81.25/81.15 56.07/56.11 FastSGT 95.36/95.67 99.52/99.59 95.57/95.44 83.97/84.31 96.17/96.36 99.69/99.69 95.33/95.46 82.03/82.81 Table 3: The results of the ablation study of FastSGT on test/dev sets of SGD+. Single-domain Dialogues All Dialogues Model Intent Acc Slot Req F1 Average GA Joint GA Intent Acc Slot Req F1 Average GA Joint GA SGD Baseline 95.31/95.70 99.49/99.66 91.85/91.67 71.76/71.86 96.34/96.45 99.69/99.69 81.25/81.15 56.07/56.11 FastSGT 95.36/95.67 99.52/99.59 95.57/95.44 83.97/84.31 96.17/96.36 99.69/99.69 95.33/95.46 82.03/82.81 - Attention-based Layer 95.23/95.72 99.51/99.62 95.28/95.04 82.85/83.04 96.32/96.48 99.69/99.70 95.14/95.44 81.80/82.76 - Carry-over Value/Status 95.12/95.87 99.54/99.63 95.30/95.40 81.88/82.36 96.32/96.50 99.68/99.71 95.08/95.27 80.84/81.65 - In-service Carry-over 95.28/95.83 99.50/99.65 91.48/91.18 71.48/71.34 96.25/96.41 99.67/99.70 92.07/92.01 72.62/72.58 - Cross-service Carry-over - - - - 96.28/96.43 99.68/99.71 84.31/84.37 62.90/63.54 Augmentation for categorical slots was not possible since the of FastSGT are reported in Table 3. In each variant, one part of dataset does not provide the unique position of the categorical slot the model is disabled to show the effect of each feature of the values in the dialogue utterance. Also, we did not try the augmen- model on the final performance. As it can be seen, both carry-over tation on multi-domain dialogues as switching between services mechanisms are very effective and most of the improvements of our makes it more challenging to maintain the consistency of the dia- model is the result of this part of the model. The experiments show logue. the effectiveness of the cross-service carry-over for multi-domain We augmented the data with 10x of the original data, but kept the dialogues which was expected. We did not report the results of number of training steps fixed to have a fair comparison by keeping removing the cross-service carry-over for single-domain dialogues the total amount of the computation the same. The results reported as it does not affect such dialogues. in Table 4 show the effectiveness of augmentation on both the SGD The performance of the variant of the model without carry-over and SGD+ datasets for single domain dialogues. The experiments status or carry-over values for categorical slots is still good. The on the SGD dataset are done just for the seen services. main reason is that while this variant of the model lacks these features it still can handle carry-over for non-categorical slots by 4.6 Ablation Study predicting "active" status along with out-of-bound span prediction. We study and evaluate the effectiveness of different aspects of Experiments also show that the attention-based projections can our model in this section. The performances of different variants help the performance of the model in terms of accuracy. It shows the effectiveness of exploiting all the encoded outputs instead of just the first output of the encoder by using self-attention mechanism. Table 4: FastSGT with and without data augmentation (Aug.) on seen single-domain dialogues of SGD and SGD+ datasets. All results are reported for dev/test sets. 5 CONCLUSIONS Dataset Aug. Intent Acc Slot Req F1 Average GA Joint GA In this paper we proposed an efficient and robust state tracker SGD – 98.86/73.98 99.64/99.24 96.54/95.31 88.03/81.56 model for goal-oriented dialogue systems called FastSGT. In many SGD + 98.74/73.97 99.59/99.31 96.70/96.31 88.66/83.12 cases in multi-turn dialogues, the value of a slot is not mentioned SGD+ – 95.36/95.67 99.52/99.59 95.57/95.44 83.97/84.31 explicitly in the user utterance. To cope with this issue, the pro- SGD+ + 95.31/95.76 99.49/99.65 96.23/96.19 84.93/86.24 posed model incorporates two carry-over procedures to retrieve KDD Converse’20, August 2020, Noroozi et al. the value of such slots from the previous system utterances or other the Association for Computational Linguistics. 808–819. services. Additionally, FastSGT model utilized multi-head attention [19] Kevin Knight Xing Shi, Scot Fang. 2020. A BERT-based Unified Span Detection Framework for Schema-Guided Dialogue State Tracking. DSTC8 Workshop, AAAI- mechanism in some of the decoders to have a better modeling of 20 (2020). the encoder outputs. [20] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language We run several experiments and compared our model with the understanding. In Advances in neural information processing systems. 5754–5764. baseline model of the SGD dataset. The results indicate that our [21] Zheng Zhang, Ryuichi Takanobu, Minlie Huang, and Xiaoyan Zhu. 2020. Re- model has significantly higher accuracy, at the same time being effi- cent advances and challenges in task-oriented dialog system. arXiv preprint arXiv:2003.07490 (2020). cient in terms of memory utilization and computation resources. In additional experiments, we studied the effectiveness of augmenta- tion for our model and shown that data augmentation can improve the performance of the model. Finally, we also included ablation studies measuring the impact of different parts of the model on its performance. REFERENCES [1] Vevake Balaraman and Bernardo Magnini. 2020. Domain-aware dialogue state tracker for multi-domain dialogue systems. arXiv preprint arXiv:2001.07526 (2020). [2] Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gasic. 2018. MultiWOZ-A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 5016–5026. [3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186. [4] Pavel Gulyaev, Eugenia Elistratova, Vasily Konovalov, Yuri Kuratov, Leonid Pu- gachev, and Mikhail Burtsev. 2020. Goal-oriented multi-task bert-based dialogue state tracker. arXiv preprint arXiv:2002.02450 (2020). [5] Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016). [6] John Chan Junyuan Zheng, Onkar Salvi. 2020. Candidate Attended Dialogue State Tracking Using BERT. DSTC8 Workshop, AAAI-20 (2020). [7] Sungdong Kim, Sohee Yang, Gyuwan Kim, and Sang-Woo Lee. 2019. Efficient Dialogue State Tracking by Selectively Overwriting Memory. arXiv preprint arXiv:1911.03906 (2019). [8] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti- mization. arXiv preprint arXiv:1412.6980 (2014). [9] Shuyu Lei, Shuaipeng Liu, Mengjun Sen, Huixing Jiang, and Xiaojie Wang. 2020. Zero-shot state tracking and user adoption tracking on schema-guided dialogue. In Dialog System Technology Challenge Workshop at AAAI. [10] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019). [11] Yue Ma, Zengfeng Zeng, Dawei Zhu, Xuan Li, Yiying Yang, Xiaoyuan Yao, Kaijie Zhou, and Jianping Shen. 2019. An End-to-End Dialogue State Tracking System with Machine Reading Comprehension and Wide & Deep Classification. arXiv preprint arXiv:1912.09297 (2019). [12] Yunbo Cao Miao Li, Haoqi Xiong. 2020. The SPPD System for Schema Guided Dialogue State Tracking Challenge. DSTC8 Workshop, AAAI-20 (2020). [13] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. 2017. Mixed precision training. arXiv preprint arXiv:1710.03740 (2017). [14] Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta, and Pranav Khaitan. 2019. Towards scalable multi-domain conversational agents: The schema- guided dialogue dataset. arXiv preprint arXiv:1909.05855 (2019). [15] Yu-Ping Ruan, Zhen-Hua Ling, Jia-Chen Gu, and Quan Liu. 2020. Fine-tuning bert for schema-guided zero-shot dialogue state tracking. arXiv preprint arXiv:2002.00181 (2020). [16] Pararth Shah, Dilek Hakkani-Tür, Gokhan Tür, Abhinav Rastogi, Ankur Bapna, Neha Nayak, and Larry Heck. 2018. Building a conversational agent overnight with dialogue self-play. arXiv preprint arXiv:1801.04871 (2018). [17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008. [18] Chien-Sheng Wu, Andrea Madotto, Ehsan Hosseini-Asl, Caiming Xiong, Richard Socher, and Pascale Fung. 2019. Transferable Multi-Domain State Generator for Task-Oriented Dialogue Systems. In Proceedings of the 57th Annual Meeting of