=Paper=
{{Paper
|id=Vol-2666/KDD_Converse20_paper_12
|storemode=property
|title=IQ-Net: A DNN Model for Estimating Interaction-level Dialogue Quality with Conversational Agents
|pdfUrl=https://ceur-ws.org/Vol-2666/KDD_Converse20_paper_12.pdf
|volume=Vol-2666
|authors=Yuan Ling,Benjamin Yao,Guneet Kohli,Tuan-Hung Pham,Chenlei Guo
|dblpUrl=https://dblp.org/rec/conf/kdd/LingYKPG20
}}
==IQ-Net: A DNN Model for Estimating Interaction-level Dialogue Quality with Conversational Agents ==
IQ-Net: A DNN Model for Estimating Interaction-level Dialogue Quality with Conversational Agents Yuan Ling, Benjamin Yao, Guneet Kohli, Tuan-Hung Pham, Chenlei Guo yualing,benjamy,gkohli,hupha,guochenl@amazon.com Amazon Alexa AI Seattle, WA ABSTRACT These ICAs are complex systems with many components, such as An automated metric to evaluate dialogue quality is critical for con- automatic speech recognition (ASR), natural language understand- tinuously optimizing large-scale conversational agent systems such ing (NLU), language generation, and dialog management. As a as Alexa. Previous approaches for tackling this problem often rely on result, there are generally two categories of evaluation metrics for a limited set of manually designed and/or heuristic features, which ICAs [32]: (1) Component metrics which are used to measure the cannot be easily scaled to a large number of domains or scenarios. quality of each individual component, such as Word Error Rate In this paper, we present Interaction-Quality-Network (IQ-Net), a (WER) and NLU Accuracy; (2) End-to-End (E2E) metrics that are novel DNN model that allows us to predict interaction-level dialogue designed to measure the entire dialogue quality and/or user satisfac- quality directly from raw dialogue contents and system metadata tion. In this paper, we focus on “online” or “automated” E2E metrics, without human engineered NLP features. The IQ-Net architecture is which are built with machine learning models that are designed to compatible with several pre-trained neural network embeddings and predict user satisfaction. Defining online E2E metrics is a popular architectures such as CNN, Elmo, and BERT. Through an ablation subject of various research [10] due to its vital role in continuous study in Alexa, we demonstrate that several variants of IQ-Net out- optimization of ICA systems [15]. perform a baseline model with manually engineered features (3.89% improvement in F1 score, 3.15% in accuracy, and 6.1% in precision 1. Assessment: non-defect (success) score), while also reduce the efforts to extend to new domains/use- User request: play five little ducks cases. ICA response: Ok, playing five little ducks User request: (after 2 mins...) play five little ducks CCS CONCEPTS 2. Assessment: defect, user paraphrase User request: play three little ducks • Computing methodologies → Natural language processing. ICA response: Sorry, I cannot find the song User request: play five little ducks KEYWORDS 3. Assessment: non-defect, user confirmation online evaluation, intelligent conversational agents (ICAs), evalua- User request: turn off light tion metrics, defect detection ICA response: did you mean Lamp One? User request: yeah ACM Reference Format: 4. Assessment: defect, user correction and ASR error Yuan Ling, Benjamin Yao, Guneet Kohli, Tuan-Hung Pham, Chenlei Guo. User request: buy 2020. IQ-Net: A DNN Model for Estimating Interaction-level Dialogue ICA response: you have one item in your shopping cart Quality with Conversational Agents. In Proceedings of KDD Workshop on {name of the item}, do you want to buy it? Conversational Systems Towards Mainstream Adoption (KDD Converse’20). User request: bye-bye. ACM, New York, NY, USA, 7 pages. Table 1: Example dialogues and interaction-quality assessment. Note that assessments are for the 1st turn in each dialogue. 1 INTRODUCTION As voice-controlled intelligent conversational agents (ICAs), such Prior attempts to model online E2E metrics could be roughly as Alexa, Siri, and Google Assistant become increasingly popular, grouped into two categories: (1) Dialogue-level metrics such as ICAs have become a new paradigm for accessing information. They the popular PARADISE framework [39] which aims to predict represent a hybrid of search and dialogue systems that conversa- dialogue-level satisfaction ratings provided by surveyed users; Ref- tionally interact with users to execute a wide range of actions (e.g., erences [8, 36] formulate dialogue interaction as a reinforcement searching the Web, setting alarms, and making phone calls) [9, 31]. learning task, which aims to predict reward from dialogue history using different variants of DNN models. (2) Turn-level (or exchange- Permission to make digital or hard copies of part or all of this work for personal or level) metrics such as Interaction Quality (IQ) [33] which predict classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation per turn dialogue quality either provided by users [13] or annotated on the first page. Copyrights for third-party components of this work must be honored. by human raters [34]. In particular, IQ models have gained popu- For all other uses, contact the owner/author(s). larity recently because of publicly available benchmarks such as KDD Converse’20, August 2020, © 2020 Copyright held by the owner/author(s). the CMU Go-bus information system [34]. Various methods exist to predict Interaction Quality, for example using Hidden Markov Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). KDD Converse’20, August 2020, Ling and Yao, et al. Models [38], Support Vector Machines [12, 33] and Recurrent Neu- Evaluation is a central component for information search sys- ral Networks [28, 29]. However, these approaches rely heavily on tems [19]. For text-based information retrieval, the relevant docu- dialog system internal features such as ASR confidence and SLU ments/pages are annotated manually to evaluate the search system Semantic Parse. While these features a very effective in a small-scale performance. Query-based metrics such as the mean average preci- closed-loop system, they are very unreliable in a large organization sion (MAP) and normalized discounted cumulative gain (nDCG) [20] like Alexa where many teams constantly updating various compo- are frequently used to evaluate the system performance. However, the nents in parallel. For example, ASR-confidence score could have human annotation process is expensive and error-prone; in addition, significant shift between two ASR model versions hence will be an the user’s individual intent is commonly not taken into consider- unreliable input for E2E online metric, which is, in part, designed ation. To alleviate this issue, some research models user satisfac- to measure ASR’s impact to user satisfaction. Therefore, instead tion/behaviors to improve the evaluation of system’s performance [1] of relying on system internal signals, we draw on the intuition that by incorporating the following signals: 1) user behaviors includ- human raters could reliably judge the quality of a turn by looking at ing clicks, dwell time, mouse movements, scrolling behaviors, and the context of the dialogue [4] (without ever needing to know what abandonment [16]; 2) context-specific features such as viewport is the ASR confidence score). Table 1 shows four example dialogues metrics [24], touch-related features, and acoustic signals [23], 3) with interaction-quality assessment by human. query-based features, such as query refinement, query length, and While these examples demonstrate how human could easily judge frequency in logs. While the metrics for evaluating traditional search the quality of a turn using dialogue context, they are also non-trivial system might not be used directly to evaluate ICAs, some of the cases for an ML model to predict. For example, example #1 and metric components, such as query refinement, can be adapted to #2 share similar user paraphrase structure. But example #1 is non- evaluate ICAs. defective (successful) because ICA response is relevant while #2 is Compared with text-based information retrieval, voice-based in- clearly defective because the ICA responded with “sorry, I cannot formation retrieval is quite different [21, 23] because of two reasons. . . . ”. On the other hand, while example #3 and #4 share similar First, voice-based interactions are conversational; in some scenarios, query/response structure (ICA asking for confirmation), #3 is non- the user expects that the search system is able to refer to the previous defective because user have a positive confirmation next turn while interactions to understand the current request. Second, the voice #4 is a defect because user correction next turn. To capture such input could provide automatic speech recognition (ASR) errors to a diverse of dialogue patterns, it is clear that we need to leverage downstream applications and affect user satisfaction negatively. Re- semantic meaning of the dialogue context. While it is possible to search on Spoken Dialogue System (SDS) attempts to model user manually extract features such as “paraphrasing”, “cohesion between satisfaction at turn level as a continuous process over time [13, 18]. response and request” as proposed by Bodigutla et al. [4], the com- An annotated and standardized corpus, such as Let’s Go Bus Informa- plexity of open domain system like Alexa limits the efficacy for such tion System dataset from CMU [34], was developed for classification approach. Therefore, in this paper we present Interaction-Quality- and evaluation tasks regarding task success prediction, dialogue qual- Network (IQ-Net), an E2E DNN model that allows us to predict ity estimation, and emotion recognition. Based on the dataset, an interaction-level dialogue quality directly from raw dialogue con- evaluation metric as Interaction Quality [10, 34] is developed with tents and system metadata without human engineered NLP features. features related to ASR, Spoken Language Understanding (SLU), In contrast to existing related work, we contribute to the IQ modeling and Dialog Manager at exchange, dialog, and window level. literature from the following perspectives. ICAs differ from traditional SDS in that they support personaliza- tion and a wide range of tasks. Dialogue systems can be categorized • Instead of focusing on a few specific tasks and system internal into three groups: task-oriented systems, conversational agents, and features [21, 33], IQ-Net is a generic interaction quality model interactive question answering systems [10]. ICAs are designed to that could be used for evaluating Interaction Quality across be able to handle all of these tasks; thus, it makes the evaluation of multiple domains and various systems; ICAs very challenging. In addition, as the voice-only ICAs tend to • Unlike previous multi-domain user satisfaction evaluation evolve to become the voice-enabled multi-modal ICAs, it become model [4], which relies on manually engineered dialogue even more complex for evaluation. A recent user study on ICAs features that are hard to scale, IQ-net is capable of capturing a attempts to compare the differences regarding features [25], perfor- variety of dialogue patterns and could be easily extend to new mance, ASR error [17], and user experiences [3] across different domain/use-cases as long as we have annotated examples. ICAs. Surveys with questionnaires [2, 5] are conducted to under- stand functional and topical use of ICAs by individuals; however, The rest of the paper is organized as follows. Section 2 reviews these studies are limited to predefined scenarios of interactions. existing work. Section 3 presents our methods to estimate interaction- Currently, there is limited research on building automatic metrics level dialogue quality. Section 4 presents our experimental results. for evaluating ICAs’ performances. Jiang et al. [21] built separate We conclude our paper in Section 5. models for evaluating user satisfaction on five domains, including Chat, Device Control, Communication, Location, Calendar, and Weather. The models consider several types of features, including 2 RELATED WORK user-system interactions, click features, request features, response In this section, we summarize the related work on evaluation meth- features, and acoustic features. This work automated the online eval- ods/metrics for the search systems and error analysis for the ICAs to uation for ICAs. However, the work did not consider the variability put our contributions in context. of interface and interaction; in addition, its scope is limited to ICAs IQ-Net: A DNN Model for Estimating Interaction-level Dialogue Quality with Conversational Agents KDD Converse’20, August 2020, on mobile devices and several specific scenarios/domains. Bodigutla et al. [4] introduces a Response Quality annotation schema, which showed high correlation with explicit turn level user satisfaction rat- ings. This paper developed a method for evaluating user satisfaction at turn level in multi-domain conversations users for ICA using five features: user request rephrasing, cohesion between response and request, aggregate topic popularity, unactionable user request, and the diversity of topics in a session. The turn-level user satisfaction rating is further used as feature to improve dialogue-level satisfac- tion estimation. Other current research on evaluation of ICAs more focuses on user satisfaction estimation and goal success prediction, which are more suitable for dialog-level or task-level evaluation due to the dialogue style of interactions [15, 22, 30]. A user’s frustration in the middle of a task or a dialogue might not be captured. The ap- proach also often lacks interpretability in term of the root causes of user frustration. Finally, it is not obvious how one should define task and session boundaries for ICAs [30]; thus, it is critical to evaluate ICAs at turn level. The complexity of ICAs’ components makes it difficult to de- termine which component causes an error or user frustration. Re- searchers have studied system errors in search and dialogue systems. For ICAs, the error root causes can be categorized into groups [31, 32], including ASR errors, NLU errors, unsupported system actions, Figure 1: IQ-Net Architecture Diagram. The model consid- no language generation, back-end failures, endpoint errors, and un- ers the information of current turn utterance (𝑈𝑡 ), response interpretable inputs. These errors can be the root causes of user (𝑅𝑡 ), next turn utterance (𝑈𝑡 +1 ). Semantic encoding layers share reformulating their queries [27, 31]. weights. Position encoders are shared among (𝑈𝑡 ) and (𝑈𝑡 +1 ). Note that in this diagram, we use BERT encoder as an example. 3 METHODOLOGY For IQ-Net(CNN), we replace the BERT encoder with an CNN encoder. In this section, we present IQ-Net: a DNN model for estimating interaction-level dialogue quality. First, we introduce the overall architecture and training procedure of IQ-Net. Then, we explain how concatenate results for each part followed by a feed-forward network we represent the dialogue context in details. Next, we introduce the and activation function. system metadata used in the IQ-Net. The final outputs from the dialogue context representations will combine with all other features derived from system metadata to pre- dict a defect/non-defect outcome for the interaction-level dialogue 3.1 IQ-Net quality of the first turn: The IQ-Net model is presented in Figure 1. IQ-Net includes two major components: (1) dialogue context representations and (2) a 𝑝 (Defect = 𝑡𝑟𝑢𝑒 | < 𝑈𝑡 , 𝑅𝑡 , 𝑈𝑡 +1 >, 𝑓𝑀 ) (1) list of features derived from system metadata. 𝑓𝑀 represents a list of meta-data features (described in Section 3.3). For modeling dialogue context, we consider user’s request text The objective function for the overall task is plus response text of ICA in the consecutive turns. As showed in Õ Figure 1, the dialogue context representation part takes current turn 𝐿Θ = 𝑙 (𝐹 (𝑈𝑡 , 𝑅𝑡 , 𝑈𝑡 +1, 𝑓𝑀 ), 𝑦) (2) request and response (𝑈 1 and 𝑅1 ), and more requests from following (𝑈𝑡 ,𝑅𝑡 ,𝑈𝑡 +1 ) turns (𝑈 2, 𝑈 3, ...) as inputs. For simplicity, we only consider the whereas 𝐹 () is a function that represents IQ-Net. 𝑙 is the standard next one turn request; thus, the inputs can be represented as < cross entropy loss. 𝑦 is the ground-truth label. 𝑈 1, 𝑅1, 𝑈 2 >. We will support more following turns in future work. We assume < 𝑈 1, 𝑅1 > captures the relevancy between user request 3.2 Dialogue Context Representations and Alexa response and < 𝑈 1, 𝑈 2 > captures patterns from user’s Here, we explain in details that the dialogue context are represented repeat/dialog behavior. with < 𝑈 1, 𝑈 2 > modeling and < 𝑈 1, 𝑅1 > modeling. We map the word indicies of 𝑈 1 , 𝑅1 , and 𝑈 2 into a fixed dimension of vectors through pre-trained word embeddings [11, 26]. The word 3.2.1 < 𝑈 1, 𝑈 2 > Modeling. < 𝑈 1, 𝑈 2 > pair contains user’s dia- embedding representation of 𝑈 1 , 𝑅1 , and 𝑈 2 all go through sentence log behavior/patterns. For example, ICA users tend to re-express the encoder 𝐸, which can be pre-trained by individual datasets. We use same intention with follow-up requests after an unsuccessful attempt both CNN encoder 𝐸𝐶𝑁 𝑁 and BERT encoder 𝐸𝐵𝐸𝑅𝑇 as sentence from the previous request. We refer to the follow-up request as a encoders in our experiments for comparisons. We concatenate the “rephrase” of the previous request. Identifying rephrasing pattern hidden representations of ℎ𝑈1 and ℎ𝑅1 , ℎ𝑈1 and ℎ𝑈2 accordingly. The between requests pair can help discovering defect/frictions. KDD Converse’20, August 2020, Ling and Yao, et al. In addition to the rephrasing behavior between two consecutive user requests, the user can also express the confirm or deny intention in a follow-up request, as shown in Table 2. Example 1: confirm User request: Alexa, add paper to my cart Alexa response: Do you mean paper towel? User request: Yes. Example 2: deny User request: Alexa, add paper to my cart (a) barge-in Alexa response: Do you mean paper towel? User request: No. add A4 paper to my cart. Table 2: < 𝑈 1, 𝑈 2 > confirm/deny examples. Such patterns existing in < 𝑈 1, 𝑈 2 > reflect user’s real intention through the corresponding repeat/confirm/deny behaviors, which can be learned in the proposed IQ-Net. 3.2.2 < 𝑈 1, 𝑅1 > Modeling. The semantic relevance of ICA’s response and user’s request can be an effective feature for defect pre- dictions. When an ICA responds to a user’s request with an irrelevant answer, the metric should capture this as defective. However, it is (b) termination difficult to discover such a defect when the ICA provides a complete but incorrect response, and user chooses to abandon the interaction Figure 2: User interruption signals without rephrasing the request. The relevance between the request and the response text can potentially help with defect identification. The IQ-Net takes user request and response text (𝑈 1 and 𝑅1 ) as 3.3.1 User Interruption Signals. User Barge-in: Barge-in is inputs, similar to < 𝑈 1, 𝑈 2 > modeling. We adopt the frequently used a frequently used feature for evaluation in SDS [4, 10, 34]. When “Siamese” architecture [6] to measure request-response similarities a customer interrupts a follow-up request while ICA is respond- in the projected space as showed in the Figure 1. ing or playing, the turn will be labeled as a barge-in. As shown in Figure 2(a), we build a rule-based barge-in model. When (1) ICA 3.3 System Metadata is talking or playing, (2) the delay between the previous utterance Table 3 introduces our signals and features of system’s metadata and the current one is less than a certain period of time (e.g., 45 for evaluating ICAs. We adopt several signals, such as termination seconds), and (3) the user intent is not in the intents set {“Vol- and barge-in, to reflect the user’s implicit feedback/action. Then, we umeUp”,“VolumeDown”,“SetVolume”}, we label the current turn introduce generic features, such as user domain/intent information barge-in value 𝑓𝐵𝐼 = 1; otherwise, 𝑓𝐵𝐼 = 0. from NLU outputs, and some system action types. User Termination: We define a user termination as when a cus- tomer expresses a terminating intent. As shown in Figure 2(b), our termination detection is rule-based. If the user’s intent is a terminat- Feature Symbol Feature Type ing action (e.g., StopIntent, ExitAppIntent), the delay between the [User interruption] previous utterance, and the current one is less than a certain period barge-in 𝑓𝐵𝐼 {0, 1} binary values of time (e.g., 45 seconds), we have the termination value 𝑓𝑇 𝑀 = 1; termination 𝑓𝑇 𝑀 {0, 1} binary values otherwise, 𝑓𝑇 𝑀 = 0. gap time 𝑓𝐺𝑇 continuous values Gap Time: The gap time between two requests is an important [User intent] indicator for a user interruption. We use the time differences as a intent 𝑓𝑈 𝐼 categorical feature feature, which is represented as 𝑓𝐺𝑇 . domain 𝑓𝑈 𝐷 categorical feature [System Action] 3.3.2 User Intent Signals. An NLU component allows ICAs to dialog status 𝑓𝐷𝑆 categorical feature produce interpretations for an input sentence. The NLU component promptID 𝑓𝑃𝐼 {0, 1} binary values accepts recognized speech inputs and produces intents, domains, and SLU score bin 𝑓𝑆𝐵 categorical feature slots for the input utterance to support the user request [7, 37]. We Table 3: Feature List Derived from System Metadata. use the domain and intent outputs from NLU as signals to reflect the user’s intention. We cover over dozens of domains and thousands of intents, and use them as categorical features. We use 𝑓𝑈 𝐷 as domain features and 𝑓𝑈 𝐼 as intent features. IQ-Net: A DNN Model for Estimating Interaction-level Dialogue Quality with Conversational Agents KDD Converse’20, August 2020, 3.3.3 System Action Signals. Dialog Status: Dialog Man- Defect = 1 Example: agement (DM) is a key component of spoken language interactions User request: Do I have any appointments today? with ICAs. It makes user inputs actionable by asking appropriate ICA response: Appointment is [definition of appointment] User request: Tell me my appointments today questions to help customers achieve a goal. DM can detect when a Defect = 0 Example: valid task completes or if there is trouble in the dialog and it records User request: Turn on the lights this information. Following previous work on dialog acts model- ICA response: Ok ing [35], we use DM status values as system action signals for defect User request: Thank you detection. Compared with the work [21], we focus on more generic DM status categories here: 4.2 Main Results • SUCCESS: The ICA is able to act and deliver what it thinks Table 4 illustrates the result of IQ-Net with two different encoders, the user wants (not ground truth). including CNN and BERT, compared with result from baseline. • IN_PROGRESS: The ICA is in the process of executing on a task or is prompting for additional information. Perf(%) Accuracy F1 Recall Precision • USER_ABANDONED: The user abandons an in-progress Baseline dialog, either explicitly or implicitly. meta_data only 82.54 74.30 75.48 73.16 • INVALID: The Spoken Language Understanding (SLU) could + < 𝑈1, 𝑈2 > +0.27 +0.64 +1.36 -0.03 interpret the utterance, but the ICA cannot handle it. For ex- + < 𝑈 1 , 𝑅1 > +0.22 +0.57 +1.3 -0.11 ample, the input may express a task that is unactionable due to + < 𝑈 1 , 𝑅1 , 𝑈 2 > +0.08 +0.73 +2.56 -0.92 user dependencies (e.g. account linking for music purchases), IQ-Net (CNN) or is currently unsupported. + < 𝑈1, 𝑈2 > +0.35 +1.21 +3.41 -0.75 • ICA_ABANDONED: The SLU stops trying / ICA reaches + < 𝑈 1 , 𝑅1 > +1.82 +2.39 +1.43 +3.31 the MAX number of turns. + < 𝑈 1 , 𝑅1 , 𝑈 2 > +2.63 +3.74 +3.28 +4.18 • FAULT: The ASR encounters some internal errors, the NLU IQ-Net (BERT) service fails, or the app fails. + < 𝑈1, 𝑈2 > +0.82 +1.09 +0.71 +1.44 + < 𝑈 1 , 𝑅1 > +2.69 +3.83 +3.38 +4.25 We represent the DM status as categorical feature 𝑓𝐷𝑆 . + < 𝑈 1 , 𝑅1 , 𝑈 2 > +3.23 +4.62 +4.14 +5.18 System Prompts: promptID is a free form system status code provided by ICA speechlets to indicate whether a speechlet can Table 4: Results. Baseline: The baseline method is a Gradi- handle the request. For example, when the ICA responds “Sorry, ent Boosting Decision Tree (GBDT) [14] with the same features I’m not sure”, the promptID is “NotUnderstood”. promptIDs can be mentioned in Section 3. IQ-Net(CNN): our proposed method categorized and mapped to different types of frictions such as SLU with CNN encoder. IQ-Net(BERT): our proposed method with frictions, errors or retries, coverage gaps, unsupported use cases and BERT encoder. user actions required. We convert the promptID into a binary feature; if the promptID is mapped to any friction type, the feature value 𝑓𝑃𝐼 = 1, otherwise 𝑓𝑃𝐼 = 0. As shown in Table 4, the IQ-Net (with either CNN encoder or SLU Score Bin: The SLU score represents the confidence of BERT encoder) has better performance than the baseline method. what the SLU understoods as the desired intent/slot output for the IQ-Net (BERT) outperforms the baseline method of meta-data only utterance. The SLU score bin is a categorical feature to group the with an improvement of 3.23% in accuracy, 4.62% in F1 score, and confidence score into high/medium/low bins. Comparing to volatile 5.18% in precision, and it outperforms the baseline method with full features such as “ASR confidence” or “Entity resolution score”, SLU features with an improvement of 3.15% in accuracy, 3.89% in F1 score bin a stable feature that has low variation over-time. Hence, score, and 6.1% in precision. we use it as a feature, which is represented as 𝑓𝑆𝐵 . 4.3 Ablation Study 4 EXPERIMENTS Ablation for different features: We perform ablation experiments In this section, we discuss our experimental results. First, we present over the list of features used in IQ-Net(BERT) to better understand the IQ-Net model’s performance with different encoders (CNN and their relative importance. In Table 5, we show how much degraded BERT) and compare it with our baseline method. Then, we conduct the overall model performance is when we remove each specific fea- an ablation study to understand the importance of each feature. Also, ture. In particular, removing the context representation of 𝑓<𝑈1 ,𝑅1 ,𝑈2 > we conduct additional analysis over specific examples. will impact the overall performance the most, the decrease of accu- racy is -3.237%, F1 score is -4.615%. 4.1 Datasets We collect an annotated turn-level user perceived defect dataset for 4.4 Case Analysis experiments by following the same annotation workflow as described As shown in Table 4, using both < 𝑈 1, 𝑈 2 > and < 𝑈 1, 𝑅1 > as in [4, 31]. We randomly sampled data for annotation. The dataset context helps the IQ-Net have better performance than considering contains hundreds of thousands of samples. only one of the signals. We look into examples where the former can The two examples for the first-turn with 𝑙𝑎𝑏𝑒𝑙 = 1 and 𝑙𝑎𝑏𝑒𝑙 = 0 make a correct prediction while the latter fails to do so. For the defect are as follows. = 1 example below, the predicted probability score of the second KDD Converse’20, August 2020, Ling and Yao, et al. Perf(%) Accuracy F1 Recall Precision for Optimizing and Evaluating Neural Language Generation. 45–54. IQ-Net(BERT) 85.77 78.92 79.62 78.34 [8] Heriberto Cuayáhuitl, Seonghan Ryu, Donghyeon Lee, and Jihie Kim. 2018. A study on dialogue reward prediction for open-ended conversational agents. arXiv −𝑓<𝑈1 ,𝑈2 > -0.542 -0.792 -0.758 -0.824 preprint arXiv:1812.00350 (2018). −𝑓<𝑈1 ,𝑅1 > -2.412 -3.530 -3.423 -3.632 [9] Allan de Barcelos Silva, Marcio Miguel Gomes, Cristiano André da Costa, Ro- −𝑓<𝑈1 ,𝑅1 ,𝑈2 > -3.237 -4.615 -4.131 -5.076 drigo da Rosa Righi, Jorge Luis Victoria Barbosa, Gustavo Pessin, Geert De Don- −𝑓𝐵𝐼 0.018 0.050 0.139 -0.036 cker, and Gustavo Federizzi. 2020. Intelligent Personal Assistants: A Systematic Literature Review. Expert Systems with Applications (2020), 113193. −𝑓𝑇 𝑀 -0.195 -0.076 0.727 -0.836 [10] Jan Deriu, Alvaro Rodrigo, Arantxa Otegi, Guillermo Echegoyen, Sophie Rosset, −𝑓𝐺𝑇 -0.888 -1.752 -3.255 -0.243 Eneko Agirre, and Mark Cieliebak. 2019. Survey on evaluation methods for −𝑓𝑈 𝐼 0.466 0.306 -1.172 1.787 dialogue systems. arXiv preprint arXiv:1905.04071 (2019). [11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: −𝑓𝑈 𝐷 -0.265 -0.215 0.445 -0.842 Pre-training of deep bidirectional transformers for language understanding. arXiv −𝑓𝐷𝑆 -0.350 -0.523 -0.547 -0.500 preprint arXiv:1810.04805 (2018). −𝑓𝑃𝐼 -0.074 -0.064 0.107 -0.228 [12] Layla El Asri, Hatim Khouzaimi, Romain Laroche, and Olivier Pietquin. 2014. Or- dinal regression for interaction quality prediction. In Proceedings of ICASSP (pro- −𝑓𝑆𝐵 -0.209 -0.221 0.109 -0.537 ceedings of icassp ed.). https://www.microsoft.com/en-us/research/publication/ Table 5: Results of IQ-Net(BERT) on Feature Ablation. ordinal-regression-interaction-quality-prediction-2/ [13] Klaus-Peter Engelbrecht, Florian Gödde, Felix Hartard, Hamed Ketabdar, and Sebastian Möller. 2009. Modeling user satisfaction with hidden Markov models. In Proceedings of the SIGDIAL 2009 Conference. 170–177. [14] Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics (2001), 1189–1232. turn being a rephrase of the first turn is 0.432 and the predicted [15] Jianfeng Gao, Michel Galley, Lihong Li, et al. 2019. Neural approaches to relevance score between the request-response pair is 0.906. Thus, conversational ai. Foundations and Trends® in Information Retrieval 13, 2-3 the defect example will not be easily captured if only considering (2019), 127–298. [16] Ahmed Hassan and Ryen W White. 2013. Personalized models of search satisfac- one of the < 𝑈 1, 𝑈 2 > or < 𝑈 1, 𝑅1 > pairs as context. However, the tion. In Proceedings of the 22nd ACM international conference on Information & overall IQ-Net can detect it as a defect by considering both at the Knowledge Management. 2009–2018. same time. [17] David Herbert and Byeong Kang. 2019. Comparative Analysis of Intelligent Personal Agent Performance. In Pacific Rim Knowledge Acquisition Workshop. Defect = 1 Example: Springer, 127–141. User request: where is university located? [18] Ryuichiro Higashinaka, Yasuhiro Minami, Kohji Dohsaka, and Toyomi Meguro. ICA response: University, Hillsborough County, ... 2010. Issues in predicting user satisfaction transitions in dialogues: Individual User request: where is yale university differences, evaluation criteria, and prediction models. In International Workshop on Spoken Dialogue Systems Technology. Springer, 48–60. [19] Katja Hofmann, Lihong Li, Filip Radlinski, et al. 2016. Online evaluation for 5 CONCLUSION information retrieval. Foundations and Trends® in Information Retrieval 10, 1 (2016), 1–117. In this paper, we propose to build an automated metric to evalu- [20] Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation ate dialogue quality at turn level for ICAs. We propose an IQ-Net of IR techniques. ACM Transactions on Information Systems (TOIS) 20, 4 (2002), 422–446. model with end-to-end tuned from raw dialogue context and system [21] Jiepu Jiang, Ahmed Hassan Awadallah, Rosie Jones, Umut Ozertem, Imed Zitouni, metadata that allows us to predict interaction level dialogue quality. Ranjitha Gurunath Kulkarni, and Omar Zia Khan. 2015. Automatic online evalua- tion of intelligent assistants. In Proceedings of the 24th International Conference Experimental results show that our methods outperform the baseline on World Wide Web. 506–516. method and work well across different domains as well as various [22] Julia Kiseleva, Kyle Williams, Ahmed Hassan Awadallah, Aidan C Crook, Imed intents. We conduct an ablation study on individual features to under- Zitouni, and Tasos Anastasakos. 2016. Predicting user satisfaction with intelligent assistants. In Proceedings of the 39th International ACM SIGIR conference on stand the contribution of each feature on model’s prediction ability. Research and Development in Information Retrieval. 45–54. [23] Julia Kiseleva, Kyle Williams, Jiepu Jiang, Ahmed Hassan Awadallah, Aidan C Crook, Imed Zitouni, and Tasos Anastasakos. 2016. Understanding user satisfac- tion with intelligent assistants. In Proceedings of the 2016 ACM on Conference on REFERENCES Human Information Interaction and Retrieval. 121–130. [1] Eugene Agichtein, Eric Brill, and Susan Dumais. 2006. Improving web search [24] Dmitry Lagun, Chih-Hung Hsieh, Dale Webster, and Vidhya Navalpakkam. 2014. ranking by incorporating user behavior information. In Proceedings of the 29th Towards better measurement of attention and satisfaction in mobile search. In annual international ACM SIGIR conference on Research and development in Proceedings of the 37th international ACM SIGIR conference on Research & information retrieval. 19–26. development in information retrieval. 113–122. [2] Snjezana Babic, Tihomir Orehovacki, and Darko Etinger. 2018. Perceived user [25] Tihomir Orehovački, Snježana Babić, and Darko Etinger. 2018. Modelling the experience and performance of intelligent personal assistants employed in higher perceived pragmatic and hedonic quality of intelligent personal assistants. In education settings. In 2018 41st International Convention on Information and International Conference on Intelligent Human Systems Integration. Springer, Communication Technology, Electronics and Microelectronics (MIPRO). IEEE, 589–594. 0830–0834. [26] Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: [3] Ana Berdasco, Gustavo López, Ignacio Diaz, Luis Quesada, and Luis A Guerrero. Global vectors for word representation. In Proceedings of the 2014 conference on 2019. User Experience Comparison of Intelligent Personal Assistants: Alexa, empirical methods in natural language processing (EMNLP). 1532–1543. Google Assistant, Siri and Cortana. In Multidisciplinary Digital Publishing Insti- [27] Pragaash Ponnusamy, Alireza Roshan Ghias, Chenlei Guo, and Ruhi Sarikaya. tute Proceedings, Vol. 31. 51. 2019. Feedback-Based Self-Learning in Large-Scale Conversational AI Agents. [4] Praveen Kumar Bodigutla, Lazaros Polymenakos, and Spyros Matsoukas. 2019. arXiv preprint arXiv:1911.02557 (2019). Multi-domain Conversation Quality Evaluation via User Satisfaction Estimation. [28] Louisa Pragst, Stefan Ultes, and Wolfgang Minker. 2017. Recurrent Neural arXiv preprint arXiv:1911.08567 (2019). Network Interaction Quality Estimation. Springer Singapore, Singapore, 381–393. [5] Thomas M Brill, Laura Munoz, and Richard J Miller. 2019. Siri, Alexa, and https://doi.org/10.1007/978-981-10-2585-3_31 other digital assistants: a study of customer satisfaction with artificial intelligence [29] Niklas Rach, Wolfgang Minker, and Stefan Ultes. 2017. Interaction Quality applications. Journal of Marketing Management 35, 15-16 (2019), 1401–1436. Estimation Using Long Short-Term Memories. In Proceedings of the 18th Annual [6] Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. SIGdial Meeting on Discourse and Dialogue. Association for Computational 1994. Signature verification using a" siamese" time delay neural network. In Linguistics, Saarbrücken, Germany, 164–169. https://doi.org/10.18653/v1/W17- Advances in neural information processing systems. 737–744. 5520 [7] Eunah Cho, He Xie, and William M Campbell. 2019. Paraphrase generation for [30] Shumpei Sano, Nobuhiro Kaji, and Manabu Sassano. 2016. Prediction of prospec- semi-supervised learning in NLU. In Proceedings of the Workshop on Methods tive user engagement with intelligent assistants. In Proceedings of the 54th Annual IQ-Net: A DNN Model for Estimating Interaction-level Dialogue Quality with Conversational Agents KDD Converse’20, August 2020, Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Meteer. 2000. Dialogue act modeling for automatic tagging and recognition of 1203–1212. conversational speech. Computational linguistics 26, 3 (2000), 339–373. [31] Shumpei Sano, Nobuhiro Kaji, and Manabu Sassano. 2017. Predicting causes of [36] Pei-Hao Su, Milica Gašić, and Steve Young. 2018. Reward estimation for dialogue reformulation in intelligent assistants. arXiv preprint arXiv:1707.03968 (2017). policy optimisation. Computer Speech & Language 51 (2018), 24–43. [32] Ruhi Sarikaya. 2017. The technology behind personal digital assistants: An [37] Gokhan Tur and Renato De Mori. 2011. Spoken language understanding: Systems overview of the system architecture and key components. IEEE Signal Processing for extracting semantic information from speech. John Wiley & Sons. Magazine 34, 1 (2017), 67–81. [38] Stefan Ultes and Wolfgang Minker. 2014. Interaction Quality Estimation in Spoken [33] Alexander Schmitt and Stefan Ultes. 2015. Interaction quality: assessing the Dialogue Systems Using Hybrid-HMMs. In Proceedings of the 15th Annual quality of ongoing spoken dialog interaction by experts—and how it relates to Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL). user satisfaction. Speech Communication 74 (2015), 12–36. Association for Computational Linguistics, Philadelphia, PA, U.S.A., 208–217. [34] Alexander Schmitt, Stefan Ultes, and Wolfgang Minker. 2012. A Parameterized https://doi.org/10.3115/v1/W14-4328 and Annotated Spoken Dialog Corpus of the CMU Let’s Go Bus Information [39] Marilyn A Walker, Diane J Litman, Candace A Kamm, and Alicia Abella. 1997. System.. In LREC. 3369–3373. PARADISE: A framework for evaluating spoken dialogue agents. arXiv preprint [35] Andreas Stolcke, Klaus Ries, Noah Coccaro, Elizabeth Shriberg, Rebecca Bates, cmp-lg/9704004 (1997). Daniel Jurafsky, Paul Taylor, Rachel Martin, Carol Van Ess-Dykema, and Marie