=Paper= {{Paper |id=Vol-2666/KDD_Converse20_paper_12 |storemode=property |title=IQ-Net: A DNN Model for Estimating Interaction-level Dialogue Quality with Conversational Agents |pdfUrl=https://ceur-ws.org/Vol-2666/KDD_Converse20_paper_12.pdf |volume=Vol-2666 |authors=Yuan Ling,Benjamin Yao,Guneet Kohli,Tuan-Hung Pham,Chenlei Guo |dblpUrl=https://dblp.org/rec/conf/kdd/LingYKPG20 }} ==IQ-Net: A DNN Model for Estimating Interaction-level Dialogue Quality with Conversational Agents == https://ceur-ws.org/Vol-2666/KDD_Converse20_paper_12.pdf
IQ-Net: A DNN Model for Estimating Interaction-level Dialogue
            Quality with Conversational Agents
                            Yuan Ling, Benjamin Yao, Guneet Kohli, Tuan-Hung Pham, Chenlei Guo
                                                    yualing,benjamy,gkohli,hupha,guochenl@amazon.com
                                                                     Amazon Alexa AI
                                                                        Seattle, WA
ABSTRACT                                                                                    These ICAs are complex systems with many components, such as
An automated metric to evaluate dialogue quality is critical for con-                       automatic speech recognition (ASR), natural language understand-
tinuously optimizing large-scale conversational agent systems such                          ing (NLU), language generation, and dialog management. As a
as Alexa. Previous approaches for tackling this problem often rely on                       result, there are generally two categories of evaluation metrics for
a limited set of manually designed and/or heuristic features, which                         ICAs [32]: (1) Component metrics which are used to measure the
cannot be easily scaled to a large number of domains or scenarios.                          quality of each individual component, such as Word Error Rate
In this paper, we present Interaction-Quality-Network (IQ-Net), a                           (WER) and NLU Accuracy; (2) End-to-End (E2E) metrics that are
novel DNN model that allows us to predict interaction-level dialogue                        designed to measure the entire dialogue quality and/or user satisfac-
quality directly from raw dialogue contents and system metadata                             tion. In this paper, we focus on “online” or “automated” E2E metrics,
without human engineered NLP features. The IQ-Net architecture is                           which are built with machine learning models that are designed to
compatible with several pre-trained neural network embeddings and                           predict user satisfaction. Defining online E2E metrics is a popular
architectures such as CNN, Elmo, and BERT. Through an ablation                              subject of various research [10] due to its vital role in continuous
study in Alexa, we demonstrate that several variants of IQ-Net out-                         optimization of ICA systems [15].
perform a baseline model with manually engineered features (3.89%
improvement in F1 score, 3.15% in accuracy, and 6.1% in precision                                    1. Assessment: non-defect (success)
score), while also reduce the efforts to extend to new domains/use-                                  User request: play five little ducks
cases.                                                                                               ICA response: Ok, playing five little ducks
                                                                                                     User request: (after 2 mins...) play five little ducks
CCS CONCEPTS                                                                                         2. Assessment: defect, user paraphrase
                                                                                                     User request: play three little ducks
• Computing methodologies → Natural language processing.                                             ICA response: Sorry, I cannot find the song
                                                                                                     User request: play five little ducks
KEYWORDS                                                                                             3. Assessment: non-defect, user confirmation
online evaluation, intelligent conversational agents (ICAs), evalua-                                 User request: turn off light
tion metrics, defect detection                                                                       ICA response: did you mean Lamp One?
                                                                                                     User request: yeah
ACM Reference Format:                                                                                4. Assessment: defect, user correction and ASR error
Yuan Ling, Benjamin Yao, Guneet Kohli, Tuan-Hung Pham, Chenlei Guo.                                  User request: buy
2020. IQ-Net: A DNN Model for Estimating Interaction-level Dialogue                                  ICA response: you have one item in your shopping cart
Quality with Conversational Agents. In Proceedings of KDD Workshop on                                {name of the item}, do you want to buy it?
Conversational Systems Towards Mainstream Adoption (KDD Converse’20).                                User request: bye-bye.
ACM, New York, NY, USA, 7 pages.
                                                                                            Table 1: Example dialogues and interaction-quality assessment.
                                                                                            Note that assessments are for the 1st turn in each dialogue.

1     INTRODUCTION
As voice-controlled intelligent conversational agents (ICAs), such                             Prior attempts to model online E2E metrics could be roughly
as Alexa, Siri, and Google Assistant become increasingly popular,                           grouped into two categories: (1) Dialogue-level metrics such as
ICAs have become a new paradigm for accessing information. They                             the popular PARADISE framework [39] which aims to predict
represent a hybrid of search and dialogue systems that conversa-                            dialogue-level satisfaction ratings provided by surveyed users; Ref-
tionally interact with users to execute a wide range of actions (e.g.,                      erences [8, 36] formulate dialogue interaction as a reinforcement
searching the Web, setting alarms, and making phone calls) [9, 31].                         learning task, which aims to predict reward from dialogue history
                                                                                            using different variants of DNN models. (2) Turn-level (or exchange-
Permission to make digital or hard copies of part or all of this work for personal or       level) metrics such as Interaction Quality (IQ) [33] which predict
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation   per turn dialogue quality either provided by users [13] or annotated
on the first page. Copyrights for third-party components of this work must be honored.      by human raters [34]. In particular, IQ models have gained popu-
For all other uses, contact the owner/author(s).
                                                                                            larity recently because of publicly available benchmarks such as
KDD Converse’20, August 2020,
© 2020 Copyright held by the owner/author(s).                                               the CMU Go-bus information system [34]. Various methods exist
                                                                                            to predict Interaction Quality, for example using Hidden Markov




Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
KDD Converse’20, August 2020,                                                                                                   Ling and Yao, et al.


Models [38], Support Vector Machines [12, 33] and Recurrent Neu-             Evaluation is a central component for information search sys-
ral Networks [28, 29]. However, these approaches rely heavily on          tems [19]. For text-based information retrieval, the relevant docu-
dialog system internal features such as ASR confidence and SLU            ments/pages are annotated manually to evaluate the search system
Semantic Parse. While these features a very effective in a small-scale    performance. Query-based metrics such as the mean average preci-
closed-loop system, they are very unreliable in a large organization      sion (MAP) and normalized discounted cumulative gain (nDCG) [20]
like Alexa where many teams constantly updating various compo-            are frequently used to evaluate the system performance. However, the
nents in parallel. For example, ASR-confidence score could have           human annotation process is expensive and error-prone; in addition,
significant shift between two ASR model versions hence will be an         the user’s individual intent is commonly not taken into consider-
unreliable input for E2E online metric, which is, in part, designed       ation. To alleviate this issue, some research models user satisfac-
to measure ASR’s impact to user satisfaction. Therefore, instead          tion/behaviors to improve the evaluation of system’s performance [1]
of relying on system internal signals, we draw on the intuition that      by incorporating the following signals: 1) user behaviors includ-
human raters could reliably judge the quality of a turn by looking at     ing clicks, dwell time, mouse movements, scrolling behaviors, and
the context of the dialogue [4] (without ever needing to know what        abandonment [16]; 2) context-specific features such as viewport
is the ASR confidence score). Table 1 shows four example dialogues        metrics [24], touch-related features, and acoustic signals [23], 3)
with interaction-quality assessment by human.                             query-based features, such as query refinement, query length, and
     While these examples demonstrate how human could easily judge        frequency in logs. While the metrics for evaluating traditional search
the quality of a turn using dialogue context, they are also non-trivial   system might not be used directly to evaluate ICAs, some of the
cases for an ML model to predict. For example, example #1 and             metric components, such as query refinement, can be adapted to
#2 share similar user paraphrase structure. But example #1 is non-        evaluate ICAs.
defective (successful) because ICA response is relevant while #2 is          Compared with text-based information retrieval, voice-based in-
clearly defective because the ICA responded with “sorry, I cannot         formation retrieval is quite different [21, 23] because of two reasons.
. . . ”. On the other hand, while example #3 and #4 share similar         First, voice-based interactions are conversational; in some scenarios,
query/response structure (ICA asking for confirmation), #3 is non-        the user expects that the search system is able to refer to the previous
defective because user have a positive confirmation next turn while       interactions to understand the current request. Second, the voice
#4 is a defect because user correction next turn. To capture such         input could provide automatic speech recognition (ASR) errors to
a diverse of dialogue patterns, it is clear that we need to leverage      downstream applications and affect user satisfaction negatively. Re-
semantic meaning of the dialogue context. While it is possible to         search on Spoken Dialogue System (SDS) attempts to model user
manually extract features such as “paraphrasing”, “cohesion between       satisfaction at turn level as a continuous process over time [13, 18].
response and request” as proposed by Bodigutla et al. [4], the com-       An annotated and standardized corpus, such as Let’s Go Bus Informa-
plexity of open domain system like Alexa limits the efficacy for such     tion System dataset from CMU [34], was developed for classification
approach. Therefore, in this paper we present Interaction-Quality-        and evaluation tasks regarding task success prediction, dialogue qual-
Network (IQ-Net), an E2E DNN model that allows us to predict              ity estimation, and emotion recognition. Based on the dataset, an
interaction-level dialogue quality directly from raw dialogue con-        evaluation metric as Interaction Quality [10, 34] is developed with
tents and system metadata without human engineered NLP features.          features related to ASR, Spoken Language Understanding (SLU),
In contrast to existing related work, we contribute to the IQ modeling    and Dialog Manager at exchange, dialog, and window level.
literature from the following perspectives.                                  ICAs differ from traditional SDS in that they support personaliza-
                                                                          tion and a wide range of tasks. Dialogue systems can be categorized
    • Instead of focusing on a few specific tasks and system internal     into three groups: task-oriented systems, conversational agents, and
      features [21, 33], IQ-Net is a generic interaction quality model    interactive question answering systems [10]. ICAs are designed to
      that could be used for evaluating Interaction Quality across        be able to handle all of these tasks; thus, it makes the evaluation of
      multiple domains and various systems;                               ICAs very challenging. In addition, as the voice-only ICAs tend to
    • Unlike previous multi-domain user satisfaction evaluation           evolve to become the voice-enabled multi-modal ICAs, it become
      model [4], which relies on manually engineered dialogue             even more complex for evaluation. A recent user study on ICAs
      features that are hard to scale, IQ-net is capable of capturing a   attempts to compare the differences regarding features [25], perfor-
      variety of dialogue patterns and could be easily extend to new      mance, ASR error [17], and user experiences [3] across different
      domain/use-cases as long as we have annotated examples.             ICAs. Surveys with questionnaires [2, 5] are conducted to under-
                                                                          stand functional and topical use of ICAs by individuals; however,
   The rest of the paper is organized as follows. Section 2 reviews       these studies are limited to predefined scenarios of interactions.
existing work. Section 3 presents our methods to estimate interaction-       Currently, there is limited research on building automatic metrics
level dialogue quality. Section 4 presents our experimental results.      for evaluating ICAs’ performances. Jiang et al. [21] built separate
We conclude our paper in Section 5.                                       models for evaluating user satisfaction on five domains, including
                                                                          Chat, Device Control, Communication, Location, Calendar, and
                                                                          Weather. The models consider several types of features, including
2   RELATED WORK                                                          user-system interactions, click features, request features, response
In this section, we summarize the related work on evaluation meth-        features, and acoustic features. This work automated the online eval-
ods/metrics for the search systems and error analysis for the ICAs to     uation for ICAs. However, the work did not consider the variability
put our contributions in context.                                         of interface and interaction; in addition, its scope is limited to ICAs
IQ-Net: A DNN Model for Estimating Interaction-level Dialogue Quality with Conversational Agents                              KDD Converse’20, August 2020,


on mobile devices and several specific scenarios/domains. Bodigutla
et al. [4] introduces a Response Quality annotation schema, which
showed high correlation with explicit turn level user satisfaction rat-
ings. This paper developed a method for evaluating user satisfaction
at turn level in multi-domain conversations users for ICA using five
features: user request rephrasing, cohesion between response and
request, aggregate topic popularity, unactionable user request, and
the diversity of topics in a session. The turn-level user satisfaction
rating is further used as feature to improve dialogue-level satisfac-
tion estimation. Other current research on evaluation of ICAs more
focuses on user satisfaction estimation and goal success prediction,
which are more suitable for dialog-level or task-level evaluation due
to the dialogue style of interactions [15, 22, 30]. A user’s frustration
in the middle of a task or a dialogue might not be captured. The ap-
proach also often lacks interpretability in term of the root causes of
user frustration. Finally, it is not obvious how one should define task
and session boundaries for ICAs [30]; thus, it is critical to evaluate
ICAs at turn level.
    The complexity of ICAs’ components makes it difficult to de-
termine which component causes an error or user frustration. Re-
searchers have studied system errors in search and dialogue systems.
For ICAs, the error root causes can be categorized into groups [31,
32], including ASR errors, NLU errors, unsupported system actions,                 Figure 1: IQ-Net Architecture Diagram. The model consid-
no language generation, back-end failures, endpoint errors, and un-                ers the information of current turn utterance (𝑈𝑡 ), response
interpretable inputs. These errors can be the root causes of user                  (𝑅𝑡 ), next turn utterance (𝑈𝑡 +1 ). Semantic encoding layers share
reformulating their queries [27, 31].                                              weights. Position encoders are shared among (𝑈𝑡 ) and (𝑈𝑡 +1 ).
                                                                                   Note that in this diagram, we use BERT encoder as an example.
3     METHODOLOGY                                                                  For IQ-Net(CNN), we replace the BERT encoder with an CNN
                                                                                   encoder.
In this section, we present IQ-Net: a DNN model for estimating
interaction-level dialogue quality. First, we introduce the overall
architecture and training procedure of IQ-Net. Then, we explain how                 concatenate results for each part followed by a feed-forward network
we represent the dialogue context in details. Next, we introduce the                and activation function.
system metadata used in the IQ-Net.                                                    The final outputs from the dialogue context representations will
                                                                                    combine with all other features derived from system metadata to pre-
                                                                                    dict a defect/non-defect outcome for the interaction-level dialogue
3.1    IQ-Net                                                                       quality of the first turn:
The IQ-Net model is presented in Figure 1. IQ-Net includes two
major components: (1) dialogue context representations and (2) a                                   𝑝 (Defect = 𝑡𝑟𝑢𝑒 | < 𝑈𝑡 , 𝑅𝑡 , 𝑈𝑡 +1 >, 𝑓𝑀 )         (1)
list of features derived from system metadata.
                                                                                    𝑓𝑀 represents a list of meta-data features (described in Section 3.3).
    For modeling dialogue context, we consider user’s request text
                                                                                      The objective function for the overall task is
plus response text of ICA in the consecutive turns. As showed in                                            Õ
Figure 1, the dialogue context representation part takes current turn                           𝐿Θ =              𝑙 (𝐹 (𝑈𝑡 , 𝑅𝑡 , 𝑈𝑡 +1, 𝑓𝑀 ), 𝑦)     (2)
request and response (𝑈 1 and 𝑅1 ), and more requests from following                                     (𝑈𝑡 ,𝑅𝑡 ,𝑈𝑡 +1 )
turns (𝑈 2, 𝑈 3, ...) as inputs. For simplicity, we only consider the
                                                                                   whereas 𝐹 () is a function that represents IQ-Net. 𝑙 is the standard
next one turn request; thus, the inputs can be represented as <
                                                                                   cross entropy loss. 𝑦 is the ground-truth label.
𝑈 1, 𝑅1, 𝑈 2 >. We will support more following turns in future work.
We assume < 𝑈 1, 𝑅1 > captures the relevancy between user request
                                                                                    3.2    Dialogue Context Representations
and Alexa response and < 𝑈 1, 𝑈 2 > captures patterns from user’s
                                                                                   Here, we explain in details that the dialogue context are represented
repeat/dialog behavior.
                                                                                   with < 𝑈 1, 𝑈 2 > modeling and < 𝑈 1, 𝑅1 > modeling.
    We map the word indicies of 𝑈 1 , 𝑅1 , and 𝑈 2 into a fixed dimension
of vectors through pre-trained word embeddings [11, 26]. The word                  3.2.1 < 𝑈 1, 𝑈 2 > Modeling. < 𝑈 1, 𝑈 2 > pair contains user’s dia-
embedding representation of 𝑈 1 , 𝑅1 , and 𝑈 2 all go through sentence             log behavior/patterns. For example, ICA users tend to re-express the
encoder 𝐸, which can be pre-trained by individual datasets. We use                 same intention with follow-up requests after an unsuccessful attempt
both CNN encoder 𝐸𝐶𝑁 𝑁 and BERT encoder 𝐸𝐵𝐸𝑅𝑇 as sentence                          from the previous request. We refer to the follow-up request as a
encoders in our experiments for comparisons. We concatenate the                    “rephrase” of the previous request. Identifying rephrasing pattern
hidden representations of ℎ𝑈1 and ℎ𝑅1 , ℎ𝑈1 and ℎ𝑈2 accordingly. The               between requests pair can help discovering defect/frictions.
KDD Converse’20, August 2020,                                                                                                   Ling and Yao, et al.


   In addition to the rephrasing behavior between two consecutive
user requests, the user can also express the confirm or deny intention
in a follow-up request, as shown in Table 2.


              Example 1: confirm
              User request: Alexa, add paper to my cart
              Alexa response: Do you mean paper towel?
              User request: Yes.
              Example 2: deny
              User request: Alexa, add paper to my cart                                                (a) barge-in
              Alexa response: Do you mean paper towel?
              User request: No. add A4 paper to my cart.
          Table 2: < 𝑈 1, 𝑈 2 > confirm/deny examples.



   Such patterns existing in < 𝑈 1, 𝑈 2 > reflect user’s real intention
through the corresponding repeat/confirm/deny behaviors, which
can be learned in the proposed IQ-Net.

3.2.2 < 𝑈 1, 𝑅1 > Modeling. The semantic relevance of ICA’s
response and user’s request can be an effective feature for defect pre-
dictions. When an ICA responds to a user’s request with an irrelevant
answer, the metric should capture this as defective. However, it is                                  (b) termination
difficult to discover such a defect when the ICA provides a complete
but incorrect response, and user chooses to abandon the interaction                      Figure 2: User interruption signals
without rephrasing the request. The relevance between the request
and the response text can potentially help with defect identification.
   The IQ-Net takes user request and response text (𝑈 1 and 𝑅1 ) as       3.3.1 User Interruption Signals. User Barge-in: Barge-in is
inputs, similar to < 𝑈 1, 𝑈 2 > modeling. We adopt the frequently used    a frequently used feature for evaluation in SDS [4, 10, 34]. When
“Siamese” architecture [6] to measure request-response similarities       a customer interrupts a follow-up request while ICA is respond-
in the projected space as showed in the Figure 1.                         ing or playing, the turn will be labeled as a barge-in. As shown in
                                                                          Figure 2(a), we build a rule-based barge-in model. When (1) ICA
3.3    System Metadata                                                    is talking or playing, (2) the delay between the previous utterance
Table 3 introduces our signals and features of system’s metadata          and the current one is less than a certain period of time (e.g., 45
for evaluating ICAs. We adopt several signals, such as termination        seconds), and (3) the user intent is not in the intents set {“Vol-
and barge-in, to reflect the user’s implicit feedback/action. Then, we    umeUp”,“VolumeDown”,“SetVolume”}, we label the current turn
introduce generic features, such as user domain/intent information        barge-in value 𝑓𝐵𝐼 = 1; otherwise, 𝑓𝐵𝐼 = 0.
from NLU outputs, and some system action types.                               User Termination: We define a user termination as when a cus-
                                                                          tomer expresses a terminating intent. As shown in Figure 2(b), our
                                                                          termination detection is rule-based. If the user’s intent is a terminat-
          Feature               Symbol   Feature Type                     ing action (e.g., StopIntent, ExitAppIntent), the delay between the
          [User interruption]                                             previous utterance, and the current one is less than a certain period
          barge-in              𝑓𝐵𝐼      {0, 1} binary values             of time (e.g., 45 seconds), we have the termination value 𝑓𝑇 𝑀 = 1;
          termination           𝑓𝑇 𝑀     {0, 1} binary values             otherwise, 𝑓𝑇 𝑀 = 0.
          gap time              𝑓𝐺𝑇      continuous values                    Gap Time: The gap time between two requests is an important
          [User intent]                                                   indicator for a user interruption. We use the time differences as a
          intent                𝑓𝑈 𝐼     categorical feature              feature, which is represented as 𝑓𝐺𝑇 .
          domain                𝑓𝑈 𝐷     categorical feature
          [System Action]                                                 3.3.2 User Intent Signals. An NLU component allows ICAs to
          dialog status         𝑓𝐷𝑆      categorical feature              produce interpretations for an input sentence. The NLU component
          promptID              𝑓𝑃𝐼      {0, 1} binary values             accepts recognized speech inputs and produces intents, domains, and
          SLU score bin         𝑓𝑆𝐵      categorical feature
                                                                          slots for the input utterance to support the user request [7, 37]. We
      Table 3: Feature List Derived from System Metadata.                 use the domain and intent outputs from NLU as signals to reflect the
                                                                          user’s intention. We cover over dozens of domains and thousands of
                                                                          intents, and use them as categorical features. We use 𝑓𝑈 𝐷 as domain
                                                                          features and 𝑓𝑈 𝐼 as intent features.
IQ-Net: A DNN Model for Estimating Interaction-level Dialogue Quality with Conversational Agents                                  KDD Converse’20, August 2020,


3.3.3 System Action Signals. Dialog Status: Dialog Man-                                   Defect = 1 Example:
agement (DM) is a key component of spoken language interactions                           User request: Do I have any appointments today?
with ICAs. It makes user inputs actionable by asking appropriate                          ICA response: Appointment is [definition of appointment]
                                                                                          User request: Tell me my appointments today
questions to help customers achieve a goal. DM can detect when a
                                                                                          Defect = 0 Example:
valid task completes or if there is trouble in the dialog and it records
                                                                                          User request: Turn on the lights
this information. Following previous work on dialog acts model-                           ICA response: Ok
ing [35], we use DM status values as system action signals for defect                     User request: Thank you
detection. Compared with the work [21], we focus on more generic
DM status categories here:                                                          4.2     Main Results
      • SUCCESS: The ICA is able to act and deliver what it thinks                 Table 4 illustrates the result of IQ-Net with two different encoders,
        the user wants (not ground truth).                                         including CNN and BERT, compared with result from baseline.
      • IN_PROGRESS: The ICA is in the process of executing on a
        task or is prompting for additional information.
                                                                                           Perf(%)                Accuracy   F1        Recall   Precision
      • USER_ABANDONED: The user abandons an in-progress
                                                                                           Baseline
        dialog, either explicitly or implicitly.
                                                                                           meta_data only         82.54      74.30     75.48    73.16
      • INVALID: The Spoken Language Understanding (SLU) could                             + < 𝑈1, 𝑈2 >           +0.27      +0.64     +1.36    -0.03
        interpret the utterance, but the ICA cannot handle it. For ex-                     + < 𝑈 1 , 𝑅1 >         +0.22      +0.57     +1.3     -0.11
        ample, the input may express a task that is unactionable due to                    + < 𝑈 1 , 𝑅1 , 𝑈 2 >   +0.08      +0.73     +2.56    -0.92
        user dependencies (e.g. account linking for music purchases),                      IQ-Net (CNN)
        or is currently unsupported.                                                       + < 𝑈1, 𝑈2 >           +0.35      +1.21     +3.41    -0.75
      • ICA_ABANDONED: The SLU stops trying / ICA reaches                                  + < 𝑈 1 , 𝑅1 >         +1.82      +2.39     +1.43    +3.31
        the MAX number of turns.                                                           + < 𝑈 1 , 𝑅1 , 𝑈 2 >   +2.63      +3.74     +3.28    +4.18
      • FAULT: The ASR encounters some internal errors, the NLU                            IQ-Net (BERT)
        service fails, or the app fails.                                                   + < 𝑈1, 𝑈2 >           +0.82      +1.09     +0.71    +1.44
                                                                                           + < 𝑈 1 , 𝑅1 >         +2.69      +3.83     +3.38    +4.25
We represent the DM status as categorical feature 𝑓𝐷𝑆 .                                    + < 𝑈 1 , 𝑅1 , 𝑈 2 >   +3.23      +4.62     +4.14    +5.18
    System Prompts: promptID is a free form system status code
provided by ICA speechlets to indicate whether a speechlet can                     Table 4: Results. Baseline: The baseline method is a Gradi-
handle the request. For example, when the ICA responds “Sorry,                     ent Boosting Decision Tree (GBDT) [14] with the same features
I’m not sure”, the promptID is “NotUnderstood”. promptIDs can be                   mentioned in Section 3. IQ-Net(CNN): our proposed method
categorized and mapped to different types of frictions such as SLU                 with CNN encoder. IQ-Net(BERT): our proposed method with
frictions, errors or retries, coverage gaps, unsupported use cases and             BERT encoder.
user actions required. We convert the promptID into a binary feature;
if the promptID is mapped to any friction type, the feature value
𝑓𝑃𝐼 = 1, otherwise 𝑓𝑃𝐼 = 0.                                                           As shown in Table 4, the IQ-Net (with either CNN encoder or
    SLU Score Bin: The SLU score represents the confidence of                      BERT encoder) has better performance than the baseline method.
what the SLU understoods as the desired intent/slot output for the                 IQ-Net (BERT) outperforms the baseline method of meta-data only
utterance. The SLU score bin is a categorical feature to group the                 with an improvement of 3.23% in accuracy, 4.62% in F1 score, and
confidence score into high/medium/low bins. Comparing to volatile                  5.18% in precision, and it outperforms the baseline method with full
features such as “ASR confidence” or “Entity resolution score”, SLU                features with an improvement of 3.15% in accuracy, 3.89% in F1
score bin a stable feature that has low variation over-time. Hence,                score, and 6.1% in precision.
we use it as a feature, which is represented as 𝑓𝑆𝐵 .
                                                                                    4.3     Ablation Study
4     EXPERIMENTS                                                                  Ablation for different features: We perform ablation experiments
In this section, we discuss our experimental results. First, we present            over the list of features used in IQ-Net(BERT) to better understand
the IQ-Net model’s performance with different encoders (CNN and                    their relative importance. In Table 5, we show how much degraded
BERT) and compare it with our baseline method. Then, we conduct                    the overall model performance is when we remove each specific fea-
an ablation study to understand the importance of each feature. Also,              ture. In particular, removing the context representation of 𝑓<𝑈1 ,𝑅1 ,𝑈2 >
we conduct additional analysis over specific examples.                             will impact the overall performance the most, the decrease of accu-
                                                                                   racy is -3.237%, F1 score is -4.615%.
4.1     Datasets
We collect an annotated turn-level user perceived defect dataset for                4.4     Case Analysis
experiments by following the same annotation workflow as described                 As shown in Table 4, using both < 𝑈 1, 𝑈 2 > and < 𝑈 1, 𝑅1 > as
in [4, 31]. We randomly sampled data for annotation. The dataset                   context helps the IQ-Net have better performance than considering
contains hundreds of thousands of samples.                                         only one of the signals. We look into examples where the former can
   The two examples for the first-turn with 𝑙𝑎𝑏𝑒𝑙 = 1 and 𝑙𝑎𝑏𝑒𝑙 = 0                make a correct prediction while the latter fails to do so. For the defect
are as follows.                                                                    = 1 example below, the predicted probability score of the second
KDD Converse’20, August 2020,                                                                                                                                    Ling and Yao, et al.


        Perf(%)                Accuracy        F1          Recall      Precision                    for Optimizing and Evaluating Neural Language Generation. 45–54.
        IQ-Net(BERT)           85.77           78.92       79.62       78.34                    [8] Heriberto Cuayáhuitl, Seonghan Ryu, Donghyeon Lee, and Jihie Kim. 2018. A
                                                                                                    study on dialogue reward prediction for open-ended conversational agents. arXiv
        −𝑓<𝑈1 ,𝑈2 >            -0.542          -0.792      -0.758      -0.824                       preprint arXiv:1812.00350 (2018).
        −𝑓<𝑈1 ,𝑅1 >            -2.412          -3.530      -3.423      -3.632                   [9] Allan de Barcelos Silva, Marcio Miguel Gomes, Cristiano André da Costa, Ro-
        −𝑓<𝑈1 ,𝑅1 ,𝑈2 >        -3.237          -4.615      -4.131      -5.076                       drigo da Rosa Righi, Jorge Luis Victoria Barbosa, Gustavo Pessin, Geert De Don-
        −𝑓𝐵𝐼                   0.018           0.050       0.139       -0.036                       cker, and Gustavo Federizzi. 2020. Intelligent Personal Assistants: A Systematic
                                                                                                    Literature Review. Expert Systems with Applications (2020), 113193.
        −𝑓𝑇 𝑀                  -0.195          -0.076      0.727       -0.836                  [10] Jan Deriu, Alvaro Rodrigo, Arantxa Otegi, Guillermo Echegoyen, Sophie Rosset,
        −𝑓𝐺𝑇                   -0.888          -1.752      -3.255      -0.243                       Eneko Agirre, and Mark Cieliebak. 2019. Survey on evaluation methods for
        −𝑓𝑈 𝐼                  0.466           0.306       -1.172      1.787                        dialogue systems. arXiv preprint arXiv:1905.04071 (2019).
                                                                                               [11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:
        −𝑓𝑈 𝐷                  -0.265          -0.215      0.445       -0.842                       Pre-training of deep bidirectional transformers for language understanding. arXiv
        −𝑓𝐷𝑆                   -0.350          -0.523      -0.547      -0.500                       preprint arXiv:1810.04805 (2018).
        −𝑓𝑃𝐼                   -0.074          -0.064      0.107       -0.228                  [12] Layla El Asri, Hatim Khouzaimi, Romain Laroche, and Olivier Pietquin. 2014. Or-
                                                                                                    dinal regression for interaction quality prediction. In Proceedings of ICASSP (pro-
        −𝑓𝑆𝐵                   -0.209          -0.221      0.109       -0.537                       ceedings of icassp ed.). https://www.microsoft.com/en-us/research/publication/
    Table 5: Results of IQ-Net(BERT) on Feature Ablation.                                           ordinal-regression-interaction-quality-prediction-2/
                                                                                               [13] Klaus-Peter Engelbrecht, Florian Gödde, Felix Hartard, Hamed Ketabdar, and
                                                                                                    Sebastian Möller. 2009. Modeling user satisfaction with hidden Markov models.
                                                                                                    In Proceedings of the SIGDIAL 2009 Conference. 170–177.
                                                                                               [14] Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting
                                                                                                    machine. Annals of statistics (2001), 1189–1232.
turn being a rephrase of the first turn is 0.432 and the predicted                             [15] Jianfeng Gao, Michel Galley, Lihong Li, et al. 2019. Neural approaches to
relevance score between the request-response pair is 0.906. Thus,                                   conversational ai. Foundations and Trends® in Information Retrieval 13, 2-3
the defect example will not be easily captured if only considering                                  (2019), 127–298.
                                                                                               [16] Ahmed Hassan and Ryen W White. 2013. Personalized models of search satisfac-
one of the < 𝑈 1, 𝑈 2 > or < 𝑈 1, 𝑅1 > pairs as context. However, the                               tion. In Proceedings of the 22nd ACM international conference on Information &
overall IQ-Net can detect it as a defect by considering both at the                                 Knowledge Management. 2009–2018.
same time.                                                                                     [17] David Herbert and Byeong Kang. 2019. Comparative Analysis of Intelligent
                                                                                                    Personal Agent Performance. In Pacific Rim Knowledge Acquisition Workshop.
     Defect = 1 Example:                                                                            Springer, 127–141.
     User request: where is university located?                                                [18] Ryuichiro Higashinaka, Yasuhiro Minami, Kohji Dohsaka, and Toyomi Meguro.
     ICA response: University, Hillsborough County, ...                                             2010. Issues in predicting user satisfaction transitions in dialogues: Individual
     User request: where is yale university                                                         differences, evaluation criteria, and prediction models. In International Workshop
                                                                                                    on Spoken Dialogue Systems Technology. Springer, 48–60.
                                                                                               [19] Katja Hofmann, Lihong Li, Filip Radlinski, et al. 2016. Online evaluation for
5    CONCLUSION                                                                                     information retrieval. Foundations and Trends® in Information Retrieval 10, 1
                                                                                                    (2016), 1–117.
In this paper, we propose to build an automated metric to evalu-                               [20] Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation
ate dialogue quality at turn level for ICAs. We propose an IQ-Net                                   of IR techniques. ACM Transactions on Information Systems (TOIS) 20, 4 (2002),
                                                                                                    422–446.
model with end-to-end tuned from raw dialogue context and system                               [21] Jiepu Jiang, Ahmed Hassan Awadallah, Rosie Jones, Umut Ozertem, Imed Zitouni,
metadata that allows us to predict interaction level dialogue quality.                              Ranjitha Gurunath Kulkarni, and Omar Zia Khan. 2015. Automatic online evalua-
                                                                                                    tion of intelligent assistants. In Proceedings of the 24th International Conference
Experimental results show that our methods outperform the baseline                                  on World Wide Web. 506–516.
method and work well across different domains as well as various                               [22] Julia Kiseleva, Kyle Williams, Ahmed Hassan Awadallah, Aidan C Crook, Imed
intents. We conduct an ablation study on individual features to under-                              Zitouni, and Tasos Anastasakos. 2016. Predicting user satisfaction with intelligent
                                                                                                    assistants. In Proceedings of the 39th International ACM SIGIR conference on
stand the contribution of each feature on model’s prediction ability.                               Research and Development in Information Retrieval. 45–54.
                                                                                               [23] Julia Kiseleva, Kyle Williams, Jiepu Jiang, Ahmed Hassan Awadallah, Aidan C
                                                                                                    Crook, Imed Zitouni, and Tasos Anastasakos. 2016. Understanding user satisfac-
                                                                                                    tion with intelligent assistants. In Proceedings of the 2016 ACM on Conference on
REFERENCES                                                                                          Human Information Interaction and Retrieval. 121–130.
 [1] Eugene Agichtein, Eric Brill, and Susan Dumais. 2006. Improving web search                [24] Dmitry Lagun, Chih-Hung Hsieh, Dale Webster, and Vidhya Navalpakkam. 2014.
     ranking by incorporating user behavior information. In Proceedings of the 29th                 Towards better measurement of attention and satisfaction in mobile search. In
     annual international ACM SIGIR conference on Research and development in                       Proceedings of the 37th international ACM SIGIR conference on Research &
     information retrieval. 19–26.                                                                  development in information retrieval. 113–122.
 [2] Snjezana Babic, Tihomir Orehovacki, and Darko Etinger. 2018. Perceived user               [25] Tihomir Orehovački, Snježana Babić, and Darko Etinger. 2018. Modelling the
     experience and performance of intelligent personal assistants employed in higher               perceived pragmatic and hedonic quality of intelligent personal assistants. In
     education settings. In 2018 41st International Convention on Information and                   International Conference on Intelligent Human Systems Integration. Springer,
     Communication Technology, Electronics and Microelectronics (MIPRO). IEEE,                      589–594.
     0830–0834.                                                                                [26] Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove:
 [3] Ana Berdasco, Gustavo López, Ignacio Diaz, Luis Quesada, and Luis A Guerrero.                  Global vectors for word representation. In Proceedings of the 2014 conference on
     2019. User Experience Comparison of Intelligent Personal Assistants: Alexa,                    empirical methods in natural language processing (EMNLP). 1532–1543.
     Google Assistant, Siri and Cortana. In Multidisciplinary Digital Publishing Insti-        [27] Pragaash Ponnusamy, Alireza Roshan Ghias, Chenlei Guo, and Ruhi Sarikaya.
     tute Proceedings, Vol. 31. 51.                                                                 2019. Feedback-Based Self-Learning in Large-Scale Conversational AI Agents.
 [4] Praveen Kumar Bodigutla, Lazaros Polymenakos, and Spyros Matsoukas. 2019.                      arXiv preprint arXiv:1911.02557 (2019).
     Multi-domain Conversation Quality Evaluation via User Satisfaction Estimation.            [28] Louisa Pragst, Stefan Ultes, and Wolfgang Minker. 2017. Recurrent Neural
     arXiv preprint arXiv:1911.08567 (2019).                                                        Network Interaction Quality Estimation. Springer Singapore, Singapore, 381–393.
 [5] Thomas M Brill, Laura Munoz, and Richard J Miller. 2019. Siri, Alexa, and                      https://doi.org/10.1007/978-981-10-2585-3_31
     other digital assistants: a study of customer satisfaction with artificial intelligence   [29] Niklas Rach, Wolfgang Minker, and Stefan Ultes. 2017. Interaction Quality
     applications. Journal of Marketing Management 35, 15-16 (2019), 1401–1436.                     Estimation Using Long Short-Term Memories. In Proceedings of the 18th Annual
 [6] Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah.                   SIGdial Meeting on Discourse and Dialogue. Association for Computational
     1994. Signature verification using a" siamese" time delay neural network. In                   Linguistics, Saarbrücken, Germany, 164–169. https://doi.org/10.18653/v1/W17-
     Advances in neural information processing systems. 737–744.                                    5520
 [7] Eunah Cho, He Xie, and William M Campbell. 2019. Paraphrase generation for                [30] Shumpei Sano, Nobuhiro Kaji, and Manabu Sassano. 2016. Prediction of prospec-
     semi-supervised learning in NLU. In Proceedings of the Workshop on Methods                     tive user engagement with intelligent assistants. In Proceedings of the 54th Annual
IQ-Net: A DNN Model for Estimating Interaction-level Dialogue Quality with Conversational Agents                                         KDD Converse’20, August 2020,


     Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).        Meteer. 2000. Dialogue act modeling for automatic tagging and recognition of
     1203–1212.                                                                               conversational speech. Computational linguistics 26, 3 (2000), 339–373.
[31] Shumpei Sano, Nobuhiro Kaji, and Manabu Sassano. 2017. Predicting causes of         [36] Pei-Hao Su, Milica Gašić, and Steve Young. 2018. Reward estimation for dialogue
     reformulation in intelligent assistants. arXiv preprint arXiv:1707.03968 (2017).         policy optimisation. Computer Speech & Language 51 (2018), 24–43.
[32] Ruhi Sarikaya. 2017. The technology behind personal digital assistants: An          [37] Gokhan Tur and Renato De Mori. 2011. Spoken language understanding: Systems
     overview of the system architecture and key components. IEEE Signal Processing           for extracting semantic information from speech. John Wiley & Sons.
     Magazine 34, 1 (2017), 67–81.                                                       [38] Stefan Ultes and Wolfgang Minker. 2014. Interaction Quality Estimation in Spoken
[33] Alexander Schmitt and Stefan Ultes. 2015. Interaction quality: assessing the             Dialogue Systems Using Hybrid-HMMs. In Proceedings of the 15th Annual
     quality of ongoing spoken dialog interaction by experts—and how it relates to            Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL).
     user satisfaction. Speech Communication 74 (2015), 12–36.                                Association for Computational Linguistics, Philadelphia, PA, U.S.A., 208–217.
[34] Alexander Schmitt, Stefan Ultes, and Wolfgang Minker. 2012. A Parameterized              https://doi.org/10.3115/v1/W14-4328
     and Annotated Spoken Dialog Corpus of the CMU Let’s Go Bus Information              [39] Marilyn A Walker, Diane J Litman, Candace A Kamm, and Alicia Abella. 1997.
     System.. In LREC. 3369–3373.                                                             PARADISE: A framework for evaluating spoken dialogue agents. arXiv preprint
[35] Andreas Stolcke, Klaus Ries, Noah Coccaro, Elizabeth Shriberg, Rebecca Bates,            cmp-lg/9704004 (1997).
     Daniel Jurafsky, Paul Taylor, Rachel Martin, Carol Van Ess-Dykema, and Marie