Test Collections and Measures for Evaluating
                                Customer-Helpdesk Dialogues
                 Zhaohao Zeng                                      Cheng Luo                                         Lifeng Shang
            Waseda University, Japan                    Tsinghua University, P.R.China                       Huawei Noah’s Ark Lab, HK
            zhaohao@fuji.waseda.jp                        chengluo@tsinghua.edu.cn                            shang.lifeng@huawei.com

                                                  Hang Li                                 Tetsuya Sakai
                                       Toutiao AI Lab, P.R.China                     Waseda University, Japan
                                       lihang.lh@bytedance.com                        tetsuyasakai@acm.org

ABSTRACT                                                                         C: I copied a picture from my PC to my mobile phone, but it kind of
We address the problem of evaluating textual, task-oriented dia-                 looks fuzzy on the phone. How can I solve this? P.S. I’m no good at
logues between the customer and the helpdesk, such as those that                 computers and mobile phones.                                  Trigger
take the form of online chats. As an initial step towards evalu-
                                                                                 H: Please synchronise your PC and phone using iTunes first, and
ating automatic helpdesk dialogue systems, we have constructed
                                                                                 then upload your picture.                                   Solution
a test collection comprising 3,700 real Customer-Helpdesk multi-
turn dialogues by mining Weibo, a major Chinese social media.                    C: I’d done the synchronisation but did not upload it with XXX
We have annotated each dialogue with multiple subjective qual-                   Mobile Assistant. I managed to do so by following your advice. You
ity annotations and nugget annotations, where a nugget is a mini-                are a real expert, thank you!                              Confirmation
mal sequence of posts by the same utterer that helps towards prob-
lem solving. In addition, 10% of the dialogues have been manually                H: You are very welcome. If you have any problems using XXX
                                                                                 Mobile Phone Software, please contact us again, or visit XXX.com.
translated into English. We have made our test collection DCH-1
publicly available for research purposes. We also propose a simple
nugget-based evaluation measure for task-oriented dialogue eval-             Figure 1: An example of a dialogue between Customer (C)
uation, which we call UCH, and explore its usefulness and limita-            and Helpdesk (H).
tions.

KEYWORDS                                                                     slot filling schemes that are required by many existing evaluation
dialogues; evaluation; helpdesk; measures; nuggets; test collec-             measures for task-oriented dialogues (See Section 2.2).
tions                                                                           In the present study, we address the problem of evaluating tex-
                                                                             tual Customer-Helpdesk dialogues, such as those that take the form
                                                                             of online chats. As an initial step towards evaluating automatic
1    INTRODUCTION                                                            helpdesk dialogue systems, we have constructed a test collection
Whenever a user of a commercial product or a service encounters              comprising 3,700 real customer-helpdesk multi-turn dialogues by
a problem, an effective way to solve it would be to contact the              mining Weibo1 , a major Chinese social media. We have anno-
helpdesk. Efficient and successful dialogues are desirable both for          tated each dialogue with subjective quality annotations (task state-
the customer and the company that sells the product/service. Re-             ment, task accomplishment, customer satisfaction, helpdesk appro-
cent advances in artificial intelligence suggest that, in the not-too-       priateness, customer appropriateness) as well as nugget annotations,
distant future, these human-human Customer-Helpdesk dialogues                where a nugget is a minimal sequence of posts by the same ut-
will be replaced by human-machine ones. In order to build and ef-            terer that helps towards problem solving. In addition, 10% of the
ficiently tune automatic helpdesk systems, reliable automatic eval-          dialogues have been manually translated into English. We have
uation methods for task-oriented dialogues are required.                     made our test collection DCH-1 (Dialogues between Customer and
   Figure 1 shows an example of a Customer-Helpdesk dialogue. It             Helpdesk) publicly available for research purposes, along with a
can be observed that it is initiated by Customer’s report of a partic-       smaller pilot collection DCH-0, which contains 234 dialogues2 .
ular problem she is facing, which we call a trigger. This is an exam-           We also propose a simple nugget-based evaluation measure for
ple of a successful dialogue, for Helpdesk provides an actual solu-          task-oriented dialogue evaluation, which we call UCH (Utility for
tion to the problem and Customer acknowledges that the problem               Customer and Helpdesk), and explore its usefulness and limita-
has been solved. Unlike the classical closed-domain task-oriented            tions. We believe that, while subjective dialogue evaluation can
dialogues, Helpdesk may have to handle diverse requests, which               evaluate the dialogue as a whole, automatic evaluation methods
makes it impossible for us to solve the problems by pre-defined              will eventually require more local pieces of evidence from the di-
                                                                             alogue text for close diagnosis. For this reason, we collected both
Copying permitted for private and academic purposes.
                                                                             1
EVIA 2017, co-located with NTCIR-13, Tokyo, Japan.                               http://www.weibo.com
© 2017 Copyright held by the author.                                         2 http://waseda.box.com/DCH-0-1


                                                                         1
EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan.                                                                 Z. Zeng et al.


subjective annotations and nugget annotations for each dialogue,            standard labels were created manually by hiring multiple annota-
in the hope that automatic evaluation measures defined as a func-           tors who used the following axes to decide on a single graded label
tion of nuggets will eventually be able to predict subjective scores        (L0, L1 or L2): coherence, topical relevance, context-independence,
with reasonable accuracy. Another possible benefit of construct-            and non-repetitiveness. The second STC task (STC-2) at NTCIR-13
ing nuggets is that a set of nuggets collected from a dialogue may          attracted 22 participating teams for the Chinese subtask, which al-
also be useful for evaluating a different dialogue that discusses a         lowed participants to submit not only retrieved responses but also
similar problem.                                                            generated ones [19].

2 RELATED WORK
                                                                            2.2    Evaluating Task-Oriented Dialogues
2.1 Evaluating Non-Task-Oriented Dialogues
                                                                            Two decades ago, Walker et al. [21] proposed the PARADISE (PAR-
Evaluating generated responses in non-task-oriented dialogues is            Adigm for Dialogue System Evaluation) framework for evaluating
a difficult problem. Galley et al. [3] proposed Discriminative BLEU,        task-oriented spoken dialogue systems. The basic idea is to collect
which generalises BLEU [13], a machine translation evaluation mea-          a variety of real human-machine dialogues for a specific task (e.g.,
sure that compares the system output with multiple reference trans-         train timetable lookup) as well as subjective ratings of user satisfac-
lations at the n-gram level. Discriminative BLEU introduces posi-           tion for each dialogue, and use task success and cost as explanatory
tive and negative weights to human references (i.e., gold standard          variables so that the user satisfaction measures for new dialogues
responses) in the computation of n-gram-based precision, which              can be estimated by means of linear regression. PARADISE re-
is the primary component of BLEU. Because it is difficult to ob-            quires an attribute-value matrix that represents the task: for exam-
tain multiple hand-crafted references for conversational data, they         ple, for the train timetable domain, attributes such as “depart-city,”
automatically mine candidate responses from a corpora of conver-            “arrival-city” and “depart-time” must be specified in advance. This
sations and then have the annotators rate the quality of the candi-         is contrast to our helpdesk case because, while it is task-oriented,
dates. The reference weights reflect the result of the quality anno-        the required attributes depend on the customer’s problem and can-
tations.                                                                    not be listed up exhaustively in advance. In this respect, helpdesk
   Higashinaka et al. [5] ran the first Dialogue Breakdown Detection        dialogues probably lie somewhere in between non-task-oriented
Challenge using Japanese human-machine chat corpora, to eval-               dialogues and the slot-filling dialogues that PARADISE deals with.
uate the system’s ability to detect the point in a given dialogue              The PARADISE framework was subsequently used in the
where it becomes difficult to continue due to the system’s inap-            DARPA COMMUNICATOR Program that evaluated spoken dia-
propriate response. This effort used 1,146 text chat dialogues for          logue systems in the travel planning domain [22]. The effort pro-
training and another 100 for development and testing. After each            duced the Communicator 2000 Corpus consisting of 662 dialogues
system utterance in the dialogue, participating systems were re-            based on nine different systems, with per-call survey results on di-
quired to provide a diagnosis: “NB” (not a breakdown), “PB” (pos-           alogue efficiency, dialogue quality, task success and user satisfac-
sible breakdown), or “B” (breakdown). They were also required               tion. Here, a new utterance tagging scheme called DATE (Dialogue
to submit a probability distribution over the three labels. To de-          Act Tagging for Evaluation) was introduced, which enables three
fine the gold standard data for this task, multiple annotators were         orthogonal annotations along the axes of speech-act (e.g., “request-
hired, so that a gold probability distribution can be constructed for       info,” “apology”), task-subtask (e.g., “origin,” “destination,” “date”)
each utterance. By comparing the best gold label with the system’s          and conversational-domain (“about-task,” “about-communication,”
output, accuracy, precision, recall and F-measure were computed.            or “situation-frame”). Again, unlike our case, their task-subtask
Moreover, by comparing the gold distribution over the three la-             annotation scheme needs to be defined in advance.
bels with the system’s distribution, Jensen-Shannon Divergence                 Lowe et al. [9] released the Ubuntu Dialogue Corpus, which
and Mean Squared Error were computed. Using a distribution as               contains 930,000 human-human dialogues extracted from Ubuntu
the gold standard probably reflects the view that there can be multi-       chats. Their effort is more similar to ours than the aforementioned
ple acceptable choices within a dialogue, as suggested also by other        studies on task-oriented dialogue evaluation in that they focus pri-
studies [1, 3]. The third Dialogue Breakdown Detection Challenge            marily on unstructured dialogues rather than slot-filling. However,
workshop will be held as part of Dialogue System Technology Chal-           while they automatically disentangled the chats to form dyadic di-
lenges on December 10, 20173                                                alogues, their original chat logs usually involve more than two par-
   At NTCIR-12 , the first Short Text Conversation (STC) task was           ties, which makes it different from our dyadic customer-helpdesk
run using Weibo data (for the Chinese subtask) and Twitter data             DCH-1 dataset. They formed a response selection test data set
(for the Japanese subtask), attracting 22 participating teams [20].         by setting aside 2% of the corpus and forming (context, response,
The STC task required participating systems to return a valid com-          flag) triplets based on this set. Here, context is the sequence of
ment in response to an input tweet (given without any prior con-            utterances that appear prior to the response in the dialogue; re-
text). Instead of relying on natural language generation, systems           sponse is either the actual correct response from the dialogue or a
were required to search a repository of past tweets and return a            randomly chosen utterance from outside the dialogue (but within
ranked list as possible responses. Information retrieval evaluation         the test set); flag is one for the correct response and zero for in-
measures were used to evaluate the participating systems. Gold              correct responses. For each correct response, they generated nine
                                                                            additional triplets containing different incorrect responses. Thus,
3 http://workshop.colips.org/dstc6/                                         response selection systems are given a context and ten choices of


                                                                        2
Evaluating Customer-Helpdesk Dialogues                                EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan.


responses, and required to select one or more responses. They use          Table 1: Test collection statistics. ∗Only 40 dialogues from
recall at k as the evaluation measure, where k is the size of the          DCH-0 were annotated with nuggets.
set of responses selected by the system and therefore “recall at 1”
reduces to accuracy. Note that this evaluation setting does not                                                    DCH-0             DCH-1
require annotations for defining the gold standard. They do not                  Source                            www.weibo.com
consider ranked lists of responses as is done at STC.                            Language                          Chinese
    The most straightforward approach to evaluating dialogues is                 Data timestamps                   Jan. 2013 - Sep. 2016
to collect subjective assessments from the user who actually expe-               #Dialogues                            234             3,700
rienced the dialogue. Hone and Graham [6] used a large question-                 #English translations                  40               370
                                                                                 #Helpdesk accounts                     16               161
naire to evaluate an in-car speech interface and identified system
                                                                                 Avg. #posts/dialogue               13.402             4.512
response accuracy, likeability, cognitive demand, annoyance, hab-
                                                                                 Avg. #utterance blocks/dialogue    12.021             4.162
itability and speed as the key factors in subjective evaluation by               Avg. post length (#chars)          35.011            44.568
means of factor analysis; their approach is known as SASSI (Sub-                 Avg. utterance block length        39.031            48.313
jective Assessment of Speech System Interfaces). Hartikainen et                  length (#chars)
al. [4] applied a service quality assessment from marketing to the               #annotators/dialogue                    2              3
evaluation of telephone-based email application; their method is                 Subjective annotation             TS, TA, CS, HA, CA
known as SERVQUAL. Paek [12] discusses SASSI, SERVQUAL and                       criteria                          (See Section 3.4)
PARADISE in a survey paper that discusses spoken dialogue evalu-                 Nugget types                      CNUG0, CNUG, HNUG,
ation, along with his Wizard-of-Oz approach of using human per-                                                    CNUG∗, HNUG∗
formance to replace a system component in order to define a gold                                                   (See Section 3.5)
standard.                                                                        Triggerless dialogues                  1∗            184


2.3    Evaluating Textual Information Access                               In contrast, Sakai, Kato and Song [18] introduced a nugget-based
                                                                           evaluation measure called
While the aforementioned BLEU [13] is basically equivalent to an              S-measure for evaluating textual summaries for mobile search,
n-gram-based precision, ROUGE [7], a BLEU-inspired measure de-             by incorporating a decay factor for nugget weights based on nugget
signed for text summarisation evaluation, is basically a suite of          positions. Just like information retrieval for ranked retrieval de-
measures including n-gram-based (or skip-gram-based) recall and            fines a decay function over ranks of documents, S-measure defines
F-measure. Just as BLEU requires multiple reference translations,          a linear decay function over the text, using offset positions of the
ROUGE requires multiple reference summaries. Note that the ba-             nuggets. This reflects the view that important nuggets should be
sic unit of comparison, namely n-grams etc., are automatically ex-         presented first and that we should minimise the amount of text that
tracted from both the references and the system output.                    the user has to read. Sakai and Kato [17] complements S-measure
   In contrast to the above automatically extracted units of com-          with a precision-like measure called T-measure, which, unlike the
parison, manually-devised nuggets have been used in both sum-              aforementioned allowance-based precision used at the TREC QA
marisation evaluation [11] and question answering evaluation. In           track, takes into account the fact that different pieces of informa-
the TREC Question Answering (QA) tracks, a nugget is defined as            tion require different textual lengths. They define an “iUnit” (in-
“a fact for which the annotator could make a binary decision as to         formation unit) as “an atomic piece of information that stands alone
whether a response contained that nugget” [8]. Having constructed          and is useful to the user.”
nuggets, (weighted) recall, precision and F-measure scores can be             Sakai and Dou [16] generalised the idea of S-measure to han-
computed, except that the precision computation requires special           dle various textual information access tasks, including web search.
handling: while one can count the number of nuggets present or             Their measure, known as U-measure, constructs a string called trail-
missing in the system output, one cannot count the number of               text, which is a concatenation of all the texts that the user has read
“non-nuggets” (i.e., irrelevant pieces of information) in the same         (obtained by observation or by assuming a user model). Then, over
output, since “non-nuggets” are never defined. Hence, nugget pre-          the trailtext, a linear decay function is defined (See Section 4).
cision, which is supposed to quantify the amount of irrelevant in-
formation in the output, cannot be defined. To work around this            3 DESIGNING AND BUILDING DCH-1
problem, a fixed-length “allowance” was introduced at the TREC
QA tracks so that nugget precision could be defined based solely on        3.1 Overview
the system output length. The TREC QA tracks also used a measure           Our ultimate goal is automatic evaluation of human-machine
called POURPRE, which replaces the manual nugget matching step             Customer-Helpdesk dialogues. As a first step towards it, we built
with automatic nugget matching based on unigrams. The NTCIR                two test collections based on real (i.e., human-human) Customer-
ACLIA (Advanced Cross-lingual Information Access) Task adapted             Helpdesk dialogues, which we call DCH-0 and DCH-1.
these methods for evaluating QA with Asian languages [10].                    DCH-0, our smaller collection, was used to establish an efficient
   As was discussed above, traditional evaluation measures for sum-        and reliable test collection construction procedure. For example,
marisation and question answering employ variants of recall, pre-          although we started constructing DCH-0 by using the number of
cision and F-measure based on small textual units. Hence, they             posts in each dialogue for sampling dialogues of different lengths,
regard the system output as a set of n-grams, nuggets, and so on.          where a post refers to a piece of timestamped text entered by either


                                                                       3
EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan.                                                                               Z. Zeng et al.


Customer or Helpdesk, we quickly realised that posts are often a                             paper at Waseda University; subsequently, the annotators were al-
mere artifact of the Weibo users’ arbitrary hits of the ENTER key,                           lowed to do their annotation work online using a web-browser-
and that they are not suitable as the basic semantic unit. Based on                          based tool at their convenient location and time. The number of
this experience, we used the utterance block as the basis for mea-                           dialogues assigned to each annotator was 3, 700 ∗ 3/16 = 693.75
suring the length of a dialogue in DCH-1, formed by merging all                              on average; all of them completed their work within two weeks as
consecutive posts by the same utterer.                                                       they were initially asked to do. The actual annotation time spent
   Table 1 provides some statistics of DCH-0 and DCH-1. As shown                             by each annotator was 18-20 hours.
in the table, 184 of the 3,700 DCH-1 dialogues are “triggerless,” by
which we mean that Customer and Helpdesk exchange remarks                                    3.4    Subjective Annotation
even though Customer does not seem to be facing any problem (cf.                             By subjective annotation, we mean manual quantification of the
Figure 1)4 . Below, we discuss the construction and validation of                            quality of a dialogue as a whole. As there are two players involved
DCH-1.                                                                                       in a Customer-Helpdesk dialogue, we wanted to accommodate the
                                                                                             following two viewpoints:
3.2     Dialogue Mining                                                                             Customer’s viewpoint Does Helpdesk solve Customer’s
The 3,700 Helpdesk dialogues contained in the DCH-1 test collec-                                       problem efficiently?       Customer may want a solu-
tion were mined from Weibo in September 2016 as follows. (1) We                                        tion quickly while providing minimal information to
collected an initial set of Weibo accounts by searching Weibo                                          Helpdesk.
account names that contained keywords such as “assistant” and                                       Helpdesk’s viewpoint Does Customer provide accurate
“helper” (in Chinese). We denote this set by A0 . (2) For each ac-                                     and sufficient information so that Helpdesk can provide
count name a in A0 , we added a prefix “@” to a and used the string                                    the right solution? Helpdesk also wants to solve Cus-
as a query for searching up to 40 conversational threads (i.e., ini-                                   tomer’s problem through minimal interactions, as these
tial post plus comments on it) that contain a mention of the official                                  interactions translate directly into cost for the company.
account5 . We then filtered out accounts that did not respond to                             Moreover, we wanted to assess customer satisfaction as this is of
over one half of these threads. We denote the filtered set of “ac-                           utmost importance for both parties. While customer satisfaction
tive” accounts as A. (3) For each account a in A, we retrieved all                           ratings should ideally be collected from the real customer at the
threads that contain a mention of a from January 2013 to Septem-                             time of dialogue termination, we had no choice but to collect sur-
ber 2016, and extracted Customer-Helpdesk dyadic dialogues from                              rogate, post-hoc ratings by the annotators instead.
them. We then kept those that consist of at least one utterance                                  By considering the above points as well as our results from the
block by Customer and one by Helpdesk. As a result, 21,669 dia-                              smaller DCH-0 collection, we finally devised the following five sub-
logues were obtained. This collection is denoted as D 0 . (4) As D 0 is                      jective annotation criteria:
too large for annotation, we sampled 3,700 dialogues from it as fol-
                                                                                                    Task Statement Whether the task (i.e., the problem to be
lows. For i = 2, 3, . . . , 6, we randomly sampled 700 dialogues that
                                                                                                       solved) is clearly stated by Customer (denoted by TS);
contained i utterance blocks. In addition, we randomly sampled
                                                                                                    Task Accomplishment Whether the task is actually ac-
200 that contained i = 7 utterance blocks; we could not sample
                                                                                                       complished (denoted by TA);
700 dialogues for i = 7 as D 0 did not contain enough dialogues
                                                                                                    Customer Satisfaction Whether Customer is likely to have
that are very long.
                                                                                                       been satisfied with the dialogue, and to what degree (de-
   10% (370) of the Chinese Dialogues in DCH-1 were manually
                                                                                                       noted by CS);
translated English by a professional translation company for re-
                                                                                                    Helpdesk Appropriateness Whether Helpdesk provided
search purposes.
                                                                                                       appropriate information (denoted by HA);
                                                                                                    Customer Appropriateness Whether Customer provided
3.3     Annotators                                                                                     appropriate information (denoted by CA).
We hired 16 Chinese undergraduate students from the Faculty of                               Figure 2 shows the actual instructions for annotators: note that
Science and Engineering at Waseda University so that each Chi-                               CS is on a five-point scale (−2 to 2), while the other four are on a
nese dialogue was annotated independently by three annotators.                               three-point scale (−1 to 1).
The assignment of dialogues to annotators was randomised; given                                  Table 2 shows the inter-rater agreement (for three assessors)
a dialogue, each annotator first read the entire dialogue carefully,                         of the subjective labels in terms of Fleiss’ κ [2] and Randolph’s
and then gave it ratings according to the five subjective annotation                         κ free [14]; κ free is known to be more suitable when the labels are
criteria described in Section 3.4; finally, he/she identified nuggets                        heavily skewed across the categories, which is indeed the case here.
within the same dialogue, where nuggets were defined as described                            “2+ agree” means the proportion of dialogues for which at least two
in Section 3.5. An initial face-to-face instruction and training ses-                        annotators agree, e.g., (−1, −1); “3 agree” means the proportion of
sion for the annotators was organised by the first author of this                            dialogues for which all three annotators agree, e.g., (−1, −1, −1).
                                                                                                 It can be observed that the agreement among the three assessors
4 We tried filtering out these triggerless dialogues for the analyses reported in Sec-       is low, except perhaps for TS, which reflects the highly subjective
tion 5, but the effect of this on our results was not substantial.                           nature of this labelling task. While it may be possible to improve
5 Weibo’s interface for conversational threads is somewhat different from Twitter’s:
comments to a post are not displayed on the main timeline; they are displayed under          the inter-assessor agreement a little in our future work by revising
each post only if the “comments” button is clicked.                                          the labelling instructions, it should be stressed that our labelling


                                                                                         4
Evaluating Customer-Helpdesk Dialogues                                  EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan.


                                                                                Note that we utilise Weibo posts as the atomic building blocks
                                                                             for forming nuggets; This takes into account the remark by Wang
                                                                             et al. [23]: “Experience from question answering evaluations has
                                                                             shown that users disagree about the granularity of nuggets—for ex-
                                                                             ample, whether a piece of text encodes one or more nuggets and how
                                                                             to treat partial semantic overlap between two pieces of text.” Note
                                                                             also that according to our definition, an utterance block (i.e., max-
                                                                             imal consecutive posts by the same utterer) generally subsumes
                                                                             one or more nuggets.
                                                                                Compared to traditional nugget-based information access eval-
                                                                             uation that was discussed in Section 2.3, there are two unique fea-
                                                                             tures in nugget-based helpdesk dialogue evaluation: (1) A dialogue
                                                                             involves two parties, Customer and Helpdesk; (2) Even within the
                                                                             same utterer, nuggets are not homogeneous, by which we mean
                                                                             that some nuggets may play special roles. In particular, since the
                                                                             dialogues we consider are task-oriented (but not closed-domain,
                                                                             which makes slot filling approaches infeasible), there must be some
           Figure 2: Subjective annotation criteria.
                                                                             nuggets that represent the state of identifying the task and those
                                                                             that represent the state of accomplishing it.
Table 2: Inter-annotator agreement of the subjective annota-                    Based on the above considerations, we defined the following
tions for DCH-1 (3,700 dialogues, 3 annotators per dialogue).                four mutually exclusive nugget types:
Note that Fleiss’ κ and Randolph’s κ free treat the ratings as                    CNUG0 Customer’s trigger nuggets. These are nuggets that
nominal categories. 2+ agree means the proportion of di-                            define Customer’s initial problem, which directly caused
alogues for which at least two annotators agree; 3 agree                            Customer to contact Helpdesk.
means the proportion of dialogues for which all three an-                         HNUG Helpdesk’s regular nuggets. These are nuggets in
notators agree. For CS, 2 and 1 were treated as 1, and −2 and                       Helpdesk’s utterances that are useful from Customer’s
−1 were treated as −1.                                                              point of view.
                                                                                  CNUG Customer’s regular nuggets. These are nuggets in
                  2+ agree    3 agree   Fleiss’ κ   κ free                          Customer’s utterances that are useful from Helpdesk’s
           TS         .981       .729        .301   .719                            point of view.
           TA         .925       .361        .273   .324                          HNUG∗ Helpdesk’s goal nuggets. These are nuggets in
           CS         .938       .349        .276   .318                            Helpdesk’s utterances which provide the Customer with
           HA         .873       .309        .197   .245                            a solution to the problem.
           CA         .857       .288        .141   .216                          CNUG∗ Customer’s goal nuggets. These are nuggets in Cus-
                                                                                    tomer’s utterances which tell Helpdesk that Customer’s
                                                                                    problem has been solved.
task is not document relevance assessments, and that it is inher-            Each nugget type may or may not be present in a dialogue. Multi-
ently highly subjective. We believe that, as our future work, hir-           ple nuggets of the same type may be present in a dialogue.
ing more than three assessors and preserving their different view-              Using a pull-down menu on our web-browser-based tool, asses-
points in the test collection, is more important than trying to force        sors categorised each post into CNUG0, CNUG, HNUG, CNUG∗,
them into reaching an agreement.                                             HNUG∗, or NAN (not a nugget). Then, consecutive posts with
                                                                             the same label (e.g., CNUG followed by CNUG) were automatically
3.5    Nugget Annotation                                                     merged to form a nugget.
We had three annotators independently identified nuggets for each               Table 3 shows the inter-annotator agreement of the nugget an-
dialogue as follows. At the instruction and training session, anno-          notations, where the posts are used as the basis for comparison.
tators were given the diagram shown in Figure 3, which reflects our          The 3,700 dialogues in DCH-1 contains a total of 7,155 Helpdesk
view that accumulating nuggets will eventually solve Customer’s              posts, all of which were annotated independently by three annota-
problem, together with a written definition of nuggets, as described         tors, producing a total of 21,465 annotations, A direct comparison
below. (1) A nugget is a post, or a sequence of consecutive posts            with the subjective annotation agreement shown in Table 2 would
by the same utterer (i.e., either Customer or Helpdesk). (2) It can          be difficult, since both the annotation unit (dialogues vs. nuggets)
neither partially nor wholly overlap with another nugget. (3) It             and the annotation schemes (numerical ratings vs. nugget types)
should be minimal: that is, it should not contain irrelevant posts           are different. However, it can be observed that the agreement for
at the start, the end or in the middle. An irrelevant post is one            Customer nuggets is substantially higher than for the Helpdesk
that does not contribute to the Customer transition (See Figure 3).          nuggets. A possible explanation for this would be that it is easier
(4) It helps Customer transition from Current State (including Ini-          for annotators to judge the contribution of Customer’s utterances
tial State) towards Target State (i.e., when the problem is solved).         for reaching his/her target state than to judge that of Helpdesk, at


                                                                         5
EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan.                                                                               Z. Zeng et al.


                                                                    Different paths that lead
                                                                    from Customer’s current state
                                                                    to target state

                                                    Customer’s                                              Customer’s
                                                    initial state                                           target state
                                                      (facing a                                              (problem
                                                     problem)                                                 solved)


                                                  Contribution
                                                  of a nugget

                                    Helpdesk-Customer interactions that          An intermediate state, where the problem is
                                    do not directly lead Customer to an          not quite solved yet but Customer is a little
                                     intermediate state or Target state                  closer towards Target state


                            Figure 3: Task accomplishment as state transitions, and the role of a nugget.


Table 3: Inter-annotator agreement of the nugget annota-                             This is for discounting the value of a nugget that appear later in
tions for DCH-1 (3,700 dialogues, 3 annotators per dialogue).                        the dialogue; at position L, the value of any nugget wears out com-
2+ agree means the proportion of nuggets for which at least                          pletely. In our experiments, we let L = L max = 916 as this is
two annotators agree; 3 agree means the proportion of di-                            the number of (Chinese) characters in the longest dialogue from
alogues for which all three annotators agree. NAN means                              the DCH-1 collection. The benefit of introducing L is discussed in
“not a nugget.” 95% CI for κ are also shown.                                         Section 5.2.
                                                                                         Let N and M denote the number of Customer’s non-goal nuggets
                             2+ agree    3 agree            Fleiss’κ    κ free       and Helpdesk’s non-goal nuggets identified within a dialogue, re-
 Helpdesk (#total posts)                                                             spectively; for simplicity, let us assume that there is at most one
 (HNUG/HNUG*                      .907       .299              .174      .253        Customer’s goal nugget (c ∗ ) and at most one Helpdesk’s goal nugget
 /NAN)                                                 [.165, .184]                  (h ∗ ) in a dialogue. Let {c 1 , . . . , c N , c ∗ } denote the set of nuggets
 Customer (#total posts)                                                             from Customer’s posts, and let {h 1 , . . . , h M , h ∗ } denote that from
 (CNUG0/CNUG                      .959       .491              .488      .529        Helpdesk’s posts. Let pos(c i ) (i ∈ {1, . . . , N , ∗}) be the position of
 /CNUG*/NAN)                                           [.481, .496]                  nugget c i ; pos(h j ) (j ∈ {1, . . . , M, ∗}) is defined similarly.
                                                                                         Given the gain value of each non-goal nugget (д(c i )), a simple
                                                                                     evaluation measure based solely on Customer’s utterances can be
                                                                                     computed as:
least for regular nuggets: while Helpdesk often asks Customer for                                                   ∑
more information regarding the problem context, it is Customer’s                                     UC =                         д(c i ) D(pos(c i )) .        (2)
utterances that actually provide that information.                                                            c i ∈ {c 1, ...,c N ,c ∗ }
   While directly comparing the inter-annotator agreement of sub-                    In the present study, we define the gain value of CNUG∗ as д(c ∗ ) =
jective annotation and nugget annotation seems difficult, we would                        ∑
                                                                                     1 + iN=1 д(c i ). This is an attempt at reflecting the view that task
like to compare the intra-annotator consistency by making each                       accomplishment is what matters most. To be more specific, when
annotator process the same dialogue multiple times in our future                     the discounting function is ignored and dialogues are regarded as
work.                                                                                sets of nuggets, then having only the goal nugget is better than hav-
                                                                                     ing all the regular nuggets. Similarly, given the gain value of each
4   UCH: A DIALOGUE EVALUATION                                                       non-goal nugget (д(h j )), a measure solely based on Helpdesk’s ut-
    MEASURE                                                                          terances can be computed as:
                                                                                                                ∑
We now propose an evaluation measure that leverages nuggets for                                    UH =                   д(h j ) D(pos(h j )) ,       (3)
quantifying the quality of Customer-Helpdesk dialogues. We re-                                               h j ∈ {h 1, ...,h M ,h ∗ }
gard a Customer-Helpdesk dialogue as a trailtext of U-measure,                                            ∑
which may or may not contain nuggets. Let pos denote the posi-                       where д(h ∗ ) = 1 + M   j =1 д(h j ). Finally, for a given parameter α
tion (i.e., offset from the beginning of the dialogue) of a nugget; for              (0 ≤ α ≤ 1) that specifies the contribution of Helpdesk’s utterances
ideographic languages such as Chinese and Japanese, we use the                       relative to Customer’s, we can define the following combined mea-
number of characters to define the offset position. Given a patience                 sure:
parameter L, we define a decay function over the trailtext as [16]:                                     UCHα = (1 − α)UC + αUH .                         (4)
                                            pos                                      By default, we use α = 0.5. Note that UCH0 . 5 is equivalent to com-
                    D(pos) = max(0, 1 −         ).                        (1)        puting a single U-measure score without distinguishing between
                                             L


                                                                                 6
Evaluating Customer-Helpdesk Dialogues                                   EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan.

Table 4: Kendall’s τ between AUCH and average subjective
ratings for DCH-1 (3,700 dialogues), with 95% CIs.                                            0.20


                                       AUCH
                       TS     .267 [.237, .277]                                               0.15
                       TA     .256 [.244, .289]
                       CS     .118 [.097, .141]


                                                                                Kendall's τ
                                                                                              0.10
                       HA     .414 [.398, .432]
                       CA     .434 [.417, .450]
                                                                                              0.05


                                                                                              0.00
Customer’s and Helpdesk’s nuggets. The choice of α is discussed                                Lmax/16 Lmax/8 Lmax/4 Lmax/2 Lmax   Lmax   4Lmax 16Lmax   ∞
in Section 5.3.
   Since we have three independent nugget annotations per dia-
logue, We tried two approaches to computing a single score for                Figure 4: Effect of L on the τ between average customer sat-
a given dialogue: Average UCH (AUCH) simply computes a UCH                    isfaction and AUCH.
score each annotator and then takes the average for that dialogue;
Consolidated UCH (CUCH) merges the nuggets from multiple anno-
tators first and then computes a single UCH score. We only report
on results with AUCH, which consistently outperformed CUCH in                 5.2               The Patience Parameter L
our experiments.                                                              As was explained in Section 4, UCH inherits the patience param-
                                                                              eter L from S-measure [18] and U-measure [16], to discount the
                                                                              value of a nugget based on its position within the dialogue. As we
5     ANALYSIS WITH UCH
                                                                              have mentioned earlier, we let L = L max = 916 by default, as this
This section addresses the following questions: How does UCH cor-             is the length of the longest dialogue within DCH-1. Using a small
relate with subjective ratings? (Section 5.1); Is the patience param-         L means that the decay function becomes steep and that we do not
eter L useful for estimating subjective ratings? (Section 5.2); and           tolerate long dialogues; using an extremely large L is equivalent to
Which utterer plays the major role when estimating subjective rat-            switching off the decay function, thereby treating the dialogue as
ings with UCH? (Section 5.3).                                                 a set of nuggets (See Eq. 1).
    In the analysis reported below, we use the z-score of each subjec-            Figure 4 shows the effect of L on the τ between average CS and
tive rating before averaging them over the three annotators. That             AUCH. It can be observed that, at least for DCH-1, L = L max /4 =
is, for each annotator and subjective criterion, we first compute the         229 seems to be a good choice if AUCH is to be used for estimating
mean and standard deviation of the raw ratings, and then process              customer satisfaction. This suggests that user satisfaction may be
each raw rating by subtracting the mean and then dividing by the              linked to user patience, and that considering nugget positions as
standard deviation. This is to remove each annotator’s inherent               UCH does is of some use. However, as was discussed earlier, the
scoring tendency.                                                             reliability of the CS ratings deserves a closer investigation in our
                                                                              future work.
5.1    Correlation with Subjective Annotations
Table 4 shows the Kendall’s τ values between AUCH and the av-                 5.3               The Contribution Parameter α
erage subjective ratings for the DCH-1 collection, with 95% confi-            As Eq. 4 shows, UCH can decide on a balance between Customer’s
dence intervals. It can be observed that AUCH is reasonably highly            utterances and Helpdesk’s; a small α means that we rely more on
correlated with HA (.414, 95% CI[.398, .432]) and CA (.434, 95%               Customer nuggets for computing UCH. Figure 5 shows the effect of
CI[.417, .450]). That is, even though the inter-annotator agreement           α on the τ between AUCH and different average subjective ratings.
for appropriateness is relatively low (Table 2), AUCH manages to es-          The trends are the same for TS, TA, CS, and CA: the smaller the α,
timate the average appropriateness with reasonable accuracy. On               the higher the rank correlation. That is, to achieve the highest τ , it
the other hand, the table shows that the τ between AUCH and                   is best to rely entirely on Customer utterances, i.e., to completely
CS is very low, albeit statistically significant (.118, 95% CI[.097,          ignore Helpdesk utterances.
.141]). One possible explanation for this might be that the CS rat-              Interestingly, however, the trend is different for HA: the curve
ings themselves are not as reliable as we would have like. First, as          for HA suggests that α = 0.5, our default value, is in fact the best
we have discussed in Section 3.4, the annotators are not the actual           choice. That is, to achieve the highest τ with Helpdesk Appropri-
customers; second, our manual inspection of some of the dialogues             ateness, treating Customer’s and Helpdesk’s nuggets equally ap-
from DCH-0 and DCH-1 suggest that the annotator’s ratings may                 pears to be a good choice. While it is obvious that Helpdesk’s utter-
be influenced by his/her prior impression of the product/service or           ances need to be taken into account in order to estimate Helpdesk
the company, rather than the contents of the particular dialogue              Appropriateness, the curve implies that Customer’s utterances also
in question.                                                                  play an important part in the estimation. These results suggest that


                                                                          7
EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan.                                                                                            Z. Zeng et al.


                                                                                              • Collecting subjective and nugget annotations for the Eng-
                                                                                                lish subcollection of DCH-1, and comparing across Chi-
                                                                    TS                          nese and English;
                  0.6                                               TA                        • Devising ways for automatic nugget identification and au-
                                                                    CS                          tomatic categorisation of nuggets into different nugget
                  0.5                                               HA                          types;
                                                                    CA
                  0.4
    Kendall's τ


                                                                                          The NTCIR-14 Short Text Conversation task (STC-3) will feature
                  0.3                                                                  a new subtask that is based on the present study: given a dialogue,
                                                                                       participating systems are required to estimate the distribution of
                  0.2
                                                                                       subjective scores such as user satisfaction over multiple annotators,
                  0.1
                                                                                       as well as the distribution of nugget types (e.g. trigger, regular,
                                                                                       goal, not-a-nugget) over multiple assessors for each utterance [15].
                  0.0
                    0.0      0.2        0.4       0.6        0.8         1.0
                                              α
                                                                                       REFERENCES
                                                                                        [1] DeVault, D., Leuski, A. and Sagae, K.: Toward Learning and Evaluation of Di-
Figure 5: Effect of α on the τ between average subjective rat-                              alogue Policies with Text Examples, Proceedings of SIGDIAL 2011, pp. 39–48
                                                                                            (2011).
ings and AUCH.                                                                          [2] Fleiss, J. L.: Measuring Nominal Scale Agreement among Many Raters, Psycho-
                                                                                            logical Bulletin, Vol. 76, No. 5, pp. 378–382 (1971).
                                                                                        [3] Galley, M., Brockett, C., Sordoni, A., Ji, Y., Auli, M., Quirk, C., Mitchell, M., Gao,
                                                                                            J. and Dolan, B.: ∆BLEU: A Discriminative Metric for Generation Tasks with
different subjective annotation criteria requires different balances                        Intrinsically Diverse Targets, Proceedings of ACL 2015, pp. 445–450 (2015).
between Customer’s and Helpdesk’s utterances.                                           [4] Hartikainen, M., Salonen, E.-P. and Turunen, M.: Subjective Evaluation of Spo-
                                                                                            ken Dialogue Systems Using SERVQUAL Method, Proceedings of INTERSPEECH
                                                                                            2004-ICSLP (2004).
6            CONCLUSIONS                                                                [5] Higashinaka, R., Funakoshi, K., Kobayashi, Y. and Inaba, M.: The Dialogue
                                                                                            Breakdown Detection Challenge: Task Description, Datasets, and Evaluation
As an initial step towards evaluating automatic dialogue sys-                               Metrics, Proceedings of LREC 2016 (2016).
tems, we constructed DCH-1, which contains 3,700 real Customer-                         [6] Hone, K. S. and Graham, R.: Towards a Tool for the Subjective Assessment of
                                                                                            Speech System Interfaces (SASSI), Natural Language Engineering, Vol. 6, No. 3-4,
Helpdesk multi-turn dialogues mined from Weibo. We have anno-                               pp. 287–303 (2000).
tated each dialogue with subjective quality annotations (TS, TA,                        [7] Lin, C.-Y.: ROUGE: A Package for Automatic Evaluation of Summaries, Proceed-
CS, HA, and CA) and nugget annotations, with three annotators                               ings of the Workshop on Text Summarization Branches Out, pp. 74–81 (2004).
                                                                                        [8] Lin, J. and Demner-Fushman, D.: Will Pyramids Built of Nuggets Topple Over?,
per dialogue. In addition, 10% of the dialogues have been manually                          Proceedings of HLT/NAACL 2006, pp. 383–390 (2006).
translated into English. We described how we constructed the test                       [9] Lowe, R., Row, N., Serban, I. V. and Pineau, J.: The Ubuntu Dialogue Corpus:
collection and the philosophy behind it. We also proposed UCH,                              A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems,
                                                                                            Proceedings of SIGDIAL 2015, pp. 285–294 (2015).
a simple nugget-based evaluation measure for task-oriented dia-                        [10] Mitamura, T., Shima, H., Sakai, T., Kando, N., Mori, T., Takeda, K., Lin, C.-Y.,
logue evaluation, and explored its usefulness and limitations. Our                          Song, R., Lin, C.-J. and Lee, C.-W.: Overview of the NTCIR-8 ACLIA Tasks:
                                                                                            Advanced Cross-Lingual Information Access, Proceedings of NTCIR-8, pp. 15–
main findings on UCH based on the DCH-1 collection are as fol-                              24 (2010).
lows.                                                                                  [11] Nenkova, A., Passonneau, R. and McKeown, K.: The Pyramid Method: Incorpo-
                                                                                            rating Human Content Selection Variation in Summarization Evaluation, ACM
             (1) UCH correlates better with subjective ratings that reflect                 Transactions on Speech and Language Processing, Vol. 4, No. 2 (2007).
                 the appropriateness of utterances (HA and CA) than with               [12] Paek, T.: Toward Evaluation that Leads to Best Practices: Reconciling Dialog
                 customer satisfaction (CS);                                                Evaluation in Research and Industry, Bridging the Gap: Academic and Industrial
                                                                                            Research in Dialogue Technologies Workshop Proceedings, pp. 40–47 (2007).
             (2) The patience parameter L of UCH, which considers the po-              [13] Papineni, K., Roukos, S., Ward, T. and Zhu, W.-J.: BLEU: a Method for Automatic
                 sitions of nuggets within a dialogue, may be a useful fea-                 Evaluation of Machine Translation, Proceedings of ACL 2002, pp. 311–318 (2002).
                                                                                       [14] Randolph, J. J.: Free-marginal Multirater Kappa (Multirater κ free ): An Alterna-
                 ture for enhancing the correlation with customer satisfac-
                                                                                            tive to Fleiss’ Fixed Marginal Multirater Kappa, Joensuu Learning and Instruction
                 tion;                                                                      Symposium 2005 (2005).
             (3) For the majority of our subjective annotation criteria, cus-          [15] Sakai, T.: Towards Automatic Evaluation of Multi-Turn Dialogues: A Task De-
                                                                                            sign that Leverages Inherently Subjective Annotations, Proceedings of EVIA 2017
                 tomer utterances seem to play a much more important role                   (2017).
                 for UCH to achieve high correlations with subjective rat-             [16] Sakai, T. and Dou, Z.: Summaries, Ranked Retrieval and Sessions: A Unified
                 ings than helpdesk utterances do, according to our analy-                  Framework for Information Access Evaluation, Proceedings of ACM SIGIR 2013,
                                                                                            pp. 473–482 (2013).
                 sis on the parameter α.                                               [17] Sakai, T. and Kato, M. P.: One Click One Revisited: Enhancing Evaluation based
      Our future work includes the following:                                               on Information Units, Proceedings of AIRS 2012 (2012).
                                                                                       [18] Sakai, T., Kato, M. P. and Song, Y.-I.: Click the Search Button and Be Happy:
                  • Comparing subjective annotation and nugget annotation                   Evaluating Direct and Immediate Information Access, Proceedings of ACM CIKM
                                                                                            2011, pp. 621–630 (2011).
                    in terms of intra-annotator agreement;                             [19] Shang, L., Sakai, T., Li, H., Higashinaka, R., Miyao, Y., Arase, Y. and Nomoto,
                  • Investigating the reliability of offline customer satisfac-             M.: Overview of the NTCIR-13 Short Text Conversation Task, Proceedings of
                    tion ratings by comparing them with real customer rat-                  NTCIR-13 (2017).
                                                                                       [20] Shang, L., Sakai, T., Lu, Z., Li, H., Higashinaka, R. and Miyao, Y.: Overview of the
                    ings collected right after the termination of a helpdesk di-            NTCIR-12 Short Text Conversation Task, Proceedings of NTCIR-12, pp. 473–484
                    alogue;                                                                 (2016).


                                                                                   8
Evaluating Customer-Helpdesk Dialogues                                               EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan.


[21] Walker, M. A., Litman, D. J., Kamm, C. A. and Abella, A.: PARADISE: A Frame-
     work for Evaluating Spoken Dialogue Agents, Proceedings of ACL 1997, pp. 271–
     280 (1997).
[22] Walker, M. A., Passoneau, R. and Boland, J. E.: Quantitative and Qualitative
     Evaluation of Darpa Communicator Spoken Dialogue Systems, Proceedings of
     ACL 2001, pp. 515–522 (2001).
[23] Wang, Y., Sherman, G., Lin, J. and Efron, M.: Assessor Differences and User
     Preferences in Tweet Timeline Generation, Proceedings of ACM SIGIR 2015, pp.
     615–624 (2015).


                                                                                      9