Test Collections and Measures for Evaluating Customer-Helpdesk Dialogues Zhaohao Zeng Cheng Luo Lifeng Shang Waseda University, Japan Tsinghua University, P.R.China Huawei Noah’s Ark Lab, HK zhaohao@fuji.waseda.jp chengluo@tsinghua.edu.cn shang.lifeng@huawei.com Hang Li Tetsuya Sakai Toutiao AI Lab, P.R.China Waseda University, Japan lihang.lh@bytedance.com tetsuyasakai@acm.org ABSTRACT C: I copied a picture from my PC to my mobile phone, but it kind of We address the problem of evaluating textual, task-oriented dia- looks fuzzy on the phone. How can I solve this? P.S. I’m no good at logues between the customer and the helpdesk, such as those that computers and mobile phones. Trigger take the form of online chats. As an initial step towards evalu- H: Please synchronise your PC and phone using iTunes first, and ating automatic helpdesk dialogue systems, we have constructed then upload your picture. Solution a test collection comprising 3,700 real Customer-Helpdesk multi- turn dialogues by mining Weibo, a major Chinese social media. C: I’d done the synchronisation but did not upload it with XXX We have annotated each dialogue with multiple subjective qual- Mobile Assistant. I managed to do so by following your advice. You ity annotations and nugget annotations, where a nugget is a mini- are a real expert, thank you! Confirmation mal sequence of posts by the same utterer that helps towards prob- lem solving. In addition, 10% of the dialogues have been manually H: You are very welcome. If you have any problems using XXX Mobile Phone Software, please contact us again, or visit XXX.com. translated into English. We have made our test collection DCH-1 publicly available for research purposes. We also propose a simple nugget-based evaluation measure for task-oriented dialogue eval- Figure 1: An example of a dialogue between Customer (C) uation, which we call UCH, and explore its usefulness and limita- and Helpdesk (H). tions. KEYWORDS slot filling schemes that are required by many existing evaluation dialogues; evaluation; helpdesk; measures; nuggets; test collec- measures for task-oriented dialogues (See Section 2.2). tions In the present study, we address the problem of evaluating tex- tual Customer-Helpdesk dialogues, such as those that take the form of online chats. As an initial step towards evaluating automatic 1 INTRODUCTION helpdesk dialogue systems, we have constructed a test collection Whenever a user of a commercial product or a service encounters comprising 3,700 real customer-helpdesk multi-turn dialogues by a problem, an effective way to solve it would be to contact the mining Weibo1 , a major Chinese social media. We have anno- helpdesk. Efficient and successful dialogues are desirable both for tated each dialogue with subjective quality annotations (task state- the customer and the company that sells the product/service. Re- ment, task accomplishment, customer satisfaction, helpdesk appro- cent advances in artificial intelligence suggest that, in the not-too- priateness, customer appropriateness) as well as nugget annotations, distant future, these human-human Customer-Helpdesk dialogues where a nugget is a minimal sequence of posts by the same ut- will be replaced by human-machine ones. In order to build and ef- terer that helps towards problem solving. In addition, 10% of the ficiently tune automatic helpdesk systems, reliable automatic eval- dialogues have been manually translated into English. We have uation methods for task-oriented dialogues are required. made our test collection DCH-1 (Dialogues between Customer and Figure 1 shows an example of a Customer-Helpdesk dialogue. It Helpdesk) publicly available for research purposes, along with a can be observed that it is initiated by Customer’s report of a partic- smaller pilot collection DCH-0, which contains 234 dialogues2 . ular problem she is facing, which we call a trigger. This is an exam- We also propose a simple nugget-based evaluation measure for ple of a successful dialogue, for Helpdesk provides an actual solu- task-oriented dialogue evaluation, which we call UCH (Utility for tion to the problem and Customer acknowledges that the problem Customer and Helpdesk), and explore its usefulness and limita- has been solved. Unlike the classical closed-domain task-oriented tions. We believe that, while subjective dialogue evaluation can dialogues, Helpdesk may have to handle diverse requests, which evaluate the dialogue as a whole, automatic evaluation methods makes it impossible for us to solve the problems by pre-defined will eventually require more local pieces of evidence from the di- alogue text for close diagnosis. For this reason, we collected both Copying permitted for private and academic purposes. 1 EVIA 2017, co-located with NTCIR-13, Tokyo, Japan. http://www.weibo.com © 2017 Copyright held by the author. 2 http://waseda.box.com/DCH-0-1 1 EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan. Z. Zeng et al. subjective annotations and nugget annotations for each dialogue, standard labels were created manually by hiring multiple annota- in the hope that automatic evaluation measures defined as a func- tors who used the following axes to decide on a single graded label tion of nuggets will eventually be able to predict subjective scores (L0, L1 or L2): coherence, topical relevance, context-independence, with reasonable accuracy. Another possible benefit of construct- and non-repetitiveness. The second STC task (STC-2) at NTCIR-13 ing nuggets is that a set of nuggets collected from a dialogue may attracted 22 participating teams for the Chinese subtask, which al- also be useful for evaluating a different dialogue that discusses a lowed participants to submit not only retrieved responses but also similar problem. generated ones [19]. 2 RELATED WORK 2.2 Evaluating Task-Oriented Dialogues 2.1 Evaluating Non-Task-Oriented Dialogues Two decades ago, Walker et al. [21] proposed the PARADISE (PAR- Evaluating generated responses in non-task-oriented dialogues is Adigm for Dialogue System Evaluation) framework for evaluating a difficult problem. Galley et al. [3] proposed Discriminative BLEU, task-oriented spoken dialogue systems. The basic idea is to collect which generalises BLEU [13], a machine translation evaluation mea- a variety of real human-machine dialogues for a specific task (e.g., sure that compares the system output with multiple reference trans- train timetable lookup) as well as subjective ratings of user satisfac- lations at the n-gram level. Discriminative BLEU introduces posi- tion for each dialogue, and use task success and cost as explanatory tive and negative weights to human references (i.e., gold standard variables so that the user satisfaction measures for new dialogues responses) in the computation of n-gram-based precision, which can be estimated by means of linear regression. PARADISE re- is the primary component of BLEU. Because it is difficult to ob- quires an attribute-value matrix that represents the task: for exam- tain multiple hand-crafted references for conversational data, they ple, for the train timetable domain, attributes such as “depart-city,” automatically mine candidate responses from a corpora of conver- “arrival-city” and “depart-time” must be specified in advance. This sations and then have the annotators rate the quality of the candi- is contrast to our helpdesk case because, while it is task-oriented, dates. The reference weights reflect the result of the quality anno- the required attributes depend on the customer’s problem and can- tations. not be listed up exhaustively in advance. In this respect, helpdesk Higashinaka et al. [5] ran the first Dialogue Breakdown Detection dialogues probably lie somewhere in between non-task-oriented Challenge using Japanese human-machine chat corpora, to eval- dialogues and the slot-filling dialogues that PARADISE deals with. uate the system’s ability to detect the point in a given dialogue The PARADISE framework was subsequently used in the where it becomes difficult to continue due to the system’s inap- DARPA COMMUNICATOR Program that evaluated spoken dia- propriate response. This effort used 1,146 text chat dialogues for logue systems in the travel planning domain [22]. The effort pro- training and another 100 for development and testing. After each duced the Communicator 2000 Corpus consisting of 662 dialogues system utterance in the dialogue, participating systems were re- based on nine different systems, with per-call survey results on di- quired to provide a diagnosis: “NB” (not a breakdown), “PB” (pos- alogue efficiency, dialogue quality, task success and user satisfac- sible breakdown), or “B” (breakdown). They were also required tion. Here, a new utterance tagging scheme called DATE (Dialogue to submit a probability distribution over the three labels. To de- Act Tagging for Evaluation) was introduced, which enables three fine the gold standard data for this task, multiple annotators were orthogonal annotations along the axes of speech-act (e.g., “request- hired, so that a gold probability distribution can be constructed for info,” “apology”), task-subtask (e.g., “origin,” “destination,” “date”) each utterance. By comparing the best gold label with the system’s and conversational-domain (“about-task,” “about-communication,” output, accuracy, precision, recall and F-measure were computed. or “situation-frame”). Again, unlike our case, their task-subtask Moreover, by comparing the gold distribution over the three la- annotation scheme needs to be defined in advance. bels with the system’s distribution, Jensen-Shannon Divergence Lowe et al. [9] released the Ubuntu Dialogue Corpus, which and Mean Squared Error were computed. Using a distribution as contains 930,000 human-human dialogues extracted from Ubuntu the gold standard probably reflects the view that there can be multi- chats. Their effort is more similar to ours than the aforementioned ple acceptable choices within a dialogue, as suggested also by other studies on task-oriented dialogue evaluation in that they focus pri- studies [1, 3]. The third Dialogue Breakdown Detection Challenge marily on unstructured dialogues rather than slot-filling. However, workshop will be held as part of Dialogue System Technology Chal- while they automatically disentangled the chats to form dyadic di- lenges on December 10, 20173 alogues, their original chat logs usually involve more than two par- At NTCIR-12 , the first Short Text Conversation (STC) task was ties, which makes it different from our dyadic customer-helpdesk run using Weibo data (for the Chinese subtask) and Twitter data DCH-1 dataset. They formed a response selection test data set (for the Japanese subtask), attracting 22 participating teams [20]. by setting aside 2% of the corpus and forming (context, response, The STC task required participating systems to return a valid com- flag) triplets based on this set. Here, context is the sequence of ment in response to an input tweet (given without any prior con- utterances that appear prior to the response in the dialogue; re- text). Instead of relying on natural language generation, systems sponse is either the actual correct response from the dialogue or a were required to search a repository of past tweets and return a randomly chosen utterance from outside the dialogue (but within ranked list as possible responses. Information retrieval evaluation the test set); flag is one for the correct response and zero for in- measures were used to evaluate the participating systems. Gold correct responses. For each correct response, they generated nine additional triplets containing different incorrect responses. Thus, 3 http://workshop.colips.org/dstc6/ response selection systems are given a context and ten choices of 2 Evaluating Customer-Helpdesk Dialogues EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan. responses, and required to select one or more responses. They use Table 1: Test collection statistics. ∗Only 40 dialogues from recall at k as the evaluation measure, where k is the size of the DCH-0 were annotated with nuggets. set of responses selected by the system and therefore “recall at 1” reduces to accuracy. Note that this evaluation setting does not DCH-0 DCH-1 require annotations for defining the gold standard. They do not Source www.weibo.com consider ranked lists of responses as is done at STC. Language Chinese The most straightforward approach to evaluating dialogues is Data timestamps Jan. 2013 - Sep. 2016 to collect subjective assessments from the user who actually expe- #Dialogues 234 3,700 rienced the dialogue. Hone and Graham [6] used a large question- #English translations 40 370 #Helpdesk accounts 16 161 naire to evaluate an in-car speech interface and identified system Avg. #posts/dialogue 13.402 4.512 response accuracy, likeability, cognitive demand, annoyance, hab- Avg. #utterance blocks/dialogue 12.021 4.162 itability and speed as the key factors in subjective evaluation by Avg. post length (#chars) 35.011 44.568 means of factor analysis; their approach is known as SASSI (Sub- Avg. utterance block length 39.031 48.313 jective Assessment of Speech System Interfaces). Hartikainen et length (#chars) al. [4] applied a service quality assessment from marketing to the #annotators/dialogue 2 3 evaluation of telephone-based email application; their method is Subjective annotation TS, TA, CS, HA, CA known as SERVQUAL. Paek [12] discusses SASSI, SERVQUAL and criteria (See Section 3.4) PARADISE in a survey paper that discusses spoken dialogue evalu- Nugget types CNUG0, CNUG, HNUG, ation, along with his Wizard-of-Oz approach of using human per- CNUG∗, HNUG∗ formance to replace a system component in order to define a gold (See Section 3.5) standard. Triggerless dialogues 1∗ 184 2.3 Evaluating Textual Information Access In contrast, Sakai, Kato and Song [18] introduced a nugget-based evaluation measure called While the aforementioned BLEU [13] is basically equivalent to an S-measure for evaluating textual summaries for mobile search, n-gram-based precision, ROUGE [7], a BLEU-inspired measure de- by incorporating a decay factor for nugget weights based on nugget signed for text summarisation evaluation, is basically a suite of positions. Just like information retrieval for ranked retrieval de- measures including n-gram-based (or skip-gram-based) recall and fines a decay function over ranks of documents, S-measure defines F-measure. Just as BLEU requires multiple reference translations, a linear decay function over the text, using offset positions of the ROUGE requires multiple reference summaries. Note that the ba- nuggets. This reflects the view that important nuggets should be sic unit of comparison, namely n-grams etc., are automatically ex- presented first and that we should minimise the amount of text that tracted from both the references and the system output. the user has to read. Sakai and Kato [17] complements S-measure In contrast to the above automatically extracted units of com- with a precision-like measure called T-measure, which, unlike the parison, manually-devised nuggets have been used in both sum- aforementioned allowance-based precision used at the TREC QA marisation evaluation [11] and question answering evaluation. In track, takes into account the fact that different pieces of informa- the TREC Question Answering (QA) tracks, a nugget is defined as tion require different textual lengths. They define an “iUnit” (in- “a fact for which the annotator could make a binary decision as to formation unit) as “an atomic piece of information that stands alone whether a response contained that nugget” [8]. Having constructed and is useful to the user.” nuggets, (weighted) recall, precision and F-measure scores can be Sakai and Dou [16] generalised the idea of S-measure to han- computed, except that the precision computation requires special dle various textual information access tasks, including web search. handling: while one can count the number of nuggets present or Their measure, known as U-measure, constructs a string called trail- missing in the system output, one cannot count the number of text, which is a concatenation of all the texts that the user has read “non-nuggets” (i.e., irrelevant pieces of information) in the same (obtained by observation or by assuming a user model). Then, over output, since “non-nuggets” are never defined. Hence, nugget pre- the trailtext, a linear decay function is defined (See Section 4). cision, which is supposed to quantify the amount of irrelevant in- formation in the output, cannot be defined. To work around this 3 DESIGNING AND BUILDING DCH-1 problem, a fixed-length “allowance” was introduced at the TREC QA tracks so that nugget precision could be defined based solely on 3.1 Overview the system output length. The TREC QA tracks also used a measure Our ultimate goal is automatic evaluation of human-machine called POURPRE, which replaces the manual nugget matching step Customer-Helpdesk dialogues. As a first step towards it, we built with automatic nugget matching based on unigrams. The NTCIR two test collections based on real (i.e., human-human) Customer- ACLIA (Advanced Cross-lingual Information Access) Task adapted Helpdesk dialogues, which we call DCH-0 and DCH-1. these methods for evaluating QA with Asian languages [10]. DCH-0, our smaller collection, was used to establish an efficient As was discussed above, traditional evaluation measures for sum- and reliable test collection construction procedure. For example, marisation and question answering employ variants of recall, pre- although we started constructing DCH-0 by using the number of cision and F-measure based on small textual units. Hence, they posts in each dialogue for sampling dialogues of different lengths, regard the system output as a set of n-grams, nuggets, and so on. where a post refers to a piece of timestamped text entered by either 3 EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan. Z. Zeng et al. Customer or Helpdesk, we quickly realised that posts are often a paper at Waseda University; subsequently, the annotators were al- mere artifact of the Weibo users’ arbitrary hits of the ENTER key, lowed to do their annotation work online using a web-browser- and that they are not suitable as the basic semantic unit. Based on based tool at their convenient location and time. The number of this experience, we used the utterance block as the basis for mea- dialogues assigned to each annotator was 3, 700 ∗ 3/16 = 693.75 suring the length of a dialogue in DCH-1, formed by merging all on average; all of them completed their work within two weeks as consecutive posts by the same utterer. they were initially asked to do. The actual annotation time spent Table 1 provides some statistics of DCH-0 and DCH-1. As shown by each annotator was 18-20 hours. in the table, 184 of the 3,700 DCH-1 dialogues are “triggerless,” by which we mean that Customer and Helpdesk exchange remarks 3.4 Subjective Annotation even though Customer does not seem to be facing any problem (cf. By subjective annotation, we mean manual quantification of the Figure 1)4 . Below, we discuss the construction and validation of quality of a dialogue as a whole. As there are two players involved DCH-1. in a Customer-Helpdesk dialogue, we wanted to accommodate the following two viewpoints: 3.2 Dialogue Mining Customer’s viewpoint Does Helpdesk solve Customer’s The 3,700 Helpdesk dialogues contained in the DCH-1 test collec- problem efficiently? Customer may want a solu- tion were mined from Weibo in September 2016 as follows. (1) We tion quickly while providing minimal information to collected an initial set of Weibo accounts by searching Weibo Helpdesk. account names that contained keywords such as “assistant” and Helpdesk’s viewpoint Does Customer provide accurate “helper” (in Chinese). We denote this set by A0 . (2) For each ac- and sufficient information so that Helpdesk can provide count name a in A0 , we added a prefix “@” to a and used the string the right solution? Helpdesk also wants to solve Cus- as a query for searching up to 40 conversational threads (i.e., ini- tomer’s problem through minimal interactions, as these tial post plus comments on it) that contain a mention of the official interactions translate directly into cost for the company. account5 . We then filtered out accounts that did not respond to Moreover, we wanted to assess customer satisfaction as this is of over one half of these threads. We denote the filtered set of “ac- utmost importance for both parties. While customer satisfaction tive” accounts as A. (3) For each account a in A, we retrieved all ratings should ideally be collected from the real customer at the threads that contain a mention of a from January 2013 to Septem- time of dialogue termination, we had no choice but to collect sur- ber 2016, and extracted Customer-Helpdesk dyadic dialogues from rogate, post-hoc ratings by the annotators instead. them. We then kept those that consist of at least one utterance By considering the above points as well as our results from the block by Customer and one by Helpdesk. As a result, 21,669 dia- smaller DCH-0 collection, we finally devised the following five sub- logues were obtained. This collection is denoted as D 0 . (4) As D 0 is jective annotation criteria: too large for annotation, we sampled 3,700 dialogues from it as fol- Task Statement Whether the task (i.e., the problem to be lows. For i = 2, 3, . . . , 6, we randomly sampled 700 dialogues that solved) is clearly stated by Customer (denoted by TS); contained i utterance blocks. In addition, we randomly sampled Task Accomplishment Whether the task is actually ac- 200 that contained i = 7 utterance blocks; we could not sample complished (denoted by TA); 700 dialogues for i = 7 as D 0 did not contain enough dialogues Customer Satisfaction Whether Customer is likely to have that are very long. been satisfied with the dialogue, and to what degree (de- 10% (370) of the Chinese Dialogues in DCH-1 were manually noted by CS); translated English by a professional translation company for re- Helpdesk Appropriateness Whether Helpdesk provided search purposes. appropriate information (denoted by HA); Customer Appropriateness Whether Customer provided 3.3 Annotators appropriate information (denoted by CA). We hired 16 Chinese undergraduate students from the Faculty of Figure 2 shows the actual instructions for annotators: note that Science and Engineering at Waseda University so that each Chi- CS is on a five-point scale (−2 to 2), while the other four are on a nese dialogue was annotated independently by three annotators. three-point scale (−1 to 1). The assignment of dialogues to annotators was randomised; given Table 2 shows the inter-rater agreement (for three assessors) a dialogue, each annotator first read the entire dialogue carefully, of the subjective labels in terms of Fleiss’ κ [2] and Randolph’s and then gave it ratings according to the five subjective annotation κ free [14]; κ free is known to be more suitable when the labels are criteria described in Section 3.4; finally, he/she identified nuggets heavily skewed across the categories, which is indeed the case here. within the same dialogue, where nuggets were defined as described “2+ agree” means the proportion of dialogues for which at least two in Section 3.5. An initial face-to-face instruction and training ses- annotators agree, e.g., (−1, −1); “3 agree” means the proportion of sion for the annotators was organised by the first author of this dialogues for which all three annotators agree, e.g., (−1, −1, −1). It can be observed that the agreement among the three assessors 4 We tried filtering out these triggerless dialogues for the analyses reported in Sec- is low, except perhaps for TS, which reflects the highly subjective tion 5, but the effect of this on our results was not substantial. nature of this labelling task. While it may be possible to improve 5 Weibo’s interface for conversational threads is somewhat different from Twitter’s: comments to a post are not displayed on the main timeline; they are displayed under the inter-assessor agreement a little in our future work by revising each post only if the “comments” button is clicked. the labelling instructions, it should be stressed that our labelling 4 Evaluating Customer-Helpdesk Dialogues EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan. Note that we utilise Weibo posts as the atomic building blocks for forming nuggets; This takes into account the remark by Wang et al. [23]: “Experience from question answering evaluations has shown that users disagree about the granularity of nuggets—for ex- ample, whether a piece of text encodes one or more nuggets and how to treat partial semantic overlap between two pieces of text.” Note also that according to our definition, an utterance block (i.e., max- imal consecutive posts by the same utterer) generally subsumes one or more nuggets. Compared to traditional nugget-based information access eval- uation that was discussed in Section 2.3, there are two unique fea- tures in nugget-based helpdesk dialogue evaluation: (1) A dialogue involves two parties, Customer and Helpdesk; (2) Even within the same utterer, nuggets are not homogeneous, by which we mean that some nuggets may play special roles. In particular, since the dialogues we consider are task-oriented (but not closed-domain, which makes slot filling approaches infeasible), there must be some Figure 2: Subjective annotation criteria. nuggets that represent the state of identifying the task and those that represent the state of accomplishing it. Table 2: Inter-annotator agreement of the subjective annota- Based on the above considerations, we defined the following tions for DCH-1 (3,700 dialogues, 3 annotators per dialogue). four mutually exclusive nugget types: Note that Fleiss’ κ and Randolph’s κ free treat the ratings as CNUG0 Customer’s trigger nuggets. These are nuggets that nominal categories. 2+ agree means the proportion of di- define Customer’s initial problem, which directly caused alogues for which at least two annotators agree; 3 agree Customer to contact Helpdesk. means the proportion of dialogues for which all three an- HNUG Helpdesk’s regular nuggets. These are nuggets in notators agree. For CS, 2 and 1 were treated as 1, and −2 and Helpdesk’s utterances that are useful from Customer’s −1 were treated as −1. point of view. CNUG Customer’s regular nuggets. These are nuggets in 2+ agree 3 agree Fleiss’ κ κ free Customer’s utterances that are useful from Helpdesk’s TS .981 .729 .301 .719 point of view. TA .925 .361 .273 .324 HNUG∗ Helpdesk’s goal nuggets. These are nuggets in CS .938 .349 .276 .318 Helpdesk’s utterances which provide the Customer with HA .873 .309 .197 .245 a solution to the problem. CA .857 .288 .141 .216 CNUG∗ Customer’s goal nuggets. These are nuggets in Cus- tomer’s utterances which tell Helpdesk that Customer’s problem has been solved. task is not document relevance assessments, and that it is inher- Each nugget type may or may not be present in a dialogue. Multi- ently highly subjective. We believe that, as our future work, hir- ple nuggets of the same type may be present in a dialogue. ing more than three assessors and preserving their different view- Using a pull-down menu on our web-browser-based tool, asses- points in the test collection, is more important than trying to force sors categorised each post into CNUG0, CNUG, HNUG, CNUG∗, them into reaching an agreement. HNUG∗, or NAN (not a nugget). Then, consecutive posts with the same label (e.g., CNUG followed by CNUG) were automatically 3.5 Nugget Annotation merged to form a nugget. We had three annotators independently identified nuggets for each Table 3 shows the inter-annotator agreement of the nugget an- dialogue as follows. At the instruction and training session, anno- notations, where the posts are used as the basis for comparison. tators were given the diagram shown in Figure 3, which reflects our The 3,700 dialogues in DCH-1 contains a total of 7,155 Helpdesk view that accumulating nuggets will eventually solve Customer’s posts, all of which were annotated independently by three annota- problem, together with a written definition of nuggets, as described tors, producing a total of 21,465 annotations, A direct comparison below. (1) A nugget is a post, or a sequence of consecutive posts with the subjective annotation agreement shown in Table 2 would by the same utterer (i.e., either Customer or Helpdesk). (2) It can be difficult, since both the annotation unit (dialogues vs. nuggets) neither partially nor wholly overlap with another nugget. (3) It and the annotation schemes (numerical ratings vs. nugget types) should be minimal: that is, it should not contain irrelevant posts are different. However, it can be observed that the agreement for at the start, the end or in the middle. An irrelevant post is one Customer nuggets is substantially higher than for the Helpdesk that does not contribute to the Customer transition (See Figure 3). nuggets. A possible explanation for this would be that it is easier (4) It helps Customer transition from Current State (including Ini- for annotators to judge the contribution of Customer’s utterances tial State) towards Target State (i.e., when the problem is solved). for reaching his/her target state than to judge that of Helpdesk, at 5 EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan. Z. Zeng et al. Different paths that lead from Customer’s current state to target state Customer’s Customer’s initial state target state (facing a (problem problem) solved) Contribution of a nugget Helpdesk-Customer interactions that An intermediate state, where the problem is do not directly lead Customer to an not quite solved yet but Customer is a little intermediate state or Target state closer towards Target state Figure 3: Task accomplishment as state transitions, and the role of a nugget. Table 3: Inter-annotator agreement of the nugget annota- This is for discounting the value of a nugget that appear later in tions for DCH-1 (3,700 dialogues, 3 annotators per dialogue). the dialogue; at position L, the value of any nugget wears out com- 2+ agree means the proportion of nuggets for which at least pletely. In our experiments, we let L = L max = 916 as this is two annotators agree; 3 agree means the proportion of di- the number of (Chinese) characters in the longest dialogue from alogues for which all three annotators agree. NAN means the DCH-1 collection. The benefit of introducing L is discussed in “not a nugget.” 95% CI for κ are also shown. Section 5.2. Let N and M denote the number of Customer’s non-goal nuggets 2+ agree 3 agree Fleiss’κ κ free and Helpdesk’s non-goal nuggets identified within a dialogue, re- Helpdesk (#total posts) spectively; for simplicity, let us assume that there is at most one (HNUG/HNUG* .907 .299 .174 .253 Customer’s goal nugget (c ∗ ) and at most one Helpdesk’s goal nugget /NAN) [.165, .184] (h ∗ ) in a dialogue. Let {c 1 , . . . , c N , c ∗ } denote the set of nuggets Customer (#total posts) from Customer’s posts, and let {h 1 , . . . , h M , h ∗ } denote that from (CNUG0/CNUG .959 .491 .488 .529 Helpdesk’s posts. Let pos(c i ) (i ∈ {1, . . . , N , ∗}) be the position of /CNUG*/NAN) [.481, .496] nugget c i ; pos(h j ) (j ∈ {1, . . . , M, ∗}) is defined similarly. Given the gain value of each non-goal nugget (д(c i )), a simple evaluation measure based solely on Customer’s utterances can be computed as: least for regular nuggets: while Helpdesk often asks Customer for ∑ more information regarding the problem context, it is Customer’s UC = д(c i ) D(pos(c i )) . (2) utterances that actually provide that information. c i ∈ {c 1, ...,c N ,c ∗ } While directly comparing the inter-annotator agreement of sub- In the present study, we define the gain value of CNUG∗ as д(c ∗ ) = jective annotation and nugget annotation seems difficult, we would ∑ 1 + iN=1 д(c i ). This is an attempt at reflecting the view that task like to compare the intra-annotator consistency by making each accomplishment is what matters most. To be more specific, when annotator process the same dialogue multiple times in our future the discounting function is ignored and dialogues are regarded as work. sets of nuggets, then having only the goal nugget is better than hav- ing all the regular nuggets. Similarly, given the gain value of each 4 UCH: A DIALOGUE EVALUATION non-goal nugget (д(h j )), a measure solely based on Helpdesk’s ut- MEASURE terances can be computed as: ∑ We now propose an evaluation measure that leverages nuggets for UH = д(h j ) D(pos(h j )) , (3) quantifying the quality of Customer-Helpdesk dialogues. We re- h j ∈ {h 1, ...,h M ,h ∗ } gard a Customer-Helpdesk dialogue as a trailtext of U-measure, ∑ which may or may not contain nuggets. Let pos denote the posi- where д(h ∗ ) = 1 + M j =1 д(h j ). Finally, for a given parameter α tion (i.e., offset from the beginning of the dialogue) of a nugget; for (0 ≤ α ≤ 1) that specifies the contribution of Helpdesk’s utterances ideographic languages such as Chinese and Japanese, we use the relative to Customer’s, we can define the following combined mea- number of characters to define the offset position. Given a patience sure: parameter L, we define a decay function over the trailtext as [16]: UCHα = (1 − α)UC + αUH . (4) pos By default, we use α = 0.5. Note that UCH0 . 5 is equivalent to com- D(pos) = max(0, 1 − ). (1) puting a single U-measure score without distinguishing between L 6 Evaluating Customer-Helpdesk Dialogues EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan. Table 4: Kendall’s τ between AUCH and average subjective ratings for DCH-1 (3,700 dialogues), with 95% CIs. 0.20 AUCH TS .267 [.237, .277] 0.15 TA .256 [.244, .289] CS .118 [.097, .141] Kendall's τ 0.10 HA .414 [.398, .432] CA .434 [.417, .450] 0.05 0.00 Customer’s and Helpdesk’s nuggets. The choice of α is discussed Lmax/16 Lmax/8 Lmax/4 Lmax/2 Lmax Lmax 4Lmax 16Lmax ∞ in Section 5.3. Since we have three independent nugget annotations per dia- logue, We tried two approaches to computing a single score for Figure 4: Effect of L on the τ between average customer sat- a given dialogue: Average UCH (AUCH) simply computes a UCH isfaction and AUCH. score each annotator and then takes the average for that dialogue; Consolidated UCH (CUCH) merges the nuggets from multiple anno- tators first and then computes a single UCH score. We only report on results with AUCH, which consistently outperformed CUCH in 5.2 The Patience Parameter L our experiments. As was explained in Section 4, UCH inherits the patience param- eter L from S-measure [18] and U-measure [16], to discount the value of a nugget based on its position within the dialogue. As we 5 ANALYSIS WITH UCH have mentioned earlier, we let L = L max = 916 by default, as this This section addresses the following questions: How does UCH cor- is the length of the longest dialogue within DCH-1. Using a small relate with subjective ratings? (Section 5.1); Is the patience param- L means that the decay function becomes steep and that we do not eter L useful for estimating subjective ratings? (Section 5.2); and tolerate long dialogues; using an extremely large L is equivalent to Which utterer plays the major role when estimating subjective rat- switching off the decay function, thereby treating the dialogue as ings with UCH? (Section 5.3). a set of nuggets (See Eq. 1). In the analysis reported below, we use the z-score of each subjec- Figure 4 shows the effect of L on the τ between average CS and tive rating before averaging them over the three annotators. That AUCH. It can be observed that, at least for DCH-1, L = L max /4 = is, for each annotator and subjective criterion, we first compute the 229 seems to be a good choice if AUCH is to be used for estimating mean and standard deviation of the raw ratings, and then process customer satisfaction. This suggests that user satisfaction may be each raw rating by subtracting the mean and then dividing by the linked to user patience, and that considering nugget positions as standard deviation. This is to remove each annotator’s inherent UCH does is of some use. However, as was discussed earlier, the scoring tendency. reliability of the CS ratings deserves a closer investigation in our future work. 5.1 Correlation with Subjective Annotations Table 4 shows the Kendall’s τ values between AUCH and the av- 5.3 The Contribution Parameter α erage subjective ratings for the DCH-1 collection, with 95% confi- As Eq. 4 shows, UCH can decide on a balance between Customer’s dence intervals. It can be observed that AUCH is reasonably highly utterances and Helpdesk’s; a small α means that we rely more on correlated with HA (.414, 95% CI[.398, .432]) and CA (.434, 95% Customer nuggets for computing UCH. Figure 5 shows the effect of CI[.417, .450]). That is, even though the inter-annotator agreement α on the τ between AUCH and different average subjective ratings. for appropriateness is relatively low (Table 2), AUCH manages to es- The trends are the same for TS, TA, CS, and CA: the smaller the α, timate the average appropriateness with reasonable accuracy. On the higher the rank correlation. That is, to achieve the highest τ , it the other hand, the table shows that the τ between AUCH and is best to rely entirely on Customer utterances, i.e., to completely CS is very low, albeit statistically significant (.118, 95% CI[.097, ignore Helpdesk utterances. .141]). One possible explanation for this might be that the CS rat- Interestingly, however, the trend is different for HA: the curve ings themselves are not as reliable as we would have like. First, as for HA suggests that α = 0.5, our default value, is in fact the best we have discussed in Section 3.4, the annotators are not the actual choice. That is, to achieve the highest τ with Helpdesk Appropri- customers; second, our manual inspection of some of the dialogues ateness, treating Customer’s and Helpdesk’s nuggets equally ap- from DCH-0 and DCH-1 suggest that the annotator’s ratings may pears to be a good choice. While it is obvious that Helpdesk’s utter- be influenced by his/her prior impression of the product/service or ances need to be taken into account in order to estimate Helpdesk the company, rather than the contents of the particular dialogue Appropriateness, the curve implies that Customer’s utterances also in question. play an important part in the estimation. These results suggest that 7 EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan. Z. Zeng et al. • Collecting subjective and nugget annotations for the Eng- lish subcollection of DCH-1, and comparing across Chi- TS nese and English; 0.6 TA • Devising ways for automatic nugget identification and au- CS tomatic categorisation of nuggets into different nugget 0.5 HA types; CA 0.4 Kendall's τ The NTCIR-14 Short Text Conversation task (STC-3) will feature 0.3 a new subtask that is based on the present study: given a dialogue, participating systems are required to estimate the distribution of 0.2 subjective scores such as user satisfaction over multiple annotators, 0.1 as well as the distribution of nugget types (e.g. trigger, regular, goal, not-a-nugget) over multiple assessors for each utterance [15]. 0.0 0.0 0.2 0.4 0.6 0.8 1.0 α REFERENCES [1] DeVault, D., Leuski, A. and Sagae, K.: Toward Learning and Evaluation of Di- Figure 5: Effect of α on the τ between average subjective rat- alogue Policies with Text Examples, Proceedings of SIGDIAL 2011, pp. 39–48 (2011). ings and AUCH. [2] Fleiss, J. L.: Measuring Nominal Scale Agreement among Many Raters, Psycho- logical Bulletin, Vol. 76, No. 5, pp. 378–382 (1971). [3] Galley, M., Brockett, C., Sordoni, A., Ji, Y., Auli, M., Quirk, C., Mitchell, M., Gao, J. and Dolan, B.: ∆BLEU: A Discriminative Metric for Generation Tasks with different subjective annotation criteria requires different balances Intrinsically Diverse Targets, Proceedings of ACL 2015, pp. 445–450 (2015). between Customer’s and Helpdesk’s utterances. [4] Hartikainen, M., Salonen, E.-P. and Turunen, M.: Subjective Evaluation of Spo- ken Dialogue Systems Using SERVQUAL Method, Proceedings of INTERSPEECH 2004-ICSLP (2004). 6 CONCLUSIONS [5] Higashinaka, R., Funakoshi, K., Kobayashi, Y. and Inaba, M.: The Dialogue Breakdown Detection Challenge: Task Description, Datasets, and Evaluation As an initial step towards evaluating automatic dialogue sys- Metrics, Proceedings of LREC 2016 (2016). tems, we constructed DCH-1, which contains 3,700 real Customer- [6] Hone, K. S. and Graham, R.: Towards a Tool for the Subjective Assessment of Speech System Interfaces (SASSI), Natural Language Engineering, Vol. 6, No. 3-4, Helpdesk multi-turn dialogues mined from Weibo. We have anno- pp. 287–303 (2000). tated each dialogue with subjective quality annotations (TS, TA, [7] Lin, C.-Y.: ROUGE: A Package for Automatic Evaluation of Summaries, Proceed- CS, HA, and CA) and nugget annotations, with three annotators ings of the Workshop on Text Summarization Branches Out, pp. 74–81 (2004). [8] Lin, J. and Demner-Fushman, D.: Will Pyramids Built of Nuggets Topple Over?, per dialogue. In addition, 10% of the dialogues have been manually Proceedings of HLT/NAACL 2006, pp. 383–390 (2006). translated into English. We described how we constructed the test [9] Lowe, R., Row, N., Serban, I. V. and Pineau, J.: The Ubuntu Dialogue Corpus: collection and the philosophy behind it. We also proposed UCH, A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems, Proceedings of SIGDIAL 2015, pp. 285–294 (2015). a simple nugget-based evaluation measure for task-oriented dia- [10] Mitamura, T., Shima, H., Sakai, T., Kando, N., Mori, T., Takeda, K., Lin, C.-Y., logue evaluation, and explored its usefulness and limitations. Our Song, R., Lin, C.-J. and Lee, C.-W.: Overview of the NTCIR-8 ACLIA Tasks: Advanced Cross-Lingual Information Access, Proceedings of NTCIR-8, pp. 15– main findings on UCH based on the DCH-1 collection are as fol- 24 (2010). lows. [11] Nenkova, A., Passonneau, R. and McKeown, K.: The Pyramid Method: Incorpo- rating Human Content Selection Variation in Summarization Evaluation, ACM (1) UCH correlates better with subjective ratings that reflect Transactions on Speech and Language Processing, Vol. 4, No. 2 (2007). the appropriateness of utterances (HA and CA) than with [12] Paek, T.: Toward Evaluation that Leads to Best Practices: Reconciling Dialog customer satisfaction (CS); Evaluation in Research and Industry, Bridging the Gap: Academic and Industrial Research in Dialogue Technologies Workshop Proceedings, pp. 40–47 (2007). (2) The patience parameter L of UCH, which considers the po- [13] Papineni, K., Roukos, S., Ward, T. and Zhu, W.-J.: BLEU: a Method for Automatic sitions of nuggets within a dialogue, may be a useful fea- Evaluation of Machine Translation, Proceedings of ACL 2002, pp. 311–318 (2002). [14] Randolph, J. J.: Free-marginal Multirater Kappa (Multirater κ free ): An Alterna- ture for enhancing the correlation with customer satisfac- tive to Fleiss’ Fixed Marginal Multirater Kappa, Joensuu Learning and Instruction tion; Symposium 2005 (2005). (3) For the majority of our subjective annotation criteria, cus- [15] Sakai, T.: Towards Automatic Evaluation of Multi-Turn Dialogues: A Task De- sign that Leverages Inherently Subjective Annotations, Proceedings of EVIA 2017 tomer utterances seem to play a much more important role (2017). for UCH to achieve high correlations with subjective rat- [16] Sakai, T. and Dou, Z.: Summaries, Ranked Retrieval and Sessions: A Unified ings than helpdesk utterances do, according to our analy- Framework for Information Access Evaluation, Proceedings of ACM SIGIR 2013, pp. 473–482 (2013). sis on the parameter α. [17] Sakai, T. and Kato, M. P.: One Click One Revisited: Enhancing Evaluation based Our future work includes the following: on Information Units, Proceedings of AIRS 2012 (2012). [18] Sakai, T., Kato, M. P. and Song, Y.-I.: Click the Search Button and Be Happy: • Comparing subjective annotation and nugget annotation Evaluating Direct and Immediate Information Access, Proceedings of ACM CIKM 2011, pp. 621–630 (2011). in terms of intra-annotator agreement; [19] Shang, L., Sakai, T., Li, H., Higashinaka, R., Miyao, Y., Arase, Y. and Nomoto, • Investigating the reliability of offline customer satisfac- M.: Overview of the NTCIR-13 Short Text Conversation Task, Proceedings of tion ratings by comparing them with real customer rat- NTCIR-13 (2017). [20] Shang, L., Sakai, T., Lu, Z., Li, H., Higashinaka, R. and Miyao, Y.: Overview of the ings collected right after the termination of a helpdesk di- NTCIR-12 Short Text Conversation Task, Proceedings of NTCIR-12, pp. 473–484 alogue; (2016). 8 Evaluating Customer-Helpdesk Dialogues EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan. [21] Walker, M. A., Litman, D. J., Kamm, C. A. and Abella, A.: PARADISE: A Frame- work for Evaluating Spoken Dialogue Agents, Proceedings of ACL 1997, pp. 271– 280 (1997). [22] Walker, M. A., Passoneau, R. and Boland, J. E.: Quantitative and Qualitative Evaluation of Darpa Communicator Spoken Dialogue Systems, Proceedings of ACL 2001, pp. 515–522 (2001). [23] Wang, Y., Sherman, G., Lin, J. and Efron, M.: Assessor Differences and User Preferences in Tweet Timeline Generation, Proceedings of ACM SIGIR 2015, pp. 615–624 (2015). 9