Towards Automatic Evaluation of Multi-Turn Dialogues: A Task Design that Leverages Inherently Subjective Annotations Tetsuya Sakai Waseda University, Tokyo, Japan tetsuyasakai@acm.org ABSTRACT by considering the diverse views of customers. The proposed task This paper proposes a design of a shared task whose ultimate goal has been accepted as part of the NTCIR-14 Short Text Conversation is automatic evaluation of multi-turn, dyadic, textual helpdesk (STC-3) task. dialogues. The proposed task takes the form of an offline evaluation, While estimated and gold distributions are traditionally com- where participating systems are given a dialogue as input, and pared by means of root mean squared error, Jensen-Shannon diver- output at least one of the following: (1) an estimated distribution gence [6] and the like, we propose a pilot measure that considers the of the annotators’ quality ratings for that dialogue; and (2) an order of the probability bins for the dialogue quality subtask, which estimated distribution of the annotators’ nugget type labels for we call Symmetric Normalised Order-aware Divergence (SNOD). each utterance block (i.e., a maximal sequence of consecutive posts by the same utterer) in that dialogue. This shared task should 2 RELATED WORK help researchers build automatic helpdesk dialogue systems that respond appropriately to inquiries by considering the diverse views 2.1 Dialogue Evaluation in Brief of customers. The proposed task has been accepted as part of the While to our knowledge the task proposed in the present paper is NTCIR-14 Short Text Conversation (STC-3) task. While estimated novel, dialogue evaluation is not a new problem. For example, it was and gold distributions are traditionally compared by means of root in 1997 that Walker et al. [13] proposed the PARADISE (PARAdigm mean squared error, Jensen-Shannon divergence and the like, we for Dialogue System Evaluation) framework for evaluating spoken propose a pilot measure that considers the order of the probability dialogue systems in the train timetable domain. In 2000, Hone and bins for the dialogue quality subtask, which we call Symmetric Graham [5] proposed the questionnaire-based SASSI (Subjective Normalised Order-aware Divergence (SNOD). Assessment of Speech System Interfaces) for evaluating an in-car speech interface. However, existing studies along these lines of KEYWORDS research mostly focus on closed-domain applications. The topics that helpdesks need to deal with are far more diverse. dialogues; divergence; evaluation; nuggets; probability distribu- Recently, Lowe et al. [7] released the Ubuntu dialogue corpus tions; test collections and proposed a response selection task: systems are given a dialogue context, one correct response immediately following the context 1 INTRODUCTION plus nine “fake” responses sampled from outside the dialogue, and More and more companies are providing online customer services are required to select one or more appropriate responses from them. where a customer can exchange realtime textual messages about the Their effort is more similar to ours in that the topics discussed in company’s services and products with a (probably human) helpdesk the dialogues are more diverse than those dealt with by traditional operator. This means more convenience for the customers, but more dialogue evaluation. However, since their “correct” response is the burden on the companies. Hence, research in automatic helpdesk original response from the dialogue in their task, their task does not dialogue systems is highly practical as a means to reduce the cost involve manual annotations at all. In contrast, the present study for the companies. To design and tune automatic dialogue systems addresses the problem of annotators’ subjective decisions that may efficiently and in a costly manner, automatic evaluation of dialogue be unanimous in some cases but contradictory in others. In fact, quality is desirable. our proposal is to preserve the diverse views in the annotations “as As an initial step towards automatic evaluation of helpdesk dia- is” and leverage them at the step of evaluation measure calculation, logue systems, this paper proposes a design of a shared task. The as we shall describe in Section 3. proposed task takes the form of an offline evaluation, where partic- There are also a few recent efforts in evaluating non-task-oriented ipating systems are given a dialogue as input, and output at least dialogues, or dialogues without a specific purpose (e.g. [2]). The one of the following: (1) an estimated distribution of the annotators’ Dialogue Breakdown Detection Challenge [3, 4] (Section 2.2) and quality ratings for that dialogue; and (2) an estimated distribution the NTCIR Short Text Conversation task [12] (Section 2.3) are also of the annotators’ nugget type labels for each utterance block (i.e., a non-task-oriented. However, we are more interested in helpdesk maximal sequence of consecutive posts by the same utterer) in that dialogues that try to solve a specific problem that the customer is dialogue. This shared task should help researchers build automatic facing. helpdesk dialogue systems that respond appropriately to inquiries 2.2 Dialogue Breakdown Detection Copying permitted for private and academic purposes. EVIA 2017, co-located with NTCIR-13, Tokyo, Japan. The Dialogue Breakdown Detection Challenge (DBDC) [4] provides © 2017 Copyright held by the author. human-machine non-task-oriented chats to participating systems. 24 EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan. Tetsuya Sakai Participating systems are required to examine each machine ut- measure called UCH, which was adapted from an information re- terance, and determine the likelihood that the utterance caused trieval evaluation measure called U-measure [10]. They hired three a dialogue breakdown (i.e., a point where it becomes difficult to annotators per dialogue (helpdesk-customer interactions mined continue a proper conversation any further due to an inappropri- from Weibo) and obtained dialogue-level quality annotations as ate utterance). More specifically, the system is required, for each well as nugget annotations, where a nugget is a minimal sequence machine utterance, to output an estimated distribution of multi- of consecutive posts by the same utterer that helps towards prob- ple annotators over three categories: NB (not a breakdown), PB lem solving. In essence, a nugget is a “relevant” portion within an (possible breakdown), and B (breakdown). This enabled the task to utterance block. evaluate systems by comparing the system’s estimated distribution Each of the three annotators independently provided the follow- with the gold distribtution of the annotators in terms Mean Squared ing dialogue-level quality labels for each dialogue [14]: Error and Jensen-Shannon divergence (See Section 3.2.1). The Third TS Task Statement: whether the task (i.e., the problem to be DBDC [3] will be concluded at Dialog System Technology Challenges solved) is clearly stated by Customer (scores: {−1, 0, 1}); (DSTC6) on December 10, 20171 . TA Task Accomplishment: whether the task is actually accom- Our proposed task was directly inspired by DBDC, which reflects plished (scores: {−1, 0, 1}); the view that the annotations by different people can be inherently CS Customer Satisfaction: whether Customer is likely to have different, and that systems should be aware of that. We believe been satisfied with the dialogue, and to what degree (scores: that this is particularly important for dialogue systems that need {−2, −1, 0, 1, 2}); to face diverse customers, often in the absence of absolute truths. HA Helpdesk Appropriateness: whether Helpdesk provided Thus, instead of trying to consolidate multiple annotations to form appropriate information (scores: {−1, 0, 1}); a single gold label, we represent the gold data as a distribution of CA Customer Appropriateness: whether Customer provided annotators; we also require systems to produce estimated distribu- appropriate information (scores: {−1, 0, 1}). tions, rather than an estimated judgement of an “average” person2 . Moreover, they independently identified the following types of One important point to note is that while the probability bins (i.e., nuggets within each utterance block [14]: the categories) of DBDC are ordered (e.g., PB is closer to NB than B CNUG0 Customer’s trigger nuggets. These are nuggets that is), the aforementioned measures do not take this into account. In define Customer’s initial problem, which directly caused the present study, we introduce a pilot measure called Symmetric Customer to contact Helpdesk. Normalised Order-aware Divergence (SNOD) as an attempt to solve HNUG Helpdesk’s regular nuggets. These are nuggets in this issue. Helpdesk’s utterances that are useful from Customer’s point of view. 2.3 Short Text Conversation CNUG Customer’s regular nuggets. These are nuggets in Cus- The NTCIR Short Text Conversation (STC) task [11, 12], the largest tomer’s utterances that are useful from Helpdesk’s point task in NTCIR-12 and -13, also handles non-task-oriented dialogues. of view. However, their task setting has so far considered single-turn dia- HNUG∗ Helpdesk’s goal nuggets. These are nuggets in Helpdesk’s logues only: given a Chinese Weibo3 post (in the Chinese subtask), utterances which provide the Customer with a solution to can participating systems either retrieve or generate an appropriate the problem. response? CNUG∗ Customer’s goal nuggets. These are nuggets in Cus- While the STC task also hires multiple assessors and require tomer’s utterances which tell Helpdesk that Customer’s them to label tweets based on four criteria (fluent, coherent, self- problem has been solved. sufficient, substantial4 ), they consolidate the labels of the multiple In our proposed task design, we tentatively use the aforemen- assessors to form the final graded relevance level (e.g., relevant tioned annotation scheme of DCH-1, so that we can discuss our and highly relevant). While Sakai’s unanimity-aware gains [9] ideas with concrete examples. However, it should be noted that were applied for the NTCIR-13 STC-2 Chinese subtask to weight our proposal does not require that the dialogue-level and nugget unanimous ratings more heavily compared to controversial ones, annotations are done in exactly the same way as above. If we do the task did not involve direct comparisons of gold and system use the above schema in a new task, however, it would enable us distributions. to directly utilise the DCH-1 test collection as training data for the As was mentioned earlier, the framework proposed in the present participants, as we shall describe in the next section. study has been accepted as part of the NTCIR-14 STC-3 task. 3 PROPOSED TASK DESIGN 2.4 DCH-1 Test Collection Our ultimate goal is the automatic evaluation of Helpdesk-Customer Recently, Zeng et al. [14] reported on a Chinese helpdesk-customer (be it human-human or human-machine) dialogues; as a first step, dialogue test collection and proposed a nugget-based evaluation we propose the following shared task. 1 http://workshop.colips.org/dstc6/ 2 See Maddalena et al. [8] and Sakai [9] for related discussions in the context of infor- 3.1 Task Definition mation retrieval evaluation. Participating teams are provided with training data, for example, 3 http://weibo.com 4 http://ntcirstc.noahlab.com.hk/STC2/submission evaluation/EvaluationCriteriaCN. the aforementioned DCH-1 test collection with multiple dialogue- pdf level and nugget annotations per dialogue. Then, in the test phase, 25 Towards Automatic Evaluation of Multi-Turn Dialogues: A Task Design that Leverages Inherently Subjective Annotations EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan. each team is given a new set of dialogues as input. Let D be the test (X) of dialogues in the test set. Two subtasks are described below. It is True probability distribution hoped that these offline (i.e., laboratory-based) tasks will serve as Estimated probability distribution initial steps towards evaluating real customer-helpdesk dialogue 0.4 systems. p(1) p(2) 0.3 3.1.1 Dialogue Quality Subtask. First, participating systems are given a list of possible dialogue quality levels {1, 2, . . . , L} and the p*(1) p*(2) p*(3) p(3) p*(4) p*(5) 0.2 number of annotators a. Then, for each d ∈ D, participating systems are required to return an estimated distribution of annotators over p(4) p(5) 0.1 the quality levels. For example, if L = 5 (five levels) for Customer Satisfaction (See Section 2.4) and a = 10, a participating system might return (2, 2, 2, 2, 2) (i.e., two annotators for each quality level). Note that the gold distribution can also be represented similarly, e.g., (0, 0, 1, 4, 5). Thus, the probability bins (i.e., dialogue quality levels) (Y) True probability distribution are ordered, just like those in the Dialogue Breakdown Detection p(1) Estimated probability distribution Challenge (See Section 2.2). 0.4 If a system can thus accurately estimate the dialogue quality (e.g., customer satisfaction, task accomplishment, etc.) from dif- 0.3 ferent people’s viewpoints, that system can potentially serve as a component of a dialogue for self-diagnosis and self-improvement p*(1) p*(2) p(2) p*(3) p(3) p*(4) p*(5) 0.2 for satisfying diverse customers. p(4) p(5) 3.1.2 Nugget Detection Subtask. First, participating systems are 0.1 given a list of Customer nugget types (e.g., {CNUG0, CNUG, CNUG∗, NaN}) and a list of Helpdesk nugget types (e.g., {HNUG, HNUG∗, NaN}. For each d ∈ D, participating systems are required to return, for every utterance block in the dialogue, an estimated distribution of the annotators over nugget types. For example, if a = 10 and Figure 1: Examples of true and estimated probability distri- we have the nugget types from DCH-1, a participating system may butions. return, for a particular Customer utterance block, an estimated dis- tribution (3, 4, 3, 0), which means “Three annotators said CNUG0; four said CNUG; three said CNUG∗; none said NaN.” Similarly, for a particular Helpdesk utterance block, the same system may (MAE) [1], as a candidate measure for comparing the estimated return (4, 4, 2), which means “Four annotators said HNUG; four distribution p with the gold distribution p ∗ : said HNUG∗; two said NaN.” Note that the gold distribution for Õ each utterance block can be represented similarly5 , and that the V (p, p ∗ ) = |p(i) − p ∗ (i)| , (1) i probability bins (i.e., nugget types) are nominal (i.e., unordered). If a system can accurately detect nuggets and their types, that will where p(i), p ∗ (i) are the estimated and true probabilities for the i-th help researchers utilise nugget-based evaluation measures without bin. Dividing it by two (representing the case with a complete lack having to manually construct nuggets. Nugget-based evaluation of overlap) would ensure the [0, 1] range. However, accumulating measures may provide more fine-grained diagnoses of systems’ the per-bin errors in this way is not ideal for our purpose, because failures than dialogue-level annotations: for example, if designed variational distance cannot penalise “outlier” probabilities. For appropriately, they may be able to tell us exactly where in the example, we argue that Figure 1(X) should be rated higher than dialogue a problem occurred, and why. (Y), because the latter distribution is too skewed compared to the gold distribution; the system is falsely confident that Bin 1 has a 3.2 Evaluation Measures very high probability. However, the variational distance is clearly 3.2.1 Comparing Two Distributions with Existing Measures. Both 0.4 (0.2 when normalised) for both (X) and (Y): the two systems are of the aforementioned subtasks require a comparison of the sys- treated as equivalent according to this measure. For this reason, we tem’s estimated probability distribution over the gold distribution. prefer the measures discussed below over variational distance or Figure 1 shows two examples where the estimated distribution is MAE. compared with the gold distribution when there are five bins (i.e., Root mean squared error (RMSE) is often used along with MAE dialogue quality levels or nugget types). One might consider vari- in the research community. This approach is more suitable for our ational distance [6], which forms the basis of mean absolute error purpose because of its ability to penalise outliers. In our case, we can define a measure based on Sum of Squares (SS) first: 5 In the DCH-1 collection, nuggets were generally identified as “relevant” parts of (p(i) − p ∗ (i))2 . Õ within an utterance block. However, treating entire utterance blocks as nuggets may SS(p, p ∗ ) = (2) facilitate both the annotation and evaluation steps. i 26 EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan. Tetsuya Sakai Since the largest possible value of SS is 12 + 12 = 2, we can use Root distributions p and p ∗ , we define Order-aware Divergence as: Normalised Sum of Squares (RNSS), which has the [0, 1] range: 1 Õ |i − j |(p(j) − p ∗ (j))2 . Õ OD(p, p ∗ ) = ∗ (7) r |B | ∗ SS(p, p ∗ ) ∗ i ∈B j ∈A, j,i RNSS(p, p ) = . (3) 2 It can be observed that OD is not symmetric: for every nonzero For the examples in Figure 1, the RNSS of (X) is 0.1414 while that bin i of p ∗ , it computes a sum of weighted squares for the other of (Y) is 0.1732; hence (X) outperforms (Y). The reader is referred to bins, where the weight is given as the distance between i and every Chai and Draxler [1] for a discussion of the advantages of RMSE other bin j. Hence, B ∗ = B is a sufficient condition that implies the (which is similar to RNSS) over MAE. symmetry of OD. We will come back to this point later with a few Another measure that can distinguish the difference between examples. Figure 1(X) and (Y) is the (normalised, symmetric version of) Jensen- Symmetric Order-aware Divergence (SOD) can easily be defined Shannon divergence (JSD) [6], which we denote as JSD(p, p ∗ )6 . First, as: OD(p, p ∗ ) + OD(p ∗ , p) for probability distributions p1 and p2 , the Kullback-Leibler diver- SOD = . (8) gence (KLD), which is not symmetric, is defined as: 2 To ensure that the measure has the [0, 1] range, we should con- Õ p1 (i) sider the maximum possible value of OD for a given L: it is clear KLD(p1 , p2 ) = p1 (i) log2 . (4) p2 (i) from the definition of OD that in situations such as if p(1) = 1 i s.t. p1 (i)>0 and p ∗ (L) = 1, that is, when both estimated and gold distributions Note that the above is undefined if p2 (i) = 0: JSD avoids this occupy exactly one bin and the two bins are as far apart as possible limitation as described below. from each other, the worst-case OD is given by (L − 1) ∗ 12 = L − 1. For a given pair of distributions p and p ∗ , let p M be a probability Hence, Normalised Order-aware Divergence (NOD) and Symmetric distribution such that, for every bin i, p M (i) = (p(i)+p ∗ (i))/2. Then, Normalised Order-aware Divergence (SNOD) may be defined as: JSD, which is symmetric, is defined as: OD(p, p ∗ ) NOD(p, p ∗ ) = , (9) KL(p, p M ) + KL(p ∗ , p M) L−1 JSD(p, p ∗ ) = . (5) 2 SOD(p, p ∗ ) SNOD(p, p ∗ ) = . (10) Thus, by introducing p M , we can avoid the aforementioned limita- L−1 tion of KLD, since p1(i) > 0 implies that p M (i) > 0 also. Moreover, Note that SNOD is symmetric, but NOD is generally not. provided that the logarithm base in Eq. 4 is two, the above JSD Figure 2, which we have mentioned earlier, contains the NOD has the [0, 1] range. Lin [6] proves that the above form of JSD is and SNOD scores for (a)-(d). The right half of the figures (a)’-(d)’, bounded above by the normalised variational distance (See Eq. 1): which swaps the estimated and gold distributions, are used for computing SNOD. It can be observed that the SNOD score goes V (p, p ∗ ) down as we move from (a) to (d). Hence (d) is considered better JSD(p, p ∗ ) ≤ . (6) 2 than (a), and (c) is considered better than (b). In particular, note For the examples shown in Figure 1, JSD(p, p ∗ ) = (0.0408+0.0372)/2 = that while the (S)NOD for (a) is 1, the maximum possible value, that 0.0390 for (X), and JSD(p, p ∗ ) = (0.0490 + 0.0490)/2 = 0.0490 for for (d) is 0.5, reflecting the linear weighting scheme of OD. (X). Again, (X) is considered to be superior. Figure 3 provides a few other examples with L = 3: this time, the gold distribution is uniform. While RS and JS give the same 3.2.2 Comparing Two Distributions with Order-Aware Measures. score to (I) and (II) (RNSS=0.5774, JSD=0.4591), and to (III) and (IV) For the dialogue quality subtask, the probability bins are ordinal, (RNSS=0.3333, JSD=0.2075), it can be observed that the SNOD score but the aforementioned measures do not take that into account. goes down as we move from (I) to (IV). For example, compare Figure 2(a) with (d), and (b) with (c) (the left Finally, we compute the SNOD scores for the examples given in half in each figure): where we have L = 3 ordinal bins and the true Figure 1, where L = 5: the results are shown in Figure 4. It can be and the estimated distributions are represented in blue and red, observed that SNOD prefers (X) to the more skewed (Y). Moreover, respectively. Because RNSS and JSD are summations of differences note that B ∗ = B holds for these examples, since both probability across the bins, they give the same score to (a) and (d) (RNSS=1, distributions cover all the bins. Hence NOD(p, p ∗ ) = NOD(p ∗ , p) = JSD=1), and to (b) with (c) (RNSS=0.8819, JSD=1). However, for SNOD(p, p ∗ ) holds7 . ordinal bins, it is clear that (d) is better than (a), and (c) is better To sum up, we propose to use RNSS, JSD, and SNOD for com- than (b). The problem is that there is no notion of distance between paring the probability distributions in the dialogue quality subtask different bins. Hence we propose a new measure for comparing (since the bins are ordered), to use RNSS and JSD for comparing two distributions where bins are ordinal. the probability distributions in the nugget detection subtask (since Let A be the set of bins used in the task, where |A| = L(> 1). the bins are nominal). First, we define sets of bins of nonzero probabilities B ∗ = {i |p ∗ (i) > 0}(⊆ A) and B = {i |p(i) > 0}(⊆ A). Then, given estimated and gold 3.2.3 Dialogue Quality Measures. The Dialogue Quality sub- task needs to compare, for each dialogue, the system’s estimated 6 The original definition of the Jensen-Shannon divergence assigns a weight to each 7 Another sufficient condition for the symmetry of (N)OD is: |B ∗ | = |B | = 1 and probability distribution. Our definition of JSD equals the “L divergence” of Lin [6] B ∗ , B . That is, p ∗ (i) = 1 for a particular i and p(j) = 1 for a particular j(, i). See divided by two. Figure 2(a) and (d). 27 Towards Automatic Evaluation of Multi-Turn Dialogues: A Task Design that Leverages Inherently Subjective Annotations EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan. (a) True (a’) True (I) True (I’) True 1 1 1 1 2/3 2/3 2/3 2/3 1/3 1/3 1/3 1/3 1 2 3 1 2 3 1 2 3 1 2 3 NOD = 1 SNOD = 1 NOD = 1 NOD = 0.3148 SNOD = 0.2407 NOD = 0.1667 Estimated Estimated Estimated Estimated 1 1 1 1 2/3 2/3 2/3 2/3 1/3 1/3 1/3 1/3 1 2 3 1 2 3 1 2 3 1 2 3 (b) True (b’) True (II) True (II’) True 1 1 1 1 2/3 2/3 2/3 2/3 1/3 1/3 1/3 1/3 1 2 3 1 2 3 1 2 3 1 2 3 NOD = 0.5000 SNOD = 0.6944 NOD = 0.8889 NOD = 0.3148 SNOD = 0.2130 NOD = 0.1111 Estimated Estimated Estimated Estimated 1 1 1 1 2/3 2/3 2/3 2/3 1/3 1/3 1/3 1/3 1 2 3 1 2 3 1 2 3 1 2 3 (c) True (c’) True (III) True (III’) True 1 1 1 1 2/3 2/3 2/3 2/3 1/3 1/3 1/3 1/3 1 2 3 1 2 3 1 2 3 1 2 3 NOD = 0.3333 SNOD = 0.6111 NOD = 0.8889 NOD = 0.1111 SNOD = 0.1111 NOD = 0.1111 Estimated Estimated Estimated Estimated 1 1 1 1 2/3 2/3 2/3 2/3 1/3 1/3 1/3 1/3 1 2 3 1 2 3 1 2 3 1 2 3 (d) True (d’) True (IV) True (IV’) True 1 1 1 1 2/3 2/3 2/3 2/3 1/3 1/3 1/3 1/3 1 2 3 1 2 3 1 2 3 1 2 3 NOD = 0.5000 SNOD = 0.5000 NOD = 0.5000 NOD = 0.0926 SNOD = 0.1019 NOD = 0.1111 Estimated Estimated Estimated Estimated 1 1 1 1 2/3 2/3 2/3 2/3 1/3 1/3 1/3 1/3 1 2 3 1 2 3 1 2 3 1 2 3 Figure 2: Examples of SNOD scores where L = 3, p ∗ (1) = 1. Figure 3: Examples of SNOD scores where L = 3, p ∗ (1) = p ∗ (2) = p ∗ (3) = 1/3. distribution of a annotators over L quality levels with the gold dis- tribution. Let a(i) be the system’s estimated number of annotators who chose Level i for a dialogue, and let a ∗ (i) be the correspond- by letting p(i) = a(i)/a, p ∗ (i) = a ∗ (i)/a for i = 1, . . . , L, and com- ÍL ÍL ∗ ing true number, so that i=1 a(i) = i=1 a (i) = a. Hence, for pute M(d) = M(p, p ∗ ), where M ∈ {RNSS, JSD, SNOD}. Figure 5(a) each dialogue d, we can construct probability distributions p, p ∗ shows a conceptual diagram of how these measures are computed. 28 EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan. Tetsuya Sakai (X) (X’) utterance block bC , we can construct probability distributions p, p ∗ 0.4 True 0.4 True by letting p(i) = a(i)/a, p ∗ (i) = a ∗ (i)/a for i = 1, . . . ,TC , and 0.3 0.3 compute M(bC ) = M(p, p ∗ ), where M ∈ {RNSS, JSD}. Figure 5(b) 0.2 0.2 shows a conceptual diagram of how the measures are computed 0.1 0.1 for a Customer utterance block. 1 2 3 4 5 1 2 3 4 5 Similarly, for each Helpdesk block bH ∈ B H (d), we can compute NOD = 0.0227 SNOD = 0.0227 NOD = 0.0227 Estimated Estimated M(bH ) where M ∈ {RNSS, JSD}. 0.4 0.4 The entire dialogue d can then be evaluated by (weighted) average 0.3 0.3 RNSS and (weighted) average JSD: 0.2 0.2 α Õ 0.1 0.1 waM(d) = M(bC ) (12) |BC (d)| 1 2 3 4 5 1 2 3 4 5 bC ∈BC (d) 1−α Õ (Y) True (Y’) True + M(bH ) , 0.4 0.4 |B H (d)| b H ∈B H (d) 0.3 0.3 0.2 0.2 where α (0 ≤ α ≤ 1) is a parameter for emphasising Customer or 0.1 0.1 Helpdesk results and where M ∈ {RNSS, JSD}. 1 2 3 4 5 1 2 3 4 5 Finally, the participating systems can be compared in terms of NOD = 0.0380 SNOD = 0.0380 NOD = 0.0380 mean (weighted ) Average RNSS and mean (weighted ) Average JSD: Estimated Estimated 0.4 0.4 1 Õ meanwaM = waM(d) . (13) 0.3 0.3 |D| 0.2 0.2 d ∈D 0.1 0.1 4 CONCLUSIONS 1 2 3 4 5 1 2 3 4 5 This paper proposed a design of a shared task whose ultimate goal is automatic evaluation of multi-turn, dyadic, textual helpdesk Figure 4: Examples of SNOD scores for L = 5, with probabil- dialogues. The proposed task takes the form of an offline evaluation, ity distributions discussed in Figure 1. where participating systems are given a dialogue as input, and output at least one of the following: (1) an estimated distribution The participating systems can then be compared in terms of of the annotators’ quality ratings for that dialogue; and (2) an mean RNSS, mean JSD, and mean SNOD: estimated distribution of the annotators’ nugget type labels for each utterance block in that dialogue. This shared task should 1 Õ meanM = M(d) , (11) help researchers build automatic helpdesk dialogue systems that |D| respond appropriately to inquiries by considering the diverse views d ∈D where M ∈ {RNSS, JSD, SNOD}. of customers. The proposed framework has been accepted as part of the NTCIR-14 Short Text Conversation task; we plan to provide 3.2.4 Nugget Detection Measures. The Dialogue Quality subtask the proposed tasks for both Chinese and English dialogues. first needs to evaluate, for each utterance block, the accuracy of the We also proposed SNOD, a pilot measure that considers the order system’s estimated distribution of annotators over nugget types; of the probability bins for the dialogue quality subtask. In our future then consolidate the results for the entire dialogue8 . work, the properties of the measures considered in this paper will Let TC be the number of possible Customer nugget types (in- be examined with real dialogue data. cluding NaN), and let TH be the number of possible Helpdesk nugget types (including NaN). For example, if the Customer nugget ACKNOWLEDGEMENTS types are CNUG0, CNUG, CNUG∗, and NaN, then TC = 4; if I thank the EVIA reviewers who gave me constructive comments, the Helpdesk nugget types are HNUG, HNUG∗, and NaN, then especially Reviewer 1 who pointed out the limitation of RNSS and TH = 3. Let BC (d) be the set of Customer utterance blocks of a JSD for the purpose of comparing two distributions where the given test dialogue d, and let B H (d) be the set of Helpdesk utterance categories are ordered. This led me to my proposal of SNOD. blocks for d. For each Customer block bC ∈ BC (d), let a(i) be the system’s es- timated number of annotators who chose the i-th Customer nugget REFERENCES type (1 ≤ i ≤ TC ) for bC ; let a ∗ (i) be the corresponding true number [1] T. Chai and R.R. Draxler. 2014. Root Mean Square Error (RMSE) or Mean Absolute Error (MAE)? – Arguments against avoiding RMSE in the Literature. Geoscientific of annotators. Note that for any block bC , Ti=1 a(i) = Ti=1 a (i) = Í C Í C ∗ Model Development 7 (2014), 1247–1250. a, since we have a total of a annotators. Hence, for each Customer [2] Michel Galley, Chris Brockett, Alessandro Sordoni, Yangfeng Ji, Michael Auli, Chris Quirk, Margaret Mitchell, Jianfeng Gao, and Bill Dolan. 2015. ∆BLEU: A 8 This is the macroaveraging approach, where we assume that each dialogue is as Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets. important as any other, as it represents a particular customer experience. An alternative In Proceedings of ACL 2015. 445–450. would be the microaveraging approach, which views each utterance block to be as [3] Ryuichiro Higashinaka, Kotaro Funakoshi, Michimasa Inaba, Yuiko Tsunomori, important as any other. The latter implies that longer dialogues impact the overall Tetsuro Takahashi, and Nobuhiro Kaji. 2017. Overview of Dialogue Breakdown system performance more heavily, which is not necessarily what we want in the Detection Challenge 3. In Proceedings of Dialog System Technology Challenge 6 present study. (DSTC6) Workshop. 29 Towards Automatic Evaluation of Multi-Turn Dialogues: A Task Design that Leverages Inherently Subjective Annotations EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan. (a) Dialogue quality evaluation measures Dialogue d Participating Customer #annotators system’s estimates utterance block Dialogue Helpdesk quality utterance block RNSS(d) Customer Level 1Level 2 Level L JSD(d) utterance SNOD(d) Dialogue Gold block quality distribution Helpdesk utterance block a=10 assessors Level 1Level 2 Level L (b) Nugget detection evaluation measures Dialogue d Participating Customer #annotators system’s estimates utterance block b Helpdesk utterance block RNSS(b) Customer Type 1 Type 2 Type T JSD(b) utterance Gold block distribution Helpdesk utterance block a=10 assessors Type 1 Type 2 Type T Figure 5: Conceptual diagrams of the proopsed subtasks and the evaluation measures. [4] Ryuichiro Higashinaka, Kotaro Funakoshi, Yuka Kobayashi, and Michimasa [10] Tetsuya Sakai and Zhicheng Dou. 2013. Summaries, Ranked Retrieval and Inaba. 2016. The Dialogue Breakdown Detection Challenge: Task Description, Sessions: A Unified Framework for Information Access Evaluation. In Proceedings Datasets, and Evaluation Metrics. In Proceedings of LREC 2016. of ACM SIGIR 2013. 473–482. [5] Kate S. Hone and Robert Graham. 2000. Towards a Tool for the Subjective [11] Lifeng Shang, Tetsuya Sakai, Hang Li, Ryuichiro Higashinaka, Yusuke Miyao, Assessment of Speech System Interfaces (SASSI). Natural Language Engineering Yuki Arase, and Masako Nomoto. 2017. Overview of the NTCIR-13 Short Text 6, 3-4 (2000), 287–303. Conversation Task. In Proceedings of NTCIR-13. [6] Jianhua Lin. 1991. Divergence Measures Based on the Shannon Entropy. IEEE [12] Lifeng Shang, Tetsuya Sakai, Zhengdong Lu, Hang Li, Ryuichiro Higashinaka, Transactions on Information Theory 37, 1 (1991), 145–151. and Yusuke Miyao. 2016. Overview of the NTCIR-12 Short Text Conversation [7] Ryan Lowe, Nissan Row, Iulian V. Serban, and Joelle Pineau. 2015. The Ubuntu Task. In Proceedings of NTCIR-12. 473–484. Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn [13] Marilyn A. Walker, Diane J. Litman, Candace A. Kamm, and Alicia Abella. 1997. Dialogue Systems. In Proceedings of SIGDIAL 2015. 285–294. PARADISE: A Framework for Evaluating Spoken Dialogue Agents. In Proceedings [8] Eddy Maddalena, Kevin Roitero, Gianluca Demartini, and Stefano Mizzaro. 2017. of ACL 1997. 271–280. Considering Assessor Agreement in IR Evaluation. In Proceedings of ACM ICTIR [14] Zhaohao Zeng, Cheng Luo, Lifeng Shang, Hang Li, and Tetsuya Sakai. 2017. 2017. 75–82. Test Collections and Measures for Evaluating Customer-Helpdesk Dialogues. In [9] Tetsuya Sakai. 2017. Unanimity-Aware Gain for Highly Subjective Assessments. Proceedings of EVIA 2017. In Proceedings of EVIA 2017. 30