INTRODUCTION

Towards Automatic Evaluation of Multi-Turn Dialogues: A Task Design that Leverages Inherently Subjective Annotations

Tetsuya Sakai

tetsuyasakai@acm.org 0 0 Waseda University , Tokyo , Japan

2017

24 30

is paper proposes a design of a shared task whose ultimate goal is automatic evaluation of multi-turn, dyadic, textual helpdesk dialogues. e proposed task takes the form of an oine evaluation, where participating systems are given a dialogue as input, and output at least one of the following: (1) an estimated distribution of the annotators' quality ratings for that dialogue; and (2) an estimated distribution of the annotators' nugget type labels for each uerance block (i.e., a maximal sequence of consecutive posts by the same uerer) in that dialogue. is shared task should help researchers build automatic helpdesk dialogue systems that respond appropriately to inquiries by considering the diverse views of customers. e proposed task has been accepted as part of the NTCIR-14 Short Text Conversation (STC-3) task. While estimated and gold distributions are traditionally compared by means of root mean squared error, Jensen-Shannon divergence and the like, we propose a pilot measure that considers the order of the probability bins for the dialogue quality subtask, which we call Symmetric Normalised Order-aware Divergence (SNOD).

INTRODUCTION

More and more companies are providing online customer services where a customer can exchange realtime textual messages about the company’s services and products with a (probably human) helpdesk operator. is means more convenience for the customers, but more burden on the companies. Hence, research in automatic helpdesk dialogue systems is highly practical as a means to reduce the cost for the companies. To design and tune automatic dialogue systems eciently and in a costly manner, automatic evaluation of dialogue quality is desirable.

As an initial step towards automatic evaluation of helpdesk dialogue systems, this paper proposes a design of a shared task. e proposed task takes the form of an oine evaluation, where participating systems are given a dialogue as input, and output at least one of the following: (1) an estimated distribution of the annotators’ quality ratings for that dialogue; and (2) an estimated distribution of the annotators’ nugget type labels for each uerance block (i.e., a maximal sequence of consecutive posts by the same uerer) in that dialogue. is shared task should help researchers build automatic helpdesk dialogue systems that respond appropriately to inquiries Copying permied for private and academic purposes.

EVIA 2017, co-located with NTCIR-13, Tokyo, Japan. © 2017 Copyright held by the author. by considering the diverse views of customers. e proposed task has been accepted as part of the NTCIR-14 Short Text Conversation (STC-3) task.

While estimated and gold distributions are traditionally compared by means of root mean squared error, Jensen-Shannon divergence [ 6 ] and the like, we propose a pilot measure that considers the order of the probability bins for the dialogue quality subtask, which we call Symmetric Normalised Order-aware Divergence (SNOD). 2 2.1

RELATED WORK Dialogue Evaluation in Brief

While to our knowledge the task proposed in the present paper is novel, dialogue evaluation is not a new problem. For example, it was in 1997 that Walker et al. [ 13 ] proposed the PARADISE (PARAdigm for Dialogue System Evaluation) framework for evaluating spoken dialogue systems in the train timetable domain. In 2000, Hone and Graham [ 5 ] proposed the questionnaire-based SASSI (Subjective Assessment of Speech System Interfaces) for evaluating an in-car speech interface. However, existing studies along these lines of research mostly focus on closed-domain applications. e topics that helpdesks need to deal with are far more diverse.

Recently, Lowe et al. [ 7 ] released the Ubuntu dialogue corpus and proposed a response selection task: systems are given a dialogue context, one correct response immediately following the context plus nine “fake” responses sampled from outside the dialogue, and are required to select one or more appropriate responses from them. eir eort is more similar to ours in that the topics discussed in the dialogues are more diverse than those dealt with by traditional dialogue evaluation. However, since their “correct” response is the original response from the dialogue in their task, their task does not involve manual annotations at all. In contrast, the present study addresses the problem of annotators’ subjective decisions that may be unanimous in some cases but contradictory in others. In fact, our proposal is to preserve the diverse views in the annotations “as is” and leverage them at the step of evaluation measure calculation, as we shall describe in Section 3.

ere are also a few recent eorts in evaluating non-task-oriented dialogues, or dialogues without a specic purpose (e.g. [ 2 ]). e Dialogue Breakdown Detection Challenge [ 3, 4 ] (Section 2.2) and the NTCIR Short Text Conversation task [ 12 ] (Section 2.3) are also non-task-oriented. However, we are more interested in helpdesk dialogues that try to solve a specic problem that the customer is facing. 2.2

Dialogue Breakdown Detection

e Dialogue Breakdown Detection Challenge (DBDC) [ 4 ] provides human-machine non-task-oriented chats to participating systems. Participating systems are required to examine each machine utterance, and determine the likelihood that the uerance caused a dialogue breakdown (i.e., a point where it becomes dicult to continue a proper conversation any further due to an inappropriate uerance). More specically, the system is required, for each machine uerance, to output an estimated distribution of multiple annotators over three categories: NB (not a breakdown), PB (possible breakdown), and B (breakdown). is enabled the task to evaluate systems by comparing the system’s estimated distribution with the gold distribtution of the annotators in terms Mean Squared Error and Jensen-Shannon divergence (See Section 3.2.1). e ird DBDC [ 3 ] will be concluded at Dialog System Technology Challenges (DSTC6) on December 10, 20171.

Our proposed task was directly inspired by DBDC, which reects the view that the annotations by dierent people can be inherently dierent, and that systems should be aware of that. We believe that this is particularly important for dialogue systems that need to face diverse customers, oen in the absence of absolute truths. us, instead of trying to consolidate multiple annotations to form a single gold label, we represent the gold data as a distribution of annotators; we also require systems to produce estimated distributions, rather than an estimated judgement of an “average” person2. One important point to note is that while the probability bins (i.e., the categories) of DBDC are ordered (e.g., PB is closer to NB than B is), the aforementioned measures do not take this into account. In the present study, we introduce a pilot measure called Symmetric Normalised Order-aware Divergence (SNOD) as an aempt to solve this issue. 2.3

Short Text Conversation

e NTCIR Short Text Conversation (STC) task [ 11, 12 ], the largest task in NTCIR-12 and -13, also handles non-task-oriented dialogues. However, their task seing has so far considered single-turn dialogues only: given a Chinese Weibo3 post (in the Chinese subtask), can participating systems either retrieve or generate an appropriate response?

While the STC task also hires multiple assessors and require them to label tweets based on four criteria (uent, coherent, selfsucient, substantial4), they consolidate the labels of the multiple assessors to form the nal graded relevance level (e.g., relevant and highly relevant). While Sakai’s unanimity-aware gains [ 9 ] were applied for the NTCIR-13 STC-2 Chinese subtask to weight unanimous ratings more heavily compared to controversial ones, the task did not involve direct comparisons of gold and system distributions.

As was mentioned earlier, the framework proposed in the present study has been accepted as part of the NTCIR-14 STC-3 task. 2.4

DCH-1 Test Collection

Recently, Zeng et al. [ 14 ] reported on a Chinese helpdesk-customer dialogue test collection and proposed a nugget-based evaluation 1 hp://workshop.colips.org/dstc6/ 2See Maddalena et al. [ 8 ] and Sakai [ 9 ] for related discussions in the context of information retrieval evaluation. 3hp://weibo.com 4hp://ntcirstc.noahlab.com.hk/STC2/submission evaluation/EvaluationCriteriaCN. pdf measure called UCH, which was adapted from an information retrieval evaluation measure called U-measure [ 10 ]. ey hired three annotators per dialogue (helpdesk-customer interactions mined from Weibo) and obtained dialogue-level quality annotations as well as nugget annotations, where a nugget is a minimal sequence of consecutive posts by the same uerer that helps towards problem solving. In essence, a nugget is a “relevant” portion within an uerance block.

Each of the three annotators independently provided the following dialogue-level quality labels for each dialogue [ 14 ]: TS Task Statement: whether the task (i.e., the problem to be solved) is clearly stated by Customer (scores: f 1; 0; 1g); TA Task Accomplishment: whether the task is actually accomplished (scores: f 1; 0; 1g); CS Customer Satisfaction: whether Customer is likely to have been satised with the dialogue, and to what degree (scores: f 2; 1; 0; 1; 2g); HA Helpdesk Appropriateness: whether Helpdesk provided appropriate information (scores: f 1; 0; 1g); CA Customer Appropriateness: whether Customer provided appropriate information (scores: f 1; 0; 1g).

Moreover, they independently identied the following types of nuggets within each uerance block [ 14 ]:

CNUG0 Customer’s trigger nuggets. ese are nuggets that dene Customer’s initial problem, which directly caused Customer to contact Helpdesk.

HNUG Helpdesk’s regular nuggets. ese are nuggets in Helpdesk’s uerances that are useful from Customer’s point of view.

CNUG Customer’s regular nuggets. ese are nuggets in Customer’s uerances that are useful from Helpdesk’s point of view.

HNUG Helpdesk’s goal nuggets. ese are nuggets in Helpdesk’s uerances which provide the Customer with a solution to the problem.

CNUG Customer’s goal nuggets. ese are nuggets in Customer’s uerances which tell Helpdesk that Customer’s problem has been solved.

In our proposed task design, we tentatively use the aforementioned annotation scheme of DCH-1, so that we can discuss our ideas with concrete examples. However, it should be noted that our proposal does not require that the dialogue-level and nugget annotations are done in exactly the same way as above. If we do use the above schema in a new task, however, it would enable us to directly utilise the DCH-1 test collection as training data for the participants, as we shall describe in the next section. 3

PROPOSED TASK DESIGN

Our ultimate goal is the automatic evaluation of Helpdesk-Customer (be it human-human or human-machine) dialogues; as a rst step, we propose the following shared task. 3.1

Task Denition

Participating teams are provided with training data, for example, the aforementioned DCH-1 test collection with multiple dialoguelevel and nugget annotations per dialogue. en, in the test phase, each team is given a new set of dialogues as input. Let D be the test of dialogues in the test set. Two subtasks are described below. It is hoped that these oine (i.e., laboratory-based) tasks will serve as initial steps towards evaluating real customer-helpdesk dialogue systems.

3.1.1 Dialogue ality Subtask. First, participating systems are given a list of possible dialogue quality levels f1; 2; : : : ; Lg and the number of annotators a. en, for each d 2 D, participating systems are required to return an estimated distribution of annotators over the quality levels. For example, if L = 5 (ve levels) for Customer Satisfaction (See Section 2.4) and a = 10, a participating system might return ¹2; 2; 2; 2; 2º (i.e., two annotators for each quality level). Note that the gold distribution can also be represented similarly, e.g., ¹0; 0; 1; 4; 5º. us, the probability bins (i.e., dialogue quality levels) are ordered, just like those in the Dialogue Breakdown Detection Challenge (See Section 2.2).

If a system can thus accurately estimate the dialogue quality (e.g., customer satisfaction, task accomplishment, etc.) from different people’s viewpoints, that system can potentially serve as a component of a dialogue for self-diagnosis and self-improvement for satisfying diverse customers.

3.1.2 Nugget Detection Subtask. First, participating systems are given a list of Customer nugget types (e.g., fCNUG0; CNUG; CNUG ; NaNg) and a list of Helpdesk nugget types (e.g., fHNUG; HNUG ; NaNg. For each d 2 D, participating systems are required to return, for every uerance block in the dialogue, an estimated distribution of the annotators over nugget types. For example, if a = 10 and we have the nugget types from DCH-1, a participating system may return, for a particular Customer uerance block, an estimated distribution ¹3; 4; 3; 0º, which means “ree annotators said CNUG0; four said CNUG; three said CNUG ; none said NaN.” Similarly, for a particular Helpdesk uerance block, the same system may return ¹4; 4; 2º, which means “Four annotators said HNUG; four said HNUG ; two said NaN.” Note that the gold distribution for each uerance block can be represented similarly5, and that the probability bins (i.e., nugget types) are nominal (i.e., unordered).

If a system can accurately detect nuggets and their types, that will help researchers utilise nugget-based evaluation measures without having to manually construct nuggets. Nugget-based evaluation measures may provide more ne-grained diagnoses of systems’ failures than dialogue-level annotations: for example, if designed appropriately, they may be able to tell us exactly where in the dialogue a problem occurred, and why. 3.2

Evaluation Measures

3.2.1 Comparing Two Distributions with Existing Measures. Both of the aforementioned subtasks require a comparison of the system’s estimated probability distribution over the gold distribution. Figure 1 shows two examples where the estimated distribution is compared with the gold distribution when there are ve bins (i.e., dialogue quality levels or nugget types). One might consider variational distance [ 6 ], which forms the basis of mean absolute error 5 In the DCH-1 collection, nuggets were generally identied as “relevant” parts of within an uerance block. However, treating entire uerance blocks as nuggets may facilitate both the annotation and evaluation steps. (X) 0.4 0.3 0.2 0.1 (Y) 0.4 0.3 0.2 p(4) p(5) (MAE) [ 1 ], as a candidate measure for comparing the estimated distribution p with the gold distribution p : where p¹iº; p ¹iº are the estimated and true probabilities for the i-th bin. Dividing it by two (representing the case with a complete lack of overlap) would ensure the »0; 1¼ range. However, accumulating the per-bin errors in this way is not ideal for our purpose, because variational distance cannot penalise “outlier” probabilities. For example, we argue that Figure 1(X) should be rated higher than (Y), because the laer distribution is too skewed compared to the gold distribution; the system is falsely condent that Bin 1 has a very high probability. However, the variational distance is clearly 0.4 (0.2 when normalised) for both (X) and (Y): the two systems are treated as equivalent according to this measure. For this reason, we prefer the measures discussed below over variational distance or MAE.

Root mean squared error (RMSE) is oen used along with MAE in the research community. is approach is more suitable for our purpose because of its ability to penalise outliers. In our case, we can dene a measure based on Sum of Squares (SS) rst: SS¹p; p º = Since the largest possible value of SS is 12 + 12 = 2, we can use Root Normalised Sum of Squares (RNSS), which has the »0; 1¼ range: RNSS¹p; p º = r

SS¹p; p º : 2 For the examples in Figure 1, the RNSS of (X) is 0.1414 while that of (Y) is 0.1732; hence (X) outperforms (Y). e reader is referred to Chai and Draxler [ 1 ] for a discussion of the advantages of RMSE (which is similar to RNSS) over MAE.

Another measure that can distinguish the dierence between Figure 1(X) and (Y) is the (normalised, symmetric version of) JensenShannon divergence (JSD) [ 6 ], which we denote as JSD¹p; p º6. First, for probability distributions p1 and p2, the Kullback-Leibler divergence (KLD), which is not symmetric, is dened as:

KLD¹p1; p2º = Note that the above is undened if p2¹iº = 0: JSD avoids this limitation as described below.

For a given pair of distributions p and p , let pM be a probability distribution such that, for every bin i, pM ¹iº = ¹p¹iº +p ¹iºº2. en, JSD, which is symmetric, is dened as:

JSD¹p; p º =

KL¹p; pM º + KL¹p ; pM º : 2 us, by introducing pM , we can avoid the aforementioned limitation of KLD, since p1¹iº > 0 implies that pM ¹iº > 0 also. Moreover, provided that the logarithm base in Eq. 4 is two, the above JSD has the »0; 1¼ range. Lin [ 6 ] proves that the above form of JSD is bounded above by the normalised variational distance (See Eq. 1): JSD¹p; p º

V ¹p; p º : 2 For the examples shown in Figure 1, JSD¹p; p º = ¹0:0408+0:0372º2 = 0:0390 for (X), and JSD¹p; p º = ¹0:0490 + 0:0490º2 = 0:0490 for (X). Again, (X) is considered to be superior.

3.2.2 Comparing Two Distributions with Order-Aware Measures. For the dialogue quality subtask, the probability bins are ordinal, but the aforementioned measures do not take that into account. For example, compare Figure 2(a) with (d), and (b) with (c) (the le half in each gure): where we have L = 3 ordinal bins and the true and the estimated distributions are represented in blue and red, respectively. Because RNSS and JSD are summations of dierences across the bins, they give the same score to (a) and (d) (RNSS=1, JSD=1), and to (b) with (c) (RNSS=0.8819, JSD=1). However, for ordinal bins, it is clear that (d) is beer than (a), and (c) is beer than (b). e problem is that there is no notion of distance between dierent bins. Hence we propose a new measure for comparing two distributions where bins are ordinal.

Let A be the set of bins used in the task, where jAj = L¹> 1º. First, we dene sets of bins of nonzero probabilities B = fi jp ¹iº > 0g¹ A) and B = fi jp¹iº > 0g¹ A). en, given estimated and gold (3) (4) (5) (6)

OD¹p; p º =

jB j i 2B j 2A; j,i distributions p and p , we dene Order-aware Divergence as: 1 Õ Õ ji j j¹p¹jº p ¹jºº2 : (7) It can be observed that OD is not symmetric: for every nonzero bin i of p , it computes a sum of weighted squares for the other bins, where the weight is given as the distance between i and every other bin j. Hence, B = B is a sucient condition that implies the symmetry of OD. We will come back to this point later with a few examples.

Symmetric Order-aware Divergence (SOD) can easily be dened as:

SOD = OD¹p; p º +2 OD¹p ; pº : (8)

To ensure that the measure has the »0; 1¼ range, we should consider the maximum possible value of OD for a given L: it is clear from the denition of OD that in situations such as if p¹1º = 1 and p ¹Lº = 1, that is, when both estimated and gold distributions occupy exactly one bin and the two bins are as far apart as possible from each other, the worst-case OD is given by ¹L 1º 12 = L 1. Hence, Normalised Order-aware Divergence (NOD) and Symmetric Normalised Order-aware Divergence (SNOD) may be dened as: NOD¹p; p º = ODL¹p;1p º ; (9)

SNOD¹p; p º = SOLD¹p;1p º : (10) Note that SNOD is symmetric, but NOD is generally not.

Figure 2, which we have mentioned earlier, contains the NOD and SNOD scores for (a)-(d). e right half of the gures (a)’-(d)’, which swaps the estimated and gold distributions, are used for computing SNOD. It can be observed that the SNOD score goes down as we move from (a) to (d). Hence (d) is considered beer than (a), and (c) is considered beer than (b). In particular, note that while the (S)NOD for (a) is 1, the maximum possible value, that for (d) is 0.5, reecting the linear weighting scheme of OD.

Figure 3 provides a few other examples with L = 3: this time, the gold distribution is uniform. While RS and JS give the same score to (I) and (II) (RNSS=0.5774, JSD=0.4591), and to (III) and (IV) (RNSS=0.3333, JSD=0.2075), it can be observed that the SNOD score goes down as we move from (I) to (IV).

Finally, we compute the SNOD scores for the examples given in Figure 1, where L = 5: the results are shown in Figure 4. It can be observed that SNOD prefers (X) to the more skewed (Y). Moreover, note that B = B holds for these examples, since both probability distributions cover all the bins. Hence NOD¹p; p º = NOD¹p ; pº = SNOD¹p; p º holds7.

To sum up, we propose to use RNSS, JSD, and SNOD for comparing the probability distributions in the dialogue quality subtask (since the bins are ordered), to use RNSS and JSD for comparing the probability distributions in the nugget detection subtask (since the bins are nominal).

3.2.3 Dialogue ality Measures. e Dialogue ality subtask needs to compare, for each dialogue, the system’s estimated 6e original denition of the Jensen-Shannon divergence assigns a weight to each probability distribution. Our denition of JSD equals the “L divergence” of Lin [ 6 ] divided by two. 7 Another sucient condition for the symmetry of (N)OD is: jB j = jB j = 1 and B , B. at is, p ¹iº = 1 for a particular i and p¹jº = 1 for a particular j¹, iº. See Figure 2(a) and (d). (a) 1 True 2/3 1/3

NOD = 1 1 Estimated

SNOD = 1 distribution of a annotators over L quality levels with the gold distribution. Let a¹iº be the system’s estimated number of annotators who chose Level i for a dialogue, and let a ¹iº be the corresponding true number, so that ÍL i=1 a ¹iº = a. Hence, for i=1 a¹iº = ÍL each dialogue d, we can construct probability distributions p; p by leing p¹iº = a¹iºa; p ¹iº = a ¹iºa for i = 1; : : : ; L, and compute M¹dº = M¹p; p º, where M 2 fRNSS; JSD; SNODg. Figure 5(a) shows a conceptual diagram of how these measures are computed. 2 NOD = 0.8889

3 2 NOD = 0.8889

3 2 2 2 2 2 NOD = 0.1667

3 2 NOD = 0.1111

3 2 2 2 2 3 3 3 3 0.4 True 0.3 0.2 0.1 0.4 0.3 0.2 0.1 0.4 0.3 0.2 0.1 0.4 True 0.3 0.2 0.1 1 2 3 4

NOD = 0.0380 Estimated

SNOD = 0.0380 1 2 3 4

NOD = 0.0380

Estimated

3.2.4 Nugget Detection Measures. e Dialogue ality subtask rst needs to evaluate, for each uerance block, the accuracy of the system’s estimated distribution of annotators over nugget types; then consolidate the results for the entire dialogue8.

Let TC be the number of possible Customer nugget types (including NaN), and let TH be the number of possible Helpdesk nugget types (including NaN). For example, if the Customer nugget types are CNUG0, CNUG, CNUG , and NaN, then TC = 4; if the Helpdesk nugget types are HNUG, HNUG , and NaN, then TH = 3. Let BC ¹dº be the set of Customer uerance blocks of a given test dialogue d, and let BH ¹dº be the set of Helpdesk uerance blocks for d.

For each Customer block bC 2 BC ¹dº, let a¹iº be the system’s estimated number of annotators who chose the i-th Customer nugget type (1 i TC ) for bC ; let a ¹iº be the corresponding true number of annotators. Note that for any block bC , ÍTC a¹iº = ÍTi=C1 a ¹iº = i=1 a, since we have a total of a annotators. Hence, for each Customer 8 is is the macroaveraging approach, where we assume that each dialogue is as important as any other, as it represents a particular customer experience. An alternative would be the microaveraging approach, which views each uerance block to be as important as any other. e laer implies that longer dialogues impact the overall system performance more heavily, which is not necessarily what we want in the present study. uerance block bC , we can construct probability distributions p; p by leing p¹iº = a¹iºa; p ¹iº = a ¹iºa for i = 1; : : : ; TC , and compute M¹bC º = M¹p; p º, where M 2 fRNSS; JSDg. Figure 5(b) shows a conceptual diagram of how the measures are computed for a Customer uerance block.

Similarly, for each Helpdesk block bH 2 BH ¹dº, we can compute M¹bH º where M 2 fRNSS; JSDg.

e entire dialogue d can then be evaluated by (weighted) average RNSS and (weighted) average JSD:

α Õ = + jBC ¹dºj bC 2BC ¹dº 1 α Õ jBH ¹dºj bH 2BH ¹dº

M¹bC º M¹bH º ; (12) (13) where α ¹0 α 1º is a parameter for emphasising Customer or Helpdesk results and where M 2 fRNSS; JSDg.

Finally, the participating systems can be compared in terms of mean (weighted ) Average RNSS and mean (weighted ) Average JSD: 1 Õ meanwaM =

waM¹dº : jD j d 2D 4 CONCLUSIONS is paper proposed a design of a shared task whose ultimate goal is automatic evaluation of multi-turn, dyadic, textual helpdesk dialogues. e proposed task takes the form of an oine evaluation, where participating systems are given a dialogue as input, and output at least one of the following: (1) an estimated distribution of the annotators’ quality ratings for that dialogue; and (2) an estimated distribution of the annotators’ nugget type labels for each uerance block in that dialogue. is shared task should help researchers build automatic helpdesk dialogue systems that respond appropriately to inquiries by considering the diverse views of customers. e proposed framework has been accepted as part of the NTCIR-14 Short Text Conversation task; we plan to provide the proposed tasks for both Chinese and English dialogues.

We also proposed SNOD, a pilot measure that considers the order of the probability bins for the dialogue quality subtask. In our future work, the properties of the measures considered in this paper will be examined with real dialogue data.

ACKNOWLEDGEMENTS

I thank the EVIA reviewers who gave me constructive comments, especially Reviewer 1 who pointed out the limitation of RNSS and JSD for the purpose of comparing two distributions where the categories are ordered. is led me to my proposal of SNOD.

Towards Automatic Evaluation of Multi-Turn Dialogues: A Task Design that Leverages Inherently Subjective Annotations (a) Dialogue quality evaluation measures

Dialogue d Customer utterance block Participating system’s estimates Level 1 Level 2 Level L Gold

distribution (b) Nugget detection evaluation measures

Type 1 Type 2 Type T

[1]

Chai and

R.R.

Draxler . 2014 . Root Mean Square Error (RMSE) or Mean Absolute Error (MAE)? - Arguments against avoiding RMSE in the Literature . Geoscientic Model Development 7 ( 2014 ), 1247 - 1250 .

[2]

Michel

Galley , Chris Brocke, Alessandro

Sordoni

, Yangfeng Ji, Michael Auli, Chris irk, Margaret Mitchell, Jianfeng Gao , and Bill Dolan . 2015 . ΔBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets . In Proceedings of ACL 2015 . 445 - 450 .

[3]

Ryuichiro

Higashinaka , Kotaro Funakoshi, Michimasa Inaba, Yuiko Tsunomori, Tetsuro Takahashi, and

Nobuhiro

Kaji . 2017 . Overview of Dialogue Breakdown Detection Challenge 3 . In Proceedings of Dialog System Technology Challenge 6 ( DSTC6 ) Workshop.

Type 1 Type 2 Level 1 Level 2

[4]

Ryuichiro

Higashinaka , Kotaro Funakoshi, Yuka Kobayashi, and

Michimasa

Inaba . 2016 . e Dialogue Breakdown Detection Challenge: Task Description, Datasets, and Evaluation Metrics . In Proceedings of LREC 2016 .

[5] Kate

Hone and Robert

Graham . 2000 . Towards a Tool for the Subjective Assessment of Speech System Interfaces (SASSI) . Natural Language Engineering 6 , 3 - 4 ( 2000 ), 287 - 303 .

[6]

Jianhua

Lin . 1991 . Divergence Measures Based on the Shannon Entropy . IEEE Transactions on Information eory 37 , 1 ( 1991 ), 145 - 151 .

[7]

Ryan

Lowe , Nissan Row, Iulian

Serban , and Joelle

Pineau . 2015 . e Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems . In Proceedings of SIGDIAL 2015 . 285 - 294 .

[8]

Eddy

Maddalena , Kevin Roitero, Gianluca Demartini, and

Stefano

Mizzaro . 2017 . Considering Assessor Agreement in IR Evaluation . In Proceedings of ACM ICTIR 2017 . 75 - 82 .

[9]

Tetsuya

Sakai . 2017 . Unanimity-Aware Gain for Highly Subjective Assessments . In Proceedings of EVIA 2017 .

[10]

Tetsuya

Sakai and

Zhicheng

Dou . 2013 . Summaries, Ranked Retrieval and Sessions: A Unied Framework for Information Access Evaluation . In Proceedings of ACM SIGIR 2013 . 473 - 482 .

[11] Lifeng

Shang

, Tetsuya Sakai,

Hang

Li ,

Ryuichiro

Higashinaka , Yusuke Miyao, Yuki Arase, and

Masako

Nomoto . 2017 . Overview of the NTCIR-13 Short Text Conversation Task . In Proceedings of NTCIR-13.

[12] Lifeng

Shang

, Tetsuya Sakai, Zhengdong Lu,

Hang

Li ,

Ryuichiro

Higashinaka , and

Yusuke

Miyao . 2016 . Overview of the NTCIR-12 Short Text Conversation Task . In Proceedings of NTCIR-12 . 473 - 484 .

[13] Marilyn

Walker , Diane J.

Litman , Candace A. Kamm , and Alicia Abella . 1997 . PARADISE: A Framework for Evaluating Spoken Dialogue Agents . In Proceedings of ACL 1997 . 271 - 280 .

[14]

Zhaohao

Zeng , Cheng Luo, Lifeng Shang,

Hang

Li ,

and Tetsuya

Sakai . 2017 . Test Collections and Measures for Evaluating Customer-Helpdesk Dialogues . In Proceedings of EVIA 2017 .