Towards Automatic Evaluation of Multi-Turn Dialogues:
A Task Design that Leverages Inherently Subjective Annotations
                                                                Tetsuya Sakai
                                                       Waseda University, Tokyo, Japan
                                                           tetsuyasakai@acm.org

ABSTRACT                                                                       by considering the diverse views of customers. The proposed task
This paper proposes a design of a shared task whose ultimate goal              has been accepted as part of the NTCIR-14 Short Text Conversation
is automatic evaluation of multi-turn, dyadic, textual helpdesk                (STC-3) task.
dialogues. The proposed task takes the form of an offline evaluation,             While estimated and gold distributions are traditionally com-
where participating systems are given a dialogue as input, and                 pared by means of root mean squared error, Jensen-Shannon diver-
output at least one of the following: (1) an estimated distribution            gence [6] and the like, we propose a pilot measure that considers the
of the annotators’ quality ratings for that dialogue; and (2) an               order of the probability bins for the dialogue quality subtask, which
estimated distribution of the annotators’ nugget type labels for               we call Symmetric Normalised Order-aware Divergence (SNOD).
each utterance block (i.e., a maximal sequence of consecutive posts
by the same utterer) in that dialogue. This shared task should                 2 RELATED WORK
help researchers build automatic helpdesk dialogue systems that
respond appropriately to inquiries by considering the diverse views
                                                                               2.1 Dialogue Evaluation in Brief
of customers. The proposed task has been accepted as part of the               While to our knowledge the task proposed in the present paper is
NTCIR-14 Short Text Conversation (STC-3) task. While estimated                 novel, dialogue evaluation is not a new problem. For example, it was
and gold distributions are traditionally compared by means of root             in 1997 that Walker et al. [13] proposed the PARADISE (PARAdigm
mean squared error, Jensen-Shannon divergence and the like, we                 for Dialogue System Evaluation) framework for evaluating spoken
propose a pilot measure that considers the order of the probability            dialogue systems in the train timetable domain. In 2000, Hone and
bins for the dialogue quality subtask, which we call Symmetric                 Graham [5] proposed the questionnaire-based SASSI (Subjective
Normalised Order-aware Divergence (SNOD).                                      Assessment of Speech System Interfaces) for evaluating an in-car
                                                                               speech interface. However, existing studies along these lines of
KEYWORDS                                                                       research mostly focus on closed-domain applications. The topics
                                                                               that helpdesks need to deal with are far more diverse.
dialogues; divergence; evaluation; nuggets; probability distribu-
                                                                                  Recently, Lowe et al. [7] released the Ubuntu dialogue corpus
tions; test collections
                                                                               and proposed a response selection task: systems are given a dialogue
                                                                               context, one correct response immediately following the context
1    INTRODUCTION                                                              plus nine “fake” responses sampled from outside the dialogue, and
More and more companies are providing online customer services                 are required to select one or more appropriate responses from them.
where a customer can exchange realtime textual messages about the              Their effort is more similar to ours in that the topics discussed in
company’s services and products with a (probably human) helpdesk               the dialogues are more diverse than those dealt with by traditional
operator. This means more convenience for the customers, but more              dialogue evaluation. However, since their “correct” response is the
burden on the companies. Hence, research in automatic helpdesk                 original response from the dialogue in their task, their task does not
dialogue systems is highly practical as a means to reduce the cost             involve manual annotations at all. In contrast, the present study
for the companies. To design and tune automatic dialogue systems               addresses the problem of annotators’ subjective decisions that may
efficiently and in a costly manner, automatic evaluation of dialogue           be unanimous in some cases but contradictory in others. In fact,
quality is desirable.                                                          our proposal is to preserve the diverse views in the annotations “as
   As an initial step towards automatic evaluation of helpdesk dia-            is” and leverage them at the step of evaluation measure calculation,
logue systems, this paper proposes a design of a shared task. The              as we shall describe in Section 3.
proposed task takes the form of an offline evaluation, where partic-              There are also a few recent efforts in evaluating non-task-oriented
ipating systems are given a dialogue as input, and output at least             dialogues, or dialogues without a specific purpose (e.g. [2]). The
one of the following: (1) an estimated distribution of the annotators’         Dialogue Breakdown Detection Challenge [3, 4] (Section 2.2) and
quality ratings for that dialogue; and (2) an estimated distribution           the NTCIR Short Text Conversation task [12] (Section 2.3) are also
of the annotators’ nugget type labels for each utterance block (i.e., a        non-task-oriented. However, we are more interested in helpdesk
maximal sequence of consecutive posts by the same utterer) in that             dialogues that try to solve a specific problem that the customer is
dialogue. This shared task should help researchers build automatic             facing.
helpdesk dialogue systems that respond appropriately to inquiries
                                                                               2.2    Dialogue Breakdown Detection
Copying permitted for private and academic purposes.
EVIA 2017, co-located with NTCIR-13, Tokyo, Japan.                             The Dialogue Breakdown Detection Challenge (DBDC) [4] provides
© 2017 Copyright held by the author.                                           human-machine non-task-oriented chats to participating systems.


                                                                          24
EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan.                                                                                  Tetsuya Sakai


Participating systems are required to examine each machine ut-                                   measure called UCH, which was adapted from an information re-
terance, and determine the likelihood that the utterance caused                                  trieval evaluation measure called U-measure [10]. They hired three
a dialogue breakdown (i.e., a point where it becomes difficult to                                annotators per dialogue (helpdesk-customer interactions mined
continue a proper conversation any further due to an inappropri-                                 from Weibo) and obtained dialogue-level quality annotations as
ate utterance). More specifically, the system is required, for each                              well as nugget annotations, where a nugget is a minimal sequence
machine utterance, to output an estimated distribution of multi-                                 of consecutive posts by the same utterer that helps towards prob-
ple annotators over three categories: NB (not a breakdown), PB                                   lem solving. In essence, a nugget is a “relevant” portion within an
(possible breakdown), and B (breakdown). This enabled the task to                                utterance block.
evaluate systems by comparing the system’s estimated distribution                                   Each of the three annotators independently provided the follow-
with the gold distribtution of the annotators in terms Mean Squared                              ing dialogue-level quality labels for each dialogue [14]:
Error and Jensen-Shannon divergence (See Section 3.2.1). The Third                                     TS Task Statement: whether the task (i.e., the problem to be
DBDC [3] will be concluded at Dialog System Technology Challenges                                         solved) is clearly stated by Customer (scores: {−1, 0, 1});
(DSTC6) on December 10, 20171 .                                                                        TA Task Accomplishment: whether the task is actually accom-
    Our proposed task was directly inspired by DBDC, which reflects                                       plished (scores: {−1, 0, 1});
the view that the annotations by different people can be inherently                                    CS Customer Satisfaction: whether Customer is likely to have
different, and that systems should be aware of that. We believe                                           been satisfied with the dialogue, and to what degree (scores:
that this is particularly important for dialogue systems that need                                        {−2, −1, 0, 1, 2});
to face diverse customers, often in the absence of absolute truths.                                    HA Helpdesk Appropriateness: whether Helpdesk provided
Thus, instead of trying to consolidate multiple annotations to form                                       appropriate information (scores: {−1, 0, 1});
a single gold label, we represent the gold data as a distribution of                                   CA Customer Appropriateness: whether Customer provided
annotators; we also require systems to produce estimated distribu-                                        appropriate information (scores: {−1, 0, 1}).
tions, rather than an estimated judgement of an “average” person2 .                              Moreover, they independently identified the following types of
One important point to note is that while the probability bins (i.e.,                            nuggets within each utterance block [14]:
the categories) of DBDC are ordered (e.g., PB is closer to NB than B
                                                                                                       CNUG0 Customer’s trigger nuggets. These are nuggets that
is), the aforementioned measures do not take this into account. In
                                                                                                          define Customer’s initial problem, which directly caused
the present study, we introduce a pilot measure called Symmetric
                                                                                                          Customer to contact Helpdesk.
Normalised Order-aware Divergence (SNOD) as an attempt to solve
                                                                                                       HNUG Helpdesk’s regular nuggets. These are nuggets in
this issue.
                                                                                                          Helpdesk’s utterances that are useful from Customer’s
                                                                                                          point of view.
2.3      Short Text Conversation
                                                                                                       CNUG Customer’s regular nuggets. These are nuggets in Cus-
The NTCIR Short Text Conversation (STC) task [11, 12], the largest                                        tomer’s utterances that are useful from Helpdesk’s point
task in NTCIR-12 and -13, also handles non-task-oriented dialogues.                                       of view.
However, their task setting has so far considered single-turn dia-                                     HNUG∗ Helpdesk’s goal nuggets. These are nuggets in Helpdesk’s
logues only: given a Chinese Weibo3 post (in the Chinese subtask),                                        utterances which provide the Customer with a solution to
can participating systems either retrieve or generate an appropriate                                      the problem.
response?                                                                                              CNUG∗ Customer’s goal nuggets. These are nuggets in Cus-
   While the STC task also hires multiple assessors and require                                           tomer’s utterances which tell Helpdesk that Customer’s
them to label tweets based on four criteria (fluent, coherent, self-                                      problem has been solved.
sufficient, substantial4 ), they consolidate the labels of the multiple
                                                                                                    In our proposed task design, we tentatively use the aforemen-
assessors to form the final graded relevance level (e.g., relevant
                                                                                                 tioned annotation scheme of DCH-1, so that we can discuss our
and highly relevant). While Sakai’s unanimity-aware gains [9]
                                                                                                 ideas with concrete examples. However, it should be noted that
were applied for the NTCIR-13 STC-2 Chinese subtask to weight
                                                                                                 our proposal does not require that the dialogue-level and nugget
unanimous ratings more heavily compared to controversial ones,
                                                                                                 annotations are done in exactly the same way as above. If we do
the task did not involve direct comparisons of gold and system
                                                                                                 use the above schema in a new task, however, it would enable us
distributions.
                                                                                                 to directly utilise the DCH-1 test collection as training data for the
   As was mentioned earlier, the framework proposed in the present
                                                                                                 participants, as we shall describe in the next section.
study has been accepted as part of the NTCIR-14 STC-3 task.
                                                                                                 3     PROPOSED TASK DESIGN
2.4      DCH-1 Test Collection
                                                                                                 Our ultimate goal is the automatic evaluation of Helpdesk-Customer
Recently, Zeng et al. [14] reported on a Chinese helpdesk-customer                               (be it human-human or human-machine) dialogues; as a first step,
dialogue test collection and proposed a nugget-based evaluation                                  we propose the following shared task.
1 http://workshop.colips.org/dstc6/
2 See Maddalena et al. [8] and Sakai [9] for related discussions in the context of infor-        3.1    Task Definition
mation retrieval evaluation.                                                                     Participating teams are provided with training data, for example,
3 http://weibo.com
4 http://ntcirstc.noahlab.com.hk/STC2/submission evaluation/EvaluationCriteriaCN.                the aforementioned DCH-1 test collection with multiple dialogue-
pdf                                                                                              level and nugget annotations per dialogue. Then, in the test phase,


                                                                                            25
Towards Automatic Evaluation of Multi-Turn Dialogues:
A Task Design that Leverages Inherently Subjective Annotations EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan.

each team is given a new set of dialogues as input. Let D be the test                       (X)
of dialogues in the test set. Two subtasks are described below. It is                                                                     True probability distribution
hoped that these offline (i.e., laboratory-based) tasks will serve as                                                                     Estimated probability distribution
initial steps towards evaluating real customer-helpdesk dialogue                            0.4
systems.                                                                                                  p(1)           p(2)
                                                                                            0.3
   3.1.1 Dialogue Quality Subtask. First, participating systems are
given a list of possible dialogue quality levels {1, 2, . . . , L} and the                        p*(1)          p*(2)              p*(3) p(3)   p*(4)           p*(5)
                                                                                            0.2
number of annotators a. Then, for each d ∈ D, participating systems
are required to return an estimated distribution of annotators over                                                                                      p(4)             p(5)
                                                                                            0.1
the quality levels. For example, if L = 5 (five levels) for Customer
Satisfaction (See Section 2.4) and a = 10, a participating system
might return (2, 2, 2, 2, 2) (i.e., two annotators for each quality level).
Note that the gold distribution can also be represented similarly, e.g.,
(0, 0, 1, 4, 5). Thus, the probability bins (i.e., dialogue quality levels)                 (Y)
                                                                                                                                          True probability distribution
are ordered, just like those in the Dialogue Breakdown Detection
                                                                                                          p(1)                            Estimated probability distribution
Challenge (See Section 2.2).                                                                0.4
   If a system can thus accurately estimate the dialogue quality
(e.g., customer satisfaction, task accomplishment, etc.) from dif-                          0.3
ferent people’s viewpoints, that system can potentially serve as a
component of a dialogue for self-diagnosis and self-improvement                                   p*(1)          p*(2) p(2)         p*(3) p(3)   p*(4)           p*(5)
                                                                                            0.2
for satisfying diverse customers.
                                                                                                                                                         p(4)             p(5)
   3.1.2 Nugget Detection Subtask. First, participating systems are                         0.1
given a list of Customer nugget types (e.g., {CNUG0, CNUG, CNUG∗,
NaN}) and a list of Helpdesk nugget types (e.g., {HNUG, HNUG∗,
NaN}. For each d ∈ D, participating systems are required to return,
for every utterance block in the dialogue, an estimated distribution
of the annotators over nugget types. For example, if a = 10 and                            Figure 1: Examples of true and estimated probability distri-
we have the nugget types from DCH-1, a participating system may                            butions.
return, for a particular Customer utterance block, an estimated dis-
tribution (3, 4, 3, 0), which means “Three annotators said CNUG0;
four said CNUG; three said CNUG∗; none said NaN.” Similarly,
for a particular Helpdesk utterance block, the same system may                             (MAE) [1], as a candidate measure for comparing the estimated
return (4, 4, 2), which means “Four annotators said HNUG; four                             distribution p with the gold distribution p ∗ :
said HNUG∗; two said NaN.” Note that the gold distribution for                                                              Õ
each utterance block can be represented similarly5 , and that the                                             V (p, p ∗ ) =   |p(i) − p ∗ (i)| ,      (1)
                                                                                                                                i
probability bins (i.e., nugget types) are nominal (i.e., unordered).
   If a system can accurately detect nuggets and their types, that will                    where p(i), p ∗ (i) are the estimated and true probabilities for the i-th
help researchers utilise nugget-based evaluation measures without                          bin. Dividing it by two (representing the case with a complete lack
having to manually construct nuggets. Nugget-based evaluation                              of overlap) would ensure the [0, 1] range. However, accumulating
measures may provide more fine-grained diagnoses of systems’                               the per-bin errors in this way is not ideal for our purpose, because
failures than dialogue-level annotations: for example, if designed                         variational distance cannot penalise “outlier” probabilities. For
appropriately, they may be able to tell us exactly where in the                            example, we argue that Figure 1(X) should be rated higher than
dialogue a problem occurred, and why.                                                      (Y), because the latter distribution is too skewed compared to the
                                                                                           gold distribution; the system is falsely confident that Bin 1 has a
3.2     Evaluation Measures                                                                very high probability. However, the variational distance is clearly
   3.2.1 Comparing Two Distributions with Existing Measures. Both                          0.4 (0.2 when normalised) for both (X) and (Y): the two systems are
of the aforementioned subtasks require a comparison of the sys-                            treated as equivalent according to this measure. For this reason, we
tem’s estimated probability distribution over the gold distribution.                       prefer the measures discussed below over variational distance or
Figure 1 shows two examples where the estimated distribution is                            MAE.
compared with the gold distribution when there are five bins (i.e.,                           Root mean squared error (RMSE) is often used along with MAE
dialogue quality levels or nugget types). One might consider vari-                         in the research community. This approach is more suitable for our
ational distance [6], which forms the basis of mean absolute error                         purpose because of its ability to penalise outliers. In our case, we
                                                                                           can define a measure based on Sum of Squares (SS) first:
5 In the DCH-1 collection, nuggets were generally identified as “relevant” parts of

                                                                                                                              (p(i) − p ∗ (i))2 .
                                                                                                                            Õ
within an utterance block. However, treating entire utterance blocks as nuggets may                           SS(p, p ∗ ) =                                  (2)
facilitate both the annotation and evaluation steps.                                                                            i


                                                                                      26
EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan.                                                                                                Tetsuya Sakai


Since the largest possible value of SS is 12 + 12 = 2, we can use Root                      distributions p and p ∗ , we define Order-aware Divergence as:
Normalised Sum of Squares (RNSS), which has the [0, 1] range:                                                     1 Õ
                                                                                                                                     |i − j |(p(j) − p ∗ (j))2 .
                                                                                                                               Õ
                                                                                                   OD(p, p ∗ ) = ∗                                                                 (7)
                                    r                                                                           |B |
                               ∗       SS(p, p ∗ )                                                                        ∗  i ∈B    j ∈A, j,i
                    RNSS(p, p ) =                  .               (3)
                                           2                                                It can be observed that OD is not symmetric: for every nonzero
For the examples in Figure 1, the RNSS of (X) is 0.1414 while that                          bin i of p ∗ , it computes a sum of weighted squares for the other
of (Y) is 0.1732; hence (X) outperforms (Y). The reader is referred to                      bins, where the weight is given as the distance between i and every
Chai and Draxler [1] for a discussion of the advantages of RMSE                             other bin j. Hence, B ∗ = B is a sufficient condition that implies the
(which is similar to RNSS) over MAE.                                                        symmetry of OD. We will come back to this point later with a few
   Another measure that can distinguish the difference between                              examples.
Figure 1(X) and (Y) is the (normalised, symmetric version of) Jensen-                           Symmetric Order-aware Divergence (SOD) can easily be defined
Shannon divergence (JSD) [6], which we denote as JSD(p, p ∗ )6 . First,                     as:
                                                                                                                        OD(p, p ∗ ) + OD(p ∗ , p)
for probability distributions p1 and p2 , the Kullback-Leibler diver-                                           SOD =                             .             (8)
gence (KLD), which is not symmetric, is defined as:                                                                                 2
                                                                                                To ensure that the measure has the [0, 1] range, we should con-
                                Õ                    p1 (i)                                 sider the maximum possible value of OD for a given L: it is clear
              KLD(p1 , p2 ) =            p1 (i) log2        .      (4)
                                                     p2 (i)                                 from the definition of OD that in situations such as if p(1) = 1
                                  i s.t. p1 (i)>0
                                                                                            and p ∗ (L) = 1, that is, when both estimated and gold distributions
Note that the above is undefined if p2 (i) = 0: JSD avoids this                             occupy exactly one bin and the two bins are as far apart as possible
limitation as described below.                                                              from each other, the worst-case OD is given by (L − 1) ∗ 12 = L − 1.
   For a given pair of distributions p and p ∗ , let p M be a probability                   Hence, Normalised Order-aware Divergence (NOD) and Symmetric
distribution such that, for every bin i, p M (i) = (p(i)+p ∗ (i))/2. Then,                  Normalised Order-aware Divergence (SNOD) may be defined as:
JSD, which is symmetric, is defined as:                                                                                            OD(p, p ∗ )
                                                                                                                        NOD(p, p ∗ ) =         ,               (9)
                                  KL(p, p M    ) + KL(p ∗ , p   M)                                                                   L−1
                 JSD(p, p ∗ ) =                                      .          (5)
                                                    2                                                                              SOD(p, p ∗ )
                                                                                                                   SNOD(p, p ∗ ) =               .            (10)
Thus, by introducing p M , we can avoid the aforementioned limita-                                                                   L−1
tion of KLD, since p1(i) > 0 implies that p M (i) > 0 also. Moreover,                       Note that SNOD is symmetric, but NOD is generally not.
provided that the logarithm base in Eq. 4 is two, the above JSD                                Figure 2, which we have mentioned earlier, contains the NOD
has the [0, 1] range. Lin [6] proves that the above form of JSD is                          and SNOD scores for (a)-(d). The right half of the figures (a)’-(d)’,
bounded above by the normalised variational distance (See Eq. 1):                           which swaps the estimated and gold distributions, are used for
                                                                                            computing SNOD. It can be observed that the SNOD score goes
                                             V (p, p ∗ )                                    down as we move from (a) to (d). Hence (d) is considered better
                           JSD(p, p ∗ ) ≤                .                      (6)
                                                 2                                          than (a), and (c) is considered better than (b). In particular, note
For the examples shown in Figure 1, JSD(p, p ∗ ) = (0.0408+0.0372)/2 =                      that while the (S)NOD for (a) is 1, the maximum possible value, that
0.0390 for (X), and JSD(p, p ∗ ) = (0.0490 + 0.0490)/2 = 0.0490 for                         for (d) is 0.5, reflecting the linear weighting scheme of OD.
(X). Again, (X) is considered to be superior.                                                  Figure 3 provides a few other examples with L = 3: this time,
                                                                                            the gold distribution is uniform. While RS and JS give the same
   3.2.2 Comparing Two Distributions with Order-Aware Measures.                             score to (I) and (II) (RNSS=0.5774, JSD=0.4591), and to (III) and (IV)
For the dialogue quality subtask, the probability bins are ordinal,                         (RNSS=0.3333, JSD=0.2075), it can be observed that the SNOD score
but the aforementioned measures do not take that into account.                              goes down as we move from (I) to (IV).
For example, compare Figure 2(a) with (d), and (b) with (c) (the left                          Finally, we compute the SNOD scores for the examples given in
half in each figure): where we have L = 3 ordinal bins and the true                         Figure 1, where L = 5: the results are shown in Figure 4. It can be
and the estimated distributions are represented in blue and red,                            observed that SNOD prefers (X) to the more skewed (Y). Moreover,
respectively. Because RNSS and JSD are summations of differences                            note that B ∗ = B holds for these examples, since both probability
across the bins, they give the same score to (a) and (d) (RNSS=1,                           distributions cover all the bins. Hence NOD(p, p ∗ ) = NOD(p ∗ , p) =
JSD=1), and to (b) with (c) (RNSS=0.8819, JSD=1). However, for                              SNOD(p, p ∗ ) holds7 .
ordinal bins, it is clear that (d) is better than (a), and (c) is better                       To sum up, we propose to use RNSS, JSD, and SNOD for com-
than (b). The problem is that there is no notion of distance between                        paring the probability distributions in the dialogue quality subtask
different bins. Hence we propose a new measure for comparing                                (since the bins are ordered), to use RNSS and JSD for comparing
two distributions where bins are ordinal.                                                   the probability distributions in the nugget detection subtask (since
   Let A be the set of bins used in the task, where |A| = L(> 1).                           the bins are nominal).
First, we define sets of bins of nonzero probabilities B ∗ = {i |p ∗ (i) >
0}(⊆ A) and B = {i |p(i) > 0}(⊆ A). Then, given estimated and gold                             3.2.3 Dialogue Quality Measures. The Dialogue Quality sub-
                                                                                            task needs to compare, for each dialogue, the system’s estimated
6 The original definition of the Jensen-Shannon divergence assigns a weight to each         7 Another sufficient condition for the symmetry of (N)OD is: |B ∗ | = |B | = 1 and
probability distribution. Our definition of JSD equals the “L divergence” of Lin [6]        B ∗ , B . That is, p ∗ (i) = 1 for a particular i and p(j) = 1 for a particular j(, i). See
divided by two.                                                                             Figure 2(a) and (d).


                                                                                       27
Towards Automatic Evaluation of Multi-Turn Dialogues:
A Task Design that Leverages Inherently Subjective Annotations EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan.


   (a)       True                                     (a’)       True                                   (I)         True                                   (I’)        True
         1                                                   1                                                  1                                                  1

    2/3                                                2/3                                               2/3                                                2/3

    1/3                                                1/3                                               1/3                                                1/3


                    1        2         3                            1           2           3                              1        2         3                           1           2           3

                        NOD = 1            SNOD = 1                          NOD = 1                                           NOD = 0.3148       SNOD = 0.2407                    NOD = 0.1667

             Estimated                                           Estimated                                          Estimated                                          Estimated
         1                                                   1                                                 1                                                   1
    2/3                                                2/3                                               2/3                                                2/3
    1/3                                                1/3                                               1/3                                                1/3

                    1        2         3                            1           2           3                              1        2         3                           1           2           3

   (b)       True                                     (b’)       True                                   (II)        True                                   (II’)       True
         1                                                   1                                                  1                                                  1

    2/3                                                2/3                                               2/3                                                2/3

    1/3                                                1/3                                               1/3                                                1/3


                    1        2         3                            1           2           3                              1        2         3                           1           2           3

                        NOD = 0.5000       SNOD = 0.6944                     NOD = 0.8889                                      NOD = 0.3148       SNOD = 0.2130                    NOD = 0.1111

             Estimated                                           Estimated                                          Estimated                                          Estimated
         1                                                   1                                                 1                                                   1
    2/3                                                2/3                                               2/3                                                2/3
    1/3                                                1/3                                               1/3                                                1/3

                    1        2         3                            1           2           3                              1        2         3                           1           2           3

   (c)       True                                     (c’)       True                                   (III)       True                                   (III’) True
         1                                                   1                                                  1                                               1

    2/3                                                2/3                                               2/3                                                2/3

    1/3                                                1/3                                               1/3                                                1/3


                    1        2         3                            1           2           3                              1        2         3                           1           2           3

                        NOD = 0.3333       SNOD = 0.6111                     NOD = 0.8889                                      NOD = 0.1111       SNOD = 0.1111                    NOD = 0.1111

             Estimated                                           Estimated                                          Estimated                                          Estimated
         1                                                   1                                                 1                                                   1
    2/3                                                2/3                                               2/3                                                2/3
    1/3                                                1/3                                               1/3                                                1/3

                    1        2         3                            1           2           3                              1        2         3                           1           2           3

   (d)       True                                     (d’)       True                                   (IV) True                                          (IV’) True
         1                                                   1                                              1                                                  1

    2/3                                                2/3                                               2/3                                                2/3

    1/3                                                1/3                                               1/3                                                1/3


                    1        2         3                            1           2           3                              1        2         3                           1           2           3

                        NOD = 0.5000       SNOD = 0.5000                     NOD = 0.5000                                      NOD = 0.0926       SNOD = 0.1019                    NOD = 0.1111

             Estimated                                           Estimated                                          Estimated                                          Estimated
         1                                                   1                                                 1                                                   1
    2/3                                                2/3                                               2/3                                                2/3
    1/3                                                1/3                                               1/3                                                1/3

                    1        2         3                            1           2           3                              1        2         3                           1           2           3


 Figure 2: Examples of SNOD scores where L = 3, p ∗ (1) = 1.                                         Figure 3: Examples of SNOD scores where L = 3, p ∗ (1) =
                                                                                                     p ∗ (2) = p ∗ (3) = 1/3.


distribution of a annotators over L quality levels with the gold dis-
tribution. Let a(i) be the system’s estimated number of annotators
who chose Level i for a dialogue, and let a ∗ (i) be the correspond-                                 by letting p(i) = a(i)/a, p ∗ (i) = a ∗ (i)/a for i = 1, . . . , L, and com-
                            ÍL           ÍL ∗
ing true number, so that i=1     a(i) = i=1   a (i) = a. Hence, for                                  pute M(d) = M(p, p ∗ ), where M ∈ {RNSS, JSD, SNOD}. Figure 5(a)
each dialogue d, we can construct probability distributions p, p ∗                                   shows a conceptual diagram of how these measures are computed.


                                                                                                28
EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan.                                                                                                  Tetsuya Sakai


    (X)                                            (X’)                                          utterance block bC , we can construct probability distributions p, p ∗
          0.4
                True
                                                          0.4
                                                                True
                                                                                                 by letting p(i) = a(i)/a, p ∗ (i) = a ∗ (i)/a for i = 1, . . . ,TC , and
          0.3                                             0.3                                    compute M(bC ) = M(p, p ∗ ), where M ∈ {RNSS, JSD}. Figure 5(b)
          0.2                                             0.2
                                                                                                 shows a conceptual diagram of how the measures are computed
          0.1                                             0.1
                                                                                                 for a Customer utterance block.
                 1     2    3    4    5                          1     2   3     4    5             Similarly, for each Helpdesk block bH ∈ B H (d), we can compute
                       NOD = 0.0227       SNOD = 0.0227                NOD = 0.0227
                Estimated                                       Estimated
                                                                                                 M(bH ) where M ∈ {RNSS, JSD}.
          0.4                                             0.4                                       The entire dialogue d can then be evaluated by (weighted) average
          0.3                                             0.3                                    RNSS and (weighted) average JSD:
          0.2                                             0.2
                                                                                                                                  α        Õ
          0.1                                             0.1                                                 waM(d) =                            M(bC )              (12)
                                                                                                                              |BC (d)|
                 1     2    3    4    5                          1     2    3   4     5                                                         bC ∈BC (d)
                                                                                                                                     1−α            Õ
    (Y)
                True
                                                   (Y’)
                                                                True
                                                                                                                              +                                M(bH ) ,
          0.4                                             0.4                                                                       |B H (d)|
                                                                                                                                                b H ∈B H (d)
          0.3                                             0.3
          0.2                                             0.2                                    where α (0 ≤ α ≤ 1) is a parameter for emphasising Customer or
          0.1                                             0.1                                    Helpdesk results and where M ∈ {RNSS, JSD}.
                 1     2    3    4    5                          1     2   3     4    5            Finally, the participating systems can be compared in terms of
                       NOD = 0.0380       SNOD = 0.0380                NOD = 0.0380              mean (weighted ) Average RNSS and mean (weighted ) Average JSD:
                Estimated                                       Estimated
          0.4                                             0.4                                                                     1 Õ
                                                                                                                  meanwaM =             waM(d) .             (13)
          0.3                                             0.3                                                                    |D|
          0.2                                             0.2                                                                                d ∈D
          0.1                                             0.1
                                                                                                 4    CONCLUSIONS
                 1     2    3    4    5                          1     2    3   4     5
                                                                                                 This paper proposed a design of a shared task whose ultimate goal
                                                                                                 is automatic evaluation of multi-turn, dyadic, textual helpdesk
Figure 4: Examples of SNOD scores for L = 5, with probabil-
                                                                                                 dialogues. The proposed task takes the form of an offline evaluation,
ity distributions discussed in Figure 1.
                                                                                                 where participating systems are given a dialogue as input, and
                                                                                                 output at least one of the following: (1) an estimated distribution
  The participating systems can then be compared in terms of                                     of the annotators’ quality ratings for that dialogue; and (2) an
mean RNSS, mean JSD, and mean SNOD:                                                              estimated distribution of the annotators’ nugget type labels for
                                                                                                 each utterance block in that dialogue. This shared task should
                              1 Õ
                   meanM =          M(d) ,              (11)                                     help researchers build automatic helpdesk dialogue systems that
                             |D|                                                                 respond appropriately to inquiries by considering the diverse views
                                                d ∈D
where M ∈ {RNSS, JSD, SNOD}.                                                                     of customers. The proposed framework has been accepted as part
                                                                                                 of the NTCIR-14 Short Text Conversation task; we plan to provide
    3.2.4 Nugget Detection Measures. The Dialogue Quality subtask                                the proposed tasks for both Chinese and English dialogues.
first needs to evaluate, for each utterance block, the accuracy of the                              We also proposed SNOD, a pilot measure that considers the order
system’s estimated distribution of annotators over nugget types;                                 of the probability bins for the dialogue quality subtask. In our future
then consolidate the results for the entire dialogue8 .                                          work, the properties of the measures considered in this paper will
    Let TC be the number of possible Customer nugget types (in-                                  be examined with real dialogue data.
cluding NaN), and let TH be the number of possible Helpdesk
nugget types (including NaN). For example, if the Customer nugget                                ACKNOWLEDGEMENTS
types are CNUG0, CNUG, CNUG∗, and NaN, then TC = 4; if                                           I thank the EVIA reviewers who gave me constructive comments,
the Helpdesk nugget types are HNUG, HNUG∗, and NaN, then                                         especially Reviewer 1 who pointed out the limitation of RNSS and
TH = 3. Let BC (d) be the set of Customer utterance blocks of a                                  JSD for the purpose of comparing two distributions where the
given test dialogue d, and let B H (d) be the set of Helpdesk utterance                          categories are ordered. This led me to my proposal of SNOD.
blocks for d.
    For each Customer block bC ∈ BC (d), let a(i) be the system’s es-
timated number of annotators who chose the i-th Customer nugget                                  REFERENCES
type (1 ≤ i ≤ TC ) for bC ; let a ∗ (i) be the corresponding true number                          [1] T. Chai and R.R. Draxler. 2014. Root Mean Square Error (RMSE) or Mean Absolute
                                                                                                      Error (MAE)? – Arguments against avoiding RMSE in the Literature. Geoscientific
of annotators. Note that for any block bC , Ti=1      a(i) = Ti=1 a (i) =
                                                 Í C        Í C ∗
                                                                                                      Model Development 7 (2014), 1247–1250.
a, since we have a total of a annotators. Hence, for each Customer                                [2] Michel Galley, Chris Brockett, Alessandro Sordoni, Yangfeng Ji, Michael Auli,
                                                                                                      Chris Quirk, Margaret Mitchell, Jianfeng Gao, and Bill Dolan. 2015. ∆BLEU: A
8 This is the macroaveraging approach, where we assume that each dialogue is as
                                                                                                      Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets.
important as any other, as it represents a particular customer experience. An alternative             In Proceedings of ACL 2015. 445–450.
would be the microaveraging approach, which views each utterance block to be as                   [3] Ryuichiro Higashinaka, Kotaro Funakoshi, Michimasa Inaba, Yuiko Tsunomori,
important as any other. The latter implies that longer dialogues impact the overall                   Tetsuro Takahashi, and Nobuhiro Kaji. 2017. Overview of Dialogue Breakdown
system performance more heavily, which is not necessarily what we want in the                         Detection Challenge 3. In Proceedings of Dialog System Technology Challenge 6
present study.                                                                                        (DSTC6) Workshop.


                                                                                            29
Towards Automatic Evaluation of Multi-Turn Dialogues:
A Task Design that Leverages Inherently Subjective Annotations EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan.


                             (a) Dialogue quality evaluation measures
                       Dialogue d
                                                                                                      Participating
                       Customer                                  #annotators                          system’s estimates
                       utterance
                         block                     Dialogue
                       Helpdesk                    quality
                       utterance
                         block
                                                                                                                                         RNSS(d)
                       Customer                                              Level 1Level 2                      Level L
                                                                                                                                         JSD(d)
                       utterance                                                                                                         SNOD(d)
                                                   Dialogue                                Gold
                         block
                                                   quality                                 distribution
                       Helpdesk
                       utterance
                         block

                                      a=10 assessors                         Level 1Level 2                      Level L

                             (b) Nugget detection evaluation measures
                       Dialogue d
                                                                                                      Participating
                       Customer                                  #annotators                          system’s estimates
                       utterance
                         block b
                       Helpdesk
                       utterance
                         block
                                                                                                                                         RNSS(b)
                       Customer                                              Type 1 Type 2                       Type T
                                                                                                                                         JSD(b)
                       utterance
                                                                                           Gold
                         block
                                                                                           distribution
                       Helpdesk
                       utterance
                         block

                                      a=10 assessors                         Type 1 Type 2                       Type T

                         Figure 5: Conceptual diagrams of the proopsed subtasks and the evaluation measures.


[4] Ryuichiro Higashinaka, Kotaro Funakoshi, Yuka Kobayashi, and Michimasa               [10] Tetsuya Sakai and Zhicheng Dou. 2013. Summaries, Ranked Retrieval and
    Inaba. 2016. The Dialogue Breakdown Detection Challenge: Task Description,                Sessions: A Unified Framework for Information Access Evaluation. In Proceedings
    Datasets, and Evaluation Metrics. In Proceedings of LREC 2016.                            of ACM SIGIR 2013. 473–482.
[5] Kate S. Hone and Robert Graham. 2000. Towards a Tool for the Subjective              [11] Lifeng Shang, Tetsuya Sakai, Hang Li, Ryuichiro Higashinaka, Yusuke Miyao,
    Assessment of Speech System Interfaces (SASSI). Natural Language Engineering              Yuki Arase, and Masako Nomoto. 2017. Overview of the NTCIR-13 Short Text
    6, 3-4 (2000), 287–303.                                                                   Conversation Task. In Proceedings of NTCIR-13.
[6] Jianhua Lin. 1991. Divergence Measures Based on the Shannon Entropy. IEEE            [12] Lifeng Shang, Tetsuya Sakai, Zhengdong Lu, Hang Li, Ryuichiro Higashinaka,
    Transactions on Information Theory 37, 1 (1991), 145–151.                                 and Yusuke Miyao. 2016. Overview of the NTCIR-12 Short Text Conversation
[7] Ryan Lowe, Nissan Row, Iulian V. Serban, and Joelle Pineau. 2015. The Ubuntu              Task. In Proceedings of NTCIR-12. 473–484.
    Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn             [13] Marilyn A. Walker, Diane J. Litman, Candace A. Kamm, and Alicia Abella. 1997.
    Dialogue Systems. In Proceedings of SIGDIAL 2015. 285–294.                                PARADISE: A Framework for Evaluating Spoken Dialogue Agents. In Proceedings
[8] Eddy Maddalena, Kevin Roitero, Gianluca Demartini, and Stefano Mizzaro. 2017.             of ACL 1997. 271–280.
    Considering Assessor Agreement in IR Evaluation. In Proceedings of ACM ICTIR         [14] Zhaohao Zeng, Cheng Luo, Lifeng Shang, Hang Li, and Tetsuya Sakai. 2017.
    2017. 75–82.                                                                              Test Collections and Measures for Evaluating Customer-Helpdesk Dialogues. In
[9] Tetsuya Sakai. 2017. Unanimity-Aware Gain for Highly Subjective Assessments.              Proceedings of EVIA 2017.
    In Proceedings of EVIA 2017.


                                                                                    30