<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Automatic Evaluation of Multi-Turn Dialogues: A Task Design that Leverages Inherently Subjective Annotations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tetsuya Sakai</string-name>
          <email>tetsuyasakai@acm.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Waseda University</institution>
          ,
          <addr-line>Tokyo</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <fpage>24</fpage>
      <lpage>30</lpage>
      <abstract>
        <p>is paper proposes a design of a shared task whose ultimate goal is automatic evaluation of multi-turn, dyadic, textual helpdesk dialogues. e proposed task takes the form of an oine evaluation, where participating systems are given a dialogue as input, and output at least one of the following: (1) an estimated distribution of the annotators' quality ratings for that dialogue; and (2) an estimated distribution of the annotators' nugget type labels for each uerance block (i.e., a maximal sequence of consecutive posts by the same uerer) in that dialogue. is shared task should help researchers build automatic helpdesk dialogue systems that respond appropriately to inquiries by considering the diverse views of customers. e proposed task has been accepted as part of the NTCIR-14 Short Text Conversation (STC-3) task. While estimated and gold distributions are traditionally compared by means of root mean squared error, Jensen-Shannon divergence and the like, we propose a pilot measure that considers the order of the probability bins for the dialogue quality subtask, which we call Symmetric Normalised Order-aware Divergence (SNOD).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>More and more companies are providing online customer services
where a customer can exchange realtime textual messages about the
company’s services and products with a (probably human) helpdesk
operator. is means more convenience for the customers, but more
burden on the companies. Hence, research in automatic helpdesk
dialogue systems is highly practical as a means to reduce the cost
for the companies. To design and tune automatic dialogue systems
eciently and in a costly manner, automatic evaluation of dialogue
quality is desirable.</p>
      <p>As an initial step towards automatic evaluation of helpdesk
dialogue systems, this paper proposes a design of a shared task. e
proposed task takes the form of an oine evaluation, where
participating systems are given a dialogue as input, and output at least
one of the following: (1) an estimated distribution of the annotators’
quality ratings for that dialogue; and (2) an estimated distribution
of the annotators’ nugget type labels for each uerance block (i.e., a
maximal sequence of consecutive posts by the same uerer) in that
dialogue. is shared task should help researchers build automatic
helpdesk dialogue systems that respond appropriately to inquiries
Copying permied for private and academic purposes.</p>
      <p>EVIA 2017, co-located with NTCIR-13, Tokyo, Japan.
© 2017 Copyright held by the author.
by considering the diverse views of customers. e proposed task
has been accepted as part of the NTCIR-14 Short Text Conversation
(STC-3) task.</p>
      <p>
        While estimated and gold distributions are traditionally
compared by means of root mean squared error, Jensen-Shannon
divergence [
        <xref ref-type="bibr" rid="ref7">6</xref>
        ] and the like, we propose a pilot measure that considers the
order of the probability bins for the dialogue quality subtask, which
we call Symmetric Normalised Order-aware Divergence (SNOD).
2
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
    </sec>
    <sec id="sec-3">
      <title>Dialogue Evaluation in Brief</title>
      <p>
        While to our knowledge the task proposed in the present paper is
novel, dialogue evaluation is not a new problem. For example, it was
in 1997 that Walker et al. [
        <xref ref-type="bibr" rid="ref14">13</xref>
        ] proposed the PARADISE (PARAdigm
for Dialogue System Evaluation) framework for evaluating spoken
dialogue systems in the train timetable domain. In 2000, Hone and
Graham [
        <xref ref-type="bibr" rid="ref6">5</xref>
        ] proposed the questionnaire-based SASSI (Subjective
Assessment of Speech System Interfaces) for evaluating an in-car
speech interface. However, existing studies along these lines of
research mostly focus on closed-domain applications. e topics
that helpdesks need to deal with are far more diverse.
      </p>
      <p>
        Recently, Lowe et al. [
        <xref ref-type="bibr" rid="ref8">7</xref>
        ] released the Ubuntu dialogue corpus
and proposed a response selection task: systems are given a dialogue
context, one correct response immediately following the context
plus nine “fake” responses sampled from outside the dialogue, and
are required to select one or more appropriate responses from them.
eir eort is more similar to ours in that the topics discussed in
the dialogues are more diverse than those dealt with by traditional
dialogue evaluation. However, since their “correct” response is the
original response from the dialogue in their task, their task does not
involve manual annotations at all. In contrast, the present study
addresses the problem of annotators’ subjective decisions that may
be unanimous in some cases but contradictory in others. In fact,
our proposal is to preserve the diverse views in the annotations “as
is” and leverage them at the step of evaluation measure calculation,
as we shall describe in Section 3.
      </p>
      <p>
        ere are also a few recent eorts in evaluating non-task-oriented
dialogues, or dialogues without a specic purpose (e.g. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]). e
Dialogue Breakdown Detection Challenge [
        <xref ref-type="bibr" rid="ref3 ref5">3, 4</xref>
        ] (Section 2.2) and
the NTCIR Short Text Conversation task [
        <xref ref-type="bibr" rid="ref13">12</xref>
        ] (Section 2.3) are also
non-task-oriented. However, we are more interested in helpdesk
dialogues that try to solve a specic problem that the customer is
facing.
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Dialogue Breakdown Detection</title>
      <p>
        e Dialogue Breakdown Detection Challenge (DBDC) [
        <xref ref-type="bibr" rid="ref5">4</xref>
        ] provides
human-machine non-task-oriented chats to participating systems.
Participating systems are required to examine each machine
utterance, and determine the likelihood that the uerance caused
a dialogue breakdown (i.e., a point where it becomes dicult to
continue a proper conversation any further due to an
inappropriate uerance). More specically, the system is required, for each
machine uerance, to output an estimated distribution of
multiple annotators over three categories: NB (not a breakdown), PB
(possible breakdown), and B (breakdown). is enabled the task to
evaluate systems by comparing the system’s estimated distribution
with the gold distribtution of the annotators in terms Mean Squared
Error and Jensen-Shannon divergence (See Section 3.2.1). e ird
DBDC [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] will be concluded at Dialog System Technology Challenges
(DSTC6) on December 10, 20171.
      </p>
      <p>Our proposed task was directly inspired by DBDC, which reects
the view that the annotations by dierent people can be inherently
dierent, and that systems should be aware of that. We believe
that this is particularly important for dialogue systems that need
to face diverse customers, oen in the absence of absolute truths.
us, instead of trying to consolidate multiple annotations to form
a single gold label, we represent the gold data as a distribution of
annotators; we also require systems to produce estimated
distributions, rather than an estimated judgement of an “average” person2.
One important point to note is that while the probability bins (i.e.,
the categories) of DBDC are ordered (e.g., PB is closer to NB than B
is), the aforementioned measures do not take this into account. In
the present study, we introduce a pilot measure called Symmetric
Normalised Order-aware Divergence (SNOD) as an aempt to solve
this issue.
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Short Text Conversation</title>
      <p>
        e NTCIR Short Text Conversation (STC) task [
        <xref ref-type="bibr" rid="ref12 ref13">11, 12</xref>
        ], the largest
task in NTCIR-12 and -13, also handles non-task-oriented dialogues.
However, their task seing has so far considered single-turn
dialogues only: given a Chinese Weibo3 post (in the Chinese subtask),
can participating systems either retrieve or generate an appropriate
response?
      </p>
      <p>
        While the STC task also hires multiple assessors and require
them to label tweets based on four criteria (uent, coherent,
selfsucient, substantial4), they consolidate the labels of the multiple
assessors to form the nal graded relevance level (e.g., relevant
and highly relevant). While Sakai’s unanimity-aware gains [
        <xref ref-type="bibr" rid="ref10">9</xref>
        ]
were applied for the NTCIR-13 STC-2 Chinese subtask to weight
unanimous ratings more heavily compared to controversial ones,
the task did not involve direct comparisons of gold and system
distributions.
      </p>
      <p>As was mentioned earlier, the framework proposed in the present
study has been accepted as part of the NTCIR-14 STC-3 task.
2.4</p>
    </sec>
    <sec id="sec-6">
      <title>DCH-1 Test Collection</title>
      <p>
        Recently, Zeng et al. [
        <xref ref-type="bibr" rid="ref15">14</xref>
        ] reported on a Chinese helpdesk-customer
dialogue test collection and proposed a nugget-based evaluation
1 hp://workshop.colips.org/dstc6/
2See Maddalena et al. [
        <xref ref-type="bibr" rid="ref9">8</xref>
        ] and Sakai [
        <xref ref-type="bibr" rid="ref10">9</xref>
        ] for related discussions in the context of
information retrieval evaluation.
3hp://weibo.com
4hp://ntcirstc.noahlab.com.hk/STC2/submission evaluation/EvaluationCriteriaCN.
pdf
measure called UCH, which was adapted from an information
retrieval evaluation measure called U-measure [
        <xref ref-type="bibr" rid="ref11">10</xref>
        ]. ey hired three
annotators per dialogue (helpdesk-customer interactions mined
from Weibo) and obtained dialogue-level quality annotations as
well as nugget annotations, where a nugget is a minimal sequence
of consecutive posts by the same uerer that helps towards
problem solving. In essence, a nugget is a “relevant” portion within an
uerance block.
      </p>
      <p>
        Each of the three annotators independently provided the
following dialogue-level quality labels for each dialogue [
        <xref ref-type="bibr" rid="ref15">14</xref>
        ]:
TS Task Statement: whether the task (i.e., the problem to be
solved) is clearly stated by Customer (scores: f 1; 0; 1g);
TA Task Accomplishment: whether the task is actually
accomplished (scores: f 1; 0; 1g);
CS Customer Satisfaction: whether Customer is likely to have
been satised with the dialogue, and to what degree (scores:
f 2; 1; 0; 1; 2g);
HA Helpdesk Appropriateness: whether Helpdesk provided
appropriate information (scores: f 1; 0; 1g);
CA Customer Appropriateness: whether Customer provided
appropriate information (scores: f 1; 0; 1g).
      </p>
      <p>
        Moreover, they independently identied the following types of
nuggets within each uerance block [
        <xref ref-type="bibr" rid="ref15">14</xref>
        ]:
      </p>
      <p>CNUG0 Customer’s trigger nuggets. ese are nuggets that
dene Customer’s initial problem, which directly caused
Customer to contact Helpdesk.</p>
      <p>HNUG Helpdesk’s regular nuggets. ese are nuggets in
Helpdesk’s uerances that are useful from Customer’s
point of view.</p>
      <p>CNUG Customer’s regular nuggets. ese are nuggets in
Customer’s uerances that are useful from Helpdesk’s point
of view.</p>
      <p>HNUG Helpdesk’s goal nuggets. ese are nuggets in Helpdesk’s
uerances which provide the Customer with a solution to
the problem.</p>
      <p>CNUG Customer’s goal nuggets. ese are nuggets in
Customer’s uerances which tell Helpdesk that Customer’s
problem has been solved.</p>
      <p>In our proposed task design, we tentatively use the
aforementioned annotation scheme of DCH-1, so that we can discuss our
ideas with concrete examples. However, it should be noted that
our proposal does not require that the dialogue-level and nugget
annotations are done in exactly the same way as above. If we do
use the above schema in a new task, however, it would enable us
to directly utilise the DCH-1 test collection as training data for the
participants, as we shall describe in the next section.
3</p>
    </sec>
    <sec id="sec-7">
      <title>PROPOSED TASK DESIGN</title>
      <p>Our ultimate goal is the automatic evaluation of Helpdesk-Customer
(be it human-human or human-machine) dialogues; as a rst step,
we propose the following shared task.
3.1</p>
    </sec>
    <sec id="sec-8">
      <title>Task Denition</title>
      <p>Participating teams are provided with training data, for example,
the aforementioned DCH-1 test collection with multiple
dialoguelevel and nugget annotations per dialogue. en, in the test phase,
each team is given a new set of dialogues as input. Let D be the test
of dialogues in the test set. Two subtasks are described below. It is
hoped that these oine (i.e., laboratory-based) tasks will serve as
initial steps towards evaluating real customer-helpdesk dialogue
systems.</p>
      <p>3.1.1 Dialogue ality Subtask. First, participating systems are
given a list of possible dialogue quality levels f1; 2; : : : ; Lg and the
number of annotators a. en, for each d 2 D, participating systems
are required to return an estimated distribution of annotators over
the quality levels. For example, if L = 5 (ve levels) for Customer
Satisfaction (See Section 2.4) and a = 10, a participating system
might return ¹2; 2; 2; 2; 2º (i.e., two annotators for each quality level).
Note that the gold distribution can also be represented similarly, e.g.,
¹0; 0; 1; 4; 5º. us, the probability bins (i.e., dialogue quality levels)
are ordered, just like those in the Dialogue Breakdown Detection
Challenge (See Section 2.2).</p>
      <p>If a system can thus accurately estimate the dialogue quality
(e.g., customer satisfaction, task accomplishment, etc.) from
different people’s viewpoints, that system can potentially serve as a
component of a dialogue for self-diagnosis and self-improvement
for satisfying diverse customers.</p>
      <p>3.1.2 Nugget Detection Subtask. First, participating systems are
given a list of Customer nugget types (e.g., fCNUG0; CNUG; CNUG ;
NaNg) and a list of Helpdesk nugget types (e.g., fHNUG; HNUG ;
NaNg. For each d 2 D, participating systems are required to return,
for every uerance block in the dialogue, an estimated distribution
of the annotators over nugget types. For example, if a = 10 and
we have the nugget types from DCH-1, a participating system may
return, for a particular Customer uerance block, an estimated
distribution ¹3; 4; 3; 0º, which means “ree annotators said CNUG0;
four said CNUG; three said CNUG ; none said NaN.” Similarly,
for a particular Helpdesk uerance block, the same system may
return ¹4; 4; 2º, which means “Four annotators said HNUG; four
said HNUG ; two said NaN.” Note that the gold distribution for
each uerance block can be represented similarly5, and that the
probability bins (i.e., nugget types) are nominal (i.e., unordered).</p>
      <p>If a system can accurately detect nuggets and their types, that will
help researchers utilise nugget-based evaluation measures without
having to manually construct nuggets. Nugget-based evaluation
measures may provide more ne-grained diagnoses of systems’
failures than dialogue-level annotations: for example, if designed
appropriately, they may be able to tell us exactly where in the
dialogue a problem occurred, and why.
3.2</p>
    </sec>
    <sec id="sec-9">
      <title>Evaluation Measures</title>
      <p>
        3.2.1 Comparing Two Distributions with Existing Measures. Both
of the aforementioned subtasks require a comparison of the
system’s estimated probability distribution over the gold distribution.
Figure 1 shows two examples where the estimated distribution is
compared with the gold distribution when there are ve bins (i.e.,
dialogue quality levels or nugget types). One might consider
variational distance [
        <xref ref-type="bibr" rid="ref7">6</xref>
        ], which forms the basis of mean absolute error
5 In the DCH-1 collection, nuggets were generally identied as “relevant” parts of
within an uerance block. However, treating entire uerance blocks as nuggets may
facilitate both the annotation and evaluation steps.
(X)
0.4
0.3
0.2
0.1
(Y)
0.4
0.3
0.2
p(4)
p(5)
(MAE) [
        <xref ref-type="bibr" rid="ref1 ref4">1</xref>
        ], as a candidate measure for comparing the estimated
distribution p with the gold distribution p :
where p¹iº; p ¹iº are the estimated and true probabilities for the i-th
bin. Dividing it by two (representing the case with a complete lack
of overlap) would ensure the »0; 1¼ range. However, accumulating
the per-bin errors in this way is not ideal for our purpose, because
variational distance cannot penalise “outlier” probabilities. For
example, we argue that Figure 1(X) should be rated higher than
(Y), because the laer distribution is too skewed compared to the
gold distribution; the system is falsely condent that Bin 1 has a
very high probability. However, the variational distance is clearly
0.4 (0.2 when normalised) for both (X) and (Y): the two systems are
treated as equivalent according to this measure. For this reason, we
prefer the measures discussed below over variational distance or
MAE.
      </p>
      <p>Root mean squared error (RMSE) is oen used along with MAE
in the research community. is approach is more suitable for our
purpose because of its ability to penalise outliers. In our case, we
can dene a measure based on Sum of Squares (SS) rst:
SS¹p; p º =
Since the largest possible value of SS is 12 + 12 = 2, we can use Root
Normalised Sum of Squares (RNSS), which has the »0; 1¼ range:
RNSS¹p; p º =
r</p>
      <p>
        SS¹p; p º :
2
For the examples in Figure 1, the RNSS of (X) is 0.1414 while that
of (Y) is 0.1732; hence (X) outperforms (Y). e reader is referred to
Chai and Draxler [
        <xref ref-type="bibr" rid="ref1 ref4">1</xref>
        ] for a discussion of the advantages of RMSE
(which is similar to RNSS) over MAE.
      </p>
      <p>
        Another measure that can distinguish the dierence between
Figure 1(X) and (Y) is the (normalised, symmetric version of)
JensenShannon divergence (JSD) [
        <xref ref-type="bibr" rid="ref7">6</xref>
        ], which we denote as JSD¹p; p º6. First,
for probability distributions p1 and p2, the Kullback-Leibler
divergence (KLD), which is not symmetric, is dened as:
      </p>
      <p>KLD¹p1; p2º =
Note that the above is undened if p2¹iº = 0: JSD avoids this
limitation as described below.</p>
      <p>For a given pair of distributions p and p , let pM be a probability
distribution such that, for every bin i, pM ¹iº = ¹p¹iº +p ¹iºº2. en,
JSD, which is symmetric, is dened as:</p>
      <p>JSD¹p; p º =</p>
      <p>
        KL¹p; pM º + KL¹p ; pM º :
2
us, by introducing pM , we can avoid the aforementioned
limitation of KLD, since p1¹iº &gt; 0 implies that pM ¹iº &gt; 0 also. Moreover,
provided that the logarithm base in Eq. 4 is two, the above JSD
has the »0; 1¼ range. Lin [
        <xref ref-type="bibr" rid="ref7">6</xref>
        ] proves that the above form of JSD is
bounded above by the normalised variational distance (See Eq. 1):
JSD¹p; p º
      </p>
      <p>V ¹p; p º :
2
For the examples shown in Figure 1, JSD¹p; p º = ¹0:0408+0:0372º2 =
0:0390 for (X), and JSD¹p; p º = ¹0:0490 + 0:0490º2 = 0:0490 for
(X). Again, (X) is considered to be superior.</p>
      <p>3.2.2 Comparing Two Distributions with Order-Aware Measures.
For the dialogue quality subtask, the probability bins are ordinal,
but the aforementioned measures do not take that into account.
For example, compare Figure 2(a) with (d), and (b) with (c) (the le
half in each gure): where we have L = 3 ordinal bins and the true
and the estimated distributions are represented in blue and red,
respectively. Because RNSS and JSD are summations of dierences
across the bins, they give the same score to (a) and (d) (RNSS=1,
JSD=1), and to (b) with (c) (RNSS=0.8819, JSD=1). However, for
ordinal bins, it is clear that (d) is beer than (a), and (c) is beer
than (b). e problem is that there is no notion of distance between
dierent bins. Hence we propose a new measure for comparing
two distributions where bins are ordinal.</p>
      <p>Let A be the set of bins used in the task, where jAj = L¹&gt; 1º.
First, we dene sets of bins of nonzero probabilities B = fi jp ¹iº &gt;
0g¹ A) and B = fi jp¹iº &gt; 0g¹ A). en, given estimated and gold
(3)
(4)
(5)
(6)</p>
      <p>OD¹p; p º =</p>
      <p>jB j i 2B j 2A; j,i
distributions p and p , we dene Order-aware Divergence as:
1 Õ Õ
ji j j¹p¹jº
p ¹jºº2 :
(7)
It can be observed that OD is not symmetric: for every nonzero
bin i of p , it computes a sum of weighted squares for the other
bins, where the weight is given as the distance between i and every
other bin j. Hence, B = B is a sucient condition that implies the
symmetry of OD. We will come back to this point later with a few
examples.</p>
      <p>Symmetric Order-aware Divergence (SOD) can easily be dened
as:</p>
      <p>SOD = OD¹p; p º +2 OD¹p ; pº : (8)</p>
      <p>To ensure that the measure has the »0; 1¼ range, we should
consider the maximum possible value of OD for a given L: it is clear
from the denition of OD that in situations such as if p¹1º = 1
and p ¹Lº = 1, that is, when both estimated and gold distributions
occupy exactly one bin and the two bins are as far apart as possible
from each other, the worst-case OD is given by ¹L 1º 12 = L 1.
Hence, Normalised Order-aware Divergence (NOD) and Symmetric
Normalised Order-aware Divergence (SNOD) may be dened as:
NOD¹p; p º = ODL¹p;1p º ;
(9)</p>
      <p>SNOD¹p; p º = SOLD¹p;1p º : (10)
Note that SNOD is symmetric, but NOD is generally not.</p>
      <p>Figure 2, which we have mentioned earlier, contains the NOD
and SNOD scores for (a)-(d). e right half of the gures (a)’-(d)’,
which swaps the estimated and gold distributions, are used for
computing SNOD. It can be observed that the SNOD score goes
down as we move from (a) to (d). Hence (d) is considered beer
than (a), and (c) is considered beer than (b). In particular, note
that while the (S)NOD for (a) is 1, the maximum possible value, that
for (d) is 0.5, reecting the linear weighting scheme of OD.</p>
      <p>Figure 3 provides a few other examples with L = 3: this time,
the gold distribution is uniform. While RS and JS give the same
score to (I) and (II) (RNSS=0.5774, JSD=0.4591), and to (III) and (IV)
(RNSS=0.3333, JSD=0.2075), it can be observed that the SNOD score
goes down as we move from (I) to (IV).</p>
      <p>Finally, we compute the SNOD scores for the examples given in
Figure 1, where L = 5: the results are shown in Figure 4. It can be
observed that SNOD prefers (X) to the more skewed (Y). Moreover,
note that B = B holds for these examples, since both probability
distributions cover all the bins. Hence NOD¹p; p º = NOD¹p ; pº =
SNOD¹p; p º holds7.</p>
      <p>To sum up, we propose to use RNSS, JSD, and SNOD for
comparing the probability distributions in the dialogue quality subtask
(since the bins are ordered), to use RNSS and JSD for comparing
the probability distributions in the nugget detection subtask (since
the bins are nominal).</p>
      <p>
        3.2.3 Dialogue ality Measures. e Dialogue ality
subtask needs to compare, for each dialogue, the system’s estimated
6e original denition of the Jensen-Shannon divergence assigns a weight to each
probability distribution. Our denition of JSD equals the “L divergence” of Lin [
        <xref ref-type="bibr" rid="ref7">6</xref>
        ]
divided by two.
7 Another sucient condition for the symmetry of (N)OD is: jB j = jB j = 1 and
B , B. at is, p ¹iº = 1 for a particular i and p¹jº = 1 for a particular j¹, iº. See
Figure 2(a) and (d).
(a) 1 True
2/3
1/3
      </p>
      <p>2</p>
      <p>NOD = 1
1 Estimated</p>
      <p>SNOD = 1
distribution of a annotators over L quality levels with the gold
distribution. Let a¹iº be the system’s estimated number of annotators
who chose Level i for a dialogue, and let a ¹iº be the
corresponding true number, so that ÍL i=1 a ¹iº = a. Hence, for
i=1 a¹iº = ÍL
each dialogue d, we can construct probability distributions p; p
by leing p¹iº = a¹iºa; p ¹iº = a ¹iºa for i = 1; : : : ; L, and
compute M¹dº = M¹p; p º, where M 2 fRNSS; JSD; SNODg. Figure 5(a)
shows a conceptual diagram of how these measures are computed.
2
NOD = 0.8889</p>
      <p>3
2
NOD = 0.8889</p>
      <p>3
2
2
2
2
2
NOD = 0.1667</p>
      <p>3
2
NOD = 0.1111</p>
      <p>3
2
NOD = 0.1111</p>
      <p>3
2
NOD = 0.1111</p>
      <p>3
2
2
2
2
3
3
3
3
0.4 True
0.3
0.2
0.1
0.4
0.3
0.2
0.1
0.4
0.3
0.2
0.1
0.4 True
0.3
0.2
0.1
1 2 3 4</p>
      <p>NOD = 0.0380
Estimated</p>
      <p>SNOD = 0.0380
1 2 3 4</p>
      <p>NOD = 0.0380</p>
      <p>Estimated</p>
      <p>3.2.4 Nugget Detection Measures. e Dialogue ality subtask
rst needs to evaluate, for each uerance block, the accuracy of the
system’s estimated distribution of annotators over nugget types;
then consolidate the results for the entire dialogue8.</p>
      <p>Let TC be the number of possible Customer nugget types
(including NaN), and let TH be the number of possible Helpdesk
nugget types (including NaN). For example, if the Customer nugget
types are CNUG0, CNUG, CNUG , and NaN, then TC = 4; if
the Helpdesk nugget types are HNUG, HNUG , and NaN, then
TH = 3. Let BC ¹dº be the set of Customer uerance blocks of a
given test dialogue d, and let BH ¹dº be the set of Helpdesk uerance
blocks for d.</p>
      <p>For each Customer block bC 2 BC ¹dº, let a¹iº be the system’s
estimated number of annotators who chose the i-th Customer nugget
type (1 i TC ) for bC ; let a ¹iº be the corresponding true number
of annotators. Note that for any block bC , ÍTC a¹iº = ÍTi=C1 a ¹iº =
i=1
a, since we have a total of a annotators. Hence, for each Customer
8 is is the macroaveraging approach, where we assume that each dialogue is as
important as any other, as it represents a particular customer experience. An alternative
would be the microaveraging approach, which views each uerance block to be as
important as any other. e laer implies that longer dialogues impact the overall
system performance more heavily, which is not necessarily what we want in the
present study.
uerance block bC , we can construct probability distributions p; p
by leing p¹iº = a¹iºa; p ¹iº = a ¹iºa for i = 1; : : : ; TC , and
compute M¹bC º = M¹p; p º, where M 2 fRNSS; JSDg. Figure 5(b)
shows a conceptual diagram of how the measures are computed
for a Customer uerance block.</p>
      <p>Similarly, for each Helpdesk block bH 2 BH ¹dº, we can compute
M¹bH º where M 2 fRNSS; JSDg.</p>
      <p>e entire dialogue d can then be evaluated by (weighted) average
RNSS and (weighted) average JSD:</p>
      <p>α Õ
=
+
jBC ¹dºj bC 2BC ¹dº
1 α Õ
jBH ¹dºj bH 2BH ¹dº</p>
      <p>M¹bC º
M¹bH º ;
(12)
(13)
where α ¹0 α 1º is a parameter for emphasising Customer or
Helpdesk results and where M 2 fRNSS; JSDg.</p>
      <p>Finally, the participating systems can be compared in terms of
mean (weighted ) Average RNSS and mean (weighted ) Average JSD:
1 Õ
meanwaM =</p>
      <p>waM¹dº :
jD j d 2D
4 CONCLUSIONS
is paper proposed a design of a shared task whose ultimate goal
is automatic evaluation of multi-turn, dyadic, textual helpdesk
dialogues. e proposed task takes the form of an oine evaluation,
where participating systems are given a dialogue as input, and
output at least one of the following: (1) an estimated distribution
of the annotators’ quality ratings for that dialogue; and (2) an
estimated distribution of the annotators’ nugget type labels for
each uerance block in that dialogue. is shared task should
help researchers build automatic helpdesk dialogue systems that
respond appropriately to inquiries by considering the diverse views
of customers. e proposed framework has been accepted as part
of the NTCIR-14 Short Text Conversation task; we plan to provide
the proposed tasks for both Chinese and English dialogues.</p>
      <p>We also proposed SNOD, a pilot measure that considers the order
of the probability bins for the dialogue quality subtask. In our future
work, the properties of the measures considered in this paper will
be examined with real dialogue data.</p>
    </sec>
    <sec id="sec-10">
      <title>ACKNOWLEDGEMENTS</title>
      <p>I thank the EVIA reviewers who gave me constructive comments,
especially Reviewer 1 who pointed out the limitation of RNSS and
JSD for the purpose of comparing two distributions where the
categories are ordered. is led me to my proposal of SNOD.</p>
      <p>Towards Automatic Evaluation of Multi-Turn Dialogues:
A Task Design that Leverages Inherently Subjective Annotations
(a) Dialogue quality evaluation measures</p>
      <sec id="sec-10-1">
        <title>Dialogue d Customer utterance block</title>
      </sec>
      <sec id="sec-10-2">
        <title>Participating system’s estimates</title>
      </sec>
      <sec id="sec-10-3">
        <title>Level 1 Level 2</title>
      </sec>
      <sec id="sec-10-4">
        <title>Level L</title>
      </sec>
      <sec id="sec-10-5">
        <title>Gold</title>
        <p>distribution
(b) Nugget detection evaluation measures</p>
      </sec>
      <sec id="sec-10-6">
        <title>Type 1 Type 2 Type T</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Chai</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.R.</given-names>
            <surname>Draxler</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Root Mean Square Error (RMSE) or Mean Absolute Error (MAE)? - Arguments against avoiding RMSE in the Literature</article-title>
          .
          <source>Geoscientic Model Development</source>
          <volume>7</volume>
          (
          <year>2014</year>
          ),
          <fpage>1247</fpage>
          -
          <lpage>1250</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Michel</given-names>
            <surname>Galley</surname>
          </string-name>
          , Chris Brocke,
          <string-name>
            <surname>Alessandro</surname>
            <given-names>Sordoni</given-names>
          </string-name>
          , Yangfeng Ji, Michael Auli, Chris irk, Margaret Mitchell,
          <string-name>
            <surname>Jianfeng Gao</surname>
            , and
            <given-names>Bill</given-names>
          </string-name>
          <string-name>
            <surname>Dolan</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>ΔBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets</article-title>
          .
          <source>In Proceedings of ACL</source>
          <year>2015</year>
          .
          <volume>445</volume>
          -
          <fpage>450</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Ryuichiro</given-names>
            <surname>Higashinaka</surname>
          </string-name>
          , Kotaro Funakoshi, Michimasa Inaba, Yuiko Tsunomori, Tetsuro Takahashi, and
          <string-name>
            <given-names>Nobuhiro</given-names>
            <surname>Kaji</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Overview of Dialogue Breakdown Detection Challenge 3</article-title>
          .
          <source>In Proceedings of Dialog System Technology Challenge</source>
          <volume>6</volume>
          (
          <issue>DSTC6</issue>
          ) Workshop.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>Type 1 Type 2 Level 1 Level 2</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Ryuichiro</given-names>
            <surname>Higashinaka</surname>
          </string-name>
          , Kotaro Funakoshi, Yuka Kobayashi, and
          <string-name>
            <given-names>Michimasa</given-names>
            <surname>Inaba</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>e Dialogue Breakdown Detection Challenge: Task Description, Datasets, and Evaluation Metrics</article-title>
          .
          <source>In Proceedings of LREC</source>
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Kate</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Hone</surname>
            and
            <given-names>Robert</given-names>
          </string-name>
          <string-name>
            <surname>Graham</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>Towards a Tool for the Subjective Assessment of Speech System Interfaces (SASSI)</article-title>
          .
          <source>Natural Language Engineering</source>
          <volume>6</volume>
          ,
          <fpage>3</fpage>
          -
          <lpage>4</lpage>
          (
          <year>2000</year>
          ),
          <fpage>287</fpage>
          -
          <lpage>303</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Jianhua</given-names>
            <surname>Lin</surname>
          </string-name>
          .
          <year>1991</year>
          .
          <article-title>Divergence Measures Based on the Shannon Entropy</article-title>
          .
          <source>IEEE Transactions on Information eory 37</source>
          ,
          <issue>1</issue>
          (
          <year>1991</year>
          ),
          <fpage>145</fpage>
          -
          <lpage>151</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Ryan</given-names>
            <surname>Lowe</surname>
          </string-name>
          , Nissan Row,
          <string-name>
            <surname>Iulian</surname>
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Serban</surname>
            , and
            <given-names>Joelle</given-names>
          </string-name>
          <string-name>
            <surname>Pineau</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>e Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems</article-title>
          .
          <source>In Proceedings of SIGDIAL</source>
          <year>2015</year>
          .
          <volume>285</volume>
          -
          <fpage>294</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Eddy</given-names>
            <surname>Maddalena</surname>
          </string-name>
          , Kevin Roitero, Gianluca Demartini, and
          <string-name>
            <given-names>Stefano</given-names>
            <surname>Mizzaro</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Considering Assessor Agreement in IR Evaluation</article-title>
          .
          <source>In Proceedings of ACM ICTIR</source>
          <year>2017</year>
          .
          <volume>75</volume>
          -
          <fpage>82</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Tetsuya</given-names>
            <surname>Sakai</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Unanimity-Aware Gain for Highly Subjective Assessments</article-title>
          .
          <source>In Proceedings of EVIA</source>
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Tetsuya</given-names>
            <surname>Sakai</surname>
          </string-name>
          and
          <string-name>
            <given-names>Zhicheng</given-names>
            <surname>Dou</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Summaries, Ranked Retrieval and Sessions: A Unied Framework for Information Access Evaluation</article-title>
          .
          <source>In Proceedings of ACM SIGIR</source>
          <year>2013</year>
          .
          <volume>473</volume>
          -
          <fpage>482</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Lifeng</surname>
            <given-names>Shang</given-names>
          </string-name>
          , Tetsuya Sakai,
          <string-name>
            <given-names>Hang</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Ryuichiro</given-names>
            <surname>Higashinaka</surname>
          </string-name>
          , Yusuke Miyao, Yuki Arase, and
          <string-name>
            <given-names>Masako</given-names>
            <surname>Nomoto</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Overview of the NTCIR-13 Short Text Conversation Task</article-title>
          .
          <source>In Proceedings of NTCIR-13.</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Lifeng</surname>
            <given-names>Shang</given-names>
          </string-name>
          , Tetsuya Sakai, Zhengdong Lu,
          <string-name>
            <given-names>Hang</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Ryuichiro</given-names>
            <surname>Higashinaka</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Yusuke</given-names>
            <surname>Miyao</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Overview of the NTCIR-12 Short Text Conversation Task</article-title>
          .
          <source>In Proceedings of NTCIR-12</source>
          .
          <fpage>473</fpage>
          -
          <lpage>484</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Marilyn</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Walker</surname>
            ,
            <given-names>Diane J.</given-names>
          </string-name>
          <string-name>
            <surname>Litman</surname>
          </string-name>
          , Candace A.
          <string-name>
            <surname>Kamm</surname>
            , and
            <given-names>Alicia</given-names>
          </string-name>
          <string-name>
            <surname>Abella</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>PARADISE: A Framework for Evaluating Spoken Dialogue Agents</article-title>
          .
          <source>In Proceedings of ACL</source>
          <year>1997</year>
          .
          <volume>271</volume>
          -
          <fpage>280</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Zhaohao</given-names>
            <surname>Zeng</surname>
          </string-name>
          , Cheng Luo, Lifeng Shang,
          <string-name>
            <given-names>Hang</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Tetsuya</given-names>
            <surname>Sakai</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Test Collections and Measures for Evaluating Customer-Helpdesk Dialogues</article-title>
          .
          <source>In Proceedings of EVIA</source>
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>