<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Test Collections and Measures for Evaluating Customer-Helpdesk Dialogues</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hang Li Toutiao AI Lab</string-name>
          <email>shang.lifeng@huawei.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>P.R.China lihang.lh@bytedance.com</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Cheng Luo Tsinghua University</institution>
          ,
          <country country="CN">P.R.China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Lifeng Shang Huawei Noah's Ark Lab, HK</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Tetsuya Sakai Waseda University</institution>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Zhaohao Zeng Waseda University</institution>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <abstract>
        <p>We address the problem of evaluating textual, task-oriented dialogues between the customer and the helpdesk, such as those that take the form of online chats. As an initial step towards evaluating automatic helpdesk dialogue systems, we have constructed a test collection comprising 3,700 real Customer-Helpdesk multiturn dialogues by mining Weibo, a major Chinese social media. We have annotated each dialogue with multiple subjective quality annotations and nugget annotations, where a nugget is a minimal sequence of posts by the same uterer that helps towards problem solving. In addition, 10% of the dialogues have been manually translated into English. We have made our test collection DCH-1 publicly available for research purposes. We also propose a simple nugget-based evaluation measure for task-oriented dialogue evaluation, which we call UCH, and explore its usefulness and limitations.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>dialogues; evaluation; helpdesk; measures; nuggets; test
collections</p>
    </sec>
    <sec id="sec-2">
      <title>INTRODUCTION</title>
      <p>Whenever a user of a commercial product or a service encounters
a problem, an efective way to solve it would be to contact the
helpdesk. Eficient and successful dialogues are desirable both for
the customer and the company that sells the product/service.
Recent advances in artificial intelligence suggest that, in the
not-toodistant future, these human-human Customer-Helpdesk dialogues
will be replaced by human-machine ones. In order to build and
efifciently tune automatic helpdesk systems, reliable automatic
evaluation methods for task-oriented dialogues are required.</p>
      <p>Figure 1 shows an example of a Customer-Helpdesk dialogue. It
can be observed that it is initiated by Customer’s report of a
particular problem she is facing, which we call a trigger. This is an
example of a successful dialogue, for Helpdesk provides an actual
solution to the problem and Customer acknowledges that the problem
has been solved. Unlike the classical closed-domain task-oriented
dialogues, Helpdesk may have to handle diverse requests, which
makes it impossible for us to solve the problems by pre-defined
Copying permited for private and academic purposes.</p>
      <p>EVIA 2017, co-located with NTCIR-13, Tokyo, Japan.
© 2017 Copyright held by the author.</p>
      <p>C: I copied a picture from my PC to my mobile phone, but it kind of
looks fuzzy on the phone. How can I solve this? P.S. I’m no good at
computers and mobile phones.</p>
      <p>H: Please synchronise your PC and phone using iTunes first, and
then upload your picture.</p>
      <p>Trigger
Solution
C: I’d done the synchronisation but did not upload it with XXX
Mobile Assistant. I managed to do so by following your advice. You
are a real expert, thank you!
Confirmation
H: You are very welcome. If you have any problems using XXX
Mobile Phone Software, please contact us again, or visit XXX.com.
slot filling schemes that are required by many existing evaluation
measures for task-oriented dialogues (See Section 2.2).</p>
      <p>In the present study, we address the problem of evaluating
textual Customer-Helpdesk dialogues, such as those that take the form
of online chats. As an initial step towards evaluating automatic
helpdesk dialogue systems, we have constructed a test collection
comprising 3,700 real customer-helpdesk multi-turn dialogues by
mining Weibo1, a major Chinese social media. We have
annotated each dialogue with subjective quality annotations (task
statement, task accomplishment, customer satisfaction, helpdesk
appropriateness, customer appropriateness) as well as nugget annotations,
where a nugget is a minimal sequence of posts by the same
utterer that helps towards problem solving. In addition, 10% of the
dialogues have been manually translated into English. We have
made our test collection DCH-1 (Dialogues between Customer and
Helpdesk) publicly available for research purposes, along with a
smaller pilot collection DCH-0, which contains 234 dialogues2.</p>
      <p>We also propose a simple nugget-based evaluation measure for
task-oriented dialogue evaluation, which we call UCH (Utility for
Customer and Helpdesk), and explore its usefulness and
limitations. We believe that, while subjective dialogue evaluation can
evaluate the dialogue as a whole, automatic evaluation methods
will eventually require more local pieces of evidence from the
dialogue text for close diagnosis. For this reason, we collected both</p>
      <sec id="sec-2-1">
        <title>1 http://www.weibo.com</title>
        <p>2 http://waseda.box.com/DCH-0-1
subjective annotations and nugget annotations for each dialogue,
in the hope that automatic evaluation measures defined as a
function of nuggets will eventually be able to predict subjective scores
with reasonable accuracy. Another possible benefit of
constructing nuggets is that a set of nuggets collected from a dialogue may
also be useful for evaluating a diferent dialogue that discusses a
similar problem.
2
2.1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>RELATED WORK</title>
    </sec>
    <sec id="sec-4">
      <title>Evaluating Non-Task-Oriented Dialogues</title>
      <p>
        Evaluating generated responses in non-task-oriented dialogues is
a dificult problem. Galley et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] proposed Discriminative BLEU,
which generalises BLEU [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], a machine translation evaluation
measure that compares the system output with multiple reference
translations at the n-gram level. Discriminative BLEU introduces
positive and negative weights to human references (i.e., gold standard
responses) in the computation of n-gram-based precision, which
is the primary component of BLEU. Because it is dificult to
obtain multiple hand-crafted references for conversational data, they
automatically mine candidate responses from a corpora of
conversations and then have the annotators rate the quality of the
candidates. The reference weights reflect the result of the quality
annotations.
      </p>
      <p>
        Higashinaka et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] ran the first Dialogue Breakdown Detection
Challenge using Japanese human-machine chat corpora, to
evaluate the system’s ability to detect the point in a given dialogue
where it becomes dificult to continue due to the system’s
inappropriate response. This efort used 1,146 text chat dialogues for
training and another 100 for development and testing. After each
system uterance in the dialogue, participating systems were
required to provide a diagnosis: “NB” (not a breakdown), “PB”
(possible breakdown), or “B” (breakdown). They were also required
to submit a probability distribution over the three labels. To
deifne the gold standard data for this task, multiple annotators were
hired, so that a gold probability distribution can be constructed for
each uterance. By comparing the best gold label with the system’s
output, accuracy, precision, recall and F-measure were computed.
Moreover, by comparing the gold distribution over the three
labels with the system’s distribution, Jensen-Shannon Divergence
and Mean Squared Error were computed. Using a distribution as
the gold standard probably reflects the view that there can be
multiple acceptable choices within a dialogue, as suggested also by other
studies [
        <xref ref-type="bibr" rid="ref1 ref3">1, 3</xref>
        ]. The third Dialogue Breakdown Detection Challenge
workshop will be held as part of Dialogue System Technology
Challenges on December 10, 20173
      </p>
      <p>
        At NTCIR-12 , the first Short Text Conversation (STC) task was
run using Weibo data (for the Chinese subtask) and Twiter data
(for the Japanese subtask), atracting 22 participating teams [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
hTe STC task required participating systems to return a valid
comment in response to an input tweet (given without any prior
context). Instead of relying on natural language generation, systems
were required to search a repository of past tweets and return a
ranked list as possible responses. Information retrieval evaluation
measures were used to evaluate the participating systems. Gold
3 http://workshop.colips.org/dstc6/
standard labels were created manually by hiring multiple
annotators who used the following axes to decide on a single graded label
(L0, L1 or L2): coherence, topical relevance, context-independence,
and non-repetitiveness. The second STC task (STC-2) at NTCIR-13
atracted 22 participating teams for the Chinese subtask, which
allowed participants to submit not only retrieved responses but also
generated ones [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
2.2
      </p>
    </sec>
    <sec id="sec-5">
      <title>Evaluating Task-Oriented Dialogues</title>
      <p>
        Two decades ago, Walker et al. [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] proposed the PARADISE
(PARAdigm for Dialogue System Evaluation) framework for evaluating
task-oriented spoken dialogue systems. The basic idea is to collect
a variety of real human-machine dialogues for a specific task (e.g.,
train timetable lookup) as well as subjective ratings of user
satisfaction for each dialogue, and use task success and cost as explanatory
variables so that the user satisfaction measures for new dialogues
can be estimated by means of linear regression. PARADISE
requires an atribute-value matrix that represents the task: for
example, for the train timetable domain, atributes such as “depart-city,”
“arrival-city” and “depart-time” must be specified in advance. This
is contrast to our helpdesk case because, while it is task-oriented,
the required atributes depend on the customer’s problem and
cannot be listed up exhaustively in advance. In this respect, helpdesk
dialogues probably lie somewhere in between non-task-oriented
dialogues and the slot-filling dialogues that PARADISE deals with.
      </p>
      <p>
        hTe PARADISE framework was subsequently used in the
DARPA COMMUNICATOR Program that evaluated spoken
dialogue systems in the travel planning domain [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. The efort
produced the Communicator 2000 Corpus consisting of 662 dialogues
based on nine diferent systems, with per-call survey results on
dialogue eficiency, dialogue quality, task success and user
satisfaction. Here, a new uterance tagging scheme called DATE (Dialogue
Act Tagging for Evaluation) was introduced, which enables three
orthogonal annotations along the axes of speech-act (e.g.,
“requestinfo,” “apology”), task-subtask (e.g., “origin,” “destination,” “date”)
and conversational-domain (“about-task,” “about-communication,”
or “situation-frame”). Again, unlike our case, their task-subtask
annotation scheme needs to be defined in advance.
      </p>
      <p>
        Lowe et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] released the Ubuntu Dialogue Corpus, which
contains 930,000 human-human dialogues extracted from Ubuntu
chats. Their efort is more similar to ours than the aforementioned
studies on task-oriented dialogue evaluation in that they focus
primarily on unstructured dialogues rather than slot-filling. However,
while they automatically disentangled the chats to form dyadic
dialogues, their original chat logs usually involve more than two
parties, which makes it diferent from our dyadic customer-helpdesk
DCH-1 dataset. hTey formed a response selection test data set
by seting aside 2% of the corpus and forming ( context, response,
lfag) triplets based on this set. Here, context is the sequence of
uterances that appear prior to the response in the dialogue;
response is either the actual correct response from the dialogue or a
randomly chosen uterance from outside the dialogue (but within
the test set); flag is one for the correct response and zero for
incorrect responses. For each correct response, they generated nine
additional triplets containing diferent incorrect responses. Thus,
response selection systems are given a context and ten choices of
responses, and required to select one or more responses. They use
recall at k as the evaluation measure, where k is the size of the
set of responses selected by the system and therefore “recall at 1”
reduces to accuracy. Note that this evaluation seting does not
require annotations for defining the gold standard. They do not
consider ranked lists of responses as is done at STC.
      </p>
      <p>
        hTe most straightforward approach to evaluating dialogues is
to collect subjective assessments from the user who actually
experienced the dialogue. Hone and Graham [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] used a large
questionnaire to evaluate an in-car speech interface and identified system
response accuracy, likeability, cognitive demand, annoyance,
habitability and speed as the key factors in subjective evaluation by
means of factor analysis; their approach is known as SASSI
(Subjective Assessment of Speech System Interfaces). Hartikainen et
al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] applied a service quality assessment from marketing to the
evaluation of telephone-based email application; their method is
known as SERVQUAL. Paek [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] discusses SASSI, SERVQUAL and
PARADISE in a survey paper that discusses spoken dialogue
evaluation, along with his Wizard-of-Oz approach of using human
performance to replace a system component in order to define a gold
standard.
2.3
      </p>
    </sec>
    <sec id="sec-6">
      <title>Evaluating Textual Information Access</title>
      <p>
        While the aforementioned BLEU [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] is basically equivalent to an
n-gram-based precision, ROUGE [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], a BLEU-inspired measure
designed for text summarisation evaluation, is basically a suite of
measures including n-gram-based (or skip-gram-based) recall and
F-measure. Just as BLEU requires multiple reference translations,
ROUGE requires multiple reference summaries. Note that the
basic unit of comparison, namely n-grams etc., are automatically
extracted from both the references and the system output.
      </p>
      <p>
        In contrast to the above automatically extracted units of
comparison, manually-devised nuggets have been used in both
summarisation evaluation [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and question answering evaluation. In
the TREC Question Answering (QA) tracks, a nugget is defined as
“a fact for which the annotator could make a binary decision as to
whether a response contained that nugget” [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Having constructed
nuggets, (weighted) recall, precision and F-measure scores can be
computed, except that the precision computation requires special
handling: while one can count the number of nuggets present or
missing in the system output, one cannot count the number of
“non-nuggets” (i.e., irrelevant pieces of information) in the same
output, since “non-nuggets” are never defined. Hence, nugget
precision, which is supposed to quantify the amount of irrelevant
information in the output, cannot be defined. To work around this
problem, a fixed-length “allowance” was introduced at the TREC
QA tracks so that nugget precision could be defined based solely on
the system output length. The TREC QA tracks also used a measure
called POURPRE, which replaces the manual nugget matching step
with automatic nugget matching based on unigrams. The NTCIR
ACLIA (Advanced Cross-lingual Information Access) Task adapted
these methods for evaluating QA with Asian languages [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        As was discussed above, traditional evaluation measures for
summarisation and question answering employ variants of recall,
precision and F-measure based on small textual units. Hence, they
regard the system output as a set of n-grams, nuggets, and so on.
In contrast, Sakai, Kato and Song [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] introduced a nugget-based
evaluation measure called
      </p>
      <p>
        S-measure for evaluating textual summaries for mobile search,
by incorporating a decay factor for nugget weights based on nugget
positions. Just like information retrieval for ranked retrieval
deifnes a decay function over ranks of documents, S-measure defines
a linear decay function over the text, using ofset positions of the
nuggets. This reflects the view that important nuggets should be
presented first and that we should minimise the amount of text that
the user has to read. Sakai and Kato [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] complements S-measure
with a precision-like measure called T-measure, which, unlike the
aforementioned allowance-based precision used at the TREC QA
track, takes into account the fact that diferent pieces of
information require diferent textual lengths. They define an “iUnit”
(information unit) as “an atomic piece of information that stands alone
and is useful to the user.”
      </p>
      <p>
        Sakai and Dou [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] generalised the idea of S-measure to
handle various textual information access tasks, including web search.
hTeir measure, known as U-measure, constructs a string called
trailtext, which is a concatenation of all the texts that the user has read
(obtained by observation or by assuming a user model). Then, over
the trailtext, a linear decay function is defined (See Section 4).
      </p>
    </sec>
    <sec id="sec-7">
      <title>DESIGNING AND BUILDING DCH-1</title>
    </sec>
    <sec id="sec-8">
      <title>Overview</title>
      <p>Our ultimate goal is automatic evaluation of human-machine
Customer-Helpdesk dialogues. As a first step towards it, we built
two test collections based on real (i.e., human-human)
CustomerHelpdesk dialogues, which we call DCH-0 and DCH-1.</p>
      <p>DCH-0, our smaller collection, was used to establish an eficient
and reliable test collection construction procedure. For example,
although we started constructing DCH-0 by using the number of
posts in each dialogue for sampling dialogues of diferent lengths,
where a post refers to a piece of timestamped text entered by either</p>
      <p>Customer or Helpdesk, we quickly realised that posts are often a
mere artifact of the Weibo users’ arbitrary hits of the ENTER key,
and that they are not suitable as the basic semantic unit. Based on
this experience, we used the uterance block as the basis for
measuring the length of a dialogue in DCH-1, formed by merging all
consecutive posts by the same uterer.</p>
      <p>Table 1 provides some statistics of DCH-0 and DCH-1. As shown
in the table, 184 of the 3,700 DCH-1 dialogues are “triggerless,” by
which we mean that Customer and Helpdesk exchange remarks
even though Customer does not seem to be facing any problem (cf.
Figure 1)4. Below, we discuss the construction and validation of
DCH-1.
3.2</p>
      <p>Dialogue Mining
hTe 3,700 Helpdesk dialogues contained in the DCH-1 test
collection were mined from Weibo in September 2016 as follows. (1) We
collected an initial set of Weibo accounts by searching Weibo
account names that contained keywords such as “assistant” and
“helper” (in Chinese). We denote this set by A0. (2) For each
account name a in A0, we added a prefix “@” to a and used the string
as a query for searching up to 40 conversational threads (i.e.,
initial post plus comments on it) that contain a mention of the oficial
account5. We then filtered out accounts that did not respond to
over one half of these threads. We denote the filtered set of
“active” accounts as A. (3) For each account a in A, we retrieved all
threads that contain a mention of a from January 2013 to
September 2016, and extracted Customer-Helpdesk dyadic dialogues from
them. We then kept those that consist of at least one uterance
block by Customer and one by Helpdesk. As a result, 21,669
dialogues were obtained. This collection is denoted as D0. (4) As D0 is
too large for annotation, we sampled 3,700 dialogues from it as
follows. For i = 2; 3; : : : ; 6, we randomly sampled 700 dialogues that
contained i uterance blocks. In addition, we randomly sampled
200 that contained i = 7 uterance blocks; we could not sample
700 dialogues for i = 7 as D0 did not contain enough dialogues
that are very long.</p>
      <p>10% (370) of the Chinese Dialogues in DCH-1 were manually
translated English by a professional translation company for
research purposes.
3.3</p>
    </sec>
    <sec id="sec-9">
      <title>Annotators</title>
      <p>We hired 16 Chinese undergraduate students from the Faculty of
Science and Engineering at Waseda University so that each
Chinese dialogue was annotated independently by three annotators.
hTe assignment of dialogues to annotators was randomised; given
a dialogue, each annotator first read the entire dialogue carefully,
and then gave it ratings according to the five subjective annotation
criteria described in Section 3.4; finally, he/she identified nuggets
within the same dialogue, where nuggets were defined as described
in Section 3.5. An initial face-to-face instruction and training
session for the annotators was organised by the first author of this
4 We tried filtering out these triggerless dialogues for the analyses reported in
Section 5, but the efect of this on our results was not substantial.
5 Weibo’s interface for conversational threads is somewhat diferent from Twiter’s:
comments to a post are not displayed on the main timeline; they are displayed under
each post only if the “comments” buton is clicked.
paper at Waseda University; subsequently, the annotators were
allowed to do their annotation work online using a
web-browserbased tool at their convenient location and time. The number of
dialogues assigned to each annotator was 3; 700 3/16 = 693:75
on average; all of them completed their work within two weeks as
they were initially asked to do. The actual annotation time spent
by each annotator was 18-20 hours.
3.4</p>
    </sec>
    <sec id="sec-10">
      <title>Subjective Annotation</title>
      <p>By subjective annotation, we mean manual quantification of the
quality of a dialogue as a whole. As there are two players involved
in a Customer-Helpdesk dialogue, we wanted to accommodate the
following two viewpoints:</p>
      <p>Customer’s viewpoint Does Helpdesk solve Customer’s
problem eficiently? Customer may want a
solution quickly while providing minimal information to
Helpdesk.</p>
      <p>Helpdesk’s viewpoint Does Customer provide accurate
and suficient information so that Helpdesk can provide
the right solution? Helpdesk also wants to solve
Customer’s problem through minimal interactions, as these
interactions translate directly into cost for the company.
Moreover, we wanted to assess customer satisfaction as this is of
utmost importance for both parties. While customer satisfaction
ratings should ideally be collected from the real customer at the
time of dialogue termination, we had no choice but to collect
surrogate, post-hoc ratings by the annotators instead.</p>
      <p>By considering the above points as well as our results from the
smaller DCH-0 collection, we finally devised the following five
subjective annotation criteria:</p>
      <p>Task Statement Whether the task (i.e., the problem to be
solved) is clearly stated by Customer (denoted by TS);
Task Accomplishment Whether the task is actually
accomplished (denoted by TA);
Customer Satisfaction Whether Customer is likely to have
been satisfied with the dialogue, and to what degree
(denoted by CS);</p>
      <sec id="sec-10-1">
        <title>Helpdesk Appropriateness Whether Helpdesk provided</title>
        <p>appropriate information (denoted by HA);</p>
      </sec>
      <sec id="sec-10-2">
        <title>Customer Appropriateness Whether Customer provided</title>
        <p>appropriate information (denoted by CA).</p>
        <p>Figure 2 shows the actual instructions for annotators: note that
CS is on a five-point scale ( 2 to 2), while the other four are on a
three-point scale ( 1 to 1).</p>
        <p>
          Table 2 shows the inter-rater agreement (for three assessors)
of the subjective labels in terms of Fleiss’ κ [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] and Randolph’s
κfree [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]; κfree is known to be more suitable when the labels are
heavily skewed across the categories, which is indeed the case here.
“2+ agree” means the proportion of dialogues for which at least two
annotators agree, e.g., ( 1; 1); “3 agree” means the proportion of
dialogues for which all three annotators agree, e.g., ( 1; 1; 1).
        </p>
        <p>It can be observed that the agreement among the three assessors
is low, except perhaps for TS, which reflects the highly subjective
nature of this labelling task. While it may be possible to improve
the inter-assessor agreement a litle in our future work by revising
the labelling instructions, it should be stressed that our labelling
task is not document relevance assessments, and that it is
inherently highly subjective. We believe that, as our future work,
hiring more than three assessors and preserving their diferent
viewpoints in the test collection, is more important than trying to force
them into reaching an agreement.
3.5</p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>Nugget Annotation</title>
      <p>We had three annotators independently identified nuggets for each
dialogue as follows. At the instruction and training session,
annotators were given the diagram shown in Figure 3, which reflects our
view that accumulating nuggets will eventually solve Customer’s
problem, together with a writen definition of nuggets, as described
below. (1) A nugget is a post, or a sequence of consecutive posts
by the same uterer (i.e., either Customer or Helpdesk). (2) It can
neither partially nor wholly overlap with another nugget. (3) It
should be minimal: that is, it should not contain irrelevant posts
at the start, the end or in the middle. An irrelevant post is one
that does not contribute to the Customer transition (See Figure 3).
(4) It helps Customer transition from Current State (including
Initial State) towards Target State (i.e., when the problem is solved).</p>
      <p>
        Note that we utilise Weibo posts as the atomic building blocks
for forming nuggets; This takes into account the remark by Wang
et al. [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]: “Experience from question answering evaluations has
shown that users disagree about the granularity of nuggets—for
example, whether a piece of text encodes one or more nuggets and how
to treat partial semantic overlap between two pieces of text.” Note
also that according to our definition, an uterance block (i.e.,
maximal consecutive posts by the same uterer) generally subsumes
one or more nuggets.
      </p>
      <p>Compared to traditional nugget-based information access
evaluation that was discussed in Section 2.3, there are two unique
features in nugget-based helpdesk dialogue evaluation: (1) A dialogue
involves two parties, Customer and Helpdesk; (2) Even within the
same uterer, nuggets are not homogeneous, by which we mean
that some nuggets may play special roles. In particular, since the
dialogues we consider are task-oriented (but not closed-domain,
which makes slot filling approaches infeasible), there must be some
nuggets that represent the state of identifying the task and those
that represent the state of accomplishing it.</p>
      <p>Based on the above considerations, we defined the following
four mutually exclusive nugget types:</p>
      <p>CNUG0 Customer’s trigger nuggets. These are nuggets that
define Customer’s initial problem, which directly caused
Customer to contact Helpdesk.</p>
      <p>HNUG Helpdesk’s regular nuggets. These are nuggets in
Helpdesk’s uterances that are useful from Customer’s
point of view.</p>
      <p>CNUG Customer’s regular nuggets. These are nuggets in
Customer’s uterances that are useful from Helpdesk’s
point of view.</p>
      <p>HNUG Helpdesk’s goal nuggets. These are nuggets in
Helpdesk’s uterances which provide the Customer with
a solution to the problem.</p>
      <p>CNUG Customer’s goal nuggets. These are nuggets in
Customer’s uterances which tell Helpdesk that Customer’s
problem has been solved.</p>
      <p>Each nugget type may or may not be present in a dialogue.
Multiple nuggets of the same type may be present in a dialogue.</p>
      <p>Using a pull-down menu on our web-browser-based tool,
assessors categorised each post into CNUG0, CNUG, HNUG, CNUG ,
HNUG , or NAN (not a nugget). Then, consecutive posts with
the same label (e.g., CNUG followed by CNUG) were automatically
merged to form a nugget.</p>
      <p>Table 3 shows the inter-annotator agreement of the nugget
annotations, where the posts are used as the basis for comparison.
hTe 3,700 dialogues in DCH-1 contains a total of 7,155 Helpdesk
posts, all of which were annotated independently by three
annotators, producing a total of 21,465 annotations, A direct comparison
with the subjective annotation agreement shown in Table 2 would
be dificult, since both the annotation unit (dialogues vs. nuggets)
and the annotation schemes (numerical ratings vs. nugget types)
are diferent. However, it can be observed that the agreement for
Customer nuggets is substantially higher than for the Helpdesk
nuggets. A possible explanation for this would be that it is easier
for annotators to judge the contribution of Customer’s uterances
for reaching his/her target state than to judge that of Helpdesk, at</p>
      <p>Customer’s
initial state
(facing a
problem)
Contribution
of a nugget</p>
      <p>Different paths that lead
from Customer’s current state
to target state</p>
      <p>Customer’s
target state
(problem
solved)
Helpdesk-Customer interactions that
do not directly lead Customer to an
intermediate state or Target state
An intermediate state, where the problem is
not quite solved yet but Customer is a little
closer towards Target state
least for regular nuggets: while Helpdesk often asks Customer for
more information regarding the problem context, it is Customer’s
uterances that actually provide that information.</p>
      <p>While directly comparing the inter-annotator agreement of
subjective annotation and nugget annotation seems dificult, we would
like to compare the intra-annotator consistency by making each
annotator process the same dialogue multiple times in our future
work.
4</p>
    </sec>
    <sec id="sec-12">
      <title>UCH: A DIALOGUE EVALUATION</title>
    </sec>
    <sec id="sec-13">
      <title>MEASURE</title>
      <p>
        We now propose an evaluation measure that leverages nuggets for
quantifying the quality of Customer-Helpdesk dialogues. We
regard a Customer-Helpdesk dialogue as a trailtext of U-measure,
which may or may not contain nuggets. Let pos denote the
position (i.e., ofset from the beginning of the dialogue) of a nugget; for
ideographic languages such as Chinese and Japanese, we use the
number of characters to define the ofset position. Given a patience
parameter L, we define a decay function over the trailtext as [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]:
D(pos) = max(0; 1
pos
L
) :
hTis is for discounting the value of a nugget that appear later in
the dialogue; at position L, the value of any nugget wears out
completely. In our experiments, we let L = Lmax = 916 as this is
the number of (Chinese) characters in the longest dialogue from
the DCH-1 collection. The benefit of introducing L is discussed in
Section 5.2.
      </p>
      <p>Let N and M denote the number of Customer’s non-goal nuggets
and Helpdesk’s non-goal nuggets identified within a dialogue,
respectively; for simplicity, let us assume that there is at most one
Customer’s goal nugget (c ) and at most one Helpdesk’s goal nugget
(h ) in a dialogue. Let fc1; : : : ; cN ; c g denote the set of nuggets
from Customer’s posts, and let fh1; : : : ; hM ; h g denote that from
Helpdesk’s posts. Let pos(ci ) (i 2 f1; : : : ; N ; g) be the position of
nugget ci ; pos(hj ) (j 2 f1; : : : ; M; g) is defined similarly.</p>
      <p>Given the gain value of each non-goal nugget (д(ci )), a simple
evaluation measure based solely on Customer’s uterances can be
computed as:</p>
      <p>UC =</p>
      <p>∑
ci 2 fc1; :::;cN ;c g
д(ci ) D(pos(ci )) :
In the present study, we define the gain value of CNUG as д(c ) =
1 + ∑iN=1 д(ci ). This is an atempt at reflecting the view that task
accomplishment is what maters most . To be more specific, when
the discounting function is ignored and dialogues are regarded as
sets of nuggets, then having only the goal nugget is beter than
having all the regular nuggets. Similarly, given the gain value of each
non-goal nugget (д(hj )), a measure solely based on Helpdesk’s
utterances can be computed as:</p>
      <p>UH =</p>
      <p>∑
hj 2 fh1; :::;hM;h g
д(hj ) D(pos(hj )) ;
where д(h ) = 1 + ∑M</p>
      <p>j=1 д(hj ). Finally, for a given parameter α
(0 α 1) that specifies the contribution of Helpdesk’s uterances
relative to Customer’s, we can define the following combined
measure:</p>
      <p>UCHα = (1
α )UC + α UH :
(2)
(3)
(4)
(1)</p>
      <p>By default, we use α = 0:5. Note that UCH0:5 is equivalent to
computing a single U-measure score without distinguishing between
Customer’s and Helpdesk’s nuggets. The choice of α is discussed
in Section 5.3.</p>
      <p>Since we have three independent nugget annotations per
dialogue, We tried two approaches to computing a single score for
a given dialogue: Average UCH (AUCH) simply computes a UCH
score each annotator and then takes the average for that dialogue;
Consolidated UCH (CUCH) merges the nuggets from multiple
annotators first and then computes a single UCH score. We only report
on results with AUCH, which consistently outperformed CUCH in
our experiments.
5</p>
    </sec>
    <sec id="sec-14">
      <title>ANALYSIS WITH UCH</title>
      <p>hTis section addresses the following questions: How does UCH
correlate with subjective ratings? (Section 5.1); Is the patience
parameter L useful for estimating subjective ratings? (Section 5.2); and
Which uterer plays the major role when estimating subjective
ratings with UCH? (Section 5.3).</p>
      <p>In the analysis reported below, we use the z-score of each
subjective rating before averaging them over the three annotators. That
is, for each annotator and subjective criterion, we first compute the
mean and standard deviation of the raw ratings, and then process
each raw rating by subtracting the mean and then dividing by the
standard deviation. This is to remove each annotator’s inherent
scoring tendency.
5.1</p>
    </sec>
    <sec id="sec-15">
      <title>Correlation with Subjective Annotations</title>
      <p>
        Table 4 shows the Kendall’s τ values between AUCH and the
average subjective ratings for the DCH-1 collection, with 95%
confidence intervals. It can be observed that AUCH is reasonably highly
correlated with HA (.414, 95% CI[.398, .432]) and CA (.434, 95%
CI[.417, .450]). That is, even though the inter-annotator agreement
for appropriateness is relatively low (Table 2), AUCH manages to
estimate the average appropriateness with reasonable accuracy. On
the other hand, the table shows that the τ between AUCH and
CS is very low, albeit statistically significant (.118, 95% CI[.097,
.141]). One possible explanation for this might be that the CS
ratings themselves are not as reliable as we would have like. First, as
we have discussed in Section 3.4, the annotators are not the actual
customers; second, our manual inspection of some of the dialogues
from DCH-0 and DCH-1 suggest that the annotator’s ratings may
be influenced by his/her prior impression of the product/service or
the company, rather than the contents of the particular dialogue
in question.
0L.0m0ax/16 Lmax/8 Lmax/4 Lmax/2 Lmax Lmax 4Lmax 16Lmax
∞
As was explained in Section 4, UCH inherits the patience
parameter L from S-measure [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] and U-measure [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], to discount the
value of a nugget based on its position within the dialogue. As we
have mentioned earlier, we let L = Lmax = 916 by default, as this
is the length of the longest dialogue within DCH-1. Using a small
L means that the decay function becomes steep and that we do not
tolerate long dialogues; using an extremely large L is equivalent to
switching of the decay function, thereby treating the dialogue as
a set of nuggets (See Eq. 1).
      </p>
      <p>Figure 4 shows the efect of L on the τ between average CS and
AUCH. It can be observed that, at least for DCH-1, L = Lmax/4 =
229 seems to be a good choice if AUCH is to be used for estimating
customer satisfaction. This suggests that user satisfaction may be
linked to user patience, and that considering nugget positions as
UCH does is of some use. However, as was discussed earlier, the
reliability of the CS ratings deserves a closer investigation in our
future work.
5.3
hTe Contribution Parameter
α
As Eq. 4 shows, UCH can decide on a balance between Customer’s
uterances and Helpdesk’s; a small α means that we rely more on
Customer nuggets for computing UCH. Figure 5 shows the efect of
α on the τ between AUCH and diferent average subjective ratings.
hTe trends are the same for TS, TA, CS, and CA: the smaller the α ,
the higher the rank correlation. That is, to achieve the highest τ , it
is best to rely entirely on Customer uterances, i.e., to completely
ignore Helpdesk uterances.</p>
      <p>Interestingly, however, the trend is diferent for HA: the curve
for HA suggests that α = 0:5, our default value, is in fact the best
choice. That is, to achieve the highest τ with Helpdesk
Appropriateness, treating Customer’s and Helpdesk’s nuggets equally
appears to be a good choice. While it is obvious that Helpdesk’s
uterances need to be taken into account in order to estimate Helpdesk
Appropriateness, the curve implies that Customer’s uterances also
play an important part in the estimation. These results suggest that
0.6
0.5</p>
    </sec>
    <sec id="sec-16">
      <title>CONCLUSIONS</title>
      <p>As an initial step towards evaluating automatic dialogue
systems, we constructed DCH-1, which contains 3,700 real
CustomerHelpdesk multi-turn dialogues mined from Weibo. We have
annotated each dialogue with subjective quality annotations (TS, TA,
CS, HA, and CA) and nugget annotations, with three annotators
per dialogue. In addition, 10% of the dialogues have been manually
translated into English. We described how we constructed the test
collection and the philosophy behind it. We also proposed UCH,
a simple nugget-based evaluation measure for task-oriented
dialogue evaluation, and explored its usefulness and limitations. Our
main findings on UCH based on the DCH-1 collection are as
follows.</p>
      <p>(1) UCH correlates beter with subjective ratings that reflect
the appropriateness of uterances ( HA and CA) than with
customer satisfaction (CS);
(2) The patience parameter</p>
      <p>L of UCH, which considers the
positions of nuggets within a dialogue, may be a useful
feature for enhancing the correlation with customer
satisfaction;
(3) For the majority of our subjective annotation criteria,
customer uterances seem to play a much more important role
for UCH to achieve high correlations with subjective
ratings than helpdesk uterances do, according to our
analysis on the parameter α .</p>
      <sec id="sec-16-1">
        <title>Our future work includes the following:</title>
        <p>Comparing subjective annotation and nugget annotation
in terms of intra-annotator agreement;
Investigating the reliability of ofline customer
satisfaction ratings by comparing them with real customer
ratings collected right after the termination of a helpdesk
dialogue;
TS
TA
CS
HA</p>
        <p>
          Collecting subjective and nugget annotations for the
English subcollection of DCH-1, and comparing across
Chinese and English;
Devising ways for automatic nugget identification and
automatic categorisation of nuggets into diferent nugget
types;
hTe NTCIR-14 Short Text Conversation task (STC-3) will feature
a new subtask that is based on the present study: given a dialogue,
participating systems are required to estimate the distribution of
subjective scores such as user satisfaction over multiple annotators,
as well as the distribution of nugget types (e.g. trigger, regular,
goal, not-a-nugget) over multiple assessors for each uterance [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>DeVault</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leuski</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Sagae</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Toward Learning and Evaluation of Dialogue Policies with Text Examples</article-title>
          ,
          <source>Proceedings of SIGDIAL</source>
          <year>2011</year>
          , pp.
          <fpage>39</fpage>
          -
          <lpage>48</lpage>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Fleiss</surname>
            ,
            <given-names>J. L.</given-names>
          </string-name>
          :
          <article-title>Measuring Nominal Scale Agreement among Many Raters, Psychological Bulletin</article-title>
          , Vol.
          <volume>76</volume>
          , No.
          <issue>5</issue>
          , pp.
          <fpage>378</fpage>
          -
          <lpage>382</lpage>
          (
          <year>1971</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Galley</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brocket</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sordoni</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ji</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auli</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Quirk</surname>
            , C., Mitchell,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Dolan</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>∆BLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets</article-title>
          ,
          <source>Proceedings of ACL</source>
          <year>2015</year>
          , pp.
          <fpage>445</fpage>
          -
          <lpage>450</lpage>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Hartikainen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salonen</surname>
            ,
            <given-names>E.-P.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Turunen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Subjective Evaluation of Spoken Dialogue Systems Using SERVQUAL Method</article-title>
          ,
          <source>Proceedings of INTERSPEECH 2004-ICSLP</source>
          (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Higashinaka</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Funakoshi</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kobayashi</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Inaba</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>The Dialogue Breakdown Detection Challenge: Task Description, Datasets, and Evaluation Metrics</article-title>
          ,
          <source>Proceedings of LREC</source>
          <year>2016</year>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Hone</surname>
            ,
            <given-names>K. S.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Graham</surname>
          </string-name>
          , R.:
          <article-title>Towards a Tool for the Subjective Assessment of Speech System Interfaces (SASSI</article-title>
          ),
          <source>Natural Language Engineering</source>
          , Vol.
          <volume>6</volume>
          , No.
          <fpage>3</fpage>
          -
          <issue>4</issue>
          , pp.
          <fpage>287</fpage>
          -
          <lpage>303</lpage>
          (
          <year>2000</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          -Y.:
          <article-title>ROUGE: A Package for Automatic Evaluation of Summaries</article-title>
          ,
          <source>Proceedings of the Workshop on Text Summarization Branches Out</source>
          , pp.
          <fpage>74</fpage>
          -
          <lpage>81</lpage>
          (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Demner-Fushman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Will Pyramids Built of Nuggets Topple Over?</article-title>
          ,
          <source>Proceedings of HLT/NAACL</source>
          <year>2006</year>
          , pp.
          <fpage>383</fpage>
          -
          <lpage>390</lpage>
          (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Lowe</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Row</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Serban</surname>
            ,
            <given-names>I. V.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Pineau</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems</article-title>
          ,
          <source>Proceedings of SIGDIAL</source>
          <year>2015</year>
          , pp.
          <fpage>285</fpage>
          -
          <lpage>294</lpage>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Mitamura</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shima</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sakai</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kando</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mori</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Takeda</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
          </string-name>
          , C.-Y.,
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          -J. and
          <string-name>
            <surname>Lee</surname>
          </string-name>
          , C.-W.:
          <article-title>Overview of the NTCIR-8 ACLIA Tasks: Advanced Cross-Lingual Information Access</article-title>
          ,
          <source>Proceedings of NTCIR-8</source>
          , pp.
          <fpage>15</fpage>
          -
          <lpage>24</lpage>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Nenkova</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Passonneau</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          and
          <string-name>
            <surname>McKeown</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>The Pyramid Method: Incorporating Human Content Selection Variation in Summarization Evaluation</article-title>
          ,
          <source>ACM Transactions on Speech and Language Processing</source>
          , Vol.
          <volume>4</volume>
          , No.
          <volume>2</volume>
          (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Paek</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Toward Evaluation that Leads to Best Practices: Reconciling Dialog Evaluation in Research and Industry</article-title>
          ,
          <source>Bridging the Gap: Academic and Industrial Research in Dialogue Technologies Workshop Proceedings</source>
          , pp.
          <fpage>40</fpage>
          -
          <lpage>47</lpage>
          (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Papineni</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roukos</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ward</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Zhu</surname>
          </string-name>
          , W.-J.:
          <article-title>BLEU: a Method for Automatic Evaluation of Machine Translation</article-title>
          ,
          <source>Proceedings of ACL</source>
          <year>2002</year>
          , pp.
          <fpage>311</fpage>
          -
          <lpage>318</lpage>
          (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Randolph</surname>
            ,
            <given-names>J. J.</given-names>
          </string-name>
          :
          <article-title>Free-marginal Multirater Kappa (Multirater κfree): An Alternative to Fleiss' Fixed Marginal Multirater Kappa</article-title>
          ,
          <source>Joensuu Learning and Instruction Symposium</source>
          <year>2005</year>
          (
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Sakai</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Towards Automatic Evaluation of Multi-Turn Dialogues: A Task Design that Leverages Inherently Subjective Annotations</article-title>
          ,
          <source>Proceedings of EVIA</source>
          <year>2017</year>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Sakai</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Dou</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Summaries, Ranked Retrieval and Sessions: A Unified Framework for Information Access Evaluation</article-title>
          ,
          <source>Proceedings of ACM SIGIR</source>
          <year>2013</year>
          , pp.
          <fpage>473</fpage>
          -
          <lpage>482</lpage>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Sakai</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Kato</surname>
            ,
            <given-names>M. P.</given-names>
          </string-name>
          :
          <source>One Click One Revisited: Enhancing Evaluation based on Information Units, Proceedings of AIRS</source>
          <year>2012</year>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Sakai</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kato</surname>
            ,
            <given-names>M. P.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>Y.-I.</given-names>
          </string-name>
          :
          <article-title>Click the Search Buton and Be Happy: Evaluating Direct and Immediate Information Access</article-title>
          ,
          <source>Proceedings of ACM CIKM</source>
          <year>2011</year>
          , pp.
          <fpage>621</fpage>
          -
          <lpage>630</lpage>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Shang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sakai</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Higashinaka</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miyao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arase</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Nomoto</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Overview of the NTCIR-13 Short Text Conversation Task</article-title>
          ,
          <source>Proceedings of NTCIR-13</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Shang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sakai</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Higashinaka</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Miyao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Overview of the NTCIR-12 Short Text Conversation Task</article-title>
          ,
          <source>Proceedings of NTCIR-12</source>
          , pp.
          <fpage>473</fpage>
          -
          <lpage>484</lpage>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Walker</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Litman</surname>
            ,
            <given-names>D. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kamm</surname>
            ,
            <given-names>C. A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Abella</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>PARADISE: A Framework for Evaluating Spoken Dialogue Agents</article-title>
          ,
          <source>Proceedings of ACL</source>
          <year>1997</year>
          , pp.
          <fpage>271</fpage>
          -
          <lpage>280</lpage>
          (
          <year>1997</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Walker</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Passoneau</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Boland</surname>
            ,
            <given-names>J. E.</given-names>
          </string-name>
          :
          <article-title>Quantitative and Qualitative Evaluation of Darpa Communicator Spoken Dialogue Systems</article-title>
          ,
          <source>Proceedings of ACL</source>
          <year>2001</year>
          , pp.
          <fpage>515</fpage>
          -
          <lpage>522</lpage>
          (
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sherman</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Efron</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Assessor Diferences and User Preferences in Tweet Timeline Generation</article-title>
          ,
          <source>Proceedings of ACM SIGIR</source>
          <year>2015</year>
          , pp.
          <fpage>615</fpage>
          -
          <lpage>624</lpage>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>