<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>November</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A Closer Look at Evaluation Measures for Ordinal Quantification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tetsuya Sakai</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Waseda University</institution>
          ,
          <addr-line>Tokyo</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>1</volume>
      <issue>2021</issue>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>In his ACL 2021 paper [1], Sakai compared several evaluation measures in the context of Ordinal Quantification (OQ) tasks in terms of system ranking similarity, system ranking consistency (i.e., robustness to the choice of test data), and discriminative power (i.e., ability to find many statistically significant diferences). Based on his experimental results, he recommended the use of his RNOD (Root Normalised Order-aware Divergence) measure along with NMD (Normalised Match Distance, i.e., normalised Earth Mover's Distance). The present study follows up on his discriminative power experiments, by taking a much closer look at the statistical significance test results obtained from each evaluation measure. Our new analyses show that (1) RNOD is the overall winner among the OQ measures in terms of pooled discriminative power (i.e., discriminative power across multiple data sets); (2) NMD behaves noticeably diferently from RNOD and from measures that cannot handle ordinal classes; (3) NMD tends to favour a popularity-based baseline (which accesses the gold distributions) over a uniformdistribution baseline, thus contradicting the other measures in terms of statistical significance. As both RNOD and NMD have their merits, we recommend the organisers of OQ tasks to use both of them to evaluate the systems from multiple angles.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;evaluation</kwd>
        <kwd>evaluation measures</kwd>
        <kwd>distributions</kwd>
        <kwd>ordinal classes</kwd>
        <kwd>ordinal quantification</kwd>
        <kwd>prevalence estimation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>distribution of  instances can easily converted to a
probability mass function.) In particular, we define Ordinal
Quantification (or prevalence estimation) tasks are highly Quantification (OQ) as a task that requires systems to
practical [2, 3, 4]. While classification evaluation deals estimate a probability mass function over ordinal classes
with a confusion matrix whose rows and columns repre- for each of  test cases.1 Examples of OQ tasks defined
sent gold and estimated classes, quantification evaluation in this way include the following.
compares a gold distribution over the classes with an
estimated distribution. Put another way, while classification SemEval 2016/2017 Task 4 Subtask E Given a set of
cares about exactly which of given  instances (with  tweets about a topic, estimate the distribution
masked gold labels) are classified into each class (repre- of the tweets over five classes: highly negative,
sented in the cells of the confusion matrix), quantification negative, neutral, positive, highly positive [5, 6].
only cares about how many of the  instances are clas- Dialogue Breakdown Detection Challenge For
sified into each class. In general, a quantification task each system utterance within a human-machine
involves  test cases; each test case has  instances, and dialogue, estimate the distribution of gold labels
 can vary across cases. Hence a quantification eval- given by  annotators, where the possible
uation measure computes a score for each test case by labels are NB (not a breakdown), PB (possible
comparing the gold and estimated distributions, and the breakdown), B (breakdown) [7].
measure score can be averaged across the  cases. Hence
statistical significance tests can be applied to compare NTCIR Dialogue Quality Subtasks For each
the systems. customer-helpdesk dialogue, estimate the</p>
      <p>In the present study, if a task involves the comparison distribution of dialogue quality ratings given by
of an estimated probability mass function with a gold  annotators, where the possible ratings are − 2,
probability mass function for each test case, we regard − 1, 0, 1, 2 [8, 9].
it as a quantification task, regardless of what the exact
input to the estimation system is. (Note that a frequency</p>
      <sec id="sec-1-1">
        <title>To evaluate systems (or runs [22]) submitted to OQ</title>
        <p>tasks, evaluation measures that can handle ordinal
classes should be used. More specifically, “nominal
quantification” measures such as Mean Absolute Error
(MAE), (Root) Mean Squared Error ((R)MSE), and
JensenShannon Divergence (JSD) are not adequate, as they are
based on simple averaging/summing across classes [1].</p>
      </sec>
      <sec id="sec-1-2">
        <title>1Interval classes are also ordinal by definition.</title>
      </sec>
      <sec id="sec-1-3">
        <title>To see why, consider a gold distribution for the afore</title>
        <p>mentioned SemEval task, where all  tweets for a topic
are in the highly positive class; consider a system which
puts all  tweets in highly negative (i.e., an utter failure),
and consider another which puts all  tweets in positive.
It is clear that the above measures rate both systems as
utter failures.</p>
        <p>To the best of our knowledge, only two families of
measures are known to be suitable for evaluating OQ
systems: the Earth Mover’s Distance family [10, 11, 12],
which is based on cumulative distributions of the gold
and estimated distributions, and Sakai’s Order-aware
Divergence family, proposed in 2017-2018 [13, 14]. Recently,
Sakai [1] reported on extensive experiments for
comparing the above two families as well as nominal
quantification measures in the context of evaluating OQ systems
submitted to the SemEval and NTCIR tasks. His
recommendation was to use Root Normalised Order-aware
Divergence (RNOD) as the primary measure, and
Normalised Match Distance (NMD) as the secondary
measure, where NMD is simply a normalised version of Earth
Mover’s Distance [5]. In that study, RNOD was preferred
over NMD because it was the overall winner when looked
across the data sets in terms of system ranking
consistency (i.e., the ability to provide stable system rankings
regardless of the choice of test data) [15] and
discriminative power (i.e., the ability to obtain many statistically
significance diferences under the same experimental
condition) [16, 17].</p>
        <p>The present study follows up on Sakai’s
experiments [1], by taking a much closer look at the statistical
significance test results obtained from each evaluation
measure. We also leverage additional sets of OQ data
from NTCIR that were not previously used. Our new
analyses show that (1) RNOD is the overall winner among the
OQ measures in terms of pooled discriminative power (i.e.,
discriminative power across multiple data sets); (2) NMD
behaves noticeably diferently from RNOD and from
measures that cannot handle ordinal classes; (3) NMD tends
to favour a popularity-based baseline (which accesses
the gold distributions) over a uniform-distribution
baseline, thus contradicting the other measures in terms of
statistical significance. As both RNOD and NMD have
their merits, we recommend the organisers of OQ tasks
to use both of them to evaluate the systems from multiple
angles.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>The aforementioned OQ tasks of SemEval (2016/2017</title>
        <p>Task 4 Subtask E) [5, 6] used Earth Mover’s Distance
(EMD) as the evaluation meaure, remarking that “EMD is
currently the only known measure for ordinal
quantification.” Their EMD is the same as Match Distance [14, 10],
and the present study uses its normalised version, called
Normalised Match Distance (NMD) [14, 1]. NMD has
been used as one of the evaluation measures for
evaluating the aforementioned OQ tasks of NTCIR (Dialogue
Quality) [8, 9].</p>
        <p>In 2017, Sakai [13] proposed Order-aware Divergence
(OD), Normalised OD (NOD), and Symmetric Normalised
OD (SNOD) for OQ tasks, by explicitly incorporating
the notion of “distance” between classes. Subsequently,
Sakai [14] proposed Root Normalised OD (RNOD) and
Root Symmetric Normalised OD (RSNOD), as the
computation of OD involves sums of squares. The OQ tasks of
NTCIR have used RSNOD along with NMD [8, 9]. Sakai’s
recent recommendation for OQ tasks [1] is to use RNOD
as the primary measure and NMD as the secondary
measure, for the reasons discussed in Section 1.</p>
        <p>Although the aforementioned Dialogue Breakdown
Detection Challenge (DBDC) [7] is an OQ task, the
oficial evaluation measures used there for comparing two
distributions were MSE and JSD, which cannot consider
the ordinal nature of classes (i.e., nominal quantification
measures). Subsequently, the organisers of DBDC used
their Japanese and English DBDC task data to compare
these oficial DBDC measures with NMD and RSNOD
in terms of system ranking consistency and
discriminative power; they reported that RSNOD was the overall
winner [18].</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Quantification Measures</title>
      <p>Table 1 provides a brief qualitative summary of the
measures considered in this study. Due to lack of space, we
refer the reader to Sakai [1] for the definitions of nominal
quantification measures; here, we define only the ordinal
quantification measures.</p>
      <p>Let  denote the set of ordinal classes, represented by
consecutive integers for convenience. Let  denote the
estimated probability for Class , so that ∑︀∈  = 1. RNADW (Root Normalised ADW) can also be defined:
Similarly, let * denote the gold probability. We also
denote the entire probability mass functions by  and * , √︃ ADW(, * )
respectively. Let cp = ∑︀≤  , and cp* = ∑︀≤  *. RNADW(, * ) = || − 1 .
NMD is given by [14]:
(8)
NMD (, * ) = ∑︀∈ |cp − cp* | . (1) WeWwhiellnevtahleuraeteatrheisovnalyriatnwtoincolausrsfeustu(|re|wo=rk. 2) and
|| − 1 therefore the distinction between nominal and
ordi</p>
      <p>We now define R(S)NOD. First, let the Distance- nal classes becomes unnecessary, it can be shown that
Weighted sum of squares for Class  be: NMD(, * ) = RNOD( || * ) = RNOD(* || )(=
RSNOD(, * )). See the Appendix for a proof.</p>
      <p>DW  = ∑︁   ( − * )2 ,   = | − | . (2) OD-based measures tend to emphasise errors near
ei∈ ther end of the ordinal scale, as the following example
DW was designed to quantify the overall error from the illustrates. Consider a situation with || = 4 ordinal
viewpoint of a particular gold class : it tries to measure classes and a uniform gold distribution: * = 0.25( =
how much of its probability * has been misallocated 1, . . . , 4). If we compare System A which returns 1 =
to other classes  ∈ ( ̸= ), by assuming that the 4 = 0.25, 2 = 0.35, 3 = 0.15 and System B which
diference between  and * is directly caused by a mis- retruns 1 = 2 = 0.25, 3 = 0.35, 4 = 0.15, then
allocation of part of * ; the weight   is designed to from Eq. 2, DW1 = DW4 = 0.03, DW2 = DW3 = 0.01
penalise the misallocation based on the distance between and therefore OD = 0.020 for System A; whereas,
the ordinal classes. Note that the   in Eq. 2 assumes DW1 = 0.05, DW2 = 0.03, DW3 = DW4 = 0.01 and
equidistance; we shall discuss an alternative in Section 6. therfore OD = 0.025 for System B. Hence System A</p>
      <p>Let * = { ∈ |* &gt; 0}. That is, * (⊆ ) is the is considered slightly better. On the other hand, it can
set of classes with a non-zero gold probability. Order- easily be verified that A and B are considered equally
aware Divergence is defined as: efective in terms of NMD. It should be noted that this
diference does not say which measure is “correct” as
OD ( ‖ * ) = 1 ∑︁ DW  , (3) an OQ evaluation measure, as both measures take the
|* | ∈* ordinal nature of the classes into account. (Similarly, we
cannot say whether (say) JSD is superior to NVD for a
with its symmetric version: nominal quantification task just because they difer.)
SOD (, * ) =</p>
      <p>OD ( ‖ * ) + OD (* ‖ )
2
.</p>
      <p>(4)
RNOD and RSNOD are defined as:
4. Data
2Similarly, it is clear from Eqs. 2 and 3 that * =  (i.e., there
is no gold probability that is zero) is a suficient condition for OD to
be symmetric [13]. Another suficient condition for guaranteeing
the symmetry of OD is: |* | = 1 and |{ ∈  |  &gt; 0}| =
1 (i.e., both the gold and estimated distributions have exactly one
positive probability).</p>
      <p>√︃ Table 2 provides an overview of the eight OQ task data
RNOD ( ‖ * ) = OD ( ‖ * ) , (5) sets that we used for our analysis. The three STC-3
|| − 1 (Short Text Conversation 3) data sets [8] were not used in
√︃ Sakai [1], but the specifications of the Dialogue Quality
RSNOD (, * ) = SOD (, * ) . (6) (DQ) subtask at STC-3 are identical to those of DialEval-1
|| − 1 (Dialogue Evaluation 1) [9]. As can be seen, all data sets</p>
      <p>come with five ordinal classes. For the two SemEval data</p>
      <p>Note that Eq. 3 averages over * rather than  because sets, the classes are tweet polarities, namely, highly
negaoHfowwheavterD, Wit isiaslmsoepanoststioblreeptoredseefinnet,aavsadriisacnutsosefdOaDboavse. ttihvee,snixegNaTtiCveI,Rndeuattarasle,tpso,stihtievec,lahsigsehslyarpeosfivieti-vpeo[in5,t 6s]c.aFleor
follows; let us call it ADW (Average DW): dialogue quality ratings (− 2 through 2) based on three
ADW (, * ) = 1 ∑︁ DW  . (7) diferent viewpoints, namely, A-score (task
accomplish|| ∈ ment), E-score (dialogue efectiveness), and S-score
(customer satisfaction) [8, 9]. Hence, for example, DialEval-1
From Eqs. 2 and 7, it is clear that ADW is symmetric.2 A DQ-A is the data set containing the gold and estimated
root-normalised measure based on ADW, which we call probability distributions for the A-score estimation
“subsubtask” of the NTCIR-15 DialEval-1 task. The NTCIR
dialogue data were provided in both Chinese and English
(manually translated from the original Chinese text) to
the participants, and the participants were allowed to
submit Chinese and/or English runs. On the other hand,</p>
    </sec>
    <sec id="sec-4">
      <title>5. Analysis</title>
      <p>the gold distributions were constructed solely based on
the original Chinese dialogues. Hence, both Chinese and
English runs are evaluated using the same gold distribu- 5.1. Pooled Discriminative Power
tions. The gold distributions of the STC-3 and DialEval-1
data were constructed based on votes from 19 and 20
assessors for each dialogue, respectively [8].</p>
      <p>The NTCIR data sets are larger than the SemEval data
sets both in terms of the test data sample size  and
in terms of the number of runs to be evaluated. Hence
our results with the NTCIR data may be more reliable,
especially regarding statistical significance test results.</p>
      <p>Sakai [1] presented discriminative power curves [16] for
NMD, R(S)NOD, NVD, RNS, and JSD using the SemEval
and DialEval-1 data sets. Given a data set with submitted
runs, a discriminative power curve is obtained by
obtaining a -value for every system pair (using a randomised
Tukey HSD test with  = 5000 trials [19]) and sorting
them in descending order. Curves that are closer to the
100
125
390
#Runs used
12
14
origin represent discriminative measures, i.e., those that not examine which measures agree or disagree with each
can give us confident conclusions from experiments. A other in terms of significance test results. This section
highly discriminative measure is not necessarily “correct,” addresses exactly this question.
but we do want measures to be discriminative to some ex- Table 4 breaks down the number of significant
diftent; otherwise, we will not be able to conclude anything ferences () shown in Table 3 by comparing the
refrom experiments [20]. sults of every pair of measures. More specifically, we</p>
      <p>Here, we revisit Sakai’s results by focusing on the present Statistical Significance Overlaps (SSO’s), defined
commonly-used significance level of  = 0.05, to view as SSO = /( +  + ), where  is the number of
sigand summarise the discriminative power results in a more nificant diferences found with the first measure only, 
quantitative manner. More specifically, for data set , is the number of significant diferences found with both
let DP  = /, where  is the total number of measures, and  is the number of significant diferences
system pairs ( = ( − 1)/2 if there are  found with the second measure only. That is, the 
systems) and (≤ ) is the number of those found for the first measure is  + , and that for the second
to be statistically significantly diferent at  . To provide measure is  + .
a quantitative summary of discriminative power results If the SSO for a pair of evaluation measures is high,
over a set  = {} of data sets, we define pooled dis- that means that the two measures tend to give us similar
criminative power as follows: conclusions as to which system pairs are statistically
significantly diferent. However, it can be observed that SSO
PDP = ∑︁ / ∑︁  . (9) is not always high, as underlined in Table 4. In
particu∈ ∈ lar, note that the SSOs of NMD with other measures are
We also report on additional results with the three STC-3 particularly low for DialEval-1 DQ-A and DQ-S (Parts (f)
data sets; these were not discussed in Sakai [1]. and (h) of the table). In Section 5.1, we have pointed</p>
      <p>Table 3 shows the individual and pooled discrimina- out that NMD performs very poorly with these two data
tive power results for each of our eight OQ data sets. sets in terms of discriminative power. The discriminative
For example, for NMD, the discriminative power with power results alone could mean two situations: (i) NMD
Sem16T4E is 38/66 = 57.6% and higher than the other manages to find only a subset of the significant
difermeasures, but as it sufers with the NTCIR data, the ences found by the other measures; or (ii) NMD finds
pooled discriminative power is only 42.0%. In particular, significant diferences outside those found by the other
note that NMD performs very poorly with DialEval-1 DQ- measures, and the diferences found are relatively few.
A and DQ-S data sets: with DialEval-1 DQ-A, NMD finds Table 4(f) and (h) reveals that the truth is Situation (ii).
only 84 statistically significant diferences at  = 0.05, For example, from Table 4(h), we can see that the SSO
while RSNOD, RNOD, NVD, RNSS, and JSD find as many between NMD and RNOD is only 52.2% (with only 71
as 119, 133, 138, 113, and 135, respectively; similarly, with diferences found significant by both measures), and that
DialEval DQ-S, NMD finds only 82 statistically signifi- NMD found as many as 11 significant diferences that
cant diferences at  = 0.05, while RSNOD, RNOD, NVD, were not considered significant by RNOD. Similarly, the
RNSS, and JSD find as many as 115, 125, 120, 129, and 127, SSO between NMD and NVD is only 48.5% (with only
respectively. This apparent breakdown of NMD for these 66 diferences found significant by both measures), and
two data sets was also visualised in the discriminative NMD found as many as 16 significant diferences that
curves of Sakai [1, Figure 2]. were not considered significant by NVD. This outlier</p>
      <p>Our findings in terms of pooled discriminative power tendency of NMD is consistent with Sakai’s observation
are: regarding system ranking similarity [1, Table 5].</p>
      <p>The above analysis examined the overlaps of
signif• The most discriminative measures are RNOD and icantly diferent system pairs based on two-sided tests.</p>
      <p>NVD (but recall that NVD is a nominal quantifi- However, the overlaps in fact contain a few
contradiccation measure). tions: a statistical significance contradiction occurs when
• RNOD outperforms NMD; one measure says “System  statistically significantly
• RSNOD slightly underperforms RNOD, suggest- outperforms System ” while another says “System 
ing that making the measure symmetric is not statistically significantly outperforms System .” Which
beneficial [1]. system outperforms another is determined by the mean
scores of  and  (smaller the better in our case).
Al5.2. Significance Overlaps and though such situations are very rare, we have found them
useful for understanding the properties of the measures,</p>
      <p>Contradictions as discussed below.</p>
      <p>Discriminative power only considers how many signifi- Table 5 shows the number of contradictions, which
cant diferences each measure manages to obtain; it does can be used together with Table 4. (There were no
contra(e) STC-3 DQ-S
(f) DialEval-1 DQ-A</p>
      <sec id="sec-4-1">
        <title>RSNOD</title>
        <p>(g) DialEval-1 DQ-E</p>
      </sec>
      <sec id="sec-4-2">
        <title>RSNOD</title>
        <p>(h) DialEval-1 DQ-S</p>
      </sec>
      <sec id="sec-4-3">
        <title>RSNOD</title>
      </sec>
      <sec id="sec-4-4">
        <title>RSNOD</title>
        <p>70.7 (9/29/3)</p>
      </sec>
      <sec id="sec-4-5">
        <title>RSNOD</title>
        <p>76.0 (10/38/2)</p>
      </sec>
      <sec id="sec-4-6">
        <title>RSNOD</title>
        <p>91.7 (5/66/1)</p>
      </sec>
      <sec id="sec-4-7">
        <title>RSNOD</title>
        <p>94.4 (0/68/4)
RSNOD
98.5 (0/65/1)
65.0 (4/80/39)
77.1 (15/101/15)
0
0
dictions in the Sem16T4E, Sem17T4E, and STC-3 DQ-E • A_BL-popularity-E vs. A_BL-uniform-E
results.) For example, while Table 4(c) shows that NMD
and RNOD detected a statistical significance for the same However, these essentially constitute one contradiction,
68 run pairs from STC-3 DQ-A (with an SSO of 95.8%), Ta- as the contents of the Chinese and English runs
A_BLble 5(I) shows that 4 of them were actually contradictions. popularity-{C,E} are the same, as are those of
A_BLHence, if we choose to remove the contradictions prior to uniform-{C,E}. These are the Popularity and Uniform
computing the SSO, it would be 64/(3+64+0) = 95.5%. baseline runs provided by the organisers of the NTCIR
However, “practical” contradictions in the five NTCIR tasks, which rely on the following simple strategies.3
data sets occur less frequently than what Table 5 suggests, Popularity Access the gold data, and return an
“estias explained below. mated” distribution where the class that is most</p>
        <p>Table 6 provides an exhaustive list of the exact run frequent in the gold distribution is given a
probpairs listed as contradictions in Table 5. For example, ability of 1, and others are given a 0. Note that
Table 6(I) reveals that the 4 contradictions mentioned in this is a type of oracle run.
Table 5(a) are:
• A_BL-popularity-C vs. A_BL-uniform-C
• A_BL-popularity-C vs. A_BL-uniform-E
• A_BL-popularity-E vs. A_BL-uniform-C</p>
        <sec id="sec-4-7-1">
          <title>Uniform Always return a uniform distribution.</title>
        </sec>
        <sec id="sec-4-7-2">
          <title>3The SemEval tasks also had a few baseline runs, including a</title>
          <p>run that always assigns a probability of 1 to the Positive class [5, 6].</p>
          <p>They did not have Popularity and Uniform baselines.
Note that the prefix “A_BL” means that the run is a base- RNSS and JSD conclude the exact opposite. As for the
line for the A-score estimation subsubtask; similar base- results for DialEval-1 DQ-E (Table 6(IV)), they are
essenline runs are present in the E-score and S-score estimation tially identical to those of STC-3 DQ-A (Table 6(I)).
subsubtask data from NTCIR, with prefixes “E_BL” and In Table 6(III) and (V), we see non-baseline runs. Note
“S_BL.” that, for example, A_NKUST-run0-C and
A_NKUST</p>
          <p>Sakai [1] kept both the Chinese and English versions run0-E are actually diferent runs, unlike the situations
of the baselines in his experiments even though their with the aforementioned baseline runs. Thus, for
examcontents are the same, because they have diferent file ple, the 8 contradictions shown in Table 5(III) between
names and were listed as distinct runs in the oficial NMD/RSNOD and RNOD/RNSS/JSD are essentially for
evaluations [8, 9]. Hence we follow suit. However, it the following 3 cases, as shown in Table 6(III).
is clear from the above that the contradiction in STC-3
DQ-A is essentially a single instance: while NMD con- • A_BL-popularity-{C,E} vs. A_BL-uniform-{C,E} (4
cludes that Popularity statistically significantly outper- run combinations)
forms Uniform, RNOD, NVD, RNSS, and JSD concludes • A_BL-popularity-{C,E} vs. A_NKUST-run0-C (2
the exact opposite. Similarly, Table 6(II) reveals that the 4 run combinations)
contradictions shown in Table 5 also concerns Popularity • A_BL-popularity-{C,E} vs. A_NKUST-run0-E (2
vs. Uniform: NMD and RSNOD conclude that Popular- run combinations)
ity statistically significantly outperforms Uniform, while</p>
        </sec>
        <sec id="sec-4-7-3">
          <title>In every case, NMD and RSNOD conclude that Popular</title>
          <p>ity statistically significantly outperforms the other run, where  (∙ ) is the score according to measure  for
which is in direct disagreement with the other measures. a run’s estimated distribution for a particular dialogue.</p>
          <p>We also observe that RSNOD behaves similarly to Note that, since these measures give smaller scores to
NMD from Table 5(II), (III), and (V) and the accompa- better systems, a negative delta means Popularity is
prenying details in Table 6(II), (III), and (V). Moreover, there ferred while a positive delta means Uniform is preferred.
are no contradictions between NMD and RSNOD. These Figure 1 shows scatter plots of score deltas by
comresults suggest that the properties of RSNOD lie some- paring NMD with NVD, RNSS, JSD, and RNOD; Figure 2
where between NMD and RNOD: this is in line with shows similar scatterplots by comparing RNOD with
Sakai’s observation regarding the system ranking simi- NVD, RNSS, and JSD. The two figures are arranged to
falarity for the DialEval-1 DQ-A and DQ-S data [1, Table 5]. cilitate comparisons across NMD and RNOD. (To reduce
Put another way, introducing symmetry appears to bring the number of measure combinations, RSNOD is omitted
RNOD closer to NMD. in this analysis.) Within each green box, the number of</p>
          <p>Our findings regarding statistical significance overlap instances in the 2nd and 4th quadrants (i.e., dialogues for
and contradictions can be summarised as follows. which two measures disagree as to which of Popularity
and Uniform is better) is shown, together with a
Pearson correlation with a 95%CI. From the figures, we can
observe the following.
• The sets of significant diferences found by NMD
are generally not subsets of those found by the
other, more discriminative measures. NMD
behaves markedly diferently from other measures
regarding which system pairs are statistically
significant.
• There are a few contradictions between NMD and
four other measures (RNOD, NVD, RNSS, JSD)
in terms of significance test results, and all of
these contradictions involve a Popularity baseline,
which access the gold distributions. NMD tends
to rate Popularity higher than Uniform, thus
directly contradicting the other measures.
• From the viewpoint of contradictions, RSNOD</p>
          <p>behaves somewhat similarly to NMD.
Δ =  (Popularity) −  (Uniform) ,
(10)
• The correlations between NMD and the nominal
quantification measures (NVD, RNSS, and JSD)
are lower compared to those between RNOD and
the nominal quantification measures, as the
Pearson correlations and the scatterplots show.
• More importantly, NMD disagrees more often
with the nominal quantification measures than
RNOD does. All of these disagreements of NMD
happen in the 2nd quadrant: that is, while NMD
says that Popularity outperforms Uniform, the
other three measures say otherwise.
• From Figure 1(d), NMD and RNOD disagree for
a total of 77 dialogues: for 75 of them, NMD
says that Popularity outperforms Uniform while
RNOD says otherwise; for the remaining 2
dialogues, NMD says that Popularity underperforms
Uniform while RNOD says otherwise.
5.3. Popularity vs. Uniform baselines</p>
        </sec>
        <sec id="sec-4-7-4">
          <title>In Section 5.2, we showed that NMD can contradict with</title>
          <p>RNOD, NVD, RNSS or JSD in terms of statistical
significance. In particular, we have seen cases where NMD
favours Popularity over Uniform, contrary to the con- In summary, for 62-78 dialogues out of 300 (21-26%), NMD
clusions of the other measures. We find this behaviour rates Popularity higher than Uniform, disagreeing with
of NMD generally intuitive, as Popularity accesses the the other measures.
gold data and utilises that knowledge, while Uniform is
noninformative and practically useless. However, note
that whether Popularity should actually be rated higher
depends on what the gold distribution looks like: for
example, if the gold distributions is almost flat, we would
like the measure to prefer Uniform over Popularity.</p>
          <p>To examine the above tendency of NMD, this section
focusses on the comparison between Popularity and
Uniform. First, we focus on contradictions regarding
Popularity vs. Uniform from the DialEval-1 DQ-A data set,
as we found the highest number of conflicts (not limited
to Popularity vs. Uniform) in this data set among the
ifve data sets shown in Table 5-6. For each evaluation
measure  and for each dialogue, we first compute the
score delta (e.g. Δ  ):
-0.8
0.8 RΔN</p>
          <p>O</p>
          <p>D
-0.8
-0.6
-0.4
-0.2
0
popularity − A_BL-uniform from DialEval-1 DQ-A): NMD vs</p>
        </sec>
      </sec>
      <sec id="sec-4-8">
        <title>NVD/RNSS/JSD/RNOD. Number of instances in the 2nd and</title>
      </sec>
      <sec id="sec-4-9">
        <title>4th quadrants, as well as Pearson correlations (with 95%CIs,</title>
        <p>= 300), are also shown.
-0.8
0.8 RΔN</p>
        <p>SS</p>
        <sec id="sec-4-9-1">
          <title>Whether a measure prefers Popularity or Uniform de</title>
          <p>pends on what the gold distribution for each dialogue
looks like. To closely examine situations where NMD
favours Popularity while disagreeing with the other
measures, we shall discuss two actual dialogues from the
DialEval-1 DQ-A data below, which were selected as
follows. First, because we are primarily interested in how
and why NMD and RNOD behave diferently, we ranked
the 300 dialogues by how the ΔNMD and ΔRNOD values -2 -1 0 1 2
gold popularity uniform
difeFri,gtuhraet 3is,sh ow=s tΔheNgMolDd, − PoΔpuRlaNriOtyD,a.nd Uniform dis- 0.91 (Δb)N1M8Dth dialo=gu0e:
itutrricebau3n(taibo)en(1so8bf1ostrehrtdvhieeadltootghpuatetw,Cola=dsisa− 1lo0hg.au1se5st0hi−ne 0hte.i1grm1h5ess=otfg−ol.d0I.np2r6Foi5gb)--, 0000....6785 ΔΔΔΔΔJRRNRSSNNVDNDSOSODD ===== 00000.....221225856704431
ability, and therefore that Popularity sets the probabil- 0.4
ity of Class 1 to be 1. This is how Popularity “cheats.” 0.3
As shown in the pink box, all measures except NMD 00..12
have positive Δ ’s; that is, they say that Popularity 0
underperforms Uniform. In contrast, NMD prefers Pop- -2 -1gold popular0ity uniform 1 2
ularity. (Recall that for quantification measure scores,
smaller means better.) In Figure 3(b) (18th dialogue, Figure 3: Top two dialogues from DialEval-1 DQ-A when
 = 0 − 0.263 = − 0.263), Class − 2 has the highest ranked by  = ΔNMD − ΔRNOD: 181th dialogue ( =
gold probability. For this dialogue, NMD says that Popu- − 0.265) and 18th dialogue ( = − 0.263).
larity and Uniform are equally efective, while all other
measures prefer Uniform. It is clear from these examples
that it is dificult to say whether one measure is “correct”
or not; we can only say that NMD tends to prefer Popu- is far below 50%), NMD prefers Popularity more often
larity over Uniform compared to the other measures. than it prefers Uniform. As for RSNOD, it does prefer</p>
          <p>Using the DialEval-1 DQ-A data set, we have so far Uniform more often just like RNOD and others, but the
discussed how NMD tends to favour Popularity over tendency is less clear; again, its property lies somewhere
Uniform. To generalise this observation, Table 7 shows between NMD and RNOD.
how often each measure prefers one of the two
baselines, for each of the six NTCIR data sets that contain
these baselines. For example, NMD prefers Uniform for
175 dialogues and prefers Popularity for 197 dialogues
from the DTC-3 DQ-A data set. (For the remaining
390 − 175 − 197 = 18 dialogues, the two baselines are
tied.) It can be observed from the TOTAL row that while
RNOD, NVD, RNSS, and JSD prefer Uniform far more
often (where the probability that Popularity is preferred</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Conclusions</title>
      <p>The present study re-examined the OQ measures (NMD
and R(S)NOD) along with nominal quantification
measures (NVD, RNSS, and JSD) using SemEval and NTCIR
data sets, using statistical significance test results. Our
main findings are as follows.</p>
      <p>• According to our pooled discriminative power
results (Table 3), the most discriminative
measures are RNOD and NVD (but recall that NVD
is a nominal quantification measure and is not
appropriate for OQ).
• The sets of statistically significant diferences
found by NMD are generally not subsets of those
found by other, more discriminative measures
like RNOD.
• NMD sometimes contradicts with RNOD and
the nominal quantification measures in statistical
terms, by preferring a Popularity baseline over a
Uniform baseline.</p>
      <sec id="sec-5-1">
        <title>The tendency of NMD to rate Popularity higher than</title>
        <p>Uniform is generally intuitive, since the former “cheats”
by accessing the gold data while the latter is the laziest
approach possible. However, it is dificult to say whether
NMD is more appropriate than RNOD, as the preference
between Popularity and Uniform should depend on what
the gold distribution looks like (e.g., Is it almost flat?).
On the other hand, the strengths of RNOD are that it is
statistically stable, as demonstrated in terms of system
ranking consistency [1] and pooled discriminative power.
Based on these arguments, we recommend using both
RNOD and NMD for evaluating OQ systems, to examine
them from multiple angles.</p>
        <p>Our future work includes exploring variants of RNOD.
More specifically, while Eq. 2 relies on   = | − | and
therefore assumes equidistance, an alternative   that is
free from this assumption could be considered. Inspired
by the distance function used in Krippendorf’s alpha for
ordinal classes [21, 22], one possibility is:
  = ⎝
⎛ max(,)</p>
        <p>∑︁
=min(,)</p>
        <p>⎞
*⎠ −
* + * .</p>
        <p>2
(11)</p>
      </sec>
      <sec id="sec-5-2">
        <title>That is, we could utilise the gold propabilities that lie</title>
        <p>between Classes  and  to define the distance. This can
also be combined with the RNADW measure that we
defined in Section 3.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <sec id="sec-6-1">
        <title>We thank the reviewers of the LQ 2021 workshop for their feedback on the initial version of this paper, and the organisers of the workshop for giving us the opportunity to publish our work.</title>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>A. Proof That RNOD equals NMD</title>
      <p>when | | = 2.</p>
      <p>Note that cp1 = 1 and cp*1 = *1 in general.
Furthermore, when || = 2, note that cp2 = cp*2 = 1. Hence,
from Eq. 1,</p>
      <p>NMD(, * )
=
=
|1 − *1| + |2 − *2|
|1 − *1| + 0 = |1 − *1| . (12)</p>
      <p>On the other hand, note that when || = 2, (2 −
*2)2 = (1 − 1 − 1 + *2)2 = (1 − *1)2. To
compute RNOD, the following three cases need to be
considered. Case 1 when *1 &gt; 0 and *2 &gt; 0: from Eq. 3,
OD( || * ) = (DW1 + DW2)/2 = ((2 − *2)2 + (1 −
*1)2)/2 = 2(1 − *1)2/2 = (1 − *1)2. Hence from
Eq. 5,</p>
      <p>RNOD( || * ) = √︀OD( || * ) = |1 − *1| . (13)
Case 2 when *1 = 1 and *2 = 0: from Eq. 3, OD( ||
* ) = DW1 = (2 − *2)2 = (1 − *1)2. Therefore,
Eq 14 holds for this case as well. Case 2 when *1 = 0
and *2 = 1: from Eq. 3, OD( || * ) = DW1 = (1 −
*1)2 and Eq 14 holds for this case as well. In summary,
NMD(, * ) = RNOD( || * ).</p>
      <p>Finally, following similar steps as above, we can also
obtain:</p>
      <p>RNOD(* || ) = √︀OD(* || ) = |1 − *1| . (14)</p>
      <p>In summary, NMD(, * ) = RNOD( || * ) =
RNOD(* || ) when || = 2. Q.E.D.
the mallows distance: Some insights from statistics,
in: Proceedings of ICCV 2001, 2001, pp. 251–256.
[1] T. Sakai, Evaluating evaluation measures for or- [13] T. Sakai, Towards automatic evaluation of
multidinal classification and ordinal quantification, in: turn dialogues: A task design that leverages
inherProceedings of ACL-IJCNLP 2021, 2021, pp. 2759– ently subjective annotations, in: Proceedings of
2769. URL: https://aclanthology.org/2021.acl-long. EVIA 2017, 2017, pp. 24–30. URL: http://ceur-ws.
214.pdf . org/Vol-2008/paper_4.pdf .
[2] A. Esuli, F. Sebastiani, Sentiment quantification, [14] T. Sakai, Comparing two binned probability
distri</p>
      <p>IEEE Intelligent Systems 25 (2010) 72–75. butions for information access evaluation, in:
Pro[3] W. Gao, F. Sebastiani, From classification to quantifi- ceedings of ACM SIGIR 2018, 2018, pp. 1073–1076.
cation in tweet sentiment analysis, Social Network URL: https://dl.acm.org/doi/pdf/10.1145/3209978.</p>
      <p>Analysis and Mining 6 (2016) 1–22. 3210073.
[4] F. Sebastiani, Evaluation measures for quantifica- [15] T. Sakai, On the instability of diminishing return
tion: an axiomatic approach, Information Retrieval IR measures, in: Proceedings of ECIR 2021 Part I
Journal 23 (2020) 255–288. (LNCS 12656), 2021, pp. 572–586.
[5] P. Nakov, A. Ritter, S. Rosenthal, V. Stoyanov, F. Se- [16] T. Sakai, Evaluating evaluation metrics based on
bastiani, SemEval-2016 task 4: Sentiment analysis the bootstrap, in: Proceedings of ACM SIGIR 2006,
in Twitter, in: Proceedings of the 10th Interna- 2006, pp. 525–532.
tional Workshop on Semantic Evaluation, SemEval [17] T. Sakai, Alternatives to bpref, in: Proceedings of
’16, Association for Computational Linguistics, San ACM SIGIR 2007, 2007, pp. 71–78.
Diego, California, 2016. URL: https://www.aclweb. [18] Y. Tsunomori, R. Higashinaka, T. Takahashi, M.
Inorg/anthology/S16-1001.pdf . aba, Selection of evaluation metrics for dialogue
[6] S. Rosenthal, N. Farra, P. Nakov, SemEval-2017 task breakdown detection in dialogue breakdown
detec4: Sentiment analysis in Twitter, in: Proceedings tion challenge 3 (in Japanese), Transactions of the
of the 11th International Workshop on Semantic Japanese Society for Artificial Intelligence 35 (2020).
Evaluation, SemEval ’17, Association for Computa- URL: https://www.jstage.jst.go.jp/article/tjsai/35/1/
tional Linguistics, Vancouver, Canada, 2017. URL: 35_DSI-G/_pdf/-char/ja.</p>
      <p>https://www.aclweb.org/anthology/S17-2088.pdf . [19] T. Sakai, Laboratory Experiments in Information
[7] R. Higashinaka, K. Funakoshi, M. Inaba, Retrieval: Sample Sizes, Efect Sizes, and Statistical
Y. Tsunomori, T. Takahashi, N. Kaji, Overview Power, Springer, 2018.
of Dialogue Breakdown Detection Challenge 3, [20] T. Sakai, Metrics, statistics, tests, in: PROMISE
in: Proceedings of Dialog System Technology Winter School 2013: Bridging between
InformaChallenge 6 (DSTC6) Workshop, 2017. URL: tion Retrieval and Databases (LNCS 8173), Springer,
http://workshop.colips.org/dstc6/papers/track3_ 2014, pp. 116–163.</p>
      <p>overview_higashinaka.pdf . [21] K. Krippendorf, Content Analysis: An Introduction
[8] Z. Zeng, S. Kato, T. Sakai, Overview of the NTCIR- to Its Methodology (Fourth Edition), SAGE
Publica14 short text conversation task: Dialogue quality tions, 2018.
and nugget detection subtasks, in: Proceedings of [22] T. Sakai, How to run an evaluation task, in:
InforNTCIR-14, 2019, pp. 289–315. URL: http://research. mation Retrieval Evaluation in a Changing World,
nii.ac.jp/ntcir/workshop/OnlineProceedings14/ Springer, 2019, pp. 71–102.</p>
      <p>pdf/ntcir/01-NTCIR14-OV-STC-ZengZ.pdf .
[9] Z. Zeng, S. Kato, T. Sakai, I. Kang, Overview
of the NTCIR-15 dialogue evaluation task
(DialEval-1), in: Proceedings of NTCIR-15,
2020, pp. 13–34. URL: http://research.nii.ac.jp/
ntcir/workshop/OnlineProceedings15/pdf/ntcir/
01-NTCIR15-OV-DIALEVAL-ZengZ.pdf .
[10] M. Werman, S. Peleg, A. Rosenfeld, A distance
metric for multidimensional histograms, Computer
Vision, Graphics, and Image Processing 32 (1985)
328–336.
[11] Y. Rubner, C. Tomasi, L. J. Guibas, The earth mover’s
distance as a metric for image retrieval,
International Journal of Computer Vision 40 (2000) 99–121.
[12] E. Levina, P. Bickel, The earth mover’s distance is</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>