1. Introduction

November

A Closer Look at Evaluation Measures for Ordinal Quantification

Tetsuya Sakai

0 0 Waseda University , Tokyo , Japan

2021

1 2021 0000 0002

In his ACL 2021 paper [1], Sakai compared several evaluation measures in the context of Ordinal Quantification (OQ) tasks in terms of system ranking similarity, system ranking consistency (i.e., robustness to the choice of test data), and discriminative power (i.e., ability to find many statistically significant diferences). Based on his experimental results, he recommended the use of his RNOD (Root Normalised Order-aware Divergence) measure along with NMD (Normalised Match Distance, i.e., normalised Earth Mover's Distance). The present study follows up on his discriminative power experiments, by taking a much closer look at the statistical significance test results obtained from each evaluation measure. Our new analyses show that (1) RNOD is the overall winner among the OQ measures in terms of pooled discriminative power (i.e., discriminative power across multiple data sets); (2) NMD behaves noticeably diferently from RNOD and from measures that cannot handle ordinal classes; (3) NMD tends to favour a popularity-based baseline (which accesses the gold distributions) over a uniformdistribution baseline, thus contradicting the other measures in terms of statistical significance. As both RNOD and NMD have their merits, we recommend the organisers of OQ tasks to use both of them to evaluate the systems from multiple angles.

eol>evaluation evaluation measures distributions ordinal classes ordinal quantification prevalence estimation

1. Introduction

distribution of instances can easily converted to a probability mass function.) In particular, we define Ordinal Quantification (or prevalence estimation) tasks are highly Quantification (OQ) as a task that requires systems to practical [2, 3, 4]. While classification evaluation deals estimate a probability mass function over ordinal classes with a confusion matrix whose rows and columns repre- for each of test cases.1 Examples of OQ tasks defined sent gold and estimated classes, quantification evaluation in this way include the following. compares a gold distribution over the classes with an estimated distribution. Put another way, while classification SemEval 2016/2017 Task 4 Subtask E Given a set of cares about exactly which of given instances (with tweets about a topic, estimate the distribution masked gold labels) are classified into each class (repre- of the tweets over five classes: highly negative, sented in the cells of the confusion matrix), quantification negative, neutral, positive, highly positive [5, 6]. only cares about how many of the instances are clas- Dialogue Breakdown Detection Challenge For sified into each class. In general, a quantification task each system utterance within a human-machine involves test cases; each test case has instances, and dialogue, estimate the distribution of gold labels can vary across cases. Hence a quantification eval- given by annotators, where the possible uation measure computes a score for each test case by labels are NB (not a breakdown), PB (possible comparing the gold and estimated distributions, and the breakdown), B (breakdown) [7]. measure score can be averaged across the cases. Hence statistical significance tests can be applied to compare NTCIR Dialogue Quality Subtasks For each the systems. customer-helpdesk dialogue, estimate the

In the present study, if a task involves the comparison distribution of dialogue quality ratings given by of an estimated probability mass function with a gold annotators, where the possible ratings are − 2, probability mass function for each test case, we regard − 1, 0, 1, 2 [8, 9]. it as a quantification task, regardless of what the exact input to the estimation system is. (Note that a frequency

To evaluate systems (or runs [22]) submitted to OQ

tasks, evaluation measures that can handle ordinal classes should be used. More specifically, “nominal quantification” measures such as Mean Absolute Error (MAE), (Root) Mean Squared Error ((R)MSE), and JensenShannon Divergence (JSD) are not adequate, as they are based on simple averaging/summing across classes [1].

1Interval classes are also ordinal by definition. To see why, consider a gold distribution for the afore

mentioned SemEval task, where all tweets for a topic are in the highly positive class; consider a system which puts all tweets in highly negative (i.e., an utter failure), and consider another which puts all tweets in positive. It is clear that the above measures rate both systems as utter failures.

To the best of our knowledge, only two families of measures are known to be suitable for evaluating OQ systems: the Earth Mover’s Distance family [10, 11, 12], which is based on cumulative distributions of the gold and estimated distributions, and Sakai’s Order-aware Divergence family, proposed in 2017-2018 [13, 14]. Recently, Sakai [1] reported on extensive experiments for comparing the above two families as well as nominal quantification measures in the context of evaluating OQ systems submitted to the SemEval and NTCIR tasks. His recommendation was to use Root Normalised Order-aware Divergence (RNOD) as the primary measure, and Normalised Match Distance (NMD) as the secondary measure, where NMD is simply a normalised version of Earth Mover’s Distance [5]. In that study, RNOD was preferred over NMD because it was the overall winner when looked across the data sets in terms of system ranking consistency (i.e., the ability to provide stable system rankings regardless of the choice of test data) [15] and discriminative power (i.e., the ability to obtain many statistically significance diferences under the same experimental condition) [16, 17].

The present study follows up on Sakai’s experiments [1], by taking a much closer look at the statistical significance test results obtained from each evaluation measure. We also leverage additional sets of OQ data from NTCIR that were not previously used. Our new analyses show that (1) RNOD is the overall winner among the OQ measures in terms of pooled discriminative power (i.e., discriminative power across multiple data sets); (2) NMD behaves noticeably diferently from RNOD and from measures that cannot handle ordinal classes; (3) NMD tends to favour a popularity-based baseline (which accesses the gold distributions) over a uniform-distribution baseline, thus contradicting the other measures in terms of statistical significance. As both RNOD and NMD have their merits, we recommend the organisers of OQ tasks to use both of them to evaluate the systems from multiple angles.

2. Related Work The aforementioned OQ tasks of SemEval (2016/2017

Task 4 Subtask E) [5, 6] used Earth Mover’s Distance (EMD) as the evaluation meaure, remarking that “EMD is currently the only known measure for ordinal quantification.” Their EMD is the same as Match Distance [14, 10], and the present study uses its normalised version, called Normalised Match Distance (NMD) [14, 1]. NMD has been used as one of the evaluation measures for evaluating the aforementioned OQ tasks of NTCIR (Dialogue Quality) [8, 9].

In 2017, Sakai [13] proposed Order-aware Divergence (OD), Normalised OD (NOD), and Symmetric Normalised OD (SNOD) for OQ tasks, by explicitly incorporating the notion of “distance” between classes. Subsequently, Sakai [14] proposed Root Normalised OD (RNOD) and Root Symmetric Normalised OD (RSNOD), as the computation of OD involves sums of squares. The OQ tasks of NTCIR have used RSNOD along with NMD [8, 9]. Sakai’s recent recommendation for OQ tasks [1] is to use RNOD as the primary measure and NMD as the secondary measure, for the reasons discussed in Section 1.

Although the aforementioned Dialogue Breakdown Detection Challenge (DBDC) [7] is an OQ task, the oficial evaluation measures used there for comparing two distributions were MSE and JSD, which cannot consider the ordinal nature of classes (i.e., nominal quantification measures). Subsequently, the organisers of DBDC used their Japanese and English DBDC task data to compare these oficial DBDC measures with NMD and RSNOD in terms of system ranking consistency and discriminative power; they reported that RSNOD was the overall winner [18].

3. Quantification Measures

Table 1 provides a brief qualitative summary of the measures considered in this study. Due to lack of space, we refer the reader to Sakai [1] for the definitions of nominal quantification measures; here, we define only the ordinal quantification measures.

Let denote the set of ordinal classes, represented by consecutive integers for convenience. Let denote the estimated probability for Class , so that ∑︀∈ = 1. RNADW (Root Normalised ADW) can also be defined: Similarly, let * denote the gold probability. We also denote the entire probability mass functions by and * , √︃ ADW(, * ) respectively. Let cp = ∑︀≤ , and cp* = ∑︀≤ *. RNADW(, * ) = || − 1 . NMD is given by [14]: (8) NMD (, * ) = ∑︀∈ |cp − cp* | . (1) WeWwhiellnevtahleuraeteatrheisovnalyriatnwtoincolausrsfeustu(|re|wo=rk. 2) and || − 1 therefore the distinction between nominal and ordi

We now define R(S)NOD. First, let the Distance- nal classes becomes unnecessary, it can be shown that Weighted sum of squares for Class be: NMD(, * ) = RNOD( || * ) = RNOD(* || )(= RSNOD(, * )). See the Appendix for a proof.

DW = ∑︁ ( − * )2 , = | − | . (2) OD-based measures tend to emphasise errors near ei∈ ther end of the ordinal scale, as the following example DW was designed to quantify the overall error from the illustrates. Consider a situation with || = 4 ordinal viewpoint of a particular gold class : it tries to measure classes and a uniform gold distribution: * = 0.25( = how much of its probability * has been misallocated 1, . . . , 4). If we compare System A which returns 1 = to other classes ∈ ( ̸= ), by assuming that the 4 = 0.25, 2 = 0.35, 3 = 0.15 and System B which diference between and * is directly caused by a mis- retruns 1 = 2 = 0.25, 3 = 0.35, 4 = 0.15, then allocation of part of * ; the weight is designed to from Eq. 2, DW1 = DW4 = 0.03, DW2 = DW3 = 0.01 penalise the misallocation based on the distance between and therefore OD = 0.020 for System A; whereas, the ordinal classes. Note that the in Eq. 2 assumes DW1 = 0.05, DW2 = 0.03, DW3 = DW4 = 0.01 and equidistance; we shall discuss an alternative in Section 6. therfore OD = 0.025 for System B. Hence System A

Let * = { ∈ |* > 0}. That is, * (⊆ ) is the is considered slightly better. On the other hand, it can set of classes with a non-zero gold probability. Order- easily be verified that A and B are considered equally aware Divergence is defined as: efective in terms of NMD. It should be noted that this diference does not say which measure is “correct” as OD ( ‖ * ) = 1 ∑︁ DW , (3) an OQ evaluation measure, as both measures take the |* | ∈* ordinal nature of the classes into account. (Similarly, we cannot say whether (say) JSD is superior to NVD for a with its symmetric version: nominal quantification task just because they difer.) SOD (, * ) =

OD ( ‖ * ) + OD (* ‖ ) 2 .

(4) RNOD and RSNOD are defined as: 4. Data 2Similarly, it is clear from Eqs. 2 and 3 that * = (i.e., there is no gold probability that is zero) is a suficient condition for OD to be symmetric [13]. Another suficient condition for guaranteeing the symmetry of OD is: |* | = 1 and |{ ∈ | > 0}| = 1 (i.e., both the gold and estimated distributions have exactly one positive probability).

√︃ Table 2 provides an overview of the eight OQ task data RNOD ( ‖ * ) = OD ( ‖ * ) , (5) sets that we used for our analysis. The three STC-3 || − 1 (Short Text Conversation 3) data sets [8] were not used in √︃ Sakai [1], but the specifications of the Dialogue Quality RSNOD (, * ) = SOD (, * ) . (6) (DQ) subtask at STC-3 are identical to those of DialEval-1 || − 1 (Dialogue Evaluation 1) [9]. As can be seen, all data sets

come with five ordinal classes. For the two SemEval data

Note that Eq. 3 averages over * rather than because sets, the classes are tweet polarities, namely, highly negaoHfowwheavterD, Wit isiaslmsoepanoststioblreeptoredseefinnet,aavsadriisacnutsosefdOaDboavse. ttihvee,snixegNaTtiCveI,Rndeuattarasle,tpso,stihtievec,lahsigsehslyarpeosfivieti-vpeo[in5,t 6s]c.aFleor follows; let us call it ADW (Average DW): dialogue quality ratings (− 2 through 2) based on three ADW (, * ) = 1 ∑︁ DW . (7) diferent viewpoints, namely, A-score (task accomplish|| ∈ ment), E-score (dialogue efectiveness), and S-score (customer satisfaction) [8, 9]. Hence, for example, DialEval-1 From Eqs. 2 and 7, it is clear that ADW is symmetric.2 A DQ-A is the data set containing the gold and estimated root-normalised measure based on ADW, which we call probability distributions for the A-score estimation “subsubtask” of the NTCIR-15 DialEval-1 task. The NTCIR dialogue data were provided in both Chinese and English (manually translated from the original Chinese text) to the participants, and the participants were allowed to submit Chinese and/or English runs. On the other hand,

5. Analysis

the gold distributions were constructed solely based on the original Chinese dialogues. Hence, both Chinese and English runs are evaluated using the same gold distribu- 5.1. Pooled Discriminative Power tions. The gold distributions of the STC-3 and DialEval-1 data were constructed based on votes from 19 and 20 assessors for each dialogue, respectively [8].

The NTCIR data sets are larger than the SemEval data sets both in terms of the test data sample size and in terms of the number of runs to be evaluated. Hence our results with the NTCIR data may be more reliable, especially regarding statistical significance test results.

Sakai [1] presented discriminative power curves [16] for NMD, R(S)NOD, NVD, RNS, and JSD using the SemEval and DialEval-1 data sets. Given a data set with submitted runs, a discriminative power curve is obtained by obtaining a -value for every system pair (using a randomised Tukey HSD test with = 5000 trials [19]) and sorting them in descending order. Curves that are closer to the 100 125 390 #Runs used 12 14 origin represent discriminative measures, i.e., those that not examine which measures agree or disagree with each can give us confident conclusions from experiments. A other in terms of significance test results. This section highly discriminative measure is not necessarily “correct,” addresses exactly this question. but we do want measures to be discriminative to some ex- Table 4 breaks down the number of significant diftent; otherwise, we will not be able to conclude anything ferences () shown in Table 3 by comparing the refrom experiments [20]. sults of every pair of measures. More specifically, we

Here, we revisit Sakai’s results by focusing on the present Statistical Significance Overlaps (SSO’s), defined commonly-used significance level of = 0.05, to view as SSO = /( + + ), where is the number of sigand summarise the discriminative power results in a more nificant diferences found with the first measure only, quantitative manner. More specifically, for data set , is the number of significant diferences found with both let DP = /, where is the total number of measures, and is the number of significant diferences system pairs ( = ( − 1)/2 if there are found with the second measure only. That is, the systems) and (≤ ) is the number of those found for the first measure is + , and that for the second to be statistically significantly diferent at . To provide measure is + . a quantitative summary of discriminative power results If the SSO for a pair of evaluation measures is high, over a set = {} of data sets, we define pooled dis- that means that the two measures tend to give us similar criminative power as follows: conclusions as to which system pairs are statistically significantly diferent. However, it can be observed that SSO PDP = ∑︁ / ∑︁ . (9) is not always high, as underlined in Table 4. In particu∈ ∈ lar, note that the SSOs of NMD with other measures are We also report on additional results with the three STC-3 particularly low for DialEval-1 DQ-A and DQ-S (Parts (f) data sets; these were not discussed in Sakai [1]. and (h) of the table). In Section 5.1, we have pointed

Table 3 shows the individual and pooled discrimina- out that NMD performs very poorly with these two data tive power results for each of our eight OQ data sets. sets in terms of discriminative power. The discriminative For example, for NMD, the discriminative power with power results alone could mean two situations: (i) NMD Sem16T4E is 38/66 = 57.6% and higher than the other manages to find only a subset of the significant difermeasures, but as it sufers with the NTCIR data, the ences found by the other measures; or (ii) NMD finds pooled discriminative power is only 42.0%. In particular, significant diferences outside those found by the other note that NMD performs very poorly with DialEval-1 DQ- measures, and the diferences found are relatively few. A and DQ-S data sets: with DialEval-1 DQ-A, NMD finds Table 4(f) and (h) reveals that the truth is Situation (ii). only 84 statistically significant diferences at = 0.05, For example, from Table 4(h), we can see that the SSO while RSNOD, RNOD, NVD, RNSS, and JSD find as many between NMD and RNOD is only 52.2% (with only 71 as 119, 133, 138, 113, and 135, respectively; similarly, with diferences found significant by both measures), and that DialEval DQ-S, NMD finds only 82 statistically signifi- NMD found as many as 11 significant diferences that cant diferences at = 0.05, while RSNOD, RNOD, NVD, were not considered significant by RNOD. Similarly, the RNSS, and JSD find as many as 115, 125, 120, 129, and 127, SSO between NMD and NVD is only 48.5% (with only respectively. This apparent breakdown of NMD for these 66 diferences found significant by both measures), and two data sets was also visualised in the discriminative NMD found as many as 16 significant diferences that curves of Sakai [1, Figure 2]. were not considered significant by NVD. This outlier

Our findings in terms of pooled discriminative power tendency of NMD is consistent with Sakai’s observation are: regarding system ranking similarity [1, Table 5].

The above analysis examined the overlaps of signif• The most discriminative measures are RNOD and icantly diferent system pairs based on two-sided tests.

NVD (but recall that NVD is a nominal quantifi- However, the overlaps in fact contain a few contradiccation measure). tions: a statistical significance contradiction occurs when • RNOD outperforms NMD; one measure says “System statistically significantly • RSNOD slightly underperforms RNOD, suggest- outperforms System ” while another says “System ing that making the measure symmetric is not statistically significantly outperforms System .” Which beneficial [1]. system outperforms another is determined by the mean scores of and (smaller the better in our case). Al5.2. Significance Overlaps and though such situations are very rare, we have found them useful for understanding the properties of the measures,

Contradictions as discussed below.

Discriminative power only considers how many signifi- Table 5 shows the number of contradictions, which cant diferences each measure manages to obtain; it does can be used together with Table 4. (There were no contra(e) STC-3 DQ-S (f) DialEval-1 DQ-A

RSNOD

(g) DialEval-1 DQ-E

RSNOD

(h) DialEval-1 DQ-S

RSNOD RSNOD

70.7 (9/29/3)

RSNOD

76.0 (10/38/2)

RSNOD

91.7 (5/66/1)

RSNOD

94.4 (0/68/4) RSNOD 98.5 (0/65/1) 65.0 (4/80/39) 77.1 (15/101/15) 0 0 dictions in the Sem16T4E, Sem17T4E, and STC-3 DQ-E • A_BL-popularity-E vs. A_BL-uniform-E results.) For example, while Table 4(c) shows that NMD and RNOD detected a statistical significance for the same However, these essentially constitute one contradiction, 68 run pairs from STC-3 DQ-A (with an SSO of 95.8%), Ta- as the contents of the Chinese and English runs A_BLble 5(I) shows that 4 of them were actually contradictions. popularity-{C,E} are the same, as are those of A_BLHence, if we choose to remove the contradictions prior to uniform-{C,E}. These are the Popularity and Uniform computing the SSO, it would be 64/(3+64+0) = 95.5%. baseline runs provided by the organisers of the NTCIR However, “practical” contradictions in the five NTCIR tasks, which rely on the following simple strategies.3 data sets occur less frequently than what Table 5 suggests, Popularity Access the gold data, and return an “estias explained below. mated” distribution where the class that is most

Table 6 provides an exhaustive list of the exact run frequent in the gold distribution is given a probpairs listed as contradictions in Table 5. For example, ability of 1, and others are given a 0. Note that Table 6(I) reveals that the 4 contradictions mentioned in this is a type of oracle run. Table 5(a) are: • A_BL-popularity-C vs. A_BL-uniform-C • A_BL-popularity-C vs. A_BL-uniform-E • A_BL-popularity-E vs. A_BL-uniform-C

Uniform Always return a uniform distribution. 3The SemEval tasks also had a few baseline runs, including a

run that always assigns a probability of 1 to the Positive class [5, 6].

They did not have Popularity and Uniform baselines. Note that the prefix “A_BL” means that the run is a base- RNSS and JSD conclude the exact opposite. As for the line for the A-score estimation subsubtask; similar base- results for DialEval-1 DQ-E (Table 6(IV)), they are essenline runs are present in the E-score and S-score estimation tially identical to those of STC-3 DQ-A (Table 6(I)). subsubtask data from NTCIR, with prefixes “E_BL” and In Table 6(III) and (V), we see non-baseline runs. Note “S_BL.” that, for example, A_NKUST-run0-C and A_NKUST

Sakai [1] kept both the Chinese and English versions run0-E are actually diferent runs, unlike the situations of the baselines in his experiments even though their with the aforementioned baseline runs. Thus, for examcontents are the same, because they have diferent file ple, the 8 contradictions shown in Table 5(III) between names and were listed as distinct runs in the oficial NMD/RSNOD and RNOD/RNSS/JSD are essentially for evaluations [8, 9]. Hence we follow suit. However, it the following 3 cases, as shown in Table 6(III). is clear from the above that the contradiction in STC-3 DQ-A is essentially a single instance: while NMD con- • A_BL-popularity-{C,E} vs. A_BL-uniform-{C,E} (4 cludes that Popularity statistically significantly outper- run combinations) forms Uniform, RNOD, NVD, RNSS, and JSD concludes • A_BL-popularity-{C,E} vs. A_NKUST-run0-C (2 the exact opposite. Similarly, Table 6(II) reveals that the 4 run combinations) contradictions shown in Table 5 also concerns Popularity • A_BL-popularity-{C,E} vs. A_NKUST-run0-E (2 vs. Uniform: NMD and RSNOD conclude that Popular- run combinations) ity statistically significantly outperforms Uniform, while

In every case, NMD and RSNOD conclude that Popular

ity statistically significantly outperforms the other run, where (∙ ) is the score according to measure for which is in direct disagreement with the other measures. a run’s estimated distribution for a particular dialogue.

We also observe that RSNOD behaves similarly to Note that, since these measures give smaller scores to NMD from Table 5(II), (III), and (V) and the accompa- better systems, a negative delta means Popularity is prenying details in Table 6(II), (III), and (V). Moreover, there ferred while a positive delta means Uniform is preferred. are no contradictions between NMD and RSNOD. These Figure 1 shows scatter plots of score deltas by comresults suggest that the properties of RSNOD lie some- paring NMD with NVD, RNSS, JSD, and RNOD; Figure 2 where between NMD and RNOD: this is in line with shows similar scatterplots by comparing RNOD with Sakai’s observation regarding the system ranking simi- NVD, RNSS, and JSD. The two figures are arranged to falarity for the DialEval-1 DQ-A and DQ-S data [1, Table 5]. cilitate comparisons across NMD and RNOD. (To reduce Put another way, introducing symmetry appears to bring the number of measure combinations, RSNOD is omitted RNOD closer to NMD. in this analysis.) Within each green box, the number of

Our findings regarding statistical significance overlap instances in the 2nd and 4th quadrants (i.e., dialogues for and contradictions can be summarised as follows. which two measures disagree as to which of Popularity and Uniform is better) is shown, together with a Pearson correlation with a 95%CI. From the figures, we can observe the following. • The sets of significant diferences found by NMD are generally not subsets of those found by the other, more discriminative measures. NMD behaves markedly diferently from other measures regarding which system pairs are statistically significant. • There are a few contradictions between NMD and four other measures (RNOD, NVD, RNSS, JSD) in terms of significance test results, and all of these contradictions involve a Popularity baseline, which access the gold distributions. NMD tends to rate Popularity higher than Uniform, thus directly contradicting the other measures. • From the viewpoint of contradictions, RSNOD

behaves somewhat similarly to NMD. Δ = (Popularity) − (Uniform) , (10) • The correlations between NMD and the nominal quantification measures (NVD, RNSS, and JSD) are lower compared to those between RNOD and the nominal quantification measures, as the Pearson correlations and the scatterplots show. • More importantly, NMD disagrees more often with the nominal quantification measures than RNOD does. All of these disagreements of NMD happen in the 2nd quadrant: that is, while NMD says that Popularity outperforms Uniform, the other three measures say otherwise. • From Figure 1(d), NMD and RNOD disagree for a total of 77 dialogues: for 75 of them, NMD says that Popularity outperforms Uniform while RNOD says otherwise; for the remaining 2 dialogues, NMD says that Popularity underperforms Uniform while RNOD says otherwise. 5.3. Popularity vs. Uniform baselines

In Section 5.2, we showed that NMD can contradict with

RNOD, NVD, RNSS or JSD in terms of statistical significance. In particular, we have seen cases where NMD favours Popularity over Uniform, contrary to the con- In summary, for 62-78 dialogues out of 300 (21-26%), NMD clusions of the other measures. We find this behaviour rates Popularity higher than Uniform, disagreeing with of NMD generally intuitive, as Popularity accesses the the other measures. gold data and utilises that knowledge, while Uniform is noninformative and practically useless. However, note that whether Popularity should actually be rated higher depends on what the gold distribution looks like: for example, if the gold distributions is almost flat, we would like the measure to prefer Uniform over Popularity.

To examine the above tendency of NMD, this section focusses on the comparison between Popularity and Uniform. First, we focus on contradictions regarding Popularity vs. Uniform from the DialEval-1 DQ-A data set, as we found the highest number of conflicts (not limited to Popularity vs. Uniform) in this data set among the ifve data sets shown in Table 5-6. For each evaluation measure and for each dialogue, we first compute the score delta (e.g. Δ ): -0.8 0.8 RΔN

D -0.8 -0.6 -0.4 -0.2 0 popularity − A_BL-uniform from DialEval-1 DQ-A): NMD vs

NVD/RNSS/JSD/RNOD. Number of instances in the 2nd and 4th quadrants, as well as Pearson correlations (with 95%CIs,

= 300), are also shown. -0.8 0.8 RΔN

Whether a measure prefers Popularity or Uniform de

pends on what the gold distribution for each dialogue looks like. To closely examine situations where NMD favours Popularity while disagreeing with the other measures, we shall discuss two actual dialogues from the DialEval-1 DQ-A data below, which were selected as follows. First, because we are primarily interested in how and why NMD and RNOD behave diferently, we ranked the 300 dialogues by how the ΔNMD and ΔRNOD values -2 -1 0 1 2 gold popularity uniform difeFri,gtuhraet 3is,sh ow=s tΔheNgMolDd, − PoΔpuRlaNriOtyD,a.nd Uniform dis- 0.91 (Δb)N1M8Dth dialo=gu0e: itutrricebau3n(taibo)en(1so8bf1ostrehrtdvhieeadltootghpuatetw,Cola=dsisa− 1lo0hg.au1se5st0hi−ne 0hte.i1grm1h5ess=otfg−ol.d0I.np2r6Foi5gb)--, 0000....6785 ΔΔΔΔΔJRRNRSSNNVDNDSOSODD ===== 00000.....221225856704431 ability, and therefore that Popularity sets the probabil- 0.4 ity of Class 1 to be 1. This is how Popularity “cheats.” 0.3 As shown in the pink box, all measures except NMD 00..12 have positive Δ ’s; that is, they say that Popularity 0 underperforms Uniform. In contrast, NMD prefers Pop- -2 -1gold popular0ity uniform 1 2 ularity. (Recall that for quantification measure scores, smaller means better.) In Figure 3(b) (18th dialogue, Figure 3: Top two dialogues from DialEval-1 DQ-A when = 0 − 0.263 = − 0.263), Class − 2 has the highest ranked by = ΔNMD − ΔRNOD: 181th dialogue ( = gold probability. For this dialogue, NMD says that Popu- − 0.265) and 18th dialogue ( = − 0.263). larity and Uniform are equally efective, while all other measures prefer Uniform. It is clear from these examples that it is dificult to say whether one measure is “correct” or not; we can only say that NMD tends to prefer Popu- is far below 50%), NMD prefers Popularity more often larity over Uniform compared to the other measures. than it prefers Uniform. As for RSNOD, it does prefer

Using the DialEval-1 DQ-A data set, we have so far Uniform more often just like RNOD and others, but the discussed how NMD tends to favour Popularity over tendency is less clear; again, its property lies somewhere Uniform. To generalise this observation, Table 7 shows between NMD and RNOD. how often each measure prefers one of the two baselines, for each of the six NTCIR data sets that contain these baselines. For example, NMD prefers Uniform for 175 dialogues and prefers Popularity for 197 dialogues from the DTC-3 DQ-A data set. (For the remaining 390 − 175 − 197 = 18 dialogues, the two baselines are tied.) It can be observed from the TOTAL row that while RNOD, NVD, RNSS, and JSD prefer Uniform far more often (where the probability that Popularity is preferred

6. Conclusions

The present study re-examined the OQ measures (NMD and R(S)NOD) along with nominal quantification measures (NVD, RNSS, and JSD) using SemEval and NTCIR data sets, using statistical significance test results. Our main findings are as follows.

• According to our pooled discriminative power results (Table 3), the most discriminative measures are RNOD and NVD (but recall that NVD is a nominal quantification measure and is not appropriate for OQ). • The sets of statistically significant diferences found by NMD are generally not subsets of those found by other, more discriminative measures like RNOD. • NMD sometimes contradicts with RNOD and the nominal quantification measures in statistical terms, by preferring a Popularity baseline over a Uniform baseline.

The tendency of NMD to rate Popularity higher than

Uniform is generally intuitive, since the former “cheats” by accessing the gold data while the latter is the laziest approach possible. However, it is dificult to say whether NMD is more appropriate than RNOD, as the preference between Popularity and Uniform should depend on what the gold distribution looks like (e.g., Is it almost flat?). On the other hand, the strengths of RNOD are that it is statistically stable, as demonstrated in terms of system ranking consistency [1] and pooled discriminative power. Based on these arguments, we recommend using both RNOD and NMD for evaluating OQ systems, to examine them from multiple angles.

Our future work includes exploring variants of RNOD. More specifically, while Eq. 2 relies on = | − | and therefore assumes equidistance, an alternative that is free from this assumption could be considered. Inspired by the distance function used in Krippendorf’s alpha for ordinal classes [21, 22], one possibility is: = ⎝ ⎛ max(,)

∑︁ =min(,)

⎞ *⎠ − * + * .

2 (11)

That is, we could utilise the gold propabilities that lie

between Classes and to define the distance. This can also be combined with the RNADW measure that we defined in Section 3.

Acknowledgments We thank the reviewers of the LQ 2021 workshop for their feedback on the initial version of this paper, and the organisers of the workshop for giving us the opportunity to publish our work. A. Proof That RNOD equals NMD

when | | = 2.

Note that cp1 = 1 and cp*1 = *1 in general. Furthermore, when || = 2, note that cp2 = cp*2 = 1. Hence, from Eq. 1,

NMD(, * ) = = |1 − *1| + |2 − *2| |1 − *1| + 0 = |1 − *1| . (12)

On the other hand, note that when || = 2, (2 − *2)2 = (1 − 1 − 1 + *2)2 = (1 − *1)2. To compute RNOD, the following three cases need to be considered. Case 1 when *1 > 0 and *2 > 0: from Eq. 3, OD( || * ) = (DW1 + DW2)/2 = ((2 − *2)2 + (1 − *1)2)/2 = 2(1 − *1)2/2 = (1 − *1)2. Hence from Eq. 5,

RNOD( || * ) = √︀OD( || * ) = |1 − *1| . (13) Case 2 when *1 = 1 and *2 = 0: from Eq. 3, OD( || * ) = DW1 = (2 − *2)2 = (1 − *1)2. Therefore, Eq 14 holds for this case as well. Case 2 when *1 = 0 and *2 = 1: from Eq. 3, OD( || * ) = DW1 = (1 − *1)2 and Eq 14 holds for this case as well. In summary, NMD(, * ) = RNOD( || * ).

Finally, following similar steps as above, we can also obtain:

RNOD(* || ) = √︀OD(* || ) = |1 − *1| . (14)

In summary, NMD(, * ) = RNOD( || * ) = RNOD(* || ) when || = 2. Q.E.D. the mallows distance: Some insights from statistics, in: Proceedings of ICCV 2001, 2001, pp. 251–256. [1] T. Sakai, Evaluating evaluation measures for or- [13] T. Sakai, Towards automatic evaluation of multidinal classification and ordinal quantification, in: turn dialogues: A task design that leverages inherProceedings of ACL-IJCNLP 2021, 2021, pp. 2759– ently subjective annotations, in: Proceedings of 2769. URL: https://aclanthology.org/2021.acl-long. EVIA 2017, 2017, pp. 24–30. URL: http://ceur-ws. 214.pdf . org/Vol-2008/paper_4.pdf . [2] A. Esuli, F. Sebastiani, Sentiment quantification, [14] T. Sakai, Comparing two binned probability distri

IEEE Intelligent Systems 25 (2010) 72–75. butions for information access evaluation, in: Pro[3] W. Gao, F. Sebastiani, From classification to quantifi- ceedings of ACM SIGIR 2018, 2018, pp. 1073–1076. cation in tweet sentiment analysis, Social Network URL: https://dl.acm.org/doi/pdf/10.1145/3209978.

Analysis and Mining 6 (2016) 1–22. 3210073. [4] F. Sebastiani, Evaluation measures for quantifica- [15] T. Sakai, On the instability of diminishing return tion: an axiomatic approach, Information Retrieval IR measures, in: Proceedings of ECIR 2021 Part I Journal 23 (2020) 255–288. (LNCS 12656), 2021, pp. 572–586. [5] P. Nakov, A. Ritter, S. Rosenthal, V. Stoyanov, F. Se- [16] T. Sakai, Evaluating evaluation metrics based on bastiani, SemEval-2016 task 4: Sentiment analysis the bootstrap, in: Proceedings of ACM SIGIR 2006, in Twitter, in: Proceedings of the 10th Interna- 2006, pp. 525–532. tional Workshop on Semantic Evaluation, SemEval [17] T. Sakai, Alternatives to bpref, in: Proceedings of ’16, Association for Computational Linguistics, San ACM SIGIR 2007, 2007, pp. 71–78. Diego, California, 2016. URL: https://www.aclweb. [18] Y. Tsunomori, R. Higashinaka, T. Takahashi, M. Inorg/anthology/S16-1001.pdf . aba, Selection of evaluation metrics for dialogue [6] S. Rosenthal, N. Farra, P. Nakov, SemEval-2017 task breakdown detection in dialogue breakdown detec4: Sentiment analysis in Twitter, in: Proceedings tion challenge 3 (in Japanese), Transactions of the of the 11th International Workshop on Semantic Japanese Society for Artificial Intelligence 35 (2020). Evaluation, SemEval ’17, Association for Computa- URL: https://www.jstage.jst.go.jp/article/tjsai/35/1/ tional Linguistics, Vancouver, Canada, 2017. URL: 35_DSI-G/_pdf/-char/ja.

https://www.aclweb.org/anthology/S17-2088.pdf . [19] T. Sakai, Laboratory Experiments in Information [7] R. Higashinaka, K. Funakoshi, M. Inaba, Retrieval: Sample Sizes, Efect Sizes, and Statistical Y. Tsunomori, T. Takahashi, N. Kaji, Overview Power, Springer, 2018. of Dialogue Breakdown Detection Challenge 3, [20] T. Sakai, Metrics, statistics, tests, in: PROMISE in: Proceedings of Dialog System Technology Winter School 2013: Bridging between InformaChallenge 6 (DSTC6) Workshop, 2017. URL: tion Retrieval and Databases (LNCS 8173), Springer, http://workshop.colips.org/dstc6/papers/track3_ 2014, pp. 116–163.

overview_higashinaka.pdf . [21] K. Krippendorf, Content Analysis: An Introduction [8] Z. Zeng, S. Kato, T. Sakai, Overview of the NTCIR- to Its Methodology (Fourth Edition), SAGE Publica14 short text conversation task: Dialogue quality tions, 2018. and nugget detection subtasks, in: Proceedings of [22] T. Sakai, How to run an evaluation task, in: InforNTCIR-14, 2019, pp. 289–315. URL: http://research. mation Retrieval Evaluation in a Changing World, nii.ac.jp/ntcir/workshop/OnlineProceedings14/ Springer, 2019, pp. 71–102.

pdf/ntcir/01-NTCIR14-OV-STC-ZengZ.pdf . [9] Z. Zeng, S. Kato, T. Sakai, I. Kang, Overview of the NTCIR-15 dialogue evaluation task (DialEval-1), in: Proceedings of NTCIR-15, 2020, pp. 13–34. URL: http://research.nii.ac.jp/ ntcir/workshop/OnlineProceedings15/pdf/ntcir/ 01-NTCIR15-OV-DIALEVAL-ZengZ.pdf . [10] M. Werman, S. Peleg, A. Rosenfeld, A distance metric for multidimensional histograms, Computer Vision, Graphics, and Image Processing 32 (1985) 328–336. [11] Y. Rubner, C. Tomasi, L. J. Guibas, The earth mover’s distance as a metric for image retrieval, International Journal of Computer Vision 40 (2000) 99–121. [12] E. Levina, P. Bickel, The earth mover’s distance is