Evaluating Evaluation Measures with Worst-Case Confidence Interval Widths

Evaluating Evaluation Measures with Worst-Case Confidence Interval Widths TetsuyaSakai tetsuyasakai@acm.org Waseda University Evaluating Evaluation Measures with Worst-Case Confidence Interval Widths 9749A3C1B173487A013A77490CD26689 GROBID - A machine learning software for extracting information from scholarly documents ANOVA con dence intervals e ect sizes evaluation measures p-values sample sizes statistical signi cance

IR evaluation measures are o en compared in terms of rank correlation between two system rankings, agreement with the users' preferences, the swap method, and discriminative power. While we view the agreement with real users as the most important, this paper proposes to use the Worst-case Con dence interval Width (WCW) curves to supplement it in test-collection environments. WCW is the worst-case width of a con dence interval (CI) for the di erence between any two systems, given a topic set size. We argue that WCW curves are more useful than the swap method and discriminative power, since they provide a statistically well-founded overview of the comparison of measures over various topic set sizes, and visualise what levels of di erences across measures might be of practical importance. First, we prove that Sakai's ANOVA-based topic set size design tool can be used for discussing WCW instead of his CI-based tool that cannot handle large topic set sizes. We then provide some case studies of evaluating evaluation measures using WCW curves based on the ANOVA-based tool, using data from TREC and NTCIR.

INTRODUCTION

IR systems are built to satisfy users' information needs, but it is not practical to make the users evaluate the systems all the time for the purpose of improving them-that would annoy the users, not satisfy them! Hence, we o en turn to IR evaluation measures in laboratory experiments. But which IR measures are good?

In laboratory studies, evaluation measures are o en compared in terms of rank correlation between two system rankings (e.g. [9]), agreement with the users' document preferences (e.g. [7]), the swap method (e.g. [8]), and discriminative power (e.g. [3,4]). Since IR evaluation measures are o en regarded as surrogates of user satisfaction or user performance measurements, we view the agreement with users as the most important, although it needs to be said that user preference studies o en use hired assessors such as crowd workers intead of real users with an information need. Moreover, studies involving human assessors obviously incur costs.

To supplement user-based studies of IR evaluation measures, we propose to use Worst-case Con dence interval Width (WCW) curves in test-collection environments. WCW is the worst-case width of a con dence interval (CI) for the di erence between any two systems, given a topic set size. We argue that WCW curves are more useful than the swap method and discriminative power, since they provide a statistically well-founded overview of the comparison of measures over various topic set sizes, and visualise what levels of di erences across measures might be of practical importance. To this end, we leverage one of the publicly available topic set size design Excel tools of Sakai [6]. First, we prove that Sakai's ANOVA-based topic set size design tool1 can be used for discussing WCW instead of his CI-based tool2 that cannot handle large topic set sizes (See Section 2). We then provide some case studies of evaluating evaluation measures using WCW curves based on the ANOVA-based tool, using data from TREC and NTCIR.

PRIOR ART IN EVALUATING EVALUATION MEASURES

When a new IR evaluation measure is invented, a system ranking according to this measure (averaged over a set of topics) is o en compared with another according to a well-established measure; rank correlation measures such as Kendall's τ or the top-heavy τ ap [9] are o en used to quantify the similarity between two rankings. However, this approach cannot tell us whether a measure is good or bad, due to the lack of a "correct" system ranking. It merely tells us whether a new measure is similar to an existing one or not; it only serves as a sanity check. For a given query, a user sees two Search Engine Result Pages (SERPs) side by side, and says that SERP 1 is be er than SERP 2 ("SERP 1 > SERP 2 "). If an evaluation measure also says "SERP 1 > SERP 2 ," this is a preference agreement; if it says "SERP 1 < SERP 2 ," this is a preference disagreement. We can count the number of agreements over di erent queries and SERP pairs, and use it for comparing the "goodness" of evaluation measures. In practice, this approach also has a few limitations: (a) the judges employed in the preference assessments are o en not real search engine users with an information need; (b) human assessments can be unreliable and/or inconsistent; and (c) hiring judges comes at a cost, no ma er how small. e swap method [8] may be used to measure the consistency (i.e., "preference agreement with itself") of evaluation measures across di erent topic sets. Given a set of n topics, the set is split in half, and the number of inconsistent preferences (e.g., SERP 1 > SERP 2 with the rst half but SERP 1 < SERP 2 with the second half) is counted, using di erent systems and di erent splits. As this method can only consider half the original topic set size, Voorhees and Buckley used a simple extrapolation method to estimate what will happen for topic set sizes larger than n. However, estimating the swap rate for (say) n = 100 topics based on observations with (say) n = 10, 25, 50 topics may not be reliable. To directly consider the size n, bootstrap samples [3] can be used to replace the samplingwithout-replacement approach of Voorhees and Buckley, but this method cannot consider topic set sizes larger than n either.

Given a set of runs and an evaluation measure, a p-value can be obtained for every system pair using an appropriate signi cance test, and the sorted p-values can be plo ed against the system pairs [3,4]: this is called the discriminative power curve. While highly discriminative measures are useful in the sense that they can obtain more statistically signi cant results in a given environment with exactly n topics, discriminative power does not provide a view over di erent choice of topics. Moreover, it is not clear, for example, a measure with 90% discriminative power should actually be preferred over one with 80% discriminative power.

Sakai [6] released three Excel tools based on topic set size design, which determines the number of topics n to create for a new test collection given a set of statistical requirements. His ANOVA-based tool takes the following as input: α (Type I error probability), β (Type II error probability), m (the number of systems to be compared in one-way ANOVA), σ 2 (an estimate of the within-system variance for a particular evaluation measure), and minD (minimum detectable range); the tool returns the topic set size n that ensures 100(1 − β)% statistical power whenever the true di erence between the best and the worst among the m systems is minD or larger. Whereas, his CI-based tool takes the following as input: α, σ 2 t (an estimate of the variance of the between-system di erences in terms of a particular evaluation measure), and δ , which is exactly what we call WCW in this study; the tool returns the topic set size n that ensures that the width of the 100(1 − α)% CI for any system pair is no larger than δ . Following Sakai, we simply let σ 2 t = 2 σ 2 for any evaluation measure.

While the relationship between minD for ANOVA and n can be plo ed for di erent evaluation measures, this seems problematic as a way to compare evaluation measures, since, for example, a minD of 0.1 in term of one measure is not equivalent to a minD of 0.1 in terms of another. In contrast, if we plot δ against n, this is probably a more valid comparison since, at least for any normalised measures that lie in the [0, 1] score range, we usually want the CI width to be as small as possible. is is why we propose to plot δ against topic set sizes to compare di erent measures. However, Sakai's CI-based tool cannot handle large topic set sizes: the limitation of his CIbased tool is due to that of Excel's GAMMA function: GAMMA(172) is greater than 10 307 and cannot be computed [6]. Hence, we start by proving that his ANOVA-based tool can be used instead of the less robust CI-based one, for IR researchers to compare the statistical reliability of evaluation measures based on WCW.

PROOF THAT ANOVA-BASED TOPIC SET SIZE DESIGN CAN BE USED INSTEAD OF CI-BASED ONE

According to Sakai's CI-based topic set size design, the initial topic set size estimate for ensuring that the CI width for the di erence in means for any two systems is no larger than δ (> 0) is given by [6]:

n CI = 4{z inv (α/2)} 2 σ 2 t δ 2 = 4{z inv (α/2)} 2 (2 σ 2 ) δ 2 ,(1)

where z inv (P) is the upper z-value3 for probability P. Subsequently, this estimate is incremented until it actually satis es the requirement (α, δ ). us, while the actual CI relies on a t-distribution, the method starts o with a standard normal distribution by assuming that the variance estimate σ 2 t is perfectly accurate 4 . is is why Eq. 1 involves a z-value rather than a t-value.

Whereas, according to Sakai's ANOVA-based topic set size design, the initial topic set size estimate for ensuring 100(1 − β)% statistical power whenever the true di erence between the best and the worst systems is minD or larger is given by [6]:

n ANOVA = 2 σ 2 λ minD 2 , (2)

where λ is a noncentrality parameter of a noncentral χ 2 distribution with ϕ = m − 1 degrees of freedom; as discussed below, linear formulae are available for estimating λ from ϕ [2]. As Eq. 2 is based on a series of approximations, n ANOVA is then incremented until it actually satis es the requirement (α, β, minD, m). Sakai [6] observed that, for the data he considered, "the topic set size required based on the CI-based design with α = 0.05 and δ = c is almost the same as the topic set size required based on the ANOVA-based design with (α, β, m) = (0.05, 0.20, 10) and minD = c, for any c." We analytically explain and generalise his observation as follows. From Eqs. 1 and 2, we have:

n ANOVA n CI = λδ 2 4{z inv (α/2)} 2 minD 2 = λ 4{z inv (α/2)} 2 ( δ minD ) 2 . (3)

Here, note that 4{z inv (α/2)} 2 is a constant for a given α; also, λ is a constant given α, β and m. Figure 1 visualises the relationship between the two constants for α = 0.01, 0.05 and β = 0.10, 0.20, while varying the number of systems m. e linear formulae for approximating λ based on ϕ = m − 1 [6] are provided in the bo om half of the gure. Figure 1 shows that

λ ≈ 4{z inv (α/2)} 2(4)n ANOVA n CI ≈ ( δ minD ) 2 .(5)

us, when one of the above four conditions holds, by le ing δ = minD in Eq. 5 we obtain n ANOVA /n CI ≈ 1, that is, n ANOVA ≈ n CI , regardless of the variance estimate σ 2 . Q.E.D.

Henceforth, we only consider the popular Cohen's ve-eighty convention [1], i.e., (α, β) = (0.05, 0.20) 5 , and leverage Condition (a) Figure 1: e noncentrality parameter λ vs. 4{z inv (α/2)} 2 Table 1: σ 2 : estimates of within-system variances. md stands for measurement depth (i.e., document cuto ).

WCW-BASED EVALUATION OF EVALUATION MEASURES: CASE STUDIES

Having proven that the ANOVA-based tool can be used instead of the less robust CI-based tool, we now demonstrate how di erent evaluation measures can be compared using WCW curves obtained with the ANOVA-based tool.

Table 1 shows the variance estimates σ 2 of various evaluation measures reported in the literature [5,6]. For the purpose of the present study, the knowledge of each evaluation measure is not necessary; the measures with a pre x "std-AB" denote standardised measures of the original measures, where the raw score for each topic is transformed based on a set of known systems, to absorb the hardness of that topic as well as its variation across systems [5]. Given a topic-by-run score matrix for a particular evaluation measure, σ 2 can easily be obtained as the residual variance of ANOVA. While some evaluation measures are substantially less stable across topics than others (e.g., Compare nERR and nDCG in Table 1(a)), it is not clear just from this table how such di erences will actually impact our evaluation results.

Figure 3 shows the WCW curves that correspond to the variances shown in Table 1, for α = 0.05, i.e., 95% CIs. For each evaluation e advantages of the proposed WCW-based comparison of evaluation measures are as follows:

• Unlike discriminative power and the swap method, we can easily consider a wide range of topic set sizes; • For a particular topic set size, we can easily compare across di erent evaluation measures, since an evaluation measure with a small WCW is usually more desirable than one with a large WCW under the same condition; • e WCW curves can visualise the di erences among measures that practically ma er. For example, from Figure 3(b), when the topic set size is n = 50, it is clear that the WCW of nDCG and that of Q are about the same (around 0.16), while those of AP and nERR are substantially larger (around 0.23). Similarly, from Figure 3(d), while it is clear that the standardised ("std-AB") measures have substantially lower WCW values than the unstandardised ones, the di erences within the set of standardised measures are probably not of practical importance, as indicated by the near-perfect overlaps of the curves.

holds when: Condition (a) α = 0.05, β = 0.20, m = 10; or Condition (b) α = 0.05, β = 0.10, m = 5; or Condition (c) α = 0.01, β = 0.20, m = 18; or Condition (d) α = 0.01, β = 0.10, m = 10. Hence, whenever one of the above four conditions holds true, then from Eqs. 3 and 4 we obtain:

(a) TREC03-04Robust (md = 1000), from Sakai[6] TREC11-12WebAdhoc (md = 10), from Sakai[6] above. Figure2compares, for di erent and quite extreme values of the variance estimate σ 2 , the topic set size curve using the CI-based tool with α = 0.05 and one using the ANOVA-based tool with α = 0.05, β = 0.20, m = 10. Due to the aforementioned limitation of the CI-based tool, it was not possible to obtain the entire curves with this tool. On the other hand, it is clear that the ANOVA-based curves can serve as highly accurate surrogates for the CI-based curves and can handle large topic set sizes. In summary, to discuss WCW, we can always use the more robust ANOVA-based tool and treat the minD values as if they are δ values.

Figure 2 :2Figure 2: e actual relationship between δ for CI and minD for ANOVA in topic set size design.

Figure 3 :3Figure 3: WCW curves for 95% CIs. h p://www.f.waseda.jp/tetsuya/CIKM2014/samplesizeANOVA.xlsx h p://www.f.waseda.jp/tetsuya/FIT2014/samplesizeCI.xlsx NORM.S.INV(1 − P ) with Microso Excel. Replacing the true population variance of a standard normal distribution with a sample variance constitutes the very de nition of a t -distribution. Note that "eighty" refers to the statistical power: 100(1 − β )%.

CONCLUSIONS AND FUTURE WORK

We proposed to evaluate evaluation measures by comparing the WCW for various topic set sizes, using an existing ANOVA-based tool instead of the less robust CI-based tool. We proved the relationship between these two topic set size design methods, and demonstrated the advantages of WCW curves over well-known methods such as the swap test and discriminative power. It is hoped that this method will supplement user-based studies of evaluation measures.

e Essential Guide to E ect Sizes PaulDEllis 2010 Cambridge University Press How to Design the Sample Size YasushiNagata 2003 Asakura Shoten in Japanese Evaluating Evaluation Metrics based on the Bootstrap TetsuyaSakai Proceedings of ACM SIGIR ACM SIGIR 2006. 2006 Alternatives to Bpref TetsuyaSakai Proceedings of ACM SIGIR ACM SIGIR 2007. 2007 e E ect of Score Standardisation on Topic Set Size Design TetsuyaSakai Proceedings of AIRS 2016 AIRS 2016 2016. 9994 Topic Set Size Design TetsuyaSakai Information Retrieval Journal 19 3 2016. 2016 Do User Preferences and Evaluation Measures Line Up? MarkSanderson MonicaLestariParamita PaulClough EvangelosKanoulas Proceedings of ACM SIGIR ACM SIGIR 2010. 2010 e E ect of Topic Set Size on Retrieval Experiment Error EllenMVoorhees ChrisBuckley Proceedings of ACM SIGIR ACM SIGIR 2002. 2002 A New Rank Correlation Coe cient for Information Retrieval EmineYilmaz AJaved StephenAslam Robertson Proceedings of ACM SIGIR ACM SIGIR 2008. 2008