=Paper= {{Paper |id=Vol-2008/paper_3 |storemode=property |title=Evaluating Evaluation Measures with Worst-Case Confidence Interval Widths |pdfUrl=https://ceur-ws.org/Vol-2008/paper_3.pdf |volume=Vol-2008 |authors=Tetsuya Sakai |dblpUrl=https://dblp.org/rec/conf/ntcir/Sakai17 }} ==Evaluating Evaluation Measures with Worst-Case Confidence Interval Widths== https://ceur-ws.org/Vol-2008/paper_3.pdf
                                 Evaluating Evaluation Measures with
                                Worst-Case Confidence Interval Widths
                                                                Tetsuya Sakai
                                                              Waseda University
                                                            tetsuyasakai@acm.org

ABSTRACT                                                                          To supplement user-based studies of IR evaluation measures,
IR evaluation measures are often compared in terms of rank cor-                we propose to use Worst-case Confidence interval Width (WCW)
relation between two system rankings, agreement with the users’                curves in test-collection environments. WCW is the worst-case
preferences, the swap method, and discriminative power. While                  width of a confidence interval (CI) for the difference between any
we view the agreement with real users as the most important, this              two systems, given a topic set size. We argue that WCW curves
paper proposes to use the Worst-case Confidence interval Width                 are more useful than the swap method and discriminative power,
(WCW) curves to supplement it in test-collection environments.                 since they provide a statistically well-founded overview of the
WCW is the worst-case width of a confidence interval (CI) for the              comparison of measures over various topic set sizes, and visualise
difference between any two systems, given a topic set size. We                 what levels of differences across measures might be of practical
argue that WCW curves are more useful than the swap method and                 importance. To this end, we leverage one of the publicly available
discriminative power, since they provide a statistically well-founded          topic set size design Excel tools of Sakai [6]. First, we prove that
overview of the comparison of measures over various topic set sizes,           Sakai’s ANOVA-based topic set size design tool1 can be used for
and visualise what levels of differences across measures might be              discussing WCW instead of his CI-based tool2 that cannot handle
of practical importance. First, we prove that Sakai’s ANOVA-based              large topic set sizes (See Section 2). We then provide some case
topic set size design tool can be used for discussing WCW instead              studies of evaluating evaluation measures using WCW curves based
of his CI-based tool that cannot handle large topic set sizes. We              on the ANOVA-based tool, using data from TREC and NTCIR.
then provide some case studies of evaluating evaluation measures
using WCW curves based on the ANOVA-based tool, using data                     2    PRIOR ART IN EVALUATING EVALUATION
from TREC and NTCIR.                                                                MEASURES
                                                                               When a new IR evaluation measure is invented, a system ranking
CCS CONCEPTS                                                                   according to this measure (averaged over a set of topics) is often
•Information systems → Retrieval effectiveness;                                compared with another according to a well-established measure;
                                                                               rank correlation measures such as Kendall’s τ or the top-heavy
KEYWORDS                                                                       τap [9] are often used to quantify the similarity between two rank-
                                                                               ings. However, this approach cannot tell us whether a measure is
ANOVA; confidence intervals; effect sizes; evaluation measures;
                                                                               good or bad, due to the lack of a “correct” system ranking. It merely
p-values; sample sizes; statistical significance
                                                                               tells us whether a new measure is similar to an existing one or not;
                                                                               it only serves as a sanity check.
                                                                                   For a given query, a user sees two Search Engine Result Pages
1    INTRODUCTION                                                              (SERPs) side by side, and says that SERP 1 is better than SERP 2
IR systems are built to satisfy users’ information needs, but it is not        (“SERP 1 > SERP 2 ”). If an evaluation measure also says “SERP 1 >
practical to make the users evaluate the systems all the time for              SERP 2 ,” this is a preference agreement; if it says “SERP 1 < SERP 2 ,”
the purpose of improving them—that would annoy the users, not                  this is a preference disagreement. We can count the number of
satisfy them! Hence, we often turn to IR evaluation measures in                agreements over different queries and SERP pairs, and use it for
laboratory experiments. But which IR measures are good?                        comparing the “goodness” of evaluation measures. In practice, this
   In laboratory studies, evaluation measures are often compared               approach also has a few limitations: (a) the judges employed in
in terms of rank correlation between two system rankings (e.g. [9]),           the preference assessments are often not real search engine users
agreement with the users’ document preferences (e.g. [7]), the swap            with an information need; (b) human assessments can be unreliable
method (e.g. [8]), and discriminative power (e.g. [3, 4]). Since IR            and/or inconsistent; and (c) hiring judges comes at a cost, no matter
evaluation measures are often regarded as surrogates of user satis-            how small.
faction or user performance measurements, we view the agreement                    The swap method [8] may be used to measure the consistency (i.e.,
with users as the most important, although it needs to be said that            “preference agreement with itself”) of evaluation measures across
user preference studies often use hired assessors such as crowd                different topic sets. Given a set of n topics, the set is split in half, and
workers intead of real users with an information need. Moreover,               the number of inconsistent preferences (e.g., SERP 1 > SERP 2 with
studies involving human assessors obviously incur costs.                       the first half but SERP 1 < SERP 2 with the second half) is counted,
                                                                               using different systems and different splits. As this method can
Copying permitted for private and academic purposes.
EVIA 2017, co-located with NTCIR-13, Tokyo, Japan.                             1 http://www.f.waseda.jp/tetsuya/CIKM2014/samplesizeANOVA.xlsx

© 2017 Copyright held by the author.                                           2 http://www.f.waseda.jp/tetsuya/FIT2014/samplesizeCI.xlsx




                                                                          16
EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan.                                                                                    Tetsuya Sakai


only consider half the original topic set size, Voorhees and Buckley           means for any two systems is no larger than δ (> 0) is given by [6]:
used a simple extrapolation method to estimate what will happen                                     4{z inv (α/2)}2 σ̂t2
                                                                                                                      4{z inv (α/2)}2 (2σ̂ 2 )
for topic set sizes larger than n. However, estimating the swap                            n CI =                          =                   ,    (1)
rate for (say) n = 100 topics based on observations with (say)                                               δ2                 δ2
n = 10, 25, 50 topics may not be reliable. To directly consider the            where z inv (P) is the upper z-value3 for probability P. Subsequently,
size n, bootstrap samples [3] can be used to replace the sampling-             this estimate is incremented until it actually satisfies the require-
without-replacement approach of Voorhees and Buckley, but this                 ment (α, δ ). Thus, while the actual CI relies on a t-distribution, the
method cannot consider topic set sizes larger than n either.                   method starts off with a standard normal distribution by assuming
   Given a set of runs and an evaluation measure, a p-value can be             that the variance estimate σ̂t2 is perfectly accurate4 . This is why
obtained for every system pair using an appropriate significance               Eq. 1 involves a z-value rather than a t-value.
test, and the sorted p-values can be plotted against the system                   Whereas, according to Sakai’s ANOVA-based topic set size de-
pairs [3, 4]: this is called the discriminative power curve. While             sign, the initial topic set size estimate for ensuring 100(1 − β)%
highly discriminative measures are useful in the sense that they can           statistical power whenever the true difference between the best
obtain more statistically significant results in a given environment           and the worst systems is minD or larger is given by [6]:
with exactly n topics, discriminative power does not provide a                                                     2σ̂ 2 λ
view over different choice of topics. Moreover, it is not clear, for                                          n ANOVA =    ,                      (2)
                                                                                                                   minD2
example, a measure with 90% discriminative power should actually
                                                                               where λ is a noncentrality parameter of a noncentral χ 2 distribution
be preferred over one with 80% discriminative power.
                                                                               with ϕ = m − 1 degrees of freedom; as discussed below, linear
   Sakai [6] released three Excel tools based on topic set size design,
                                                                               formulae are available for estimating λ from ϕ [2]. As Eq. 2 is based
which determines the number of topics n to create for a new test
                                                                               on a series of approximations, n ANOVA is then incremented until it
collection given a set of statistical requirements. His ANOVA-based
                                                                               actually satisfies the requirement (α, β, minD, m).
tool takes the following as input: α (Type I error probability), β
                                                                                  Sakai [6] observed that, for the data he considered, “the topic
(Type II error probability), m (the number of systems to be compared
                                                                               set size required based on the CI-based design with α = 0.05 and
in one-way ANOVA), σ̂ 2 (an estimate of the within-system variance
                                                                               δ = c is almost the same as the topic set size required based on the
for a particular evaluation measure), and minD (minimum detectable
                                                                               ANOVA-based design with (α, β, m) = (0.05, 0.20, 10) and minD = c,
range); the tool returns the topic set size n that ensures 100(1 − β)%
                                                                               for any c.” We analytically explain and generalise his observation
statistical power whenever the true difference between the best and
                                                                               as follows. From Eqs. 1 and 2, we have:
the worst among the m systems is minD or larger. Whereas, his
CI-based tool takes the following as input: α, σ̂t2 (an estimate of the          n ANOVA             λδ 2                  λ           δ 2
                                                                                         =                       =                  (      ) . (3)
variance of the between-system differences in terms of a particular                 n CI                  2
                                                                                           4{z inv (α/2)} minD 2   4{z inv (α/2)} 2   minD
evaluation measure), and δ , which is exactly what we call WCW
                                                                               Here, note that 4{z inv (α/2)}2 is a constant for a given α; also, λ is
in this study; the tool returns the topic set size n that ensures that
                                                                               a constant given α, β and m. Figure 1 visualises the relationship
the width of the 100(1 − α)% CI for any system pair is no larger
                                                                               between the two constants for α = 0.01, 0.05 and β = 0.10, 0.20,
than δ . Following Sakai, we simply let σ̂t2 = 2σ̂ 2 for any evaluation
                                                                               while varying the number of systems m. The linear formulae for
measure.
                                                                               approximating λ based on ϕ = m − 1 [6] are provided in the bottom
   While the relationship between minD for ANOVA and n can be
                                                                               half of the figure. Figure 1 shows that
plotted for different evaluation measures, this seems problematic as
a way to compare evaluation measures, since, for example, a minD                                               λ ≈ 4{z inv (α/2)}2                               (4)
of 0.1 in term of one measure is not equivalent to a minD of 0.1 in            holds when:
terms of another. In contrast, if we plot δ against n, this is probably
                                                                                    Condition (a) α = 0.05, β = 0.20, m = 10; or
a more valid comparison since, at least for any normalised measures
                                                                                    Condition (b) α = 0.05, β = 0.10, m = 5; or
that lie in the [0, 1] score range, we usually want the CI width to be
                                                                                    Condition (c) α = 0.01, β = 0.20, m = 18; or
as small as possible. This is why we propose to plot δ against topic
                                                                                    Condition (d) α = 0.01, β = 0.10, m = 10.
set sizes to compare different measures. However, Sakai’s CI-based
tool cannot handle large topic set sizes: the limitation of his CI-            Hence, whenever one of the above four conditions holds true, then
based tool is due to that of Excel’s GAMMA function: GAMMA(172) is             from Eqs. 3 and 4 we obtain:
greater than 10307 and cannot be computed [6]. Hence, we start by                                        n ANOVA         δ 2
                                                                                                                   ≈(        ) .                  (5)
proving that his ANOVA-based tool can be used instead of the less                                           n CI      minD
robust CI-based one, for IR researchers to compare the statistical             Thus, when one of the above four conditions holds, by letting δ =
reliability of evaluation measures based on WCW.                               minD in Eq. 5 we obtain n ANOVA /n CI ≈ 1, that is, n ANOVA ≈ n CI ,
                                                                               regardless of the variance estimate σ̂ 2 . Q.E.D.
3   PROOF THAT ANOVA-BASED TOPIC SET                                              Henceforth, we only consider the popular Cohen’s five-eighty con-
    SIZE DESIGN CAN BE USED INSTEAD OF                                         vention [1], i.e., (α, β) = (0.05, 0.20)5 , and leverage Condition (a)
                                                                               3 NORM.S.INV(1 − P ) with Microsoft Excel.
    CI-BASED ONE                                                               4 Replacing the true population variance of a standard normal distribution with a
According to Sakai’s CI-based topic set size design, the initial topic         sample variance constitutes the very definition of a t -distribution.
set size estimate for ensuring that the CI width for the difference in         5 Note that “eighty” refers to the statistical power: 100(1 − β )%.




                                                                          17
Evaluating Evaluation Measures with
Worst-Case Confidence Interval Widths                                     EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan.




    Figure 1: The noncentrality parameter λ vs. 4{z inv (α/2)}2

Table 1: σ̂ 2 : estimates of within-system variances. md stands
for measurement depth (i.e., document cutoff).

                (a) TREC03-04Robust (md = 1000), from Sakai [6]
                AP          .0471   nDCG                      .0456
                Q           .0465   nERR                      .1145
                (b) TREC11-12WebAdhoc (md = 10), from Sakai [6]
                AP          .0824   nDCG                      .0441
                Q           .0368   nERR                      .0863
                (c) TREC11-12WebDiversity (md = 10), from Sakai [6]
                α -nDCG     .0779   D-nDCG                    .0340
                nERR-IA     .0842   D#-nDCG                   .0504
                (d) NTCIR-12 STC1C (md = 10), from Sakai [5]
                nG@1        .1144   std-AB nG@1               .0193
                P+          .0943   std-AB P+                 .0186
                nERR        .0867   std-AB nERR               .0182



mentioned above. Figure 2 compares, for different and quite extreme
values of the variance estimate σ̂ 2 , the topic set size curve using
the CI-based tool with α = 0.05 and one using the ANOVA-based
tool with α = 0.05, β = 0.20, m = 10. Due to the aforementioned
limitation of the CI-based tool, it was not possible to obtain the
entire curves with this tool. On the other hand, it is clear that
the ANOVA-based curves can serve as highly accurate surrogates
for the CI-based curves and can handle large topic set sizes. In
summary, to discuss WCW, we can always use the more robust
ANOVA-based tool and treat the minD values as if they are δ values.

4     WCW-BASED EVALUATION OF
      EVALUATION MEASURES: CASE STUDIES
Having proven that the ANOVA-based tool can be used instead of
the less robust CI-based tool, we now demonstrate how different
evaluation measures can be compared using WCW curves obtained
with the ANOVA-based tool.
   Table 1 shows the variance estimates σ̂ 2 of various evaluation
measures reported in the literature [5, 6]. For the purpose of the
present study, the knowledge of each evaluation measure is not
necessary; the measures with a prefix “std-AB” denote standardised
measures of the original measures, where the raw score for each
                                                                               Figure 2: The actual relationship between δ for CI and minD
topic is transformed based on a set of known systems, to absorb
                                                                               for ANOVA in topic set size design.
the hardness of that topic as well as its variation across systems [5].
Given a topic-by-run score matrix for a particular evaluation mea-             is not clear just from this table how such differences will actually
sure, σ̂ 2 can easily be obtained as the residual variance of ANOVA.           impact our evaluation results.
While some evaluation measures are substantially less stable across               Figure 3 shows the WCW curves that correspond to the variances
topics than others (e.g., Compare nERR and nDCG in Table 1(a)), it             shown in Table 1, for α = 0.05, i.e., 95% CIs. For each evaluation




                                                                          18
EVIA 2017, co-located with NTCIR-13, 5 December 2017, Tokyo, Japan.                                                                       Tetsuya Sakai


                                                                      measure, the δ is plotted against the required topic set size n; the
                                                                      curve was obtained by entering different values of minD (i.e., δ )
                                                                      into the ANOVA-based tool (with α = 0.05, β = 0.20, m = 10)
                                                                      and recording the resultant n. The advantages of the proposed
                                                                      WCW-based comparison of evaluation measures are as follows:
                                                                             • Unlike discriminative power and the swap method, we can
                                                                                easily consider a wide range of topic set sizes;
                                                                             • For a particular topic set size, we can easily compare across
                                                                                different evaluation measures, since an evaluation measure
                                                                                with a small WCW is usually more desirable than one with
                                                                                a large WCW under the same condition;
                                                                             • The WCW curves can visualise the differences among mea-
                                                                                sures that practically matter.
                                                                      For example, from Figure 3(b), when the topic set size is n = 50, it
                                                                      is clear that the WCW of nDCG and that of Q are about the same
                                                                      (around 0.16), while those of AP and nERR are substantially larger
                                                                      (around 0.23). Similarly, from Figure 3(d), while it is clear that the
                                                                      standardised (“std-AB”) measures have substantially lower WCW
                                                                      values than the unstandardised ones, the differences within the set
                                                                      of standardised measures are probably not of practical importance,
                                                                      as indicated by the near-perfect overlaps of the curves.

                                                                      5    CONCLUSIONS AND FUTURE WORK
                                                                      We proposed to evaluate evaluation measures by comparing the
                                                                      WCW for various topic set sizes, using an existing ANOVA-based
                                                                      tool instead of the less robust CI-based tool. We proved the re-
                                                                      lationship between these two topic set size design methods, and
                                                                      demonstrated the advantages of WCW curves over well-known
                                                                      methods such as the swap test and discriminative power. It is hoped
                                                                      that this method will supplement user-based studies of evaluation
                                                                      measures.


                                                                      REFERENCES
                                                                       [1] Paul D. Ellis. 2010. The Essential Guide to Effect Sizes. Cambridge University
                                                                           Press.
                                                                       [2] Yasushi Nagata. 2003. How to Design the Sample Size (in Japanese). Asakura
                                                                           Shoten.
                                                                       [3] Tetsuya Sakai. 2006. Evaluating Evaluation Metrics based on the Bootstrap. In
                                                                           Proceedings of ACM SIGIR 2006. 525–532.
                                                                       [4] Tetsuya Sakai. 2007. Alternatives to Bpref. In Proceedings of ACM SIGIR 2007.
                                                                           71–78.
                                                                       [5] Tetsuya Sakai. 2016. The Effect of Score Standardisation on Topic Set Size Design.
                                                                           In Proceedings of AIRS 2016 (LNCS 9994). 16–28.
                                                                       [6] Tetsuya Sakai. 2016. Topic Set Size Design. Information Retrieval Jour-
                                                                           nal 19, 3 (2016), 256–283. http://link.springer.com/content/pdf/10.1007%
                                                                           2Fs10791-015-9273-z.pdf
                                                                       [7] Mark Sanderson, Monica Lestari Paramita, Paul Clough, and Evangelos Kanoulas.
                                                                           2010. Do User Preferences and Evaluation Measures Line Up?. In Proceedings of
                                                                           ACM SIGIR 2010. 555–562.
                                                                       [8] Ellen M. Voorhees and Chris Buckley. 2002. The Effect of Topic Set Size on
                                                                           Retrieval Experiment Error. In Proceedings of ACM SIGIR 2002. 316–323.
                                                                       [9] Emine Yilmaz, Javed A. Aslam, and Stephen Robertson. 2008. A New Rank
                                                                           Correlation Coefficient for Information Retrieval. In Proceedings of ACM SIGIR
                                                                           2008. 587–594.

            Figure 3: WCW curves for 95% CIs.




                                                                19