<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluating Evaluation Measures with Worst-Case Confidence Interval Widths</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tetsuya Sakai</string-name>
          <email>tetsuyasakai@acm.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Waseda University</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <fpage>16</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>IR evaluation measures are oen compared in terms of rank correlation between two system rankings, agreement with the users' preferences, the swap method, and discriminative power. While we view the agreement with real users as the most important, this paper proposes to use the Worst-case Condence interval Width (WCW) curves to supplement it in test-collection environments. WCW is the worst-case width of a condence interval (CI) for the dierence between any two systems, given a topic set size. We argue that WCW curves are more useful than the swap method and discriminative power, since they provide a statistically well-founded overview of the comparison of measures over various topic set sizes, and visualise what levels of dierences across measures might be of practical importance. First, we prove that Sakai's ANOVA-based topic set size design tool can be used for discussing WCW instead of his CI-based tool that cannot handle large topic set sizes. We then provide some case studies of evaluating evaluation measures using WCW curves based on the ANOVA-based tool, using data from TREC and NTCIR.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>•Information systems ! Retrieval eectiveness;
ANOVA; condence intervals; eect sizes; evaluation measures;
p-values; sample sizes; statistical signicance</p>
    </sec>
    <sec id="sec-2">
      <title>INTRODUCTION</title>
      <p>IR systems are built to satisfy users’ information needs, but it is not
practical to make the users evaluate the systems all the time for
the purpose of improving them—that would annoy the users, not
satisfy them! Hence, we oen turn to IR evaluation measures in
laboratory experiments. But which IR measures are good?</p>
      <p>
        In laboratory studies, evaluation measures are oen compared
in terms of rank correlation between two system rankings (e.g. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]),
agreement with the users’ document preferences (e.g. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]), the swap
method (e.g. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]), and discriminative power (e.g. [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]). Since IR
evaluation measures are oen regarded as surrogates of user
satisfaction or user performance measurements, we view the agreement
with users as the most important, although it needs to be said that
user preference studies oen use hired assessors such as crowd
workers intead of real users with an information need. Moreover,
studies involving human assessors obviously incur costs.
      </p>
      <p>
        To supplement user-based studies of IR evaluation measures,
we propose to use Worst-case Condence interval Width (WCW)
curves in test-collection environments. WCW is the worst-case
width of a condence interval (CI) for the dierence between any
two systems, given a topic set size. We argue that WCW curves
are more useful than the swap method and discriminative power,
since they provide a statistically well-founded overview of the
comparison of measures over various topic set sizes, and visualise
what levels of dierences across measures might be of practical
importance. To this end, we leverage one of the publicly available
topic set size design Excel tools of Sakai [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. First, we prove that
Sakai’s ANOVA-based topic set size design tool1 can be used for
discussing WCW instead of his CI-based tool2 that cannot handle
large topic set sizes (See Section 2). We then provide some case
studies of evaluating evaluation measures using WCW curves based
on the ANOVA-based tool, using data from TREC and NTCIR.
2
      </p>
    </sec>
    <sec id="sec-3">
      <title>PRIOR ART IN EVALUATING EVALUATION</title>
    </sec>
    <sec id="sec-4">
      <title>MEASURES</title>
      <p>
        When a new IR evaluation measure is invented, a system ranking
according to this measure (averaged over a set of topics) is oen
compared with another according to a well-established measure;
rank correlation measures such as Kendall’s τ or the top-heavy
τap [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] are oen used to quantify the similarity between two
rankings. However, this approach cannot tell us whether a measure is
good or bad, due to the lack of a “correct” system ranking. It merely
tells us whether a new measure is similar to an existing one or not;
it only serves as a sanity check.
      </p>
      <p>For a given query, a user sees two Search Engine Result Pages
(SERPs) side by side, and says that SERP1 is beer than SERP2
(“SERP1 &gt; SERP2”). If an evaluation measure also says “SERP1 &gt;
SERP2,” this is a preference agreement; if it says “SERP1 &lt; SERP2,”
this is a preference disagreement. We can count the number of
agreements over dierent queries and SERP pairs, and use it for
comparing the “goodness” of evaluation measures. In practice, this
approach also has a few limitations: (a) the judges employed in
the preference assessments are oen not real search engine users
with an information need; (b) human assessments can be unreliable
and/or inconsistent; and (c) hiring judges comes at a cost, no maer
how small.</p>
      <p>
        e swap method [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] may be used to measure the consistency (i.e.,
“preference agreement with itself”) of evaluation measures across
dierent topic sets. Given a set of n topics, the set is split in half, and
the number of inconsistent preferences (e.g., SERP1 &gt; SERP2 with
the rst half but SERP1 &lt; SERP2 with the second half) is counted,
using dierent systems and dierent splits. As this method can
1 hp://www.f.waseda.jp/tetsuya/CIKM2014/samplesizeANOVA.xlsx
2 hp://www.f.waseda.jp/tetsuya/FIT2014/samplesizeCI.xlsx
only consider half the original topic set size, Voorhees and Buckley
used a simple extrapolation method to estimate what will happen
for topic set sizes larger than n. However, estimating the swap
rate for (say) n = 100 topics based on observations with (say)
n = 10; 25; 50 topics may not be reliable. To directly consider the
size n, bootstrap samples [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] can be used to replace the
samplingwithout-replacement approach of Voorhees and Buckley, but this
method cannot consider topic set sizes larger than n either.
      </p>
      <p>
        Given a set of runs and an evaluation measure, a p-value can be
obtained for every system pair using an appropriate signicance
test, and the sorted p-values can be ploed against the system
pairs [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]: this is called the discriminative power curve. While
highly discriminative measures are useful in the sense that they can
obtain more statistically signicant results in a given environment
with exactly n topics, discriminative power does not provide a
view over dierent choice of topics. Moreover, it is not clear, for
example, a measure with 90% discriminative power should actually
be preferred over one with 80% discriminative power.
      </p>
      <p>
        Sakai [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] released three Excel tools based on topic set size design,
which determines the number of topics n to create for a new test
collection given a set of statistical requirements. His ANOVA-based
tool takes the following as input: α (Type I error probability), β
(Type II error probability), m (the number of systems to be compared
in one-way ANOVA), σˆ 2 (an estimate of the within-system variance
for a particular evaluation measure), and minD (minimum detectable
range); the tool returns the topic set size n that ensures 100¹1 β º%
statistical power whenever the true dierence between the best and
the worst among the m systems is minD or larger. Whereas, his
CI-based tool takes the following as input: α , σˆt2 (an estimate of the
variance of the between-system dierences in terms of a particular
evaluation measure), and δ , which is exactly what we call WCW
in this study; the tool returns the topic set size n that ensures that
the width of the 100¹1 α º% CI for any system pair is no larger
than δ . Following Sakai, we simply let σˆt2 = 2σˆ 2 for any evaluation
measure.
      </p>
      <p>
        While the relationship between minD for ANOVA and n can be
ploed for dierent evaluation measures, this seems problematic as
a way to compare evaluation measures, since, for example, a minD
of 0.1 in term of one measure is not equivalent to a minD of 0.1 in
terms of another. In contrast, if we plot δ against n, this is probably
a more valid comparison since, at least for any normalised measures
that lie in the »0; 1¼ score range, we usually want the CI width to be
as small as possible. is is why we propose to plot δ against topic
set sizes to compare dierent measures. However, Sakai’s CI-based
tool cannot handle large topic set sizes: the limitation of his
CIbased tool is due to that of Excel’s GAMMA function: GAMMA(172) is
greater than 10307 and cannot be computed [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Hence, we start by
proving that his ANOVA-based tool can be used instead of the less
robust CI-based one, for IR researchers to compare the statistical
reliability of evaluation measures based on WCW.
3
      </p>
    </sec>
    <sec id="sec-5">
      <title>PROOF THAT ANOVA-BASED TOPIC SET</title>
    </sec>
    <sec id="sec-6">
      <title>SIZE DESIGN CAN BE USED INSTEAD OF</title>
    </sec>
    <sec id="sec-7">
      <title>CI-BASED ONE</title>
      <p>
        According to Sakai’s CI-based topic set size design, the initial topic
set size estimate for ensuring that the CI width for the dierence in
means for any two systems is no larger than δ ¹&gt; 0º is given by [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]:
nCI =
4fzinv ¹α 2ºg2σˆt2 = 4fzinv ¹α 2ºg2¹2σˆ 2º ;
      </p>
      <p>δ 2 δ 2
where zinv ¹P º is the upper z-value3 for probability P . Subsequently,
this estimate is incremented until it actually satises the
requirement ¹α ; δ º. us, while the actual CI relies on a t -distribution, the
method starts o with a standard normal distribution by assuming
that the variance estimate σˆt2 is perfectly accurate4. is is why
Eq. 1 involves a z-value rather than a t -value.</p>
      <p>
        Whereas, according to Sakai’s ANOVA-based topic set size
design, the initial topic set size estimate for ensuring 100¹1 β º%
statistical power whenever the true dierence between the best
and the worst systems is minD or larger is given by [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]:
nANOVA =
2σˆ 2λ
minD2
;
where λ is a noncentrality parameter of a noncentral χ 2 distribution
with ϕ = m 1 degrees of freedom; as discussed below, linear
formulae are available for estimating λ from ϕ [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. As Eq. 2 is based
on a series of approximations, nANOVA is then incremented until it
actually satises the requirement ¹α ; β; minD; mº.
      </p>
      <p>
        Sakai [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] observed that, for the data he considered, “the topic
set size required based on the CI-based design with α = 0:05 and
δ = c is almost the same as the topic set size required based on the
ANOVA-based design with ¹α ; β; mº = ¹0:05; 0:20; 10º and minD = c,
for any c.” We analytically explain and generalise his observation
as follows. From Eqs. 1 and 2, we have:
      </p>
      <p>
        λδ 2
nANOVA =
nCI
λ δ 2
4fzinv ¹α 2ºg2minD2 = 4fzinv ¹α 2ºg2 ¹ minD º : (3)
Here, note that 4fzinv ¹α 2ºg2 is a constant for a given α ; also, λ is
a constant given α ; β and m. Figure 1 visualises the relationship
between the two constants for α = 0:01; 0:05 and β = 0:10; 0:20,
while varying the number of systems m. e linear formulae for
approximating λ based on ϕ = m 1 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] are provided in the boom
half of the gure. Figure 1 shows that
λ
      </p>
      <p>4fzinv ¹α 2ºg2
holds when:</p>
      <p>Condition (a) α = 0:05; β = 0:20; m = 10; or
Condition (b) α = 0:05; β = 0:10; m = 5; or
Condition (c) α = 0:01; β = 0:20; m = 18; or</p>
      <p>Condition (d) α = 0:01; β = 0:10; m = 10.</p>
      <p>Hence, whenever one of the above four conditions holds true, then
from Eqs. 3 and 4 we obtain:
nANOVA</p>
      <p>δ 2
¹ minD º :
nCI
us, when one of the above four conditions holds, by leing δ =
minD in Eq. 5 we obtain nANOVAnCI 1, that is, nANOVA nCI ,
regardless of the variance estimate σˆ 2. Q.E.D.</p>
      <p>
        Henceforth, we only consider the popular Cohen’s ve-eighty
convention [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], i.e., ¹α ; β º = ¹0:05; 0:20º5, and leverage Condition (a)
3 NORM:S:INV¹1 P º with Microso Excel.
4 Replacing the true population variance of a standard normal distribution with a
sample variance constitutes the very denition of a t -distribution.
5Note that “eighty” refers to the statistical power: 100¹1 β º%.
(1)
(2)
(4)
(5)
mentioned above. Figure 2 compares, for dierent and quite extreme
values of the variance estimate σˆ 2, the topic set size curve using
the CI-based tool with α = 0:05 and one using the ANOVA-based
tool with α = 0:05; β = 0:20; m = 10. Due to the aforementioned
limitation of the CI-based tool, it was not possible to obtain the
entire curves with this tool. On the other hand, it is clear that
the ANOVA-based curves can serve as highly accurate surrogates
for the CI-based curves and can handle large topic set sizes. In
summary, to discuss WCW, we can always use the more robust
ANOVA-based tool and treat the minD values as if they are δ values.
      </p>
    </sec>
    <sec id="sec-8">
      <title>4 WCW-BASED EVALUATION OF</title>
    </sec>
    <sec id="sec-9">
      <title>EVALUATION MEASURES: CASE STUDIES</title>
      <p>Having proven that the ANOVA-based tool can be used instead of
the less robust CI-based tool, we now demonstrate how dierent
evaluation measures can be compared using WCW curves obtained
with the ANOVA-based tool.</p>
      <p>
        Table 1 shows the variance estimates σˆ 2 of various evaluation
measures reported in the literature [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. For the purpose of the
present study, the knowledge of each evaluation measure is not
necessary; the measures with a prex “std-AB” denote standardised
measures of the original measures, where the raw score for each
topic is transformed based on a set of known systems, to absorb
the hardness of that topic as well as its variation across systems [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
Given a topic-by-run score matrix for a particular evaluation
measure, σˆ 2 can easily be obtained as the residual variance of ANOVA.
While some evaluation measures are substantially less stable across
topics than others (e.g., Compare nERR and nDCG in Table 1(a)), it
is not clear just from this table how such dierences will actually
impact our evaluation results.
      </p>
      <p>Figure 3 shows the WCW curves that correspond to the variances
shown in Table 1, for α = 0:05, i.e., 95% CIs. For each evaluation
measure, the δ is ploed against the required topic set size n; the
curve was obtained by entering dierent values of minD (i.e., δ )
into the ANOVA-based tool (with α = 0:05; β = 0:20; m = 10)
and recording the resultant n. e advantages of the proposed
WCW-based comparison of evaluation measures are as follows:
Unlike discriminative power and the swap method, we can
easily consider a wide range of topic set sizes;
For a particular topic set size, we can easily compare across
dierent evaluation measures, since an evaluation measure
with a small WCW is usually more desirable than one with
a large WCW under the same condition;
e WCW curves can visualise the dierences among
measures that practically maer.</p>
      <p>For example, from Figure 3(b), when the topic set size is n = 50, it
is clear that the WCW of nDCG and that of Q are about the same
(around 0.16), while those of AP and nERR are substantially larger
(around 0.23). Similarly, from Figure 3(d), while it is clear that the
standardised (“std-AB”) measures have substantially lower WCW
values than the unstandardised ones, the dierences within the set
of standardised measures are probably not of practical importance,
as indicated by the near-perfect overlaps of the curves.</p>
    </sec>
    <sec id="sec-10">
      <title>5 CONCLUSIONS AND FUTURE WORK</title>
      <p>We proposed to evaluate evaluation measures by comparing the
WCW for various topic set sizes, using an existing ANOVA-based
tool instead of the less robust CI-based tool. We proved the
relationship between these two topic set size design methods, and
demonstrated the advantages of WCW curves over well-known
methods such as the swap test and discriminative power. It is hoped
that this method will supplement user-based studies of evaluation
measures.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Paul D.</given-names>
            <surname>Ellis</surname>
          </string-name>
          .
          <year>2010</year>
          . e Essential Guide to Eect Sizes. Cambridge University Press.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Yasushi</given-names>
            <surname>Nagata</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>How to Design the Sample Size (in Japanese)</article-title>
          .
          <source>Asakura Shoten.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Tetsuya</given-names>
            <surname>Sakai</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Evaluating Evaluation Metrics based on the Bootstrap</article-title>
          .
          <source>In Proceedings of ACM SIGIR</source>
          <year>2006</year>
          .
          <volume>525</volume>
          -
          <fpage>532</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Tetsuya</given-names>
            <surname>Sakai</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Alternatives to Bpref</article-title>
          .
          <source>In Proceedings of ACM SIGIR</source>
          <year>2007</year>
          .
          <volume>71</volume>
          -
          <fpage>78</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Tetsuya</given-names>
            <surname>Sakai</surname>
          </string-name>
          .
          <year>2016</year>
          . e E
          <article-title>ect of Score Standardisation on Topic Set Size Design</article-title>
          .
          <source>In Proceedings of AIRS 2016 (LNCS 9994)</source>
          .
          <fpage>16</fpage>
          -
          <lpage>28</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Tetsuya</given-names>
            <surname>Sakai</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Topic Set Size Design</article-title>
          .
          <source>Information Retrieval Journal</source>
          <volume>19</volume>
          ,
          <issue>3</issue>
          (
          <year>2016</year>
          ),
          <fpage>256</fpage>
          -
          <lpage>283</lpage>
          . hp://link.springer.com/content/pdf/10.1007%
          <fpage>2Fs10791</fpage>
          -
          <fpage>015</fpage>
          -9273-z.pdf
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Mark</given-names>
            <surname>Sanderson</surname>
          </string-name>
          , Monica Lestari Paramita, Paul Clough, and
          <string-name>
            <given-names>Evangelos</given-names>
            <surname>Kanoulas</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Do User Preferences and Evaluation Measures Line Up?</article-title>
          .
          <source>In Proceedings of ACM SIGIR</source>
          <year>2010</year>
          .
          <volume>555</volume>
          -
          <fpage>562</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Ellen</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Voorhees</surname>
            and
            <given-names>Chris</given-names>
          </string-name>
          <string-name>
            <surname>Buckley</surname>
          </string-name>
          .
          <year>2002</year>
          . e E
          <article-title>ect of Topic Set Size on Retrieval Experiment Error</article-title>
          .
          <source>In Proceedings of ACM SIGIR</source>
          <year>2002</year>
          .
          <volume>316</volume>
          -
          <fpage>323</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Emine</given-names>
            <surname>Yilmaz</surname>
          </string-name>
          , Javed A.
          <string-name>
            <surname>Aslam</surname>
            , and
            <given-names>Stephen</given-names>
          </string-name>
          <string-name>
            <surname>Robertson</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>A New Rank Correlation Coecient for Information Retrieval</article-title>
          .
          <source>In Proceedings of ACM SIGIR</source>
          <year>2008</year>
          .
          <volume>587</volume>
          -
          <fpage>594</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>