<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A formal theory to determine scale properties of evaluation measures?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marco Ferrante</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Ferro</string-name>
          <email>ferro@dei.unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Silvia Pontarollo</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Engineering, University of Padua</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Mathematics, University of Padua</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Evaluation measures are the basis for quantifying the performance of information access systems and the way in which their values can be processed to perform statistical analyses depends on the scales on which these measures are de ned. For example, mean and variance should be computed only when relying on interval scales. We de ne a formal theory of evaluation measures, based on the representational theory of measurement, which allows us to determine whether and when measures are interval scales. We found that common set- based retrieval measures { namely Precision, Recall, and F-measure { always are interval scales in the case of binary relevance while this does not happen in the multi-graded relevance case. In the case of rank-based retrieval measures { namely AP, gRBP, DCG, and ERR { only gRBP is an interval scale when we choose a speci c value of the parameter p and de ne a speci c total order among systems while all the other measures are not interval scales.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Measurement scales play a central role [
        <xref ref-type="bibr" rid="ref4 ref5">4,5</xref>
        ] since they determine the operations
that can be performed with the measured values and, as a consequence, the
statistical analyses that can be applied. Stevens [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] identi es four major types
of scales with increasing properties: (i) the nominal scale consists of discrete
unordered values, i.e. categories; (ii) the ordinal scale introduces a natural order
among the values; (iii) the interval scale preserves the equality of intervals or
di erences; and (iv) the ratio scale preserves the equality of ratios.
      </p>
      <p>
        In experimental evaluation we daily perform operations, such as computing
means and variances, which are also the basic \ingredients" of the more
sophisticated statistical signi cance tests we use to compare systems and assess their
di erences [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However, all these operations can be performed only from interval
Copyright c 2019 for the individual papers by the papers' authors. Copying
permitted for private and academic purposes. This volume is published and copyrighted
by its editors. SEBD 2019, June 16-19, 2019, Castiglione della Pescaia, Italy.
? Extended abstract of [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
scales onwards but, due to our limited knowledge of evaluation measures, we do
not actually know which scales they rely on.
      </p>
      <p>
        This paper sets a theory of evaluation measures to formally investigate their
properties and to study whether and when they use an interval scale. We frame
our work within the representational theory of measurement [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which is the
measurement theory adopted in both physical and social sciences. In particular,
we develop a fully comprehensive framework, comprising both set-based
measures { namely Precision, Recall and F-measure { dealing with unordered result
lists, and rank-based measures { namely, Average Precision (AP), Discounted
Cumulated Gain (DCG), Rank-Biased Precision (RBP), and Expected
Reciprocal Rank (ERR) { dealing with ranked result lists. Moreover, we consider both
binary relevance, i.e. when documents can be just relevant or not relevant, and
multi-graded relevance [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], i.e. when documents can have di erent degrees of
relevance, such as not relevant, partially relevant, and highly relevant.
      </p>
      <p>The paper is organized as follows. Section 2 introduces how to determine
when a measure is an interval scale, according to the representational theory of
measurement; Section 3 introduces our basic formalisms; Sections 4 and 5 discuss
set-based and rank-based evaluation measures, respectively; nally, Section 6
wraps up the discussion and outlooks some future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Representational Theory of Measurement</title>
      <p>Given E, a weakly ordered empirical structure is a pair (E; ) where, for every
a; b; c 2 E,
a
a
b or b
b and b
a;
c ) a</p>
      <p>c (transitivity).</p>
      <p>Note that if a; b 2 E are such that a b and b a, then we write a b and
we say that a and b are equivalent elements of E for . When the antisymmetric
relation holds, that is when a b and b a implies that a and b are the same
element (namely a = b), we talk about a total order.</p>
      <p>An interval on the empirical structure is an element (a; b) 2 E E and we
introduce a notion of di erence ab over intervals, to act as a signed distance
we exploit to compare intervals. Once we have a notion of di erence ab, we can
de ne a weak order d between the ab di erences and, consequently, among
intervals. We can proceed as follows: if two elements a; b 2 E are such that a b,
then the interval [a; b] is null and, consequently, we set ab d ba; if a b we
agree upon choosing aa d ab which, in turn implies that aa d ba, that
is there exist a kind of \zero" and the inverse with respect to this \zero".
De nition 1. Let E be a nite (not empty) set of objects. Let d be a binary
relation on E E that satis es, for each a; b; c; d; a0; b0; c0 2 E, the following
axioms:
i.
ii. if
d is weak order;
ab d cd, then
dc d</p>
      <p>ba;
iii. Weak Monotonicity: if
iv. Solvability Condition: if
such that ad0 d cd
ab</p>
      <p>d
d
aa
d00b:</p>
      <p>d
Then (E; d) is a di erence structure.</p>
      <p>a0b0 and
cd
d
bc d b0c0 then ac d a0c0 ;
ab; then there exists d0; d00 2 R</p>
      <p>The representation theorem for di erence structures states:
Theorem 1. Let E be a nite (not empty) set of objects and let (E; d) be a
di erence structure. Then, there exist an interval scale measurement M : E ! R
such that for every a; b; c; d 2 E
ab
d
cd , M(b)</p>
      <p>M(a)</p>
      <p>M(d)</p>
      <p>M(c) :</p>
      <p>This theorem ensures us that, if there is a di erence structure on the
empirical set E, then there exists an interval scale M over it.</p>
      <p>
        Therefore, to study whether evaluation measures are interval scales or not, [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
proceeded as follows:
1. De ne a total ordering among system runs, which allows us also to introduce
the notion of interval among runs;
2. Since this set is graded of a given rank n, there exists a unique rank function
which assigns a natural number to each run;
3. De ne the length of an interval as the natural distance ab := `(a; b) :=
`([a; b]) = (b) (a);
4. Check whether the set with the above natural length is a di erence structure
or not;
5. In this case we have a di erence structure and we can de ne an interval scale
      </p>
      <p>M as the rank function itself;
6. We can eventually check whether evaluation measures are a linear positive
transformation of this interval scale M and determine whether they are an
interval scale.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Basic Formalism</title>
      <p>Let (REL, ) be a nite and totally ordered set of relevance degrees. We set
REL = fa0; a1; : : : ; acg with ai ai+1 for all i 2 f0; : : : ; c 1g; REL has a
minimum a0, called the \not relevant" relevance degree. Let us consider a nite
set of documents D and a set of topics T . For each pair (t; d) 2 T D, the
ground-truth GT is a map which assigns a relevance degree rel 2 REL to a
document d with respect to a topic t. Let N be a positive natural number called
the length of the run. We assume that all the runs have same length N , since this
is what typically happens in real evaluation settings when you compare systems.
We de ne D(N ) as the set of all the possible N retrieved documents.</p>
      <p>A run r : T ! D(N ) retrieves N documents belonging to D(N ) in response
to a topic t 2 T:</p>
      <p>Let R(N ) be the set of N judged documents, that is the set of all the N
possible combinations of relevance degrees.</p>
      <p>We call judged run of length N the function r^ from T D(N ) into R(N )
which assigns a relevance degree to each retrieved document, i.e. a judged run r^
is the application of the ground-truth GT function to each element of the run r.</p>
      <p>We de ne the gain function g : REL ! R+ as the map that assigns a
positive real number to any relevance degree. We set, without loss of generality,
g(a0) = 0 and we require g to be strictly increasing.</p>
      <p>We de ne the indicator function for the relevance degrees as a(aj ) =
j 8j 2 f0; : : : ; cg: Note that a is a particular gain function.</p>
      <p>Given the gain function g, the recall base RB : T ! R+ is the map de ned
as RB(t) = PjjD=j1 g(GT (t; dj )): In the binary relevance case when c = 1 and
REL = fa0; a1g, the gain function usually is g(a1) = a(a1) = 1 and RB
counts the total number of relevant documents for a topic.</p>
      <p>An evaluation measure is a function M : R(N ) ! R+ which maps a judged
run r^ into a positive real number which quanti es its e ectiveness. Note that
most of the evaluation measures are normalized and thus the co-domain is the
[0; 1] interval.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Set-based Measures</title>
      <p>Let us start by introducing an order relation on the set of judged runs. Let
r^; s^ 2 R(N ) such that r^ 6= s^, and let k be the biggest relevance degree at which
the two runs di er for the rst time, i.e. k = maxfj c : fi : r^i = aj g 6= fi :
s^i = aj g g. We strictly order any pair of distinct system runs as follows
r^
s^ ,
fi : r^i = akg &lt; fi : s^i = akg :
(1)</p>
      <p>R(N ) is a totally ordered set with respect to the ordering de ned by (1).
As for any totally order set, R(N ) is a poset consisting of only one maximal
chain (the whole set); therefore it is graded of rank jR(N )j 1, where R(N ) =
N+c since it consists of all the N combinations of c + 1 = jRELj objects with
N
repetition. Since R(N ) is graded of rank jR(N )j 1, there exists a unique rank
function (r^) : R(N ) ! N such that (0^) = 0 and (s^) = (r^) + 1 if s^ covers r^:</p>
      <p>N
(r^) = X
j=1
where r^ = fr^1; : : : ; r^N g 2 R(N ) with r^i r^i+1 for any i &lt; N .</p>
      <p>The natural distance is then given by `(r^; s^) = (s^) (r^), for r^; s^ 2 R(N )
such that r^ s^; and we can de ne the di erence as r^s^ = `(r^; s^) if r^ s^,
otherwise r^s^ = `(s^; r^): (R(N ); d) is a di erence structure. Thus the rank
function is an interval scale and we are able to de ne a new measure that follows:
De nition 2. The Set-Based Total Order (SBTO) measure on (R(N ); d) is:</p>
      <p>N
SBTO(r^) = (r^) = X
j=1
This measure satis es the condition imposed by Theorem 1. Thus, SBTO is an
interval scale on (R(N ); d).</p>
      <p>Let us explore more deeply how the SBTO measure works. The rst
relevance degree immediately above not relevant, i.e. a1, always gives a constant
contribution, independently from how many a1 documents are retrieved, since:
a(a1) + N</p>
      <p>N j + 1
j
=
1 + N
N</p>
      <p>j
j + 1
= 1 :</p>
      <p>However, when we consider higher relevance degrees, i.e. ak with k &gt; 1, the
binomial coe cient strictly depends and changes on the basis of how many of
them are retrieved. Indeed, a(ak) is constant for all the documents with the
same relevance degree ak, but the term N j decreases as the number of ak
retrieved documents increases due to N being constant and j increasing, i.e. the
binomial coe cient is decreasing in the number of ak retrieved documents. In
other terms, each additional ak retrieved document gives a contribution smaller
than the previously retrieved ones by a discount factor j.
4.1</p>
      <p>Binary Relevance Case
When c = 1, i.e. in the binary relevance case, the ordering (1) just orders judged
runs by how many relevant documents they retrieve, i.e. by their total mass of
relevance:
r^
s^ ,</p>
      <p>N
X a(r^i)
i=1</p>
      <p>N
X a(s^i) ;
i=1
since there is only one relevant relevance degree a1.</p>
      <p>Therefore the rank function becomes</p>
      <p>N
(r^) = X a(r^i) = M(r^) :</p>
      <p>i=1
This follows easily from (3), using the fact that a(r^i) 2 f0; 1g for any i
when c = 1.</p>
      <p>Let now g be the gain function, and let us consider Precision
N
Precision (P)(r^) =
1 XN g(r^i)
N i=1 g(a1)
=</p>
      <p>N i=1
1 XN a(r^i) =</p>
      <p>M(r^)</p>
      <p>N
;
since g(a0) = 0 = a(a0) and c = 1. Thus Precision is an interval scale, as it is
a linear positive transformation of M.</p>
      <p>Similarly, Recall</p>
      <p>Recall (R)(r^) =
Neither Generalized Precision nor Generalized Recall are a positive linear
transformation of M de ned in (3). Indeed, in these measures, the individual
contribution of each retrieved document r^j is independent from the contribution of any
other retrieved document r^k. However, the previous discussion on the measure
de ned in (3) pointed out that, for each relevance degree ak with k &gt; 1, the
individual contribution of an ak retrieved document depends on how many ak
retrieved documents there are in the run. Therefore neither gP nor gR are an
interval scale, since they cannot be a linear transformation of M.</p>
      <p>Moreover they are not even an ordinal scale which, again, implies they cannot
be an interval scale too. Indeed, a measure M0 is an ordinal scale on R(N ) if, for
every r^; s^ 2 R(N ), the following statement is true:
r^
s^ , M0(r^)</p>
      <p>M0(s^) :
Let us consider r^ = fa1; : : : ; a1g and s^ = fa2; a0; : : : ; a0g, two runs of length
N . We have r^ s^. Moreover, since Generalized Recall (gR) and Generalized
Precision (gP) are both proportional to G(v^) := PiN=1 g(v^i)=g(ac), for any v^ 2
R(N ), we can prove that G( ) is not on an ordinal scale with respect to the
order (1). Since g(a0) = 0, G(r^) = N g(a1)=g(ac) while G(s^) = g(a2)=g(ac).
From the fact that the gain function g is a positive strictly increasing function
and it is de ned independently from the length N of the runs, by choosing a N
big enough we can have G(r^) &gt; G(s^).
5</p>
    </sec>
    <sec id="sec-5">
      <title>Rank-based Measures</title>
      <p>Top-heaviness is a central property in Information Retrieval (IR), stating that
the higher a system ranks relevant documents the better it is. If we apply this
property at each rank position and we take to extremes the importance of having
a relevant document ranked higher, we can de ne a strong top-heaviness property
which, in turn, will induce a total ordering among runs.</p>
      <p>
        Let r^; s^ 2 R(N ) such that r^ 6= s^, then there exists k = minfj
s^[j]g &lt; 1, and we order system runs as follows
r^
s^ , r^[k]
This order prefers a single relevant document ranked higher to any number of
relevant documents, with same relevance degree or higher, ranked just below it
(u^[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]; : : : ; u^[m]; a0; ac; : : : ; ac);
(u^[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]; : : : ; u^[m]; aj ; a0; : : : ; a0) ;
for any 1 j c; for any length N 2 N and any m 2 f0; 1; : : : ; N
why we call it strong top-heaviness.
      </p>
      <p>R(N ) is totally ordered with respect to
Therefore, there is a unique rank function
which is given by:
and is graded of rank (c + 1)N
: R(N ) ! f0; 1; : : : ; (c + 1)N</p>
      <p>1g. This is</p>
      <p>N
(r^) = X a(r^[i])(c + 1)N i;</p>
      <p>i=1
where a is the indicator function.</p>
      <p>
        Let us set a (r^) = ( a(r^[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]); : : : ; a(r^[N ])). If we look at a (r^) as a string,
the rank function is exactly the conversion in base 10 of the number in base c + 1
identi ed by a (r^) and the ordering among runs corresponds to the ordering
among numbers in base c + 1.
      </p>
      <p>The natural distance is then given by `(r^; s^) = (s^) (r^), for r^; s^ 2 R(N )
such that r^ s^; and we can de ne the di erence as r^s^ = `(r^; s^) if r^ s^,
otherwise r^s^ = `(s^; r^): (R(N ); d) is a di erence structure. As done before
in the set-based case, an interval scale measure on (R(N ); d) is given by the
rank function itself.</p>
      <p>De nition 3. The Rank-Based Total Order (RBTO) measure on (R(N ); d)
is:</p>
      <p>N
RBTO(r^) = (r^) = X a(r^[i])(c + 1)N i</p>
      <p>i=1
This measure satis es the condition imposed by Theorem 1. Thus, RBTO is an
interval scale on (R(N ); d).</p>
      <p>Let G = minj2f1;:::;cg(g(aj ) g(aj 1))=g(ac) &gt; 0 be the normalized smallest
gap between the gain of two consecutive relevance degrees.</p>
      <p>Graded Rank-Biased Precision (gRBP)p with p &gt; G=(G + 1) and other IR
measures { namely AP, DCG, and ERR { are not even an ordinal scale on
R(N ), as the following example shows. Let r^ = (a1; a0; a2; a0; a1) and s^ =
(a1; a1; a0; a0; a0) be two runs on R(5) with c = 2 and let us use the indicator
function as gain function g. We have r^ s^. Then DCG2(r^) = 1 + 2= log2 3 +
1= log2 5 &gt; 1 + 1 = DCG2(s^); ERR(r^) = 1=4 + 3=16 + 3=320 &gt; 1=4 + 3=32 =
ERR(s^); nally, since g(a2) = a(a2) = 2, 2gRBPp(r^) = (1 p)(1 + 2p2 + p4) &gt;
(1 p)(1 + p) = 2gRBP(s^) for p &amp; 0:454, and such an example can be found
for any other values of p &gt; G=(G + 1); where G = 1=2. AP is a binary measure
and, just to stay with the same data above, we adopt a lenient mapping of
multi-graded to binary relevance degrees setting g(a1) = g(a2) = 1 and thus
RB AP(r^) = 1 + 2=3 + 3=5 &gt; 1 + 1 = RB AP(s^), where RB is the recall base.</p>
      <p>As a consequence, being not an ordinal scale, gRBPp with p &gt; G=(G+1), AP,
DCG, and ERR cannot be an interval scale too, since an interval scale measure
is also an ordinal scale. gRBPp with p G=(G + 1) is interval if and only if
p = G=(1 + G) = (c + 1) 1 and the gain function is equal to g(ai) = K a(ai);
for any i 2 f0; : : : ; Ng and for any K &gt; 0 xed.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion and Future Works</title>
      <p>We developed a theory of evaluation measures to explore whether and when both
set-based and rank-based IR measures are interval scales. This is a fundamental
question since the validity of the descriptive statistics, such as mean and variance,
and the statistical signi cance tests we daily use to compare IR systems depends
on its answer.</p>
      <p>We summarize here the main ndings of our framework:
{ set-based evaluation measures:
binary relevance: precision, recall, F-measure are interval scales;
multi-graded relevance: gP and gR are neither ordinal nor interval scales;
{ rank-based evaluation measures:
binary relevance: RBP is an interval scale only for p = 1=2 and it is an
ordinal scale for p &lt; 1=2; RBP for p &gt; 1=2 and AP are neither ordinal
nor interval scales;
multi-graded relevance: gRBP is an interval scale only for p = G=(G + 1)
and when the gain function is equal to g(ai) = K a(ai); gRBP is an
ordinal scale when p &lt; G=(G + 1); gRBP for p &gt; G=(G + 1), DCG, and
ERR are neither ordinal nor interval scales.</p>
      <p>As future work, we will investigate these new interval scale measures from an
experimental point of view, e.g. by performing an analysis of their robustness to
pool downsampling or of their discriminative power, as well as the exploration of
how they behave in statistical signi cance testing with respect to the traditional
evaluation measures which, instead, violate the interval scale assumption.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Carterette</surname>
            ,
            <given-names>B.A.</given-names>
          </string-name>
          :
          <article-title>Multiple Testing in Statistical Analysis of Systems-Based Information Retrieval Experiments</article-title>
          .
          <source>ACM TOIS 30(1)</source>
          , 4:
          <issue>1</issue>
          {4:
          <issue>34</issue>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Ferrante</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pontarollo</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>A General Theory of IR Evaluation Measures</article-title>
          .
          <source>IEEE TKDE 31(3)</source>
          ,
          <volume>409</volume>
          {422 (March
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Kekalainen, J., Jarvelin, K.:
          <article-title>Using Graded Relevance Assessments in IR Evaluation</article-title>
          .
          <source>JASIST</source>
          <volume>53</volume>
          (
          <issue>13</issue>
          ),
          <volume>1120</volume>
          {1129 (November
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Krantz</surname>
            ,
            <given-names>D.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luce</surname>
            ,
            <given-names>R.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suppes</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tversky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <source>Foundations of Measurement. Additive and Polynomial Representations</source>
          , vol.
          <volume>1</volume>
          . Academic Press, USA (
          <year>1971</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Stevens</surname>
            ,
            <given-names>S.S.</given-names>
          </string-name>
          :
          <source>On the Theory of Scales of Measurement. Science, New Series</source>
          <volume>103</volume>
          (
          <issue>2684</issue>
          ),
          <volume>677</volume>
          {680 (
          <year>June 1946</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>