An interval-like scale property for IR evaluation measures
                 Marco Ferrante                                             Nicola Ferro                                    Silvia Pontarollo
                Dept. Mathematics                                Dept. Information Engineering                            Dept. Mathematics
            University of Padua, Italy                             University of Padua, Italy                           University of Padua, Italy
             ferrante@math.unipd.it                                    ferro@dei.unipd.it                               spontaro@math.unipd.it

ABSTRACT                                                                                   an interval scale is somehow arbitrary and, as a consequence, also
Evaluation measures play an important role in IR experimental                              some of the descriptive statistics you compute about it.
evaluation and their properties determine the kind of statistical                              Therefore, researchers started to study what IR effectiveness
analyses we can conduct.                                                                   measures are, not only from an empirical perspective, e.g., [4, 5, 19],
   It has been previously shown that it is questionable that IR ef-                        but also from a theoretical one, e.g., [1–3, 6, 10, 22, 26].
fectiveness measures are on an interval-scale and this implies that                            In this paper, we stem from the recent work of [11] and we
computing means and variances is not a permissible operation.                              move a step forward in understanding when and to what extent IR
   In this paper, we investigate whether it is possible to relax a bit                     effectiveness measures are on an interval scale.
the definition of interval scale, introducing the notion of interval-                          [11] investigated whether IR effectiveness measures are on an
like scale, and to what extent IR effectiveness measures comply                            interval scale in the perspective of the representational theory of
with this relaxed definition.                                                              measurement [15], which is the measurement theory adopted in
                                                                                           both physical and social sciences. According to this framework, the
CCS CONCEPTS                                                                               key point is to understand how real world objects, i.e., system runs
                                                                                           in our case, are related to each other since measure properties are
•Information systems → Retrieval effectiveness;
                                                                                           then derived from these relations. Moreover, it is important that
                                                                                           these relations among real world objects are intuitive and sensible
KEYWORDS                                                                                   to “everybody” and that they can be commonly agreed on.
evaluation measures; representational theory of measurement; in-                               Therefore, [11] pointed out that the main issues in determining
terval scale                                                                               the scale of IR effectiveness measures are: (i) to understand how
                                                                                           runs are empirically and intuitively ordered; (ii) to define what
                                                                                           an interval of runs is; and, (iii) to determine how these intervals
1    INTRODUCTION                                                                          are ordered. Once you settled all these aspects, you can check
Evaluation plays a central role in Information Retrieval (IR) and a                        whether an effectiveness measure comply with them or not and thus
lot of attention is devoted to improving our evaluation method-                            determine whether it is on an interval scale or not. In particular,
ologies and practices. For example, since many years, there is a                           [11] found that under a strong top-heaviness notion of ordering
continued interest on how to properly apply statistical techniques                         among runs, only Rank-Biased Precision (RBP) [16] with p = 12 is on
to the analysis of IR experimental data, e.g., on the appropriate use                      an interval scale while RBP for other values of p and other popular
of statistical testing [7, 13, 20, 23], on the normalization of measure                    measures – namely AP, Discounted Cumulated Gain (DCG) [14],
values for cross-collection comparison [27], or on moving towards                          and Expected Reciprocal Rank (ERR) [9] – are not. Moreover, using
Bayesian inference [8, 21], just to name a few.                                            a weak top-heaviness notion of ordering among runs, [11] found
   However, all these studies rely on some, often hidden and implicit,                     that all the previously mentioned IR effectiveness measures are not
assumptions on what IR effectiveness measures are. In particular,                          on an interval scale.
measurement scales [15, 25] determine the operations that is admis-                            Strong top-heaviness provides us with a total ordering among
sible to perform with measure values and, as a consequence, the                            runs and, as discussed above, there is at least one case of IR mea-
statistical analyses that can be applied. [25] identifies four major                       sure on an interval scale; however, the way in which strong top-
types of scales with increasing properties: (i) the nominal scale                          heaviness orders runs may give raise to disagreement or corner
consists of discrete unordered values, i.e. categories; (ii) the ordinal                   cases. For example, strong top-heaviness ranks the run (1, 0, 0, 0)
scale introduces a natural order among the values; (iii) the inter-                        with just one top relevant document before the run (0, 1, 1, 1) with
val scale preserves the equality of intervals or differences; and (iv)                     all relevant documents except for the first position; thus, there
the ratio scale preserves the equality of ratios. Operations such as                       might be disagreement on whether this is an appropriate ordering
computing the mean or the variance are possible just on interval                           for these runs. On the other hand, weak top-heaviness provides
and ratio scales and they constitute the basis of many of the statis-                      us with a much more intuitive partial ordering based on two basic
tical techniques mentioned above. However, are we sure that IR                             operations – swapping two consecutive documents in a ranking
effectiveness measures are on an interval scale? For example, [17]                         and replacing a not relevant document with a relevant one [10];
points out that the assumption of Average Precision (AP) being on                          however, none of the IR evaluation measures is on interval-scale
                                                                                           using weak top-heaviness.
Copying permitted for private and academic purposes.                                           The problem with IR effectiveness measures emerging from [11]
8th International Workshop on Evaluating Information Access (EVIA 2017), co-located        is two-fold: on the one side, both strong and weak top-heaviness
with NTCIR-13, 5 December 2017, Tokyo, Japan.
© 2017 Copyright held by the author.
                                                                                           create equi-spaced intervals of runs, as expected by the definition


                                                                                      10
of interval scale, but IR effectiveness measures do not respect this                          Given (E, ), we have to define a difference ∆ab between two
equi-spacing; on the other side, both strong and weak top-heaviness                        elements a, b ∈ E, which is a kind of signed distance we exploit
do not account enough for the importance and the effect of the                             to compare intervals. Then, we have to define a weak order d
rank of a document in a run, since they both rely on the notion of                         between these ∆ab differences. We can proceed as follows: if two
natural distance in a poset (partially ordered set) [24] which flattens                    elements a, b ∈ E are such that a ∼ b, i.e. a  b and b  a, then
things too much, shrinking everything into a single number.                                the interval [a, b] is null and, consequently, we set ∆ab ∼d ∆ba ; if
   In this paper, we take a different approach to the ordering of in-                      a  b we agree upon choosing ∆aa ≺d ∆ab which, in turn implies
tervals of runs, not based on single numbers, as the natural distance                      that ∆aa d ∆ba .
of [11] does, but using vectors instead. This new ordering is richer
                                                                                           Definition 1. Let E be a finite (not empty) set of objects. Let d be a
and more expressive than that induced by the natural distances in
                                                                                           binary relation on E ×E that satisfies, for each a, b, c, d, a 0, b 0, c 0 ∈ E,
the strong and weak top-heaviness cases and allows us to introduce
                                                                                           the following axioms:
the notion of interval-like scale, i.e., something richer than an ordi-
nal scale but a bit less powerful than an interval scale, since runs                              i. d is weak order;
are ordered, intervals of runs are ordered too but intervals may                                 ii. if ∆ab d ∆cd , then ∆dc d ∆ba ;
not be equi-spaced. In particular, we find that, under reasonable                               iii. if ∆ab d ∆a 0b 0 and ∆bc d ∆b 0c 0 then ∆ac d ∆a 0c 0 ;
assumptions, DCG and RBP are on a interval-like scale while AP                                  iv. Solvability Condition: if ∆aa d ∆cd d ∆ab , then there
and ERR are not.                                                                                     exists d 0, d 00 ∈ E such that ∆ad 0 ∼d ∆cd ∼d ∆d 00b .
   The paper is organized as follows: Section 2 recaps some basic                          Then (E, d ) is a difference structure.
concepts about the representational theory of measurement and
                                                                                              Particular attention has to be paid to the Solvability Condition
posets; Section 3 deals with interval-like scales; finally, Section 4
                                                                                           which ensures the existence of an equally spaced gradation be-
wraps up the discussion and outlooks some future work.
                                                                                           tween the elements of E, indispensable to construct an interval
                                                                                           scale measurement.
2 BACKGROUND                                                                                  The representation theorem for difference structures states:
2.1 Representational Theory of Measurement                                                    Theorem 1. Let E be a finite (not empty) set of objects and let
A relational structure [15, 18] is an ordered pair X = X , R X of                          (E, d ) be a difference structure. Then there exist a measurement scale
a domain set X and a set of relations R X on X , where the relations in                    M : E → R such that for every a, b, c, d ∈ E
R X may have different arities, i.e. they can be unary, binary, ternary                                ∆ab d ∆cd ⇔ M(a) − M(b) ≤ M(c) − M(d) .
relations and so on. Given two relational structures X and Y, a
homomorphism M : X → Y from X to Y is a mapping M = M, MR                                     This theorem ensures us that, if there is a difference structure
where: (i) M is a function that maps X into M(X ) ⊆ Y , i.e. for each                      on the empirical set E, then there exists an interval scale M.
element of the domain set there exists one corresponding image                                As anticipated in Section 1, we will introduce the notion of
element; (ii) MR is a function that maps R X into MR (R X ) ⊆ RY such                      interval-like scale which corresponds to removing the solvability
that ∀r ∈ R X , r and MR (r ) have the same arity, i.e. for each relation                  condition from the definition of difference structure and obtaining
on the domain set there exists one (and it is usually, and often                           a new partial ordering of the intervals of runs.
implicitly, assumed: and only one) corresponding image        relation;
(iii) ∀r ∈ R X , ∀x i ∈ X , if r (x 1 , . . . , x n ) then MR (r ) M(x 1 ), . . . ,        2.3     Posets
                                                                                          A partially ordered set P, poset for short, is a set with a partial order
M(x n ) , i.e. if a relation holds for some elements of the domain set                      defined on it [24]. A partial order  is a binary relation over P
then the image relation must hold for the image elements.                                  which is reflexive, antisymmetric and transitive. Given s, t ∈ P, we
    A relational structure E is called empirical if its domain set E                       say that s and t are comparable if s  t or t  s, otherwise they are
spans over the entities under consideration in the real world, i.e. the                    incomparable.
system runs in our case; a relational structure S is called symbolic                           A closed interval is a subset of P defined as [s, t] B {u ∈ P : s 
if its domain set S spans over a given set of numbers. A measure-                          u  t }, where s, t ∈ P and s  t. Moreover we say that t covers s
ment (scale) is the homomorphism M = M, MR from the real                                   if s  t and [s, t] = {s, t }, that is there does not exist u ∈ P such
world to the symbolic world and a measure is the number assigned                           that s ≺ u ≺ t .
to an entity by this mapping.                                                                  We can represent a finite poset P by using the Hasse diagram
                                                                                           which is a graph where vertices are the elements of P, edges rep-
2.2     Measurement Scales                                                                 resent the covers relations, and if s ≺ t then s is below t in the
                                                                                           diagram.
[11] relied on the notion of difference structure [15, 18] to introduce
                                                                                               A subset C of a poset P is a chain if any two elements of C are
a definition of interval among system runs in such a way that it
                                                                                           comparable: a chain is a totally ordered subset of a poset. If C is a
ensures the existence of an interval scale.
                                                                                           finite chain, the length of C, `(C), is defined by `(C) = |C | − 1. A
   Given E, a weakly ordered empirical structure is a pair (E, )
                                                                                           maximal chain of P is a chain that is not a proper subset of any
where, for every a, b, c ∈ E,
                                                                                           other chain of P.
       • a  b or b  a;                                                                       If every maximal chain of P has the same length n, we say that
       • a  b and b  c ⇒ a  c.                                                          P is graded of rank n; in particular there exists a unique function


                                                                                      11
ρ : P → {0, 1, . . . , n}, called the rank function, such that ρ(s) = 0,                   which, as discussed in Section 1, has the drawback of flattening
if s is a minimal element of P, and ρ(t) = ρ(s) + 1, if t covers s.                        everything into a single number.
    Finally, since any interval on a graded poset is graded, the length                       To define the length of an interval we adopt the following strat-
of an interval [s, t] is given by `(s, t) B `([s, t]) = ρ(t) − ρ(s), also                  egy: given rˆ, ŝ ∈ REL N with rˆ  ŝ, we count how many replace-
called the natural distance.                                                               ments in the last position and how many forward single-step swaps
                                                                                           at each depth are necessary to go from rˆ to ŝ following a maximal
3 INTERVAL-LIKE SCALES                                                                     chain in REL N . In order to do this, it is useful to define the cu-
3.1 Preliminary Definitions                                                                mulative sums of a vector v = (v[1], . . . , v[N ]), denoted using the
                                                                                                                                                       Íj
                                                                                           capital letter as V = (V [1], . . . , V [N ]), where V [j] = i=1 v[i].
Given N , the length of the run, we define the set of retrieved
                                                                                              Let us start with a simple example.
documents as D(N ) = {(d 1 , . . . , d N ) : di ∈ D, di , d j for any i ,
j}, i.e. the ranked list of retrieved documents without duplicates, and                    Example. Consider the two judged runs in REL4
                                                            Ð |D |
the universe set of retrieved documents as D := N =1 D(N ).                                                        rˆ = (0, 1, 1, 0) ,
A run r t , retrieving a ranked list of documents D(N ) in response
                                                                                                                       0̂ = (0, 0, 0, 0) .
to a topic t ∈ T , is a function from T into D
                                                                                           Since 0̂ ≺ rˆ, in order to construct a chain from 0̂ to rˆ with the
                           t 7→ r t = (d 1 , . . . , d N )
                                                                                           two basic operators (replacement in last position and single-step
We denote by r t [j] the j-th element of the vector r t ,                                  forward swap) we get
i.e. r t [j] = d j .
                                                                                                                     0̂ = (0, 0, 0, 0) ,
    We define the universe set of judged documents as R :=
Ð |D |         N                    N                                                                               0̂1 = (0, 0, 0, 1) ,
   N =1 REL , where REL is the set of the ranked lists of judged
retrieved documents with length fixed to N . Since in our case                                                      0̂2 = (0, 0, 1, 0) ,
REL = {0, 1}, REL N = {0, 1} N refers to the space of all N −length
                                                                                                                    0̂4 = (0, 1, 0, 0) ,
vectors consisting of 0 and 1. As for the set-based case, we denote
by RBt the recall base, i.e. the total number of relevant documents                                                 0̂5 = (0, 1, 0, 1) ,
for a topic.                                                                                                        0̂6 = (0, 1, 1, 0) = rˆ .
    We call judged run the function rˆt from T × D into R, which
assigns a relevance degree to each retrieved document in the ranked                        We have made two replacement in the fourth position, one swap
list                                                                                       in the second position and two in the third one. Recall that with
                   (t, r t ) 7→ rˆt = GT (t, d 1 ), . . . , GT (t, d N )
                                                                                          swap at depth i we mean that a forward swap from position i − 1
                                                                                           to position i was done. We can count how many of these basic
We denote by rˆt [j] the j-th element of the vector rˆt , i.e. rˆt [j] =                   operations in each position are needed to go from 0̂ to rˆ just taking
GT (t, d j ).                                                                              the cumulative sums of rˆ. Indeed we get
    As for the set-based case, we can simplify the               notation omitting
the dependence on topics, rˆ B rˆ[1], . . . , rˆ[N ] , RB, and so on.                                                  R̂ = (0, 1, 2, 2) ,
                                                                                           and each entry k < D of R̂, R̂[k], counts the number of swaps made
3.2     Ordering between Intervals                                                         in position k, while R̂[N ] counts the number of replacement, i.e.
Let us start recalling the ordering between runs adopted in this                           the total mass of rˆ, to go from 0̂ to rˆ.
paper and based on the following two monotonicity-like properties
                                                                                              More generally, given two vectors rˆ, ŝ ∈ REL N , with rˆ ≺ ŝ,
proposed by [10]:
                                                                                           in order to collect the number of basic operations made at each
      • Replacement A measure of retrieval effectiveness should                            position to go from rˆ to ŝ, we can compute this vector of length N
        not decrease when replacing a document with another one                            first between 0̂ and rˆ and between 0̂ and ŝ, namely R̂ and S, ˆ and
        in the same rank position with higher degree of relevance.                         then subtract the two vectors. Precisely Sˆ − R̂ leads to a new vector
      • Swap If we swap a less relevant document with a more                               of length N , where each entry k equals the number of swaps or
        relevant one in a lower rank position, the measure should                          replacements (if k = N ) needed to go from rˆ to ŝ.
        not decrease.
   These two properties lead to the following partial ordering among                       Example. In order to better understand this mechanism, let us
system runs                                                                                consider a second example. Consider the two judged runs in REL4
                       k               k
                                                                                                                   rˆ = (0, 1, 0, 0) ,
                                                                                                                       ŝ = (1, 0, 1, 0) .
                       Õ               Õ
          rˆ  ŝ ⇔          rˆ[j] ≤         ŝ[j] ∀k ∈ {1, . . . , N } .      (1)
                       j=1             j=1                                                 In order to construct a chain from rˆ to ŝ with the two basic operators
This ordering considers a run bigger than another one when, for                            (replacement in last position and single-step forward swap) we get
each rank position, it has more relevant documents than the other                                                      rˆ = (0, 1, 0, 0) ,
one up to that rank.
                                                                                                                       v̂ = (1, 0, 0, 0) ,
   This is the same ordering of runs used by [11] in the weak top-
heaviness case but, differently from [11], we now introduce a differ-                                                 ŵ = (1, 0, 0, 1) ,
ent notion of length of an interval, not based on the natural distance                                                 ŝ = (1, 0, 1, 0) .


                                                                                      12
We have made a swap in the first and third position and a replace-                     Moreover, when computing the difference vector ∆               ® ·· between
ment in the fourth position, that we can collect in a vector as                     two comparable runs rˆ, ŝ, in this work we write ∆ŝ r̂ whenever rˆ  ŝ:
                                                                                                                                          ®
                                    t = (1, 0, 1, 1) .                  (2)         if we instead consider ∆   ® r̂ ŝ , then we are counting the backward
                                                                                    swaps from ŝ to rˆ and ∆r̂ ŝ [i] ≤ 0 for all i ∈ {1, . . . , N }.
                                                                                                             ®
On the other hand it is easy to compute R̂ = (0, 1, 1, 1) and Sˆ =
                                                                                       Since here ∆
                                                                                                  ® ·· is no more a scalar but a vector, we have to define
(1, 1, 2, 2). Therefore
                                                                                    the partial order among intervals of runs d as follow:
                               Sˆ − R̂ = (1, 0, 1, 1) = t ,
                                                                                    Definition 3. Given [ˆr, ŝ], [û, v̂] ⊆ REL N ,
as we wanted to show.
                                                                                                                       ∆
                                                                                                                       ® v̂ û d ∆
                                                                                                                                  ® ŝ r̂
   Let us consider a second, more complicated, example.                             if and only if
Example. Consider the two judged runs                                                                   ∆
                                                                                                        ® v̂ û [i] ≤ ∆
                                                                                                                      ® ŝ r̂ [i], ∀i ∈ {1, . . . , N }.
                        rˆ = (1, 0, 0, 0, 0, 1, 1, 0, 1, 0) ,
                                                                                    Example. With respect to the previous example, where t = Sˆ − R̂ =
                         ŝ = (1, 1, 0, 0, 1, 0, 1, 0, 0, 1) .                      (0, 1, 1, 1, 2, 1, 1, 1, 0, 1), the vector ∆
                                                                                                                               ® ŝ r̂ is given by
Clearly rˆ  ŝ. Moreover                                                                                ∆
                                                                                                         ® ŝ r̂ = T = (0, 1, 2, 3, 5, 6, 7, 8, 8, 9).
                         Sˆ = (1, 2, 2, 2, 3, 3, 4, 4, 4, 5) ,                      Let now û, v̂ ∈ {0, 1}10 be as follows
                        R̂ = (1, 1, 1, 1, 1, 2, 3, 3, 4, 4).                                             û = (1, 0, 0, 1, 0, 1, 1, 1, 0, 0) ,
Thus                                                                                                        v̂ = (1, 0, 1, 1, 1, 0, 1, 0, 0, 0) .
                      Sˆ − R̂ = (0, 1, 1, 1, 2, 1, 1, 1, 0, 1).                     Clearly û  v̂ and
Let t = Sˆ − R̂. For any i < 10, t[i] tells us how many swaps one                                          ∆
                                                                                                           ® v̂ û = (0, 0, 1, 2, 4, 5, 6, 6, 6, 6) .
needs to do at depth i to make the smallest run coincide with the
biggest one. Moreover, if the total number of relevant relevance-                   Thus we can conclude that the difference between ŝ and rˆ is greater
degrees is not equal for both, as in this example, the last entry of                than the difference between v̂ and û.
t, t[N ], is exactly the number or replacements on rˆ one needs to                        Note that the last entry of ∆    ® ·· always equals the natural distance
make, and coincide with i ŝ(i) − i rˆ(i).
                           Í        Í
                                                                                    as defined in Section 2.3 and used by [11]. Indeed, given two compa-
    Given an interval [ˆr , ŝ], if we take the cumulative sums of t =              rable runs r,     ˆ ŝ ∈ REL N , with rˆ  ŝ, ∆
                                                                                                                                   ® ŝ r̂ [N ] counts the total number
Sˆ − R̂ we obtain the vector T of the cumulative sums of t that                     of forward swaps of length one and/or replacements done from
counts, for every i ≤ N , the total number of swaps (or replacements,               rˆ to match ŝ. Since swaps of length one and replacements in the
if i = N ) made from depth 1 to i between the endpoints of the                      last positions are elementary operations as observed above, then
given interval. The vector T can be seen as a new and generalized                   ∆® ŝ r̂ [N ] is just counting the length of every maximal chain in [ˆr , ŝ],
definition of the length of the interval [ˆr , ŝ], which replaces the              i.e., exactly the natural distance.
natural distance used by [11].                                                            This definition of difference vector solves some of the problems
    According to this new distance, we say that the interval [rˆ1 , sˆ1 ]           encountered with the difference defined using the natural distance,
is smaller than or equal to the interval [rˆ2 , sˆ2 ] if, for the vectors T1        as the following example shows.
and T2 of their cumulative sums, it holds that T1 [i] ≤ T2 [i] for any              Example. Let rˆ, ŝ, û, v̂ be defined as follows:
i ≤ n. It is worth noticing that, if we take as definition of length
                                                                                                           rˆ = (0, 1, 0, 0, 0, 0, 0, 0, 0, 0) ,
any convex linear combination of the values (T [i], . . . ,T [n]), the
intervals comparable for the previous ordering remain comparable.                                            ŝ = (1, 0, 0, 0, 0, 0, 0, 0, 0, 0) ,
Other intervals become comparable for any fixed linear combina-                                              û = (0, 0, 0, 0, 0, 0, 0, 0, 0, 1) ,
tion, but it is not possible to say in advance they are ordered in the                                      v̂ = (0, 0, 0, 0, 0, 0, 0, 0, 1, 0) ,
same way by any two of these combinations.
    We are now able to define a difference in this setting:                         where rˆ  ŝ and û  v̂.
                                                                                       As already discussed, the natural distance induces a difference
Definition 2. Given r, ˆ ŝ ∈ REL N , with rˆ  ŝ, the difference ∆
                                                                   ® ŝ r̂          between runs that does not keep track or the rank. In this case, the
is a vector of length N such that                                                   natural distance would that both the pairs r, ˆ ŝ, and û, v̂, have both
                                   i
                                   Õ                                                difference equal to 1, even if these two pair differs a lot in terms of
                   ∆
                   ® ŝ r̂ [i] B         (i − j + 1) ŝ[j] − rˆ[j] ,
                                                                                   where differences actually happen in the ranking.
                                   j=1                                                 Instead, ∆
                                                                                                ® ·· shows a bigger difference between rˆ and ŝ compared
                                                                                    to the other two runs, because their differences happen in higher
for all i ∈ {1, . . . , N }.
                                                                                    and more important rank positions:
    It can be easily proved that ∆  ® ŝ r̂ is exactly the vector T de-
                                                                                                            ∆
                                                                                                            ® ŝ r̂ = (1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
fined above. Indeed, by construction, given r,        ˆ ŝ ∈ REL N with
                Íj                                                                                         ∆
                                                                                                           ® v̂ û = (0, 0, 0, 0, 0, 0, 0, 0, 1, 1) ,
rˆ  ŝ, t[j] = n=1 (ŝ[n] − rˆ[n]). Therefore T [i] = ij=1 t[j] =
                                                             Í
Íi Íj                       Íi
   j=1 n=1 (ŝ[n] − rˆ[n]) = j=1 (i − j + 1) ŝ[j] − rˆ[j] .                        and ∆
                                                                                        ® ŝ r̂ [i] ≥ ∆
                                                                                                      ® v̂ û [i] for every i ∈ {1, . . . , 10}.
                                                           

                                                                               13
   Therefore, this new and more expressive difference matches                                 Instead, RBPp and DCG show a greater agreement with the in-
better with the intuition that the higher the rank position at which                        equalities between intervals induced by ∆® ·· , even if sometimes they
it happens, the more important the same difference between two                              do not respect these relations: this happens when the endpoints of
runs.                                                                                       an interval do not have an equal number of relevant documents.
   The vector ∆
              ® ·· is thus useful to compare, when possible, intervals
on REL N , paying the necessary attention on the ranking. As a                              Example. Let us consider rˆ, ŝ, û ∈ {0, 1}10 :
consequence, a measure that satisfy these relations among intervals,                                                 rˆ = (0, 0, 1, 0, 1, 1, 0, 0, 1, 0) ,
although not interval scale, could be viewed as something more                                                       ŝ = (0, 1, 0, 1, 0, 1, 1, 1, 1, 0) ,
powerful than a measure on ordinal scale. Indeed, when the above
                                                                                                                     û = (1, 1, 0, 1, 1, 1, 0, 1, 0, 0) .
differences between intervals are comparable, one direction of iff
on Theorem 1 is still satisfied.                                                                Clearly rˆ  ŝ  û and one can prove that
   Therefore we can say that a measure M of retrieval effectiveness
                                                                                                                 ∆
                                                                                                                 ® ŝ r̂ = (0, 1, 1, 2, 2, 2, 3, 5, 7, 9) ,
is interval-like if, given a distance (potentially vector) ∆ ·· , an
ordering d between distances, and given rˆ, ŝ, û, v̂ ∈ REL N , the                                            ∆
                                                                                                                 ® û ŝ = (1, 2, 3, 4, 6, 8, 9, 10, 10, 10) ,
following relation holds:
                                                                                            that is ∆
                                                                                                    ® ŝ r̂ [i] ≤ ∆
                                                                                                                  ® û ŝ [i] ∀i, with strict inequality for some i. While
             ∆ŝ r̂ d ∆v̂ û ⇒ M(ŝ) − M(ˆr ) ≤ M(v̂) − M(û).                             û and ŝ has the same number of relevant documents, rˆ has two
   The next section is discusses whether some well-known IR mea-                            relevant documents less than ŝ. In particular DCG(ŝ) − DCG(ˆr ) >
sures are interval-like with respect to the difference introduced in                        DCG(û)−DCG(ŝ) and, for p > 0.85, RBPp (ŝ)−RBPp (ˆr ) > RBPp (û)−
Definition 2.                                                                               RBPp (ŝ), against the inequality given by the difference vectors.
                                                                                               Therefore, we can say that RBPp and DCG are interval-like with
3.3     Interval-like Scale Measures                                                        respect to the difference introduced in Definition 2 and consider-
We tested some measures of retrieval effectiveness – namely AP,                             ing only intervals where the endpoints have an equal number of
RBPp , ERR, DCG – on intervals with comparable differences ac-                              relevant documents. While AP and ERR are not even interval-like
cording to the above definition.                                                            since the relations between intervals often fail to be complied with.
   ERR shows the strongest discordance with our definition of
difference, since often it does not respect the relations between                           4    CONCLUSIONS AND FUTURE WORK
intervals induced by ∆® ·· , as the next example shows.                                     In this paper, we conducted a formal study to propose a new and
Example. Let us consider the following four runs r,
                                                 ˆ ŝ, û, v̂ ∈ {0, 1}10 :                  more expressive way of providing an empirical ordering of intervals
                                                                                            of runs in order to determine how close IR effectiveness measure
                        rˆ = (0, 0, 0, 0, 0, 0, 1, 1, 1, 0) ,                               are to be on an interval scale. Indeed, previous work [10, 11] has
                        ŝ = (0, 0, 0, 0, 0, 1, 0, 1, 1, 0) ,                               shown that they are on an ordinal scale, under some conditions,
                        û = (1, 1, 0, 1, 0, 1, 1, 0, 1, 1) ,                               but not on an interval scale. We have introduced the notion of
                                                                                            interval-like scale, a kind of interval scale which admits intervals
                        v̂ = (1, 1, 1, 0, 0, 1, 1, 0, 1, 1) .
                                                                                            to not be equi-spaced, and we have shown that both DCG and RBP
Clearly rˆ  ŝ  û  v̂. It seems fair to think that rˆ and ŝ give rise                  are on this scale, under reasonable conditions, while AP and ERR
to a smaller interval compared to [û, v̂] – note that the endpoints of                     are not.
both intervals differ by a swap of length one, but made in different                           Future work will concern an empirical investigation of the dif-
positions. Moreover it is easy to prove that ∆  ® ŝ r̂ [i] ≤ ∆
                                                              ® v̂ û [i] ∀i. But           ferent theoretical properties of evaluation measures we have found
while the measures RBPp , AP and DCG agree with the previous                                in order to determine the impact and severity of not complying
statement, ERR does not, since ERR(ŝ) −ERR(ˆr ) > ERR(v̂) −ERR(û).                        with them when you compute descriptive statistics, like mean and
                                                                                            variance, and when you conduct statistical significance tests.
   Another measure that does not always respect the relations
between distances is AP.                                                                    REFERENCES
Example. Let us consider the following runs rˆ, ŝ, û ∈ {0, 1}10 :                          [1] E. Amigó, J. Gonzalo, and M. F. Verdejo. 2013. A General Evaluation Measure
                                                                                                 for Document Organization Tasks. In Proc. 36th Annual International ACM SIGIR
                        rˆ = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0) ,                                    Conference on Research and Development in Information Retrieval (SIGIR 2013),
                                                                                                 G. J. F. Jones, P. Sheridan, D. Kelly, M. de Rijke, and T. Sakai (Eds.). ACM Press,
                        ŝ = (0, 1, 0, 0, 1, 0, 0, 0, 0, 1) ,                                    New York, USA, 643–652.
                        û = (0, 1, 0, 0, 1, 1, 1, 0, 0, 1) .
                                                                                             [2] P. Bollman. 1984. Two Axioms for Evaluation Measures in Information Retrieval.
                                                                                                 In Proc. of the Third Joint BCS and ACM Symposium on Research and Development
                                                                                                 in Information Retrieval, C. J. van Rijsbergen (Ed.). Cambridge University Press,
   Clearly rˆ  ŝ and ŝ  û. The readers can agree to consider the                            UK, 233–245.
interval [ˆr , ŝ] strictly bigger than [ŝ, û], since from û to ŝ we have                [3] P. Bollmann and V. S. Cherniavsky. 1980. Measurement-theoretical investigation
lost only two relevant documents, while from ŝ to rˆ the information                            of the MZ-metric. In Proc. 3rd Annual International ACM SIGIR Conference on Re-
                                                                                                 search and Development in Information Retrieval (SIGIR 1980), C. J. van Rijsbergen
lost seems to be higher. Moreover ∆        ® ŝ r̂ [i] ≥ ∆
                                                         ® û ŝ [i] ∀i, with strict             (Ed.). ACM Press, New York, USA, 256–267.
inequality for some i. However while the measures RBPp , ERR and                             [4] C. Buckley and E. M. Voorhees. 2000. Evaluating Evaluation Measure Stability.
                                                                                                 In Proc. 23rd Annual International ACM SIGIR Conference on Research and De-
DCG agree with this relation between the two intervals, AP does                                  velopment in Information Retrieval (SIGIR 2000), E. Yannakoudakis, N. J. Belkin,
not, since AP(ŝ) − AP(ˆr ) < AP(û) − AP(ŝ).                                                   M.-K. Leong, and P. Ingwersen (Eds.). ACM Press, New York, USA, 33–40.


                                                                                       14
 [5] C. Buckley and E. M. Voorhees. 2004. Retrieval Evaluation with Incomplete                   (SIGIR 2008), T.-S. Chua, M.-K. Leong, D. W. Oard, and F. Sebastiani (Eds.). ACM
     Information. In Proc. 27th Annual International ACM SIGIR Conference on Research            Press, New York, USA, 51–58.
     and Development in Information Retrieval (SIGIR 2004), M. Sanderson, K. Järvelin,
     J. Allan, and P. Bruza (Eds.). ACM Press, New York, USA, 25–32.
 [6] L. Busin and S. Mizzaro. 2013. Axiometrics: An Axiomatic Approach to Informa-
     tion Retrieval Effectiveness Metrics. In Proc. 4th International Conference on the
     Theory of Information Retrieval (ICTIR 2013), O. Kurland, D. Metzler, C. Lioma,
     B. Larsen, and P. Ingwersen (Eds.). ACM Press, New York, USA, 22–29.
 [7] B. A. Carterette. 2012. Multiple Testing in Statistical Analysis of Systems-Based
     Information Retrieval Experiments. ACM Transactions on Information Systems
     (TOIS) 30, 1 (2012), 4:1–4:34.
 [8] B. A. Carterette. 2015. Bayesian Inference for Information Retrieval Evaluation.
     In Proc. 1st ACM SIGIR International Conference on the Theory of Information
     Retrieval (ICTIR 2015), J. Allan, W. B. Croft, A. P. de Vries, C. Zhai, N. Fuhr, and
     Y. Zhang (Eds.). ACM Press, New York, USA, 31–40.
 [9] O. Chapelle, D. Metzler, Y. Zhang, and P. Grinspan. 2009. Expected Reciprocal
     Rank for Graded Relevance. In Proc. 18th International Conference on Information
     and Knowledge Management (CIKM 2009), D. W.-L. Cheung, I.-Y. Song, W. W.
     Chu, X. Hu, and J. J. Lin (Eds.). ACM Press, New York, USA, 621–630.
[10] M. Ferrante, N. Ferro, and M. Maistro. 2015. Towards a Formal Framework
     for Utility-oriented Measurements of Retrieval Effectiveness. In Proc. 1st ACM
     SIGIR International Conference on the Theory of Information Retrieval (ICTIR 2015),
     J. Allan, W. B. Croft, A. P. de Vries, C. Zhai, N. Fuhr, and Y. Zhang (Eds.). ACM
     Press, New York, USA, 21–30.
[11] M. Ferrante, N. Ferro, and S. Pontarollo. 2017. Are IR Evaluation Measures on an
     Interval Scale?. In Proc. 3rd ACM SIGIR International Conference on the Theory of
     Information Retrieval (ICTIR 2017), J. Kamps, E. Kanoulas, M. de Rijke, H. Fang,
     and E. Yilmaz (Eds.). ACM Press, New York, USA, 67–74.
[12] S. Foldes. 2013. On distances and metrics in discrete ordered sets. arXiv.org,
     Combinatorics (math.CO) arXiv:1307.0244 (June 2013).
[13] D. A. Hull. 1993. Using Statistical Testing in the Evaluation of Retrieval Experi-
     ments. In Proc. 16th Annual International ACM SIGIR Conference on Research and
     Development in Information Retrieval (SIGIR 1993), R. Korfhage, E. Rasmussen,
     and P. Willett (Eds.). ACM Press, New York, USA, 329–338.
[14] K. Järvelin and J. Kekäläinen. 2002. Cumulated Gain-Based Evaluation of IR
     Techniques. ACM Transactions on Information Systems (TOIS) 20, 4 (October
     2002), 422–446.
[15] D. H. Krantz, R. D. Luce, P. Suppes, and A. Tversky. 1971. Foundations of Mea-
     surement. Additive and Polynomial Representations. Vol. 1. Academic Press, New
     York, USA.
[16] A. Moffat and J. Zobel. 2008. Rank-biased Precision for Measurement of Retrieval
     Effectiveness. ACM Transactions on Information Systems (TOIS) 27, 1 (2008),
     2:1–2:27.
[17] S. Robertson. 2006. On GMAP: and Other Transformations. In Proc. 15th Interna-
     tional Conference on Information and Knowledge Management (CIKM 2006), P. S.
     Yu, V. Tsotras, E. A. Fox, and C.-B. Liu (Eds.). ACM Press, New York, USA, 78–83.
[18] G. B. Rossi. 2014. Measurement and Probability. A Probabilistic Theory of Mea-
     surement with Applications. Springer-Verlag, New York, USA.
[19] T. Sakai. 2006. Evaluating Evaluation Metrics based on the Bootstrap. In Proc.
     29th Annual International ACM SIGIR Conference on Research and Development
     in Information Retrieval (SIGIR 2006), E. N. Efthimiadis, S. Dumais, D. Hawking,
     and K. Järvelin (Eds.). ACM Press, New York, USA, 525–532.
[20] T. Sakai. 2014. Statistical Reform in Information Retrieval? SIGIR Forum 48, 1
     (June 2014), 3–12.
[21] T. Sakai. 2017. The Probability that Your Hypothesis Is Correct, Credible Intervals,
     and Effect Sizes for IR Evaluation. In Proc. 40th Annual International ACM SIGIR
     Conference on Research and Development in Information Retrieval (SIGIR 2017),
     N. Kando, T. Sakai, H. Joho, H. Li, A. P. de Vries, and R. W. White (Eds.). ACM
     Press, New York, USA, 25–34.
[22] F. Sebastiani. 2015. An Axiomatically Derived Measure for the Evaluation of
     Classification Algorithms. In Proc. 1st ACM SIGIR International Conference on the
     Theory of Information Retrieval (ICTIR 2015), J. Allan, W. B. Croft, A. P. de Vries,
     C. Zhai, N. Fuhr, and Y. Zhang (Eds.). ACM Press, New York, USA, 11–20.
[23] M. D. Smucker, J. Allan, and B. A. Carterette. 2007. A Comparison of Statistical
     Significance Tests for Information Retrieval Evaluation. In Proc. 16th International
     Conference on Information and Knowledge Management (CIKM 2007), M. J. Silva,
     A. A. F. Laender, R. Baeza-Yates, D. L. McGuinness, B. Olstad, Ø. H. Olsen, and
     A. and Falcão (Eds.). ACM Press, New York, USA, 623–632.
[24] R. P. Stanley. 2012. Enumerative Combinatorics – Volume 1 (2nd ed.). Cambridge
     Studies in Advanced Mathematics, Vol. 49. Cambridge University Press, Cam-
     bridge, UK.
[25] S. S. Stevens. 1946. On the Theory of Scales of Measurement. Science, New Series
     103, 2684 (June 1946), 677–680.
[26] C. J. van Rijsbergen. 1974. Foundations of Evaluation. Journal of Documentation
     30, 4 (1974), 365–373.
[27] W. Webber, A. Moffat, and J. Zobel. 2008. Score Standardization for Inter-
     Collection Comparison of Retrieval Systems. In Proc. 31st Annual International
     ACM SIGIR Conference on Research and Development in Information Retrieval


                                                                                            15