An interval-like scale property for IR evaluation measures Marco Ferrante Nicola Ferro Silvia Pontarollo Dept. Mathematics Dept. Information Engineering Dept. Mathematics University of Padua, Italy University of Padua, Italy University of Padua, Italy ferrante@math.unipd.it ferro@dei.unipd.it spontaro@math.unipd.it ABSTRACT an interval scale is somehow arbitrary and, as a consequence, also Evaluation measures play an important role in IR experimental some of the descriptive statistics you compute about it. evaluation and their properties determine the kind of statistical Therefore, researchers started to study what IR effectiveness analyses we can conduct. measures are, not only from an empirical perspective, e.g., [4, 5, 19], It has been previously shown that it is questionable that IR ef- but also from a theoretical one, e.g., [1–3, 6, 10, 22, 26]. fectiveness measures are on an interval-scale and this implies that In this paper, we stem from the recent work of [11] and we computing means and variances is not a permissible operation. move a step forward in understanding when and to what extent IR In this paper, we investigate whether it is possible to relax a bit effectiveness measures are on an interval scale. the definition of interval scale, introducing the notion of interval- [11] investigated whether IR effectiveness measures are on an like scale, and to what extent IR effectiveness measures comply interval scale in the perspective of the representational theory of with this relaxed definition. measurement [15], which is the measurement theory adopted in both physical and social sciences. According to this framework, the CCS CONCEPTS key point is to understand how real world objects, i.e., system runs in our case, are related to each other since measure properties are •Information systems → Retrieval effectiveness; then derived from these relations. Moreover, it is important that these relations among real world objects are intuitive and sensible KEYWORDS to “everybody” and that they can be commonly agreed on. evaluation measures; representational theory of measurement; in- Therefore, [11] pointed out that the main issues in determining terval scale the scale of IR effectiveness measures are: (i) to understand how runs are empirically and intuitively ordered; (ii) to define what an interval of runs is; and, (iii) to determine how these intervals 1 INTRODUCTION are ordered. Once you settled all these aspects, you can check Evaluation plays a central role in Information Retrieval (IR) and a whether an effectiveness measure comply with them or not and thus lot of attention is devoted to improving our evaluation method- determine whether it is on an interval scale or not. In particular, ologies and practices. For example, since many years, there is a [11] found that under a strong top-heaviness notion of ordering continued interest on how to properly apply statistical techniques among runs, only Rank-Biased Precision (RBP) [16] with p = 12 is on to the analysis of IR experimental data, e.g., on the appropriate use an interval scale while RBP for other values of p and other popular of statistical testing [7, 13, 20, 23], on the normalization of measure measures – namely AP, Discounted Cumulated Gain (DCG) [14], values for cross-collection comparison [27], or on moving towards and Expected Reciprocal Rank (ERR) [9] – are not. Moreover, using Bayesian inference [8, 21], just to name a few. a weak top-heaviness notion of ordering among runs, [11] found However, all these studies rely on some, often hidden and implicit, that all the previously mentioned IR effectiveness measures are not assumptions on what IR effectiveness measures are. In particular, on an interval scale. measurement scales [15, 25] determine the operations that is admis- Strong top-heaviness provides us with a total ordering among sible to perform with measure values and, as a consequence, the runs and, as discussed above, there is at least one case of IR mea- statistical analyses that can be applied. [25] identifies four major sure on an interval scale; however, the way in which strong top- types of scales with increasing properties: (i) the nominal scale heaviness orders runs may give raise to disagreement or corner consists of discrete unordered values, i.e. categories; (ii) the ordinal cases. For example, strong top-heaviness ranks the run (1, 0, 0, 0) scale introduces a natural order among the values; (iii) the inter- with just one top relevant document before the run (0, 1, 1, 1) with val scale preserves the equality of intervals or differences; and (iv) all relevant documents except for the first position; thus, there the ratio scale preserves the equality of ratios. Operations such as might be disagreement on whether this is an appropriate ordering computing the mean or the variance are possible just on interval for these runs. On the other hand, weak top-heaviness provides and ratio scales and they constitute the basis of many of the statis- us with a much more intuitive partial ordering based on two basic tical techniques mentioned above. However, are we sure that IR operations – swapping two consecutive documents in a ranking effectiveness measures are on an interval scale? For example, [17] and replacing a not relevant document with a relevant one [10]; points out that the assumption of Average Precision (AP) being on however, none of the IR evaluation measures is on interval-scale using weak top-heaviness. Copying permitted for private and academic purposes. The problem with IR effectiveness measures emerging from [11] 8th International Workshop on Evaluating Information Access (EVIA 2017), co-located is two-fold: on the one side, both strong and weak top-heaviness with NTCIR-13, 5 December 2017, Tokyo, Japan. © 2017 Copyright held by the author. create equi-spaced intervals of runs, as expected by the definition 10 of interval scale, but IR effectiveness measures do not respect this Given (E, ), we have to define a difference ∆ab between two equi-spacing; on the other side, both strong and weak top-heaviness elements a, b ∈ E, which is a kind of signed distance we exploit do not account enough for the importance and the effect of the to compare intervals. Then, we have to define a weak order d rank of a document in a run, since they both rely on the notion of between these ∆ab differences. We can proceed as follows: if two natural distance in a poset (partially ordered set) [24] which flattens elements a, b ∈ E are such that a ∼ b, i.e. a  b and b  a, then things too much, shrinking everything into a single number. the interval [a, b] is null and, consequently, we set ∆ab ∼d ∆ba ; if In this paper, we take a different approach to the ordering of in- a  b we agree upon choosing ∆aa ≺d ∆ab which, in turn implies tervals of runs, not based on single numbers, as the natural distance that ∆aa d ∆ba . of [11] does, but using vectors instead. This new ordering is richer Definition 1. Let E be a finite (not empty) set of objects. Let d be a and more expressive than that induced by the natural distances in binary relation on E ×E that satisfies, for each a, b, c, d, a 0, b 0, c 0 ∈ E, the strong and weak top-heaviness cases and allows us to introduce the following axioms: the notion of interval-like scale, i.e., something richer than an ordi- nal scale but a bit less powerful than an interval scale, since runs i. d is weak order; are ordered, intervals of runs are ordered too but intervals may ii. if ∆ab d ∆cd , then ∆dc d ∆ba ; not be equi-spaced. In particular, we find that, under reasonable iii. if ∆ab d ∆a 0b 0 and ∆bc d ∆b 0c 0 then ∆ac d ∆a 0c 0 ; assumptions, DCG and RBP are on a interval-like scale while AP iv. Solvability Condition: if ∆aa d ∆cd d ∆ab , then there and ERR are not. exists d 0, d 00 ∈ E such that ∆ad 0 ∼d ∆cd ∼d ∆d 00b . The paper is organized as follows: Section 2 recaps some basic Then (E, d ) is a difference structure. concepts about the representational theory of measurement and Particular attention has to be paid to the Solvability Condition posets; Section 3 deals with interval-like scales; finally, Section 4 which ensures the existence of an equally spaced gradation be- wraps up the discussion and outlooks some future work. tween the elements of E, indispensable to construct an interval scale measurement. 2 BACKGROUND The representation theorem for difference structures states: 2.1 Representational Theory of Measurement Theorem 1. Let E be a finite (not empty) set of objects and let A relational structure [15, 18] is an ordered pair X = X , R X of (E, d ) be a difference structure. Then there exist a measurement scale a domain set X and a set of relations R X on X , where the relations in M : E → R such that for every a, b, c, d ∈ E R X may have different arities, i.e. they can be unary, binary, ternary ∆ab d ∆cd ⇔ M(a) − M(b) ≤ M(c) − M(d) . relations and so on. Given two relational structures X and Y, a homomorphism M : X → Y from X to Y is a mapping M = M, MR This theorem ensures us that, if there is a difference structure where: (i) M is a function that maps X into M(X ) ⊆ Y , i.e. for each on the empirical set E, then there exists an interval scale M. element of the domain set there exists one corresponding image As anticipated in Section 1, we will introduce the notion of element; (ii) MR is a function that maps R X into MR (R X ) ⊆ RY such interval-like scale which corresponds to removing the solvability that ∀r ∈ R X , r and MR (r ) have the same arity, i.e. for each relation condition from the definition of difference structure and obtaining on the domain set there exists one (and it is usually, and often a new partial ordering of the intervals of runs. implicitly, assumed: and only one) corresponding image  relation; (iii) ∀r ∈ R X , ∀x i ∈ X , if r (x 1 , . . . , x n ) then MR (r ) M(x 1 ), . . . , 2.3 Posets  A partially ordered set P, poset for short, is a set with a partial order M(x n ) , i.e. if a relation holds for some elements of the domain set  defined on it [24]. A partial order  is a binary relation over P then the image relation must hold for the image elements. which is reflexive, antisymmetric and transitive. Given s, t ∈ P, we A relational structure E is called empirical if its domain set E say that s and t are comparable if s  t or t  s, otherwise they are spans over the entities under consideration in the real world, i.e. the incomparable. system runs in our case; a relational structure S is called symbolic A closed interval is a subset of P defined as [s, t] B {u ∈ P : s  if its domain set S spans over a given set of numbers. A measure- u  t }, where s, t ∈ P and s  t. Moreover we say that t covers s ment (scale) is the homomorphism M = M, MR from the real if s  t and [s, t] = {s, t }, that is there does not exist u ∈ P such world to the symbolic world and a measure is the number assigned that s ≺ u ≺ t . to an entity by this mapping. We can represent a finite poset P by using the Hasse diagram which is a graph where vertices are the elements of P, edges rep- 2.2 Measurement Scales resent the covers relations, and if s ≺ t then s is below t in the diagram. [11] relied on the notion of difference structure [15, 18] to introduce A subset C of a poset P is a chain if any two elements of C are a definition of interval among system runs in such a way that it comparable: a chain is a totally ordered subset of a poset. If C is a ensures the existence of an interval scale. finite chain, the length of C, `(C), is defined by `(C) = |C | − 1. A Given E, a weakly ordered empirical structure is a pair (E, ) maximal chain of P is a chain that is not a proper subset of any where, for every a, b, c ∈ E, other chain of P. • a  b or b  a; If every maximal chain of P has the same length n, we say that • a  b and b  c ⇒ a  c. P is graded of rank n; in particular there exists a unique function 11 ρ : P → {0, 1, . . . , n}, called the rank function, such that ρ(s) = 0, which, as discussed in Section 1, has the drawback of flattening if s is a minimal element of P, and ρ(t) = ρ(s) + 1, if t covers s. everything into a single number. Finally, since any interval on a graded poset is graded, the length To define the length of an interval we adopt the following strat- of an interval [s, t] is given by `(s, t) B `([s, t]) = ρ(t) − ρ(s), also egy: given rˆ, ŝ ∈ REL N with rˆ  ŝ, we count how many replace- called the natural distance. ments in the last position and how many forward single-step swaps at each depth are necessary to go from rˆ to ŝ following a maximal 3 INTERVAL-LIKE SCALES chain in REL N . In order to do this, it is useful to define the cu- 3.1 Preliminary Definitions mulative sums of a vector v = (v[1], . . . , v[N ]), denoted using the Íj capital letter as V = (V [1], . . . , V [N ]), where V [j] = i=1 v[i]. Given N , the length of the run, we define the set of retrieved Let us start with a simple example. documents as D(N ) = {(d 1 , . . . , d N ) : di ∈ D, di , d j for any i , j}, i.e. the ranked list of retrieved documents without duplicates, and Example. Consider the two judged runs in REL4 Ð |D | the universe set of retrieved documents as D := N =1 D(N ). rˆ = (0, 1, 1, 0) , A run r t , retrieving a ranked list of documents D(N ) in response 0̂ = (0, 0, 0, 0) . to a topic t ∈ T , is a function from T into D Since 0̂ ≺ rˆ, in order to construct a chain from 0̂ to rˆ with the t 7→ r t = (d 1 , . . . , d N ) two basic operators (replacement in last position and single-step We denote by r t [j] the j-th element of the vector r t , forward swap) we get i.e. r t [j] = d j . 0̂ = (0, 0, 0, 0) , We define the universe set of judged documents as R := Ð |D | N N 0̂1 = (0, 0, 0, 1) , N =1 REL , where REL is the set of the ranked lists of judged retrieved documents with length fixed to N . Since in our case 0̂2 = (0, 0, 1, 0) , REL = {0, 1}, REL N = {0, 1} N refers to the space of all N −length 0̂4 = (0, 1, 0, 0) , vectors consisting of 0 and 1. As for the set-based case, we denote by RBt the recall base, i.e. the total number of relevant documents 0̂5 = (0, 1, 0, 1) , for a topic. 0̂6 = (0, 1, 1, 0) = rˆ . We call judged run the function rˆt from T × D into R, which assigns a relevance degree to each retrieved document in the ranked We have made two replacement in the fourth position, one swap list in the second position and two in the third one. Recall that with (t, r t ) 7→ rˆt = GT (t, d 1 ), . . . , GT (t, d N )  swap at depth i we mean that a forward swap from position i − 1 to position i was done. We can count how many of these basic We denote by rˆt [j] the j-th element of the vector rˆt , i.e. rˆt [j] = operations in each position are needed to go from 0̂ to rˆ just taking GT (t, d j ). the cumulative sums of rˆ. Indeed we get As for the set-based case, we can simplify the  notation omitting the dependence on topics, rˆ B rˆ[1], . . . , rˆ[N ] , RB, and so on. R̂ = (0, 1, 2, 2) , and each entry k < D of R̂, R̂[k], counts the number of swaps made 3.2 Ordering between Intervals in position k, while R̂[N ] counts the number of replacement, i.e. Let us start recalling the ordering between runs adopted in this the total mass of rˆ, to go from 0̂ to rˆ. paper and based on the following two monotonicity-like properties More generally, given two vectors rˆ, ŝ ∈ REL N , with rˆ ≺ ŝ, proposed by [10]: in order to collect the number of basic operations made at each • Replacement A measure of retrieval effectiveness should position to go from rˆ to ŝ, we can compute this vector of length N not decrease when replacing a document with another one first between 0̂ and rˆ and between 0̂ and ŝ, namely R̂ and S, ˆ and in the same rank position with higher degree of relevance. then subtract the two vectors. Precisely Sˆ − R̂ leads to a new vector • Swap If we swap a less relevant document with a more of length N , where each entry k equals the number of swaps or relevant one in a lower rank position, the measure should replacements (if k = N ) needed to go from rˆ to ŝ. not decrease. These two properties lead to the following partial ordering among Example. In order to better understand this mechanism, let us system runs consider a second example. Consider the two judged runs in REL4 k k rˆ = (0, 1, 0, 0) , ŝ = (1, 0, 1, 0) . Õ Õ rˆ  ŝ ⇔ rˆ[j] ≤ ŝ[j] ∀k ∈ {1, . . . , N } . (1) j=1 j=1 In order to construct a chain from rˆ to ŝ with the two basic operators This ordering considers a run bigger than another one when, for (replacement in last position and single-step forward swap) we get each rank position, it has more relevant documents than the other rˆ = (0, 1, 0, 0) , one up to that rank. v̂ = (1, 0, 0, 0) , This is the same ordering of runs used by [11] in the weak top- heaviness case but, differently from [11], we now introduce a differ- ŵ = (1, 0, 0, 1) , ent notion of length of an interval, not based on the natural distance ŝ = (1, 0, 1, 0) . 12 We have made a swap in the first and third position and a replace- Moreover, when computing the difference vector ∆ ® ·· between ment in the fourth position, that we can collect in a vector as two comparable runs rˆ, ŝ, in this work we write ∆ŝ r̂ whenever rˆ  ŝ: ® t = (1, 0, 1, 1) . (2) if we instead consider ∆ ® r̂ ŝ , then we are counting the backward swaps from ŝ to rˆ and ∆r̂ ŝ [i] ≤ 0 for all i ∈ {1, . . . , N }. ® On the other hand it is easy to compute R̂ = (0, 1, 1, 1) and Sˆ = Since here ∆ ® ·· is no more a scalar but a vector, we have to define (1, 1, 2, 2). Therefore the partial order among intervals of runs d as follow: Sˆ − R̂ = (1, 0, 1, 1) = t , Definition 3. Given [ˆr, ŝ], [û, v̂] ⊆ REL N , as we wanted to show. ∆ ® v̂ û d ∆ ® ŝ r̂ Let us consider a second, more complicated, example. if and only if Example. Consider the two judged runs ∆ ® v̂ û [i] ≤ ∆ ® ŝ r̂ [i], ∀i ∈ {1, . . . , N }. rˆ = (1, 0, 0, 0, 0, 1, 1, 0, 1, 0) , Example. With respect to the previous example, where t = Sˆ − R̂ = ŝ = (1, 1, 0, 0, 1, 0, 1, 0, 0, 1) . (0, 1, 1, 1, 2, 1, 1, 1, 0, 1), the vector ∆ ® ŝ r̂ is given by Clearly rˆ  ŝ. Moreover ∆ ® ŝ r̂ = T = (0, 1, 2, 3, 5, 6, 7, 8, 8, 9). Sˆ = (1, 2, 2, 2, 3, 3, 4, 4, 4, 5) , Let now û, v̂ ∈ {0, 1}10 be as follows R̂ = (1, 1, 1, 1, 1, 2, 3, 3, 4, 4). û = (1, 0, 0, 1, 0, 1, 1, 1, 0, 0) , Thus v̂ = (1, 0, 1, 1, 1, 0, 1, 0, 0, 0) . Sˆ − R̂ = (0, 1, 1, 1, 2, 1, 1, 1, 0, 1). Clearly û  v̂ and Let t = Sˆ − R̂. For any i < 10, t[i] tells us how many swaps one ∆ ® v̂ û = (0, 0, 1, 2, 4, 5, 6, 6, 6, 6) . needs to do at depth i to make the smallest run coincide with the biggest one. Moreover, if the total number of relevant relevance- Thus we can conclude that the difference between ŝ and rˆ is greater degrees is not equal for both, as in this example, the last entry of than the difference between v̂ and û. t, t[N ], is exactly the number or replacements on rˆ one needs to Note that the last entry of ∆ ® ·· always equals the natural distance make, and coincide with i ŝ(i) − i rˆ(i). Í Í as defined in Section 2.3 and used by [11]. Indeed, given two compa- Given an interval [ˆr , ŝ], if we take the cumulative sums of t = rable runs r, ˆ ŝ ∈ REL N , with rˆ  ŝ, ∆ ® ŝ r̂ [N ] counts the total number Sˆ − R̂ we obtain the vector T of the cumulative sums of t that of forward swaps of length one and/or replacements done from counts, for every i ≤ N , the total number of swaps (or replacements, rˆ to match ŝ. Since swaps of length one and replacements in the if i = N ) made from depth 1 to i between the endpoints of the last positions are elementary operations as observed above, then given interval. The vector T can be seen as a new and generalized ∆® ŝ r̂ [N ] is just counting the length of every maximal chain in [ˆr , ŝ], definition of the length of the interval [ˆr , ŝ], which replaces the i.e., exactly the natural distance. natural distance used by [11]. This definition of difference vector solves some of the problems According to this new distance, we say that the interval [rˆ1 , sˆ1 ] encountered with the difference defined using the natural distance, is smaller than or equal to the interval [rˆ2 , sˆ2 ] if, for the vectors T1 as the following example shows. and T2 of their cumulative sums, it holds that T1 [i] ≤ T2 [i] for any Example. Let rˆ, ŝ, û, v̂ be defined as follows: i ≤ n. It is worth noticing that, if we take as definition of length rˆ = (0, 1, 0, 0, 0, 0, 0, 0, 0, 0) , any convex linear combination of the values (T [i], . . . ,T [n]), the intervals comparable for the previous ordering remain comparable. ŝ = (1, 0, 0, 0, 0, 0, 0, 0, 0, 0) , Other intervals become comparable for any fixed linear combina- û = (0, 0, 0, 0, 0, 0, 0, 0, 0, 1) , tion, but it is not possible to say in advance they are ordered in the v̂ = (0, 0, 0, 0, 0, 0, 0, 0, 1, 0) , same way by any two of these combinations. We are now able to define a difference in this setting: where rˆ  ŝ and û  v̂. As already discussed, the natural distance induces a difference Definition 2. Given r, ˆ ŝ ∈ REL N , with rˆ  ŝ, the difference ∆ ® ŝ r̂ between runs that does not keep track or the rank. In this case, the is a vector of length N such that natural distance would that both the pairs r, ˆ ŝ, and û, v̂, have both i Õ difference equal to 1, even if these two pair differs a lot in terms of ∆ ® ŝ r̂ [i] B (i − j + 1) ŝ[j] − rˆ[j] ,  where differences actually happen in the ranking. j=1 Instead, ∆ ® ·· shows a bigger difference between rˆ and ŝ compared to the other two runs, because their differences happen in higher for all i ∈ {1, . . . , N }. and more important rank positions: It can be easily proved that ∆ ® ŝ r̂ is exactly the vector T de- ∆ ® ŝ r̂ = (1, 1, 1, 1, 1, 1, 1, 1, 1, 1) fined above. Indeed, by construction, given r, ˆ ŝ ∈ REL N with Íj ∆ ® v̂ û = (0, 0, 0, 0, 0, 0, 0, 0, 1, 1) , rˆ  ŝ, t[j] = n=1 (ŝ[n] − rˆ[n]). Therefore T [i] = ij=1 t[j] = Í Íi Íj Íi j=1 n=1 (ŝ[n] − rˆ[n]) = j=1 (i − j + 1) ŝ[j] − rˆ[j] . and ∆ ® ŝ r̂ [i] ≥ ∆ ® v̂ û [i] for every i ∈ {1, . . . , 10}.  13 Therefore, this new and more expressive difference matches Instead, RBPp and DCG show a greater agreement with the in- better with the intuition that the higher the rank position at which equalities between intervals induced by ∆® ·· , even if sometimes they it happens, the more important the same difference between two do not respect these relations: this happens when the endpoints of runs. an interval do not have an equal number of relevant documents. The vector ∆ ® ·· is thus useful to compare, when possible, intervals on REL N , paying the necessary attention on the ranking. As a Example. Let us consider rˆ, ŝ, û ∈ {0, 1}10 : consequence, a measure that satisfy these relations among intervals, rˆ = (0, 0, 1, 0, 1, 1, 0, 0, 1, 0) , although not interval scale, could be viewed as something more ŝ = (0, 1, 0, 1, 0, 1, 1, 1, 1, 0) , powerful than a measure on ordinal scale. Indeed, when the above û = (1, 1, 0, 1, 1, 1, 0, 1, 0, 0) . differences between intervals are comparable, one direction of iff on Theorem 1 is still satisfied. Clearly rˆ  ŝ  û and one can prove that Therefore we can say that a measure M of retrieval effectiveness ∆ ® ŝ r̂ = (0, 1, 1, 2, 2, 2, 3, 5, 7, 9) , is interval-like if, given a distance (potentially vector) ∆ ·· , an ordering d between distances, and given rˆ, ŝ, û, v̂ ∈ REL N , the ∆ ® û ŝ = (1, 2, 3, 4, 6, 8, 9, 10, 10, 10) , following relation holds: that is ∆ ® ŝ r̂ [i] ≤ ∆ ® û ŝ [i] ∀i, with strict inequality for some i. While ∆ŝ r̂ d ∆v̂ û ⇒ M(ŝ) − M(ˆr ) ≤ M(v̂) − M(û). û and ŝ has the same number of relevant documents, rˆ has two The next section is discusses whether some well-known IR mea- relevant documents less than ŝ. In particular DCG(ŝ) − DCG(ˆr ) > sures are interval-like with respect to the difference introduced in DCG(û)−DCG(ŝ) and, for p > 0.85, RBPp (ŝ)−RBPp (ˆr ) > RBPp (û)− Definition 2. RBPp (ŝ), against the inequality given by the difference vectors. Therefore, we can say that RBPp and DCG are interval-like with 3.3 Interval-like Scale Measures respect to the difference introduced in Definition 2 and consider- We tested some measures of retrieval effectiveness – namely AP, ing only intervals where the endpoints have an equal number of RBPp , ERR, DCG – on intervals with comparable differences ac- relevant documents. While AP and ERR are not even interval-like cording to the above definition. since the relations between intervals often fail to be complied with. ERR shows the strongest discordance with our definition of difference, since often it does not respect the relations between 4 CONCLUSIONS AND FUTURE WORK intervals induced by ∆® ·· , as the next example shows. In this paper, we conducted a formal study to propose a new and Example. Let us consider the following four runs r, ˆ ŝ, û, v̂ ∈ {0, 1}10 : more expressive way of providing an empirical ordering of intervals of runs in order to determine how close IR effectiveness measure rˆ = (0, 0, 0, 0, 0, 0, 1, 1, 1, 0) , are to be on an interval scale. Indeed, previous work [10, 11] has ŝ = (0, 0, 0, 0, 0, 1, 0, 1, 1, 0) , shown that they are on an ordinal scale, under some conditions, û = (1, 1, 0, 1, 0, 1, 1, 0, 1, 1) , but not on an interval scale. We have introduced the notion of interval-like scale, a kind of interval scale which admits intervals v̂ = (1, 1, 1, 0, 0, 1, 1, 0, 1, 1) . to not be equi-spaced, and we have shown that both DCG and RBP Clearly rˆ  ŝ  û  v̂. It seems fair to think that rˆ and ŝ give rise are on this scale, under reasonable conditions, while AP and ERR to a smaller interval compared to [û, v̂] – note that the endpoints of are not. both intervals differ by a swap of length one, but made in different Future work will concern an empirical investigation of the dif- positions. Moreover it is easy to prove that ∆ ® ŝ r̂ [i] ≤ ∆ ® v̂ û [i] ∀i. But ferent theoretical properties of evaluation measures we have found while the measures RBPp , AP and DCG agree with the previous in order to determine the impact and severity of not complying statement, ERR does not, since ERR(ŝ) −ERR(ˆr ) > ERR(v̂) −ERR(û). with them when you compute descriptive statistics, like mean and variance, and when you conduct statistical significance tests. Another measure that does not always respect the relations between distances is AP. REFERENCES Example. Let us consider the following runs rˆ, ŝ, û ∈ {0, 1}10 : [1] E. Amigó, J. Gonzalo, and M. F. Verdejo. 2013. A General Evaluation Measure for Document Organization Tasks. In Proc. 36th Annual International ACM SIGIR rˆ = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0) , Conference on Research and Development in Information Retrieval (SIGIR 2013), G. J. F. Jones, P. Sheridan, D. Kelly, M. de Rijke, and T. Sakai (Eds.). ACM Press, ŝ = (0, 1, 0, 0, 1, 0, 0, 0, 0, 1) , New York, USA, 643–652. û = (0, 1, 0, 0, 1, 1, 1, 0, 0, 1) . [2] P. Bollman. 1984. Two Axioms for Evaluation Measures in Information Retrieval. In Proc. of the Third Joint BCS and ACM Symposium on Research and Development in Information Retrieval, C. J. van Rijsbergen (Ed.). Cambridge University Press, Clearly rˆ  ŝ and ŝ  û. The readers can agree to consider the UK, 233–245. interval [ˆr , ŝ] strictly bigger than [ŝ, û], since from û to ŝ we have [3] P. Bollmann and V. S. Cherniavsky. 1980. Measurement-theoretical investigation lost only two relevant documents, while from ŝ to rˆ the information of the MZ-metric. In Proc. 3rd Annual International ACM SIGIR Conference on Re- search and Development in Information Retrieval (SIGIR 1980), C. J. van Rijsbergen lost seems to be higher. Moreover ∆ ® ŝ r̂ [i] ≥ ∆ ® û ŝ [i] ∀i, with strict (Ed.). ACM Press, New York, USA, 256–267. inequality for some i. However while the measures RBPp , ERR and [4] C. Buckley and E. M. Voorhees. 2000. Evaluating Evaluation Measure Stability. In Proc. 23rd Annual International ACM SIGIR Conference on Research and De- DCG agree with this relation between the two intervals, AP does velopment in Information Retrieval (SIGIR 2000), E. Yannakoudakis, N. J. Belkin, not, since AP(ŝ) − AP(ˆr ) < AP(û) − AP(ŝ). M.-K. Leong, and P. Ingwersen (Eds.). ACM Press, New York, USA, 33–40. 14 [5] C. Buckley and E. M. Voorhees. 2004. Retrieval Evaluation with Incomplete (SIGIR 2008), T.-S. Chua, M.-K. Leong, D. W. Oard, and F. Sebastiani (Eds.). ACM Information. In Proc. 27th Annual International ACM SIGIR Conference on Research Press, New York, USA, 51–58. and Development in Information Retrieval (SIGIR 2004), M. Sanderson, K. Järvelin, J. Allan, and P. Bruza (Eds.). ACM Press, New York, USA, 25–32. [6] L. Busin and S. Mizzaro. 2013. Axiometrics: An Axiomatic Approach to Informa- tion Retrieval Effectiveness Metrics. In Proc. 4th International Conference on the Theory of Information Retrieval (ICTIR 2013), O. Kurland, D. Metzler, C. Lioma, B. Larsen, and P. Ingwersen (Eds.). ACM Press, New York, USA, 22–29. [7] B. A. Carterette. 2012. Multiple Testing in Statistical Analysis of Systems-Based Information Retrieval Experiments. ACM Transactions on Information Systems (TOIS) 30, 1 (2012), 4:1–4:34. [8] B. A. Carterette. 2015. Bayesian Inference for Information Retrieval Evaluation. In Proc. 1st ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR 2015), J. Allan, W. B. Croft, A. P. de Vries, C. Zhai, N. Fuhr, and Y. Zhang (Eds.). ACM Press, New York, USA, 31–40. [9] O. Chapelle, D. Metzler, Y. Zhang, and P. Grinspan. 2009. Expected Reciprocal Rank for Graded Relevance. In Proc. 18th International Conference on Information and Knowledge Management (CIKM 2009), D. W.-L. Cheung, I.-Y. Song, W. W. Chu, X. Hu, and J. J. Lin (Eds.). ACM Press, New York, USA, 621–630. [10] M. Ferrante, N. Ferro, and M. Maistro. 2015. Towards a Formal Framework for Utility-oriented Measurements of Retrieval Effectiveness. In Proc. 1st ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR 2015), J. Allan, W. B. Croft, A. P. de Vries, C. Zhai, N. Fuhr, and Y. Zhang (Eds.). ACM Press, New York, USA, 21–30. [11] M. Ferrante, N. Ferro, and S. Pontarollo. 2017. Are IR Evaluation Measures on an Interval Scale?. In Proc. 3rd ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR 2017), J. Kamps, E. Kanoulas, M. de Rijke, H. Fang, and E. Yilmaz (Eds.). ACM Press, New York, USA, 67–74. [12] S. Foldes. 2013. On distances and metrics in discrete ordered sets. arXiv.org, Combinatorics (math.CO) arXiv:1307.0244 (June 2013). [13] D. A. Hull. 1993. Using Statistical Testing in the Evaluation of Retrieval Experi- ments. In Proc. 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1993), R. Korfhage, E. Rasmussen, and P. Willett (Eds.). ACM Press, New York, USA, 329–338. [14] K. Järvelin and J. Kekäläinen. 2002. Cumulated Gain-Based Evaluation of IR Techniques. ACM Transactions on Information Systems (TOIS) 20, 4 (October 2002), 422–446. [15] D. H. Krantz, R. D. Luce, P. Suppes, and A. Tversky. 1971. Foundations of Mea- surement. Additive and Polynomial Representations. Vol. 1. Academic Press, New York, USA. [16] A. Moffat and J. Zobel. 2008. Rank-biased Precision for Measurement of Retrieval Effectiveness. ACM Transactions on Information Systems (TOIS) 27, 1 (2008), 2:1–2:27. [17] S. Robertson. 2006. On GMAP: and Other Transformations. In Proc. 15th Interna- tional Conference on Information and Knowledge Management (CIKM 2006), P. S. Yu, V. Tsotras, E. A. Fox, and C.-B. Liu (Eds.). ACM Press, New York, USA, 78–83. [18] G. B. Rossi. 2014. Measurement and Probability. A Probabilistic Theory of Mea- surement with Applications. Springer-Verlag, New York, USA. [19] T. Sakai. 2006. Evaluating Evaluation Metrics based on the Bootstrap. In Proc. 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2006), E. N. Efthimiadis, S. Dumais, D. Hawking, and K. Järvelin (Eds.). ACM Press, New York, USA, 525–532. [20] T. Sakai. 2014. Statistical Reform in Information Retrieval? SIGIR Forum 48, 1 (June 2014), 3–12. [21] T. Sakai. 2017. The Probability that Your Hypothesis Is Correct, Credible Intervals, and Effect Sizes for IR Evaluation. In Proc. 40th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2017), N. Kando, T. Sakai, H. Joho, H. Li, A. P. de Vries, and R. W. White (Eds.). ACM Press, New York, USA, 25–34. [22] F. Sebastiani. 2015. An Axiomatically Derived Measure for the Evaluation of Classification Algorithms. In Proc. 1st ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR 2015), J. Allan, W. B. Croft, A. P. de Vries, C. Zhai, N. Fuhr, and Y. Zhang (Eds.). ACM Press, New York, USA, 11–20. [23] M. D. Smucker, J. Allan, and B. A. Carterette. 2007. A Comparison of Statistical Significance Tests for Information Retrieval Evaluation. In Proc. 16th International Conference on Information and Knowledge Management (CIKM 2007), M. J. Silva, A. A. F. Laender, R. Baeza-Yates, D. L. McGuinness, B. Olstad, Ø. H. Olsen, and A. and Falcão (Eds.). ACM Press, New York, USA, 623–632. [24] R. P. Stanley. 2012. Enumerative Combinatorics – Volume 1 (2nd ed.). Cambridge Studies in Advanced Mathematics, Vol. 49. Cambridge University Press, Cam- bridge, UK. [25] S. S. Stevens. 1946. On the Theory of Scales of Measurement. Science, New Series 103, 2684 (June 1946), 677–680. [26] C. J. van Rijsbergen. 1974. Foundations of Evaluation. Journal of Documentation 30, 4 (1974), 365–373. [27] W. Webber, A. Moffat, and J. Zobel. 2008. Score Standardization for Inter- Collection Comparison of Retrieval Systems. In Proc. 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 15