CCS CONCEPTS

An interval-like scale property for IR evaluation measures

Marco Ferrante

ferrante@math.unipd.it 1

Nicola Ferro

ferro@dei.unipd.it 0

Silvia Pontarollo

spontaro@math.unipd.it 1 0 Dept. Information Engineering, University of Padua , Italy 1 Dept. Mathematics, University of Padua , Italy

2017

10 15

Evaluation measures play an important role in IR experimental evaluation and their properties determine the kind of statistical analyses we can conduct. It has been previously shown that it is questionable that IR effectiveness measures are on an interval-scale and this implies that computing means and variances is not a permissible operation. In this paper, we investigate whether it is possible to relax a bit the denition of interval scale, introducing the notion of intervallike scale, and to what extent IR eectiveness measures comply with this relaxed denition.

CCS CONCEPTS

•Information systems ! Retrieval eectiveness;

INTRODUCTION

Evaluation plays a central role in Information Retrieval (IR) and a lot of aention is devoted to improving our evaluation methodologies and practices. For example, since many years, there is a continued interest on how to properly apply statistical techniques to the analysis of IR experimental data, e.g., on the appropriate use of statistical testing [ 7, 13, 20, 23 ], on the normalization of measure values for cross-collection comparison [ 27 ], or on moving towards Bayesian inference [ 8, 21 ], just to name a few.

However, all these studies rely on some, oen hidden and implicit, assumptions on what IR eectiveness measures are. In particular, measurement scales [ 15, 25 ] determine the operations that is admissible to perform with measure values and, as a consequence, the statistical analyses that can be applied. [ 25 ] identies four major types of scales with increasing properties: (i) the nominal scale consists of discrete unordered values, i.e. categories; (ii) the ordinal scale introduces a natural order among the values; (iii) the interval scale preserves the equality of intervals or dierences; and (iv) the ratio scale preserves the equality of ratios. Operations such as computing the mean or the variance are possible just on interval and ratio scales and they constitute the basis of many of the statistical techniques mentioned above. However, are we sure that IR eectiveness measures are on an interval scale? For example, [ 17 ] points out that the assumption of Average Precision (AP) being on an interval scale is somehow arbitrary and, as a consequence, also some of the descriptive statistics you compute about it.

erefore, researchers started to study what IR eectiveness measures are, not only from an empirical perspective, e.g., [ 4, 5, 19 ], but also from a theoretical one, e.g., [ 1–3, 6, 10, 22, 26 ].

In this paper, we stem from the recent work of [ 11 ] and we move a step forward in understanding when and to what extent IR eectiveness measures are on an interval scale.

[ 11 ] investigated whether IR eectiveness measures are on an interval scale in the perspective of the representational theory of measurement [ 15 ], which is the measurement theory adopted in both physical and social sciences. According to this framework, the key point is to understand how real world objects, i.e., system runs in our case, are related to each other since measure properties are then derived from these relations. Moreover, it is important that these relations among real world objects are intuitive and sensible to “everybody” and that they can be commonly agreed on.

erefore, [ 11 ] pointed out that the main issues in determining the scale of IR eectiveness measures are: (i) to understand how runs are empirically and intuitively ordered; (ii) to dene what an interval of runs is; and, (iii) to determine how these intervals are ordered. Once you seled all these aspects, you can check whether an eectiveness measure comply with them or not and thus determine whether it is on an interval scale or not. In particular, [ 11 ] found that under a strong top-heaviness notion of ordering among runs, only Rank-Biased Precision (RBP) [ 16 ] with p = 21 is on an interval scale while RBP for other values of p and other popular measures – namely AP, Discounted Cumulated Gain (DCG) [ 14 ], and Expected Reciprocal Rank (ERR) [ 9 ] – are not. Moreover, using a weak top-heaviness notion of ordering among runs, [ 11 ] found that all the previously mentioned IR eectiveness measures are not on an interval scale.

Strong top-heaviness provides us with a total ordering among runs and, as discussed above, there is at least one case of IR measure on an interval scale; however, the way in which strong topheaviness orders runs may give raise to disagreement or corner cases. For example, strong top-heaviness ranks the run ¹1; 0; 0; 0º with just one top relevant document before the run ¹0; 1; 1; 1º with all relevant documents except for the rst position; thus, there might be disagreement on whether this is an appropriate ordering for these runs. On the other hand, weak top-heaviness provides us with a much more intuitive partial ordering based on two basic operations – swapping two consecutive documents in a ranking and replacing a not relevant document with a relevant one [ 10 ]; however, none of the IR evaluation measures is on interval-scale using weak top-heaviness.

e problem with IR eectiveness measures emerging from [ 11 ] is two-fold: on the one side, both strong and weak top-heaviness create equi-spaced intervals of runs, as expected by the denition of interval scale, but IR eectiveness measures do not respect this equi-spacing; on the other side, both strong and weak top-heaviness do not account enough for the importance and the eect of the rank of a document in a run, since they both rely on the notion of natural distance in a poset (partially ordered set) [ 24 ] which aens things too much, shrinking everything into a single number.

In this paper, we take a dierent approach to the ordering of intervals of runs, not based on single numbers, as the natural distance of [ 11 ] does, but using vectors instead. is new ordering is richer and more expressive than that induced by the natural distances in the strong and weak top-heaviness cases and allows us to introduce the notion of interval-like scale, i.e., something richer than an ordinal scale but a bit less powerful than an interval scale, since runs are ordered, intervals of runs are ordered too but intervals may not be equi-spaced. In particular, we nd that, under reasonable assumptions, DCG and RBP are on a interval-like scale while AP and ERR are not.

e paper is organized as follows: Section 2 recaps some basic concepts about the representational theory of measurement and posets; Section 3 deals with interval-like scales; nally, Section 4 wraps up the discussion and outlooks some future work. 2 2.1

BACKGROUND Representational eory of Measurement

A relational structure [ 15, 18 ] is an ordered pair X = X ; RX of a domain set X and a set of relations RX on X , where the relations in RX may have dierent arities, i.e. they can be unary, binary, ternary relations and so on. Given two relational structures X and Y, a homomorphism M : X ! Y from X to Y is a mapping M = M; MR where: (i) M is a function that maps X into M¹X º Y , i.e. for each element of the domain set there exists one corresponding image element; (ii) MR is a function that maps RX into MR ¹RX º RY such that 8r 2 RX , r and MR ¹r º have the same arity, i.e. for each relation on the domain set there exists one (and it is usually, and oen implicitly, assumed: and only one) corresponding image relation; (iii) 8r 2 RX ; 8xi 2 X , if r ¹x1; : : : ; xn º then MR ¹r º M¹x1º; : : : ; M¹xn º , i.e. if a relation holds for some elements of the domain set then the image relation must hold for the image elements.

A relational structure E is called empirical if its domain set E spans over the entities under consideration in the real world, i.e. the system runs in our case; a relational structure S is called symbolic if its domain set S spans over a given set of numbers. A measurement (scale) is the homomorphism M = M; MR from the real world to the symbolic world and a measure is the number assigned to an entity by this mapping. 2.2

Measurement Scales

[ 11 ] relied on the notion of dierence structure [ 15, 18 ] to introduce a denition of interval among system runs in such a way that it ensures the existence of an interval scale.

Given E, a weakly ordered empirical structure is a pair ¹E; º where, for every a; b; c 2 E, a a b or b b and b a; c ) a c.

Given ¹E; º, we have to dene a dierence Δab between two elements a; b 2 E, which is a kind of signed distance we exploit to compare intervals. en, we have to dene a weak order d between these Δab dierences. We can proceed as follows: if two elements a; b 2 E are such that a b, i.e. a b and b a, then the interval »a; b¼ is null and, consequently, we set Δab d Δba ; if a b we agree upon choosing Δaa d Δab which, in turn implies that Δaa d Δba .

Definition 1. Let E be a nite (not empty) set of objects. Let d be a binary relation on E E that satises, for each a; b; c; d; a0; b 0; c 0 2 E, the following axioms: i. d is weak order; ii. if Δab d Δcd , then Δdc d Δba ; iii. if Δab d Δa0b0 and Δbc d Δb0c0 then Δac d Δa0c0 ; iv. Solvability Condition: if Δaa d Δcd d Δab ; then there exists d 0; d 00 2 E such that Δad0 d Δcd d Δd00b : en ¹E; d º is a dierence structure.

Particular aention has to be paid to the Solvability Condition which ensures the existence of an equally spaced gradation between the elements of E, indispensable to construct an interval scale measurement.

e representation theorem for dierence structures states: Theorem 1. Let E be a nite (not empty) set of objects and let ¹E; d º be a dierence structure. en there exist a measurement scale M : E ! R such that for every a; b; c; d 2 E Δab d Δcd , M a ¹ º

M b ¹ º

M c ¹ º

M¹dº : is theorem ensures us that, if there is a dierence structure on the empirical set E, then there exists an interval scale M.

As anticipated in Section 1, we will introduce the notion of interval-like scale which corresponds to removing the solvability condition from the denition of dierence structure and obtaining a new partial ordering of the intervals of runs. 2.3

Posets

A partially ordered set P , poset for short, is a set with a partial order dened on it [ 24 ]. A partial order is a binary relation over P which is reexive, antisymmetric and transitive. Given s; t 2 P , we say that s and t are comparable if s t or t s, otherwise they are incomparable.

A closed interval is a subset of P dened as »s; t ¼ B fu 2 P : s u t g, where s; t 2 P and s t . Moreover we say that t covers s if s t and »s; t ¼ = fs; t g; that is there does not exist u 2 P such that s u t :

We can represent a nite poset P by using the Hasse diagram which is a graph where vertices are the elements of P , edges represent the covers relations, and if s t then s is below t in the diagram.

A subset C of a poset P is a chain if any two elements of C are comparable: a chain is a totally ordered subset of a poset. If C is a nite chain, the length of C, `¹Cº, is dened by `¹Cº = jC j 1: A maximal chain of P is a chain that is not a proper subset of any other chain of P .

If every maximal chain of P has the same length n, we say that P is graded of rank n; in particular there exists a unique function ρ : P ! f0; 1; : : : ; ng, called the rank function, such that ρ¹sº = 0, if s is a minimal element of P , and ρ¹t º = ρ¹sº + 1, if t covers s.

Finally, since any interval on a graded poset is graded, the length of an interval »s; t ¼ is given by `¹s; t º B `¹»s; t ¼º = ρ¹t º ρ¹sº, also called the natural distance. 3 3.1

INTERVAL-LIKE SCALES Preliminary Denitions

Given N , the length of the run, we dene the set of retrieved documents as D¹N º = f¹d1; : : : ; dN º : di 2 D; di , dj for any i , j g, i.e. the ranked list of retrieved documents without duplicates, and the universe set of retrieved documents as D := ÐNjD=j1 D¹N º. A run rt , retrieving a ranked list of documents D¹N º in response to a topic t 2 T , is a function from T into D

t 7! rt = ¹d1; : : : ; dN º We denote by rt »j¼ the j-th element of the vector rt , i.e. rt »j¼ = dj .

We dene the universe set of judged documents as R := ÐjD j RELN , where RELN is the set of the ranked lists of judged

N =1 retrieved documents with length xed to N . Since in our case REL = f0; 1g, RELN = f0; 1gN refers to the space of all N length vectors consisting of 0 and 1. As for the set-based case, we denote by RBt the recall base, i.e. the total number of relevant documents for a topic.

We call judged run the function rˆt from T D into R, which assigns a relevance degree to each retrieved document in the ranked list

¹t ; rt º 7! rˆt = GT ¹t ; d1º; : : : ; GT ¹t ; dN º We denote by rˆt »j¼ the j-th element of the vector rˆt , i.e. rˆt »j¼ = GT ¹t ; dj º.

As for the set-based case, we can simplify the notation omiing the dependence on topics, rˆ B rˆ»1¼; : : : ; rˆ»N ¼ , RB, and so on. 3.2

Ordering between Intervals

Let us start recalling the ordering between runs adopted in this paper and based on the following two monotonicity-like properties proposed by [ 10 ]:

Replacement A measure of retrieval eectiveness should not decrease when replacing a document with another one in the same rank position with higher degree of relevance. Swap If we swap a less relevant document with a more relevant one in a lower rank position, the measure should not decrease.

ese two properties lead to the following partial ordering among system runs rˆ sˆ , k Õ rˆ»j¼ j=1 k Õ sˆ»j¼ 8k 2 f1; : : : ; N g : j=1 (1) is ordering considers a run bigger than another one when, for each rank position, it has more relevant documents than the other one up to that rank.

is is the same ordering of runs used by [ 11 ] in the weak topheaviness case but, dierently from [ 11 ], we now introduce a dierent notion of length of an interval, not based on the natural distance which, as discussed in Section 1, has the drawback of aening everything into a single number.

To dene the length of an interval we adopt the following strategy: given rˆ; sˆ 2 RELN with rˆ sˆ, we count how many replacements in the last position and how many forward single-step swaps at each depth are necessary to go from rˆ to sˆ following a maximal chain in RELN . In order to do this, it is useful to dene the cumulative sums of a vector v = ¹v»1¼; : : : ; v»N ¼º, denoted using the capital leer as V = ¹V »1¼; : : : ; V »N ¼º; where V »j¼ = Íj i=1 v»i¼.

Let us start with a simple example.

Example. Consider the two judged runs in REL4

Since 0ˆ rˆ, in order to construct a chain from 0ˆ to rˆ with the two basic operators (replacement in last position and single-step forward swap) we get We have made two replacement in the fourth position, one swap in the second position and two in the third one. Recall that with swap at depth i we mean that a forward swap from position i 1 to position i was done. We can count how many of these basic operations in each position are needed to go from 0ˆ to rˆ just taking the cumulative sums of rˆ. Indeed we get

Rˆ = ¹0; 1; 2; 2º ; and each entry k < D of Rˆ, Rˆ»k¼, counts the number of swaps made in position k, while Rˆ»N ¼ counts the number of replacement, i.e. the total mass of rˆ, to go from 0ˆ to rˆ.

More generally, given two vectors rˆ; sˆ 2 RELN , with rˆ sˆ, in order to collect the number of basic operations made at each position to go from rˆ to sˆ, we can compute this vector of length N rst between 0ˆ and rˆ and between 0ˆ and sˆ, namely Rˆ and Sˆ, and then subtract the two vectors. Precisely Sˆ Rˆ leads to a new vector of length N , where each entry k equals the number of swaps or replacements (if k = N ) needed to go from rˆ to sˆ.

Example. In order to beer understand this mechanism, let us consider a second example. Consider the two judged runs in REL4 In order to construct a chain from rˆ to sˆ with the two basic operators (replacement in last position and single-step forward swap) we get rˆ = ¹0; 1; 0; 0º ; sˆ = ¹1; 0; 1; 0º : rˆ = ¹0; 1; 0; 0º ; vˆ = ¹1; 0; 0; 0º ; wˆ = ¹1; 0; 0; 1º ; sˆ = ¹1; 0; 1; 0º : We have made a swap in the rst and third position and a replacement in the fourth position, that we can collect in a vector as sˆ: Moreover

Sˆ Rˆ = ¹0; 1; 1; 1; 2; 1; 1; 1; 0; 1º: Let t = Sˆ Rˆ. For any i < 10, t »i¼ tells us how many swaps one needs to do at depth i to make the smallest run coincide with the biggest one. Moreover, if the total number of relevant relevancedegrees is not equal for both, as in this example, the last entry of t , t »N ¼, is exactly the number or replacements on rˆ one needs to make, and coincide with Íi sˆ¹iº Íi rˆ¹iº.

Given an interval »rˆ; sˆ¼; if we take the cumulative sums of t = Sˆ Rˆ we obtain the vector T of the cumulative sums of t that counts, for every i N , the total number of swaps (or replacements, if i = N ) made from depth 1 to i between the endpoints of the given interval. e vector T can be seen as a new and generalized denition of the length of the interval »rˆ; sˆ¼, which replaces the natural distance used by [ 11 ].

According to this new distance, we say that the interval »rˆ1; sˆ1¼ is smaller than or equal to the interval »rˆ2; sˆ2¼ if, for the vectors T1 and T2 of their cumulative sums, it holds that T1»i¼ T2»i¼ for any i n. It is worth noticing that, if we take as denition of length any convex linear combination of the values ¹T »i¼; : : : ; T »n¼º, the intervals comparable for the previous ordering remain comparable. Other intervals become comparable for any xed linear combination, but it is not possible to say in advance they are ordered in the same way by any two of these combinations.

We are now able to dene a dierence in this seing: Definition 2. Given rˆ; sˆ 2 RELN ; with rˆ is a vector of length N such that sˆ, the dierence Δ®sˆrˆ Õi j=1 Δ®sˆrˆ »i¼ B ¹i j + 1º sˆ»j¼ rˆ»j¼ ; for all i 2 f1; : : : ; N g:

It can be easily proved that Δ®sˆrˆ is exactly the vector T dened above. Indeed, by construction, given rˆ; sˆ 2 RELN with rˆ sˆ, t »j¼ = Ínj=1 ¹sˆ»n¼ rˆ»n¼º. erefore T »i¼ = Íij=1 t »j¼ = Íij=1 Ínj=1 ¹sˆ»n¼ rˆ»n¼º = Íij=1¹i j + 1º sˆ»j¼ rˆ»j¼ .

Moreover, when computing the dierence vector Δ® between two comparable runs rˆ; sˆ, in this work we write Δ®sˆrˆ whenever rˆ sˆ: if we instead consider Δ®rˆsˆ, then we are counting the backward swaps from sˆ to rˆ and Δ®rˆsˆ»i¼ 0 for all i 2 f1; : : : ; N g.

Since here Δ® is no more a scalar but a vector, we have to dene the partial order among intervals of runs d as follow:

Definition 3. Given »rˆ; sˆ¼; »uˆ; vˆ¼

RELN ; Δ®vˆuˆ d Δ®sˆrˆ if and only if Δ®vˆuˆ »i¼ Δ®sˆrˆ »i¼; 8i 2 f1; : : : ; N g: Example. With respect to the previous example, where t = Sˆ ¹0; 1; 1; 1; 2; 1; 1; 1; 0; 1º, the vector Δ®sˆrˆ is given by Rˆ = Δ®sˆrˆ = T = ¹0; 1; 2; 3; 5; 6; 7; 8; 8; 9º:

Let now uˆ; vˆ 2 f0; 1g10 be as follows

uˆ = ¹1; 0; 0; 1; 0; 1; 1; 1; 0; 0º ; vˆ = ¹1; 0; 1; 1; 1; 0; 1; 0; 0; 0º : Clearly uˆ vˆ and

Δ®vˆuˆ = ¹0; 0; 1; 2; 4; 5; 6; 6; 6; 6º : us we can conclude that the dierence between sˆ and rˆ is greater than the dierence between vˆ and uˆ.

Note that the last entry of Δ® always equals the natural distance as dened in Section 2.3 and used by [ 11 ]. Indeed, given two comparable runs rˆ; sˆ 2 RELN ; with rˆ sˆ, Δ®sˆrˆ »N ¼ counts the total number of forward swaps of length one and/or replacements done from rˆ to match sˆ. Since swaps of length one and replacements in the last positions are elementary operations as observed above, then Δ®sˆrˆ »N ¼ is just counting the length of every maximal chain in »rˆ; sˆ¼; i.e., exactly the natural distance.

is denition of dierence vector solves some of the problems encountered with the dierence dened using the natural distance, as the following example shows.

Example. Let rˆ; sˆ; uˆ; vˆ be dened as follows: rˆ = ¹0; 1; 0; 0; 0; 0; 0; 0; 0; 0º ; sˆ = ¹1; 0; 0; 0; 0; 0; 0; 0; 0; 0º ; uˆ = ¹0; 0; 0; 0; 0; 0; 0; 0; 0; 1º ; vˆ = ¹0; 0; 0; 0; 0; 0; 0; 0; 1; 0º ; where rˆ sˆ and uˆ vˆ:

As already discussed, the natural distance induces a dierence between runs that does not keep track or the rank. In this case, the natural distance would that both the pairs rˆ; sˆ; and uˆ; vˆ; have both dierence equal to 1, even if these two pair diers a lot in terms of where dierences actually happen in the ranking.

Instead, Δ® shows a bigger dierence between rˆ and sˆ compared to the other two runs, because their dierences happen in higher and more important rank positions: Δ®sˆrˆ = ¹1; 1; 1; 1; 1; 1; 1; 1; 1; 1º Δ®vˆuˆ = ¹0; 0; 0; 0; 0; 0; 0; 0; 1; 1º ; and Δ®sˆrˆ »i¼ Δ®vˆuˆ »i¼ for every i 2 f1; : : : ; 10g.

erefore, this new and more expressive dierence matches beer with the intuition that the higher the rank position at which it happens, the more important the same dierence between two runs.

e vector Δ® is thus useful to compare, when possible, intervals on RELN , paying the necessary aention on the ranking. As a consequence, a measure that satisfy these relations among intervals, although not interval scale, could be viewed as something more powerful than a measure on ordinal scale. Indeed, when the above dierences between intervals are comparable, one direction of i on eorem 1 is still satised.

erefore we can say that a measure M of retrieval eectiveness is interval-like if, given a distance (potentially vector) Δ , an ordering d between distances, and given rˆ; sˆ; uˆ; vˆ 2 RELN , the following relation holds:

Δsˆrˆ d Δvˆuˆ ) M¹sˆº M¹rˆº M¹vˆº M¹uˆº: e next section is discusses whether some well-known IR measures are interval-like with respect to the dierence introduced in Denition 2.

3.3 Interval-like Scale Measures

We tested some measures of retrieval eectiveness – namely AP, RBPp , ERR, DCG – on intervals with comparable dierences according to the above denition.

ERR shows the strongest discordance with our denition of dierence, since oen it does not respect the relations between intervals induced by Δ® , as the next example shows.

Example. Let us consider the following four runs rˆ; sˆ; uˆ; vˆ 2 f0; 1g10: rˆ = ¹0; 0; 0; 0; 0; 0; 1; 1; 1; 0º ; sˆ = ¹0; 0; 0; 0; 0; 1; 0; 1; 1; 0º ; uˆ = ¹1; 1; 0; 1; 0; 1; 1; 0; 1; 1º ; vˆ = ¹1; 1; 1; 0; 0; 1; 1; 0; 1; 1º : Clearly rˆ sˆ uˆ vˆ. It seems fair to think that rˆ and sˆ give rise to a smaller interval compared to »uˆ; vˆ¼ – note that the endpoints of both intervals dier by a swap of length one, but made in dierent positions. Moreover it is easy to prove that Δ®sˆrˆ »i¼ Δ®vˆuˆ »i¼ 8i : But while the measures RBPp ; AP and DCG agree with the previous statement, ERR does not, since ERR¹sˆº ERR¹rˆº > ERR¹vˆº ERR¹uˆº.

Another measure that does not always respect the relations between distances is AP.

Example. Let us consider the following runs rˆ; sˆ; uˆ 2 f0; 1g10: rˆ = ¹0; 0; 0; 0; 0; 0; 0; 0; 0; 0º ; sˆ = ¹0; 1; 0; 0; 1; 0; 0; 0; 0; 1º ; uˆ = ¹0; 1; 0; 0; 1; 1; 1; 0; 0; 1º :

Clearly rˆ sˆ and sˆ uˆ. e readers can agree to consider the interval »rˆ; sˆ¼ strictly bigger than »sˆ; uˆ¼; since from uˆ to sˆ we have lost only two relevant documents, while from sˆ to rˆ the information lost seems to be higher. Moreover Δ®sˆrˆ »i¼ Δ®uˆsˆ»i¼ 8i; with strict inequality for some i : However while the measures RBPp ; ERR and DCG agree with this relation between the two intervals, AP does not, since AP ¹sˆº AP ¹rˆº < AP ¹uˆº AP ¹sˆº.

Instead, RBPp and DCG show a greater agreement with the inequalities between intervals induced by Δ® , even if sometimes they do not respect these relations: this happens when the endpoints of an interval do not have an equal number of relevant documents.

Example. Let us consider rˆ; sˆ; uˆ 2 f0; 1g10:

that is Δ®sˆrˆ »i¼ Δ®uˆsˆ»i¼ 8i; with strict inequality for some i : While uˆ and sˆ has the same number of relevant documents, rˆ has two relevant documents less than sˆ. In particular DCG¹sˆº DCG¹rˆº > DCG¹uˆº DCG¹sˆº and, for p > 0:85, RBPp ¹sˆº RBPp ¹rˆº > RBPp ¹uˆº RBPp ¹sˆº, against the inequality given by the dierence vectors.

erefore, we can say that RBPp and DCG are interval-like with respect to the dierence introduced in Denition 2 and considering only intervals where the endpoints have an equal number of relevant documents. While AP and ERR are not even interval-like since the relations between intervals oen fail to be complied with.

4 CONCLUSIONS AND FUTURE WORK

In this paper, we conducted a formal study to propose a new and more expressive way of providing an empirical ordering of intervals of runs in order to determine how close IR eectiveness measure are to be on an interval scale. Indeed, previous work [ 10, 11 ] has shown that they are on an ordinal scale, under some conditions, but not on an interval scale. We have introduced the notion of interval-like scale, a kind of interval scale which admits intervals to not be equi-spaced, and we have shown that both DCG and RBP are on this scale, under reasonable conditions, while AP and ERR are not.

Future work will concern an empirical investigation of the different theoretical properties of evaluation measures we have found in order to determine the impact and severity of not complying with them when you compute descriptive statistics, like mean and variance, and when you conduct statistical signicance tests.

[1] E. Amigo´ ,

Gonzalo , and

M. F.

Verdejo . 2013 . A General Evaluation Measure for Document Organization Tasks . In Proc. 36th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2013 ),

G. J. F.

Jones ,

Sheridan ,

Kelly , M. de Rijke, and T. Sakai (Eds.). ACM Press, New York, USA, 643 - 652 .

[2]

Bollman . 1984 . Two Axioms for Evaluation Measures in Information Retrieval . In Proc. of the ird Joint BCS and ACM Symposium on Research and Development in Information Retrieval , C. J. van Rijsbergen (Ed.). Cambridge University Press, UK, 233 - 245 .

[3]

Bollmann and

V. S.

Cherniavsky . 1980 . Measurement-theoretical investigation of the MZ-metric . In Proc. 3rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1980 ), C. J. van Rijsbergen (Ed.). ACM Press, New York, USA, 256 - 267 .

[4]

Buckley and

E. M.

Voorhees . 2000 . Evaluating Evaluation Measure Stability . In Proc. 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2000 ), E. Yannakoudakis,

N. J.

Belkin , M.-K. Leong , and P. Ingwersen (Eds.). ACM Press, New York, USA, 33 - 40 .

[5]

Buckley and

E. M.

Voorhees . 2004 . Retrieval Evaluation with Incomplete Information . In Proc. 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2004 ),

Sanderson , K. Ja¨rvelin, J. Allan, and P. Bruza (Eds.). ACM Press, New York, USA, 25 - 32 .

[6]

Busin and

Mizzaro . 2013 . Axiometrics: An Axiomatic Approach to Information Retrieval Eectiveness Metrics . In Proc. 4th International Conference on the eory of Information Retrieval (ICTIR 2013 ),

Kurland ,

Metzler ,

Lioma ,

Larsen , and P. Ingwersen (Eds.). ACM Press, New York, USA, 22 - 29 .

[7]

B. A.

Cartere e. 2012 . Multiple Testing in Statistical Analysis of Systems-Based Information Retrieval Experiments . ACM Transactions on Information Systems (TOIS) 30 , 1 ( 2012 ), 4 : 1 - 4 : 34 .

[8]

B. A.

Cartere e. 2015 . Bayesian Inference for Information Retrieval Evaluation . In Proc. 1st ACM SIGIR International Conference on the eory of Information Retrieval (ICTIR 2015 ),

Allan , W. B. Cro, A. P. de Vries , C.

Zhai , N.

Fuhr , and Y. Zhang (Eds.). ACM Press, New York, USA, 31 - 40 .

[9]

Chapelle ,

Metzler ,

Zhang , and

Grinspan . 2009 . Expected Reciprocal Rank for Graded Relevance . In Proc. 18th International Conference on Information and Knowledge Management (CIKM 2009 ), D. W.-L. Cheung,

I.-Y.

Song ,

W. W.

Chu ,

Hu , and

J. J.

Lin (Eds.). ACM Press, New York, USA, 621 - 630 .

[10]

Ferrante ,

Ferro , and

Maistro . 2015 . Towards a Formal Framework for Utility-oriented Measurements of Retrieval Eectiveness . In Proc. 1st ACM SIGIR International Conference on the eory of Information Retrieval (ICTIR 2015 ),

Allan , W. B. Cro, A. P. de Vries , C.

Zhai , N.

Fuhr , and Y. Zhang (Eds.). ACM Press, New York, USA, 21 - 30 .

[11]

Ferrante ,

Ferro , and

Pontarollo . 2017 . Are IR Evaluation Measures on an Interval Scale? . In Proc. 3rd ACM SIGIR International Conference on the eory of Information Retrieval (ICTIR 2017 ),

Kamps , E. Kanoulas, M. de Rijke,

Fang , and E. Yilmaz (Eds.). ACM Press, New York, USA, 67 - 74 .

[12]

Foldes . 2013 . On distances and metrics in discrete ordered sets . arXiv.org, Combinatorics (math.CO) arXiv:1307.0244 (June 2013 ).

[13]

D. A.

Hull . 1993 . Using Statistical Testing in the Evaluation of Retrieval Experiments . In Proc. 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1993 ), R. Korage, E. Rasmussen, and P. Wille (Eds.). ACM Press, New York, USA, 329 - 338 .

[14]

Ja ¨rvelin and J. Keka¨la¨inen. 2002 . Cumulated Gain-Based Evaluation of IR Techniques . ACM Transactions on Information Systems (TOIS) 20 , 4 ( October 2002 ), 422 - 446 .

[15]

D. H.

Krantz ,

R. D.

Luce ,

Suppes , and

Tversky . 1971 . Foundations of Measurement. Additive and Polynomial Representations . Vol. 1 . Academic Press, New York, USA.

[16]

at and

Zobel . 2008 . Rank-biased Precision for Measurement of Retrieval Eectiveness . ACM Transactions on Information Systems (TOIS) 27 , 1 ( 2008 ), 2 : 1 - 2 : 27 .

[17]

Robertson . 2006 . On GMAP: and Other Transformations . In Proc. 15th International Conference on Information and Knowledge Management (CIKM 2006 ),

P. S.

Yu ,

Tsotras ,

E. A.

Fox , and C.-B. Liu (Eds.). ACM Press, New York, USA, 78 - 83 .

[18]

G. B.

Rossi . 2014 . Measurement and Probability. A Probabilistic eory of Measurement with Applications . Springer-Verlag, New York, USA.

[19]

Sakai . 2006 . Evaluating Evaluation Metrics based on the Bootstrap . In Proc. 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2006 ), E. N. Ehimiadis, S. Dumais,

Hawking , and K. Ja¨rvelin (Eds.). ACM Press, New York, USA, 525 - 532 .

[20]

Sakai . 2014 . Statistical Reform in Information Retrieval? SIGIR Forum 48, 1 ( June 2014 ), 3 - 12 .

[21]

Sakai . 2017 . e Probability that Your Hypothesis Is Correct, Credible Intervals, and Eect Sizes for IR Evaluation . In Proc. 40th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2017 ),

Kando ,

Sakai ,

Joho ,

Li , A. P. de Vries, and R. W. White (Eds.). ACM Press, New York, USA, 25 - 34 .

[22]

Sebastiani . 2015 . An Axiomatically Derived Measure for the Evaluation of Classication Algorithms . In Proc. 1st ACM SIGIR International Conference on the eory of Information Retrieval (ICTIR 2015 ),

Allan , W. B. Cro, A. P. de Vries , C.

Zhai , N.

Fuhr , and Y. Zhang (Eds.). ACM Press, New York, USA, 11 - 20 .

[23] M. D. Smucker , J.

Allan , and B. A.

Cartere e. 2007 . A Comparison of Statistical Signicance Tests for Information Retrieval Evaluation . In Proc. 16th International Conference on Information and Knowledge Management (CIKM 2007 ),

M. J.

Silva ,

A. A. F.

Laender ,

Baeza-Yates ,

D. L.

McGuinness ,

Olstad ,

Ø. H.

Olsen , and A. and Falca˜o (Eds.). ACM Press, New York, USA, 623 - 632 .

[24]

R. P.

Stanley . 2012 . Enumerative Combinatorics - Volume 1 (2nd ed.). Cambridge Studies in Advanced Mathematics , Vol. 49 . Cambridge University Press, Cambridge, UK.

[25]

S. S.

Stevens . 1946 . On the eory of Scales of Measurement . Science, New Series 103, 2684 ( June 1946 ), 677 - 680 .

[26] C. J. van Rijsbergen. 1974 . Foundations of Evaluation . Journal of Documentation 30 , 4 ( 1974 ), 365 - 373 .

[27]

Webber , A . Moat, and

Zobel . 2008 . Score Standardization for InterCollection Comparison of Retrieval Systems . In Proc. 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2008 ), T.-S. Chua, M.-K. Leong , D. W.

Oard , and F.

Sebastiani (Eds.). ACM Press, New York, USA, 51 - 58 .