=Paper=
{{Paper
|id=Vol-2008/paper_11
|storemode=property
|title=An Interval-Like Scale Property for IR Evaluation Measures
|pdfUrl=https://ceur-ws.org/Vol-2008/paper_11.pdf
|volume=Vol-2008
|authors=Marco Ferrante,Nicola Ferro,Silvia Pontarollo
|dblpUrl=https://dblp.org/rec/conf/ntcir/FerranteFP17
}}
==An Interval-Like Scale Property for IR Evaluation Measures==
An interval-like scale property for IR evaluation measures
Marco Ferrante Nicola Ferro Silvia Pontarollo
Dept. Mathematics Dept. Information Engineering Dept. Mathematics
University of Padua, Italy University of Padua, Italy University of Padua, Italy
ferrante@math.unipd.it ferro@dei.unipd.it spontaro@math.unipd.it
ABSTRACT an interval scale is somehow arbitrary and, as a consequence, also
Evaluation measures play an important role in IR experimental some of the descriptive statistics you compute about it.
evaluation and their properties determine the kind of statistical Therefore, researchers started to study what IR effectiveness
analyses we can conduct. measures are, not only from an empirical perspective, e.g., [4, 5, 19],
It has been previously shown that it is questionable that IR ef- but also from a theoretical one, e.g., [1–3, 6, 10, 22, 26].
fectiveness measures are on an interval-scale and this implies that In this paper, we stem from the recent work of [11] and we
computing means and variances is not a permissible operation. move a step forward in understanding when and to what extent IR
In this paper, we investigate whether it is possible to relax a bit effectiveness measures are on an interval scale.
the definition of interval scale, introducing the notion of interval- [11] investigated whether IR effectiveness measures are on an
like scale, and to what extent IR effectiveness measures comply interval scale in the perspective of the representational theory of
with this relaxed definition. measurement [15], which is the measurement theory adopted in
both physical and social sciences. According to this framework, the
CCS CONCEPTS key point is to understand how real world objects, i.e., system runs
in our case, are related to each other since measure properties are
•Information systems → Retrieval effectiveness;
then derived from these relations. Moreover, it is important that
these relations among real world objects are intuitive and sensible
KEYWORDS to “everybody” and that they can be commonly agreed on.
evaluation measures; representational theory of measurement; in- Therefore, [11] pointed out that the main issues in determining
terval scale the scale of IR effectiveness measures are: (i) to understand how
runs are empirically and intuitively ordered; (ii) to define what
an interval of runs is; and, (iii) to determine how these intervals
1 INTRODUCTION are ordered. Once you settled all these aspects, you can check
Evaluation plays a central role in Information Retrieval (IR) and a whether an effectiveness measure comply with them or not and thus
lot of attention is devoted to improving our evaluation method- determine whether it is on an interval scale or not. In particular,
ologies and practices. For example, since many years, there is a [11] found that under a strong top-heaviness notion of ordering
continued interest on how to properly apply statistical techniques among runs, only Rank-Biased Precision (RBP) [16] with p = 12 is on
to the analysis of IR experimental data, e.g., on the appropriate use an interval scale while RBP for other values of p and other popular
of statistical testing [7, 13, 20, 23], on the normalization of measure measures – namely AP, Discounted Cumulated Gain (DCG) [14],
values for cross-collection comparison [27], or on moving towards and Expected Reciprocal Rank (ERR) [9] – are not. Moreover, using
Bayesian inference [8, 21], just to name a few. a weak top-heaviness notion of ordering among runs, [11] found
However, all these studies rely on some, often hidden and implicit, that all the previously mentioned IR effectiveness measures are not
assumptions on what IR effectiveness measures are. In particular, on an interval scale.
measurement scales [15, 25] determine the operations that is admis- Strong top-heaviness provides us with a total ordering among
sible to perform with measure values and, as a consequence, the runs and, as discussed above, there is at least one case of IR mea-
statistical analyses that can be applied. [25] identifies four major sure on an interval scale; however, the way in which strong top-
types of scales with increasing properties: (i) the nominal scale heaviness orders runs may give raise to disagreement or corner
consists of discrete unordered values, i.e. categories; (ii) the ordinal cases. For example, strong top-heaviness ranks the run (1, 0, 0, 0)
scale introduces a natural order among the values; (iii) the inter- with just one top relevant document before the run (0, 1, 1, 1) with
val scale preserves the equality of intervals or differences; and (iv) all relevant documents except for the first position; thus, there
the ratio scale preserves the equality of ratios. Operations such as might be disagreement on whether this is an appropriate ordering
computing the mean or the variance are possible just on interval for these runs. On the other hand, weak top-heaviness provides
and ratio scales and they constitute the basis of many of the statis- us with a much more intuitive partial ordering based on two basic
tical techniques mentioned above. However, are we sure that IR operations – swapping two consecutive documents in a ranking
effectiveness measures are on an interval scale? For example, [17] and replacing a not relevant document with a relevant one [10];
points out that the assumption of Average Precision (AP) being on however, none of the IR evaluation measures is on interval-scale
using weak top-heaviness.
Copying permitted for private and academic purposes. The problem with IR effectiveness measures emerging from [11]
8th International Workshop on Evaluating Information Access (EVIA 2017), co-located is two-fold: on the one side, both strong and weak top-heaviness
with NTCIR-13, 5 December 2017, Tokyo, Japan.
© 2017 Copyright held by the author.
create equi-spaced intervals of runs, as expected by the definition
10
of interval scale, but IR effectiveness measures do not respect this Given (E, ), we have to define a difference ∆ab between two
equi-spacing; on the other side, both strong and weak top-heaviness elements a, b ∈ E, which is a kind of signed distance we exploit
do not account enough for the importance and the effect of the to compare intervals. Then, we have to define a weak order d
rank of a document in a run, since they both rely on the notion of between these ∆ab differences. We can proceed as follows: if two
natural distance in a poset (partially ordered set) [24] which flattens elements a, b ∈ E are such that a ∼ b, i.e. a b and b a, then
things too much, shrinking everything into a single number. the interval [a, b] is null and, consequently, we set ∆ab ∼d ∆ba ; if
In this paper, we take a different approach to the ordering of in- a b we agree upon choosing ∆aa ≺d ∆ab which, in turn implies
tervals of runs, not based on single numbers, as the natural distance that ∆aa d ∆ba .
of [11] does, but using vectors instead. This new ordering is richer
Definition 1. Let E be a finite (not empty) set of objects. Let d be a
and more expressive than that induced by the natural distances in
binary relation on E ×E that satisfies, for each a, b, c, d, a 0, b 0, c 0 ∈ E,
the strong and weak top-heaviness cases and allows us to introduce
the following axioms:
the notion of interval-like scale, i.e., something richer than an ordi-
nal scale but a bit less powerful than an interval scale, since runs i. d is weak order;
are ordered, intervals of runs are ordered too but intervals may ii. if ∆ab d ∆cd , then ∆dc d ∆ba ;
not be equi-spaced. In particular, we find that, under reasonable iii. if ∆ab d ∆a 0b 0 and ∆bc d ∆b 0c 0 then ∆ac d ∆a 0c 0 ;
assumptions, DCG and RBP are on a interval-like scale while AP iv. Solvability Condition: if ∆aa d ∆cd d ∆ab , then there
and ERR are not. exists d 0, d 00 ∈ E such that ∆ad 0 ∼d ∆cd ∼d ∆d 00b .
The paper is organized as follows: Section 2 recaps some basic Then (E, d ) is a difference structure.
concepts about the representational theory of measurement and
Particular attention has to be paid to the Solvability Condition
posets; Section 3 deals with interval-like scales; finally, Section 4
which ensures the existence of an equally spaced gradation be-
wraps up the discussion and outlooks some future work.
tween the elements of E, indispensable to construct an interval
scale measurement.
2 BACKGROUND The representation theorem for difference structures states:
2.1 Representational Theory of Measurement Theorem 1. Let E be a finite (not empty) set of objects and let
A relational structure [15, 18] is an ordered pair X = X , R X of (E, d ) be a difference structure. Then there exist a measurement scale
a domain set X and a set of relations R X on X , where the relations in M : E → R such that for every a, b, c, d ∈ E
R X may have different arities, i.e. they can be unary, binary, ternary ∆ab d ∆cd ⇔ M(a) − M(b) ≤ M(c) − M(d) .
relations and so on. Given two relational structures X and Y, a
homomorphism M : X → Y from X to Y is a mapping M = M, MR This theorem ensures us that, if there is a difference structure
where: (i) M is a function that maps X into M(X ) ⊆ Y , i.e. for each on the empirical set E, then there exists an interval scale M.
element of the domain set there exists one corresponding image As anticipated in Section 1, we will introduce the notion of
element; (ii) MR is a function that maps R X into MR (R X ) ⊆ RY such interval-like scale which corresponds to removing the solvability
that ∀r ∈ R X , r and MR (r ) have the same arity, i.e. for each relation condition from the definition of difference structure and obtaining
on the domain set there exists one (and it is usually, and often a new partial ordering of the intervals of runs.
implicitly, assumed: and only one) corresponding image relation;
(iii) ∀r ∈ R X , ∀x i ∈ X , if r (x 1 , . . . , x n ) then MR (r ) M(x 1 ), . . . , 2.3 Posets
A partially ordered set P, poset for short, is a set with a partial order
M(x n ) , i.e. if a relation holds for some elements of the domain set defined on it [24]. A partial order is a binary relation over P
then the image relation must hold for the image elements. which is reflexive, antisymmetric and transitive. Given s, t ∈ P, we
A relational structure E is called empirical if its domain set E say that s and t are comparable if s t or t s, otherwise they are
spans over the entities under consideration in the real world, i.e. the incomparable.
system runs in our case; a relational structure S is called symbolic A closed interval is a subset of P defined as [s, t] B {u ∈ P : s
if its domain set S spans over a given set of numbers. A measure- u t }, where s, t ∈ P and s t. Moreover we say that t covers s
ment (scale) is the homomorphism M = M, MR from the real if s t and [s, t] = {s, t }, that is there does not exist u ∈ P such
world to the symbolic world and a measure is the number assigned that s ≺ u ≺ t .
to an entity by this mapping. We can represent a finite poset P by using the Hasse diagram
which is a graph where vertices are the elements of P, edges rep-
2.2 Measurement Scales resent the covers relations, and if s ≺ t then s is below t in the
diagram.
[11] relied on the notion of difference structure [15, 18] to introduce
A subset C of a poset P is a chain if any two elements of C are
a definition of interval among system runs in such a way that it
comparable: a chain is a totally ordered subset of a poset. If C is a
ensures the existence of an interval scale.
finite chain, the length of C, `(C), is defined by `(C) = |C | − 1. A
Given E, a weakly ordered empirical structure is a pair (E, )
maximal chain of P is a chain that is not a proper subset of any
where, for every a, b, c ∈ E,
other chain of P.
• a b or b a; If every maximal chain of P has the same length n, we say that
• a b and b c ⇒ a c. P is graded of rank n; in particular there exists a unique function
11
ρ : P → {0, 1, . . . , n}, called the rank function, such that ρ(s) = 0, which, as discussed in Section 1, has the drawback of flattening
if s is a minimal element of P, and ρ(t) = ρ(s) + 1, if t covers s. everything into a single number.
Finally, since any interval on a graded poset is graded, the length To define the length of an interval we adopt the following strat-
of an interval [s, t] is given by `(s, t) B `([s, t]) = ρ(t) − ρ(s), also egy: given rˆ, ŝ ∈ REL N with rˆ ŝ, we count how many replace-
called the natural distance. ments in the last position and how many forward single-step swaps
at each depth are necessary to go from rˆ to ŝ following a maximal
3 INTERVAL-LIKE SCALES chain in REL N . In order to do this, it is useful to define the cu-
3.1 Preliminary Definitions mulative sums of a vector v = (v[1], . . . , v[N ]), denoted using the
Íj
capital letter as V = (V [1], . . . , V [N ]), where V [j] = i=1 v[i].
Given N , the length of the run, we define the set of retrieved
Let us start with a simple example.
documents as D(N ) = {(d 1 , . . . , d N ) : di ∈ D, di , d j for any i ,
j}, i.e. the ranked list of retrieved documents without duplicates, and Example. Consider the two judged runs in REL4
Ð |D |
the universe set of retrieved documents as D := N =1 D(N ). rˆ = (0, 1, 1, 0) ,
A run r t , retrieving a ranked list of documents D(N ) in response
0̂ = (0, 0, 0, 0) .
to a topic t ∈ T , is a function from T into D
Since 0̂ ≺ rˆ, in order to construct a chain from 0̂ to rˆ with the
t 7→ r t = (d 1 , . . . , d N )
two basic operators (replacement in last position and single-step
We denote by r t [j] the j-th element of the vector r t , forward swap) we get
i.e. r t [j] = d j .
0̂ = (0, 0, 0, 0) ,
We define the universe set of judged documents as R :=
Ð |D | N N 0̂1 = (0, 0, 0, 1) ,
N =1 REL , where REL is the set of the ranked lists of judged
retrieved documents with length fixed to N . Since in our case 0̂2 = (0, 0, 1, 0) ,
REL = {0, 1}, REL N = {0, 1} N refers to the space of all N −length
0̂4 = (0, 1, 0, 0) ,
vectors consisting of 0 and 1. As for the set-based case, we denote
by RBt the recall base, i.e. the total number of relevant documents 0̂5 = (0, 1, 0, 1) ,
for a topic. 0̂6 = (0, 1, 1, 0) = rˆ .
We call judged run the function rˆt from T × D into R, which
assigns a relevance degree to each retrieved document in the ranked We have made two replacement in the fourth position, one swap
list in the second position and two in the third one. Recall that with
(t, r t ) 7→ rˆt = GT (t, d 1 ), . . . , GT (t, d N )
swap at depth i we mean that a forward swap from position i − 1
to position i was done. We can count how many of these basic
We denote by rˆt [j] the j-th element of the vector rˆt , i.e. rˆt [j] = operations in each position are needed to go from 0̂ to rˆ just taking
GT (t, d j ). the cumulative sums of rˆ. Indeed we get
As for the set-based case, we can simplify the notation omitting
the dependence on topics, rˆ B rˆ[1], . . . , rˆ[N ] , RB, and so on. R̂ = (0, 1, 2, 2) ,
and each entry k < D of R̂, R̂[k], counts the number of swaps made
3.2 Ordering between Intervals in position k, while R̂[N ] counts the number of replacement, i.e.
Let us start recalling the ordering between runs adopted in this the total mass of rˆ, to go from 0̂ to rˆ.
paper and based on the following two monotonicity-like properties
More generally, given two vectors rˆ, ŝ ∈ REL N , with rˆ ≺ ŝ,
proposed by [10]:
in order to collect the number of basic operations made at each
• Replacement A measure of retrieval effectiveness should position to go from rˆ to ŝ, we can compute this vector of length N
not decrease when replacing a document with another one first between 0̂ and rˆ and between 0̂ and ŝ, namely R̂ and S, ˆ and
in the same rank position with higher degree of relevance. then subtract the two vectors. Precisely Sˆ − R̂ leads to a new vector
• Swap If we swap a less relevant document with a more of length N , where each entry k equals the number of swaps or
relevant one in a lower rank position, the measure should replacements (if k = N ) needed to go from rˆ to ŝ.
not decrease.
These two properties lead to the following partial ordering among Example. In order to better understand this mechanism, let us
system runs consider a second example. Consider the two judged runs in REL4
k k
rˆ = (0, 1, 0, 0) ,
ŝ = (1, 0, 1, 0) .
Õ Õ
rˆ ŝ ⇔ rˆ[j] ≤ ŝ[j] ∀k ∈ {1, . . . , N } . (1)
j=1 j=1 In order to construct a chain from rˆ to ŝ with the two basic operators
This ordering considers a run bigger than another one when, for (replacement in last position and single-step forward swap) we get
each rank position, it has more relevant documents than the other rˆ = (0, 1, 0, 0) ,
one up to that rank.
v̂ = (1, 0, 0, 0) ,
This is the same ordering of runs used by [11] in the weak top-
heaviness case but, differently from [11], we now introduce a differ- ŵ = (1, 0, 0, 1) ,
ent notion of length of an interval, not based on the natural distance ŝ = (1, 0, 1, 0) .
12
We have made a swap in the first and third position and a replace- Moreover, when computing the difference vector ∆ ® ·· between
ment in the fourth position, that we can collect in a vector as two comparable runs rˆ, ŝ, in this work we write ∆ŝ r̂ whenever rˆ ŝ:
®
t = (1, 0, 1, 1) . (2) if we instead consider ∆ ® r̂ ŝ , then we are counting the backward
swaps from ŝ to rˆ and ∆r̂ ŝ [i] ≤ 0 for all i ∈ {1, . . . , N }.
®
On the other hand it is easy to compute R̂ = (0, 1, 1, 1) and Sˆ =
Since here ∆
® ·· is no more a scalar but a vector, we have to define
(1, 1, 2, 2). Therefore
the partial order among intervals of runs d as follow:
Sˆ − R̂ = (1, 0, 1, 1) = t ,
Definition 3. Given [ˆr, ŝ], [û, v̂] ⊆ REL N ,
as we wanted to show.
∆
® v̂ û d ∆
® ŝ r̂
Let us consider a second, more complicated, example. if and only if
Example. Consider the two judged runs ∆
® v̂ û [i] ≤ ∆
® ŝ r̂ [i], ∀i ∈ {1, . . . , N }.
rˆ = (1, 0, 0, 0, 0, 1, 1, 0, 1, 0) ,
Example. With respect to the previous example, where t = Sˆ − R̂ =
ŝ = (1, 1, 0, 0, 1, 0, 1, 0, 0, 1) . (0, 1, 1, 1, 2, 1, 1, 1, 0, 1), the vector ∆
® ŝ r̂ is given by
Clearly rˆ ŝ. Moreover ∆
® ŝ r̂ = T = (0, 1, 2, 3, 5, 6, 7, 8, 8, 9).
Sˆ = (1, 2, 2, 2, 3, 3, 4, 4, 4, 5) , Let now û, v̂ ∈ {0, 1}10 be as follows
R̂ = (1, 1, 1, 1, 1, 2, 3, 3, 4, 4). û = (1, 0, 0, 1, 0, 1, 1, 1, 0, 0) ,
Thus v̂ = (1, 0, 1, 1, 1, 0, 1, 0, 0, 0) .
Sˆ − R̂ = (0, 1, 1, 1, 2, 1, 1, 1, 0, 1). Clearly û v̂ and
Let t = Sˆ − R̂. For any i < 10, t[i] tells us how many swaps one ∆
® v̂ û = (0, 0, 1, 2, 4, 5, 6, 6, 6, 6) .
needs to do at depth i to make the smallest run coincide with the
biggest one. Moreover, if the total number of relevant relevance- Thus we can conclude that the difference between ŝ and rˆ is greater
degrees is not equal for both, as in this example, the last entry of than the difference between v̂ and û.
t, t[N ], is exactly the number or replacements on rˆ one needs to Note that the last entry of ∆ ® ·· always equals the natural distance
make, and coincide with i ŝ(i) − i rˆ(i).
Í Í
as defined in Section 2.3 and used by [11]. Indeed, given two compa-
Given an interval [ˆr , ŝ], if we take the cumulative sums of t = rable runs r, ˆ ŝ ∈ REL N , with rˆ ŝ, ∆
® ŝ r̂ [N ] counts the total number
Sˆ − R̂ we obtain the vector T of the cumulative sums of t that of forward swaps of length one and/or replacements done from
counts, for every i ≤ N , the total number of swaps (or replacements, rˆ to match ŝ. Since swaps of length one and replacements in the
if i = N ) made from depth 1 to i between the endpoints of the last positions are elementary operations as observed above, then
given interval. The vector T can be seen as a new and generalized ∆® ŝ r̂ [N ] is just counting the length of every maximal chain in [ˆr , ŝ],
definition of the length of the interval [ˆr , ŝ], which replaces the i.e., exactly the natural distance.
natural distance used by [11]. This definition of difference vector solves some of the problems
According to this new distance, we say that the interval [rˆ1 , sˆ1 ] encountered with the difference defined using the natural distance,
is smaller than or equal to the interval [rˆ2 , sˆ2 ] if, for the vectors T1 as the following example shows.
and T2 of their cumulative sums, it holds that T1 [i] ≤ T2 [i] for any Example. Let rˆ, ŝ, û, v̂ be defined as follows:
i ≤ n. It is worth noticing that, if we take as definition of length
rˆ = (0, 1, 0, 0, 0, 0, 0, 0, 0, 0) ,
any convex linear combination of the values (T [i], . . . ,T [n]), the
intervals comparable for the previous ordering remain comparable. ŝ = (1, 0, 0, 0, 0, 0, 0, 0, 0, 0) ,
Other intervals become comparable for any fixed linear combina- û = (0, 0, 0, 0, 0, 0, 0, 0, 0, 1) ,
tion, but it is not possible to say in advance they are ordered in the v̂ = (0, 0, 0, 0, 0, 0, 0, 0, 1, 0) ,
same way by any two of these combinations.
We are now able to define a difference in this setting: where rˆ ŝ and û v̂.
As already discussed, the natural distance induces a difference
Definition 2. Given r, ˆ ŝ ∈ REL N , with rˆ ŝ, the difference ∆
® ŝ r̂ between runs that does not keep track or the rank. In this case, the
is a vector of length N such that natural distance would that both the pairs r, ˆ ŝ, and û, v̂, have both
i
Õ difference equal to 1, even if these two pair differs a lot in terms of
∆
® ŝ r̂ [i] B (i − j + 1) ŝ[j] − rˆ[j] ,
where differences actually happen in the ranking.
j=1 Instead, ∆
® ·· shows a bigger difference between rˆ and ŝ compared
to the other two runs, because their differences happen in higher
for all i ∈ {1, . . . , N }.
and more important rank positions:
It can be easily proved that ∆ ® ŝ r̂ is exactly the vector T de-
∆
® ŝ r̂ = (1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
fined above. Indeed, by construction, given r, ˆ ŝ ∈ REL N with
Íj ∆
® v̂ û = (0, 0, 0, 0, 0, 0, 0, 0, 1, 1) ,
rˆ ŝ, t[j] = n=1 (ŝ[n] − rˆ[n]). Therefore T [i] = ij=1 t[j] =
Í
Íi Íj Íi
j=1 n=1 (ŝ[n] − rˆ[n]) = j=1 (i − j + 1) ŝ[j] − rˆ[j] . and ∆
® ŝ r̂ [i] ≥ ∆
® v̂ û [i] for every i ∈ {1, . . . , 10}.
13
Therefore, this new and more expressive difference matches Instead, RBPp and DCG show a greater agreement with the in-
better with the intuition that the higher the rank position at which equalities between intervals induced by ∆® ·· , even if sometimes they
it happens, the more important the same difference between two do not respect these relations: this happens when the endpoints of
runs. an interval do not have an equal number of relevant documents.
The vector ∆
® ·· is thus useful to compare, when possible, intervals
on REL N , paying the necessary attention on the ranking. As a Example. Let us consider rˆ, ŝ, û ∈ {0, 1}10 :
consequence, a measure that satisfy these relations among intervals, rˆ = (0, 0, 1, 0, 1, 1, 0, 0, 1, 0) ,
although not interval scale, could be viewed as something more ŝ = (0, 1, 0, 1, 0, 1, 1, 1, 1, 0) ,
powerful than a measure on ordinal scale. Indeed, when the above
û = (1, 1, 0, 1, 1, 1, 0, 1, 0, 0) .
differences between intervals are comparable, one direction of iff
on Theorem 1 is still satisfied. Clearly rˆ ŝ û and one can prove that
Therefore we can say that a measure M of retrieval effectiveness
∆
® ŝ r̂ = (0, 1, 1, 2, 2, 2, 3, 5, 7, 9) ,
is interval-like if, given a distance (potentially vector) ∆ ·· , an
ordering d between distances, and given rˆ, ŝ, û, v̂ ∈ REL N , the ∆
® û ŝ = (1, 2, 3, 4, 6, 8, 9, 10, 10, 10) ,
following relation holds:
that is ∆
® ŝ r̂ [i] ≤ ∆
® û ŝ [i] ∀i, with strict inequality for some i. While
∆ŝ r̂ d ∆v̂ û ⇒ M(ŝ) − M(ˆr ) ≤ M(v̂) − M(û). û and ŝ has the same number of relevant documents, rˆ has two
The next section is discusses whether some well-known IR mea- relevant documents less than ŝ. In particular DCG(ŝ) − DCG(ˆr ) >
sures are interval-like with respect to the difference introduced in DCG(û)−DCG(ŝ) and, for p > 0.85, RBPp (ŝ)−RBPp (ˆr ) > RBPp (û)−
Definition 2. RBPp (ŝ), against the inequality given by the difference vectors.
Therefore, we can say that RBPp and DCG are interval-like with
3.3 Interval-like Scale Measures respect to the difference introduced in Definition 2 and consider-
We tested some measures of retrieval effectiveness – namely AP, ing only intervals where the endpoints have an equal number of
RBPp , ERR, DCG – on intervals with comparable differences ac- relevant documents. While AP and ERR are not even interval-like
cording to the above definition. since the relations between intervals often fail to be complied with.
ERR shows the strongest discordance with our definition of
difference, since often it does not respect the relations between 4 CONCLUSIONS AND FUTURE WORK
intervals induced by ∆® ·· , as the next example shows. In this paper, we conducted a formal study to propose a new and
Example. Let us consider the following four runs r,
ˆ ŝ, û, v̂ ∈ {0, 1}10 : more expressive way of providing an empirical ordering of intervals
of runs in order to determine how close IR effectiveness measure
rˆ = (0, 0, 0, 0, 0, 0, 1, 1, 1, 0) , are to be on an interval scale. Indeed, previous work [10, 11] has
ŝ = (0, 0, 0, 0, 0, 1, 0, 1, 1, 0) , shown that they are on an ordinal scale, under some conditions,
û = (1, 1, 0, 1, 0, 1, 1, 0, 1, 1) , but not on an interval scale. We have introduced the notion of
interval-like scale, a kind of interval scale which admits intervals
v̂ = (1, 1, 1, 0, 0, 1, 1, 0, 1, 1) .
to not be equi-spaced, and we have shown that both DCG and RBP
Clearly rˆ ŝ û v̂. It seems fair to think that rˆ and ŝ give rise are on this scale, under reasonable conditions, while AP and ERR
to a smaller interval compared to [û, v̂] – note that the endpoints of are not.
both intervals differ by a swap of length one, but made in different Future work will concern an empirical investigation of the dif-
positions. Moreover it is easy to prove that ∆ ® ŝ r̂ [i] ≤ ∆
® v̂ û [i] ∀i. But ferent theoretical properties of evaluation measures we have found
while the measures RBPp , AP and DCG agree with the previous in order to determine the impact and severity of not complying
statement, ERR does not, since ERR(ŝ) −ERR(ˆr ) > ERR(v̂) −ERR(û). with them when you compute descriptive statistics, like mean and
variance, and when you conduct statistical significance tests.
Another measure that does not always respect the relations
between distances is AP. REFERENCES
Example. Let us consider the following runs rˆ, ŝ, û ∈ {0, 1}10 : [1] E. Amigó, J. Gonzalo, and M. F. Verdejo. 2013. A General Evaluation Measure
for Document Organization Tasks. In Proc. 36th Annual International ACM SIGIR
rˆ = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0) , Conference on Research and Development in Information Retrieval (SIGIR 2013),
G. J. F. Jones, P. Sheridan, D. Kelly, M. de Rijke, and T. Sakai (Eds.). ACM Press,
ŝ = (0, 1, 0, 0, 1, 0, 0, 0, 0, 1) , New York, USA, 643–652.
û = (0, 1, 0, 0, 1, 1, 1, 0, 0, 1) .
[2] P. Bollman. 1984. Two Axioms for Evaluation Measures in Information Retrieval.
In Proc. of the Third Joint BCS and ACM Symposium on Research and Development
in Information Retrieval, C. J. van Rijsbergen (Ed.). Cambridge University Press,
Clearly rˆ ŝ and ŝ û. The readers can agree to consider the UK, 233–245.
interval [ˆr , ŝ] strictly bigger than [ŝ, û], since from û to ŝ we have [3] P. Bollmann and V. S. Cherniavsky. 1980. Measurement-theoretical investigation
lost only two relevant documents, while from ŝ to rˆ the information of the MZ-metric. In Proc. 3rd Annual International ACM SIGIR Conference on Re-
search and Development in Information Retrieval (SIGIR 1980), C. J. van Rijsbergen
lost seems to be higher. Moreover ∆ ® ŝ r̂ [i] ≥ ∆
® û ŝ [i] ∀i, with strict (Ed.). ACM Press, New York, USA, 256–267.
inequality for some i. However while the measures RBPp , ERR and [4] C. Buckley and E. M. Voorhees. 2000. Evaluating Evaluation Measure Stability.
In Proc. 23rd Annual International ACM SIGIR Conference on Research and De-
DCG agree with this relation between the two intervals, AP does velopment in Information Retrieval (SIGIR 2000), E. Yannakoudakis, N. J. Belkin,
not, since AP(ŝ) − AP(ˆr ) < AP(û) − AP(ŝ). M.-K. Leong, and P. Ingwersen (Eds.). ACM Press, New York, USA, 33–40.
14
[5] C. Buckley and E. M. Voorhees. 2004. Retrieval Evaluation with Incomplete (SIGIR 2008), T.-S. Chua, M.-K. Leong, D. W. Oard, and F. Sebastiani (Eds.). ACM
Information. In Proc. 27th Annual International ACM SIGIR Conference on Research Press, New York, USA, 51–58.
and Development in Information Retrieval (SIGIR 2004), M. Sanderson, K. Järvelin,
J. Allan, and P. Bruza (Eds.). ACM Press, New York, USA, 25–32.
[6] L. Busin and S. Mizzaro. 2013. Axiometrics: An Axiomatic Approach to Informa-
tion Retrieval Effectiveness Metrics. In Proc. 4th International Conference on the
Theory of Information Retrieval (ICTIR 2013), O. Kurland, D. Metzler, C. Lioma,
B. Larsen, and P. Ingwersen (Eds.). ACM Press, New York, USA, 22–29.
[7] B. A. Carterette. 2012. Multiple Testing in Statistical Analysis of Systems-Based
Information Retrieval Experiments. ACM Transactions on Information Systems
(TOIS) 30, 1 (2012), 4:1–4:34.
[8] B. A. Carterette. 2015. Bayesian Inference for Information Retrieval Evaluation.
In Proc. 1st ACM SIGIR International Conference on the Theory of Information
Retrieval (ICTIR 2015), J. Allan, W. B. Croft, A. P. de Vries, C. Zhai, N. Fuhr, and
Y. Zhang (Eds.). ACM Press, New York, USA, 31–40.
[9] O. Chapelle, D. Metzler, Y. Zhang, and P. Grinspan. 2009. Expected Reciprocal
Rank for Graded Relevance. In Proc. 18th International Conference on Information
and Knowledge Management (CIKM 2009), D. W.-L. Cheung, I.-Y. Song, W. W.
Chu, X. Hu, and J. J. Lin (Eds.). ACM Press, New York, USA, 621–630.
[10] M. Ferrante, N. Ferro, and M. Maistro. 2015. Towards a Formal Framework
for Utility-oriented Measurements of Retrieval Effectiveness. In Proc. 1st ACM
SIGIR International Conference on the Theory of Information Retrieval (ICTIR 2015),
J. Allan, W. B. Croft, A. P. de Vries, C. Zhai, N. Fuhr, and Y. Zhang (Eds.). ACM
Press, New York, USA, 21–30.
[11] M. Ferrante, N. Ferro, and S. Pontarollo. 2017. Are IR Evaluation Measures on an
Interval Scale?. In Proc. 3rd ACM SIGIR International Conference on the Theory of
Information Retrieval (ICTIR 2017), J. Kamps, E. Kanoulas, M. de Rijke, H. Fang,
and E. Yilmaz (Eds.). ACM Press, New York, USA, 67–74.
[12] S. Foldes. 2013. On distances and metrics in discrete ordered sets. arXiv.org,
Combinatorics (math.CO) arXiv:1307.0244 (June 2013).
[13] D. A. Hull. 1993. Using Statistical Testing in the Evaluation of Retrieval Experi-
ments. In Proc. 16th Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval (SIGIR 1993), R. Korfhage, E. Rasmussen,
and P. Willett (Eds.). ACM Press, New York, USA, 329–338.
[14] K. Järvelin and J. Kekäläinen. 2002. Cumulated Gain-Based Evaluation of IR
Techniques. ACM Transactions on Information Systems (TOIS) 20, 4 (October
2002), 422–446.
[15] D. H. Krantz, R. D. Luce, P. Suppes, and A. Tversky. 1971. Foundations of Mea-
surement. Additive and Polynomial Representations. Vol. 1. Academic Press, New
York, USA.
[16] A. Moffat and J. Zobel. 2008. Rank-biased Precision for Measurement of Retrieval
Effectiveness. ACM Transactions on Information Systems (TOIS) 27, 1 (2008),
2:1–2:27.
[17] S. Robertson. 2006. On GMAP: and Other Transformations. In Proc. 15th Interna-
tional Conference on Information and Knowledge Management (CIKM 2006), P. S.
Yu, V. Tsotras, E. A. Fox, and C.-B. Liu (Eds.). ACM Press, New York, USA, 78–83.
[18] G. B. Rossi. 2014. Measurement and Probability. A Probabilistic Theory of Mea-
surement with Applications. Springer-Verlag, New York, USA.
[19] T. Sakai. 2006. Evaluating Evaluation Metrics based on the Bootstrap. In Proc.
29th Annual International ACM SIGIR Conference on Research and Development
in Information Retrieval (SIGIR 2006), E. N. Efthimiadis, S. Dumais, D. Hawking,
and K. Järvelin (Eds.). ACM Press, New York, USA, 525–532.
[20] T. Sakai. 2014. Statistical Reform in Information Retrieval? SIGIR Forum 48, 1
(June 2014), 3–12.
[21] T. Sakai. 2017. The Probability that Your Hypothesis Is Correct, Credible Intervals,
and Effect Sizes for IR Evaluation. In Proc. 40th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval (SIGIR 2017),
N. Kando, T. Sakai, H. Joho, H. Li, A. P. de Vries, and R. W. White (Eds.). ACM
Press, New York, USA, 25–34.
[22] F. Sebastiani. 2015. An Axiomatically Derived Measure for the Evaluation of
Classification Algorithms. In Proc. 1st ACM SIGIR International Conference on the
Theory of Information Retrieval (ICTIR 2015), J. Allan, W. B. Croft, A. P. de Vries,
C. Zhai, N. Fuhr, and Y. Zhang (Eds.). ACM Press, New York, USA, 11–20.
[23] M. D. Smucker, J. Allan, and B. A. Carterette. 2007. A Comparison of Statistical
Significance Tests for Information Retrieval Evaluation. In Proc. 16th International
Conference on Information and Knowledge Management (CIKM 2007), M. J. Silva,
A. A. F. Laender, R. Baeza-Yates, D. L. McGuinness, B. Olstad, Ø. H. Olsen, and
A. and Falcão (Eds.). ACM Press, New York, USA, 623–632.
[24] R. P. Stanley. 2012. Enumerative Combinatorics – Volume 1 (2nd ed.). Cambridge
Studies in Advanced Mathematics, Vol. 49. Cambridge University Press, Cam-
bridge, UK.
[25] S. S. Stevens. 1946. On the Theory of Scales of Measurement. Science, New Series
103, 2684 (June 1946), 677–680.
[26] C. J. van Rijsbergen. 1974. Foundations of Evaluation. Journal of Documentation
30, 4 (1974), 365–373.
[27] W. Webber, A. Moffat, and J. Zobel. 2008. Score Standardization for Inter-
Collection Comparison of Retrieval Systems. In Proc. 31st Annual International
ACM SIGIR Conference on Research and Development in Information Retrieval
15