<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An interval-like scale property for IR evaluation measures</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marco Ferrante</string-name>
          <email>ferrante@math.unipd.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Ferro</string-name>
          <email>ferro@dei.unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Silvia Pontarollo</string-name>
          <email>spontaro@math.unipd.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept. Information Engineering, University of Padua</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dept. Mathematics, University of Padua</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <fpage>10</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>Evaluation measures play an important role in IR experimental evaluation and their properties determine the kind of statistical analyses we can conduct. It has been previously shown that it is questionable that IR effectiveness measures are on an interval-scale and this implies that computing means and variances is not a permissible operation. In this paper, we investigate whether it is possible to relax a bit the denition of interval scale, introducing the notion of intervallike scale, and to what extent IR eectiveness measures comply with this relaxed denition.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>•Information systems ! Retrieval eectiveness;</p>
    </sec>
    <sec id="sec-2">
      <title>INTRODUCTION</title>
      <p>
        Evaluation plays a central role in Information Retrieval (IR) and a
lot of aention is devoted to improving our evaluation
methodologies and practices. For example, since many years, there is a
continued interest on how to properly apply statistical techniques
to the analysis of IR experimental data, e.g., on the appropriate use
of statistical testing [
        <xref ref-type="bibr" rid="ref13 ref20 ref23 ref7">7, 13, 20, 23</xref>
        ], on the normalization of measure
values for cross-collection comparison [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ], or on moving towards
Bayesian inference [
        <xref ref-type="bibr" rid="ref21 ref8">8, 21</xref>
        ], just to name a few.
      </p>
      <p>
        However, all these studies rely on some, oen hidden and implicit,
assumptions on what IR eectiveness measures are. In particular,
measurement scales [
        <xref ref-type="bibr" rid="ref15 ref25">15, 25</xref>
        ] determine the operations that is
admissible to perform with measure values and, as a consequence, the
statistical analyses that can be applied. [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] identies four major
types of scales with increasing properties: (i) the nominal scale
consists of discrete unordered values, i.e. categories; (ii) the ordinal
scale introduces a natural order among the values; (iii) the
interval scale preserves the equality of intervals or dierences; and (iv)
the ratio scale preserves the equality of ratios. Operations such as
computing the mean or the variance are possible just on interval
and ratio scales and they constitute the basis of many of the
statistical techniques mentioned above. However, are we sure that IR
eectiveness measures are on an interval scale? For example, [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]
points out that the assumption of Average Precision (AP) being on
an interval scale is somehow arbitrary and, as a consequence, also
some of the descriptive statistics you compute about it.
      </p>
      <p>
        erefore, researchers started to study what IR eectiveness
measures are, not only from an empirical perspective, e.g., [
        <xref ref-type="bibr" rid="ref19 ref4 ref5">4, 5, 19</xref>
        ],
but also from a theoretical one, e.g., [
        <xref ref-type="bibr" rid="ref1 ref10 ref2 ref22 ref26 ref3 ref6">1–3, 6, 10, 22, 26</xref>
        ].
      </p>
      <p>
        In this paper, we stem from the recent work of [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and we
move a step forward in understanding when and to what extent IR
eectiveness measures are on an interval scale.
      </p>
      <p>
        [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] investigated whether IR eectiveness measures are on an
interval scale in the perspective of the representational theory of
measurement [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], which is the measurement theory adopted in
both physical and social sciences. According to this framework, the
key point is to understand how real world objects, i.e., system runs
in our case, are related to each other since measure properties are
then derived from these relations. Moreover, it is important that
these relations among real world objects are intuitive and sensible
to “everybody” and that they can be commonly agreed on.
      </p>
      <p>
        erefore, [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] pointed out that the main issues in determining
the scale of IR eectiveness measures are: (i) to understand how
runs are empirically and intuitively ordered; (ii) to dene what
an interval of runs is; and, (iii) to determine how these intervals
are ordered. Once you seled all these aspects, you can check
whether an eectiveness measure comply with them or not and thus
determine whether it is on an interval scale or not. In particular,
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] found that under a strong top-heaviness notion of ordering
among runs, only Rank-Biased Precision (RBP) [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] with p = 21 is on
an interval scale while RBP for other values of p and other popular
measures – namely AP, Discounted Cumulated Gain (DCG) [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ],
and Expected Reciprocal Rank (ERR) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] – are not. Moreover, using
a weak top-heaviness notion of ordering among runs, [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] found
that all the previously mentioned IR eectiveness measures are not
on an interval scale.
      </p>
      <p>
        Strong top-heaviness provides us with a total ordering among
runs and, as discussed above, there is at least one case of IR
measure on an interval scale; however, the way in which strong
topheaviness orders runs may give raise to disagreement or corner
cases. For example, strong top-heaviness ranks the run ¹1; 0; 0; 0º
with just one top relevant document before the run ¹0; 1; 1; 1º with
all relevant documents except for the rst position; thus, there
might be disagreement on whether this is an appropriate ordering
for these runs. On the other hand, weak top-heaviness provides
us with a much more intuitive partial ordering based on two basic
operations – swapping two consecutive documents in a ranking
and replacing a not relevant document with a relevant one [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ];
however, none of the IR evaluation measures is on interval-scale
using weak top-heaviness.
      </p>
      <p>
        e problem with IR eectiveness measures emerging from [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]
is two-fold: on the one side, both strong and weak top-heaviness
create equi-spaced intervals of runs, as expected by the denition
of interval scale, but IR eectiveness measures do not respect this
equi-spacing; on the other side, both strong and weak top-heaviness
do not account enough for the importance and the eect of the
rank of a document in a run, since they both rely on the notion of
natural distance in a poset (partially ordered set) [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] which aens
things too much, shrinking everything into a single number.
      </p>
      <p>
        In this paper, we take a dierent approach to the ordering of
intervals of runs, not based on single numbers, as the natural distance
of [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] does, but using vectors instead. is new ordering is richer
and more expressive than that induced by the natural distances in
the strong and weak top-heaviness cases and allows us to introduce
the notion of interval-like scale, i.e., something richer than an
ordinal scale but a bit less powerful than an interval scale, since runs
are ordered, intervals of runs are ordered too but intervals may
not be equi-spaced. In particular, we nd that, under reasonable
assumptions, DCG and RBP are on a interval-like scale while AP
and ERR are not.
      </p>
      <p>e paper is organized as follows: Section 2 recaps some basic
concepts about the representational theory of measurement and
posets; Section 3 deals with interval-like scales; nally, Section 4
wraps up the discussion and outlooks some future work.
2
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>BACKGROUND</title>
    </sec>
    <sec id="sec-4">
      <title>Representational eory of Measurement</title>
      <p>
        A relational structure [
        <xref ref-type="bibr" rid="ref15 ref18">15, 18</xref>
        ] is an ordered pair X = X ; RX of
a domain set X and a set of relations RX on X , where the relations in
RX may have dierent arities, i.e. they can be unary, binary, ternary
relations and so on. Given two relational structures X and Y, a
homomorphism M : X ! Y from X to Y is a mapping M = M; MR
where: (i) M is a function that maps X into M¹X º Y , i.e. for each
element of the domain set there exists one corresponding image
element; (ii) MR is a function that maps RX into MR ¹RX º RY such
that 8r 2 RX , r and MR ¹r º have the same arity, i.e. for each relation
on the domain set there exists one (and it is usually, and oen
implicitly, assumed: and only one) corresponding image relation;
(iii) 8r 2 RX ; 8xi 2 X , if r ¹x1; : : : ; xn º then MR ¹r º M¹x1º; : : : ;
M¹xn º , i.e. if a relation holds for some elements of the domain set
then the image relation must hold for the image elements.
      </p>
      <p>A relational structure E is called empirical if its domain set E
spans over the entities under consideration in the real world, i.e. the
system runs in our case; a relational structure S is called symbolic
if its domain set S spans over a given set of numbers. A
measurement (scale) is the homomorphism M = M; MR from the real
world to the symbolic world and a measure is the number assigned
to an entity by this mapping.
2.2</p>
    </sec>
    <sec id="sec-5">
      <title>Measurement Scales</title>
      <p>
        [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] relied on the notion of dierence structure [
        <xref ref-type="bibr" rid="ref15 ref18">15, 18</xref>
        ] to introduce
a denition of interval among system runs in such a way that it
ensures the existence of an interval scale.
      </p>
      <p>Given E, a weakly ordered empirical structure is a pair ¹E; º
where, for every a; b; c 2 E,
a
a
b or b
b and b
a;
c ) a
c.</p>
      <p>Given ¹E; º, we have to dene a dierence Δab between two
elements a; b 2 E, which is a kind of signed distance we exploit
to compare intervals. en, we have to dene a weak order d
between these Δab dierences. We can proceed as follows: if two
elements a; b 2 E are such that a b, i.e. a b and b a, then
the interval »a; b¼ is null and, consequently, we set Δab d Δba ; if
a b we agree upon choosing Δaa d Δab which, in turn implies
that Δaa d Δba .</p>
      <p>Definition 1. Let E be a nite (not empty) set of objects. Let d be a
binary relation on E E that satises, for each a; b; c; d; a0; b 0; c 0 2 E,
the following axioms:
i. d is weak order;
ii. if Δab d Δcd , then Δdc d Δba ;
iii. if Δab d Δa0b0 and Δbc d Δb0c0 then Δac d Δa0c0 ;
iv. Solvability Condition: if Δaa d Δcd d Δab ; then there
exists d 0; d 00 2 E such that Δad0 d Δcd d Δd00b :
en ¹E; d º is a dierence structure.</p>
      <p>Particular aention has to be paid to the Solvability Condition
which ensures the existence of an equally spaced gradation
between the elements of E, indispensable to construct an interval
scale measurement.</p>
      <p>e representation theorem for dierence structures states:
Theorem 1. Let E be a nite (not empty) set of objects and let
¹E; d º be a dierence structure. en there exist a measurement scale
M : E ! R such that for every a; b; c; d 2 E
Δab
d Δcd , M a
¹ º</p>
      <p>M b
¹ º</p>
      <p>M c
¹ º</p>
      <p>M¹dº :
is theorem ensures us that, if there is a dierence structure
on the empirical set E, then there exists an interval scale M.</p>
      <p>As anticipated in Section 1, we will introduce the notion of
interval-like scale which corresponds to removing the solvability
condition from the denition of dierence structure and obtaining
a new partial ordering of the intervals of runs.
2.3</p>
    </sec>
    <sec id="sec-6">
      <title>Posets</title>
      <p>
        A partially ordered set P , poset for short, is a set with a partial order
dened on it [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. A partial order is a binary relation over P
which is reexive, antisymmetric and transitive. Given s; t 2 P , we
say that s and t are comparable if s t or t s, otherwise they are
incomparable.
      </p>
      <p>A closed interval is a subset of P dened as »s; t ¼ B fu 2 P : s
u t g, where s; t 2 P and s t . Moreover we say that t covers s
if s t and »s; t ¼ = fs; t g; that is there does not exist u 2 P such
that s u t :</p>
      <p>We can represent a nite poset P by using the Hasse diagram
which is a graph where vertices are the elements of P , edges
represent the covers relations, and if s t then s is below t in the
diagram.</p>
      <p>A subset C of a poset P is a chain if any two elements of C are
comparable: a chain is a totally ordered subset of a poset. If C is a
nite chain, the length of C, `¹Cº, is dened by `¹Cº = jC j 1: A
maximal chain of P is a chain that is not a proper subset of any
other chain of P .</p>
      <p>If every maximal chain of P has the same length n, we say that
P is graded of rank n; in particular there exists a unique function
ρ : P ! f0; 1; : : : ; ng, called the rank function, such that ρ¹sº = 0,
if s is a minimal element of P , and ρ¹t º = ρ¹sº + 1, if t covers s.</p>
      <p>Finally, since any interval on a graded poset is graded, the length
of an interval »s; t ¼ is given by `¹s; t º B `¹»s; t ¼º = ρ¹t º ρ¹sº, also
called the natural distance.
3
3.1</p>
    </sec>
    <sec id="sec-7">
      <title>INTERVAL-LIKE SCALES</title>
    </sec>
    <sec id="sec-8">
      <title>Preliminary Denitions</title>
      <p>Given N , the length of the run, we dene the set of retrieved
documents as D¹N º = f¹d1; : : : ; dN º : di 2 D; di , dj for any i ,
j g, i.e. the ranked list of retrieved documents without duplicates, and
the universe set of retrieved documents as D := ÐNjD=j1 D¹N º.
A run rt , retrieving a ranked list of documents D¹N º in response
to a topic t 2 T , is a function from T into D</p>
      <p>t 7! rt = ¹d1; : : : ; dN º
We denote by rt »j¼ the j-th element of the vector rt ,
i.e. rt »j¼ = dj .</p>
      <p>We dene the universe set of judged documents as R :=
ÐjD j RELN , where RELN is the set of the ranked lists of judged</p>
      <p>N =1
retrieved documents with length xed to N . Since in our case
REL = f0; 1g, RELN = f0; 1gN refers to the space of all N length
vectors consisting of 0 and 1. As for the set-based case, we denote
by RBt the recall base, i.e. the total number of relevant documents
for a topic.</p>
      <p>We call judged run the function rˆt from T D into R, which
assigns a relevance degree to each retrieved document in the ranked
list</p>
      <p>¹t ; rt º 7! rˆt = GT ¹t ; d1º; : : : ; GT ¹t ; dN º
We denote by rˆt »j¼ the j-th element of the vector rˆt , i.e. rˆt »j¼ =
GT ¹t ; dj º.</p>
      <p>As for the set-based case, we can simplify the notation omiing
the dependence on topics, rˆ B rˆ»1¼; : : : ; rˆ»N ¼ , RB, and so on.
3.2</p>
    </sec>
    <sec id="sec-9">
      <title>Ordering between Intervals</title>
      <p>
        Let us start recalling the ordering between runs adopted in this
paper and based on the following two monotonicity-like properties
proposed by [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]:
      </p>
      <p>Replacement A measure of retrieval eectiveness should
not decrease when replacing a document with another one
in the same rank position with higher degree of relevance.
Swap If we swap a less relevant document with a more
relevant one in a lower rank position, the measure should
not decrease.</p>
      <p>ese two properties lead to the following partial ordering among
system runs
rˆ
sˆ ,
k
Õ rˆ»j¼
j=1
k
Õ sˆ»j¼ 8k 2 f1; : : : ; N g :
j=1
(1)
is ordering considers a run bigger than another one when, for
each rank position, it has more relevant documents than the other
one up to that rank.</p>
      <p>
        is is the same ordering of runs used by [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] in the weak
topheaviness case but, dierently from [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], we now introduce a
dierent notion of length of an interval, not based on the natural distance
which, as discussed in Section 1, has the drawback of aening
everything into a single number.
      </p>
      <p>To dene the length of an interval we adopt the following
strategy: given rˆ; sˆ 2 RELN with rˆ sˆ, we count how many
replacements in the last position and how many forward single-step swaps
at each depth are necessary to go from rˆ to sˆ following a maximal
chain in RELN . In order to do this, it is useful to dene the
cumulative sums of a vector v = ¹v»1¼; : : : ; v»N ¼º, denoted using the
capital leer as V = ¹V »1¼; : : : ; V »N ¼º; where V »j¼ = Íj
i=1 v»i¼.</p>
      <p>Let us start with a simple example.</p>
      <sec id="sec-9-1">
        <title>Example. Consider the two judged runs in REL4</title>
        <p>Since 0ˆ rˆ, in order to construct a chain from 0ˆ to rˆ with the
two basic operators (replacement in last position and single-step
forward swap) we get
We have made two replacement in the fourth position, one swap
in the second position and two in the third one. Recall that with
swap at depth i we mean that a forward swap from position i 1
to position i was done. We can count how many of these basic
operations in each position are needed to go from 0ˆ to rˆ just taking
the cumulative sums of rˆ. Indeed we get</p>
        <p>Rˆ = ¹0; 1; 2; 2º ;
and each entry k &lt; D of Rˆ, Rˆ»k¼, counts the number of swaps made
in position k, while Rˆ»N ¼ counts the number of replacement, i.e.
the total mass of rˆ, to go from 0ˆ to rˆ.</p>
        <p>More generally, given two vectors rˆ; sˆ 2 RELN , with rˆ sˆ,
in order to collect the number of basic operations made at each
position to go from rˆ to sˆ, we can compute this vector of length N
rst between 0ˆ and rˆ and between 0ˆ and sˆ, namely Rˆ and Sˆ, and
then subtract the two vectors. Precisely Sˆ Rˆ leads to a new vector
of length N , where each entry k equals the number of swaps or
replacements (if k = N ) needed to go from rˆ to sˆ.</p>
        <p>Example. In order to beer understand this mechanism, let us
consider a second example. Consider the two judged runs in REL4
In order to construct a chain from rˆ to sˆ with the two basic operators
(replacement in last position and single-step forward swap) we get
rˆ = ¹0; 1; 0; 0º ;
sˆ = ¹1; 0; 1; 0º :
rˆ = ¹0; 1; 0; 0º ;
vˆ = ¹1; 0; 0; 0º ;
wˆ = ¹1; 0; 0; 1º ;
sˆ = ¹1; 0; 1; 0º :
We have made a swap in the rst and third position and a
replacement in the fourth position, that we can collect in a vector as
sˆ: Moreover</p>
        <p>Sˆ Rˆ = ¹0; 1; 1; 1; 2; 1; 1; 1; 0; 1º:
Let t = Sˆ Rˆ. For any i &lt; 10, t »i¼ tells us how many swaps one
needs to do at depth i to make the smallest run coincide with the
biggest one. Moreover, if the total number of relevant
relevancedegrees is not equal for both, as in this example, the last entry of
t , t »N ¼, is exactly the number or replacements on rˆ one needs to
make, and coincide with Íi sˆ¹iº Íi rˆ¹iº.</p>
        <p>
          Given an interval »rˆ; sˆ¼; if we take the cumulative sums of t =
Sˆ Rˆ we obtain the vector T of the cumulative sums of t that
counts, for every i N , the total number of swaps (or replacements,
if i = N ) made from depth 1 to i between the endpoints of the
given interval. e vector T can be seen as a new and generalized
denition of the length of the interval »rˆ; sˆ¼, which replaces the
natural distance used by [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
        <p>According to this new distance, we say that the interval »rˆ1; sˆ1¼
is smaller than or equal to the interval »rˆ2; sˆ2¼ if, for the vectors T1
and T2 of their cumulative sums, it holds that T1»i¼ T2»i¼ for any
i n. It is worth noticing that, if we take as denition of length
any convex linear combination of the values ¹T »i¼; : : : ; T »n¼º, the
intervals comparable for the previous ordering remain comparable.
Other intervals become comparable for any xed linear
combination, but it is not possible to say in advance they are ordered in the
same way by any two of these combinations.</p>
        <p>We are now able to dene a dierence in this seing:
Definition 2. Given rˆ; sˆ 2 RELN ; with rˆ
is a vector of length N such that
sˆ, the dierence Δ®sˆrˆ
Õi
j=1
Δ®sˆrˆ »i¼ B
¹i j + 1º sˆ»j¼
rˆ»j¼ ;
for all i 2 f1; : : : ; N g:</p>
        <p>It can be easily proved that Δ®sˆrˆ is exactly the vector T
dened above. Indeed, by construction, given rˆ; sˆ 2 RELN with
rˆ sˆ, t »j¼ = Ínj=1 ¹sˆ»n¼ rˆ»n¼º. erefore T »i¼ = Íij=1 t »j¼ =
Íij=1 Ínj=1 ¹sˆ»n¼ rˆ»n¼º = Íij=1¹i j + 1º sˆ»j¼ rˆ»j¼ .</p>
        <p>Moreover, when computing the dierence vector Δ® between
two comparable runs rˆ; sˆ, in this work we write Δ®sˆrˆ whenever rˆ sˆ:
if we instead consider Δ®rˆsˆ, then we are counting the backward
swaps from sˆ to rˆ and Δ®rˆsˆ»i¼ 0 for all i 2 f1; : : : ; N g.</p>
        <p>Since here Δ® is no more a scalar but a vector, we have to dene
the partial order among intervals of runs d as follow:</p>
      </sec>
      <sec id="sec-9-2">
        <title>Definition 3. Given »rˆ; sˆ¼; »uˆ; vˆ¼</title>
        <p>RELN ;
Δ®vˆuˆ d Δ®sˆrˆ
if and only if
Δ®vˆuˆ »i¼
Δ®sˆrˆ »i¼;
8i 2 f1; : : : ; N g:
Example. With respect to the previous example, where t = Sˆ
¹0; 1; 1; 1; 2; 1; 1; 1; 0; 1º, the vector Δ®sˆrˆ is given by
Rˆ =
Δ®sˆrˆ = T = ¹0; 1; 2; 3; 5; 6; 7; 8; 8; 9º:</p>
        <sec id="sec-9-2-1">
          <title>Let now uˆ; vˆ 2 f0; 1g10 be as follows</title>
          <p>uˆ = ¹1; 0; 0; 1; 0; 1; 1; 1; 0; 0º ;
vˆ = ¹1; 0; 1; 1; 1; 0; 1; 0; 0; 0º :
Clearly uˆ
vˆ and</p>
          <p>Δ®vˆuˆ = ¹0; 0; 1; 2; 4; 5; 6; 6; 6; 6º :
us we can conclude that the dierence between sˆ and rˆ is greater
than the dierence between vˆ and uˆ.</p>
          <p>
            Note that the last entry of Δ® always equals the natural distance
as dened in Section 2.3 and used by [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ]. Indeed, given two
comparable runs rˆ; sˆ 2 RELN ; with rˆ sˆ, Δ®sˆrˆ »N ¼ counts the total number
of forward swaps of length one and/or replacements done from
rˆ to match sˆ. Since swaps of length one and replacements in the
last positions are elementary operations as observed above, then
Δ®sˆrˆ »N ¼ is just counting the length of every maximal chain in »rˆ; sˆ¼;
i.e., exactly the natural distance.
          </p>
          <p>is denition of dierence vector solves some of the problems
encountered with the dierence dened using the natural distance,
as the following example shows.</p>
          <p>Example. Let rˆ; sˆ; uˆ; vˆ be dened as follows:
rˆ = ¹0; 1; 0; 0; 0; 0; 0; 0; 0; 0º ;
sˆ = ¹1; 0; 0; 0; 0; 0; 0; 0; 0; 0º ;
uˆ = ¹0; 0; 0; 0; 0; 0; 0; 0; 0; 1º ;
vˆ = ¹0; 0; 0; 0; 0; 0; 0; 0; 1; 0º ;
where rˆ sˆ and uˆ vˆ:</p>
          <p>As already discussed, the natural distance induces a dierence
between runs that does not keep track or the rank. In this case, the
natural distance would that both the pairs rˆ; sˆ; and uˆ; vˆ; have both
dierence equal to 1, even if these two pair diers a lot in terms of
where dierences actually happen in the ranking.</p>
          <p>Instead, Δ® shows a bigger dierence between rˆ and sˆ compared
to the other two runs, because their dierences happen in higher
and more important rank positions:
Δ®sˆrˆ = ¹1; 1; 1; 1; 1; 1; 1; 1; 1; 1º
Δ®vˆuˆ = ¹0; 0; 0; 0; 0; 0; 0; 0; 1; 1º ;
and Δ®sˆrˆ »i¼
Δ®vˆuˆ »i¼ for every i 2 f1; : : : ; 10g.</p>
          <p>erefore, this new and more expressive dierence matches
beer with the intuition that the higher the rank position at which
it happens, the more important the same dierence between two
runs.</p>
          <p>e vector Δ® is thus useful to compare, when possible, intervals
on RELN , paying the necessary aention on the ranking. As a
consequence, a measure that satisfy these relations among intervals,
although not interval scale, could be viewed as something more
powerful than a measure on ordinal scale. Indeed, when the above
dierences between intervals are comparable, one direction of i
on eorem 1 is still satised.</p>
          <p>erefore we can say that a measure M of retrieval eectiveness
is interval-like if, given a distance (potentially vector) Δ , an
ordering d between distances, and given rˆ; sˆ; uˆ; vˆ 2 RELN , the
following relation holds:</p>
          <p>Δsˆrˆ d Δvˆuˆ ) M¹sˆº M¹rˆº M¹vˆº M¹uˆº:
e next section is discusses whether some well-known IR
measures are interval-like with respect to the dierence introduced in
Denition 2.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>3.3 Interval-like Scale Measures</title>
      <p>We tested some measures of retrieval eectiveness – namely AP,
RBPp , ERR, DCG – on intervals with comparable dierences
according to the above denition.</p>
      <p>ERR shows the strongest discordance with our denition of
dierence, since oen it does not respect the relations between
intervals induced by Δ® , as the next example shows.</p>
      <p>Example. Let us consider the following four runs rˆ; sˆ; uˆ; vˆ 2 f0; 1g10:
rˆ = ¹0; 0; 0; 0; 0; 0; 1; 1; 1; 0º ;
sˆ = ¹0; 0; 0; 0; 0; 1; 0; 1; 1; 0º ;
uˆ = ¹1; 1; 0; 1; 0; 1; 1; 0; 1; 1º ;
vˆ = ¹1; 1; 1; 0; 0; 1; 1; 0; 1; 1º :
Clearly rˆ sˆ uˆ vˆ. It seems fair to think that rˆ and sˆ give rise
to a smaller interval compared to »uˆ; vˆ¼ – note that the endpoints of
both intervals dier by a swap of length one, but made in dierent
positions. Moreover it is easy to prove that Δ®sˆrˆ »i¼ Δ®vˆuˆ »i¼ 8i : But
while the measures RBPp ; AP and DCG agree with the previous
statement, ERR does not, since ERR¹sˆº ERR¹rˆº &gt; ERR¹vˆº ERR¹uˆº.</p>
      <p>Another measure that does not always respect the relations
between distances is AP.</p>
      <p>Example. Let us consider the following runs rˆ; sˆ; uˆ 2 f0; 1g10:
rˆ = ¹0; 0; 0; 0; 0; 0; 0; 0; 0; 0º ;
sˆ = ¹0; 1; 0; 0; 1; 0; 0; 0; 0; 1º ;
uˆ = ¹0; 1; 0; 0; 1; 1; 1; 0; 0; 1º :</p>
      <p>Clearly rˆ sˆ and sˆ uˆ. e readers can agree to consider the
interval »rˆ; sˆ¼ strictly bigger than »sˆ; uˆ¼; since from uˆ to sˆ we have
lost only two relevant documents, while from sˆ to rˆ the information
lost seems to be higher. Moreover Δ®sˆrˆ »i¼ Δ®uˆsˆ»i¼ 8i; with strict
inequality for some i : However while the measures RBPp ; ERR and
DCG agree with this relation between the two intervals, AP does
not, since AP ¹sˆº AP ¹rˆº &lt; AP ¹uˆº AP ¹sˆº.</p>
      <p>Instead, RBPp and DCG show a greater agreement with the
inequalities between intervals induced by Δ® , even if sometimes they
do not respect these relations: this happens when the endpoints of
an interval do not have an equal number of relevant documents.</p>
      <sec id="sec-10-1">
        <title>Example. Let us consider rˆ; sˆ; uˆ 2 f0; 1g10:</title>
        <p>that is Δ®sˆrˆ »i¼ Δ®uˆsˆ»i¼ 8i; with strict inequality for some i : While
uˆ and sˆ has the same number of relevant documents, rˆ has two
relevant documents less than sˆ. In particular DCG¹sˆº DCG¹rˆº &gt;
DCG¹uˆº DCG¹sˆº and, for p &gt; 0:85, RBPp ¹sˆº RBPp ¹rˆº &gt; RBPp ¹uˆº
RBPp ¹sˆº, against the inequality given by the dierence vectors.</p>
        <p>erefore, we can say that RBPp and DCG are interval-like with
respect to the dierence introduced in Denition 2 and
considering only intervals where the endpoints have an equal number of
relevant documents. While AP and ERR are not even interval-like
since the relations between intervals oen fail to be complied with.</p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>4 CONCLUSIONS AND FUTURE WORK</title>
      <p>
        In this paper, we conducted a formal study to propose a new and
more expressive way of providing an empirical ordering of intervals
of runs in order to determine how close IR eectiveness measure
are to be on an interval scale. Indeed, previous work [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ] has
shown that they are on an ordinal scale, under some conditions,
but not on an interval scale. We have introduced the notion of
interval-like scale, a kind of interval scale which admits intervals
to not be equi-spaced, and we have shown that both DCG and RBP
are on this scale, under reasonable conditions, while AP and ERR
are not.
      </p>
      <p>Future work will concern an empirical investigation of the
different theoretical properties of evaluation measures we have found
in order to determine the impact and severity of not complying
with them when you compute descriptive statistics, like mean and
variance, and when you conduct statistical signicance tests.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>E. Amigo´</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Verdejo</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>A General Evaluation Measure for Document Organization Tasks</article-title>
          .
          <source>In Proc. 36th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR</source>
          <year>2013</year>
          ),
          <string-name>
            <given-names>G. J. F.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sheridan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kelly</surname>
          </string-name>
          , M. de Rijke, and T. Sakai (Eds.). ACM Press, New York, USA,
          <fpage>643</fpage>
          -
          <lpage>652</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bollman</surname>
          </string-name>
          .
          <year>1984</year>
          .
          <article-title>Two Axioms for Evaluation Measures in Information Retrieval</article-title>
          .
          <source>In Proc. of the ird Joint BCS and ACM Symposium on Research and Development in Information Retrieval</source>
          ,
          <string-name>
            <surname>C. J. van Rijsbergen</surname>
          </string-name>
          (Ed.). Cambridge University Press, UK,
          <fpage>233</fpage>
          -
          <lpage>245</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bollmann</surname>
          </string-name>
          and
          <string-name>
            <given-names>V. S.</given-names>
            <surname>Cherniavsky</surname>
          </string-name>
          .
          <year>1980</year>
          .
          <article-title>Measurement-theoretical investigation of the MZ-metric</article-title>
          .
          <source>In Proc. 3rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR</source>
          <year>1980</year>
          ), C. J. van Rijsbergen (Ed.). ACM Press, New York, USA,
          <fpage>256</fpage>
          -
          <lpage>267</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Buckley</surname>
          </string-name>
          and
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Voorhees</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>Evaluating Evaluation Measure Stability</article-title>
          .
          <source>In Proc. 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR</source>
          <year>2000</year>
          ), E. Yannakoudakis,
          <string-name>
            <given-names>N. J.</given-names>
            <surname>Belkin</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-K. Leong</surname>
          </string-name>
          , and P. Ingwersen (Eds.). ACM Press, New York, USA,
          <fpage>33</fpage>
          -
          <lpage>40</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Buckley</surname>
          </string-name>
          and
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Voorhees</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>Retrieval Evaluation with Incomplete Information</article-title>
          .
          <source>In Proc. 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR</source>
          <year>2004</year>
          ),
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanderson</surname>
          </string-name>
          , K. Ja¨rvelin, J. Allan, and P. Bruza (Eds.). ACM Press, New York, USA,
          <fpage>25</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Busin</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Mizzaro</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Axiometrics: An Axiomatic Approach to Information Retrieval Eectiveness Metrics</article-title>
          .
          <source>In Proc. 4th International Conference on the eory of Information Retrieval (ICTIR</source>
          <year>2013</year>
          ),
          <string-name>
            <given-names>O.</given-names>
            <surname>Kurland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Metzler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lioma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Larsen</surname>
          </string-name>
          , and P. Ingwersen (Eds.). ACM Press, New York, USA,
          <fpage>22</fpage>
          -
          <lpage>29</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B. A.</given-names>
            <surname>Cartere</surname>
          </string-name>
          e.
          <year>2012</year>
          .
          <article-title>Multiple Testing in Statistical Analysis of Systems-Based Information Retrieval Experiments</article-title>
          .
          <source>ACM Transactions on Information Systems (TOIS) 30</source>
          ,
          <issue>1</issue>
          (
          <year>2012</year>
          ),
          <volume>4</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          :
          <fpage>34</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B. A.</given-names>
            <surname>Cartere</surname>
          </string-name>
          e.
          <year>2015</year>
          .
          <article-title>Bayesian Inference for Information Retrieval Evaluation</article-title>
          .
          <source>In Proc. 1st ACM SIGIR International Conference on the eory of Information Retrieval (ICTIR</source>
          <year>2015</year>
          ),
          <string-name>
            <given-names>J.</given-names>
            <surname>Allan</surname>
          </string-name>
          , W. B. Cro,
          <string-name>
            <surname>A. P. de Vries</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Zhai</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Fuhr</surname>
          </string-name>
          , and Y. Zhang (Eds.). ACM Press, New York, USA,
          <fpage>31</fpage>
          -
          <lpage>40</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>O.</given-names>
            <surname>Chapelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Metzler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Grinspan</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Expected Reciprocal Rank for Graded Relevance</article-title>
          .
          <source>In Proc. 18th International Conference on Information and Knowledge Management (CIKM</source>
          <year>2009</year>
          ), D. W.-L. Cheung,
          <string-name>
            <given-names>I.-Y.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. W.</given-names>
            <surname>Chu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Lin</surname>
          </string-name>
          (Eds.). ACM Press, New York, USA,
          <fpage>621</fpage>
          -
          <lpage>630</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ferrante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Maistro</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Towards a Formal Framework for Utility-oriented Measurements of Retrieval Eectiveness</article-title>
          .
          <source>In Proc. 1st ACM SIGIR International Conference on the eory of Information Retrieval (ICTIR</source>
          <year>2015</year>
          ),
          <string-name>
            <given-names>J.</given-names>
            <surname>Allan</surname>
          </string-name>
          , W. B. Cro,
          <string-name>
            <surname>A. P. de Vries</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Zhai</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Fuhr</surname>
          </string-name>
          , and Y. Zhang (Eds.). ACM Press, New York, USA,
          <fpage>21</fpage>
          -
          <lpage>30</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ferrante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Pontarollo</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Are IR Evaluation Measures on an Interval Scale?</article-title>
          .
          <source>In Proc. 3rd ACM SIGIR International Conference on the eory of Information Retrieval (ICTIR</source>
          <year>2017</year>
          ),
          <string-name>
            <given-names>J.</given-names>
            <surname>Kamps</surname>
          </string-name>
          , E. Kanoulas, M. de Rijke,
          <string-name>
            <given-names>H.</given-names>
            <surname>Fang</surname>
          </string-name>
          , and E. Yilmaz (Eds.). ACM Press, New York, USA,
          <fpage>67</fpage>
          -
          <lpage>74</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Foldes</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>On distances and metrics in discrete ordered sets</article-title>
          . arXiv.org,
          <source>Combinatorics (math.CO) arXiv:1307.0244 (June</source>
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Hull</surname>
          </string-name>
          .
          <year>1993</year>
          .
          <article-title>Using Statistical Testing in the Evaluation of Retrieval Experiments</article-title>
          .
          <source>In Proc. 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR</source>
          <year>1993</year>
          ), R. Korage, E. Rasmussen, and P. Wille (Eds.). ACM Press, New York, USA,
          <fpage>329</fpage>
          -
          <lpage>338</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>K.</given-names>
            <surname>Ja</surname>
          </string-name>
          <article-title>¨rvelin and</article-title>
          J. Keka¨la¨inen.
          <year>2002</year>
          .
          <article-title>Cumulated Gain-Based Evaluation of IR Techniques</article-title>
          .
          <source>ACM Transactions on Information Systems (TOIS) 20</source>
          ,
          <issue>4</issue>
          (
          <year>October 2002</year>
          ),
          <fpage>422</fpage>
          -
          <lpage>446</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>D. H.</given-names>
            <surname>Krantz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. D.</given-names>
            <surname>Luce</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Suppes</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Tversky</surname>
          </string-name>
          .
          <year>1971</year>
          .
          <article-title>Foundations of Measurement. Additive and Polynomial Representations</article-title>
          . Vol.
          <volume>1</volume>
          . Academic Press, New York, USA.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mo</surname>
          </string-name>
          <article-title>at and</article-title>
          <string-name>
            <given-names>J.</given-names>
            <surname>Zobel</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Rank-biased Precision for Measurement of Retrieval Eectiveness</article-title>
          .
          <source>ACM Transactions on Information Systems (TOIS) 27</source>
          ,
          <issue>1</issue>
          (
          <year>2008</year>
          ),
          <volume>2</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>2</lpage>
          :
          <fpage>27</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Robertson</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>On GMAP: and Other Transformations</article-title>
          .
          <source>In Proc. 15th International Conference on Information and Knowledge Management (CIKM</source>
          <year>2006</year>
          ),
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Tsotras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. A.</given-names>
            <surname>Fox</surname>
          </string-name>
          , and C.-B. Liu (Eds.). ACM Press, New York, USA,
          <fpage>78</fpage>
          -
          <lpage>83</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>G. B.</given-names>
            <surname>Rossi</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Measurement and Probability. A Probabilistic eory of Measurement with Applications</article-title>
          . Springer-Verlag, New York, USA.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>T.</given-names>
            <surname>Sakai</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Evaluating Evaluation Metrics based on the Bootstrap</article-title>
          .
          <source>In Proc. 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR</source>
          <year>2006</year>
          ), E. N. Ehimiadis, S. Dumais,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hawking</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          K. Ja¨rvelin (Eds.). ACM Press, New York, USA,
          <fpage>525</fpage>
          -
          <lpage>532</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>T.</given-names>
            <surname>Sakai</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Statistical Reform in Information Retrieval? SIGIR Forum 48, 1</article-title>
          (
          <year>June 2014</year>
          ),
          <fpage>3</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>T.</given-names>
            <surname>Sakai</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>e Probability that Your Hypothesis Is Correct, Credible Intervals, and Eect Sizes for IR Evaluation</article-title>
          .
          <source>In Proc. 40th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR</source>
          <year>2017</year>
          ),
          <string-name>
            <given-names>N.</given-names>
            <surname>Kando</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Sakai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Joho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. P.</surname>
          </string-name>
          de Vries, and
          <string-name>
            <surname>R. W.</surname>
          </string-name>
          White (Eds.). ACM Press, New York, USA,
          <fpage>25</fpage>
          -
          <lpage>34</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>F.</given-names>
            <surname>Sebastiani</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>An Axiomatically Derived Measure for the Evaluation of Classication Algorithms</article-title>
          .
          <source>In Proc. 1st ACM SIGIR International Conference on the eory of Information Retrieval (ICTIR</source>
          <year>2015</year>
          ),
          <string-name>
            <given-names>J.</given-names>
            <surname>Allan</surname>
          </string-name>
          , W. B. Cro,
          <string-name>
            <surname>A. P. de Vries</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Zhai</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Fuhr</surname>
          </string-name>
          , and Y. Zhang (Eds.). ACM Press, New York, USA,
          <fpage>11</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>M. D. Smucker</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Allan</surname>
            , and
            <given-names>B. A.</given-names>
          </string-name>
          <string-name>
            <surname>Cartere</surname>
          </string-name>
          e.
          <year>2007</year>
          .
          <article-title>A Comparison of Statistical Signicance Tests for Information Retrieval Evaluation</article-title>
          .
          <source>In Proc. 16th International Conference on Information and Knowledge Management (CIKM</source>
          <year>2007</year>
          ),
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Silva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A. F.</given-names>
            <surname>Laender</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Baeza-Yates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. L.</given-names>
            <surname>McGuinness</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Olstad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Ø. H.</given-names>
            <surname>Olsen</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <surname>A.</surname>
          </string-name>
          and Falca˜o (Eds.). ACM Press, New York, USA,
          <fpage>623</fpage>
          -
          <lpage>632</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>R. P.</given-names>
            <surname>Stanley</surname>
          </string-name>
          .
          <year>2012</year>
          . Enumerative Combinatorics - Volume
          <volume>1</volume>
          (2nd ed.).
          <source>Cambridge Studies in Advanced Mathematics</source>
          , Vol.
          <volume>49</volume>
          . Cambridge University Press, Cambridge, UK.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Stevens</surname>
          </string-name>
          .
          <year>1946</year>
          .
          <article-title>On the eory of Scales of Measurement</article-title>
          . Science, New Series 103,
          <issue>2684</issue>
          (
          <year>June 1946</year>
          ),
          <fpage>677</fpage>
          -
          <lpage>680</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>C. J. van Rijsbergen.</surname>
          </string-name>
          <year>1974</year>
          .
          <article-title>Foundations of Evaluation</article-title>
          .
          <source>Journal of Documentation 30</source>
          ,
          <issue>4</issue>
          (
          <year>1974</year>
          ),
          <fpage>365</fpage>
          -
          <lpage>373</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>W.</given-names>
            <surname>Webber</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Moat, and
          <string-name>
            <given-names>J.</given-names>
            <surname>Zobel</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Score Standardization for InterCollection Comparison of Retrieval Systems</article-title>
          .
          <source>In Proc. 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR</source>
          <year>2008</year>
          ), T.-S. Chua,
          <string-name>
            <surname>M.-K. Leong</surname>
            ,
            <given-names>D. W.</given-names>
          </string-name>
          <string-name>
            <surname>Oard</surname>
            , and
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Sebastiani</surname>
          </string-name>
          (Eds.). ACM Press, New York, USA,
          <fpage>51</fpage>
          -
          <lpage>58</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>