Introduction

Basis of a Formal Framework for Information Retrieval Evaluation Measurements?

Marco Ferrante

ferrante@math.unipd.it 1 2

Nicola Ferro

ferro@dei.unipd.it 0 2

Maria Maistro

maistro@dei.unipd.it 0 2 0 Dept. of Information Engineering, University of Padua , Italy 1 Dept. of Mathematics, University of Padua , Italy 2 Preliminary De nitions

In this paper we present a formal framework, based on the representational theory of measurement and we de ne and study the properties of utility-oriented measurements of retrieval e ectiveness like AP, RBP, ERR and many other popular IR evaluation measures. Let us consider a set of documents D and a set of topics T and let (REL, ) be a totally ordered set of relevance degrees, where we assume the existence of a minimum that we call the non-relevant relevance degree nr = min(REL). Then, given a positive natural number n called the length of the run, we de ne the set of retrieved documents as D(n) = f(d1; : : : ; dn) : di 2 D; di 6= dj for any i 6= jg, i.e. the ranked list of retrieved documents without duplicates, and the universe set of retrieved documents as D := SjnD=j1 D(n). A run rt, retrieving a ranked list of documents D(n) in response to a topic t 2 T , is a function from T into D, t 7! rt = (d1; : : : ; dn).

Introduction

Even if Information Retrieval (IR) has been deeply rooted in experimentation since its inception, especially with respect to the evaluation area, we still lack a deep comprehension about what the evaluation measures are and we mostly rely just to empirical evidence.

We, as others [ 1,2,6 ], think that, in order to achieve a better understanding of evaluation measures, the development of a rigorous theory is needed. Therefore, we start to lay the foundations for a formal framework which di ers from previous attempts to formalize IR evaluation measures since it explicitly puts these metrics in the wake of the measurement theory, it distinguishes between issues due to the intrinsic di culties in comparing runs and those due to the numerical properties of measures, and it is minimal, consisting of just one axiom (De nition 1). ? Extended abstract of [ 4 ].

De ned the universe set of judged documents as R := SjnD=j1 RELn, we call judged run the function r^t from T D into R, which assigns a relevance degree to each retrieved document in the ranked list, and we denote by r^t[j] the j-th element of the vector r^t, i.e. r^t[j] = GT (t; dj ), where GT is the groundtruth, i.e. a map which assigns the relevance degree. 3

Measurement and Measure

The representational theory of measurement [ 5 ] aims at providing a formal basis to our intuition concerning the action of measuring. More formally, a relational structure is an ordered pair X = X; RX of a domain set X and a set of relations RX on X. Given two relational structures X and Y, a homomorphism M : X ! Y is a mapping M = M; MR where M : X ! Y and MR : RX ! RY is a function which maps each relation on the domain set into one and only one corresponding image relation, with the condition that 8r 2 RX ; 8xi 2 X, if r(x1; : : : ; xn) then MR(r) M(x1); : : : ; M(xn) . A relational structure E is called empirical if its domain set E spans over the entities in the real world, while it is called symbolic, S, if its domain set S spans over a given set of symbols.

Hence, more precisely a Measurement is a homomorphism M = M; MR from the real world to a symbolic world. Consequently, a measure is the number or symbol assigned to an entity by this mapping in order to characterize an attribute [ 3 ].

A key point in de ning a measurement is to start from a clear empirical relational structure, in our case it is represented by the set of all the runs with an ordering relation based on the utility expressed as \amount" of relevance, E = T D; . Since the set D lacks of many desirable properties, for example, inclusion and union, we will focus on a partial ordering among runs of the same length. In particular, we will restrict ourselves only to those cases where the ordering is intuitive and it is possible to nd a commonly shared agreement.

Therefore, let rt and st be two runs with length n, we introduce the above mentioned partial ordering among runs as rt st , fj

k : r^t[j] relg fj 8rel 2 REL and k 2 f1; : : : ; ng k : s^t[k] relg which counts, for each relevance degree and rank position, how many items are above that relevance degree and, if a run has higher counts for each relevance degree and rank position, it is considered greater than another one.

For example, if we have four relevance degrees REL = f0; 1; 2; 3g, the run r^t = (0; 1; 1; 2; 2) is smaller than the run s^t = (0; 1; 1; 2; 3) but the run r^t = (0; 1; 1; 2; 2) is not comparable to the run w^t = (0; 1; 1; 1; 3). 4

Utility-oriented E ectiveness Measurements of Retrieval

We de ne an utility-oriented measurement of retrieval e ectiveness as an homomorphism between the empirical relational structure E = T D; , and the symbolic relational structure S = R0+; , that is a mapping which assign to any run a non negative number, i.e. a utility-oriented measure of retrieval e ectiveness.

De nition 1. A function M : T D ! R0+ de ned as M = (r^t), i.e. the composition of a judged run r^t with a scoring function : R ! R0+ is a utilityoriented measurement of retrieval e ectiveness if and only if for any two runs rt and st with the same length n such that rt st, then (r^t) (s^t).

Even if the previous de nition ts our purposes, it could be di cult to check it in practice. Therefore, we introduce two \monotonicity-like" properties, replacement and swap, which are equivalent to the required monotonicity (Theorem 1).

Replacement If we replace a less relevant document with a more relevant one in the same rank position, a utility-oriented measurement of retrieval e ectiveness should not decrease.

Swap If we swap a less relevant document in a higher rank position with a more relevant one in a lower rank position, a utility-oriented measurement of retrieval e ectiveness should not decrease.

Theorem 1 (Equivalence). A scoring function de ned from R into R0+ leads to a utility-oriented measurement of retrieval e ectiveness M if and only if it satis es the Replacement and the Swap properties.

As a nal remark, note that for any two runs rt and st such that rt st, De nition 1 ensures that any two utility-oriented measurements M1 and M2 will order rt below st, i.e. M1(rt) M1(st) and M2(rt) M2(st). On the contrary, when two runs are not comparable, i.e. when they are outside the partial ordering , we can nd two utility-oriented measurements M1 and M2 which order them di erently. 5

Balancing

In this section, we explore the behaviour of utility-oriented measurements when two runs are not comparable, according to the the partial ordering . Let rt and st be two runs with length n, qmin is the minimum relevance degree above not relevant and qmax is the maximum relevance degree, hence 0 < qmin qmax < 1. We de ne the Balancing Index as a function of the run length: B(n) = max nb 2 N : M rt : r^t[ 1 ] = qmax; r^t[j] = 0; 1 < j

M st : s^t[i] = 0; 1 i < b; s^t[j] = qmin; b j

n n o

As an example, consider the binary case REL = f0; 1g and runs of length 5. The balancing index seeks the maximum rank position of the rst relevant document, for which M (1; 0; 0; 0; 0) has score greater or equal than M (0; 0; 0; 0; 1) or M (0; 0; 0; 1; 1) or M (0; 0; 1; 1; 1) or M (0; 1; 1; 1; 1) . 210800 AEPRR 160 nnDDCCGG,, lloogg120 140 pR1B0P, p = 0.50 Ixedn120 RRBBPP,, pp == 00..9850 ing100 cn laa 80 B 60 40 20 00 20 40 60 L8e0ngth1o0f0the r1u2n0 140 160 180 200

Conclusions and Future Work

In this paper we have laid the foundations of a formal framework for de ning what a utility-oriented measurement of retrieval e ectiveness is, on the basis of the representational theory of measurement.

Future work will concern the exploration of measurements core problems, the exploitation of the theory of measurements scales, and the application of the proposed framework to other cases, i.e. measures based on diversity.

Amigo ,

Gonzalo , and

M. F.

Verdejo . A General Evaluation Measure for Document Organization Tasks . In SIGIR 2013 , pp. 643 { 652 .

Busin and

Mizzaro . Axiometrics: An Axiomatic Approach to Information Retrieval E ectiveness Metrics . In ICTIR 2013 , pp. 22 { 29 .

N. E.

Fenton and

Bieman . Software Metrics: A Rigorous & Practical Approach . Chapman and Hall/CRC, USA, 3rd edition, 2014 .

Ferrante ,

Ferro , and

Maistro . Towards a Formal Framework for Utilityoriented Measurements of Retrieval E ectiveness . In ICTIR 2015 , pp. 21 { 30 .

D. H.

Krantz ,

R. D.

Luce ,

Suppes , and

Tversky . Foundations of Measurement. Additive and Polynomial Representations , volume 1 . Academic Press, USA, 1971 .

6. A. Mo at. Seven Numeric Properties of E ectiveness Metrics . In AIRS 2013 , pp. 1 { 12 . LNCS 8281.