=Paper= {{Paper |id=Vol-1653/paper_20 |storemode=property |title=Basis of a Formal Framework for Information Retrieval Evaluation Measurements |pdfUrl=https://ceur-ws.org/Vol-1653/paper_20.pdf |volume=Vol-1653 |authors=Nicola Ferro,Marco Ferrante,Maria Maistro |dblpUrl=https://dblp.org/rec/conf/iir/FerroFM16 }} ==Basis of a Formal Framework for Information Retrieval Evaluation Measurements== https://ceur-ws.org/Vol-1653/paper_20.pdf
    Basis of a Formal Framework for Information
        Retrieval Evaluation Measurements?

                Marco Ferrante1 , Nicola Ferro2 , and Maria Maistro2
                  1
                    Dept. of Mathematics, University of Padua, Italy
                               ferrante@math.unipd.it
             2
               Dept. of Information Engineering, University of Padua, Italy
                     ferro@dei.unipd.it, maistro@dei.unipd.it


        Abstract. In this paper we present a formal framework, based on the
        representational theory of measurement and we define and study the
        properties of utility-oriented measurements of retrieval effectiveness like
        AP, RBP, ERR and many other popular IR evaluation measures.


1     Introduction

Even if Information Retrieval (IR) has been deeply rooted in experimentation
since its inception, especially with respect to the evaluation area, we still lack
a deep comprehension about what the evaluation measures are and we mostly
rely just to empirical evidence.
    We, as others [1,2,6], think that, in order to achieve a better understanding
of evaluation measures, the development of a rigorous theory is needed. There-
fore, we start to lay the foundations for a formal framework which differs from
previous attempts to formalize IR evaluation measures since it explicitly puts
these metrics in the wake of the measurement theory, it distinguishes between
issues due to the intrinsic difficulties in comparing runs and those due to the
numerical properties of measures, and it is minimal, consisting of just one axiom
(Definition 1).

2     Preliminary Definitions

Let us consider a set of documents D and a set of topics T and let (REL, )
be a totally ordered set of relevance degrees, where we assume the existence
of a minimum that we call the non-relevant relevance degree nr = min(REL).
     Then, given a positive natural number n called the length of the run, we
define the set of retrieved documents as D(n) = {(d1 , . . . , dn ) : di ∈ D, di 6=
dj for any i 6= j}, i.e. the ranked list of retrieved documents without duplicates,
                                                                S|D|
and the universe set of retrieved documents as D := n=1 D(n). A run
rt , retrieving a ranked list of documents D(n) in response to a topic t ∈ T , is a
function from T into D, t 7→ rt = (d1 , . . . , dn ).
?
    Extended abstract of [4].
2       Basis of a Formal Framework for IR Evaluation Measurements

                                                                     S|D|
    Defined the universe set of judged documents as R := n=1 RELn , we
call judged run the function r̂t from T × D into R, which assigns a relevance
degree to each retrieved document in the ranked list, and we denote by r̂t [j] the
j-th element of the vector r̂t , i.e. r̂t [j] = GT (t, dj ), where GT is the ground-
truth, i.e. a map which assigns the relevance degree.

3    Measurement and Measure
The representational theory of measurement [5] aims at providing a formal basis
to our intuition concerning the action of measuring. More formally, a relational
structure is an ordered pair X = X, RX of a domain set X and a set of
relations RX on X. Given two relational structures X and Y, a homomorphism
M : X → Y is a mapping M = M, MR where M : X → Y and MR : RX → RY
is a function which maps each relation on the domain set into one and only one
corresponding image relation,     with the condition       that ∀r ∈ RX , ∀xi ∈ X, if
r(x1 , . . . , xn ) then MR (r) M(x1 ), . . . , M(xn ) . A relational structure E is called
empirical if its domain set E spans over the entities in the real world, while it
is called symbolic, S, if its domain set S spans over a given set of symbols.
    Hence, more precisely a Measurement is a homomorphism M = M, MR
from the real world to a symbolic world. Consequently, a measure is the number
or symbol assigned to an entity by this mapping in order to characterize an
attribute [3].
    A key point in defining a measurement is to start from a clear empirical
relational structure, in our case it is represented by the set of all the runs with
an ordering relation based on the utility expressed as “amount” of relevance,
E = T ×D,  . Since the set D lacks of many desirable properties, for example,
inclusion and union, we will focus on a partial ordering among runs of the same
length. In particular, we will restrict ourselves only to those cases where the
ordering is intuitive and it is possible to find a commonly shared agreement.
    Therefore, let rt and st be two runs with length n, we introduce the above
mentioned partial ordering among runs as
             rt  st ⇔ {j ≤ k : r̂t [j] ≥ rel} ≤ {j ≤ k : ŝt [k] ≥ rel}
                         ∀rel ∈ REL and k ∈ {1, . . . , n}
which counts, for each relevance degree and rank position, how many items are
above that relevance degree and, if a run has higher counts for each relevance
degree and rank position, it is considered greater than another one.
    For example, if we have four relevance degrees REL = {0, 1, 2, 3}, the run
r̂t = (0, 1, 1, 2, 2) is smaller than the run ŝt = (0, 1, 1, 2, 3) but the run r̂t =
(0, 1, 1, 2, 2) is not comparable to the run ŵt = (0, 1, 1, 1, 3).

4    Utility-oriented Measurements of Retrieval
     Effectiveness
We define an utility-oriented measurement of retrieval effectiveness as
an homomorphism between the empirical relational structure E = T × D,  ,
              Basis of a Formal Framework for IR Evaluation Measurements            3

                                                  +
and the symbolic relational structure S = R0 , ≤ , that is a mapping which
assign to any run a non negative number, i.e. a utility-oriented measure of
retrieval effectiveness.
                                                 +
Definition 1. A function M : T × D → R0 defined as M = µ(r̂t ), i.e. the
                                                                   +
composition of a judged run r̂t with a scoring function µ : R → R0 is a utility-
oriented measurement of retrieval effectiveness if and only if for any two
runs rt and st with the same length n such that rt  st , then µ(r̂t ) ≤ µ(ŝt ).
    Even if the previous definition fits our purposes, it could be difficult to check
it in practice. Therefore, we introduce two “monotonicity-like” properties, re-
placement and swap, which are equivalent to the required monotonicity (The-
orem 1).
Replacement If we replace a less relevant document with a more relevant one
  in the same rank position, a utility-oriented measurement of retrieval effec-
  tiveness should not decrease.
Swap If we swap a less relevant document in a higher rank position with a
  more relevant one in a lower rank position, a utility-oriented measurement
  of retrieval effectiveness should not decrease.
                                                                                    +
Theorem 1 (Equivalence). A scoring function µ defined from R into R0
leads to a utility-oriented measurement of retrieval effectiveness M if and only
if it satisfies the Replacement and the Swap properties.
    As a final remark, note that for any two runs rt and st such that rt  st ,
Definition 1 ensures that any two utility-oriented measurements M1 and M2 will
order rt below st , i.e. M1 (rt ) ≤ M1 (st ) and M2 (rt ) ≤ M2 (st ). On the contrary,
when two runs are not comparable, i.e. when they are outside the partial ordering
, we can find two utility-oriented measurements M1 and M2 which order them
differently.

5    Balancing
In this section, we explore the behaviour of utility-oriented measurements when
two runs are not comparable, according to the the partial ordering . Let rt and
st be two runs with length n, qmin is the minimum relevance degree above not
relevant and qmax is the maximum relevance degree, hence 0 < qmin ≤ qmax <
∞. We define the Balancing Index as a function of the run length:
                      n                                                       
         B(n) = max b ∈ N : M rt : r̂t [1] = qmax , r̂t [j] = 0, 1 < j ≤ n
                                                                           o
               ≤ M st : ŝt [i] = 0, 1 ≤ i < b, ŝt [j] = qmin , b ≤ j ≤ n

   As an example, consider the binary case REL = {0, 1} and runs of length 5.
The balancing index seeks the maximum
                                          rank position of the first relevant docu-
ment, for which M (1, 0, 0, 0, 0) has score
                                            greater or equalthan M (0, 0, 0, 0, 1)
or M (0, 0, 0, 1, 1) or M (0, 0, 1, 1, 1) or M (0, 1, 1, 1, 1) .
4        Basis of a Formal Framework for IR Evaluation Measurements

                                                                           Balancing
                                          200
                                                        AP
                                          180           ERR
                                                        nDCG, log2
                                          160           nDCG, log10
                                                        p10
                                          140           RBP, p = 0.50
                                                        RBP, p = 0.80




                        Balancing Index
                                          120
                                                        RBP, p = 0.95

                                          100

                                          80

                                          60

                                          40

                                          20

                                           0
                                               0   20      40     60    80    100      120   140   160   180   200
                                                                        Length of the run




    Fig. 1. Balancing index for AP, RBP, ERR, P10, nDCG for different run lengths.


    Figure 1 reports the balancing index for several evaluation measurements at
different run lengths. It can be noted that Expected Reciprocal Rank (ERR) is
the most top heavy measurement since B(n) = 1, Average Precision (AP) and
Normalized Discounted Cumulated Gain (nDCG) are not strongly top heavy,
B(n) → n, while Rank-Biased Precision (RBP) falls somehow in-between.
    This index explicitly points out that the top heaviness is a property of the
measurements that concerns the area where runs are not a priori comparable,
and this causes measurements to possibly behave differently one from another,
being more or less top heavy.

6     Conclusions and Future Work
In this paper we have laid the foundations of a formal framework for defining
what a utility-oriented measurement of retrieval effectiveness is, on the basis of
the representational theory of measurement.
   Future work will concern the exploration of measurements core problems,
the exploitation of the theory of measurements scales, and the application of the
proposed framework to other cases, i.e. measures based on diversity.

References
1. E. Amigó, J. Gonzalo, and M. F. Verdejo. A General Evaluation Measure for Doc-
   ument Organization Tasks. In SIGIR 2013, pp. 643–652.
2. L. Busin and S. Mizzaro. Axiometrics: An Axiomatic Approach to Information
   Retrieval Effectiveness Metrics. In ICTIR 2013, pp. 22–29.
3. N. E. Fenton and J. Bieman. Software Metrics: A Rigorous & Practical Approach.
   Chapman and Hall/CRC, USA, 3rd edition, 2014.
4. M. Ferrante, N. Ferro, and M. Maistro. Towards a Formal Framework for Utility-
   oriented Measurements of Retrieval Effectiveness. In ICTIR 2015, pp. 21–30.
5. D. H. Krantz, R. D. Luce, P. Suppes, and A. Tversky. Foundations of Measurement.
   Additive and Polynomial Representations, volume 1. Academic Press, USA, 1971.
6. A. Moffat. Seven Numeric Properties of Effectiveness Metrics. In AIRS 2013, pp.
   1–12. LNCS 8281.