<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Basis of a Formal Framework for Information Retrieval Evaluation Measurements?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marco Ferrante</string-name>
          <email>ferrante@math.unipd.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Ferro</string-name>
          <email>ferro@dei.unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maria Maistro</string-name>
          <email>maistro@dei.unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept. of Information Engineering, University of Padua</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dept. of Mathematics, University of Padua</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Preliminary De nitions</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we present a formal framework, based on the representational theory of measurement and we de ne and study the properties of utility-oriented measurements of retrieval e ectiveness like AP, RBP, ERR and many other popular IR evaluation measures. Let us consider a set of documents D and a set of topics T and let (REL, ) be a totally ordered set of relevance degrees, where we assume the existence of a minimum that we call the non-relevant relevance degree nr = min(REL). Then, given a positive natural number n called the length of the run, we de ne the set of retrieved documents as D(n) = f(d1; : : : ; dn) : di 2 D; di 6= dj for any i 6= jg, i.e. the ranked list of retrieved documents without duplicates, and the universe set of retrieved documents as D := SjnD=j1 D(n). A run rt, retrieving a ranked list of documents D(n) in response to a topic t 2 T , is a function from T into D, t 7! rt = (d1; : : : ; dn).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Even if Information Retrieval (IR) has been deeply rooted in experimentation
since its inception, especially with respect to the evaluation area, we still lack
a deep comprehension about what the evaluation measures are and we mostly
rely just to empirical evidence.</p>
      <p>
        We, as others [
        <xref ref-type="bibr" rid="ref1 ref2 ref6">1,2,6</xref>
        ], think that, in order to achieve a better understanding
of evaluation measures, the development of a rigorous theory is needed.
Therefore, we start to lay the foundations for a formal framework which di ers from
previous attempts to formalize IR evaluation measures since it explicitly puts
these metrics in the wake of the measurement theory, it distinguishes between
issues due to the intrinsic di culties in comparing runs and those due to the
numerical properties of measures, and it is minimal, consisting of just one axiom
(De nition 1).
? Extended abstract of [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>De ned the universe set of judged documents as R := SjnD=j1 RELn, we
call judged run the function r^t from T D into R, which assigns a relevance
degree to each retrieved document in the ranked list, and we denote by r^t[j] the
j-th element of the vector r^t, i.e. r^t[j] = GT (t; dj ), where GT is the
groundtruth, i.e. a map which assigns the relevance degree.
3</p>
    </sec>
    <sec id="sec-2">
      <title>Measurement and Measure</title>
      <p>
        The representational theory of measurement [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] aims at providing a formal basis
to our intuition concerning the action of measuring. More formally, a relational
structure is an ordered pair X = X; RX of a domain set X and a set of
relations RX on X. Given two relational structures X and Y, a homomorphism
M : X ! Y is a mapping M = M; MR where M : X ! Y and MR : RX ! RY
is a function which maps each relation on the domain set into one and only one
corresponding image relation, with the condition that 8r 2 RX ; 8xi 2 X, if
r(x1; : : : ; xn) then MR(r) M(x1); : : : ; M(xn) . A relational structure E is called
empirical if its domain set E spans over the entities in the real world, while it
is called symbolic, S, if its domain set S spans over a given set of symbols.
      </p>
      <p>
        Hence, more precisely a Measurement is a homomorphism M = M; MR
from the real world to a symbolic world. Consequently, a measure is the number
or symbol assigned to an entity by this mapping in order to characterize an
attribute [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>A key point in de ning a measurement is to start from a clear empirical
relational structure, in our case it is represented by the set of all the runs with
an ordering relation based on the utility expressed as \amount" of relevance,
E = T D; . Since the set D lacks of many desirable properties, for example,
inclusion and union, we will focus on a partial ordering among runs of the same
length. In particular, we will restrict ourselves only to those cases where the
ordering is intuitive and it is possible to nd a commonly shared agreement.</p>
      <p>Therefore, let rt and st be two runs with length n, we introduce the above
mentioned partial ordering among runs as
rt
st , fj</p>
      <p>k : r^t[j] relg fj
8rel 2 REL and k 2 f1; : : : ; ng
k : s^t[k]
relg
which counts, for each relevance degree and rank position, how many items are
above that relevance degree and, if a run has higher counts for each relevance
degree and rank position, it is considered greater than another one.</p>
      <p>For example, if we have four relevance degrees REL = f0; 1; 2; 3g, the run
r^t = (0; 1; 1; 2; 2) is smaller than the run s^t = (0; 1; 1; 2; 3) but the run r^t =
(0; 1; 1; 2; 2) is not comparable to the run w^t = (0; 1; 1; 1; 3).
4</p>
    </sec>
    <sec id="sec-3">
      <title>Utility-oriented</title>
    </sec>
    <sec id="sec-4">
      <title>E ectiveness</title>
    </sec>
    <sec id="sec-5">
      <title>Measurements of Retrieval</title>
      <p>We de ne an utility-oriented measurement of retrieval e ectiveness as
an homomorphism between the empirical relational structure E = T D; ,
and the symbolic relational structure S = R0+; , that is a mapping which
assign to any run a non negative number, i.e. a utility-oriented measure of
retrieval e ectiveness.</p>
      <p>De nition 1. A function M : T D ! R0+ de ned as M = (r^t), i.e. the
composition of a judged run r^t with a scoring function : R ! R0+ is a
utilityoriented measurement of retrieval e ectiveness if and only if for any two
runs rt and st with the same length n such that rt st, then (r^t) (s^t).</p>
      <p>Even if the previous de nition ts our purposes, it could be di cult to check
it in practice. Therefore, we introduce two \monotonicity-like" properties,
replacement and swap, which are equivalent to the required monotonicity
(Theorem 1).</p>
      <p>Replacement If we replace a less relevant document with a more relevant one
in the same rank position, a utility-oriented measurement of retrieval e
ectiveness should not decrease.</p>
      <p>Swap If we swap a less relevant document in a higher rank position with a
more relevant one in a lower rank position, a utility-oriented measurement
of retrieval e ectiveness should not decrease.</p>
      <p>Theorem 1 (Equivalence). A scoring function de ned from R into R0+
leads to a utility-oriented measurement of retrieval e ectiveness M if and only
if it satis es the Replacement and the Swap properties.</p>
      <p>As a nal remark, note that for any two runs rt and st such that rt st,
De nition 1 ensures that any two utility-oriented measurements M1 and M2 will
order rt below st, i.e. M1(rt) M1(st) and M2(rt) M2(st). On the contrary,
when two runs are not comparable, i.e. when they are outside the partial ordering
, we can nd two utility-oriented measurements M1 and M2 which order them
di erently.
5</p>
    </sec>
    <sec id="sec-6">
      <title>Balancing</title>
      <p>
        In this section, we explore the behaviour of utility-oriented measurements when
two runs are not comparable, according to the the partial ordering . Let rt and
st be two runs with length n, qmin is the minimum relevance degree above not
relevant and qmax is the maximum relevance degree, hence 0 &lt; qmin qmax &lt;
1. We de ne the Balancing Index as a function of the run length:
B(n) = max nb 2 N : M rt : r^t[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] = qmax; r^t[j] = 0; 1 &lt; j
      </p>
      <p>M st : s^t[i] = 0; 1
i &lt; b; s^t[j] = qmin; b
j</p>
      <p>n
n o</p>
      <p>As an example, consider the binary case REL = f0; 1g and runs of length 5.
The balancing index seeks the maximum rank position of the rst relevant
document, for which M (1; 0; 0; 0; 0) has score greater or equal than M (0; 0; 0; 0; 1)
or M (0; 0; 0; 1; 1) or M (0; 0; 1; 1; 1) or M (0; 1; 1; 1; 1) .
210800 AEPRR
160 nnDDCCGG,, lloogg120
140 pR1B0P, p = 0.50
Ixedn120 RRBBPP,, pp == 00..9850
ing100
cn
laa 80
B
60
40
20
00 20 40 60 L8e0ngth1o0f0the r1u2n0 140 160 180 200</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusions and Future Work</title>
      <p>In this paper we have laid the foundations of a formal framework for de ning
what a utility-oriented measurement of retrieval e ectiveness is, on the basis of
the representational theory of measurement.</p>
      <p>Future work will concern the exploration of measurements core problems,
the exploitation of the theory of measurements scales, and the application of the
proposed framework to other cases, i.e. measures based on diversity.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>E.</given-names>
            <surname>Amigo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Verdejo</surname>
          </string-name>
          . A
          <article-title>General Evaluation Measure for Document Organization Tasks</article-title>
          .
          <source>In SIGIR 2013</source>
          , pp.
          <volume>643</volume>
          {
          <fpage>652</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>L.</given-names>
            <surname>Busin</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Mizzaro</surname>
          </string-name>
          .
          <article-title>Axiometrics: An Axiomatic Approach to Information Retrieval E ectiveness Metrics</article-title>
          .
          <source>In ICTIR 2013</source>
          , pp.
          <volume>22</volume>
          {
          <fpage>29</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>N. E.</given-names>
            <surname>Fenton</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Bieman</surname>
          </string-name>
          .
          <article-title>Software Metrics: A Rigorous &amp; Practical Approach</article-title>
          . Chapman and Hall/CRC, USA, 3rd edition,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>M.</given-names>
            <surname>Ferrante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Maistro</surname>
          </string-name>
          .
          <article-title>Towards a Formal Framework for Utilityoriented Measurements of Retrieval E ectiveness</article-title>
          .
          <source>In ICTIR 2015</source>
          , pp.
          <volume>21</volume>
          {
          <fpage>30</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>D. H.</given-names>
            <surname>Krantz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. D.</given-names>
            <surname>Luce</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Suppes</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Tversky</surname>
          </string-name>
          .
          <source>Foundations of Measurement. Additive and Polynomial Representations</source>
          , volume
          <volume>1</volume>
          . Academic Press, USA,
          <year>1971</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>Mo at. Seven Numeric Properties of E ectiveness Metrics</article-title>
          .
          <source>In AIRS 2013</source>
          , pp.
          <volume>1</volume>
          {
          <fpage>12</fpage>
          . LNCS 8281.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>