<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Axiometrics Pro ject</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Eddy Maddalena</string-name>
          <email>eddy.maddalena@uniud.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Mizzaro</string-name>
          <email>mizzaro@uniud.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Mathematics and Computer Science University of Udine Udine</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <fpage>11</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>The evaluation of retrieval e ectiveness has played and is playing a central role in Information Retrieval (IR). To evaluate the effectiveness of IR systems, more than 50 (maybe 100) di erent evaluation metrics have been proposed. In this paper we sketch our Axiometrics project, that aims to a formal account of IR e ectiveness metrics. E ectiveness evaluation is of paramount importance in Information Retrieval (IR). Several e ectiveness metrics have been proposed so far. In a survey in 2006 [6], more than 50 metrics have been collected, taking into account only the system oriented e ectiveness metrics; it is likely that about one hundred systems oriented metrics exist today, let alone user-oriented ones or metrics for tasks somehow related to IR, like ltering, clustering, recommendation, summarization, etc. As stated for example in [8], there is nothing close to agreement on a common metric that everyone will use. It is a di use opinion that di erent metrics evaluate di erent aspects of retrieval behavior [4,8]. Each of these metrics has its own advantages and limitations. Metric choice is neither a simple task, nor it is without consequences: an inadequate metric might mean to waste research e orts improving systems toward a wrong target. It is clear that a better understanding of the formal properties of e ectiveness metrics would help. This paper describes the Axiometrics project [5,7]: we propose an axiomatic approach to e ectiveness metrics and we aim at de ning some basic axioms that any reasonable metric should satisfy and that are formulated in a general way.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Although formal approaches have high importance in the IR eld, they have
mainly focussed on the retrieval process rather than on e ectiveness metrics
themselves. However, some research speci c to e ectiveness metrics does exist,
and it is brie y discussed here. An early attempt has been made by Swets [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
who lists some desirable properties, as quoted in [11, p.119-120]. Also van
Rijsbergen himself in [11, Chapter 7] follows an axiomatic approach. In [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], Bollmann
proposes the Axiom of monotonicity and the Archimedean axiom, and their
implication is presented as a theorem. In Yao [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] a new e ectiveness metric that
compares the relative order of documents is proposed and proved to be
appropriate through an axiomatic approach. More recently, Amigo et al. in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] focus
their formal analysis on evaluation metrics for text clustering algorithms nding
four basic formal constraints and in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] present a uni ed comparative view of
proposed metrics for the task of document ltering.
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Measurement and Similarity</title>
      <p>
        We propose to rely on measurement theory [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] to formalize IR e ectiveness
metrics. Measurement is de ned as a process aimed at determining a relationship
between a physical quantity and a unit of measurement. A particularly discussed
issue is how the measurement is expressed. Stevens proposed the four standard
measurements scales [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]: Nominal, Ordinal, Interval, Ratio. This classi cation
has become a tradition in various eld and has provides useful insights, although
alternatives exist.
      </p>
      <p>The evaluation process in IR is based on two quantities: (i) an automated
estimation, by an IR system, of the relevance of a document, and (ii) a human
(user or assessor) estimation of the relevance of a document. We can exploit
measurement to model these quantities: given a query, a system tries to measure
the relevance of the documents to the query, for example to rank the
documents; given (a description of) an information need, an assessor tries to measure
the relevance of the documents to the need. We therefore have two kinds of
relevance measurements (and measures as well): one made by a system and
referred to in the following as system relevance measure(ment), and one made by
a human and referred to in the following as human (or user/assessor) relevance
measure(ment).</p>
      <p>By using a notion of measure(ment) that is common to both system and
human, we can de ne a notion of similarity among them. Ideally, an IR system
should both: use the same measurement scale of the human assessor, and
provide the same measurement of the human assessor. However, systems are far
from being perfect, and therefore the very same measurement is almost never
provided. The aim of an IR system is thus to provide the measurement that
is most similar to the assessor / user measurement . Moreover, often the scales
are di erent: scale ( ) can be xed a priori, e.g., when a test collection provides
human relevance assessments, and scale ( ) depends on the retrieval algorithm
at hand, and di erent approaches have di erent scales. Of course, two
measurements expressed on two di erent scales can not be identical (e.g., a rank can not
be identical to a measurement expressed on a category scale, the usual ad-hoc
retrieval situation). Thus, similarity needs to be de ned over di erent scales.</p>
    </sec>
    <sec id="sec-3">
      <title>IR E ectiveness Metric</title>
      <p>On the basis of the concepts of measurement, measurement scales, and similarity
we now turn to modeling the e ectiveness metrics itself. An e ectiveness
metric provides a numerical representation of the similarity between two relevance
measurements. A metric is then a function that takes as arguments two
measurements and , a set of documents D, and a set of queries Q, and provides
as output a numeric value (usually in R): metric : D Q 7! R:</p>
      <p>A metric is de ned on the basis of ve components: scale( ) and scale( ); a
notion of similarity sim; how the values on single documents are averaged over
a set of documents D (we denote the corresponding averaging function with
avgD); and how these averages are averaged over a set of queries Q (avgQ). We
can write: metric (scale( ); scale( ); sim; avgD; avgQ) to describe a metric.</p>
      <p>
        By using suitable similarity functions, hopefully the framework can model
most, if not all, known metrics [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
5
      </p>
    </sec>
    <sec id="sec-4">
      <title>Axioms and Theorems</title>
      <p>
        In [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], we have proposed 13 axioms and 5 theorems: they de ne properties that,
ceteris paribus, any e ectiveness metric should satisfy. Given the space limits,
we can only brie y sketch some of them.
      </p>
      <p>Axiom 1 (Document monotonicity) Let q be a query, D and D0 two sets
of documents such that D \ D0 = ?, a human relevance measurement and
and 0 two system relevance measurements such that:1
and
Then
metric ( ; ) &gt; metric ( ; 0)
q;D (=) q;D</p>
      <p>(&gt;)
metric ( ; ) &gt; metric ( ; 0):
q;D0 (=) q;D0</p>
      <p>(=)
metric ( ; ) &gt; metric ( ; 0):
q;D[D0 (=) q;D[D0</p>
      <p>(&gt;)
A similar axiom holds for query sets (omitted for space limits). Another axiom
states that if system relevance measures on two documents d and d0 are equally
correct, system relevance of d is higher than system relevance of d0, and d0 is not
less relevant than d, then the e ectiveness metric should be more a ected by d
than by d0 (represented by A).
1 In this axiom the equal = and greater than &gt; signs have obviously to be paired in
the appropriate way, \row by row". We use this notation for the sake of brevity.
Axiom 2 (System relevance) Let q be a query, d and d0 two documents,
and two (human and system) relevance measurements such that simq;d ( ; ) =
simq;d0 ( ; ), (d) &gt; (d0), and (d) (d0). Then d Ametric( ; ) d0.
This entails as a corollary the often stated property that early rank positions
a ect a metric value more than later rank positions. A symmetric axiom can also
be stated on user relevance measurement: a metric should weigh more, and be
more a ected, by more relevant documents. This is perhaps less intuitive than
the previous one, but it does indeed seem natural in this framework.
6</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and Future Work</title>
      <p>We propose a framework based on the notions of measure, measurement, and
similarity to de ne axioms and to derive theorems on IR e ectiveness metrics.
Our approach aims to a threefold contribution: (i) the proposal of using
measurement to model in a uniform way both system output and human relevance
assessment, and the analysis of the di erent measurement scales used in IR; (ii)
the notions of similarity among di erent measurement scales and the consequent
de nition of metric; and (iii) the axioms and theorems. In the future, we will seek
for new axioms and theorems that can allow us to de ne and discover new
metrics property. We will also focus on aspects such as: ltering, recommendation,
reformulation, summarization, novelty, and di culty of queries.
Acknowledgments
We thank Julio Gonzalo and Enrique Amigo for long and interesting discussions,
Evangelos Kanoulas and Enrique Alfonseca for helping to frame the Axiometrics
research project, Arjen de Vries for suggesting the name \Axiometrics", and
organizers of (and participants to) SWIRL 2012. This work has been partially
supported by a Google Research Award.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>E.</given-names>
            <surname>Amigo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Artiles</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Verdejo</surname>
          </string-name>
          .
          <article-title>A comparison of extrinsic clustering evaluation metrics based on formal constraints</article-title>
          .
          <source>Information Retrieval</source>
          ,
          <volume>12</volume>
          (
          <issue>4</issue>
          ):
          <volume>461</volume>
          {
          <fpage>486</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>E.</given-names>
            <surname>Amigo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Verdejo</surname>
          </string-name>
          .
          <article-title>A comparison of evaluation metrics for document ltering</article-title>
          .
          <source>In CLEF</source>
          , volume
          <volume>6941</volume>
          <source>of LNCS</source>
          , pages
          <volume>38</volume>
          {
          <fpage>49</fpage>
          . Springer,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>P.</given-names>
            <surname>Bollmann</surname>
          </string-name>
          .
          <article-title>Two axioms for evaluation measures in information retrieval</article-title>
          .
          <source>In SIGIR '84</source>
          , pages
          <fpage>233</fpage>
          {
          <fpage>245</fpage>
          ,
          <string-name>
            <surname>Swinton</surname>
          </string-name>
          , UK,
          <year>1984</year>
          . British Computer Society.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>C.</given-names>
            <surname>Buckley</surname>
          </string-name>
          and
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Voorhees</surname>
          </string-name>
          .
          <article-title>Evaluating evaluation measure stability</article-title>
          .
          <source>In SIGIR '00</source>
          , pages
          <fpage>33</fpage>
          {
          <fpage>40</fpage>
          , New York, NY, USA,
          <year>2000</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>L.</given-names>
            <surname>Busin</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Mizzaro</surname>
          </string-name>
          .
          <article-title>Axiometrics: An Axiomatic Approach to Information Retrieval E ectiveness Metrics</article-title>
          .
          <source>In ICTIR 2013 | Proceedings of the 4th International Conference on the Theory of Information Retrieval</source>
          , pages
          <volume>22</volume>
          {
          <fpage>29</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>G.</given-names>
            <surname>Demartini</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Mizzaro</surname>
          </string-name>
          .
          <article-title>A Classi cation of IR E ectiveness Metrics</article-title>
          .
          <source>In ECIR</source>
          <year>2006</year>
          , volume
          <volume>3936</volume>
          <source>of LNCS</source>
          , pages
          <volume>488</volume>
          {
          <fpage>491</fpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>E.</given-names>
            <surname>Maddalena</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Mizzaro</surname>
          </string-name>
          . Axiometrics:
          <article-title>Axioms of information retrieval effectiveness metrics</article-title>
          .
          <source>In Proceedings of the Second Australasian Web Conference. Australian Computer Society</source>
          , Inc.,
          <year>2014</year>
          , to appear.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>S.</given-names>
            <surname>Robertson</surname>
          </string-name>
          .
          <article-title>On GMAP: and other transformations</article-title>
          .
          <source>In CIKM '06</source>
          , pages
          <fpage>78</fpage>
          {
          <fpage>83</fpage>
          , New York, USA,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Stevens</surname>
          </string-name>
          .
          <article-title>On the theory of scales of measurement</article-title>
          .
          <source>Science</source>
          ,
          <volume>103</volume>
          (
          <issue>2684</issue>
          ):
          <volume>677</volume>
          {
          <fpage>80</fpage>
          ,
          <year>1946</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Swets</surname>
          </string-name>
          .
          <article-title>Information retrieval systems</article-title>
          .
          <source>Science</source>
          ,
          <volume>141</volume>
          :
          <fpage>245</fpage>
          {
          <fpage>250</fpage>
          ,
          <year>1963</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>C. J. van Rijsbergen. Information</given-names>
            <surname>Retrieval</surname>
          </string-name>
          .
          <source>Butterworths, 2nd edition</source>
          ,
          <year>1979</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Wikipedia</surname>
          </string-name>
          .
          <article-title>Measurement | Wikipedia, the free encyclopedia</article-title>
          . http://en. wikipedia.org/wiki/Measurement,
          <year>2012</year>
          . [Last visit:
          <year>October 2013</year>
          ].
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>Y. Y.</given-names>
            <surname>Yao</surname>
          </string-name>
          .
          <article-title>Measuring retrieval e ectiveness based on user preference of documents</article-title>
          .
          <source>Journal of the American Society for Information Science</source>
          ,
          <volume>46</volume>
          (
          <issue>2</issue>
          ):
          <volume>133</volume>
          {
          <fpage>145</fpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>