Introduction

The Axiometrics Pro ject

Eddy Maddalena

eddy.maddalena@uniud.it 0

Stefano Mizzaro

mizzaro@uniud.it 0 0 Department of Mathematics and Computer Science University of Udine Udine , Italy

11 15

The evaluation of retrieval e ectiveness has played and is playing a central role in Information Retrieval (IR). To evaluate the effectiveness of IR systems, more than 50 (maybe 100) di erent evaluation metrics have been proposed. In this paper we sketch our Axiometrics project, that aims to a formal account of IR e ectiveness metrics. E ectiveness evaluation is of paramount importance in Information Retrieval (IR). Several e ectiveness metrics have been proposed so far. In a survey in 2006 [6], more than 50 metrics have been collected, taking into account only the system oriented e ectiveness metrics; it is likely that about one hundred systems oriented metrics exist today, let alone user-oriented ones or metrics for tasks somehow related to IR, like ltering, clustering, recommendation, summarization, etc. As stated for example in [8], there is nothing close to agreement on a common metric that everyone will use. It is a di use opinion that di erent metrics evaluate di erent aspects of retrieval behavior [4,8]. Each of these metrics has its own advantages and limitations. Metric choice is neither a simple task, nor it is without consequences: an inadequate metric might mean to waste research e orts improving systems toward a wrong target. It is clear that a better understanding of the formal properties of e ectiveness metrics would help. This paper describes the Axiometrics project [5,7]: we propose an axiomatic approach to e ectiveness metrics and we aim at de ning some basic axioms that any reasonable metric should satisfy and that are formulated in a general way.

Introduction

Although formal approaches have high importance in the IR eld, they have mainly focussed on the retrieval process rather than on e ectiveness metrics themselves. However, some research speci c to e ectiveness metrics does exist, and it is brie y discussed here. An early attempt has been made by Swets [ 10 ] who lists some desirable properties, as quoted in [11, p.119-120]. Also van Rijsbergen himself in [11, Chapter 7] follows an axiomatic approach. In [ 3 ], Bollmann proposes the Axiom of monotonicity and the Archimedean axiom, and their implication is presented as a theorem. In Yao [ 13 ] a new e ectiveness metric that compares the relative order of documents is proposed and proved to be appropriate through an axiomatic approach. More recently, Amigo et al. in [ 1 ] focus their formal analysis on evaluation metrics for text clustering algorithms nding four basic formal constraints and in [ 2 ] present a uni ed comparative view of proposed metrics for the task of document ltering. 3

Measurement and Similarity

We propose to rely on measurement theory [ 12 ] to formalize IR e ectiveness metrics. Measurement is de ned as a process aimed at determining a relationship between a physical quantity and a unit of measurement. A particularly discussed issue is how the measurement is expressed. Stevens proposed the four standard measurements scales [ 9 ]: Nominal, Ordinal, Interval, Ratio. This classi cation has become a tradition in various eld and has provides useful insights, although alternatives exist.

The evaluation process in IR is based on two quantities: (i) an automated estimation, by an IR system, of the relevance of a document, and (ii) a human (user or assessor) estimation of the relevance of a document. We can exploit measurement to model these quantities: given a query, a system tries to measure the relevance of the documents to the query, for example to rank the documents; given (a description of) an information need, an assessor tries to measure the relevance of the documents to the need. We therefore have two kinds of relevance measurements (and measures as well): one made by a system and referred to in the following as system relevance measure(ment), and one made by a human and referred to in the following as human (or user/assessor) relevance measure(ment).

By using a notion of measure(ment) that is common to both system and human, we can de ne a notion of similarity among them. Ideally, an IR system should both: use the same measurement scale of the human assessor, and provide the same measurement of the human assessor. However, systems are far from being perfect, and therefore the very same measurement is almost never provided. The aim of an IR system is thus to provide the measurement that is most similar to the assessor / user measurement . Moreover, often the scales are di erent: scale ( ) can be xed a priori, e.g., when a test collection provides human relevance assessments, and scale ( ) depends on the retrieval algorithm at hand, and di erent approaches have di erent scales. Of course, two measurements expressed on two di erent scales can not be identical (e.g., a rank can not be identical to a measurement expressed on a category scale, the usual ad-hoc retrieval situation). Thus, similarity needs to be de ned over di erent scales.

IR E ectiveness Metric

On the basis of the concepts of measurement, measurement scales, and similarity we now turn to modeling the e ectiveness metrics itself. An e ectiveness metric provides a numerical representation of the similarity between two relevance measurements. A metric is then a function that takes as arguments two measurements and , a set of documents D, and a set of queries Q, and provides as output a numeric value (usually in R): metric : D Q 7! R:

A metric is de ned on the basis of ve components: scale( ) and scale( ); a notion of similarity sim; how the values on single documents are averaged over a set of documents D (we denote the corresponding averaging function with avgD); and how these averages are averaged over a set of queries Q (avgQ). We can write: metric (scale( ); scale( ); sim; avgD; avgQ) to describe a metric.

By using suitable similarity functions, hopefully the framework can model most, if not all, known metrics [ 7 ]. 5

Axioms and Theorems

In [ 7 ], we have proposed 13 axioms and 5 theorems: they de ne properties that, ceteris paribus, any e ectiveness metric should satisfy. Given the space limits, we can only brie y sketch some of them.

Axiom 1 (Document monotonicity) Let q be a query, D and D0 two sets of documents such that D \ D0 = ?, a human relevance measurement and and 0 two system relevance measurements such that:1 and Then metric ( ; ) > metric ( ; 0) q;D (=) q;D

(>) metric ( ; ) > metric ( ; 0): q;D0 (=) q;D0

(=) metric ( ; ) > metric ( ; 0): q;D[D0 (=) q;D[D0

(>) A similar axiom holds for query sets (omitted for space limits). Another axiom states that if system relevance measures on two documents d and d0 are equally correct, system relevance of d is higher than system relevance of d0, and d0 is not less relevant than d, then the e ectiveness metric should be more a ected by d than by d0 (represented by A). 1 In this axiom the equal = and greater than > signs have obviously to be paired in the appropriate way, \row by row". We use this notation for the sake of brevity. Axiom 2 (System relevance) Let q be a query, d and d0 two documents, and two (human and system) relevance measurements such that simq;d ( ; ) = simq;d0 ( ; ), (d) > (d0), and (d) (d0). Then d Ametric( ; ) d0. This entails as a corollary the often stated property that early rank positions a ect a metric value more than later rank positions. A symmetric axiom can also be stated on user relevance measurement: a metric should weigh more, and be more a ected, by more relevant documents. This is perhaps less intuitive than the previous one, but it does indeed seem natural in this framework. 6

Conclusions and Future Work

We propose a framework based on the notions of measure, measurement, and similarity to de ne axioms and to derive theorems on IR e ectiveness metrics. Our approach aims to a threefold contribution: (i) the proposal of using measurement to model in a uniform way both system output and human relevance assessment, and the analysis of the di erent measurement scales used in IR; (ii) the notions of similarity among di erent measurement scales and the consequent de nition of metric; and (iii) the axioms and theorems. In the future, we will seek for new axioms and theorems that can allow us to de ne and discover new metrics property. We will also focus on aspects such as: ltering, recommendation, reformulation, summarization, novelty, and di culty of queries. Acknowledgments We thank Julio Gonzalo and Enrique Amigo for long and interesting discussions, Evangelos Kanoulas and Enrique Alfonseca for helping to frame the Axiometrics research project, Arjen de Vries for suggesting the name \Axiometrics", and organizers of (and participants to) SWIRL 2012. This work has been partially supported by a Google Research Award.

Amigo ,

Gonzalo ,

Artiles , and

Verdejo . A comparison of extrinsic clustering evaluation metrics based on formal constraints . Information Retrieval , 12 ( 4 ): 461 { 486 , 2009 .

Amigo ,

Gonzalo , and

Verdejo . A comparison of evaluation metrics for document ltering . In CLEF , volume 6941 of LNCS , pages 38 { 49 . Springer, 2011 .

Bollmann . Two axioms for evaluation measures in information retrieval . In SIGIR '84 , pages 233 { 245 , Swinton , UK, 1984 . British Computer Society.

Buckley and

E. M.

Voorhees . Evaluating evaluation measure stability . In SIGIR '00 , pages 33 { 40 , New York, NY, USA, 2000 . ACM.

Busin and

Mizzaro . Axiometrics: An Axiomatic Approach to Information Retrieval E ectiveness Metrics . In ICTIR 2013 | Proceedings of the 4th International Conference on the Theory of Information Retrieval , pages 22 { 29 , 2013 .

Demartini and

Mizzaro . A Classi cation of IR E ectiveness Metrics . In ECIR 2006 , volume 3936 of LNCS , pages 488 { 491 , 2006 .

Maddalena and

Mizzaro . Axiometrics: Axioms of information retrieval effectiveness metrics . In Proceedings of the Second Australasian Web Conference. Australian Computer Society , Inc., 2014 , to appear.

Robertson . On GMAP: and other transformations . In CIKM '06 , pages 78 { 83 , New York, USA, 2006 .

S. S.

Stevens . On the theory of scales of measurement . Science , 103 ( 2684 ): 677 { 80 , 1946 .

10.

J. A.

Swets . Information retrieval systems . Science , 141 : 245 { 250 , 1963 .

11.

C. J. van Rijsbergen. Information

Retrieval . Butterworths, 2nd edition , 1979 .

12. Wikipedia . Measurement | Wikipedia, the free encyclopedia . http://en. wikipedia.org/wiki/Measurement, 2012 . [Last visit: October 2013 ].

13.

Y. Y.

Yao . Measuring retrieval e ectiveness based on user preference of documents . Journal of the American Society for Information Science , 46 ( 2 ): 133 { 145 , 1995 .