1. INTRODUCTION

Information Research

User Evaluation of Multidimensional Relevance Assessment

Célia da Costa Pereira

pereira@dti.unimi.it 0

Mauro Dragoni

dragoni@dti.unimi.it 1

Gabriella Pasi

pasi@disco.unimib.it 2 0 Università degli Studi di Milano , Dipartimento di Tecnologie, dell'Informazione, Via Bramante 65, I-26013, Crema (CR) , Italy 1 Università degli Studi di Milano , Dipartimento di Tecnologie, dell'Informazione, Via Bramante 65, I-26013, Crema (CR) , Italy 2 Università degli Studi di, Milano Bicocca, Dipartimento di Informatica, Sistemistica e Comunicazione , Viale Sarca, 336, I-20126, Milano (MI) , Italy

2010

8 3 27 28

In this paper a user evaluation is proposed to assess the effectiveness of systems based on multidimensional relevance assessment. First of all, we introduce our approach to multidimensional modeling and aggregation, and the criteria used for the experiments. Then, we describe how the user evaluation has been performed, and ¯nally, we discuss the results obtained.

1. INTRODUCTION

In the ¯rst traditional approaches to Information Retrieval (IR), relevance was modeled as \topicality", and its numeric assessment was based on the matching function related to the adopted IR model (boolean model, vector space model, probabilistic model or fuzzy model ). However, relevance is, in its very nature, the result of several components or dimensions. Cooper [2] can be considered as one of the ¯rst researchers who had intuitions on the multidimensional nature of the concept of relevance. He de¯ned relevance as topical relevance with utility. Mizzaro, who has written an interesting article on the history of relevance [8], proposed a relevance model in which relevance is represented as a fourdimensional relationship between an information resource (surrogate, document, and information) and a representation of the user's problem (query, request, real information need and perceived information need). A further judgment is made according to the: topic, task, or context, at a particular point in time. The dimensions pointed out by Mizzaro are in line with the ¯ve manifestations of relevance suggested by Saracevic [10]: system or algorithmic relevance, topical or subject relevance, cognitive relevance or pertinence, situational relevance or utility and motivational or e®ective relevance. However, the concept of dimension used in this paper which is similar to that used by Xu and Chen in [ 12 ] is somehow di®erent from that used by Mizzaro and Saracevic. They de¯ned several kinds of relevance and call them dimensions of relevance while we de¯ne relevance as a concept of concepts, i.e., as a point in a n-dimensional space composed by n criteria. The document score is then the result of a particular combination of those n space components as explained in [3, 4].

One of the problems raised by considering relevance as a

multidimensional property of documents is how to aggregate the related relevance scores. In [3, 4] an approach for prioritized aggregation of multidimensional relevance has been proposed. The proposed aggregation scheme is user dependent: a user can be di®erently interested in each dimension. The computation of the overall relevance score to be associated with each retrieved document is then based on the aggregation of the scores representing the satisfaction of the considered dimensions. A problem raised by this new approach is how to evaluate its e®ectiveness. In fact, there is no test collection suited to evaluate such a model. In this paper, we ¯rst recall the models for aggregating multiple dimensions evaluations for relevance assessment presented in [3] and [4]. We focus on observing how document rankings are modi¯ed after applying the two operators on the di®erent typologies of users (di®erent dimensions orderings).

The paper is organized as follows. Section 2 recalls the aggregation models used in the paper. Section 3 presents the performed user evaluation and, ¯nally, Section 4 concludes the paper.

PRIORITIZED MULTICRITERIA AGGRE GATION In this section, after a brief background on the representation of a multicriteria decision making problem, two prioritized approaches for aggregating distinct relevance assessments are shortly presented.

2.1

Problem Representation

The presented multicriteria decision making approaches have the following components: ² the set C of the n considered criteria: C = fC1; : : : ; Cng, with Ci being the function evaluating the ith criterion; ² the collection of documents D; ² an aggregation function F to calculate for each document d 2 D a score F (C(d))1 = RSV (d) on the basis of the evaluation scores of the considered criteria.

1Actually, it corresponds to F (C1(d); : : : ; Cn(d)). Cj(d) represents the satisfaction scores of document d

with respect to criterion j. The weight associated with each criterion Ci 2 C, with i 6= 1, is document and userdependent. It depends on the preference order of Ci for the user, and also on both the weight associated to criterion

Ci¡1, and the satisfaction degree of the document with re

spect to Ci¡12. Formally, if we consider document d, each criterion Ci has an importance ¸i 2 [0; 1].

Notice that di®erent users can have a di®erent preference order over the criteria and, therefore, it is possible to obtain di®erent importance weights for the same document for di®erent users. We suppose that Ci Â Cj if i < j. This is just a repre

sentational convention which means that the most preferred criteria have lower indexes.

We suppose that: ² for each document d, the weight of the most important criterion C1 is set to 1, i.e., by de¯nition we have: 8 d ¸1 = 1; ² the weights of the other criteria Ci, i 2 [2; n], are calculated as follows:

¸i = ¸i¡1 ¢ Ci¡1(d); where Ci¡1(d) is the degree of satisfaction of criterion Ci¡1 by document d, and ¸i¡1 is the importance weight of criterion Ci¡1. 2.2

The Prioritized Scoring model

This operator allows us to calculate the overall score value from several criteria, where the weight of each criterion depends both on the weights and on the satisfaction degrees of the most important criteria | the higher the satisfaction degree of a more important criterion, the more the satisfaction degree of a less important criterion in°uences the overall score.

Operator Fs is de¯ned as follow: Fs : [0; 1]n ! [0; n] and it is such that, for any document d,

Fs(C1(d); : : : ; Cn(d)) = n X ¸i ¢ Ci(d): i=1

The RSVs of the alternative d is then given by:

RSVs(d) = Fs(C1(d); : : : ; Cn(d)):

Formalizations and properties of this operator are presented in [3].

2.3

The Prioritized “min” Operator

In this section a prioritized \min" (or \and") operator is recalled [4]. This operator allows to compute the overall satisfaction degree for a user whose overall satisfaction degree is strongly dependent on the degree of the least satis¯ed criterion. The peculiarity of such an operator, which also distinguishes it from the traditional \min" operator, is that the extent to which the least satis¯ed criterion is considered depends on its importance for the user. If it is not important at all, its satisfaction degree should not be considered, while if it is the most important criterion for the user, only its satisfaction degree is considered. This way, if we consider a 2If there are more than one criterion with the same priority order, the average weight and the average satisfaction degree are considered. (1) (2) (3) document d, for which the least satis¯ed criterion Ck is also the least important one, the overall satisfaction degree will be greater than Ck(d); it will not be Ck as it would be the case with the traditional \min" operator | the less important is the criterion, the lower its chances to represent the overall satisfaction degree.

The aggregation operator Fm is de¯ned as follows. Fm :

[0; 1]n ! [0; 1] is such that, for all document d, Fm(C1(d); : : : ; Cn(d)) = im=1in;n(fCi(d)g¸i ): (4)

Formalizations and properties of this operator are presented in [4].

USER EVALUATION OF THE PRIORI TIZED AGGREGATION OPERATORS

In [3, 4] the proposed approach for prioritized aggregation of the considered relevance dimensions has been applied to personalized IR without loss of generality. The considered personalized approach relies on four relevance dimensions: aboutness, coverage, appropriateness, and reliability. The aboutness is computed as the similarity between the document vector and the query vector. The scores of the coverage and the appropriateness criteria are computed based on a similarity of the document vector and a vector of terms representing the user pro¯le. While the reliability represents the trust degree for a user of the source from which document comes. 3.1

Preliminary Assumptions

The prioritized aggregations approach is based on the user's indication (either explicit or implicit) of the importance order of relevance dimensions. In [3, 4] di®erent user's behaviors have been described. In the case in which a user formulates a query with the idea of locating documents which are about the query and which also cover all his interests, and at the same time he does not care about the fact that the document also focuses on additional topics the user can be called "coverage seeker". If on the contrary the user's intent is to privilege documents which perfectly ¯t his interests the user is called "appropriateness seeker"

On the contrary, a user who formulates a query which

has no intersection with his interests or users who do not have a de¯ned list of interests { interest neutral { will not give any importance to the coverage and appropriateness criteria. Users of this kind are just looking for a satisfactory answer to their current concern, as expressed by their query.

Finally, users who are cautious about the trustworthiness of the origin of the retrieved documents { cautious { will give more importance to the reliability criterion than to the others. For example, coverage seeker users can be de¯ned as follows:

CARAp: coverage Â aboutness Â reliability Â appropriateness; 3.2

Experiments In this section, the impact of the proposed prioritized ag

gregation operators in the personalized IR setting is evaluated. In Section 3.2.1 we present the settings used to perform the experiments, while in Section 3.2.2 we discuss the obtained results. 3.2.1

The traditional way to evaluate an information retrieval system is based on a test collection composed by a document collection, a set of queries, and a set of relevance judgments which classify a document as being relevant or not for each query. Precision and recall are then computed to evaluate the e®ectiveness of the system. Unfortunately, there is not a test collection suited to evaluate a system based on approaches like the one proposed in this paper. It is important to notice that in the case of a user-independent aggregation of the multiple relevance numeric assessments, a traditional system's evaluation could be applied. In fact if for example the single assessment scores are aggregated by a mean operator, the system could produce the same result for a same query and a same document, independently of the user judgments. When applying the prioritized aggregation that we have proposed, a same document evaluated with respect to a same query, could produce distinct assessment scores depending on the adopted prioritized scheme, which is user-dependent.

The evaluation approach proposed in this paper is based

on an analysis of how document rankings are modi¯ed accordingly to the prioritized aggregations associated with the user's typologies that we have identi¯ed in Section 3.1.

The relevance criteria and their aggregation discussed in

the previous sections have been implemented on top of the well-known Apache Lucene open-source API 3. The Reuters

RCV1 Collection (over 800,000 documents) has been used. The method that we have used to generate both queries

and user's pro¯les has been inspired by the approach presented by Sanderson in [9]. In this work the author presents a method to perform simple IR evaluations by using the

Reuters collection that does not have queries nor relevance

judgments, but has one or more subject codes associated with each document.

He splits the collection in two parts, a query set \Q" and a test set \T", and documents are randomly assigned to one of the two subsets. Then, all subject codes are grouped in a set \S". For each subject code sx, all documents tagged with the subject code sx are extracted from the set \Q". From these documents, the pairs (word, weight) are generated to create a query. Then, the query is performed on the set \T".

The precision/recall curves are calculated by considering as

relevant, the documents that contain the subject code sx.

We have been inspired by Sanderson's approach to build

both the queries and the user's pro¯les. The queries have been created as expressed above. The creation of the user's pro¯le has been done in the following way. The set \Q" has been split in di®erent subsets based on the subject code of each document (ex. \sport", \science", \economy", etc.).

Each subset of \Q" represents the set of documents known

by the users interested in that particular topic. For example, the subset that contains all documents tagged with the subject code \sport" represents the set of documents known by the users interested in sports.

We have indexed each subset of \Q" and, for each created index, we have calculated the TF-IDF of each term. Then, we have computed a normalized ranking of these terms and we have extracted the most signi¯cant ones. The TF-IDF of each term represents the interest degree of that term in the pro¯le, that is, how much the term plays the role of a good 3See URL http://lucene.apache.org/. representation of the user's interests.

An example of user's pro¯le is illustrated in Table 1. For

example, the users associated with the \BIOTECH" pro¯le have, with respect to the term \disease", an interest degree of 0:419. Each pro¯le is viewed as a long term information need, therefore, it is treated in the same way as documents or queries.

To study the behavior of the system, we have carried out

a user evaluation as proposed in [1] [ 5 ] [ 6 ].

The user evaluation described in this paper has been inspired by the one suggested in [7] that simply consists in a procedure in which a set of at least 6 users performs a set of at least 6 queries.

In these experiments we have considered eight users with eight di®erent pro¯les, each one associated with a subset of \Q" (Table 2).

scientist researcher disease cancer human 1.000 0.563 0.419 0.410 0.406

BIOTECH gene 0.402 study 0.386 clone 0.281 animal 0.279 planet 0.267 patient 0.260 brain 0.259 people 0.254 experiment 0.249 drug 0.247

The aims of these experiments are to verify that: (i) when a user performs queries in-line with his interests, by applying a prioritized aggregation operator, the system produces an improved ranking with respect to the one produced by simply averaging the scores, and (ii) when a user performs queries that are not-in-line with his interests, by applying a prioritized aggregation operator, the quality of the produced rank does not decrease with respect to the situation in which the prioritized aggregation operators are not applied.

Two kinds of queries have been considered. Those which

are in-line with the interests contained in the user's pro¯le,

Qi, and those which are not-in-line with the interests con

tained in the user's pro¯le, Qn. Table 2 illustrates the set Qi and shows the associations between the user's pro¯les and the performed queries. In these preliminary experiments only one query has been generated for each user. For instance, for User 1, the set Qi is composed only by the query

Q1, while the set Qn is composed by all the other queries

from Q2 to Q8.

For User 2, the set Qi is composed only by the query Q2, while the set Qn is composed by the query Q1 and the

queries from Q3 to Q8, and so on for the other users.

User User1 User2 User3 User4 User5 User6 User7 User8

Profile Name SPACE BIOTECH HITECH CRIMINOLOGY DEFENSE DISASTER FASHION SPORT

Query Q1: \space shuttle missions" Q2: \drug disease" Q3: \information technology" Q4: \police arrest sentence fraud" Q5: \russia military navy troops" Q6: \flood earthquake hurricane" Q7: \collection italian versace"

Q8: \premiership league season score"

When a user submits a query, the matching between the query vector and each document vector is made ¯rst (aboutness), then, on each document the coverage and the appropriateness criteria are evaluated by comparing the document vector with the user's pro¯le vector. Finally, the value of the reliability criterion, which corresponds to the degree to which the user trusts the source from which the document comes, is taken into account. These are the values to be aggregated | aboutness, coverage, appropriateness and reliability.

The evaluation of the produced rank is made by the eight real users that used the system. Each user analyzed the top 10 documents returned by the system and assessed, for each document, if it is relevant or not.

3.2.2

Discussion of the Results

In this section we present the obtained results. For space reasons some ranks have not been inserted, however the complete archive of the ranks produced in these experiments are available online 4. For convenience, only the top 10 ranked documents are reported in each table. The rationale behind this decision is the fact that the majority of search result click activity (89.8%) happens on the ¯rst page of search results [ 11 ], that is, generally, users only consider the ¯rst 10 (20) documents. The baseline rank for the \Scoring" operator is obtained by applying the average operator to calculate document assessment. Such rank corresponds to the average assessment of the documents considering the four criteria and without considering priorities among the criteria. Instead, the baseline rank for the \Min" operator is obtained by applying the standard min operator. Table 3 illustrates an example of rank produced by the average operator after performing a query in Qi, while Table 4 illustrates an example of rank produced by the standard min operator after performing a query in Qi. The entries marked with the asterisk before the title, have been considered relevant with respect to both the performed query and the user pro¯le.

We can notice that there are more non-relevant documents

in the top 10 list resulting from the application of average operator than in the list resulting from the application of the standard min operator. This is due to the compensatory nature of the average operator.

We illustrate the behavior of the system by taking into

account di®erent kinds of aggregations applied to the User

1, the user associated to the \SPACE" pro¯le. In particular,

we present in Tables from 5 to 10 the results obtained by applying both the Prioritized \Scoring" Operator and the

Prioritized \Min" Operator, with the aggregations ACApR,

CApAR, and ApCAR

We can notice that the proposed document rankings are

improved, with respect to the baselines ranking for both operators and for the considered aggregations, in the sense that the number of relevant documents in the top 10 is greater than the number of relevant documents in the baseline ranking | non relevant documents are put down in the ranking.

We can also notice that, while the document in the 9th position of the top 10 documents in Table 3 is deemed su±ciently topical for the user with pro¯le \SPACE", the same document is not even considered in the top 10 list of any table corresponding to the prioritized \Scoring" operator. This is due to the fact that, even though the document satis¯es the query because it contains information about space mission, its content is instead related to space exploration. Instead, for example, the document in the ¯rst position in the scoring baseline rank, is also proposed in almost all the top ten documents (scoring and min) including the min baseline rank. An exception is Table 6 where that document does

4http://www.dti.unimi.it/dragoni/¯les/

vanceUserEvaluation.rar

Multirele

not appear. The reason is that this document comes from a source with a very low degree of reliability.

Di®erent considerations have to be done when the user's query is not in-line with his pro¯le (i.e. the user's query is in the set Qn). We will discuss about two di®erent scenarios. In the ¯rst one the user associated with the \BIOTECH" pro¯le executes the query associated to the \FASHION" pro¯le, while in the second scenario, the user associated to the \CRIMINOLOGY" pro¯le executes the query associated to the \SPACE" pro¯le. We have noticed that, for the scoring operator, the results for all aggregations are in general similar to the baseline. The previous considerations are not valid for the prioritized min operator. It is due to its de¯nition.

Indeed, if just one criterion is weak satis¯ed, the overall as

sessment is very low. Now, if users make queries not in line with their pro¯le, the criteria like coverage and appropriateness are weakly satis¯ed and then the overall value is low.

Instead, when considering the prioritized min operator, the

result depends also on the importance degree of the least satis¯ed criterion. We can conclude that the (prioritized) min operator should not be used for the users who make queries that are not in line with their pro¯le.

4. CONCLUSION AND FUTURE WORK In this paper, a user evaluation for aggregating multiple

criteria has been presented and discussed.

The experimental results have been obtained thanks to a case study on personalized Information Retrieval with multicriteria relevance. These results show that: (i) the proposed operators allow to improve the ranking of the documents which are related to the user interest, when the user formulates an interest-related query; (ii) for the \scoring" operator, when a user has no interests or formulates a query which is not related to his interests, the ranking of the documents is similar to the ranking obtaining by using the average operator; and (iii) for the \min" operator, when the user formulates a non interest-related query this operator is not suitable.

R. Document Title Score 1 *Shuttle Atlantis blasts o® on schedule. 0.626 2 Countdown starts for Sunday shuttle launch. 0.575 3 *Shuttle ¯nally takes Lucid o® space station Mir. 0.573 4 U.S. spacewoman breaks another record. 0.573 5 *Shuttle Discovery heads for Florida. 0.572 6 *Shuttle Atlantis heads for Mir despite problem. 0.568 7 Scientists delighted with U.S. shuttle flight. 0.567 8 *U.S. shuttle launched on mission to Mir. 0.563 9 Boeing-Lockheed group signs $7 billion shuttle pact. 0.562 10 *U.S. shuttle leaves space station Mir. 0.561

5. REFERENCES

*Shuttle Discovery takes o® on schedule. *Shuttle Atlantis blasts o® on schedule. *U.S. space shuttle heads home. *Shuttle Discovery heads for Florida. *U.S. shuttle crew set up space laboratory. *Columbia shuttle mission extended one day. *Shuttle Atlantis heads for Mir despite problem. *Shuttle Discovery lands in Florida. *U.S. space shuttle crew set for Thursday landing.

*U.S. shuttle will not °ush Mir's water. the Prioritized Scoring Operator and ACApR aggregation. *Shuttle Atlantis to return home on Wednesday. *With spacewalk o®, shuttle astronauts relax. *U.S. space shuttle heads for rendezvous with Mir. *U.S. shuttle crew prepares to retrieve satellite. *Shuttle-deployed telescope ready for action. *Space shuttle deploys U.S.-German satellite. *Shuttle crew prepares for nighttime landing. *Hubble service crew prepares to return home. *Satellites line up behind shuttle Columbia.

RUSSIA: Sticken Mir crew stands down, says worst over. the Prioritized Min Operator and ACApR aggregation. [1] P. Borlund. The iir evaluation model: a framework for evaluation of interactive information retrieval systems. e®ectiveness. Journal of the American Society for

Multidimensional relevance: A new aggregation

prioritized \and" aggregation operator for multidimensional relevance assessment. In AI*IA 2009, to appear, 2009.

Taylor Graham, 1992.

Information Seeking and Retrieval in Context Series. retrieval interaction: Extension and applications. Journal of American Society for Information Science, 34:313{327, 1997. *Russians aim to ¯x Mir before US Shuttle arrives. *Russians hope to ¯x Mir before Shuttle arrives. *With spacewalk o®, shuttle astronauts relax.

Countdown continues for U.S. spacewoman's return. *Shuttle Columbia blasts o® to mission. *Shuttle Atlantis blasts o® on schedule. *Navigational problem crops up on shuttle mission. *U.S. shuttle launched on mission to Mir.

Sticken Mir crew stands down, says worst over. *Astronaut Lucid tones up for ride home.

Score the Prioritized Scoring Operator and CApAR aggre

Score 1 2 3 4 5 6 7 8 9 10 the Prioritized Min Operator and CApAR aggregagation. *Part of planned space station arrives in Florida. *French astronaut to join Russian space mission. *Russia, hurt by Mars failure, sends probe to space. *Astronauts board shuttle for U.S. launch. *Shuttle Columbia blasts o® to mission. *Shuttle Atlantis blasts o® on schedule. *Shuttle Discovery lands in Florida. *U.S. space shuttle crew set for Thursday landing. *U.S. shuttle leaves space station Mir.

Lack of funds threaten Russia's space programme. 0.364 0.364 0.362 0.351 0.336 0.332 0.332 0.314 0.303 0.299 0.250 0.242 0.231 0.228 0.228 0.225 0.216 0.215 0.210 0.204 141 -1 69 48 89 208 3 63 117 -2

[5]

Ingwersen . Information Retrieval Interaction.

[6]

Ingwersen . Cognitive perspectives of information retrieval interaction: elements of a cognitive ir theory .

[11]

Spink ,

Jansen ,

Blakely , and

Koshman . A study of results overlap and uniqueness among major web search engines . Inf. Process. Manage., 42 ( 5 ): 1379 { 1391 , 2006 .

[12]

Y. C.

Xu and

Chen . Relevance judgment: What do information users consider beyond topicality ? J. Am. Soc. Inf. Sci. Technol ., 57 ( 7 ): 961 { 973 , 2006 . Table 10: Results for "SPACE" pro¯le by applying