=Paper=
{{Paper
|id=Vol-512/paper-3
|storemode=property
|title=Incorporating User Behavior Information in IR Evaluation
|pdfUrl=https://ceur-ws.org/Vol-512/paper03.pdf
|volume=Vol-512
|dblpUrl=https://dblp.org/rec/conf/sigir/YilmazSCR09
}}
==Incorporating User Behavior Information in IR Evaluation==
Incorporating user behavior information in IR evaluation
Emine Yilmaz Milad Shokouhi Nick Craswell Stephen Robertson
Microsoft Research Cambridge
Cambridge, UK
{eminey, milads, nickcr, ser}@microsoft.com
ABSTRACT choice of discount function. Many experiments with NDCG
Many evaluation measures in Information Retrieval (IR) can apply a discount at rank r of 1/ log(r + 1). Another metric,
be viewed as simple user models. Meanwhile, search logs RBP, has a persistence parameter p so that the probability
provide us with information about how real users search. of seeing position r is pr−1 . Note, some evaluation measures
This paper describes our attempts to reconcile click log in- such as Average Precision are not easily interpretable as a
formation with user-centric IR measures, bringing the mea- user model. Such measures are beyond the scope of this
sures into agreement with the logs. Studying the discount paper, since we focus on user-centric evaluation.
curve of NDCG and RBP leads us to extend them, incorpo- The next section considers the discount curves of NDCG
rating the probability of click in their discount curves. We and RBP, in contrast to real click behavior. Noting a dis-
measure accuracy of user models by calculating ‘session like- crepancy, we extend the two metrics based on information
lihood’. This leads us to propose a new IR evaluation mea- about the probability of click on each relevance label. Hav-
sure, Expected Browsing Utility (EBU), based on a more ing done so, the discount curves are more in line with real
sophisticated user model. EBU has better session likelihood user behavior. However, the curves do not incorporate in-
than existing measures, therefore we argue it is a better formation about the user’s probability of returning to the
user-centric IR measure. results list, having clicked on a result. Therefore the next
section introduces our new evaluation measure Expected
Browsing Utility (EBU). Finally we introduce Session Like-
1. INTRODUCTION lihood, a test for whether an evaluation measure is in agree-
This paper is concerned with user-centric IR evaluation, ment with click logs. Under that test, EBU is most in line
where an evaluation measure should model the reaction of a with real user behavior, therefore we argue it is a superior
real user to a list of results, evaluating the utility of the list user-centric evaluation measure.
of documents to the user. Web search experiments usually
employ an IR measure that focuses only on top-ranked re-
sults, under the assumption that Web users deal ‘shallowly’ 2. DISCOUNT FUNCTIONS AND CLICKS
with the ranked list. This is probably correct, but we might One of the key factors for differentiating between the eval-
ask: How can we be sure that Web search users are shallow, uation metrics is their discount functions. Most user-centric
and how should we choose the degree of shallowness. In this IR evaluation metrics in the literature can be written in the
form of N
P
paper, our solution is to make IR evaluation consistent with r=1 p(user observes document at rank r) · gain(r)
real user click behavior. We still evaluate based on relevance as the discount function is assumed to be modeling the prob-
judgments on a list of search results, but the importance of ability that the user observes a document at a given rank.
each search result is brought in line with the probability of Therefore, the quality of a metric is directly dependent on
clicking that result. how accurately the discount function estimates this prob-
In our experiments we use click logs of a search engine ability. In the case of Web search, this probability value
(bing.com) taken from January 2009, combined with rele- should ideally correspond to the probability that the user
vance judgments for 2057 queries. For each judged query clicks on a document at rank r. Hence, one can compare
we extracted the top-10 results for up to 1000 real query in- the evaluation metrics based on how their discount function
stances, and the pattern of clicks in the form of 10 Booleans (their assumed probability of click) compare with the actual
(so each result is either clicked or not clicked). More than probability that the user clicks on a document. Discount
91% of all top-10 query-URL pairs were judged on the 5- functions that are more consistent with click patterns are
level scale {Perfect, Excellent, Good, Fair, Bad}. Unjudged more flexible in explaining – and evaluating – the users Web
documents are assumed to be Bad. We divide the queries search behavior.
into two sets of equal size: training and test. Next, we compare the user models associated with the un-
A key difference between user-centric IR evaluation mea- derlying discount functions of RBP and NDCG. The top two
sures, such as Normalized Discounted Cumulative Gain plots in Figure 1 show the average probability of click (av-
(NDCG) [2] and Rank Biased Precision (RBP) [3], is the eraged over all sessions in the test data) per rank. We then
compare this actual probability of click with the click prob-
ability assumed by different evaluation metrics. As men-
Copyright is held by the author/owner(s). tioned above, this probability corresponds to the discount
SIGIR’09, July 19-23, 2009, Boston, USA. function used in the definition of the metrics. The upper
1 1
Actual Probs Actual Probs
0.9 NDCG log, RMS=0.385 0.9 RBP, p = 0.2 RMS=0.169
RBP, p = 0.3 RMS=0.173
0.8
NDCG 1/r, RMS=0.262 0.8 RBP, p = 0.4 RMS=0.191
RBP, p = 0.5 RMS=0.227
Probability of click
Probability of click
0.7 0.7 RBP, p = 0.6 RMS=0.284
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
1 1.5 2 2.5 3 3.5 4 4.5 5 1 1.5 2 2.5 3 3.5 4 4.5 5
Rank Rank
0.7 0.7
Actual Probs Actual Probs
NDCG log, RMS=0.148 RBP, p = 0.2, RMS=0.070
0.6 0.6
NDCG 1/r, RMS=0.098 RBP, p = 0.3, RMS=0.050
RBP, p = 0.4, RMS=0.045
Probability of click
Probability of click
0.5 0.5 RBP, p = 0.5, RMS=0.069
RBP, p = 0.6, RMS=0.113
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
1 1.5 2 2.5 3 3.5 4 4.5 5 1 1.5 2 2.5 3 3.5 4 4.5 5
Rank Rank
Figure 1: P(click) vs. rank for different metrics.
left and right plots compare the discount function of NDCG
(with the commonly used 1/ log( r + 1) and 1/r discounts) Table 1: Probability of click given the relevance
and RBP (with p ∈ {0.2, 0.3, 0.4, 0.5, 0.6}) with the actual Relevance P (click|relevance)
click probability, respectively. For comparison purposes, the Bad 0.5101
plots report the Root Mean Squared (RM S) error between Fair 0.5042
the probability of click assumed by a metric and the actual Good 0.5343
probability of click. It can be seen that the probability of Excellent 0.6530
click assumed by these two metrics is quite different than Perfect 0.8371
the actual click probability.
As the discount functions in NDCG and RBP are not de-
rived from search logs, it is not surprising to see that they
are not successful in predicting clicks. In the following sec- tained using the training dataset.1 It can be seen that the
tion, we show how extending such metrics by incorporating probability that the user clicks on a document tends to in-
the quality of snippets can significantly improve the discount crease as the level of relevance of the document increases.
functions for predicting the probabilities of clicks. Note that this behavior is slightly different for Bad and Fair
documents, in which case there is a slight difference in the
click probability. This is caused by the fact that (1) the
3. MODELING THE IMPACT OF SNIPPETS documents judged as Fair tend to be slightly relevant to the
user information need; hence, they are effectively Bad to the
One reason for the discrepancy between the described dis- user, and (2) the unjudged documents are treated as Bad in
count functions and the click patterns is that these met- our computations.
rics do not account for the fact that the users only click Motivated by these observations, we extend NDCG and
on some documents depending on the relevance of the sum- RBP to incorporate the summary quality into their dis-
mary (snippets). Both RBP and NDCG assume that the count functions as follows: If the discount function of the
user always clicks on the document at the first rank, whereas metric dictates that the user visits a document at rank r
the actual probability of click calculated from our training with probability p(dr ), then the probability that the user
search logs shows that the probability that the user clicks on clicks on the document at rank r can be computed as p(dr ) ·
the first ranked document is only slightly higher than 0.6. p(C|summaryr ) (where the click probabilities are shown in
To address this issue, we enhance the NDCG and RBP Table 1). The bottom two plots in Figure 1 show how the
user models by incorporating the snippet quality factor and extended versions of metrics then compare with the actual
considering its impact on the probability of clicks. We hy- click probability. It can be seen that the extended versions
pothesize that the probability that the user clicks on a docu-
ment (i.e., the quality of the summary) is a direct function of 1
For simplicity, we assume that the quality of summaries and
the relevance of the associated document. Table 1 supports the relevance of documents are strongly correlated. That is,
our claim by showing p(C|summary) ∼ p(C|relevance) ob- relevant summaries for relevant documents and vice versa.
is not relevant to his information need (e.g., Bad ), then he is
Examine next again likely to stop browsing as he is frustrated with the re-
document
sult he has clicked on and thinks documents retrieved lower
than that will probably be even less relevant.
No Yes
Sum
mary
Click
doc? 1- pclick|summary
No Click
Examine
more? pcontinue|noclick Motivated by the probabilities of click and continue shown
in Tables 1 and 2, we propose a novel user model in which:
Yes pclick|summary No 1-pcontinue|noclick
(1) When a user visits a document, the user may or may
Exit not click the document depending on the quality of the sum-
View
Webpage mary, and (2) The relevance of a document visited by a user
directly affects whether the user continues the search or not.
Yes
Figure 2 shows the user model associated with our metric.
Doc Relevant No Examine
Rel Doc? more? pcontinue|nonrel The associated user model can be described as follows: The
Yes No 1-pcontinue|nonrel
user starts examining the ranked list of documents from top
to bottom. At each step, the user first just observes the
Yes
Satisfied? 1-pcontinue|rel Exit summary (e.g., the snippet and the url) of the document.
Based on the quality of the summary, with some proba-
No pcontinue|rel
bility p(C|summary) the user clicks on the document. If
the user does not click on the document, then with proba-
bility p(cont|noclick) he/she continues examining the next
document or terminates the search session with probability
Figure 2: The user browsing model associated with 1 − p(cont|noclick).
the new evaluation metric. If the user clicks on the document, then he or she can
assess the relevance of the document. If the document did
not contain any relevant information, then the user contin-
Table 2: Probability of continue given the relevance ues examining with the probability p(cont|nonrel) or stops
Relevance P (cont|relevancer ) with 1 − p(cont|nonrel) probability. If the clicked document
Bad 0.5171 was relevant, then the user continues examining with prob-
Fair 0.5727 ability p(cont|rel) (which depends on the relevance of the
Good 0.6018 clicked document).
Excellent 0.4082 A similar user model has been suggested by Dupret et
Perfect 0.1903 al. [1]. However, their work is mainly focused on predicting
the future clicks, while our goal is to integrate the probabil-
ities of clicks with evaluating the search results.
We use past click data together with relevance information
of these metrics can approximate the actual probability of to model the user search behavior. At each result position r,
click substantially better than the standard versions. our model computes the expected probability of examining
We would like to note that Turpin et al. [4] recently also the document p(Er ) as follows: We first assume that the user
suggested that document summary information should be always examines the very first document, hence p(E1 ) = 1.
incorporated in evaluation retrieval evaluation, independent Now, suppose the user has examined the document at rank
of our work. They showed that using the summary infor- r − 1 and we would like to compute p(Er ). Given that the
mation in evaluation may alter the conclusions regarding user has already examined the document at r − 1, according
the relative quality of search engines. However, their work to our model, with probability p(C|summaryr−1 ) the user
mainly focus on average precision as the evaluation metric. clicks on the document at rank r − 1, observes the relevance
of the document at rank r − 1 and continues browsing the
4. EXPECTED BROWSING UTILITY (EBU) ranked list with probability p(cont|relr−1 ). Alternatively,
with probability 1 − p(C|summaryr−1 ) the user does not
All the metrics described so far assume that the prob-
click on the document at rank r − 1 and continues browsing
ability that the user will continue search at each rank is
with probability p(cont|noclick). Overall, the probability
independent of (1) whether the user has clicked on a docu-
that the user will examine the document at rank r can be
ment or not, and (2) the relevance of the document seen by
written as:
user. Intuitively, we expect the search behavior of users to
change based on the relevance of the last visited document.
That is, visiting a highly relevant document that perfectly p(Er ) = p(Er−1 ) · [p(C|summaryr−1 ) · p(cont|relr−1 )
satisfies the user’s information need (e.g. a navigational an- + (1 − p(C|summaryr−1 )) · p(cont|noclick)]
swer) shall be strongly correlated with the probability of
terminating the search session. Given that the user has examined the document at rank
We confirmed our hypothesis by computing the proba- r, the probability that the user clicks on this document is
bilities of continuing the search session conditioned on the p(C|summaryr ). That is, the user clicks on a document at
relevance of the last clicked document. The results gener- rank r with probability p(Cr ) = p(Er ) · p(C|summaryr ).1
ated from our training set are shown in Table 2. It can be Therefore, in total, the Expected Browsing Utility
seen that if the document is very relevant to the information (EBU) that the user receives from the output of the search
PN
need (e.g., Perfect), then the user is likely to stop browsing engine is then EBU = r=1 p(Cr ) · relr (divided by the
the results as he has found the information he was looking EBU value of an optimal list so that the metric is between
for. On the other hand, if the user clicks on a document that 0 and 1), where relr is the relevance of document at rank
Session Log Likelihood P(click per session)
0.7 RBP, p=0.2 -2.3859 0.0920
Actual Probs
RBP, p=0.3 -2.1510 0.1164
EBU probs,RMS=0.041
0.6 RBP, p=0.4 -2.0570 0.1278
RBP, p=0.5 -2.0732 0.1258
0.5 RBP, p=0.6 -2.2007 0.1107
Probability of click
NDCG, log -2.3064 0.0996
0.4 NDCG, 1/r -2.0435 0.1296
EBU -1.9371 0.1441
0.3
Table 3: Likelihood of individual sessions given each
0.2
evaluation metric.
0.1
0 column in the table shows the average probability of observ-
1 1.5 2 2.5 3 3.5 4 4.5 5
Rank ing the sessions in the test data. It can be seen that EBU can
Figure 3: P(click) vs. rank for EBU. predict the behavior of an individual user (i.e., per session)
much better than all the other metrics.
r. In EBU, the importance of a document d depends on (1) 6. CONCLUSIONS
its relevance and (2) the probability that d is clicked and Most evaluation metrics in information retrieval aim at
viewed by the user. evaluating the satisfaction of the user given a ranked list of
Figure 3 shows the same curves using EBU as the metric documents. Hence, these metrics are based on some under-
(computed using probabilities from Table 1 and Table 2). lying user models which are assumed to be modeling the way
Comparing the EBU curves with those in Figure 1, it can users search. However, most of these user models are based
be seen that EBU is better than both versions of NDCG and on unrealistic assumptions.
RBP. In this paper, we showed how click logs can be used to
devise enhanced evaluation measures. We first extended
5. EVALUATING EVALUATION METRICS two commonly used evaluation metrics, NDCG and RBP,
In the above experiments we focused on the average click to incorporate probability of click in their discount curves.
probability, i.e., the average probability that a user will click We then introduced EBU, new evaluation metric that comes
on a document at some rank r. Ideally, one would like to from a more sophisticated user model than the other metrics.
be able to infer the individual clicks per session. This way, Finally, using a novel evaluation methodology of evaluating
the evaluation of user satisfaction per user session would be evaluation measures (referred to as session likelihood ), we
much accurate. Hence, in the second set of experiments, compared these different metrics and showed that EBU is a
we compare the probability of click dictated by the discount better metric in terms of modeling user behavior.
function of a metric with the actual click observations per
session. 7. REFERENCES
For that, we use the click probability dictated by an eval- [1] G. E. Dupret and B. Piwowarski. A user browsing model to
predict search engine click data from past observations. In
uation metric as a generative model and then compute the SIGIR ’08: Proceedings of the 31st annual international ACM
probability that this distribution would generate the sessions SIGIR conference on Research and development in information
that were observed in the test data (i.e., the session likeli- retrieval, pages 331–338, Singapore, Singapore, 2008. ACM.
hood ). Instead of computing the session likelihood, one can [2] K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation
of ir techniques. ACM Transactions on Information Systems,
also compute the session log likelihood. Let p(Cr |M ) be the 20(4):422–446, 2002.
probability of click at rank r dictated by the discount func- [3] A. Moffat, W. Webber, and J. Zobel. Strategic system
tion of the metric M and let the likelihood of a particular comparisons via targeted relevance judgments. In SIGIR ’07:
Proceedings of the 30th annual international ACM SIGIR
session s given this metric be conference on Research and development in information
Y Y retrieval, pages 375–382, Amsterdam, The Netherlands, 2007.
P (s|M ) = P (Cr |M ) · (1 − P (Cr |M )) ACM.
∀r,docr ∈Cs ∀r,docr ∈N Cs [4] A. Turpin, F. Scholer, K. Jarvelin, M. Wu, and J. S. Culpepper.
Including summaries in system evaluation. In SIGIR ’09:
where Cs and N Cs correspond to the documents clicked and Proceedings of the 32nd annual international ACM SIGIR
not clicked in session s, respectively and docr refers to the conference on Research and development in informaion
retrieval, Boston, MA, USA, 2009. ACM. To Appear.
document at rank r in session s. The session log likelihood
can then be written as:
Y
log(P (sessions|M )) = log[ P (s|m)]
∀s∈sessions
X
= log(P (s|m))
∀s∈sessions
The first column in Table 3 shows the session log likeli-
hood for each metric. For comparison purposes, the second