=Paper= {{Paper |id=Vol-512/paper-3 |storemode=property |title=Incorporating User Behavior Information in IR Evaluation |pdfUrl=https://ceur-ws.org/Vol-512/paper03.pdf |volume=Vol-512 |dblpUrl=https://dblp.org/rec/conf/sigir/YilmazSCR09 }} ==Incorporating User Behavior Information in IR Evaluation== https://ceur-ws.org/Vol-512/paper03.pdf
     Incorporating user behavior information in IR evaluation

                       Emine Yilmaz Milad Shokouhi Nick Craswell Stephen Robertson
                                                    Microsoft Research Cambridge
                                                           Cambridge, UK
                                            {eminey, milads, nickcr, ser}@microsoft.com



ABSTRACT                                                             choice of discount function. Many experiments with NDCG
Many evaluation measures in Information Retrieval (IR) can           apply a discount at rank r of 1/ log(r + 1). Another metric,
be viewed as simple user models. Meanwhile, search logs              RBP, has a persistence parameter p so that the probability
provide us with information about how real users search.             of seeing position r is pr−1 . Note, some evaluation measures
This paper describes our attempts to reconcile click log in-         such as Average Precision are not easily interpretable as a
formation with user-centric IR measures, bringing the mea-           user model. Such measures are beyond the scope of this
sures into agreement with the logs. Studying the discount            paper, since we focus on user-centric evaluation.
curve of NDCG and RBP leads us to extend them, incorpo-                 The next section considers the discount curves of NDCG
rating the probability of click in their discount curves. We         and RBP, in contrast to real click behavior. Noting a dis-
measure accuracy of user models by calculating ‘session like-        crepancy, we extend the two metrics based on information
lihood’. This leads us to propose a new IR evaluation mea-           about the probability of click on each relevance label. Hav-
sure, Expected Browsing Utility (EBU), based on a more               ing done so, the discount curves are more in line with real
sophisticated user model. EBU has better session likelihood          user behavior. However, the curves do not incorporate in-
than existing measures, therefore we argue it is a better            formation about the user’s probability of returning to the
user-centric IR measure.                                             results list, having clicked on a result. Therefore the next
                                                                     section introduces our new evaluation measure Expected
                                                                     Browsing Utility (EBU). Finally we introduce Session Like-
1.    INTRODUCTION                                                   lihood, a test for whether an evaluation measure is in agree-
   This paper is concerned with user-centric IR evaluation,          ment with click logs. Under that test, EBU is most in line
where an evaluation measure should model the reaction of a           with real user behavior, therefore we argue it is a superior
real user to a list of results, evaluating the utility of the list   user-centric evaluation measure.
of documents to the user. Web search experiments usually
employ an IR measure that focuses only on top-ranked re-
sults, under the assumption that Web users deal ‘shallowly’          2.   DISCOUNT FUNCTIONS AND CLICKS
with the ranked list. This is probably correct, but we might            One of the key factors for differentiating between the eval-
ask: How can we be sure that Web search users are shallow,           uation metrics is their discount functions. Most user-centric
and how should we choose the degree of shallowness. In this          IR evaluation  metrics in the literature can be written in the
                                                                     form of N
                                                                              P
paper, our solution is to make IR evaluation consistent with                    r=1 p(user observes document at rank r) · gain(r)
real user click behavior. We still evaluate based on relevance       as the discount function is assumed to be modeling the prob-
judgments on a list of search results, but the importance of         ability that the user observes a document at a given rank.
each search result is brought in line with the probability of        Therefore, the quality of a metric is directly dependent on
clicking that result.                                                how accurately the discount function estimates this prob-
   In our experiments we use click logs of a search engine           ability. In the case of Web search, this probability value
(bing.com) taken from January 2009, combined with rele-              should ideally correspond to the probability that the user
vance judgments for 2057 queries. For each judged query              clicks on a document at rank r. Hence, one can compare
we extracted the top-10 results for up to 1000 real query in-        the evaluation metrics based on how their discount function
stances, and the pattern of clicks in the form of 10 Booleans        (their assumed probability of click) compare with the actual
(so each result is either clicked or not clicked). More than         probability that the user clicks on a document. Discount
91% of all top-10 query-URL pairs were judged on the 5-              functions that are more consistent with click patterns are
level scale {Perfect, Excellent, Good, Fair, Bad}. Unjudged          more flexible in explaining – and evaluating – the users Web
documents are assumed to be Bad. We divide the queries               search behavior.
into two sets of equal size: training and test.                         Next, we compare the user models associated with the un-
   A key difference between user-centric IR evaluation mea-          derlying discount functions of RBP and NDCG. The top two
sures, such as Normalized Discounted Cumulative Gain                 plots in Figure 1 show the average probability of click (av-
(NDCG) [2] and Rank Biased Precision (RBP) [3], is the               eraged over all sessions in the test data) per rank. We then
                                                                     compare this actual probability of click with the click prob-
                                                                     ability assumed by different evaluation metrics. As men-
Copyright is held by the author/owner(s).                            tioned above, this probability corresponds to the discount
SIGIR’09, July 19-23, 2009, Boston, USA.                             function used in the definition of the metrics. The upper
                                                1                                                                                   1
                                                                                  Actual Probs                                                                         Actual Probs
                                               0.9                                NDCG log, RMS=0.385                              0.9                                 RBP, p = 0.2 RMS=0.169
                                                                                                                                                                       RBP, p = 0.3 RMS=0.173
                                               0.8
                                                                                  NDCG 1/r, RMS=0.262                              0.8                                 RBP, p = 0.4 RMS=0.191
                                                                                                                                                                       RBP, p = 0.5 RMS=0.227




                        Probability of click




                                                                                                            Probability of click
                                               0.7                                                                                 0.7                                 RBP, p = 0.6 RMS=0.284

                                               0.6                                                                                 0.6

                                               0.5                                                                                 0.5

                                               0.4                                                                                 0.4

                                               0.3                                                                                 0.3

                                               0.2                                                                                 0.2

                                               0.1                                                                                 0.1

                                                0                                                                                   0
                                                     1   1.5   2     2.5    3     3.5    4     4.5      5                                1   1.5   2   2.5    3      3.5      4     4.5         5
                                                                           Rank                                                                              Rank


                                               0.7                                                                                 0.7
                                                                                  Actual Probs                                                                      Actual Probs
                                                                                  NDCG log, RMS=0.148                                                               RBP, p = 0.2, RMS=0.070
                                               0.6                                                                                 0.6
                                                                                  NDCG 1/r, RMS=0.098                                                               RBP, p = 0.3, RMS=0.050
                                                                                                                                                                    RBP, p = 0.4, RMS=0.045
                        Probability of click




                                                                                                            Probability of click
                                               0.5                                                                                 0.5                              RBP, p = 0.5, RMS=0.069
                                                                                                                                                                    RBP, p = 0.6, RMS=0.113
                                               0.4                                                                                 0.4


                                               0.3                                                                                 0.3


                                               0.2                                                                                 0.2


                                               0.1                                                                                 0.1


                                                0                                                                                   0
                                                     1   1.5   2     2.5    3     3.5    4     4.5      5                                1   1.5   2   2.5    3      3.5      4     4.5         5
                                                                           Rank                                                                              Rank



                                                                   Figure 1: P(click) vs. rank for different metrics.


left and right plots compare the discount function of NDCG
(with the commonly used 1/ log( r + 1) and 1/r discounts)                                                                           Table 1: Probability of click given the relevance
and RBP (with p ∈ {0.2, 0.3, 0.4, 0.5, 0.6}) with the actual                                                                                  Relevance P (click|relevance)
click probability, respectively. For comparison purposes, the                                                                                 Bad               0.5101
plots report the Root Mean Squared (RM S) error between                                                                                       Fair              0.5042
the probability of click assumed by a metric and the actual                                                                                   Good              0.5343
probability of click. It can be seen that the probability of                                                                                  Excellent         0.6530
click assumed by these two metrics is quite different than                                                                                    Perfect           0.8371
the actual click probability.
   As the discount functions in NDCG and RBP are not de-
rived from search logs, it is not surprising to see that they
are not successful in predicting clicks. In the following sec-                                                     tained using the training dataset.1 It can be seen that the
tion, we show how extending such metrics by incorporating                                                          probability that the user clicks on a document tends to in-
the quality of snippets can significantly improve the discount                                                     crease as the level of relevance of the document increases.
functions for predicting the probabilities of clicks.                                                              Note that this behavior is slightly different for Bad and Fair
                                                                                                                   documents, in which case there is a slight difference in the
                                                                                                                   click probability. This is caused by the fact that (1) the
3.   MODELING THE IMPACT OF SNIPPETS                                                                               documents judged as Fair tend to be slightly relevant to the
                                                                                                                   user information need; hence, they are effectively Bad to the
   One reason for the discrepancy between the described dis-                                                       user, and (2) the unjudged documents are treated as Bad in
count functions and the click patterns is that these met-                                                          our computations.
rics do not account for the fact that the users only click                                                            Motivated by these observations, we extend NDCG and
on some documents depending on the relevance of the sum-                                                           RBP to incorporate the summary quality into their dis-
mary (snippets). Both RBP and NDCG assume that the                                                                 count functions as follows: If the discount function of the
user always clicks on the document at the first rank, whereas                                                      metric dictates that the user visits a document at rank r
the actual probability of click calculated from our training                                                       with probability p(dr ), then the probability that the user
search logs shows that the probability that the user clicks on                                                     clicks on the document at rank r can be computed as p(dr ) ·
the first ranked document is only slightly higher than 0.6.                                                        p(C|summaryr ) (where the click probabilities are shown in
   To address this issue, we enhance the NDCG and RBP                                                              Table 1). The bottom two plots in Figure 1 show how the
user models by incorporating the snippet quality factor and                                                        extended versions of metrics then compare with the actual
considering its impact on the probability of clicks. We hy-                                                        click probability. It can be seen that the extended versions
pothesize that the probability that the user clicks on a docu-
ment (i.e., the quality of the summary) is a direct function of                                                      1
                                                                                                                    For simplicity, we assume that the quality of summaries and
the relevance of the associated document. Table 1 supports                                                         the relevance of documents are strongly correlated. That is,
our claim by showing p(C|summary) ∼ p(C|relevance) ob-                                                             relevant summaries for relevant documents and vice versa.
                                                                                                                                     is not relevant to his information need (e.g., Bad ), then he is
             Examine next                                                                                                            again likely to stop browsing as he is frustrated with the re-
              document
                                                                                                                                     sult he has clicked on and thinks documents retrieved lower
                                                                                                                                     than that will probably be even less relevant.
                                     No                                                                             Yes
      Sum
      mary
                 Click
                 doc?           1- pclick|summary
                                                            No Click
                                                                                               Examine
                                                                                                more?            pcontinue|noclick      Motivated by the probabilities of click and continue shown
                                                                                                                                     in Tables 1 and 2, we propose a novel user model in which:
               Yes   pclick|summary                                                            No     1-pcontinue|noclick
                                                                                                                                     (1) When a user visits a document, the user may or may
                                                                                                   Exit                              not click the document depending on the quality of the sum-
               View
              Webpage                                                                                                                mary, and (2) The relevance of a document visited by a user
                                                                                                                                     directly affects whether the user continues the search or not.
                                                                                 Yes
                                                                                                                                        Figure 2 shows the user model associated with our metric.
     Doc       Relevant               No                     Examine
     Rel        Doc?                                          more?             pcontinue|nonrel                                     The associated user model can be described as follows: The
               Yes                                           No   1-pcontinue|nonrel
                                                                                                                                     user starts examining the ranked list of documents from top
                                                                                                                                     to bottom. At each step, the user first just observes the
                                            Yes
               Satisfied?                 1-pcontinue|rel     Exit                                                                   summary (e.g., the snippet and the url) of the document.
                                                                                                                                     Based on the quality of the summary, with some proba-
               No    pcontinue|rel
                                                                                                                                     bility p(C|summary) the user clicks on the document. If
                                                                                                                                     the user does not click on the document, then with proba-
                                                                                                                                     bility p(cont|noclick) he/she continues examining the next
                                                                                                                                     document or terminates the search session with probability
Figure 2: The user browsing model associated with                                                                                    1 − p(cont|noclick).
the new evaluation metric.                                                                                                              If the user clicks on the document, then he or she can
                                                                                                                                     assess the relevance of the document. If the document did
                                                                                                                                     not contain any relevant information, then the user contin-
Table 2: Probability of continue given the relevance                                                                                 ues examining with the probability p(cont|nonrel) or stops
           Relevance P (cont|relevancer )                                                                                            with 1 − p(cont|nonrel) probability. If the clicked document
           Bad               0.5171                                                                                                  was relevant, then the user continues examining with prob-
           Fair              0.5727                                                                                                  ability p(cont|rel) (which depends on the relevance of the
           Good              0.6018                                                                                                  clicked document).
           Excellent         0.4082                                                                                                     A similar user model has been suggested by Dupret et
           Perfect           0.1903                                                                                                  al. [1]. However, their work is mainly focused on predicting
                                                                                                                                     the future clicks, while our goal is to integrate the probabil-
                                                                                                                                     ities of clicks with evaluating the search results.
                                                                                                                                        We use past click data together with relevance information
of these metrics can approximate the actual probability of                                                                           to model the user search behavior. At each result position r,
click substantially better than the standard versions.                                                                               our model computes the expected probability of examining
   We would like to note that Turpin et al. [4] recently also                                                                        the document p(Er ) as follows: We first assume that the user
suggested that document summary information should be                                                                                always examines the very first document, hence p(E1 ) = 1.
incorporated in evaluation retrieval evaluation, independent                                                                         Now, suppose the user has examined the document at rank
of our work. They showed that using the summary infor-                                                                               r − 1 and we would like to compute p(Er ). Given that the
mation in evaluation may alter the conclusions regarding                                                                             user has already examined the document at r − 1, according
the relative quality of search engines. However, their work                                                                          to our model, with probability p(C|summaryr−1 ) the user
mainly focus on average precision as the evaluation metric.                                                                          clicks on the document at rank r − 1, observes the relevance
                                                                                                                                     of the document at rank r − 1 and continues browsing the
4.    EXPECTED BROWSING UTILITY (EBU)                                                                                                ranked list with probability p(cont|relr−1 ). Alternatively,
                                                                                                                                     with probability 1 − p(C|summaryr−1 ) the user does not
   All the metrics described so far assume that the prob-
                                                                                                                                     click on the document at rank r − 1 and continues browsing
ability that the user will continue search at each rank is
                                                                                                                                     with probability p(cont|noclick). Overall, the probability
independent of (1) whether the user has clicked on a docu-
                                                                                                                                     that the user will examine the document at rank r can be
ment or not, and (2) the relevance of the document seen by
                                                                                                                                     written as:
user. Intuitively, we expect the search behavior of users to
change based on the relevance of the last visited document.
That is, visiting a highly relevant document that perfectly                                                                           p(Er )   = p(Er−1 ) · [p(C|summaryr−1 ) · p(cont|relr−1 )
satisfies the user’s information need (e.g. a navigational an-                                                                                 + (1 − p(C|summaryr−1 )) · p(cont|noclick)]
swer) shall be strongly correlated with the probability of
terminating the search session.                                                                                                         Given that the user has examined the document at rank
   We confirmed our hypothesis by computing the proba-                                                                               r, the probability that the user clicks on this document is
bilities of continuing the search session conditioned on the                                                                         p(C|summaryr ). That is, the user clicks on a document at
relevance of the last clicked document. The results gener-                                                                           rank r with probability p(Cr ) = p(Er ) · p(C|summaryr ).1
ated from our training set are shown in Table 2. It can be                                                                              Therefore, in total, the Expected Browsing Utility
seen that if the document is very relevant to the information                                                                        (EBU) that the user receives from the output of the search
                                                                                                                                                               PN
need (e.g., Perfect), then the user is likely to stop browsing                                                                       engine is then EBU =        r=1 p(Cr ) · relr (divided by the
the results as he has found the information he was looking                                                                           EBU value of an optimal list so that the metric is between
for. On the other hand, if the user clicks on a document that                                                                        0 and 1), where relr is the relevance of document at rank
                                                                                                                    Session Log Likelihood        P(click per session)
                            0.7                                                                    RBP, p=0.2               -2.3859                      0.0920
                                                                        Actual Probs
                                                                                                   RBP, p=0.3               -2.1510                      0.1164
                                                                        EBU probs,RMS=0.041
                            0.6                                                                    RBP, p=0.4               -2.0570                      0.1278
                                                                                                   RBP, p=0.5               -2.0732                      0.1258
                            0.5                                                                    RBP, p=0.6               -2.2007                      0.1107
     Probability of click




                                                                                                   NDCG, log                -2.3064                      0.0996
                            0.4                                                                    NDCG, 1/r                -2.0435                      0.1296
                                                                                                   EBU                      -1.9371                      0.1441
                            0.3

                                                                                                  Table 3: Likelihood of individual sessions given each
                            0.2
                                                                                                  evaluation metric.
                            0.1


                             0                                                                    column in the table shows the average probability of observ-
                                  1      1.5     2    2.5       3       3.5       4   4.5     5
                                                       Rank                                       ing the sessions in the test data. It can be seen that EBU can
                                      Figure 3: P(click) vs. rank for EBU.                        predict the behavior of an individual user (i.e., per session)
                                                                                                  much better than all the other metrics.

r. In EBU, the importance of a document d depends on (1)                                          6.    CONCLUSIONS
its relevance and (2) the probability that d is clicked and                                          Most evaluation metrics in information retrieval aim at
viewed by the user.                                                                               evaluating the satisfaction of the user given a ranked list of
   Figure 3 shows the same curves using EBU as the metric                                         documents. Hence, these metrics are based on some under-
(computed using probabilities from Table 1 and Table 2).                                          lying user models which are assumed to be modeling the way
Comparing the EBU curves with those in Figure 1, it can                                           users search. However, most of these user models are based
be seen that EBU is better than both versions of NDCG and                                         on unrealistic assumptions.
RBP.                                                                                                 In this paper, we showed how click logs can be used to
                                                                                                  devise enhanced evaluation measures. We first extended
5.                          EVALUATING EVALUATION METRICS                                         two commonly used evaluation metrics, NDCG and RBP,
   In the above experiments we focused on the average click                                       to incorporate probability of click in their discount curves.
probability, i.e., the average probability that a user will click                                 We then introduced EBU, new evaluation metric that comes
on a document at some rank r. Ideally, one would like to                                          from a more sophisticated user model than the other metrics.
be able to infer the individual clicks per session. This way,                                     Finally, using a novel evaluation methodology of evaluating
the evaluation of user satisfaction per user session would be                                     evaluation measures (referred to as session likelihood ), we
much accurate. Hence, in the second set of experiments,                                           compared these different metrics and showed that EBU is a
we compare the probability of click dictated by the discount                                      better metric in terms of modeling user behavior.
function of a metric with the actual click observations per
session.                                                                                          7.    REFERENCES
   For that, we use the click probability dictated by an eval-                                    [1] G. E. Dupret and B. Piwowarski. A user browsing model to
                                                                                                      predict search engine click data from past observations. In
uation metric as a generative model and then compute the                                              SIGIR ’08: Proceedings of the 31st annual international ACM
probability that this distribution would generate the sessions                                        SIGIR conference on Research and development in information
that were observed in the test data (i.e., the session likeli-                                        retrieval, pages 331–338, Singapore, Singapore, 2008. ACM.
hood ). Instead of computing the session likelihood, one can                                      [2] K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation
                                                                                                      of ir techniques. ACM Transactions on Information Systems,
also compute the session log likelihood. Let p(Cr |M ) be the                                         20(4):422–446, 2002.
probability of click at rank r dictated by the discount func-                                     [3] A. Moffat, W. Webber, and J. Zobel. Strategic system
tion of the metric M and let the likelihood of a particular                                           comparisons via targeted relevance judgments. In SIGIR ’07:
                                                                                                      Proceedings of the 30th annual international ACM SIGIR
session s given this metric be                                                                        conference on Research and development in information
                Y                      Y                                                              retrieval, pages 375–382, Amsterdam, The Netherlands, 2007.
P (s|M ) =              P (Cr |M ) ·           (1 − P (Cr |M ))                                       ACM.
                                        ∀r,docr ∈Cs                 ∀r,docr ∈N Cs                 [4] A. Turpin, F. Scholer, K. Jarvelin, M. Wu, and J. S. Culpepper.
                                                                                                      Including summaries in system evaluation. In SIGIR ’09:
where Cs and N Cs correspond to the documents clicked and                                             Proceedings of the 32nd annual international ACM SIGIR
not clicked in session s, respectively and docr refers to the                                         conference on Research and development in informaion
                                                                                                      retrieval, Boston, MA, USA, 2009. ACM. To Appear.
document at rank r in session s. The session log likelihood
can then be written as:
                                        Y
     log(P (sessions|M )) = log[               P (s|m)]
                                                                       ∀s∈sessions
                                                                       X
                                                            =                     log(P (s|m))
                                                                    ∀s∈sessions

  The first column in Table 3 shows the session log likeli-
hood for each metric. For comparison purposes, the second