Did We Get It Right? Predicting Query Performance in
                                 E-commerce Search
                                        Rohan Kumar                                                                          Mohit Kumar
                                      Flipkart                                                                                  Flipkart
                               rohankumar@flipkart.com                                                                    k.mohit@flipkart.com

                                           Neil Shah∗                                                                    Christos Faloutsos
                              Carnegie Mellon University                                                              Carnegie Mellon University
                                neilshah@cs.cmu.edu                                                                      christos@cs.cmu.edu
ABSTRACT
In this paper, we address the problem of evaluating whether results
served by an e-commerce search engine for a query are good or
not. This is a critical question in evaluating any e-commerce search
engine. While this question is traditionally answered using simple
metrics like query click-through rate (CTR), we observe that in e-
commerce search, such metrics can be misleading. Upon inspection,
we find cases where CTR is high but the results are poor and vice
versa. Similar cases exist for other metrics like time to click which
are often also used for evaluating search engines.
   We aim to learn the quality of the results served by the search
engine based on users’ interactions with the results. Although this
problem has been studied in the web search context, this is the
first study for e-commerce search, to the best of our knowledge.
Despite certain commonalities with evaluating web search engines,
there are several major differences such as underlying reasons
for search failure, and availability of rich user interaction data
with products (e.g. adding a product to the cart). We study large-                                     Figure 1: Mobile app e-commerce results page for the query
scale user interaction logs from Flipkart’s1 search engine, analyze                                    “sling bags women lavie”, showing relevant products.
behavioral patterns and build models to classify queries based on
user behavior signals. We demonstrate the feasibility and efficacy
of such models in accurately predicting query performance. Our                                         1   INTRODUCTION
classifier is able to achieve an average AUC of 0.75 on a held-out
                                                                                                       Search engines are a fundamental component of most modern In-
test set.
                                                                                                       ternet applications, and evaluating their performance on a query
                                                                                                       is not only needed for evaluating their overall performance, but is
KEYWORDS
                                                                                                       also critical in the iterative process of improving the algorithms
Information Retrieval, Evaluation, Query Performance, e-commerce,                                      that power them. This is important since bad performance of a
mobile search behavior, implicit feedback                                                              search engine leads to customer attrition as described in White and
ACM Reference format:                                                                                  Dumais [21]. Traditionally, the performance of a search engine on a
Rohan Kumar, Mohit Kumar, Neil Shah2 , and Christos Faloutsos. 2018. Did                               query is measured using metrics derived from ordinal ratings of the
We Get It Right? Predicting Query Performance in E-commerce Search. In                                 search results given by human experts [4, 13, 23]. However, obtain-
Proceedings of ACM SIGIR Workshop on eCommerce, Ann Arbor, Michigan,                                   ing such manual judgments is prohibitive for the large document
USA, July 2018 (SIGIR 2018 eCom), 7 pages.                                                             collections and high number of unique queries commonly encoun-
DOI: 10.1145/nnnnnnn.nnnnnnn                                                                           tered in most modern Internet applications. While one could solicit
                                                                                                       explicit feedback on the quality of search results from the users of
1 Flipkart is the largest e-commerce platform in India.
2 Dr. Shah is now at Snap Inc.
                                                                                                       a search engine, this may be detrimental to their experience of the
                                                                                                       application.
                                                                                                          More recent work [8] has focused on automating the evalua-
Permission to make digital or hard copies of part or all of this work for personal or
Copyright © 2018 by the paper’s authors. Copying permitted for private and academic purposes.          tion of search engine performance by using implicit feedback on
 classroom
In:           use is G.
    J. Degenhardt,    granted   withoutS.fee
                          Di Fabbrizio,      providedM.that
                                          Kallumadi,        copies
                                                         Kumar,      areLin,
                                                                  Y.-C.  notA.made   or distributed
                                                                               Trotman,   H. Zhao
 for profit
(eds.):      or commercial
        Proceedings            advantage
                      of the SIGIR        andworkshop,
                                   2018 eCom   that copies bear 2018,
                                                        12 July, this notice  and Michigan,
                                                                        Ann Arbor, the full citation
                                                                                              USA,     the quality of search results derived from various user activity sig-
published  at http://ceur-ws.org
 on the first  page. Copyrights for third-party components of this work must be honored.               nals generated by the interactions between users and the results
For all other uses, contact the owner/author(s).
                                                                                                       presented to them. Most of this work has been done for Internet
SIGIR 2018 eCom, Ann Arbor, Michigan, USA
© 2018 Copyright held by the owner/author(s). 978-x-xxxx-xxxx-x/YY/MM. . . $15.00                      search engines while in this paper, we focus on e-commerce search
DOI: 10.1145/nnnnnnn.nnnnnnn                                                                           engines. The users of e-commerce applications tend to look for
SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA                      Rohan Kumar, Mohit Kumar, Neil Shah3 , and Christos Faloutsos


                                                                          an add-to-cart or buy-now button. Using a richer set of such user
                                                                          activity signals, we build a classification model to predict whether
                                                                          the results for any query from our query-logs would be rated as
                                                                          bad or good by human experts and thus automate the evaluation
                                                                          of our search engine performance. Such a system also serves as a
                                                                          first step towards building a system to predict user satisfaction at
                                                                          the level of individual user activity sessions as studied in [6, 9, 18].
                                                                             Our classifier is able to achieve an average AUC of 0.75 on a
                                                                          held-out test set. On certain product categories like Mobile Phones,
                                                                          we achieve an average AUC of 0.88 on the held-out test set.
                                                                             Summarily, the primary contributions of our work are:
                                                                                (1) We identify a rich set of user activity signals that help
                                                                                    predict whether the results for any search query would be
                                                                                    rated as bad or good by human experts.
                                                                                (2) We demonstrate that it is possible to use user activity sig-
                                                                                    nals to automate the evaluation of search engine perfor-
                                                                                    mance for e-commerce applications.
Figure 2: Distributions of search result ratings across per-                    (3) We analyze the performance of our classifier and derive
centile based CTR buckets.                                                          insights into the effectiveness of automated systems for
                                                                                    evaluating search engine performance that are of particular
                                                                                    interest to e-commerce applications.
products and services, and thus the queries typically encountered
by e-commerce search engines are fundamentally different from             2 RELATED WORK
the informational and navigational queries typically encountered
by Internet search engines.
                                                                          2.1 Query Performance
    The most popular user activity signal in the aforementioned work      Evaluating search engine performance has been well-studied in
is clicks and it is used to define the Click-Through Rate (CTR) metric.   the domain of web search. Topical relevance based metrics like
The CTR of a query is often used as a proxy for the performance of        nDCG [13], expected reciprocal rank [4] and weighted information
the search engine on that query, and this approximation is based on       gain [23] require explicit human labeled relevance judgments for
the assumption that clicks on search results are a reliable indicator     query-document pairs which are prohibitively costly to calculate
of performance. However, Hassan et al. [10] points out that while         at scale for real-world web scale evaluation.
clicks are a useful indicator of performance, they can nevertheless          Several methods were proposed to automatically measure vari-
be quite noisy.                                                           ous characteristics of the documents retrieved for a query, which
    We validate this observation for e-commerce search by study-          can then be used for measuring overall system performance. Clarity
ing the distributions of the ordinal ratings of search results given      score [5] evaluates query performance by measuring the relative
by human experts to queries having a wide range of CTR values             entropy between query language model and corresponding col-
randomly sampled from Flipkart search query-logs. We discretized          lection language model. The Robustness score [22] exploits the
the CTR values into 5 buckets with the bucket boundaries at the           fact that query-level ranking robustness is correlated with retrieval
20th, 40th, 60th, and 80th percentiles of the CTR values of our           performance. It is measured as the expected value of Spearman's
sampled queries. The distributions of search result ratings across        rho between ranked lists from original collection and a corrupted
these percentile-based CTR buckets is shown in Figure 2. The de-          collection. Carmel et al. [1] find Jensen-Shannon divergence be-
tails of how queries are sampled from our query-logs and how the          tween queries, relevant documents and the entire collection to
associated search results are rated by human experts are given in         be an indicator of query performance. However, [23] experimen-
Section 3.                                                                tally show the ineffectiveness of these metrics in measuring search
    From Figure 2 it is evident that while the fraction of queries        performance on web-scale engines.
whose results are rated as poor decreases as we go from the low-             User click behavior has been used as an alternative to expert
est CTR bucket to the highest CTR bucket, a significant fraction          judgments for automatically tuning retrieval algorithms (predicting
of queries whose results are rated as bad still exists even in the        document relevance) as well as estimating IR evaluation metrics
highest CTR bucket. Figure 1 shows an example of a search engine          [3, 7, 8, 15]. Kim et al. [16] show that only analysing user clicks
results page (SERP) that appears in Flipkart’s mobile app for the         naively may not indicate satisfaction, but rather using dwell time
query “sling bags women lavie”. The query has good results even           per click appropriately indicates query level satisfaction in a better
though it belongs to the 0-20% CTR bucket from Figure 2. This             manner. Guo et al. [8] also make use of interaction features and
highlights the need for a richer set of user activity signals beyond      engine switches as signals to predict DCG@3.
click behavior. Guo et al. [8] made use of such signals, but their
focus was on Internet search where the set of user activity signals       2.2    Search Session Performance
available is limited in comparison to e-commerce search, where            There has been considerable work in the area of analyzing user
we have additional signals available such as the time taken to click      satisfaction at a session level rather than at an individual query
Did We Get It Right? Predicting Query Performance in E-commerce Search              SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA


level. Fox et al. [6] conducted one of the first studies that found       Table 1: Features used to distinguish between good and
association between explicit ratings and implicit measures of user        poorly performing search queries
interest, concluding that user satisfaction can be predicted using
such implicit signals. Hassan et al. [9] show empirically that user                                 Activity time
behavior alone can give an accurate picture of the success of the          timeToFirstClick  Time taken to click first product
user's web search goals, without considering the relevance of the          timeToFirstCart   Time taken to add a product to the cart
documents displayed. There have been studies focusing on graded            queryDuration     Total dwell time of the query
satisfaction [14] as well as specific user behaviors like query refor-                                Positional
mulation [10, 18] and interaction sequences [17] for understanding         posFirstClick     Position of first product clicked
satisfaction.                                                                                    Activity aggregates
                                                                           numClicks         Number of clicks
2.3    E-commerce Search Performance                                       numSwipes         Number of swipes
Most studies have been geared towards web search where user                numCarts          Number of cart adds
search goals are different from those in product/e-commerce search.        numFilters        Number of times a filter was applied
However, there has been some work recently in the context of               numSorts          Number of times user changed sorting
product search. Singh et al. [19] study the user behavior in the                             Number of product impressions
                                                                           numImpressions
e-commerce search context in a specific scenario when the search                             in the viewport
engine doesn’t retrieve any results. [20] is a recent study that           clickSuccess      Any product clicked for query
addresses the user’s session satisfaction in product search. They          cartSuccess       Any product added to cart for query
approach the problem by firstly identifying a taxonomy of user                               Query text characteristics
intents while interacting with product search, and then analyze            charQueryLen      Length of the query in characters
the user’s behavior in the context of the defined taxonomy. They           wordQueryLen      Length of the query in words
predict user session satisfaction by utilizing the interaction behav-      LMScore           Query language model perplexity score
ior, where they build separate models for different intents with the       querySim          Similarity to next query
demonstration that user behavior is different under different intents.     containsSP        Query contains specifiers
Our work, while building upon the learnings from these studies,            containsMT        Query contains modifiers
differs in that we are interested in measuring only the aggregate          containsRS        Query contains range specifiers
query performance instead of more user-centric task of session             containsUnits     Query contains units like liters
satisfaction. The example mentioned by Su et al. [20] where the                                     Meta aspects
results expected by two different users for the same query iphone                            Category (mobile phones, books etc.) of the query
may be different and thus they may be individually dissatisfied even       queryCat
                                                                                             based on taxonomy
though the results shown are “relevant.” We aim to address the                               Type of the query (specific product, broad
simpler, albeit more business-critical problem of understanding a          queryType
                                                                                             category etc.)
query’s result relevance in a user-agnostic fashion. The underlying        queryCount        Frequency of the query
reason(s) for a search engine’s poor query performance is due to
                                                                           isAutoSuggestUsed Auto-completed query or not
factors like incorrect spell error handling, vocabulary gap [2], selec-
                                                                           isGoodNetwork     Network type is WiFi or 4G
tion gap (when the e-commerce platform does not sell a particular
                                                                           numProductsFound Number of products matching the query
item – e.g. chocolate when packaged food items are not sold), and
more. Thus understanding and measuring the user-agnostic query
performance can help improve the core relevance algorithm of the          activity by 21M users collectively spending almost 4M hours on
search engine.                                                            the platform. The data is collected from Flipkart’s mobile app,
                                                                          significantly reducing the chances of bot traffic. All user behavior
3     QUERY PERFORMANCE JUDGEMENTS                                        data is captured for the same week in which the query was labeled
At Flipkart, regular search quality analysis is done for a random         by an expert. We assume the search system and hence user activity
sample of queries (stratified on query volume segment) from search        remain constant throughout the week as there are no manual or
logs by a team of quality experts. They are requested to rate queries     algorithmic fixes applied during the week.
on a five-point scale (PBAGE: Poor-1, Bad-2, Average-3, Good-4,
Excellent-5) based on result relevance. To ensure the consistency of      4   SIGNALS OF USER BEHAVIOR
labeling across experts, inter-rater agreement is continuously mon-       Table 1 lists the metrics along with their descriptions that we ex-
itored. In this work, we make use of the expert editorial judgments       tracted for every query instance. We characterize the user behavior
for the month of January 2018.                                            metrics as Activity time, Positional and Activity aggregates. We
   We selected 18,613 queries from this randomized set of expert-         characterize the non-user metrics as Query text characteristics and
labeled queries which occurred more than 100 times in a week in           Meta aspects.
order to ensure reasonable user activity data. This set of queries           Activity time features capture the time taken by the user for vari-
corresponded to 127M query impressions, 149M clicks and 14M               ous activities. timeToFirstClick is the time taken by the user to click
other interactions (e.g. filters application, sort application) from      a product after the results are displayes. timeToFirstCart is similar
SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA                     Rohan Kumar, Mohit Kumar, Neil Shah4 , and Christos Faloutsos


      (a) Time to first click              (b) Time to first cart             (c) Query duration                   (d) First Click Position

     Figure 3: Normalized Distribution of Activity Time and Positional feature values with respect to Performance Score


to timeToFirstClick except it captures time taken to add a product       Hassan et al. [11]. We also make use of certain domain-dependent
to the cart. queryDuration is the total time spent in interacting with   text features indicating if the query contains specifiers (e.g. “greater
the query results including all interactions with product pages, cart    than”), modifier phrases (e.g. “least expensive”), range specifiers
etc. Figures 3a-3c show the distribution of Activity time features       (e.g. “between”) or units (e.g. “liters”, “gb”). The intuition here is that
with respect to query performance. We observe interestingly that         search engines may face difficulty in product retrieval when queries
time taken for first click increases with the query performance. This    contain such phrases which require semantic understanding.
is counter-intuitive in that when the query performance is good,            Meta aspect features include additional information about the
it still takes users longer to click. This is however potentially ex-    query. queryCat indicates the e-commerce product category. These
plained with Figure 4a, which shows the distribution of the number       are broad lines of business, namely Mobile Phones, Books, Elec-
of clicks against query performance. We observe that when the            tronics, Lifestyle, and Home and Furniture. Each query is assumed
query performance is low the total number of clicks is lower and         to belong to one of these categories. The intuition for using this
it increases with query performance. Intuitively, the users usually      feature is that the query performance and user behavior might be
don’t click any products when the query performance is poor but          dependent on the specific categories. queryType indicates the type
when they click products for a poorly performing query they do it        of query which is classified amongst three kinds, namely “Product”,
faster. Similar pattern is observed for the add-to-cart behavior, in     “FacetCategory’ and “Category.” Queries in which the exact product
Figures 3b and 4c.                                                       that the user is looking for is mentioned are called “Product” queries
    Positional features correspond to the position of result interac-    (e.g. iPhone X ). Queries which refer to a broad group of products
tion. posFirstClick captures which position the user clicked first. A    are called “Category” queries (e.g. shoes). “FacetCategory” queries
lower position value indicates that the results were shown near the      typically contain one or more attributes followed by a category (e.g.
top of the page. We observe that the average position of the first       red Nike shoes). For both queryCat and queryType, we make use
result click increases with improving query performance. This is         of modules which are able to assign appropriate values for a given
correlated with the previous observation where time to first click       query (details of these modules is outside the scope of this paper).
of poorly performing queries is lower and correspondingly the            queryCount is the total number of times the query was issued by
user is clicking the results in lower positions (faster). The total      users in the past week. isAutoSuggestUsed indicates whether the
number of clicks is low when query performance is low. Similar           user selected the query from the suggested queries (auto-suggest).
to Activity Time features, users usually don’t click products when       The intuition is that the queries suggested by the search engine
the query performance is poor but then they click products for a         typically perform better than query issued by user. isGoodNetwork
poorly performing query they do it at lower positions.                   indicates whether the user has a good Internet connection (defined
    Activity aggregates features capture the aggregated summary          as WiFi or LTE) while issuing the query. This is important, as the
of user’s actions for a query. We observe that all the activity ag-      user experience and behavior might be altered if he/she doesn’t
gregates are positively correlated with the query performance –          have a good Internet connection leading to bad experience inde-
i.e. increasing user activity indicates better query performance.        pendent of the search engine’s performance. numProductsFound
Number of product clicks (numClicks: Figure 4a), product swipes          indicates the total number of products found in the search index
(numSwipes: Figure 4b), cart additions (numCarts: Figure 4c), fil-       for the query. The intuition here is that the number of products
ters applied (numFilters: Figure 4d), sort applied (numSorts: Figure     found in conjunction with the type of the query may indicate if the
4e), product impressions per query (numImpressions: Figure 4f),          search engine is not able to retrieve relevant results.
query successful click through rate (clickSuccess: Figure 4g), query
successful cart conversion rate (cartSuccess: Figure 4h) are all posi-
                                                                         5 EXPERIMENTS
tively correlated with query performance.
    Query text characteristics features capture the textual properties   5.1 Experimental setup
of the query. charQueryLen and wordQueryLen are length of query          In this work, we formulate the problem of query performance
in characters and words respectively. LMScore is the perplexity          prediction as a binary classification task, as is done in [20]. As
score of the query based on a language model[12] trained on the          described in Section 3, we obtained expert judgments for 18,613
query logs. querySim is the text similarity between the current          queries across a 5-point scale. Similar to [20], we label “poor,” “bad”
query and the following query defined by the measure described in        and “average” queries as DSAT and “good” and “excellent” as SAT.
Did We Get It Right? Predicting Query Performance in E-commerce Search                             SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA


        (a) Number of clicks                         (b) Number of swipes                  (c) Number of cart adds         (d) Number of filter applications


 (e) Number of sort applications             (f) Number of product card im-                 (g) Click-through rate               (h) Conversion rate
                                             pressions

                        Figure 4: Normalized Distribution of Activity Aggregates with respect to Performance Score


This results in 6,949 DSAT and 11,664 SAT queries. We treat DSAT                         The overall test AUC obtained is 0.75. We observe that the classifier
as the positive class (classifier target) as the interventions in future                 is able to achieve a reasonably good performance, thus establish-
based on the model’s prediction will be for this class.                                  ing that it is feasible to predict query performance based on user
    We aggregate the metrics, described in previous section, across                      interaction signals.
all the instances of the query in the week to obtain aggregate user                         One application of this predictive model is to enable automated
behavior corresponding to the query. For metrics which may not                           interventions for unsatisfactory queries i.e. when the classifier is
have values for all query instances (e.g. timeToFirstClick), we only                     confident that the results are poor, we can enable certain inter-
include instances for which values are present, in the aggregate                         ventions like triggering an interactive intent solicitation module.
calculation. These aggregate metrics are used as features for the                        Towards that end, we need a reasonably high precision operating
classification model. We experiment with various descriptive statis-                     point. Based on discussion with business/product team, the operat-
tics for the features, namely, average, median, standard deviation,                      ing point that can be used is 85% precision where we will be able
inter-quartile range5 . We bin each numeric feature into 10 per-                         to achieve 20% recall with the current model.
centile buckets and convert them to one-hot encoded features. We
                                                                                           5.2.2 Feature importance. Given below is the list of top-10 most
also defined certain interaction features such as clickSuccess ×
                                                                                         important features based on Gini index:
queryCount.
    We split the labeled data into 80% training and 20% test set.                            (1) numSwipes
During training, we performed feature selection using recursive                              (2) clickSuccess
feature elimination along with model hyper-parameter tuning. The                             (3) queryType
hyper-parameter tuning is done using five-fold cross validation                              (4) wordQueryLen
with class-stratification and optimized for area under the ROC                               (5) numProductsFound
curve (AUC).                                                                                 (6) cartSuccess
                                                                                             (7) numFilters
5.2      Results                                                                             (8) numClicks
                                                                                             (9) numSorts
We analyze the results of our model along the following aspects:
                                                                                            (10) queryCount
performance of learnt classifier, feature importance, performance
across e-commerce categories, performance across query types and                         We observe a mix of features from various groups in the top fea-
performance across query volume. We use AUC to evaluate the                              tures. It is interesting to see the number of page-to-page swipes
prediction performance.                                                                  as a very indicative feature of query performance. We conjecture
                                                                                         that the users tend to click and swipe more in exploratory searches
   5.2.1 Performance of classifier. We train a binary random forest                      when they are satisfied with the initial results and want to con-
model based on the methodology described earlier in section 5.1.                         tinue exploring in the same set without reformulation. As expected,
Figure 5 shows the AUC curve and Figure 6 shows the PR curve.                            clickSuccess, cartSuccess and numClicks are indicative of query
5 In all the figures above, we show qualitative analysis of the features with only the   performance. queryType in conjunction with numProductsFound
“averaged” metric which sufficiently indicates the patterns.                             is a good indicator where we expect a small number of products
SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA                    Rohan Kumar, Mohit Kumar, Neil Shah6 , and Christos Faloutsos

Table 2: Prediction performance for different product cate-
gories

                   Product Category         AUC
                   Books                    0.70
                   Electronics              0.74
                   Home And Furniture       0.72
                   Lifestyle                0.70
                   Mobile Phones            0.90

        Table 3: Prediction performance query types

                      Query Type        AUC
                      Category          0.74
                      Facet Category    0.72
                      Product           0.87
                                                                            Figure 5: Receiver Operating Characteristic Curve for
                                                                                Binary Classification of Query Performance
for “Product” queries and larger number of products for “Category”
queries. Interestingly numFilters and numSorts which indicate fur-
ther refinement of results are also indicative of query performance,
where based on Figures 4d and 4e we observe positive correla-
tion with query performance. One surprising observation is that
none of the Activity Time features are amongst the top 10 features;
even though they are indicative, they are less indicative than other
structured features like filters and sorts applied.
   5.2.3 Performance across categories. Table 2 shows the perfor-
mance of the model across the e-commerce categories (described in
Section 4). We observe that the model is able to predict the query
performance in “Mobile” categories considerably better than all
other categories. We conjecture this is due to model’s performance
across query types (detailed below in section 5.2.4). The “Mobile”
category has 7x more “Product” queries compared to the “Lifestyle”
category. Additionally, “Mobile” category has 3x less “Facet Cate-
gory” queries. The model is able to perform much better for ‘Mobile’
                                                                         Figure 6: Precision Recall Curve for Binary Classification
category due to the underlying query type distribution which is
                                                                                           of Query Performance
biased towards “Product” queries. This is fairly important from
a business perspective as the “Mobile” category contributes to a
significant portion of overall sales.                                   Table 4: Prediction performance for different query volume
                                                                        segments
   5.2.4 Performance across query types. Table 3 shows the results
across query types. There are three query types, namely, “Product,”                         Volume Segment         AUC
“Facet Category” and “Category” as discussed in Section 4.
                                                                                            Head                   0.76
   We observe that performance of “Product” queries, where the
                                                                                            TorsoHigh              0.75
user’s intent and language is very specific, is significantly better
                                                                                            TorsoBottom            0.72
than other query types. We conjecture that indicators like numProd-
uctsFound and numClicks are particularly indicative of the query
performance for “Product” queries.                                      6    CONCLUSION AND FUTURE WORK
   5.2.5 Performance across query volume segments. Queries are          Measuring search engine performance is essential to building and
categorized into three segments based on weekly volume: Head,           improving retrieval algorithms. Query performance evaluation
TorsoHigh and TorsoLow. Table 4 shows that classifier performance       allows for a fine-grained measure of search performance. CTR can
improves as the volume increases. The average queryCount for            be a noisy metric in that high CTR queries may still have poor per-
queries belonging to the Head segment is about 34x that of queries      formance, and vice versa. A more sophisticated analysis of search
belonging to TorsoBottom segment. Despite the huge difference in        behavior is needed to distinguish poor and well performing queries.
amount of data available per query, the classifier is able to predict   In this work, we successfully demonstrate that query performance
performance for queries in all three segments reasonably well.          can be predicted based on user’s interaction with the result set. This
Did We Get It Right? Predicting Query Performance in E-commerce Search                               SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA


is the first study to our knowledge that has collectively defined                        [10] Ahmed Hassan, Xiaolin Shi, Nick Craswell, and Bill Ramsey. 2013. Beyond
these signals in the context of query performance prediction for                              Clicks: Query Reformulation As a Predictor of Search Satisfaction. In Pro-
                                                                                              ceedings of the 22Nd ACM International Conference on Information & Knowl-
e-commerce search. Specifically, we propose and use several user                              edge Management (CIKM ’13). ACM, New York, NY, USA, 2019–2028. https:
interaction signals that help characterize query performance and                              //doi.org/10.1145/2505515.2505682
                                                                                         [11] Ahmed Hassan, Ryen W White, Susan T Dumais, and Yi-Min Wang. 2014. Strug-
enabled us to achieve good classification performance using these                             gling or exploring?: disambiguating long search sessions. In Proceedings of the
signals. Notably, our model achieved an overall AUC of 0.75 in the                            7th ACM international conference on Web search and data mining. ACM, 53–62.
binary SAT/DSAT prediction task. We have analyzed the results                            [12] Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In
                                                                                              Proceedings of the Sixth Workshop on Statistical Machine Translation. Association
across various factors like category of the query, query type and                             for Computational Linguistics, 187–197.
query volume. Key takeaways from the performance analysis are (a)                        [13] Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation
We achieve significantly higher AUC of 0.90 on certain categories                             of IR techniques. ACM Transactions on Information Systems (TOIS) 20, 4 (2002),
                                                                                              422–446.
like “Mobile” making the result very promising from business im-                         [14] Jiepu Jiang, Ahmed Hassan Awadallah, Xiaolin Shi, and Ryen W. White. 2015.
pact perspective, (b) Classifier performance varies across query                              Understanding and Predicting Graded Search Satisfaction. In Proceedings of the
                                                                                              Eighth ACM International Conference on Web Search and Data Mining (WSDM
types (“Product”, “Facet Category” and “Category”) and is best for                            ’15). ACM, New York, NY, USA, 57–66. https://doi.org/10.1145/2684822.2685319
“Product” queries, and (c) Classifier performance improves with en-                      [15] Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, and Geri Gay.
gagement volume, and is better for Head queries than TorsoBottom                              2005. Accurately interpreting clickthrough data as implicit feedback. In Pro-
                                                                                              ceedings of the 28th annual international ACM SIGIR conference on Research and
queries.                                                                                      development in information retrieval. ACM, 154–161.
   Future Work The study can be extended to have a finer predic-                         [16] Youngho Kim, Ahmed Hassan, Ryen W. White, and Imed Zitouni. 2014. Modeling
tion target of issue type like spell error, vocabulary gap, selection                         Dwell Time to Predict Click-level Satisfaction. In Proceedings of the 7th ACM
                                                                                              International Conference on Web Search and Data Mining (WSDM ’14). ACM, New
gap etc. which would make the classifier prediction more easily                               York, NY, USA, 193–202. https://doi.org/10.1145/2556195.2556220
actionable by giving finer details on the query. Even richer sig-                        [17] Rishabh Mehrotra, Imed Zitouni, Ahmed Hassan Awadallah, Ahmed El Kholy,
                                                                                              and Madian Khabsa. 2017. User Interaction Sequences for Search Satisfaction
nals of user activities can be used for prediction. For example, the                          Prediction. In Proceedings of the 40th International ACM SIGIR Conference on
notion of good dwell time (healthy engagement such as reading                                 Research and Development in Information Retrieval (SIGIR ’17). ACM, New York,
or voting on reviews) and bad dwell time (unhealthy engagement                                NY, USA, 165–174. https://doi.org/10.1145/3077136.3080833
                                                                                         [18] Daan Odijk, Ryen W. White, Ahmed Hassan Awadallah, and Susan T. Dumais.
such as changing seller) might be used. Reducing the number of                                2015. Struggling and Success in Web Search. In Proceedings of the 24th ACM
observations required (currently set to 100) for robustly predicting                          International on Conference on Information and Knowledge Management (CIKM
query performance would be another avenue of future work. This                                ’15). ACM, New York, NY, USA, 1551–1560. https://doi.org/10.1145/2806416.
                                                                                              2806488
would allow the classifier to scale an even larger number of queries                     [19] Gyanit Singh, Nish Parikh, and Neel Sundaresn. 2011. User Behavior in Zero-
which do not have many instances in a fixed time period.                                      recall Ecommerce Queries. In Proceedings of the 34th International ACM SIGIR
                                                                                              Conference on Research and Development in Information Retrieval (SIGIR ’11).
                                                                                              ACM, New York, NY, USA, 75–84. https://doi.org/10.1145/2009916.2009930
7    ACKNOWLEDGEMENTS                                                                    [20] Ning Su, Jiyin He, Yiqun Liu, Min Zhang, and Shaoping Ma. 2018. User Intent,
                                                                                              Behaviour, and Perceived Satisfaction in Product Search. In Proceedings of the
We thank Mr. Priyank Patel and Mr. Subhadeep Maji for their                                   Eleventh ACM International Conference on Web Search and Data Mining (WSDM
                                                                                              ’18). ACM, New York, NY, USA, 547–555. https://doi.org/10.1145/3159652.3159714
helpful comments.                                                                        [21] Ryen W White and Susan T Dumais. 2009. Characterizing and predicting search
                                                                                              engine switching behavior. In Proceedings of the 18th ACM conference on Infor-
                                                                                              mation and knowledge management. ACM, 87–96.
REFERENCES                                                                               [22] Yun Zhou and W Bruce Croft. 2006. Ranking robustness: a novel framework
 [1] David Carmel, Elad Yom-Tov, Adam Darlow, and Dan Pelleg. 2006. What makes                to predict query performance. In Proceedings of the 15th ACM international
     a query difficult?. In Proceedings of the 29th annual international ACM SIGIR            conference on Information and knowledge management. ACM, 567–574.
     conference on Research and development in information retrieval. ACM, 390–397.      [23] Yun Zhou and W Bruce Croft. 2007. Query performance prediction in web
 [2] Claudio Carpineto and Giovanni Romano. 2012. A survey of automatic query                 search environments. In Proceedings of the 30th annual international ACM SIGIR
     expansion in information retrieval. ACM Computing Surveys (CSUR) 44, 1 (2012),           conference on Research and development in information retrieval. ACM, 543–550.
     1.
 [3] Ben Carterette and Rosie Jones. 2008. Evaluating search engines by modeling
     the relationship between relevance and clicks. In Advances in Neural Information
     Processing Systems. 217–224.
 [4] Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected
     reciprocal rank for graded relevance. In Proceedings of the 18th ACM conference
     on Information and knowledge management. ACM, 621–630.
 [5] Steve Cronen-Townsend, Yun Zhou, and W Bruce Croft. 2002. Predicting query
     performance. In Proceedings of the 25th annual international ACM SIGIR conference
     on Research and development in information retrieval. ACM, 299–306.
 [6] Steve Fox, Kuldeep Karnawat, Mark Mydland, Susan Dumais, and Thomas White.
     2005. Evaluating implicit measures to improve web search. ACM Transactions
     on Information Systems (TOIS) 23, 2 (2005), 147–168.
 [7] Fan Guo and Chao Liu. 2009. Statistical Models for Web Search Click Log Analysis.
     In Tutorial at the 19Th ACM International Conference on Information & Knowledge
     Management (CIKM ’09). ACM.
 [8] Qi Guo, Ryen W White, Susan T Dumais, Jue Wang, and Blake Anderson. 2010.
     Predicting query performance using query, result, and user interaction fea-
     tures. In Adaptivity, Personalization and Fusion of Heterogeneous Information.
     LE CENTRE DE HAUTES ETUDES INTERNATIONALES D’INFORMATIQUE
     DOCUMENTAIRE, 198–201.
 [9] Ahmed Hassan, Rosie Jones, and Kristina Lisa Klinkner. 2010. Beyond DCG: User
     Behavior As a Predictor of a Successful Search. In Proceedings of the Third ACM
     International Conference on Web Search and Data Mining (WSDM ’10). ACM, New
     York, NY, USA, 221–230. https://doi.org/10.1145/1718487.1718515