Did We Get It Right? Predicting Query Performance in E-commerce Search Rohan Kumar Mohit Kumar Flipkart Flipkart rohankumar@flipkart.com k.mohit@flipkart.com Neil Shah∗ Christos Faloutsos Carnegie Mellon University Carnegie Mellon University neilshah@cs.cmu.edu christos@cs.cmu.edu ABSTRACT In this paper, we address the problem of evaluating whether results served by an e-commerce search engine for a query are good or not. This is a critical question in evaluating any e-commerce search engine. While this question is traditionally answered using simple metrics like query click-through rate (CTR), we observe that in e- commerce search, such metrics can be misleading. Upon inspection, we find cases where CTR is high but the results are poor and vice versa. Similar cases exist for other metrics like time to click which are often also used for evaluating search engines. We aim to learn the quality of the results served by the search engine based on users’ interactions with the results. Although this problem has been studied in the web search context, this is the first study for e-commerce search, to the best of our knowledge. Despite certain commonalities with evaluating web search engines, there are several major differences such as underlying reasons for search failure, and availability of rich user interaction data with products (e.g. adding a product to the cart). We study large- Figure 1: Mobile app e-commerce results page for the query scale user interaction logs from Flipkart’s1 search engine, analyze “sling bags women lavie”, showing relevant products. behavioral patterns and build models to classify queries based on user behavior signals. We demonstrate the feasibility and efficacy of such models in accurately predicting query performance. Our 1 INTRODUCTION classifier is able to achieve an average AUC of 0.75 on a held-out Search engines are a fundamental component of most modern In- test set. ternet applications, and evaluating their performance on a query is not only needed for evaluating their overall performance, but is KEYWORDS also critical in the iterative process of improving the algorithms Information Retrieval, Evaluation, Query Performance, e-commerce, that power them. This is important since bad performance of a mobile search behavior, implicit feedback search engine leads to customer attrition as described in White and ACM Reference format: Dumais [21]. Traditionally, the performance of a search engine on a Rohan Kumar, Mohit Kumar, Neil Shah2 , and Christos Faloutsos. 2018. Did query is measured using metrics derived from ordinal ratings of the We Get It Right? Predicting Query Performance in E-commerce Search. In search results given by human experts [4, 13, 23]. However, obtain- Proceedings of ACM SIGIR Workshop on eCommerce, Ann Arbor, Michigan, ing such manual judgments is prohibitive for the large document USA, July 2018 (SIGIR 2018 eCom), 7 pages. collections and high number of unique queries commonly encoun- DOI: 10.1145/nnnnnnn.nnnnnnn tered in most modern Internet applications. While one could solicit explicit feedback on the quality of search results from the users of 1 Flipkart is the largest e-commerce platform in India. 2 Dr. Shah is now at Snap Inc. a search engine, this may be detrimental to their experience of the application. More recent work [8] has focused on automating the evalua- Permission to make digital or hard copies of part or all of this work for personal or Copyright © 2018 by the paper’s authors. Copying permitted for private and academic purposes. tion of search engine performance by using implicit feedback on classroom In: use is G. J. Degenhardt, granted withoutS.fee Di Fabbrizio, providedM.that Kallumadi, copies Kumar, areLin, Y.-C. notA.made or distributed Trotman, H. Zhao for profit (eds.): or commercial Proceedings advantage of the SIGIR andworkshop, 2018 eCom that copies bear 2018, 12 July, this notice and Michigan, Ann Arbor, the full citation USA, the quality of search results derived from various user activity sig- published at http://ceur-ws.org on the first page. Copyrights for third-party components of this work must be honored. nals generated by the interactions between users and the results For all other uses, contact the owner/author(s). presented to them. Most of this work has been done for Internet SIGIR 2018 eCom, Ann Arbor, Michigan, USA © 2018 Copyright held by the owner/author(s). 978-x-xxxx-xxxx-x/YY/MM. . . $15.00 search engines while in this paper, we focus on e-commerce search DOI: 10.1145/nnnnnnn.nnnnnnn engines. The users of e-commerce applications tend to look for SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA Rohan Kumar, Mohit Kumar, Neil Shah3 , and Christos Faloutsos an add-to-cart or buy-now button. Using a richer set of such user activity signals, we build a classification model to predict whether the results for any query from our query-logs would be rated as bad or good by human experts and thus automate the evaluation of our search engine performance. Such a system also serves as a first step towards building a system to predict user satisfaction at the level of individual user activity sessions as studied in [6, 9, 18]. Our classifier is able to achieve an average AUC of 0.75 on a held-out test set. On certain product categories like Mobile Phones, we achieve an average AUC of 0.88 on the held-out test set. Summarily, the primary contributions of our work are: (1) We identify a rich set of user activity signals that help predict whether the results for any search query would be rated as bad or good by human experts. (2) We demonstrate that it is possible to use user activity sig- nals to automate the evaluation of search engine perfor- mance for e-commerce applications. Figure 2: Distributions of search result ratings across per- (3) We analyze the performance of our classifier and derive centile based CTR buckets. insights into the effectiveness of automated systems for evaluating search engine performance that are of particular interest to e-commerce applications. products and services, and thus the queries typically encountered by e-commerce search engines are fundamentally different from 2 RELATED WORK the informational and navigational queries typically encountered by Internet search engines. 2.1 Query Performance The most popular user activity signal in the aforementioned work Evaluating search engine performance has been well-studied in is clicks and it is used to define the Click-Through Rate (CTR) metric. the domain of web search. Topical relevance based metrics like The CTR of a query is often used as a proxy for the performance of nDCG [13], expected reciprocal rank [4] and weighted information the search engine on that query, and this approximation is based on gain [23] require explicit human labeled relevance judgments for the assumption that clicks on search results are a reliable indicator query-document pairs which are prohibitively costly to calculate of performance. However, Hassan et al. [10] points out that while at scale for real-world web scale evaluation. clicks are a useful indicator of performance, they can nevertheless Several methods were proposed to automatically measure vari- be quite noisy. ous characteristics of the documents retrieved for a query, which We validate this observation for e-commerce search by study- can then be used for measuring overall system performance. Clarity ing the distributions of the ordinal ratings of search results given score [5] evaluates query performance by measuring the relative by human experts to queries having a wide range of CTR values entropy between query language model and corresponding col- randomly sampled from Flipkart search query-logs. We discretized lection language model. The Robustness score [22] exploits the the CTR values into 5 buckets with the bucket boundaries at the fact that query-level ranking robustness is correlated with retrieval 20th, 40th, 60th, and 80th percentiles of the CTR values of our performance. It is measured as the expected value of Spearman's sampled queries. The distributions of search result ratings across rho between ranked lists from original collection and a corrupted these percentile-based CTR buckets is shown in Figure 2. The de- collection. Carmel et al. [1] find Jensen-Shannon divergence be- tails of how queries are sampled from our query-logs and how the tween queries, relevant documents and the entire collection to associated search results are rated by human experts are given in be an indicator of query performance. However, [23] experimen- Section 3. tally show the ineffectiveness of these metrics in measuring search From Figure 2 it is evident that while the fraction of queries performance on web-scale engines. whose results are rated as poor decreases as we go from the low- User click behavior has been used as an alternative to expert est CTR bucket to the highest CTR bucket, a significant fraction judgments for automatically tuning retrieval algorithms (predicting of queries whose results are rated as bad still exists even in the document relevance) as well as estimating IR evaluation metrics highest CTR bucket. Figure 1 shows an example of a search engine [3, 7, 8, 15]. Kim et al. [16] show that only analysing user clicks results page (SERP) that appears in Flipkart’s mobile app for the naively may not indicate satisfaction, but rather using dwell time query “sling bags women lavie”. The query has good results even per click appropriately indicates query level satisfaction in a better though it belongs to the 0-20% CTR bucket from Figure 2. This manner. Guo et al. [8] also make use of interaction features and highlights the need for a richer set of user activity signals beyond engine switches as signals to predict DCG@3. click behavior. Guo et al. [8] made use of such signals, but their focus was on Internet search where the set of user activity signals 2.2 Search Session Performance available is limited in comparison to e-commerce search, where There has been considerable work in the area of analyzing user we have additional signals available such as the time taken to click satisfaction at a session level rather than at an individual query Did We Get It Right? Predicting Query Performance in E-commerce Search SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA level. Fox et al. [6] conducted one of the first studies that found Table 1: Features used to distinguish between good and association between explicit ratings and implicit measures of user poorly performing search queries interest, concluding that user satisfaction can be predicted using such implicit signals. Hassan et al. [9] show empirically that user Activity time behavior alone can give an accurate picture of the success of the timeToFirstClick Time taken to click first product user's web search goals, without considering the relevance of the timeToFirstCart Time taken to add a product to the cart documents displayed. There have been studies focusing on graded queryDuration Total dwell time of the query satisfaction [14] as well as specific user behaviors like query refor- Positional mulation [10, 18] and interaction sequences [17] for understanding posFirstClick Position of first product clicked satisfaction. Activity aggregates numClicks Number of clicks 2.3 E-commerce Search Performance numSwipes Number of swipes Most studies have been geared towards web search where user numCarts Number of cart adds search goals are different from those in product/e-commerce search. numFilters Number of times a filter was applied However, there has been some work recently in the context of numSorts Number of times user changed sorting product search. Singh et al. [19] study the user behavior in the Number of product impressions numImpressions e-commerce search context in a specific scenario when the search in the viewport engine doesn’t retrieve any results. [20] is a recent study that clickSuccess Any product clicked for query addresses the user’s session satisfaction in product search. They cartSuccess Any product added to cart for query approach the problem by firstly identifying a taxonomy of user Query text characteristics intents while interacting with product search, and then analyze charQueryLen Length of the query in characters the user’s behavior in the context of the defined taxonomy. They wordQueryLen Length of the query in words predict user session satisfaction by utilizing the interaction behav- LMScore Query language model perplexity score ior, where they build separate models for different intents with the querySim Similarity to next query demonstration that user behavior is different under different intents. containsSP Query contains specifiers Our work, while building upon the learnings from these studies, containsMT Query contains modifiers differs in that we are interested in measuring only the aggregate containsRS Query contains range specifiers query performance instead of more user-centric task of session containsUnits Query contains units like liters satisfaction. The example mentioned by Su et al. [20] where the Meta aspects results expected by two different users for the same query iphone Category (mobile phones, books etc.) of the query may be different and thus they may be individually dissatisfied even queryCat based on taxonomy though the results shown are “relevant.” We aim to address the Type of the query (specific product, broad simpler, albeit more business-critical problem of understanding a queryType category etc.) query’s result relevance in a user-agnostic fashion. The underlying queryCount Frequency of the query reason(s) for a search engine’s poor query performance is due to isAutoSuggestUsed Auto-completed query or not factors like incorrect spell error handling, vocabulary gap [2], selec- isGoodNetwork Network type is WiFi or 4G tion gap (when the e-commerce platform does not sell a particular numProductsFound Number of products matching the query item – e.g. chocolate when packaged food items are not sold), and more. Thus understanding and measuring the user-agnostic query performance can help improve the core relevance algorithm of the activity by 21M users collectively spending almost 4M hours on search engine. the platform. The data is collected from Flipkart’s mobile app, significantly reducing the chances of bot traffic. All user behavior 3 QUERY PERFORMANCE JUDGEMENTS data is captured for the same week in which the query was labeled At Flipkart, regular search quality analysis is done for a random by an expert. We assume the search system and hence user activity sample of queries (stratified on query volume segment) from search remain constant throughout the week as there are no manual or logs by a team of quality experts. They are requested to rate queries algorithmic fixes applied during the week. on a five-point scale (PBAGE: Poor-1, Bad-2, Average-3, Good-4, Excellent-5) based on result relevance. To ensure the consistency of 4 SIGNALS OF USER BEHAVIOR labeling across experts, inter-rater agreement is continuously mon- Table 1 lists the metrics along with their descriptions that we ex- itored. In this work, we make use of the expert editorial judgments tracted for every query instance. We characterize the user behavior for the month of January 2018. metrics as Activity time, Positional and Activity aggregates. We We selected 18,613 queries from this randomized set of expert- characterize the non-user metrics as Query text characteristics and labeled queries which occurred more than 100 times in a week in Meta aspects. order to ensure reasonable user activity data. This set of queries Activity time features capture the time taken by the user for vari- corresponded to 127M query impressions, 149M clicks and 14M ous activities. timeToFirstClick is the time taken by the user to click other interactions (e.g. filters application, sort application) from a product after the results are displayes. timeToFirstCart is similar SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA Rohan Kumar, Mohit Kumar, Neil Shah4 , and Christos Faloutsos (a) Time to first click (b) Time to first cart (c) Query duration (d) First Click Position Figure 3: Normalized Distribution of Activity Time and Positional feature values with respect to Performance Score to timeToFirstClick except it captures time taken to add a product Hassan et al. [11]. We also make use of certain domain-dependent to the cart. queryDuration is the total time spent in interacting with text features indicating if the query contains specifiers (e.g. “greater the query results including all interactions with product pages, cart than”), modifier phrases (e.g. “least expensive”), range specifiers etc. Figures 3a-3c show the distribution of Activity time features (e.g. “between”) or units (e.g. “liters”, “gb”). The intuition here is that with respect to query performance. We observe interestingly that search engines may face difficulty in product retrieval when queries time taken for first click increases with the query performance. This contain such phrases which require semantic understanding. is counter-intuitive in that when the query performance is good, Meta aspect features include additional information about the it still takes users longer to click. This is however potentially ex- query. queryCat indicates the e-commerce product category. These plained with Figure 4a, which shows the distribution of the number are broad lines of business, namely Mobile Phones, Books, Elec- of clicks against query performance. We observe that when the tronics, Lifestyle, and Home and Furniture. Each query is assumed query performance is low the total number of clicks is lower and to belong to one of these categories. The intuition for using this it increases with query performance. Intuitively, the users usually feature is that the query performance and user behavior might be don’t click any products when the query performance is poor but dependent on the specific categories. queryType indicates the type when they click products for a poorly performing query they do it of query which is classified amongst three kinds, namely “Product”, faster. Similar pattern is observed for the add-to-cart behavior, in “FacetCategory’ and “Category.” Queries in which the exact product Figures 3b and 4c. that the user is looking for is mentioned are called “Product” queries Positional features correspond to the position of result interac- (e.g. iPhone X ). Queries which refer to a broad group of products tion. posFirstClick captures which position the user clicked first. A are called “Category” queries (e.g. shoes). “FacetCategory” queries lower position value indicates that the results were shown near the typically contain one or more attributes followed by a category (e.g. top of the page. We observe that the average position of the first red Nike shoes). For both queryCat and queryType, we make use result click increases with improving query performance. This is of modules which are able to assign appropriate values for a given correlated with the previous observation where time to first click query (details of these modules is outside the scope of this paper). of poorly performing queries is lower and correspondingly the queryCount is the total number of times the query was issued by user is clicking the results in lower positions (faster). The total users in the past week. isAutoSuggestUsed indicates whether the number of clicks is low when query performance is low. Similar user selected the query from the suggested queries (auto-suggest). to Activity Time features, users usually don’t click products when The intuition is that the queries suggested by the search engine the query performance is poor but then they click products for a typically perform better than query issued by user. isGoodNetwork poorly performing query they do it at lower positions. indicates whether the user has a good Internet connection (defined Activity aggregates features capture the aggregated summary as WiFi or LTE) while issuing the query. This is important, as the of user’s actions for a query. We observe that all the activity ag- user experience and behavior might be altered if he/she doesn’t gregates are positively correlated with the query performance – have a good Internet connection leading to bad experience inde- i.e. increasing user activity indicates better query performance. pendent of the search engine’s performance. numProductsFound Number of product clicks (numClicks: Figure 4a), product swipes indicates the total number of products found in the search index (numSwipes: Figure 4b), cart additions (numCarts: Figure 4c), fil- for the query. The intuition here is that the number of products ters applied (numFilters: Figure 4d), sort applied (numSorts: Figure found in conjunction with the type of the query may indicate if the 4e), product impressions per query (numImpressions: Figure 4f), search engine is not able to retrieve relevant results. query successful click through rate (clickSuccess: Figure 4g), query successful cart conversion rate (cartSuccess: Figure 4h) are all posi- 5 EXPERIMENTS tively correlated with query performance. Query text characteristics features capture the textual properties 5.1 Experimental setup of the query. charQueryLen and wordQueryLen are length of query In this work, we formulate the problem of query performance in characters and words respectively. LMScore is the perplexity prediction as a binary classification task, as is done in [20]. As score of the query based on a language model[12] trained on the described in Section 3, we obtained expert judgments for 18,613 query logs. querySim is the text similarity between the current queries across a 5-point scale. Similar to [20], we label “poor,” “bad” query and the following query defined by the measure described in and “average” queries as DSAT and “good” and “excellent” as SAT. Did We Get It Right? Predicting Query Performance in E-commerce Search SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA (a) Number of clicks (b) Number of swipes (c) Number of cart adds (d) Number of filter applications (e) Number of sort applications (f) Number of product card im- (g) Click-through rate (h) Conversion rate pressions Figure 4: Normalized Distribution of Activity Aggregates with respect to Performance Score This results in 6,949 DSAT and 11,664 SAT queries. We treat DSAT The overall test AUC obtained is 0.75. We observe that the classifier as the positive class (classifier target) as the interventions in future is able to achieve a reasonably good performance, thus establish- based on the model’s prediction will be for this class. ing that it is feasible to predict query performance based on user We aggregate the metrics, described in previous section, across interaction signals. all the instances of the query in the week to obtain aggregate user One application of this predictive model is to enable automated behavior corresponding to the query. For metrics which may not interventions for unsatisfactory queries i.e. when the classifier is have values for all query instances (e.g. timeToFirstClick), we only confident that the results are poor, we can enable certain inter- include instances for which values are present, in the aggregate ventions like triggering an interactive intent solicitation module. calculation. These aggregate metrics are used as features for the Towards that end, we need a reasonably high precision operating classification model. We experiment with various descriptive statis- point. Based on discussion with business/product team, the operat- tics for the features, namely, average, median, standard deviation, ing point that can be used is 85% precision where we will be able inter-quartile range5 . We bin each numeric feature into 10 per- to achieve 20% recall with the current model. centile buckets and convert them to one-hot encoded features. We 5.2.2 Feature importance. Given below is the list of top-10 most also defined certain interaction features such as clickSuccess × important features based on Gini index: queryCount. We split the labeled data into 80% training and 20% test set. (1) numSwipes During training, we performed feature selection using recursive (2) clickSuccess feature elimination along with model hyper-parameter tuning. The (3) queryType hyper-parameter tuning is done using five-fold cross validation (4) wordQueryLen with class-stratification and optimized for area under the ROC (5) numProductsFound curve (AUC). (6) cartSuccess (7) numFilters 5.2 Results (8) numClicks (9) numSorts We analyze the results of our model along the following aspects: (10) queryCount performance of learnt classifier, feature importance, performance across e-commerce categories, performance across query types and We observe a mix of features from various groups in the top fea- performance across query volume. We use AUC to evaluate the tures. It is interesting to see the number of page-to-page swipes prediction performance. as a very indicative feature of query performance. We conjecture that the users tend to click and swipe more in exploratory searches 5.2.1 Performance of classifier. We train a binary random forest when they are satisfied with the initial results and want to con- model based on the methodology described earlier in section 5.1. tinue exploring in the same set without reformulation. As expected, Figure 5 shows the AUC curve and Figure 6 shows the PR curve. clickSuccess, cartSuccess and numClicks are indicative of query 5 In all the figures above, we show qualitative analysis of the features with only the performance. queryType in conjunction with numProductsFound “averaged” metric which sufficiently indicates the patterns. is a good indicator where we expect a small number of products SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA Rohan Kumar, Mohit Kumar, Neil Shah6 , and Christos Faloutsos Table 2: Prediction performance for different product cate- gories Product Category AUC Books 0.70 Electronics 0.74 Home And Furniture 0.72 Lifestyle 0.70 Mobile Phones 0.90 Table 3: Prediction performance query types Query Type AUC Category 0.74 Facet Category 0.72 Product 0.87 Figure 5: Receiver Operating Characteristic Curve for Binary Classification of Query Performance for “Product” queries and larger number of products for “Category” queries. Interestingly numFilters and numSorts which indicate fur- ther refinement of results are also indicative of query performance, where based on Figures 4d and 4e we observe positive correla- tion with query performance. One surprising observation is that none of the Activity Time features are amongst the top 10 features; even though they are indicative, they are less indicative than other structured features like filters and sorts applied. 5.2.3 Performance across categories. Table 2 shows the perfor- mance of the model across the e-commerce categories (described in Section 4). We observe that the model is able to predict the query performance in “Mobile” categories considerably better than all other categories. We conjecture this is due to model’s performance across query types (detailed below in section 5.2.4). The “Mobile” category has 7x more “Product” queries compared to the “Lifestyle” category. Additionally, “Mobile” category has 3x less “Facet Cate- gory” queries. The model is able to perform much better for ‘Mobile’ Figure 6: Precision Recall Curve for Binary Classification category due to the underlying query type distribution which is of Query Performance biased towards “Product” queries. This is fairly important from a business perspective as the “Mobile” category contributes to a significant portion of overall sales. Table 4: Prediction performance for different query volume segments 5.2.4 Performance across query types. Table 3 shows the results across query types. There are three query types, namely, “Product,” Volume Segment AUC “Facet Category” and “Category” as discussed in Section 4. Head 0.76 We observe that performance of “Product” queries, where the TorsoHigh 0.75 user’s intent and language is very specific, is significantly better TorsoBottom 0.72 than other query types. We conjecture that indicators like numProd- uctsFound and numClicks are particularly indicative of the query performance for “Product” queries. 6 CONCLUSION AND FUTURE WORK 5.2.5 Performance across query volume segments. Queries are Measuring search engine performance is essential to building and categorized into three segments based on weekly volume: Head, improving retrieval algorithms. Query performance evaluation TorsoHigh and TorsoLow. Table 4 shows that classifier performance allows for a fine-grained measure of search performance. CTR can improves as the volume increases. The average queryCount for be a noisy metric in that high CTR queries may still have poor per- queries belonging to the Head segment is about 34x that of queries formance, and vice versa. A more sophisticated analysis of search belonging to TorsoBottom segment. Despite the huge difference in behavior is needed to distinguish poor and well performing queries. amount of data available per query, the classifier is able to predict In this work, we successfully demonstrate that query performance performance for queries in all three segments reasonably well. can be predicted based on user’s interaction with the result set. This Did We Get It Right? Predicting Query Performance in E-commerce Search SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA is the first study to our knowledge that has collectively defined [10] Ahmed Hassan, Xiaolin Shi, Nick Craswell, and Bill Ramsey. 2013. Beyond these signals in the context of query performance prediction for Clicks: Query Reformulation As a Predictor of Search Satisfaction. In Pro- ceedings of the 22Nd ACM International Conference on Information & Knowl- e-commerce search. Specifically, we propose and use several user edge Management (CIKM ’13). ACM, New York, NY, USA, 2019–2028. https: interaction signals that help characterize query performance and //doi.org/10.1145/2505515.2505682 [11] Ahmed Hassan, Ryen W White, Susan T Dumais, and Yi-Min Wang. 2014. Strug- enabled us to achieve good classification performance using these gling or exploring?: disambiguating long search sessions. In Proceedings of the signals. Notably, our model achieved an overall AUC of 0.75 in the 7th ACM international conference on Web search and data mining. ACM, 53–62. binary SAT/DSAT prediction task. We have analyzed the results [12] Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation. Association across various factors like category of the query, query type and for Computational Linguistics, 187–197. query volume. Key takeaways from the performance analysis are (a) [13] Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation We achieve significantly higher AUC of 0.90 on certain categories of IR techniques. ACM Transactions on Information Systems (TOIS) 20, 4 (2002), 422–446. like “Mobile” making the result very promising from business im- [14] Jiepu Jiang, Ahmed Hassan Awadallah, Xiaolin Shi, and Ryen W. White. 2015. pact perspective, (b) Classifier performance varies across query Understanding and Predicting Graded Search Satisfaction. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining (WSDM types (“Product”, “Facet Category” and “Category”) and is best for ’15). ACM, New York, NY, USA, 57–66. https://doi.org/10.1145/2684822.2685319 “Product” queries, and (c) Classifier performance improves with en- [15] Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, and Geri Gay. gagement volume, and is better for Head queries than TorsoBottom 2005. Accurately interpreting clickthrough data as implicit feedback. In Pro- ceedings of the 28th annual international ACM SIGIR conference on Research and queries. development in information retrieval. ACM, 154–161. Future Work The study can be extended to have a finer predic- [16] Youngho Kim, Ahmed Hassan, Ryen W. White, and Imed Zitouni. 2014. Modeling tion target of issue type like spell error, vocabulary gap, selection Dwell Time to Predict Click-level Satisfaction. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining (WSDM ’14). ACM, New gap etc. which would make the classifier prediction more easily York, NY, USA, 193–202. https://doi.org/10.1145/2556195.2556220 actionable by giving finer details on the query. Even richer sig- [17] Rishabh Mehrotra, Imed Zitouni, Ahmed Hassan Awadallah, Ahmed El Kholy, and Madian Khabsa. 2017. User Interaction Sequences for Search Satisfaction nals of user activities can be used for prediction. For example, the Prediction. In Proceedings of the 40th International ACM SIGIR Conference on notion of good dwell time (healthy engagement such as reading Research and Development in Information Retrieval (SIGIR ’17). ACM, New York, or voting on reviews) and bad dwell time (unhealthy engagement NY, USA, 165–174. https://doi.org/10.1145/3077136.3080833 [18] Daan Odijk, Ryen W. White, Ahmed Hassan Awadallah, and Susan T. Dumais. such as changing seller) might be used. Reducing the number of 2015. Struggling and Success in Web Search. In Proceedings of the 24th ACM observations required (currently set to 100) for robustly predicting International on Conference on Information and Knowledge Management (CIKM query performance would be another avenue of future work. This ’15). ACM, New York, NY, USA, 1551–1560. https://doi.org/10.1145/2806416. 2806488 would allow the classifier to scale an even larger number of queries [19] Gyanit Singh, Nish Parikh, and Neel Sundaresn. 2011. User Behavior in Zero- which do not have many instances in a fixed time period. recall Ecommerce Queries. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’11). ACM, New York, NY, USA, 75–84. https://doi.org/10.1145/2009916.2009930 7 ACKNOWLEDGEMENTS [20] Ning Su, Jiyin He, Yiqun Liu, Min Zhang, and Shaoping Ma. 2018. User Intent, Behaviour, and Perceived Satisfaction in Product Search. In Proceedings of the We thank Mr. Priyank Patel and Mr. Subhadeep Maji for their Eleventh ACM International Conference on Web Search and Data Mining (WSDM ’18). ACM, New York, NY, USA, 547–555. https://doi.org/10.1145/3159652.3159714 helpful comments. [21] Ryen W White and Susan T Dumais. 2009. Characterizing and predicting search engine switching behavior. In Proceedings of the 18th ACM conference on Infor- mation and knowledge management. ACM, 87–96. REFERENCES [22] Yun Zhou and W Bruce Croft. 2006. Ranking robustness: a novel framework [1] David Carmel, Elad Yom-Tov, Adam Darlow, and Dan Pelleg. 2006. What makes to predict query performance. In Proceedings of the 15th ACM international a query difficult?. In Proceedings of the 29th annual international ACM SIGIR conference on Information and knowledge management. ACM, 567–574. conference on Research and development in information retrieval. ACM, 390–397. [23] Yun Zhou and W Bruce Croft. 2007. Query performance prediction in web [2] Claudio Carpineto and Giovanni Romano. 2012. A survey of automatic query search environments. In Proceedings of the 30th annual international ACM SIGIR expansion in information retrieval. ACM Computing Surveys (CSUR) 44, 1 (2012), conference on Research and development in information retrieval. ACM, 543–550. 1. [3] Ben Carterette and Rosie Jones. 2008. Evaluating search engines by modeling the relationship between relevance and clicks. In Advances in Neural Information Processing Systems. 217–224. [4] Olivier Chapelle, Donald Metlzer, Ya Zhang, and Pierre Grinspan. 2009. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM conference on Information and knowledge management. ACM, 621–630. [5] Steve Cronen-Townsend, Yun Zhou, and W Bruce Croft. 2002. Predicting query performance. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 299–306. [6] Steve Fox, Kuldeep Karnawat, Mark Mydland, Susan Dumais, and Thomas White. 2005. Evaluating implicit measures to improve web search. ACM Transactions on Information Systems (TOIS) 23, 2 (2005), 147–168. [7] Fan Guo and Chao Liu. 2009. Statistical Models for Web Search Click Log Analysis. In Tutorial at the 19Th ACM International Conference on Information & Knowledge Management (CIKM ’09). ACM. [8] Qi Guo, Ryen W White, Susan T Dumais, Jue Wang, and Blake Anderson. 2010. Predicting query performance using query, result, and user interaction fea- tures. In Adaptivity, Personalization and Fusion of Heterogeneous Information. LE CENTRE DE HAUTES ETUDES INTERNATIONALES D’INFORMATIQUE DOCUMENTAIRE, 198–201. [9] Ahmed Hassan, Rosie Jones, and Kristina Lisa Klinkner. 2010. Beyond DCG: User Behavior As a Predictor of a Successful Search. In Proceedings of the Third ACM International Conference on Web Search and Data Mining (WSDM ’10). ACM, New York, NY, USA, 221–230. https://doi.org/10.1145/1718487.1718515