High Accuracy Recall Task Andrew Trotman Surya Kallumadi Jon Degenhardt University of Otago Kansas State University eBay inc. Dunedin, New Zealand Kansas,USA California, USA andrew@cs.otago.ac.nz surya@ksu.edu jondegenhardt@gmail.com ABSTRACT orders for a single query we show that these search engines find it We identify a new information retrieval task for eCommerce that difficult to identify documents that are relevant to that one query. we call the high accuracy recall task. That task is to identify as We believe that the problem is a consequence of the quality of many relevant documents, and as few non-relevant documents as the set of documents1 retrieved by the search engine (and then possible, such that regardless of the rank ordering, the precision ranked). If this recall base contains many false positives then it is remains high. inevitable that some rank order (either known now, or future rank We demonstrate a need to investigate this problem, we propose order) will place a non-relevant document high in the results list. metrics to measure the quality of the results, and we suggest how a There are two ways we might measure the quality of the results. document collection might be built and queries might be generated. The first is to make no assumptions on the rank order and to mea- sure the quality of the retrieved documents as a set – which we CCS CONCEPTS show is infeasible in a large collection. The second is to evaluate using the rank ordering the sites provide and we propose a metric • Information systems → Retrieval effectiveness; to accomplish this. The probabilistic ranking principal also fails for eCommerce KEYWORDS because it assumes the user is trying to find a relevant document. eCommerce, Performance Metrics, Quantitative Analysis In the case of a user browsing an eCommerce site to, for example, get a “feel” for the going price and quality of a used book, they ACM Reference Format: Andrew Trotman, Surya Kallumadi, and Jon Degenhardt. 2018. High Ac- are trying to compare the top few (k) results. We examine this curacy Recall Task. In Proceedings of ACM SIGIR Workshop on eCommerce search modality as a case of invested effort – something that has (SIGIR 2018 eCom). ACM, New York, NY, USA, 5 pages. https://doi.org/ previously been examined as the expected search length (ESL) and tolerance to irrelevance (T2I). We introduce a metric that measures the proportion of non relevant documents the user will see when 1 INTRODUCTION they reach the kth relevant document. eCommerce search engines often provide multiple rank orders of the results. Amazon, for example, offers the user 6 orders ranging 2 PROBLEM STATEMENT from “Relevance” to “Avg. Customer Review” and “Price: Low to Modern Internet search engines consist of a document collection High”, Trademe offers the user a choice of 10 rank orders. and a sophisticated search engine that, given a user query, resolves Search engine evaluation has traditionally been based on measur- the query against the collection to produce a list of results. The ing the ability of the search engine to place relevant documents at probabilistic ranking principal [11] states that the results should be the top of a results list. The working hypothesis is the probabilistic presented in order of most likely to be relevant to least likely to be ranking principal – documents in a results list should be ranked relevant. in order of most probably relevant to the user, to least probably The probabilistic ranking principal has been examined and ques- relevant to the user. For an eCommerce search engine its necessary tioned many times. Fuhr [6], for example, suggests that, in practice, to diverge from this principal because of the multiple rank orders. it is not suitable for use in an interactive setting. Work at TREC In this short opinion piece we explore how we might evaluate [3] suggests that in a web setting with millions of document and the quality of an eCommerce search engine offering multiple rank ambiguous queries it is important to diversify results in a results list. orderings using Amazon and Trademe as running examples. For example, when searching for “Apple”, the best result appears to First we explore the search interface of these two sites and show contain results about Apple Inc., as well as Apple Corps., and the that they, indeed, provide the user with the ability to re-sort the fruit. This ambiguity resolution is a natural part of the Wikipedia results of their query. We then examine the quality of the first page which has 61 links on the“Apple (disambiguation)” page, broken of results for a single query and show that the quality varies for into 8 categories.2 different rank orderings. Indeed, when we examine the multiple The probabilistic ranking principal is directly questioned by the user interfaces to many eCommerce sites. Figure 1 (left) shows Permission Copyright to make © 2018 digital authors. by the paper’s or hardCopying copies permitted of part orforall of this private andwork for purposes. academic personal or the 6 different sort orders on Amazon, ranging from “Relevance” In: J. Degenhardt, classroom use is G. Di Fabbrizio, granted withoutS.fee Kallumadi, providedM.that Kumar, Y.-C. copies areLin, notA.made Trotman, H. Zhao or distributed (eds.): Proceedings of the SIGIR 2018 eCom workshop, 12 July, 2018, Ann Arbor, Michigan, for profit or commercial advantage and that copies bear this notice and the full citation USA, to “Price: Low to High” to “Newest Arrivals”. Of these 6, only 1 published at http://ceur-ws.org on the first page. Copyrights for third-party components of this work must be honored. (Relevance) could be considered to be applying the probabilistic For all other uses, contact the owner/author(s). SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA 1 In eCommerce it is usual to use the term document to refer to a product listing – © 2018 Copyright held by the owner/author(s). which may or may not contain reviews, ratings, and so on. 2 https://en.wikipedia.org/wiki/Apple_(disambiguation), visited: 23 April 2018 SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA Andrew Trotman, Surya Kallumadi, and Jon Degenhardt Figure 1: Amazon (left) and Trademe (right) result orderings Figure 2: Amazon (left) and Trademe (right) price low to high for query “iPhone X” results for query “iPhone X” ranking principal. Figure 1 (right) shows the sort orders for Trademe, 4 EVALUATION an Australasian eCommerce site and its 10 sort orders which, while The comparison between Amazon and Trademe shows that not not dissimilar to those of Amazon, also include “Most Bids”, and only are there several possible sort orders, but that those orders are “Title”, neither of which are ordered by the probabilistic ranking different between different sites. This suggests that it might not be principal. We note that title ordering has been examined by Sherlock possible to close the list of sort orders – in other words, Amazon & Trotman [13]. might adopt some new sort orders in the future. If most of the available rank orderings of eCommerce sites are This raises the question how to evaluate the quality of a search not “Relevance”, then evaluation of the search engine cannot be engine in light of sort orders that have not yet been proposed, as done on the assumption that it is. That is, the ability to put the most well as those that have. We believe that this can be achieved by relevant document at the top of the results list is only one facet of measuring the quality of the recall base rather than the ranking. rank orderings to be evaluated when measuring the quality of a The obvious measure is the F 1 of precision and recall, at least as far site. a buying is concerned. We explore this in section 4.1. Information retrieval metrics are, in essence, models of users. We 3 ALTERNATIVE RANK ORDERS are aware of very little work examining user interaction on eCom- It has been posited that if the ranking function is effective enough merce sites (but see Sharma et al. [12]). We assume two models, then a few false positive documents in the results lists is acceptable buying and browsing. because the ranking function will place those documents at the When browsing the user wants to see k relevant documents bottom of the list and no-one will see them [8]. This approach to compare (for example) their colour, quality, age, and price. We is, unfortunately, ineffective with sort orders based on constant explore metrics for browsing in section 4.3. document features (such as price). To illustrate this point we searched for “iPhone X” on both Ama- 4.1 Buying: All Possible Orderings zon and Trademe, and ranked using price low to high – something The accuracy of a search engine irrespective of the rank order of we consider entirely reasonable for a user to do and quite likely a the documents in the results list is given by the set-wise precision. high frequency (or head) query. While using a single query is far Precision is defined as the proportion of documents that the search from evidence of a systematic problem, it can be considered to be a engine returns that are relevant. proof, by example, of the existence of a problem. Figure 2 left shows the results for Amazon while Figure 2 right fr p= , (1) shows the results for Trademe. On Amazon, neither of the first two f listings are for phones (and neither is the advertising). On Trademe, where fr is the number of known-relevant documents retrieved by two are for a stylus, and two are for cases (but not for the iPhone the search engine, and f is the number of documents in the results X). On both Amazon and Trademe none of the results on the first list. Problematically, a strategy for scoring high in set-wise precision page are for an iPhone X. When ordered by relevance, the top 4 is to return only one relevant document – which is clearly not in results on both sites (the first page) are all iPhone X. the interests of the user (unless there is only 1 relevant document To demonstrate that this problem is not unique to “price low to in the collection). high”, we issued the same query on Amazon and looked at the top A solution is to measure the recall, the proportion of the known document of each of the sort orders and examined the top result. relevant documents in the collection that the search engine returns Of the 6 sort orders on Amazon, 3 failed to place an iPhone X to the user, at position 1. On Trademe only 2 of the 10 sort orders placed an iPhone X at position 1. A single query is insufficient to draw robust fr c= (2) conclusions, but demonstrates the existence of a problem. r It is reasonable to conclude that the found document set (the where c is the recall, fr is the number of known-relevant documents recall base) contains false positives which in “Relevance” order are retrieved by the search engine, and r is the number of known- pushed low down in the results list, but in other sort orders these relevant documents in the collection. Problematically, a strategy false positives can be presented to the user. for scoring high in recall is to return all documents – which is not High Accuracy Recall Task SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA in the interests of the user because the precision can be expected 95% confidence level, z 1−α /2 = 1.645. For 10% confidence interval, to be low. z 1−β = 1.282, so If both set-wise precision and recall are very high then the search p p engine has returned a large proportion of the relevant documents 1.645 ∗ 7 × 10−7 (1 − 7 × 10−7 ) + 1.282 ∗ 8 × 10−7 (1 − 8 × 10−7 ) n >= ( −8 ), (8) and putting them in any order should nearly satisfy the probability 7 × 10 ranking principle. This is usually measured using the F 1 score, the harmonic mean of precision and recall. The F 1 score is rank-order n > 35056. (9) invariant. That is, it is a good indicator of quality before the rank In other words, tens of thousands of documents in the collection order is known. To compute F 1 , its necessary to know r . would need to be sampled. In a large document collection such as those at Amazon (about Assuming this was possible, having determined that the result 550 million listings)3 and Trademe (about 6 million listings)4 , for set contains at least the number of documents that are relevant, it is a given query, it isn’t possible to know the number of relevant next necessary to randomly sample the results set to determine the documents in the collection (items for sale that the user might want proportion of it that is relevant. The same binomial equations can to purchase or browse). So computing set-wise recall is infeasible. be applied. In this case the expected proportion of document that We propose three solutions to this: random sampling, reordering, are relevant, p̂ is near 1 (so we use 0.9), the confidence interval and and pooling. confidence level might remain the same, so n is very small (about A random sample taken from the document collection could 7). From this the F 1 measure can be computed (i.e. we know f , r be used. We observe that there are two possible outcomes of a and fr ). randomly selected document – either it is relevant or it is not – so However, since such a large number of documents must be sam- the distribution is binomial and each randomly selected document pled to determine the number of relevant documents for a given is a Bernoulli trial. query, this approach is infeasible. Assuming the search engine is perfect (precision = recall = 1), The second approach, and an alternative to sampling the entire we have an estimate of the number of relevant documents in the document collection, is to permute the results list and compute the collection is given by: precision (for example, p@10) of all possible orderings. In the case of 2000 results the number of permutations is 2000! = 6.4 × 10868 fr , p̂ = (3) which is too large to compute. However, with no recall component N its not possible to know whether the recall base contains the best where p̂ is the estimated proportion of the collection that is rele- items (e.g. the lowest priced item). This is akin to known item vant, fr is the number of found documents, and N is the collection finding where the known item is not known in advance and then size. measuring based on the assumption that the results list contains it. The confidence we have in that estimate is We do not believe this is valid way to measure quality. r The third approach, an approach used by Zobel [14] is to esti- p̂(1 − p̂) mate the number of relevant documents in the collection using a p̂ ± z α /2 (4) N number of different results lists for the same query. Each of a set of Allowing for a confidence interval of 10% of p̂, search engines is used to generate a results list for a given query. Then the first results list is examined and the number of relevant documents is noted. Then the second results is examined and the δ = |p̂ − (1.1 ∗ p̂)| (5) number of previously unseen relevant documents is noted, and so and for convenience sake we set pˆ0 = p̂, and pˆ1 = 1.1 ∗ p̂. We can on for the third and other search engines. This is then plotted and now compute n, the number of samples we need to take from the extrapolated to the point at which a new search engine will not find entire collection to validate that the results list contains at least the any previously unseen relevant documents. Unfortunately, most number of documents that are relevant. search engines today work in essentially the same way (including Since BM25 ranking) and the diversity is insufficient to consider this to be a robust way of computing the number of relevant documents r r pˆ0 (1 − pˆ0 ) pˆ1 (1 − pˆ1 ) in the collection. δ = z 1−α /2 + z 1−β , (6) Each of the three ways we propose for computing the score for n n a single query’s results list and irrespective of the results ordering n is given by are infeasible. We now turn our attention to the orderings a site provides rather than all possible orderings. z 1−α /2 pˆ0 (1 − pˆ0 ) + z 1−β pˆ1 (1 − pˆ1 ) p p n >= ( ). (7) δ 4.2 Buying: Offered Orderings Assuming a document collection of 550 million documents, and A more viable approach to measuring performance is to directly about 400 relevant documents5 , p̂ = 7 × 10−7 . For a one-tailed use the rank orderings offered by the site. In the case of Amazon, 3 https://www.scrapehero.com/many-products-amazon-sell-january-2018/ this would be the 6 orderings listed in Section 2, or the 10 orderings 4 https://www.Trademe.co.nz/About-trade-me/Site-stats for Trademe. The obvious way is to compute the score for each list 5 Roughly what we observe on Amazon today (mid 2018) for the query “iPhone X” and to linearly combine and average. SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA Andrew Trotman, Surya Kallumadi, and Jon Degenhardt We assume the user is interested in comparing k items, so we |A | Õ λ a pa measure the effort required to find those k items. More precisely, p= (10) we measure the inverse of that effort. a=1 |A| The effort to find one relevant document in one results list is where p is the precision and pa is the precision score for ordering simply the position of that item in the results list, rank 1 . The inverse a of the A possible orderings, |A| is the number of orderings, and of which is the reciprocal rank for the query, RR, the mean over a Í |A | λa is a weight for ordering a, and a=1 λa = 1. If all rank orders number of queries, |Q | is the mean reciprocal rank, MRR, are of equal importance, Í |Q | 1 i=1 r ank 1 ∀a, λa = 1 . (11) MRR = (13) |A| |Q | Generalizing this, to k relevant documents, RRk , However, it is highly unlikely that all rank orderings are of equal importance to a site. On Trademe, “Best match” is the default, and Ík i “lowest price” appeals to bargain hunters, so we expect these to be i=1 r ank i RRk = (14) weighted higher (more important) than “Title” or other orders. k One way to compute the λa weights is to compute the relative and the mean of this, proportion of results lists presented in order a, others include the RRk proportion of clickthroughs that come from the given list type, MRRk = (15) another is the proportion of sales from that list type, There are a |Q | multitude of possibilities, and most would require on-going obser- is the inverse of the effort the user must expend in order to observe vation as the proportions are likely to change based on the quality k relevant documents. MRRk is in the range [0..1] where 1 is best. of the results, time, user location, and client device. In other words, We observe that MRRk is exactly equivalent to MAP@kr where there is a feedback loop. kr is the position in the results list of the kth relevant document The individual precisions, pa , could be computed using any of (rather than the more usual kth position in the results list). An the standard information retrieval metrics – that do not require obvious extension is MAP@kr % an estimate of the recall. This might include P@n, Rank Biased We also note the similarity to r-precision [1] where the precision Precision [10], or others. We note that P@3 has been used by some is measured at position r in the results list where r is the number eCommerce sites as that is the number of results typically shown of relevant documents. Indeed, setting r to kr on a query by query in the first page of results on a smart phone [7]. We also note that basis gives the precision at the point at which the user sees k there is an implicit assumption in these metrics that the recall base relevant documents. is sufficiently large to contain the best answer for the given sort order – but the lowest priced item is the lowest priced item and it 5 RELEVANCE might not be in the recall base. It is pertinent to ask what relevance means in the context of an eCommerce site. Goldberg et al. [7] suggest that for buying it might 4.3 Browsing be defined by a book. That book encodes the difference between A browsing user is interested in comparing the characteristics of an individual user’s expectation and the meaning of their query. multiple items. This might be obvious eCommerce features such as They ask whether basketball shoes are a good answer to the query price, or delivery time, or it might be more esoteric such as whether basketball or whether the user needs to be trained to ask for what a certain edition of a book is on the market. they want as shopping is akin to known entity finding. Indeed, we We believe that a metric similar to Tolerance to Irrelevance, T2I accept that the definition of relevance for shopping is hard and [5], but for eCommerce is appropriate to measure browsing quality. requires further exploration as it is likely to include factors of price, That is, we envisage a user who continues to look down a results seller rating, shipping time, and so on. However, a buy signal for a list until their tolerance to the irrelevant material is exceeded – we query is very strong evidence of relevance, and such signals might then ask how far down the result list the user is. This is similar to be mined from logs. Cooper’s Expected Search Length, ESL, of a simple ordering [4]. We believe that the definition of relevance for browsing is even more difficult to define – but it is clearly an item from an item set kÕ +ε that the user wants to compare for some purpose. The purpose ESL = reli (12) could be spelled out in a TREC-like topic definition. The set might i=1 be mined from user behaviour. where k is the number of relevant documents we’re looking for and ε is the maximum number of non-relevant documents we’re prepared 6 TASK PROPOSAL to tolerate (stopping after k relevant documents are found). We showed in Section 2 that both Amazon and Trademe support reli is 1 if the document at position i in the results list is not rele- multiple rank orderings of the results sets, and in Section 3 that vant, and 0 if it is relevant. ELS is the absolute number of irrelevant those rank orders are not of equal quality. In order to measure the documents the user must view in order to see k relevant documents quality of the site we proposed in Section 4.1 that it is not feasible to for a given query, which is then averaged over a number of queries. measure F 1 as the number of relevant documents cannot be known, It also does not fall in the range [0..1]. and instead propose to measure a weighted average of the precision High Accuracy Recall Task SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA scores of each of the offered results orderings. In this section we the development of metrics that account for this in absolute order- provide more details on our proposed task. We propose to take a ings. An obvious way to address this is to consider non-recalled but dump of a large-scale online eCommerce site such as Amazon or relevant documents as non-relevant documents. That is, if there Trademe. This might be achieved either by agreement with the are 3 relevant documents lower in price than the search engine site, by crawling the site, or by extracting documents from a pre- returns then count that as 3 misses before the results returned by existing crawl. There are several reasons such a site might choose the search engine – however these might be weighted as a missing to participate in such a dump. First, none of the data is proprietary, cheapest item is a greater mistake than a missing 25th cheapest the data is already public facing and free. Second, providing a dump item. of the data to the research community is a marketing opportunity. Third, the results of research on such a document collection would 8 CONCLUSIONS be directly applicable by the group that makes data available, rather In this short paper we examined two eCommerce sites and showed than requiring “porting” to a new document collection. that they support different sort orders of the results list. We then Acquiring a query log may be difficult as query data is propri- showed that they are not equally good at ranking when using these etary, but a set of queries could be mined from a proxy log of a sort orders and hypothesized that the problem is the quality of the large institute that has not blocked eCommerce sites. The query is recall set, those documents the search engine returns. embedded in the URL of result page of both Amazon and Trademe, We suggested measuring the quality of the recall base irrespec- and extracting the query from that appears to be straightforward. tive of the presentation order and suggested that this as infeasible Values for λa could be estimated from a proxy log (although this as it wasn’t possible to known the number of relevant documents might introduce bias). Both Amazon and Trademe embed the sort in the collection – and it wasn’t possible to compute it. order in the URL. Either the proportions of queries using each sort We then proposed a weighted precision score as a metric and order, or the proportion that lead to a buy, could be used. proposed methods of computing the weights – for buying. For Trademe and Amazon both support list and grid result presen- browsing we developed a measure not dissimilar from tolerance to tation – and we believe that they should be measured differently. irrelevance, but based on MAP. Set-wise evaluation appears, at the onset, to be a better metric for Finally we proposed the high accuracy recall task. For this task grids whereas rank-biased metrics appear to be better suited to lists. the search engine must identify as many relevant documents as it The quality of both presentation formats should be measured. can without forfeiting precision – so that regardless of the presen- tation order the quality of the results is high. 7 DISCUSSION We believe this is an interesting problem to tackle because it raises new questions about ranking, efficiency, and performance Both Trademe and Amazon support rank orderings that are direct measurement. In future work we hope to build the collection and inversions of each other. For example, the results list for “Highest to launch the task. price” should be directly computable from the results list for “Lowest price” by simply inverting the results list, but many not be because REFERENCES of tie breaks. [1] J. A. Aslam, E. Yilmaz, and V. Pavlu. 2005. A Geometric Interpretation of R- We believe that a well performing search engine that returns precision and Its Correlation with Average Precision. In SIGIR 2005. 573–574. high quality documents irrespective of the rank order must be [2] A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Zien. 2003. Efficient Query Evaluation Using a Two-level Retrieval Process. In CIKM 2003. 426–434. good at identifying relevant documents, and have both a low false [3] K. Collins-Thompson, C. Macdonald, P. N. Bennett, F. Diaz, and E. M. Voorhees. positive rate and a low false negative rate. Hence, we believe that 2014. TREC 2014 Web Track Overview. In TREC 2014. it will be a high accuracy search engine. [4] W. S. Cooper. 1968. Expected search length: A single measure of retrieval effec- tiveness based on the weak ordering action of retrieval systems. Am. Doc. 19, 1 High accuracy recall identification is an interesting problem for (1968), 30–41. many reasons. First, many years of assumptions about the ranking [5] A. P. de Vries, G. Kazai, and M. Lalmas. 2004. Tolerance to Irrelevance: A User- function pushing low quality results down the results lists no longer effort Oriented Evaluation of Retrieval Systems Without Predefined Retrieval Unit. In RIAO 2004. 463–473. apply – the learning-to-rank pipelines in web search engines may [6] N. Fuhr. 2008. A Probability Ranking Principle for Interactive Information Re- not be applicable. Second, to be usable online, high accuracy with trieval. IRJ 11, 3 (2008), 251–265. [7] D. Goldberg, A. Trotman, X. Wang, W. Min, and Z. Wan. 2017. Drawing Sound low latency is important. This raises new problems for IR efficiency Conclusions from Noisy Judgments. In WWW 2017. 529–537. research which generally uses algorithms such as WAND [2] or [8] B. Goodwin, M. Hopcroft, D. Luu, A. Clemmer, M. Curmei, S. Elnikety, and Y. He. Anytime [9] which assume a pre-computed single rank ordering, 2017. BitFunnel: Revisiting Signatures for Search. In SIGIR 2017. 605–614. [9] J. Lin and A. Trotman. 2015. Anytime Ranking for Impact-Ordered Indexes. In and BitFunnel [8] many return too many false positives. ICTIR 2015. 301–304. The similarity between some of the rank orderings (e.g. price [10] A. Moffat and J. Zobel. 2008. Rank-biased Precision for Measurement of Retrieval low to high) and known entity search does not escape us. In the Effectiveness. ACM TOIS 27, 1 (2008), 2:1–2:27. [11] S. E. Robertson. 1997. Readings in Information Retrieval. Chapter The Probability proposed task, however, the known entity is known to exist, but Ranking Principle in IR, 281–286. which document it is is not. Indeed, knowing whether or not any [12] M. Sharma, P. Sondhi, C. Zhai, and P. Kolari. 2018. A taxonomy of queries for e-commerce search. In SIGIR 2018. search engine has found the lowest priced relevant document does [13] N. Sherlock and A. Trotman. 2011. Efficient sorting of search results by string not appear to be easy. We only know that the lowest priced item attributes. In ADCS 2011. amongst those assessed has been placed at the top of the list. The [14] J. Zobel. 1998. How Reliable Are the Results of Large-scale Information Retrieval Experiments?. In SIGIR 1998. 307–314. metrics we have proposed do not account for whether or not the most-relevant item is in the recall base. We leave for further work