High Accuracy Recall Task
                   Andrew Trotman                                                       Surya Kallumadi                                       Jon Degenhardt
                   University of Otago                                               Kansas State University                                    eBay inc.
                 Dunedin, New Zealand                                                     Kansas,USA                                         California, USA
                 andrew@cs.otago.ac.nz                                                  surya@ksu.edu                                   jondegenhardt@gmail.com

ABSTRACT                                                                                               orders for a single query we show that these search engines find it
We identify a new information retrieval task for eCommerce that                                        difficult to identify documents that are relevant to that one query.
we call the high accuracy recall task. That task is to identify as                                        We believe that the problem is a consequence of the quality of
many relevant documents, and as few non-relevant documents as                                          the set of documents1 retrieved by the search engine (and then
possible, such that regardless of the rank ordering, the precision                                     ranked). If this recall base contains many false positives then it is
remains high.                                                                                          inevitable that some rank order (either known now, or future rank
  We demonstrate a need to investigate this problem, we propose                                        order) will place a non-relevant document high in the results list.
metrics to measure the quality of the results, and we suggest how a                                       There are two ways we might measure the quality of the results.
document collection might be built and queries might be generated.                                     The first is to make no assumptions on the rank order and to mea-
                                                                                                       sure the quality of the retrieved documents as a set – which we
CCS CONCEPTS                                                                                           show is infeasible in a large collection. The second is to evaluate
                                                                                                       using the rank ordering the sites provide and we propose a metric
• Information systems → Retrieval effectiveness;
                                                                                                       to accomplish this.
                                                                                                          The probabilistic ranking principal also fails for eCommerce
KEYWORDS                                                                                               because it assumes the user is trying to find a relevant document.
eCommerce, Performance Metrics, Quantitative Analysis                                                  In the case of a user browsing an eCommerce site to, for example,
                                                                                                       get a “feel” for the going price and quality of a used book, they
ACM Reference Format:
Andrew Trotman, Surya Kallumadi, and Jon Degenhardt. 2018. High Ac-                                    are trying to compare the top few (k) results. We examine this
curacy Recall Task. In Proceedings of ACM SIGIR Workshop on eCommerce                                  search modality as a case of invested effort – something that has
(SIGIR 2018 eCom). ACM, New York, NY, USA, 5 pages. https://doi.org/                                   previously been examined as the expected search length (ESL) and
                                                                                                       tolerance to irrelevance (T2I). We introduce a metric that measures
                                                                                                       the proportion of non relevant documents the user will see when
1      INTRODUCTION                                                                                    they reach the kth relevant document.
eCommerce search engines often provide multiple rank orders of
the results. Amazon, for example, offers the user 6 orders ranging                                     2    PROBLEM STATEMENT
from “Relevance” to “Avg. Customer Review” and “Price: Low to                                          Modern Internet search engines consist of a document collection
High”, Trademe offers the user a choice of 10 rank orders.                                             and a sophisticated search engine that, given a user query, resolves
   Search engine evaluation has traditionally been based on measur-                                    the query against the collection to produce a list of results. The
ing the ability of the search engine to place relevant documents at                                    probabilistic ranking principal [11] states that the results should be
the top of a results list. The working hypothesis is the probabilistic                                 presented in order of most likely to be relevant to least likely to be
ranking principal – documents in a results list should be ranked                                       relevant.
in order of most probably relevant to the user, to least probably                                          The probabilistic ranking principal has been examined and ques-
relevant to the user. For an eCommerce search engine its necessary                                     tioned many times. Fuhr [6], for example, suggests that, in practice,
to diverge from this principal because of the multiple rank orders.                                    it is not suitable for use in an interactive setting. Work at TREC
   In this short opinion piece we explore how we might evaluate                                        [3] suggests that in a web setting with millions of document and
the quality of an eCommerce search engine offering multiple rank                                       ambiguous queries it is important to diversify results in a results list.
orderings using Amazon and Trademe as running examples.                                                For example, when searching for “Apple”, the best result appears to
   First we explore the search interface of these two sites and show                                   contain results about Apple Inc., as well as Apple Corps., and the
that they, indeed, provide the user with the ability to re-sort the                                    fruit. This ambiguity resolution is a natural part of the Wikipedia
results of their query. We then examine the quality of the first page                                  which has 61 links on the“Apple (disambiguation)” page, broken
of results for a single query and show that the quality varies for                                     into 8 categories.2
different rank orderings. Indeed, when we examine the multiple                                             The probabilistic ranking principal is directly questioned by the
                                                                                                       user interfaces to many eCommerce sites. Figure 1 (left) shows
 Permission
Copyright      to make
           © 2018          digital authors.
                    by the paper’s or hardCopying
                                             copies permitted
                                                    of part orforall of this
                                                                  private andwork   for purposes.
                                                                              academic  personal or    the 6 different sort orders on Amazon, ranging from “Relevance”
In: J. Degenhardt,
 classroom   use is G.   Di Fabbrizio,
                      granted  withoutS.fee Kallumadi,
                                              providedM.that
                                                          Kumar,   Y.-C.
                                                              copies  areLin,
                                                                           notA.made
                                                                                 Trotman,  H. Zhao
                                                                                      or distributed
(eds.): Proceedings  of the SIGIR  2018 eCom   workshop, 12 July, 2018,  Ann Arbor, Michigan,
 for profit or commercial advantage and that copies bear this notice and the full citation    USA,     to “Price: Low to High” to “Newest Arrivals”. Of these 6, only 1
published at http://ceur-ws.org
on the first page. Copyrights for third-party components of this work must be honored.                 (Relevance) could be considered to be applying the probabilistic
For all other uses, contact the owner/author(s).
SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA                                                   1 In eCommerce it is usual to use the term document to refer to a product listing –
© 2018 Copyright held by the owner/author(s).                                                          which may or may not contain reviews, ratings, and so on.
                                                                                                       2 https://en.wikipedia.org/wiki/Apple_(disambiguation), visited: 23 April 2018
SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA                                Andrew Trotman, Surya Kallumadi, and Jon Degenhardt


Figure 1: Amazon (left) and Trademe (right) result orderings              Figure 2: Amazon (left) and Trademe (right) price low to high
for query “iPhone X”                                                      results for query “iPhone X”


ranking principal. Figure 1 (right) shows the sort orders for Trademe,    4     EVALUATION
an Australasian eCommerce site and its 10 sort orders which, while        The comparison between Amazon and Trademe shows that not
not dissimilar to those of Amazon, also include “Most Bids”, and          only are there several possible sort orders, but that those orders are
“Title”, neither of which are ordered by the probabilistic ranking        different between different sites. This suggests that it might not be
principal. We note that title ordering has been examined by Sherlock      possible to close the list of sort orders – in other words, Amazon
& Trotman [13].                                                           might adopt some new sort orders in the future.
   If most of the available rank orderings of eCommerce sites are            This raises the question how to evaluate the quality of a search
not “Relevance”, then evaluation of the search engine cannot be           engine in light of sort orders that have not yet been proposed, as
done on the assumption that it is. That is, the ability to put the most   well as those that have. We believe that this can be achieved by
relevant document at the top of the results list is only one facet of     measuring the quality of the recall base rather than the ranking.
rank orderings to be evaluated when measuring the quality of a            The obvious measure is the F 1 of precision and recall, at least as far
site.                                                                     a buying is concerned. We explore this in section 4.1.
                                                                             Information retrieval metrics are, in essence, models of users. We
3    ALTERNATIVE RANK ORDERS                                              are aware of very little work examining user interaction on eCom-
It has been posited that if the ranking function is effective enough      merce sites (but see Sharma et al. [12]). We assume two models,
then a few false positive documents in the results lists is acceptable    buying and browsing.
because the ranking function will place those documents at the               When browsing the user wants to see k relevant documents
bottom of the list and no-one will see them [8]. This approach            to compare (for example) their colour, quality, age, and price. We
is, unfortunately, ineffective with sort orders based on constant         explore metrics for browsing in section 4.3.
document features (such as price).
    To illustrate this point we searched for “iPhone X” on both Ama-      4.1    Buying: All Possible Orderings
zon and Trademe, and ranked using price low to high – something           The accuracy of a search engine irrespective of the rank order of
we consider entirely reasonable for a user to do and quite likely a       the documents in the results list is given by the set-wise precision.
high frequency (or head) query. While using a single query is far         Precision is defined as the proportion of documents that the search
from evidence of a systematic problem, it can be considered to be a       engine returns that are relevant.
proof, by example, of the existence of a problem.
    Figure 2 left shows the results for Amazon while Figure 2 right                                             fr
                                                                                                          p=       ,                            (1)
shows the results for Trademe. On Amazon, neither of the first two                                               f
listings are for phones (and neither is the advertising). On Trademe,     where fr is the number of known-relevant documents retrieved by
two are for a stylus, and two are for cases (but not for the iPhone       the search engine, and f is the number of documents in the results
X). On both Amazon and Trademe none of the results on the first           list. Problematically, a strategy for scoring high in set-wise precision
page are for an iPhone X. When ordered by relevance, the top 4            is to return only one relevant document – which is clearly not in
results on both sites (the first page) are all iPhone X.                  the interests of the user (unless there is only 1 relevant document
    To demonstrate that this problem is not unique to “price low to       in the collection).
high”, we issued the same query on Amazon and looked at the top               A solution is to measure the recall, the proportion of the known
document of each of the sort orders and examined the top result.          relevant documents in the collection that the search engine returns
Of the 6 sort orders on Amazon, 3 failed to place an iPhone X             to the user,
at position 1. On Trademe only 2 of the 10 sort orders placed an
iPhone X at position 1. A single query is insufficient to draw robust                                          fr
                                                                                                          c=                                (2)
conclusions, but demonstrates the existence of a problem.                                                      r
    It is reasonable to conclude that the found document set (the         where c is the recall, fr is the number of known-relevant documents
recall base) contains false positives which in “Relevance” order are      retrieved by the search engine, and r is the number of known-
pushed low down in the results list, but in other sort orders these       relevant documents in the collection. Problematically, a strategy
false positives can be presented to the user.                             for scoring high in recall is to return all documents – which is not
High Accuracy Recall Task                                                                         SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA


in the interests of the user because the precision can be expected                    95% confidence level, z 1−α /2 = 1.645. For 10% confidence interval,
to be low.                                                                            z 1−β = 1.282, so
   If both set-wise precision and recall are very high then the search                                 p                                  p
engine has returned a large proportion of the relevant documents                                1.645 ∗ 7 × 10−7 (1 − 7 × 10−7 ) + 1.282 ∗ 8 × 10−7 (1 − 8 × 10−7 )
                                                                                       n >= (                                      −8
                                                                                                                                                                    ), (8)
and putting them in any order should nearly satisfy the probability                                                         7 × 10
ranking principle. This is usually measured using the F 1 score, the
harmonic mean of precision and recall. The F 1 score is rank-order                                                        n > 35056.                                   (9)
invariant. That is, it is a good indicator of quality before the rank                     In other words, tens of thousands of documents in the collection
order is known. To compute F 1 , its necessary to know r .                            would need to be sampled.
   In a large document collection such as those at Amazon (about                          Assuming this was possible, having determined that the result
550 million listings)3 and Trademe (about 6 million listings)4 , for                  set contains at least the number of documents that are relevant, it is
a given query, it isn’t possible to know the number of relevant                       next necessary to randomly sample the results set to determine the
documents in the collection (items for sale that the user might want                  proportion of it that is relevant. The same binomial equations can
to purchase or browse). So computing set-wise recall is infeasible.                   be applied. In this case the expected proportion of document that
We propose three solutions to this: random sampling, reordering,                      are relevant, p̂ is near 1 (so we use 0.9), the confidence interval and
and pooling.                                                                          confidence level might remain the same, so n is very small (about
   A random sample taken from the document collection could                           7). From this the F 1 measure can be computed (i.e. we know f , r
be used. We observe that there are two possible outcomes of a                         and fr ).
randomly selected document – either it is relevant or it is not – so                      However, since such a large number of documents must be sam-
the distribution is binomial and each randomly selected document                      pled to determine the number of relevant documents for a given
is a Bernoulli trial.                                                                 query, this approach is infeasible.
   Assuming the search engine is perfect (precision = recall = 1),                        The second approach, and an alternative to sampling the entire
we have an estimate of the number of relevant documents in the                        document collection, is to permute the results list and compute the
collection is given by:                                                               precision (for example, p@10) of all possible orderings. In the case
                                                                                      of 2000 results the number of permutations is 2000! = 6.4 × 10868
                                   fr
                                      ,
                                      p̂ =                          (3)               which is too large to compute. However, with no recall component
                                   N                                                  its not possible to know whether the recall base contains the best
   where p̂ is the estimated proportion of the collection that is rele-               items (e.g. the lowest priced item). This is akin to known item
vant, fr is the number of found documents, and N is the collection                    finding where the known item is not known in advance and then
size.                                                                                 measuring based on the assumption that the results list contains it.
   The confidence we have in that estimate is                                         We do not believe this is valid way to measure quality.
                                          r                                               The third approach, an approach used by Zobel [14] is to esti-
                                  p̂(1 − p̂)                                          mate the number of relevant documents in the collection using a
                            p̂ ± z α /2                                         (4)
                                      N                                               number of different results lists for the same query. Each of a set of
   Allowing for a confidence interval of 10% of p̂,                                   search engines is used to generate a results list for a given query.
                                                                                      Then the first results list is examined and the number of relevant
                                                                                      documents is noted. Then the second results is examined and the
                                δ = |p̂ − (1.1 ∗ p̂)|                           (5)
                                                                                      number of previously unseen relevant documents is noted, and so
and for convenience sake we set pˆ0 = p̂, and pˆ1 = 1.1 ∗ p̂. We can                  on for the third and other search engines. This is then plotted and
now compute n, the number of samples we need to take from the                         extrapolated to the point at which a new search engine will not find
entire collection to validate that the results list contains at least the             any previously unseen relevant documents. Unfortunately, most
number of documents that are relevant.                                                search engines today work in essentially the same way (including
  Since                                                                               BM25 ranking) and the diversity is insufficient to consider this to
                                                                                      be a robust way of computing the number of relevant documents
                            r                            r
                                pˆ0 (1 − pˆ0 )               pˆ1 (1 − pˆ1 )           in the collection.
             δ = z 1−α /2                      + z 1−β                      ,   (6)       Each of the three ways we propose for computing the score for
                                      n                            n
                                                                                      a single query’s results list and irrespective of the results ordering
n is given by
                                                                                      are infeasible. We now turn our attention to the orderings a site
                                                                                      provides rather than all possible orderings.
                     z 1−α /2 pˆ0 (1 − pˆ0 ) + z 1−β pˆ1 (1 − pˆ1 )
                             p                      p
            n >= (                                   ).       (7)
                                 δ                                                    4.2    Buying: Offered Orderings
  Assuming a document collection of 550 million documents, and
                                                                                      A more viable approach to measuring performance is to directly
about 400 relevant documents5 , p̂ = 7 × 10−7 . For a one-tailed
                                                                                      use the rank orderings offered by the site. In the case of Amazon,
3 https://www.scrapehero.com/many-products-amazon-sell-january-2018/
                                                                                      this would be the 6 orderings listed in Section 2, or the 10 orderings
4 https://www.Trademe.co.nz/About-trade-me/Site-stats                                 for Trademe. The obvious way is to compute the score for each list
5 Roughly what we observe on Amazon today (mid 2018) for the query “iPhone X”         and to linearly combine and average.
SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA                                     Andrew Trotman, Surya Kallumadi, and Jon Degenhardt


                                                                                  We assume the user is interested in comparing k items, so we
                                  |A |
                                  Õ    λ a pa                                  measure the effort required to find those k items. More precisely,
                             p=                                        (10)    we measure the inverse of that effort.
                                  a=1
                                        |A|
                                                                                  The effort to find one relevant document in one results list is
where p is the precision and pa is the precision score for ordering            simply the position of that item in the results list, rank 1 . The inverse
a of the A possible orderings, |A| is the number of orderings, and             of which is the reciprocal rank for the query, RR, the mean over a
                                    Í |A |
λa is a weight for ordering a, and a=1 λa = 1. If all rank orders              number of queries, |Q | is the mean reciprocal rank, MRR,
are of equal importance,
                                                                                                                 Í |Q |    1
                                                                                                                    i=1 r ank 1
                             ∀a, λa =
                                        1
                                           .                       (11)                                 MRR =                                       (13)
                                       |A|                                                                           |Q |
                                                                                   Generalizing this, to k relevant documents, RRk ,
   However, it is highly unlikely that all rank orderings are of equal
importance to a site. On Trademe, “Best match” is the default, and                                               Ík       i
“lowest price” appeals to bargain hunters, so we expect these to be                                                i=1 r ank i
                                                                                                         RRk =                                      (14)
weighted higher (more important) than “Title” or other orders.                                                         k
   One way to compute the λa weights is to compute the relative                and the mean of this,
proportion of results lists presented in order a, others include the
                                                                                                                      RRk
proportion of clickthroughs that come from the given list type,                                            MRRk =                                (15)
another is the proportion of sales from that list type, There are a                                                   |Q |
multitude of possibilities, and most would require on-going obser-             is the inverse of the effort the user must expend in order to observe
vation as the proportions are likely to change based on the quality            k relevant documents. MRRk is in the range [0..1] where 1 is best.
of the results, time, user location, and client device. In other words,            We observe that MRRk is exactly equivalent to MAP@kr where
there is a feedback loop.                                                      kr is the position in the results list of the kth relevant document
   The individual precisions, pa , could be computed using any of              (rather than the more usual kth position in the results list). An
the standard information retrieval metrics – that do not require               obvious extension is MAP@kr %
an estimate of the recall. This might include P@n, Rank Biased                     We also note the similarity to r-precision [1] where the precision
Precision [10], or others. We note that P@3 has been used by some              is measured at position r in the results list where r is the number
eCommerce sites as that is the number of results typically shown               of relevant documents. Indeed, setting r to kr on a query by query
in the first page of results on a smart phone [7]. We also note that           basis gives the precision at the point at which the user sees k
there is an implicit assumption in these metrics that the recall base          relevant documents.
is sufficiently large to contain the best answer for the given sort
order – but the lowest priced item is the lowest priced item and it            5    RELEVANCE
might not be in the recall base.                                               It is pertinent to ask what relevance means in the context of an
                                                                               eCommerce site. Goldberg et al. [7] suggest that for buying it might
4.3    Browsing                                                                be defined by a book. That book encodes the difference between
A browsing user is interested in comparing the characteristics of              an individual user’s expectation and the meaning of their query.
multiple items. This might be obvious eCommerce features such as               They ask whether basketball shoes are a good answer to the query
price, or delivery time, or it might be more esoteric such as whether          basketball or whether the user needs to be trained to ask for what
a certain edition of a book is on the market.                                  they want as shopping is akin to known entity finding. Indeed, we
    We believe that a metric similar to Tolerance to Irrelevance, T2I          accept that the definition of relevance for shopping is hard and
[5], but for eCommerce is appropriate to measure browsing quality.             requires further exploration as it is likely to include factors of price,
That is, we envisage a user who continues to look down a results               seller rating, shipping time, and so on. However, a buy signal for a
list until their tolerance to the irrelevant material is exceeded – we         query is very strong evidence of relevance, and such signals might
then ask how far down the result list the user is. This is similar to          be mined from logs.
Cooper’s Expected Search Length, ESL, of a simple ordering [4].                    We believe that the definition of relevance for browsing is even
                                                                               more difficult to define – but it is clearly an item from an item set
                                     kÕ
                                      +ε                                       that the user wants to compare for some purpose. The purpose
                             ESL =         reli                        (12)    could be spelled out in a TREC-like topic definition. The set might
                                     i=1                                       be mined from user behaviour.
where k is the number of relevant documents we’re looking for and ε
is the maximum number of non-relevant documents we’re prepared                 6    TASK PROPOSAL
to tolerate (stopping after k relevant documents are found).                   We showed in Section 2 that both Amazon and Trademe support
    reli is 1 if the document at position i in the results list is not rele-   multiple rank orderings of the results sets, and in Section 3 that
vant, and 0 if it is relevant. ELS is the absolute number of irrelevant        those rank orders are not of equal quality. In order to measure the
documents the user must view in order to see k relevant documents              quality of the site we proposed in Section 4.1 that it is not feasible to
for a given query, which is then averaged over a number of queries.            measure F 1 as the number of relevant documents cannot be known,
It also does not fall in the range [0..1].                                     and instead propose to measure a weighted average of the precision
High Accuracy Recall Task                                                            SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA


scores of each of the offered results orderings. In this section we      the development of metrics that account for this in absolute order-
provide more details on our proposed task. We propose to take a          ings. An obvious way to address this is to consider non-recalled but
dump of a large-scale online eCommerce site such as Amazon or            relevant documents as non-relevant documents. That is, if there
Trademe. This might be achieved either by agreement with the             are 3 relevant documents lower in price than the search engine
site, by crawling the site, or by extracting documents from a pre-       returns then count that as 3 misses before the results returned by
existing crawl. There are several reasons such a site might choose       the search engine – however these might be weighted as a missing
to participate in such a dump. First, none of the data is proprietary,   cheapest item is a greater mistake than a missing 25th cheapest
the data is already public facing and free. Second, providing a dump     item.
of the data to the research community is a marketing opportunity.
Third, the results of research on such a document collection would       8    CONCLUSIONS
be directly applicable by the group that makes data available, rather    In this short paper we examined two eCommerce sites and showed
than requiring “porting” to a new document collection.                   that they support different sort orders of the results list. We then
   Acquiring a query log may be difficult as query data is propri-       showed that they are not equally good at ranking when using these
etary, but a set of queries could be mined from a proxy log of a         sort orders and hypothesized that the problem is the quality of the
large institute that has not blocked eCommerce sites. The query is       recall set, those documents the search engine returns.
embedded in the URL of result page of both Amazon and Trademe,              We suggested measuring the quality of the recall base irrespec-
and extracting the query from that appears to be straightforward.        tive of the presentation order and suggested that this as infeasible
   Values for λa could be estimated from a proxy log (although this      as it wasn’t possible to known the number of relevant documents
might introduce bias). Both Amazon and Trademe embed the sort            in the collection – and it wasn’t possible to compute it.
order in the URL. Either the proportions of queries using each sort         We then proposed a weighted precision score as a metric and
order, or the proportion that lead to a buy, could be used.              proposed methods of computing the weights – for buying. For
   Trademe and Amazon both support list and grid result presen-          browsing we developed a measure not dissimilar from tolerance to
tation – and we believe that they should be measured differently.        irrelevance, but based on MAP.
Set-wise evaluation appears, at the onset, to be a better metric for        Finally we proposed the high accuracy recall task. For this task
grids whereas rank-biased metrics appear to be better suited to lists.   the search engine must identify as many relevant documents as it
The quality of both presentation formats should be measured.             can without forfeiting precision – so that regardless of the presen-
                                                                         tation order the quality of the results is high.
7   DISCUSSION                                                              We believe this is an interesting problem to tackle because it
                                                                         raises new questions about ranking, efficiency, and performance
Both Trademe and Amazon support rank orderings that are direct
                                                                         measurement. In future work we hope to build the collection and
inversions of each other. For example, the results list for “Highest
                                                                         to launch the task.
price” should be directly computable from the results list for “Lowest
price” by simply inverting the results list, but many not be because
                                                                         REFERENCES
of tie breaks.
                                                                          [1] J. A. Aslam, E. Yilmaz, and V. Pavlu. 2005. A Geometric Interpretation of R-
   We believe that a well performing search engine that returns               precision and Its Correlation with Average Precision. In SIGIR 2005. 573–574.
high quality documents irrespective of the rank order must be             [2] A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Zien. 2003. Efficient
                                                                              Query Evaluation Using a Two-level Retrieval Process. In CIKM 2003. 426–434.
good at identifying relevant documents, and have both a low false         [3] K. Collins-Thompson, C. Macdonald, P. N. Bennett, F. Diaz, and E. M. Voorhees.
positive rate and a low false negative rate. Hence, we believe that           2014. TREC 2014 Web Track Overview. In TREC 2014.
it will be a high accuracy search engine.                                 [4] W. S. Cooper. 1968. Expected search length: A single measure of retrieval effec-
                                                                              tiveness based on the weak ordering action of retrieval systems. Am. Doc. 19, 1
   High accuracy recall identification is an interesting problem for          (1968), 30–41.
many reasons. First, many years of assumptions about the ranking          [5] A. P. de Vries, G. Kazai, and M. Lalmas. 2004. Tolerance to Irrelevance: A User-
function pushing low quality results down the results lists no longer         effort Oriented Evaluation of Retrieval Systems Without Predefined Retrieval
                                                                              Unit. In RIAO 2004. 463–473.
apply – the learning-to-rank pipelines in web search engines may          [6] N. Fuhr. 2008. A Probability Ranking Principle for Interactive Information Re-
not be applicable. Second, to be usable online, high accuracy with            trieval. IRJ 11, 3 (2008), 251–265.
                                                                          [7] D. Goldberg, A. Trotman, X. Wang, W. Min, and Z. Wan. 2017. Drawing Sound
low latency is important. This raises new problems for IR efficiency          Conclusions from Noisy Judgments. In WWW 2017. 529–537.
research which generally uses algorithms such as WAND [2] or              [8] B. Goodwin, M. Hopcroft, D. Luu, A. Clemmer, M. Curmei, S. Elnikety, and Y. He.
Anytime [9] which assume a pre-computed single rank ordering,                 2017. BitFunnel: Revisiting Signatures for Search. In SIGIR 2017. 605–614.
                                                                          [9] J. Lin and A. Trotman. 2015. Anytime Ranking for Impact-Ordered Indexes. In
and BitFunnel [8] many return too many false positives.                       ICTIR 2015. 301–304.
   The similarity between some of the rank orderings (e.g. price         [10] A. Moffat and J. Zobel. 2008. Rank-biased Precision for Measurement of Retrieval
low to high) and known entity search does not escape us. In the               Effectiveness. ACM TOIS 27, 1 (2008), 2:1–2:27.
                                                                         [11] S. E. Robertson. 1997. Readings in Information Retrieval. Chapter The Probability
proposed task, however, the known entity is known to exist, but               Ranking Principle in IR, 281–286.
which document it is is not. Indeed, knowing whether or not any          [12] M. Sharma, P. Sondhi, C. Zhai, and P. Kolari. 2018. A taxonomy of queries for
                                                                              e-commerce search. In SIGIR 2018.
search engine has found the lowest priced relevant document does         [13] N. Sherlock and A. Trotman. 2011. Efficient sorting of search results by string
not appear to be easy. We only know that the lowest priced item               attributes. In ADCS 2011.
amongst those assessed has been placed at the top of the list. The       [14] J. Zobel. 1998. How Reliable Are the Results of Large-scale Information Retrieval
                                                                              Experiments?. In SIGIR 1998. 307–314.
metrics we have proposed do not account for whether or not the
most-relevant item is in the recall base. We leave for further work